Inference and Learning from Data: Volume 2 (II): Inference [1 ed.] 1009218263, 9781009218269

This extraordinary three-volume work, written in an engaging and rigorous style by a world authority in the field, provi

134 2 52MB

English Pages 1070 [1166] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half-title
Title page
Copyright information
Dedication
Contents
Preface
P.1 Emphasis on Foundations
P.2 Glimpse of History
P.3 Organization of the Text
P.4 How to Use the Text
P.5 Simulation Datasets
P.6 Acknowledgments
Notation
27 Mean-Square-Error Inference
27.1 Inference without Observations
27.2 Inference with Observations
27.3 Gaussian Random Variables
27.4 Bias–Variance Relation
27.5 Commentaries and Discussion
Problems
27.A Circular Gaussian Distribution
References
28 Bayesian Inference
28.1 Bayesian Formulation
28.2 Maximum A-Posteriori Inference
28.3 Bayes Classifier
28.4 Logistic Regression Inference
28.5 Discriminative and Generative Models
28.6 Commentaries and Discussion
Problems
References
29 Linear Regression
29.1 Regression Model
29.2 Centering and Augmentation
29.3 Vector Estimation
29.4 Linear Models
29.5 Data Fusion
29.6 Minimum-Variance Unbiased Estimation
29.7 Commentaries and Discussion
Problems
29.A Consistency of Normal Equations
References
30 Kalman Filter
30.1 Uncorrelated Observations
30.2 Innovations Process
30.3 State-Space Model
30.4 Measurement- and Time-Update Forms
30.5 Steady-State Filter
30.6 Smoothing Filters
30.7 Ensemble Kalman Filter
30.8 Nonlinear Filtering
30.9 Commentaries and Discussion
Problems
References
31 Maximum Likelihood
31.1 Problem Formulation
31.2 Gaussian Distribution
31.3 Multinomial Distribution
31.4 Exponential Family of Distributions
31.5 Cramer–Rao Lower Bound
31.6 Model Selection
31.7 Commentaries and Discussion
Problems
31.A Derivation of the Cramer–Rao Bound
31.B Derivation of the AIC Formulation
31.C Derivation of the BIC Formulation
References
32 Expectation Maximization
32.1 Motivation
32.3 Gaussian Mixture Models
32.4 Bernoulli Mixture Models
32.5 Commentaries and Discussion
Problems
32.A Exponential Mixture Models
References
33 Predictive Modeling
33.1 Posterior Distributions
33.2 Laplace Method
33.3 Markov Chain Monte Carlo Method
33.4 Commentaries and Discussion
Problems
References
34 Expectation Propagation
34.1 Factored Representation
34.2 Gaussian Sites
34.3 Exponential Sites
34.4 Assumed Density Filtering
34.5 Commentaries and Discussion
Problems
References
35 Particle Filters
35.1 Data Model
35.2 Importance Sampling
35.3 Particle Filter Implementations
35.4 Commentaries and Discussion
Problems
References
36 Variational Inference
36.1 Evaluating Evidences
36.2 Evaluating Posterior Distributions
36.3 Mean-Field Approximation
36.4 Exponential Conjugate Models
36.5 Maximizing the ELBO
36.6 Stochastic Gradient Solution
36.7 Black Box Inference
36.8 Commentaries and Discussion
Problems
References
37 Latent Dirichlet Allocation
37.1 Generative Model
37.2 Coordinate-Ascent Solution
37.3 Maximizing the ELBO
37.4 Estimating Model Parameters
37.5 Commentaries and Discussion
Problems
References
38 Hidden Markov Models
38.1 Gaussian Mixture Models
38.2 Markov Chains
38.3 Forward–Backward Recursions
38.4 Validation and Prediction Tasks
38.5 Commentaries and Discussion
Problems
References
39 Decoding Hidden Markov Models
39.1 Decoding States
39.2 Decoding Transition Probabilities
39.3 Normalization and Scaling
39.4 Viterbi Algorithm
39.5 EM Algorithm for Dependent Observations
39.6 Commentaries and Discussion
Problems
References
40 Independent Component Analysis
40.1 Problem Formulation
40.2 Maximum-Likelihood Formulation
40.3 Mutual Information Formulation
40.4 Maximum Kurtosis Formulation
40.5 Projection Pursuit
40.6 Commentaries and Discussion
Problems
References
41 Bayesian Networks
41.1 Curse of Dimensionality
41.2 Probabilistic Graphical Models
41.3 Active and Blocked Pathways
41.4 Conditional Independence Relations
41.5 Commentaries and Discussion
Problems
References
42 Inference over Graphs
42.1 Probabilistic Inference
42.2 Inference by Enumeration
42.3 Inference by Variable Elimination
42.4 Chow–Liu Algorithm
42.5 Graphical LASSO
42.6 Learning Graph Parameters
42.7 Commentaries and Discussion
Problems
References
43 Undirected Graphs
43.1 Cliques and Potentials
43.2 Representation Theorem
43.3 Factor Graphs
43.4 Message-Passing Algorithms
43.5 Commentaries and Discussion
Problems
43.A Proof of the Hammersley–Clifford Theorem
43.B Equivalence of Markovian Properties
References
44 Markov Decision Processes
44.1 MDP Model
44.2 Discounted Rewards
44.3 Policy Evaluation
44.4 Linear Function Approximation
44.5 Commentaries and Discussion
Problems
References
45 Value and Policy Iterations
45.1 Value Iteration
45.2 Policy Iteration
45.3 Partially Observable MDP
45.4 Commentaries and Discussion
Problems
45.A Optimal Policy and State–Action Values
45.B Convergence of Value Iteration
45.C Proof of [epsilon]-Optimality
45.D Convergence of Policy Iteration
45.E Piecewise Linear Property
45.F Bellman Principle of Optimality
References
46 Temporal Difference Learning
46.1 Model-Based Learning
46.2 Monte Carlo Policy Evaluation
46.3 TD(0) Algorithm
46.4 Look-Ahead TD Algorithm
46.5 TD(lambda) Algorithm
46.6 True Online TD(lambda) Algorithm
46.7 Off-Policy Learning
46.8 Commentaries and Discussion
Problems
46.A Useful Convergence Result
46.B Convergence of TD(0) Algorithm
46.C Convergence of TD(lambda) Algorithm
46.D Equivalence of Offline Implementations
References
47 Q-Learning
47.1 SARSA(0) Algorithm
47.2 Look-Ahead SARSA Algorithm
47.3 SARSA(lambda) Algorithm
47.4 Off-Policy Learning
47.5 Optimal Policy Extraction
47.6 Q-Learning Algorithm
47.7 Exploration versus Exploitation
47.8 Q-Learning with Replay Buffer
47.9 Double Q-Learning
47.10 Commentaries and Discussion
Problems
47.A Convergence of SARSA(0) Algorithm
47.B Convergence of Q-Learning Algorithm
References
48 Value Function Approximation
48.1 Stochastic Gradient TD-Learning
48.2 Least-Squares TD-Learning
48.3 Projected Bellman Learning
48.4 SARSA Methods
48.5 Deep Q-Learning
48.6 Commentaries and Discussion
Problems
References
49 Policy Gradient Methods
49.1 Policy Model
49.2 Finite-Difference Method
49.3 Score Function
49.4 Objective Functions
49.5 Policy Gradient Theorem
49.6 Actor–Critic Algorithms
49.7 Natural Gradient Policy
49.8 Trust Region Policy Optimization
49.9 Deep Reinforcement Learning
49.10 Soft Learning
49.11 Commentaries and Discussion
Problems
49.A Proof of Policy Gradient Theorem
49.B Proof of Consistency Theorem
References
Author Index
Subject Index
Recommend Papers

Inference and Learning from Data: Volume 2 (II): Inference [1 ed.]
 1009218263, 9781009218269

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Inference and Learning from Data Volume II This extraordinary three-volume work, written in an engaging and rigorous style by a world authority in the field, provides an accessible, comprehensive introduction to the full spectrum of mathematical and statistical techniques underpinning contemporary methods in data-driven learning and inference. This second volume, Inference, builds on the foundational topics established in Volume I to introduce students to techniques for inferring unknown variables and quantities, including Bayesian inference, Markov chain Monte Carlo methods, maximum likelihood, variational inference, hidden Markov models, Bayesian networks, and reinforcement learning. A consistent structure and pedagogy is employed throughout this volume to reinforce student understanding, with over 350 end-of-chapter problems (including solutions for instructors), 180 solved examples, almost 200 figures, datasets, and downloadable Matlab code. Supported by sister volumes Foundations and Learning, and unique in its scale and depth, this textbook sequence is ideal for early-career researchers and graduate students across many courses in signal processing, machine learning, statistical analysis, data science, and inference. Ali H. Sayed is Professor and Dean of Engineering at École Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He has also served as Distinguished Professor and Chairman of Electrical Engineering at the University of California, Los Angeles (UCLA), USA, and as President of the IEEE Signal Processing Society. He is a member of the US National Academy of Engineering (NAE) and The World Academy of Sciences (TWAS), and a recipient of several awards, including the 2022 IEEE Fourier Award and the 2020 IEEE Norbert Wiener Society Award. He is a Fellow of the IEEE, EURASIP, and AAAS.

Inference and Learning from Data Volume II: Inference A L I H . S AY E D École Polytechnique Fédérale de Lausanne University of California at Los Angeles

Shaftesbury Road, Cambridge CB2 8EA, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467 Cambridge University Press is part of Cambridge University Press & Assessment, a department of the University of Cambridge. We share the University’s mission to contribute to society through the pursuit of education, learning and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/highereducation/isbn/9781009218269 DOI: 10.1017/9781009218245 © Ali H. Sayed 2023 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press & Assessment. First published 2023 Printed in the United Kingdom by Bell and Bain Ltd A catalogue record for this publication is available from the British Library ISBN - 3 Volume Set 978-1-009-21810-8 Hardback ISBN - Volume I 978-1-009-21812-2 Hardback ISBN - Volume II 978-1-009-21826-9 Hardback ISBN - Volume III 978-1-009-21828-3 Hardback Additional resources for this publication at www.cambridge.org/sayed-vol2. Cambridge University Press & Assessment has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

In loving memory of my parents

Contents

VOLUME I FOUNDATIONS Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation

page xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv

1

Matrix Theory 1.1 Symmetric Matrices 1.2 Positive-Definite Matrices 1.3 Range Spaces and Nullspaces 1.4 Schur Complements 1.5 Cholesky Factorization 1.6 QR Decomposition 1.7 Singular Value Decomposition 1.8 Square-Root Matrices 1.9 Kronecker Products 1.10 Vector and Matrix Norms 1.11 Perturbation Bounds on Eigenvalues 1.12 Stochastic Matrices 1.13 Complex-Valued Matrices 1.14 Commentaries and Discussion Problems 1.A Proof of Spectral Theorem 1.B Constructive Proof of SVD References

1 1 5 7 11 14 18 20 22 24 30 37 38 39 41 47 50 52 53

2

Vector Differentiation 2.1 Gradient Vectors 2.2 Hessian Matrices

59 59 62

viii

Contents

2.3 2.4

Matrix Differentiation Commentaries and Discussion Problems References

63 65 65 67

3

Random Variables 3.1 Probability Density Functions 3.2 Mean and Variance 3.3 Dependent Random Variables 3.4 Random Vectors 3.5 Properties of Covariance Matrices 3.6 Illustrative Applications 3.7 Complex-Valued Variables 3.8 Commentaries and Discussion Problems 3.A Convergence of Random Variables 3.B Concentration Inequalities References

68 68 71 77 93 96 97 106 109 112 119 122 128

4

Gaussian Distribution 4.1 Scalar Gaussian Variables 4.2 Vector Gaussian Variables 4.3 Useful Gaussian Manipulations 4.4 Jointly Distributed Gaussian Variables 4.5 Gaussian Processes 4.6 Circular Gaussian Distribution 4.7 Commentaries and Discussion Problems References

132 132 134 138 144 150 155 157 160 165

5

Exponential Distributions 5.1 Definition 5.2 Special Cases 5.3 Useful Properties 5.4 Conjugate Priors 5.5 Commentaries and Discussion Problems 5.A Derivation of Properties References

167 167 169 178 183 187 189 192 195

6

Entropy and Divergence 6.1 Information and Entropy 6.2 Kullback–Leibler Divergence 6.3 Maximum Entropy Distribution

196 196 204 209

Contents

6.4 6.5 6.6 6.7 6.8

ix

Moment Matching Fisher Information Matrix Natural Gradients Evidence Lower Bound Commentaries and Discussion Problems References

211 213 217 227 231 234 237

7

Random Processes 7.1 Stationary Processes 7.2 Power Spectral Density 7.3 Spectral Factorization 7.4 Commentaries and Discussion Problems References

240 240 245 252 255 257 259

8

Convex Functions 8.1 Convex Sets 8.2 Convexity 8.3 Strict Convexity 8.4 Strong Convexity 8.5 Hessian Matrix Conditions 8.6 Subgradient Vectors 8.7 Jensen Inequality 8.8 Conjugate Functions 8.9 Bregman Divergence 8.10 Commentaries and Discussion Problems References

261 261 263 265 266 268 272 279 281 285 290 293 299

9

Convex Optimization 9.1 Convex Optimization Problems 9.2 Equality Constraints 9.3 Motivating the KKT Conditions 9.4 Projection onto Convex Sets 9.5 Commentaries and Discussion Problems References

302 302 310 312 315 322 323 328

10

Lipschitz Conditions 10.1 Mean-Value Theorem 10.2 δ-Smooth Functions 10.3 Commentaries and Discussion Problems References

330 330 332 337 338 340

x

Contents

11

Proximal Operator 11.1 Definition and Properties 11.2 Proximal Point Algorithm 11.3 Proximal Gradient Algorithm 11.4 Convergence Results 11.5 Douglas–Rachford Algorithm 11.6 Commentaries and Discussion Problems 11.A Convergence under Convexity 11.B Convergence under Strong Convexity References

341 341 347 349 354 356 358 362 366 369 372

12

Gradient-Descent Method 12.1 Empirical and Stochastic Risks 12.2 Conditions on Risk Function 12.3 Constant Step Sizes 12.4 Iteration-Dependent Step-Sizes 12.5 Coordinate-Descent Method 12.6 Alternating Projection Algorithm 12.7 Commentaries and Discussion Problems 12.A Zeroth-Order Optimization References

375 375 379 381 392 402 413 418 425 433 436

13

Conjugate Gradient Method 13.1 Linear Systems of Equations 13.2 Nonlinear Optimization 13.3 Convergence Analysis 13.4 Commentaries and Discussion Problems References

441 441 454 459 465 466 469

14

Subgradient Method 14.1 Subgradient Algorithm 14.2 Conditions on Risk Function 14.3 Convergence Behavior 14.4 Pocket Variable 14.5 Exponential Smoothing 14.6 Iteration-Dependent Step Sizes 14.7 Coordinate-Descent Algorithms 14.8 Commentaries and Discussion Problems 14.A Deterministic Inequality Recursion References

471 471 475 479 483 486 489 493 496 498 501 505

Contents

xi

15

Proximal and Mirror-Descent Methods 15.1 Proximal Gradient Method 15.2 Projection Gradient Method 15.3 Mirror-Descent Method 15.4 Comparison of Convergence Rates 15.5 Commentaries and Discussion Problems References

507 507 515 519 537 539 541 544

16

Stochastic Optimization 16.1 Stochastic Gradient Algorithm 16.2 Stochastic Subgradient Algorithm 16.3 Stochastic Proximal Gradient Algorithm 16.4 Gradient Noise 16.5 Regret Analysis 16.6 Commentaries and Discussion Problems 16.A Switching Expectation and Differentiation References

547 548 565 569 574 576 582 586 590 595

17

Adaptive Gradient Methods 17.1 Motivation 17.2 AdaGrad Algorithm 17.3 RMSprop Algorithm 17.4 ADAM Algorithm 17.5 Momentum Acceleration Methods 17.6 Federated Learning 17.7 Commentaries and Discussion Problems 17.A Regret Analysis for ADAM References

599 599 603 608 610 614 619 626 630 632 640

18

Gradient Noise 18.1 Motivation 18.2 Smooth Risk Functions 18.3 Gradient Noise for Smooth Risks 18.4 Nonsmooth Risk Functions 18.5 Gradient Noise for Nonsmooth Risks 18.6 Commentaries and Discussion Problems 18.A Averaging over Mini-Batches 18.B Auxiliary Variance Result References

642 642 645 648 660 665 673 675 677 679 681

xii

Contents

19

Convergence Analysis I: Stochastic Gradient Algorithms 19.1 Problem Setting 19.2 Convergence under Uniform Sampling 19.3 Convergence of Mini-Batch Implementation 19.4 Convergence under Vanishing Step Sizes 19.5 Convergence under Random Reshuffling 19.6 Convergence under Importance Sampling 19.7 Convergence of Stochastic Conjugate Gradient 19.8 Commentaries and Discussion Problems 19.A Stochastic Inequality Recursion 19.B Proof of Theorem 19.5 References

683 683 686 691 692 698 701 707 712 716 720 722 727

20

Convergence Analysis II: Stochastic Subgradient Algorithms 20.1 Problem Setting 20.2 Convergence under Uniform Sampling 20.3 Convergence with Pocket Variables 20.4 Convergence with Exponential Smoothing 20.5 Convergence of Mini-Batch Implementation 20.6 Convergence under Vanishing Step Sizes 20.7 Commentaries and Discussion Problems References

730 730 735 738 740 745 747 750 753 754

21

Convergence Analysis III: Stochastic Proximal Algorithms 21.1 Problem Setting 21.2 Convergence under Uniform Sampling 21.3 Convergence of Mini-Batch Implementation 21.4 Convergence under Vanishing Step Sizes 21.5 Stochastic Projection Gradient 21.6 Mirror-Descent Algorithm 21.7 Commentaries and Discussion Problems References

756 756 761 765 766 769 771 774 775 776

22

Variance-Reduced Methods I: Uniform Sampling 22.1 Problem Setting 22.2 Naïve Stochastic Gradient Algorithm 22.3 Stochastic Average-Gradient Algorithm (SAGA) 22.4 Stochastic Variance-Reduced Gradient Algorithm (SVRG) 22.5 Nonsmooth Risk Functions 22.6 Commentaries and Discussion Problems

779 779 782 785 793 799 806 808

Contents

xiii

22.A Proof of Theorem 22.2 22.B Proof of Theorem 22.3 References

810 813 815

23

Variance-Reduced Methods II: Random Reshuffling 23.1 Amortized Variance-Reduced Gradient Algorithm (AVRG) 23.2 Evolution of Memory Variables 23.3 Convergence of SAGA 23.4 Convergence of AVRG 23.5 Convergence of SVRG 23.6 Nonsmooth Risk Functions 23.7 Commentaries and Discussion Problems 23.A Proof of Lemma 23.3 23.B Proof of Lemma 23.4 23.C Proof of Theorem 23.1 23.D Proof of Lemma 23.5 23.E Proof of Theorem 23.2 References

816 816 818 822 827 830 831 832 833 834 838 842 845 849 851

24

Nonconvex Optimization 24.1 First- and Second-Order Stationarity 24.2 Stochastic Gradient Optimization 24.3 Convergence Behavior 24.4 Commentaries and Discussion Problems 24.A Descent in the Large Gradient Regime 24.B Introducing a Short-Term Model 24.C Descent Away from Strict Saddle Points 24.D Second-Order Convergence Guarantee References

852 852 860 865 872 874 876 877 888 897 900

25

Decentralized Optimization I: Primal Methods 25.1 Graph Topology 25.2 Weight Matrices 25.3 Aggregate and Local Risks 25.4 Incremental, Consensus, and Diffusion 25.5 Formal Derivation as Primal Methods 25.6 Commentaries and Discussion Problems 25.A Proof of Lemma 25.1 25.B Proof of Property (25.71) 25.C Convergence of Primal Algorithms References

902 903 909 913 918 935 940 943 947 949 949 965

xiv

Contents

26

Decentralized Optimization II: Primal–Dual Methods 26.1 Motivation 26.2 EXTRA Algorithm 26.3 EXACT Diffusion Algorithm 26.4 Distributed Inexact Gradient Algorithm 26.5 Augmented Decentralized Gradient Method 26.6 ATC Tracking Method 26.7 Unified Decentralized Algorithm 26.8 Convergence Performance 26.9 Dual Method 26.10 Decentralized Nonconvex Optimization 26.11 Commentaries and Discussion Problems 26.A Convergence of Primal–Dual Algorithms References

969 969 970 972 975 978 979 983 985 987 990 995 998 1000 1006

Author Index Subject Index

1009 1033

VOLUME II INFERENCE Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation

xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv

27

Mean-Square-Error Inference 27.1 Inference without Observations 27.2 Inference with Observations 27.3 Gaussian Random Variables 27.4 Bias–Variance Relation 27.5 Commentaries and Discussion Problems 27.A Circular Gaussian Distribution References

1053 1054 1057 1066 1072 1082 1085 1088 1090

28

Bayesian Inference 28.1 Bayesian Formulation 28.2 Maximum A-Posteriori Inference 28.3 Bayes Classifier 28.4 Logistic Regression Inference

1092 1092 1094 1097 1106

Contents

28.5 28.6

Discriminative and Generative Models Commentaries and Discussion Problems References

xv

1110 1113 1116 1119

29

Linear Regression 29.1 Regression Model 29.2 Centering and Augmentation 29.3 Vector Estimation 29.4 Linear Models 29.5 Data Fusion 29.6 Minimum-Variance Unbiased Estimation 29.7 Commentaries and Discussion Problems 29.A Consistency of Normal Equations References

1121 1121 1128 1131 1134 1136 1139 1143 1145 1151 1153

30

Kalman Filter 30.1 Uncorrelated Observations 30.2 Innovations Process 30.3 State-Space Model 30.4 Measurement- and Time-Update Forms 30.5 Steady-State Filter 30.6 Smoothing Filters 30.7 Ensemble Kalman Filter 30.8 Nonlinear Filtering 30.9 Commentaries and Discussion Problems References

1154 1154 1157 1159 1171 1177 1181 1185 1191 1201 1204 1208

31

Maximum Likelihood 31.1 Problem Formulation 31.2 Gaussian Distribution 31.3 Multinomial Distribution 31.4 Exponential Family of Distributions 31.5 Cramer–Rao Lower Bound 31.6 Model Selection 31.7 Commentaries and Discussion Problems 31.A Derivation of the Cramer–Rao Bound 31.B Derivation of the AIC Formulation 31.C Derivation of the BIC Formulation References

1211 1211 1214 1223 1226 1229 1237 1251 1259 1265 1266 1271 1273

xvi

Contents

32

Expectation Maximization 32.1 Motivation 32.2 Derivation of the EM Algorithm 32.3 Gaussian Mixture Models 32.4 Bernoulli Mixture Models 32.5 Commentaries and Discussion Problems 32.A Exponential Mixture Models References

1276 1276 1282 1287 1302 1308 1310 1312 1316

33

Predictive Modeling 33.1 Posterior Distributions 33.2 Laplace Method 33.3 Markov Chain Monte Carlo Method 33.4 Commentaries and Discussion Problems References

1319 1320 1328 1333 1346 1348 1349

34

Expectation Propagation 34.1 Factored Representation 34.2 Gaussian Sites 34.3 Exponential Sites 34.4 Assumed Density Filtering 34.5 Commentaries and Discussion Problems References

1352 1352 1357 1371 1375 1378 1378 1379

35

Particle Filters 35.1 Data Model 35.2 Importance Sampling 35.3 Particle Filter Implementations 35.4 Commentaries and Discussion Problems References

1380 1380 1385 1393 1400 1401 1403

36

Variational Inference 36.1 Evaluating Evidences 36.2 Evaluating Posterior Distributions 36.3 Mean-Field Approximation 36.4 Exponential Conjugate Models 36.5 Maximizing the ELBO 36.6 Stochastic Gradient Solution 36.7 Black Box Inference 36.8 Commentaries and Discussion

1405 1405 1411 1413 1440 1454 1458 1461 1467

Contents

Problems References

xvii

1467 1470

37

Latent Dirichlet Allocation 37.1 Generative Model 37.2 Coordinate-Ascent Solution 37.3 Maximizing the ELBO 37.4 Estimating Model Parameters 37.5 Commentaries and Discussion Problems References

1472 1473 1482 1493 1500 1514 1515 1515

38

Hidden Markov Models 38.1 Gaussian Mixture Models 38.2 Markov Chains 38.3 Forward–Backward Recursions 38.4 Validation and Prediction Tasks 38.5 Commentaries and Discussion Problems References

1517 1517 1522 1538 1547 1551 1557 1560

39

Decoding Hidden Markov Models 39.1 Decoding States 39.2 Decoding Transition Probabilities 39.3 Normalization and Scaling 39.4 Viterbi Algorithm 39.5 EM Algorithm for Dependent Observations 39.6 Commentaries and Discussion Problems References

1563 1563 1565 1569 1574 1586 1604 1605 1607

40

Independent Component Analysis 40.1 Problem Formulation 40.2 Maximum-Likelihood Formulation 40.3 Mutual Information Formulation 40.4 Maximum Kurtosis Formulation 40.5 Projection Pursuit 40.6 Commentaries and Discussion Problems References

1609 1610 1617 1622 1627 1634 1637 1638 1640

41

Bayesian Networks 41.1 Curse of Dimensionality 41.2 Probabilistic Graphical Models

1643 1644 1647

xviii

Contents

41.3 41.4 41.5

Active and Blocked Pathways Conditional Independence Relations Commentaries and Discussion Problems References

1661 1670 1677 1679 1680

42

Inference over Graphs 42.1 Probabilistic Inference 42.2 Inference by Enumeration 42.3 Inference by Variable Elimination 42.4 Chow–Liu Algorithm 42.5 Graphical LASSO 42.6 Learning Graph Parameters 42.7 Commentaries and Discussion Problems References

1682 1682 1685 1691 1698 1705 1711 1733 1735 1737

43

Undirected Graphs 43.1 Cliques and Potentials 43.2 Representation Theorem 43.3 Factor Graphs 43.4 Message-Passing Algorithms 43.5 Commentaries and Discussion Problems 43.A Proof of the Hammersley–Clifford Theorem 43.B Equivalence of Markovian Properties References

1740 1740 1752 1756 1761 1793 1796 1799 1803 1804

44

Markov Decision Processes 44.1 MDP Model 44.2 Discounted Rewards 44.3 Policy Evaluation 44.4 Linear Function Approximation 44.5 Commentaries and Discussion Problems References

1807 1807 1821 1825 1840 1848 1850 1851

45

Value and Policy Iterations 45.1 Value Iteration 45.2 Policy Iteration 45.3 Partially Observable MDP 45.4 Commentaries and Discussion Problems 45.A Optimal Policy and State–Action Values

1853 1853 1866 1879 1893 1900 1903

Contents

45.B 45.C 45.D 45.E 45.F

xix

Convergence of Value Iteration Proof of -Optimality Convergence of Policy Iteration Piecewise Linear Property Bellman Principle of Optimality References

1905 1906 1907 1909 1910 1914

46

Temporal Difference Learning 46.1 Model-Based Learning 46.2 Monte Carlo Policy Evaluation 46.3 TD(0) Algorithm 46.4 Look-Ahead TD Algorithm 46.5 TD(λ) Algorithm 46.6 True Online TD(λ) Algorithm 46.7 Off-Policy Learning 46.8 Commentaries and Discussion Problems 46.A Useful Convergence Result 46.B Convergence of TD(0) Algorithm 46.C Convergence of TD(λ) Algorithm 46.D Equivalence of Offline Implementations References

1917 1918 1920 1928 1936 1940 1949 1952 1957 1958 1959 1960 1963 1967 1969

47

Q-Learning 47.1 SARSA(0) Algorithm 47.2 Look-Ahead SARSA Algorithm 47.3 SARSA(λ) Algorithm 47.4 Off-Policy Learning 47.5 Optimal Policy Extraction 47.6 Q-Learning Algorithm 47.7 Exploration versus Exploitation 47.8 Q-Learning with Replay Buffer 47.9 Double Q-Learning 47.10 Commentaries and Discussion Problems 47.A Convergence of SARSA(0) Algorithm 47.B Convergence of Q-Learning Algorithm References

1971 1971 1975 1977 1979 1980 1982 1985 1993 1994 1996 1999 2001 2003 2005

48

Value Function Approximation 48.1 Stochastic Gradient TD-Learning 48.2 Least-Squares TD-Learning 48.3 Projected Bellman Learning 48.4 SARSA Methods

2008 2008 2018 2019 2026

xx

Contents

48.5 48.6

49

Deep Q-Learning Commentaries and Discussion Problems References

2032 2041 2043 2045

Policy Gradient Methods 49.1 Policy Model 49.2 Finite-Difference Method 49.3 Score Function 49.4 Objective Functions 49.5 Policy Gradient Theorem 49.6 Actor–Critic Algorithms 49.7 Natural Gradient Policy 49.8 Trust Region Policy Optimization 49.9 Deep Reinforcement Learning 49.10 Soft Learning 49.11 Commentaries and Discussion Problems 49.A Proof of Policy Gradient Theorem 49.B Proof of Consistency Theorem References

2047 2047 2048 2050 2052 2057 2059 2071 2074 2093 2098 2106 2109 2113 2117 2118

Author Index Subject Index

2121 2145

VOLUME III LEARNING Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation 50

Least-Squares Problems 50.1 Motivation 50.2 Normal Equations 50.3 Recursive Least-Squares 50.4 Implicit Bias 50.5 Commentaries and Discussion Problems 50.A Minimum-Norm Solution 50.B Equivalence in Linear Estimation

xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv 2165 2165 2170 2187 2195 2197 2202 2210 2211

Contents

xxi

50.C Extended Least-Squares References

2212 2217

51

Regularization 51.1 Three Challenges 51.2 `2 -Regularization 51.3 `1 -Regularization 51.4 Soft Thresholding 51.5 Commentaries and Discussion Problems 51.A Constrained Formulations for Regularization 51.B Expression for LASSO Solution References

2221 2222 2225 2230 2234 2242 2245 2250 2253 2257

52

Nearest-Neighbor Rule 52.1 Bayes Classifier 52.2 k -NN Classifier 52.3 Performance Guarantee 52.4 k -Means Algorithm 52.5 Commentaries and Discussion Problems 52.A Performance of the NN Classifier References

2260 2262 2265 2268 2270 2279 2282 2284 2287

53

Self-Organizing Maps 53.1 Grid Arrangements 53.2 Training Algorithm 53.3 Visualization 53.4 Commentaries and Discussion Problems References

2290 2290 2293 2302 2310 2310 2311

54

Decision Trees 54.1 Trees and Attributes 54.2 Selecting Attributes 54.3 Constructing a Tree 54.4 Commentaries and Discussion Problems References

2313 2313 2317 2327 2335 2337 2338

55

Naïve Bayes Classifier 55.1 Independence Condition 55.2 Modeling the Conditional Distribution 55.3 Estimating the Priors

2341 2341 2343 2344

xxii

Contents

55.4 55.5

Gaussian Naïve Classifier Commentaries and Discussion Problems References

2351 2352 2354 2356

56

Linear Discriminant Analysis 56.1 Discriminant Functions 56.2 Linear Discriminant Algorithm 56.3 Minimum Distance Classifier 56.4 Fisher Discriminant Analysis 56.5 Commentaries and Discussion Problems References

2357 2357 2360 2362 2365 2378 2379 2381

57

Principal Component Analysis 57.1 Data Preprocessing 57.2 Dimensionality Reduction 57.3 Subspace Interpretations 57.4 Sparse PCA 57.5 Probabilistic PCA 57.6 Commentaries and Discussion Problems 57.A Maximum-Likelihood Solution 57.B Alternative Optimization Problem References

2383 2383 2385 2396 2399 2404 2411 2414 2417 2421 2422

58

Dictionary Learning 58.1 Learning Under Regularization 58.2 Learning Under Constraints 58.3 K-SVD Approach 58.4 Nonnegative Matrix Factorization 58.5 Commentaries and Discussion Problems 58.A Orthogonal Matching Pursuit References

2424 2425 2430 2432 2435 2443 2446 2448 2454

59

Logistic Regression 59.1 Logistic Model 59.2 Logistic Empirical Risk 59.3 Multiclass Classification 59.4 Active Learning 59.5 Domain Adaptation 59.6 Commentaries and Discussion Problems

2457 2457 2459 2464 2471 2476 2484 2488

Contents

xxiii

59.A Generalized Linear Models References

2492 2496

60

Perceptron 60.1 Linear Separability 60.2 Perceptron Empirical Risk 60.3 Termination in Finite Steps 60.4 Pocket Perceptron 60.5 Commentaries and Discussion Problems 60.A Counting Theorem 60.B Boolean Functions References

2499 2499 2501 2507 2509 2513 2517 2520 2526 2528

61

Support Vector Machines 61.1 SVM Empirical Risk 61.2 Convex Quadratic Program 61.3 Cross Validation 61.4 Commentaries and Discussion Problems References

2530 2530 2541 2546 2551 2553 2554

62

Bagging and Boosting 62.1 Bagging Classifiers 62.2 AdaBoost Classifier 62.3 Gradient Boosting 62.4 Commentaries and Discussion Problems References

2557 2557 2561 2572 2580 2581 2584

63

Kernel Methods 63.1 Motivation 63.2 Nonlinear Mappings 63.3 Polynomial and Gaussian Kernels 63.4 Kernel-Based Perceptron 63.5 Kernel-Based SVM 63.6 Kernel-Based Ridge Regression 63.7 Kernel-Based Learning 63.8 Kernel PCA 63.9 Inference under Gaussian Processes 63.10 Commentaries and Discussion Problems References

2587 2587 2590 2592 2595 2604 2610 2613 2618 2623 2634 2640 2646

xxiv

Contents

64

Generalization Theory 64.1 Curse of Dimensionality 64.2 Empirical Risk Minimization 64.3 Generalization Ability 64.4 VC Dimension 64.5 Bias–Variance Trade-off 64.6 Surrogate Risk Functions 64.7 Commentaries and Discussion Problems 64.A VC Dimension for Linear Classifiers 64.B Sauer Lemma 64.C Vapnik–Chervonenkis Bound 64.D Rademacher Complexity References

2650 2650 2654 2657 2662 2663 2667 2672 2679 2686 2688 2694 2701 2711

65

Feedforward Neural Networks 65.1 Activation Functions 65.2 Feedforward Networks 65.3 Regression and Classification 65.4 Calculation of Gradient Vectors 65.5 Backpropagation Algorithm 65.6 Dropout Strategy 65.7 Regularized Cross-Entropy Risk 65.8 Slowdown in Learning 65.9 Batch Normalization 65.10 Commentaries and Discussion Problems 65.A Derivation of Batch Normalization Algorithm References

2715 2716 2721 2728 2731 2739 2750 2754 2768 2769 2776 2781 2787 2792

66

Deep 66.1 66.2 66.3 66.4 66.5 66.6

2797 2797 2802 2809 2820 2823 2830 2834 2836

67

Convolutional Networks 67.1 Correlation Layers 67.2 Pooling 67.3 Full Network

Belief Networks Pre-Training Using Stacked Autoencoders Restricted Boltzmann Machines Contrastive Divergence Pre-Training using Stacked RBMs Deep Generative Model Commentaries and Discussion Problems References

2838 2839 2860 2869

Contents

xxv

67.4 67.5

Training Algorithm Commentaries and Discussion Problems 67.A Derivation of Training Algorithm References

2876 2885 2887 2888 2903

68

Generative Networks 68.1 Variational Autoencoders 68.2 Training Variational Autoencoders 68.3 Conditional Variational Autoencoders 68.4 Generative Adversarial Networks 68.5 Training of GANs 68.6 Conditional GANs 68.7 Commentaries and Discussion Problems References

2905 2905 2913 2930 2935 2943 2956 2960 2963 2964

69

Recurrent Networks 69.1 Recurrent Neural Networks 69.2 Backpropagation Through Time 69.3 Bidirectional Recurrent Networks 69.4 Vanishing and Exploding Gradients 69.5 Long Short-Term Memory Networks 69.6 Bidirectional LSTMs 69.7 Gated Recurrent Units 69.8 Commentaries and Discussion Problems References

2967 2967 2973 2995 3002 3004 3026 3034 3036 3037 3040

70

Explainable Learning 70.1 Classifier Model 70.2 Sensitivity Analysis 70.3 Gradient X Input Analysis 70.4 Relevance Analysis 70.5 Commentaries and Discussion Problems References

3042 3042 3046 3049 3050 3060 3061 3062

71

Adversarial Attacks 71.1 Types of Attacks 71.2 Fast Gradient Sign Method 71.3 Jacobian Saliency Map Approach 71.4 DeepFool Technique 71.5 Black-Box Attacks

3065 3066 3070 3075 3078 3088

xxvi

Contents

71.6 71.7

72

Defense Mechanisms Commentaries and Discussion Problems References

3091 3093 3095 3096

Meta 72.1 72.2 72.3 72.4 72.5

Learning Network Model Siamese Networks Relation Networks Exploration Models Commentaries and Discussion Problems 72.A Matching Networks 72.B Prototypical Networks References

3099 3099 3101 3112 3118 3136 3136 3138 3144 3146

Author Index Subject Index

3149 3173

Preface

Learning directly from data is critical to a host of disciplines in engineering and the physical, social, and life sciences. Modern society is literally driven by an interconnected web of data exchanges at rates unseen before, and it relies heavily on decisions inferred from patterns in data. There is nothing fundamentally wrong with this approach, except that the inference and learning methodologies need to be anchored on solid foundations, be fair and reliable in their conclusions, and be robust to unwarranted imperfections and malicious interference.

P.1

EMPHASIS ON FOUNDATIONS Given the explosive interest in data-driven learning methods, it is not uncommon to encounter claims of superior designs in the literature that are substantiated mainly by sporadic simulations and the potential for “life-changing” applications rather than by an approach that is founded on the well-tested scientific principle to inquiry. For this reason, one of the main objectives of this text is to highlight, in a unified and formal manner, the firm mathematical and statistical pillars that underlie many popular data-driven learning and inference methods. This is a nontrivial task given the wide scope of techniques that exist, and which have often been motivated independently of each other. It is nevertheless important for practitioners and researchers alike to remain cognizant of the common foundational threads that run across these methods. It is also imperative that progress in the domain remains grounded on firm theory. As the aphorism often attributed to Lewin (1945) states, “there is nothing more practical than a good theory.” According to Bedeian (2016), this saying has an even older history. Rigorous data analysis, and conclusions derived from experimentation and theory, have been driving science since time immemorial. As reported by Heath (1912), the Greek scientist Archimedes of Syracuse devised the now famous Archimedes’ Principle about the volume displaced by an immersed object from observing how the level of water in a tub rose when he sat in it. In the account by Hall (1970), Gauss’ formulation of the least-squares problem was driven by his desire to predict the future location of the planetoid Ceres from observations of its location over 41 prior days. There are numerous similar examples by notable scientists where experimentation led to hypotheses and from there

xxviii

Preface

to substantiated theories and well-founded design methodologies. Science is also full of progress in the reverse direction, where theories have been developed first to be validated only decades later through experimentation and data analysis. Einstein (1916) postulated the existence of gravitational waves over 100 years ago. It took until 2016 to detect them! Regardless of which direction one follows, experimentation to theory or the reverse, the match between solid theory and rigorous data analysis has enabled science and humanity to march confidently toward the immense progress that permeates our modern world today. For similar reasons, data-driven learning and inference should be developed with strong theoretical guarantees. Otherwise, the confidence in their reliability can be shaken if there is over-reliance on “proof by simulation or experience.” Whenever possible, we explain the underlying models and statistical theories for a large number of methods covered in this text. A good grasp of these theories will enable practitioners and researchers to devise variations with greater mastery. We weave through the foundations in a coherent and cohesive manner, and show how the various methods blend together techniques that may appear decoupled but are actually facets of the same common methodology. In this process, we discover that a good number of techniques are well-grounded and meet proven performance guarantees, while other methods are driven by ingenious insights but lack solid justifications and cannot be guaranteed to be “fail-proof.” Researchers on learning and inference methods are of course aware of the limitations of some of their approaches, so much so that we encounter today many studies, for example, on the topic of “explainable machine learning.” The objective here is to understand why learning algorithms produce certain recommendations. While this is an important area of inquiry, it nevertheless highlights one interesting shift in paradigm. In the past, the emphasis would have been on designing inference methods that respond to the input data in certain desirable and controllable ways. Today, in many instances, the emphasis is to stick to the available algorithms (often, out of convenience) and try to understand or explain why they are responding in certain ways to the input! Writing this text has been a rewarding journey that took me from the early days of statistical mathematical theory to the modern state of affairs in learning theory. One can only stand in awe at the wondrous ideas that have been introduced by notable researchers along this trajectory. At the same time, one observes with some concern an emerging trend in recent years where solid foundations receive less attention in lieu of “speed publishing” and over-reliance on “illustration by simulation.” This is of course not the norm and most researchers in the field stay honest to the scientific approach to inquiry and design. After concluding this comprehensive text, I stand humbled at the realization of “how little we know !” There are countless questions that remain open, and even for many of the questions that have been answered, their answers rely on assumptions or (over)simplifications. It is understandable that the complexity of the problems we face today has increased manifold, and ingenious approximations become necessary to enable tractable solutions.

P.2 Glimpse of History

P.2

xxix

GLIMPSE OF HISTORY Reading through the text, the alert reader will quickly realize that the core foundations of modern-day machine learning, data analytics, and inference methods date back for at least two centuries, with contributions arising from a range of fields including mathematics, statistics, optimization theory, information theory, signal processing, communications, control, and computer science. For the benefit of the reader, I reproduce here with permission from IEEE some historical remarks from the editorial I published in Sayed (2018). I explained there that these disciplines have generated a string of “big ideas” that are driving today multi-faceted efforts in the age of “big data” and machine learning. Generations of students in the statistical sciences and engineering have been trained in the art of modeling, problem solving, and optimization. Their algorithms power everything from cell phones, to spacecraft, robotic explorers, imaging devices, automated systems, computing machines, and also recommender systems. These students mastered the foundations of their fields and have been well prepared to contribute to the explosive growth of data analysis and machine learning solutions. As the list below shows, many well-known engineering and statistical methods have actually been motivated by data-driven inquiries, even from times remote. The list is a tour of some older historical contributions, which is of course biased by my personal preferences and is not intended to be exhaustive. It is only meant to illustrate how concepts from statistics and the information sciences have always been at the center of promoting big ideas for data and machine learning. Readers will encounter these concepts in various chapters in the text. Readers will also encounter additional historical accounts in the concluding remarks of each chapter, and in particular comments on newer contributions and contributors. Let me start with Gauss himself, who in 1795 at the young age of 18, was fitting lines and hyperplanes to astronomical data and invented the least-squares criterion for regression analysis – see the collection of his works in Gauss (1903). He even devised the recursive least-squares solution to address what was a “big” data problem for him at the time: He had to avoid tedious repeated calculations by hand as more observational data became available. What a wonderful big idea for a data-driven problem! Of course, Gauss had many other big ideas. de Moivre (1730), Laplace (1812), and Lyapunov (1901) worked on the central limit theorem. The theorem deals with the limiting distribution of averages of “large” amounts of data. The result is also related to the law of “large” numbers, which even has the qualification “large” in its name. Again, big ideas motivated by “large” data problems. Bayes (ca mid-1750s) and Laplace (1774) appear to have independently discovered the Bayes rule, which updates probabilities conditioned on observations – see the article by Bayes and Price (1763). The rule forms the backbone of much of statistical signal analysis, Bayes classifiers, Naïve classifiers, and Bayesian networks. Again, a big idea for data-driven inference.

xxx

Preface

Fourier (1822), whose tools are at the core of disciplines in the information sciences, developed the phenomenal Fourier representation for signals. It is meant to transform data from one domain to another to facilitate the extraction and visualization of information. A big transformative idea for data. Forward to modern times. The fast Fourier transform (FFT) is another example of an algorithm driven by challenges posed by data size. Its modern version is due to Cooley and Tukey (1965). Their algorithm revolutionized the field of discrete-time signal processing, and FFT processors have become common components in many modern electronic devices. Even Gauss had a role to play here, having proposed an early version of the algorithm some 160 years before, again motivated by a data-driven problem while trying to fit astronomical data onto trigonometric polynomials. A big idea for a data-driven problem. Closer to the core of statistical mathematical theory, both Kolmogorov (1939) and Wiener (1949) laid out the foundations of modern statistical signal analysis and optimal prediction methods. Their theories taught us how to extract information optimally from data, leading to further refinements by Wiener’s student Levinson (1947) and more dramatically by Kalman (1960). The innovations approach by Kailath (1968) exploited to great effect the concept of orthogonalization of the data and recursive constructions. The Kalman filter is applied across many domains today, including in financial analysis from market data. Kalman’s work was an outgrowth of the model-based approach to system theory advanced by Zadeh (1954). The concept of a recursive solution from streaming data was a novelty in Kalman’s filter; the same concept is commonplace today in most online learning techniques. Again, big ideas for recursive inference from data. Cauchy (1847) early on, and Robbins and Monro (1951) a century later, developed the powerful gradient-descent method for root finding, which is also recursive in nature. Their techniques have grown to motivate huge advances in stochastic approximation theory. Notable contributions that followed include the work by Rosenblatt (1957) on the perceptron algorithm for single-layer networks, and the impactful delta rule by Widrow and Hoff (1960), widely known as the LMS algorithm in the signal processing literature. Subsequent work on multilayer neural networks grew out of the desire to increase the approximation power of single-layer networks, culminating with the backpropagation method of Werbos (1974). Many of these techniques form the backbone of modern learning algorithms. Again, big ideas for recursive online learning. Shannon (1948a, b) contributed fundamental insights to data representation, sampling, coding, and communications. His concepts of entropy and information measure helped quantify the amount of uncertainty in data and are used, among other areas, in the design of decision trees for classification purposes and in driving learning algorithms for neural networks. Nyquist (1928) contributed to the understanding of data representations as well. Big ideas for data sampling and data manipulation. Bellman (1957a, b), a towering system-theorist, introduced dynamic programming and the notion of the curse of dimensionality, both of which are core

P.3 Organization of the Text

xxxi

underpinnings of many results in learning theory, reinforcement learning, and the theory of Markov decision processes. Viterbi’s algorithm (1967) is one notable example of a dynamic programming solution, which has revolutionized communications and has also found applications in hidden Markov models widely used in speech recognition nowadays. Big ideas for conquering complex data problems by dividing them into simpler problems. Kernel methods, building on foundational results by Mercer (1909) and Aronszajn (1950), have found widespread applications in learning theory since the mid-1960s with the introduction of the kernel perceptron algorithm. They have also been widely used in estimation theory by Parzen (1962), Kailath (1971), and others. Again, a big idea for learning from data. Pearson and Fisher launched the modern field of mathematical statistical signal analysis with the introduction of methods such as principal component analysis (PCA) by Pearson (1901) and maximum likelihood and linear discriminant analysis by Fisher (1912, 1922, 1925). These methods are at the core of statistical signal processing. Pearson (1894, 1896) also had one of the earliest studies of fitting a mixture of Gaussian models to biological data. Mixture models have now become an important tool in modern learning algorithms. Big ideas for data-driven inference. Markov (1913) introduced the formalism of Markov chains, which is widely used today as a powerful modeling tool in a variety of fields including word and speech recognition, handwriting recognition, natural language processing, spam filtering, gene analysis, and web search. Markov chains are also used in Google’s PageRank algorithm. Markov’s motivation was to study letter patterns in texts. He laboriously went through the first 20,000 letters of a classical Russian novel and counted pairs of vowels, consonants, vowels followed by a consonant, and consonants followed by a vowel. A “big” data problem for his time. Great ideas (and great patience) for data-driven inquiries. And the list goes on, with many modern-day and ongoing contributions by statisticians, engineers, and computer scientists to network science, distributed processing, compressed sensing, randomized algorithms, optimization, multi-agent systems, intelligent systems, computational imaging, speech processing, forensics, computer visions, privacy and security, and so forth. We provide additional historical accounts about these contributions and contributors at the end of the chapters.

P.3

ORGANIZATION OF THE TEXT The text is organized into three volumes, with a sizable number of problems and solved examples. The table of contents provides details on what is covered in each volume. Here we provide a condensed summary listing the three main themes:

xxxii

Preface

1. (Volume I: Foundations). The first volume covers the foundations needed for a solid grasp of inference and learning methods. Many important topics are covered in this part, in a manner that prepares readers for the study of inference and learning methods in the second and third volumes. Topics include: matrix theory, linear algebra, random variables, Gaussian and exponential distributions, entropy and divergence, Lipschitz conditions, convexity, convex optimization, proximal operators, gradient-descent, mirror-descent, conjugate-gradient, subgradient methods, stochastic optimization, adaptive gradient methods, variance-reduced methods, distributed optimization, and nonconvex optimization. Interestingly enough, the following concepts occur time and again in all three volumes and the reader is well-advised to develop familiarity with them: convexity, sample mean and law of large numbers, Gaussianity, Bayes rule, entropy, Kullback–Leibler divergence, gradientdescent, least squares, regularization, and maximum-likelihood. The last three concepts are discussed in the initial chapters of the second volume. 2. (Volume II: Inference). The second volume covers inference methods. By “inference” we mean techniques that infer some unknown variable or quantity from observations. The difference we make between “inference” and “learning” in our treatment is that inference methods will target situations where some prior information is known about the underlying signal models or signal distributions (such as their joint probability density functions or generative models). The performance by many of these inference methods will be the ultimate goal that learning algorithms, studied in the third volume, will attempt to emulate. Topics covered here include: mean-square-error inference, Bayesian inference, maximum-likelihood estimation, expectation maximization, expectation propagation, Kalman filters, particle filters, posterior modeling and prediction, Markov chain Monte Carlo methods, sampling methods, variational inference, latent Dirichlet allocation, hidden Markov models, independent component analysis, Bayesian networks, inference over directed and undirected graphs, Markov decision processes, dynamic programming, and reinforcement learning. 3. (Volume III: Learning). The third volume covers learning methods. Here, again, we are interested in inferring some unknown variable or quantity from observations. The difference, however, is that the inference will now be solely data-driven, i.e., based on available data and not on any assumed knowledge about signal distributions or models. The designer is only given a collection of observations that arise from the underlying (unknown) distribution. New phenomena arise related to generalization power, overfitting, and underfitting depending on how representative the data is and how complex or simple the approximate models are. The target is to use the data to learn about the quantity of interest (its value or evolution). Topics covered here include: least-squares methods, regularization, nearest-neighbor rule, self-organizing maps, decision trees, naïve Bayes classifier, linear discrimi-

P.3 Organization of the Text

xxxiii

nant analysis, principal component analysis, dictionary learning, perceptron, support vector machines, bagging and boosting, kernel methods, Gaussian processes, generalization theory, feedforward neural networks, deep belief networks, convolutional networks, generative networks, recurrent networks, explainable learning, adversarial attacks, and meta learning. Figure P.1 shows how various topics are grouped together in the text; the numbers in the boxes indicate the chapters where these subjects are covered. The figure can be read as follows. For example, instructors wishing to cover:

Figure P.1 Organization of the text.

(a) Background material on linear algebra and matrix theory: they can use Chapters 1 and 2. (b) Background material on random variables and probability theory: they can select from Chapters 3 through 7. (c) Background material on convex functions and convex optimization: they can use Chapters 8 through 11.

xxxiv

Preface

The three groupings (a)–(c) contain introductory core concepts that are needed for subsequent chapters. For instance, instructors wishing to cover gradient descent and iterative optimization techniques would then proceed to Chapters 12 through 15, while instructors wishing to cover stochastic optimization methods would use Chapters 16–24 and so forth. Figure P.2 provides a representation of the estimated dependencies among the chapters in the text. The chapters are color-coded depending on the volume they appear in. An arrow from Chapter a toward Chapter b implies that the material in the latter chapter benefits from the material in the earlier chapter. In principle, we should have added arrows from Chapter 1, which covers background material on matrix and linear algebra, into all other chapters. We ignored obvious links of this type to avoid crowding the figure.

P.4

HOW TO USE THE TEXT Each chapter in the text consists of several blocks: (1) the main text where theory and results are presented, (2) a couple of solved examples to illustrate the main ideas and also to extend them, (3) comments at the end of the chapter providing a historical perspective and linking the references through a motivated timeline, (4) a list of problems of varying complexity, (5) appendices when necessary to cover some derivations or additional topics, and (6) references. In total, there are close to 470 solved examples and 1350 problems in the text. A solutions manual is available to instructors. In the comments at the end of each chapter I list in boldface the life span of some influential scientists whose contributions have impacted the results discussed in the chapter. The dates of birth and death rely on several sources, including the MacTutor History of Mathematics Archive, Encyclopedia Britannica, Wikipedia, Porter and Ogilvie (2000), and Daintith (2008). Several of the solved examples in the text involve computer simulations on datasets to illustrate the conclusions. The simulations, and several of the correc , which sponding figures, were generated using the software program Matlab is a registered trademark of MathWorks Inc., 24 Prime Park Way, Natick, MA 01760-1500, www.mathworks.com. The computer codes used to generate the figures are provided “as is” and without any guarantees. While these codes are useful for the instructional purposes of the book, they are not intended to be examples of full-blown or optimized designs; practitioners should use them at their own risk. We have made no attempts to optimize the codes, perfect them, or even check them for absolute accuracy. On the contrary, in order to keep the codes at a level that is easy to follow by students, we have often decided to sacrifice performance or even programming elegance in lieu of simplicity. Students can use the computer codes to run variations of the examples shown in the text.

Figure P.2 A diagram showing the approximate dependencies among the chapters in the text. The color scheme identifies chapters from the same volume, with the numbers inside the circles referring to the chapter numbers.

xxxvi

Preface

In principle, each volume could serve as the basis for a master-level graduate course, such as courses on Foundations of Data Science (volume I), Inference from Data (volume II), and Learning from Data (volume III). Once students master the foundational concepts covered in volume I (especially in Chapters 1– 17), they will be able to grasp the topics from volumes II and III more confidently. Instructors need not cover volumes II and III in this sequence; the order can be switched depending on whether they desire to emphasize data-based learning over model-based inference or the reverse. Depending on the duration of each course, one can also consider covering subsets of each volume by focusing on particular subjects. The following grouping explains how chapters from the three volumes cover specific topics and could be used as reference material for several potential course offerings: (1) (Core foundations, Chapters 1–11, Vol. I): matrix theory, linear algebra, random variables, Gaussian and exponential distributions, entropy and divergence, Lipschitz conditions, convexity, convex optimization, and proximal operators. These chapters can serve as the basis for an introductory course on foundational concepts for mastering data science. (2) (Stochastic optimization, Chapters 12–26, Vol. I): gradient-descent, mirrordescent, conjugate-gradient, subgradient methods, stochastic optimization, adaptive gradient methods, variance-reduced methods, convergence analyses, distributed optimization, and nonconvex optimization. These chapters can serve as the basis for a course on stochastic optimization for both convex and nonconvex environments, with attention to performance and convergence analyses. Stochastic optimization is at the core of most modern learning techniques, and students will benefit greatly from a solid grasp of this topic. (3) (Statistical or Bayesian inference, Chapters 27–37, 40, Vol. II): meansquare-error inference, Bayesian inference, maximum-likelihood estimation, expectation maximization, expectation propagation, Kalman filters, particle filters, posterior modeling and prediction, Markov chain Monte Carlo methods, sampling methods, variational inference, latent Dirichlet allocation, and independent component analysis. These chapters introduce students to optimal methods to extract information from data, under the assumption that the underlying probability distributions or models are known. In a sense, these chapters reveal limits of performance that future data-based learning methods, covered in subsequent chapters, will try to emulate when the models are not known. (4) (Probabilistic graphical models, Chapters 38, 39, 41–43, Vol. II): hidden Markov models, Bayesian networks, inference over directed and undirected graphs, factor graphs, message passing, belief propagation, and graph learning. These chapters can serve as the basis for a course on Bayesian inference over graphs. Several methods and techniques are discussed along with supporting examples and algorithms.

P.5 Simulation Datasets

xxxvii

(5) (Reinforcement learning, Chapters 44–49, Vol. II): Markov decision processes, dynamic programming, value and policy iterations, temporal difference learning, Q-learning, value function approximation, and policy gradient methods. These chapters can serve as the basis for a course on reinforcement learning. They cover many relevant techniques, illustrated by means of examples, and include performance and convergence analyses. (6) (Data-driven and online learning, Chapters 50–64, Vol. III): least-squares methods, regularization, nearest-neighbor rule, self-organizing maps, decision trees, naïve Bayes classifier, linear discriminant analysis, principal component analysis, dictionary learning, perceptron, support vector machines, bagging and boosting, kernel methods, Gaussian processes, and generalization theory. These chapters cover a variety of methods for learning directly from data, including various methods for online learning from sequential data. The chapters also cover performance guarantees from statistical learning theory. (7) (Neural networks, Chapters 65–72, Vol. III): feedforward neural networks, deep belief networks, convolutional networks, generative networks, recurrent networks, explainable learning, adversarial attacks, and meta learning. These chapters cover various architectures for neural networks and the respective algorithms for training them. The chapters also cover topics related to explainability and adversarial behavior over these networks. The above groupings assume that students have been introduced to background material on matrix theory, random variables, entropy, convexity, and gradient-descent methods. One can, however, rearrange the groupings by designing stand-alone courses where the background material is included along with the other relevant chapters. By doing so, it is possible to devise various course offerings, covering themes such as stochastic optimization, online or sequential learning, probabilistic graphical models, reinforcement learning, neural networks, Bayesian machine learning, kernel methods, decentralized optimization, and so forth. Figure P.3 shows several suggested selections of topics from across the text, and the respective chapters, which can be used to construct courses with particular emphasis. Other selections are of course possible, depending on individual preferences and on the intended breadth and depth for the courses.

P.5

SIMULATION DATASETS In several examples in this work we run simulations that rely on publicly available datasets. The sources for these datasets are acknowledged in the appropriate locations in the text. Here we provide an aggregate summary for ease of reference: (1) Iris dataset. This classical dataset contains information about the sepal length and width for three types of iris flowers: virginica, setosa, and

Figure P.3 Suggested selections of topics from across the text, which can be used to construct stand-alone courses with particular emphases. Other options are possible based on individual preferences.

P.5 Simulation Datasets

xxxix

versicolor. It was originally used by Fisher (1936) and is available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/iris. Actually, several of the datasets in our list are downloaded from this useful repository – see Dua and Graff (2019). (2) MNIST dataset. This is a second popular dataset, which is useful for classifying handwritten digits. It was used in the work by LeCun et al. (1998) on document recognition. The data contains 60,000 labeled training examples and 10,000 labeled test examples for the digits 0 through 9. It can be downloaded from http://yann.lecun.com/exdb/mnist/. (3) CIFAR-10 dataset. This dataset consists of color images that can belong to one of 10 classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. It is described by Krizhevsky (2009) and can be downloaded from www.cs.toronto.edu/∼kriz/cifar.html. (4) FBI crime dataset. This dataset contains statistics showing the burglary rates per 100,000 inhabitants for the period 1997–2016. The source of the data is the US Criminal Justice Information Services Division at the link https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-1. (5) Sea level and global temperature changes dataset. The sea level dataset measures the change in sea level relative to the start of year 1993. There are 952 data points consisting of fractional year values. The source of the data is the NASA Goddard Space Flight Center at https://climate.nasa. gov/vitalsigns/sea-level/. For information on how the data was generated, the reader may consult Beckley et al. (2017) and the report GSFC (2017). The temperature dataset measures changes in the global surface temperature relative to the average over the period 1951–1980. There are 139 measurements between the years 1880 and 2018. The source of the data is the NASA Goddard Institute for Space Studies (GISS) at https://climate.nasa.gov/vital-signs/globaltemperature/. (6) Breast cancer Wisconsin dataset. This dataset consists of 569 samples, with each sample corresponding to a benign or malignant cancer classification. It can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). For information on how the data was generated, the reader may consult Mangasarian, Street, and Wolberg (1995). (7) Heart-disease Cleveland dataset. This dataset consists of 297 samples that belong to patients with and without heart disease. It is available on the UCI Machine Learning Repository and can be downloaded from https://archive.ics .uci.edu/ml/datasets/heart+Disease. The investigators responsible for the collection of the data are the four leading co-authors of the article by Detrano et al. (1989).

xl

Preface

P.6

ACKNOWLEDGMENTS A project of this magnitude is not possible without the support of a web of colleagues and students. I am indebted to all of them for their input at various stages of this project, either through feedback on earlier drafts or through conversations that deepened my understanding of the topics. I am grateful to several of my former and current Ph.D. students and post-doctoral associates in no specific order: Stefan Vlaski, Kun Yuan, Bicheng Ying, Zaid Towfic, Jianshu Chen, Xiaochuan Zhao, Sulaiman Alghunaim, Qiyue Zou, Zhi Quan, Federico Cattivelli, Lucas Cassano, Roula Nassif, Virginia Bordignon, Elsa Risk, Mert Kayaalp, Hawraa Salami, Mirette Sadek, Sylvia Dominguez, Sheng-Yuan Tu, Waleed Younis, Shang-Kee Tee, Chung-Kai Tu, Alireza Tarighat, Nima Khajehnouri, Vitor Nascimento, Ricardo Merched, Cassio Lopes, Nabil Yousef, Ananth Subramanian, Augusto Santos, and Mansour Aldajani. I am also indebted to former internship and visiting undergraduate and MS students Mateja Ilic, Chao Yutong, Yigit Efe Erginbas, Zhuoyoue Wang, and Edward Nguyen for their help with some of the simulations. I also wish to acknowledge several colleagues with whom I have had fruitful interactions over the years on topics of relevance to this text, including coauthoring joint publications, and who have contributed directly or indirectly to my work: Thomas Kailath (Stanford University, USA), Vince Poor (Princeton University, USA), José Moura (Carnegie Mellon University, USA), Mos Kaveh (University of Minnesota, USA), Bernard Widrow (Stanford University, USA), Simon Haykin (McMaster University, Canada), Thomas Cover (Stanford University, USA, in memoriam), Gene Golub (Stanford University, USA, in memoriam), Sergios Theodoridis (University of Athens, Greece), Vincenzo Matta (University of Salerno, Italy), Abdelhak Zoubir (Technical University Darmstadt, Germany), Cedric Richard (Universite Côte d’Azur, France), John Treichler (Raytheon, USA), Tiberiu Constantinescu (University of Texas Dallas, USA, in memoriam), Shiv Chandrasekaran (University of California, Santa Barbara, USA), Ming Gu (University of California, Berkeley, USA), Babak Hassibi (Caltech, USA), Jeff Shamma (University of Illinois, Urbana Champaign, USA), P. P. Vaidyanathan (Caltech, USA), Hanoch Lev-Ari (Northeastern University, USA), Markus Rupp (Tech. Universität Wien, Austria), Alan Laub, Wotao Yin, Lieven Vandenberghe, Mihaela van der Schaar, and Vwani Roychowdhury (University of California, Los Angeles), Vitor Nascimento (University of São Paulo, Brazil), Jeronimo Arena Garcia (Universidad Carlos III, Spain), Tareq Al-Naffouri (King Abdullah University of Science and Technology, Saudi Arabia), Jie Chen (Northwestern Polytechnical University, China), Sergio Barbarossa (Universita di Roma, Italy), Paolo Di Lorenzo (Universita di Roma, Italy), Alle-Jan van der Veen (Delft University, the Netherlands), Paulo Diniz (Federal University of Rio de Janeiro, Brazil), Sulyman Kozat (Bilkent University, Turkey), Mohammed Dahleh (University of California, Santa Barbara,

P.6 Acknowledgments

xli

USA, in memoriam), Alexandre Bertrand (Katholieke Universiteit Leuven, Belgium), Marc Moonen (Katholieke Universiteit Leuven, Belgium), Phillip Regalia (National Science Foundation, USA), Martin Vetterli, Michael Unser, Pascal Frossard, Pierre Vandergheynst, Rudiger Urbanke, Emre Telatar, and Volkan Cevher (EPFL, Switzerland), Helmut Bölcskei (ETHZ, Switzerland), Visa Koivunen (Aalto University, Finland), Isao Yamada (Tokyo Institute of Technology, Japan), Zhi-Quan Luo and Shuguang Cui (Chinese University of Hong Kong, Shenzhen, China), Soummya Kar (Carnegie Mellon University, USA), Waheed Bajwa (Rutgers University, USA), Usman Khan (Tufts University, USA), Michael Rabbat (Facebook, Canada), Petar Djuric (Stony Brook University, USA), Lina Karam (Lebanese American University, Lebanon), Danilo Mandic (Imperial College, United Kingdom), Jonathon Chambers (University of Leicester, United Kingdom), Rabab Ward (University of British Columbia, Canada), and Nikos Sidiropoulos (University of Virginia, USA). I would like to acknowledge the support of my publisher, Elizabeth Horne, at Cambridge University Press during the production phase of this project. I would also like to express my gratitude to the publishers IEEE, Pearson Education, NOW, and Wiley for allowing me to adapt some excerpts and problems from c 2003 my earlier works, namely, Sayed (Fundamentals of Adaptive Filtering, c 2008 Wiley), Sayed (Adaptation, Learning, Wiley), Sayed (Adaptive Filters, c 2014 A. H. Sayed by NOW Publishers), and Optimization over Networks, c Sayed (“Big ideas or big data,” 2018 IEEE), and Kailath, Sayed, and Hassibi c 2000 Prentice Hall). (Linear Estimation, I initiated my work on this project in Westwood, Los Angeles, while working as a faculty member at the University of California, Los Angeles (UCLA), and concluded it in Lausanne, Switzerland, while working at the École Polytechnique Fédérale de Lausanne (EPFL). I am grateful to both institutions for their wonderful and supportive environments. My wife Laila, and daughters Faten and Samiya, have always provided me with their utmost support and encouragement without which I would not have been able to devote my early mornings and good portions of my weekend days to the completion of this text. My beloved parents, now deceased, were overwhelming in their support of my education. For all the sacrifices they have endured during their lifetime, I dedicate this text to their loving memory, knowing very well that this tiny gift will never match their gift. Ali H. Sayed Lausanne, Switzerland March 2021

xlii

References

References Aronszajn, N. (1950), “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404. Bayes, T. and R. Price (1763), “An essay towards solving a problem in the doctrine of chances,” Bayes’s article communicated by R. Price and published posthumously in Philos. Trans. Roy. Soc. Lond., vol. 53, pp. 370–418. Beckley, B. D., P. S. Callahan, D. W. Hancock, G. T. Mitchum, and R. D. Ray (2017), “On the cal-mode correction to TOPEX satellite altimetry and its effect on the global mean sea level time series,” J. Geophy. Res. Oceans, vol. 122, no. 11, pp. 8371–8384. Bedeian, A. G. (2016), “A note on the aphorism ‘there is nothing as practical as a good theory’,” J. Manag. Hist., vol. 22, no. 2, pp. 236–242. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Bellman, R. E. (1957b), “A Markovian decision process,” Indiana Univ. Math. J., vol. 6, no. 4, pp. 679–684. Cauchy, A.-L. (1847), “Methode générale pour la résolution des systems déquations simultanes,” Comptes Rendus Hebd. Séances Acad. Sci., vol. 25, pp. 536–538. Cooley, J. W. and J. W. Tukey (1965), “An algorithm for the machine calculation of complex Fourier series” Math. Comput., vol. 19, no. 90, pp. 297–301. Daintith, J. (2008), editor, Biographical Encyclopedia of Scientists, 3rd ed., CRC Press. de Moivre, A. (1730), Miscellanea Analytica de Seriebus et Quadraturis, J. Tonson and J. Watts, London. Detrano, R., A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, and V. Froelicher (1989), “International application of a new probability algorithm for the diagnosis of coronary artery disease,” Am. J. Cardiol., vol. 64, pp. 304–310. Dua, D. and C. Graff (2019), UCI Machine Learning Repository, available at http://archive.ics.uci.edu/ml, School of Information and Computer Science, University of California, Irvine. Einstein, A. (1916), “Näherungsweise Integration der Feldgleichungen der Gravitation,” Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften Berlin, part 1, pp. 688–696. Fisher, R. A. (1912), “On an absolute criterion for fitting frequency curves,” Messeg. Math., vol. 41, pp. 155–160. Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philos. Trans. Roy. Soc. Lond. Ser. A., vol. 222, pp. 309–368. Fisher, R. A. (1925), “Theory of statistical estimation,” Proc. Cambridge Philos. Soc., vol. 22, pp. 700–725. Fisher, R. A. (1936), “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188. Fourier, J. (1822), Théorie Analytique de la Chaleur, Firmin Didot Père et Fils. English translation by A. Freeman in 1878 reissued as The Analytic Theory of Heat, Dover Publications. Gauss, C. F. (1903), Carl Friedrich Gauss Werke, Akademie der Wissenschaften. GSFC (2017), “Global mean sea level trend from integrated multi-mission ocean altimeters TOPEX/Poseidon, Jason-1, OSTM/Jason-2,” ver. 4.2 PO.DAAC, CA, USA. Dataset accessed 2019-03-18 at http://dx.doi.org/10.5067/GMSLM-TJ42. Hall, T. (1970), Carl Friedrich Gauss: A Biography, MIT Press. Heath, J. L. (1912), The Works of Archimedes, Dover Publications. Kailath, T. (1968), “An innovations approach to least-squares estimation, part I: Linear filtering in additive white noise,” IEEE Trans. Aut. Control, vol. 13, pp. 646–655. Kailath, T. (1971), “RKHS approach to detection and estimation problems I: Deterministic signals in Gaussian noise,” IEEE Trans. Inf. Theory, vol. 17, no. 5, pp. 530–549. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall.

References

xliii

Kalman, R. E. (1960), “A new approach to linear filtering and prediction problems,” Trans. ASME J. Basic Eng., vol. 82, pp. 34–45. Kolmogorov, A. N. (1939), “Sur l’interpolation et extrapolation des suites stationnaires,” C. R. Acad. Sci., vol. 208, p. 2043. Krizhevsky, A. (2009), Learning Multiple Layers of Features from Tiny Images, MS dissertation, Computer Science Department, University of Toronto, Canada. Laplace, P. S. (1774), “Mémoire sur la probabilité des causes par les événements,” Mém. Acad. R. Sci. de MI (Savants étrangers), vol. 4, pp. 621–656. See also Oeuvres Complètes de Laplace, vol. 8, pp. 27–65 published by the L’Académie des Sciences, Paris, during the period 1878–1912. Translated by S. M. Sitgler, Statistical Science, vol. 1, no. 3, pp. 366–367. Laplace, P. S. (1812), Théorie Analytique des Probabilités, Paris. LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998), “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324. Levinson, N. (1947), “The Wiener RMS error criterion in filter design and prediction,” J. Math. Phys., vol. 25, pp. 261–278. Lewin, K. (1945), “The research center for group dynamics at MIT,” Sociometry, vol. 8, pp. 126–135. See also page 169 of Lewin, K. (1952), Field Theory in Social Science: Selected Theoretical Papers by Kurt Lewin, Tavistock. Lyapunov, A. M. (1901), “Nouvelle forme du théoreme sur la limite de probabilité,” Mémoires de l’Académie de Saint-Petersbourg, vol. 12, no. 8, pp. 1–24. Mangasarian, O. L., W. N. Street, and W. H. Wolberg (1995), “Breast cancer diagnosis and prognosis via linear programming,” Op. Res., vol. 43, no. 4, pp. 570–577. Markov, A. A. (1913), “An example of statistical investigation in the text of Eugene Onyegin illustrating coupling of texts in chains,” Proc. Acad. Sci. St. Petersburg, vol. 7, no. 3, p. 153–162. English translation in Science in Context, vol. 19, no. 4, pp. 591–600, 2006. Mercer, J. (1909), “Functions of positive and negative type and their connection with the theory of integral equations,” Philos. Trans. Roy. Soc. Lond. Ser. A, vol. 209, pp. 415–446. Nyquist, H. (1928), “Certain topics in telegraph transmission theory,” Trans. AIEE, vol. 47, pp. 617–644. Reprinted as classic paper in Proc. IEEE, vol. 90, no. 2, pp. 280–305, 2002. Parzen, E. (1962), “Extraction and detection problems and reproducing kernel Hilbert spaces,” J. Soc. Indus. Appl. Math. Ser. A: Control, vol. 1, no. 1, pp. 35–62. Pearson, K. (1894), “Contributions to the mathematical theory of evolution,” Philos. Trans. Roy. Soc. Lond., vol. 185, pp. 71–110. Pearson, K. (1896), “Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia,” Philos. Trans. Roy. Soc. Lond., vol. 187, pp. 253–318. Pearson, K. (1901), “On lines and planes of closest fit to systems of points in space,” Philos. Mag., vol. 2, no. 11, pp. 559–572. Porter, R. and M. Ogilvie (2000), editors, The Biographical Dictionary of Scientists, 3rd ed., Oxford University Press. Robbins, H. and S. Monro (1951), “A stochastic approximation method,” Ann. Math. Stat., vol. 22, pp. 400–407. Rosenblatt, F. (1957), The Perceptron: A Perceiving and Recognizing Automaton, Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Sayed, A. H. (2014a), Adaptation, Learning, and Optimization over Networks, Foundations and Trends in Machine Learning, NOW Publishers, vol. 7, no. 4–5, pp. 311–801. Sayed, A. H. (2018), “Big ideas or big data?” IEEE Sign. Process. Mag., vol. 35, no. 2, pp. 5–6. Shannon, C. E. (1948a), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423. Shannon, C. E. (1948b), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 623–656.

xliv

References

Viterbi, A. J. (1967), “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Trans. Inf. Theory, vol. 13, pp. 260–269. Werbos, P. J. (1974), Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. dissertation, Harvard University. Widrow, B. and M. E. Hoff (1960), “Adaptive switching circuits,” IRE WESCON Conv. Rec., Institute of Radio Engineers, pt. 4, pp. 96–104. Wiener, N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time Series, Technology Press and Wiley. Originally published in 1942 as a classified Nat. Defense Res. Council Report. Also published under the title Time Series Analysis by MIT Press. Zadeh, L. A. (1954), “System theory,” Columbia Engr. Quart., vol. 8, pp. 16–19.

Notation

The following is a list of notational conventions used in the text: (a) We use boldface letters to denote random variables and normal font letters to refer to their realizations (i.e., deterministic values), like x and x, respectively. In other words, we reserve the boldface notation for random quantities. (b) We use CAPITAL LETTERS for matrices and small letters for both vectors and scalars, for example, X and x, respectively. In view of the first convention, X would denote a matrix with random entries, while X would denote a matrix realization (i.e., a matrix with deterministic entries). Likewise, x would denote a vector with random entries, while x would denote a vector realization (or a vector with deterministic entries). One notable exception to the capital letter notation is the use of such letters to refer to matrix dimensions or the number of data points. For example, we usually write M to denote the size of a feature vector and N to denote the number of data samples. These exceptions will be obvious from the context. (c) Small Greek letters generally refer to scalar quantities, such as α and β, while CAPITAL Greek letters generally refer to matrices such as Σ and Γ. (d) All vectors in our presentation are column vectors unless mentioned otherwise. Thus, if h ∈ IRM refers to a feature vector and w ∈ IRM refers to a classifier, then their inner product is hT w where ·T denotes the transposition symbol. (e) If P (w) : IRM → IR is some objective function, then its gradient relative to wT is denoted by either ∇wT P (w) or ∂P (w)/∂w and this notation refers to the column vector consisting of the partial derivatives of P (w) relative to the individual entries of w: 

  ∂P (w)/∂w = ∇wT P (w) =  

∂P (w)/∂w1 ∂P (w)/∂w2 .. . ∂P (w)/∂wM

    

(M × 1)

Notation

Symbols We collect here, for ease of reference, a list of the main symbols used throughout the text. IR C ZZ IRM ·T ·∗ x X x X Ex E x g(x) x⊥y x⊥y x y|z kxk2 kxk2W kxk or kxk2 kxk1 kxk∞ kxk? kAk or kAk2 kAkF kAk1 kAk∞ kAk? col{a, b} diag{a, b} diag{A} diag{a} a⊕b a = vec{A} A = vec−1 {a} blkcol{a, b} blkdiag{A, B} a b a b A⊗B A ⊗b B |=

xlvi

set of real numbers set of complex numbers set of integer numbers set of M × 1 real vectors matrix transposition complex conjugation (transposition) for scalars (matrices) boldface letter denotes a random scalar or vector boldface capital letter denotes a random matrix letter in normal font denotes a scalar or vector capital letter in normal font denotes a matrix expected value of the random variable x expected value of g(x) relative to pdf of x orthogonal random variables x and y, i.e., E xy T = 0 orthogonal vectors x and y, i.e., xT y = 0 x and y are conditionally independent given z xT x for a real x; squared Euclidean norm of x xT W x for a real x and positive-definite matrix W √ xT x for a real column vector x; Euclidean norm of x `1 -norm of vector x; sum of its absolute entries `∞ -norm of vector x; maximum absolute entry dual norm of vector x maximum singular value of A; also the spectral norm of A Frobenius norm of A `1 -norm of matrix A or maximum absolute column sum `∞ -norm of matrix A or maximum absolute row sum dual norm of matrix A column vector with a and b stacked on top of each other diagonal matrix with diagonal entries a and b column vector formed from the diagonal entries of A diagonal matrix with entries read from column a same as diag{a, b} column vector a formed by stacking the columns of A square matrix A recovered by unstacking its columns from a columns a and b stacked on top of each other block diagonal matrix with blocks A and B on diagonal Hadamard elementwise product of vectors a and b Hadamard elementwise division of vectors a and b Kronecker product of matrices A and B block Kronecker product of block matrices A and B

Notation

A† ∆

a = b 0 IM P >0 P ≥0 P 1/2 A>B A≥B det A Tr(A) A = QR A = U ΣV T ρ(A) λ(A) σ(A) N(A) R(A) rank(A) In(A) b x e x x ¯ or E x σx2 Rx P(A) P(A|B) fx (x) fx (x; θ) fx|y (x|y) S(θ) F (θ) Nx (¯ x, Rx ) x ∼ fx (x) Gg (m(x), K(x, x0 )) K(x, x0 ) H(x) H(x|y) I(x; y) DKL (pkq) Dφ (a, b)

xlvii

pseudo-inverse of A quantity a is defined as b zero scalar, vector, or matrix identity matrix of size M × M positive-definite matrix P positive-semidefinite matrix P square-root factor of P ≥ 0, usually lower triangular means that A − B is positive-definite means that A − B is positive-semidefinite determinant of matrix A trace of matrix A QR factorization of matrix A SVD factorization of matrix A spectral radius of matrix A refers to a generic eigenvalue of A refers to a generic singular value of A nullspace of matrix A range space or column span of matrix A rank of matrix A inertia of matrix A estimator for x error in estimating x mean of random variable x variance of a scalar random variable x covariance matrix of a vector random variable x probability of event A probability of event A conditioned on knowledge of B pdf of random variable x pdf of x parameterized by θ conditional pdf of random variable x given y score function, equal to ∇θT fx (x; θ) Fisher information matrix: covariance matrix of S(θ) Gaussian distribution over x with mean x ¯, covariance Rx random variable x distributed according to pdf fx (x) Gaussian process g; mean m(x) and covariance K(x, x0 ) kernel function with arguments x, x0 entropy of random variable x conditional entropy of random variable x given y mutual information of random variables x and y KL divergence of pdf distributions px (x) and qx (x) Bregman divergence of points a, b relative to mirror φ(w)

xlviii

Notation

Sx (z) Sx (ejω ) b MSE x b MAP x b MVUE x x bML `(θ) PC (z) dist(x, C) proxµh (z) Mµh (z) h? (x) Tα (x) {x}+ P (w) or P(W ) Q(w, ·) or Q(W, ·) ∇w P (w) ∇wT P (w) ∂P (w)/∂w ∇2w P (w) ∂w P (w) hn γ(n) γn c(hn ) wn w? wo w en Remp (c) R(c) q ? (z) I[a] IC,∞ [x] log a ln a O(µ) o(µ) O(1/n) o(1/n) mb(x) pa(x) φC (·)

z-spectrum of stationary random process, x(n) power spectral density function mean-square-error estimator of x maximum a-posteriori estimator of x minimum-variance unbiased estimator of x maximum-likelihood estimate of x log-likelihood function; parameterized by θ projection of point z onto convex set C distance from point x to convex set C proximal projection of z relative to h(w) Moreau envelope of proxµh (z) conjugate function of h(w) soft-thresholding applied to x with threshold α max{0, x} risk functions loss functions row gradient vector of P (w) relative to w column gradient vector of P (w) relative to wT same as the column gradient ∇wT P (w) Hessian matrix of P (w) relative to w subdifferential set of P (·) at location w nth feature vector nth target or label signal, when scalar nth target or label signal, when vector classifier applied to hn weight iterate at nth iteration of an algorithm minimizer of an empirical risk, P (w) minimizer of a stochastic risk, P (w) weight error at iteration n empirical risk of classifier c(h) actual risk of classifier c(h) variational factor for estimating the posterior fz|y (z|y) indicator function: 1 when a is true; otherwise 0 indicator function: 0 when x ∈ C; otherwise ∞ logarithm of a relative to base 10 natural logarithm of a asymptotically bounded by a constant multiple of µ asymptotically bounded by a higher power of µ decays asymptotically at rate comparable to 1/n decays asymptotically at rate faster than 1/n Markov blanket of node x in a graph parents of node x in a graph potential function associated with clique C

Notation

Nk M = (S, A, P, r) S A P r π(a|s) π ? (a|s) v π (s) v ? (s) π q (s, a) q ? (s, a) softmax(z) z` y` W` θ` (`) wij θ` (j) δ`

xlix

neighborhood of node k in a graph Markov decision process set of states for an MDP set of actions for an MDP transition probabilities for an MDP reward function for an MDP policy for selecting action a conditioned on state s optimal policy for an MDP state value function at state s optimal state value function at state s state–action value function at state s and action a optimal state–action value function at (s, a) softmax operation applied to entries of vector z pre-activation vector at layer ` of a neural network post-activation vector at layer ` of a neural network weighting matrix between layers ` and ` + 1 bias vector feeding into layer ` + 1 weight from node i in layer ` to node j in layer ` + 1 weight from bias source in layer ` to node j in layer ` + 1 sensitivity vector for layer `

Abbreviations ADF assumed density filtering ae convergence almost everywhere AIC Akaike information criterion BIBO bounded-input bounded-output BIC Bayesian information criterion BPTT backpropagation through time cdf cumulative distribution function CNN convolutional neural network CPT conditional probability table DAG directed acyclic graph EKF extended Kalman filter ELBO evidence lower bound ELU exponential linear unit EM expectation maximization EP expectation propagation ERC exact recovery condition ERM empirical risk minimization FDA Fisher discriminant analysis FGSM fast gradient sign method GAN generative adversarial network GLM generalized linear model GMM Gaussian mixture model GRU gated recurrent unit

l

Notation

HALS HMM ICA iid IRLS ISTA JSMA KKT KL LASSO LDA LDA LLMSE LMSE LOESS LOWESS LSTM LWRP MAB MAP MCMC MDL MDP ML MMSE MP MRF MSD MSE MVUE NMF NN OCR OMP PCA pdf PGM pmf POMDP PPO RBF RBM ReLu RIP

hierarchical alternating least-squares hidden Markov model independent component analysis independent and identically distributed iterative reweighted least-squares iterated soft-thresholding algorithm Jacobian saliency map approach Karush–Kuhn–Tucker Kullback–Leibler divergence least absolute shrinkage and selection operator latent Dirichlet allocation linear discriminant analysis linear least-mean-squares error least-mean-squares error locally estimated scatter-plot smoothing locally weighted scatter-plot smoothing long short-term memory network layer-wise relevance propagation multi-armed bandit maximum a-posteriori Markov chain Monte Carlo minimum description length Markov decision process maximum likelihood minimum mean-square error matching pursuit Markov random field mean-square deviation mean-square error minimum variance unbiased estimator nonnegative matrix factorization nearest-neighbor rule optical character recognition orthogonal matching pursuit principal component analysis probability density function probabilistic graphical model probability mass function partially observable MDP proximal policy optimization radial basis function restricted Boltzmann machine rectified linear unit restricted isometry property

Notation

RKHS RLS RNN RTRL SARSA SNR SOM SVD SVM TD TRPO UCB VAE VC VI 

reproducing kernel Hilbert space recursive least-squares recurrent neural network real time recurrent learning sequence of state, action, reward, state, action signal-to-noise ratio self-organizing map singular value decomposition support vector machine temporal difference trust region policy optimization upper confidence bound variational autoencoder Vapnik–Chervonenkis dimension variational inference end of theorem/lemma/proof/remark

li

27 Mean-Square-Error Inference

Inference deals with the estimation of hidden parameters or random variables from observations of other related variables. In this chapter, we study the basic, yet fundamental, problem of inferring an unknown random quantity from observations of another random quantity by using the mean-square-error (MSE) criterion. Several other design criteria can be used for inference purposes besides MSE, such as the mean-absolute error (MAE) and the maximum a-posteriori (MAP) criteria. We will encounter these possibilities in future chapters, starting with the next chapter. We initiate our discussions of inference problems though by focusing on the MSE criterion due to its mathematical tractability and because it sheds light on several important questions that arise in the study of inference problems in general. In our treatment of inference problems, we will encounter three broad formulations: (a) In one instance, we will model the unknown as a random variable, denoted by the boldface symbol x. The objective will be to predict (or infer) the value of x from observations of a related variable y. The predictor or estimator b . Different design criteria will lead to different for x will be denoted by x b . In this chapter, we discuss the popular MSE criterion. constructions for x

(b) In another instance, we will continue to model the unknown x as a random variable but will limit its values to discrete levels, such as having x assume the values +1 or −1. This type of formulation is prevalent in classification problems and will be discussed at length in future chapters, starting with the next chapter, where we examine the Bayes classifier. (c) In a third instance, we will model the unobservable as an unknown constant, denoted by the Greek symbol θ, rather than a random variable. This type of problem is frequent in applications requiring fitting models onto data and will be discussed in future chapters, for example, when we introduce the maximum-likelihood and expectation maximization paradigms.

1054

Mean-Square-Error Inference

27.1

INFERENCE WITHOUT OBSERVATIONS We consider first a simple yet useful estimation problem, which relates to estimating a random variable x ∈ IR when the only information that is available about x is its mean, x ¯. Our objective is to estimate the value that x will assume in a given experiment. We denote the estimate for x by the notation x b; it is a deterministic quantity (i.e., a number). But how do we come up with a value for x b ? And how do we decide whether this value is optimal or not? And if optimal, in what sense? These inquiries are at the heart of every inference problem. To answer them, we first need to choose a cost (also called risk ) function to penalize the estimation error. The resulting estimate x b will be optimal only in the sense that it leads to the smallest cost or risk value. Different choices for the cost function will generally lead to different choices for x b, each of which will be optimal in its own way.

27.1.1

Problem Formulation The design criterion we study first is the famed MSE criterion; several other criteria are possible and we discuss other important choices in the next chapter. The MSE criterion is based on introducing the error signal: ∆

e = x−x x b

(27.1)

and on determining x b by minimizing the MSE, which is defined as the mean of 2 e : the squared error x ∆

e2 x b = argmin E x

(27.2)

x b=x ¯

(27.3)

x b

e is a random variable since x is random. The resulting estimate, x The error x b, is called the least-mean-squares estimate (LMSE) of x. For added emphasis, we could have written x bMSE , with the subscript MSE, in order to highlight the fact that this is an estimate for x that is based on minimizing the MSE criterion defined by (27.2). This notation is unnecessary in this chapter because we will be discussing mainly the MSE criterion. However, in later chapters, when we introduce other design criteria, it will become necessary to use the MSE subscript to distinguish the MSE estimate, x bMSE , from other estimates such as x bMAP , x bMAE , or x bML , where the subscripts MAP, MAE, and ML will be referring to maximum a-posteriori, mean-absolute error, and maximum-likelihood estimators. Returning to (27.2), it is immediate to verify that the solution x b is given by

27.1 Inference without Observations

1055

and that the resulting minimum mean-square error (MMSE) is e 2 = σx2 MMSE = E x

(27.4)

We therefore say that the best estimate of a random variable (best from the perspective of MSE estimation) when only its mean is known is the mean value of the random variable itself. Proof of (27.3)–(27.4) using differentiation: We expand the MSE from (27.2) to get e 2 = E (x − x Ex b)2 = E x2 − 2¯ xx b+x b2

(27.5)

and then differentiate the right-hand side with respect to the unknown x b. Setting the derivative to zero gives ∂ E (x − x b)2 = −2¯ x + 2b x = 0 =⇒ x b=x ¯ ∂b x

(27.6)

This choice for x b minimizes the MSE since the risk (27.5) is quadratic in x b.

 Proof of (27.3)–(27.4) using completion-of-squares: An alternative argument to arrive at the same conclusion relies on the use of a completion-of-squares step. It is immediate to verify that the risk function in (27.2) can be rewritten in the following form, by adding and subtracting x ¯: e 2 = E ((x − x Ex ¯) + (¯ x−x b))2

x−x b) = E (x − x ¯)2 + (¯ x−x b)2 + 2 E (x − x ¯)(¯ | {z } =0

= σx2 + (¯ x−x b)2

(27.7)

The above result expresses the risk as the sum of two nonnegative terms where only the second term depends on the unknown, x b. It is clear that the choice x b=x ¯ annihilates the second term and results in the smallest possible value for the MSE. This value is referred to as the MMSE and is equal to σx2 .



27.1.2

Interpretation There are good reasons for using the MSE criterion (27.2). The simplest one perhaps is that the criterion is amenable to mathematical manipulations, more so than any other criterion. In addition, the criterion is attempting to force the e , to assume values close to its mean, which is zero since estimation error, x e = E (x − x Ex b) = E (x − x ¯) = x ¯−x ¯=0

(27.8)

e 2 , we are in effect minimizing the variance of the Therefore, by minimizing E x e . Then, in view of the discussion in Section 3.2 regarding the interpreerror, x tation of the variance of a random variable, we find that the MSE criterion is attempting to increase the likelihood of small errors.

1056

Mean-Square-Error Inference

The effectiveness of the estimation procedure (27.2) can be measured by examining the value of the resulting minimum risk, which is the variance of the e 2 . The above discussion tells us that estimation error, denoted by σxe2 = E x σxe2 = σx2

(27.9)

so that the estimate x b=x ¯ does not reduce our initial uncertainty about x: The error variable has the same variance as x itself. We therefore find that, in this initial scenario, the performance of the MSE design procedure is limited. We are interested in estimation procedures that result in error variances that are smaller than the original signal variance. We discuss one such procedure in the next section. The reason for the poor performance of the estimate x b= x ¯ lies in the lack of more sophisticated prior information about x. Note that result (27.3) simply tells us that the best we can do, in the absence of any other information about a random variable x, other than its mean, is to use the mean value of x as our estimate. This result is at least intuitive. After all, the mean value of a random variable is, by definition, an indication of the value that we expect to observe on average in repeated experiments. Hence, in answer to the question, “What is the best guess for x? ” the analysis tells us that the best guess is what we would expect for x on average! This is a circular answer, but one that is at least consistent with intuition. Example 27.1 (Guessing the class of an image) Assume a box contains an equal number of images of cats and dogs. An image is selected at random from the box and a random variable x is associated with this experiment. The variable x assumes the value x = +1 if the selected image is a cat and it assumes the value x = −1 otherwise. We say that x represents the class (or label) variable: Its value specifies whether the selected image belongs to one class (+1 corresponding to cats) or the other (−1 corresponding to dogs). It is clear that x is a binary random variable assuming the values ±1 with equal probability. Then, 1 1 × (+1) + × (−1) = 0 2 2 2 2 σx = E x = 1 x ¯=

(27.10) (27.11)

If a user were to predict the class of the image that will be selected from the box beforehand then, according to the MSE criterion, the best estimate for x will be x b= x ¯ = 0. This estimate value is neither +1 nor −1. This example shows that the LMSE estimate does not always lead to a meaningful solution! In this case, using x b = 0 is not useful in guessing whether the realization for x will be +1 or −1. If we could incorporate into the design procedure some additional information, besides the mean of x, then we could perhaps come up with a better prediction for the class of the image. Example 27.2 (Predicting a crime statistic) The US Federal Bureau of Investigation (FBI) publishes statistics on the crime rates in the country on an annual basis. Figure 27.1 plots the burglary rates per 100,000 inhabitants for the period 1997–2016. Assume we model the annual burglary rate as a random variable x with some mean x ¯. By examining the plot, we find that this assumption is more or less reasonable only over the shorter range 2000–2009 during which the burglary rate remained practically flat

27.2 Inference with Observations

1057

with fluctuations around some nominal average value. The rates are declining before 2000 and after 2010. Assume we did not know the burglary rate for the year 2010 and wanted to predict its value from the burglary rates observed in prior years. In this example, the probability distribution of x is not known to evaluate its mean x ¯. Instead, we have access to measurements for the years 1997–2015. We can use the data from the years 2000–2009 to compute a sample mean and use it to predict x(2010): x b(2010) =

(27.12)

burglary rate per 100,000 inhabitants

1000

prediction using data from 2000–2009

900

burglary rate

2009 1 X x(n) ≈ 732.6 10 n=2000

prediction using all data

800 700 600

years 2000–2009

500 400 1996

1998

2000

2002

2004

2006

2008

years

2010

2012

2014

2016

2018

Figure 27.1 Plot of the annual burglary rates per 100,000 inhabitants in the United

States during 1997–2016. Source: US Criminal Justice Information Services Division. Data found at https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s. -2016/tables/table-1. The value 732.6 is close enough to the actual burglary rate observed for 2010, which is 701. If we were instead to predict the burglary rate for the year 2016 by using the data for the entire period 1997–2015, we would end up with x b(2016) ≈ 715.5

(27.13)

which is clearly a bad estimate since the actual value is 468.9. This example illustrates the fact that we will often be dealing with distributions that vary (i.e., drift) over time for various reasons, such as changing environmental conditions or, in the case of this example, crime deterrence policies that may have been put in place. This possibility necessitates the development of inference techniques that are able to adapt to variations in the statistical properties of the data in an automated manner. In this example, the statistical properties of the data during the period 2000–2009 are clearly different from the periods before 2000 and after 2009.

27.2

INFERENCE WITH OBSERVATIONS Let us examine next the case in which more is known about the random variable x, beyond its mean. Let us assume that we have access to an observation of a second random variable y that is related to x in some way. For example, y could

1058

Mean-Square-Error Inference

be a noisy measurement of x, say, y = x + v, where v denotes the disturbance, or y could be the sign of x, or dependent on x in some other way.

27.2.1

Conditional Mean Estimator Given two dependent random variables {x, y}, we pose the problem of determining the LMSE estimator for x from an observation of y. Observe that we are now employing the terminology “estimator” of x as opposed to “estimate” of x. In order to highlight this distinction, we denote the estimator of x by the b ; it is now a random variable since it will be a function of y, boldface notation x i.e., b = c(y) x

(27.14)

for some function c(·) to be determined. Once the function c(·) has been determined, evaluating it at a particular observation for y, say, at y = y, will result in an estimate for x, i.e., x b = c(y) = c(y) (27.15) y=y

Different occurrences for y will lead to different estimates x b. In Section 27.1 we b and an estimate did not need to make this distinction between an estimator x x b. There we sought directly an estimate x b for x since we did not have access to a random variable y; we only had access to the deterministic quantity x ¯. b will continue to be the same The criterion we use to determine the estimator x MSE criterion. We define the error signal: ∆

e = x−x b x

(27.16)

e 2 , subject to x b = c(y) min E x

(27.17)

x b = E (x|y = y)

(27.18)

co (y) = E (x|y = y)

(27.19)

b by minimizing the MSE over all possible choices for the function and determine x c(·): c(·)

We establish in the following that the solution of (27.17) is given by the conditional mean estimator:

That is, the optimal choice for c(y) in (27.15) is

Recall from (3.100) that for continuous random variables x and y, the conditional expectation is computed via the integral expression: ˆ E (x|y = y) = xfx|y (x|y)dx (27.20a) x∈X

27.2 Inference with Observations

over the domain of x, while for discrete random variables: X E (x|y = y) = x P(x = x|y = y)

1059

(27.20b)

x

in terms of the conditional probability values and where the sum is over the possible realizations for x. We continue our presentation by using the notation for continuous random variables without loss in generality. Returning to (27.18), this estimator is obviously unbiased since from the result of Prob. 3.25 we know that   b = E E (x|y) = E x = x Ex ¯ (27.21) Moreover, the resulting minimum cost or MMSE will be given by ∆

e 2 = σx2 − σxb2 MMSE = E x

(27.22)

which is smaller than the earlier value (27.4). Result (27.18) states that the leastmean-squares estimator of x is its conditional expectation given y. This result is again intuitive. In answer to the question, “What is the best guess for x given that we observe y?” the analysis tells us that the best guess is what we would expect for x given the occurrence of y! Derivation of (27.18) using differentiation: Using again the result of Prob. 3.25 we have n  o b )2 | y b )2 = E E (x − x E (x − x o n  b2 | y x+x = E E x2 − 2xb n  o b = c(y) = E E x2 − 2xc(y) + c2 (y) | y , since x n o = E E (x2 |y) − 2c(y)E (x|y) + c2 (y) (27.23) It is sufficient to minimize the inner expectation relative to c(y), for any realization y = y. Differentiating and setting the derivative to zero at the optimal solution co (y) gives co (y) = E (x|y = y).

 Derivation of (27.18) using completion-of-squares: In this second derivation, we will establish two useful intermediate results, namely, the orthogonality conditions (27.26) and (27.29). We again refer to the result of Prob. 3.25, which we write more explicitly as   E x = E y E x|y (x|y) (27.24) where the outer expectation is relative to the distribution of y, while the inner expectation is relative to the conditional distribution of x given y. It follows that, for any real-valued function of y, say, g(y),

1060

Mean-Square-Error Inference

n  o E xg(y) = E y E x|y xg(y) | y n o = E y E x|y (x|y)g(y) n o = E y E x|y (x|y) g(y)

(27.25)

This means that, for any g(y), it holds that   e ⊥ g(y) E x − E (x|y) g(y) = 0 ⇐⇒ x

(orthogonality condition)

(27.26)

e = x − E (x|y) is orthogonal to any Result (27.26) states that the error variable x e ⊥ g(y). function g(·) of y; we also represent this result by the compact notation x e is uncorrelated with However, since x − E (x|y) is zero mean, then it also holds that x g(y). Using this intermediate result, we return to the risk (27.17), add and subtract E (x|y) to its argument, and express it as  2 b )2 = E (x − E (x|y) + E (x|y) − x b E (x − x

(27.27)

b is a function of y. Therefore, if we choose g(y) = E (x|y) − x b, The term E (x|y) − x then from the orthogonality property (27.26) we conclude that  2  2 b )2 = E x − E (x|y) b E (x − x + E E (x|y) − x

(27.28)

b and the MSE is Now, only the second term on the right-hand side is dependent on x b = E (x|y). To evaluate the resulting MMSE we first use the minimized by choosing x b = E (x|y) is itself a function orthogonality property (27.26), along with the fact that x of y, to conclude that b )b e⊥x b E (x − x x = 0 ⇐⇒ x

(27.29)

e , is uncorrelated with the optimal estimator. In other words, the estimation error, x Using this result, we can evaluate the MMSE as follows: e 2 = E (x − x b )(x − x b) Ex b )x = E (x − x

(because of (27.29))

2

b (e b ) (since x = x e+x b) = Ex − Ex x+x

b2 = E x2 − E x (because of (27.29) again)   2 b2 = Ex − x ¯2 + x ¯2 − E x = σx2 − σx2b

(27.30)

 Example 27.3 (From soft to hard decisions) Let us return to Example 27.1, where x is a binary signal that assumes the values ±1 with probability 1/2. Recall that x represents the class of the selected image (cats or dogs). In that example, we only assumed knowledge of the mean of x and concluded that the resulting MSE estimate was not meaningful because it led to x b = 0, which is neither +1 nor −1. We are going to assume now that we have access to some additional information about x. Specifically,

27.2 Inference with Observations

1061

we are going to assume that we have some noisy measurement of x, denoted by y, say, as y =x+v (27.31) where the symbol v denotes the disturbance. How do measurements y of this type arise? We are going to encounter later in this text several inference techniques that provide approximate estimates for discrete variables. Rather than detect whether the unknown x is +1 or −1 (which we refer to as performing hard decisions), these other methods will return approximate values for x such as claiming that it is 0.9 or −0.7 (which we refer to as performing soft decisions). The soft value is a real (and not discrete) number and it can be interpreted as being a perturbed version of the actual label x. We are then faced with the problem of deciding whether x = ±1 from the soft version y. Obviously, the nature of the perturbation v in (27.31) depends on the method that is used to generate the approximation y. In this example, and in order to keep the analysis tractable, we will assume that v and x are independent of each other and, moreover, that v has zero mean, unit variance, and is Gaussian-distributed with probability density function (pdf): 2 1 fv (v) = √ e−v /2 2π

(27.32)

Our intuition tells us that we should be able to do better here than in Example 27.1. But beware, even here, we will arrive at some interesting conclusions. According to b = E (x|y), (27.18), the optimal estimator for x given y is the conditional mean x which we evaluated earlier in (3.117) and determined that: ∆

b = tanh(y) = x

ey − e−y ey + e−y

(27.33)

The result is represented in Fig. 27.2. If the variance of the measurement noise v were not fixed at 1 but denoted more generically by σv2 , then the same argument would lead b = tanh(y/σv2 ) with y scaled by σv2 – see Prob. 27.2. to x

Gaussian noise AAACDXicbVDLSsNAFJ34rPEVdekmWAUXUpKC6LLgQpcV7AOaUG4mk3boZCbMTIRS+gNu/BU3LhRx696df+O0DaitBwYO59x7594TZYwq7Xlf1tLyyuraemnD3tza3tl19vabSuQSkwYWTMh2BIowyklDU81IO5ME0oiRVjS4mviteyIVFfxODzMSptDjNKEYtJG6znHABeUx4dq9hlwpCjwI7B/REEW6TtmreFO4i8QvSBkVqHedzyAWOE/NBMxAqY7vZTocgdQUMzK2g1yRDPAAeqRjKIeUqHA0vWbsnhgldhMhzTMbTNXfHSNIlRqmkalMQffVvDcR//M6uU4uwxHlWa4Jx7OPkpy5WriTaNyYSoI1GxoCWFKzq4v7IAFrE6BtQvDnT14kzWrFP694t9Vy7ayIo4QO0RE6RT66QDV0g+qogTB6QE/oBb1aj9az9Wa9z0qXrKLnAP2B9fENmNeb0A==

binary ±1 AAAB9XicbVBNSwMxEJ2tX7V+VT16CbaCBym7BdFjwYvHCvYD2rVk02wbmmSXJKssS/+HFw+KePW/ePPfmLZ70NYHA4/3ZpiZF8ScaeO6305hbX1jc6u4XdrZ3ds/KB8etXWUKEJbJOKR6gZYU84kbRlmOO3GimIRcNoJJjczv/NIlWaRvDdpTH2BR5KFjGBjpYeASaxSVO3HAnnVQbni1tw50CrxclKBHM1B+as/jEgiqDSEY617nhsbP8PKMMLptNRPNI0xmeAR7VkqsaDaz+ZXT9GZVYYojJQtadBc/T2RYaF1KgLbKbAZ62VvJv7n9RITXvsZk3FiqCSLRWHCkYnQLAI0ZIoSw1NLMFHM3orIGCtMjA2qZEPwll9eJe16zbusuXf1SuMij6MIJ3AK5+DBFTTgFprQAgIKnuEV3pwn58V5dz4WrQUnnzmGP3A+fwDwoZFp

soft decision AAACAHicbVDLSgMxFM3UV62vqgsXboJFcCFlpiC6LLhxWcE+oB1KJnOnDc0kQ5IRytCNv+LGhSJu/Qx3/o3pdBbaeiBwOOfem3tPkHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVFNpUcql6AdHAmYC2YYZDL1FA4oBDN5jczv3uIyjNpHgw0wT8mIwEixglxkrD6slASCZCEAZrGRkcAmU6d2pu3c2BV4lXkBoq0BpWvwahpGlsJ1FOtO57bmL8jCjDKIdZZZBqSAidkBH0LRUkBu1n+QEzfG6VEEdS2Wc3ydXfHRmJtZ7Gga2MiRnrZW8u/uf1UxPd+BkTSWpA0MVHUcqxkXieBg6ZAmr41BJCFbO7YjomilBjM6vYELzlk1dJp1H3rurufaPWvCziKKNTdIYukIeuURPdoRZqI4pm6Bm9ojfnyXlx3p2PRWnJKXqO0R84nz/gEpaC

AAACEHicbVBNS8NAEJ3Ur1q/qh69BIvoQUpSED0WvHisYFuhCWWz2bRLN7thd1MIoT/Bi3/FiwdFvHr05r9x2wbU1gcLjzdvZmdekDCqtON8WaWV1bX1jfJmZWt7Z3evun/QUSKVmLSxYELeB0gRRjlpa6oZuU8kQXHASDcYXU/r3TGRigp+p7OE+DEacBpRjLSR+tVTjwvKQ8K1bYjKPK/yo4hAETkunDWn7sxgLxO3IDUo0OpXP71Q4DQ2czBDSvVcJ9F+jqSmmJFJxUsVSRAeoQHpGcpRTJSfzw6a2CdGCe1ISPPMHjP1d0eOYqWyODDOGOmhWqxNxf9qvVRHV35OeZJqwvH8oyhlthb2NB07pJJgzTJDEJbU7GrjIZIIa5NhxYTgLp68TDqNuntRd24bteZ5EUcZjuAYzsCFS2jCDbSgDRge4Ale4NV6tJ6tN+t9bi1ZRc8h/IH18Q2EqJ1z

noisy observation

Figure 27.2 Estimation of a binary signal ±1 observed under unit-variance additive Gaussian noise.

Figure 27.3 plots the function tanh(y). We see that it tends to ±1 as y → ±∞. For other values of y, the function assumes real values that are distinct from ±1. This is a bit puzzling from the designer’s perspective. The designer is interested in knowing whether the symbol x is +1 or −1 based on the observed value of y. Construction

1062

Mean-Square-Error Inference

(27.33) tells the designer to estimate x by computing tanh(y). But, once again, this value will not be +1 or −1; it will be a real number somewhere inside the interval (−1, 1). The designer will be induced to make a hard decision of the form:  +1, if tanh(y) is nonnegative decide in favor of (27.34) −1, if tanh(y) is negative 1

0.5

0

-0.5

-1 -3

-2

-1

0

1

2

3

Figure 27.3 A plot of the hyperbolic tangent function, tanh(y). Observe that the

curve tends to ±1 as y → ±∞. In effect, the designer is implementing the alternative estimator:   b = sign tanh(y) x

(27.35)

where sign(·) denotes the sign of its argument; it is equal to +1 if the argument is nonnegative and −1 otherwise:  +1, if x ≥ 0 ∆ (27.36) sign(x) = −1, if x < 0 We therefore have a situation where the optimal estimator (27.33), although known in closed form, does not solve the original problem of recovering the symbols ±1 directly. Instead, the designer is forced to implement a suboptimal solution; it is suboptimal from a LMSE point of view. Actually, the designer could consider implementing the following simpler suboptimal estimator directly: b = sign(y) x

(27.37)

where the sign(·) function operates directly on y rather than on tanh(y) – see Fig. 27.4. Both suboptimal implementations (27.35) and (27.37) lead to the same result since, as is evident from Fig. 27.3:   sign tanh(y) = sign(y) (27.38) We say that implementation (27.37) provides hard decisions, while implementation (27.33) provides soft decisions. We will revisit this problem later in Example 28.3 and show how the estimator (27.37) can be interpreted as being optimal relative to another design criterion; specifically, it will be the optimal Bayes classifier for the situation under study. The purpose of Examples 27.1 and 27.3 is not to confuse the reader, but to stress the fact that an optimal estimator is optimal only in the sense that it satisfies a certain optimality criterion. One should not confuse an optimal guess with a perfect guess. One

27.2 Inference with Observations

1063

Gaussian noise AAACDXicbVDLSsNAFJ34rPEVdekmWAUXUpKC6LLgQpcV7AOaUG4mk3boZCbMTIRS+gNu/BU3LhRx696df+O0DaitBwYO59x7594TZYwq7Xlf1tLyyuraemnD3tza3tl19vabSuQSkwYWTMh2BIowyklDU81IO5ME0oiRVjS4mviteyIVFfxODzMSptDjNKEYtJG6znHABeUx4dq9hlwpCjwI7B/REEW6TtmreFO4i8QvSBkVqHedzyAWOE/NBMxAqY7vZTocgdQUMzK2g1yRDPAAeqRjKIeUqHA0vWbsnhgldhMhzTMbTNXfHSNIlRqmkalMQffVvDcR//M6uU4uwxHlWa4Jx7OPkpy5WriTaNyYSoI1GxoCWFKzq4v7IAFrE6BtQvDnT14kzWrFP694t9Vy7ayIo4QO0RE6RT66QDV0g+qogTB6QE/oBb1aj9az9Wa9z0qXrKLnAP2B9fENmNeb0A==

binary ±1 AAAB9XicbVBNSwMxEJ2tX7V+VT16CbaCBym7BdFjwYvHCvYD2rVk02wbmmSXJKssS/+HFw+KePW/ePPfmLZ70NYHA4/3ZpiZF8ScaeO6305hbX1jc6u4XdrZ3ds/KB8etXWUKEJbJOKR6gZYU84kbRlmOO3GimIRcNoJJjczv/NIlWaRvDdpTH2BR5KFjGBjpYeASaxSVO3HAnnVQbni1tw50CrxclKBHM1B+as/jEgiqDSEY617nhsbP8PKMMLptNRPNI0xmeAR7VkqsaDaz+ZXT9GZVYYojJQtadBc/T2RYaF1KgLbKbAZ62VvJv7n9RITXvsZk3FiqCSLRWHCkYnQLAI0ZIoSw1NLMFHM3orIGCtMjA2qZEPwll9eJe16zbusuXf1SuMij6MIJ3AK5+DBFTTgFprQAgIKnuEV3pwn58V5dz4WrQUnnzmGP3A+fwDwoZFp

hard decision AAACAHicbVDLSgMxFM3UV62vqgsXboJFcCFlpiC6LLhxWcE+oB1KJnOnDc0kQ5IRytCNv+LGhSJu/Qx3/o3pdBbaeiBwOOfem3tPkHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVFNpUcql6AdHAmYC2YYZDL1FA4oBDN5jczv3uIyjNpHgw0wT8mIwEixglxkrD6slASCZCEAaPiQpxCJTp3Km5dTcHXiVeQWqoQGtY/RqEkqaxnUQ50brvuYnxM6IMoxxmlUGqISF0QkbQt1SQGLSf5QfM8LlVQhxJZZ/dJFd/d2Qk1noaB7YyJmasl725+J/XT01042dMJKkBQRcfRSnHRuJ5GjhkCqjhU0sIVczuiqnNgVBjM6vYELzlk1dJp1H3rurufaPWvCziKKNTdIYukIeuURPdoRZqI4pm6Bm9ojfnyXlx3p2PRWnJKXqO0R84nz+y6JZl

AAACEHicbVBNS8NAEJ3Ur1q/qh69BIvoQUpSED0WvHisYFuhCWWz2bRLN7thd1MIoT/Bi3/FiwdFvHr05r9x2wbU1gcLjzdvZmdekDCqtON8WaWV1bX1jfJmZWt7Z3evun/QUSKVmLSxYELeB0gRRjlpa6oZuU8kQXHASDcYXU/r3TGRigp+p7OE+DEacBpRjLSR+tVTjwvKQ8K1bYjKPK/yo4hAETkunDWn7sxgLxO3IDUo0OpXP71Q4DQ2czBDSvVcJ9F+jqSmmJFJxUsVSRAeoQHpGcpRTJSfzw6a2CdGCe1ISPPMHjP1d0eOYqWyODDOGOmhWqxNxf9qvVRHV35OeZJqwvH8oyhlthb2NB07pJJgzTJDEJbU7GrjIZIIa5NhxYTgLp68TDqNuntRd24bteZ5EUcZjuAYzsCFS2jCDbSgDRge4Ale4NV6tJ6tN+t9bi1ZRc8h/IH18Q2EqJ1z

noisy observation

Figure 27.4 Suboptimal MSE estimation of a binary signal embedded in unit-variance

additive Gaussian noise.

should also not confuse an optimal guess with a practical one; an optimal guess need not be perfect or even practical, though it often suggests good practical solutions. Example 27.4 (Estimating petal length for iris flowers) Consider the data shown in Fig. 27.6, which represent scatter diagrams for measurements about petal and sepal dimensions for three types of iris flowers (setosa, versicolor, and virginica). The flowers are shown in Fig. 27.5.

Versicolor

Setosa

Virginica

sepal

petal Figure 27.5 Illustration of three types of iris flowers: (left) setosa, (middle) versicolor,

and (right) virginica. The center figure also points to a petal and a sepal within the flower. The source of these flower images is Wikimedia commons, where they are available for use under the Creative Commons Attribution Share-Alike License. The labels and arrows in white have been added by the author. Each row in Fig. 27.6 corresponds to one flower type and includes two plots showing the scatter diagrams for the sepal width × sepal length and petal width × petal length for the flower. There are 50 data measurements in each diagram and all dimensions are measured in centimeters. The axes in the figure are normalized to similar scales to facilitate comparison. Let us model the petal length of a flower as some random variable x. Let us also model the flower type as another random variable y, which assumes one of three discrete values: setosa, versicolor, and virginica. Given that we observe a setosa flower at random, we would like to estimate its petal length. We can do so by computing E (x|y). However, we do not have any information about the conditional distribution fx|y (x|y).

Mean-Square-Error Inference

Iris setosa

petal width (cm)

4

2

3 2 4

1

5

6

sepal length (cm)

7

8

Iris versicolor

2

4

petal length (cm)

6

8

6

8

6

8

Iris versicolor

petal width (cm)

4

2

3

1

5

6

sepal length (cm)

7

8

Iris virginica

5

0 0

2

4

petal length (cm) Iris virginica

3

petal width (cm)

4

2

3 2 4

0 0

3

sepal width (cm)

5

2 4

Iris setosa

3

sepal width (cm)

5

sepal width (cm)

1064

1

5

6

sepal length (cm)

7

8

0 0

2

4

petal length (cm)

Figure 27.6 Each row shows a scatter diagram for the sepal width × sepal length and petal width × petal length for one type of iris flower: (top) setosa, (middle) versicolor, and (bottom) virginica. The red circles in the “center” of the scatter diagrams correspond to the centers of gravity of the plots whose coordinates correspond to the mean values of the vertical and horizontal dimensions. The iris dataset is available from https://archive.ics.uci.edu/ml/datasets/iris.

Table 27.1 Average dimensions measured in centimeters for the three types of iris flowers computed from the available data measurements. Flower Sepal length Sepal width Petal length Petal width setosa versicolor virginica

5.0060 5.9360 6.5880

3.4180 2.7700 2.9740

1.4640 4.2600 5.5520

0.2440 1.3260 2.0260

Still, we can use the available measurements to approximate E (x|y) and estimate the petal length for the new setosa flower. Table 27.1 lists the sample average values for the petal length, petal width, sepal length, and sepal width for the three types of flowers from the available data; each graph has 50 points. From the table we estimate for y = setosa: predicted petal length = 1.4640 cm ≈ E (x|y = setosa)

(27.39)

27.2 Inference with Observations

27.2.2

1065

Orthogonality Principle Two important conclusions follow from the derivation of result (27.18), namely, the orthogonality properties (27.26) and (27.29). The first property states that e is orthogonal to any function g(·) of y. In other words, no matter the error x how we modify the data y, we cannot extract additional information from the data to reduce the uncertainty about x any further. This result is represented schematically in Fig. 27.7.

orthogonal AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgQUpSRD0WvHisYD+wDWWz3bRLN7thdyKU0H/hxYMiXv033vw3btsctPXBwOO9GWbmhYngBj3v2ymsrW9sbhW3Szu7e/sH5cOjllGppqxJlVC6ExLDBJesiRwF6ySakTgUrB2Ob2d++4lpw5V8wEnCgpgMJY84JWilR6VxpIZKEtEvV7yqN4e7SvycVCBHo1/+6g0UTWMmkQpiTNf3EgwyopFTwaalXmpYQuiYDFnXUkliZoJsfvHUPbPKwI2UtiXRnau/JzISGzOJQ9sZExyZZW8m/ud1U4xugozLJEUm6WJRlAoXlTt73x1wzSiKiSWEam5vdemIaELRhlSyIfjLL6+SVq3qX1Uv72uV+kUeRxFO4BTOwYdrqMMdNKAJFCQ8wyu8OcZ5cd6dj0VrwclnjuEPnM8f+JaRDA==

estimation error AAAB+XicbVDLSgMxFM3UV62vUZdugkVwIWWmiLosuHFZwT6gHUomvdOGZpIhyRTK0D9x40IRt/6JO//GdDoLbT0QODnnXnJywoQzbTzv2yltbG5t75R3K3v7B4dH7vFJW8tUUWhRyaXqhkQDZwJahhkO3UQBiUMOnXByv/A7U1CaSfFkZgkEMRkJFjFKjJUGrgvasDi/YFBKqoFb9WpeDrxO/IJUUYHmwP3qDyVNYxCGcqJ1z/cSE2REGUY5zCv9VENC6ISMoGepIDHoIMuTz/GFVYY4ksoeYXCu/t7ISKz1LA7tpA051qveQvzP66UmugsyJpLUgKDLh6KUYyPxogY8ZAqo4TNLCFXMZsV0TBShxpZVsSX4q19eJ+16zb+pXT/Wq42roo4yOkPn6BL56BY10ANqohaiaIqe0St6czLnxXl3PpajJafYOUV/4Hz+AOSvk8M=

e ⊥ g(y) ⇐⇒ E x e g(y) = 0. Figure 27.7 The orthogonality condition: x

b The second orthogonality property (27.29) is a special case of (27.26) since x is a function of the observations. Actually, the orthogonality condition (27.26) is a defining property for optimality under MSE inference.

b = c(y) Lemma 27.1. (Optimal estimators and orthogonality) An estimator x b is unbiased (i.e., E x b=x is optimal in the LMSE sense (27.17) if, and only if, x ¯) b ⊥ g(y) for any function g(·). and x − x

b is the optimal Proof: One direction has already been proven in (27.26), namely, if x b = E (x|y), then we already know from (27.26) that x e ⊥ g(y), estimator and hence, x for any g(·). Moreover, we know from (27.21) that this estimator is unbiased. b is some unbiased estimator for x and that it satisfies x − x b⊥ Conversely, assume x b − E (x|y) and let us show that z g(y), for any g(·). Define the random variable z = x is equal to the zero variable in probability by appealing to the result of Example 3.5. Note first that z has zero mean since b − E (E (x|y)) = x Ez = Ex ¯−x ¯ = 0

(27.40)

b ⊥ g(y). Moreover, from (27.26) we have x − E (x|y) ⊥ g(y) and, by assumption, x − x Subtracting these two conditions we conclude that z ⊥ g(y), which is the same as

1066

Mean-Square-Error Inference

E zg(y) = 0. Now since the variable z itself is a function of y, we may choose g(y) = z to conclude that E z 2 = 0. We thus find that z is a zero-mean random variable with zero variance, so that from result (3.36) we conclude that z = 0 in probability, or b = E (x|y), in probability. equivalently, x 

Although we have focused so far on scalar random variables {x, y}, it is straightforward to verify that the same conclusions hold when x and/or y are vectorb = E (x|y) and the orthogvalued. The mean-square estimator continues to be x onality condition continues to hold. The next section deals with a case involving such vector-valued random variables.

27.3

GAUSSIAN RANDOM VARIABLES It is not always possible to determine a closed-form expression for the optimal estimator E (x|y). Only in some special cases this calculation can be carried out to completion (as we did in Example 27.3 and will do again for another example in this section). We will consider one important case for which the optimal estimator E (x|y) turns out to be affine in y. This scenario happens when the random variables x and y are jointly Gaussian. Thus, consider two jointly Gaussian-distributed random vectors x and y of sizes p×1 and q×1, respectively. Referring to (4.79a) and (4.79b), their individual pdfs are Gaussian and given by   1 1 1 T −1 √ exp − (x − x ¯) Rx (x − x ¯) (27.41a) fx (x) = p 2 (2π)p det Rx   1 1 1 p (27.41b) fy (y) = p exp − (y − y¯)T Ry−1 (y − y¯) 2 det Ry (2π)q where {¯ x, y¯} denote the means and {Rx , Ry } denote the covariance matrices: Rx = E (x − x ¯)(x − x ¯)T ,

Ry = E (y − y¯)(y − y¯)T

(27.42)

Moreover, according to (4.77), their joint pdf is given by fx,y (x, y) 1 1 √ =p exp p+q det R (2π)

(



1 (x − x ¯)T 2

(y − y¯)T



R−1



x−x ¯ y − y¯

(27.43) )

in terms of the covariance matrix R of the aggregate vector col{x, y}, namely, ∆

R = E =



 Rx T Rxy

x y





Rxy Ry





x ¯ y¯

 

x y







x ¯ y¯

T (27.44)

27.3 Gaussian Random Variables

1067

Figure 27.8 plots a typical joint Gaussian distribution for scalar variables {x, y} for illustration purposes, along with its contour curves using   2 1 x ¯ = 3, y¯ = 4, R = (27.45) 1 4 The figure also shows the individual Gaussian distributions for the variables x and y separately.

Figure 27.8 (Top) A joint Gaussian distribution fx,y (x, y) with parameters given by

(27.45) along with its contour curves. (Bottom) Individual Gaussian distributions for the variables x and y.

We showed in Lemma 4.3 that the conditional pdf of x given y is Gaussian and defined by: ∆

T Σx = Rx − Rxy Ry−1 Rxy

b = x x ¯+

Rxy Ry−1 (y

(27.46a)

− y¯)

1 1 √ fx|y (x|y) = p p det Σx (2π)

(27.46b)  1 exp − (x − x b)Σ−1 b) (27.46c) x (x − x 2 

b is the mean of the conditional distribution, Note that the expression for x fx|y (x|y). Therefore, it corresponds to the optimal MSE estimator. Moreover, ex e T . To the matrix Σx corresponds to the error covariance matrix, i.e., Σx = E x

1068

Mean-Square-Error Inference

e+x b and use the orthogonality condition x e⊥x b to see this, we start from x = x conclude that T

ex eT + E x bx bT E xxT = E x

(27.47)

Rx = Rxe + Rxb

(27.48)

Subtracting x ¯x ¯ from both sides, this relation translates into an equality ine, x b }, which we denote by the symbols volving the covariance matrices of {x, x {Rx , Rxe, Rxb}: It follows that Rxe

= = (27.46b)

=

Rx − Rxb

Rx − E (b x−x ¯)(b x−x ¯)T T Rx − Rxy Ry−1 E (y − y¯)(y − y¯)T Ry−1 Rxy {z } | =Ry

= =

Rx −

T Rxy Ry−1 Rxy

Σx

(27.49)

as claimed. Observe that the error covariance matrix Σx is independent of y, b depends in an affine manner on the observation y. For and that the estimator x this reason, the problem of estimating x from y for jointly Gaussian random variables is sometimes referred to as the linear Gaussian model. We also derived in Lemma 4.3 a similar expression for the reverse conditional pdf of y given x leading to ∆

T Σy = Ry − Rxy Rx−1 Rxy

b = y¯ + y

T Rxy Rx−1 (x

(27.50a)

−x ¯)

1 1 p fy|x (y|x) = p q det Σy (2π)

(27.50b)  1 b) (27.50c) exp − (y − yb)Σ−1 y (y − y 2 

b as the conditional mean estimator of These expressions can be used to identify y ey eT . y given x, with Σy corresponding to the error covariance matrix, Σy = E y

Example 27.5 (A second linear Gaussian model) We consider a second situation dealing with Gaussian random variables and which will arise in later chapters in the context of predictive modeling and variational inference. We derive the main expressions in this example, and comment on their significance in Example 27.7 further ahead. Consider again two random vectors x and y of size p × 1 and q × 1, respectively. It is assumed now that fx (x) and fy|x (y|x) are Gaussian-distributed with pdfs given by   1 1 1 √ fx (x) = p exp − (x − x ¯)T Rx−1 (x − x ¯) (27.51a) 2 (2π)p det Rx  T   1 1 1 −1 p √ exp − y − (W x − θ) Γ y − (W x ¯ − θ) fy|x (y|x) = 2 (2π)q det Γ (27.51b)

27.3 Gaussian Random Variables

1069

In this description, we are assuming that the mean of the conditional distribution of y is dependent on x and is linearly parameterized in the form W x − θ for some known parameters (W, θ); W is q × p and θ is q × 1. Moreover, Γ denotes the covariance matrix of the conditional distribution. We can interpret fy|x (y|x) as representing a model for generating observations y once x is known. Now, given the model parameters {¯ x, Rx , W, θ, Γ}, we would like to determine expressions for the marginal fy (y) and for the other conditional pdf fx|y (x|y). To do so, we will need to determine the respective mean vectors and covariance matrices {¯ y , Ry , x b, Σx } from knowledge of the above parameters. To begin with, comparing with (27.50c) we find that Γ plays the role of Σy and W x − θ plays the role of yb. It follows that T W = Rxy Rx−1 ,

−θ = y¯ − W x ¯

(27.52a)

so that yb = W x − θ and y¯ = W x ¯−θ

(27.52b)

From expression (27.50a) and the fact that Γ = Σy , we readily conclude that Ry = Γ + W Rx W T

(27.53)

which is determined from knowledge of {Γ, Rx , W }. Moreover, using (27.46a) and applying the matrix inversion formula (1.81) we get: Σ−1 x

= (27.53)

=

Rx−1 + Rx−1 Rxy (Ry − Ryx Rx−1 Rxy )−1 Ryx Rx−1 Rx−1 + W T Γ−1 W

(27.54)

and, hence, Σx = (Rx−1 + W T Γ−1 W )−1

(27.55)

which is again determined in terms of the given quantities {Γ, Rx , W }. In the same token, we manipulate expression (27.46b) to find that x b=x ¯ + Rx W T (Γ + W Rx W T )−1 (y + θ − W x ¯)   T T −1 = I − Rx W (Γ + W Rx W ) W x ¯ + Rx W T (Γ + W Rx W T )−1 (y + θ)

(27.56)

Applying the matrix inversion formula again, it is easy to verify that Rx W T (Γ + W Rx W T )−1   = Rx W T Γ−1 − Γ−1 W (Rx−1 + W T Γ−1 W )−1 W T Γ−1   = Rx Rx−1 + W T Γ−1 W − W T Γ−1 W (Rx−1 + W T Γ−1 W )−1 W T Γ−1 (27.55)

=

Σx W T Γ−1

(27.57)

Substituting into (27.56) we arrive at x b = Σx Rx−1 x ¯ + Σx W T Γ−1 (y + θ)

(27.58)

1070

Mean-Square-Error Inference

Table 27.2 summarizes how to move from the parameters {¯ x, Rx , W, θ, Γ} for the distributions {fx (x), fy|x (y|x)} to the parameters {¯ y , Ry , x b, Σx } for {fy (y), fx|y (x|y)}. Table 27.2 Transforming parameters {¯ x, Rx , W, θ, Γ} for {fx (x), fy|x (y|x)} to parameters {¯ y , Ry , x b, Σx } for {fy (y), fx|y (x|y)} for jointly Gaussian distributed random variables. Given distributions Deduced distributions given {¯ x, Rx , W, θ, Γ}

find {¯ y , Ry , x b, Σx }

x ∼ N(¯ x, Rx )

y ∼ N(¯ y , Ry ) y¯ = W x ¯−θ Ry = Γ + W Rx W T

y|x ∼ N(b y , Γ)

  x|y ∼ N x b, Σx

yb = W x − θ

x b = Σx Rx−1 x ¯ + Σx W T Γ−1 (y + θ) Σx = (Rx−1 + W T Γ−1 W )−1

Example 27.6 (A useful integral expression) We can use the results of the previous example to establish a useful integral expression. Recall that we are given two Gaussian distributions: fx (x) ∼ Nx (¯ x, Rx ),

fy|x (y|x) ∼ Ny (b y , Γ)

(27.59)

where yb = W x − θ

(27.60)

Using these distributions we can determine the joint pdf fx,y (x, y) = fx (x)fy|x (y|x) and marginalize over x to get

ˆ

(27.61)



fy (y) =

fx (x)fy|x (y|x)dx

(27.62)

−∞

We already showed that y ∼ Ny (¯ y , Ry ). We therefore arrive at the useful equality: ˆ



Ny (¯ y , Ry ) = x−∞

Nx (¯ x, Rx ) × Ny (W x − θ, Γ)dx

(27.63)

The right-hand side involves the integration of the product of two Gaussians (one over x and the other over y); the result is another Gaussian distribution over y. Example 27.7 (Encoding–decoding architecture) We use the results of the previous two examples to motivate a useful encoder–decoder architecture that we will encounter in several other occasions in this text, e.g., when we study predictive modeling and variational inference in Chapter 33 and generative networks in Chapter 68.

27.3 Gaussian Random Variables

1071

The conditional distribution fy|x (y|x) can be regarded as a model for generating samples y once x is known. We refer to this model as the generative model or the decoder since it decodes the information contained in x to generate y. On the other hand, the second conditional fx|y (x|y), with the roles of x and y reversed, is known as the encoder since it maps y to x and helps encode the information from y into x. It is useful to represent these mappings in graphical form, as shown in Fig. 27.9. Such structures are useful for the following reason.

AAAB7XicbVBNS8NAFHypX7V+VT16WSyCp5L0ot4KXjxWMLbQhrLZvLRLN5uwuxFK6I/w4kHFq//Hm//GbZuDtg4sDDPvsW8mzATXxnW/ncrG5tb2TnW3trd/cHhUPz551GmuGPosFanqhVSj4BJ9w43AXqaQJqHAbji5nfvdJ1Sap/LBTDMMEjqSPOaMGit1UbI0QjWsN9ymuwBZJ15JGlCiM6x/DaKU5QlKwwTVuu+5mQkKqgxnAme1Qa4xo2xCR9i3VNIEdVAszp2RC6tEJE6VfdKQhfp7o6CJ1tMktJMJNWO96s3F/7x+buLroOAyy40NtvwozgUxKZlnJxFXyIyYWkKZ4vZWwsZUUWZsQzVbgrcaeZ34reZN07tvNdpu2UYVzuAcLsGDK2jDHXTABwYTeIZXeHMy58V5dz6WoxWn3DmFP3A+fwDYA49g

encoder

AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0l6UW8FLx4rGFtoQ9lsJu3SzSbsboQS+iO8eFDx6v/x5r9x2+agrQ8GHu/NMDMvzATXxnW/ncrG5tb2TnW3trd/cHhUPz551GmuGPosFanqhVSj4BJ9w43AXqaQJqHAbji5nfvdJ1Sap/LBTDMMEjqSPOaMGit1I2RphGpYb7hNdwGyTrySNKBEZ1j/GkQpyxOUhgmqdd9zMxMUVBnOBM5qg1xjRtmEjrBvqaQJ6qBYnDsjF1aJSJwqW9KQhfp7oqCJ1tMktJ0JNWO96s3F/7x+buLroOAyyw1KtlwU54KYlMx/JxFXyIyYWkKZ4vZWwsZUUWZsQjUbgrf68jrxW82bpnffarTdMo0qnME5XIIHV9CGO+iADwwm8Ayv8OZkzovz7nwsWytOOXMKf+B8/gDIso9W

decoder

Figure 27.9 The decoder is represented by fy|x (y|x) since it decodes the information contained in x, while the encoder is represented by fx|y (x|y) since it maps the information from y into x.

Assume the parameter x is selected randomly from a Gaussian prior, i.e., x ∼ Nx (¯ x, Rx ), while the observation y is generated according to another Gaussian distribution parameterized by x, i.e., y ∼ fy|x (y|x). Usually, in practice, the model parameter x is unknown, in which case we would not know the exact fy|x (y|x) but only its general Gaussian form. As such, if we face a situation where we need to generate additional independent observations y 0 from the same statistical distribution as the observed y, then we would not be able to do so. That is where we can appeal to the results from Table 27.2 to determine first the posterior distribution fx|y (x|y); this distribution does not require knowledge of x. Then, by applying the Bayes rule and marginalizing over x, we can determine the conditional distribution of new observations given y (also known as the predictive pdf) as follows: ˆ ∞ fy0 |y (y 0 |y) = fx,y0 |y (x, y 0 |y)dx −∞ ˆ ∞ = fx|y (x|y) fy0 |x (y 0 |x) dx (27.64) {z } | {z } −∞ | posterior

likelihood

where the two distributions under the integral are known. This process is known as predictive modeling. The integral involves the product of two Gaussian distributions, which can be evaluated using result (27.63) to find that fy0 |y (y 0 |y) = Ny0 (¯ y 0 , Ry 0 )

(27.65a)

0

y¯ = W x b−θ

Ry0 = Γ + W Σx W 0

(27.65b) T

(27.65c) 0

Once fy0 |y (y |y) is determined, it can be sampled to generate realizations y , or one could use its mean y¯0 . Predictive modeling will be pursued in greater detail in Chapter

1072

Mean-Square-Error Inference

33. The purpose of the current example is to motivate the encoder–decoder structure in a convenient Gaussian setting.

27.4

BIAS–VARIANCE RELATION b = E (x|y) It is generally difficult to evaluate the conditional mean estimator x since its computation requires evaluating an integral involving the conditional pdf fx|y (x|y). For this reason, we will be employing suboptimal estimators for x in future chapters and, in particular, estimators that are computed directly from data measurements and without appealing to probability distributions. We will describe many alternatives, including constructions based on stochastic approximation algorithms. One would be inclined to think that more complex apb are likely to lead to better performance. It turns out that an proximations for x interesting and “counterintuitive” bias–variance trade-off arises, where it is not necessarily true that “more” complexity is “better.” Assume we have N pairs of scalar data points denoted by the letter D: n o ∆ D = (x0 , y0 ), (x1 , y1 ), . . . , (xN −1 , yN −1 ) (27.66) where the realizations (xn , yn ) correspond to independent samples arising from some joint distribution fx,y (x, y). Using the dataset D, we can fit some function to map y-values to x-values. We denote this function or mapping by the notation: x ba = cD (y)

(27.67)

For any new measurement y, we use the function to estimate the corresponding xvalue, which we are denoting by x ba . There are many ways to construct functions cD (·). For example, we can consider fitting a line through the given data points {(xn , yn )} by minimizing a least-squares criterion of the form: ∆

(α? , θ? ) = argmin α,θ∈IR

(

N −1 2 1 X xn − (αyn − θ) N n=0

)

(27.68)

for some scalar parameters (α, θ). This criterion determines the optimal values (α? , θ? ) that minimize the sum of squared errors in mapping the {yn } to the {xn } by means of a regression line. The result would be an affine mapping of the form: ∆

x ba = cD (y) = α? y − θ?

(27.69)

To determine (α? , θ? ), we differentiate the cost function in (27.68) relative to α and θ and set the derivatives to zero to find:

27.4 Bias–Variance Relation

α? =

θ? =

N

P

 P  P  x y − x y n n n n n n n  P 2 P 2 N n yn n yn −

P

n yn

P

 P P  2 x y − y x n n n n n n n P  P 2 2 N n yn − n yn

1073

(27.70a)

(27.70b)

P where the notation n involves summing over n = 0, 1, . . . , N − 1. We can of course use alternative criteria, other than the quadratic cost in (27.68), to design other mappings cD (y), such as fitting higher-order polynomials through the data. What we would like to highlight in this discussion is the fact that the dataset D is stochastic in nature. This is because different experiments to collect the data {(xn , yn )} will generally result in different realizations for these samples due to the random nature of the variables x and y. For this reason, we will denote the observations in boldface and write {(xn , y n )}. In the same token, we will denote the dataset by the boldface notation D = {(xn , y n )}. As a result, the mapping cD (y) will be stochastic: Its expression will change depending on the dataset used to construct it. This is clear from expressions (27.70a)–(27.70b): The values for (α? , θ? ) depend on the data samples {(xn , yn )} used to compute them. For this reason, these parameters should also be viewed as realizations for two random variables (α? , θ ? ) whose values change from one experiment to another. We highlight this randomness by rewriting (27.69) in the form ∆

b a = cD (y) = α? y − θ ? x

(27.71)

E co (y) = E x

(27.72)

where the boldface notation is used to reflect the random nature of the variables involved. In this simple example, we have three random variables on the righthand side, (α? , θ ? , y). The randomness in (α? , θ ? ) is due to the randomness in the dataset D, while the variable y is independent of D. Now, for any estimator cD (y), it is useful to examine how its MSE performance compares to that of the optimal estimator, E (x|y). Some degradation is of course b = co (y) denote the optimal MSE estimator for x. We already expected. Let x know that co (y) = E (x|y) and that this estimator is unbiased since We denote the corresponding MSE by ∆

MMSE = E x − co (y)

2

(27.73)

Obviously, since co (y) is the optimal MSE estimator, the error variance that would result from using cD (y) in its place to generate estimators for x will be larger:

1074

Mean-Square-Error Inference



b a )2 error variance of cD (y) = E (x − x b) ≥ E (x − x

(27.74)

2 ∆

= error variance of co (y)

The next result is known as the bias–variance relation and it reveals that the degradation in error variance arises from two effects: one is the bias effect and the other is the variance effect. We establish the result first and then comment on its significance. In the statement and subsequent proof of the result, when needed for emphasis, we use the notation E b to indicate that the expectation is computed relative to the distribution of the random variable b.

Theorem 27.1. (Bias–variance relation) Let co (y) denote the optimal MSE estimator of x given y with the corresponding mean-square-error denoted by MMSE. Consider any other estimator cD (y) that is computed from a random dataset D. It then holds that:  2 E x,y,D x − cD (y) = MMSE +  2 E co (y) − E cD (y) + {z } | bias2

2 E cD (y) − E cD (y) | {z } 

(27.75)

variance

where the degradation in performance is captured by two terms referred to as (squared) bias and variance, as explained further ahead.

Proof: Conditioning on y = y, and computing the expectation relative to the randomness in x and D, we have:  2 x − cD (y) | y = y   2 = E x,D x − co (y) + co (y) − cD (y) | y = y h i h i 2 2 = E x x − co (y) | y = y + E D co (y) − cD (y) | y = y +     2 E x,D x − co (y) co (y) − cD (y) | y = y | {z } E x,D



(27.76)

=0

The last term is zero because of the following argument. We recall that the randomness in D arises from the randomness in the samples that are included in D over repeated experiments. Thus, note that

27.4 Bias–Variance Relation

E x,D (a)

 (

= Ex

   x − co (y) co (y) − cD (y) | y = y 

1075

)

o

     x − c (y) | y = y E D co (y) − cD (y) | y = y

(b)

        E D co (y) − cD (y) | y = y E x x − co (y) | y = y      (c) = E (x|y = y) − co (y) E D co (y) − cD (y) | y = y | {z } =

=0

=0

(27.77)

where step (a) is because the term x − c (y) does not depend on D and it can move out of the inner expectation over the distribution of D. Likewise, in step (b), the term co (y) − cD (y) does not depend on x and it can move out of the expectation over x. Step (c) is because co (y) = E (x|y = y). Consequently, returning to (27.76) we find that h i h i 2 2 E x,D x − cD (y) | y = y = E x x − co (y) | y = y + h i 2 E D co (y) − cD (y) |y = y (27.78) o

We continue with the last term in (27.78): h i 2 E D co (y) − cD (y) | y = y   2 = E D co (y) − E cD (y) + E cD (y) − cD (y) | y = y h i 2 2 = co (y) − E cD (y) + E D cD (y) − E cD (y) | y = y +     2 E D co (y) − E cD (y) E cD (y) − cD (y) | y = y | {z }

(27.79)

=0

where the last term is zero because the factor co (y) − E cD (y) is a constant conditioned on y = y and can be taken out of the expectation so that the last term becomes equal to   co (y) − E cD (y) E cD (y) − E cD (y) = 0 (27.80) Substituting (27.79) into (27.78) gives at y = y: h i 2 E x,D x − cD (y) |y = y (27.81)  2  2  2 = E x x − co (y) + co (y) − E cD (y) + E D cD (y) − E cD (y) Taking expectations to remove the conditioning over y we arrive at the desired result (27.75). 

Interpretation The bias–variance relation (27.75) shows that the degradation in MSE performance is due to two effects: a variance effect and a bias effect. We illustrate these effects in Fig. 27.10 and explain their meaning by considering the conditioned form of the bias–variance relation given by (27.81).

1076

Mean-Square-Error Inference

variance ⇣ E cD (y) AAAB7nicbVBNT8JAEN36ifiFevSykZh4Ii0X9Ub04hETKyTQkOkyhQ3bbd3dkpCGP+HFgxqv/h5v/hsX6EHBl0zy8t5MZuaFqeDauO63s7a+sbm1Xdop7+7tHxxWjo4fdZIphj5LRKLaIWgUXKJvuBHYThVCHApshaPbmd8ao9I8kQ9mkmIQw0DyiDMwVmqPQXGQDHuVqltz56CrxCtIlRRo9ipf3X7CshilYQK07nhuaoIclOFM4LTczTSmwEYwwI6lEmLUQT6/d0rPrdKnUaJsSUPn6u+JHGKtJ3FoO2MwQ73szcT/vE5moqsg5zLNDEq2WBRlgpqEzp6nfa6QGTGxBJji9lbKhqCAGRtR2YbgLb+8Svx67brm3derjZsijRI5JWfkgnjkkjTIHWkSnzAiyDN5JW/Ok/PivDsfi9Y1p5g5IX/gfP4ApFyP5Q==

optimal estimate co (y) AAACAnicbVC7TsMwFHXKq5RXgQ0WiwqJqUq6AFsFC2ORCK3URpXj3LRWHTuyHaQqqsTCr7AwAGLlK9j4G9w2A7RcyfLROffa554w5Uwb1/12Siura+sb5c3K1vbO7l51/+Bey0xR8KnkUnVCooEzAb5hhkMnVUCSkEM7HF1P9fYDKM2kuDPjFIKEDASLGSXGUv3qUU9IJiIQBsvUsIRwDHp6G+hXa27dnRVeBl4BaqioVr/61YskzRL7GOVE667npibIiTKMcphUepmGlNARGUDXQkES0EE+22GCTy0T4Vgqe6yZGft7IieJ1uMktJ3W3FAvalPyP62bmfgiyJlIMwOCzj+KM46NxNNAcMQUUMPHFhCqmPWK6ZAoQo2NrWJD8BZXXgZ+o35Z924bteZVkUYZHaMTdIY8dI6a6Aa1kI8oekTP6BW9OU/Oi/PufMxbS04xc4j+lPP5A8gpl84=

E cD (y)

AAACQ3icdVDLTgIxFO3gC/E16tJNI5hAomSGjbojqIlLTERIZpB0SoGGziNtx2QymY9z4we48wvcuFDj1sQOzAJBb9L05Jxz23uPEzAqpGG8aLml5ZXVtfx6YWNza3tH3927E37IMWlhn/m84yBBGPVIS1LJSCfgBLkOI21nfJHq7QfCBfW9WxkFpOuioUcHFCOpqJ5ulWwXyZHjxFeJ3aBDVsa92HZ81heRq67YxojByyQpR5WTGevx/7b0GV65r5V6etGoGpOCi8DMQBFk1ezpz3bfx6FLPIkZEsIyjUB2Y8QlxYwkBTsUJEB4jIbEUtBDLhHdeBJCAo8U04cDn6vjSThhZzti5Ip0WOVM1xDzWkr+pVmhHJx1Y+oFoSQenn40CBmUPkwThX3KCZYsUgBhTtWsEI8QR1iq3AsqBHN+5UXQqlXPq+ZNrVhvZGnkwQE4BGVgglNQB9egCVoAg0fwCt7Bh/akvWmf2tfUmtOynn3wq7TvHx+Tsa8=

AAAB7nicbVA9TwJBEJ3zE/ELtbTZCCbYkDsatSPaWGLiCQmcZG/Zgw17u+fungm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLE860cd1vZ2V1bX1js7BV3N7Z3dsvHRzea5kqQn0iuVTtEGvKmaC+YYbTdqIojkNOW+Hoeuq3nqjSTIo7M05oEOOBYBEj2FipXSEPsjo+q/RKZbfmzoCWiZeTMuRo9kpf3b4kaUyFIRxr3fHcxAQZVoYRTifFbqppgskID2jHUoFjqoNsdu8EnVqljyKpbAmDZurviQzHWo/j0HbG2Az1ojcV//M6qYkugoyJJDVUkPmiKOXISDR9HvWZosTwsSWYKGZvRWSIFSbGRlS0IXiLLy8Tv167rHm39XLjKk+jAMdwAlXw4BwacANN8IEAh2d4hTfn0Xlx3p2PeeuKk88cwR84nz8Ta47e

⌘2 horizontal axis AAACAXicbVA9T8MwEHXKVylfASbEYlEhMVVJF2CrYGEsEqGV2qq6OE5r1bEj20GUqmLhr7AwAGLlX7Dxb3DbDNDyJEtP792d716YcqaN5307haXlldW14nppY3Nre8fd3bvVMlOEBkRyqZohaMqZoIFhhtNmqigkIaeNcHA58Rt3VGkmxY0ZprSTQE+wmBEwVuq6B20hmYioMLgvFXuQwgDHcM901y17FW8KvEj8nJRRjnrX/WpHkmSJnUU4aN3yvdR0RqAMI5yOS+1M0xTIAHq0ZamAhOrOaHrCGB9bJcKxVPbZXabq744RJFoPk9BWJmD6et6biP95rczEZ50RE2lmqCCzj+KMYyPxJA8cMUWJ4UNLgChmd8WkDwqIsamVbAj+/MmLJKhWziv+dbVcu8jTKKJDdIROkI9OUQ1doToKEEGP6Bm9ojfnyXlx3p2PWWnByXv20R84nz8c5pdx

bias2 ⇣ co (y)

average estimator AAAB+XicbVA9T8MwFHTKVylfKYwsFhUSU5V0AbYKFsYiEVqpjSrHdVqrjh3ZL6Aq9KewMABi5Z+w8W9w2wzQcpKl0917eueLUsENeN63U1pb39jcKm9Xdnb39g/c6uG9UZmmLKBKKN2JiGGCSxYAB8E6qWYkiQRrR+Prmd9+YNpwJe9gkrIwIUPJY04JWKnvVol1yZBhZoAnBJTuuzWv7s2BV4lfkBoq0Oq7X72BolnCJFBBjOn6XgphTjRwKti00ssMSwkd2ytdSyVJmAnzefQpPrXKAMdK2ycBz9XfGzlJjJkkkZ206UZm2ZuJ/3ndDOKLMOcyzYBJujgUZwKDwrMe8IBrRkFMLCFUc5sV0xHRhIJtq2JL8Je/vEqCRv2y7t82as2roo0yOkYn6Az56Bw10Q1qoQBR9Iie0St6c56cF+fd+ViMlpxi5wj9gfP5A+zNk/o=

AAAB7nicbVA9TwJBEJ3zE/ELtbTZCCZW5I5G7Yg2lph4QgJI9pY92LC3d+7OmZALf8LGQo2tv8fOf+MCVyj4kkle3pvJzLwgkcKg6347K6tr6xubha3i9s7u3n7p4PDexKlm3GexjHUroIZLobiPAiVvJZrTKJC8GYyup37ziWsjYnWH44R3IzpQIhSMopVagaCm8lCr9Eplt+rOQJaJl5My5Gj0Sl+dfszSiCtkkhrT9twEuxnVKJjkk2InNTyhbEQHvG2pohE33Wx274ScWqVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/ee0Uw4tuJlSSIldsvihMJcGYTJ8nfaE5Qzm2hDIt7K2EDammDG1ERRuCt/jyMvFr1cuqd1sr16/yNApwDCdwBh6cQx1uoAE+MJDwDK/w5jw6L8678zFvXXHymSP4A+fzB2VGjxM=

AAACJXicbVBdSwJBFJ21L7Mvq8dehjRQKNn1pXoIxAp6NMgUXF1mx1EHZ3eWmdlAlv01vfRXeunBIuipv9Ks7kNpF4Y5nHsP957jBoxKZZpfRmZldW19I7uZ29re2d3L7x88Sh4KTJqYMy7aLpKEUZ80FVWMtANBkOcy0nLH10m/9USEpNx/UJOAdD009OmAYqQ05eSvinadDlkJ93hpUj6zPaRGrhvdxvYpdiLb5awvJ57+IhsjBm/iWI8lElHuVYtOvmBWzFnBZWCloADSajj5qd3nOPSIrzBDUnYsM1DdCAlFMSNxzg4lCRAeoyHpaOgjj8huNLMZwxPN9OGAC/18BWfsb0WEPJkcqycTG3Kxl5D/9TqhGlx0I+oHoSI+ni8ahAwqDpPMYJ8KghWbaICwoPpWiEdIIKx0sjkdgrVoeRk0q5XLinVfLdTqaRpZcASOQQlY4BzUwB1ogCbA4Bm8gil4N16MN+PD+JyPZoxUcwj+lPH9AwCxpHc=

E cD (y)

⌘2

E cD (y) AAACEHicbVDLSsNAFJ3UV62vqEs3g61QQUrSjborPsBlBWMLTQiTyaQdOnkwMxFKyC+48VfcuFBx69Kdf+OkzUJbLwxzOOde7rnHSxgV0jC+tcrS8srqWnW9trG5tb2j7+7dizjlmFg4ZjHve0gQRiNiSSoZ6SecoNBjpOeNLwu990C4oHF0JycJcUI0jGhAMZKKcvVmww6RHHledp3bJ9jNbC9mvpiE6stsjBi8yvPm5Ljh6nWjZUwLLgKzBHVQVtfVv2w/xmlIIokZEmJgGol0MsQlxYzkNTsVJEF4jIZkoGCEQiKcbHpRDo8U48Mg5upFEk7Z3xMZCkVhUnUW9sW8VpD/aYNUBmdORqMklSTCs0VByqCMYREP9CknWLKJAghzqrxCPEIcYalCrKkQzPmTF4HVbp23zNt2vXNRplEFB+AQNIEJTkEH3IAusAAGj+AZvII37Ul70d61j1lrRStn9sGf0j5/AFVFnOM=

estimates obtained over repeated experiments AAACKnicbVBNTwIxEO3iF+LXqkcvjcTEE9nlot4IXjxiIkLCEtLtDtDQbTdt10gI/8eLf8WDHpR49YfYhU1UcJImL2/em868MOFMG8+bOYW19Y3NreJ2aWd3b//APTy61zJVFJpUcqnaIdHAmYCmYYZDO1FA4pBDKxxdZ/3WAyjNpLgz4wS6MRkI1meUGEv13HogJBMRCINBGxYTAxrL0BA7L8LSWoOg9KNRkICVRBgeE1AstpzuuWWv4s0LrwI/B2WUV6PnvgaRpGlmppxo3fG9xHQnRBlGOUxLQaohIXREBtCxUJAYdHcyv3WKzywT4b5U9tmF5uxvx4TEWo/j0CrtMUO93MvI/3qd1PQvuxMmktSAoIuP+inHRuIsOBwxBdTwsQWEKmZ3xXRIFKHGxluyIfjLJ6+CZrVyVfFvq+VaPU+jiE7QKTpHPrpANXSDGqiJKHpCL+gdfTjPzpszcz4X0oKTe47Rn3K+vgGSkqi4

Figure 27.10 Illustration of the terms involved in the bias–variance relation (27.81) at

y = y. The (squared) bias provides an indication of how far the data-based estimate cD (y) is from the optimal estimate co (y), on average. The variance term provides an indication of the size of the dispersion by the data-based estimate cD (y) around its mean, E cD (y).

Thus, assume we are given some observation y = y and would like to use it to estimate x by means of a data-based estimator, cD (y). The last term in (27.81) refers to the variance effect at location y that would result from using this estimator. This effect arises from the randomness in the datasets. Each dataset D leads to an estimate x ba = cD (y) at the given y. This estimate is a realization for a random quantity because different datasets lead to different constructions for cD (·) and these in turn lead to different estimates x ba at the same location b a results from the randomness in D since y is y. Note that the randomness in x fixed. Repeated experiments on datasets will generate repeated estimates {b xa }, one for each experiment, and these estimates will be dispersed around the mean value E cD (y); this is the mean of the data-based estimator over experiments, evaluated at the given observation y = y. The last term in (27.81) is measuring the size of this dispersion and the expectation in it is relative to the distribution of D:  2 variance effect at y = E D cD (y) − E cD (y) (27.82) The second effect in the bias–variance relation (27.81) is the (squared) bias term b a is, on average, at y = y, which measures how far the data-based estimator x o from the optimal estimate c (y), namely: (squared) bias effect at y =



2 co (y) − E cD (y)

(27.83)

27.4 Bias–Variance Relation

1077

Unfortunately, the bias and variance effects tend to work in opposite directions. A data-based estimator, cD (y), with a small (large) bias will have a large (small) variance and vice-versa. As a result, it is not possible to design an estimator cD (y) that reduces the bias and variance effects simultaneously. For example, assume we use some complex model cD (y), such as fitting a higher-order polynomial into the measurements {(xn , y n )}, and use it to determine the estimator b a at y = y over the various experiments. It is likely that the estimator will x approximate x well and have a small bias. However, its error variance will be large since the coefficients of the higher-order polynomial will likely change appreciably over different datasets to ensure good fitting. On the other hand, if we use a simpler model to estimate x from the data, such as fitting a line through the data, then the bias will be large because, on average, this simpler estimator may not provide a good approximation for x. However, the error variance of this estimator will be smaller than in the higher-order polynomial case since the coefficients of the regression line will vary less drastically over datasets. In future chapters we will explain that some care is needed with the design of data-based estimators because they can lead to overfitting or underfitting of the data. These terms will be described more precisely then. Here, it is sufficient to mention that the scenario of overfitting arises when overly complex models are used to fit the data (i.e., more complex than is actually necessary); this situation leads to small bias but large variance. On the other hand, underfitting arises when very simple models are used (simpler than is actually necessary), which leads to large bias but small variance. We summarize these useful remarks in Table 27.3, where it is seen that small bias is an indication of overfitting, while large bias is an indication of underfitting. Table 27.3 Bias and variance sizes for data-based estimators under overfitting and underfitting scenarios. Scenario Bias Variance overfitting

low

high

underfitting

high

low

Example 27.8 (Illustrating the bias–variance relation) We reconsider the setting of Example 27.3 where we are interested in estimating a binary random variable x ∈ {±1} from noisy measurements y = x + v under zero-mean Gaussian noise with variance σv2 . We know from the derivation that led to (27.33), as well as from the result of Prob. 27.2, that the optimal MSE estimator for x is given by b = tanh(y/σv2 ) x

(27.84)

We simulate this solution by generating N = 50,000 random samples x chosen uniformly from {+1, −1}. We index the samples by the letter n and write xn to refer to the nth sample. Each xn is perturbed by a noise component vn generated from a zeromean Gaussian distribution with variance σv2 = 0.1. The observation is denoted by yn = xn + vn . For each of the noisy measurements yn , expression (27.84) is used to estimate the corresponding xn as follows: x bn = tanh(yn /σv2 )

(27.85)

1078

Mean-Square-Error Inference

where we are assuming knowledge of the noise variance in this example. The MSE for this construction can be estimated by computing the sample error variance over all N = 50,000 trials: MMSE ≈

N −1 1 X (xn − x bn )2 ≈ 0.0021 N n=0

(optimal MSE)

(27.86)

First-order linear estimator Next, we examine four suboptimal estimators, evaluate their MSEs from data measurements, and verify that these MSEs satisfy the bias–variance relation (27.75). We consider initially a first-order linear regression model for estimating x from y of the form: x ba = α? y − θ? ?

(27.87)

?

where the parameters (α , θ ) are determined from minimizing the squared error defined earlier by (27.68): ( N −1 ) 2 1 X ∆ ? ? (α , θ ) = argmin xn − (αyn − θ) (27.88) N n=0 α,θ We already know that the solution is given by expressions (27.70a)–(27.70b). Before examining the performance of this estimator, and before introducing three other suboptimal estimators of higher order, we will first rewrite model (27.87) in a useful vector form that will be more convenient for these other estimators. Thus, note that we can rewrite (27.87) as x ba =



1

y





−θ? α?

We introduce the matrix and vector quantities:    1 y0 x0 y1   1  x1 ∆  1 ∆  y2  x2 , H =  d =   .   . ..  .   .. . . 1

yN −1

 (27.89)

   ,  



w =



−θ α

 (27.90)

xN −1

so that problem (27.88) can be written more succinctly in the least-squares form: ∆

w? = argmin kd − Hwk2

(27.91)

w∈IR2

where the scaling by 1/N is inconsequential and is removed. If we set the gradient vector of this cost function relative to w to zero at w = w? , i.e., 2H T (d − Hw) =0 (27.92) ? w=w

?

we find that the solution w , whose entries are w? = col{−θ? , α? }, is given by w? = (H T H)−1 H T d

(27.93)

It is straightforward to verify that this expression for w? leads to the same expressions (27.70a)–(27.70b) for (α? , θ? ) – see Prob. 27.14. We now proceed to evaluate the MSE, bias effect, and variance effect for this first-order estimator from experimental data.

27.4 Bias–Variance Relation

1079

For this purpose, we perform the following repeated experiments to generate multiple data-based estimators and then use these estimators to approximate the bias and variance effects. We index each experiment by the letter `, for ` = 1, 2, . . . , L, where (`) (`) L = 200. For each experiment, we generate N` = 1000 data samples {(xn , yn )} in the same manner described before and use these samples to construct the data quantities {H (`) , d(`) } and estimate the parameters (α(`)? , θ(`)? ) for the `th experiment. In (`) (`) other words, each collection of data points {(xn , yn )} gives rise to an estimator whose mapping is given by: x ba(`) = α(`)? y − θ(`)? , ` = 1, 2, . . . , L

(27.94)

Obviously, if we now apply these estimators to the original N = 50,000 data points {(xn , yn )}, then each of them will take the same observation yn and generate an esti(`) mate x bn for the corresponding xn . We therefore note that, for the same observation y, there will be some variability in the estimates for its x-component. This variability gives rise to the bias and variance effects, whose sizes can be evaluated as explained next. First, for each yn we obtain a collection of L-estimates for its x-component denoted by (`) x bn = α(`)? yn − θ(`)? , ` = 1, 2, . . . , L

(27.95)

We use these estimates to estimate the MSE for the first-order estimator by using ∆

mse1 = E x,y,D x − cD (y) L 1X ≈ L `=1

2

N −1 2 1 X xn − x b(`) n N n=0

!

≈ 0.0913

(27.96)

where the subscript 1 refers to the fact that this is the MSE for the first-order estimator, and where we are averaging over all experiments. Next, we evaluate the bias and variance effects. If we average the model parameters {α(`)? , θ(`)? } across all L experiments, we obtain an approximation for the mean estimator, E cD (y), whose structure will take the form x bmean =α ¯ ? y − θ¯? a

  mean estimator, E cD (y)

(27.97)

in terms of the sample average coefficients: α ¯? =

L 1 X (`)? α , L `=1

L 1 X (`)? θ¯? = θ L

(27.98)

`=1

For each of the original measurements yn , we use this average estimator to estimate its x-component and denote the estimates by: x bmean =α ¯ ? yn − θ¯? n

(27.99)

These estimates can be used to approximate the bias term relative to the earlier optimal estimates x bn :

1080

Mean-Square-Error Inference

 2 ∆ bias21 = E y co (y) − E cD (y) N −1 2 1 X x bn − x bmean n N n=0



≈ 0.0893

(27.100)

We similarly approximate the variance effect by using  2 ∆ variance1 = E D cD (y) − E cD (y) L 1X ≈ L `=1

N −1 2 1 X  (`) x bn − x bmean n N n=0

!

≈ 0.0002

(27.101)

Higher-order estimator We repeat similar calculations for higher-order estimators, such as a third-order estimator of the form: x ba = α? y + β ? y 2 + λ? y 3 − θ? ?

?

?

(27.102)

?

where the parameters (α , β , λ , θ ) are determined by minimizing the squared error defined by: ( N −1 ) 2 1 X ∆ ? ? ? ? 2 3 (α , β , λ , θ ) = argmin xn − (αy + βy + λy − θ) (27.103) N n=0 α,β,λ,θ The solution can be found by introducing the matrix and vector quantities:     1 y0 y02 y03 x0   2 3  1  −θ y y y x 1 1   1 1     2 3 α ∆  ∆ ∆   x2  , w = y2 y2 y2  H =  1 , d =   β    ..  .  ..   .  .  . . λ 2 3 xN −1 1 yN −1 yN yN −1 −1

(27.104)

and considering the same least-squares problem (27.91) for which (27.93) continues to hold. We can then evaluate the MSE, bias effect, and variance effect for this third-order estimator as follows. (`)

(`)

For each experiment with N` = 1000 data points {(xn , yn )}, we construct the data quantities {H (`) , d(`) } and estimate the parameter vector w(`)? . Then, the structure of the estimator resulting from the `th experiment is given by: x ba(`) = hT w(`)? , ` = 1, 2, . . . , L where the data vector h is constructed as follows n o h = col 1, y, y 2 , y 3

(27.105)

(27.106)

We apply each of the estimators w(`)? to the data vector hn corresponding to yn to estimate its x-component: (`) x bn = hTn w(`)? , ` = 1, 2, . . . , L

(27.107)

27.4 Bias–Variance Relation

1081

and use these values to approximate the MSE for the third-order estimator by using ∆

mse3 = E x,y,D x − cD (y) L 1X ≈ L `=1

2

N −1 2 1 X xn − x b(`) n N n=0

!

≈ 0.0305

(27.108)

where the subscript 3 refers to the fact that this is the MSE for the third-order estimator. If we average the model parameters {w(`)? } across the L = 200 experiments we obtain an approximation for the mean estimator whose structure has the form:   x bmean = hT w ¯? mean estimator, E cD (y) (27.109) a in terms of the average model: w ¯? =

L 1 X (`)? w L

(27.110)

`=1

For each of the original measurements yn , we use this average estimator to estimate its x-component as well and denote the estimates by: x bmean = hTn w ¯? n

(27.111)

These values can be used to approximate the bias term:  2 ∆ bias23 = E y co (y) − E cD (y) ≈

N −1 2 1 X xn − x bmean n N n=0

≈ 0.0283

(27.112)

We similarly approximate the variance effect by computing  2 ∆ variance3 = E D cD (y) − E cD (y) ! N −1 L 2 1 X 1 X  (`) mean ≈ x bn − x bn L N n=0 `=1

≈ 0.0003

(27.113)

We repeat the same construction for two other estimators of orders 5 and 7, respectively, where their hn vectors are constructed as follows:   hn = 1 yn yn2 yn3 yn4 yn5 (5th order) (27.114)   2 3 4 5 6 7 hn = 1 yn yn yn yn yn yn yn (7th order) (27.115) Table 27.4 lists the MSEs, squared bias terms, and variance terms for all estimators considered in this example. It is seen that as the complexity of the estimator increases from first order toward seventh order, the MSE improves, its bias effect also improves, while its variance effect worsens – this behavior is also illustrated in Fig. 27.11.

1082

Mean-Square-Error Inference

Table 27.4 Mean-square error, bias, and variance sizes for data-based estimators of first, third, fifth, and seventh order. Order optimal (MMSE) 1 3 5 7

MSE

Bias2

Variance

MMSE+bias2 +variance

0.0021 0.0913 0.0305 0.0154 0.0096

0 0.0893 0.0283 0.0128 0.0068

0 0.0002 0.0003 0.0007 0.0010

0.0021 0.0916 0.0307 0.0156 0.0099

Figure 27.11 (Top) Illustration of the bias and variance effects as a function of the

order of the regression polynomial. The horizontal line in the top plot shows the MMSE of the optimal conditional estimator, E (x|y). (Bottom) Scatter diagrams for the estimation errors generated by suboptimal polynomial estimators of increasing order. It is seen that the variance of the errors increases with the order.

27.5

COMMENTARIES AND DISCUSSION Mean-square-error criterion. The earlier part of the chapter, dealing with the formulation of the MSE inference problem, is adapted from the presentation in Sayed (2003, 2008). It is worth noting that the underlying squared-error criterion, where the square of the estimation error is used as a measure of performance, has a distinguished but controversial history. It dates back to works by the German and French mathematicians Carl Friedrich Gauss (1777–1855) and Adrien Legendre (1752–1833),

27.5 Commentaries and Discussion

1083

respectively, who developed in Gauss (1809) and Legendre (1805) a deterministic least-squares-error criterion as opposed to the stochastic least-mean-squares criterion (27.17) used in this chapter. Legendre (1805) was the first to publish an account of the least-squares formulation and used it to fit a line to experimental data. A controversy erupted at the time with Gauss (1809) claiming priority for having developed (but not published) the least-squares formulation earlier in 1795, almost a decade before Legendre. Plackett (1972) and Stigler (1981, 1986) provide insightful accounts of this history and comment on the indirect evidences presented by Gauss to solidify his claim to the method. Interestingly, the method of least-squares was also published around the same time by the Irish-American mathematician Robert Adrain (1775–1843) in the work by Adrain (1808). Gauss’ (1809) formulation was broader than Legendre’s (1805) and was motivated by Gauss’ work on celestial orbits. A distinctive feature of the square-error criterion is that it penalizes large errors more than small errors. In this way, it is more sensitive to the presence of outliers in the data. This is in contrast, for example, to the technique proposed by the French mathematician Pierre-Simon Laplace (1749–1827), which relied on the use of the absolute error criterion as performance measure (see Sheynin (1977)). We noted in the body of the chapter, following at times the presentation from Sayed (2003, 2008), that the stochastic version of Gauss’ criterion for estimating a random variable x from observations of another random variable y is to minimize the MSE, which leads to the conditional mean estimator: argmin E (x − c(y))2 =⇒ x b = E (x|y = y)

(27.116)

c(·)

We will formulate in the next chapter Laplace’s criterion, which seeks to minimize the MAE, namely, argmin E |x − c(y)|

(27.117)

c(y)

and show in Prob. 27.13 that the solution is given by the median (rather than the mean) of the conditional distribution, fx|y (x|y), i.e., by the value x b that satisfies ˆ

x b −∞

ˆ



fx|y (x|y)dx =

fx|y (x|y)dx = 1/2

(27.118)

x b

As noted in Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008), Gauss was very much aware of the distinction between both design criteria and this is how he commented on his squared-error criterion in relation to Laplace’s absolute-error criterion: Laplace has also considered the problem in a similar manner, but he adopted the absolute value of the error as the measure of this loss. Now if I am not mistaken, this convention is no less arbitrary than mine. Should an error of double size be considered as tolerable as a single error twice repeated or worse? Is it better to assign only twice as much influence to a double error or more? The answers are not self-evident, and the problem cannot be resolved by mathematical proofs, but only by an arbitrary decision. Extracted from the translation by Stewart (1995) Besides Gauss’ motivation, there are many good reasons for using the MSE criterion, not the least of which is the fact that it leads to a closed-form characterization of the solution as a conditional mean. In addition, for Gaussian random variables, it can be shown that the MSE estimator is practically optimal for any other choice of the error cost function (quadratic or otherwise) – see, for example, Pugachev (1958) and Zakai (1964). Bias–variance relation. The bias–variance relation (27.75) helps explain some counter intuitive behavior about the performance of data-based estimators. It reveals that more

1084

Mean-Square-Error Inference

complex estimators do not necessarily outperform simpler ones because the MSE is degraded by two competing effects. One effect relates to a squared bias error and the second effect relates to a variance error. When one error is small, the other is large and vice-versa. For example, the smaller the bias is (i.e., the closer the estimate is to the optimal MSE estimator), the larger its variance will be, meaning that, over repeated experiments, we will tend to observe larger variations in the estimated quantity. On the other hand, the smaller the variance effect is (i.e., the fewer variations we observe in the value of the estimated quantity over repeated experiments), the more likely it is that the estimate has a larger bias than normal and lies further away from the optimal estimator. In later chapters, when we discuss classification problems and the generalization error of classifiers, we will encounter a similar bias–variance relation in future expression (64.25). We will comment in the concluding remarks of Chapter 64 on the role of this relation in the context of classification theory and provide several relevant references, including works by German, Bienenstock, and Doursat (1992), Kong and Dietterich (1995), Breiman (1994, 1996a, b), Tibshirani (1996a), James and Hastie (1997), Kohavi and Wolpert (1996), Friedman (1997), Domingos (2000), James (2003), and Geurts (2005), as well as the text by Hastie, Tibshirani, and Friedman (2009). Circularity and MSE estimation. We introduced in Section 4.6 the circularity condition (4.142) while discussing complex-valued random variables that are Gaussian distributed. The derivation in that section helped clarify how the pdf of a circular (also called proper) Gaussian random variable, z, is fully characterized by its firstand second-order moments, {¯ z , Rz }, in a manner similar to the real case. It also follows from examining expression (27.127) in the appendix that uncorrelated circular Gaussian variables are independent. It should be noted though that complex random signals are not always circular, in which case the lack of circularity needs to be taken into account in the solution of inference problems to avoid performance degradation. For the sake of illustration, and motivated by the discussion in Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008), consider jointly distributed real Gaussian random vectors {d, x, y} with zero mean, and let us examine the problem of estimating d from the complex random variable z = x + jy. We already know that the optimal MSE estimator for d given {x, y} is the conditional mean E (d|x, y). When z is circular, knowledge of the moment Rz is equivalent to knowledge of the covariance matrices {Rx , Ry } so that this same estimator can be viewed as the conditional mean of d given z, i.e., E (d|z). On the other hand, when z is not circular, then knowledge of Rz alone is not sufficient to characterize the distribution of z, as already noted by (4.141), since Rz = (Rx + Ry ) + j(Ryx − Rxy )

(27.119)

Let us introduce the extended vector ∆

s =



z (z ∗ )T



E zz T (E zz ∗ )T



(27.120)

and its covariance matrix ∆

Rs = E ss∗ =



E zz ∗ ∗ E zz T

 =

Rz R2T

R2 RzT

 (27.121)

where ∆

R2 = E zz T = (Rx − Ry ) + j(Rxy + Ryx )

(27.122)

Observe now that knowledge of both {Rz , R2 } enables us to recover the covariances {Rx , Ry , Rxy } by solving  Rx + Ry = Re(Rz ), Ryx − Rxy = Im(Rz ) (27.123) Rx − Ry = Re(R2 ), Ryx + Rxy = Im(R2 )

Problems

1085

Therefore, when z is noncircular, the problem of estimating d from {x, y} can be equivalently rewritten as E (d|z, z ∗ ) involving the conditional mean of d given both {z, z ∗ }. One of the earliest works to consider noncircularity in the characterization of complex random signals is Brown and Crane (1969), followed by Comon (1986), Picinbono (1993), Picinbono and Chevalier (1995), Amblard and Duvaut (1995), and Lacoume (1998), among others. The texts by Mandic and Goh (2009), Schreier and Scharf (2010), and the article by Adali, Schreier, and Scharf (2011) provide useful expositions and examples on this subject matter. Iris dataset. This dataset was originally used by Fisher (1936) and is available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/ iris – see Dua and Graff (2019). Figure 27.5 displays three types of iris flowers: virginica (photo by Frank Mayfield), setosa (photo by Radomil Binek), and versicolor (photo by Danielle Langlois). The source of the images is Wikimedia commons, where they are available for use under the Creative Commons Attribution Share-Alike License. The relevant links are: (a) https://commons.wikimedia.org/wiki/File:Iris_virginica.jpg (b) https://commons.wikimedia.org/wiki/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg (c) https://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg

PROBLEMS1

27.1 Consider Example 27.3 but assume now that the noise is uniformly distributed b MSE = 1 when y ∈ [1/2, 3/2] and x b MSE = −1 when between −1/2 and 1/2. Show that x y ∈ [−3/2, −1/2]. 27.2 Consider the same setting of Example 27.3 but assume now that the noise v has a generic variance σv2 . b MSE = tanh(y/σv2 ). (a) Show that the optimal least-mean-squares estimator of x is x b MSE = sign(y) can be chosen as a suboptimal estimator. (b) Argue that x 27.3 Consider the same setting of Example 27.3 but assume now that the noise v has mean v¯ and unit variance. b MSE = tanh(y − v¯). (a) Show that the optimal least-mean-squares estimator is x b MSE = sign(y − v¯) can be chosen as a suboptimal estimator. (b) Argue that x 27.4 Consider noisy observations y(n) = x+v(n), where x and v(n) are independent random variables, v(n) is a white-noise Gaussian random process with zero mean and variance σv2 , and x takes the values ±1 with equal probability. The value of x is fixed at either +1 or −1 for all measurements {y(n)}. The whiteness assumption on v(n) means that E v(n)v(m) = 0 for n 6= m. (a) Show that the least-mean-squares estimate of x given {y(0), . . . , y(N − 1)} is ! N −1 X 2 x bN = tanh y(n)/σv n=0

(b)

Assume x takes the value +1 with probability p and the value −1 with probability 1 − p. Show that the least-mean-squares estimate of x is now given by ( )   N −1 X 1 p 2 x bN = tanh ln + y(n)/σv 2 1−p n=0 in terms of the natural logarithm of p/(1 − p).

1

Some problems in this section are adapted from exercises in Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008).

1086

Mean-Square-Error Inference

(c)

27.5 is

Assume the noise is correlated with covariance matrix Rv = E vv T , where v = col{v(0), . . . , v(N − 1)}. Show that the least-mean-squares estimate of x becomes     1 p x bN = tanh ln + y T Rv−1 1 2 1−p where y = col{y(0), . . . , y(N − 1)} and 1 = col{1, 1, . . . , 1}. Refer to Prob. 3.16. Show that the least-mean-squares estimate of x given y = y x bMSE =

1 e−λ1 y − −λ y y λ1 − λ2 e 2 − e−λ1 y

27.6 A random variable x assumes the value +1 with probability p and the value −1 with probability 1 − p. The distribution of a second random variable v depends on the value assumed by x. If x = +1, then v is either Gaussian distributed Nv (0, σa2 ) with probability a or uniformly distributed within the interval [−σa , σa ] with probability 1−a. On the other hand, if x = −1, then v is either Gaussian distributed Nv (0, σb2 ) with probability b and uniformly distributed within the interval [−σb , σb ] with probability 1 − b. Let y = x + v. (a) Determine E (v|x = +1), E (v|x = −1), and E (v). (b) Determine E (y|x = +1), E (y|x = −1), and E (y). b MSE = E (x|y). (c) Determine the optimal MSE estimator, x (d) Compute σx2 , σy2 , and σxy = E xy. 27.7 The state of some system of interest is described by a Gaussian random variable θ with mean θ¯ and variance σθ2 . An observation of θ is collected under additive white Gaussian noise, namely, y = θ + v, where v has zero mean and variance σv2 and is independent of θ. (a) Using the fact that y and θ are jointly Gaussian distributed, show that the bMSE = E (θ|y), has a Gaussian distribution. Find the optimal MSE estimator, θ mean and variance of this distribution in terms of the signal-to-noise ratio, SNR = σθ2 /σv2 . (b) Show that the optimal estimator of θ given N independent observations y = col{y(n)}N n=1 is  −1 bMSE = θ¯ + σθ2 1TN σθ2 11T + σv2 IN ¯ θ (y − 1θ) (c)

Show that the variance of the estimator in part (b) is given by σθ2ˆ =

σθ2 1+

1 N ×SNR

Conclude that this variance approaches σθ2 as N → ∞. 27.8 The state of some system of interest is described by a binary variable θ, which can be either 0 or 1 with equal probability. Let y be a random variable that is observed according to the following probability distribution:

θ=0 θ=1

y=0 q 1−q

y=1 1−q q

We collect N independent observations {y(1), . . . , y(N )}. We wish to employ these observations to learn about the state of the system. Assume the true state is θ = 0. bN = (a) Find the optimal MSE estimator of θ given these observations, namely, θ E (θ|y(1), y(2), . . . , y(N )).

Problems

1087

bN decays to zero exponentially in the MSE sense as Assume q 6= 0.5. Show that θ N → ∞. That is, verify that the MSE converges to zero at an exponential rate. What happens when q = 0.5? bN . Find its limit as N → ∞. (c) Find an expression for the variance of θ (d) Why are these results useful? 27.9 Consider two scalar random variables x1 and x2 . The random variable x1 assumes the value +1 with probability p and the value −1 with probability 1 − p. The random variable x2 is distributed as follows:  +2 with probability q if x1 = +1 then x2 = −2 with probability 1 − q (b)

 if x1 = −1 then

x2 =

+3 with probability r −3 with probability 1 − r

Consider further the variables y 1 = x1 + v 1 and y 2 = x1 + x2 + v 2 , where {v 1 , v 2 } are independent zero-mean Gaussian random variables with unit variance. The variables {v 1 , v 2 } are independent of x1 and x2 . (a) Express the pdfs of x1 and x2 in terms of delta functions. (b) Find the joint pdf of {x1 , x2 }. (c) Find the joint pdf of {y 1 , y 2 }. (d) Find the joint pdf of {x1 , x2 , y 1 , y 2 }. (e) Find the conditional pdf of {x1 , x2 } given {y 1 , y 2 }. (f) Find the MSE estimator of x2 given {y 1 , y 2 }. (g) Find the MSE estimator of x2 given {y 1 , y 2 , x1 }. 27.10 A random variable x assumes the value +1 with probability p and the value −1 with probability 1 − p. The variable x is observed under additive noise, say, as y = x + v, where v has mean v¯ and variance σv2 . Both x and v are independent of each other. In this problem, we consider different pdfs for v. (a) Assume initially that v is Gaussian. What is the optimal MSE estimator for x given y? What is the corresponding MMSE? (b) Assume instead that v is uniformly distributed. What is the optimal MSE estimator for x given y? What is the corresponding MMSE? (c) Assume now that v is exponentially distributed. What is the optimal MSE estimator for x given y? What is the corresponding MMSE? (d) Which noise distribution results in the smallest MMSE? (e) Over all possible pdfs, determine the pdf of noise that results in the smallest MMSE for optimal estimation. 27.11 Refer to expression (27.129) for estimating one random vector x from observations of another random vector y, where x and y are possibly complex-valued as well. Show that the solution is again given by the conditional mean estimator, b MSE = E (x|y). We remark that the conditional mean of a complex random variable, x with real and imaginary parts x = xr + jxi , given another complex random variable y = y r + jy i , is defined as E (x|y) = E (xr |y r , y i ) + jE (xi |y r , y i ). b MSE = E (x|y) also minimizes 27.12 Show that the least-mean-squares estimator x eTW x e , for any symmetric nonnegative-definite matrix W . Ex 27.13 Consider two zero-mean scalar random variables {x, y} and let fx|y (x|y) denote the conditional pdf of x given y. We already know that the estimator for x that minimizes the MSE, E (x − c(y))2 , over c(·) is given by the mean of the conditional distribution fx|y (x|y). Show that the solution to the following alternative problem, where the quadratic measure is replaced by the absolute measure: b MAE = argmin E |x − c(y)| x c(·)

1088

Mean-Square-Error Inference

is given by the median of the same conditional distribution, fx|y (x|y), where the median is the point a such that ˆ a ˆ ∞ fx|y (x|y)dx = fx|y (x|y)dx = 1/2 −∞

a

27.14 Derive expressions (27.70a)–(27.70b). Show that expression (27.93) leads to the same conclusion. b = co (y) denote the optimal MSE estimator for 27.15 All variables are scalar. Let x 2 o x with MMSE given by E x − c (y) . We already know that co (y) = E (x|y). Let c(y) denote any other estimator for x. (a) Show that 2 2 E x − c(y) = MMSE + E co (y) − c(y) (b)

Verify that E co (y) − c(y)

(c)

2

2 2 = E co (y) − E c(y) + E c(y) − E c(y) + n  o 2 E co (y) − E c(y) E c(y) − c(y)

Conclude that, in general, 2 2 2 E x − c(y) 6= MMSE + E co (y) − E c(y) + E c(y) − E c(y)

27.16 All variables are scalar. Let y = f (x) + v for some known function f (·), and where v is zero-mean noise with variance σv2 and independent of all other random b denote some arbitrary estimator for x. Let further y b = f (b variables. Let x x) denote an b . We measure the quality of x b by measuring estimator for y that is computed from x b )2 . Show that the following the MSE in estimating y, namely, the quantity E (y − y bias–variance relation holds:    2  2 b )2 |x = σv2 + f (x) − E (f (b E (y − y x)|x) + E f (b x) − E (f (b x)|x) Explain why the second and third terms on the right-hand side of the above expression are commonly referred to as (squared) bias and variance terms. 27.17 Verify the validity of expression (27.135).

27.A

CIRCULAR GAUSSIAN DISTRIBUTION We focused in Section 27.3 on real scalar-valued random variables. However, the discussion is general and extends to complex vector-valued random variables as well. We illustrate this fact in this appendix following Sayed (2003, 2008). Thus, assume now that x and y are circular complex-valued random vectors of dimensions p × 1 and q × 1, respectively. Referring to (4.147), this means that their individual pdfs are given by: n o 1 1 fx (x) = p exp −(x − x ¯)∗ Rx−1 (x − x ¯) (27.124a) π det Rx n o 1 1 fy (y) = q exp −(y − y¯)∗ Ry−1 (y − y¯) (27.124b) π det Ry where the covariance matrices of x and y are assumed to be positive-definite and, hence, invertible, Rx = E (x − x ¯)(x − x ¯ )∗ > 0 Ry = E (y − y¯)(y − y¯)∗ > 0

(27.125a) (27.125b)

27.A Circular Gaussian Distribution

1089

Assume further that x and y are second-order circular, which means that E (x − x ¯)(y − y¯)T = 0

(27.126)

This condition implies that the aggregate variable z = col{x, y} is circular and, therefore, it can be verified from (4.147) that the joint pdf of x and y has the form: (  )   −1 x − x 1 1 ¯ ∗ ∗ ¯) (y − y¯) fx,y (x, y) = p+q exp − (x − x R y − y¯ π det R (27.127) where  R=

Rx Ryx

Rxy Ry



Rxy = E (x − x ¯)(y − y¯)∗

,

(27.128)

We continue to assume that R is positive-definite so that it is invertible. We are interested in determining the form of the optimal estimator for x given y. For this purpose, we note first that the MSE formulation (27.17) needs to be adjusted as follows for estimating a vector from another vector: b = argmin E ke x xk2

(27.129)

c(y)

e∗x e. in terms of the mean of the squared Euclidean norm of the error vector, ke xk2 = x b = c(y). If In this case, the estimator function c(y) maps the vector y into the vector x we let Rx˜ denote the covariance matrix of the error vector, namely, ∆

ex e∗ Rx˜ = E x

(27.130)

E ke xk2 = Tr (Rx˜ )

(27.131)

then b that minimizes the trace so that, in effect, problem (27.129) is seeking the estimator x of the error covariance matrix: b = argmin Tr (Rx˜ ) x

(27.132)

c(y)

We show in Prob. 27.11 that the solution continues to be given by the conditional mean b = E (x|y). So let us determine the form of this conditional mean for the estimator, x circular Gaussian vectors that are jointly distributed according to (27.127). For this purpose, we introduce the quantities ∆

x b = x ¯ + Rxy Ry−1 (y − y¯) ∆

Σx = Rx − Rxy Ry−1 Ryx

(27.133) (27.134)

and then recall from result (1.75b) that Σx > 0. If we now repeat the derivation that led to (4.88a) we similarly conclude that the conditional pdf of x given y is given by the following expression: n o 1 1 fx|y (x|y) = p exp −(x − x b)∗ Σ−1 (x − x b) (27.135) π det Σ which has the form of a circular Gaussian distribution. It follows that the MSE estimator of x given y again depends in an affine manner on the observation, i.e., b = E (x|y) = x x ¯ + Rxy Ry−1 (y − y¯) (27.136)

1090

Mean-Square-Error Inference

REFERENCES Adali, T., P. J. Schreier, and L. L. Scharf (2011), “Complex-valued signal processing: The proper way to deal with impropriety,” IEEE Trans. Signal Process., vol. 59, no. 11, pp. 5101–5125. Adrain, R. (1808), “Research concerning the probabilities of the errors which happen in making observations,” The Analyst or Mathematical Museum, vol. 1, no. 4, pp. 93–109. Amblard, P. O. and P. Duvaut (1995), “Filtrage adapte dans le cas Gaussien complexe non circulaire,” in Proc. GRETSI Conf., pp. 141–144, Juan-les-Pins. Breiman, L. (1994), “Heuristics of instability in model selection,” Ann. Statist., vol. 24, no. 6, pp. 2350–2383. Breiman, L. (1996a), “Stacked regressions,” Mach. Learn., vol. 24, no. 1, pp. 41—64. Breiman, L. (1996b), “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140. Brown, W. M. and R. B. Crane (1969), “Conjugate linear filtering,” IEEE Trans. Inf. Theory, vol. 15, no. 4, pp. 462–465. Comon, P. (1986), “Estimation multivariable complexe,” Traitement du Signal, vol. 3, pp. 97–102. Domingos, P. (2000), “A unified bias–variance decomposition,” Proc. Int. Conf. Machine Learning (ICML), pp. 231–238, Stanford, CA. Dua, D. and C. Graff (2019), UCI Machine Learning Repository, available at http://archive.ics.uci.edu/ml, School of Information and Computer Science, University of California, Irvine. Fisher, R. A. (1936), “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188. Friedman, J. H. (1997), “On bias, variance, 0/1 loss, and the curse-of-dimensionality,” Data Mining Knowl. Discov., vol. 1, pp. 55–77. Gauss, C. F. (1809), Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, English translation by C. H. Davis, 1857, Little, Brown, and Company, Boston, MA. German, S., E. Bienenstock, and R. Doursat (1992), “Neural networks and the bias variance dilemma,” Neural Comput., vol. 4, pp. 1–58. Geurts, P. (2005), “Bias vs variance decomposition for regression and classification,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, editors, pp. 749–763, Springer. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. James, G. M. (2003), “Variance and bias for general loss functions,” Mach. Learn., vol. 51, pp. 115–135. James, G. and T. Hastie (1997), “Generalizations of the bias/variance decomposition for prediction error,” Technical Report, Department of Statistics, Stanford University, Stanford, CA. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall. Kohavi, R. and D. H. Wolpert (1996), “Bias plus variance decomposition for zero-one loss functions,” Proc. Int. Conf. Machine Learning (ICML), pp. 275–283, Tahoe City, CA. Kong, E. B. and T. G. Dietterich (1995), “Error-correcting output coding corrects bias and variance,” Proc. Int. Conf. Machine Learning (ICML), pp. 313–321, Tahoe City, CA. Lacoume, J. L. (1998), “Variables et signaux alèatoires complexes,” Traitement du Signal, vol. 15, pp. 535–544. Legendre, A. M. (1805), Nouvelles Méthodes pour la Détermination des Orbites de Comètes, Courcier. Mandic, D. P. and V. S. L. Goh (2009), Complex Valued Nonlinear Adaptive Filters, Wiley.

References

1091

Picinbono, B. (1993), Random Signals and Systems, Prentice Hall. Picinbono, B. and P. Chevalier (1995), “Widely linear estimation with complex data,” IEEE Trans. Signal Process., vol. 43, pp. 2030–2033. Plackett, R. L. (1972), “The discovery of the method of least-squares,” Biometrika, vol. 59, pp. 239–251. Pugachev, V. S. (1958), “The determination of an optimal system by some arbitrary criterion,” Aut. Remote Control, vol. 19, pp. 513–532. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Schreier, P. J. and L. L. Scharf (2010), Statistical Signal Processing of Complex-Valued Data, Cambridge University Press. Sheynin, O. B. (1977), “Laplace’s theory of errors,” Arch. Hist. Exact Sci., vol. 17, pp. 1–61. Stewart, G. W. (1995), Theory of the Combination of Observations Least Subject to Errors, SIAM. Translation of original works by C. F. Gauss under the title Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. Stigler, S. M. (1981), “Gauss and the invention of least-squares,” Ann. Statist., vol. 9, no. 3, pp. 465–474. Stigler, S. M. (1986), The History of Statistics: The Measurement of Uncertainty before 1900, Harvard University Press. Tibshirani, R. (1996a), “Bias, variance, and prediction error for classification rules,” Technical Report, Department of Preventive Medicine and Biostatistics and Department of Statistics, University of Toronto, Toronto, Canada. Zakai, M. (1964), “General error criteria,” IEEE Trans. Inf. Theory, vol. 10, pp. 94–95.

28 Bayesian Inference

The mean-square-error (MSE) criterion (27.17) is one notable example of the Bayesian approach to statistical inference. In the Bayesian approach, both the unknown quantity, x, and the observation, y, are treated as random variables b for x is sought by minimizing the expected value of some loss and an estimator x b ). In the previous chapter, we focused exclusively on function denoted by Q(x, x b ) = (x − x b )2 for scalar x. In this chapter, we consider the quadratic loss Q(x, x more general loss functions, which will lead to other types of inference solutions such as the mean-absolute error (MAE) and the maximum a-posteriori (MAP) estimators. We will also derive the famed Bayes classifier as a special case when the realizations for x are limited to the discrete values x ∈ {±1}.

28.1

BAYESIAN FORMULATION Consider scalar random variables {x, y}, where y is observable and the objective b and is defined as is to infer the value of x. The estimator for x is denoted by x some function of y, denoted by c(y), to be determined by minimizing an average loss over the joint distribution of {x, y}. The purpose of the loss function is to measure the discrepancy between x and its estimator. The inference problem is stated as: ∆

b Q = argmin E Q(x, x b) x

(28.1)

b =c(y) x

where the loss Q(·, ·) is nonnegative, and the expectation is over the joint probability density function (pdf) fx,y (x, y). Similar to what we did in the last chapter, we will continue to employ continuous-time distributions in our presentation with the understanding that the arguments can be easily adjusted for discrete distributions. b Q to highlight its dependence Observe that we are attaching a subscript Q to x on the choice of loss functions. The cost that appears in (28.1) in the form of an expected loss is also referred to as the risk and is denoted by: ∆

b) R(c) = E Q(x, x

(risk function)

(28.2)

28.1 Bayesian Formulation

1093

Note that the risk depends on c(y). Different choices for c(y) will generally have different risk values and the objective is to choose an optimal mapping, denoted by co (y), with the smallest risk value possible: b) R(co ) = min E Q(x, x

(28.3)

b =c(y) x

For later use, it is useful to note that formulation (28.1) admits an equivalent characterization. Using the conditional mean property (27.24), we rewrite the mean loss in the following form by conditioning on the observation y: ) (   b )|y b ) = E y E x|y Q(x, x (28.4) E Q(x, x

Now, since the loss function assumes nonnegative values, the minimization in (28.1) can be attained by solving instead: ∆

x bQ (y) = argmin x b=c(y)

(



b )|y = y E x|y Q(x, x



)

(28.5)

where the expectation of the loss function is now evaluated relative to the conditional pdf, fx|y (x|y). This conditional pdf is known as the predictive distribution because it enables us to “predict” values for x for each individual observation for y. The predictor x bQ in (28.5) is a function of y and that is why we are denoting it more explicitly by writing x bQ (y), with an argument y. Using relation (28.4), we then find that the minimal risk value admits the representation: n  o b Q (y)) | y = y R(co ) = E y E x|y Q(x, x (28.6) Formulations (28.1) and (28.5) are also valid when either x or y (or both) are vector-valued. We continue with the scalar case for illustration purposes. Two special cases of Bayesian estimators are evident.

MSE inference In the MSE case, the loss function is quadratic and chosen as b ) = (x − x b )2 Q(x, x

(28.7)

b MSE = E (x|y) x

(28.8)

In this case, we can solve problem (28.1) explicitly and showed in the last chapter that the estimator is the mean of the conditional distribution of x given y, i.e.,

Note that we are attaching a subscript MSE to distinguish this estimator from other estimators discussed below.

1094

Bayesian Inference

MAE inference In this case, the loss function is the absolute error, namely, b ) = |x − x b| Q(x, x

(28.9)

b is given by the We showed in Prob. 27.13 that the corresponding estimator x median (rather than the mean) of the conditional distribution, fx|y (x|y). That is, the value of x bMAE is the point that enforces the equality: ˆ

x bMAE

ˆ

fx|y (x|y)dx =

−∞

28.2



fx|y (x|y)dx = 1/2

(28.10)

x bMAE

MAXIMUM A-POSTERIORI INFERENCE Another popular inference solution is the MAP estimator. While the MSE estib MSE , selects the value x that corresponds to the mean of the conditional mator, x b MAP , selects the location x that pdf, fx|y (x|y), the MAP estimator, denoted by x corresponds to the peak of the same pdf: x bMAP = argmax fx|y (x|y)

(28.11)

x∈X

where the maximization is over the domain of x ∈ X. MAP estimators need not be unique because fx|y (x|y) may be a multimodal distribution. MAP estimators can be viewed as a limiting case of Bayesian inference if the loss function is set to the 0/1-loss defined as follows (see Prob. 28.1):  b 1, x 6= x ∆ b) = Q(x, x (28.12) 0, otherwise

We illustrate this fact by considering the case in which x has a discrete support set. Thus, given an observation y = y, and considering the 0/1-loss (28.12), we have:   (a) X b )|y = y b |y = y)Q(x, x E x|y Q(x, x = P(x 6= x b) x∈X

(28.12)

=

X

x6=x b

=

b |y = y) P(x 6= x

b |y = y) 1 − P(x = x

(28.13)

where the expectation on the left-hand side of (a) is relative to the conditional distribution of x given y = y. It follows that b |y = y) = argmax P(x = x argmin E x|y Q(x, x b|y = y) x b

x b

(28.14)

28.2 Maximum A-Posteriori Inference

1095

In other words, and in view of (28.5), the mean of the loss function (28.12) is minimized when x b is selected as the location that maximizes the conditional distribution, P(x = x|y = y). For jointly Gaussian-distributed random variables {x, y}, the MSE and MAP estimators for x will agree with each other. This is because the conditional pdf, fx|y (x|y), will be Gaussian and the locations of its mean and peak will coincide. This conclusion, however, is not generally true for other distributions. Example 28.1 (MSE and MAP estimators) Assume the conditional pdf of a scalar random variable, x, given observations of another scalar random variable, y > 0, follows a Rayleigh distribution of the form (3.26), namely, fx|y (x|y) =

x −x2 /2y2 e , y2

x≥0

(28.15)

Then, we know from (3.27) that the mean and variance of this distribution, denoted 2 by µx|y and σx|y , respectively, are given by r  π π 2 2 µx|y = y , σx|y = 2− y (28.16) 2 2 Moreover, the peak location of the Rayleigh distribution (its mode location) and its median are given by √ mode = y, median = y 2 ln 2 (28.17) It follows from these expressions that the MSE, MAE, and MAP estimators for x given y are given by r √ π b MAE = y 2 ln 2, b MSE = y b MAP = y , x x x (28.18) 2 Example 28.2 (Election poll) Two candidates A and B are running for office in a local district election. The probability of success for candidate A is p. We survey a fraction of the voters in the district, say, a number of N potential voters, and ask them whether they will be voting for one candidate or the other. We would like to use the result of the survey to estimate p, i.e., the likelihood of success for candidate A. Let y denote a binomial variable with parameters N and p. The probability of observing y successes in N trials (i.e., the probability of obtaining y positive answers in favor of candidate A out of N ) is given by the expression: ! N y P(y = y) = p (1 − p)N −y , y = 0, 1, . . . , N (28.19) y The value of the parameter p can be estimated in a number of ways, for example, by using a MSE formulation (as described in Prob. 28.12), or a maximum-likelihood formulation (as discussed in Prob. 31.8), or a MAP formulation. In this example, we focus on the MAP approach. In Bayesian inference, we treat the quantities we wish to estimate as random variables. For this reason, we will need to model p as a random variable and then determine an expression for the conditional pdf, fp|y (p|y). Once this pdf is computed, its peak location will provide the desired MAP estimate, pbMAP .

1096

Bayesian Inference

Treating p as random requires that we specify its distribution, fp (p), also called the prior. Since the value of p is confined to the interval [0, 1], we can select the prior from the family of beta distributions. This family is useful in modeling random variables that are confined to finite intervals. The beta distribution is defined by two positive shape parameters (a, b) as follows:   Γ(a + b) pa−1 (1 − p)b−1 , 0 ≤ p ≤ 1 fp (p; a, b) = Γ(a)Γ(b) (28.20)  0, otherwise where Γ(x) denotes the gamma function defined earlier in Prob. 4.3. Different choices for (a, b) result in different behavior for the distribution fp (p). For example, the uniform distribution over the interval [0, 1] corresponds to the choice a = b = 1. In this case, the variable p is equally likely to assume any value within the interval. Other values for a and b will give more likelihood to smaller or larger values in the interval. The top part of Fig. 28.1 plots some typical curves for the beta distribution.

beta distribution 3 2.5 2 1.5 1 0.5 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.9

1

spatial distribution of survey responses

1 0.8 0.6 0.4 0.2 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 28.1 (Top) Plots of several beta distributions for different values of the shape

parameters (a, b). Observe how a = b = 1 results in the uniform distribution, while other values for (a, b) give more likelihood to smaller or larger values within the interval [0, 1]. (Bottom) Results of polling N = 500 likely voters. The colors refer to votes for candidates A or B. The mean and variance of the beta distribution (28.20) are known to be: p¯ =

a , a+b

σp2 =

ab (a + b)2 (a + b + 1)

(28.21)

28.3 Bayes Classifier

1097

When a > 0 and b > 0, the mode of the distribution is also known to occur at mode =

a−1 a+b−2

(28.22)

Using these facts, we derive an expression for the conditional pdf, fp|y (p|y), in Prob. 28.11 and deduce there that its peak occurs at location: pbMAP =

y+a−1 N +a+b−2

(28.23)

The bottom plot in Fig. 28.1 shows the polling results from surveying N = 500 potential voters in the district. The simulation assumes a beta distribution with parameters a = 3 and b = 2. The actual success probability was generated randomly according to this distribution and took the value p = 0.5565. Out of the N = 500 surveys, there were y = 287 votes in favor of candidate A. Substituting into (28.23) we find that pbMAP =

287 + 3 − 1 ≈ 0.5746 500 + 3 + 2 − 2

(28.24)

Note that we could have also estimated p by simply dividing y by N ; this computation is a common solution and we will encounter it later in Prob. 31.8, where we will show that it amounts to the maximum-likelihood estimate for p denoted by: pbML =

287 = 0.5740 500

(28.25)

This latter solution method, however, treats p as an unknown constant and not as a random variable.

28.3

BAYES CLASSIFIER One useful application of the MAP formulation (28.11) arises in the context of classification problems, which we will study in great detail in later chapters. In these problems, the unknown variable x is discrete and assumes a finite number of levels.

28.3.1

Binary Classification We motivate classification problems by considering first the case in which x is a discrete binary random variable assuming one of two possible values, say, x ∈ {±1}. Given some possibly vector-valued observation y ∈ IRM that is dependent on x, we would like to infer x by determining a mapping, now called a classifier, c(y), that maps y into one of the two discrete values: c(y) : IRM −→ {±1}

(28.26)

We refer to x as the class variable or the label corresponding to y. The intention is to employ this mapping to deduce from the observation y whether it belongs to class +1 or −1. We can attain this objective by seeking the optimal estimator,

1098

Bayesian Inference

b bayes , that minimizes the probability of erroneous decisions (or denoted by x misclassifications), i.e., that solves: n o b bayes = argmin P (c(y) 6= x) x

(28.27)

b =c(y) x

We verify below that the following classifier, known as the Bayes classifier, solves (28.27): 

+1, when P(x = +1|y = y) ≥ 1/2 −1, otherwise

(28.28)

which can also be written in a single equation as h i x bbayes = 2 I P(x = +1|y = y) ≥ 1/2 − 1

(28.29)

x bbayes =

in terms of the indicator function:  1, if statement a is true I[a] = 0, otherwise

(28.30)

Expression (28.28) indicates that the classifier decides in favor of +1 when the conditional probability of the event x = +1 is at least 1/2. In other words, the b bayes selects the value for x that maximizes the conditional probabilclassifier x b bayes coincides with the MAP ity of observing x given y, which means that x estimator: x bbayes = x bMAP = argmax P(x = x|y = y)

(28.31)

x∈{±1}

Proof of (28.28): First, note that problem (28.27) is equivalent to solving ∆

b bayes = argmax P (c(y) = x) x

(28.32)

b =c(y) x

where

ˆ P (c(y) = x) = ∆

P (c(y) = x|y = y) fy (y)dy ˆ

y∈Y

=

∆(y)fy (y)dy

(28.33)

y∈Y

where the integration is over the observation space, y ∈ Y. In the above expression, the term fy (y) denotes the pdf of the observation and the shorthand notation ∆(y) denotes the conditional probability that appears multiplying fy (y) in the first line. Since fy (y) ≥ 0, we can solve (28.32) by seeking a classifier c(y) that maximizes ∆(y). Now observe that, since the events x = +1 and x = −1 are mutually exclusive conditioned on y: ∆

∆(y) = P (c(y) = x|y = y) (28.34) = P (c(y) = +1, x = +1|y = y) + P (c(y) = −1, x = −1|y = y) = I [c(y) = +1] P (x = +1|y = y) + I [c(y) = −1] P (x = −1|y = y)

28.3 Bayes Classifier

1099

For any given observation y, we need to select c(y) to maximize ∆(y). There are only two possibilities for c(y) in the binary classification problem, either c(y) = +1 or c(y) = −1: if we set c(y) = +1, then ∆(y) = P (x = +1|y = y) if we set c(y) = −1, then ∆(y) = P (x = −1|y = y)

(28.35) (28.36)

Therefore, we should set c(y) = +1 whenever P (x = +1|y = y) ≥ P (x = −1|y = y) = 1 − P (x = +1|y = y)

which is equivalent to the condition P (x = +1|y = y) ≥ 1/2.

28.3.2

(28.37)



Likelihood Ratio Test The Bayes classifier (28.28) can be expressed in an equivalent form involving a likelihood ratio test. To see this, note from expression (28.28) that deciding on whether x is +1 or −1 amounts to checking the inequality: P(x = +1|y = y) ≥ P(x = −1|y = y) ⇐⇒ x bbayes (y) = +1

(28.38)

fy|x (y|x = +1) P(x = +1) ≥ fy|x (y|x = −1) P(x = −1)

(28.39)

Using the Bayes rule (3.39) for conditional probabilities, the above inequality is equivalent to checking whether

Let π±1 denote the prior probabilities for the events x = +1 and x = −1, i.e., ∆



π+1 = P(x = +1),

π−1 = P(x = −1)

(28.40)

where π−1 + π+1 = 1

(28.41)

Let further L(y) denote the likelihood ratio: ∆

L(y) =

fy|x (y|x = +1) fy|x (y|x = −1)

(28.42)

Then, condition (28.39) translates into deciding for x = +1 or x = −1 depending on whether +1

L(y) R −1

π−1 π+1

(28.43)

This test is equivalent to the Bayes classifier (28.28): It decides for x = +1 when the likelihood ratio is larger than or equal to π−1 /π+ . When the classes are equally probable so that π−1 = π+1 = 1/2, the threshold value on the righthand side of (28.43) reduces to 1. Example 28.3 (Hard classifier) Let us apply the Bayes classifier (28.28) to the situation encountered earlier in Example 27.3. In that example, we discussed recovering the class variable x ∈ {+1, −1} for cat and dog images from soft measurements y = x + v in the presence of additive Gaussian perturbation, v.

1100

Bayesian Inference

Given y, we would like now to recover x by minimizing the probability of misclassification (rather than the MSE, as was done in Example 27.3). The solution is given by the Bayes classifier (28.28); its computation requires that we evaluate P(x = +1|y = y). This quantity has already been evaluated in Example 3.17. Indeed, from that example we know that: P(x = +1|y = y) =

fv (y − 1) , where; fv (v) = Nv (0, 1) fv (y + 1) + fv (y − 1)

(28.44)

Simplifying gives P(x = +1|y = y) = 

1 2

e−(y+1) /2 2 e−(y−1) /2



(28.45) +1

According to (28.28), we need to compare P(x = +1|y = y) against the threshold 1/2. It is easy to verify from the above expression that P(x = +1|y = y) ≥ 1/2 ⇐⇒ y ≥ 0 In this way, expression (28.28) for the optimal classifier reduces to  +1, when y ≥ 0 x bbayes = −1, otherwise

(28.46)

(28.47)

which is equivalent to x bbayes = sign(y)

(28.48)

This is precisely the expression for the suboptimal MSE estimator we used earlier in (27.35)! Here, we discover that this construction is actually optimal but relative to the misclassification criterion (28.27). Example 28.4 (Using the likelihood ratio test) We reconsider the previous example from the perspective of likelihood ratios to arrive at the same conclusion (28.48). Indeed, note that the pdf of the observation y under both classes x ∈ {±1} is Gaussian with means {±1} and variances equal to 1: fy|x (y|x = +1) ∼ Ny (1, 1),

fy|x (y|x = −1) ∼ Ny (−1, 1)

(28.49)

In other words, 2 1 1 fy|x (y|x = +1) = √ e− 2 (y−1) 2π 1 − 12 (y+1)2 fy|x (y|x = −1) = √ e 2π

(28.50a) (28.50b)

so that the likelihood ratio is L(y) =

1

2

1

2

e− 2 (y−1) e− 2 (y+1)

(28.51)

Assuming equally probable classes, we need to compare this ratio against 1 or, equivalently, n 1 o +1 1 exp − (y − 1)2 + (y + 1)2 R 1 2 2 −1

(28.52)

28.3 Bayes Classifier

1101

Computing the natural logarithms of both sides, it is straightforward to verify that the above condition reduces to +1

y R 0

(28.53)

−1

which is equivalent to (28.48). The likelihood ratio test can be illustrated graphically as shown in Fig. 28.2. The figure shows two Gaussian distributions centered at ±1 and with unit variances. These distributions represent the conditional pdfs (28.50a)–(28.50b) of the observation given x. In the example under consideration, the means are symmetric around the origin and both distributions have equal variances. Obviously, more general situations can be considered as well – see Prob. 28.5. For the scenario illustrated in the figure, the likelihood ratio test (28.53) leads to comparing the value of y against zero. That is, given an observation y, we decide that its class is x = +1 whenever y ≥ 0 (i.e., whenever it lies to the right of the zero threshold). Likewise, we decide that its class is x = −1 whenever y < 0 (i.e., whenever it lies to the left of the zero threshold). The figure highlights in color two small areas under the pdf curves. The smaller area to the right of the zero threshold (colored in red) corresponds to the following probability of error: ˆ P( deciding x = +1|x = −1 ) =



ˆ0 ∞

= 0

fy|x (y|x = −1)dy 2 1 1 ∆ √ e− 2 (y+1) dy =  2π

P(x = +1|x =

(28.54)

1)

AAACGXicbVC7TsMwFHXKq5RXgJElokUqQlRxF2BAqmBhLBKhlZqochynteo8ZDuIKuQ7WPgVFgZAjDDxNzhtBmg5kuWjc+7Vvfe4MaNCmua3VlpYXFpeKa9W1tY3Nrf07Z1bESUcEwtHLOJdFwnCaEgsSSUj3ZgTFLiMdNzRZe537ggXNApv5DgmToAGIfUpRlJJfR3W7ADJoeum7axuuxHzxDhQX3qfnR/BhxnlGB7W+nrVbJgTGPMEFqQKCrT7+qftRTgJSCgxQ0L0oBlLJ0VcUsxIVrETQWKER2hAeoqGKCDCSSenZcaBUjzDj7h6oTQm6u+OFAUi309V5neIWS8X//N6ifRPnZSGcSJJiKeD/IQZMjLynAyPcoIlGyuCMKdqVwMPEUdYqjQrKgQ4e/I8sZqNswa8blZbF0UaZbAH9kEdQHACWuAKtIEFMHgEz+AVvGlP2ov2rn1MS0ta0bML/kD7+gGc6KDI

Figure 28.2 Illustration of the conditional Gaussian distributions (28.50a)–(28.50b).

That is, the probability of assigning y wrongly to class x = +1 when it actually originates from class x = −1 is given by the red-colored area in the figure, whose value we are denoting by . Likewise, from the same figure, the smaller area to the left of the

1102

Bayesian Inference

zero threshold (colored in blue) corresponds to the following probability of error: P( deciding x = −1|x = +1 ) =

ˆ

0

ˆ

−∞ 0

fy|x (y|x = +1)dy

= −∞

2 1 1 √ e− 2 (y−1) dy =  2π

(28.55)

In this example, both error probabilities (or areas) are equal in size and we denote each one of them by . The probabilities can be combined to determine an expression for the probability of error of the Bayes classifier since: P(b xbayes 6= x)

= = = (28.41)

=

P( erroneous decisions ) 1 1 P( deciding x = +1|x = −1) + P(deciding x = −1|x = +1) 2 2 /2 + /2 

(28.56)

Example 28.5 (Classifying iris flowers) We reconsider the iris flower dataset encountered earlier in Example 27.4. The top row in Fig. 28.3 shows two histogram distributions for the petal length measured in centimeters for two types of flowers: iris setosa and iris virginica. Each histogram constructs 5 bins based on 50 measurements for each flower type. The width of the bin is 0.32 cm for setosa flowers and 0.70 cm for virginica flowers. The bottom row shows the same histograms normalized by dividing each bin value by the number of samples (which is 50) and by the bin width (0.32 for setosa flowers and 0.70 for virginica flowers). This normalization results in approximations for the pdfs. We assume that a flower can only be one of two kinds: either setosa or virginica. Given an observation of a flower with petal length equal to 5.5 cm, we would like to decide whether it is of one type or the other. We will be solving classification problems of this type in a more structured manner in later chapters, and in many different ways. The current example is only meant to illustrate Bayes classifiers. Let x denote the class label, namely, x = +1 if the flower is iris setosa and x = −1 if the flower is iris virginica. We model the petal length as a random variable y. According to the Bayes classifier (28.28), we need to determine the conditional probability P(x = +1|y = 5.5). To do so, we assume the flowers are equally distributed so that P(x = +1) = P(x = −1) = 1/2

(28.57)

According to the Bayes rule (3.42b), we have: P(x = x|y = y) =

P(x = x) fy|x (y|x = x) fy (y)

(28.58)

Therefore, we need to evaluate the pdfs fy|x (y|x) and fy (y) that appear on the righthand side. We do not have these pdfs but we will estimate them from the data measurements by assuming they follow Gaussian distributions. For that purpose, we only need to identify the mean and variance parameters for these distributions; in later chapters, we will learn how to fit more complex distributions into data measurements such as mixtures of Gaussian models. The sample means and variances for the petal length computed from the respective 50 measurements for each flower type are found to be:

28.3 Bayes Classifier

25

setosa (50 flowers)

frequency

frequency

20

15

10

5

0 4

virginica (50 flowers)

25

20

15

10

5

4.5

5

5.5

0 4

6

5

6

petal length (cm)

1.4

setosa (50 flowers)

8

9

virginica (50 flowers)

0.7 0.6

probability density

probability density

7

petal length (cm)

1.2 1 0.8 0.6 0.4 0.2 0 4

1103

normalized histogram

fitted pdf

0.5 0.4 0.3 0.2 0.1

4.5

5

5.5

6

petal length (cm)

0 4

5

6

7

8

9

petal length (cm)

Figure 28.3 (Top) Histogram distribution of the petal length measured in centimeters for iris setosa flowers on the left and for iris virginica flowers on the right. (Bottom) The same histogram plots are normalized by dividing the value for each bin by the bin size and by the total number of 50 samples to generate approximate probability distributions for the petal length variable. (Bottom) Two Gaussian distributions are fitted on top of the normalized histograms.

E (petal length flower E (petal length flower = var(petal length flower var(petal length flower =

= setosa) ≈ 5.0060

(28.59a)

virginica) ≈ 6.5880

(28.59b)

virginica) ≈ 0.4043

(28.59d)

= setosa) ≈ 0.1242

(28.59c)

where, for example, the sample mean and variance for the setosa flower are computed by using: 50 1 X ∆ E (petal length flower = setosa) ≈ yn = y¯setosa 50 n=1

(28.60)

50 1 X var(petal length flower = setosa) ≈ (yn − y¯setosa )2 49 n=1

(28.61)

Here, the sum is over the 50 setosa samples and yn is the petal length for the nth setosa sample.

Bayesian Inference

The bottom row in Fig. 28.3 shows two Gaussian distributions with these means and variances fitted on top of the histograms. These are used as approximations for the conditional pdfs fy|x (y|x = x), namely,   1 1 fy|x (y|x = setosa) = √ exp − (y − 5.0060)2 2 × 0.1242 2π × 0.1242   1 1 fy|x (y|x = virginica) = √ exp − (y − 6.5880)2 2 × 0.4043 2π × 0.4043

(28.62) (28.63)

The combined distribution for the petal length variable can then be approximated by fy (y) =

  1 1 1 √ exp − (y − 5.0060)2 + 2 2π × 0.1242 2 × 0.1242   1 1 1 √ exp (y − 6.5880)2 2 2π × 0.4043 2 × 0.4043

(28.64)

since it is equally likely for a petal length to arise from one Gaussian distribution or the other. Figure 28.4 shows the normalized histogram distribution for all 100 petal lengths and fits the sum of two Gaussian distributions on top of it.

combined distribution (100 flowers)

1

0.8

probability density

1104

normalized histogram

fitted pdf

0.6

0.4

0.2

0 3

4

5

6

7

8

9

petal length (cm)

Figure 28.4 Combined normalized histogram for the distribution of the petal length

measured in centimeters for both classes of iris setosa and iris virginica flowers. The sum of two Gaussian distributions is fitted on top of the histogram. We now have all the elements needed to evaluate the right-hand side of (28.58) for the given petal length of y = 5.5 cm. Indeed, P(x = setosa) fy|x (y = 5.5|x = setosa) fy (y = 5.5) 0.5 × 0.3309 = ≈ 0.4730 0.3498

P(x = setosa|y = 5.5) =

(28.65)

This value is less than 1/2 and we therefore classify the flower as being of the iris virginica type.

28.3 Bayes Classifier

28.3.3

1105

Multiclass Classification Problem 28.3 at the end of the chapter extends conclusion (28.28) to multiclass classification problems, where x could assume one of R ≥ 2 discrete values, say, x ∈ {1, 2, . . . , R}. In this case, the classifier maps the observation vector y to integer values in the range {1, 2, . . . , R}, i.e., c(y) : IRM → {1, 2, . . . , R}

(28.66)

and its optimal construction is now given by the MAP formulation: x bbayes = x bMAP =

argmax x∈{1,2,...,R}

P(x = x|y = y)

(28.67)

which is the natural generalization of (28.31). This construction seeks the class x that maximizes the posterior probability given the observation. Since the obb bayes is also random, i.e., each servation y is random, the resulting classifier x realization value y = y results in a realization x bbayes . We can assess the probability of erroneous decisions by the Bayes classifier as follows. If the true label corresponding to y = y is x, then the probability of error for this observation is ∆

P(error|y = y) = P(b xbayes 6= x|y = y)

= 1 − P(b xbayes = x|y = y)

(28.68)

If we average over the distribution of the observations, we remove the conditioning over y and arrive at the probability of error for the Bayes classifier denoted by ˆ   bayes ∆ Pe = P(b xbayes 6= x) = 1 − P(b xbayes (y) = xy |y = y) fy (y)dy y∈Y

(28.69)

Here, we are writing xy , with an explicit subscript y, inside the integral expression to emphasize that xy is the label that corresponds to the observation y. The following bound holds. Theorem 28.1. (Performance of the Bayes classifier) Consider a multiclass classification problem with R labels, x = 1, 2, . . . , R. It holds that Pebayes ≤

R−1 R

(28.70)

Proof: We employ the result of Theorem 52.1 in the following argument. By construction, the Bayes classifier minimizes the probability of erroneous decisions, i.e., b bayes = argmin P(c(y) 6= x) x

(28.71)

b =c(y) x

In Theorem 52.1, we will study one particular suboptimal classifier called the nearestneighbor rule. It is suboptimal in the sense that it does not minimize the probability

1106

Bayesian Inference

of error. We denote its probability of error by Pe , which is of course worse than that of the optimal Bayes classifier, i.e., Pebayes ≤ Pe . We will establish in Theorem 52.1 that the probability of error for the nearest-neighbor classifier is upper-bounded by  Pe ≤ Pebayes 2 −

R Pebayes R−1

 (28.72)

The right-hand side is a quadratic function in Pebayes ; its maximum is attained at the location Pebayes = (R − 1)/R. Substituting this value into the upper bound we get Pebayes ≤ Pe ≤

R−1 R

 2−

R R−1 R−1 R

 =

R−1 R

(28.73)

as claimed. 

28.3.4

Discriminant Function Structure Regardless of whether we are dealing with a binary or multiclass classification problem, both solutions (28.31) and (28.67) admit a discriminant function interpretation. The solution first associates a discriminant function with each discrete class x, which we denote by: ∆

dx (y) = P(x = x|y = y),

x = 1, 2, . . . , R

(28.74)

This function measures the likelihood that observation y belongs to class x. Then, the optimal classifier selects the class label, x bbayes , with the largest discrimination value – see Fig. 28.5. We will encounter this type of structure multiple times in our treatment – see, e.g., expression (56.10), which will arise in the design of linear discriminant classifiers; see also Prob. 56.1.

28.4

LOGISTIC REGRESSION INFERENCE b ) in our future develWe will encounter other choices for the loss function Q(x, x opment. One of them is the logistic regression loss for binary variables x ∈ {±1} defined by b ) = ln(1 + e−xbx ), Q(x, x

b = c(y) x

(28.75)

We will study logistic regression in greater detail later in Chapter 59. Here we provide some motivation based on the following theorem, which provides an b that follows from minimizing the logistic expression for the optimal estimator x risk.

28.4 Logistic Regression Inference

1107

observation y AAACAHicbVC7TsMwFHV4lvIKsCCxWFRInaqkC7BVYmEsEqGVmqhyHKe16tiR7VSqorLwKywMgFj5DDb+BifNAC1HsnR0zr3X954wZVRpx/m21tY3Nre2azv13b39g0P76PhBiUxi4mHBhOyHSBFGOfE01Yz0U0lQEjLSCyc3hd+bEqmo4Pd6lpIgQSNOY4qRNtLQPvW5oDwiXEMRKiKnpe779aHdcFpOCbhK3Io0QIXu0P7yI4GzxIzCDCk1cJ1UBzmSmmJG5nU/UyRFeIJGZGAoRwlRQV5eMIcXRolgLKR5ZpVS/d2Ro0SpWRKaygTpsVr2CvE/b5Dp+CrIKU8zTThefBRnDGoBizhgRCXBms0MQVhSsyvEYyQR1ia0IgR3+eRV4rVb1y33rt3oNKs0auAMnIMmcMEl6IBb0AUewOARPINX8GY9WS/Wu/WxKF2zqp4T8AfW5w/a25ac

AAAB6nicbVBNS8NAEJ34WetX1aOXxVboqSS9qLeCF48VjC20oWy2m3bp7ibsboQQ+he8eFDx6i/y5r9x0+agrQ8GHu/NMDMvTDjTxnW/nY3Nre2d3cpedf/g8Oi4dnL6qONUEeqTmMeqH2JNOZPUN8xw2k8UxSLktBfObgu/90SVZrF8MFlCA4EnkkWMYFNIjaxRHdXqbstdAK0TryR1KNEd1b6G45ikgkpDONZ64LmJCXKsDCOczqvDVNMEkxme0IGlEguqg3xx6xxdWmWMoljZkgYt1N8TORZaZyK0nQKbqV71CvE/b5Ca6DrImUxSQyVZLopSjkyMisfRmClKDM8swUQxeysiU6wwMTaeIgRv9eV14rdbNy3vvl3vNMs0KnAOF9AED66gA3fQBR8ITOEZXuHNEc6L8+58LFs3nHLmDP7A+fwBOeaNJQ==

discriminant functions AAACFXicbVDLSsNAFJ34rPEVdekmWIRuLEk36q7gxmUFYwtNKJPJpB06jzAzEUroV7jxV9y4UHEruPNvnLQBtfXAwOGce+/ce+KMEqU978taWV1b39isbdnbO7t7+87B4Z0SuUQ4QIIK2YuhwpRwHGiiKe5lEkMWU9yNx1el373HUhHBb/UkwxGDQ05SgqA20sA5C7kgPMFcuwlRSBJGOOQ6DO0fI805KquVPXDqXtObwV0mfkXqoEJn4HyGiUA5M2MQhUr1fS/TUQGlJojiqR3mCmcQjeEQ9w3lkGEVFbOzpu6pURI3FdI8s8ZM/d1RQKbUhMWmkkE9UoteKf7n9XOdXkQF4VmuMUfzj9Kculq4ZUYmCYmRphNDoInE7OqiEZQQaZNkGYK/ePIyCVrNy6Z/06q3G1UaNXAMTkAD+OActME16IAAIPAAnsALeLUerWfrzXqfl65YVc8R+APr4xs645+F

dR (y)

d2 (y)

d1 (y)

AAAB8HicbVBNT8JAEJ3iF+IX6tHLRjDBC2m5qDcSLx7RWCFCQ7bbLWzYbpvdrUnT8C+8eFDj1Z/jzX/jAj0o+JJJXt6bycw8P+FMadv+tkpr6xubW+Xtys7u3v5B9fDoQcWpJNQlMY9lz8eKciaoq5nmtJdIiiOf064/uZ753ScqFYvFvc4S6kV4JFjICNZGeqwHw/xu2sjO68NqzW7ac6BV4hSkBgU6w+rXIIhJGlGhCcdK9R070V6OpWaE02llkCqaYDLBI9o3VOCIKi+fXzxFZ0YJUBhLU0Kjufp7IseRUlnkm84I67Fa9mbif14/1eGllzORpJoKslgUphzpGM3eRwGTlGieGYKJZOZWRMZYYqJNSBUTgrP88ipxW82rpnPbqrUbRRplOIFTaIADF9CGG+iACwQEPMMrvFnKerHerY9Fa8kqZo7hD6zPH6owj7U=

AAAB7nicbVBNT8JAEJ3iF+IX6tHLRjDBC2m5qDcSLx4xsUICDdlut7Bhuy27W5Om4U948aDGq7/Hm//GBXpQ8CWTvLw3k5l5fsKZ0rb9bZU2Nre2d8q7lb39g8Oj6vHJo4pTSahLYh7Lno8V5UxQVzPNaS+RFEc+p11/cjv3u09UKhaLB50l1IvwSLCQEayN1KsHw1Yju6wPqzW7aS+A1olTkBoU6AyrX4MgJmlEhSYcK9V37ER7OZaaEU5nlUGqaILJBI9o31CBI6q8fHHvDF0YJUBhLE0JjRbq74kcR0plkW86I6zHatWbi/95/VSH117ORJJqKshyUZhypGM0fx4FTFKieWYIJpKZWxEZY4mJNhFVTAjO6svrxG01b5rOfavWbhRplOEMzqEBDlxBG+6gAy4Q4PAMr/BmTa0X6936WLaWrGLmFP7A+vwBsXCOiQ==

AAAB7nicbVBNT8JAEJ3iF+IX6tHLRjDBC2m5qDcSLx4xsUICDdlut7Bhuy27W5Om4U948aDGq7/Hm//GBXpQ8CWTvLw3k5l5fsKZ0rb9bZU2Nre2d8q7lb39g8Oj6vHJo4pTSahLYh7Lno8V5UxQVzPNaS+RFEc+p11/cjv3u09UKhaLB50l1IvwSLCQEayN1KsHQ6eRXdaH1ZrdtBdA68QpSA0KdIbVr0EQkzSiQhOOleo7dqK9HEvNCKezyiBVNMFkgke0b6jAEVVevrh3hi6MEqAwlqaERgv190SOI6WyyDedEdZjterNxf+8fqrDay9nIkk1FWS5KEw50jGaP48CJinRPDMEE8nMrYiMscREm4gqJgRn9eV14raaN03nvlVrN4o0ynAG59AAB66gDXfQARcIcHiGV3izptaL9W59LFtLVjFzCn9gff4Ar+mOiA==

x bbayes = argmax dx (y) 1xR

AAACO3icbVA9j9QwEHWOr2P5WqCksVghLc0queZAJ6STaCiPj+VOWkfRxJndtc52gj2BRFF+GA0/go6KhgIQLT3e3BZwx5NsPb15o5l5eaWVpzj+Eu1cunzl6rXd66MbN2/dvjO+e++tL2sncS5LXbqTHDxqZXFOijSeVA7B5BqP89Pnm/rxe3RelfYNtRWmBlZWLZUEClI2fi0W4oMqcA3UNX3WCWd4Di36/tlI1LYIrUhdIjS+483wv+o7YfKyGazgVgaavhcH4qDImmn7eCTSbDyJZ/EAfpEkWzJhWxxl48+iKGVt0JLU4P0iiStKO3CkpMY+LOKxAnkKK1wEasGgT7vh+J4/CkrBl6ULzxIf1L87OjDetyYPTgO09udrG/F/tUVNyydpp2xVE1p5NmhZa04l3yTJC+VQkm4DAelU2JXLNTiQFEIbhRCS8ydfJPO92dNZ8nJvcjjdprHLHrCHbMoSts8O2Qt2xOZMso/sK/vOfkSfom/Rz+jXmXUn2vbcZ/8g+v0H0R+vHw==

x bbayes (y) AAACB3icbVC7TsNAEDzzDOFloKTgRIQUmshOA3SRaCiDhEkk27LO501yyvmhuzNgWSlp+BUaCkC0/AIdf8PlUUDCSCuNZna1uxNmnEllWd/G0vLK6tp6ZaO6ubW9s2vu7d/KNBcUHJryVHRDIoGzBBzFFIduJoDEIYdOOLwc+507EJKlyY0qMvBj0k9Yj1GitBSYR57r3bMIBkSVD6Og9ESMQ1KAHNWL06rnB2bNalgT4EViz0gNzdAOzC8vSmkeQ6IoJ1K6tpUpvyRCMcphVPVyCRmhQ9IHV9OExCD9cvLICJ9oJcK9VOhKFJ6ovydKEktZxKHujIkayHlvLP7nubnqnfslS7JcQUKni3o5xyrF41RwxARQxQtNCBVM34rpgAhClc6uqkOw519eJE6zcdGwr5u1Vn2WRgUdomNURzY6Qy10hdrIQRQ9omf0it6MJ+PFeDc+pq1LxmzmAP2B8fkD3p6ZVQ==

Figure 28.5 Classifier structure in the form of a collection of discriminant functions,

dx (y); one for each discrete value x. For each observation vector, y, the optimal classifier is obtained by selecting the class label x bbayes with the largest discrimination value. This value is denoted by x bbayes (y) at the bottom of the figure.

Theorem 28.2. (Minimizer of logistic risk) Consider a binary classification problem where x ∈ {±1} and the following Bayesian inference problem: b LR = argmin E ln(1 + e−xbx ) x

(28.76)

b =c(y) x

The optimal estimator that minimizes the above risk is given by   P(x = +1|y = y) ∆ o = logit(y) x bLR = c (y) = ln P(x = −1|y = y)

(28.77)

where we are denoting the ratio by the notation logit(y). The sign of x bLR determines the Bayes classifier for x.

Proof: Let R(c) = E ln(1 + e−xbx ) denote the logistic risk. We recall the conditional mean property from Prob. 3.25 that E a = E [E (a|b)], for any two random variables a and b. Applying this property to the logistic risk we get n  o R(c) = E y E x|y ln(1 + e−xbx ) | y (28.78)

where the inner expectation is over the conditional pdf fx|y (x|y), while the outer expectation is over the distribution of y. The inner expectation is always nonnegative. Therefore, it is sufficient to examine the problem of minimizing its value to arrive at a minimizer for R(c). Since x assumes the discrete values ±1, we can assess the inner expectation and write   E x|y ln(1 + e−xbx ) | y = y = P(x = +1|y = y) ln(1 + e−bx ) + P(x = −1|y = y) ln(1 + exb )

(28.79)

1108

Bayesian Inference

Differentiating over x b and setting the derivative to zero at x bLR = co (y) gives − P(x = +1|y = y)

exbLR e−bxLR + P(x = −1|y = y) =0 −b x LR 1+e 1 + exbLR

(28.80)

or, equivalently,   o exbLR 1 + e−bxLR exbLR 1 + e−bxLR P(x = +1|y = y) = −bx = = exbLR = ec (h) (28.81) P(x = −1|y = y) e LR (1 + exbLR ) 1 + e−bxLR from which we arrive at (28.77). We explain in (28.85) that the sign of x bLR determines the Bayes classifier for x. 

The reason for the qualification “logistic” is because the solution (28.77) expresses the conditional label probabilities in the form of logistic functions evaluated at x bLR . Indeed, it follows from (28.77) that 1 1 + e−bxLR 1 P(x = −1|y = y) = 1 + e+bxLR

P(x = +1|y = y) =

(28.82a)

(28.82b)

Figure 28.6 illustrates the logistic functions 1/(1 + e−z ) and 1/(1 + ez ). Note that these functions return values between 0 and 1 (as befits a true probability measure). logistic function for class +1 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 -10

-5

0

5

10

0 -10

logistic function for class -1

-5

0

5

10

Figure 28.6 Typical behavior of logistic functions for two classes. The figure shows

plots of the functions 1/(1 + e−z ) (left) and 1/(1 + ez ) (right) assumed to correspond to classes +1 and −1, respectively.

Note that the logit of y is the logarithm of the odds of y belonging to one class or the other: P(x = +1|y = y) ∆ P(x = +1|y = y) odds(y) = = (28.83) P(x = −1|y = y) 1 − P(x = +1|y = y) so that

odds(y) ≥ 1 ⇐⇒ P(x = +1|y = y) ≥ 1/2

(28.84)

which agrees with the condition used by the Bayes classifier. Therefore, once the logarithm is applied to the odds function, the value of x bLR will be nonnegative

28.4 Logistic Regression Inference

1109

when P(x = +1|y = y) ≥ 1/2 and negative otherwise. For this reason, we can use b LR to deduce the value of the Bayes classifier by rewriting the logistic estimator x (28.28) in the form: x bbayes =



+1, when x bLR = logit(y) ≥ 0 −1, otherwise

(28.85)

Example 28.6 (Exponential loss and boosting) We will encounter later in Chapter 62, b) = while studying boosting algorithms for learning, the exponential loss function Q(x, x e−xbx . Consider again a binary classification problem where x ∈ {±1} and assume we seek to solve o n b EXP = argmin E e−xbx x (28.86) b =c(y) x

Then, the same derivation will lead to   P(x = +1|y = y) 1 1 = logit(y) x bEXP = ln 2 P(x = −1|y = y) 2

(28.87)

with an additional scaling by 1/2. Example 28.7 (Motivating the logistic risk) One way to motivate the logistic risk function used in (28.76) is to invoke the Kullback–Leibler (KL) divergence measure. Let fx|y (x|y) denote some unknown conditional pdf that we wish to estimate, where we are using the pdf notation fx|y (x|y) instead of the more explicit form P(x = +1|y = y) for convenience. Assume we opt to use a sigmoid function to approximate the unknown pdf and choose the approximation to be of the form: gx|y (x|y) =

1 1 + e−xc(y)

(28.88)

for some function c(y) to be determined. Such sigmoidal functions are particularly useful to model distributions for binary-valued discrete variables x ∈ {±1} since they return values between 0 and 1 (as befitting a true probability measure). Recall from the discussion in Chapter 6 that the KL divergence is a useful measure of closeness between probability distributions. Accordingly, we can choose c(y) to minimize the KL divergence between fx|y (x|y) and gx|y (x|y), i.e.,    fx|y (x|y) ∆ co (y) = argmin E f ln (28.89) gx|y (x|y) c(y) where the expectation is relative to the unknown distribution fx|y (x|y). But since this distribution is independent of c(y), the above problem is equivalent to n o co (y) = argmin −E f ln gx|y (x|y) (28.90) c(y)

Substituting the assumed form (28.88) for gx|y (x|y), we arrive at   co (y) = argmin E ln 1 + e−xc(y) c(y)

which agrees with the logistic risk formulation (28.76).

(28.91)

1110

Bayesian Inference

28.5

DISCRIMINATIVE AND GENERATIVE MODELS The solution of Bayesian inference problems requires knowledge of the conditional distribution fx|y (x|y), as is evident from (28.5). For example, the MSE b MSE , corresponds to the mean of this conditional distribution, while estimator, x b MAP , corresponds to the location of its mode. The same the MAP estimator, x is true for the Bayes classifier when x is discrete since it requires knowledge of the conditional probabilities P(x = r|y = y). Implementing inference solutions that depend on knowledge of the conditional distribution fx|y (x|y) can be challenging, as explained below. For this reason, in future chapters we will be pursuing various methodologies that attempt to solve the inference problem of predicting x from y in different ways, either by insisting on approximating the conditional pdf fx|y (x|y) or by ignoring it altogether and working directly with data realizations instead. Four broad classes of approaches stand out: (a) (Approaches based on discriminative models). Even if fx|y (x|y) were known in closed form, computing its mean or mode locations can be demanding and need not admit closed-form solutions. In later chapters we will assume that this conditional distribution has particular forms that are easy to work with. These approximate techniques will belong to the class of discriminative methods because they assume models for the conditional pdf fx|y (x|y) and allow us to discriminate between classes. (b) (Approaches based on generative models). In some other instances, we may actually have more information than fx|y (x|y) and know the full joint distribution fx,y (x, y). In principle, this joint distribution should be sufficient to determine the conditional pdf since, from the Bayes rule: fx|y (x|y) =

fx,y (x, y) fy (y)

(28.92)

where the distribution for y (also called its evidence), and which appears in the denominator, can be determined by marginalizing the joint distribution: ˆ fy (y) = fx,y (x, y)dy (28.93) y∈Y

The difficulty, however, lies in the fact that this marginalization does not always admit a tractable closed-form expression. In later chapters, we will describe various approximation methods to forgo the need to evaluate the evidence, such as the Laplace method, the Markov chain Monte Carlo method, and the expectation propagation method, in addition to variational inference techniques. Besides solving inference problems, knowledge of the joint distribution can also be used to determine the generative distribution, fy|x (y|x), which allows us to generate samples y from knowledge of x. We will encounter

28.5 Discriminative and Generative Models

1111

many examples of this approach in the form of Gaussian mixture models (GMM), restricted Boltzmann machines (RBMs), hidden Markov models (HMMs), and variational autoencoders. These techniques belong to the class of generative methods because they allow us to determine models for the reverse conditional pdf fy|x (y|x). Observe that the main difference between the discriminative and generative approaches is that the former works with (or approximates) fx|y (x|y) while the latter works with (or approximates) fy|x (y|x): discriminative approach =⇒ works with fx|y (x|y)

(28.94a)

generative approach =⇒ works with fy|x (y|x)

(28.94b)

(c) (Approaches based on model-based inference). The inference methods under (a) and (b) work directly with joint or conditional distributions for the variables involved, namely, fx|y (x|y) and fx,y (x, y). These distributions are either known or approximated. The approximations can take different forms. For example, one can assume a parametric model for the conditional pdf fx|y (x|y; θ), assumed parameterized by some θ (such as assuming a Gaussian form with its mean and variance playing the role of the parameter θ). One can then seek to estimate θ in order to fit the assumed distribution model onto the data, and proceed from there to perform inference. The maximum-likelihood technique and Gaussian mixture models are examples of this approach. Alternatively, one can assume a model relating the variables {x, y} directly, such as a state-space model or a linear regression model that tells us how x generates y. The Kalman and particle filter solutions are prominent examples of this approach. The assumed state-space models implicitly define a conditional distribution linking x and y. One can then work with the model equations to perform inference. In many instances of interest, the assumed model removes the need to know the full conditional pdf fx|y (x|y), and only some of its moments are necessary. We will encounter our first example of this scenario in the next chapter. There, by assuming a linear regression model, it will be seen that the Bayesian solution only requires knowledge of the first- and second-order moments of the variables {x, y}, namely, their means, cross-covariance, and variances. Our treatment of inference methods will cover steps (a)–(c) in some detail, and introduce various techniques that fit into one of these approaches starting from the next chapter. (d) (Approaches based on data-driven inference). Model-based solutions can be complex and computationally demanding; for example, it is not unusual for these implementations to involve the computation of challenging integrals or to require fitting complex distributions onto data. Moreover, in a large number of applications, designers do not know the general forms of the conditional or joint probability distributions, or even models linking the

1112

Bayesian Inference

variables, and will only have access to data realizations {x(n), yn } that arise from these distributions or models. For this reason, we will be motivated to introduce a variety of learning methods that perform inference directly from data. In contrast to the inference methods under (a)–(c), which attempt to approximate or emulate the underlying distributions or models, learning algorithms will be largely data-driven and will arrive at inference conclusions without the need to know or determine explicitly the forms of the underlying distributions or models. The learning methods will differ by how they process the data. Some methods will operate directly on {x(n), yn } to estimate values for the conditional probabilities (rather than their actual forms). Examples include the nearest-neighbor (NN) rule and self-organizing maps (SOMs). Other learning methods will go a step further. They will require the mapping c(y) to be an affine model of the observations, say, c(y) = y T w −θ, for some parameters w ∈ IRM and θ ∈ IR, or use some more involved nonlinear models as happens with kernel methods and neural networks. For affine models, the Bayesian inference problem (28.1) will reduce to minimizing over the parameters (w, θ): (wo , θo ) = argmin E Q(w, θ; x, y)

(28.95)

w,θ

For example, in the MSE case, the loss function will take the form (for scalar x): b ) = (x − x b )2 Q(x, x

= (x − y T w + θ)2 = Q(w, θ; x, y)

(28.96)

which shows that the loss is dependent on the parameters (w, θ) and on the variables {x, y}. Formulation (28.95) is an optimization problem with a stochastic risk. If we observe a collection of realizations {x(n), yn } arising from the underlying (but unknown) distribution fx,y (x, y), then we already know how to run stochastic gradient algorithms and many variations thereof to seek the optimizers (wo , θo ). We will also consider empirical risk versions of (28.95) such as ) ( N −1 1 X ? ? (w , θ ) = argmin Q(w, θ; x(n), yn ) (28.97) N n=0 w,θ Most learning algorithms discussed in later chapters will correspond to stochastic approximation methods applied to the minimization of the stochastic or empirical formulations similar to (28.95) or (28.97). We will encounter a variety of methods that fit into this paradigm, such as support vector machines, the perceptron, kernel methods, and neural networks. These methods will differ by their choice of the loss function.

28.6 Commentaries and Discussion

1113

For ease of reference, we represent the various inference and learning methods described above in the diagram shown in Fig. 28.7, where we also embedded the encoder and decoder cells that map the variables {x, y} to each other.

28.6

COMMENTARIES AND DISCUSSION Bayesian and non-Bayesian formulations. In statistics there is a clear distinction between the classical approach and the Bayesian approach to estimation. In the classical approach, the unknown quantity to be estimated is modeled as a deterministic but unknown constant. One popular non-Bayesian technique is the maximum-likelihood (ML) approach discussed later in Chapter 31. This approach was developed by the English statistician Ronald Fisher (1890–1962) in the works by Fisher (1912, 1922, 1925) – see the presentations by Pratt (1976), Savage (1976), and Aldrich (1997). The ML formulation does not assume any prior distribution for the unknown x and relies on maximizing a certain likelihood function. The Bayesian approach, on the other hand, models both the unknown quantity and the observation as random variables. It allows the designer to incorporate prior knowledge about the unknown into the solution, such as information about its pdf. This fact helps explain why Bayesian techniques are dominant in many successful filtering and estimation designs. We will provide a more detailed comparison of the ML and Bayesian approaches in the comments at the end of Chapter 31. For additional information on Bayesian and non-Bayesian techniques, readers may refer to the texts by Zacks (1971), Box and Tiao (1973), Scharf (1991), Kay (1993), Cassella and Berger (2002), Cox (2006), Hogg and McKean (2012), and Van Trees (2013). Bayes classifiers. The Bayes classifier (28.67) is one notable application of the method of Bayesian inference in statistical analysis. Some early references on the application of Bayesian inference to classification problems include the works by Chow (1957) and Miller (1962) and the texts by Davenport and Root (1958), Middleton (1960), and Wald (1950). For readers interested in learning more about Bayes classifiers and Bayesian inference, there are many available treatments in the literature, including, among others, the textbooks by Bernardo and Smith (2000), Lee (2002), DeGroot (2004), Cox (2006), Bolstad (2007), Robert (2007), Hoff (2009), and Young and Smith (2010). Likelihood ratio tests. In expression (28.31) we showed that the optimal classifier that minimizes the probability of misclassification can be obtained by maximizing the posterior probability of the variable x given the observation y = y. This construction provides a useful interpretation for the Bayes classifier as a MAP solution – see Duda, Hart, and Stork (2000), Webb (2002), Bishop (2007), and Theodoridis and Koutroumbas (2008). In Section 28.3.2 we explained how the Bayes classifier (28.28) can be recast in terms of the likelihood ratio test (28.43). This reformulation brings forth connections with another notable framework in statistical analysis, namely, the solution of detection problems by evaluating likelihood ratios and comparing them against threshold values. There is an extensive literature on this important topic, starting with the seminal works by Neyman and Pearson (1928, 1933), which laid the foundation for most of the subsequent development in this field. In one of its most basic forms, the Neyman– Pearson construction allows us to select between two simple hypotheses represented by parameter values x0 and x1 . For example, in the context of binary classification, the parameter x0 could be chosen to represent class +1, while the parameter x1 could be chosen to represent class −1. The two hypotheses are then stated as follows:

model-based methods AAACBnicbVC7SgNBFJ2Nrxhfq5YiLAbBQsNuQLQM2FhGMA9IljA7e5MMmccyMyuEkMrGX7GxUMTWb7Dzb5xNttDEAxcO99zniRJGtfH9b6ewsrq2vlHcLG1t7+zuufsHTS1TRaBBJJOqHWENjApoGGoYtBMFmEcMWtHoJtNbD6A0leLejBMIOR4I2qcEG5vqucddIamIQRiPyxjYRTYs9jiYoYx1zy37FX8Gb5kEOSmjHPWe+9WNJUm5nUcY1roT+IkJJ1gZShhMS91UQ4LJCA+gY6nAHHQ4mb0x9U7TbHVfKhv2nln2d8cEc63HPLKVHJuhXtSy5H9aJzX963BCRZIaEGS+qJ8yz0gv88SLqQJi2NgSTBS1t3pkiBUmxjpXsiYEiy8vk2a1ElxW/LtquXae21FER+gEnaEAXaEaukV11EAEPaJn9IrenCfnxXl3PualBSfvOUR/4Hz+ADxXmOU=

x

assumed model AAACDHicbVDLSgMxFM34rOOr6tJNsAgupMwURJcFNy4r2Ad0Sslk7rShmWRIMkIZ+gFu/BU3LhRx6we4829M2wG19UDgcM69N/eeMOVMG8/7clZW19Y3Nktb7vbO7t5++eCwpWWmKDSp5FJ1QqKBMwFNwwyHTqqAJCGHdji6nvrte1CaSXFnxin0EjIQLGaUGCv1y5VASCYiEAYTrbMEoiBwf7RERsBtlVf1ZsDLxC9IBRVo9MufQSSpHSYM5XZs1/dS08uJMoxymLhBpiEldEQG0LVUkAR0L58dM8GnVolwLJV9doOZ+rsjJ4nW4yS0lQkxQ73oTcX/vG5m4qtezkSaGRB0/lGccWwkniaDI6aAGj62hFDF7K6YDoki1Nj8XBuCv3jyMmnVqv5F1butVernRRwldIxO0Bny0SWqoxvUQE1E0QN6Qi/o1Xl0np03531euuIUPUfoD5yPb+Ynm3A=

AAACA3icbVDLSgMxFM3UV62vUXe6CbaCCykzBdFlwY3LCvYBnaFkMpk2NJMMSUYsQ8GNv+LGhSJu/Ql3/o2ZdhbaeiHkcM69ybknSBhV2nG+rdLK6tr6RnmzsrW9s7tn7x90lEglJm0smJC9ACnCKCdtTTUjvUQSFAeMdIPxda5374lUVPA7PUmIH6MhpxHFSBtqYB95XFAeEq5hzQsEC9UkNlf2MK0N7KpTd2YFl4FbgCooqjWwv7xQ4DQ2j2GGlOq7TqL9DElNMSPTipcqkiA8RkPSN5CjmCg/m+0whaeGCWEkpDnGzIz9PZGhWOXeTGeM9Egtajn5n9ZPdXTlZ5QnqSYczz+KUga1gHkgMKSSYM0mBiAsqfEK8QhJhLWJrWJCcBdXXgadRt29qDu3jWrzvIijDI7BCTgDLrgETXADWqANMHgEz+AVvFlP1ov1bn3MW0tWMXMI/pT1+QMgmJfB

data-based methods AAACBXicbVC7SgNBFJ2Nrxhfq5ZaDAbBQsNuQLQM2FhGMA9IljA7O5sMmccyMyuEJY2Nv2JjoYit/2Dn3zibbKGJBy4c7rnPEyaMauN5305pZXVtfaO8Wdna3tndc/cP2lqmCpMWlkyqbog0YVSQlqGGkW6iCOIhI51wfJPrnQeiNJXi3kwSEnA0FDSmGBmbGrjHfSGpiIgwMEIGXeSzIsiJGclID9yqV/NmgMvEL0gVFGgO3K9+JHHK7TjMkNY930tMkCFlKGZkWumnmiQIj9GQ9CwViBMdZLMvpvA0zVfHUtmw58yyvzsyxLWe8NBWcmRGelHLk/9pvdTE10FGRZIaIvB8UZwyaCTMLYERVQQbNrEEYUXtrRCPkELYWOMq1gR/8eVl0q7X/Muad1evNs4LO8rgCJyAM+CDK9AAt6AJWgCDR/AMXsGb8+S8OO/Ox7y05BQ9h+APnM8fVoqYZA==

y

AAAB/XicbVBLSwMxGMzWV62v9XHzEmyFClJ2C6LHghePFewDukvJZrNtaDZZkqxYl+Jf8eJBEa/+D2/+G9N2D9o6EBhmvi+ZTJAwqrTjfFuFldW19Y3iZmlre2d3z94/aCuRSkxaWDAhuwFShFFOWppqRrqJJCgOGOkEo+up37knUlHB7/Q4IX6MBpxGFCNtpL595HFBeUi4hhUve6jyM29S6dtlp+bMAJeJm5MyyNHs219eKHAam2swQ0r1XCfRfoakppiRSclLFUkQHqEB6RnKUUyUn83ST+CpUUIYCWmOiTFTf29kKFZqHAdmMkZ6qBa9qfif10t1dOVnlCepJhzPH4pSBrWA0ypgSCXBmo0NQVhSkxXiIZIIa1NYyZTgLn55mbTrNfei5tzWy43zvI4iOAYnoApccAka4AY0QQtg8AiewSt4s56sF+vd+piPFqx85xD8gfX5A1RolGk=

{x(n)}

AAAB/HicbVDNS8MwHE3n15xf1R29BDfBg4x2IHocePE4wX3AWkqapltYmpQkFUqZ/4oXD4p49Q/x5n9jtvWgmw8Cj/d+v+TlhSmjSjvOt1XZ2Nza3qnu1vb2Dw6P7OOTvhKZxKSHBRNyGCJFGOWkp6lmZJhKgpKQkUE4vZ37g0ciFRX8Qecp8RM05jSmGGkjBXbd44LyiHANm16RB9ybNQO74bScBeA6cUvSACW6gf3lRQJnibkFM6TUyHVS7RdIaooZmdW8TJEU4Skak5GhHCVE+cUi/AyeGyWCsZDmmBQL9fdGgRKl8iQ0kwnSE7XqzcX/vFGm4xu/oDzNNOF4+VCcMagFnDcBIyoJ1iw3BGFJTVaIJ0girE1fNVOCu/rlddJvt9yrlnPfbnQuyzqq4BScgQvggmvQAXegC3oAgxw8g1fwZj1ZL9a79bEcrVjlTh38gfX5A0B/lG4=

{yn }

AAAB7XicbVBNS8NAFHypX7V+VT16WSyCp5L0ot4KXjxWMLbQhrLZvLRLN5uwuxFK6I/w4kHFq//Hm//GbZuDtg4sDDPvsW8mzATXxnW/ncrG5tb2TnW3trd/cHhUPz551GmuGPosFanqhVSj4BJ9w43AXqaQJqHAbji5nfvdJ1Sap/LBTDMMEjqSPOaMGit1UbI0QjWsN9ymuwBZJ15JGlCiM6x/DaKU5QlKwwTVuu+5mQkKqgxnAme1Qa4xo2xCR9i3VNIEdVAszp2RC6tEJE6VfdKQhfp7o6CJ1tMktJMJNWO96s3F/7x+buLroOAyy40NtvwozgUxKZlnJxFXyIyYWkKZ4vZWwsZUUWZsQzVbgrcaeZ34reZN07tvNdpu2UYVzuAcLsGDK2jDHXTABwYTeIZXeHMy58V5dz6WoxWn3DmFP3A+fwDYA49g

y

encoder

AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0l6UW8FLx4rGFtoQ9lsJu3SzSbsboQS+iO8eFDx6v/x5r9x2+agrQ8GHu/NMDMvzATXxnW/ncrG5tb2TnW3trd/cHhUPz551GmuGPosFanqhVSj4BJ9w43AXqaQJqHAbji5nfvdJ1Sap/LBTDMMEjqSPOaMGit1I2RphGpYb7hNdwGyTrySNKBEZ1j/GkQpyxOUhgmqdd9zMxMUVBnOBM5qg1xjRtmEjrBvqaQJ6qBYnDsjF1aJSJwqW9KQhfp7oqCJ1tMktJ0JNWO96s3F/7x+buLroOAyyw1KtlwU54KYlMx/JxFXyIyYWkKZ4vZWwsZUUWZsQjUbgrf68jrxW82bpnffarTdMo0qnME5XIIHV9CGO+iADwwm8Ayv8OZkzovz7nwsWytOOXMKf+B8/gDIso9W

decoder

x ⇠ fx|y (x|y)

y ⇠ fy|x (y|x)

approximate fy|x (y|x) AAACM3icbVDLSsNAFJ34rPUVdekm2AoVpCQF0WXBjbiqYB/QlDKZTNqhk5kwM5GGNP/kxh9xIYgLRdz6D04fYG29MMy5597LPfd4ESVS2farsbK6tr6xmdvKb+/s7u2bB4cNyWOBcB1xykXLgxJTwnBdEUVxKxIYhh7FTW9wPa43H7CQhLN7lUS4E8IeIwFBUGmqa966jBPmY6YsGEWCD0kIFXbd/C9fDLqp63HqyyTUX5pko/l0mGWlZDQ8K3bNgl22J2EtA2cGCmAWta757PocxaFegiiUsu3YkeqkUCiCKM7ybixxBNEA9nBbQwZDLDvp5ObMOtWMbwVc6KdFTtj5iRSGcqxQd+qL+nKxNib/q7VjFVx1UsKiWGGGpouCmFqKW2MDLZ8IjBRNNIBIEK3VQn0oIFLa5rw2wVk8eRk0KmXnomzfVQrV85kdOXAMTkAJOOASVMENqIE6QOARvIB38GE8GW/Gp/E1bV0xZjNH4E8Y3z8C0Kzj

approximate fx|y (x|y) AAACM3icbVDLSsNAFJ34rPUVdekm2AoVpCQF0WXBjbiqYB/QlDKZTNqhk5kwM5GGNP/kxh9xIYgLRdz6D04fYG29MMy5597LPfd4ESVS2farsbK6tr6xmdvKb+/s7u2bB4cNyWOBcB1xykXLgxJTwnBdEUVxKxIYhh7FTW9wPa43H7CQhLN7lUS4E8IeIwFBUGmqa966jBPmY6YsGEWCD0kIFXbd/C9fDLqp63HqyyTUXzrMRvNpkmWl4Sg5K3bNgl22J2EtA2cGCmAWta757PocxaFegiiUsu3YkeqkUCiCKM7ybixxBNEA9nBbQwZDLDvp5ObMOtWMbwVc6KdFTtj5iRSGcqxQd+qL+nKxNib/q7VjFVx1UsKiWGGGpouCmFqKW2MDLZ8IjBRNNIBIEK3VQn0oIFLa5rw2wVk8eRk0KmXnomzfVQrV85kdOXAMTkAJOOASVMENqIE6QOARvIB38GE8GW/Gp/E1bV0xZjNH4E8Y3z8Cv6zj

AAACEXicbVDLSsNAFJ34rPEVdekmWIQupCQF0WXBjcsK9gFNKZPJTTt0MhNmJoUS+gtu/BU3LhRx686df+O0DaitBwYO59x7594Tpowq7Xlf1tr6xubWdmnH3t3bPzh0jo5bSmSSQJMIJmQnxAoY5dDUVDPopBJwEjJoh6Obmd8eg1RU8Hs9SaGX4AGnMSVYG6nvVAIuKI+Aa3cAHKSRxxAE9o+cgB6KSPWdslf15nBXiV+QMirQ6DufQSRIlpgZhGGlur6X6l6OpaaEwdQOMgUpJiM8gK6hHCegevn8oql7bpTIjYU0z+wwV3935DhRapKEpjLBeqiWvZn4n9fNdHzdyylPMw2cLD6KM+Zq4c7icSMqgWg2MQQTSc2uLhliiYk2IdomBH/55FXSqlX9y6p3VyvXL4o4SugUnaEK8tEVqqNb1EBNRNADekIv6NV6tJ6tN+t9UbpmFT0n6A+sj28a1529

discriminative methods AAACFXicbVDLSsNAFJ34rPEVdekmWAQXpSQF0WXBjcsK9gFNKZPJbTt0HmFmUiihP+HGX3HjQhG3gjv/xukD1NYDA4dz7r1z74lTRrUJgi9nbX1jc2u7sOPu7u0fHHpHxw0tM0WgTiSTqhVjDYwKqBtqGLRSBZjHDJrx8GbqN0egNJXi3oxT6HDcF7RHCTZW6nqlSEgqEhDGT6gminIqrDWCKHJ/LA5mIBPd9YpBOZjBXyXhghTRArWu9xklkmTcziAMa90Og9R0cqwMJQwmbpRpSDEZ4j60LRWYg+7ks6sm/rlVEr8nlX12h5n6uyPHXOsxj20lx2agl72p+J/XzkzvupNTkWYGBJl/1MuYb6Q/jcgGoYAYNrYE20Tsrj4ZYIWJsUG6NoRw+eRV0qiUw8tycFcpVkuLOAroFJ2hCxSiK1RFt6iG6oigB/SEXtCr8+g8O2/O+7x0zVn0nKA/cD6+AYUnn5Y=

generative methods

Figure 28.7 Schematic representation of inference and learning approaches based on discriminative methods, generative methods, model-based methods, and data-based methods.

28.6 Commentaries and Discussion



H0 : x = x0 H1 : x = x1

(null or positive hypothesis) (alternative or negative hypothesis)

1115

(28.98)

It is customary to refer to H0 as the null or positive hypothesis, while H1 is the alternative or negative hypothesis. Given an observation y, the Neyman–Pearson test would accept H0 in lieu of H1 (i.e., declare that the null hypothesis is valid) when the following likelihood ratio exceeds some threshold value η: ∆

L(y) =

fy|x (y|x = x0 ) H0 R η fy|x (y|x = x1 ) H1

(28.99)

The value of η is usually selected to ensure that some upper bound, denoted by α, is imposed on the probability of erroneously rejecting H0 when H0 is true. The resulting type-I error, or the probability of false negatives or missed detection, is given by: (false negative or type-I error) P(L(y) < η|H0 ) = P (reject H0 |when H0 is true) < α

(28.100)

On the other hand, the probability of false positives or false alarm, also called type-II error, corresponds to (false positive or type-II error) β = P(L(y) ≥ η|H1 ) = P (accept H0 |when H1 is true)

(28.101)

The Neyman–Pearson theory establishes that the likelihood test (28.99) is the most powerful test at level α. This means that it is the test that results in the largest power defined as the following probability: ∆

power = P (reject H0 |when H1 is true) = 1 − β

(28.102)

which measures the ability of the test to reject H0 when H1 is present. Table 28.1 summarizes the various decision possibilities and their respective probabilities. Table 28.1 Definitions of the probabilities of false negatives, false positives, and the power of a hypothesis test. H0 is true H1 is true accept H0 reject H0

correct decision, 1 − α.

type-I error (false negative), α.

type-II error (false positive), β. correct decision (power), 1 − β.

Returning to the Bayes classifier, we noted in expression (28.43) that the threshold value η should be selected as the ratio η = π−1 /π+1 , in terms of the priors for the classes ±1. Moreover, from expressions (28.54)–(28.55), we can deduce the probabilities of errors of types I and II (i.e., the fraction of false negatives and false positives by the classifier) for the situation discussed in the example: ˆ 0 2 1 1 √ e− 2 (y−1) dy =  (type-I) (28.103a) P (deciding x = −1|x = +1) = 2π −∞ ˆ ∞ 2 1 1 √ e− 2 (y+1) dy =  (type-II) (28.103b) P (deciding x = +1|x = −1) = 2π 0 Consequently, for this example, α = β = , and the resulting power level is P (rejecting x = +1|x = −1) = 1 − 

(28.104)

1116

Bayesian Inference

For further reading on hypothesis testing and statistical inference, some useful references include the texts by Kay (1998), Poor (1998), Cassella and Berger (2002), DeGroot (2004), Lehmann and Romano (2005), Cox (2006), Levy (2008), Young and Smith (2010), and Van Trees (1968, 2013). Beta distribution. The beta distribution (28.20), also known as the beta distribution of first-kind, is very useful to model random variables that are confined to the finite interval [0, 1]. It is parameterized by two positive shape parameters a and b, and includes the uniform distribution as a special case. It is often used as a prior in Bayesian inference, as was illustrated in Example 28.2. For more information on the beta distribution, the reader may consult the texts by Hahn and Shapiro (1994), Johnson, Kotz, and Balakrishnan (1994), and Gupta and Nadarajah (2004).

PROBLEMS

28.1

Motivated by the 0/1-loss (28.12), consider the alternative loss function:  b| >  1, |x − x ∆ b) = Q(x, x b| ≤  0, |x − x

for some small  > 0. Show that

ˆ

b |y = y) = 1 − E Q(x, x

x b+

fx|y (x|y)dx

x b−

28.2 Consider a collection of N independent Gaussian realizations {y n } with mean µ and unit variance, i.e., y n ∼ Nyn (µ, 1) for n = 0, 1, . . . , N − 1. The mean µ is unknown but arises from a Gaussian prior distribution µ ∼ Nµ (0, σµ2 ) with known variance. (a) Determine the posterior distribution fµ|y0 ,...,yN −1 (µ|y0 , y1 , . . . , yN −1 ). (b) Determine the optimal MSE estimator of µ. (c) Determine the MAP estimator for µ. (d) Determine the MAE estimator for µ. 28.3 Refer to the derivation of the Bayes classifier (28.28). We wish to extend the solution to multiclass classification problems consisting of R classes, say, x ∈ {1, 2, . . . , R}. Given y, we again seek to solve over all possible classifiers: minc(y) P (c(y) 6= x). Show that the Bayes classifier in this case is given by the MAP construction x bbayes = argmax P(x = x|y = y) 1≤x≤R

28.4 A binary label x ∈ {+1, −1} is observed under zero-mean additive Gaussian noise v with variance σv2 . The observation is denoted by y = x + v. Assume x = +1 with probability p and x = −1 with probability 1 − p. Determine the form of the Bayes classifier. Compare with the result of Example 28.3. 28.5 Consider a binary classification problem in which x = +1 with probability p and x = −1 with probability 1 − p. The observation is scalar valued, y ∈ IR, and it 2 has a Gaussian distribution with mean m+1 and variance σ+1 when x = +1, and mean 2 m−1 and variance σ−1 when x = −1. (a) Determine the form of the Bayes classifier. 2 2 (b) Assume σ+1 = σ−1 = σ 2 and m+1 > m−1 . Determine an expression for the probability of error of this classifier. 28.6 Consider a binary classification problem in which x = +1 with probability p and x = −1 with probability 1 − p. The observation y is M -dimensional, y ∈ IRM , and it has a Gaussian distribution with mean m+1 and covariance matrix Σ+1 when x = +1,

Problems

1117

and mean m−1 and covariance matrix Σ−1 when x = −1. Follow the log-likelihood ratio test of Section 28.3.2 to determine the form of the Bayes classifier. 28.7 We consider binary classification problems with x = ±1. The Bayes classifier was derived by minimizing the probability of erroneous decisions, as defined by (28.27). There are two types of error that can occur: assigning an observation to x = +1 when it actually arises from x = −1 or, conversely, assigning an observation to x = −1 when it arises from x = +1. The formulation (28.27) treats these two errors equally. However, there are situations where one type of error is more serious than the other, e.g., in deciding whether a person has a particular disease or not. To address such situations, we can assign weights to the errors and define instead a weighted risk function, also called the Bayes risk, for the classifier c(y) as follows: ∆

R(c) = α+1,−1 π+1 P(c(y) = −1|x = +1) + α−1,+1 π−1 P(c(y) = +1|x = −1) In this expression, the nonnegative scalar α+1,−1 weighs the error in assigning an observation from x = +1 to x = −1; similarly, for α−1,+1 . Moreover, the scalars π±1 denote the prior probabilities for x = ±1. (a) Follow arguments similar to those in Section 28.3 to determine the optimal classifier that minimizes the above weighted risk, R(c). Verify that the expression reduces to (28.28) when the weights {α+1,−1 , α−1,+1 } are equal. (b) Follow the derivation of the log-likelihood ratio test of Section 28.3.2 to show that the optimal classifier admits the following equivalent representation:   +1  α−1,+1 π−1 L(y) R α+1,11 π+1 −1 where L(y) is the likelihood ratio defined by (28.42). 28.8 We continue with the setting of Prob. 28.7, except that now we consider situations where it is also important to emphasize correct decisions in addition to erroneous decisions. There are two types of correct decisions: assigning an observation to x = +1 when it arises from x = +1 or assigning it to x = −1 when it arises from x = −1. Again, there are situations where one type of correct decisions is more relevant than the other. We can address these scenarios by defining a general weighted risk function as follows: ∆

R(y) = α+1,−1 π+1 P(c(y) = −1|x = +1) + α−1,+1 π−1 P(c(y) = +1|x = −1) + α+1,+1 π+1 P(c(y) = +1|x = +1) + α−1,−1 π−1 P(c(y) = −1|x = −1) Given α−1,+1 > α−1,−1 and α+1,−1 > α+1,+1 , follow the derivation of the log-likelihood ratio test from Section 28.3.2 to show that the optimal classifier admits the following equivalent representation:   +1  α−1,+1 − α−1,−1 π−1 L(y) R α+1,−1 − α+1,+1 π+1 −1 where L(y) is the likelihood ratio defined by (28.42). 28.9 Let π+1 and π−1 denote the prior probabilities for x = ±1, i.e., P(x = +1) = π+1 and P(x = −1) = π−1 . Introduce the conditional pdfs: ∆

f+1 (y) = fy|x (y|x = +1),



f−1 (y) = fy|x (y|x = −1)

Let t(y) = P(x = +1|y = y). Show that t(y) =

π+1 f+1 π+1 f+1 + π−1 f−1

1118

Bayesian Inference

Conclude that the test t(y) > 1/2 is equivalent to checking for the condition π+1 f+1 (y) ≥ π−1 f−1 (y). 28.10 Let y denote a random variable that is distributed according to a Poisson distribution with mean λ ≥ 0, i.e., P(y = k) =

λk e−λ , k!

k = 0, 1, 2, . . .

where λ is the average number of events occurring in an interval of time. We model λ as a random variable following an exponential distribution with mean equal to 1, i.e., fλ (λ) = e−λ for λ ≥ 0. Assume we collect N iid observations {y 1 , . . . , y N }. We would like to estimatePλ from the observations. (a) Let S = N n=1 yn . Verify that fλ|y

1 ,...,y N

(λ|y1 , . . . , yN ) ∝ e−λ(N +1) λS

where ∝ denotes proportionality. Conclude that the conditional pdf of λ given the observations follows a gamma distribution (which we defined earlier in Prob. 5.2). Determine the mean of this distribution and conclude that the MSE estimate for λ is given by ! N X 1 bMSE = λ 1+ yn N +1 n=1 (b)

Show that the MAP estimate for λ is given by bMAP = λ

(c)

N 1 X yn N + 1 n=1

Show that the MAE estimate for λ is found by solving the integral equation 1 S!

ˆ

b MAE λ

λS (N + 1)S+1 e−λ(N +1) dλ = 1/2

0

(d) Which of the estimators found in parts (a)–(c) is unbiased? 28.11 A random variable y follows a binomial distribution with parameters N and p, i.e., the probability of observing k successes in N trials is given by: ! N k P(y = k) = p (1 − p)N −k , k = 0, 1, . . . , N k Having observed y = y, we wish to estimate the probability of success, p, using a MAP estimator. To do so, and as explained in Example 28.2, we assume that the marginal distribution of p follows the beta distribution (28.20). (a) Using the assumed forms for the distributions of p and y, determine an expression for the conditional pdf fp|y (p|y). (b) Show that the peak of fp|y (p|y) occurs at location pbMAP =

y+a−1 N +a+b−2

(c) Compare the MAP solution to the ML solution in Prob. 31.8. 28.12 Consider the same setting of Prob. 28.11. Show that the MSE estimate of p given y = y (i.e., the conditional mean estimate) is given by: pbMSE = E (p|y = y) =

y+a N +a+b

References

1119

Find the resulting MSE. Compare the MSE solution to the ML solution from Prob. 31.8.

REFERENCES Aldrich, J. (1997), “R. A. Fisher and the making of maximum likelihood 1912–1922,” Statist. Sci., vol. 12, no. 3, pp. 162–176. Bernardo, J. M. and A. F. M. Smith (2000), Bayesian Theory, Wiley. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Bolstad, W. M. (2007), Bayesian Statistics, 2nd ed., Wiley. Box, G. E. P. and G. C. Tiao (1973), Bayesian Inference in Statistical Analysis, Addison-Wesley. Cassella, G. and R. L. Berger (2002), Statistical Inference, Duxbury. Chow, C. K. (1957), “An optimum character recognition system using decision functions,” IRE Trans. Electron. Comput., vol. 6, pp. 247–254. Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press. Davenport, W. B. and W. L. Root (1958), The Theory of Random Signals and Noise, McGraw-Hill. DeGroot, M. H. (2004), Optimal Statistical Decisions, Wiley. Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Fisher, R. A. (1912), “On an absolute criterion for fitting frequency curves,” Mess. Math., vol. 41, pp. 155–160. Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philos. Trans. Roy. Soc. Lond. Ser. A., vol. 222, pp. 309–368. Fisher, R. A. (1925), “Theory of statistical estimation,” Proc. Cambridge Philos. Soc., vol. 22, pp. 700–725. Gupta, A. K. and S. Nadarajah (2004), Handbook of Beta Distribution and its Applications, Marcel Dekker. Hahn, G. J. and S. S. Shapiro (1994), Statistical Models in Engineering, Wiley. Hoff, P. D. (2009), A First Course in Bayesian Statistical Methods, Springer. Hogg, R. V. and J. McKean (2012), Introduction to Mathematical Statistics, 7th ed., Pearson. Johnson, N. L., S. Kotz, and N. Balakrishnan (1994), Continuous Univariate Distributions, vol. 1, Wiley. Kay, S. (1993), Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall. Kay, S. (1998), Fundamentals of Statistical Signal Processing: Detection Theory, Prentice Hall. Lee, P. M. (2002), An Introduction to Bayesian Statistics, 4th ed., Wiley. Lehmann, E. L. and J. P. Romano (2005), Testing Statistical Hypothesis, 3rd ed., Springer. Levy, B. C. (2008), Principles of Signal Detection and Parameter Estimation, Springer. Middleton, D. (1960), An Introduction to Statistical Communication Theory, McGrawHill. Miller, R. G. (1962), “Statistical prediction by discriminant analysis,” Meteorol. Monogr., vol. 4, no. 25, pp. 1–54. Neyman, J. and E. Pearson (1928), “On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,” Biometrika, vol. 20A, nos. 1–2, pp. 175–240. Neyman, J. and E. Pearson (1933), “On the problem of the most efficient tests of statistical hypotheses,” Philos. Trans. Roy. Soc. Lond., vol. 231, pp. 289–337. Poor, H. V. (1998), An Introduction to Signal Detection and Estimation, Springer.

1120

Bayesian Inference

Pratt, J. W. (1976), “F. Y. Edgeworth and R. A. Fisher on the efficiency of maximum likelihood estimation,” Ann. Statist., vol. 4, no. 3, pp. 501–514. Robert, C. P. (2007), The Bayesian Choice, 2nd ed., Springer. Savage, L. J. (1976), “On rereading R. A. Fisher,” Ann. Statist., vol. 4, pp. 441–500. Scharf, L. L. (1991), Statistical Signal Processing: Detection, Estimation, and TimeSeries Analysis, Addison-Wesley. Theodoridis, S. and K. Koutroumbas (2008), Pattern Recognition, 4th ed., Academic Press. Van Trees, H. L. (1968), Detection, Estimation, and Modulation Theory, Wiley. Van Trees, H. L. (2013), Detection, Estimation, and Modulation Theory, Part I, 2nd ed., Wiley. Wald, A. (1950), Statistical Decision Functions, Wiley. Webb, A. (2002), Statistical Pattern Recognition, Wiley. Young, G. A. and R. L. Smith (2010), Essentials of Statistical Inference, Cambridge University Press. Zacks, S. (1971), The Theory of Statistical Inference, Wiley.

29 Linear Regression

The mean-square-error (MSE) problem of estimating a random variable x from observations of another random variable y seeks a mapping c(y) that solves b = argmin E (x − c(y))2 x

(29.1)

x=c(y)

We showed in (27.18) that the optimal estimate is given by the conditional mean x b = E (x|y = y). For example, for continuous random variables, the MSE estimate involves an integral computation of the form: ˆ x b = xfx|y (x|y)dx (29.2) x∈X

over the domain of realizations x ∈ X. Evaluation of this solution requires knowledge of the conditional distribution, fx|y (x|y). Even if fx|y (x|y) were available, computation of the integral expression is generally not possible in closed form. In this chapter, we address this challenge by limiting c(y) to the class of affine functions of y. Assuming y is M -dimensional, affine functions take the following form: c(y) = y T w − θ

(29.3)

for some parameters w ∈ IRM and θ ∈ IR; the latter is called the offset parameter. b is then reduced to the problem The problem of determining the MSE estimator x of selecting optimal parameters (w, θ) to minimize the same MSE in (29.1). Despite its apparent narrowness, this class of estimators leads to solutions that are tractable mathematically and deliver laudable performance in a wide range of applications.

29.1

REGRESSION MODEL Although we can treat the inference problem in greater generality than below, by considering directly the problem of estimating a random vector x from another random vector y, we will consider first the case of estimating a scalar x from a vector y ∈ IRM .

1122

Linear Regression

Let {¯ x, y¯} denote the first-order moments of the random variables x ∈ IR and y ∈ IRM , i.e., their means: x ¯ = E x,

y¯ = E y

(29.4a)

and let {σx2 , Ry , rxy } denote their second-order moments, i.e., their (co)-variances and cross-covariance vector: σx2 = E (x − x ¯)2

Ry = E (y − y¯)(y − y¯)

T

rxy = E (x − x ¯)(y − y¯) = T

T ryx

(scalar)

(29.4b)

(M × M )

(29.4c)

(1 × M )

(29.4d)

The cross-covariance vector, rxy , between x and y is a useful measure of the amount of information that one variable conveys about the other. We then pose the problem of determining the linear least-mean-square-error (LLMSE) estimator of x given y, namely, an estimator for the form b LMSE = y T w − θ = wT y − θ x

(estimator for x)

(29.5)

where (w, θ) are determined by solving

b LMSE )2 (wo , θo ) = argmin E (x − x

(29.6)

w,θ

The minus sign in front of the offset parameter θ in (29.5) is chosen for convenience, and the subscript LMSE refers to the linear MSE estimator. Since in this chapter we will be dealing almost exclusively with linear estimators under the MSE criterion, we will refrain from including the subscript and will simply b . It is customary to refer to model (29.5) as a linear regression model in write x the sense that the individual entries of y are being combined linearly, or a linear model is being fitted to the entries of y, in order to estimate x. Theorem 29.1. (Linear estimators) The solution (wo , θo ) to (29.6) satisfies the relations Ry wo = ryx ,

θo = y¯T wo − x ¯

(29.7)

so that, when Ry is invertible, the estimator and the resulting minimum meansquare error (MMSE) are given by b−x x ¯ = rxy Ry−1 (y − y¯)

MMSE =

σx2



(29.8a)

rxy Ry−1 ryx

(29.8b)

e =x−x b . We expand the MSE to get Proof: We provide an algebraic proof. Let x ∆

e 2 = E (x − y T w + θ)2 MSE = E x 2

T

(29.9) T

T

T

= E x − 2(E xy )w + 2θ¯ x − 2θ¯ y w + w (E yy )w + θ

2

29.1 Regression Model

1123

This MSE is a quadratic function of w and θ. Differentiating with respect to w and θ and setting the derivatives to zero at the optimal solution gives: ∂ MSE/∂θ = 2¯ x − 2¯ y T wo + 2θo = 0 (29.10) θ=θ o ,w=wo ∇w MSE = −2E xy T − 2θo y¯T + 2(wo )T (E yy T ) = 0 (29.11) θ=θ o ,w=wo

Solving for θo and wo we find

o

θo = y¯T wo − x ¯

and

(E yy T )wo = E yx + θo y¯

(29.12)

o

Replacing θ in the second expression for w and grouping terms gives   E yy T − y¯y¯T wo = E yx − y¯x ¯ ⇐⇒ Ry wo = ryx

(29.13)

which leads to (29.7). Substituting the expressions for (wo , θo ) into (29.9) leads to (29.8b). Finally, the Hessian matrix of the MSE relative to w and θ is given by       ∂ 2 MSE ∂ ∇w MSE 2 2 −2¯ yT ∆ ∂θ ∂θ    = H =  (29.14) −2¯ y 2 E yy T ∇wT ∂ MSE ∇2w MSE ∂θ This Hessian matrix is nonnegative-definite since its (1, 1) entry is positive and the Schur complement relative to this entry is nonnegative-definite: Schur complement = 2 E yy T − 2¯ y y¯T = 2 Ry ≥ 0

(29.15)

Since the MSE cost is quadratic in the parameters (w, θ), we conclude that the solution (wo , θo ) corresponds to a global minimizer. 

It is sufficient for our purposes to assume that Ry > 0. Observe from the statement of the theorem that the solution to the linear regression problem only requires knowledge of the first- and second-order moments {¯ x, y¯, σx2 , Ry , rxy }; there is no need to know the full conditional probability density function (pdf) fx|y (x|y) as was the case with the optimal conditional mean estimator (27.18). Moreover, the observation, y, does not appear in the MMSE expression. This means that we can assess beforehand, even before receiving the observation, the performance level that will be expected from the solution. Remark 29.1. (Linear model) Consider two random variables {x, y} and assume they are related by a linear model of the form: x = yT zo + v o

(29.16)

M

for some unknown parameter z ∈ IR , and where v has zero mean and is orthogonal to y, i.e., E vy T = 0. Taking expectation of both sides gives x ¯ = y¯T z o so that x−x ¯ = (y − y¯)T z o + v

(29.17)

Multiplying both sides by (y − y¯) from the left and taking expectations again gives ryx = Ry z o

(29.18) o

This is the same equation satisfied by the solution w in (29.7). We therefore conclude that when the variables {x, y} happen to be related by a linear model as in (29.16),

1124

Linear Regression

b LMSE , is able to recover the exact model z o . Put in another way, then the estimator, x when we solve a linear MSE problem, we are implicitly assuming that the variables {x, y} satisfy a linear model of the above form. 

Two properties The linear least-mean-squares estimator satisfies two useful properties. First, the estimator is unbiased since by taking expectations of both sides of (29.8a) we get E (b x−x ¯) = rxy Ry−1 E (y − y¯) = 0 | {z }

(29.19)

=0

so that,

b=x Ex ¯

(unbiased estimator)

(29.20)

b is Moreover, using (29.8a) again, we find that the variance of the estimator x given by σxb2

=

(29.8a)

=

E (b x−x ¯)2

 rxy Ry−1 E (y − y¯)(y − y¯)T Ry−1 ryx | {z } =Ry

rxy Ry−1 ryx

=

(29.21)

so that expression (29.8b) for the MMSE can be equivalently rewritten as e 2 = σx2 − σxb2 MMSE = E x

(29.22)

Second, the LLMSE estimator satisfies an important orthogonality condition, namely, it is is uncorrelated with the observation: e yT = 0 Ex

(orthogonality principle)

(29.23)

Proof of (29.23): From expression (29.8a) we note that e (y − y¯)T Ex

= (29.8a)

= =

= =

b )(y − y¯)T E (x − x  E x−x ¯ − rxy Ry−1 (y − y¯) (y − y¯)T

E (x − x ¯)(y − y¯)T − rxy Ry−1 E (y − y¯)(y − y¯)T

rxy − rxy Ry−1 Ry 0

e = 0, we conclude that (29.23) holds. But since E x

(29.24) 

It is because of the orthogonality property (29.23) between the estimation error and the observation that equations (29.7) for wo are referred to as the normal equations:

29.1 Regression Model

Ry wo = ryx

(normal equations)

1125

(29.25)

It can be verified that these equations are always consistent, i.e., a solution wo always exists independent of whether Ry is invertible or not – see Appendix 29.A. Moreover, the orthogonality condition (29.23) plays a critical role in characterizing linear estimators. Theorem 29.2. (Orthogonality principle) An unbiased linear estimator is optimal in the LMSE sense if, and only if, its estimation error satisfies the orthogonality condition (29.23). Proof: One direction of the argument has already been proven prior to the statement, b , given by (29.8a), satisfies (29.23). With regards to namely, the LLMSE estimator, x the converse statement, assume now that we are given an unbiased linear estimator for x of the form b u = y T w u − θu x

(29.26)

and that the corresponding estimation error satisfies the orthogonality condition (29.23), e u ⊥ y. We verify that the parameters {wu , θu } must coincide with the optimal i.e., x parameters {wo , θo } given by (29.7). b u is unbiased means that E x bu = x Indeed, the fact that x ¯ and, hence, θu satisfies θu = −(¯ x − y¯T wu )

(29.27)

b u satisfies so that, by substituting into (29.26), we get that x bu − x x ¯ = (y − y¯)T wu

(29.28)

e u ⊥ y we must have Now using the assumed orthogonality condition x b u )y T = 0 E (x − x

(29.29)

b u )(y − y¯)T = 0 E (x − x

(29.30)

which is equivalent to

b u = E x. Substituting expression (29.28) for x b u into (29.30), since, by assumption, E x we find that wu must satisfy   E x−x ¯ − (y − y¯)T wu (y − y¯)T = 0 (29.31) which leads to ryx = Ry wu

(29.32)

o

so that w and wu satisfy the same normal equations. Substituting wu by wo into expression (29.27) we get that θu = θo . 

Example 29.1 (Multiple noisy measurements of a binary signal) Consider a signal x that assumes the values ±1 with probability 1/2 each. We collect N noisy measurements y` = x + v` ,

` = 1, . . . , N

(29.33)

1126

Linear Regression

where v ` is zero-mean noise of unit variance and independent of x. We introduce the observation vector y = col{y 1 , y 2 , . . . , y N }. Then, say, for N = 5, it is straightforward to find that   2 1 1 1 1  1 2 1 1 1      (29.34) rxy = 1 1 1 1 1 , Ry =  1 1 2 1 1   1 1 1 2 1  1

1

1

1

2

so that 

   b = rxy Ry−1 y =  1 x 

1

1

1

1

   

2 1 1 1 1

1 2 1 1 1

1 1 2 1 1

1 1 1 2 1

1 1 1 1 2

−1     

  y 

(29.35)

We need to evaluate Ry−1 . Due to the special structure of Ry , its inverse can be evaluated in closed form for any N . Observe that, for any N , the matrix Ry can be expressed as Ry = IN +11T , where IN is the N ×N identity matrix and 1 is the N ×1 column vector with unit entries, 1 = col{1, 1, 1, . . . , 1}. In other words, Ry is a rank-one modification of the identity matrix. This is a useful observation since the inverse of every such matrix has a similar form (see Prob. 1.10). Specifically, it can be verified that, for any column vector a ∈ IRN ,  −1 aaT IN + aaT = IN − (29.36) 1 + kak2 where kak2 = aT a denotes the squared Euclidean norm of a. Using this result with a = 1, we find that   N 1T 11T = 1T − 1T = (29.37) rxy Ry−1 = 1T IN − N +1 N +1 N +1 so that b= x

N 1 1 X 1T y = y` N +1 N +1

(29.38)

`=1

Recall that in this problem the variable x is discrete and assumes the values ±1. The b in (29.38) will generally assume real values. If one wishes to use x b to decide estimator x b and use whether x = +1 or x = −1, then one may consider examining the sign of x the suboptimal estimator: ! N 1 X b sub = sign x y (29.39) N + 1 n=1 n where the sign function was defined earlier in (27.36). Example 29.2 (Learning a regression model from data) Assume we can estimate the price of a house in some neighborhood A, measured in units of ×1000 USD, in some affine manner from the surface area s (measured in square meters) and the unit’s age a (measured in years), say, as: Pb = αs + βa − θ

(29.40)

for some unknown scalar parameters (α, β, θ). Here, the scalar θ denotes an offset parameter and the above relation represents the equation of a plane mapping values

29.1 Regression Model

1127

(s, a) into an estimate for the house price, denoted by Pb. We collect the attributes {s, a} into an observation vector:   s y= (29.41) a and the unknown parameters {α, β} into a column vector:   α w= β

(29.42)

Then, the mapping from y to Pb can be written more compactly as: Pb = y T w − θ

(29.43)

The price P plays the role of the variable x that we wish to estimate from observations of y. If we happen to know the first- and second-order moments of the price and observation variables, then we could estimate (w, θ) by using wo = Ry−1 ryP ,

θo = y¯T wo − P¯

(29.44)

where P¯ is the average price of houses in neighborhood A. Often, in practice, these statistical moments are not known beforehand. They can, however, be estimated from measurements. Assume we have available a list of N houses from neighborhood A with their prices, sizes, and ages; obviously, the house whose price we are interested in estimating should not be part of this list. We denote the available information by {Pn , yn } where n = 1, 2, . . . , N . Then, we can estimate the first- and second-order moments from this data by using the sample averages: N X b¯ = 1 P Pn , N n=1

b y¯ =

N 1 X yn N n=1

(29.45a)

and N 1 X b¯ ) (yn − b y¯ )(Pn − P N n=1

(29.46a)

N X by = 1 R (yn − b y¯ )(yn − b y¯ )T N n=1

(29.46b)

rbyP =

The parameters (wo , θo ) would be approximated by by−1 rbyP w? = R T b¯ θ =b y¯ w? − P ?

(29.47a) (29.47b)

where we are using the star notation to refer to parameters estimated directly from data measurements; this will be a standard convention in our treatment. Figure 29.1 illustrates these results by means of a simulation. The figure shows the scatter diagram for N = 500 points (Pn , yn ) representing the triplet (price, area, age) for a collection of 500 houses. The price is measured in units of ×1000 USD, the area in units of square meters, and the age in units of years. The spheres represent the measured data. The flat plane represents the regression plane that results from the above calculations, namely, Pb = y T w? − θ? (29.48) with ?

w =



3.02 −2.04

 ,

θ? = 1.26

(29.49)

Linear Regression

price P (×1000 USD)

1128

age (years)

area (m2)

Figure 29.1 Scatter diagram of N = 500 points (Pn , yn ) representing the triplet (price,

area, age) for a collection of 500 houses. The price is measured in units of ×1000 USD, the area in units of square meters, and the age in units of years. The spheres represent the measured data. The flat plane represents the fitted regression plane (29.48).

The values of the parameters (α, β, θ) are measured in units of ×1000 USD. Given a 17-year-old house with area 102 m2 , we can use the above parameter values to estimate/predict its price as follows: Pb = (3.02 × 102) − (2.04 × 17) − 1.26 = 272.1K USD

29.2

(29.50)

CENTERING AND AUGMENTATION We describe in this section two useful preprocessing steps that are common in inference and learning implementations in order to remove the need for the offset parameter, θ.

29.2.1

Centering We start with centering. One useful fact to note is that the mean values {¯ x, y¯} appear subtracted from both sides of (29.8a). We say that the random variables

29.2 Centering and Augmentation

1129

{x, y} are being centered, by subtracting their means, and that the estimation problem actually amounts to estimating one centered variable from another centered variable, i.e., to estimating xc = x − x ¯ from y c = y − y¯ in a linear manner. Indeed, note that the variables {x, y} and {xc , y c } have the same second-order moments since ∆



Ryc = E y c y T ¯)(y − y¯)T = Ry c = E (y − y

(29.51a)

and ∆



rxc yc = E xc (y c )T = E (x − x ¯)(y − y¯)T = rxy

(29.51b)

so that estimating xc from y c leads to the same relation b c = rxy Ry−1 y c x

(29.52)

x ¯ = y¯T w − θ =⇒ θ = y¯T w − x ¯

(29.53)

For this reason, and without loss in generality, it is customary in linear estimation problems to assume that the variables {x, y} have zero means (or have already been centered) and to solve the estimation task under this condition. The fact that centering arises in the context of LLMSE estimation can also be seen directly from (29.5) if we were to impose the requirement that the estimator should be unbiased. In that case, we would conclude from (29.5) that θ must satisfy

Substituting this condition for θ into (29.5) we find that, in effect, the estimator we are seeking should be of the form b−x x ¯ = (y − y¯)T w

(29.54)

with centered variables appearing on both sides of the expression.

29.2.2

Augmentation There is a second equivalent construction to centering, which will be used extensively in later chapters. We refer to the affine model (29.5) and introduce the extended vectors:     1 −θ ∆ ∆ y0 = , w0 = , (M + 1) × 1 (29.55) w y where we have added the scalars 1 and −θ as leading entries on top of y and w, respectively. Then, the affine estimator (29.5) can be rewritten in the linear (rather than affine) form: b = (y 0 )T w0 x

(29.56)

Ry0 (w0 )o = ry0 x

(29.57)

and the design problem becomes one of determining a parameter vector (w0 )o e 2 . We already know from the statement of Theorem 29.1 that minimizes E x applied to this extended problem that (w0 )o should satisfy the linear equations:

1130

Linear Regression

where the first- and second-order moments are computed as follows: 0 ∆

0

y¯ = E y =



1 y¯



(29.58a)

while ∆

¯) ry0 x = E (y 0 − y¯0 )(x − x   0 = E (x − x ¯) (y − y¯)   0 = ryx

(29.58b)

and ∆

Ry0 = E y 0 (y 0 )T − y¯0 (¯ y 0 )T     1 yT 1 y¯T = E − y yy T y¯ y¯y¯T   0 0 = 0 Ry

(29.58c)

If we denote the individual entries of (w0 )o by ∆

(w0 )o =



−θo wo



(29.59)

then it follows from the expressions for Ry0 and ry0 x and from (29.57) that the wo component satisfies Ry wo = ryx . This result agrees with (29.7). Again, if we b =x impose the unbiasedness condition that E x ¯ at the optimal solution (w0 )o , then we conclude from (29.56) that θo must satisfy x ¯ = (¯ y 0 )T (w0 )o     −θo = 1 y¯T wo = −θo + y¯T wo

(29.60)

from which we conclude that θo = y¯T wo − x ¯, which again agrees with (29.7). For this reason, it is customary to assume that the variables have been extended to (y 0 , w0 ) as in (29.55), and to seek a linear (as opposed to affine) model as in (29.56). It is a matter of convenience whether we assume that the variables (x, y) are centered and replaced by (xc , y c ) or that (y, w) are extended and replaced by (y 0 , w0 ). The net effect in both cases is that we can assume that we are dealing with an offset-free problem that estimates xc from y c or x from y 0 in a linear (rather than affine) manner.

29.3 Vector Estimation

29.3

1131

VECTOR ESTIMATION The results from Section 29.1 can be easily extended to the case of estimating a vector (as opposed to a scalar) variable x from multiple measurements. Thus, consider a column vector x with entries     x1 x ¯1  x2   x     ¯2  x =  . , x ¯ =  .  , (M × 1) (29.61)  ..   ..  xM x ¯M We wish to estimate x from multiple observations, say, 

  y= 

y1 y2 .. . yN



  , 



  y¯ =  

y¯1 y¯2 .. . y¯N



  , 

(N × 1)

(29.62)

The solution to this problem can be deduced from the results of the previous sections. First, according to (29.8a), we express the estimate for an arbitrary mth entry of x from all observations as follows: T (b xm − x ¯m ) = wm (y − y¯)

(29.63)

for some column vector wm ∈ IRN , and where the variables have been centered around their respective means. The vectors {wm } are then determined by minimizing the aggregate MSE: ( M ) X ∆ o 2 bm) {wm } = argmin E (xm − x (29.64) {wm ∈IRN }M m=1

29.3.1

m=1

Error Covariance Matrix We can rewrite the cost in an equivalent form. Assume we collect the estimators into a column vector,   b1 x  x   b2  b= .  x (29.65)  ..  bM x

and introduce the error vector:



e = x−x b x

(29.66)

1132

Linear Regression

e agrees with the cost appearing in Then, the mean-squared Euclidean norm of x (29.64), i.e., E ke xk2 =

M X

m=1

b m )2 E (xm − x

(29.67)

If we further let Rx˜ denote the covariance matrix of the error vector, namely, ∆

ex eT Rx˜ = E x

(29.68)

E ke xk2 = Tr (Rx˜ )

(29.69)

then we also have that

o } that minso that, in effect, problem (29.64) is seeking the weight vectors {wm imize the trace of the error covariance matrix: n o ∆ o {wm } = argmin Tr (Rx˜ ) (29.70) {wm ∈IRN }M m=1

We refer to the trace of Rx˜ , which appears in the above expression, as the MSE in the vector case. We also refer to the matrix Rx˜ as the MSE matrix. In this way, for vector estimation problems, when we write MSE we may be referring either to the scalar quantity Tr(Rx˜ ) or to the matrix quantity Rx˜ , depending on the context. It is common to use the matrix representation for the MSE in the vector case.

29.3.2

Normal Equations Continuing with (29.64), we observe that the cost consists of the sum of M nonnegative separable terms, with each term depending on the respective wm . Therefore, we can determine the optimal coefficients {wm , m = 1, . . . , M } by minimizing each term separately: o b m )2 wm = argmin E (xm − x

(29.71)

wm ∈IRN

This is the same problem we studied before: estimating a scalar variable xm from multiple observations {y ` , ` = 1, . . . , N }. We already know that the solution is given by the normal equations: o Ry wm = ryxm ,

m = 1, 2, . . . , M

(29.72)

where rxm y is the cross-covariance vector of xm with y: ∆

rxm y = E (xm − x ¯m )(y − y¯)T

(29.73)

Moreover, the resulting estimation error satisfies the orthogonality condition: e m y T = 0, Ex

m = 1, . . . , M

(29.74)

29.3 Vector Estimation

1133

e m = xm − x b m . Note that the {rxm y } are the rows of the M × N crosswhere x covariance matrix Rxy between the vectors x and y: 

  Rxy = E (x − x ¯)(y − y¯)T =  

rx1 y rx2 y .. .

rxM y



  , 

(M × N )

(29.75)

o from (29.72) as columns into a matrix W o , for Therefore, by collecting the wm all k = 1, 2, . . . , M :

Wo =



w1o

w2o

...

o wM



,

(N × M )

(29.76)

T , we find that W o satisfies the normal equations and by noting that Ryx = Rxy

Ry W o = Ryx

(normal equations)

(29.77)

It follows that the optimal estimator is given by

or, equivalently,

T b−x x ¯ = (W o ) (y − y¯)

(29.78)

b−x x ¯ = Rxy Ry−1 (y − y¯)

(29.79)

In view of (29.74), the resulting estimation error vector satisfies the orthogonality condition: e yT = 0 Ex

(orthogonality principle)

(29.80)

e =x−x b . The corresponding MMSE matrix is given by where x b )(x − x b )T MMSE = E (x − x

b )(x − x b )T = E (x − x ¯+x ¯−x b )(x − x = E (x − x ¯)T

= E ((x − x ¯) − (b x−x ¯))(x − x ¯ )T

= E (x − x ¯)(x − x ¯)T − E (b x−x ¯)(x − x ¯ )T

(29.81)

and we conclude from (29.79) that MMSE = Rx − Rxy Ry−1 Ryx in terms of the covariance and cross-covariance matrices of x and y.

(29.82)

1134

Linear Regression

29.4

LINEAR MODELS We apply the MSE estimation theory of the previous sections to the important case of linear models, which arises often in applications. Thus, assume that zeromean random vectors {x, y} are related via a linear model of the form: y = Hx + v

(29.83)

for some N × M known matrix H. We continue to assume that y is N × 1 and x is M × 1 so that we are estimating a vector variable from another vector variable. We explained earlier that the zero-mean assumption is not restrictive since the random variables x and y can be assumed to have been centered. In the above model, the variable v denotes a zero-mean random additive noise vector with known covariance matrix, Rv = E vv T . The covariance matrix of x is also assumed to be known, say, Rx = E xxT . Both {x, v} are uncorrelated with each other, i.e., E xv T = 0, and we further assume that Rx > 0,

Rv > 0

(29.84)

Two equivalent representations According to (29.79), when Ry > 0, the LLMSE estimator of x given y is b = Rxy Ry−1 y x

(29.85)

Because of (29.83), the covariances {Rxy , Ry } can be determined in terms of the linear model parameters {H, Rx , Rv }. Indeed, the uncorrelatedness of {x, v} gives Ry = E yy T = E (Hx + v)(Hx + v)T = HRx H T + Rv Rxy = E xy = E x(Hx + v) T

T

= Rx H

T

(29.86) (29.87)

b Moreover, since Rv > 0 we immediately get Ry > 0. Expression (29.85) for x then becomes −1 b = Rx H T Rv + HRx H T x y (29.88)

This expression can be rewritten in an equivalent form by using the matrix inversion formula (1.81). The result states that for arbitrary matrices {A, B, C, D} of compatible dimensions, if A and C are invertible, then (A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1

Applying this identity to the matrix Rv + HRx H T tifications A ← Rv ,

B ← H,

C ← Rx ,

−1

(29.89)

in (29.88), with the iden-

D ← HT

(29.90)

29.4 Linear Models

1135

we obtain n o b = Rx H T Rv−1 − Rv−1 H(Rx−1 + H T Rv−1 H)−1 H T Rv−1 y x n o = Rx (Rx−1 + H T Rv−1 H) − Rx H T Rv−1 H (Rx−1 + H T Rv−1 H)−1 H T Rv−1 y −1 T −1 = Rx−1 + H T Rv−1 H H Rv y (29.91)

where in the second equality we factored out (Rx−1 + H T Rv−1 H)−1 H T Rv−1 y from the right. Hence, b = Rx−1 + H T Rv−1 H x

−1

H T Rv−1 y

(29.92)

This alternative form is useful in several contexts. Observe, for example, that when H happens to be a column vector (i.e., when x is a scalar), the quantity (Rv + HRx H T ) that appears in (29.88) is a matrix, while the quantity (Rx−1 + H T Rv−1 H) that appears in (29.92) is a scalar. In this case, the representation b . In general, the convenience of using (29.92) leads to a simpler expression for x (29.88) or (29.92) depends on the situation at hand. It further follows that the MMSE matrix is given by ex e T = E xxT − E x bx bT MMSE = E x b )xT , = E (x − x

e⊥x b since x  −1 = Rx − Rx−1 + H T Rv−1 H H T Rv−1 HRx  −1 = Rx−1 + H T Rv−1 H

(29.93)

where in the last equality we used the matrix inversion lemma again. That is, MMSE = Rx−1 + H T Rv−1 H

−1

(29.94)

Lemma 29.1. (Equivalent linear estimators) Consider two zero-mean random vectors {x, y} that are related via a linear model of the form y = Hx + v. The variables {x, v} have zero-mean, positive-definite covariance matrices Rx and Rv , respectively, and are uncorrelated with each other. The LLMSE estimator of x given y can be computed by either of the following equivalent expressions: −1 b = Rx H T Rv + HRx H T x y (29.95a)  −1 H T Rv−1 y (29.95b) = Rx−1 + H T Rv−1 H with the resulting MMSE value given by

MMSE = Rx−1 + H T Rv−1 H

−1

(29.96)

1136

Linear Regression

29.5

DATA FUSION We illustrate one application of the theory for linear models to the important problem of fusing information from several sources in order to enhance the accuracy of the estimation process. Thus, assume that we have a collection of N sensors that are distributed over some region in space. All sensors are interested in estimating the same zero-mean vector x with covariance matrix Rx = E xxT . For example, the sensors could be tracking a moving object and their objective is to estimate the speed and direction of motion of the target. Assume each sensor k collects a measurement vector y k that is related to x via a linear model, say, y k = Hk x + v k

(29.97)

where Hk is the model matrix that maps x to y k at sensor k and v k is the zero-mean measurement noise with covariance matrix Rk = E v k v T k . In general, the quantity y k is a vector, say, of size Lk × 1. If the vector x has size M × 1, then Hk is Lk × M . We assume Lk ≥ M so that each sensor k has at least as many measurements as the size of the unknown vector x.

29.5.1

Fusing Raw Data In the first fusion method, each sensor k transmits its measurement vector y k and its model parameters {Hk , Rk , Rx } to a remote fusion center. The latter collects all measurements from across the nodes, {y k , k = 1, 2, . . . , N }, and all model parameters {(Hk , Rk ), k = 1, 2, . . . , N }. The collected data satisfies the model:       H1 v1 y1  v2   y 2   H2        (29.98)  .  =  . x +  .  . . .  .   .   .  yN | {z } y

|

HN {z } H

vN | {z } v

which is a linear model of the same form as (29.83). We assume that the noises {v k } across all nodes are uncorrelated with each other so that the covariance matrix of the aggregate noise vector, v, is block diagonal: n o Rv = blkdiag R1 , R2 , . . . , RN (29.99) Now, using (29.92), the fusion center can determine the estimator of x that is based on all measurements {y k } as follows: −1 T −1 b = Rx−1 + H T Rv−1 H x H Rv y (29.100) The resulting MMSE would be

P = Rx−1 + H T Rv−1 H

−1

(29.101)

29.5 Data Fusion

1137

where we are denoting the MMSE by the letter P ; it is a matrix of size M ×M . We note that we are deliberately using form (29.92) for the estimator of x because, as we are going to see shortly, this form will allow us to derive a second more efficient fusion method. The solution method (29.100) requires all sensors to transmit {y k , Hk , Rk } to the fusion center; this amounts to a total of 1 Lk + Lk M + L2k 2

entries to be transmitted per sensor

(29.102)

where the last term (L2k /2) arises from the transmission of (half of) the entries of the Lk × Lk symmetric matrix Rk .

29.5.2

Fusing Processed Data In the fusion method (29.100), the fusion center fuses the raw data {y k , Hk , Rk } that is collected at the sensors. A more efficient data fusion method is possible and leads to a reduction in the amount of communication resources that are necessary between the sensors and the fusion center. This alternative method is based on the sensors performing some local processing first and then sharing the results of the processing step with the fusion center. The two fusion modes are illustrated in Fig. 29.2.

Figure 29.2 Two modes for data fusion: (a) on the left, the sensors share their raw

data {y k , Hk , Rk } with the fusion center; (b) on the right, the sensors share their local estimation results {b xk , Pk } with the fusion center.

Specifically, assume that each node estimates x using its own data y k . We b k and it is given by: denote the resulting estimator by x −1 T −1 b k = Rx−1 + HkT Rk−1 Hk x Hk Rk y k (29.103)

1138

Linear Regression

The corresponding MMSE is Pk = Rx−1 + HkT Rk−1 Hk

−1

(29.104)

We assume now that the nodes share the processed data {b xk , Pk } with the fusion center rather than the raw data {y k , Hk , Rk }. It turns out that the desired global quantities {b x, P } can be recovered from these processed data. To begin with, observe that we can rework expression (29.101) for the global MMSE as follows: P −1 = Rx−1 + H T Rv−1 H = Rx−1 +

N X

HkT Rk−1 Hk

k=1

=

N X

Rx−1 + HkT Rk−1 Hk

k=1

=

N X

k=1



− (N − 1)Rx−1

Pk−1 − (N − 1)Rx−1

(29.105)

This expression allows the fusion center to determine P −1 directly from knowledge of the quantities {Pk−1 , Rx−1 }. Note further that the global expression (29.100) can be rewritten as b = H T Rv−1 y P −1 x

(29.106)

which, using H and Rv from (29.98)–(29.99), leads to P

−1

b= x

N X

k=1

HkT Rk−1 y k

=

N X

k=1

bk Pk−1 x

(29.107)

Therefore, we arrive at the following conclusion to fuse the data from multiple sensors. Lemma 29.2. (Data fusion) Consider a collection of N linear measurements of the form y k = Hk x + v k , where x and v k have zero-mean, positive-definite covariance matrices Rx and Rk , respectively, and are uncorrelated with each b denote the LLMSE estimator of x given the N observations other. Let x ex e T . Let x b k denote the {y 1 , y 2 , . . . , y N } with error covariance matrix P = E x LLMSE estimator of the same x given only y k with error covariance matrix ekx eT Pk = E x k . It holds that  P −1 = P1−1 + P2−1 + · · · + PN−1 − (N − 1)Rx−1 (29.108a) b = P1−1 x b 1 + P2−1 x b 2 + · · · + PN−1 x bN P −1 x

(29.108b)

Observe from (29.108b) that the individual estimators are scaled by the inverses of their MMSE matrices so that more accurate estimators are given more

29.6 Minimum-Variance Unbiased Estimation

1139

weight. In this method, the sensors need to send to the fusion center the processed information {b xk , Pk }. This amounts to a total of M + M 2 /2 entries to be transmitted per node

(29.109)

which is smaller than (29.102) given that Lk ≥ M .

29.6

MINIMUM-VARIANCE UNBIASED ESTIMATION In the previous section we examined the linear model (29.83) where the unknown, x, is modeled as a random variable with covariance matrix, Rx . We encountered one instance of this model earlier in Example 29.1, which dealt with the problem of estimating a zero-mean scalar random variable, x, from a collection of noisy measurements, {y 1 , y 2 , . . . , y N }. The model relating the variables in that example is a special case of (29.83) since it amounts to       1 v1 y1  1   v2   y2        (29.110)  .  =  . x +  .   ..   ..   ..  1 vN yN | {z } | {z } | {z } ∆



=y

=H



=v

where

H = col{1, 1, . . . , 1} Rx = Rv =

σx2 σv2 IN

(29.111a) (29.111b) (29.111c)

Note that, for generality, we are using generic values for σx2 and σv2 rather than setting them equal to 1, as was the case in Example 29.1. We can recover the solution (29.38) by appealing to expression (29.92), which gives  −1 N N X 1 N 1 X 1 b= x + y = y` (29.112) ` 1 σx2 σv2 σv2 N + SNR `=1 `=1 in terms of the signal-to-noise ratio (SNR) defined as ∆

SNR = σx2 /σv2

(29.113)

In (29.110), the variable x is assumed to have been selected at random and, subsequently, N noisy measurements of this same value are collected. The observations are used to estimate x according to (29.112). Observe that the solution does not correspond to computing the sample mean of the observations. Expression (29.112) would reduce to the sample mean only when SNR → ∞. We now consider an alternative formulation for estimating the unknown by modeling it as a deterministic unknown constant, say, x, rather than a random

1140

Linear Regression

quantity, x. For this purpose, we replace the earlier linear model (29.83) by one of the form: y = Hx + v

(29.114)

where, compared with (29.83), we are replacing the boldface letter x by the normal letter x (remember that we reserve the boldface notation to random variables). The observation vector y in (29.114) continues to be random since the disturbance v is random. Any estimator for x that is based on y will also be a random variable itself. Given model (29.114), we now study the problem of designing an optimal linear estimator for x of the form b = W Ty x

(29.115)

for some coefficient matrix W T ∈ IRM ×N to be determined. It will turn out that, for such problems, W is found by solving a constrained LMSE estimation problem, as opposed to the unconstrained estimation problem (29.70).

29.6.1

Problem Formulation Thus, consider a zero-mean random noise variable v with a positive-definite covariance matrix Rv = E vv T > 0, and let y be a noisy measurement of Hx according to model (29.114) where x is the unknown constant vector that we wish to estimate. The dimensions of the data matrix H are denoted by N × M and it is assumed that N ≥ M and that H has full rank: rank(H) = M, N ≥ M

(29.116)

That is, H is a tall matrix so that the number of available measurements is at least as many as the number of unknown entries in x. The full rank condition on H guarantees that the matrix product H T Rv−1 H is positive-definite – recall result (1.59). The inverse of this matrix product will appear in the expression for the estimator. b = W T y. We are interested in determining a linear estimator for x of the form x The choice of W should satisfy two conditions:

b = x, which is the same as (a) (Unbiasedness). That is, we must guarantee E x W T E y = x. But from (29.114) we have E y = Hx so that W must satisfy W T Hx = x, no matter what the value of x is. This condition means that W should satisfy W T H = IM

(29.117)

(b) (Optimality). The choice of W should minimize the trace of the covariance e = x−x b . Using the condition W T H = IM , matrix of the estimation error, x we find that b = W T y = W T (Hx + v) = W T Hx + W T v = x + W T v x

(29.118)

29.6 Minimum-Variance Unbiased Estimation

1141

e = −W v. This means that the error covariance matrix, as a function so that x of W , is given by ex e T = W T Rv W Ex

(29.119)

Combining (29.117) and (29.119), we conclude that the desired W can be found by solving the following constrained optimization problem: n o ∆ W o = argmin Tr(W T Rv W ) , subject to W T H = IM , Rv > 0

(29.120)

W

b = (W o )T y that results from the solution of (29.120) is known as The estimator x the minimum-variance-unbiased estimator, or MVUE for short. It is also sometimes called the best linear unbiased estimator, or BLUE. Example 29.3 (Guessing the solution) Let us first try to guess the form of the solution to the constrained problem (29.120) by appealing to the solution of the LLMSE estimation problem (29.83). In that formulation, the unknown, x, is modeled as a random variable with covariance matrix Rx . From expression (29.92), we know that the linear estimator is given by b = (Rx−1 + H T Rv−1 H)−1 H T Rv−1 y x

(29.121)

Now assume that the covariance matrix of x has the particular form Rx = σx2 I, with a sufficiently large positive scalar σx2 (i.e., σx2 → ∞). That is, assume that the variance of each of the entries of x is “infinitely” large. In this way, the variable x can be “regarded” as playing the role of some unknown constant vector, x. Then, the above expression for b reduces to x b = (H T Rv−1 H)−1 H T Rv−1 y x

(29.122)

This conclusion suggests that the choice (W o )T = (H T Rv−1 H)−1 H T Rv−1

(29.123)

should solve the problem of estimating the unknown constant vector x for model (29.114). We establish this result more formally next.

29.6.2

Gauss–Markov Theorem Result (29.123) is a manifestation of the Gauss–Markov theorem.

1142

Linear Regression

Theorem 29.3. (Gauss–Markov theorem) Consider a linear model of the form y = Hx + v, where x is an unknown constant, v has zero mean and covariance matrix Rv > 0, and H is a tall full rank matrix (with at least as many rows as columns). The MVUE for x, the one that solves (29.120), is given by b MVUE = (H T Rv−1 H)−1 H T Rv−1 y x

(29.124)

Equivalently, the optimal W in (29.120) is

(W o )T = (H T Rv−1 H)−1 H T Rv−1

(29.125)

Moreover, the resulting MMSE is ∆

ex e T = (H T Rv−1 H)−1 Rxe = E x

(29.126)

Proof: Let J(W ) = W T Rv W denote the cost function that appears in (29.120). Some straightforward algebra shows that J(W ) can be expressed as J(W ) = (W − W o )T Rv (W − W o ) + (W o )T Rv W o

(29.127)

T

This is because, using W H = I,   W T Rv W o = W T Rv Rv−1 H(H T Rv−1 H)−1 = W T H(H T Rv−1 H)−1 = (H T Rv−1 H)−1

(29.128)

Likewise, (W o )T Rv W o = (H T Rv−1 H)−1 . Relation (29.127) expresses the cost J(K) as the sum of two nonnegative-definite terms: One is independent of W and is equal to (W o )T Rv W o , while the other is dependent on W . It is then clear, since Rv > 0, that the trace of the cost is minimized by choosing W = W o . Note further that the matrix W o satisfies the constraint (W o )T H = IM . 

Example 29.4 (Sample mean estimator) Let us reconsider problem (29.110) where x is now modeled as an unknown constant, i.e., we now write       y1 1 v1  y2   1   v2        (29.129)  ..  =  ..  x +  ..   .   .   .  yN 1 vN | {z } | {z } | {z } ∆

=y





=H

=v

where the boldface x is replaced by x. In this case, the value of x can be regarded as the mean value for each measurement, y ` . Using expression (29.124) we find that the MVUE for x is given by (compare with (29.112)): b MVUE = x

N 1 X y` N

(29.130)

`=1

which is simply the sample mean estimator that we are familiar with from introductory courses on statistics.

29.7 Commentaries and Discussion

29.7

1143

COMMENTARIES AND DISCUSSION Linear estimation. In this chapter we covered the basics of LMSE regression analysis and highlighted only concepts that are relevant to the subject matter of the book, motivated by the presentations in Sayed (2003, 2008) and Kailath, Sayed, and Hassibi (2000). The pioneering work in this domain was done independently by the Russian mathematician Andrey Kolmogorov (1903–1987) in the works by Kolmogorov (1939, 1941a, b) and the American mathematician Norbert Wiener (1894–1964) in the work by Wiener (1949); the latter reference was originally published in 1942 as a classified report during World War II. Kolmogorov was motivated by the work of Wold (1938) on stationary processes and solved a linear prediction problem for discrete-time stationary random processes. Wiener, on the other hand, solved a continuous-time prediction problem under causality constraints by means of an elegant technique now known as the Wiener–Hopf technique, introduced in Wiener and Hopf (1931). Readers interested in more details about Wiener’s contribution, and linear estimation theory in general, may consult the textbook by Kailath, Sayed, and Hassibi (2000). Unbiased estimators. In the LMSE estimation problems studied here, the estimators were required to be unbiased. Sometimes, unbiasedness can be a hurdle to minimizing the MSE. This is because there are estimators that are biased but that can achieve smaller error variances than unbiased estimators – see, e.g., Rao (1973), Cox and Hinkley (1974), and Kendall and Stuart (1976–1979). Two interesting examples to this effect are the following given in Sayed (2008): the first example is from Kay (1993, pp. 310–311) while the second example is from Rao (1973). In the derivation leading to (29.130) we studied the problem of estimating the mean value, x, of N measurements {y ` }. The MVUE for x was seen to be given by the sample mean estimator: b MVUE = x

N 1 X y` N

(29.131)

`=1

The value of x was not restricted in any way; it was only assumed to be an unknown constant and that it could assume any value in the interval (−∞, ∞). But what if we know beforehand that x is limited to some interval, say [−α, α] for some finite α > 0? One way to incorporate this piece of information into the design of an estimator for x is to consider the following alternative construction:  b  −α, b MVUE , x x =  α,

if x bMVUE < −α if − α ≤ x bMVUE ≤ α if x bMVUE > α

(29.132)

b in terms of a realization for x bMVUE . In this way, x will always assume values within b [−α, α]. A bcalculation in Kay (1993) shows that although the above (truncated mean) b MVUE )2 – see Prob. estimator x is biased, it nevertheless satisfies E (x − x)2 < E (x − x 29.5. In other words, the truncated mean estimator results in a smaller MSE. A second classical example from the realm of statistics is the variance estimator. In this case, the parameter to be estimated is the variance of a random variable y given access to several observations of it, say {y ` }. Let σy2 denote the variance of y. Two well-known estimators for σy2 are c2y = σ

N 1 X (y ` − y¯)2 N −1 `=1

c and

σ2 y =

N 1 X (y ` − y¯)2 N +1 `=1

(29.133)

1144

Linear Regression

PN where y¯ = N1 `=1 y ` is the sample mean. The first one is unbiased while the second one is biased. However, it is shown in Rao (1973) that c  2  2 2 c2y E σy − σ 2 y < E σy2 − σ (29.134) We therefore see that biased estimators can result in smaller MSEs. However, unbiasedness is often a desirable property in practice since it guarantees that, on average, the estimator agrees with the unknown quantity that we seek to estimate. Gauss–Markov theorem. Theorem 29.3 characterizes unbiased linear estimators of smallest error variance (or covariance), also known as BLUE estimators. Given an unknown x ∈ IRM and a random observation y ∈ IRM , the theorem was obtained by solving the constrained optimization problem (29.120), namely, b = W T y, y = Hx + v, W T H = IM , E vv T = Rv min E ke xk2 , subject to x W

(29.135) The significance of the solution (29.124) is perhaps best understood if we recast it in the context of least-squares problems. We will study such problems in greater detail in Chapter 50; see also Sayed (2008). For now, let us assume that Rv = σv2 IN , so that the noise components are uncorrelated with each other and have equal variances. Then, expression (29.124) for the estimate of x reduces to x b = (H T H)−1 H T y

(29.136)

We are going to see later that this expression can be interpreted as the solution to the following least-squares problem. Given a noisy deterministic observation vector y satisfying y = Hx + v (29.137) the unknown x can be estimated by solving x b = argmin ky − Hxk2

(29.138)

x∈IRM

in terms of the squared Euclidean norm of the difference y − Hx. Indeed, if we expand the squared error we get ∆

J(x) = ky − Hxk2 = kyk2 − 2y T Hx + xT H T Hx Setting the gradient vector relative to x to zero at x b gives ∇x J(x) = 2b xT H T H − 2y T H = 0 =⇒ x b = (H T H)−1 H T y

(29.139)

(29.140)

x=b x

which is the same expression we had in (29.136). We therefore find that the standard least-squares estimate for x coincides with the BLUE estimate. More generally, consider a weighted least-squares problem of the form x b = argmin (y − Hx)T R(y − Hx)

(29.141)

x∈IRM

where R > 0 denotes some weighting matrix. Differentiating again and solving for x b gives x b = (H T RH)−1 H T Ry

(29.142)

Comparing with (29.124) we find that the weighted least-squares solution would agree with the BLUE estimate if we select the weighting matrix as R = Rv−1 (i.e., as the inverse of the noise covariance matrix). Therefore, the Gauss–Markov theorem is essentially stating that the least-squares solution leads to the BLUE if the weighting matrix is matched with Rv−1 .

Problems

1145

The original version of the Gauss–Markov theorem with Rv = σv2 IN is due to the German mathematician Carl Friedrich Gauss (1777–1855) and the Russian mathematician Andrey Markov (1856–1922), who published versions of the result in Gauss (1821) and Markov (1912). The extension to the case of arbitrary covariance matrices Rv was given by Aitken (1935) – see also the overview by Plackett (1949, 1950). The discussion in Section 29.6 on the Gauss–Markov theorem, and the leading Sections 29.4 and 29.5 on linear models and data fusion, are adapted from the discussion in Kailath, Sayed, and Hassibi (2000).

PROBLEMS1

29.1 Show that the LLMSE estimator defined by (29.70) also minimizes the determinant of the error covariance matrix, det(Rx˜ ). eTW x e 29.2 Show that the LLMSE estimator defined by (29.70) also minimizes the E x for any W ≥ 0. 29.3 All variables are zero mean. Show that for any three random variables {x, y, z} it holds that d b y,z = x b y + (e x xy )zey where             

b y,z x by x by z ey x ey z d (e xy )zey

= = = = = =

LLMSE LLMSE LLMSE by x−x by z−z LLMSE

of x given {z, y} of x given y of z given y

e y given z ey of x

What is the geometric interpretation of this result? b • } defined 29.4 Verify that the MSE values that correspond to the estimators {b xo , x by (29.149a)–(29.149b) coincide. b 29.5 Refer to the truncated mean estimator (29.132). Show that it results in a smaller b MVUE )2 . MSE, namely, E (x − x)2 < E (x − x 29.6 Let {x, y} denote two zero-mean random variables with positive-definite covarib denote the LLMSE estimator of x given y. Likewise, let ance matrices {Rx , Ry }. Let x b denote the LLMSE estimator of y given x. Introduce the estimation errors x e = x−x b y e =y−y b , and denote their covariance matrices by Rxe and Rye, respectively. and y b = Rxy Ry−1 (a) Show that Rx Rx−1 e x e y. Assume {y, x} are related via a linear model of the form y = Hx + v, where H is a matrix of appropriate dimensions while v has zero mean with covariance matrix Rv and is uncorrelated with x. Verify that the identity of part (a) reduces b = H T Rv−1 y. to Rx−1 e x 29.7 Let x be a zero-mean random variable with an M × M positive-definite covarib 1 denote the LLMSE estimator of x given a zero-mean observation ance matrix Rx . Let x b 2 denote the LLMSE estimator of the same variable x given a second y 1 . Likewise, let x zero-mean observation y 2 . That is, we have two separate estimators for x from two separate sources. Let P1 and P2 denote the corresponding error covariance matrices:

(b)

1

Some problems in this section are adapted from exercises in Sayed (2003, 2008) and Kailath, Sayed, and Hassibi (2000).

1146

Linear Regression

e1x e T1 and P2 = E x e2x e T2 where x ej = x − x b j . Assume P1 > 0 and P2 > 0 and P1 = E x that the cross-covariance matrix   T x x E y1 y2 has rank M . b , satisfies (a) Show that the LLMSE estimator of x given both {y 1 , y 2 }, denoted by x b = P1−1 x b 1 + P2−1 x b 2 , where P denotes the resulting error covariance matrix P −1 x and is given by P −1 = P1−1 + P2−1 − Rx−1 . (b) Assume {y 1 , x} and {y 2 , x} are related via linear models of the form y 1 = H1 x + v 1 and y 2 = H2 x + v 2 , where {v 1 , v 2 } have zero means with covariance matrices {Rv1 , Rv2 } and are uncorrelated with each other and with x. Verify that this situation satisfies the required rank-deficiency condition and conclude that the estimator of x given {y 1 , y 2 } is given by the expression in part (a). 29.8 Let y 1 = H1 x + v 1 and y 2 = H2 x + v 2 denote two linear observation models with the same unknown random vector x. All random variables have zero mean. The covariance and cross-covariance matrices of {x, v 1 , v 2 } are denoted by   T  Rx x x E v 1  v 1  =  0 v2 v2 0 

 0 C R2

0 R1 CT

In particular, observe that we are assuming the noises to be correlated with C = E v 1 v T2 . All covariance matrices are assumed to be invertible whenever necessary. (a) Show how you would replace the observation vectors {y 1 , y 2 } by two other observation vectors {z 1 , z 2 } of similar dimensions such that they satisfy linear models of the form z 1 = G1 x + w1 ,

z 2 = G2 x + w2

for some matrices G1 and G2 to be specified, and where the noises {w1 , w2 } are now uncorrelated. What are the covariance matrices of w1 and w2 in terms of R1 and R2 ? b 1 be the LLMSE estimator of x given z 1 with error covariance matrix P1 . (b) Let x b 2 be the LLMSE of x given y 2 with error covariance matrix P2 . Similarly, let x b denote the LLMSE of x given {y 1 , y 2 } with error covariance matrix Let further x b and P in terms of {b b 2 , P1 , P2 , C, Rx , R1 , R2 }. P . Determine expressions for x x1 , x 29.9 Let y = Hx + v. All random variables have zero mean. The covariance and cross-covariance matrices of {x, v} are denoted by E

   T  Rx x x = v v CT

C Rv



with positive-definite Rx and Rv . (a) What is the LLMSE of x given y? What is the corresponding MMSE? (b) A new scalar observation, α, is added to y and a new row vector is added to H such that       H y v = T x+ α n h where hT is a row vector, n is uncorrelated with all other variables and has b new denote the new estimator for x given {y, α}. Relate x b new variance σ 2 . Let x b from part (a). Relate also their MMSE values. to x

Problems

29.10

1147

All variables are zero mean. Let       va Ha ya  y  =  H x +  v  vb Hb yb

where {v a , v, v b }, are uncorrelated with x and have zero mean and covariance matrices: T     Ra Sa 0 va va T R Sb  E  v   v  =  Sa vb vb 0 SbT Rb b ya ,y denote the linear estimator of x given {y a , y}. Let x b yb ,y denote the linear Let x estimator of x given {y b , y}. Can you relate these estimators, and their MMSE, to each other? 29.11 Let y = x + v, where x and v are independent zero-mean Gaussian random variables with variances σx2 and σv2 , respectively. Show that the LLMSE estimator of x2 using {y, y 2 } is c2 = σx2 + x 29.12

σx4

σx4 (y 2 − σx2 − σv2 ) + 2σx2 σv2 + σv4

A random variable z is defined as follows  −x, with probability p z= Hx + v, with probability 1 − p

where x and v are zero-mean uncorrelated random vectors. Assume we know the b |y , where y is a zero-mean random random LLMSE estimator of x given y, namely, x variable that is also uncorrelated with v. b|y in terms of x b |y . (a) Find an expression for z b |z and the corresponding MMSE. (b) Find an expression for the LLMSE estimator x 29.13 Consider the distributed network with m nodes, shown in Fig. 29.3. Each node k observes a zero-mean measurement y k that is related to an unknown zero-mean variable x via a linear model of the form y k = Hk x + v k , where the data matrix Hk is known, and the noise v k is zero mean and uncorrelated with x. The noises across all nodes are uncorrelated with each other. Let {Rx , Rk } denote the positive-definite covariance matrices of {x, v k }, respectively. Introduce the following notation: b k denotes the LLMSE estimator of x that is based on • At each node k, the notation x the observation y k . Likewise, Pk denotes the resulting error covariance matrix, ekx e Tk . Pk = E x b 1:k denotes the LLMSE estimator of x that is based • At each node k, the notation x on the observations {y 1 , y 2 , . . . , y k }, i.e., on the observations collected at nodes 1 through k. Likewise, P1:k denotes the resulting error covariance matrix, P1:k = e 1:k x e T1:k . Ex The network functions as follows. Node 1 uses y 1 to estimate x. The resulting estimator, b 1 , and the corresponding error covariance matrix, P1 = E x e1x e T1 , are transmitted to x node 2. Node 2 in turn uses its measurement y 2 and the data {b x1 , P1 } received from node 1 to compute the estimator of x that is based on both observations {y 1 , y 2 }. Note that node 2 does not have access to y 1 but only to y 2 and the information received b 1:2 , and the corresponding error from node 1. The estimator computed by node 2, x covariance matrix, P1:2 , are then transmitted to node 3. Node 3 evaluates {b x1:3 , P1:3 } ˆ 1:2 , P1:2 } and transmits {b using {y 3 , x x1:3 , P1:3 } to node 4 and so forth. b 1:m in terms of x b 1:m−1 and x bm. (a) Find an expression for x −1 −1 −1 (b) Find an expression for P1:m in terms of {P1:m−1 , Pm , Rx−1 }. (c) Find a recursion relating P1:m to P1:m−1 . (d) Show that P1:m is a nonincreasing sequence as a function of m.

1148

Linear Regression

AAAB+3icbVDNS8MwHE3n15xfdR69BDfB02iHqMeBF48T3AdspaRpuoWlSUlSsZT9K148KOLVf8Sb/43p1oNuPgh5vPf7kZcXJIwq7TjfVmVjc2t7p7pb29s/ODyyj+t9JVKJSQ8LJuQwQIowyklPU83IMJEExQEjg2B2W/iDRyIVFfxBZwnxYjThNKIYaSP5dr05DgQLVRabK8/mvtv07YbTchaA68QtSQOU6Pr21zgUOI0J15ghpUauk2gvR1JTzMi8Nk4VSRCeoQkZGcpRTJSXL7LP4blRQhgJaQ7XcKH+3shRrIp0ZjJGeqpWvUL8zxulOrrxcsqTVBOOlw9FKYNawKIIGFJJsGaZIQhLarJCPEUSYW3qqpkS3NUvr5N+u+VetS7v242OU9ZRBafgDFwAF1yDDrgDXdADGDyBZ/AK3qy59WK9Wx/L0YpV7pyAP7A+fwC6qJQw

y1

AAAB+3icbVDNS8MwHE3n15xfdR69BDfB02iHqMeBF48T3AdspaRpuoWlSUlScZT+K148KOLVf8Sb/43p1oNuPgh5vPf7kZcXJIwq7TjfVmVjc2t7p7pb29s/ODyyj+t9JVKJSQ8LJuQwQIowyklPU83IMJEExQEjg2B2W/iDRyIVFfxBzxPixWjCaUQx0kby7XpzHAgWqnlsrmye++2mbzeclrMAXCduSRqgRNe3v8ahwGlMuMYMKTVynUR7GZKaYkby2jhVJEF4hiZkZChHMVFetsiew3OjhDAS0hyu4UL9vZGhWBXpzGSM9FSteoX4nzdKdXTjZZQnqSYcLx+KUga1gEURMKSSYM3mhiAsqckK8RRJhLWpq2ZKcFe/vE767ZZ71bq8bzc6TllHFZyCM3ABXHANOuAOdEEPYPAEnsEreLNy68V6tz6WoxWr3DkBf2B9/gC8LZQx

y2

AAACBXicbVC7TsMwFHV4lvIKMMJg0SIxVUmFgLESC2OR6ENqoshxnNaqE0e2A1RRFhZ+hYUBhFj5Bzb+BqfNAC1Hsnx0zr269x4/YVQqy/o2lpZXVtfWKxvVza3tnV1zb78reSow6WDOuOj7SBJGY9JRVDHSTwRBkc9Izx9fFX7vjghJeXyrJglxIzSMaUgxUlryzKO6c08DMkIqc3zOAjmJ9Jc95Lln1z2zZjWsKeAisUtSAyXanvnlBBynEYkVZkjKgW0lys2QUBQzkledVJIE4TEakoGmMYqIdLPpFTk80UoAQy70ixWcqr87MhTJYj1dGSE1kvNeIf7nDVIVXroZjZNUkRjPBoUpg4rDIhIYUEGwYhNNEBZU7wrxCAmElQ6uqkOw509eJN1mwz5vnN00ay2rjKMCDsExOAU2uAAtcA3aoAMweATP4BW8GU/Gi/FufMxKl4yy5wD8gfH5A/AFmM0=

1 AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCp5IUUY8FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+6pX7Zcrbs2dg6wSLycVyNHsl796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdKu17zL2sVdvdJw8ziKcAKncA4eXEEDbqEJLWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QMu840G

b1 x

2 AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCp5IUUY8FLx4r2lpoQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/DEY38z8xyeujYjVA04S7kd0qEQoGEUr3Vfr1X654tbcOcgq8XJSgRzNfvmrN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42fzU6fkzCoDEsbalkIyV39PZDQyZhIFtjOiODLL3kz8z+umGF77mVBJilyxxaIwlQRjMvubDITmDOXEEsq0sLcSNqKaMrTplGwI3vLLq6Rdr3mXtYu7eqXh5nEU4QRO4Rw8uIIG3EITWsBgCM/wCm+OdF6cd+dj0Vpw8plj+APn8wcweI0H

P1 AAAB7nicbVBNS8NAEJ3Ur1q/qh69LLaCp5IUUY8FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNM7uZ+54lrI2L1iNOE+xEdKREKRtFKnWpzkHmz6qBccWvuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDG/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKu17zr2tVDvdJw8ziKcAbncAke3EAD7qEJLWAwgWd4hTcncV6cd+dj2Vpw8plT+APn8wdJo47V

AAACCXicbVC7TsMwFHV4lvIKMLJYtEhMVVIhQEyVWBiLRB9SE0WO47RWHSeyHaCKsrLwKywMIMTKH7DxNzhtBmg5kuWjc+7Vvff4CaNSWda3sbS8srq2Xtmobm5t7+yae/tdGacCkw6OWSz6PpKEUU46iipG+okgKPIZ6fnjq8Lv3REhacxv1SQhboSGnIYUI6Ulz4R1554GZIRU5vgxC+Qk0l/2kOdeZl8287pn1qyGNQVcJHZJaqBE2zO/nCDGaUS4wgxJObCtRLkZEopiRvKqk0qSIDxGQzLQlKOISDebXpLDY60EMIyFflzBqfq7I0ORLFbUlRFSIznvFeJ/3iBV4YWbUZ6kinA8GxSmDKoYFrHAgAqCFZtogrCgeleIR0ggrHR4VR2CPX/yIuk2G/ZZ4/SmWWtZZRwVcAiOwAmwwTlogWvQBh2AwSN4Bq/gzXgyXox342NWumSUPQfgD4zPH9A7mlk=

P1:2 AAAB8HicbVBNSwMxEJ2tX7V+VT16CbaCp7JbRMVTwYvHFeyHtEvJptk2NMkuSVYoS3+FFw+KePXnePPfmLZ70NYHA4/3ZpiZFyacaeO6305hbX1jc6u4XdrZ3ds/KB8etXScKkKbJOax6oRYU84kbRpmOO0kimIRctoOx7czv/1ElWaxfDCThAYCDyWLGMHGSo9Vv595N/VptV+uuDV3DrRKvJxUIIffL3/1BjFJBZWGcKx113MTE2RYGUY4nZZ6qaYJJmM8pF1LJRZUB9n84Ck6s8oARbGyJQ2aq78nMiy0nojQdgpsRnrZm4n/ed3URNdBxmSSGirJYlGUcmRiNPseDZiixPCJJZgoZm9FZIQVJsZmVLIheMsvr5JWveZd1i7u65WGm8dRhBM4hXPw4AoacAc+NIGAgGd4hTdHOS/Ou/OxaC04+cwx/IHz+QM7VI9V

b 1:2 x

3 AAAB6nicbVDLTgJBEOzFF+IL9ehlIph4Irto1COJF48Y5ZHAhswOszBhdnYz02tCCJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIJHCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbmd+64lrI2L1iOOE+xEdKBEKRtFKD+WLcq9YcivuHGSVeBkpQYZ6r/jV7ccsjbhCJqkxHc9N0J9QjYJJPi10U8MTykZ0wDuWKhpx40/mp07JmVX6JIy1LYVkrv6emNDImHEU2M6I4tAsezPxP6+TYnjjT4RKUuSKLRaFqSQYk9nfpC80ZyjHllCmhb2VsCHVlKFNp2BD8JZfXiXNasW7qlzeV0s1N4sjDydwCufgwTXU4A7q0AAGA3iGV3hzpPPivDsfi9ack80cwx84nz8x/Y0I

AAAB+3icbVBPS8MwHE3nvzn/1Xn0EtwET6Odoh4HXjxOcHOwlZKm6RaWJiVJxVH6Vbx4UMSrX8Sb38Z060E3H4Q83vv9yMsLEkaVdpxvq7K2vrG5Vd2u7ezu7R/Yh/W+EqnEpIcFE3IQIEUY5aSnqWZkkEiC4oCRh2B6U/gPj0QqKvi9niXEi9GY04hipI3k2/XmKBAsVLPYXNks98+bvt1wWs4ccJW4JWmAEl3f/hqFAqcx4RozpNTQdRLtZUhqihnJa6NUkQThKRqToaEcxUR52Tx7Dk+NEsJISHO4hnP190aGYlWkM5Mx0hO17BXif94w1dG1l1GepJpwvHgoShnUAhZFwJBKgjWbGYKwpCYrxBMkEdamrpopwV3+8irpt1vuZevirt3oOGUdVXAMTsAZcMEV6IBb0AU9gMETeAav4M3KrRfr3fpYjFascucI/IH1+QO9spQy

y3

P1:3 AAAB8HicbVBNSwMxEJ2tX7V+VT16CbaCp7LbioqnghePFeyHtEvJptk2NMkuSVYoS3+FFw+KePXnePPfmLZ70NYHA4/3ZpiZF8ScaeO6305ubX1jcyu/XdjZ3ds/KB4etXSUKEKbJOKR6gRYU84kbRpmOO3EimIRcNoOxrczv/1ElWaRfDCTmPoCDyULGcHGSo/lRj/1bmrTcr9YcivuHGiVeBkpQYZGv/jVG0QkEVQawrHWXc+NjZ9iZRjhdFroJZrGmIzxkHYtlVhQ7afzg6fozCoDFEbKljRorv6eSLHQeiIC2ymwGellbyb+53UTE177KZNxYqgki0VhwpGJ0Ox7NGCKEsMnlmCimL0VkRFWmBibUcGG4C2/vEpa1Yp3Wbm4r5bqbhZHHk7gFM7Bgyuowx00oAkEBDzDK7w5ynlx3p2PRWvOyWaO4Q+czx882o9W

AAACCXicbVC7TsMwFHXKq5RXgJHFokViqpKCADFVYmEsEn1IbRQ5jtNadZzIdoAqysrCr7AwgBArf8DG3+C0GaDlSJaPzrlX997jxYxKZVnfRmlpeWV1rbxe2djc2t4xd/c6MkoEJm0csUj0PCQJo5y0FVWM9GJBUOgx0vXGV7nfvSNC0ojfqklMnBANOQ0oRkpLrglrg3vqkxFS6cCLmC8nof7ShyxzU/vyJKu5ZtWqW1PARWIXpAoKtFzza+BHOAkJV5ghKfu2FSsnRUJRzEhWGSSSxAiP0ZD0NeUoJNJJp5dk8EgrPgwioR9XcKr+7khRKPMVdWWI1EjOe7n4n9dPVHDhpJTHiSIczwYFCYMqgnks0KeCYMUmmiAsqN4V4hESCCsdXkWHYM+fvEg6jbp9Vj+9aVSbVhFHGRyAQ3AMbHAOmuAatEAbYPAInsEreDOejBfj3fiYlZaMomcf/IHx+QPRwZpa

AAAB6nicbVDLTgJBEOzFF+IL9ehlIph4IruEqEcSLx4xyiOBDZkdemHC7OxmZtaEED7BiweN8eoXefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj27nffkKleSwfzSRBP6JDyUPOqLHSQ7lW7hdLbsVdgKwTLyMlyNDoF796g5ilEUrDBNW667mJ8adUGc4Ezgq9VGNC2ZgOsWuppBFqf7o4dUYurDIgYaxsSUMW6u+JKY20nkSB7YyoGelVby7+53VTE974Uy6T1KBky0VhKoiJyfxvMuAKmRETSyhT3N5K2IgqyoxNp2BD8FZfXietasW7qtTuq6W6m8WRhzM4h0vw4BrqcAcNaAKDITzDK7w5wnlx3p2PZWvOyWZO4Q+czx8zgo0J

P1:m

4

b 1:3 x

AAAB8nicbVBNSwMxEM3Wr1q/qh69BFvBi2W3iIqnghePFewHbJeSTbNtaLJZklmhLP0ZXjwo4tVf481/Y9ruQVsfDDzem2FmXpgIbsB1v53C2vrG5lZxu7Szu7d/UD48ahuVaspaVAmluyExTPCYtYCDYN1EMyJDwTrh+G7md56YNlzFjzBJWCDJMOYRpwSs5Feb/cy7lRfetNovV9yaOwdeJV5OKihHs1/+6g0UTSWLgQpijO+5CQQZ0cCpYNNSLzUsIXRMhsy3NCaSmSCbnzzFZ1YZ4EhpWzHgufp7IiPSmIkMbackMDLL3kz8z/NTiG6CjMdJCiymi0VRKjAoPPsfD7hmFMTEEkI1t7diOiKaULAplWwI3vLLq6Rdr3lXtcuHeqXh5nEU0Qk6RefIQ9eoge5RE7UQRQo9o1f05oDz4rw7H4vWgpPPHKM/cD5/AHLckAI=

AAAB6nicbVBNSwMxEJ3Ur1q/qh69BFvBU9ktoh4LXjxWtLXQLiWbZtvQJLskWaEs/QlePCji1V/kzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqmzjVlLVoLGLdCYlhgivWstwK1kk0IzIU7DEc38z8xyemDY/Vg50kLJBkqHjEKbFOuq/Kar9c8WreHHiV+DmpQI5mv/zVG8Q0lUxZKogxXd9LbJARbTkVbFrqpYYlhI7JkHUdVUQyE2TzU6f4zCkDHMXalbJ4rv6eyIg0ZiJD1ymJHZllbyb+53VTG10HGVdJapmii0VRKrCN8exvPOCaUSsmjhCqubsV0xHRhFqXTsmF4C+/vEra9Zp/Wbu4q1caXh5HEU7gFM7BhytowC00oQUUhvAMr/CGBHpB7+hj0VpA+cwx/AH6/AGKH41C

AAAB+3icbVBLS8NAGNzUV62vWI9egq3gqSRF1GPBi8cK9gFtCJvNpl26j7C7EUPoX/HiQRGv/hFv/hs3bQ7aOrDsMPN97OyECSVKu+63VdnY3Nreqe7W9vYPDo/s43pfiVQi3EOCCjkMocKUcNzTRFM8TCSGLKR4EM5uC3/wiKUigj/oLME+gxNOYoKgNlJg15vjUNBIZcxceTYPWDOwG27LXcBZJ15JGqBEN7C/xpFAKcNcIwqVGnluov0cSk0QxfPaOFU4gWgGJ3hkKIcMKz9fZJ8750aJnFhIc7h2FurvjRwyVaQzkwzqqVr1CvE/b5Tq+MbPCU9SjTlaPhSn1NHCKYpwIiIx0jQzBCJJTFYHTaGESJu6aqYEb/XL66TfbnlXrcv7dqPjlnVUwSk4AxfAA9egA+5AF/QAAk/gGbyCN2tuvVjv1sdytGKVOyfgD6zPHxXjlGw=

m

ym

1

AAAB+3icbVBPS8MwHE3nvzn/1Xn0EtwET6MdQz0OvHic4DZhKyVN0y0sTUqSiqX0q3jxoIhXv4g3v43Z1oNuPgh5vPf7kZcXJIwq7TjfVmVjc2t7p7pb29s/ODyyj+sDJVKJSR8LJuRDgBRhlJO+ppqRh0QSFAeMDIPZzdwfPhKpqOD3OkuIF6MJpxHFSBvJt+vNcSBYqLLYXHlW+J2mbzeclrMAXCduSRqgRM+3v8ahwGlMuMYMKTVynUR7OZKaYkaK2jhVJEF4hiZkZChHMVFevshewHOjhDAS0hyu4UL9vZGjWM3TmckY6ala9ebif94o1dG1l1OepJpwvHwoShnUAs6LgCGVBGuWGYKwpCYrxFMkEdamrpopwV398joZtFvuZatz1250nbKOKjgFZ+ACuOAKdMEt6IE+wOAJPINX8GYV1ov1bn0sRytWuXMC/sD6/AG/N5Qz

y4

AAACC3icbVC7TsMwFHV4lvIKMLJYbZFYqJIKAWKqxMJYJPqQ2ihyHKe16sSR7QBVlJ2FX2FhACFWfoCNv8FpM0DLkSwfnXOv7r3HixmVyrK+jaXlldW19dJGeXNre2fX3NvvSJ4ITNqYMy56HpKE0Yi0FVWM9GJBUOgx0vXGV7nfvSNCUh7dqklMnBANIxpQjJSWXLNSG9xTn4yQSgceZ76chPpLH7LMTe3L8MTOaq5ZterWFHCR2AWpggIt1/wa+BwnIYkUZkjKvm3FykmRUBQzkpUHiSQxwmM0JH1NIxQS6aTTWzJ4pBUfBlzoFyk4VX93pCiU+ZK6MkRqJOe9XPzP6ycquHBSGsWJIhGeDQoSBhWHeTDQp4JgxSaaICyo3hXiERIIKx1fWYdgz5+8SDqNun1WP71pVJtWEUcJHIIKOAY2OAdNcA1aoA0weATP4BW8GU/Gi/FufMxKl4yi5wD8gfH5AxxGmwY=

b 1:m x

1

Figure 29.3 A distributed network with m nodes for Prob. 29.13.

Assume Hk = H for all k and Rvk = Rv > 0. Assume further that H is tall and has full column rank. Find limm→∞ P1:m . 29.14 Consider two sensors labeled k = 1, 2, and assume each sensor has an unbiased estimator, {wk , k = 1, 2} for some M × 1 column vector wo . Let {Pk , k = 1, 2} denote the error covariance matrix, Pk = E (wo − wk )(wo − wk )T . Assume the errors of the two estimators are uncorrelated, i.e., E (wo − w1 )(wo − w2 )T = 0. Consider a new aggregate b = αw1 + (1 − α)w2 . estimator of the form w (a) If α is nonnegative, determine the optimal scalar α that minimizes the MSE, i.e., b 2. minα≥0 E kwo − wk (b) Repeat part (a) when α is not restricted to being nonnegative. When would a negative α be advantageous? o b b T . How does P compare to P1 and P2 in both cases (c) Let P = E (wo − w)(w − w) (a) and (b)? (d) Now assume the errors of the two estimators are correlated instead, i.e., E (wo − w1 )(wo − w2 )T = C, for some matrix C. Repeat parts (a)–(c). 29.15 Consider N sensors labeled k = 1, 2, . . . , N . Each node has an unbiased estimate of some unknown column vector wo ∈ IRM . We denote the individual estimator at node k by wk . We also denote the error covariance matrix of wk by Pk and the cross-covariance matrix of wk and w` by Pk` P. A sensor S wishes to combine the esbS = N timators {wk , k = 1, . . . , N } through w k=1 ak w k in order to optimize the cost function: (e)

N

2 X

min E wo − ak wk

{ak }

where the {ak } are real-valued scalars.

k=1

Problems

1149

b S is an Find a condition on the coefficients {ak } to ensure that the resulting w unbiased estimator for wo . (b) Under condition (a), find the optimal coefficients {ak }. Your solution should not depend on wo . (c) Assume the reliability of each estimator wk is measured by the scalar σk2 = Tr(Pk ). The smaller the σk2 is, the more reliable the estimator will be. What is the relation between the optimal coefficients {ak } and the reliability factors {σk2 }? b S. (d) Evaluate the reliability of the estimator w (e) Motivate and derive a stochastic gradient algorithm for updating the coefficients {ak } in part (b). (f) How is the estimator of part (b) different from the unbiased LLMSE estimator of wo based on the {wk }? Find the latter estimator. (g) Find the MMSE of the estimators in parts (b) and (f) for the case where Pk` = 0 when ` 6= k. Specialize your result to the case Pk = P for all k and compare the resulting MSEs. 29.16 Consider a collection of N iid random variables, {y(n), n = 0, 1, . . . , N − 1}. Each y(n) has a Gaussian distribution with zero mean and variance σy2 . We want to use the observations {y(n)} to estimate the variance σy2 in the following manner: (a)

b 2y = α σ

N −1 X

y 2 (n)

n=0

for some scalar parameter α to be determined. b 2y in terms of α and σy2 ? (a) What is the mean of the estimator σ (b) Evaluate the MSE below in terms of α and σy2 : MSE = E (b σ 2y − σy2 )2 Determine the optimal scalar α that minimizes the MSE. Is the corresponding estimator biased or unbiased? (d) For what value of α would the estimator be unbiased? What is the MSE of this estimator and how does it compare to the MSE of the estimator from part (c)? 29.17 Consider noisy observations y(n) = x+v(n), where x and v(n) are independent random variables, v(n) is a white-noise random process with zero mean and distributed as follows: (c)

v(n) is Gaussian with variance σv2 with probability q v(n) is uniformly distributed over [−a, a] with probability 1 − q Moreover, x assumes the values ±1 with equal probability. The value of x is the same for all measurements {y(n)}. All variables are real-valued. (a) Find an expression for the LLMSE estimator of x given the collection of N observations {y(0), y(1), . . . , y(N − 1)}. (b) Find the LLMSE of x given the observations {y(0), y(1), . . . , y(N − 1)} and {y 2 (0), y 2 (1), . . . , y 2 (N − 1)}. How does the answer compare to part (a)? 29.18 This problem deals with constrained MSE estimation. Let d denote a scalar zero-mean random variable with variance σd2 , and let u denote an M × 1 zero-mean random vector with covariance matrix Ru = E uuT > 0. Consider the constrained optimization problem min E (d − uT w)2 , w

subject to cT w = α

where c is a known M × 1 vector and α is a known real scalar. (a) Let z = w − Ru−1 rud and rdu = E duT . Show that the above optimization problem is equivalent to the following: n o min σd2 − rdu Ru−1 rud + z T Ru z , subject to cT z = α − cT Ru−1 rud z∈IRM

1150

Linear Regression

(b)

Show that the optimal solution, wo , of the constrained optimization problem is given by  T −1  c Ru rud − α wo = Ru−1 rud − Ru−1 c cT Ru−1 c

Verify that this solution satisfies the constraint cT wo = α. 29.19 Let x be a zero-mean random variable with an M × M positive-definite cob y1 denote the LLMSE estimator of x given a zero-mean variance matrix Rx . Let x b y2 denote the LLMSE estiobservation y 1 with covariance matrix Ry1 . Likewise, let x mator of x given another zero-mean observation y 2 with covariance matrix Ry2 . Let b y1 and Ry1 ,y2 = E y 1 y T2 . We want to determine another estimator for x by combining x b y2 in a convex manner as follows: x b = λb x xy1 + (1 − λ)b xy2 where λ is a real scalar lying inside the interval 0 ≤ λ ≤ 1. b with the smallest MSE. (a) Determine the value of λ that results in an estimator x (b) If λ is allowed to be any arbitrary real scalar (not necessarily limited to the range [0, 1]), how much smaller can the MSE be? 29.20 Let y = s + v be a vector of measurements, where v is noise and s is the desired signal. Both v and s are zero-mean uncorrelated random vectors with covariance matrices {Rv , Rs }, respectively. We wish to determine a unit-norm column vector, w, such that the SNR in the output signal, y T w, is maximized. (a) Verify that the covariance matrices of the signal and noise components in y T w are equal to wT Rs w and wT Rv w, respectively. (b) Assume first that Rv = σv2 I. Use the Rayleigh–Ritz characterization (1.16) to conclude that the solution of   T w Rs w max kwk=1 σv2 kwk2

(c)

is given by the unit-norm eigenvector that corresponds to the maximum eigenvalue of Rs , written as wo = qmax , where Rs qmax = λmax qmax . Verify further that the resulting maximum SNR is equal to λmax /σv2 . Assume now that v is colored noise so that its covariance matrix is not necessarily diagonal. Introduce the eigen-decomposition Rv = U ΛU T , where U is orthogonal and Λ is diagonal with positive entries. Let L = U Λ1/2 . Repeat the argument of part (b) to show that the solution of   T w Rs w max kwk=1 wT Rv w

is now related to the unit-norm eigenvector that corresponds to the maximum −1 eigenvalue of L−1 Rs LT . 29.21 Consider the optimization problem: min W T Rv W, W

subject to W T H = A, Rv > 0

where W T is M × N , H is N × P , A is M × P , P < N , M < N , and H has full rank. In the text we assumed A is square and equal to the identity matrix (see (29.120)). Show that the optimal solution is given by (W o )T = A(H T Rv−1 H)−1 H T Rv−1 and that the resulting minimum cost is A(H T Rv−1 H)−1 AT . 29.22 Refer to (29.108a). Compare P to Pk , for each k = 1, 2, . . . , N . Specifically, verify that the difference Pk − P is nonnegative-definite.

29.A Consistency of Normal Equations

1151

29.23 Refer to (29.108a)–(29.108b). Let {b x, P } be the estimator and the MMSE that result from estimating x from data across all N sensors. Let {b x0 , P 0 } be the estimator and the MMSE that result from estimating x from data across the first N − 1 sensors. Relate {b x, P } to {b x0 , P 0 }. 29.24 All variables are zero mean. Consider a complex-valued scalar random variable d and a complex-valued M × 1 regression vector u. Let b = (wo )T u = uT wo d denote the LLMSE estimator of d given u for some M × 1 vector wo . Consider additionally the problem of estimating separately the real and imaginary parts of d using knowledge of the real and imaginary parts of u, also in the LLMSE sense, namely,     Re(u) Re(u) o o breal = (wreal bimag = (wimag d )T , d )T Im(u) Im(u) o o for some 2M × 1 vectors wreal , and wimag . (a) Argue that estimating the real and imaginary parts of d from the real and imaginary parts of u is equivalent to estimating the real and imaginary parts of d from {u, u∗ }, where u∗ is the complex conjugate transpose of u. o o (b) What are the optimal choices for wo , wreal , and wimag ? b b b (c) Let d2 = dreal +j dimag denote the estimator that is obtained for d from this second construction. What is the corresponding MMSE? How does it compare to the b = (wo )T u? Under what conditions will both constructions MMSE obtained for d lead to the same MMSE?

29.A

CONSISTENCY OF NORMAL EQUATIONS In this appendix we verify that the normal equations (29.25) are always consistent, i.e., we establish that a solution wo always exists. Moreover, the solution is either unique or there are infinitely many solutions. In the latter case, all solutions will differ by vectors in the nullspace of Ry and, moreover, all of them will lead to the same estimator for x and to the same MSE. Only these possibilities can occur. Proof (Consistency of normal equations): We verify that at least one solution wo exists to the normal equations Ry wo = ryx . For this purpose, we need to verify that ryx belongs to the range space of Ry , i.e., ryx ∈ R (Ry )

(29.143)

We show this property by contradiction. Assume that (29.143) does not hold. Under this assumption, there should exist some nonzero vector p ∈ N(Ry ) that is not orthogonal to ryx , namely, T ∃ p such that Ry p = 0, ryx p 6= 0

It follows from Ry p = 0 that pT Ry p = 0 so that   pT Ry p = 0 ⇐⇒ pT E (y − y¯)(y − y¯)T p = 0  2 ⇐⇒ E (y − y¯)T p = 0

(29.144)

(29.145)

1152

Linear Regression

from which we conclude that the zero-mean scalar random variable (y − y¯)T p has zero variance and, hence, (y − y¯)T p = 0,

in probability

(29.146)

This conclusion leads to a contradiction since it implies that T ryx p

= (29.146)

=

E (x − x ¯)(y − y¯)T p 0,

in probability

(29.147)

T which violates the assertion that ryx p 6= 0. We conclude that (29.143) holds and the normal equations (29.25) are consistent. Next we verify that the solution wo is either unique or there are infinitely many solutions. To begin with, it is clear from (29.25) that the solution is unique whenever Ry is invertible, in which case wo = Ry−1 ryx . On the other hand, when Ry is singular, then infinitely many solutions exist. This is because if we let p denote any nontrivial vector in the nullspace of Ry , i.e., Ry p = 0, then the vector wo + p will also satisfy the normal equations (29.25). The next property we verify is that when infinitely many solutions exist, any solution will continue to lead to the same estimator for x and to the same MSE value. Let wo and w• denote any two solutions to the normal equations, i.e.,

Ry wo = ryx ,

Ry w• = ryx

(29.148)

The corresponding estimators for x are denoted by bo = x x ¯ + (y − y¯)T wo •

T

b =x x ¯ + (y − y¯) w



(29.149a) (29.149b)

Subtracting both equalities in (29.148) gives Ry (wo − w• ) = 0

(29.150)

so that any two solution vectors differ by vectors in the nullspace of Ry , namely, w• = wo + p,

for some p ∈ N(Ry )

(29.151)

Moreover, we obtain from (29.150) that Ry (wo − w• ) = 0 =⇒ (wo − w• )T Ry (wo − w• ) = 0   =⇒ (wo − w• )T E (y − y¯)(y − y¯)T (wo − w• ) = 0  2 =⇒ E (y − y¯)T (wo − w• ) = 0 (29.152) which implies that the following zero-mean scalar random variable is equal to zero in probability: ∆

α = (y − y¯)T (wo − w• ) = 0,

in probability

(29.153)

That is, for any  > 0, P(|α| ≥ ) = 0

(29.154)

Subtracting expressions (29.149a)–(29.149b) gives bo − x b• x

= (29.153)

=

(y − y¯)T (wo − w• ) 0,

in probability

(29.155)

References

1153

which confirms our claim that different solutions to the normal equations continue to lead to the same estimator. It is left as an exercise to check that the MSEs corresponding b o and x b • also agree with each other – see Prob. 29.4. to x 

REFERENCES Aitken, A. C. (1935), “On least squares and linear combinations of observations,” Proc. Roy. Soc. Edinburgh, vol. 55, pp. 42–48. Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, Chapman & Hall. Gauss, C. F. (1821), Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, vol. 2, H. Dieterich. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall. Kay, S. (1993), Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall. Kendall, M. and A. Stuart (1976–1979), The Advanced Theory of Statistics, vols. 1–3, Macmillan. Kolmogorov, A. N. (1939), “Sur l’interpolation et extrapolation des suites stationnaires,” C. R. Acad. Sci., vol. 208, p. 2043. Kolmogorov, A. N. (1941a), “Stationary sequences in Hilbert space (in Russian),” Bull. Math. Univ. Moscow, vol. 2. Kolmogorov, A. N. (1941b), “Interpolation and extrapolation of stationary random processes,” Bull. Acad. Sci. USSR, vol. 5. A translation has been published by the RAND Corp., Santa Monica, CA, as Memo. RM-3090-PR, April 1962. Markov, A. A. (1912), Wahrscheinlichkeitsrechnung, B. G. Teubner. Plackett, R. L. (1949), “A historical note on the method of least squares,” Biometrika, vol. 36, no. 3–4, pp. 458–4601. Plackett, R. L. (1950), “Some theorems in least-squares,” Biometrika, vol. 37, no. 1–2, pp. 149–157. Rao, C. R. (1973), Linear Statistical Inference and Its Applications, Wiley. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Wiener, N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time Series, Technology Press and Wiley. Originally published in 1942 as a classified National Defense Research Council Report. Also published under the title Time Series Analysis by MIT Press. Wiener, N. and E. Hopf (1931), “On a class of singular integral equations,” Proc. Prussian Acad. Math. Phys. Ser., vol. 31, pp. 696–706. Wold, H. (1938), A Study in the Analysis of Stationary Time Series, Almqvist & Wiksell.

30 Kalman Filter

In this chapter we illustrate one important application of the linear meansquare-error (MSE) theory to the derivation of the famed Kalman filter. The filter is a powerful recursive technique for updating the estimates of the state (hidden) variables of a state-space model from noisy observations. The state evolution satisfies a Markovian property in the sense that the distribution of the state xn at time n is only dependent on the most recent past state, xn−1 . Likewise, the distribution of the observation y n at the same time instant is only dependent on the state xn . The state and observation variables are represented by a linear state-space model, which will be shown to enable a powerful recursive solution. One key step in the argument is the introduction of the innovations process and the exploitation to great effect of the principle of orthogonality. In Chapter 35 we will allow for nonlinear state-space models and derive the class of particle filters by relying instead on the concept of sequential importance sampling. Before deriving the Kalman filter, we introduce some preliminary results that will facilitate the exposition. All variables in this chapter are assumed to have zero means.

30.1

UNCORRELATED OBSERVATIONS Consider a random variable x (scalar or vector-valued) and a collection of observations {y 0 , y 1 , . . . , y N }. Each of the observations, y n , can be scalar or vectorvalued. We collect the observations into the column vector:   y0  y1  ∆   y =  .  (30.1)  ..  yN

and consider the problem of estimating x from y optimally in the linear leastb is mean-square-error (LLMSE) sense. We already know from (29.79) that x given by b = Rxy Ry−1 y x

(30.2)

30.1 Uncorrelated Observations

1155

in terms of the covariances ∆

Rxy = E xy T

and



Ry = E yy T

(30.3)

where we will be assuming Ry > 0. The covariance matrix Ry is a block matrix and its entries will consist of the cross-covariance matrices between the individual observations. For example, for N = 2, we have   E y0 yT E y0 yT E y0 yT 0 1 2  Ry =  E y 1 y T (30.4) E y1 yT E y1 yT 0 1 2 T T T E y2 y0 E y2 y1 E y2 y2

where each covariance matrix, E y n y T m , is a scalar when the observations are scalars and is a square matrix when the observations are vectors. Likewise,   Rxy = E xy T (30.5) E xy T E xy T 0 1 2 where each of the terms E xy T m is either a scalar (when both x and the observations are scalars) or a matrix (when both x and the observations are vectors). b. According to (30.2) we need to invert the matrix Ry in order to evaluate x Inverting a full matrix is generally demanding. However, a special case is of particular interest. Assume for now that the observations happen to be uncorrelated with each other, so that E yn yT m = 0, for n 6= m In this case, the matrix Ry will be block diagonal:  E y0 yT 0 Ry =  E y1 yT 1

(30.6)

E y2 yT 2

 

(30.7)

and its inverse is obtained by simply inverting the block diagonal entries:   −1 E y0 yT 0 −1   (30.8) Ry−1 =  E y1 yT  1  T −1 E y2 y2

b by means of the sum Substituting into (30.2) we find that we can evaluate x expression:

where each term

b = x

N X

E xy T n

n=0

E xy T n





E yn yT n

E yn yT n

−1

−1

yn

yn

(30.9)

(30.10)

corresponds to the estimator for x that is based solely on observation, y n . In other words, we find that when the observations happen to be uncorrelated, we can reduce the problem of estimating x from all observations into the equivalent problem of estimating x from each of the individual observations and then

1156

Kalman Filter

adding the estimators up as in (30.9). This simplified construction is at the core of the Kalman filter derivation, as we are going to illustrate in the future sections. The main issue we will be facing in the derivation of the filter is that the available observations {y n } are generally correlated. For this reason, the Kalman filter will include a critical step whose purpose is to decorrelate the observations and replace them with an uncorrelated sequence of data called the innovations process. Then the estimation problem is solved by using the innovations instead of the original observations.

Recursiveness A second important property that follows from decomposition (30.9) when the observations are uncorrelated is the ability to update the estimator in a recursive b by x b |N in manner. To see this, let us introduce a new notation. Let us replace x b |N is the estimator of x that is based on the observations order to indicate that x up to and including time N : ∆

b |N = LLMSE of x given {y o , y 1 , . . . , y N } x

(30.11)

Then, expression (30.9) gives b |N = x =

N X

E xy T n

n=0 N −1 X



E xy T n

n=0

E yn yT n



−1

E yn yT n

b |N −1 + x b |yN , =x

−1

yn T −1 y n + (E xy T yN N ) (E y N y N )

when the {y n } are uncorrelated

(30.12)

b |yN denotes the estimator of x that is based on the last observation where x b |N −1 denotes the estimator of x that is based on the observations up y N , and x to time N − 1. We therefore arrive at a useful decomposition result when the observations are uncorrelated: When a new observation y N is added, we simply b |N −1 . The estimate x from y N and add the result to the previous estimator, x Kalman filter will actively exploit this useful construction.

Linear transformations

A third useful property pertains to the effect of linear transformations on estimators. Assume we replace y by some linear transformation of the form e = Ay, where A is an invertible matrix so that e and y define each other uniquely. Then, the estimator of x given y will coincide with the estimator of x given e, namely, b |e = x b |y , x

when e = Ay and A invertible

(30.13)

Proof of (30.13): Note that, by definition, Rxe = E xeT = (E xy T )AT = Rxy AT T

T

T

(30.14) T

Ree = E ee = A(E yy )A = ARy A

(30.15)

30.2 Innovations Process

1157

and, hence, b |e = Rxe Re−1 e x = Rxy AT (AT )−1 Ry−1 A−1 e = Rxy Ry−1 y b |y =x

(30.16)



30.2

INNOVATIONS PROCESS We now use the previous results to motivate the concept of the innovations process, which we will use to derive the Kalman filter. Consider again two zero-mean random variables {x, y}. Usually, the variable y is vector-valued, say, y = col{y 0 , y 1 , . . . , y N }, where each y n is also possibly a vector. Now assume that we could somehow replace y by another vector e of similar dimensions and structure, say, e = Ay

(30.17)

for some lower triangular invertible matrix A. Assume further that the transformation A could be chosen such that the individual entries of e, denoted by e = col{e0 , e1 , . . . , eN }, are uncorrelated with each other, i.e., ∆

E en eT m = Re,n δnm

(30.18)

where δnm denotes the Kronecker delta function that is unity when n = m and zero otherwise, and Re,n denotes the covariance matrix of the nth entry, en . Then, the covariance matrix of e will be block diagonal, n o ∆ Re = E eeT = blkdiag Re,0 , Re,1 , . . . , Re,N (30.19)

and, in view of property (30.13), the problem of estimating x from y would be equivalent to the problem of estimating x from e, namely, b |e = x b |y x

(30.20)

The key advantage of working with e instead of y is that the covariance matrix b |e can be evaluated Re in (30.19) is block-diagonal and, hence, the estimator x as the combined sum of individual estimators (recall the discussion that led to (30.12)). Specifically, expression (30.9) gives b |e = x

N X

n=0

E xeT n



−1 Re,n en

=

N X

n=0

b |en x

(30.21)

which shows that we can estimate x individually from each en and then combine b |e by the more the resulting estimators. In particular, if we replace the notation x

1158

Kalman Filter

b |N , in order to indicate that the estimator of x is based on suggestive notation x the observations y 0 through y N , then (30.21) gives b |N = x b |N −1 + x b |eN x

That is,

(30.22)

 −1 b |N = x b |N −1 + E xeT x N Re,N eN

(30.23)

This is a useful recursive formula; it shows how the estimator for x can be updated recursively by adding the contribution of the most recent variable, eN .

Gram–Schmidt procedure The main question, therefore, is how to generate the variables {en } from the observations {y n }. One possible transformation is the Gram–Schmidt proceb n|n−1 denote the estimator of y n that is based on the observations dure. Let y up to time n − 1, i.e., on {y 0 , y 1 , . . . , y n−1 }. The same argument that led to b n|n−1 can be alternatively calculated by estimating y n from (30.20) shows that y {e0 , . . . , en−1 }. We then choose en as ∆

b n|n−1 en = y n − y

(30.24)

en ⊥ {y 0 , y 1 , . . . , y n−1 }

(30.25)

In order to verify that the terms of the sequence {en } constructed in this manner are indeed uncorrelated with each other, we recall that, by virtue of the orthogonality condition (29.23) of linear least-mean-squares estimation:

That is, en is uncorrelated with the observations {y 0 , y 1 . . . , y n−1 }. It then follows that en should be uncorrelated with any em for m < n since, by definition, em is a linear combination of the observations {y 0 , y 1 . . . , y m } and, moreover, {y 0 , y 1 , . . . , y m } ∈ span{y 0 , y 1 , . . . , y n−1 },

for m < n

By the same token, en is uncorrelated with any em for m > n. It is instructive to see what choice of transformation A corresponds to the Gram–Schmidt procedure (30.24). Assume, for illustration purposes, that N = 2. Then writing (30.24) for n = 0, 1, 2 we get      e0 I y0 T −1  e1  =  −(E y 1 y T   y1  (30.26) I 0 )(E y 0 y 0 ) e2 × × I y2 | {z } =A

where the entries × arise from the calculation 

×

×



= −



E y2 yT 0

E y2 yT 1



E



y0 y1



y0 y1

T !−1

(30.27)

30.3 State-Space Model

1159

We thus find that A is lower triangular with unit entries along its diagonal. The lower triangularity of A is relevant since it translates into a causal relationship between the {en } and the {y n }. By causality we mean that each en can be computed from {y m , m ≤ n} and, similarly, each y n can be recovered from {em , m ≤ n}. We also see from the construction (30.24) that we can regard en as the “new information” in y n given {y 0 , . . . , y n−1 }. Therefore, it is customary to refer to the {en } as the innovations process associated with {y n }.

30.3

STATE-SPACE MODEL The Kalman filter turns out to be an efficient procedure for determining the innovations process when the observations {y n } arise from a finite-dimensional linear state-space model. We assume now that y n satisfies an equation of the form: y n = Hn xn + v n ,

n≥0

(y n : p × 1)

(30.28)

in terms of an s × 1 state vector xn , which in turn obeys a recursion of the form xn+1 = Fn xn + Gn un ,

n≥0

(xn : s × 1)

(30.29)

The processes v n and un are assumed to be p × 1 and q × 1 zero-mean “whitenoise” processes, respectively, with covariances and cross-covariances denoted by   T   un uk Qn Sn ∆ E = δnk (30.30) vn vk SnT Rn Note that we are allowing the moment matrices {Qn , Rn , Sn } to vary with time (although, strictly speaking, we defined white-noise processes earlier in (7.11c) as having constant second-order moments). It is customary to assume Rn > 0 to avoid degenerate situations (see the explanation after (30.52)). The initial state x0 of the model is also assumed to have zero mean, covariance matrix Π0 ≥ 0, and to be uncorrelated with {un } and {v n }, i.e., T E x0 xT 0 = Π0 , E un x0 = 0, and

E v n xT 0 = 0, for all n ≥ 0

(30.31)

The process v n is called measurement noise and the process un is called process noise. The assumptions on {x0 , un , v n } can be compactly restated in matrix form as:     T Qn Sn un  δnk  v n  uk  SnT Rn   vk  =  E  x0   0 0 x0 1 0 0 

 0 0   Π0  0

(30.32)

1160

Kalman Filter

It is also assumed that the model parameters Fn (s × s),

Qn (q × q),

Gn (s × q),

Rn (p × p),

Hn (p × s)

Sn (q × p),

(30.33) Π0 (s × s)

(30.34)

are known beforehand. Since these parameters are allowed to vary with time, the processes {xn , y n } will not be stationary in general. We now examine how the innovations {en } of the process {y n } satisfying a state-space model of the form (30.28)–(30.32) can be evaluated.

30.3.1

State Estimator b n|n−1 , v bn|n−1 } denote the estimators of {y n , xn , v n } from the obLet {b y n|n−1 , x servations {y 0 , y 1 , . . . , y n−1 }, respectively. Then using y n = Hn xn + v n

(30.35)

and appealing to linearity, we have b n|n−1 = Hn x b n|n−1 + v bn|n−1 y

(30.36)

Now the assumptions on the state-space model imply that vn ⊥ ym

for

m≤n−1

(30.37)

i.e., v n is uncorrelated with the observations {y m , m ≤ n − 1}, so that bn|n−1 = 0 v

(30.38)

b n|n−1 = y n − Hn x b n|n−1 en = y n − y

(30.39)

This is because from the model (30.28)–(30.29), the observation y m is a linear combination of the variables {v m , um−1 , . . . , u0 , x0 }, all of which are uncorrelated with v n for m ≤ n − 1. Consequently, Therefore, the problem of finding the innovations reduces to one of finding b n|n−1 . For this purpose, we can use (30.23) to write x  −1 b n+1|n = x b n+1|n−1 + E xn+1 eT x n Re,n en  −1 b n+1|n−1 + E xn+1 eT b n|n−1 ) =x (30.40) n Re,n (y n − Hn x where we are introducing the p × p matrix: ∆

Re,n = E en eT n

(30.41)

But since xn+1 obeys the state equation xn+1 = Fn xn + Gn un

(30.42)

we also obtain, again by linearity, that b n+1|n−1 = Fn x b n|n−1 + Gn u b n|n−1 = Fn x b n|n−1 + 0 x

(30.43)

30.3 State-Space Model

1161

since un ⊥ y m , m ≤ n − 1. By combining (30.39)–(30.43) we arrive at the following recursive equations for determining the innovations: b n|n−1 en = y n − Hn x b n+1|n = Fn x b n|n−1 + Kp,n en , x

n≥0

(30.44)

with initial conditions

b 0|−1 = 0, x

e0 = y 0

(30.45)

and where we introduced the s × p gain matrix: ∆

Kp,n =

 −1 E xn+1 eT n Re,n

(30.46)

The subscript p indicates that Kp,n is used to update a predicted estimator of the state vector. By combining the equations in (30.44) we also find that b n+1|n = Fp,n x b n|n−1 + Kp,n y n , x b 0|−1 = 0, n ≥ 0 x

(30.47)

where we introduced the s × s closed-loop matrix: ∆

Fp,n = Fn − Kp,n Hn

30.3.2

(30.48)

Relation (30.47) shows that in finding the innovations, we actually have a comb n|n−1 . The initial condition x b 0|−1 refers plete recursion for the state estimator x to the estimator of the initial state vector, x0 , given no observations. In this case, the estimator coincides with the mean of the variable, which is zero. That b 0|−1 = 0. is why we set x

Computing the Gain Matrix

We still need to evaluate Kp,n and Re,n . To do so, we introduce the stateestimation error: ∆

e n|n−1 = xn − x b n|n−1 x

(30.49)

and denote its s × s covariance matrix by ∆

e n|n−1 x eT Pn|n−1 = E x n|n−1

(30.50)

Then, as we are going to see, the {Kp,n , Re,n } can be expressed in terms of Pn|n−1 and, in addition, the evaluation of Pn|n−1 will require propagating a so-called Riccati recursion. To see this, note first that b n|n−1 e n = y n − Hn x

b n|n−1 + v n = Hn xn − Hn x e n|n−1 + v n = Hn x

(30.51)

1162

Kalman Filter

e n|n−1 . This is because x e n|n−1 is a linear combination of the Moreover, v n ⊥ x variables {v 0 , . . . , v n−1 , x0 , u0 , . . . , un−1 }, all of which are uncorrelated with v n . e n|n−1 = xn − x b n|n−1 and from the fact This claim follows from the definition x b n|n−1 is a linear combination of {y 0 , . . . , y n−1 } and xn is a linear combithat x nation of {x0 , u0 , . . . , un−1 }. Therefore, we get T Re,n = E en eT n = Rn + Hn Pn|n−1 Hn

(30.52)

Observe that Re,n > 0 since, by assumption, Rn > 0. In this way, the matrix Re,n is always invertible. Likewise, since   T T E xn+1 eT (30.53) n = Fn E xn en + Gn E un en T with the terms E xn eT n and E un en given by  b n|n−1 + x e n|n−1 eT E xn eT n n =E x

e n|n−1 eT = Ex n,

b n|n−1 since en ⊥ x

e n|n−1 (Hn x e n|n−1 + v n )T = Ex

e n|n−1 (Hn x e n|n−1 )T + 0, = Ex =

and

Pn|n−1 HnT

e n|n−1 since v n ⊥ x

e n|n−1 + v n )T E un eT n = E un (Hn x =0+

E un v T n,

= Sn

we get

(30.55)

e n|n−1 since un ⊥ x

 −1 T −1 Kp,n = E xn+1 eT n Re,n = (Fn Pn|n−1 Hn + Gn Sn )Re,n

30.3.3

(30.54)

(30.56) (30.57)

(30.58)

Riccati Recursion Since un ⊥ xn , it can be easily seen from xn+1 = Fn xn + Gn un

(30.59)

that the covariance matrix of xn obeys the recursion ∆

Πn = E x n x T n

(30.60)

= E (Fn xn + Gn un ) (Fn xn + Gn un )T  T  T T = Fn E xn xT n Fn + Gn E un un Gn

(30.61)

Πn+1 = Fn Πn FnT + Gn Qn GT n, This is because ∆

Πn+1 = E xn+1 xT n+1

= Fn Πn FnT + Gn Qn GT n

30.3 State-Space Model

1163

b n|n−1 , it can be seen from Likewise, since en ⊥ x

(30.62)

b n|n−1 satisfies the recursion that the covariance matrix of x

(30.63)

b n+1|n = Fn x b n|n−1 + Kp,n en x ∆

b n|n−1 x bT Σn = E x n|n−1

T Σn+1 = Fn Σn FnT + Kp,n Re,n Kp,n ,

with initial condition Σ0 = 0. Now the orthogonal decomposition

shows that

b n|n−1 + x e n|n−1 , with x b n|n−1 ⊥ x e n|n−1 xn = x

(30.64)

Πn = Σn + Pn|n−1

(30.65)

Indeed, Πn = E x n x T n b n|n−1 + x e n|n−1 =E x



b n|n−1 + x e n|n−1 x

b n|n−1 x bT e n|n−1 x eT = Ex n|n−1 + E x n|n−1 = Σn + Pn|n−1

T (30.66)

It is then immediate to conclude that the matrix ∆

Pn+1|n = Πn+1 − Σn+1

(30.67)

satisfies the recursion T Pn+1|n = Fn Pn|n−1 FnT + Gn Qn GT n − Kp,n Re,n Kp,n ,

P0|−1 = Π0

(30.68)

which is known as the Riccati recursion. This recursion involves propagating an s × s matrix Pn|n−1 over time; it is clear from (30.68) that the recursion is nonlinear in the elements of Pn|n−1 due to the presence of the last term, T Kp,n Re,n Kp,n .

30.3.4

Covariance Form In summary, we arrive at the following statement of the Kalman filter, also known as the covariance form of the filter. This algorithm can be viewed as a filter that takes the correlated sequence of observations {y n }, decorrelates them, and produces the uncorrelated sequence {en }. To do so, the filter iterates a Riccati b n|n−1 . recursion involving Pn|n−1 and a state estimator recursion involving x

1164

Kalman Filter

Kalman filter for hidden state estimation and decorrelation. given observations {y n } that satisfy the state-space model (30.28)–(30.32); objective: generate innovations process {en } and predictions {b xn|n−1 }; b 0|−1 = 0, P0|−1 = Π0 . start from x repeat for n ≥ 0: b n|n−1 e n = y n − Hn x Re,n = Rn + Hn Pn|n−1 HnT −1 Kp,n = (Fn Pn|n−1 HnT + Gn Sn )Re,n b n+1|n = Fn x b n|n−1 + Kp,n en x T Pn+1|n = Fn Pn|n−1 FnT + Gn Qn GT n − Kp,n Re,n Kp,n end (30.69)

Example 30.1 (Nonzero means) Sometimes the initial state x0 does not have zero mean. Let us denote its mean by x ¯0 and its covariance matrix by Π0 where now Π0 = E (x0 − x ¯0 )(x0 − x ¯0 )T . We continue to consider the state-space model: xn+1 = Fn xn + Gn un y n = H n xn + v n but allow the noise processes {un , v n } to have known means {¯ un , v¯n } this case, condition (30.32) is replaced by      T Qn Sn 0 un − u ¯n  uk − u ¯k δnk  0 SnT Rn  v n − v¯n     v k − v¯k E = x0 − x ¯0  0 0 Π 0 x0 − x ¯0 1 0 0 0

(30.70a) (30.70b) for generality. In    

(30.71)

It is straightforward to verify that the means of the state and output processes {xn , y n } evolve according to the equations: x ¯n+1 = Fn x ¯n + Gn u ¯n y¯n = Hn x ¯n + v¯n

(30.72a) (30.72b)

If we now introduce the centered variables ∆

δxn = xn − x ¯n

(30.73)

δy n = y n − y¯n

(30.74)

δun = un − u ¯n

(30.75)

δv n = v n − v¯n

(30.76)

∆ ∆ ∆

then these variables satisfy a similar state-space model albeit one where all variables have zero means, namely, δxn+1 = Fn δxn + Gn δun δy n = Hn δxn + δv n

(30.77a) (30.77b)

30.3 State-Space Model

1165

We can therefore apply the Kalman recursions (30.69) to this model, starting with c 0|−1 = 0 and P0|−1 = Π0 , in order to estimate δx c n+1|n . This computation would δx employ the centered observations {δy n = y n − y¯n }. Then, the desired state estimators are recovered from the correction: c n|n+1 b n+1|n = x x ¯n+1 + δx

(30.78)

Whitening and modeling filters Examining the covariance form of the Kalman filter, we note that we can associate two state-space models with the filter: One model amounts to a whitening filter and the other to a modeling filter. Whitening filter. It is clear from the equations for the filter that the following state-space model holds: b n+1|n = (Fn − Kp,n Hn ) x b n|n−1 + Kp,n y n x b n|n−1 + y n en = −Hn x

(30.79a) (30.79b)

This model has y n as input and en as output. Since en is an uncorrelated sequence with covariance matrix Re,n , we refer to the above model as a whitening filter: Its purpose is to decorrelate the input sequence, y n . Modeling filter. It is also clear from the equations for the Kalman filter that the following state-space model holds: b n+1|n = Fn x b n|n−1 + Kp,n en x

(30.80a)

b n|n−1 + en y n = Hn x

(30.80b)

This model has en as input and y n as output. Since en is an uncorrelated sequence, we refer to the above filter as a modeling filter: it shows how to generate the sequence y n from the uncorrelated input sequence, en . Example 30.2

(First-order model) Consider the first-order model: 1 x(n) + u(n) 4 1 y(n) = x(n) + v(n), 2

x(n + 1) =

(30.81) n≥0

(30.82)

Comparing with the standard model (30.28)–(30.29), we find that all variables are now scalars (and we are writing x(n) instead of xn to emphasize its scalar nature). The model coefficients are time-invariant and given by F =

1 1 , G = 1, H = , Q = 1, 4 2

R=

1 , 2

S = 0,

Π0 = 1

(30.83)

1166

Kalman Filter

b (0| − 1) = 0, Writing down the Kalman recursions we get the following. We start from x p(0| − 1) = 1, and repeat for n ≥ 0: 1 1 + p(n|n − 1) 2 4 1 p(n|n − 1) kp (n) = 1 8 1 + 4 p(n|n − 1) 2 re (n) =

e(n) = y(n) −

(30.84a) (30.84b)

1 ˆ (n|n − 1) x 2

(30.84c)

1 b (n|n − 1) + kp (n)e(n) x 4   1 1 b (n|n − 1) + kp (n)y(n) = − kp (n) x 4 2

b (n + 1|n) = x

p(n + 1|n) =

1 p(n|n − 1) + 1 − 16

1 2 p (n|n − 1) 64 1 + 14 p(n|n − 1) 2

(30.84d)

(30.84e)

These recursions allow us to evaluate the predictors for y(n) given all prior observations from time 0 up to and including time n − 1: b (n|n − 1) = y

1 b (n|n − 1) x 2

(30.85)

Substituting into (30.84d) we get   1 b (n|n − 1) + kp (n)y(n) − kp (n) y 2b y (n + 1|n) = 2

(30.86)

or, delaying by one unit of time:  b (n|n − 1) = y

1 1 − kp (n − 1) 4 2

 b (n − 1|n − 2) + y

1 kp (n − 1) y(n − 1) 2 (30.87)

This is a first-order difference recursion with time-variant coefficients: The input is b (n|n − 1). y(n − 1) and the output is y

Deterministic driving terms In some applications, such as the target-tracking problem studied in Example 30.5, the state-space model (30.28)–(30.32) appears modified with an additional known deterministic (i.e., nonrandom) driving sequence added to the state recursion, such as xn+1 = Fn xn + Gn un + dn , y n = Hn xn + v n

n≥0

(30.88) (30.89)

where dn denotes the deterministic sequence. The arguments in the previous sections can be repeated to conclude that a minor modification occurs to the Kalman recursions – see Prob. 30.4. Specifically, the s × 1 sequence dn will now

30.3 State-Space Model

1167

appear included in the recursion for the state estimator – see (30.90d). We again b 0|−1 = 0, P0|−1 = Π0 , and then repeat for n ≥ 0: start from x Re,n = Rn + Hn Pn|n−1 HnT

Kp,n =

(Fn Pn|n−1 HnT

+

b n|n−1 en = y n − Hn x

Pn+1|n =

+

(30.90b) (30.90c)

b n+1|n = Fn x b n|n−1 + Kp,n en + dn x Fn Pn|n−1 FnT

(30.90a)

−1 Gn Sn )Re,n

Gn Qn GT n

(30.90d) −

T Kp,n Re,n Kp,n

(30.90e)

Example 30.3 (Estimating a DC level) Consider an unknown zero-mean scalar random variable x(0) with variance π0 . Assume the state realization x(n) remains constant over time across all collected noisy measurements, i.e., 

x(n + 1) = x(n), n ≥ 0 y(n) = x(n) + v(n)

(30.91)

where v(n) is a zero-mean white-noise process with variance σv2 and uncorrelated with x(0) for all n. The state equation asserts that the initial state vector remains unchanged over time; thus, any Kalman filtering solution that estimates the state vector from the measurements will in effect be estimating the initial state vector from these same measurements. Models of this type are useful in many contexts. For example, assume we model the sea level as fluctuating around some constant height x(0). The quantities y(n) would correspond to noisy measurements of the sea level collected by a sensor at various time instants n – see Fig. 30.1.

Figure 30.1 Noisy sensor measurements of the sea level over time.

We can write down the Kalman filtering equations for model (30.91) to estimate the state variable from the noisy measurements. Comparing with the standard model (30.28)–(30.29), we find that all variables are now scalars. The model coefficients are time-invariant and given by F = 1, G = 0, H = 1, Q = 0,

R = σv2 ,

S = 0,

Π0 = π0

(30.92)

1168

Kalman Filter

b (0| − 1) = 0, p(0| − 1) = πo , and repeat for n ≥ 0: We start from x re (n) = σv2 + p(n|n − 1) p(n|n − 1) kp (n) = 2 σv + p(n|n − 1)

(30.93a) (30.93b)

b (n|n − 1) e(n) = y(n) − x b (n + 1|n) = x b (n|n − 1) + kp (n)e(n) x p(n + 1|n) = p(n|n − 1) −

p2 (n|n − 1) + p(n|n − 1)

σv2

(30.93c) (30.93d)

(30.93e)

A couple of simplifications are possible given the nature of the state-space model in this case, especially since G = 0 and the state variable remains pegged at its initial value x(0). b (n|n−1) To begin with, since x(n) = x(0) for all n, we can replace the filtered notation x b (0|n − 1): by x b (0|n − 1) = x b (n|n − 1) x

(30.94)

b (0|n − 1) denotes the estimator for x(0) given the measurements {y(m)} The term x up to time n − 1. Moreover, we can rework recursion (30.93e) and derive a closed-form expression for p(n + 1|n) as follows. First note that by grouping the two terms on the right-hand side of (30.93e) we get p(n + 1|n) =

1 1/p(n|n − 1) + 1/σv2

(30.95)

so that the inverse of p(n + 1|n) satisfies the following recursion: 1 1 1 = + 2, p(n + 1)|n) p(n|n − 1) σv

1 1 = p(0| − 1) π0

(30.96)

Iterating from n = 0 we find that n+1 1 + π0 σv2

(30.97)

σv2 (n + 1) + σv2 /π0

(30.98)

p−1 (n + 1|n) = so that p(n + 1|n) = and kp (n) =

1 (n + 1) + σv2 /π0

(30.99)

Substituting into (30.93d) we arrive at the following update for estimating the DC level x(0):     n + σv2 /π0 1 b (0|n) = b x x (0|n − 1) + y(n) (30.100) (n + 1) + σv2 /π0 (n + 1) + σv2 /π0 b (0| − 1) = 0. The ratio π0 /σv2 can be interpreted as a signalwith initial condition x to-noise ratio (SNR) since π0 corresponds to the variance of the desired signal and σv2 corresponds to the variance of the interfering noise.

30.3 State-Space Model

1169

Figure 30.2 illustrates the result of a simulation using π0 = 2 and σv2 = 0.01; these values correspond to an SNR value of   π0 SNR = 10 log10 ≈ 23 dB (30.101) σv2 The figure shows the simulated initial state x(0); the noisy measurements y(n) fluctub (0|n) over the first 100 iterations. ating around x(0); and the successive estimates x

1

0.8

0.6

0.4 0

10

20

30

40

50

60

70

80

90

100

Figure 30.2 Estimation of the initial state value x(0) for a simulation using π0 = 2

and σv2 = 0.01, which corresponds to an SNR level at approximately 23 dB.

Example 30.4 (Linear regression) We can generalize the previous example and illustrate how the Kalman filter can be used to solve regression problems. Consider an unknown zero-mean random variable w ∈ IRM with covariance matrix Π0 . A realization w for w is selected and noisy measurements y(n) are generated according to the model:  xn+1 = xn , n ≥ 0, x0 = w (30.102) y(n) = hTn xn + v(n) where v(n) is a zero-mean white-noise process with variance σv2 and uncorrelated with x0 for any n. In the above model, the hn are given M × 1 column vectors. Once again, the state equation asserts that the initial state vector remains unchanged over time at b n|n−1 to refer to the linear the initial realization w. For this reason, when we write x MSE estimator of xn given all observations up to time n − 1, we are in effect estimating the model w based on these observations. We denote this estimator by wn−1 so that ∆

b n|n−1 = LLMSE of w given {y 0 , . . . , y n−1 } wn−1 = x

(30.103)

In a similar vein, we denote the covariance matrix Pn|n−1 by Pn−1 . We then write the Kalman recursions using the (wn , Pn ) notation instead of (b xn|n−1 , Pn|n−1 ). We start from w−1 = 0, P−1 = Π0 and repeat re (n) = σv2 + hTn Pn−1 hn

(30.104a)

Pn−1 hTn /re (n) y(n) − hTn wn−1

(30.104b)

kp,n =

e(n) = wn = wn−1 + kp,n e(n) Pn = Pn−1 − Pn−1 hTn hn Pn−1 /re (n)

(30.104c) (30.104d) (30.104e)

1170

Kalman Filter

Example 30.5 (Tracking a moving target) Our third example involves tracking a moving target. We consider a simplified model and assume the target is moving within the vertical plane. The target is launched from location (xo , yo ) at an angle θ with the horizontal axis at an initial speed v – see Fig. 30.3. The initial velocity components along the horizontal and vertical directions are vx (0) = v cos θ,

vy (0) = v sin θ

(30.105)

The motion of the object is governed by Newton’s equations; the acceleration along the vertical direction is downward and its magnitude is given by g ≈ 10 m/s2 . The motion along the horizontal direction is uniform (with zero acceleration) so that the horizontal velocity component is constant for all time instants and remains equal to vx (0): vx (t) = v cos θ,

t≥0

(30.106)

For the vertical direction, the velocity component satisfies the equation of motion: vy (t) = v sin θ − gt,

t≥0

(30.107)

We denote the location coordinates of the object at any time t by (x(t), y(t)). These coordinates satisfy the differential equations dx(t) = vx (t), dt

dy(t) = vy (t) dt

(30.108)

Figure 30.3 The object is launched from location (xo , yo ) at an angle θ with the

horizontal direction. Under idealized conditions, the trajectory is parabolic. Using noisy measurements of the target location (x(t), y(t)), the objective is to estimate the actual trajectory of the object. We sample the equations every T units of time and write ∆

vx (n) = vx (nT ) = v cos θ ∆

vy (n) = vy (nT ) = v sin θ − ngT x(n + 1) = x(n) + T vx (n) y(n + 1) = y(n) + T vy (n)

(30.109a) (30.109b) (30.109c) (30.109d)

30.4 Measurement- and Time-Update Forms

As such, the dynamics of the moving object discretized state-space equation:    x(n + 1) 1 0 T 0 y(n + 1)   0 1 0 T   v (n + 1)  =  0 0 1 0 x vy (n + 1) 0 0 0 1 | {z } | {z xn+1

F

1171

can be approximated by the following    x(n) 0 y(n) 0       v (n)  −  0  gT x vy (n) 1 }| {z } | {z } 

xn

(30.110)

dn

Note that the state vector xn in this model involves four entries. Compared with (30.88), we see that the state recursion includes a deterministic driving term and does not include process noise (i.e., G = 0); if desired, we may include a process noise term to model disturbances in the state evolution (such as errors arising from the discretization process). The tracking problem deals with estimating and tracking the state vector xn based on noisy measurements of the location coordinates (x(n), y(n)) of the object. We denote the measurement vector by z n and it satisfies:   x(n)   1 0 0 0  y(n)  zn = + wn (30.111) 0 1 0 0  vx (n)  {z } vy (n) | H

where wn denotes a 2 × 1 zero-mean white-noise process with some covariance matrix R. The entries of the vector z n are noisy measurements of the x- and y-coordinates of the location of the moving object. Using (30.90a)–(30.90e), the Kalman filter equations b 0|−1 = 0, P0|−1 = Π0 , and then repeat for this example are as follows. We start from x for n ≥ 0: Re,n = R + HPn|n−1 H T

(30.112a)

−1 Re,n

Kp,n = F Pn|n−1 H b n|n−1 en = z n − H x b n+1|n = F x b n|n−1 + Kp,n en + dn x

(30.112b) (30.112c) (30.112d)

T Pn+1|n = F Pn|n−1 F T − Kp,n Re,n Kp,n

(30.112e)

T

Figure 30.4 illustrates the result of simulating this solution using the following numerical values:   0.3 Π0 = I, R = , (x0 , y0 ) = (1, 30), v = 15, T = 0.01, θ = 60◦ 0.3 It is seen in the bottom plot that the actual and estimated trajectories are close to each other.

30.4

MEASUREMENT- AND TIME-UPDATE FORMS Implementation (30.69) is known as the prediction form of the Kalman filter since b n|n−1 . There is an alternative it relies on propagating the one-step prediction x implementation of the Kalman filter, known as the time- and measurementb n|n−1 to x b n|n (a measurement-update step), update form. It relies on going from x b n|n to x b n+1|n (a time-update step). and from x

1172

Kalman Filter

target trajectory and noisy measurements

40 35 30 25

solid curve represents actual trajectory

20 15 10

dots represent noisy measurements

5 0 -5 0

5

10

15

20

25

30

35

25

30

35

40 35 30 25 20

actual and estimated trajectories essentially overlap

15 10 5 0 -5 0

5

10

15

20

Figure 30.4 The top plot shows the actual and measured positions of the target. The

bottom plot shows the actual and estimated trajectories.

For the measurement-update step, it can be verified by arguments similar to the ones used in deriving the prediction form that

where ∆

b n|n = x b n|n−1 + Kf,n en x

(30.113)

−1 T −1 Kf,n = (E xn eT n )Re,n = Pn|n−1 Hn Re,n

(30.114)

with error covariance matrix ∆

T −1 e n|n x eT Pn|n = E x n|n = Pn|n−1 − Pn|n−1 Hn Re,n Hn Pn|n−1 T = Pn|n−1 − Kf,n Re,n Kf,n

(30.115)

Likewise, for the time-update step we get

−1 b n+1|n = Fn x b n|n + Gn Sn Re,n x en

with

(30.116)

Pn+1|n =

Fn Pn|n FnT

(30.117) + Gn (Qn −

−1 T Sn Re,n Sn )GT n



Fn Kf,n SnT GT n



T Gn Sn Kf,n FnT

30.4 Measurement- and Time-Update Forms

1173

Time- and measurement-update forms of the Kalman filter. given observations {y n } that satisfy the state-space model (30.28)–(30.32); objective: generate innovations process {en } and estimators {b xn|n−1 }; b 0|−1 = 0, P0|−1 = Π0 . start from x repeat for n ≥ 0: b n|n−1 e n = y n − Hn x Re,n = Rn + Hn Pn|n−1 HnT

(measurement-update) −1 Kf,n = Pn|n−1 HnT Re,n −1 Pn|n = Pn|n−1 − Pn|n−1 HnT Re,n Hn Pn|n−1 b n|n = x b n|n−1 + Kf,n en x

(time-update) −1 b n+1|n = Fn x b n|n + Gn Sn Re,n en x T −1 T T T Pn+1|n = Fn Pn|n Fn + Gn (Qn − Sn Re,n Sn )GT n − Fn Kf,n Sn Gn T −Gn Sn Kf,n FnT end (30.118)

Alternative forms Assume the error covariance matrices are invertible as necessary. Applying the matrix inversion formula (29.89) to the measurement-update relation that updates Pn|n−1 to Pn|n we get, after some algebra left to Prob. 30.9: (measurement-update) −1 −1 Pn|n = Pn|n−1 + HnT Rn−1 Hn

(30.119a)

−1 −1 b n|n = Pn|n−1 b n|n−1 + HnT Rn−1 y n Pn|n x x

(30.119b)

Some simplifications occur in the time-update relation when Sn = 0: (time-update) Pn+1|n = Fn Pn|n FnT + Gn Qn GT n

(30.120a)

b n+1|n = Fn x b n|n x

(30.120b)

Example 30.6 (Gaussian noise processes) The derivation of the Kalman recursions in the earlier sections was based on the MSE formulation and exploited the structure of the state-space model and the orthogonality principle to a great extent. There was no Gaussianity assumption imposed on the noise processes {v n , un } or on the initial state x0 . In other words, the recursions provide the optimal linear state estimators independent of the nature of the distributions for {un , v n , x0 }, as long as these processes satisfy the uncorrelatedness conditions (30.32).

1174

Kalman Filter

The purpose of this example is to show that the Kalman recursions can be re-derived from a different Bayesian perspective by propagating probability density functions (pdfs). To do so, we impose the Gaussian conditions: v n ∼ Nvn (0, Rn ),

un ∼ Nun (0, Qn ),

x0 ∼ Nx0 (0, Π0 )

(30.121)

The derivation given here is a precursor to the approach considered in Chapter 35 for nonlinear state-space models when we derive the class of particle filters. Thus, consider again the same state-space model (30.28)–(30.32) albeit with the additional Gaussian assumptions (30.121). For simplicity, we assume Sn = 0 so that the noise processes are uncorrelated with each other. Clearly, conditioned on xn we have ∆

xn+1 ∼ Nxn+1 (Fn xn , Gn Qn GTn ) = fxn+1 |xn (xn+1 |xn )

(30.122)

y n ∼ Nyn (Hn xn , Rn ) = fyn |xn (yn |xn )

(30.123)



We are interested in estimating xn from the collection of observations up to time n, which we denote by the notation: ∆

y 0:n = {y m , 0 ≤ m ≤ n}

(30.124)

This task requires that we determine the (posterior) conditional pdfs or beliefs: fxn |y0:n (xn |y0:n )

and fxn |y0:n−1 (xn |y0:n−1 )

(30.125)

The pdf on the left allows us to filter the observations and estimate xn , for example, by computing the mean or mode of the distribution and using either of them as a filtered b n|n , then estimate for xn . For instance, if we denote the mean of the distribution by x its evaluation would require that we compute: ∆

ˆ

b n |n = x

x∈X

xfx|y0:n (x|y0:n )dx

(measurement-update or filtering) (30.126)

over the domain of x ∈ X. The derivation will reveal that fxn |y0:n (xn |y0:n ) is Gaussian b n|n will and, therefore, its mean and mode locations will agree so that the estimator x also correspond to the maximum a-posteriori (MAP) solution in this case. Likewise, the second pdf in (30.125) allows us to predict xn from the past observations y 0:n−1 , for example, by computing the mean or mode of the distribution and using either one of them as a predicted estimate for xn . If we denote the mean of the distribution b n|n−1 , then its evaluation involves by x ∆

b n |n−1 = x

ˆ x∈X

xfx|y0:n−1 (x|y0:n−1 )dx

(time-update or prediction) (30.127)

This relation is also known as the Chapman–Kolmogorov equation, and we will encounter its discrete version later in Section 38.2.4 when we study hidden Markov models. We evaluate the required pdfs as follows. To begin with, using the Bayes rule we have fxn ,yn |y0:n−1 (xn , yn |y0:n−1 )

= fyn |y0:n−1 (yn |y0:n−1 ) fxn |yn ,y0:n−1 (xn |yn , y0:n−1 ) = fyn |y0:n−1 (yn |y0:n−1 ) fxn |y0:n (xn , y0:n )

(30.128)

30.4 Measurement- and Time-Update Forms

1175

and, similarly, factoring the same joint pdf in a different order: fxn ,yn |y0:n−1 (xn , yn |y0:n−1 ) = fxn |y0:n−1 (xn |y0:n−1 )fyn |xn ,y0:n−1 (yn |xn , y0:n−1 ) = fxn |y0:n−1 (xn |y0:n−1 )fyn |xn (yn |xn )

(30.129)

Equating (30.128) and (30.129) we arrive at the measurement update equation for computing the first pdf in (30.125):

(measurement-update equation or filtering distribution) ! fyn |xn (yn |xn ) fxn |y0:n (xn |y0:n ) = fxn |y0:n−1 (xn |y0:n−1 ) fyn |y0:n−1 (yn |y0:n−1 )

(30.130)

∝ fyn |xn (yn |xn ) fxn |y0:n−1 (xn |y0:n−1 ) This result allows us to update the conditional pdf of xn that is based on the past measurements y0:n−1 by incorporating the newer measurement yn . Regarding the rightmost term in (30.130), we again appeal to the Bayes rule to write fxn−1 ,xn |y0:n−1 = fxn−1 |y0:n−1 (xn−1 |y0:n−1 ) fxn |xn−1 ,y0:n−1 (xn |xn−1 , y0:n−1 ) = fxn−1 |y0:n−1 (xn−1 |y0:n−1 ) fxn |xn−1 (xn |xn−1 )

(30.131)

Consequently, by marginalizing over xn−1 , we obtain the time-update or prediction relation for computing the second pdf in (30.125):

(time-update equation or predictive distribution) fxn |y0:n−1 (xn |y0:n−1 ) ˆ = fxn |xn−1 (xn |xn−1 )fxn−1 |y0:n−1 (xn−1 |y0:n−1 )dxn−1

(30.132)

xn−1 ∈X

This relation requires computing an integral. Moreover, by comparing with (30.130), we find that these two relations lead to a recursive structure for updating the pdf of xn based on streaming observations. The recursive construction is useful because it enables us to carry out the calculations without the need to store the growing number of observations. Specifically, note the following sequence of updates (where we are removing the subscripts from the pdfs for compactness of notation):

f (xn−1 |y0:n−1 )

(30.132)

− −−−−−−−− → time-update

f (xn |y0:n−1 )

(30.130)

−−−−−−−−−−−−−−−→ measurement-update

f (xn |y0:n )

1176

Kalman Filter

We now carry out these calculations for the linear state-space model (30.28)–(30.32) with the additional Gaussian assumptions (30.121). b n|n and x b n|n−1 to denote the means of the conTo begin with, we use the notation x ditional pdfs shown in (30.125); the derivation will reveal that these means coincide with the MSE estimators we computed in the earlier sections. For this reason, we are using the same notation. We also denote the respective error covariance matrices by Pn|n and Pn|n−1 . In the derivation that follows we will assume inverses exist whenever needed. At n = 0 we have, in the absence of observations, that the initial conditional pdf is Gaussian: fx0 |y0:−1 (x0 |y0:−1 ) ∼ Nx0 (0, Π0 )

(30.133)

b 0|−1 = 0 and P0|−1 = Π0 . We proceed by induction. Assume, for a generic That is, x n > 0, that the distribution of the state xn given all prior observations is Gaussian with some mean x bn|n−1 and covariance matrix Pn|n−1 , i.e., fxn |y0:n−1 (xn |y1:n−1 ) ∼ Nxn (b xn|n−1 , Pn|n−1 )

(30.134)

Then, using the measurement-update step (30.130) we find fxn |y0:n (xn |y0:n ) ∝ fyn |xn (yn |xn ) fxn |y0:n−1 (xn |y0:n−1 )

xn|n−1 , Pn|n−1 ) = Nyn (Hn xn , Rn ) × Nxn (b n 1 o 1 −1 −1 ∝ exp − (yn − Hn xn )T Rn (xn − x bn|n−1 ) (yn − Hn xn ) − (xn − x bn|n−1 )T Pn|n−1 2 2 n 1 o T −1 T −1 T −1 −1 ∝ exp − xn (Pn|n−1 + Hn Rn Hn )xn − 2xn (Pn|n−1 x bn|n−1 + HnT Rn yn ) 2 (30.135) where we are leaving out terms that do not depend on xn . Next, introduce the quantities ∆

−1 −1 −1 + HnT Rn Hn = Pn|n−1 Pn|n −1 b n|n Pn|n x



=

−1 b n|n−1 Pn|n−1 x

+

−1 HnT Rn yn

(30.136a) (30.136b)

Substituting into (30.135) we can rewrite the updated distribution in the form: n 1 o −1 −1 fxn |y0:n (xn |y0:n ) ∝ exp − xTn Pn|n xn − 2xTn Pn|n x bn|n 2 o n 1 −1 ∝ exp − (xn − x bn|n )T Pn|n (xn − x bn|n ) (30.137) 2 from which we conclude that fxn |y0:n (xn |y0:n ) ∼ Nxn (b xn|n , Pn|n )

(30.138)

On the other hand, using the time-update step (30.132) we have ˆ fxn+1 |y0:n (xn+1 |y0:n ) = fxn+1 |xn (xn+1 |xn )fxn |y0:n (xn |y0:n )dxn x ∈X ˆ n = Nxn+1 (Fn xn , Gn Qn GTn ) × Nxn (b xn|n , Pn|n )dxn xn ∈X

(30.139)

30.5 Steady-State Filter

1177

which expresses the updated pdf as the integral of the product of two Gaussian distributions. We know from (30.131) that the product amounts to the joint conditional pdf of {xn+1 , xn } given the observations y 0:n . Using the result of Example 4.6, we know that this joint pdf is a Gaussian distribution as well, and given by fxn ,xn+1 |y0:n (xn+1 , xn |y0:n )    x bn|n Pn|n ∼ Nxn ,xn+1 , Fn x bn|n Fn Pn|n

Pn|n FnT Fn Pn|n FnT + Gn Qn GTn

! (30.140)

It follows that the marginal distribution of xn+1 conditioned on y 0:n is given by fxn+1 |y0:n (xn+1 |y0:n ) ∼ Nxn+1 (b xn+1|n , Pn+1|n )

(30.141)

where b n+1|n = Fn x b n|n x Pn+1|n =

30.5

Fn Pn|n FnT

(30.142a) +

Gn Qn GTn

(30.142b)

STEADY-STATE FILTER One distinctive feature of the Kalman filter is that, under appropriate conditions on the dynamics of the state-space model, the filter ends up converging to a stable steady-state operation as time progresses. Assume the state-space model (30.28)– (30.32) is time-invariant, so that its parameters do not change with time, and assume also that operation started in the remote past: xn+1 = F xn + Gun , y n = Hxn + v n with

n > −∞

    T Q S un  δnm u m  vn   ST R      E = vm x0  0 0 x0 1 0 0 

 0 0   Π0  0

(30.143) (30.144)

(30.145)

It follows from the state-space model that the transfer matrix function from un to y n is given by Huy (z) = H(zI − F )−1 G

(30.146)

while the transfer function from v n to y n is Hvy (z) = 1. Assuming F is a stable matrix (meaning all its eigenvalues are inside the unit circle), then Huy (z) will be a bounded-input bounded-output (BIBO) stable mapping (i.e., it maps bounded input sequences {un } to bounded output sequences {y n }). The processes {un , v n } are assumed to be wide-sense stationary and, hence, {y n } will also be wide-sense stationary.

1178

Kalman Filter

30.5.1

Solutions of DARE The Kalman recursions that correspond to the above model are given by: Re,n = R + HPn|n−1 H T T

Kp,n = (F Pn|n−1 H +

(30.147a)

−1 GS)Re,n

ˆ n|n−1 en = y n − H x

(30.147b) (30.147c)

b n+1|n = F x b n|n−1 + Kp,n en x T

(30.147d) T

Pn+1|n = F Pn|n−1 F + GQG −

T Kp,n Re,n Kp,n

(30.147e)

There are conditions under which the Riccati recursion (30.147e) can be shown to converge to a unique matrix, P . Discussion of these technical conditions is beyond the scope of this text – see though the comments at the end of the chapter. It suffices to consider here the case when S = 0 and F is stable. In this case, the Riccati recursion can be shown to converge, as n → ∞, to a positive semi-definite matrix P that satisfies the so-called discrete algebraic Riccati equation (DARE): P = F P F T + GQGT − Kp Re KpT

(30.148a)

Re = R + HP H T

(30.148b)

Re−1

(30.148c)

where

Kp = F P H

T

Moreover, the resulting closed-loop matrix F −Kp H will be stable (i.e., will have all its eigenvalues inside the unit circle). The DARE (30.148a) is a nonlinear algebraic equation in the unknown P . In general, the DARE may have many solutions P , including the possibility of many positive semi-definite solutions. None of these solutions need to stabilize the closed-loop matrix F −Kp H (when this happens, we say that P is a stabilizing solution). However, when S = 0 and F is stable, it can be shown that the DARE will have a unique positive semi-definite solution P and that this solution will be the only stabilizing solution. If S 6= 0, then it is sufficient to require instead the matrix F s = F − GSR−1 H to be stable for the same convergence conclusion to hold; in this case, expression (30.148c) for Kp would be replaced by Kp = (F P H T + GS)Re−1 ,

when S 6= 0

(30.149)

Example 30.7 (Multiple solutions to DARE) We illustrate the varied behavior of the DARE by considering several situations corresponding to scalar model parameters: (a) (Multiple nonnegative-definite solutions but none is stabilizing): F = 1, G = 0, H = 1, Q = 1, S = 0, and R = 1. In this case, the DARE (30.148a) reduces to P =P +0−

P2 P = 1+P 1+P

(30.150)

which leads to the quadratic equation P 2 = 0. Therefore, in this case, we have two solutions at P = 0. For these solutions, Kp = 0 and, hence, the closed-loop

30.5 Steady-State Filter

(b)

matrix Fp is given by Fp = 1. Thus, although the DARE has multiple nonnegative solutions, none of them is stabilizing. (Multiple nonnegative-definite solutions with one being stabilizing): F = 2, G = 0, H = 1, Q = 1, S = 0, and R = 1. In this case, the DARE (30.148a) reduces to P = 4P + 0 −

(c)

4P 2 4P = 1+P 1+P

(30.151)

which leads to the quadratic equation P 2 −3P = 0 with two nonnegative solutions P1 = 0 and P2 = 3. For the solution P2 we obtain Kp = 3/2 and Fp = 1/2. Therefore, this solution is stabilizing. It is easy to verify that the solution P1 = 0 is not stabilizing. Thus, we have an example of a DARE with multiple nonnegative solutions but not all of them are stabilizing. (Solutions can be negative-definite): F = 1/2, G = 0, H = 1, Q = 1, S = 0, and R = 1. In this case, the DARE (30.148a) reduces to P =

(d)

1179

P P2 P − = 4 4(1 + P ) 4(1 + P )

(30.152)

which leads to the quadratic equation 4P 2 + 3P = 0 with solutions at P1 = 0 and P2 = −3/4. It is easy to see that P1 is the stabilizing solution in this case. We therefore have an example showing that not all solutions of the DARE need to be nonnegative-definite. (Stability of F is not sufficient when S 6= 0): F = 0, G = 1, H = 1, Q = 1, S = 1, and R = 1, for which F is stable but S is nonzero. The stability of F is not sufficient to ensure the existence of a stabilizing solution to the DARE. Indeed, in this case, the DARE (30.148a) reduces to P =1−

P 1 = 1+P 1+P

(30.153)

which leads to the quadratic equation P 2 = 0 with two solutions at P = 0. None of these solutions is stabilizing because they lead to Fp = −1.

30.5.2

Spectral Factorization Let us now focus on the case in which the sequences {y n , v n } are scalar sequences, replaced by {y(n), v(n)}; the discussion can be easily extended to vector processes but it is sufficient for our purposes to study scalar-valued output processes. In this situation, the H matrix becomes a row vector, say hT , and the covariance matrix R becomes a scalar, say r. Additionally, the innovations process en becomes scalar-valued, say, e(n), with variance re in the steady state, and the gain matrix Kp becomes a column vector, kp . Writing down the resulting modeling filter (30.80a)–(30.80b) in the steady state, as n → ∞, we have b n+1|n = F x b n|n−1 + kp e(n) x

(30.154)

b n|n−1 + e(n) y(n) = h x

(30.155)

L(z) = 1 + hT (zI − F )−1 kp

(30.156)

T

The transfer function from e(n) to y(n) is the causal modeling filter:

1180

Kalman Filter

This is a stable transfer function since its poles are given by the eigenvalues of F and these eigenvalues lie inside the unit disc. Accordingly, y(n) is a widesense stationary process and its z-spectrum is given by (recall the definition from Section 7.3):   ∗ 1 (30.157) Sy (z) = re L(z) L z∗ where ∗ denotes the complex conjugation symbol. We refer to L(z) as the canonical spectral factor of Sy (z). We therefore find that the steady-state Kalman equations allow us to determine the canonical spectral factor, L(z), of the process {y(n)} (since it allows us to determine kp ). If we similarly consider the whitening filter (30.79a)–(30.79b) in the steady state, we have  b n+1|n = F − kp hT x b n|n−1 + kp y(n) x (30.158) b n|n−1 + y(n) e(n) = −hT x

(30.159)

This model allows us to determine the transfer function from y(n) to e(n), which is the inverse of L(z):  −1 L−1 (z) = 1 − hT zI − (F − kp hT ) kp

(30.160)

This result is also known as the Wiener filter. Note that both L(z) and its inverse are stable transfer functions, which is the reason for the qualification canonical spectral factor. Example 30.8 (Spectral factorization) Let us reconsider Example 30.2 and examine what happens in the steady state as n → ∞. Since F = 1/4 is a stable matrix, the Riccati recursion will converge to a positive value p that satisfies the DARE: p=

1 p + 1 − 16

1 2 p 64 1 + 14 p 2

(30.161)

This equality leads to a quadratic expression in p: 8p2 + 7p − 16 = 0

(30.162)

with one negative root and one positive root. We pick the positive root since p is the error variance (and, hence, cannot be negative): √ −7 + 561 p= (30.163) 16 With this value for p, the scalar gain kp (n) will converge to the value √ p −23 + 561 kp = = (30.164) 4 + 2p 4 so that, in the steady state as n → ∞, we arrive at the time-invariant filter: ! ! √ √ 25 − 561 −23 + 561 b (n|n − 1) = b (n − 1|n − 2) + y y(n − 1) (30.165) y 8 8

30.6 Smoothing Filters

30.6

1181

SMOOTHING FILTERS The Kalman filter allows us to compute the predicted and filtered state estimab n|n−1 and x b n|n . However, computation of smoothed estimators, such as tors, x b n|N for n < N , is not as direct and will require two passes over the data: a forx ward pass and a backward pass. There are different variations of the smoothing problem, including fixed-interval, fixed-lag, and fixed-point smoothing. In fixedb n|N , which estimates xn from the observainterval smoothing, we determine x tions {y 0 , y 1 , . . . , y N } over the fixed interval 0 ≤ n ≤ N . If we fix n at some n = no and increase N , then we have a fixed-point smoothing formulation. If, on the other hand, we allow both n and N to vary according to the relation N = n + L, for a fixed L > 0, then we have a fixed-lag smoothing problem for deb n|n+L . We discuss the fixed-interval problem in this section, and leave termining x the results for fixed-point and fixed-lag smoothers to Probs. 30.14 and 30.15.

30.6.1

Fixed-Interval Smoothing We introduce the cross-covariance matrices for the predicted estimators {e xn|n−1 }, namely, ∆

eT e n|n−1 x Pn,m = E x m|m−1 ∆

eT e n|n−1 x Pn,n = Pn|n−1 = E x n|n−1

(30.166a) (30.166b)

Now, from the general result (30.21) and by the orthogonality of the innovations process, we can express the desired fixed-interval smoother in the form: b n|N = x

N X

m=0

 −1 E xn eT m Re,m em

(30.167)

and split the sum into two parts to obtain b n|N = x b n|n−1 + x

N X

m=n

 −1 E xn eT m Re,m em

(30.168)

Using (30.51), we can be more explicit about the cross-covariance terms E xn eT m for m ≥ n:   T T T e E xn eT = E x x n m m|m−1 Hm + E xn v m     T T T e n|n−1 x eT b n|n−1 x eT = Ex Ex m|m−1 Hm + m|m−1 Hm + E xn v m

(30.169)

However, for m ≥ n, we have that E xn v T m = 0 by assumption, and e m|m−1 ⊥ {y 0 , . . . , y n , . . . , y m } x

(30.170)

1182

Kalman Filter

e m|m−1 ⊥ x b n|n−1 . Hence, from which we conclude that x   T T e n|n−1 x eT E xn eT m = E x m|m−1 Hm = Pn,m Hm ,

m≥n

b n|n−1 for m ≥ n, we also conclude that and, noting that em ⊥ x T T e n|n−1 eT Ex m = E xn em = Pn,m Hm ,

m≥n

(30.171)

(30.172)

It follows that (30.168) reduces to

b n|N = x b n|n−1 + x

N X

T −1 Pn,m Hm Re,m em ,

m=n

0≤n≤N

(30.173)

Subtracting xn from both sides of (30.173) we get e n|N = x e n|n−1 − x

N X

T −1 Pn,m Hm Re,m em

(30.174)

m=n

so that computing the error covariance matrix ∆

eT e n|N x Pn|N = E x n|N

(30.175)

e n|n−1 for m ≥ n, we find that Pn|N and recalling from (30.172) that em ⊥ x satisfies: Pn|N = Pn|n−1 −

N X

T −1 Pn,m Hm Re,m Hm Pn,m

(30.176)

m=n

Example 30.9 (Alternative smoothing formulas) We could have obtained a similar decomposition to (30.168) using filtered estimators {b xn|n } rather than predicted estimators {b xn|n−1 }. Repeating the arguments we would arrive instead at the following relations: b n|N = x b n|n + x

N X

T −1 Pn,m Hm Re,m em

(30.177a)

T −1 Pn,m Hm Re,m Hm Pn,m

(30.177b)

m=n+1

Pn|N = Pn|n −

30.6.2

N X m=n+1

Bryson–Frazier Form We can rewrite the smoothing relations (30.173) and (30.176) in an alternative more revealing form by exploiting the state-space structure to evaluate the crosscovariances Pn,m more explicitly. Indeed, it holds that Pn,m = Pn|n−1 ΦT p (m, n),

m≥n

(30.178)

30.6 Smoothing Filters

where the matrix Φp (m, n) is defined as follows:  Fp,m−1 Fp,m−2 . . . Fp,n , ∆ Φp (m, n) = I,

for m > n, for m = n,

1183

(30.179)

in terms of products of successive closed-loop matrices, Fp,n = Fn − Kp,n Hn . Proof of (30.178): Subtracting the state model (30.29) from the Kalman filter recursion b n+1|n in (30.44) and using (30.51), we find that for x e n+1|n = Fp,n x e n|n−1 + Gn un − Kp,n v n x

(30.180)

Therefore, by iterating for m ≥ n, we obtain e m|m−1 = Φp (m, n)e x xn|n−1 +

m−1 X k=n

Φp (m, k + 1) (Gk uk − Kp,k v k ) , m ≥ n (30.181)

from which it follows that ∆

e n|n−1 x e Tm|m−1 Pn,m = E x   e n|n−1 x e Tn|n−1 ΦTp (m, n) + 0 = Ex = Pn|n−1 ΦTp (m, n),

m≥n

(30.182)

 Now substituting (30.178) into (30.173) we can rewrite the latter expression in the form b n|N = x b n|n−1 + Pn|n−1 λn|N , x

0≤n≤N

(30.183)

where we introduced the variable: ∆

λn|N =

N X

T −1 ΦT p (m, n)Hm Re,m em

(30.184)

m=n

The formula (30.179) for Φp (m, n) suggests that λn|N can be computed via the backwards-time recursion: T −1 λn|N = Fp,n λn+1|N + HnT Re,n en ,

λN +1|N = 0

(30.185)

If we now subtract xn from both sides of (30.183) we find that the smoothing error satisfies e n|N = x e n|n−1 − Pn|n−1 λn|N x

(30.186)

which we can rewrite more conveniently as

e n|n−1 = x e n|N + Pn|n−1 λn|N , since λn|N ⊥ x e n|N x

(30.187)

1184

Kalman Filter

e n|N ⊥ {e0 , e1 , . . . , eN }, while the variable λn|N is formed The reason is that x by combining the innovations {en , en+1 , . . . , eN }. If we denote the covariance matrix of λn|N by ∆

Λn|N = E λn|N λT n|N

(30.188)

then we conclude from (30.187) that T Pn|n−1 = Pn|N + Pn|n−1 Λn|N Pn|n−1

(30.189)

T Pn|N = Pn|n−1 − Pn|n−1 Λn|N Pn|n−1

(30.190)

or, equivalently,

Moreover, from (30.185) we immediately have that T −1 Λn|N = Fp,n Λn+1|N Fp,n + HnT Re,n Hn , ΛN +1|N = 0

(30.191)

In summary, we arrive at the following Bryson–Frazier smoothing recursions: b n|N = x b n|n−1 + Pn|n−1 λn|N , x

λn|N = Λn|N =

0≤n≤N

T −1 Fp,n λn+1|N + HnT Re,n en , λN +1|N = 0 T T −1 Fp,n Λn+1|N Fp,n + Hn Re,n Hn , ΛN +1|N =

(30.192a) (30.192b) 0

Pn|N = Pn|n−1 − Pn|n−1 Λn+1|N Pn|n−1

(30.192c) (30.192d)

The resulting smoothing filter, including both the forward and backward passes, is listed in (30.194). Note that the Bryson–Frazier algorithm involves two passes over the data: On a forward pass, we compute the innovations and the predicted state estimators along with the error covariance matrices and the closed-loop matrices, i.e., the b n|n−1 , Re,n , Pn|n−1 , Fp,n }, starting from x b 0|−1 = 0 and P0|−1 = quantities {en , x Πo . Then a backward pass uses the innovations to compute the variables, λn|N +1 , b n|n−1 and and their covariances, Λn|N . Finally, an appropriate combination of x b n|N . λn|N gives the smoothed estimator x It is not difficult to verify that we could have derived alternative Bryson– Frazier smoothing expressions in terms of filtered estimators as follows: T b n|N = x b n|n + Pn|n−1 Fp,n x λn+1|N ,

λn|N = Λn|N =

Pn|N =

0≤n≤N

T −1 Fp,n λn+1|N + HnT Re,n en , λN +1|N = 0 T T −1 Fp,n Λn+1|N Fp,n + Hn Re,n Hn , ΛN +1|N = T Pn|n − Pn|n−1 Fp,n Λn+1|N Fp,n Pn|n−1

(30.193a) (30.193b) 0

(30.193c) (30.193d)

30.7 Ensemble Kalman Filter

1185

Bryson–Frazier fixed-interval smoothing filter. given observations {y n } that satisfy model (30.28)–(30.32); given a fixed interval duration N ; objective: generate the smoothed estimators {b xn|N }; b 0|−1 = 0, P0|−1 = Π0 , λN +1|N = 0, ΛN +1|N = 0. start from x (forward pass) repeat for n = 0, 1, . . . , N − 1, N :

b n|n−1 e n = y n − Hn x Re,n = Rn + Hn Pn|n−1 HnT −1 Kp,n = (Fn Pn|n−1 HnT + Gn Sn )Re,n b n+1|n = Fn x b n|n−1 + Kp,n en x T Pn+1|n = Fn Pn|n−1 FnT + Gn Qn GT n − Kp,n Re,n Kp,n end

(30.194)

(backward pass) repeat for n = N, N − 1, . . . , 1, 0: λn|N b n|N x Λn|N Pn|N end

30.7

−1 T en λn+1|N + HnT Re,n = Fp,n b n|n−1 + Pn|n−1 λn|N =x −1 T Hn Λn+1|N Fp,n + HnT Re,n = Fp,n = Pn|n−1 − Pn|n−1 Λn+1|N Pn|n−1

ENSEMBLE KALMAN FILTER Many applications involve large-dimensional state and measurement vectors with the dimensions of xn (s × 1) and y n (p × 1) running into the tens of thousands or even millions of entries. One example is weather forecasting where s ≈ 107 (state dimension) and p ≈ 105 (measurement dimension). The ensemble Kalman filter is a popular technique that is well suited for such large-dimensional models. It relies on propagating ensembles (i.e., collections) of state variables and on estimating the state and error covariance matrices by means of sample calculations. We describe the filter for the case of linear state-space models with Gaussian noise processes. As the presentation will reveal, the arguments extend with minimal adjustments to other noise distributions (other than Gaussian) and to nonlinear state-space models.

1186

Kalman Filter

Consider the state-space model: xn+1 = Fn xn + Gn un y n = Hn xn + v n

(xn : s × 1)

(y n : p × 1)

(30.195a) (30.195b)

where we now assume, for simplicity, that the noises are Gaussian-distributed and uncorrelated with each other so that 

  T un  Qn δnk u k  vn   0   = E  x0  v k  0 x0 1 0

0 Rn δnk 0 0

 0 0   Π0  0

(30.195c)

We know from listing (30.118) and the result of Prob. 30.10 that the time and measurement-update form of the Kalman filter for this model is given by the b 0|−1 = 0, P0|−1 = Π0 , and repeat: following relations. Start from x                    

b n|n−1 = Hn x b n|n−1 y b n|n−1 en = y n − y Re,n = Rn + Hn Pn|n−1 HnT

(measurement-update) −1 Kf,n = Pn|n−1 HnT Re,n T Pn|n = (I − Kf,n Hn )Pn|n−1 (I − Kf,n Hn )T + Kf,n Rn Kf,n b n|n = (I − Kf,n Hn )b x xn|n−1 + Kf,n y n

            (time-update)     b n+1|n = Fn x b n|n x    Pn+1|n = Fn Pn|n FnT + Gn Qn GT n

(30.196)

We further know from definition (30.114) that the gain matrix Kf,n corresponds to the following calculation in terms of covariance and cross-covariance quantities associated with the zero-mean random variables {xn , en }: Kf,n

(30.197)



T −1 = (E xn eT n ) (E en en ) T −1 e n|n−1 eT b n|n−1 + x e n|n−1 , en ⊥ x b n|n−1 = (E x , since xn = x n ) (E en en )    −1 b n|n−1 )(y n − y b n|n−1 )T × E (y n − y b n|n−1 )(y n − y b n|n−1 )T = E (xn − x

The ensemble Kalman filter will exploit this expression to approximate Kf,n . For now, observe that implementation (30.196) relies on the propagation of the s × s error covariance matrices {Pn|n , Pn+1|n }, which is computationally demanding when s is large. The ensemble Kalman filter employs a low-rank approximation for these quantities.

30.7 Ensemble Kalman Filter

1187

Filtered estimates Assume we have available an ensemble of state estimates for xn at time n, denoted by n o (1) (2) (L) xn|n , xn|n , . . . , xn|n , L  s (30.198) where L is much smaller that the state dimension, s. We explain in the following how these realizations are generated. Once they become available, we compute their sample mean and (unbiased) covariance matrix: ∆

x ¯n|n =

L

1 X (`) xn|n L

(30.199a)

`=

∆ P¯n|n =

T  1 X (`) (`) ¯n|n xn|n − x ¯n|n xn|n − x L−1 L

(30.199b)

`=1

and use these quantities to serve as the filtered state estimate, x bn|n , and its error covariance matrix: x bn|n ≈ x ¯n|n ,

Pn|n ≈ P¯n|n

(30.200)

Observe that P¯n|n is a low-rank matrix; its rank is L  s, while the size of P¯n|n is s × s.

Time-update or forecast step Next, we generate L realizations for the state noise process, denoted by u(`) n ∼ Nun (0, Qn ),

` = 1, 2, . . . , L

(30.201)

We are employing the Gaussian distribution here because the noise process un is assumed to be Gaussian. However, other distributions are possible, in which (`) case we would need to sample the {un } from the alternative distribution. Using (`) (`) the realizations {un } and the ensemble {xn|n }, we propagate these variables through the state equation to get a new ensemble of state realizations denoted (`) by {xn+1|n }: (`)

(`)

xn+1|n = Fn xn|n + Gn u(`) n ,

` = 1, 2, . . . , L

(30.202)

This step, which amounts to a time-update, is called forecasting in the ensemble Kalman filtering literature. Again, we compute their sample mean and (unbiased) covariance matrix: ∆

x ¯n+1|n =

L

1 X (`) xn+1|n L

(30.203a)

`=

∆ P¯n+1|n =

 T 1 X (`) (`) xn+1|n − x ¯n+1|n xn+1|n − x ¯n+1|n L−1 L

`=1

(30.203b)

1188

Kalman Filter

and use these quantities to serve as the predicted state estimate, x bn+1|n , and its error covariance matrix: Pn+1|n ≈ P¯n+1|n

x bn+1|n ≈ x ¯n+1|n ,

(30.204)

The matrix P¯n+1|n is again low rank; its rank is L  s, while the size of P¯n+1|n is s × s.

Measurement-update or analysis step (`)

We now explain how to generate the original ensemble {xn|n } from (30.198). For this purpose, we sample L realizations for the measurement noise process, denoted by vn(`) ∼ Nvn (0, Rn ),

` = 1, 2, . . . , L

(30.205)

Here, again, other distributions (other than the Gaussian) could be used. Then, motivated by the form of the measurement-update in (30.196), we use these noise (`) samples to transform the ensemble of predicted states {xn|n−1 } into an ensemble (`)

of filtered states {xn|n }: (`)

(`)

xn|n = (I − Kf,n Hn )xn|n−1 + Kf,n (yn − vn(`) ), ` = 1, 2, . . . , L

(30.206)

where we still need to explain how to compute the gain matrix Kf,n . Observe (`) that, in comparison with (30.196), an additional noise term vn appears added to the right-hand side of the above expression. This noise perturbation is needed here. Otherwise, if it were absent and if we rely instead on an update of the form: (`)

(`)

xn|n = (I − Kf,n Hn )xn|n−1 + Kf,n yn

(30.207)

where we view the observation yn as a given deterministic quantity, then (`)

(`)

E xn|n = (I − Kf,n Hn )E xn|n−1 + Kf,n yn ` = 1, 2, . . . , L

(30.208)

Subtracting from (30.207) and equating the covariance matrices of both sides of the equality we get     (`) (`) (`) (`) cov xn|n , xn|n = (I − Kf,n Hn )cov xn|n−1 , xn|n−1 (I − Kf,n Hn )T (30.209)

Comparing with the update from Pn|n−1 to Pn|n in listing (30.196), we observe T that the last term Kf,n Rn Kf,n is missing. The perturbation by the noise real(`)

izations vn in (30.206) helps ensure that this term is present. Continuing with (30.206), if we introduce the quantity: (`)



(`)

yn|n−1 = Hn xn|n−1 + vn(`)

(30.210)

then the measurement-update or analysis step (30.206) can be rewritten in the equivalent form: (`)

(`)

(`)

xn|n = xn|n−1 + Kf,n (yn − yn|n−1 ), ` = 1, 2, . . . , L

(30.211)

30.7 Ensemble Kalman Filter

1189

Gain matrix It remains to explain how to compute Kf,n . We first denote the sample mean of the output realizations computed via (30.210) by y¯n|n−1

L

1 X (`) = yn|n−1 L ∆

(30.212)

`=1

and use this quantity to approximate the covariance and cross-covariance quantities that appear in expression (30.197): ∆



 T 1 X (`) (`) yn|n−1 − y¯n|n−1 yn|n−1 − y¯n|n−1 L−1 L

Re,n = E en eT n ≈

(30.213a)

`=1

Rxe,n = E xn eT n ≈

L  T 1 X (`) (`) xn|n−1 − x ¯n|n−1 yn|n−1 − y¯n|n−1 L−1

(30.213b)

`=1

Then, we compute Kf,n by solving the linear system of equations Kf,n Re,n = Rxe,n . In this implementation of the ensemble Kalman filter, the noise realiza(`) tions {vn } influence performance in two locations. They perturb the measurement yn in (30.206) and also affect the value of Kf,n through y¯n|n−1 . There is an alternative way to reduce the effect of these noise realizations by approximating Re,n in a different manner. For this purpose, we introduce the ensemble errors (also called ensemble anomalies): (`)

(`)

x en|n−1 = xn|n−1 − x ¯n|n−1 , ` = 1, 2, . . . , L

(30.214)

and use expression Re,n = Rn + Hn Pn|n−1 HnT to motivate the following approximation (see also Prob. 30.20):  T 1 X (`) (`) Hn x en|n−1 x en|n−1 HnT L−1 L

Re,n ≈ Rn +

(30.215)

`=1

One advantage of this computation is that Re,n is nonsingular since Rn > 0. In summary, we arrive at listing (30.217) for the ensemble Kalman filter. Remark 30.1. (Nonlinear models) Minor adjustments are necessary to handle nonlinear state-space models of the form considered in the next section in (30.218a)– (`) (`) (30.218b). In this case, the ensemble variables {yn|n−1 , xn+1|n } appearing in the algorithm would be evaluated by computing instead:     (`) (`) (`) xn+1|n = f xn|n + g xn|n u(`) n   (`) (`) (`) yn|n−1 = h xn|n−1 + vn

(30.216a) (30.216b) 

1190

Kalman Filter

Ensemble Kalman filter for large-dimensional models. given observations {yn } that satisfy model (30.195a)–(30.195c); objective: generate state estimates {b xn|n−1 , x bn|n }; (`) start from L  s realizations {x0|−1 }, ` = 1, 2, . . . , L repeat for n ≥ 0: L 1 X (`) x bn|n−1 = xn|n−1 L `=1

(`) x en|n−1

=

(`) xn|n−1

−x bn|n−1 , ` = 1, 2, . . . , L

1 X (`)  (`) T en|n−1 x en|n−1 x L−1 L

Pn|n−1 =

`=1

Re,n ≈ Rn + Hn Pn|n−1 HnT (`)

vn ∼ Nvn (0, Rn ), (`)

(`)

` = 1, 2, . . . , N (`)

yn|n−1 = Hn xn|n−1 + vn , ` = 1, 2, . . . , L L 1 X (`) ybn|n−1 = yn|n−1 L `=1

(`) ∆

en

(`)

= yn|n−1 − ybn|n−1 , ` = 1, 2, . . . , L 1 X (`)  (`) T x en|n−1 en L−1 L

Rxe,n ≈

`=1

Kf,n Re,n = Rxe,n =⇒ Kf,n

  (`) (`) (`) xn|n = xn|n−1 + Kf,n yn − yn|n−1 , ` = 1, 2, . . . , L L

x bn|n = (`) x en|n

=

Pn|n =

1 X (`) xn|n L `=1

(`) xn|n

−x bn|n , ` = 1, 2, . . . , L

 T 1 X x en|n x en|n L−1 L

`=1

(`)

un ∼ Nun (0, Qn ), ` = 1, 2, . . . , L (`)

(`)

(`)

xn+1|n = Fn xn|n + Gn un , ` = 1, 2, . . . , L end

(30.217)

30.8 Nonlinear Filtering

30.8

1191

NONLINEAR FILTERING Most systems in practice involve nonlinear dynamics and cannot be described by linear state-space models. Optimal estimation of the hidden state variable will generally require that we propagate pdfs, which is generally a difficult task to pursue. Later, in Section 35.1, we will introduce sequential Monte Carlo methods that are based on recursive importance sampling techniques to propagate the necessary pdfs. The discussion will result in the class of particle filters as one solution to nonlinear filtering problems. In this section, we describe some popular approximation methods that are based on linearizing the state-space model through Taylor series approximations. Thus, consider a nonlinear state-space model of the general form: xn+1 = f (xn ) + g(xn )un , n > 0

(30.218a)

y n = h(xn ) + v n

(30.218b)

where the {un , v n } are zero-mean white-noise Gaussian processes that are independent of each other (so that Sn = 0) and with covariance matrices {Qn , Rn }, respectively. Moreover, f (·), g(·), and h(·) are known vector-valued functions that control the evolution of the model.

30.8.1

Linearized Kalman Filter denote The first approximate solution is the linearized Kalman filter. Let xnom n some nominal trajectory generated according to the dynamics: nom xnom ), n+1 = f (xn

xnom =0 0

(30.219)

That is, we run the unforced state dynamics starting from the zero initial condition. The resulting state trajectory will be different from the true state evolution and we denote the error between the nominal and actual state trajectories by: ∆

∆xn = xn − xnom n

(30.220)

We perform the following zeroth- and first-order Taylor series expansions around to determine coefficient matrices {Fn , Gn , Hn }: xnom n ∆

g(xn ) ≈ g(xnom ) = Gn n

f (xn ) ≈ h(xn ) ≈

(30.221a)

f (xnom ) + Fn ∆xn n nom h(xn ) + Hn ∆xn

(30.221b) (30.221c)

where the matrices Fn and Hn are defined in terms of the Jacobians: ∂h(x) ∆ ∂f (x) Fn = , Hn = ∂x x=xnom ∂x x=xnom n

n

(30.222)

1192

Kalman Filter

In this notation, the (i, j)th element of matrix Fn is the partial derivative of the ith component of f (x) with respect to the jth component of x, and similarly for Hn . The derivatives are evaluated at the nominal trajectory xnom . Using these n quantities, the original state equation (30.218a) can be approximated as nom xnom ) + Fn ∆xn + g(xnom )un n+1 + ∆xn+1 = f (xn n

(30.223)

Using (30.219), we obtain ∆xn+1 = Fn ∆xn + Gn un yn −

h(xnom ) n

(30.224a)

= Hn ∆xn + vn

(30.224b)

is known, these equations describe a linear state-space model in the Since xnom n variable ∆xn and, therefore, a standard Kalman filter solution can be used to estimate it. We write this solution in terms of time- and measurement-update relations: dn+1|n = Fn ∆x dn|n ∆x dn|n ∆x

d0 = 0, P0 = Π0 ∆x (30.225a)   nom dn|n−1 + Kf,n y n − h(xn ) − Hn ∆x dn|n−1 = ∆x (30.225b)

Re,n = Rn + Hn Pn|n−1 HnT

Kf,n =

(30.225c)

−1 Pn|n−1 HnT Re,n

(30.225d)

Pn|n = (I − Kf,n Hn )Pn|n−1

Pn+1|n =

Fn Pn|n FnT

+

(30.225e)

Gn Qn GT n

(30.225f)

Estimators for the state of the original nonlinear model (30.218a)–(30.218b) can be found by adjusting the nominal values using dn|n , b n|n = xnom x + ∆x n

which result in the listing (30.226).

d b n+1|n = xnom x n+1 + ∆xn+1|n

Linearized Kalman filter for model (30.218a)–(30.218b). nom = 0; given a nominal trajectory xnom ), xnom n+1 = f (xn 0 given observations {y n } that arise from the model; objective: estimate the hidden state; b 0|−1 = 0, P0|−1 = Π0 . start from x repeat for n ≥ 0: Re,n = Rn + Hn Pn|n−1 HnT −1 Kf,n = Pn|n−1 HnT Re,n 

b n|n = x b n|n−1 + Kf,n y n − h(xnom b n|n−1 + Hn xnom x ) − Hn x n n

b n+1|n = Fn (b x xn|n − xnom ) + f (xnom ) n n Pn|n = (I − Kf,n Hn )Pn|n−1 Pn+1|n = Fn Pn|n FnT + Gn Qn GT n end

(30.226) 

30.8 Nonlinear Filtering

1193

There is one major problem with this solution. The linearization is performed around a nominal trajectory, which is decoupled from the driving term g(xn )un and cannot provide a faithful representation for the original state trajectory. While the nominal and actual state trajectories may be close to each other for the initial time instants n, they will nevertheless grow further apart over time and lead to poor performance for the filter.

30.8.2

Extended Kalman Filter (EKF) A more popular implementation for nonlinear filtering relies on linearizing the model around state estimates, as opposed to nominal states, thus leading to the extended Kalman filter (EKF). By doing so, it is expected that the filter will be able to track better the actual state trajectory. For this method, we construct the coefficient matrices {Fn , Gn , Hn } by means of the following zeroth- and firstorder Taylor series expansions around the most recent predicted and filtered state estimates: f (xn ) ≈ f (b xn|n ) + Fn (xn − x bn|n )

h(xn ) ≈ h(b xn|n−1 ) + Hn (xn − x bn|n−1 ) ∆

g(xn ) ≈ g(b xn|n ) = Gn

∂f (x) , Fn = ∂x x=bxn|n

∂h(x) Hn = ∂x x=bxn|n−1

(30.227) (30.228) (30.229)

(30.230)

In the last two expressions for {Fn , Hn }, the partial derivatives are evaluated at the predicted and filtered state values, respectively. Substituting into the original model (30.218a)–(30.218b) we obtain the approximations: xn+1 ≈ Fn xn + (f (b xn|n ) − Fn x bn|n ) + Gn un {z } |

(30.231a)

known at time n

y n − (h(b xn|n−1 ) − Hn x bn|n−1 ) ≈ Hn xn + vn {z } |

(30.231b)

b n+1|n = Fn x b n|n + f (b b n|n = f (b x xn|n ) − Fn x xn|n )

(30.232)

known at time n − 1

which is a linear state-space model for xn with a deterministic driving term in the first relation (30.231a), represented by the known component, in a manner similar to (30.88) Therefore, we can apply the Kalman filter equations as in the case of the linearized Kalman filter to write

where the calculations in the center simplify to f (b xn|n ), and similarly   b n|n = x b n|n−1 + Kf,n y n − h(b b n|n−1 − Hn x b n|n−1 x xn|n−1 ) + Hn x   b n|n−1 + Kf,n y n − h(b =x xn|n−1 ) (30.233)

1194

Kalman Filter

where again the last two terms in the first line cancel each other. The corresponding covariance and gain equations are the same as those of the linearized Kalman filter, thus leading to listing (30.234). Unlike the linearized Kalman filter, however, the quantities {Pn|n−1 , Kf,n } cannot be pre-computed because the coefficient matrices {Fn , Hn , Gn } are now dependent on the measurements through their dependence on the predicted and filtered state estimates.

Extended Kalman filter for model (30.218a)–(30.218b). given observations {y n } that arise from the model; objective: estimate the hidden state; b 0|−1 = x start from x ¯0 , P0|−1 = Π0 . repeat for n ≥ 0: −1 Pn|n−1 HnT + Rn ) Kf,n = Pn|n−1 HnT (Hn

(30.234)

b n|n = x b n|n−1 + Kf,n y n − h(b x xn|n−1 )

b n+1|n = f (b x xn|n ) Pn|n = (I − Kf,n Hn )Pn|n−1 Pn+1|n = Fn Pn|n FnT + Gn Qn GT n end

30.8.3

Unscented Kalman Filter (UKF) The linearized and extended Kalman filters rely on Taylor approximations of the nonlinear model and require evaluation of certain Jacobian (derivative) matrices. The unscented Kalman filter (UKF) is a variation that avoids linearization and the use of derivatives. It relies on approximating the first- and second-order moments of the state estimators by sampling a collection of sigma points and using them to perform sample mean and variance calculations in a manner that resembles the operation of the ensemble Kalman filter. The construction tends to show improved performance in general. We consider the same state-space model (30.218a)–(30.218b) with g(xn ) = 1

(30.235)

For example, if g(xn ) = Gn , then we can treat the term u0n = Gn un as the driving noise process with covariance matrix Q0n = Gn Qn GT n.

Time-update step Recall that the state dimension is xn ∈ IRs . Assume we are at step n−1 with a fil1/2 b n−1|n−1 and error covariance matrix Pn−1|n−1 . We let Pn−1|n−1 tered estimator x denote a square-root factor for Pn−1|n−1 , i.e., a matrix that satisfies

30.8 Nonlinear Filtering

 T 1/2 1/2 Pn−1|n−1 = Pn−1|n−1 Pn−1|n−1

1195

(30.236)

The UKF solution requires that we compute these square-root factors. We denote 1/2 the columns of Pn−1|n−1 by 1/2

Pn−1|n−1 =



a1

a2

...

as



(30.237)

The first step involves selecting 2L + 1 points (referred to as sigma points): n o z ` ∈ IRs , ` = 0, 1, 2, . . . , 2L

(30.238)

and the associated weights {w` }. One typical choice for L is L = s. The selection of the sigma points is driven by the “distribution” of xn−1 with the quantities x bn−1|n−1 and Pn−1|n−1 approximating its first- and second-order moments (mean and variance). The points {z ` } are constructed in a deterministic manner according to the following expressions: b n−1|n−1 z0 = x

w0 ∈ (−1, 1)

(30.239a) (30.239b)

where w0 is selected arbitrarily from within the open interval (−1, 1), and b n−1|n−1 + z` = x

b n−1|n−1 − z `+L = x w` =





s 1/2 a` , ` = 1, 2, . . . , L 1 − w0 s 1/2 a` , ` = 1, 2, . . . , L 1 − w0

1 − w0 , ` = 1, 2, . . . , 2L 2L

(30.239c)

(30.239d) (30.239e)

b n−1|n−1 , with each one of Observe that the points {z ` } are centered around x them perturbed along the direction of a column vector from the square-root 1/2 factor Pn−1|n−1 . Observe also that the weights {w` } add up to 1: 2L X

w` = 1

(30.240)

`=0

This normalization ensures that the sample moments computed next do not introduce bias. The points {z ` } are propagated through the nonlinear state transformation and used to generate the quantities: xn,` = f (z ` ), ` = 0, 1, . . . , 2L

(30.241)

1196

Kalman Filter

b n|n−1 and its with subscript `; these samples are then used to approximate x error covariance matrix as follows: b n|n−1 = x

Pn|n−1 =

2L X

w` xn,`

(30.242a)

(30.242b)

`=0

b n|n−1 )(xn,` − x b n|n−1 )T + Qn w` (xn,` − x

`=0

2L X

We observe that the unscented method provides a way to estimate the mean and covariance matrix of the state variable after it undergoes the nonlinear transformation by f (·). In the above implementation, the same weights {w` } are used in b n|n−1 and Pn|n−1 . This need not be the case in general the expressions for both x and one can use different weights for calculating these quantities. For instance, another popular choice for the weights w` is the following: w0 =

λ L+λ

w0 =

λ + (1 − α2 + β) (for estimating covariances) L+λ

(for estimating states)

λ = α2 (L + κ) − L w` =

(30.243a) (30.243b) (30.243c)

1 , ` = 1, 2, . . . , 2L 2(L + λ)

(30.243d)

where κ is usually small (say, κ ≈ 0 or κ = 1), and the parameters (α, β) are typically chosen as α = 0.001 or α = 1 and β = 2.

Measurement-update step We repeat the same construction for the measurement-update step. We now have b n|n−1 and its error covariance matrix Pn|n−1 . We let the predicted estimator x 1/2

Pn|n−1 denote a square-root factor for Pn|n−1 , i.e., a matrix that satisfies  T 1/2 1/2 Pn|n−1 = Pn|n−1 Pn|n−1

(30.244)

1/2

We further denote the columns of Pn|n−1 by 1/2

Pn|n−1 =



b1

b2

...

bs



We again need to select 2L + 1 new points: n o t` ∈ IRs , ` = 0, 1, 2, . . . , 2L

(30.245)

(30.246)

and the associated weights {w` }. We can either continue with the same points {z ` } computed in the previous step for simplicity, i.e., set t` = z ` with the same

30.8 Nonlinear Filtering

1197

weights or, more generally, we can generate a new set of sigma points. We assume the latter case for generality and create the t` and their weights as follows: b n|n−1 t0 = x

w0 ∈ (−1, 1)

(30.247a) (30.247b)

as well as b n|n−1 + t` = x

b n|n−1 − t`+L = x w` =

 

s 1/2 b` , ` = 1, 2, . . . , L 1 − w0 s 1/2 b` , ` = 1, 2, . . . , L 1 − w0

1 − w0 , ` = 1, 2, . . . , 2L 2L

(30.247c)

(30.247d) (30.247e)

The points {t` } are propagated through the nonlinear output transformation and used to generate the quantities: y n,` = h(t` ), ` = 0, 1, . . . , 2L

(30.248)

with subscript `; these samples are then used to approximate the mean and the “innovations” covariance matrix as follows: b n|n−1 = y

Re,n ≈

2L X

w` y n,`

(30.249a)

(30.249b)

`=0

b n|n−1 )(y n,` − y b n|n−1 )T + Rn w` (y n,` − y

`=0

2L X

We also approximate the cross-covariance matrix by ∆

Rxe,n = E xn eT n ≈

2L X `=0

b n|n−1 )(y n,` − y b n|n−1 )T w` (xn,` − x

(30.250)

so that the gain matrix is approximated by (recall definition (30.114) in the linear case): −1 Kf,n = Rxe,n Re,n

(30.251)

b n|n = x b n|n−1 + Kf,n (y n − y b n|n−1 ) x

(30.252)

and

T Pn|n = Pn|n−1 − Kf,n Re,n Kf,n

(30.253)

b −1|−1 = 0 and We arrive at listing (30.254). The initial conditions are set to x P−1|−1 = Π0 .

1198

Kalman Filter

Unscented Kalman filter for model (30.218a)–(30.218b) with g(xn ) = 1. given observations {y n } that arise from the model; objective: estimate the hidden state; given number of sigma points, 2L + 1; b −1|−1 = x start from x ¯−1 , P−1|−1 = Π0 . repeat for n ≥ 0:

(time-update) using {b xn−1|n−1 , Pn−1|n−1 }, select 2L + 1 sigma points {z ` , w` } according to (30.239a)–(30.239e) and generate {xn,` }; 2L X b n|n−1 = w` xn,` x `=0

Pn|n−1 =

2L X `=0

b n|n−1 )(xn,` − x b n|n−1 )T + Qn w` (xn,` − x

(measurement-update) using {b xn|n−1 , Pn|n−1 } select 2L + 1 sigma points {t` , w` } according to (30.247a)–(30.247e) and generate {y n,` }; 2L X b n|n−1 = y w` y n,` `=0

Re,n ≈

2L X

b n|n−1 )(y n,` − y b n|n−1 )T + Rn w` (y n,` − y

`=0 2L X

Rxe,n ≈

`=0

b n|n−1 )(y n,` − y b n|n−1 )T w` (xn,` − x

−1 Kf,n = Rxe,n Re,n

b n|n = x b n|n−1 + Kf,n (y n − y b n|n−1 ) x

T Pn|n = Pn|n−1 − Kf,n Re,n Kf,n end

(30.254)

Example 30.10 (Tracking a moving object in 2D) An object is moving in 2D as shown in Fig. 30.5. We denote its coordinates at any instant n by col{an , bn }, where an is the horizontal coordinate and bn is the vertical coordinate. The velocity components along the horizontal and vertical axes are denoted by col{va,n , vb,n }. Therefore, at any instant in time, the state of the object is described by the four-dimensional state vector xn = col{an , bn , va,n , vb,n }. We assume this vector evolves according to the dynamics:

30.8 Nonlinear Filtering

  an  va,n   =  b n vb,n | {z } | 

1 0 0 0

1 1 0 0

xn

0 0 1 0 {z

 0 an−1 0   va,n−1 1   bn−1 1 vb,n−1 }| {z xn−1

F

 0.5 0   0  ua,n   1 + 0 0.5  ub,n | {z } 0 1 u } | {z } 

1199



(30.255)

G

where the perturbation un ∼ Nun (0, qI2 ). The observations are noisy measurements of the distance and angle viewed from an observer at the origin, i.e., " # √ 2 an + b2n yn = + vn (30.256) arctan(bn /an )

b-axis

where v n ∼ Nvn (0, rI2 ).

vb,n va,n

bn

θn = arctan(bn /an ) an

a-axis

Figure 30.5 A moving object in 2D; its coordinates are denoted by (an , bn ) at time n;

the speed components along the horizontal and vertical axes are denoted by (va,n , vb,n ). The bearing is the angle viewed from the origin. The initial state x0 ∼ Nx0 (¯ x0 , Π0 ) is Gaussian-distributed with  0  0  x ¯0 =  , 0.2  −0.1





 Π0 = 

0.10

 0.01

 

0.10

(30.257)

0.01

We assume the numerical values q = 1 × 10−6 , r = 5 × 10−6

(30.258)

Noting that bn and an are the third and first entries of the state vector, we conclude from the output equation (30.256) that

1200

Kalman Filter

√ 2 an + b2n

" h(x) =

Hn

# (30.259)

arctan(bn /an )  an √ 2 2 ∆ ∂h(x)  a n + bn = = bn ∂x − 2 an + b2n

0 0

bn a2n + b2n an a2n + b2n

0 0

 (30.260)

 

The EKF equations in this case become: q 2 b 2n|n−1 + b a bn|n−1  b n|n−1 /dn 0 b a bn|n−1 /dn Hn = b n|n−1 /d2n −b bn|n−1 /d2n 0 a ∆

dn =

(30.261a) 0 0

 (30.261b)

Re,n = rI2 + Hn Pn|n−1 HnT Kf,n = b n|n x

(30.261c)

−1 Pn|n−1 HnT Re,n

(30.261d)

  b n|n−1 + Kf,n y n − h(b =x xn|n−1 )

(30.261e)

b n+1|n = F x b n|n x Pn|n = (I4 − Kf,n Hn )Pn|n−1

(30.261f) (30.261g)

Pn+1|n = F Pn|n F T + qGGT

(30.261h)

The EKF and UKF filters do not perform uniformly well across all simulations; they fail in several instances. Figure 30.6 shows the actual and estimated trajectories over 0 ≤ n ≤ 100 using both EKF and UKF for a situation where the filters are able to track the actual trajectory to some reasonable degree. The main challenge for EKF arises from the error in approximating h(x) by the Jacobian Hn . One challenge for UKF arises from the relation updating Pn|n−1 to Pn|n ; the latter matrix may lose its required nonnegative definiteness due to numerical errors, in which case the square-root factorization would fail. We will revisit this application later in Example 35.2 when we apply particle filters to track the same state variable.

extended Kalman filter

0

unscented Kalman filter

0

extended Kalman filter -5

-5

actual trajectory

-10

-10

-15

unscented Kalman filter

-15 0

5

10

15

0

5

10

Figure 30.6 Actual and estimated trajectories by means of the extended and the

unscented Kalman filters applied to model (30.255)–(30.256).

15

30.9 Commentaries and Discussion

30.9

1201

COMMENTARIES AND DISCUSSION Kalman filter. The discussion in the chapter dealing with filtering and smoothing recursions is motivated by the presentation from Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008). As indicated in the first reference, the seminal works by Kolmogorov (1939, 1941a, b) and Wiener (1949) laid the foundations for statistical signal analysis in the first half of the twentieth century. Nevertheless, their contributions focused on filtering and prediction problems for stationary random signals. And, moreover, the Wiener–Hopf technique was particular to scalar random processes and required the computation of canonical spectral factors. For scalar processes with rational z-spectra, determination of the spectral factors generally involves searching for roots of polynomials. Determining these spectral factors for vector-valued processes, and extending the notions of zeros and poles to these processes, was not a trivial task at the time. These complications hindered the application of the Wiener theory to vector stationary processes until the landmark contributions of Rudolf Kalman (1930–2016) in the works by Kalman (1960) and Kalman and Bucy (1961). Kalman’s filter had three distinctive features relative to the earlier contributions of Kolmogorov and Wiener: (a) His filter was the optimal solution for both stationary and nonstationary processes. (b) The filter applied to both scalar and vector-valued processes. (c) Importantly, his filter was recursive in nature. Instead of providing an expression (say, in the form of a transfer function representation for the optimal filter as was the case in Wiener’s earlier work, which necessitates stationarity), Kalman (1960) followed an innovative route and developed a recursive solution – one that responds to measurements in real time. In other words, his filter was a notable example of an online algorithm and, moreover, the algorithm was optimal during all stages of operation and converged, in steady state, to the Wiener solution. These features were possible because the Kalman formulation did not operate directly on z-spectra, but rather assumed that the signals involved in the filtering operations satisfied a state-space model. The Kalman filter is a notable example of a model-based solution. There are many useful texts that cover Kalman filtering and its many variations in some detail, including the works by Gelb (1974), Anderson and Moore (1979), Kailath (1981), Maybeck (1982), Grewal and Andrews (1993), and Kailath, Sayed, and Hassibi (2000). Stochastic modeling. What is particularly significant about the works of Kolmogorov (1939, 1941a, b), Wiener (1949), and Kalman (1960) is the insight that the problem of separating signals from noise, as well as prediction and filtering problems, can be approached statistically. In other words, they can be approached by formulating statistical performance measures and by modeling the variables involved as random rather than deterministic quantities. This point of view is in clear contrast, for example, to deterministic (least-squares-based) estimation techniques studied earlier by Gauss (1809) and Legendre (1805, 1810), and which we study in a later chapter. An overview of the historical progress from Gauss’ to Kalman’s work can be found in Sorenson (1970) and in the edited volume by Sorenson (1985) – see also Kailath (1974), Kailath, Sayed, and Hassibi (2000), and Sayed (2003, 2008). The insight of formulating estimation problems stochastically in this manner has had a significant impact on the fields of signal processing, communications, control, inference, and learning. In particular, it has led to the development of several disciplines that nowadays go by the names of statistical signal processing, statistical communications theory, optimal stochastic control, stochastic system theory, and statistical learning theory, as can be seen, for instance, from examining the titles of some of the references in these fields, e.g., Lee (1960), Aström (1970), Caines (1988), Scharf (1991), Kay (1993, 1998), Vapnik (1995, 1998), Hastie, Tibshirani, and Friedman (2009), and Theodoridis (2015).

1202

Kalman Filter

Innovations process. According to the description in Kailath, Sayed, and Hassibi (2000), the innovations concept is due to Wold (1938), who was motivated by an observation from Fréchet (1937) that random variables can be treated as elements of a linear vector space with uncorrelatedness playing the role of orthogonality. Wold (1938) showed that the Gram–Schmidt procedure can be used to transform a set of observations into uncorrelated variables. Kolmogorov (1939, 1941a, b) developed the ideas further for the case of discrete-time stationary processes. The term “innovations” was perhaps first used by Wiener and Masani (1958); it was reintroduced in connection with (continuous-time) nonstationary linear and nonlinear process estimation by the Indian-American electrical engineer and 2012 US National Medal of Science recipient Thomas Kailath (1935–) in the works by Kailath (1968, 1969, 1972). State-space models. State-space realizations are powerful mathematical models that emphasize the fundamental concept of a state variable, xn The state variable is defined as the smallest set of information variables that are sufficient to represent the dynamics of a system. Realizations with the smallest dimension for the state variable are called minimal-order realizations; their implementations require the smallest number of delay elements. State-space models are particularly popular in control engineering and in the design of feedback control structures and estimators. State-space theory evolved from the modeling of electric circuits in the late 1950s. The work of Zadeh (1962) and the title of the work by Linvill (1956) on the development of “system theory as an extension of circuit theory” are suggestive of this evolution. Nevertheless, the major impetus in the adoption of state-space models came with the work by the Azerbaijani-American engineer Lotfi Zadeh (1921–2017), who published an insightful article on the topic of moving “from circuit theory to system theory” in Zadeh (1962). Zadeh is the same person who introduced the designation “z-transform” in the article by Ragazzini and Zadeh (1952) in their work on sampled-data systems. Since Zadeh’s work on state-space models in the early 1960s, the theory has evolved into so many directions that state-space models are nowadays very common in a wide range of applications covering space science, economics and finance, statistics, filtering and estimation, and ecological and biological modeling. One of the earliest treatments of state-space theory was given in the textbook written by Zadeh and Desoer (1963). A comprehensive and insightful treatment of linear state-space theory that covers the wide range of developments that transpired over the span of two decades since Zadeh’s 1962 publication appears in the textbook by Kailath (1980). Other useful treatments appear in Callier and Desoer (1991), Rugh (1995), Bay (1998), Chen (1999), Antsaklis and Michel (2005), and Williams and Lawrence (2007). Riccati recursion. In Kalman’s filter, recursion (30.68) plays a prominent role in propagating the error covariance matrix. Kalman termed this recursion the Riccati recursion in analogy to a differential equation attributed to the Venetian mathematician Jacopo Riccati (1676–1754) in the work by Riccati (1724), and later used by the French mathematician Adrien-Marie Legendre (1752–1833) in his studies on the calculus of variations and by Bellman (1957a) in optimal control theory. There is an extensive body of literature on the convergence behavior of the Riccati recursion and the existence of solutions for the resulting DARE when the underlying state-space model is time-invariant, i.e., when {F, G, H, Q, S, R} are constant matrices – see, e.g., the treatment in Kailath, Sayed, and Hassibi (2000, ch. 14) and the many references therein, including the treatment by Lancaster and Rodman (1995) devoted to Riccati equations. We comment on some of these results here, assuming, for convenience, that S = 0. It is explained in the above references that when S 6= 0, the results continue to hold by replacing in the statement below the matrices {F, Q} by {F s = F − GSR−1 H, Qs = Q − SR−1 S T }. We recall that the DARE (30.148a) is a nonlinear algebraic equation in the unknown P . In general, the DARE may have many solutions P , including the possibility of many positive semi-definite solutions. None

30.9 Commentaries and Discussion

1203

of these solutions need to stabilize the closed-loop matrix Fp = F − Kp H (when this happens, we say that P is a stabilizing solution). (Stabilizing solution) (see, e.g., Kailath, Sayed, and Hassibi (2000)): The DARE (30.148a) will have a stabilizing solution if, and only if, {F, H} is detectable and {F, GQ1/2 } is controllable on the unit circle. Moreover, any such stabilizing solution is unique and positive semi-definite. Here, the notation Q1/2 denotes a square-root factor for Q. The pair {F, GQ1/2 } is said to be unit-circle controllable if   rank λIs − F GQ1/2 = s (30.262) at all unit-circle eigenvalues of F . This condition ensures that a matrix L can be found such that (F − GQ1/2 L) has no unit-circle eigenvalues. Likewise, the pair {F, H} is said to be detectable if   λIs − F = s, at all unstable eigenvalues of F (30.263) rank H This ensures that a matrix L can be found such that (F − LH) is stable. Ensemble Kalman filter. The filter is an approximate implementation of the Kalman solution that is well suited for large dimensional state-space models. It is based on propagating ensembles of state variables and using them to compute filtered and predicted state estimates and their covariance matrices. The ensemble Kalman filter was introduced in the geophysics literature by Evensen (1994, 2003, 2009a, b), where it is viewed as an example of a data assimilation technique – see, e.g., Hamill (2006), Evensen (2009a), and Nychka and Anderson (2010). The filter has been very successful in this community, especially in applications related to weather forecasting where the state dimension can run into the millions of entries. There are several reviews of the ensemble Kalman filter, including by Houtekamer and Mitchell (2005), Lakshmivarahan and Stensrud (2009), Anderson (2009), Katzfuss, Stroud, and Wikle (2016), and Roth et al. (2017). Convergence results on the performance of the filter appear in Furrer and Bengtsson (2007), Butala et al. (2008), and Mandel, Cobb, and Beezley (2011). For example, for Gaussian linear state-space models, and for large ensemble sizes L, it is known that the state estimates will converge to the classical Kalman filter solution. The ensemble Kalman filter, nevertheless, suffers from some challenges. The low-rank approximations for the error covariance matrices lead to some performance deterioration. Moreover, for small ensemble sizes L, the filter tends to underestimate the error covariances (i.e., it tends to assume that the state estimates are of better quality than they actually are), which can cause filter divergence. And, the ensemble elements become coupled rather than remain independent. The coupling happens through the computation of the gain matrix Kf,n . The aforementioned references discuss techniques to address some of these difficulties. In the body of the chapter we provided one example related to reducing the effect of the measurement noise ensemble on performance. Approximate nonlinear filtering. The presentation in Section 30.8 on the linearized and extended Kalman filters is adapted from Kailath, Sayed, and Hassibi (2000). According to this reference, the idea of the EKF was originally proposed by Schmidt (1966, 1970). This variant has become a popular scheme for nonlinear filtering, especially since many applications in navigation, control, and communications involve nonlinear state-space models. While EKF is derived by relying on a first-order Taylor expansion, higherorder filters can also be obtained by retaining more terms in the Taylor expansions, as shown in Schweppe (1973), Gelb (1974), and Maybeck (1982). However, these filters are not necessarily better than the EKF. The UKF avoids the use of derivatives and is based on approximating moments from sigma samples. The filter is due to Julier and

1204

Kalman Filter

Uhlmann (1997); see also the accounts in Wan and van der Merwe (2001) and Julier and Uhlmann (2004). One challenge for the UKF implementation arises from the last relation updating Pn|n−1 to Pn|n ; the latter matrix may lose its required nonnegativedefiniteness due to numerical errors, in which case the square-root factorization would fail. One way to ameliorate this difficulty is to resort to square-root or array algorithms, 1/2 1/2 which propagate directly the square-root factor Pn|n−1 to Pn|n . This class of algorithms is described at length in Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008). More generally, for nonlinear models, Bucy and Senne (1971) attempted to directly propagate the conditional density functions via the Bayes rule, but the computational burden was excessive. In future Chapter 35 we will develop a sequential version of an importance sampling technique that will allow us to solve nonlinear filtering and prediction problems rather effectively by means of particle filters. It is worth noting that there have been earlier investigations on nonlinear inference problems such as the works by Swerling (1959, 1968, 1971), who developed recursive solutions for nonlinear least-squares problems in the context of satellite tracking. These works followed a purely deterministic formulation and did not adopt state-space models or rely on stochastic formulations for the estimation problems. By linearizing the equations around a reference trajectory, one recovers Kalman-type recursions. While Swerling’s contribution is significant, his work was more difficult to follow than Kalman (1960). For comments on the works by Swerling (1959) and Kalman (1960), the reader may review Sorenson (1985).

PROBLEMS1

30.1 Consider three zero-mean observations {y(0), y(1), y(2)} arising from a stationary random process y(n) with auto-correlation sequence ry (`) = ( 12 )|`| . Find the innovations {e(0), e(1), e(2)} using the Gram–Schmidt procedure. Find also the entries of the matrix A relating the three innovations {e(m)} to the three observations {y(m)} as in (30.26). Find the variances of the innovations {e(0), e(1), e(2)}. 30.2 Consider three zero-mean observations {y(0), y(1), y(2)} arising from a stationary random process y(n) with auto-correlation sequence with ry (`) = ( 21 )|`| . Consider further a zero-mean random variable x with the cross-correlations: E xy(0) = 1, E xy(1) =

1 , E xy(2) = 2 2

Find: b |2 using the observations {y(0), y(1), y(2)}. (a) x b |2 using the innovations {e(0), e(1), e(2)}. (b) x b |1 using the observations {y(0), y(1)}. (c) x b |1 using the innovations {e(0), e(1)}. (d) x b |2 from x b |1 and e(2). (e) x 30.3 Consider the following state-space model: xn+1 = αxn + un + d, T

|α| < 1

y(n) = 1 (xn + xn−1 ) + v(n),

1

n≥0

Several problems in this section are adapted from exercises in Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008).

Problems

1205

with a constant driving factor d of size s × 1, and where 1 is the column vector with unit entries and    T  qIs δnm 0 0 un um 0 rδnm 0    v(n)   v(m)  =  E 0 0 π0 Is  x0  x0 0 0 0 1 where {q, r, π0 } are positive scalars. Determine the innovations process of {y(n)}. 30.4 Establish the validity of recursions (30.90a)–(30.90e) when a deterministic driving term is present. 30.5 Let y n = v n + v n−1 , n ≥ 0, where {v m , m ≥ −1} is a zero-mean stationary white-noise scalar process with unit variance. Show that b n+1|n = y

n+1 b n|n−1 ) (y − y n+2 n

30.6 Consider the model y n = z n + v n , with E v n v Tm = Rn δnm , E v n z Tm = 0 for n > m, and E v n z Tn = Dn . All random variables are zero mean. Let Re,n denote the bn|n denote the covariance matrix of the innovations of the process {y n }. Let also z −1 bn|n = y n − (Dn + Rn )Re,n LLMSE of z n given {y m , 0 ≤ m ≤ n}. Show that z en . 30.7 Consider a process y n = Hn xn + v n , where {v n } is a white-noise zero-mean process with covariance matrix Rn and uncorrelated with the zero-mean process {xn }. b n|n . Show that the {ν n } The filtered residuals of {y n } are defined as ν n = y n − Hn x form a white-noise sequence with covariance matrix Rν,n = Rn − Hn Pn|n HnT . 30.8 Consider the state-space model xn+1 = F xn + G1 un + G2 un+1 , y n = Hxn + v n for n ≥ 0, with zero-mean uncorrelated random variables {x0 , un , v n } such that E un uTm = Qn δnm , E v n v Tm = Rn δnm , E x0 xT0 = Π0 . Find recursive equations for b n|n−1 and Pn|n−1 . x 30.9 Establish the measurement-update relations (30.119a) and (30.119b). 30.10 Refer to the measurement update from Pn|n−1 to Pn|n in listing (30.118). Verify that it can also be written in the equivalent form T Pn|n = (I − Kf,n H)Pn|n−1 (I − Kf,n H)T + Kf,n Rn Kf,n

30.11

For any n × n matrices P and Pn|n−1 , define the quantities

−1 Re,n = R + HPn|n−1 H T , Re = R + HP H T , Kp,n = Kn Re,n , Kp = KRe−1

where Kn = F Pn|n−1 H T + GS and K = F P H T + GS. Define also Fp,n = F − Kp,n H and Fp = F − Kp H, and introduce the difference matrix ∆Pn = Pn+1|n − P . (a) Show that Re,n+1 − Re = H∆Pn H T Kn+1 − K = F ∆Pn H T

−1 Kp,n+1 − Kp = Fp ∆Pn H T Re,n+1

−1 Fp,n+1 = Fp (I − ∆Pn H T Re,n+1 H) = Fp (I + ∆Pn H T Re−1 H)−1

(b)

Now assume that P is any solution of the DARE (30.148a) and that Pn|n−1 satisfies the Riccati recursion (30.68). Show that −1 ∆Pn+1 = Fp (∆Pn − ∆Pn H T Re,n+1 H∆Pn )FpT

= Fp (I + ∆Pn H T Re−1 H)−1 ∆Pn FpT = Fp,n+1 ∆Pn FpT

Remark. See Kailath, Sayed, and Hassibi (2000, ch. 11) for a related discussion.

1206

Kalman Filter

(1)

(2)

30.12 Let Pn+1|n and Pn+1|n denote the solutions to the Riccati recursion (30.68) with the same constant parameters {F, G, H, Q, R} and S = 0, but with different (1) (2) (2) (1) initial conditions Πo and Πo , respectively. Let δPn+1|n = Pn+1|n − Pn+1|n . Establish the following identities:  T (1) (2) δPn+1|n = Fp,n δPn|n−1 Fp,n h i T (1) (2) −1 (1) δPn+1|n = Fp,n δPn|n−1 − δPn|n−1 H T (Re,n ) H δPn|n−1 Fp,n (m)

(m)

(m)

(m)

(m)

(m)

(m)

where Fp,n = F −Kp,n H, Kp,n = F Pn|n−1 H T (Re,n )−1 , and Re,n = R+HPn|n−1 H T , for m = 1, 2. Remark. See Kailath, Sayed, and Hassibi (2000, ch. 11) for a related discussion. b n|N 30.13 Refer to the derivation of the Bryson–Frazier algorithm (30.194). Let u denote the LLMSE estimator of un given all observations up to time N . Show that b n|N = Qn GTn λn+1|N . u 30.14 This problem deals with fixed-point smoothing. Fix a time instant no and let b no |N denote the LLMSE estimator of xno given the observations N increase. Let x {y o , . . . , y N }, and denote the error covariance matrix at time no by Pno |N . Likewise, b no |N +1 denote the LLMSE estimator of xno given {y o , . . . , y N , y N +1 } and let let x Pno |N +1 denote the corresponding error covariance matrix. Show that T −1 b no |N +1 = x b no |N + Pno |no −1 ΦTp (N + 1, no )HN x +1 Re,N +1 eN +1 T −1 Pno |N +1 = Pno |N − Pno |no −1 ΦTp (N + 1, no )HN +1 Re,N +1 HN +1 Φp (N + 1, no )Pno |no −1

30.15 This problem deals with the fixed-lag smoothing problem. Assume Sn = 0 for b n|n+L denote the simplicity. Now choose a positive integer L and let n increase. Let x LLMSE estimator of xn given the observations {y o , . . . , y n+L }. We want to determine b n|n+L and x b n−1|n−1+L . a recursion relating x −1 T b n|n+L = x b n|n+L−1 + Pn|n−1 ΦTp (n + L, n)Hn+L Re,n+L en+L . (a) Show that x T b n|n+L−1 = Fn−1 x b n−1|n−1+L + Gn−1 Qn−1 Gn−1 λn|n+L−1 where (b) Show also that x we defined n+L−1 X ∆ T −1 λn|n+L−1 = ΦTp (m, n)Hm Re,m em m=n

(c)

T Using λn|n+L−1 and the relation ΦTp (m, n) = Fp,n ΦTp (m, n + 1), conclude that

b n|n+L = Fn−1 x b n−1|n−1+L + Gn−1 Qn−1 GTn−1 λn|n+L−1 + x T −1 Pn|n−1 ΦTp (n + L, n)Hn+L Re,n+L en+L T −1 T −1 λn|n+L−1 = Fp,n λn+1|n+L + HnT Re,n en − ΦTp (n + L, n)Hn+L Re,n+L en+L

30.16 This problem deals with an alternative approach to the solution of the fixedpoint smoothing problem. Consider the standard state-space model of this chapter with b no |N denote the LLMSE estimator of Sn = 0. Let no be a fixed time instant. Also let x the state xn0 given the observations {y o , y 1 , . . . , y N } with N > n0 . We are interested b no |N and x b no |N +1 , as well as a recursive in finding a recursive update that relates x formula that relates Pno |N and Pno |N +1 , where Pno |N is the covariance matrix of the b no |N ). Argue that the answer can be obtained by using the following error (xno − x augmented state-space model: Define the variable z n+1 = xno and write        xn+1 Fn 0 xn Gn = + un zn 0 z n+1 0 I     xn y n = Hn 0 + v n , for n ≥ n0 zn

Problems

1207

Use the above model to derive the desired recursions and specify the necessary initial conditions. 30.17 Consider a standard state-space model of the form: xn+1 = Fn xn + Gun y n = Hxn + v n , n ≥ 0 where   T Qδnm un  um T  vn    S δnm vm  =  E x0  0 x0 1 0 

Sδnm Rδnm 0 0

 0 0  Π0  0

where all parameters are time-invariant, with the exception of Fn , which is defined below. (a) Assume first that Fn = F1 with probability p and Fn = F2 with probability 1 − p, where F1 and F2 are some constant matrices. Determine a recursive procedure for constructing the innovations process. (b) Assume instead that F2n = F1 and F2n+1 = F2 . Determine a recursive procedure for constructing the innovations process. (c) Assume the filters converge to the steady state in both cases so that the corresponding output processes {y n } are stationary. Determine their auto-correlation sequences, {Ry (`)}, and the corresponding z-spectra. (d) Under part (c), determine the corresponding spectral factors and the pre-whitening filters. 30.18 What is the steady-state version of recursion (30.100) as n → ∞? Simplify the recursion (30.100) when the SNR is high, i.e., when π0 /σv2 → ∞, and conclude that it leads to n 1 X b (0|n) = y(m) x n + 1 m=0 30.19 Refer to the time- and measurement-update form of the Kalman filter in listing (30.118) and assume Sn = 0 for simplicity. An alternative and popular way to implement the filter is to propagate square-root factors of the error covariance matrices through orthogonal transformations. This so-called array form or array algorithm has improved numerical properties and can be implemented through effective QR1/2 1/2 type transformations. Let {Pn|n−1 , Pn|n } denote lower-triangular Cholesky factors for 1/2

1/2

1/2

{Pn|n−1 , Pn|n }. Likewise, for {Qn , Rn , Re,n }. At every iteration n, let Θ1 and Θ2 denote two orthogonal matrices that transform the pre-array matrices shown on the left-hand side below to the respective post-array triangular forms: h i   1/2 1/2 Θ1 = X 0 0 Fn Pn|n Gn Qn " 1/2 #   1/2 Rn Hn Pn|n−1 X 0 Θ = 2 1/2 Y Z 0 Pn|n−1 where the quantities {X0 , X, Z} are lower triangular. Show that the quantities in the post-arrays can be identified as follows: 1/2

X0 = Pn+1|n ,

1/2 X = Re,n ,

1/2 Y = Kf,n Re,n ,

1/2

Z = Pn|n

Remark. For more information on array algorithms in the context of both Kalman filtering and adaptive filtering, the reader may refer to Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008).

1208

Kalman Filter

30.20 Refer to expression (30.215) for the approximation of Re,n in the ensemble Kalman filter. Collect the ensemble realizations into the s × L (tall) matrix h i (1) (2) (L) Xn|n−1 = xn|n−1 xn|n−1 . . . xn|n−1 1/2

1/2

and introduce the square-root factorization Rn = Rn (Rn )T . Show that a squareroot factor for Re,n can be determined as follows. Find an orthogonal matrix Θ that transforms the pre-array matrix on the left below into the post-array matrix on the 1/2 right where X is p × p and lower triangular. Show that X can be taken as Re : h i   1/2 X 0 Θ = Hn Xn|n−1 Rn

REFERENCES Anderson, B. D. O. and J. B. Moore (1979), Optimal Filtering, Prentice Hall. Anderson, J. (2009), “Ensemble Kalman filters for large geophysical applications,” IEEE Control. Syst. Mag., vol. 29, no. 3, pp. 66–82. Antsaklis, P. J. and A. N. Michel (2005), A Linear Systems Primer, Birkhauser. Aström, K. J. (1970), Introduction to Stochastic Control Theory, Academic Press. Bay, J. S. (1998), Fundamentals of Linear State Space Systems, McGraw-Hill. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Bucy, R. S. and K. D. Senne (1971), “Digital synthesis of nonlinear filters,” Automatica, vol. 7, pp. 287–289. Butala, M., J. Yun, Y. Chen, R. Frazin, and F. Kamalabadi (2008), “Asymptotic convergence of the ensemble Kalman filter,” Proc. IEEE Int. Conf. Image Processing (ICIP), pp. 825–828, San Diego, CA. Caines, P. E. (1988), Linear Stochastic Systems, Wiley. Callier, F. M. and C. A. Desoer (1991), Linear System Theory, Springer. Chen, C.-T. (1999). Linear System Theory and Design, 3rd ed., Oxford University Press. Evensen, G. (1994), “Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics,” J. Geophys. Res. Oceans, vol. 99(C5), art. 3-10162. Evensen, G. (2003), “The ensemble Kalman filter: Theoretical formulation and practical implementation,” Ocean Dyn., vol. 53, no. 4, pp. 343–367. Evensen, G. (2009a), Data Assimilation: The Ensemble Kalman Filter, 2nd ed., Springer. Evensen, G. (2009b), “The ensemble Kalman filter for combined state and parameter estimation,” IEEE Control. Syst. Mag., vol. 29, no. 3, pp. 83–104. Fréchet, G. (1937), “Généralités sur les probabilités,” Variables aléatoires, Borel Series, Traité du Calcul des Probabilités et des ses Applications, vol. 1, pp. 123–126. Furrer, R. and T. Bengtsson (2007), “Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants,” J. Multivar. Anal., vol. 98, no. 2, pp. 227–255. Gauss, C. F. (1809), Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, English translation by C. H. Davis, 1857, Little, Brown, and Company. Gelb, A., editor, (1974) Applied Optimal Estimation, MIT Press. Grewal, M. S. and A. P. Andrews (1993), Kalman Filtering: Theory and Practice, Prentice Hall.

References

1209

Hamill, T. M. (2006), “Ensemble-based atmospheric data assimilation,” in Predictability of Weather and Climate, T. Palmer and R. Hagedorn, editors, chapter 6, Cambridge University Press. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Houtekamer, P. L. and H. L Mitchell (2005), “Ensemble Kalman filtering,” Q. J. R. Meteorol. Soc., vol. 131, no. 613, pp. 3269–3289. Julier, S. J. and J. K. Uhlmann (1997), “New extension of the Kalman filter to nonlinear systems,” Proc. SPIE, vol. 3068, pp. 182–193. Julier, S. J. and J. K. Uhlmann (2004), “Unscented filtering and nonlinear estimation,” Proc. IEEE, vol. 92, no. 3, pp. 401–422. Kailath, T. (1968), “An innovations approach to least-squares estimation, part I: Linear filtering in additive white noise,” IEEE Trans. Aut. Control, vol. 13, pp. 646–655. Kailath, T. (1969), “A general likelihood ratio formula for random signals in noise,” IEEE Trans. Inf. Theory, vol. 15, pp. 350–361. Kailath, T. (1972), “A note on least-squares estimation by the innovations method,” SIAM J. Comput., vol. 10, no. 3, pp. 477–486. Kailath, T. (1974), “A view of three decades of linear filtering theory,” IEEE Trans. Inf. Theory, vol. 20, pp. 146–181. Kailath, T. (1980), Linear Systems, Prentice Hall. Kailath, T. (1981), Lectures on Wiener and Kalman Filtering, 2nd ed., Springer. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall. Kalman, R. E. (1960), “A new approach to linear filtering and prediction problems,” Trans. ASME J. Basic Eng., vol. 82, pp. 34–45. Kalman, R. E. and R. S. Bucy (1961), “New results in linear filtering and prediction theory,” Trans. ASME J. Basic Eng., vol. 83, pp. 95–107. Katzfuss, M., J. R. Stroud, and C. K. Wikle (2016), “Understanding the ensemble Kalman filter,” Amer. Statist., vol. 70, no. 4, pp. 350–357. Kay, S. (1993), Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall. Kay, S. (1998), Fundamentals of Statistical Signal Processing: Detection Theory, Prentice Hall. Kolmogorov, A. N. (1939), “Sur l’interpolation et extrapolation des suites stationnaires,” C. R. Acad. Sci., vol. 208, p. 2043. Kolmogorov, A. N. (1941a), “Stationary sequences in Hilbert space (in Russian),” Bull. Math. Univ. Moscow, vol. 2. Kolmogorov, A. N. (1941b), “Interpolation and extrapolation of stationary random processes,” Bull. Acad. Sci. USSR, vol. 5. A translation has been published by the RAND Corp., Santa Monica, CA, as Memo. RM-3090-PR, April 1962. Lakshmivarahan, S. and D. Stensrud (2009), “Ensemble Kalman filter,” IEEE Control Syst. Mag., vol. 29, no. 3, pp. 34–46. Lancaster, P. and L. Rodman (1995), Algebraic Riccati Equations, Oxford University Press. Lee, Y. W. (1960), Statistical Theory of Communication, Wiley. Legendre, A. M. (1805), Nouvelles Méthodes pour la Détermination des Orbites de Comètes, Courcier, Paris. Legendre, A. M. (1810), “Méthode de moindres quarres, pour trouver le milieu de plus probable entre les résultats des différentes observations,” Mem. Inst. France, pp. 149–154. Linvill, W. K. (1956), “System theory as an extension of circuit theory,” IRE Trans. Circuit Theory, vol. 3, pp. 217–223. Mandel, J., L. Cobb, and J. D. Beezley (2011), “On the convergence of the ensemble Kalman filter,” Appl. Math., vol. 56, no. 6., pp. 533–541. Maybeck, P. S. (1982), Stochastic Models, Estimation, and Control, Academic Press.

1210

Kalman Filter

Nychka, D., and J. L. Anderson (2010), “Data assimilation,” in Handbook of Spatial Statistics, A. Gelfand, P. Diggle, P. Guttorp, and M. Fuentes, editors, pp. 477–492, Chapman & Hall. Ragazzini, J. R. and L. A. Zadeh (1952), “The analysis of sampled-data systems,” Trans. Am. Inst. Elec. Eng., vol. 71, pp. 225–234. Riccati, J. F. (1724), “Animadversationnes in aequationes differentiales secundi gradus,” Eruditorum Quae Lipside Publicantur Supplementa, vol. 8, pp. 66–73. Roth, M., G. Hendeby, C. Fritsche, and F. Gustafsson (2017), “The ensemble Kalman filter: A signal processing perspective,” EURASIP J. Adv. Signal Process., vol. 56, pp. 1–16. Rugh, W. J. (1995), Linear System Theory, 2nd ed., Prentice Hall. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Scharf, L. L. (1991), Statistical Signal Processing: Detection, Estimation, and TimeSeries Analysis, Addison-Wesley. Schmidt, S. F. (1966), “Applications of state space methods to navigation problems,” in Advanced Control Systems, C. T. Leondes, editor, pp. 293–340, Academic Press. Schmidt, S. F. (1970), “Computational techniques in Kalman filtering,” in Theory and Applications of Kalman Filtering, AGARDograph 139, Technical report, NATO Advisory Group for Aerospace Research and Development. Schweppe, F. C. (1973), Uncertain Dynamic Systems, Prentice Hall. Sorenson, H. W. (1970), “Least-squares estimation: From Gauss to Kalman,” IEEE Spectrum, vol. 7, no. 7, pp. 63–68. Sorenson, H. W., editor, (1985), Kalman Filtering: Theory and Application, IEEE Press. Swerling, P. (1959), “First-order error propagation in a stagewise smoothing procedure for satellite observations,” J. Astronaut. Sci., vol. 6, pp. 46–52. Swerling, P. (1968), “A proposed stagewise differential correction procedure for satellite tracking and prediction,” Technical Report P-1292, RAND Corporation. Swerling, P. (1971), “Modern state estimation methods from the viewpoint of the method of least-squares,” IEEE Trans. Aut. Control, vol. 16, pp. 707–719. Theodoridis, S. (2015), Machine Learning: A Bayesian and Optimization Perspective, Academic Press. Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer. Vapnik, V. N. (1998), Statistical Learning Theory, Wiley. Wan, E. and R. van der Merwe (2001), The Unscented Kalman Filter, Wiley. Wiener, N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time Series, Technology Press and Wiley, NY. Originally published in 1942 as a classified Nat. Defense Res. Council Report. Also published under the title Time Series Analysis by MIT Press. Wiener, N. and P. Masani (1958), “The prediction theory of multivariate processes, part II: The linear predictor,” Acta Math., pp. 93–139. Williams, R. L. and D. A. Lawrence (2007), Linear State-Space Control Systems, Wiley. Wold, H. (1938), A Study in the Analysis of Stationary Time Series, Almqvist & Wiksell. Zadeh, L. A. (1954), “System theory,” Columbia Engr. Quart., vol. 8, pp. 16–19. Zadeh, L. A. (1962), “From circuit theory to system theory,” Proc. IRE, vol. 50, pp. 856– 865. Zadeh, L. A. and C. A. Desoer (1963), Linear System Theory, McGraw-Hill. Also published in 2008 by Dover Publications.

31 Maximum Likelihood

The maximum-likelihood (ML) formulation is one of the most formidable tools for the solution of inference problems in modern statistical analysis. It allows the estimation of unknown parameters in order to fit probability density functions (pdfs) onto data measurements. We introduce the ML approach in this chapter and limit our discussions to properties that will be relevant for the future developments in the text. The presentation is not meant to be exhaustive, but targets key concepts that will be revisited in later chapters. We also avoid anomalous situations and focus on the main features of ML inference that are generally valid under some reasonable regularity conditions. The ML approach is one notable example of the non-Bayesian viewpoint to inference whereby the unknown quantity to be estimated is modeled as a deterministic unknown but fixed parameter, rather than as a random variable. This viewpoint is very relevant when we attempt to fit probability density models onto data. We will comment at the end of the chapter, as well as in later chapters, on the relation to the Bayesian approach to inference problems. In this latter case, both the unknown and observations are treated as random variables.

31.1

PROBLEM FORMULATION Consider a random variable y with pdf denoted by fy (y). In statistical inference, this pdf is also called the evidence of y. We assume that fy (y) is dependent on some parameters that are denoted generically by the letter θ. For emphasis, we will write fy (y; θ) instead of fy (y). For example, the pdf fy (y) could be a Gaussian distribution, in which case θ would refer to its mean or variance or both. In this case, we write fy (y; µ, σy2 ). Given an observation y, the ML formulation deals with the problem of estimating θ by maximizing the likelihood function: θb = argmax fy (y; θ)

(31.1)

θ

That is, it selects the value of θ that maximizes the likelihood of the observation. The pdf, fy (y; θ), is called the likelihood function and its logarithm is called the log-likelihood function:

1212

Maximum Likelihood



`(y; θ) = ln fy (y; θ)

(31.2)

Since the logarithm function is monotonically increasing, the ML estimate can also be determined by solving instead: θb = argmax `(y; θ)

(31.3)

θ

Usually, in the context of ML estimation, we observe N iid realizations, {yn }, and use them to estimate θ by maximizing the likelihood function corresponding to these joint observations: ! N Y θb = argmax `(y1 , y2 , . . . , yN ; θ) = argmax ln fy (yn ; θ) (31.4) θ

θ

n=1

so that

θbML = argmax θ

(

N X

ln fy (yn ; θ)

n=1

)

(31.5)

where we are adding the ML subscript for clarity. Clearly, the ML estimate need not exist; it also need not be unique. We will sometimes write θbN , with a subscript N , to indicate that the computation of the estimate is based on N measurements. It is important to realize that the estimate θbML is dependent on the observations {yn }. A different collection of N observations arising from the same underlying true distribution fy (y) will generally lead to a different value for the estimate θbML . For this reason, we treat the ML solution as a random variable bML , which we denote in boldface notation. and introduce the ML estimator, θ From this perspective, every estimate θbML corresponds to a realization for the bML . We introduce the estimation error random variable θ ∆ eML = bML θ θ−θ

(31.6)

where θ represents the true unknown parameter. We associate three measures of quality with the ML estimator, namely, its bias, variance, and mean-square error (MSE) defined by (bias) (variance) (mean-square-error)

bML ) bias(θ bML ) var(θ

∆ bML = E θ eML = θ − Eθ



bML − E θ bML )2 = E (θ ∆

bML ) = E (θ bML − θ)2 = MSE(θ

e2 Eθ ML

(31.7a) (31.7b) (31.7c)

where the expectation is relative to the true distribution, fy (y; θ). When the estimator is unbiased, the MSE coincides with its variance. The bias measures how far the estimator is on average from the true parameter, θ. The variance measures how well concentrated the distribution of the estimator is around its mean, whereas the MSE measures how well concentrated the same distribution is around the true parameter, θ. Ideally, we would like the error to have zero mean,

31.1 Problem Formulation

1213

in which case we say that the ML estimator is unbiased. We would also like the estimator to have a small MSE (or variance). We will explain in the following that the ML estimator has two useful properties for large measurement sizes, N . Specifically, it will be asymptotically unbiased, i.e., n o bN = θ lim E θ (31.8) N →∞

as well as asymptotically efficient, meaning that it will attain the smallest variance (and MSE) possible: n o bML ) = smallest value it can be lim var(θ (31.9) N →∞

We will quantify the value of this smallest MSE by means of the Cramer–Rao bound. Example 31.1 (Bias–variance relation) It is not always the case that unbiased estimab with mean tors are preferred. Consider an unknown parameter θ whose estimator is θ b The MSE of the estimator is given by denoted by θ¯ = E θ. ∆

b 2 MSE = E (θ − θ) b 2 = E (θ − θ¯ + θ¯ − θ)

:  θ) ¯ 2 + E (θ¯ − θ) b 2 + 2(θ − θ) ¯ b = (θ − θ) E ( θ¯ − ¯ 2 + E (θ¯ − θ) b 2 = (θ − θ) b + var(θ) b = bias2 (θ)

0

(31.10)

In other words, the MSE is the sum of two components, namely, the squared bias and the variance of the estimator. This means that one may still employ a biased estimator as long as the sum of both components remains small. We commented on the bias– variance relation earlier in Section 27.4. Example 31.2 (Comparing ML and the maximum a-posteriori approach) The ML formulation treats the parameter θ as some unknown constant, and parameterizes the pdf of the observation y in terms of θ by writing fy (y; θ). This same pdf can be rewritten in the suggestive conditional form fy|θ (y|θ) to emphasize that we are referring to the distribution of y given that the parameter θ is fixed at the value θ = θ. The value of θ is then estimated by maximizing the likelihood function: θbML = argmax fy|θ (y|θ)

(31.11)

θ

It is instructive to compare this formulation with the Bayesian maximum a-posteriori (MAP) approach where both θ and y are treated as random variables. Returning to (28.11), and using the Bayes rule (3.39), we find that the MAP estimator (28.11) corresponds to solving: n o θbMAP = argmax fθ (θ)fy|θ (y|θ) (31.12) θ where we are ignoring the marginal pdf, fy (y), because it does not depend on the unknown θ. Observe from the term on the right-hand side of (31.12) that, in contrast to (31.11), the MAP formulation incorporates information about the prior distribution for θ into the problem statement.

1214

Maximum Likelihood

Example 31.3 (Comparing ML and minimum-variance unbiased estimation) We discussed the Gauss–Markov theorem in Section 29.6, where we considered observation vectors y generated by a linear model of the form y = Hθ + v. The parameter θ ∈ IRM is unknown and the perturbation v has zero mean and covariance matrix Rv > 0. The minimum-variance unbiased estimator (MVUE) for θ, i.e., the unbiased estimator with the smallest MSE was found to be bMVUE = (H T Rv−1 H)−1 H T Rv−1 y θ

(31.13)

In this example, we wish to explain the relation to ML estimation. Although we are dealing now with vector quantities {θ, y}, the same ML construction applies: We form the log-likelihood function and maximize it over θ. For the ML derivation, however, we will assume additionally that v is Gaussian distributed. It follows from the model y = Hθ + v that y is Gaussian distributed with mean vector y¯ = Hθ and covariance matrix ∆

Ry = E (y − y¯)(y − y¯)T = E vv T = Rv

(31.14)

In other words, the pdf of y is given by   1 1 1 √ fy (y; θ) = p exp − (y − Hθ)T Rv−1 (y − Hθ) 2 det Rv (2π)N

(31.15)

The corresponding log-likelihood function is 1 `(y; θ) = − (y − Hθ)T Rv−1 (y − Hθ) + cte 2

(31.16)

where terms independent of θ are grouped into the constant factor. Maximizing `(y; θ) over θ amounts to minimizing the weighted least-squares cost: n o θb = argmin (y − Hθ)T Rv−1 (y − Hθ) (31.17) θ∈IRM

Differentiating with respect to θ we find that the minimizer occurs at θbML = (H T Rv−1 H)−1 H T Rv−1 y

(31.18)

which has the same form as (31.13). The main difference, though, is that the ML derivation assumes the noise component to be Gaussian-distributed and seeks to maximize the log-likelihood function, while the Gauss–Markov theorem is independent of the distribution of the noise and minimizes the MSE.

31.2

GAUSSIAN DISTRIBUTION We illustrate the ML construction by considering the problem of estimating the mean and variance of a Gaussian distribution. Thus, consider a collection of N iid Gaussian observations, {yn }, with unknown mean µ and variance σy2 . The joint pdf (or likelihood function) of the observations is given by fy1 ,...,yN (y1 , . . . , yN ; µ, σy2 )

=

N Y

− 2σ12 (yn −µ)2 1 y e 2 )1/2 (2πσ y n=1

(31.19)

31.2 Gaussian Distribution

1215

so that `(y1 , . . . , yN ; µ, σy2 ) = −

N 1 X N ln(2πσy2 ) − (yn − µ)2 2 2σy2 n=1

(31.20)

Differentiating this log-likelihood function relative to µ and σy2 and setting the derivatives to zero, we obtain two equations in the unknowns (b µ, σ by2 ): N 1 X (yn − µ b) = 0 σ by2 n=1

−N σ by2 +

N X

n=1

(yn − µ b)2 = 0

(31.21a)

(31.21b)

Solving these equations leads to the ML estimates: µ bML =

2 σ by,ML =

N 1 X yn N n=1

N 1 X (yn − µ bML )2 N n=1

(31.22a)

(31.22b)

as well as to similar expressions for the ML estimators, where all variables are treated as random variables and expressed in boldface notation: b ML = µ

b 2y,ML = σ

N 1 X y N n=1 n

N 1 X b ML )2 (y − µ N n=1 n

(31.23a)

(31.23b)

It is straightforward to verify from these expressions that one of the estimators is unbiased while the other is biased; see Prob. 31.1 where it is shown that   N −1 b ML = µ, E σ b 2y,ML = σy2 (31.24) Eµ N

Although the variance estimator is biased, it nevertheless becomes asymptotically unbiased as N → ∞. This does not mean that we cannot construct an unbiased estimator for σy2 for finite N . Actually, the rightmost expression in (31.24) suggests the following construction: b 2y,unbiased = σ

N 1 X b ML )2 (y − µ N − 1 n=1 n

(31.25)

where the scaling by 1/N in (31.23b) is replaced by 1/(N − 1) so that b 2y,unbiased = σy2 Eσ

This second construction, however, is not an ML estimator.

(31.26)

1216

Maximum Likelihood

What about the MSE performance? In this case, we can construct yet another b 2y,ML . To see this, assume we pose estimator for σy2 with a smaller MSE than σ the problem of searching for an estimator for σy2 of the following form: ∆ b 2y,MSE = σ

N X

b ML )2 (y n − µ

(31.27)

b 2y,MSE )2 αo = argmin E (σy2 − σ

(31.28)

α

n=1

for some scalar α > 0 chosen to minimize the resulting MSE: α

We show in Prob. 31.2 that αo = 1/(N + 1) so that the third estimator is: b 2y,MSE = σ

N 1 X b ML )2 (y − µ N + 1 n=1 n

(31.29)

Obviously, this estimator is biased. It agrees with neither the ML estimator (31.23b), which is scaled by 1/N , nor the unbiased estimator (31.25), which is scaled by 1/(N − 1). Example 31.4 (Fitting Gaussian and beta distributions) The top row in Fig. 31.1 shows on the left a histogram distribution for the serum cholesterol level measured in mg/dl for N = 297 patients. The vertical axis measures absolute frequencies. The plot uses 15 bins of width 30 mg/dl each, and shows how many patients fall within each bin. The same plot is normalized on the right by dividing each bin value by N = 297 measurements and by the bin width – recall the explanation given in Remark 6.1. By doing so, the result is an approximate pdf. A Gaussian pdf is fitted on top of the normalized data. The mean and variance of the Gaussian distribution are determined by using expressions (31.23a) and (31.23b). If we denote the cholesterol level by the random variable y, then the sample mean and variance values are found to be N 1 X yn ≈ 247.35 N n=1

(31.30a)

N 1 X (yn − µ by )2 ≈ 2703.7 N − 1 n=1

(31.30b)



µ bcholesterol = µ by = 2 σ bcholesterol =

where yn refers to the nth cholesterol measurement. The bottom row in Fig. 31.1 repeats the same construction for the maximal heart rate of a patient measured in beats per minute (bpm) from the same dataset. If we denote the heart rate by the random variable z, then the sample mean and variance values are found to be ∆

N 1 X zn ≈ 149.60 N n=1

(31.31a)

N 1 X (zn − µ bz )2 ≈ 526.32 N − 1 n=1

(31.31b)

µ bheartrate = µ bz = 2 σ bheartrate =

where zn refers to the nth heart-beat measurement. By examining the rightmost lower plot in Fig. 31.1, it appears that the histogram distribution is skewed to the right. This

31.2 Gaussian Distribution

1217

observation motivates us to consider fitting a different distribution onto the data in order to better capture the skewness in the histogram; this is not possible if we persist with the Gaussian distribution due to its symmetry. cholesterol

frequency

60

40

20

0

200

300

400

500

0.008

0.006

0.004

0.002

0 100

600

serum cholesterol (mg/dl)

200

300

400

500

600

serum cholesterol (mg/dl)

heart rate

70

cholesterol

0.01

probability density

80

heart rate

0.025

probability density

60

frequency

50 40 30 20

0.02

0.015

0.01

0.005

10 0

100

150

maximum heart rate (bpm)

200

0

100

150

200

maximum heart rate (bpm)

Figure 31.1 (Top) Histogram distribution of the serum cholesterol level measured in mg/dl on the left using 15 bins of width 30 mg/dl each, and its normalized version on the right where each bin value is divided by N = 297 measurements and by the bin width. By doing so, the result is an approximate pdf. A Gaussian pdf is fitted on top of the normalized data. (Bottom) A similar construction for the maximum heart rate of a patient measured in bpm. A Gaussian pdf is fitted on top of the normalized data. The data is derived from the processed Cleveland dataset from https://archive.ics.uci.edu/ml/datasets/heart+Disease.

First, we normalize the heart rate variable so that it is confined to the interval [0, 1). We do so by dividing z by (a slightly larger number than) the maximum heart rate in the data, which is 202. We denote this normalized variable by t. We have access to N = 297 measurements {tn }, obtained by normalizing the heart rates zn by  + max zn (for a small ; this ensures that all values of tn are strictly less than 1 so that logarithms of 1 − tn will be well defined further ahead in (31.34)). Next, we consider fitting a beta distribution onto the data {tn }. The pdf of a beta distribution has the form:   Γ(a + b) ta−1 (1 − t)b−1 , ft (t; a, b) = Γ(a)Γ(b)  0,

0≤t≤1 otherwise

(31.32)

1218

Maximum Likelihood

where Γ(x) denotes the gamma function defined earlier in Prob. 4.3. Different choices for (a, b) result in different behavior for the distribution ft (t). We need to estimate the shape parameters (a, b). We have a collection of N independent measurements {tn }. The likelihood function of these observations is given by  N Y N Γ(a + b) ft1 ,...,tN (t1 , . . . , tN ; a, b) = ta−1 (1 − tn )b−1 (31.33) n Γ(a)Γ(b) n=1 so that, in the log domain, `(t1 , . . . , tN ; a, b) = N ln Γ(a + b) − N ln Γ(a) − N ln Γ(b) + (a − 1)

N X

n=1

ln tn + (b − 1)

N X

n=1

ln(1 − tn )

Differentiating with respect to a and b gives  X  0 N Γ0 (a) Γ (a + b) − + ln tn ∂`(t1 , . . . , tN ; a, b)/∂a = N Γ(a + b) Γ(a) n=1  X  0 N Γ0 (b) Γ (a + b) − + ∂`(t1 , . . . , tN ; a, b)/∂b = N ln(1 − tn ) Γ(a + b) Γ(b) n=1

(31.34)

(31.35)

(31.36)

where Γ0 (x) denotes the derivative of the Γ-function. Two complications arise here. First, we need to know how to compute ratios of the form ψ(x) = Γ0 (x)/Γ(x) for the gamma function; this ratio is known as the digamma function and it is equal to the derivative of ln Γ(x). The computation of the digamma function is not straightforward. As was mentioned earlier in (5.62), and based on properties of the gamma function, it is known that  ∞  0 X 1 1 ∆ Γ (x) ≈ −0.577215665 + − (31.37) ψ(x) = Γ(x) 1+m x+m m=0 The expression on the right-hand side can be used to approximate Γ0 (x)/Γ(x) by replacing the infinite series by a finite sum. Second, even then, if we set the derivatives (31.35)–(31.36) to zero, the resulting equations will not admit a closed-form solution for the parameters (a, b). Another way to seek values (b a, b b) that maximize the likelihood function is to employ a gradient-ascent recursion of the following form for n ≥ 0 (along the lines discussed in Chapter 12 on gradient-descent algorithms): ( ) N Γ0 (an−1 ) Γ0 (an−1 + bn−1 ) 1 X an = an−1 + µ − + ln tn (31.38) Γ(an−1 + bn−1 ) Γ(an−1 ) N n=1 ( ) N Γ0 (an−1 + bn−1 ) Γ0 (bn−1 ) 1 X bn = bn−1 + µ − + ln(1 − tn ) (31.39) Γ(an−1 + bn−1 ) Γ(bn−1 ) N n=1 where µ is a small step-size parameter. These recursions need to be initialized from a good starting point. In this example, we repeat the iterations for a total of 10,000 times using µ = 0.001. We use the construction explained next to determine good initial conditions (a−1 , b−1 ). Actually, the construction provides yet another method to fit a beta distribution onto the data, albeit one that does not need to run the gradient-ascent recursion altogether.

31.2 Gaussian Distribution

1219

heart rate with normalized horizontal axis

probability density

4

3 normalized histogram Gaussian beta distribution using moments beta distribution using ML

2

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

normalized heart rate in the range [0 1)

heart rate

probability density

0.02

Gaussian

0.015

0.01

Gaussian beta distribution using moments beta distrubution using ML

0.005

beta using moments 0 0

20

40

60

80

100

120

140

beta using ML 160

180

200

maximum heart rate (bpm)

Figure 31.2 (Top) Normalized histogram for the scaled heart rate variables {tn }, along with three pdfs: a Gaussian fit, a beta distribution fit obtained from the moment matching method, and a beta distribution fit obtained from a gradient-ascent iteration for ML. (Bottom) The same probability distributions with the horizontal axis returned to the original heart rate scale (obtained by multiplying the horizontal axis of the top figure by the maximum heart rate, as well as scaling the vertical axis down by the same value to ensure that the area under each of the probability distributions stays normalized to 1).

Indeed, the mean and variance of a beta distribution with shape parameters a and b are given by t¯ =

a , a+b

σt2 =

ab (a + b)2 (a + b + 1)

(31.40)

We can solve these equations in terms of a and b and find that  t¯(1 − t¯) − 1 σt2   t¯(1 − t¯) b = (1 − t¯) − 1 σt2 a = t¯



(31.41) (31.42)

These expressions suggest another method (called a moment matching method) to fit the beta distribution to data measurements. We estimate the mean and variance of the distribution from the data, say, as

1220

Maximum Likelihood

N X bt¯ = 1 tn , N n=1

σ bt2 =

N 1 X (tn − bt¯ )2 N − 1 n=1

(31.43)

and then use these values in (31.41)–(31.42) to estimate a and b. Using this construction we obtain b a = 10.2900, b b = 3.6043

(moment matching)

(31.44)

This is of course not a ML solution. Using these values as initial conditions for the gradient-ascent iterations (31.38)–(31.39) we arrive at a second set of estimates for a and b: b a = 10.2552, b b = 3.0719,

(ML method)

(31.45)

The resulting beta distributions are shown in Fig. 31.2 along with the Gaussian distribution from the earlier figure for comparison purposes. Example 31.5 (Fitting the empirical data distribution – discrete case) Let us return to the ML formulation (31.5) and provide two useful interpretations for it in terms of what is known as the empirical data distribution. The interpretations are easier to describe for discrete random variables, y. For convenience, we will continue to denote the probability mass function (pmf) for y by the notation fy (y; θ) so that fy (y; θ) stands for P(y = y; θ)

(31.46)

In this notation, the pmf is dependent on the parameter θ. The observations for y arise from a discrete set Y representing the support of its pmf, i.e., ∆

y ∈ Y = {o1 , o2 , . . . , oL }

(31.47)

where we are denoting the individual elements of Y by {o` }. We consider a collection of N independent realizations {yn }. The ML problem for estimating θ does not change if we scale the likelihood function by 1/N so that: ( θbML = argmax θ

N 1 X ln fy (yn ; θ) N n=1

) (log-likelihood)

(31.48)

If we now examine the observations {yn }, some of them may assume repeated values. Let p` denote the relative frequency for realization o` in the observed set (this is a measure of how often o` appears within the observation set). In particular, if o` appears a` times within the N observations, then ∆

p` = a` /N

(relative frequency for o` )

(31.49)

In this way, we end up constructing an empirical distribution (or histogram) with the observed data defined by fby (y = o` ) = p`

(31.50)

Figure 31.3 compares the parameterized pmf fy (y; θ) and the empirical distribution fby (y), which corresponds to a normalized histogram with all relative frequencies {p` } adding up to 1.

31.2 Gaussian Distribution

1221

Figure 31.3 Parameterized pmf fy (y; θ) on the left versus the empirical distribution,

fby (y), on the right, which amounts to a normalized histogram with the relative frequencies {p` } adding up to 1.

Now, using expression (6.43), the Kullback–Leibler (KL) divergence between the empirical and actual pmfs is given by ∆

DKL (fby kfy ) = E fby ln fby (y) − E fby ln fy (y; θ) =

L X `=1

=

L X `=1

=

L X `=1

p` ln p` − p` ln p` −

L X

p` ln fy (y = o` ; θ)

`=1 L 1 X a` ln fy (y = o` ; θ) N `=1

N 1 X p` ln p` − ln fy (yn ; θ) N n=1

(31.51)

where the expectations are computed relative to the empirical distribution. It follows that the ML solution, which maximizes the rightmost term in the above equation, is effectively minimizing the KL divergence between the empirical pmf fby (y) and the fitted model fy (y; θ):

θbML = argmin DKL (fby k fy )

(KL divergence)

(31.52)

θ

Another useful interpretation for the ML solution follows if we appeal to the conclusion from Example 6.10, which relates the cross-entropy between two distributions to their KL divergence. The cross-entropy between the empirical pmf fby (y) and the fitted model fy (y; θ) is defined by

1222

Maximum Likelihood



H(fby , fy ) = −E fby ln fy (y; θ) = − = −

L X

p` ln fy (y = o` ; θ)

`=1 N 1 X ln fy (yn ; θ) N n=1

(31.53)

and, hence, it also holds that θbML = argmin H(fby , fy )

(cross-entropy)

(31.54)

θ

In summary, the following interpretations hold:   θbML maximizes the log-likelihood function (31.48) θb minimizes the KL divergence (31.52)  bML θML minimizes the cross-entropy (31.54)

(31.55)

Example 31.6 (Fitting the data distribution – continuous case) We extend the conclusions of the previous example to continuous random variables y as follows. We assume the realizations for y arise from a true pdf denoted by fy (y; θo ) and parameterized by some unknown parameter θo . We again consider a collection of N independent realizations {yn } arising from this distribution. The ML formulation fits a model fy (y; θ) by seeking a value for θ that solves: ) ( N 1 X b ln fy (yn ; θ) (31.56) θML = argmax N n=1 θ Under ergodicity, and for large enough N → ∞, the above problem can be replaced by   θbML = argmax E θo ln fy (y; θ)

(log-likelihood)

(31.57)

θ

where the expectation is over the true distribution, y ∼ fy (y; θo ). We are highlighting this fact by writing E θo , with a subscript θo . Now note from expression (6.43) that the KL divergence between the true and fitted pdfs is given by       ∆ DKL fy (y; θo ) k fy (y; θ) = E θo ln fy (y; θo ) − E θo ln fy (y; θ) (31.58) where the first term is independent of θ. Since the ML solution maximizes the rightmost term in the above equation, we conclude that it is also minimizing the KL divergence between the true and fitted models, namely, for large enough N :   θbML = argmin DKL fy (y; θo ) k fy (y; θ)

(KL divergence)

(31.59)

θ

We encountered one application for this result earlier in Example 28.7 while motivating the logistic risk. Another interpretation for the ML solution can be obtained if we refer to the conclusion of Example 6.10, which relates the cross-entropy between two

31.3 Multinomial Distribution

1223

distributions to their KL divergence. The cross-entropy between the true and fitted models is defined by     ∆ H fy (y; θo ), fy (y; θ) = −E θo ln fy (y; θ) (31.60) and, hence, it also holds that for N large enough:   θbML = argmin H fy (y; θo ), fy (y; θ)

(cross-entropy)

(31.61)

θ

In summary, the following interpretations hold:   θbML maximizes the log-likelihood function (31.57) θb minimizes the KL divergence (31.59)  bML θML minimizes the cross-entropy (31.61)

31.3

(31.62)

MULTINOMIAL DISTRIBUTION We continue to illustrate the ML construction by considering next the problem of estimating the parameters defining a multinomial distribution, which is a generalization of the binomial distribution. Recall that the binomial distribution is useful to model the outcome of an experiment involving the tossing of a coin N times. Each experiment consists of only two possible outcomes: “heads” or “tails.” The binomial distribution then allows us to assess the likelihood of observing y heads in N tosses by means of the following expression defined in terms of the factorial operation:   N y P(y heads in N tosses) = p (1 − p)N −y y N! py (1 − p)N −y (31.63) = y!(N − y)! where p ∈ [0, 1] denotes the probability of observing a “head” in any given toss of the coin. The multinomial distribution generalizes this setting and allows each experiment to involve more than two outcomes, say, L ≥ 2 of them. This situation arises, for example, if we toss a dice with L faces with each face ` having a probability p` of being observed with the probabilities satisfying the obvious normalization: L X p` = 1 (31.64) `=1

The multinomial distribution then allows us to assess the likelihood of observing y1 times the first face, y2 times the second face, and so on, in a total of N tosses, by means of the following expression:

1224

Maximum Likelihood

N! py1 py2 . . . pyLL y1 !y2 ! . . . yL ! 1 2 ! L Y py` ` = N! y` !

P(y1 , y2 , . . . , yL in N tosses) =

`=1



= fy1 ,y2 ,...,yL (y1 , y2 , . . . , yL )

(31.65)

where the last line introduces a compact notation for the probability expression. It is clear that the multinomial distribution is parameterized by the scalars {p1 , p2 , . . . , pL }. In the next two examples, we consider special cases and delay the general case to Section 31.4, where we study the exponential family of distributions. Example 31.7 (Elephants, horses, and cars) Consider a multinomial distribution with L = 3 outcomes with probabilities {p1 , p2 , p3 }. The same arguments and derivation that follow can be extended to an arbitrary number L of outcomes. Thus, assume we are dealing with an experiment involving a box with L = 3 types of images in it: Type #1 are images of horses, type #2 are images of elephants, and type #3 are images of cars. The probability of selecting any given type ` is p` ; this situation is illustrated schematically in Fig. 31.4, where we are assuming for this example that the probabilities {p1 , p2 , p3 } are parameterized in terms of some unknown scalar parameter θ as follows: p1 =

1 , 4

p2 =

1 + θ, 4

p3 =

1 −θ 2

(31.66)

Assume we repeat the experiment a total of N independent times, and write down the number of times, {y1 , y2 , y3 }, that images of types #1, #2, and #3 are observed. In view of the multinomial distribution (31.65), the probability of observing each type ` a number y` times is given by fy1 ,y2 ,y3 (y1 , y2 , y3 ) =

N! py1 py2 py3 y1 !y2 !y3 ! 1 2 3

(31.67)

where the y` assume integer values and satisfy: y` ∈ {0, 1, . . . , N },

y1 + y2 + y3 = N

(31.68)

Expression (31.67) allows us to assess the likelihood of observing an “elephant” y1 times, a “horse” y2 times, and a “car” y3 times. Now, using the observations {y1 , y2 , y3 }, we are interested in determining the ML estimate for the parameter θ (and, consequently, for the probabilities {p1 , p2 , p3 }). This can be done by maximizing directly the log-likelihood function, which in this case is given by  y1  y2  y3  1 1 1 +θ −θ (31.69) 4 4 2         N! 1 1 1 = ln + y1 ln + y2 ln + θ + y3 ln −θ y1 !y2 !y3 ! 4 4 2 ∆

`(y1 , y2 , y3 ; θ) = ln



N! y1 !y2 !y3 !

31.3 Multinomial Distribution

1225

The box contains L = 3 types of images (elephants, horses, and cars). The probability of picking with replacement an image of type ` is p` ; these probabilities are parameterized by the scalar θ in (31.66). The source of the individual images is www.pixabay.com, where images are free to use. Figure 31.4

Differentiating with respect to θ and setting the derivative to zero at θ = θb gives y2 − 1 +θ 4

y3 = 0 =⇒ 1 − θ θ=θb 2

θb|y1 ,y2 ,y3 =

1 y 2 2

− 41 y3 y2 + y3

(31.70)

where the subscript added to θb is meant to indicate that this estimate is based on the measurements y1 , y2 , and y3 . Example 31.8 (Partial information) The solution in the previous example assumes that we have access to the number of outcomes {y1 , y2 , y3 }. Let us now consider a scenario where we only have access to partial information. Assume that all we know is the number of times that an “animal” image has been observed and the number of times that a “car” image has been observed. That is, we only know the quantities y1 + y2 and y3 . We are still interested in estimating θ from this information. This particular problem can still be solved directly using the ML formulation. To do so, we need to determine the likelihood function for the random variables {y 1 + y 2 , y 3 }, which can be found from the following calculation (note that the variables (s, y3 ) below satisfy s + y3 = N ): fs,y3 (y1 + y2 = s, y3 ; θ) =

s X m=0 s X

fy1 ,y2 ,y3 (y1 = m, y2 = s − m, y3 ; θ)

 m  s−m  y3 N! 1 1 1 +θ −θ m!(s − m)!y3 ! 4 4 2 m=0  y3 X  m  s−m s N! 1 s! 1 1 = −θ +θ s!y3 ! 2 m!(s − m)! 4 4 m=0  y3 X   m  s−m s  N! 1 1 1 s = −θ +θ m s!y3 ! 2 4 4 m=0 =

(31.71)

1226

Maximum Likelihood

We now call upon the binomial theorem, which states that, for any integer s and real numbers a and b: ! s X s s (a + b) = am bs−m (31.72) m m=0 and use it to simplify (31.71) as s 1 +θ 2 N −s  s (a) N! 1 1 = −θ +θ s!(N − s)! 2 2 ! s  N −s 1 N 1 +θ −θ = 2 2 s

fs,y3 (y1 + y2 = s, y3 ; θ) =

N! s!y3 !



1 −θ 2 

y3 

(31.73)

where step (a) follows from the fact that the value of y3 is fixed at y3 = N − s. Note that expression (31.73) shows that the sum variable s = y 1 + y 2 follows a binomial distribution with success rate equal to 12 + θ. It follows that the log-likelihood function is given by: !     1 N 1 + θ + y3 ln −θ ln fs,y3 (y1 + y2 = s, y3 ; θ) = ln + s ln 2 2 s (31.74) Differentiating with respect to θ and setting the derivative to zero at θ = θb leads to:

1 2

s − +θ

y3 = 0 =⇒ 1 − θ θ=θb 2

θb|y1 +y2,y3 =

1 (y1 2

+ y2 ) − 12 y3 N

(31.75)

where the subscript added to θb is meant to indicate that this estimate is based on the measurements y1 + y2 and y3 .

31.4

EXPONENTIAL FAMILY OF DISTRIBUTIONS We illustrate next the application of the ML formulation to the exponential family of probability distributions. We showed in Chapter 5 that this family includes many other distributions as special cases, such as the Gaussian distribution, the binomial distribution, the multinomial distribution, the Dirichlet distribution, the gamma distribution, and others. Thus, consider a vector random variable y ∈ IRP that follows the exponential distribution described by (5.2) in its natural or canonical form, namely, fy (y; θ) = h(y)eθ

T

T (y)−a(θ)

(31.76)

31.4 Exponential Family of Distributions

1227

where the pdf is parameterized by θ ∈ IRM , and the functions {h(y), T (y), a(θ)} satisfy h(y) ≥ 0 : IRP → IR,

T (y) : IRP → IRM ,

a(θ) : IRM → IR

(31.77)

Note that we are allowing the parameter θ to be vector-valued, as well as the observation y. We refer to h(y) as the base function, to a(θ) as the log-partition function, and to T (y) as the sufficient statistic. We showed in Table 5.1 how different distributions can be obtained as special cases of (31.76) through the selection of {h(y), T (y), a(θ)}. We continue with the general description (31.76) and derive the ML estimator for θ. Assume we have a collection of N iid realizations {yn }, arising from the exponential distribution (31.76) with unknown parameter vector, θ. The joint pdf (or likelihood function) of the observations is given by fy1 ,...,yN (y1 , . . . , yN ; θ) =

N Y

h(yn )eθ

n=1

=

N Y

h(yn )

n=1

T

T (yn )−a(θ)

!

e

−N a(θ)

(31.78)

exp

(

N X

T

)

θ T (yn )

n=1

so that the log-likelihood function is `(y1 , . . . , yN ; θ) =

N X

n=1

ln h(yn ) − N a(θ) +

N X

θT T (yn ))

(31.79)

n=1

It was argued after (5.87) that this function is concave in the parameter θ. Computing the gradient vector relative to θ and setting it to zero at θ = θb gives b + − N ∇θT a(θ)

N X

T (yn ) = 0

(31.80)

n=1

and, hence, the ML estimate θb is found by solving the equation: N 1 X ∇θT a(θbML ) = T (yn ) N n=1

(31.81)

This is an important conclusion. It shows that in order to estimate the parameter θ, it is sufficient to know the average of the values of {T (yn )}; the individual measurements {yn } are not needed. We say that the function T (y) plays the role of a sufficient statistic for y, or that the sample average of the values {T (yn )} is sufficient knowledge for the problem of estimating θ – recall the comments on the concept of a sufficient statistics at the end of Chapter 5. We comment further on this concept in Example 31.10.

1228

Maximum Likelihood

Example 31.9 (Gaussian distribution) Let us illustrate how construction (31.81) reduces to known results by considering the first row of Table 5.1, which corresponds to the Gaussian distribution. In this case we have     1 θ2 µ/σy2 y , T (y) = , a(θ) = − ln(−2θ2 ) − 1 θ= (31.82) 2 2 y −1/2σy 2 4θ2 so that  θ1   − µ   2θ2 =  =  1 θ2 σy2 + µ2 − + 4θ12 2 2θ2 

 ∇θT a(θ) =

∂a(θ)/∂θ1 ∂a(θ)/∂θ2



(31.83)

It follows from (31.81) that 



N 1 X yn    N n=1 =  N   1 X 2 yn N n=1

µ bML +µ b2ML

2 σ by,ML

     

(31.84)

These expressions agree with the earlier result (31.23b) since they lead to the estimates: µ bML =

N 1 X yn , N n=1

2 σ by,ML =

N 1 X 2 yn − µ b2ML N n=1

(31.85)

Example 31.10 (Sufficient statistic) The function T (y) that appears in (31.81) plays the important role of a sufficient statistic. Generally, any function of an observation, T (y), is a statistic. However, this concept is most relevant when the statistic happens to be sufficient. The statistic is said to be sufficient for estimating a parameter θ from a measurement y if the conditional distribution of y given T (y) does not depend on θ: fy|T (y) (y|T (y)) does not depend on θ =⇒ T (y) is a sufficient statistic

(31.86)

A key factorization theorem in statistics states that T (y) is a sufficient statistic for θ if, and only if, the pdf fy (y; θ) can be factored in the form – recall expression (5.138) and Prob. 5.11: fy (y; θ) = h(y) g(T (y); θ)

(31.87)

That is, the pdf can be factored as the product of two nonnegative functions, h(·) and g(·; θ), such that h(y) depends solely on y and g(T (y); θ), which depends on θ, depends on the observation only through T (y). If we examine the form of the exponential distribution (31.76) we observe that it can be written in this form with g(T (y); θ) = eθ

T

T (y)−a(θ)

(31.88)

More commonly, one often discusses sufficiency in the context of estimating θ from a collection of iid observations {yn } arising from the distribution fy (y). If we consider N such observations then their joint pdf is given by fy1 ,y2 ,...,yN (y1 , y2 , . . . , yN ; θ) =

N Y

h(yn )eθ

T

T (yn )−a(θ)

(31.89)

n=1

=

N Y n=1

! h(yn )

( exp θ

T

N X n=1

! T (yn )

) − N a(θ)

31.5 Cramer–Rao Lower Bound

1229

which is seen to be of the desired factored form (31.87) with g

N X

! T (yn ); θ

( = exp θ

n=1

T

N X

! T (yn )

n=1

) − N a(θ)

(31.90)

Therefore, the ML estimate for θ from a collection of iid observations {y1 , . . . , yN } can be determined by relying solely on knowledge of the sufficient statistic given by the sum ∆

sufficient statistic =

N X

T (yn )

(31.91)

n=1

rather than on the individual observations. We already observe this feature in expression (31.81).

31.5

CRAMER–RAO LOWER BOUND The derivation in Section 31.2 provides examples of ML estimators that can be biased or unbiased. It also shows examples of estimators that differ in their MSE. Ideally, we would like our estimators to be unbiased and to have the smallest MSE (or variance) possible. The Cramer–Rao bound is a useful result in that regard. It provides a lower bound on the variance for any estimator (whether of ML-type or not). Estimators that meet the Cramer–Rao bound are said to be efficient since no other estimator will be able to deliver a smaller variance or MSE. Efficient estimators that are also unbiased belong to the class of MVUE because they are both unbiased and attain the smallest variance. It turns out that, under certain regularity conditions, the ML estimators are asymptotically unbiased and efficient as the number of observations grows, i.e., as N → ∞. Thus, consider the problem of estimating an unknown constant parameter θ, which may be scalar or vector-valued, from an observation vector y. For generality, we describe the Cramer–Rao bound for the case of vector parameters, θ ∈ IRM . We denote the individual entries of θ by {θm } and denote the corresponding estimation error by ∆ em = bm θ θm − θ

(31.92)

bm } are all assumed to be unbiased, i.e., E θ bm = θm . The where the estimators {θ error covariance matrix is denoted by b b T = Eθ eθ eT Rθ˜ = E (θ − θ)(θ − θ)

b = col{θ bm } and θ e = θ − θ. b where θ

(31.93)

1230

Maximum Likelihood

31.5.1

Fisher Information Matrix We associate with the inference problem an M × M Fisher information matrix, whose entries are constructed as follows. The (n, m)th entry is defined in terms of the (negative) expectation of the second-order partial derivative of the loglikelihood function relative to the parameter entries: ∆

[F (θ)]n,m = −E



∂ 2 ln fy (y; θ) ∂θn ∂θm



, n, m = 1, 2, . . . , M

(31.94)

or, in matrix form and using the Hessian matrix notation: ∆

F (θ) = −E ∇2θ ln fy (y; θ)

(31.95)

The derivatives in these expressions are evaluated at the true value for the parameter θ. The Fisher information matrix helps reflect how much information the distribution of y conveys about θ. The above expressions define the Fisher information matrix relative to a single observation y.

31.5.2

Score Function Expression (31.94) assumes that the log-likelihood function is twice-differentiable with respect to θ. Under a regularity condition that integration and differentiation operations are exchangeable (recall the discussion in Appendix 16.A), there is an equivalent form for the Fisher matrix as the covariance matrix of the socalled score function, namely, F (θ) = E S(θ)ST (θ)

(31.96)

where the score function is defined in terms of the gradient vector with respect to θ (i.e., only first-order derivatives are involved): ∆

S(θ) = ∇θT ln fy (y; θ)

(31.97)

Proof of (31.96): Note first that ∇2θ ln fy (y; θ) = ∇θT (∇θ ln fy (y; θ))  ∇ f (y; θ)  θ y = ∇θ T fy (y; θ) =

fy (y; θ) ∇2θ fy (y; θ) − ∇θT fy (y; θ) ∇θ fy (y; θ) fy2 (y; θ)

=

∇2θ fy (y; θ) − S(θ)ST (θ) fy (y; θ)

(31.98)

31.5 Cramer–Rao Lower Bound

1231

Consequently, if we denote the domain of y by y ∈ Y, E S(θ)ST (θ)

(31.98)

=

=

 2  ∇θ fy (y; θ) −E ∇2θ ln fy (y; θ) + E fy (y; θ) ˆ 2 −E ∇θ ln fy (y; θ) + ∇2θ fy (y; θ)dy y∈Y

(a)

=

−E ∇2θ ln fy (y; θ) + ∇2θ

=

−E ∇2θ ln fy (y; θ)

 fy (y; θ)dy y∈Y | {z } ˆ

=1

(31.99)

where step (a) assumes that the operations of integration and differentiation can be exchanged.

 It is explained in Appendix 31.A that under two regularity conditions stated in (31.229)–(31.230), the log-likelihood function satisfies for any θ:   ∂ ln fy (y; θ) E = 0, m = 1, 2, . . . , M (31.100) ∂θm which implies that the score function exists, is bounded, and has zero mean E S(θ) = E ∇θ ln fy (y; θ) = 0

(31.101)

It follows that the Fisher information matrix defined by (31.96) is the actual covariance matrix of the score function. We obtain from (31.96) that we also have:   ∂ ln fy (y; θ) ∂ ln fy (y; θ) [F (θ)]n,m = E , n, m = 1, 2, . . . , M (31.102) ∂θn ∂θm This expression again defines the Fisher information matrix relative to a single observation. Example 31.11 (Diagonal covariance matrix) Consider a vector Gaussian distribution with mean µ ∈ IRP and diagonal covariance matrix n o Σy = diag σ12 , σ22 , . . . , σP2 (31.103) We denote the individual entries of the mean vector by n o µ = col µ1 , µ2 , . . . , µP

(31.104)

We can write the pdf in the form: 1 1 qQ fy (y) = p P P (2π)

p=1

P Y

σp2

n 1 o exp − 2 (yp − µp )2 2σp p=1

(31.105)

We wish to evaluate the Fisher information matrix of this distribution relative to its mean and variance parameters. First, note that the likelihood function is given by ln fy (y) = −

P P X P 1X 1 ln(2π) − ln σp2 − (yp − µp )2 2 2 2 p=1 2σ p p=1

(31.106)

1232

Maximum Likelihood

so that 1 (yp − µp ) σp2 1 1 ∂ ln fy (y)/∂σp2 = − 2 + 4 (yp − µp )2 2σp 2σp

∂ ln fy (y)/∂µp =

(31.107) (31.108)

The Fisher information matrix in this case has dimensions 2P × 2P . Let us order the parameters in θ with the {µp } coming first followed by the {σp2 }, for p = 1, 2, . . . , P : n o θ = µ1 , . . . , µP , σ12 , . . . , σP2

(31.109)

Then, the diagonal entries of the Fisher information matrix are given by

[F (µ, Σ)]p,p = E =

∂ ln fy (y) ∂ ln fy (y) ∂µp ∂µp

!

1 E (y p − µp )2 σp4

= 1/σp2

(31.110)

and

[F (µ, Σ)]p+P,p+P = E

∂ ln fy (y) ∂ ln fy (y) ∂σp2 ∂σp2

!

 2 1 1 − 2 + 4 (y p − µp )2 2σp 2σp 1 1 1 = + 8 E (y p − µp )4 − 6 E (y p − µp )2 4σp4 4σp 2σp

=E

3σp4 σp2 1 + − 4σp4 4σp8 2σp6 1 = 2σp4

=

(31.111)

where we used the fact that, for a Gaussian random variable x with mean x ¯ and variance σx2 , it holds that E (x − x ¯)4 = 3σx4 . On the other hand, for p 6= p0 = 1, 2, . . . , P , the off-diagonal entries of the Fisher information matrix are given by

[F (µ, Σ)]p,p0 = E =

∂ ln fy (y) ∂ ln fy (y) ∂µp ∂µ0p

!

1 E (y p − µp )(y p0 − µp0 ) σp2 σp20

=0

(31.112)

31.5 Cramer–Rao Lower Bound

1233

since {y p , y p0 } are uncorrelated (actually independent) random variables due to the diagonal covariance structure. Likewise, for p 6= p0 : ! ∂ ln fy (y) ∂ ln fy (y) [F (µ, Σ)]p+P,p0 +P = E ∂σp2 ∂σp20 !   1 1 1 1 2 2 0 − 2 + 4 (y p0 − µp ) = E − 2 + 4 (y p − µp ) 2σp 2σp 2σp0 2σp0 =

σp20 σp2 σp20 σp2 1 − − + 4σp2 σp20 4σp2 σp40 4σp4 σp20 4σp4 σp40

=

1 1 1 1 − 2 2 − 2 2 + 2 2 4σp2 σp20 4σp σp0 4σp σp0 4σp σp0

=0

(31.113)

while [F (µ, Σ)]p,p0 +P = E

∂ ln fy (y) ∂ ln fy (y) ∂µp ∂σp20 (

1 = 2 2 E σp σp0

!

1 1 (y p − µp ) − 2 + 4 (y p0 − µp0 )2 σp0 2σp0

!)

=0

(31.114)

and [F (µ, Σ)]

p+P,p0

! ∂ ln fy (y) ∂ ln fy (y) =E ∂σp2 ∂µp0    1 1 1 2 0 = 2 2 E (y p0 − µp ) − 2 + 4 (y p − µp ) σp0 σp σp 2σp =0

(31.115)

We conclude that the Fisher information matrix in this case is diagonal and given by   1 1 1 1 1 1 F (µ, Σ) = diag , , . . . , , , , . . . , (31.116) σ12 σ22 σP2 2σ14 2σ24 2σP4

31.5.3

Cramer–Rao Bound We list the Cramer–Rao bound for the two situations of unbiased and biased estimators, and also show the correction that needs to be made to the Fisher information matrix when a multitude of observations are used to determine the ML estimator rather than a single observation.

Unbiased estimators The Cramer–Rao lower bound for unbiased estimators amounts to the statement that (see Appendix 31.A for one derivation):   e2 ≥ F −1 (θ) Eθ (31.117) m m,m

1234

Maximum Likelihood

in terms of the mth diagonal entry of the inverse of the Fisher matrix. The result can also be rewritten in terms of the variance of the individual entries as follows: bm ) ≥ [F −1 (θ)]m,m var(θ

(31.118)

or in terms of the aggregate covariance matrix of the estimator: Rθb ≥ F −1 (θ)

(unbiased vector estimators)

(31.119)

where the notation A ≥ B for two nonnegative-definite matrices means that A − B ≥ 0. In the special case in which θ happens to be a scalar, the Cramer– Rao lower bound (31.117) can be rewritten equivalently in the forms:  −1 2 e2 ≥ − E ∂ ln fy (y; θ) Eθ (31.120a) ∂2θ 2 !−1  ∂ ln fy (y; θ) (31.120b) = E ∂θ or, more compactly, bm ) ≥ 1/F (θ) var(θ

(unbiased scalar estimators)

(31.121)

Biased estimators b for some function When the estimator is biased, with mean denoted by g(θ) = E θ of θ, statement (31.120a) for the Cramer–Rao bound for scalar parameters is modified as follows (see Appendix 31.A): −1  2  ∂ 2 ln fy (y; θ) ∂g(θ) 2 b E (g(θ) − θ) ≥ − E (31.122) ∂2θ ∂θ

or, equivalently, in terms of the variance of the estimator: b ≥ var(θ)

(∂g(θ)/∂θ)2 F (θ)

(biased scalar estimators)

(31.123)

The special case g(θ) = θ reduces this expression to (31.120a). For vector parameters θ, the corresponding relation becomes b b T ≥ (∇θT g(θ)) F −1 (θ) ∇θ g(θ) E (g(θ) − θ)(g(θ) − θ)

(31.124)

or, equivalently, in terms of the covariance matrix of the estimator: Rθb ≥



   ∇θT g(θ) F −1 (θ) ∇θ g(θ)

(biased vector estimators) (31.125)

31.5 Cramer–Rao Lower Bound

1235

Multiple observations The Cramer–Rao bounds listed so far are expressed in terms of the Fisher information matrix for a single observation. If a collection of N iid realizations b as is usually the case, then F (θ) {yn } are used to determine the estimator θ, will need to be scaled by N . This is because the Fisher information matrix that is associated with N observations will be ! N Y ∆ 2 FN (θ) = −E ∇θ ln fy (yn ; θ) n=1

= −E ∇2θ = −

N X

n=1

N X

E

n=1

 ln fy (yn ; θ)

∇2θ

ln fy (yn ; θ)

  = −N E ∇2θ ln fy (yn ; θ)

! (31.126)

where in the last step we used the fact that the observations are identically distributed. It follows that FN (θ) = N F (θ)

31.5.4

(when N observations are used)

(31.127)

Efficiency and Consistency The ML estimator exhibits several important properties, which have been studied at great length in the literature. We list three classical properties here without proof; derivations can be found in the references listed at the end of the chapter. It can be shown, again under some reasonable regularity conditions, that the ML estimator satisfies the following three conclusions: bN , based on N observations, for some un(a) (Consistency). An estimator θ bN converges to θ in probability, meaning known θ is said to be consistent if θ that for any  > 0: bN | > ) = 0 lim P(|θ − θ

N →∞

(convergence in probability)

(31.128)

Maximum-likelihood estimators satisfy this property and are therefore consistent. √ eN converges in distri(b) (Asymptotic normality). The random variable N θ bution to a Gaussian pdf with zero mean and covariance matrix F −1 (θ), written as: √

 d bN ) −→ N (θ − θ N e 0, F −1 (θ) , as N → ∞ θN

(31.129)

where θ is the true unknown parameter. It follows that the ML estimator is asymptotically unbiased as well.

1236

Maximum Likelihood

(c) (Efficiency). Maximum-likelihood estimators are asymptotically efficient, meaning that their covariance matrix Rθb attains the Cramer–Rao bound (31.119) in the limit as N → ∞. There are also situations in which ML estimators attain the bound even for finite sample sizes, N – see the next example. The property of asymptotic efficiency follows from asymptotic normality since the latter implies that   d bN −→ θ N b θ, FN−1 (θ) (31.130) θN where FN (θ) = N F (θ).

Example 31.12 surements:

(Estimating a DC level) Consider a collection of N iid random meayn = θ + vn

(31.131) σv2 ,

where v n is Gaussian noise with zero mean and variance while θ is an unknown constant parameter (which amounts to the mean of y n ). The likelihood function is given by fy1 ,...,yN (y1 , . . . , yN ; θ) =

N Y n=1



n 1 o 1 exp − 2 (yn − θ)2 2 2σv 2πσv

(31.132)

so that the log-likelihood function is `(y1 , . . . , yN ; θ) = −

N 1 X N ln(2πσv2 ) − (yn − θ)2 2 2σv2 n=1

(31.133)

We conclude from setting the derivative to zero that the ML estimate for θ is given by N 1 X θbML = yn N n=1

(31.134)

which is clearly unbiased with error variance given by bML )2 = E E (θ − θ =E =

θ−

N 1 X y N n=1 n

!2

!2 N 1 X θ− (θ + v n ) N n=1

N 1 X E v 2n = σv2 /N 2 N n=1

(31.135)

It can be verified that the two regularity conditions (31.229)–(31.230) hold in this case – see Prob. 31.18. We compute next the Cramer–Rao lower bound. For this purpose, we first evaluate ∂ 2 `(y1 , . . . , yN ; θ) = −N/σv2 ∂2θ

(31.136)

31.6 Model Selection

so that the lower bound is given by  −1 ∂ 2 ln f (y1 , . . . , yN ; θ) − E = σv2 /N ∂2θ

1237

(31.137)

which agrees with the error variance found in (31.135). Therefore, the ML estimator in this case is efficient. Observe from (31.135) that the error variance in this example decays at the rate of 1/N , in inverse proportion to the sample size. There are situations where the error variance can decay exponentially fast. One such example is given in the commentaries at the end of the chapter in (31.206); that example deals with the same problem of recovering θ from noisy measurements under Gaussian noise, except that the unknown θ is constrained to being an integer. Example 31.13 (Bias, efficiency, and consistency) Let θ be an unknown scalar pabN be an estimator for θ that is based on N rameter that we wish to estimate. Let θ observations. We encountered in our presentation of ML estimators three important properties related to the notions of bias, efficiency, and consistency. These properties apply to other types of estimators as well. We describe them together in this example for ease of comparison: bN is said to be unbiased if E θ bN = θ. It is said to be asymptotically (a) An estimator θ unbiased if this equality holds in the limit as N → ∞. Thus, the notion of bias relates to a property about the mean (or the first-order moment) of the distribution bN . of the random variable θ − θ bN is said to be efficient if its variance attains the Cramer–Rao (b) An estimator θ bound (31.123). The estimator is said to be asymptotically efficient if it attains the Cramer–Rao bound in the limit as N → ∞. Thus, the notion of efficiency relates to a property about the second-order moment of the distribution of the bN . random variable θ − θ bN is said to be consistent if θ bN converges to θ in probability, (c) An estimator θ meaning that for any  > 0: bN | > ) = 0 lim P(|θ − θ

N →∞

(convergence in probability)

(31.138)

Thus, the notion of consistency relates to a property about the limiting distribubN , which tends to concentrate around θ. tion of the random variable θ − θ

31.6

MODEL SELECTION Once a family of density distributions fy (y; θ) is selected, parameterized by some θ, the ML formulation allows us to estimate θ from observations {yn }. This construction presumes that the designer has already selected what family of distributions to use. In many situations of interest, the designer will be faced with the additional task of selecting the family of distributions from among a collection of possible choices. Each family k will be parameterized by its own θk . In these cases, the designer will need to (a) select the best family of distributions from among the available choices and, moreover, (b) estimate the optimal parameter θ for the selected family.

1238

Maximum Likelihood

There are several criteria that can be used to solve this problem, with the log-likelihood function and the ML estimate playing a prominent role in the solution. In this section, we describe the Bayesian information criterion (BIC), the Akaike information criterion (AIC), and the minimum description length (MDL) criterion for selecting among density models. They all serve as “goodness-of-fit” tests and guide the selection of the best model. In Section 31.6.5 we describe another method for choosing among models that is based on the cross-validation technique; this technique has found widespread application in learning and inference problems and leads to good performance, often under weaker conditions than needed for the BIC, AIC, and MDL methods of this section.

31.6.1

Motivation and Overfitting Consider a collection of N data measurements {y1 , y2 , . . . , yN } and a collection of K probability density models with parameters {θ1 , θ2 , . . . , θK }. Each model (or class or family) θk amounts to assuming a pdf parameterized by θk for the observation y, say, according to fy (y; θk )

(kth pdf model with parameter θk )

(31.139)

For each k, the size of θk is denoted by Mk and it can vary over k. For example, θ1 could be representing a Gaussian distribution with unknown mean µ1 but known variance σ12 , in which case fy (y; θ1 ) ∼ Ny (µ1 , σ12 ),

θ1 = {µ1 }

(31.140)

This corresponds to a problem where the number of parameters to be selected is M1 = 1. The second model θ2 could be representing a second Gaussian distribution with both unknown mean and variance: fy (y; θ2 ) ∼ Ny (µ2 , σ22 ),

θ2 = {µ2 , σ22 }

(31.141)

In this case, we would need to learn two parameters with M2 = 2. Likewise, θ3 could correspond to a third model where we are trying to fit the sum of two Gaussian distributions, say, fy (y; θ3 ) ∼ πNy (µa , σa2 ) + (1 − π)Ny (µb , σb2 )

θ3 =

{µa , µb , σa2 , σb2 , π}

(31.142a) (31.142b)

where π ∈ (0, 1). In this case, we would need to learn five parameters with M3 = 5.

Overfitting Generally, the more complex the model θk is, the more parameters it will involve (i.e., the larger the value of Mk will be). While complex models can be expected to fit the data better because of the degree of freedom that results from using a larger number of parameters, they are nevertheless less desirable in practice. We are going to learn later in this text that complex models lead to overfitting;

31.6 Model Selection

1239

a property that we should avoid. Overfitting essentially amounts to using more complex models to fit the data than is necessary. This can be illustrated by means of an example. Assume each yn is a scalar measurement that arises from small perturbations to a quadratic function of the form: y = ax2 + bx + c + small noise

(true model)

(31.143)

where x is given and y is the response. For each given xn , we measure the corresponding noisy yn according to this model. We could then use the N data points {xn , yn } to fit a quadratic model to the data. This can be done by estimating the parameter vector θ = {a, b, c} of size M = 3 by solving a least-squares problem of the form: ( N ) X 2 2 b {b a, b, b c} = argmin (yn − ax − bxn − c) (31.144) {a,b,c}

n

n=1

Each term in the above cost function penalizes the squared error between the noisy measurement yn and its quadratic fit. It is straightforward to differentiate the above cost relative to {a, b, c} and determine expressions for their estimates – see Prob. 31.23. The expressions are not relevant for the discussion here, but once they are determined they can be used, for example, to compute predictions for future values xm by using the fitted model: ybm = b a x2m + bb xm + b c (prediction)

(31.145)

ym = ax2m + bxm + c

(31.146)

If the model parameters have been learned well, one would expect ybm to provide a good prediction for the noiseless value of ym that would have been observed under the true model (a, b, c), namely,

This situation is illustrated in the left plot of Fig. 31.5. The red curve shows N = 21 noisy measurements resulting from the parameters {a, b, c} = {−0.2883, 0.3501, −1.8359},

σv2 = 3

(31.147)

The locations of the measurements are indicated on the red line by the filled circles; the horizontal axis shows the values of x with the range x ∈ [−5, 5] in increments of 1. The blue curve with squares shows the same measurements without the noise component. The black line shows the fitted curve (31.145) resulting from the following estimated parameters for this particular simulation: {b a, bb, b c} = {−0.2115, 0.0803, −2.3376}

(31.148)

The quality of these estimated parameters would be better and their values would be closer to the true (a, b, c) if we use larger N and have less noise. We continue with the values (31.148) to illustrate the main idea and to facilitate the visualization of the resulting effects. Using the fitted curve (31.145) we can

1240

Maximum Likelihood

predict values for the nonnoisy curve for any given x. For example, for x = −1.3, we get ( ax2 + bx + c ≈ −2.7781 (nonnoisy measurement) x = −1.3 =⇒ yb = b a x2 + bb x + b c ≈ −2.7993 (prediction) (31.149) 0

fitting a second-order model

-2

-2

-4

-4

fitted curve

-6

-6

-8

-8

nonnoisy measurements

-10 -12

fitting a model of order 20

0

-5

-4

-3

-2

-1

0

fitted curve -10

noisy measurements 1

2

3

4

5

-12

-5

-4

-3

-2

-1

0

1

2

3

4

5

Figure 31.5 The plot on the left shows the result of fitting a second-order model onto

the measurements, while the plot on the right shows the result of fitting a model of order 20.

Now, given the same N data points {xn , yn }, we could consider fitting a higherorder model; one that weaves more closely through the {xn , yn } points in the 2D plane. For instance, we could consider fitting a fifth-order model with parameters θ = {a, b, c, d, e, f } instead of the second-order model, such as: ybm = b a x5m + bb x4m + b c x3m + dbx2m + eb xm + fb

(31.150)

x = −1.3 =⇒ yb ≈ −0.4718

(31.151)

Doing so would amount to overfitting (fitting a more complex model than necessary since the data originates from a second-order model to begin with). While the fifth-order model may fit the given data points {xn , yn } better than the second-order model, the higher-order model will perform poorly on predicting future samples ym . Poor performance means that if we were to substitute xm into the higher-order model, the predicted value ybm will generally be far from the value ym that would result from the true model. This situation is illustrated in the right-hand plot in Fig. 31.5. The black curve shows the same N = 20 noisy measurements from before, while the blue curve shows the same nonnoisy measurements. We now fit a model of order 20 even though the data was generated from a second-order model. We observe in this case that the fitted curve lies on top of the black curve. In other words, the fitting now is so good that the fitted curve weaves through the measurement points and even accommodates the presence of noise in the measurements. This property is undesirable because it will generally lead to bad prediction performance. For instance, for the same point x = −1.3, the new fitted curve now predicts:

31.6 Model Selection

1241

which is further away from the true value at approximately −2.7781.

31.6.2

Akaike Information Criterion The AIC formulation discourages overfitting (i.e., it discourages overly complex models) by penalizing the number of parameters in the model. From a collection of candidate models, it selects the optimal model as follows: n o (k ? , θ? ) = argmin 2Mk − 2`(y1 , y2 , . . . , yN ; θk )

(31.152)

k,θk

where `(·) is the log-likelihood function of the observations, assumed independent of each other: ! N Y `(y1 , y2 , . . . , yN ; θk ) = ln fy (yn ; θk ) (31.153) n−1

In other words, the AIC formulation selects the model as follows: ?

?

(

(k , θ ) = argmin 2Mk − 2 k,θk

N X

)

ln fy (yn ; θk )

n=1

(AIC)

(31.154)

The first factor 2Mk penalizes the complexity of the model, while the second term (also known as the “goodness-of-fit” measure) quantities how well the model θk fits the data by calculating its log-likelihood value. We can of course remove the factor 2 from both terms; it is kept for “historical” reasons to match the original formulation. We provide one motivation for the cost function (31.154) in Appendix. 31.B. Since only the second term depends on θk , we find that the AIC solution can be determined as follows: (a) For each model class θk , we determine its ML estimate by solving ( N ) X θbk = argmax ln fy (yn ; θk ) , k = 1, 2, . . . , K (31.155) θk

n=1

(b) We assign an AIC score to each model k: ∆

AIC(k) = 2Mk − 2

N X

n=1

ln fy (yn ; θbk )

(31.156)

(c) We select the model class with the smallest AIC score:

k ? = argmin AIC(k) =⇒ θ? = θbk?

(31.157)

1≤k≤K

It is explained in Appendix 31.B that the AIC formulation seeks the model that minimizes the KL divergence between the true model and the ML models {θb1 , . . . , θbK }. Since the true model is unknown, AIC ignores the entropy of the

1242

Maximum Likelihood

true distribution in expression (31.238) in the appendix. For this reason, the AIC score is a relative measure of the “distance” from the true model. The lower the AIC score is, the closer the selected model will be to the true model. For this reason, in practice, these scores are handled as follows: (a’) We determine the model with the lowest AIC score and denote it by θ? , as already explained above. (b’) We associate with each model k a (nonnegative) delta score computed as follows: ∆

δ(k) = AIC(k) − AIC(k ? )

(31.158) ?

which measures how far model k is from the optimal model k . (c’) We associate a probability distribution with the models conditioned on the observations (also known as a softmax mapping) as follows: e−δ(k)/2 ∆ π(k|y1 , . . . , yN ) = PK , k = 1, 2, . . . , K −δ(k0 )/2 k0 =1 e

(31.159)

where the division by 2 removes the “unnecessary” scaling that appears in the AIC score expression. The ratios π(k|y1 , . . . , yN ) lie in the interval [0, 1] and they add up to 1. Therefore, they can be interpreted as probability values. In this way, for an arbitrary model k, the value π(k|y1 , . . . , yN ) indicates how likely it is for model k to be the best model based on the observed measurements.

31.6.3

Bayesian Information Criterion The BIC formulation is closely related to AIC. The main difference between the criteria is the manner by which they penalize the complexity of the model; AIC penalizes the model selection less strongly than BIC, which selects the model as follows: ( ) N X ? ? (BIC) (31.160) (k , θ ) = argmin Mk ln N − 2 ln fy (yn ; θk ) k,θk

n=1

with an additional factor ln N multiplying Mk . We motivate this cost function in Appendix 31.C. Since only the second term depends on θk , we find that the BIC solution can be determined as follows: (a) For each model class θk , we determine its ML estimate by solving: ( N ) X b θk = argmax ln fy (yn ; θk ) , k = 1, 2, . . . , K (31.161) θk

n=1

(b) We assign a BIC score to each model k: ∆

BIC(k) = Mk ln N − 2

N X

n=1

ln fy (yn ; θbk )

(31.162)

31.6 Model Selection

1243

(c) We select the model class with the smallest BIC score: k ? = argmin BIC(k) =⇒ θ? = θbk?

(31.163)

1≤k≤K

It is explained in Appendix 31.C that the BIC formulation maximizes the a posteriori probability of the model selection given the observations. Thus, the lower the BIC score is, the more likely the selected model is a good approximation for the true model. Specifically, from expression (31.279) in the appendix we deduce that the likelihood of selecting model k given the observations satisfies: eBIC(k)/2 π(k|y1 , y2 , . . . , yN ) ≈ PK BIC(k0 )/2 k0 =1 e

(31.164)

Example 31.14 (Illustrating BIC and AIC procedures) We illustrate the BIC and AIC procedures by considering the problem of fitting a Gaussian distribution into a collection of N data points {yn }. We wish to select the best fit among two models for the data. The first model is a Gaussian distribution with known variance σ 2 but unknown mean µ1 . That is, θ1 = {µ1 } and M1 = 1: model θ1: θ1 = {µ1 }

(31.165a)

1 − 1 (y−θ1 )2 fy (y; θ1 ) ∼ Ny (θ1 , σ ) = √ e 2σ2 2πσ 2 2

The log-likelihood function in this case is ! N N Y X 1 N ∆ (yn − θ1 )2 `(θ1 ) = ln fy (yn ; θ1 ) = − ln(2πσ 2 ) − 2 2 2σ n=1 n=1

(31.165b)

(31.166)

Differentiating relative to θ1 and setting the derivative to zero leads to the ML estimate: N 1 X θb1 = yn , N n=1

with “goodness-to-fit” measure `(θb1 )

(31.167)

The second model is also a Gaussian distribution albeit with unknown variance and mean. That is, θ2 = {µ2 , σ22 } and M2 = 2: model θ2 : θ2 = {µ2 , σ22 }

(31.168a)

2 1 1 e− 2 (y−µ2 ) fy (y; θ1 ) ∼ Ny (θ2 , σ22 ) = p 2πσ22

(31.168b)

The log-likelihood function in this case is `(θ2 ) = −

N X N 1 ln(2πσ22 ) − (yn − µ2 )2 2 2 2σ 2 n=1

(31.169)

Differentiating relative to (µ2 , σ22 ) and setting the derivative to zero leads to the same ML estimates we encountered before in (31.23a)–(31.23b): µ b2 =

N 1 X yn , N n=1

σ b22 =

N 1 X (yn − µ b2 )2 , N n=1

θb2 = {b µ2 , σ b22 }

(31.170)

1244

Maximum Likelihood

with “goodness-to-fit” measure `(θb2 ). The BIC and AIC scores for the two models are given by BIC(1) = ln N − 2`(θb1 ) BIC(2) = 2 ln N − 2`(θb2 )

AIC(1) = 2 − 2`(θb1 ) AIC(2) = 4 − 2`(θb2 )

(31.171a) (31.171b) (31.171c) (31.171d)

Example 31.15 (Moving average model) We studied regression problems in Chapter 29. Here we consider a motivating example. Assume we collect N iid scalar observations {γ(n), h(n)}. We wish to model γ(n) by a linear model of the form: γ(n) = hTn w + v(n)

(31.172)

where w ∈ IRM is a parameter vector to be determined, and hn is an observation vector consisting of M delayed samples h(n), namely, n o ∆ hn = col h(n), h(n − 1), . . . , h(n − M + 1)

(31.173)

Moreover, the term v(n) represents some small zero-mean discrepancy assumed to be Gaussian-distributed: 2

− v 1 e 2σv2 fv (v) = √ 2 2πσv

(31.174)

In this way, expression (31.172) is attempting to fit a linear regression model into the data by representing γ(n) as a combination of current and delayed samples {h(m)}. The order of the model is M because it uses M samples {h(n), . . . , h(n − M + 1)}. The parameter w plays the role of the model θ, and M is its size. We wish to determine the optimal size M . It is straightforward to see that the log-likelihood function of the observations {γ(n), hn } for a particular model w is   ∆ `(w) = ln fγ 1 ,...,γ N ,h1 ,...,hN h1 , . . . , hN , γ(1), . . . , γ(N ); w = ln fv1 ,...,vN (v(1), . . . , v(N ); w) ( N ) (γ(n)−hTn w)2 Y − 1 2 2σv √ = ln e 2πσv2 n=1 =−

N N 1 X ln(2πσv2 ) − (γ(n) − hTn w)2 2 2σv2 n=1

(31.175)

It follows that the maximum of the log-likelihood function over w is obtained by solving the least-squares problem:

31.6 Model Selection

( ? wM

= argmin w∈IRM

N X

1245

) (γ(n) −

hTn w)2

? , with “goodness-of-fit” measure `(wM )

n=1

(31.176) Differentiating relative to w and setting the gradient to zero we find the following expressions for the minimizer and the corresponding minimum cost (see Prob. 31.26):

? wN

=

N X

!−1 hn hTn

n=1 ∆

EM =

N X

N X

γ(n)hn

(31.177a)

n=1

(γ(n) −



hTn w)2

n=1



= ? w=wM

N X n=1

? γ(n)(γ(n) − hTn wM )

(31.177b)

so that ? `(wM )=−

N 1 ln(2πσv2 ) − 2 EM 2 2σv

(31.178)

The BIC and AIC scores for models of order M are then given by ? BIC(M ) = M ln N − 2`(wM ) ? AIC(M ) = 2M − 2`(wM )

(31.179a) (31.179b)

Either of these scores can now be minimized over M to select the best model order.

31.6.4

Minimum Description Length The MDL criterion is based on the principle that the best model fit is one that compresses the data the most, i.e., a solution where the representation of both the model and the data requires the smallest number of bits. This is based on the intuition that the more we are able to compress the data (i.e., the easier it is for us to describe it), the more we would have learned about its inherent structure. Let B(θ) represent the number of bits that are needed to represent a generic model θ. For example, if the model θ corresponds to choosing the means of two Gaussian distributions in a mixture model of the form 21 Ny (µa , 1) + 12 Ny (µb , 1), and if the means are known to be one of only four possibilities: n o (µa , µb ) ∈ (µa1 , µb1 ), (µa2 , µb2 ), (µa3 , µb3 ), (µa4 , µb4 )

(31.180)

then, for this example, B(θ) = 2 bits. In a second example, assume θ has M dimensions and lies within a bounded region. We can discretize every dimension √ into N small segments (where N is the number of data points); this choice is motivated by the fact √ that the size of the error in estimating each entry of θ is on the order of 1/ N as suggested by (31.129). Then, we would need roughly

1246

Maximum Likelihood

√ log2 N bits to represent the value of θ along each of its dimensions so that B(θ) can be approximated by 1 B(θ) ∝ M ln N (31.181) 2 which is similar to the term that appears in the BIC objective function in (31.160). In other situations, it is justified to treat θ as a realization for some random variable θ and to assign a distribution for θ (also called its prior), say, fθ (θ). For instance, if θ corresponds to the mean of a Gaussian distribution fy (y; θ) ∼ Ny (µ, σ 2 ), then one could assume that the unknown mean µ is a realization that arises from some exponential distribution θ ∼ λe−λθ . When the model parameters are treated in this manner as random variables with priors assigned to them, we can appeal to our earlier discussion on information and entropy from Chapter 6 to deduce that the number of bits that are needed to code θ is on the order of (recall expression (6.1)): B(θ) = − ln fθ (θ)

(31.182)

The MDL approach exploits this connection between code lengths and pdfs to great effect. Since we are using the natural logarithm in (31.182), the units should be listed as nats instead of bits. Let B(y1 , y2 , . . . , yn ; θ) represent the number of bits that are needed to represent the N data points under model θ. The MDL criterion then selects the model as follows: ( ) (k ? , θ? ) = argmin B(θk ) + B(y1 , y2 , . . . , yn ; θk )

(31.183)

k,θk

Again, motivated by (31.182) we can select B(y1 , y2 , . . . , yn ; θ) = − ln

(

N Y

)

fy (yn ; θ)

n=1

(31.184)

in terms of the natural logarithm of the likelihood function and rewrite the MDL formulation as (

(k , θ ) = argmin B(θk ) − ?

?

k,θk

N X

)

ln fy (yn ; θk )

n=1

(MDL)

(31.185)

One useful interpretation for the MDL objective arises when the choice (31.182) is used, i.e., when ( ) N X (k ? , θ? ) = argmax ln fθ (θk ) + ln fy (yn ; θk ) (31.186) k,θ k

n=1

where we replaced argmin by argmax and removed the negative signs. In this case, MDL can be shown to correspond to choosing among a collection of MAP

31.6 Model Selection

1247

estimators. This can be seen as follows. When θ is treated as random, each term fy (yn ; θk ) in (31.185) has the interpretation of the conditional distribution of y given the realization for θ k : fy (yn ; θk ) = fy|θ k (yn |θk )

(31.187)

Then, using (31.182), we can rewrite the cost that is being optimized in (31.186) in the form: ln fθ (θk ) +

N X

ln fy (yn ; θk )

n=1

(

= ln fθ (θk )

N Y

n=1

(

(31.188) )

fy|θ k (yn |θk )

= ln fθ (θk ) fy ,...,y |θ k (y1 , . . . , yN |θk ) 1 N

)

= ln fy ,...,y ,θ k (y1 , y2 , . . . , yN , θk ) N n 1 o = ln fy1 ,...,yN (y1 , . . . , yn ) × fθ k |y ,...,yN (θk |y1 , . . . , yN ) 1 n o = ln fθ k |y ,...,y (θk |y1 , . . . , yN ) + term independent of θk 1 N

It follows that the MDL criterion (31.186) is the best fit from among a collection bk }: of MAP estimators {θ n o (k ? , θ? ) = argmax fθ k |y ,...,y (θk |y1 , . . . , yN ) (31.189) 1 N k,θ k

This construction involves the following steps: (a) For each model class θk , we determine its MAP estimate by solving: n o (31.190) θbk = argmax fθ k |y ,...,y (θk |y1 , . . . , yN ) , k = 1, 2, . . . , K 1 N θ k

(b) We assign an MDL score to each class k:

∆ MDL(k) = fθ k |y ,...,y (θbk |y1 , . . . , yN ) 1 N

(31.191)

(c) We select the model class with the largest MDL score:

k ? = argmax MDL(k) =⇒ θ? = θbk?

(31.192)

1≤k≤K

Example 31.16 (MDL with model prior) Consider a collection of N iid realizations {yn }. We wish to select the best fit among two models for the data. The first model is a Gaussian distribution with known variance σ12 and unknown mean µ1 . That is,

1248

Maximum Likelihood

θ1 = {µ1 }. We model θ 1 as a random variable and assume it is Gaussian-distributed with zero mean and unit variance, i.e., model θ1 : 2 1 fµ1 (µ1 ) ∼ Nµ1 (0, 1) = √ e−µ1 /2 2π θ 1 = µ1

(31.193a) (31.193b)

− 12 (y−θ1 )2 2σ

1 fy (y; θ1 ) ∼ Ny (θ1 , σ 2 ) = √ e 2πσ 2

y

(31.193c)

If we denote the cost that is maximized in the MDL formulation (31.186) by J(θk ), then it is given by the following expression for model θ1 = µ1 : N 1 1 X J(µ1 ) = − µ21 − 2 (yn − µ1 )2 + cte 2 2σ1 n=1

(31.194)

where constant terms that are independent of θ1 = µ1 are separated. The second model is also Gaussian, albeit with unknown variance and mean. That is, θ2 = {µ2 , σ22 }. We assume the two components of θ 2 are independent of each other, with µ being Gaussian-distributed with zero mean and unit variance and σ 2 being exponentially distributed with parameter λ = 1, i.e., model θ2 : 2 1 fµ2 (µ2 ) ∼ Nµ2 (0, 1) = √ e−µ2 /2 2π 2

fσ 2 (σ22 ) = e−σ2 2

2 2 1 fθ 2 (θ2 ) = √ e−µ /2 × e−σ2 2π − 12 (y−µ2 )2 1 fy (y; θ2 ) ∼ Ny (µ2 , σ22 ) = p e 2σ2 2 2πσ2

(31.195a) (31.195b) (31.195c) (31.195d)

The cost in (31.186) that corresponds to this model is given by N 1 1 X J(µ2 , σ22 ) = − µ22 − σ22 − 2 (yn − µ2 )2 + cte 2 2σ2 n=1

(31.196)

where again constant terms that are independent of θ2 = {µ2 , σ22 } are separated. This example is pursued further in Probs. 31.24 and 31.25.

31.6.5

Cross-Validation Method We describe next an alternative method for choosing among models that is based on a popular technique known as cross validation; the method leads to good performance often under weaker conditions than needed for the other (AIC, BIC, MDL) methods described before. Consider again a collection of N iid data measurements {y1 , y2 , . . . , yN } arising from an underlying unknown pdf, fy (y). Introduce K models with parameters {θ1 , θ2 , . . . , θK }, where each θk defines a pdf for the observations. Crossvalidation splits the N data points {yn } into L segments of size N/L each. At

31.6 Model Selection

1249

each iteration of the construction described below, we use L − 1 segments for estimation purposes and the last segment for a supporting role – see Fig. 31.6. To simplify the notation, we let E = (L − 1)N/L denote the total number of samples used for estimation from the L − 1 segments and T = N/L denote the number of samples used for the support role from the remaining segment, so that N = E + T . The objective is to select the “best fit” model from among the {θ1 , . . . , θK }.

•••

L

AAACAXicbVA9TwJBEJ3DL8SvUytjsxFMqMgdjdqR2Fhi4gkJXMjeMgcb9j6yu2dCCLHxr9hYqLH1X9j5b1zgCgVfMpmX92ayOy9IBVfacb6twtr6xuZWcbu0s7u3f2AfHt2rJJMMPZaIRLYDqlDwGD3NtcB2KpFGgcBWMLqe+a0HlIon8Z0ep+hHdBDzkDOqjdSzTyrdIBMCNVnqlZ5ddmrOHGSVuDkpQ45mz/7q9hOWRRhrJqhSHddJtT+hUnMmcFrqZgpTykZ0gB1DYxqh8ifzE6bk3Ch9EibSVKzJXP29MaGRUuMoMJMR1UO17M3E/7xOpsNLf8LjNNMYs8VDYSaITsgsD9LnEpkWY0Mok9z8lbAhlZRpk1rJhOAun7xKvHrtqube1suNap5GEU7hDKrgwgU04Aaa4AGDR3iGV3iznqwX6936WIwWrHznGP7A+vwB3z2WkQ==

1

L

support segment

estimation segments E samples

T samples

Figure 31.6 The data is divided into L segments, with L − 1 of them used for estimation purposes for a total of E samples and the remaining segment with T samples used for a supporting role.

The first step is to use the E data points to estimate θk by maximizing the corresponding log-likelihood function: ( E ) X b θk = argmax ln fy (ye |θk ) , k = 1, 2, . . . , K (31.197) θk

e=1

where the subscript e is used to index the samples from the E collection. This step generates K estimated models {θbk }. Next, we need to select a “best fit” model from among these models. One way to achieve this task is to minimize the KL divergence between the true (unknown) pdf and its approximations, namely, n  o k ? = argmin DKL fy (y) k fy (y; θbk ) =⇒ θ? = θbk? (31.198) 1≤k≤K

where

  DKL fy (y) k fy (y; θbk ) ˆ ˆ = fy (y) ln fy (y)dy − y∈Y

y∈Y

fy (y) ln fy (y; θbk )dy

(31.199)

The first term on the right-hand side is independent of k. Therefore, problem (31.198) for selecting the optimal model k reduces to

1250

Maximum Likelihood

k

?

= argmax 1≤k≤K



y∈Y

fy (y) ln fy (y; θbk )dy

)

n o = argmax E y ln fy (y; θbk ) 1≤k≤K

(31.200)

The quantity that is being maximized is the mean of ln fy (y; θbk ), where the expectation is computed relative to the true distribution, fy (y). The expectation cannot be computed since fy (y) is unknown. One approximation is derived in Appendix 31.B in the form of expression (31.240) and used to motivate the AIC method. Here, we pursue a different approach based on cross-validation. In the cross-validation approach, the quantity E y f (y; θb` ) is estimated by using the T samples from the support segment, which were not involved in the estimation of the {θbk }. That is, we use T 1X E y ln fy (y; θbk ) ≈ ln fy (yt ; θbk ) T t=1

(31.201)

where the samples {yt } in this expression arise from the T collection and are independent from the samples {ye } used to estimate θbk . Therefore, the sample average estimator on the right-hand side of (31.201) is an unbiased estimator for the quantity of interest, namely, ( ) T 1X b E ln fy (yt ; θk ) = E y ln fy (y; θbk ) (31.202) T t=1 | {z } ∆ | {z } = Xk ∆ b =X k

To simplify the notation, we denote the variable that we wish to approximate bk ; we use the subscript k to indicate by Xk and its sample approximation by X that these values relate to model k. bk by considering a single pass over the So far we have computed the estimate X L data segments. More generally, cross-validation performs L passes over these segments. During each pass, one segment is chosen for support and the remaining bk,` as above L − 1 segments for estimation. Each pass ` generates an estimate X by computing the ensemble average over the samples of the support segment for that pass. Subsequently, the final estimate for Xk is obtained by averaging these multiple pass estimates as follows: L

X ∆ 1 bk = bk,` X X L

(cross-validation score)

(31.203)

`=1

bk in this manner for each model θk and then Cross-validation generates a score X selects the model with the highest score in view of (31.200): n o bk k ? = argmax X =⇒ θ? = θk? (31.204) 1≤k≤K

31.7 Commentaries and Discussion

1251

We explain in Prob. 31.29 that, under certain conditions, the cross-validation construction is able to discover the best model with high likelihood. We will discuss cross-validation further in Chapter 59 and provide comments on its history and application in the context of inference and learning methods.

31.7

COMMENTARIES AND DISCUSSION Maximum likelihood. The ML approach was developed by the English statistician Ronald Fisher (1890–1962) in the works by Fisher (1912, 1922, 1925) – see the presentations by Pratt (1976), Savage (1976), and Aldrich (1997). The approach does not assume any prior distribution for the parameter θ and estimates it from observations of the random variable y by maximizing the likelihood function defined by (31.1). Since its inception, the ML technique has grown to become one of the most formidable tools in modern statistical analysis, motivated largely by the foundational works of Fisher (1922, 1956) and also by the efficiency of this class of estimators. As already noted by (31.129), ML estimators are asymptotically efficient in that their MSEs approach the Cramer–Rao bound as the number of observations grows. For additional information on ML estimators, and for more details on the Cramer–Rao bound, its ramifications, and the asymptotic efficiency and normality of ML solutions, readers may refer to the texts by Zacks (1971), Box and Tiao (1973), Scharf (1991), Kay (1993, 1998), Lehmann (1998), Cassella and Berger (2002), Cox (2006), Hogg and McKean (2012), and Van Trees (2013). Cramer–Rao bound. This important bound, which is due to Rao (1945) and Cramer (1946), provides a lower limit on the achievable MSE for any unbiased (and also biased) estimator of unknown constant parameters. The lower bound is determined by the inverse of the Fisher information matrix which, although named after Fisher (1922, 1956), was actually advanced in the works by Edgeworth (1908a, b, c) – see the expositions by Savage (1976) and Pratt (1976). The entries of the Fisher matrix reflect the amount of information that the observations convey about the unknown parameter – see Frieden (2004). In Appendix 31.A we provide one derivation for the Cramer–Rao bound for the case of scalar parameters by following an argument similar to Cassella and Berger (2002), Frieden (2004), and Van Trees (1968, 2013). Expression (31.129), and the result in Example 31.12, suggest that the MSE decays at the rate of 1/N , in inverse proportion to the sample size. There are situations, however, where the rate of decay of the MSE can be exponentially fast in N . These situations arise, for example, when estimating unknown parameters θ that are restricted in certain ways, as discussed by Hammersley (1950). To illustrate this possibility, let us reconsider Example 31.12, where we are still interested in determining the ML estimate for θ, except that θ is now constrained to being an integer. It is shown in Hammersley (1950) that, in this case (see Prob. 31.20): ! N 1 X b yn (31.205) θML = round N n=1 where the function round(x) denotes the integer value that is closest to x. The corresponding MSE behaves asymptotically as  2 1/2 2 e2ML = 8σv Eθ × e−N/8σv , as N → ∞ (31.206) πN which decays exponentially with N .

1252

Maximum Likelihood

Sufficient statistics. We commented on sufficient statistics at the end of Chapter 5. Let {y1 , y2 , . . . , yN } denote realizations for a random variable y whose pdf is parameterized by some θ (scalar or vector), written as fy (y; θ). Let the short-hand notation S(y) denote any function of these realizations, i.e., ∆

S(y) = S(y1 , y2 , . . . , yN )

(31.207)

We refer to S(y) as a statistic. The statistic is said to be sufficient for θ if the conditional distribution of {y 1 , y 2 , . . . , y N } given S(y) does not depend on θ. This concept plays a key role in ML estimation theory and was introduced by Fisher (1922) in his development of theoretical statistics. In a way, the sufficient statistic, when it exists, contains all the information that is embedded in the observations about θ so that the observations can be discarded and replaced by S(y) for estimation purposes. This step amounts to compressing the observations down to S(y). Let us consider the following classical example. Let y denote the outcome of a Bernoulli experiment with success rate p, i.e., y = 1 with probability p and y = 0 with probability 1 − p. The variable p plays the role of the parameter θ. Assume we perform N experiments and observe the outcomes {y1 , . . . , yN }. Define the function ∆

S(y) =

N X

yn

(31.208)

n=1

which counts the number of successes. The conditional pdf of {y 1 , y 2 , . . . , y N } given S(y) can be shown to be independent of p. Indeed, it is left as an exercise for Prob. 31.16 to show that:  N X  s!(N − s)!   , if yn = s     N! n=1 (31.209) P y 1 = y1 , . . . , y N = yN | S(y) = s = N X     if yn 6= s  0, n=1

which shows that the conditional distribution is independent of the parameter p. Therefore, the statistic S(y) defined by (31.208) is sufficient for p. The following result explains how, starting from some initial crude estimator for a parameter θ that is not necessarily optimal, we can construct better estimators for it by conditioning on a sufficient statistic for θ – see Prob. 31.17 and also Caines (1988), Scharf (1991), and Kay (1993, 1998). Observe how the conditional mean plays a useful role in constructing estimators.

b1 denote an (Rao–Blackwell theorem) (Rao (1945) and Blackwell (1947)): Let θ unbiased estimator for a parameter θ given observations of a variable y and assume b21 < ∞. Let S(y) denote a sufficient statistic for θ and construct the estimator: Eθ   b2 = E θ b1 | S(y) θ (31.210) b2 is also an unbiased estimator for θ with at most the same MSE (or variThen, θ ance), namely, b2 )2 ≤ E (θ − θ b 1 )2 E (θ − θ (31.211) b1 is a function of S(y). The inequality holds strictly, except when the estimator θ

31.7 Commentaries and Discussion

1253

Gamma function. We encountered the gamma function in Example 31.4 while fitting a beta distribution onto measured data. The function is defined by the integral expression ˆ ∞ Γ(z) = sz−1 e−s ds, z > 0 (31.212) 0

√ and has several useful properties such as Γ(1/2) = π, Γ(z + 1) = zΓ(z) for any z > 0, and Γ(n + 1) = n! for any integer n ≥ 0. This last property shows that the gamma function can be viewed as an extension of the factorial operation to real (and even complex) numbers. In the example, we needed to evaluate the digamma function ψ(z) = Γ0 (z)/Γ(z), which arises often in applications, involving the derivative of the gamma function. This ratio is sometimes referred to as the polygamma function of order zero and is known to satisfy the relation:  ∞  X Γ0 (z) 1 1 = −c + − (31.213) Γ(z) 1+m z+m m=0 where ∆

c = − lim Γ0 (z) ≈ 0.577215665 z→1

(31.214)

is Euler’s constant (often denoted instead by the letter γ); it appears in many other problems in mathematical analysis. For more information on the gamma function and its properties, the reader may consult the works by Davis (1959), Abramowitz and Stegun (1965), Lebedev (1972), Temme (1996), and Artin (2015). Method of moments. We discussed in Example 31.4 two methods to fit a beta distribution onto data measurements. One method was based on the ML formulation and required an iterative procedure to learn the shape parameters (a, b), while the second method estimated these parameters by matching the first- and second-order moments (mean and variance) of the resulting beta distribution to the sample mean and sample variance computed from the data. It is a historical curiosity that the ML approach to fitting a beta distribution was favored by the English statistician Ronald Fisher (1890–1962), while the moment matching approach was favored by the English statistician Karl Pearson (1857–1936) – see Pearson (1936) and the account by Bowman and Shenton (2007). Both Fisher and Pearson were giants in their field and are credited with establishing the modern field of mathematical statistics. Akaike and Bayesian information criteria. The AIC is due to the Japanese statistician Hirotugu Akaike (1927–2009) and appeared in the work by Akaike (1974). Since its inception, it has flourished to become one of the main tools in statistical analysis. The criterion is based on information-theoretic concepts and seeks the “best fit” that minimizes the KL divergence relative to the true (unknown) distribution of the data. We explain in Appendix 31.B that the AIC achieves this goal by constructing an “unbiased” estimate for the mean log-likelihood function defined in (31.243) as follows: ˆ ∆ L(θ) = f (y) ln f (y; θ)dy = E y ln f (y; θ) (31.215) y∈Y

The expectation is relative to the true distribution of y. Since L(θ) is unavailable, the AIC approximates it from data measurements by using (31.240), namely, E y ln f (y; θbk ) ≈

N Mk 1 X ln f (yn ; θbk ) − N n=1 N

(31.216)

where the correction by Mk /N is necessary to remove the bias from the first term. It is explained in the survey article by Cavanaugh and Neath (2019) that the AIC formulation is efficient. This means that the selected model θ? will generate predictions

1254

Maximum Likelihood

b for y that have the lowest MSE, E (y − y b )2 . In other words, the AIC favors model y selections that are good predictors. The approximation (31.216) is known to perform well for large N but its performance degrades for small or moderate-size datasets as explained in Linhart and Zucchini (1986) and McQuarrie and Tsai (1998). To address this difficulty, Hurvich and Tsai (1989) suggested one correction to the AIC score based on replacing the original expression (31.156) by the following corrected score for sample sizes satisfying N < 40Mk : ∆

AIC(k) = 2Mk +

N X 2Mk (Mk + 1) −2 ln fy (yn ; θbk ) (corrected AIC) N −k−1 n=1

(31.217)

The text by Burnham and Anderson (2002) provides a theoretical justification for the superior performance of the corrected AIC in relation to the BIC. The BIC was proposed by Schwarz (1978); it bears a strong resemblance to the AIC except that it penalizes the model selection more strongly by scaling the model order Mk by ln N instead of 1 as happens with the AIC. The BIC adopts a Bayesian approach and assigns a prior π(k) to the model variable k ∈ {1, 2, . . . , K}, as well as a prior to the parameter θ k under each class k. It then maximizes the posterior π(k|y1 , y2 , . . . , yN ) given the observations. One of the main features of the BIC formulation is that its criterion is consistent. This means that if the true unknown distribution that generated the observations happens to belong to the collection of candidate models {θ1 , . . . , θK }, then the BIC solution is guaranteed to select it with probability 1 in the limit of large datasets – see, e.g., Claeskens and Hjort (2008). This property may also explain the observation in McQuarrie and Tsai (1998) that the BIC outperforms the AIC for moderate-size datasets in the sense that the BIC tends to select the true model more frequently. It is clear from the derivations in Appendices 31.B and 31.C that the AIC and BIC formulations are based on some approximations, especially asymptotically as the sample size N tends to infinity. This reflects on their behavior. For instance, the BIC penalizes the model complexity more heavily than the AIC; it uses the penalty term Mk ln N versus Mk for the AIC. For this reason, the AIC is more likely to favor more complex models, whereas the BIC favors simpler models. For more information on the AIC and BIC, their derivations, and applications, readers may refer to Linhart and Zucchini (1986), Ghosh, Delampady, and Samanta (2006), Claeskens and Hjort (2008), Konishi and Kitagawa (2008), Hastie, Tibshirani, and Friedman (2009), and Neath and Cavanaugh (2012). Minimum description length. The MDL criterion is due to Rissanen (1978, 1986). It selects solutions that can represent the model and the data in the most compressed form (in terms of bit representation). This line of reasoning is consistent with what is generally referred to as the Occam razor principle. The principle basically states that simpler explanations or hypotheses should be preferred over more complex explanations or hypotheses. We will encounter it again in Section 64.5 when we discuss the issue of overfitting by complex models and the bias–variance trade-off. MDL relies on information-theoretic and coding theory arguments, and exploits to great effect the connection between code lengths and the pdfs of the variables through the notion of entropy. Specifically, − log2 fx (x) bits are needed to represent realizations x for the random variable with distribution fx (x). The MDL approach exploits this connection to formulate a design criterion. We explained in the body of the chapter that when prior distributions are assigned to the models, the MDL solution reduces to selecting the best fit from among a collection of MAP estimators. We also explained that MDL and BIC are closely related. In particular, if we use the bit representation (31.182), then the MDL formulation in (31.185) will reduce to BIC. A good overview of MDL is given by Hansen and Yu (2001). For more information, the reader may consult Barron and Cover (1991), Barron, Rissanen, and Yu (1998), and Grunwald (2007).

31.7 Commentaries and Discussion

1255

Frequentist view. The ML approach of this chapter treats the parameter θ as an unknown but fixed quantity and does not attach any probability distribution to it. This approach is reminiscent of the frequentist viewpoint to probability and statistics. In the frequentist approach, the notion of probability is defined as the long-term frequency of occurrence of events, evaluated from repeated experimentation or observation. For example, the probability of landing heads (H) in a coin toss can be determined by repeatedly tossing the coin N times and counting how many heads are observed, say, M times. The ratio M/N then approaches the probability of the event, P(H), as N → ∞. We can re-examine the ML formulation in light of this description, namely, ( θbN = argmax θ

N X

) ln fy (yn ; θ)

(31.218)

n=1

where we are adding the subscript N to indicate that θbN is computed from N measurements. This solution leads to a mapping from the observations to the estimate: n

yn

oN n=1

ML −→ θbN

(31.219)

But since the observations {yn } are realizations of a random process, the randomness will reflect itself on the ML estimate as well. Specifically, θbN will vary with the measurements {yn }: Two different collections of N measurements, each arising from the same data distribution fy (y), will generally lead to two different values for θbN . For bN to highlight its this reason, we will denote the ML estimate in boldface and write θ random nature. Thus, although the ML formalism models θ as an unknown but fixed bN is a random quantity with mean and variance. In particuparameter, its estimator θ lar, the Cramer–Rao bound (31.117) provides a lower bound on the expected MSE in terms of the inverse of the Fisher information matrix. Obviously, the bound is useful when the Fisher matrix can be evaluated in closed form. This is often challenging since, as can be seen from expressions (31.102) and (31.99), the designer will need to compute certain expectations. As befits the frequentist approach, one useful alternative to assess the performance of ML estimators is to resort to a bootstrap calculation. The term “bootstrap” is commonly used in the statistical literature to refer to a technique where a statistic (such as mean or variance) is estimated by resampling with replacement from existing measurements. We will encounter this approach in other contexts in this text, e.g., when studying bagging classifiers in Chapter 62 and also temporal-difference techniques in reinforcement learning in Chapter 46. Under bootstrap, we resample the original N measurements {yn } with replacement and obtain another collection of N samples, de(1) noted by {yn }. These sample values arise from the same original set and some values may appear repeated due to resampling with replacement. We then compute the ML estimate again from this new collection and denote it by: n oN yn(1)

n=1

ML (1) −→ θbN

(31.220)

We repeat the resampling operation multiple times. Each time b leads to a new ML (b) estimate, θbN . The main advantage of carrying out these bootstrap calculations is that, (b) without collecting any additional data, the estimates {θbN } lead to a histogram distrib bution that approximates the pdf for the estimator θ N . We can subsequently use the estimated pdf to deduce useful statistics about the ML estimator, such as its sample mean, variance, and confidence interval, as illustrated in Fig. 31.7.

1256

Maximum Likelihood

AAACLHicbVDBattAEF2lSZu4Ses2x1yW2IWcjJRC22Mglx5aSCGODZYxo9XIWrKrFbujFiP0Qb30VwKhh4bQa7+jK8eHxMlc5vHeDPPmJaWSjsLwJth4trn1/MX2Tufl7t6r1903by+cqazAoTDK2HECDpUscEiSFI5Li6AThaPk8rTVR9/ROmmKc1qUONUwL2QmBZCnZt3T3N8wcwuapx5ZmVStwDNjeT/+IVPMgeo4MSp1C+1bHVOOBE0zq2Or+dcvTX/W7YWDcFn8MYhWoMdWdTbrXsepEZXGgoQC5yZRWNK0BktSKGw6ceWwBHEJc5x4WIBGN62Xzzb8nWfSpb/MFMSX7P2NGrRrrfpJDZS7da0ln9ImFWWfprUsyoqwEHeHskpxMrxNzudjUZBaeADCSu+VixwsCPL5dnwI0frLj8HF8SD6MHj/7bh3Eq7i2GYH7JAdsYh9ZCfsMztjQybYT3bF/rCb4FfwO7gN/t6NbgSrnX32oIJ//wGoLqmc

bML histogram distribution for ✓

AAACEXicbVDLSsNAFJ3Ud31FXboZrEJXJamgLgtuXCgoWC00pUwmt+3QyYOZG6WE/IIbf8WNC0XcunPn3zipWfg6MMzhnHu59x4/kUKj43xYlZnZufmFxaXq8srq2rq9sXml41RxaPNYxqrjMw1SRNBGgRI6iQIW+hKu/fFx4V/fgNIiji5xkkAvZMNIDARnaKS+Xd/1bkUAI4aZ58cy0JPQfJmHI0CW5/3MUyE9O813+3bNaThT0L/ELUmNlDjv2+9eEPM0hAi5ZFp3XSfBXsYUCi4hr3qphoTxMRtC19CIhaB72fSinO4ZJaCDWJkXIZ2q3zsyFupiVVMZMhzp314h/ud1Uxwc9TIRJSlCxL8GDVJJMaZFPDQQCjjKiSGMK2F2pXzEFONoQqyaENzfJ/8lV82Ge9DYv2jWWk4ZxyLZJjukTlxySFrkhJyTNuHkjjyQJ/Js3VuP1ov1+lVascqeLfID1tsndPad+Q==

sample mean of distribution 95%

bML ✓

confidence interval

Figure 31.7 Illustration of a histogram constructed for the distribution of a ML

bML obtained by applying the bootstrap method. estimator θ

Bayesian view. In contrast to the frequentist approach, the Bayesian view to inference treats the concept of probability as a measure of uncertainty rather than frequency of occurrence. In this case, the probability of an event is a subjective measure and provides an indication of the belief we have about its occurrence. This point of view is particularly useful to model events that do not occur frequently and are therefore difficult to capture by long-term frequency calculations. The early proponent of this interpretation for the notion of probability was the British philosopher and mathematician Frank Ramsey (1903–1930) in the work by Ramsey (1931), published posthumously – see also the accounts by Sahlin (2008) and Misak (2020). Under the Bayesian paradigm, it is possible to assign probabilities to events that are not necessarily repeatable. For example, consider again the case of a likelihood function that is parameterized by some θ, as in (31.218). The objective continues to be finding an estimate for θ. However, in many instances, the designer may have available some prior information about which values for θ are more or less likely. This information can be codified into a pdf. One can model the unknown as a random variable, θ, and associate a pdf with it, fθ (θ). This pdf is called the prior and it can be interpreted as a weighting function. The prior models the amount of uncertainty we have about θ: regions of high confidence will have higher likelihood of occurring than regions of lower confidence. For example, assume θ is a scalar and we know that its value lies in the range θ ∈ [0, 1]. If we adopt a uniform prior over this interval, then we are codifying that all values in this range are equally likely. If, on the other hand, θ can assume values over (−∞, ∞) but its true value lies somewhere within [−1, 1], then we could perhaps select a Gaussian prior with zero mean and unit standard deviation to reflect this knowledge. Once a prior is selected for θ, the Bayesian approach then seeks the

31.7 Commentaries and Discussion

1257

estimate for θ by maximizing the posterior likelihood of θ given the observations. Using the Bayes rule, this posterior is given by fθ |y ,...,y (θ|y1 , . . . , yN ) = 1 N

fθ (θ) × fy

(y 1 , . . . , y N |θ) 1 ,...,y N |θ fy1 ,...,yN (y1 , . . . , yN )

(31.221)

The evidence in the denominator is independent of θ and can be ignored in the maximization so that we can write fθ |y ,...,y (θ|y1 , . . . , yN ) ∝ fθ (θ) × fy ,...,y |θ (y 1 , . . . , y N |θ) N N | {z } | 1 | 1 {z } {z } posterior

prior

(31.222)

likelihood

or, equivalently, in the log domain (to bring forth the analogy with the ML approach): ln fθ |y

1 ,...,y N

(θ|y1 , . . . , yN ) = ln fθ (θ) +

N X

ln fy (yn ; θ) + constant

(31.223)

n=1

The Bayesian approach then seeks the estimate that solves: N n o X θbN = argmax ln fθ (θ) + ln fy (yn ; θ) θ

(31.224)

n=1

This construction leads to the MAP estimate for θ, which we have already encountered in Chapter 28. We find another instance of it in our derivation of the BIC in Appendix 31.C. Comparing with the ML formulation (31.218) we observe that the main difference is the appearance of the additional term ln fθ (θ) originating from the prior. We therefore find that the frequentist approach ignores the prior and employs only the likelihood function to arrive at the ML estimate θbN , while the Bayesian approach keeps the prior and uses it to arrive at the MAP estimate. Table 31.1 compares the main features of the frequentist and Bayesian approaches to inference: the former focuses on finding a θ that best fits the likelihood model to the data, while the latter focuses on finding a θ that best fits the posterior distribution. Table 31.1 Comparing frequentist and Bayesian approaches. Frequentist inference Bayesian inference 1. 2. 3. 4. 5. 6. 7.

ML is a prime example Probability is long-term frequency Model unknown but fixed Does not use a prior for the model Uses likelihood of data given model Finds best-fit model for the data Usually less complex

1. 2. 3. 4. 5. 6. 7.

MAP is a prime example Probability is measure of uncertainty Model random with uncertainty Uses a prior for the model Uses likelihood of data given model Finds best model for the parameter Finding evidence is challenging

The additional term ln fθ (θ) in (31.224) has another useful interpretation. Consider an example where θ is M -dimensional with a Gaussian prior of the form: 2 1 1 fθ (θ) = e− 2 kθk (2π)M/2

(31.225)

1 ln fθ (θ) = − kθk2 + constant 2

(31.226)

then

1258

Maximum Likelihood

and the cost that is being maximized in (31.224) becomes N n 1 o X θb = argmax − kθk2 + ln fy (yn ; θ) 2 θ n=1

(31.227)

b We refer We see that the effect of the additional term is to discourage large values for θ. to ln fθ (θ) in (31.224) as a regularization term, and we will study its effect more closely in Chapter 51, where we will also consider other choices for the regularization factor. We will find out then that different choices for this factor help infuse into θb certain desirable properties such as forcing them to have small norm or sparse structure. The frequentist and Bayesian approaches lead to different but related results in general. To illustrate the difference, we consider in Prob. 31.8 a random variable y that follows a binomial distribution with parameters N and p, i.e., the probability of observing y = k successes in N trials is given by: ! N k P(y = k) = p (1 − p)N −k , k = 0, 1, . . . , N (31.228a) k We show in that problem that having observed y = y, the ML estimate for the probability of success p is pbML = y/N

(31.228b)

whereas in the earlier Prob. 28.11 we modeled p as a random variable that follows a beta distribution with shape parameters (2, 1). We found then that the MAP estimate for p is given by y+1 , assuming p ∼ beta(2, 1) (31.228c) pbMAP = N +1 Observe that the expressions for both estimates are different, although they tend to each other for large sample sizes, N . The fact that the expressions are different should not come as a surprise. After all, while the ML formulation is using solely the observation y to estimate p, the Bayesian formulation is using one additional piece of information represented by the assumption of a beta distribution for p. We will encounter repeated instances of the Bayesian formulation in our treatment. One of the main difficulties that arises in this technique is the following. Referring back to expression (31.221), we indicated that the evidence in the denominator is b However, independent of θ and can therefore be ignored in the process of seeking θ. in many instances, we still desire to know the resulting posterior distribution that appears on the left-hand side. We will explain in future chapters that the posterior quantity is useful in many cases, for example, to predict future values for y from past observations – see Chapter 33. The difficulty lies in computing the evidence that appears in the denominator. We will describe several techniques in later chapters for this purpose, including variational inference methods and Markov chain Monte Carlo (MCMC) methods. In summary, the Bayesian formulation is anchored on the Bayes rule for mapping priors to posteriors. Although the frequentist approach was popular in the twentieth century, the Bayesian approach has become more prominent in recent years due to several theoretical and computational advances described in later chapters. Its main challenge continues to be selecting a suitable prior that conforms to the physical reality of θ. This is actually one of the main criticisms directed at the Bayesian approach: The prior is often selected in a subjective manner and for the convenience of mathematical tractability; it need not represent a faithful codification of the truth about the unknown, θ. Moreover, different priors will lead to different MAP estimates even for the same measurements. This old-age debate about the merits of the frequentist and Bayesian approaches is likely to continue for decades to come. However, both approaches have merit and have

Problems

1259

proven useful in many contexts. It is better to view them as complementary rather than competing or conflicting approaches. While both methodologies assume knowledge of the likelihood function, we should note that the likelihood expressions are not exact but rather approximations in most cases anyway. In this way, the ML and Bayesian inference approaches seek, in their own ways, parameter estimates that “best” explain the observed data and they lead to good performance in many applications of interest. One may view the Bayesian formulation as a “regularized” frequentist formulation, in which case more commonalities link these two approaches than actual differences. The main distinction between them then becomes one of interpreting what their respective costs mean. For more discussion on the frequentist and Bayesian approaches, the reader may refer to Savage (1954), de Finetti (1974), Samaniego and Reneau (1994), Barnett (1999), Samaniego (2010), Wakefield (2013), and VanderPlas (2014). Heart disease Cleveland dataset. Figure 31.1 relies on data derived from the heartdisease Cleveland dataset. The dataset consists of 297 samples that belong to patients with and without heart disease. It is available on the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/heart+Disease. The investigators responsible for the collection of the data are the leading four co-authors of the article by Detrano et al. (1989).

PROBLEMS

31.1 Establish relations (31.24). 31.2 Establish the validity of (31.29). 31.3 Let y(n) = x + v(n), where x is an unknown scalar constant and v(n) is zeromean white noise with power σv2 . An estimator for x is constructed recursively in the following manner: b (n) = (1 − α)b x x(n − 1) + α y(n),

n≥0

b (−1) = 0 and where 0 < α < 1. Determine the steady-state mean and starting from x b (n) as n → ∞. Any optimal choice for α? variance of x 31.4 Consider a vector-valued Gaussian distribution y ∼ Ny (µ, Ry ) where y ∈ IRM . Follow the same ML arguments from Section 31.1 to motivate the following unbiased estimates for (µ, Ry ) from N independent realizations {yn }: µ b=

N 1 X yn , N n=1

by = R

N 1 X (yn − µ b)(yn − µ b)T N − 1 n=1

31.5 Consider the same setting of Prob. 28.10. Assume we collect N independent realizations {yn } for n = 1, 2, . . . , N . (a) Verify that λS e−N λ fy1 ,...,yN (y1 , . . . , yN ; λ) = QN n=1 yn ! P b where S = N n=1 yn , and conclude that the ML estimate of λ is given by λML = P N 1 n=1 yn . Is the estimator unbiased? N PN (b) Show that T (y) = N1 n=1 yn is a sufficient statistic for λ. 31.6 A random variable y is uniformly distributed over the interval 0 ≤ y ≤ a. We observe N iid realizations {y 1 , y 2 , . . . , y N } and we wish to determine the ML estimate for a.

1260

Maximum Likelihood

(a)

Verify that fy1 ,...,yN (y1 , . . . , yN ; a) =

  1 I 0 ≤ max y ≤ a n 1≤n≤N aN

where I[x] denotes the indicator function and was defined earlier in (4.164). (b) Conclude that b aML = max1≤n≤N yn . b ML = N a/(N + 1). (c) Show that the estimator is biased, namely, establish that E a (d) Show that T (y) = max1≤n≤N yn is a sufficient statistic for a. 31.7 Assume y is an exponentially distributed random variable with rate λ > 0, i.e., fy (y; λ) = λe−λy for y ≥ 0. Verify that the maximum-likelihood estimate of λ given N b = N/ PN yn . Show that iid realizations {y1 , y2 , . . . , yN } is λ n=1 b= Eλ

Nλ , N −1

b = var(λ)

N 2 λ2 (N − 1)2 (N − 2)

Is the ML estimator efficient in this case? 31.8 A random variable y follows a binomial distribution with parameters N and p, i.e., the probability of observing k successes in N trials is given by: ! N k P(y = k) = p (1 − p)N −k , k = 0, 1, . . . , N k Having observed y = y, show that the ML estimate of p when N is known is given by pbML = y/N . (b) Having observed y = y, show that the ML estimate of N when p is known is b that satisfies N bML + 1 ≥ y/p. given by the smallest value of N (c) Under part (b), assume y/p is an integer. Conclude that there are two ML estibML = y/p and N bML = (y/p) − 1. mates given by N Remark. We compare in Probs. 28.11 and 28.12 the above ML solution to the MAP and MMSE solutions. 31.9 Derive expressions (31.41)–(31.42). 31.10 Is the unbiased estimator (31.25) efficient? That is, does it attain the Cramer– Rao lower bound? 31.11 Let y be a Bernoulli random variable that is equal to 1 with probability p and equal to 0 with probability 1 − p. Show that the Fisher information value for p is given by F (p) = 1/p(1 − p). 31.12 Let y be distributed according to a Poisson distribution with mean λ ≥ 0, i.e., (a)

λk e−λ , k = 0, 1, 2, . . . k! where λ is the average number of events occurring in an interval of time. Show that the Fisher information value for λ is given by F (λ) = 1/λ. Show further that the ML estimate of λ from N iid observations {y 1 , y 2 , . . . , y N } is efficient. 31.13 Let y be distributed according to the beta distribution P(y = k) =

fy (y; θ) = ay a−1 ,

y ∈ (0, 1), a > 0

Assume we collect N iid observations {y 1 , y 2 , . . . , y N }. (a) Show that the ML estimate of a is given by b aML = − (b)

N 1 X ln yn N n=1

!−1

Verify that the variance of b aML is given by σaˆ2ML =

N 2 a2 (N − 1)2 (N − 2)

Problems

1261

(c) Show that the Fisher information value for a is F (a) = N/a2 . (d) Is the ML estimator efficient? 31.14 Consider the vector-valued Gaussian distribution (31.105) with a diagonal covariance matrix. Introduce the vector of parameters θ = col{µ1 , µ2 , . . . , µP , ln σ1 , ln σ2 , . . . , ln σP } Compute the Fisher information matrix I(θ) relative to these parameters. 31.15 Conclude from (31.122) that the MSE for any biased estimator satisfies −1  2  2 ∂g(θ) e2 ≥ (θ − g(θ))2 − E ∂ ln fy (y; θ) Eθ ∂2θ ∂θ where θ − g(θ) is the bias. 31.16 Establish expression (31.209) for the conditional distribution of the observations given the statistic T (y). 31.17 Establish the validity of inequality (31.211) given by the Rao–Blackwell theorem. 31.18 Refer to the log-likelihood function (31.133) and verify that the two regularity conditions (31.229)–(31.230) are satisfied. 31.19 Let y denote a random variable that is uniformly distributed over the interval [0, θ], where θ > 0 is an unknown parameter, i.e., fy (y; θ) = 1/θ for 0 ≤ y ≤ θ. b = 2y is an unbiased estimator for θ, i.e., E θ b = θ. (a) Argue that θ (b) Verify that the second regularity condition (31.230) fails in this case. (c) Verify that 2  ∂ ln fy (y; θ) e2 = θ2 /3, = 1/θ2 Eθ E ∂θ and conclude that the Cramer–Rao bound result does not hold in this case. 31.20 Consider the problem of estimating the mean of a sequence of N observations {y(n)} drawn from a Gaussian distribution with mean µ and known variance σy2 . b real denote the ML estimator for µ from the observa(a) Assume that µ ∈ IR. Let µ b real ? Is the estimator µ b real unbiased? tions {y(n)}. What is the variance of µ (b) Continue assuming that µ ∈ IR. What is the Cramer–Rao bound (lower bound on var(b µreal )) for the parameter µ as a function of σy2 and N . Is the estimator b real efficient? µ (c) Now assume that µ ∈ ZZ, where ZZ denotes the set of integers. Consider the b integer = round(b estimator µ µreal ), where round(x) : IR → ZZ returns the nearestinteger to the real number x (where round(x − 1/2) = x when x itself is an b integer and express it using the standard Gaussian integer). Compute the pmf of µ cumulative distribution function (cdf) Q(x) defined earlier in part (b) of Prob. b integer unbiased? 4.31. Based on the pmf, is the estimator µ b integer can be written as: (d) Using the fact that the variance of µ  √    1 N N var(b µinteger ) = Q − 2Q 2 σy σy where r Q(x) ∼

∞ 2 1 X − 21 y2 x2 e , π x y=1

as x → ∞

b integer is of the form: show that the variance of the rounded estimator µ r N 8σy2 − 8σ 2 y , var(b µinteger ) ∼ e as N/σy2 → ∞ πN

1262

Maximum Likelihood

(e)

Compare the variances of parts (a) and (d). How fast does the variance of each estimator decay as a function of N ? Is this a surprising result? Remark. For more details on these results, the reader may refer to Hammersley (1950). 31.21 Refer to the expressions in Table 5.1 for several traditional probability distributions. Use result (31.81) to determine the ML estimates for the parameter θ in each case. Express your results in terms of estimates for the original parameters for the various distributions. 31.22 Refer to expression (31.79) for the log-likelihood function of an exponential distribution. Establish the identity N 1 X 1 ∇θT ln fy1 ,...,yN (y1 , . . . , yN ; θ) = T (yn ) − E T (y) N N n=1

31.23 Refer to the least-squares problem (31.144) with unknown scalar parameters {a, b, c}. Determine the estimates {b a, b b, b c}. 31.24 Refer to expressions (31.194)–(31.196) derived under an MDL formulation for selecting the best of two models. (a) Maximize J(µ1 ) over the parameter µ1 and determine an expression for µ b1 . Ignore the constant term and evaluate J(b µ1 ). (b) Maximize J(µ2 , σ22 ) over the parameters (µ2 , σ22 ) and determine expressions for (b µ2 , σ b22 ). Ignore the constant term and evaluate J(b µ2 , σ b22 ). (c) Which model would you choose according to the MDL criterion? 31.25 We continue with Prob. 31.24 but assume now that the number of bits needed to represent each model is computed based on (31.181). In this case, the cost functions J(µ1 ) and J(µ2 , σ22 ) will be replaced by N 1 X 1 (yn − µ1 )2 + cte J(µ1 ) = − ln(N ) − 2 2 2σ1 n=1

J(µ2 , σ22 ) = − ln(N ) −

N 1 X (yn − µ2 )2 + cte 2σ22 n=1

Repeat the derivations of Prob. 31.24 for this case. 31.26 Refer to the least-squares problem (31.176). Differentiate the cost with respect ? to w and establish the validity of the solution wM and the corresponding minimum cost EM given by (31.177a)–(31.177b). 31.27 Consider N independent measurements {y1 , y2 , . . . , yN } and introduce their sample mean and sample variance estimates: µ b=

N 1 X yn , N n=1

σ b2 =

N 1 X (yn − µ b)2 N n=1

We wish to use the data to decide between two models, Ny (0, 1) and Ny (a, 1). One of the models has θ = 0 and Mθ = 0 while the other model has θ = a and Mθ = 1. Here, Mθ denotes the number of parameters under model θ. Show that the AIC scores for both models are given by AIC(0) = N ln(2π) + N σ b2 + N µ b2 AIC(a) = 2 + N ln(2π) + N σ b2 Conclude that model Ny (a, 0) is selected whenever µ b2 > 2/N . Verify that under BIC 2 the condition changes to µ b > (ln N )/N .

Problems

1263

31.28 Use approximation (31.279) for large sample size N in the BIC formulation to justify the expression eBIC(k)/2 π(k|y1 . . . , yN ) = PK BIC(k0 )/2 k0 =1 e 31.29 Refer to the cross-validation construction (31.204). Assume, for simplicity, that cross-validation performs a single pass over the data so that, for each model θk , a single bk,1 is generated using X bk,1 = 1 PT ln fy (yt ; θbk ). Assume further that the estimate X t=1 T log-likelihood functions are bounded, say, 0 ≤ ln fy (y; θb` ) ≤ b for all y. ck,1 − (a) For any small δ > 0, use Hoeffding inequality (2.232b) to verify that P(|X −2T δ 2 /b2 Xk | ≥ δ) ≤ 2e . (b) Show how to select δ to ensure that with high probability 1 − : !1/2 b2 ln(2/) b max |Xk,1 − Xk | < 1≤k≤K 2T (c)

Explain that bk,1 is close to max Xk with high probability max X 1≤k≤K

1≤k≤K

bk,1 and maxk Xk . Quantify the likely distance between maxk X 31.30 This example is extracted from Eddy (2004) and VanderPlas (2014). Alice and Bob are present in a room with a billiard table; they cannot see the table. Their friend Carol rolls a ball down the table and marks the location where it lands. Subsequently, Carol rolls more balls. If the ball lands to the left of the mark, a point is given to Alice. Otherwise, a point is given to Bob. The first person to reach six points wins the game. After 8 throws, Alice has 5 points and Bob has 3 points. We wish to evaluate the chance of Bob winning the game using the frequentist and Bayesian approaches. (a) Based on the frequentist approach, the probability of a ball throw favoring Alice is pb = 5/8. The same probability for Bob is 1 − pb = 3/8. Use these values to estimate the likelihood that Bob will win the game. (b) Based on the Bayesian approach, we model the unknown success probability for Alice as a random variable p. Assume a uniform prior for p, i.e., fp (p) = 1 for p ∈ [0, 1]. The measurements in this problem are ya = 5 and yb = 3 (the number of balls successfully assigned to Alice and Bob, respectively). Use the Bayes rule to show that ˆ 1 p5 (1 − p)6 dp   0 P Bob wins | ya = 5, yb = 3 = ˆ 1 p5 (1 − p)3 dp 0

Evaluate the expression and compare your result with part (a). Comment on the difference. 31.31 Let {y n } denote a collection of N random variables, for n = 1, 2, . . . , N , with each variable y n distributed according to its own individual pdf, denoted by fn (y), with its own mean µn . In many problems of interest (e.g., in economics, decision-making, and in reinforcement learning studied later in Section 47.9), the objective is to estimate the maximum mean value of the variables, i.e., to solve problems of the type: ∆

µmax =

max E y n

1≤n≤N

If the means {µn } were known beforehand, then the answer is obviously the largest value among the {µn }, i.e., µmax = max1≤n≤N µn . The challenge is to estimate µmax directly

1264

Maximum Likelihood

from observations. Thus, assume we collect Mn iid observations for each variable y n . We denote the observations by {yn,m , m = 1, . . . , Mn }. These observations can be used to determine unbiased estimates for the {µn } using the sample mean calculation P n µ bn = (1/Mn ) M m=1 yn,m . Now consider the estimate construction: ∆

µ bmax =

max µ bn

1≤n≤N

b max constructed in this manner for µmax is biased. It is Show that the estimator µ sufficient to provide an example. Remark. The bias of the maximum sample mean estimator is well studied in the literature, with ramifications in the fields of economics and management – see, e.g., the works by Capen, Clapp, and Campbell (1971), Smith and Winkler (2006), Thaler (1988), Van den Steen (2004), and van Hasselt (2010). 31.32 We continue with the setting of Prob. 31.31. Assume we collect, for each vari(b) (a) able y n , two disjoint sets of iid observations with Mn and Mn samples in each. We use these measurements to compute two sample means for y n , namely, (a)

(a) µ bn

=

1

X

(a)

Mn

(b)

Mn

µ b(b) n

yn,m ,

=

m=1

1

Mn

X

(b)

Mn

yn,m

m=1

where we are using the superscripts (a) and (b) to distinguish between the two sample means. The observations {yn,m } used in each expression originate from the correspond(a) (b) ing set of measurements. Both sample means {b µn , µ bn } are unbiased estimators for the same mean, µn , i.e., (a) b (b) bn = Eµ Eµ n = E yn

Let N be the set of indices that maximize the expected values of the {y n }: n o N = n? n? = argmax E y n 1≤n≤N

(a)

Let na be an index that maximizes {b µn }, i.e., na = argmax µ b(a) n 1≤n≤N

and consider the sample mean from set (b) that corresponds to this same index, namely, (b) µ bna (that is, we maximize over one sample mean set and consider the corresponding b (b) sample mean from the other set). Verify that µ na continues to be a biased estimator for µmax but that it does not overestimate it in the sense that (b)



b na ≤ µmax = Eµ

max E y n

1≤n≤N

Show further that the inequality is strict if, and only if, P(na ∈ / N) > 0. Remark. For further motivation and discussion on this construction, the reader may refer to van Hasselt (2010).

31.A Derivation of the Cramer–Rao Bound

31.A

1265

DERIVATION OF THE CRAMER–RAO BOUND We provide in this appendix one derivation for the Cramer–Rao bounds (31.120a)– (31.120b) and (31.122) for the case of scalar parameters by following an argument similar to Cassella and Berger (2002), Frieden (2004), and Van Trees (1968, 2013); the argument can be extended with proper adjustments to the vector case. The derivation relies on two regularity conditions on the density function, namely,

(a) The score function exists and is finite, i.e., for every y where fy (y; θ) > 0, it should hold that ∆

S(θ) =

∂ ln fy (y; θ) 0. According to (4.79a), the pdf of each Gaussian component has the form:   1 1 1 T −1 (y − µ ) R (y − µ ) (32.40) fk (y) = exp − k k k 2 (2π)P/2 (det Rk )1/2 where we are using the subscript k in {fk (y), µk , Rk } to refer to the kth distribution. In order to highlight the fact that this distribution is dependent on the parameters {µk , Rk }, we also use the notation Ny (µk , Rk ) and Nk to refer to the same distribution.

32.3.1

Mixture Model Now let k denote a latent random variable, which assumes integer values in the range {1, 2, . . . , K} with probabilities πk ≥ 0, i.e., P(k = k) = πk ,

K X

πk = 1

(32.41)

k=1

The notation πk , with a subscript k, should not be confused with the irrational number, π, which appears in the pdf expression (32.40) for the Gaussian distribution; the scalar πk refers to the mixing probability specified in (32.41) and it will appear as a weighting coefficient in the mixture model (32.43). The mixture model for y arises as follows. We first select k randomly according to the distribution (32.41), and then select y randomly according to fk (y). If we let fy,k (y, k) denote the joint distribution for the random variables (y, k), we conclude using the Bayes rule (3.42a) that the marginal distribution for y is given by:

1288

Expectation Maximization

fy (y) =

K X

fy,k (y, k)

k=1

=

K X

P (k = k) fy (y|k = k)

k=1

=

K X

πk fk (y)

(32.42)

k=1

In other words, the distribution of y is a GMM of the form:

fy (y) =

K X

πk Ny (µk , Rk )

(GMM)

(32.43)

k=1

It is useful to note that, given an observation y = y, we can use this model to assess the likelihood that the kth mixture component gave rise to it. This likelihood can be evaluated by appealing again to the Bayes rule (3.42a) to note that: P(k = k|y = y)fy (y) = P(k = k)fy (y|k = k)

(32.44)

from which we conclude using (32.43) that πk Ny (µk , Rk ) P(k = k|y = y) = PK `=1 π` Ny (µ` , R` )

(32.45)

This conditional probability is also called the responsibility that component k has in explaining the observation y, and we denote it by the shorthand notation: ∆

r(k, y) = P(k = k|y = y) (responsibility measure)

(32.46)

Example 32.4 (Different modes of operation) We can envision a scenario where each component k represents a particular mode of operation for a manufacturing machine. In mode k, the machine produces a product whose dimensions (length, width, and height) are approximated by the Gaussian distribution Ny (µk , Rk ). If the machine can switch from one mode to another at random with probability πk for mode k, then we end up with a situation where the dimensions of the product generated by the machine, collected into a vector y ∈ IR3 , follow a mixture model of the form (32.43). Example 32.5 (Clustering) We illustrate the Gaussian mixture model graphically by considering K = 3 two-dimensional (i.e., P = 2) Gaussian components in Fig. 32.4. The figure shows a scatter diagram consisting of N = 1500 points selected from their respective Gaussian distributions according to the probabilities: π1 = 0.2515,

π2 = 0.5236,

π3 = 0.2249

(32.47)

32.3 Gaussian Mixture Models

1289

three individual two-dimensional Gaussian components 20

15

10

5

0

-5

-10

-5

0

5

10

15

20

25

Scatter diagram for N = 1500 points generated from three individual Gaussian components with mean vectors and covariance matrices given by (32.48a)–(32.48b). For each sample point, the corresponding Gaussian component is selected randomly according to the probabilities (32.47). Figure 32.4

The mean vectors and covariance matrices of the mixture components for this example are given by:       2.0187 10.5560 −4.3228 µ1 = , µ2 = , µ3 = (32.48a) 4.3124 10.7832 −4.5155       4 −4 45 −18 10 −2 R1 = , R2 = , R3 = (32.48b) −4 8 −18 9 −2 2 Observe how the sample points in the figure appear clustered in three groups. This observation motivates two useful questions. Assume we are given the sample points {yn , n = 1, 2, . . . , N } shown in the figure. Then, the following questions become pertinent: (a) (Modeling). Is it possible to devise a procedure to estimate the parameters {µk , Rk } for the individual components of the underlying GMM? (Clustering). If a new datum y arrives, is it possible to determine which Gaussian component is more likely to have generated it? The answer to the second question can be obtained by employing expression (32.45) to determine the responsibility score for each mixture component and to select the component that is most likely to have generated the observation. This procedure provides one useful way to solve clustering problems. We will be discussing other clustering

(b)

1290

Expectation Maximization

techniques in future chapters. Clustering is an important form of unsupervised learning; it allows us to group the data {yn } into their most likely components or classes. The answer to the first question can be obtained by applying the EM algorithm, which we discuss next.

32.3.2

Learning the Model We now apply the EM procedure to the GMM (32.43) in order to learn its parameters {πk , µk , Rk } from iid data measurements {yn } for n = 1, 2, . . . , N . It is generally assumed that the number of components K is known or pre-selected. To begin with, we introduce a latent variable k that assumes integer values in the range 1 ≤ k ≤ K with probabilities P(k = k) = πk

(32.49)

For each observation y, the variable k refers to the Gaussian component that generates it. The joint distribution of {y, k} is given by: fy,k (y, k) =

K Y

k=1

(πk )

I[k=k]



I[k=k] Ny (µk , Rk )

(32.50)

where the notation I[·] denotes the indicator function, which is equal to 1 when its argument is true and 0 otherwise. The form (32.50) is justified by noting that, for example, if y is generated by model k = 1, then I[k = 1] = 1 while I[k 6= 1] = 0. In this case, expression (32.50) will collapse to the corresponding Gaussian component: fy,k (y, k = 1) = π1 Ny (µ1 , R1 ) = P(k = k) fy|k (y|k = 1)

(32.51)

Observe that (32.50) is expressed in the form of a product of individual distributions. As the discussion will reveal, the product form is useful for the application of the EM procedure. This is because the logarithm of a product expression transforms into the sum of individual logarithms. We ignore this issue for now. There is an equivalent way to rewrite expression (32.50) by introducing an auxiliary vector variable z in place of the scalar k. We let z be a K × 1 column vector with individual entries denoted by {z k }. These entries assume binary values, either 1 or 0, depending on whether the kth component generates the observation. That is, zk replaces the notation I[k = k]. For example,  1, if mixture component k = 1 generates y z1 = (32.52) 0, otherwise and similarly for z2 , z3 , . . . , zK . The possible realizations for the vector z are the basis vectors {ek } in IRK , namely, z = ek ,

if mixture component k generates y

(32.53)

32.3 Gaussian Mixture Models

1291

The variable z is hidden because, while we observe the realizations{y1 , y2 , . . . , yN}, we do not observe which Gaussian component generates each of these observations. Figure 32.5 illustrates this situation. The hidden variable z assumes one of K possibilities with probability πk each.

(hidden variable)

state of hidden variable chosen according to mixture probability, πk .

fy|z (y|z = e1 )

fy|z (y|z = eK )

fy|z (y|z = e2 )

(mixture components)

generates observations according to component 1

Figure 32.5 The hidden variable z assumes one of K possibilities with probability πk each. Depending on the value of z (i.e., k), the mixture component is one from K possibilities as well.

Using the latent variable z and (32.50), we rewrite the joint distribution (32.50) as fy,z (y, z) =

K Y

(πk )

k=1

zk



zk Ny (µk , Rk )

(32.54)

Of course, we could have written this relation directly, in terms of the {zk }, without starting from (32.50). We followed this slower approach for the benefit of the reader. Using the above mixture model, we can identify the conditional pmf for the variable z given an observation as follows:   P(z = ek |y = yn ) = P I[k = k] | y = yn (32.46)

=

(32.45)

=

r(k, yn ) πk Nyn (µk , Rk ) PK j=1 πj Ny n (µj , Rj )

(32.55)

We now have sufficient information to apply the expectation and maximization steps of the EM procedure (32.38).

1292

Expectation Maximization

Expectation step The first step involves computing the following conditional expectation, where the scalars z k are treated as random variables: (K ) n o   X z k ln πk Nyn (µk , Rk ) E z|yn ;θm−1 ln fy,z (yn , z; θ) = E z|yn ;θm−1 k=1

=

K X

k=1

=

K X

k=1

 ln πk Nyn (µk , Rk ) E z|yn ;θm−1 {z k }

 ln πk Nyn (µk , Rk ) E zk |yn ;θm−1 {z k }

(32.56)

where in the last step we replaced the expectation over z by the expectation over its entry z k . Observe how the logarithm operation is now applied to a single factor rather than to a sum of factors, as was the case with (32.28). This useful observation facilitates the subsequent derivation and is a direct consequence of the fact that we are now starting from the product representation (32.54) for the joint pdf rather than the sum representation (32.43) used for the marginal pdf fy (y; θ). To continue, we need to evaluate the conditional expectation that appears in (32.56). Thus, recall from (32.52) that the variable z k assumes either the value zk = 1 or the value zk = 0 and, hence, E zk |yn ;θm−1 {z k }

= 1 × P(z k = 1|y = yn ; θm−1 ) + 0 × P(z k = 0|y = yn ; θm−1 )

= P(z k = 1|y = yn ; θm−1 ) = P(k = k|y = yn ; θm−1 ) (32.46)

=

r(m) (k, yn )

  (m−1) (m−1) Nyn µk , Rk   =P (m−1) (m−1) (m−1) K π N µ , R y j j j=1 j n (m−1)

πk

We therefore arrive at the following computation for the E-step:   (m−1) (m−1) (m−1) πk Nyn µk , Rk   r(m) (k, yn ) = P (m−1) (m−1) (m−1) K Nyn µj , Rj j=1 πj

(32.57)

(32.58)

Maximization step

Let us now consider the M-step. First, note from (32.56) and (32.57) that the function Q(yn ; θ) that results from the E-step is given by Q(yn ; θ) =

K X

k=1

  r(m) (k, yn ) ln πk Nyn (µk , Rk )

(32.59)

32.3 Gaussian Mixture Models

1293

so that the parameters θ = {πk , µk , Rk , k = 1, . . . , K} can be sought by solving the maximization problem: ( N K )   XX max r(m) (k, yn ) ln πk Nyn (µk , Rk ) (32.60) {πk ,µk ,Rk }K k=1

n=1 k=1

Observe that the variable r(m) (k, yn ) does not depend on the unknown parameters {πk , µk , Rk } since its value is evaluated at the previous iterate, θm−1 . One useful consequence of this fact is that by splitting the computations into two separate E and M steps, the M-step ends up involving the maximization of a cost that has a convenient form: It consists of a separable sum of terms where each term depends on the parameters for one of the components. Now differentiating (32.60) with respect to µk and Rk , using the result of Prob. 32.2 which shows how to compute the gradient of a Gaussian pdf relative to its co(m) (m) variance matrix, and setting the derivatives to zero at the locations {µk , Rk }, we arrive at the following relations for the M-step for k = 1, 2, . . . , K: (m)

µk

(m)

Rk

= =

N X

1

(m) Nk n=1 N X

1

(m)

Nk

n=1

r(m) (k, yn )yn

(32.61)

  T (m) (m) r(m) (k, yn ) yn − µk yn − µk

(32.62)

These relations show that the estimates for µk and Rk involve weighted averages of the observations and their centered outer-products. We still need to estimate the mixing coefficients πk by maximizing Q(yn ; θ) subject to the constraint that these coefficients should add up to 1. We therefore introduce the Lagrangian function: L(πk , λ) =

N X K X

n=1 k=1

K X

 r(m) (k, yn ) ln πk Nyn (µk , Rk ) + λ

k=1

!

(32.63)

πk − 1

for some real scalar λ. Differentiating with respect to the πk and setting the derivative to zero at the estimate locations gives: (m)

πk λ +

N X

r(m) (k, yn ) = 0

(32.64)

n=1

Adding over k and using the fact that the {πk } add up to 1 we get λ=−

K X N X

k=1 n=1

r

(m)

(k, yn ) = −

N K X X

n=1

|

k=1

r

(m)

!

(k, yn )

{z

=1

}

= −N

(32.65)

1294

Expectation Maximization

so that (m)

πk

(m)

=

Nk , N

(m)

Nk

=

N X

r(m) (k, yn )

(32.66)

n=1

In other words, the mixing probability for the kth component is found by estimating the fraction of the N observations that belongs to this model. We summarize the resulting EM procedure in listing (32.67).

EM algorithm for learning the parameters of the GMM (32.43) using independent observations. given observation vectors {yn ∈ IRP }, for n = 1, 2, . . . , N ; assumed K Gaussian mixture components; (0) (0) (0) given initial conditions : πk , µk , Rk , k = 1, 2, . . . , K; repeat until convergence over m ≥ 1: (E-step) for each component k = 1, . . . , K: for each data sample n = 1, . . . , N :   (m−1) (m−1) Nyn µk , Rk   r(m) (k, yn ) = P (m−1) (m−1) (m−1) K Nyn µj , Rj j=1 πj (m−1)

πk

end

(m) Nk

=

N X

(32.67) r

(m)

(k, yn )

n=1

end

(M-step) for each component k = 1, . . . , K: N 1 X (m) (m) r (k, yn )yn µk = (m) Nk n=1 (m)

Rk

(m)

=

1

N X

(m) Nk n=1 (m)

  T (m) (m) r(m) (k, yn ) yn − µk yn − µk

πk = Nk /N end end bk } ← {π (m) , µ(m) , R(m) }. return {b πk , µ bk , R k k k

32.3 Gaussian Mixture Models

1295

Initial conditions Procedure (32.67) is repeated until the log-likelihood function does not undergo significant change; this test can be performed by evaluating the log-likelihood function for each iteration, i.e., ( K ) N   X X (m) (m) (m) `(y1 , . . . , yN ; θm ) = ln πk Nyn µk , Rk (32.68) n=1

k=1

and stopping after no significant change is observed in the value of the function, namely, when |` (θm ) − ` (θm−1 )| < 

(32.69)

for some small  > 0 and large enough m. The initial conditions for procedure (32.67) can generally be chosen at random. One useful initialization procedure, though, is as follows. We set the priors to uniform values: (0)

πk = 1/K,

k = 1, 2, . . . , K

(32.70)

and compute the sample mean of all observations, ∆

y¯ =

N 1 X yn N n=1

(32.71)

We then set the mean vectors to randomly perturbed versions of this value for all components: (0)

µk = y¯ + (some random vector perturbation) k = 1, 2, . . . , K

(32.72)

The perturbations help ensure that the initial mean vectors are not uniform across the K components. We also compute the sample variances for each entry of the observations vectors yn . For example, the sample variance corresponding to the leading entry of yn is given by ∆

σ ¯12 =

N 2 1 X yn (1) − y¯(1) N − 1 n=1

(32.73)

where we are writing x(1) to denote the first entry of vector x. Likewise, for the remaining P − 1 entries of yn : ∆ σ ¯p2 =

N 2 1 X yn (p) − y¯(p) , p = 1, 2, . . . , P N − 1 n=1

(32.74)

Then, we set the initial covariance matrices to randomly perturbed versions of a diagonal covariance matrix for all components as follows:  2 2 (0) Rk = diag σ ¯1 , σ ¯2 , . . . , σ ¯P2 + (random diagonal perturbation) k = 1, 2, . . . , K

(32.75)

1296

Expectation Maximization

where the diagonal perturbation has positive entries. These perturbations help ensure that the initial covariance matrices are positive-definite (and, hence, invertible) as well as nonuniform across the K components. Example 32.6 (Numerical example for GMM) We apply the EM recursions (32.67) to a GMM consisting of K = 3 two-dimensional (i.e., P = 2) Gaussian components. Figure 32.6 shows the scatter diagram for N = 1500 sample points generated from their respective Gaussian distributions according to the probabilities: π1 =

1 , 3

π2 =

1 , 4

π3 =

5 12

(32.76)

The mean vectors and covariance matrices of the mixture components for this example are given by:       2 10 30 µ1 = , µ2 = , µ3 = (32.77a) 3 10 15       4 −4 45 −18 10 −2 R1 = , R2 = , R3 = (32.77b) −4 8 −18 9 −2 2 We apply the EM recursions (32.67) for M = 150 iterations and arrive at the estimates below; the figure also shows the level curves for the estimated Gaussian distributions:       2.0515 10.1401 29.9229 µ b1 = , µ b2 = , µ b3 = (32.78a) 2.9422 9.9494 14.9785   4.0944 −4.0233 b1 = (32.78b) R −4.0233 7.4445   45.9752 −18.2610 b2 = R (32.78c) −18.2610 9.1126   9.9815 −2.0386 b3 = (32.78d) R −2.0386 1.9939 Example 32.7 (Fitting a GMM onto the iris dataset) Figure 32.7 shows scatter diagrams for measurements of petal length and petal width in centimeters for three types of iris flowers (setosa, versicolor, and virginica) from the iris dataset. The flowers were shown earlier in Fig. 27.5. For each flower, we collect the petal length and petal width into a 2 × 1 vector:   petal length yn = (measured in cm) (32.79) petal width There are 50 measurements for each flower type with a total of N = 150 measurements. The top leftmost plot in Fig. 32.7 shows a scatter diagram for all N = 150 petal length × petal width observations for the three flower types. We are using the same color to represent all scatter points in the diagram in order to indicate the fact that the EM procedure does not know beforehand that these points arise from three underlying distributions. In the top plot on the right, we repeat the same scatter diagram but now color the scatter points and also use circle, square, and diamond markers in order to specify which measurements arise from which flower type. This grouping represents the underlying truth and is based on the information we already know about the dataset and is not the result of any automatic clustering procedure. It is clear from this latter scatter diagram that the data from the versicolor and virginica flowers overlap with

32.3 Gaussian Mixture Models

1297

25

20

15

10

5

0

-5

-10

-10

0

10

20

30

40

Figure 32.6 Scatter diagram for N = 1500 points generated from three individual Gaussian components with mean vectors and covariance matrices given by (32.77a)–(32.77b). For each sample point, the corresponding Gaussian component is selected randomly according to the probabilities (32.76).

each other, which is going to interfere with the procedure of fitting a GMM onto the data. We run the EM procedure for M = 150 iterations assuming K = 3 GMMs. The result is the level curves shown in the bottom left plot in the figure. The Gaussian components are found to have the following means and covariance matrices:       5.0159 6.1045 6.3471 µ b1 = , µ b2 = , µ b3 = (32.80) 3.4441 2.8779 2.8620 and  b1 = R

0.1195 0.0896

0.0896 0.1205



 ,

b2 = R

0.4832 0.2187

0.2187 0.1300



 ,

b3 = R

0.4304 0.0466

 0.0466 0.0964 (32.81)

The mixing probabilities are also estimated to be π b1 = 0.3212,

π b2 = 0.3138,

π b3 = 0.3649

(32.82)

The EM procedure further provides for each observation vector yn a measure of the likelihood that it belongs to one of the Gaussian components through the responsibility measures rb(k, yn ). We use these measures to determine the most likely component for each yn and construct the clustering scatter diagram shown in the bottom right plot.

Expectation Maximization

4.5

original data for 3 flower types

colored original data for 3 flower types

4.5

4

petal width (cm)

petal width (cm)

setosa

3.5

3

2.5

2 4

4

3.5

3

2.5

5

6

7

2 4

8

5

6

7

8

petal length (cm)

EM fitting of 3 Gaussian distributions

4

3.5

3

2.5

clustering by EM

4.5

petal width (cm)

4.5

2 4

virginica

versicolor

petal length (cm)

petal width (cm)

1298

4

setosa

3.5

3

2.5

5

6

petal length (cm)

7

8

2 4

virginica

versicolor 5

6

7

8

petal length (cm)

Figure 32.7 (Top left) Scatter diagram for N = 150 petal length × petal width measurements for three types of iris flowers: setosa, versicolor, and virginica. (Top right) The same scatter diagram albeit with color and marker groupings to specify which measurements arise from which flower type. This grouping is based on the information we have and is not the result of any automatic clustering procedure. (Bottom left) Gaussian level sets obtained from running M = 150 iterations of the EM procedure (32.67). (Bottom right) Clustering result from the EM procedure where each data point is assigned to the most likely Gaussian component. It is seen in this example, by comparing the two plots on the right (showing actual clustering versus predicted clustering) that the data from the versicolor and virginica flowers overlap and cause degradation in the clustering result.

Comparing this plot with the actual one shown above it, it is clear that the data from the versicolor and virginica flowers interfere with each other and cause errors in the clustering results with, for example, some data originating from the virginica dataset being declared as originating from the versicolor dataset. We repeat the same experiment by focusing exclusively on the data arising from the setosa and versicolor flowers. That is, we now have N = 100 data points and use the EM procedure to fit K = 2 Gaussian components. The result is shown in Fig. 32.8. It is seen from the clustering result in the bottom right plot that the EM procedure is now almost entirely successful in clustering the data points correctly (with the exception of one setosa data point assigned erroneously to the versicolor class). In this case, the mean and covariance values obtained by the EM procedure are

32.3 Gaussian Mixture Models

4.5

original data for 2 flower types

1299

colored original data for 2 flower types

4.5

petal width (cm)

petal width (cm)

setosa 4

3.5

3

2.5

2 4

4

versicolor

3.5

3

2.5

4.5

5

5.5

6

6.5

7

2 4

7.5

4.5

petal length (cm) EM fitting of 2 Gaussian distributions

4.5

4

4

petal width (cm)

petal width (cm)

5.5

6

6.5

7

7.5

petal length (cm)

4.5

3.5

3

2.5

2 4

5

clustering by EM setosa versicolor

3.5

3

misclassified 2.5

5

6

7

2 4

8

4.5

petal length (cm)

5

5.5

6

6.5

7

7.5

petal length (cm)

Figure 32.8 (Top left) Scatter diagram for N = 100 petal length × petal width

measurements for two flower types: setosa and versicolor. (Top right) The same scatter diagram albeit with color and marker groupings to specify which measurements arise from which flower type. (Bottom left) Gaussian level sets obtained from running M = 150 iterations of the EM procedure (32.67). (Bottom right) Clustering result from the EM procedure where each data point is assigned to the most likely Gaussian component.

5.0158 3.4407



µ b1 =



0.1196 0.0894

0.0894 0.1210



 , µ b2 =

5.9050 2.7634

 (32.83)

and  b1 = R

 ,

b2 = R

0.2964 0.0919

0.0919 0.0990

 (32.84)

The mixing probabilities are also estimated to be π b1 = 0.4881,

π b2 = 0.5119

(32.85)

Example 32.8 (Image segmentation) Besides clustering, another useful application of the EM procedure is image segmentation. The goal of image segmentation is to cluster pixels of similar properties into the same group, such as pixels of similar color. Image segmentation is relevant in many important applications, including object recognition, edge and boundary delineation, and image compression.

1300

Expectation Maximization

A color image with R rows and C columns is described by three R × C matrices (also called channels), one for each of the RGB colors: red, green, and blue. In other words, the image is described by a three-dimensional array of size R×C ×3, which is also called a tensor. The array consists of three matrices, each containing the pixel intensities for one of the fundamental colors (red, green, and blue). The intensities in this example vary from 0 to 255. We normalize them to the range [0, 1] by dividing by 255. The image has R = 332 rows and C = 539 columns for a total of N = R × C = 178, 948 pixels

(32.86)

Each pixel in the image is represented by a 3 × 1 vector containing its RGB intensities:   R yn =  G  (32.87) B In this way, the input to the EM algorithm (32.67) will be a collection of N = 178, 948 vectors {yn } of size 3 × 1 each. We apply the EM procedure for M = 30 iterations assuming K = 4 Gaussian components of dimension 3 × 1 each. The algorithm generates four mean vectors and the corresponding covariance matrices: n o n o b1 , R b2 , R3, b R b4 µ b1 , µ b2 , µ b3, µ b4 , R (32.88) as well as four mixing probabilities: n

π b1 , π b2 , π b3 , π b4

o

(32.89)

It also generates a responsibility measure rb(k, n) for each pixel location and allows us to determine which Gaussian component the pixel belongs to. The algorithm also bk , that belong to each cluster. The quantities rb(k, n) estimates the number of pixels, N (m) b and Nk refer to the values of the iterated quantities r(m) (k, yn ) and Nk at the end of the EM procedure. Once the clustering is completed, all pixels belonging to the same cluster k are replaced by the mean value µ bk for that cluster o n (32.90) every yn ∈ cluster k is replaced by µ bk These values, which are in the range [0, 1], are then scaled up and rounded to integer values in the range [0, 255]. In this way, pixels within the same cluster will be colored similarly. The result is the image shown in the bottom left plot of Fig. 32.9. We subsequently smooth the image by sliding a 3 × 3 median filter over it. The filter slides over the image one location at a time and replaces the pixel at that location by the median of the values covered by the 3 × 3 mask. This is achieved by first sorting the pixel values covered by the mask into increasing order and then selecting the middle value from this list as the output. For example, consider a 3 × 3 mask whose top leftmost corner is placed at location (1, 1) of the image. The mask will cover nine pixel values whose row and column indices are:   (1, 1) (1, 2) (1, 3) 1 3 5 (2, 1) (2, 2) (2, 3) = 3 4 3  (32.91) (3, 1) (3, 2) (3, 3) 3 3 3 The pixel values at these locations, say, the numbers shown on the right, will be ordered and the middle value used to replace the (1, 1) location in the image: o n median 1, 3, 3, 3, 3 , 3, 3, 4, 5 = 3 (32.92)

32.3 Gaussian Mixture Models

Royce Hall building

Segmented image

Image processed by EM

EM followed by median filtering

1301

Figure 32.9 (Top left) Original image capturing the Royce Hall building on the

campus of the University of California, Los Angeles, USA. (Top right) Final segmented image where pixels of similar color in the original image are grouped together and boundaries between groups are colored in red. (Bottom left) The result of the original image after processing by the EM procedure (32.67) and replacing each pixel by the average color level for all pixels within the same group. (Bottom right) The result of applying a median filter to the EM image from the bottom left to remove outliers. The source for the Royce Hall image in the top left corner is Wikimedia Commons, where the image is available for use under the Creative Commons Attribution Share-Alike License. The other three images have been processed by the segmentation procedure under discussion.

Note that when most of the pixel values under a mask assume the same value due to clustering, except perhaps for some outlier pixel values, the outlier values will be ignored by the mask since the output of the median filter will be the most repeated pixel value in the set. The edges in the figure are determined by scanning through the image and detecting adjacent pixels of different colors. These pixels are colored in red. The final segmented image is shown in the top right corner of Fig. 32.9. The implementation is not perfected but is only meant to illustrate the main concepts. In particular, observe how the tree on the left side of the image ends up blended with the building.

1302

Expectation Maximization

32.4

BERNOULLI MIXTURE MODELS We next apply the EM procedure to models involving a mixture of Bernoulli distributions, defined as follows. Let y denote a vector random variable of size L×1. Each individual entry, y(`), follows a Bernoulli distribution with outcomes equal to 0 or 1 with probability of success p(`): P(y(`) = 1) = p(`),

` = 1, 2, . . . , L

(32.93)

We can interpret these entries as corresponding to the on/off states of some underlying system. We assume the individual entries are independent of each other so that the pdf of y is given by fy (y) =

L  Y

`=1

y(`)  1−y(`) p(`) 1 − p(`)

(32.94)

We are using parentheses, as in y(`), to refer to the individual entries of y because we will be using subscripts, such as y n , to refer to observations of the vector y. We express the above distribution more compactly as follows: fy (y; p) = By (p)

(32.95)

where p is the vector of probabilities, p = col{p(`)}. It is straightforward to verify that the mean vector of y and its covariance matrix are given by (see Prob. 32.5): n o n o E y = col p(`) , Ry = diag p(`)(1 − p(`)) (32.96) We now consider a mixture model of such distributions, also known as a latent class analysis model, where each mixture component k is of the form (32.94) with probability vector pk = col{pk (`)}. That is, we let fk (y; pk ) =

L  Y

`=1

fy (y) =

K X

y(`)   ∆ pk (`) 1 − pk (`) )1−y(`) = By (pk )

πk fk (y; pk )

(32.97)

(32.98)

k=1

with mixing probabilities πk , which add up to 1. The mean vector and covariance matrix for this mixture distribution are given by Ey =

K X

πk µk

(32.99a)

k=1

Ry =

K X

k=1 ∆

πk (Dk +

µk µT k)



K X

k=1

πk µk

!

K X

k=1

πk µk

!T

µk = col{pk (`)} (mean of kth component) n o ∆ Dk = diag pk (`)(1 − pk (`)) (covariance of kth component)

(32.99b) (32.99c) (32.99d)

32.4 Bernoulli Mixture Models

1303

Example 32.9 (Collection of biased coins) We illustrate the Bernoulli mixture model by considering repeated experiments on a collection of biased coins. Consider L coins, with each coin ` = 1, 2, . . . , L having a probability p(`) of showing “heads” in a coin toss. This situation is illustrated in the top part of Fig. 32.10. During an experiment of index n, we toss all coins and write down the outcome for each one of them using yn (`) = 1 if coin ` showed “heads” and yn (`) = 0 if the same coin showed “tails.” In this way, each experiment of index n results in an L-dimensional vector, yn , consisting of the values of yn (`) for all L coins for that experiment. In a Bernoulli mixture model, we will have available K collections of such coins, as shown in the bottom part of Fig. 32.10 for K = 4 and L = 5. In each experiment of index n, we first select one of the rows with probability πk and subsequently toss the coins in that row, whose individual probabilities of success are denoted by {pk (`)} and collected into the vector pk = col{pk (`)}. Assume we repeat this type of experiment a total of N times and collect the corresponding observation vectors, {yn , n = 1, 2, . . . , N }. The problem of interest then is to employ these measurements to estimate the parameters {πk , pk }.

AAACI3icbVDLSgMxFM34rPVVdekm2ArtpsxUUBGEghuXFawKnVIy6W0bmseQZNQy9F/c+CtuXCjFjQv/xUztwteBwOGce7k5J4o5M9b33725+YXFpeXcSn51bX1js7C1fWVUoik0qeJK30TEAGcSmpZZDjexBiIiDtfR8Czzr29BG6bkpR3F0BakL1mPUWKd1CmclEJB7CCK0sa4HIpI3aehFpgqJnEpBM5L2AzUHZN9PADSNePKaVzO9EqpUyj6VX8K/JcEM1JEMzQ6hUnYVTQRIC3lxJhW4Me2nRJtGeUwzoeJgZjQIelDy1FJBJh2Os04xvtO6eKe0u5Ji6fq942UCGNGInKTWSDz28vE/7xWYnvH7ZTJOLEg6dehXsKxVTgrDHeZBmr5yBFCNXN/xXRANKHW1Zp3JQS/I/8lV7VqcFg9uKgV6/6sjhzaRXuojAJ0hOroHDVQE1H0gJ7QC3r1Hr1nb+K9fY3OebOdHfQD3scnfSejeA==

P(coin ` showing heads) = p(`)

Figure 32.10 A total of L coins (top part), with each coin ` having a probability p(`) of showing “heads” in a coin toss. A collection of K = 4 such sets, with L = 5 coins per set, is shown in the bottom part. The probability of picking any set at random is πk , with the probabilities of success for the coins in that row represented by the vector pk . The source for the coin image is Wikimedia Commons, where the image is available in the public domain.

Assume we are given a collection of N iid observations {y n } and the objective is to estimate the model parameters {pk , πk }. Returning to the mixture model

1304

Expectation Maximization

(32.98), we introduce a latent variable k that assumes integer values in the range 1 ≤ k ≤ K with probabilities P(k = k) = πk

(32.100)

This variable is used to indicate which Bernoulli component is responsible for the observation, y. The joint distribution of the variables {y, k} is then given by fy,k (y, k) =

K Y

k=1

 I[k=k] (πk )I[k=k] By (pk )

(32.101)

where the notation I[·] denotes the indicator function, which is equal to 1 when its argument is true and 0 otherwise. We denote the value I[k = k] more succinctly by the variable zk ; its value is either 0 or 1 depending on whether the mixture component happens to have index k or not. If we introduce the column vector ∆

z = col{z 1 , z 2 , . . . , z K }

(32.102)

with entries z k , then only one entry in z is equal to 1. The possible realizations for z are therefore the basis vectors {ek } in IRK : z = ek ,

if mixture component k generates y

(32.103)

The variable z is hidden because, while we observe the realizations {y1 , y2 , . . . , yN }, we do not know which Bernoulli component generates each of these observations. Using the latent variable z, we rewrite the joint distribution of {y, z} as fy,z (y, z) =

K Y

zk

(πk )

zk

(By (pk ))

(32.104)

k=1

The conditional pmf of the variable z given any observation is given by: P(z = ek |y = yn ) = P (I[k = k] | y = yn ) πk Byn ( pk ) ∆ = r(k, yn ) = PK j=1 πj By n ( pj )

(32.105)

We now have sufficient information to examine the expectation and maximization steps of the EM procedure (32.38).

32.4 Bernoulli Mixture Models

1305

Expectation step The first step involves computing the following conditional expectation, where the scalars zk are treated as random variables: ( !) K n o Y zk zk (πk ) (By (pk )) E z|yn ;θm−1 ln fy,z (yn , z; θ) = E z|yn ;θm−1 ln = E z|yn ;θm−1 =

K X

k=1

=

K X

k=1

(

k=1

K X

z k ln πk Byn (pk )

k=1



)

 ln πk Byn (pk ) E z|yn ;θm−1 {z k }

 ln πk Byn (pk ) E zk |yn ;θm−1 {z k }

(32.106)

where in the last step we replaced the expectation over z by the expectation over its entry z k . Recall that the variable z k assumes either the value zk = 1 or zk = 0 and, hence, E zk |yn ;θm−1 {z k } = 1 × P(z k = 1|y = yn ; θm−1 ) + 0 × P(z k = 0|y = yn ; θm−1 ) = P(z k = 1|y = yn ; θm−1 ) = P(k = k|y = yn ; θm−1 ) = r(m) (k, yn )

  (m−1) B yn ; pk   =P (m−1) (m−1) K B yn ; p j j=1 πj (m−1)

πk

(32.107)

Maximization step Let us now consider the M-step of the EM implementation (32.38). First, note that the function Q(yn ; θ) that results from the E-step is given by Q(yn ; θ) =

K X

k=1

 r(m) (k, yn ) ln πk Byn (pk )

(32.108)

so that the parameters θ = {πk , pk , k = 1, . . . , K} can be sought by solving the maximization problem: ( N K )   XX (m) max r (k, yn ) ln πk Byn (pk ) (32.109) {πk ,pk }K k=1

n=1 k=1

Observe that the variable r(m) (k, yn ) does not depend on the unknown parameters {πk , pk } since its value is evaluated at the previous iterate, θm−1 . Now

1306

Expectation Maximization

differentiating (32.109) with respect to pk and setting the derivatives to zero at (m) the location pk gives N X

r

(m)

yn (`)

(k, yn )

(m)

pk (`)

n=1



1 − yn (`) (m)

1 − pk (`)

!

=0

(32.110)

which implies that (m) pk (`)

PN

(m) (k, yn )yn (`) n=1 r PN (m) (k, yn ) n=1 r

=

or, in vector form, (m)

pk

=

1 (m)

Nk

N X

(32.111)

r(m) (k, yn )yn

(32.112)

n=1

where (m)

Nk

=

N X

r(m) (k, yn )

(32.113)

n=1

We still need to estimate the mixing coefficients πk by maximizing the likelihood function subject to the constraint that these coefficients should add up to 1. We introduce the Lagrangian function:

L(πk , λ) =

N X K X

n=1 k=1

r

(m)

K X  (k, yn ) ln πk Byn (pk ) + λ πk − 1 k=1

!

(32.114)

for some real scalar λ. Differentiating with respect to the πk and setting the derivative to zero at the estimate locations gives (m)

πk λ +

N X K X

r(m) (k, yn ) = 0

(32.115)

n=1 k=1

which shows, in an argument similar to (32.66), that λ = −N so that (m)

πk

(m)

=

Nk N

(32.116)

In summary, we arrive at the following listing for the EM algorithm applied to a mixture of vector Bernoulli components, each with a L × 1 probability pk .

32.4 Bernoulli Mixture Models

1307

EM algorithm for learning the parameters of the Bernoulli mixture model (32.98) using independent observations. given observation vectors {yn ∈ IRL }, for n = 1, 2, . . . , N ; assumed K Bernoulli mixture components; (0) (0) given initial conditions : πk , pk , k = 1, 2, . . . , K; repeat until convergence over m ≥ 1: (E-step) for each component k = 1, . . . , K: for each data sample n = 1, . . . , N : (m−1)

πk r(m) (k, yn ) = PK

j=1

  (m−1) By n p k

(m−1)

πj

(m−1)

Byn ( pj

) (32.117)

end

(m)

Nk

=

N X

r(m) (k, yn )

n=1

end

(M-step) for each component k = 1, . . . , K: N 1 X (m) (m) pk = r (k, yn )yn (m) Nk n=1 (m)

(m)

πk = Nk /N end end (m) (m) return {b πk , pbk } ← {πk , pk }. Example 32.10 (Numerical example for a Bernoulli mixture model) Figure 32.11 illustrates the operation of this procedure on a numerical example. A total of N = 2000 vector samples, yn , of size 3 × 1 each are generated according to the Bernoulli mixture model (32.94) and (32.98) with parameters 

1 π1 = , 3

 0.6 p1 =  0.3  , 0.7

2 π2 = , 3



 0.2 p2 =  0.8  0.4

(32.118)

Iterations (32.117) are repeated M = 100 times, starting from the initial conditions (0)

π1

=

1 , 2

(0)

π2

=

1 2

(32.119)

1308

Expectation Maximization

(0)

(0)

and randomly generated p1 and p2 . The figure shows the evolution of the parameter iterates for the first 30 iterations with their values converging toward:     0.1925 0.5904 (m) (m) (m) (m) π1 → 0.3594, π2 → 0.6406, p1 →  0.3615  , p2 →  0.7775  0.4077 0.7090 (32.120) evolution of

1

evolution of

1

0.6

0.6

2

0.8

1

0.8

0.4

0.4

0.2

0.2

0

20

40

60

80

0

100

0.8

0.6

0.6

0.4

0.4

0.2

0.2

20

40

60

60

80

100

80

100

1

0.8

0

40

evolution of

evolution of

1

20

80

0

100

(m)

(m)

20

(m)

40

60

(m)

Evolution of the iterates {π1 , π2 , p1 , p2 } toward their steady-state values for increasing values of the iteration index, m. Figure 32.11

32.5

COMMENTARIES AND DISCUSSION Mixture models. There are several texts on mixture models, including Everitt and Hand (1981), Titterington, Smith, and Makov (1985), McLachlan and Peel (2000), and Mengersen, Robert, and Titterington (2011). According to Stigler (1986) and McLachlan and Peel (2000, pp. 2–3), the earliest application of mixture models to data analysis appears to be the work by the English statistician Karl Pearson (1857–1936), who is regarded as one of the giants and early founders of modern-day mathematical and applied statistics – see the exposition by Tankard (1984). In Pearson (1894, 1896), the author fitted a convex combination of two Gaussian distributions to biological data collected by the English biologist Walter Weldon (1860–1906) in the influential

32.5 Commentaries and Discussion

1309

works by Weldon (1890, 1892, 1893); these contributions laid the foundations to the field of biostatistics. Weldon collected measurements of different features on crabs from different locations. Based on a histogram distribution of the data, Weldon postulated that there may exist two races of the same crab species in the geographical locations he examined. Pearson (1894, 1896) was able to fit a combination of two Gaussian distributions to the data to confirm the suspicion. However, Pearson’s solution method required determining the roots of a ninth-order polynomial equation, which was a challenging feat at the time. His technique preceded the ML approach, which we used in the body of the chapter; this latter approach was developed almost two decades later by the English statistician Ronald Fisher (1890–1962) in the works by Fisher (1912, 1922, 1925). Useful overviews of the contributions of Weldon and Pearson in the context of statistical analysis and biostatistics appear in the articles by Pearson (1906) and Magnello (1998, 2009). Robust statistics. Mixture models are also used in the design of robust estimators and in the modeling of outlier effects. For example, in robust statistics it is common to model the data distribution as consisting of the combination of two normal distributions of the form fy (y) = (1 − ) Ny (µ1 , σ 2 ) +  Ny (µ2 , Aσ 2 )

(32.121)

for some small  > 0 and large A. The rightmost term is used to indicate that there is a small chance of contamination by outliers occurring in the data. This line of reasoning has motivated extensive research in the field of robust statistics, starting with the seminal contributions by Tukey (1960) and Huber (1964). They were among the first to recognize that slight deviations from assumed conditions can lead to significant deterioration in the performance of estimators and that more care was needed in accounting for deviations in the models and in designing less sensitive (or more robust) estimators – see Prob. 32.15. The article by Huber (2002) gives an overview of the contributions of Tukey (1960). The survey article by Zoubir et al. (2012) and the subsequent text by Zoubir et al. (2018) provide further insights into the field of robust statistics in the context of estimation problems. Expectation maximization. The earliest version of the EM algorithm appeared in the context of statistical analysis in genetics in the works by Ceppellini, Siniscalco, and Smith (1955) and Smith (1957), and in the context of ML estimation from missing data in the work by Hartley (1958). Subsequent generalizations include the contributions by Baum et al. (1970) on hidden Markov models, Sundberg (1974) on ML estimation from missing data for the exponential family of distributions, and the classical work by Dempster, Laird, and Rubin (1977) where the term “EM” for the expectation-maximization algorithm was first coined. This latter reference also examined the convergence properties of the EM procedure, and launched EM into prominence as a powerful statistical technique. Their convergence analysis, however, had flaws and it was not until the work of Wu (1983) that the convergence of the EM algorithm toward local minima was studied correctly. Accessible overviews on the EM algorithm appear in Redner and Walker (1984), Moon (1996), Bilmes (1998), and Do and Batzoglou (2008); these references include, among other results, useful examples illustrating the application of the EM algorithm to multinomial distributions and hidden Markov models (HMM). Such Markov models are particularly useful in speech recognition applications – see, e.g., Rabiner (1989, 1993), Deller, Proakis, and Hansen (1993), and also our treatment later in Chapter 38. For additional treatments on the EM algorithm and its variations and properties, the reader may consult the texts by Everitt (1984), Duda, Hart, and Stork (2000), Bishop (2007), McLachlan and Krishnan (2008), and Hastie, Tibshirani, and Friedman (2009). Although the EM algorithm converges reasonably well in practice, its convergence rate can be slow at times. Results on the convergence properties of the EM algorithm appear in some of these texts as well as in the works by Wu (1983) and Xu and Jordan (1996).

1310

Expectation Maximization

Source of images. Figure 32.9 uses an image of the Royce Hall building on the campus of the University of California, Los Angeles. The source of the image is Wikimedia Commons, where it is available for use under the Creative Commons Attribution Share-Alike License at https://commons.wikimedia.org/wiki/File:Royce_Hall,_University_of _California,_Los_Angeles_(23-09-2003).jpg (photo by user Satyriconi). Figure 32.10 illustrates a Bernoulli mixture model using a coin image by P. S. Burton from Wikimedia Commons, where the image is in the public domain at https://commons. wikimedia.org/wiki/File:2010_cent_obverse.png.

PROBLEMS

32.1 Consider a P -dimensional Gaussian distribution with mean vector µ and diagonal covariance matrix Σ = diag{σ12 , σ22 , . . . , σP2 }, i.e., o n 1 1 1 fy (y; µ, Σ) = exp − (y − µ)T Σ−1 (y − µ) P/2 1/2 2 (2π) (det Σ) Verify that the partial derivative of the pdf relative to σk2 is given by   ∂fy (y; µ, Σ) 1 1 2 = f (y; µ, Σ) (y − µ ) − 1 y k k ∂σk2 2σk2 σk2 in terms of the kth entries of the vectors {y, µ}. Conclude that if we collect the partial derivatives into a column vector and define  ∆ ∇Σ fy (y; µ, Σ) = col ∂fy (y)/∂σ12 , . . . , ∂fy (y)/∂σP2 then ( )   1 −1 −1 T ∇Σ fy (y; µ, Σ) = fy (y)Σ Σ diag (y − µ)(y − µ) − 1 2 where 1 is the vector of all 1s of size P × 1, and the notation diag(A) extracts the diagonal entries of the matrix A into a column vector. 32.2 Continuing with Prob. 32.1, consider a P -dimensional Gaussian distribution with mean vector µ and covariance matrix R, i.e., n 1 o 1 1 fy (y; µ, R) = exp − (y − µ)T R−1 (y − µ) P/2 1/2 2 (2π) (det R) where R is not necessarily diagonal. We wish to evaluate the gradient of the pdf relative to R. Use properties from Table 2.1 to establish that   1 ∇R fy (y; µ, R) = fy (y)R−1 (y − µ)(y − µ)T R−1 − IP 2 32.3 Refer to the EM procedure (32.67). Given N iid observations {y1 , y2 , . . . , yN }, what would be an estimator (predictor) for the future measurement y N +1 ? 32.4 Consider the GMM (32.121). Derive the EM algorithm for estimating the parameters {, µ1 , µ2 , σ 2 , A}. 32.5 Derive expression (32.96) for the mean vector and the covariance matrix of the Bernoulli vector distribution (32.94) with independent entries. Derive also expressions (32.99a)–(32.99d). 32.6 Refer to the identifications in Table 5.1. Use the expressions for T (y) and a(γ) from the table to show how the EM algorithm (32.139) for exponential mixture models reduces to:

Problems

1311

(a) The EM algorithm (32.67) for Gaussian mixture models. (b) The EM algorithm (32.117) for Bernoulli mixture models. 32.7 Consider a mixture of exponential pdfs of the form: fy (y) =

K X

πk λk e−λk y

k=1

where the {λk > 0} are unknown parameters and the πk are nonnegative mixing probabilities that add up to 1. Derive the EM procedure for estimating the parameters {πk , λk } from N iid observations, {yn , n = 1, 2, . . . , N }. 32.8 Consider a mixture of Poisson distributions of the form: fy (y) =

K X k=1

πk

λyk e−λk y!

where the {λk > 0} are unknown parameters and the πk are nonnegative mixing probabilities that add up to 1. Derive the EM procedure for estimating the parameters {πk , λk } from N iid observations, {yn , n = 1, 2, . . . , N }. 32.9 We reconsider the ML problem studied earlier in Example 31.3 involving a multinomial distribution with partial or incomplete information. Introduce two hidden variables, z 1 and z 2 . These variables correspond to the number of times that each of the animal images #1 and #2 occur. This information is hidden since we are only observing the total number of animal images, y1 + y2 . (a) Verify that the joint pdf for the observed and hidden variables is given by  z1  z2  y3 1 1 N! 1 +θ −θ fy1 +y2 ,y3 ,z1 ,z2 (y1 + y2 , y3 , z1 , z2 ; θ) = z1 !z2 !y3 ! 4 4 2 (b)

(c)

We wish to apply the EM procedure (32.38) to the above model to estimate the parameter θ. Let s = y1 + y2 . Establish the following expressions for the conditional means of the latent variables, which are needed in the E-step of the algorithm:   1/4 (m−1) ∆ zb1 = E (z 1 |y1 + y2 = s; θm−1 ) = s × θm−1 + 1/2   θm−1 + 1/4 (m−1) ∆ zb2 = E (z 2 |y1 + y2 = s; θm−1 ) = s × θm−1 + 1/2 The expectations are computed relative to the conditional pdf of the latent variables given {y1 + y2 , y3 } and with the parameter fixed at θm−1 . Verify that the M-step leads to the recursion θm =

θm−1 (4s − 2y3 ) + s − y3 , s = y1 + y2 θm−1 (4s + 8y3 ) + s + 4y3

32.10 Refer to the discussion on the multinomial distribution in Prob. 32.9 and change the probabilities shown in (31.66) to the following values: p1 = 2θ, (a) (b) (c)

p2 =

1 − θ, 3

p3 =

2 −θ 3

Derive an expression for the ML estimate for θ assuming knowledge of {y1 , y2 , y3 }. Derive the EM algorithm for estimating θ assuming knowledge of {y1 + y2 , y3 }. Derive the EM algorithm for estimating θ assuming knowledge of {y1 , y2 + y3 }.

1312

Expectation Maximization

32.11 Consider two coins, one with probability p for “heads” and the other with probability q for “heads.” The first coin is chosen with probability α and the second coin is chosen with probability 1−α. When a coin is chosen, it is tossed and the outcome is noted as either yn = 1 (for heads) or yn = 0 (for tails) in the nth experiment. Given a collection of N measurements {yn , n = 1, 2, . . . , N }, write down the EM recursions for estimating the parameters {α, p, q}. 32.12 A person is tossing one of two biased coins. The probability of coin 1 landing “heads” is 0.6, while the probability of coin 2 landing “heads” is 0.3. At every experiment, one of the coins is chosen at random. The first coin is chosen with probability 0.7. Four experiments are performed and the observed outcome is the sequence HTHH. It is not known which coin generated each outcome. (a) What is the likelihood of observing this outcome? (b) What would be the most likely outcome for n = 5? (c) Can you estimate the sequence of coin selections for these four time instants? 32.13 A person is choosing colored balls, one at a time and with replacement, from one of two urns. Each urn has red (R) and blue (B) balls. The proportion of red balls in urn 1 is 70%, while the proportion of blue balls in urn 2 is 40%. Either urn can be selected with equal probability. Four balls are selected and the following colors are observed: RBBR. It is not known which urn generated each of the balls. (a) What is the likelihood of observing this outcome? (b) What would be the most likely color for the ball at n = 5? (c) Can you estimate the sequence of urn selections for these four time instants? 32.14 We continue with Prob. 32.13. A second person is independently choosing colored balls, one at a time and with replacement, from a second set of two urns. Each of the urns also has red (R) and blue (B) balls. The proportion of red balls in urn 1 is 60%, while the proportion of blue balls in urn 2 is 50%. Urn 1 is selected with probability 0.7 while urn 2 is selected with probability 0.3. A sequence of six balls are observed with colors BBRBRR. Due to a lapse in the records, it is not known whether this sequence of balls was observed by the user of Prob. 32.13 under their experiment conditions or by the user of this problem under their experiment conditions. Either option is possible. Can you help figure out which user is most likely to have generated this data? 32.15 Two estimates for the centrality of a collection of measurements {yn , n = 1, . . . , N } is their median and their mean. Explain why the median is a more robust estimate than the mean to the presence of outliers?

32.A

EXPONENTIAL MIXTURE MODELS In this appendix we extend the derivation in the body of the chapter to mixtures of exponential distributions, where each individual component is of the form (5.2), namely, n o ∆ fk (y; γk ) = h(y) exp γkT T (y) − a(γk ) = Ey (γk ) (32.122) fy (y; θ) =

K X

πk fk (y; γk )

(32.123)

k=1

with mixing probabilities πk , which add up to 1, and where we are using the notation γk ∈ IRM to refer to the parameter vector that is associated with the kth exponential component; note that before, in (5.2), we used the letter θ to refer to the parameter of an exponential distribution. We are now writing γk instead of θk because we have been using the letter θ to refer to the collection of all unknown parameters in an ML or EM problem formulation. In the current context, the symbol θ will be referring to

32.A Exponential Mixture Models

1313

the collection of parameters {πk , γk } that define the aggregate pdf in (32.123). The functions {h(y), T (y), a(γ)} satisfy h(y) ≥ 0 : IRP → IR,

T (y) : IRP → IRM ,

a(γ) : IRM → IR

(32.124)

Note that, for simplicity of presentation, we also denote the individual exponential components by the compact notation Ey (γk ). We assume that we are given a collection of N iid observations {yn } and the objective is to estimate the model parameters {γk , πk }, assuming K is known or pre-selected. As before, we introduce a latent variable denoted by k; it assumes integer values in the range 1 ≤ k ≤ K with probabilities P(k = k) = πk

(32.125)

This variable is used to indicate which exponential component is responsible for the observation, y. The joint distribution of the variables {y, k} is given by

fy,k (y, k; θ) =

K Y

 I[k=k] (πk )I[k=k] Ey (γk )

(32.126)

k=1

We denote the value I[k = k] more succinctly by zk and introduce the column vector z = col{z k }. The possible realizations for z are the basis vectors {ek } in IRK : z = ek ,

if mixture component is k

(32.127)

The variable z is hidden because, while we observe {y1 , y2 , . . . , yN }, we do not know which exponential component generates each of the observations. Using the latent variable z, we rewrite the joint distribution of {y, z} as

fy,z (y, z; θ) =

K Y

(πk )zk (Ey (γk ))zk

(32.128)

k=1

The conditional pmf of the variable z given any observation is given by: P(z = ek |y = yn ) = P (I[k = k] | y = yn ) πk Eyn ( γk ) ∆ = r(k, yn ) = PK j=1 πj Ey n ( γj )

(32.129)

We now have sufficient information to examine the expectation and maximization steps of the general EM procedure (32.38).

1314

Expectation Maximization

Expectation step The first step involves computing the following conditional expectation, where the scalars zk are treated as random variables: n o E z|yn ;θm−1 ln fy,z (yn , z; θ) ( !) K Y zk zk = E z|yn ;θm−1 ln (πk ) (Ey (γk )) k=1

= E z|yn ;θm−1

K nX

z k ln (πk Eyn (γk ))

o

k=1 K X

=

k=1 K X

=

k=1

ln (πk Eyn (γk )) E z|yn ;θm−1 {z k } ln (πk Eyn (γk )) E zk |yn ;θm−1 {z k }

(32.130)

where in the last step we replaced the expectation over z by the expectation over its entry z k . Recall that the variable z k assumes either the value zk = 1 or zk = 0 and, hence, E zk |yn ;θm−1 (z k ) = 1 × P(z k = 1|y = yn ; θm−1 ) + 0 × P(z k = 0|y = yn ; θm−1 ) = P(z k = 1|y = yn ; θm−1 ) = P(k = k|y = yn ; θm−1 ) = r(m) (k, yn ) (m−1)

πk

= P K

Ey n

(m−1)

j=1

πj



(m−1)

γk 

Eyn



(m−1)

γj

(32.131)



Maximization step Let us now consider the M-step of the EM implementation (32.38). First, note that the function Q(yn ; θ) that results from the E-step is given by Q(yn ; θ) =

K X

r(m) (k, yn ) ln (πk Eyn (γk ))

k=1

=

K X

r(m) (k, yn ) ln πk +

k=1

=

K X

K X

r(m) (k, yn ) ln Eyn (γk )

k=1

r(m) (k, yn ) ln πk +

k=1

K X

r(m) (k, yn ) ln h(yn ) +

k=1 K X k=1



r(m) (k, yn ) ln γkT T (yn ) − a(γk )



(32.132)

so that the parameters θ = {πk , γk , k = 1, . . . , K} can be sought by solving the maximization problem: max

{πk ,γk }K k=1

N X K X n=1 k=1

r(m) (k, yn ) ln πk +

N X K X n=1 k=1

  r(m) (k, yn ) ln γkT T (yn ) − a(γk ) (32.133)

32.A Exponential Mixture Models

1315

Observe that the variable r(m) (k, yn ) does not depend on the unknown parameters {πk , γk } since its value is evaluated at the previous iterate, θm−1 . Differentiating (32.133) with respect to γkT and setting the gradient vector to zero at the location (m) γk gives

N X n=1

  r(m) (k, yn ) T (yn ) − ∇γ T a(γk ) = 0

(32.134)

k

(m)

which implies that the estimate γk

is found by solving:

N X

1

  (m) ∇γ T a γ k =

(m)

k

Nk

r(m) (k, yn )T (yn )

(32.135)

n=1

where

(m)

Nk

=

N X

r(m) (k, yn )

(32.136)

n=1

We still need to estimate the mixing coefficients πk by maximizing the likelihood function subject to the constraint that these coefficients should add up to 1. We introduce the Lagrangian function:

L(πk , λ) =

N X K X

r

(m)

(k, yn ) ln (πk Eyn (γk )) + λ

n=1 k=1

K X k=1

! πk − 1

(32.137)

for some real scalar λ. Differentiating with respect to the πk and setting the derivative to zero at the estimate locations, we again find that λ = −N so that

(m)

(m)

πk

=

Nk N

(32.138)

In summary, we arrive at the following listing for the EM algorithm applied to a mixture of exponential components, each with a parameter vector γk .

1316

Expectation Maximization

EM algorithm for learning the parameters of the exponential mixture model (32.123) using independent observations. given observation vectors {yn ∈ IRP }, for n = 1, 2, . . . , N ; assumed K exponential mixture components; (0) (0) given initial conditions : πk , γk , k = 1, 2, . . . , K; repeat until convergence over m ≥ 1:

(E-step) for each component k = 1, . . . , K: for each data sample n = 1, . . . , N :   (m−1) (m−1) πk Ey n γk r(m) (k, yn ) = PK (m−1) (m−1) ) Ey n ( γj j=1 πj (32.139) end (m)

Nk

=

N X

r(m) (k, yn )

n=1

end (M-step) for each component k = 1, . . . , K:   (m) (m) find γk from ∇γ T a γk = k

(m)

πk end

(m)

= Nk

1

N X

(m)

Nk

r(m) (k, yn )T (yn )

n=1

/N

end (m) (m) return {b πk , γ bk } ← {πk , γk }.

REFERENCES Baum, L. E., T. Petrie, G. Soules, and N. Weiss (1970), “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Ann. Math. Statist., vol. 41, pp. 164–171. Bilmes, J. (1998), “A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models”, TR-97-021, Dept. Electrical Engineering and Computer Science, UC Berkeley, CA. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Ceppellini, R., M. Siniscalco, and C. A. Smith (1955), “The estimation of gene frequencies in a random-mating population,” Ann. Hum. Genet., vol. 20, pp. 97–115. Deller, J. R., J. G. Proakis, and J. H. L. Hansen (1993), Discrete-Time Processing of Speech Signals, Macmillan. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977), “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc., Ser. B (Methodological), vol. 39, no. 1, pp. 1–38. Do, C. B. and S. Batzoglou (2008), “What is the expectation maximization algorithm?” Nat. Biotechnol., vol. 26, no. 8, pp. 897–899.

References

1317

Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Everitt, B. S. (1984), An Introduction to Latent Variable Models, Chapman & Hall. Everitt, B. S. and D. J. Hand (1981), Finite Mixture Distributions, Chapman & Hall. Fisher, R. A. (1912), “On an absolute criterion for fitting frequency curves,” Mess. Math., vol. 41, pp. 155–160. Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philos. Trans. Roy. Soc. Lond. Ser. A., vol. 222, pp. 309–368. Fisher, R. A. (1925), “Theory of statistical estimation,” Proc. Cambridge Philos. Soc., vol. 22, pp. 700–725. Hartley, H. O. (1958), “Maximum likelihood estimation from incomplete data,” Biometrics, vol. 14, no. 2, pp. 174–194. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Huber, P. J. (1964), “Robust estimation of a location parameter,” Ann. Math. Statist., vol. 35, no. 1, pp. 73–101. Huber, P. J. (2002), “John W. Tukey’s contributions to robust statistics,” Ann. Statist. vol. 30, no. 6, 1640–1648. Magnello, M. E. (1998), “Karl Pearson’s mathematization of inheritance: From ancestral heredity to Mendelian genetics (1895–1909),” Ann. Sci., vol. 55, no. 1, pp. 35–94. Magnello, M. E. (2009), “Karl Pearson and the establishment of mathematical statistics,” Int. Statist. Rev., vol. 77, no. 1, pp. 3–29. McLachlan, G. J. and T. Krishnan (2008), The EM Algorithm and its Extensions, 2nd ed., Wiley. McLachlan, G. J. and D. Peel (2000), Finite Mixture Models, Wiley. Mengersen, K., C. Robert, and D. Titterington (2011), Mixtures: Estimation and Applications, Wiley. Moon, T. K. (1996), “The expectation maximization algorithm,” IEEE Signal Process. Mag., vol. 13, no. 6, pp. 47–60. Pearson, K. (1894), “Contributions to the mathematical theory of evolution,” Philos. Trans. Roy. Soc. Lond., vol. 185, pp. 71–110. Pearson, K. (1896), “Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia,” Philos. Trans. Roy. Soc. Lond., vol. 187, pp. 253–318. Pearson, K. (1906), “Walter Frank Raphael Weldon, 1860–1906,” Biometrika, vol. 5, pp. 1–52. Rabiner, L. R. (1989), “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286. Rabiner, L. R. (1993), Fundamentals of Speech Recognition, Prentice Hall. Redner, R. and H. Walker (1984), “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Rev., vol. 26, no. 2, pp. 195–239. Smith, C. A. (1957), “Counting methods in genetical statistics,” Ann. Hum. Genet., vol. 21, pp. 254–276. Stigler, S. M. (1986), The History of Statistics: The Measurement of Uncertainty before 1900, Harvard University Press. Sundberg, R. (1974), “Maximum likelihood theory for incomplete data from an exponential family,” Scand J. Statist., vol 1., pp. 49–58. Tankard, J. W. (1984), The Statistical Pioneers, Schenkman Books. Titterington, D., A. Smith, and U. Makov (1985), Statistical Analysis of Finite Mixture Distributions, Wiley. Tukey, J. (1960), “A survey of sampling from contaminated distributions,” in Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, and H. B. Mann, editors, pp. 448–485, Stanford University Press. Weldon, W. F. R. (1890), “The variations occurring in certain Decapod Crustacea, I. Crangon vulgaris,” Proc. Roy. Soc., vol. 47, pp. 445–453. Weldon, W. F. R. (1892), “Certain correlated variations in Crangon vulgaris,” Proc. Roy. Soc., vol. 51, pp. 2–21.

1318

Expectation Maximization

Weldon, W. F. R. (1893), “On certain correlated variations in Carcinus maenas,” Proc. Roy. Soc., vol. 54, pp. 318–329. Wu, C. F. J. (1983), “On the convergence properties of the EM algorithm,” Ann. Statist., vol. 11, no. 1, pp. 95–103. Xu, L. and M. I. Jordan (1996), “On convergence properties of the EM algorithm for Gaussian mixtures,” Neural Comput., vol. 8, pp. 129–151. Zoubir, A. M., V. Koivunen, Y. Chakhchoukh, and M. Muma (2012), “Robust estimation in signal processing,” IEEE Signal Process. Mag., vol. 29, no. 4, pp. 61–80. Zoubir, A. M., V. Koivunen, E. Ollila, and M. Muma (2018), Robust Statistics for Signal Processing, Cambridge University Press.

33 Predictive Modeling

Maximum likelihood (ML) is a powerful statistical tool that determines model parameters θ in order to fit probability density functions (pdfs) onto data measurements. The estimated pdfs can then be used for at least two purposes. First, they can help construct optimal estimators or classifiers (such as the conditional mean estimator, the maximum a-posteriori (MAP) estimator, or the Bayes classifier) since, as we already know from previous chapters, these optimal constructions require knowledge of the conditional or joint probability distributions of the variables involved in the inference problem. Second, once a pdf is learned, we can sample from it to generate additional observations. For example, consider a database consisting of images of cats and assume we are able to characterize (or learn) the pdf distribution of the pixel values in these images. Then, we could use the learned pdf to generate “fake” cat-like images (i.e., ones that look like real cats). We will learn later in this text that this construction is possible and some machine-learning architectures are based on this principle: They use data to learn what we call a “generative model,” and then use the model to generate “similar” data. We provide a brief explanation to this effect in the next section, where we explain the significance of posterior distributions. Now, we showed in Chapter 31 that closed-form expressions for the ML solution are possible only in some cases. For example, when we fit Gaussian distributions onto data, it is sufficient to compute sample means and variances to estimate the first- and second-order moments of the Gaussian pdf. However, challenges quickly arise when we try to fit mixture models or when some latent (or unobservable) variables are present, as was discussed in some detail in our treatment of the expectation-maximization (EM) formulation in the previous chapter. We explained there that the EM algorithm deals with the problem of determining the ML parameter θ that characterizes a joint pdf fy,z (y, z; θ), where realizations for y are observable while z represents latent (or hidden) variables whose realizations cannot be observed. The EM solution enabled us to estimate these latent values as well. Unfortunately, the EM construction (32.38) continues to have two key limitations. First, it assumes that the latent variables are discrete, such as corresponding to class labels. And there are many situations in practice where the latent variables are continuous in nature, as will be illustrated in the following. More importantly, the second limitation in the EM construction is that the expectation

1320

Predictive Modeling

step (32.38) needs to be carried out in closed form; this step involves computing the mean of the joint likelihood function, ln fy,z (y, z; θ), relative to the conditional distribution of z given y, namely, fz|y (z|y; θm−1 ). It is implicitly assumed that this conditional mean is easy to evaluate, which in turn implicitly assumes that the conditional pdf fz|y (z|y; θm−1 ) is easy to infer. We illustrated this construction for Gaussian mixture models (GMMs) and for multinomial variables in the previous chapter. In this and the next three chapters, we will discuss three approximate inference methods that help remove the two limitations. These methods will not assume that the latent variables are discrete and, moreover, they will be able to deal with situations where evaluation of the conditional pdf of z given y (i.e., of the latent variable given the observations) is prohibitive or impossible to obtain in closed form. Actually, the approaches studied in these chapters will go a step further and model all unknown variables as random in nature, as befits a true Bayesian formulation; this includes both the latent variables z as well as the needed parameter θ. The parameters will be modeled as being distributed according to some prior distribution, fθ (θ), in a manner similar to the MAP formulation from Section 28.2. In this chapter, we motivate the Bayesian inference approach and describe two enabling techniques: the Laplace method and the Markov chain Monte Carlo (MCMC) method. In the next chapter, we describe the expectation propagation method and in Chapter 36 we introduce a fourth method known as Bayesian variational inference. We will build up the theory in stages. First, we motivate the important role that posterior distributions play in Bayesian inference and, subsequently, derive several methods for the estimation of these distributions.

33.1

POSTERIOR DISTRIBUTIONS To begin with, and moving forward, we will use the letter z to refer generically to all hidden or unknown random variables in a problem, including the model parameter θ. If a problem does not involve any latent variable, but only a parameter θ, then z will be θ. In this chapter, we will describe methods for approximating or learning the conditional pdf of z given the observations, namely, fz|y (z|y), which we refer to as the posterior. We already know from previous chapters that conditional distributions of this type are important for solving inference problems to learn z from y, such as constructing conditional mean estimators, MAP estimators, or Bayes classifiers. Another important application for learning posterior distributions is predictive modeling, which we motivate as follows. Let fy|z (y|z) denote the model for generating the data y, based on some parameter realization z. We refer to this model as the generative model, and also as the decoder since it maps z to a realization y (i.e., it decodes the information summarized in z to produce y). The distribution fy|z (y|z) represents the

33.1 Posterior Distributions

1321

likelihood of y given the model parameter, z. On the other hand, the posterior fz|y (z|y), with the roles of z and y reversed, is known as the encoder since it maps y to z and, by doing so, it encodes the information from y into z. It is useful to represent these mappings in graphical form, as shown in Fig. 33.1.

decoder

encoder

Figure 33.1 The decoder is represented by fy|z (y|z) since it decodes the information contained in z, while the encoder is represented by fz|y (z|y) since it maps the information from y into z.

In Bayesian formulations, we assume z is selected randomly from some prior distribution, denoted by fz (z): z ∼ fz (z)

(33.1)

For example, z could be Gaussian-distributed, z ∼ Nz (0, σz2 ), and fy|z (y|z) could in turn be another Gaussian with mean z selected from the prior distribution (recall the discussion in Example 27.7), say, fy|z (y|z) ∼ Ny (z, Σ), for some covariance Σ

(33.2)

Since z is hidden and the realization for z is unknown beforehand, we do not know the exact fy|z (y|z) but only its general form. As such, if we face a situation where we need to generate observations from fy|z (y|z), then we would not be able to do so. That is where predictive modeling becomes useful. Assume we are able to collect N independent observations {yn }N n=1 arising from the distribution fy|z (y|z). We denote the aggregate of these observations by the short-hand notation y1:N . The likelihood of y1:N is given by fy1:N |z (y1:N |z) =

N Y

n=1

fy|z (yn |z)

(33.3)

As already indicated, in this chapter we will describe several methods that allow us to use the observations to approximate or learn the posterior, fz|y1:N (z|y1:N ). This is the conditional pdf of the unknown parameter z given all observations. Once this is done, we will be able to subsequently generate, on our

1322

Predictive Modeling

own, additional realizations for y. This can be achieved by first noting from the Bayes rule (3.39) that fy,z|y1:N (y, z|y1:N ) = fz|y1:N (z|y1:N ) fy|z,y1:N (y|z, y1:N ) | {z } joint pdf

= fz|y1:N (z|y1:N ) fy|z (y|z) | {z } | {z } posterior

(33.4)

likelihood

where the last equality is because the distribution of y is determined by z so that the conditioning over y1:N can be removed. The two distributions appearing on the right-hand side of the last equation have known forms (either given or learned). Therefore, by marginalizing (i.e., integrating) the right-hand side over the unknown z we can recover the posterior predictive distribution for y given the observations: ˆ (33.5) fz|y1:N (z|y1:N ) fy|z (y|z) dz fy|y (y|y1:N ) = | 1:N{z } z∈Z predictive pdf

where we use the notation z ∈ Z to indicate that the integration is carried over the range of the variable z, denoted generically by Z. Observe from (33.5) how the posterior is used within the integration step. With the predictive distribution computed, we can, for example, use the location of its peak (i.e., the maximum a-posteriori, or MAP, estimator) to predict a value for y. We will illustrate this construction in a couple of solved examples in the body of the chapter. In summary, the following diagram indicates how the stages of data generation, posterior estimation, and prediction are related to each other:

generate

estimate

perform

data

posterior

prediction

fy|z (y|z) −−−−−−→ y n −−−−−−→ fz|y1:N (z|y1:N ) −−−−−−→ fy|y1:N (y|y1:N )

We illustrate these concepts by means of several examples involving Bayesian regression, logit, and probit models. We start with Bayesian regression.

33.1.1

Bayesian Linear Regression Consider N scalar observations y(n) arising from the linear regression model: 2 y(n) = xT n w + v(n), v(n) ∼ Nv (0, σv ), w ∼ Nw (0, Rw ) M

(33.6) M

where the xn ∈ IR are given input vectors and the parameter w ∈ IR is assumed Gaussian distributed. This parameter plays the role of the latent variable.

33.1 Posterior Distributions

1323

Moreover, the perturbation v(n) is a white-noise process that is independent of w for all n. It follows from (33.6) that conditioned on the knowledge of w, each y(n) is Gaussian distributed: n 1 o 1 (33.7) exp − 2 (y(n) − xT w)2 fy(n)|w (y(n)|w) = p n 2σv 2πσv2

We collect the measurements {y(n)}, the input vectors {xn }, and the perturbations v(n) into matrix and vector quantities and write (for simplicity, we are denoting the vectors by v and y instead of v 1:N and y 1:N ):  T      x1 v(1) y(1)  xT  v(2)   y(2)   2  ∆  ∆  ∆    X =  . , v =  (33.8a) , y =   . .. ..  ..      . xT v(N ) y(N ) N v ∼ Nv (0, σv2 IN ),

y = Xw + v,

X ∈ IRN ×M

(33.8b)

The data matrix X is deterministic and known. On the other hand, conditioned on knowledge of w, the pdf of y is Gaussian with the following pdf: (33.6)

fy|w (y|w) =

N Y

2 Ny(n) (xT n w, σv )

n=1

=

=

1 p 2πσv2 1 p 2πσv2

!N !N

exp

(

N 1 X 2 − 2 (y(n) − xT n w) 2σv n=1

n 1 o exp − 2 ky − Xwk2 2σv

) (33.9)

Moreover, since y is the sum of two independent Gaussian components, Xw and v, we find that y is also Gaussian with pdf:   y ∼ Ny 0, σv2 IN + XRw X T (33.10)

From the Bayes rule, we can estimate the posterior distribution for w given the observations as follows: fw|y (w|y) ∝ fw (w) fy|w (y|w) (33.11) n 1 o n 1 o −1 ∝ exp − wT Rw w × exp − 2 ky − Xwk2 2 2σv ) (  1 T  1 T 1 1 T −1 2 = exp − w Rw + 2 X X w + 2 y Xw − 2 kyk 2 σv σv 2σv

The exponent on the right-hand side is quadratic in w and, therefore, after proper normalization, the conditional pdf fw|y (w|y) will be Gaussian. Using the completion-of-squares formula (1.80) with the identifications: −1 A ← Rw +

1 T X X, σv2

b←

1 T X y, σv2

α←

1 kyk2 σv2

(33.12)

1324

Predictive Modeling

we conclude that

(

) 1 T b −1 fw|y (w|y) ∝ exp − (w − w) b Rw (w − w) b 2 −1  1 ∆ −1 bw = R Rw + 2 X TX σv ∆ 1 b w b = 2 Rw X T y σv

(33.13a) (33.13b) (33.13c)

and, therefore, the posterior is given by

bw ) fw|y (w|y) ∼ Nw (w, b R

(33.14)

The quantity w b serves as a MAP estimate for w given the observations. We can now use these results to perform prediction. That is, given a new input vector x0 we would like to predict its likely observation y 0 based on the knowledge gained from the previous observations in y. The predictive distribution can be obtained by first marginalizing over w and then applying the Bayes rule as follows: fy0 |y (y 0 |y) ˆ = fy0 ,w|y (y 0 , w|y)dw (marginalization) w ˆ (a) = fw|y (w|y) fy0 |w (y 0 |w) dw (Bayes rule) {z } | {z } w| ˆ



=(33.14)

exp

w

(

=(33.7)

1 b−1 (w − w) − (w − w) b TR b w 2

)

× exp

(

(33.15)

) 2 1  0 dw − 2 y − (x0 )T w 2σv

where in step (a) we continue to assume all observations are independent of each other so that y 0 and y are conditionally independent given w. The right-hand side involves the integration of the product of two Gaussian distributions. We appeal to the identity (27.63) with the identifications bw , x W ← (x0 )T , θ ← 0, Γ ← σv2 , Rx ← R ¯←w b

to conclude that the predictive distribution is again Gaussian     b w x0 fy0 |y1:N y 0 |y1:N ∼ Ny0 (x0 )T w, b σv2 + (x0 )T R

(33.16)

(33.17)

We can now use the peak value (x0 )T w b as a predictor for the new observation. The above expression further allows us to specify the size of the error variance around this estimator.

33.1.2

Bayesian Logit and Probit Models Consider next an example involving N feature vectors hn ∈ IRM with binary labels γ(n) ∈ {±1}. Assume the {γ(n)} are independent and identically

33.1 Posterior Distributions

1325

distributed, with the distribution for each γ parameterized by some vector w ∈ IRM . We describe in this section two popular models for generating the labels from the features, known as the logit and probit models. In the logit model, the label γ follows a Bernoulli distribution with its probability of success (γ = +1) defined by a logistic function, namely, P(γ = γ | w; h) =

1 , 1 + e−γhT w

γ ∈ {−1 + 1}

(33.18)



Note that the sigmoid function σ(x) = 1/(1+e−x ) maps real numbers x ∈ IR to probability values in the range [0, 1], as illustrated in Fig. 33.2. We will encounter this logistic model again in Chapter 59, with one important distinction in relation to the discussion in this chapter. There, the vector w will be modeled as some deterministic unknown parameter and its value will be estimated by resorting to a ML formulation and a stochastic gradient procedure. Here, in this section, we pursue instead a Bayesian formulation where w is modeled as some random variable arising from a Gaussian prior. This formulation leads to the Bayesian logit model. logistic function for class +1 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 -10

-5

0

5

0 -10

logistic function for class -1

-5

0

5

10

Figure 33.2 Typical behavior of logistic functions for two classes. The figure shows

plots of the functions 1/(1 + e−x ) (left) and 1/(1 + ex ) (right) assumed to correspond to classes +1 and −1, respectively.

We therefore assume the following generative process for the labels: (Bayesian logit model) P(γ = γ |w = w; h) =

2 w ∼ Nw (0, σw IM )

1 , 1 + e−γhT w

γ ∈ {−1, +1}

(33.19)

2 where, in the Bayesian context, we assume knowledge of σw . In this case, the variable w plays the role of the latent variable z, while γ plays the role of the observation y. The feature h is treated as deterministic (i.e., nonrandom); this fact is the reason why we are using a semicolon to separate w from h in (33.19). This notation stresses that w is random while h is a given deterministic quantity.

1326

Predictive Modeling

There is a second popular model for modeling the distribution of γ given the feature information known as the probit model. In this model, the logistic function in (33.18) is replaced by the cumulative distribution function (cdf) of the standard Gaussian distribution with zero mean and unit variance, i.e., by 1 ∆ Φ(x) = √ 2π

ˆ

x

e−τ

2

/2



(33.20)

−∞

so that expression (33.18) becomes

(Bayesian probit model) P(γ = γ | w = w; h) = Φ(γhT w), 2 w ∼ Nw (0, σw IM )

γ ∈ {−1, +1}

(33.21)

Note that, similar to the sigmoid function, the cumulative function Φ(x) also maps real numbers x ∈ IR to probability values in the range [0, 1]. We explain the terminology for the “logit” and “probit” models in the comments at the end of Chapter 59. One useful application for logit and probit models arises in the context of financial planning. For example, the entries of h could represent attributes used by a financial institution to decide on whether to extend a loan to an individual (γ = +1) or not (γ = −1). The entries of h could correspond to information about income level, education level, family size, current debt level, etc. The vector w can vary across institutions and provides an indication on how these institutions weigh the various factors in h before arriving at a recommendation about the loan. In this context, the variable x = hT w plays the role of a credit score. From the plots in Fig. 33.2, we see that higher scores (i.e., higher values for x) push the recommendation toward γ = +1 (approval), while lower scores push the recommendation toward γ = −1 (denial). If we were able to observe N loan applications {hn , γ(n)} from a particular institution, then we could process the data to determine a predictive distribution that would allow us to learn the likelihood of success or failure for a new loan application with feature vector h given the past observations: ∆

prediction distribution = P(γ = γ|γ N = γN ; h, HN )

(33.22)

Here, the notation γN and HN refers to the collection of past observations: n o T T HN = col hT 1 , h2 , . . . , hN ,

n o γN = col γ(1), γ(2), . . . , γ(N )

(33.23)

33.1 Posterior Distributions

1327

According to the Bayes rule, evaluation of the prediction probability requires knowledge of the posterior fw|γ N ,HN (w|γN , HN ) since: ∆

prediction distribution = P(γ = γ|γ N = γN ; h, HN ) (33.24) ˆ (a) = fγ ,w|γ N (γ, w|γN ; h, HN )dw ˆw∈W = fw|γ N (w|γN ; HN ) fγ |w (γ|w; h) dw {z } {z }| w∈W | logit or probit

posterior

where step (a) involves marginalizing (i.e., integrating) the conditional joint distribution of (γ, w) over w, and the second equality follows from the Bayes rule. Unfortunately, evaluation of the posterior distribution in this case is difficult because the integration cannot be carried out in closed form. To see this, note again from the Bayes rule that for the logit case: ∆

posterior = fw|γ N (w|γN ; HN ) =

fγ N ,w (γN , w; HN ) fγ N (γN ; HN )

= ˆ

fγ N |w (γN |w; HN ) fw (w)

w∈W

(33.25)

fγ N |w (γN |w; HN ) fw (w) dw

where the integral in the denominator is the pdf of the observations: ˆ



evidence = fγ N (γN ; HN ) =

w∈W

fγ N |w (γN |w; HN ) fw (w) dw

(33.26)

2 which involves both the known prior fw (w) = Nw (0, σw IM ) and the likelihood function

fγ N |w (γN |w; HN )

(33.19)

=

N  Y

n=1

=

N  Y

n=1

1 T 1 + e−hn w

 1+γ(n) 2

1 1 + e−γ(n)hn w T



×



1 T 1 + ehn w

 1−γ(n) 2 (33.27)

Evaluation of (33.26) in closed form is not possible, which in turn means that the posterior (33.25) will need to be approximated. We will present various methods to approximate posterior distributions in this chapter. Once estimated, the posterior for w can then be used to evaluate the desired prediction probability using the integral expression (33.24). We will return to this example to conclude the discussion once we explain how to approximate posterior distributions.

1328

Predictive Modeling

33.2

LAPLACE METHOD The first technique for approximating posterior distributions is the Laplace method. The method has some limitations and does not always lead to accurate results. It is nevertheless a popular technique and leads to good first-order approximations. It also provides a useful way to motivate the topic of posterior approximation. Consider again jointly distributed random variables y and z, where realizations for y are observable while z models latent (unobservable) variables. We denote the pdf of the observations by fy (y); this pdf is known as the evidence of y. We denote the joint distribution by fy,z (y, z). Many inference problems to learn z from y require knowledge of the posterior, fz|y (z|y), which is the conditional pdf of the latent variable given the observation. Unfortunately, it is not always possible to determine this distribution in closed form from knowledge of the joint pdf fy,z (y, z). Recall from the Bayes rule (3.39) that fz|y (z|y) =

fz (z) fy|z (y|z) fy,z (y, z) = fy (y) fy (y)

(33.28)

In most problem formulations, we have knowledge of the term in the numerator such as knowing the joint pdf fy,z (y, z) or a reasonable model for it. The difficulty lies in obtaining the evidence term, fy (y), which appears in the denominator. This is because it entails marginalizing over z: ˆ ˆ fy (y) = fy,z (y, z)dz = fz (z) fy|z (y|z)dz (33.29) z∈Z

z∈Z

These integrals can be difficult to evaluate, especially for higher-dimensional problems. One method to approximate posterior distributions without the need to compute evidences is the Laplace method. The method is suitable for continuous latent variables z of small dimensions and is based on approximating fz|y (z|y) by a Gaussian form. It is motivated as follows. First, we rewrite the Bayes relation (33.28) in the equivalent form: n o exp ln fy,z (y, z) fz|y (z|y) = ˆ (33.30) n o exp ln fy,z (y, z) dz z∈Z

in terms of the logarithm of the joint distribution of (y, z). Let z MAP denote the location where a peak for ln fy,z (y, z) occurs. At this location, the gradient of this function evaluates to zero: ∇zT ln fy,z (y, z) =0 (33.31) z=zMAP

Introduce the (negative of the) inverse Hessian matrix at location z = zMAP : h i−1 ∆ Rz = − ∇2z ln fy,z (y, zMAP ) (33.32)

33.2 Laplace Method

1329

and consider a second-order Taylor-series expansion of ln fy,z (y, z) around z = zMAP to get

ln fy,z (y, z) ≈ ln fy,z (y, zMAP ) −

1 (z − zMAP )T Rz−1 (z − zMAP ) 2

(33.33)

Substituting this expression into (33.30) shows that o n 1 exp − (z − zMAP )T Rz−1 (z − zMAP )  2  fz|y (z|y) ≈ ˆ 1 exp − (z − zMAP )T Rz−1 (z − zMAP ) dz 2 z   1 ∝ exp − (z − zMAP )T Rz−1 (z − zMAP ) 2

(33.34)

from which we conclude that fz|y (z|y) can be approximated by a Gaussian form with mean zMAP and covariance matrix Rz :

fz|y (z|y) ≈ Nz (zMAP , Rz )

(33.35)

The value of zMAP can be estimated by means of a gradient-ascent recursion of the form: h  i z (m) = z (m−1) + µ(m) ∇zT ln fy,z y, z (m−1) , m ≥ 0

(33.36)

where µ(m) is a step-size sequence, usually satisfying the conditions: ∞ X

m=0

µ(m) = ∞,

∞ X

m=0

µ2 (m) < ∞

(33.37)

such as µ(m) = τ /(m + 1) for some τ > 0. Clearly, in general, this recursion can only approach local maxima for ln fy,z (y, z). What we need to ensure is that the Hessian matrix at this location is invertible and negative-definite so that Rz > 0. The need to invert the Hessian in (33.32) limits the applicability of the Laplace method to problems of small dimension. We list the resulting construction in (33.38).

1330

Predictive Modeling

Laplace method for approximating posterior distributions. objective: approximate fz|y (z|y); given: fy,z (y, z); initial condition: z (−1) . repeat until convergence for m = 0, 1, 2, . . . : h  i z (m) = z (m−1) + µ(m) ∇zT ln fy,z y, z (m−1)

(33.38)

end zMAP ← z (m) after convergence i−1 h Rz = − ∇2z ln fy,z (y, zMAP ) fz|y (z|y) ≈ Nz (zMAP , Rz ).

Example 33.1 (Revisiting the logistic model with Gaussian latent variable) We revisit the logit model from Section 33.1.2 and apply the Laplace method to approximate the posterior distribution. We know from the numerator of (33.28) that the joint pdf of the latent variable and the observations is proportional to: fw,γ N (w, γN ; HN )    1+γ(n)   1−γ(n)  N  Y 2 2 − 12 kwk2 1 1 ∝ e 2σw Tw Tw −h h   n n 1+e 1+e n=1

(33.39)

so that ln fw,γ N (w, γN ; HN ) = cte + (33.40)    X    N  N  X 1 − γ(n) 1 + γ(n) 1 1 1 ln + ln − 2 kwk2 −hT hT 2 2 2σ nw nw 1 + e 1 + e w n=1 n=1 from which we conclude that the gradient is given by ∇wT



N X 1 1 ln fw,γ N (w, γN ; HN ) = − 2 w + hn σw 2 n=1



T

1 − e−hn w γ(n) − T 1 + e−hn w

! (33.41)

while the Hessian is given by T

N X   1 e−hn w ∆ −1 ∇2w ln fw,γ N (w, γN ; HN ) = − 2 IM − hn hTn = −Rw (33.42) −hT σw n w )2 (1 + e n=1

The gradient ascent recursion for estimating wMAP then takes the form: h  i w(m) = w(m−1) + µ(m)∇wT ln fw,γ N w(m−1) , γN ; HN , m ≥ 0

(33.43)

33.2 Laplace Method

1331

where µ(m) is a decaying step-size sequence. Once the recursion approaches wMAP , we approximate the desired conditional pdf by fw|γ N (w|γN ; HN ) ≈ Nw (wMAP , Rw )

(33.44)

We can now use this result to approximate the predictive distribution of γ given a new feature h. Indeed, introduce the auxiliary scalar variable: ∆

x = γhT w

(33.45)

Conditioned on γ, and in view of the assumed Gaussian distribution (33.44) for w given all data, the variable x is also Gaussian-distributed with mean and variance given by: fx|γ N (x|γN ; HN ) ≈ Nx (γ hT wMAP , hT Rw h)

(33.46)

The predictive distribution is given by P(γ = γ|γ N = γN ; h, HN ) (33.47) ˆ ∞ = fγ ,x|γ N (γ, x|γN ; h, HN )dx −∞ ˆ ∞ = fγ |x (γ; h) fx|γ N (x|γN ; HN )dx −∞ ˆ ∞ 1 = N(γ hT wMAP , hT Rw h)dx 1 + e−x −∞   ˆ ∞ 1 1 1 T 2 (x − γ h w ) dx exp − = MAP −x (2πhT R h)1/2 2hT Rw h w −∞ 1 + e The last integral is difficult to evaluate in closed form. However, it can be approximated by using two properties of Gaussian distributions. The first property is the following approximation for the sigmoid function from (4.13): 1 ≈ Φ(ax), 1 + e−x

a2 = π/8

(33.48)

The second property uses the cdf (33.20) of the standard Gaussian distribution and result (4.7): !   ˆ ∞ 1 1 µ Φ(y) p (33.49) exp − 2 (y − µ)2 dy = Φ p 2σy 2πσy2 1 + σy2 −∞ Using these two properties in (33.47) we conclude that   aγhT wMAP P(γ = γ|γ N = γN ; h, HN ) ≈ Φ √ 1 + a2 hT Rw h

(33.50)

In summary, using the approximation (33.48) again, we arrive at the following approximation for the predictive distribution: (

P(γ = γ|γ N = γN ; h, HN ) ≈

γhT wMAP 1 + exp − p 1 + πhT Rw h/8

)!−1 (33.51)

We illustrate this construction by means of a simulation. A total of N = 400 feature vectors {hn ∈ IR2 } are generated randomly from a Gaussian distribution with zero mean and unit variance. The weight vector w ∈ IR2 is also generated randomly from a

1332

Predictive Modeling

training features

3 2

+1

1 0 -1 -2 -3

-1 -3

3

-2

-1

0

samples used for testing

1

+1

2

+1

1

1

0

0

-1

-1 -1

-1

-2

-2

-3

3

test results

3

2

-3

2

-2

-1

0

1

2

-3

3

-3

-2

-1

0

1

2

3

Figure 33.3 (Top row) A total of N = 400 pairs {hn , γ(n)} used for training in order

to learn the predictive distribution. (Bottom row) The plot on the left shows the 50 pairs {hn , γ(n)} used for testing. The plot on the right shows the labels that are assigned by the test (33.51) to these points.

2 Gaussian distribution with zero mean and variance σw = 2. For each hn , we assign its label to γ(n) = +1 whenever the following condition is met:

1 ≥ 0.5 =⇒ γ(n) = +1 T 1 + e−hn w

(33.52)

Otherwise, the label is assigned to −1. The 400 pairs {hn , γ(n)} generated in this manner are used for training (i.e., for learning the conditional and predictive distributions using the Laplace method). The samples are shown in the top row of Fig. 33.3. We also generate separately an additional 50 sample pairs {hn , γ(n)} to be used during testing. These samples are shown in the bottom row. Running the gradient-ascent recursion (33.43) for 600 iterations and computing the resulting covariance matrix leads to  wMAP =

1.2766 1.1055



 ,

Rw =

0.0242 0.0081

0.0081 0.0210

 (33.53)

Using these values, we can predict the label for each of the test features hn using construction (33.51). We assign hn to class γ(n) = +1 if ( )!−1 hTn wMAP 1 + exp − p ≥ 1/2 =⇒ γ(n) = +1 (33.54) 1 + πhTn Rw hn /8

33.3 Markov Chain Monte Carlo Method

1333

The result is shown in the bottom row of the figure with 12 errors (21% error rate). In later chapters we will design more effective learning methods with lower error rates. One of the main inconveniences of the Laplace approach to learning posterior distributions is that it assumes knowledge of the underlying joint distribution, such as assuming in this example that the labels follow a logistic distribution and that the model w is 2 Gaussian-distributed with known variance σw . This remark is also applicable to other methods discussed in this chapter. In later chapters, we will study alternative learning methods for solving prediction/classification problems that operate directly on the data {hn , γ(n)} without imposing (or assuming) specific probabilistic models for the data distributions. It is instructive, for example, to compare the approach of this example to the results discussed later in Chapter 59 on logistic regression.

33.3

MARKOV CHAIN MONTE CARLO METHOD A popular and more powerful technique for approximating posterior distributions is the MCMC method, which is based on the idea of sampling. There is extensive theory on MCMC methods, their convergence, and performance guarantees. The purpose of this section is to present the key steps by focusing on one of the most used variants, namely, the Metropolis–Hastings construction.

33.3.1

Importance Sampling Monte Carlo techniques are sampling techniques, which exploit to great effect the ability of computing machines to generate and manipulate samples or data. We motivate the methodology as follows. Let fx (x) denote some known pdf and assume we are interested in evaluating one of its moments, say, its mean: ˆ ∞ ∆ x ¯ = xfx (x)dx (33.55) −∞

Although the argument x can be scalar or vector-valued, we will carry out the presentation assuming a scalar x without loss in generality. The difficulty in evaluating (33.55) is that, in general, the integral may be intractable and cannot be evaluated in closed form. However, if we could sample J realizations {xj } from the given distribution fx (x), then we could approximate the calculation by using the sample mean: x ¯≈

J 1X xj , J j=1

xj ∼ fx (x)

(33.56)

Under an ergodicity assumption, and for a sufficiently large number of samples, it is known that the sample mean will approach the actual mean value. However, the main difficulty is that, in general, generating samples from an arbitrary

1334

Predictive Modeling

distribution is challenging. Importance sampling provides one elegant maneuver around this challenge. Let πx (x) be a second pdf that the designer is free to choose. We refer to πx (x) as the proposal or importance distribution. Usually, it is a distribution that can be easily sampled from, such as Gaussian or uniform distributions. One choice is πx (x) = Nx (µ, σ 2 )

(33.57)

for some parameters (µ, σ 2 ), whose choice affects the performance of the sampling technique and should be done judiciously; a vector Gaussian distribution would be selected for the case when x is a vector. Once πx (x) is selected, we rewrite the desired integration (33.55) in the equivalent form: ˆ ∞ fx (x) πx (x)dx (33.58) x ¯= x π x (x) −∞ It is clear that we should choose πx (x) such that it assumes nonzero (i.e., positive) values at locations where fx (x) is nonzero to avoid division by zero: fx (x) > 0 =⇒ πx (x) > 0

(33.59)

A Gaussian choice for πx (x) would satisfy this property. We also need to assume that the domain of πx (x) is broad enough to cover the domain of fx (x); this essentially translates into choosing a proper value for the mean and variance parameters (µ, σ 2 ) in the Gaussian example above. The modified integral (33.58) can be interpreted as representing the mean of the quantity xfx (x)/πx (x) relative to the distribution defined by πx (x). Therefore, a second way to approximate x ¯ is as follows. We generate J realizations xj from πx (x) (as opposed to fx (x)) and then use the sample mean calculation: x ¯≈

  J fx (xj ) 1X xj , J j=1 πx (xj ) | {z }

xj ∼ πx (x)

(33.60)



= wj

That is, ∆

x ¯≈x b =

J 1X wj xj , J j=1



wj =

fx (xj ) , πx (xj )

xj ∼ πx (x)

(33.61)

where the scalars {wj }, also known as importance weights, are used to scale the J individual samples {xj }. The convergence of x b to x ¯ = E x is guaranteed by the law of large numbers as J → ∞ – recall statements (3.224) and (3.227). In particular, from the strong version of the law:   P lim x b = Ex = 1 (33.62) J→∞

33.3 Markov Chain Monte Carlo Method

1335

The samples {xj } are called particles and they are useful for evaluating other integrals as well. For example, assume we are interested in evaluating an integral of the form: ˆ ∆

E f h(x) =



h(x)fx (x)dx

(33.63)

−∞

for some function h(x). In the previous example, we had h(x) = x. But other choices for h(x) are of course possible, such as higher powers of x to evaluate other moments of fx (x), or other choices. The above integral can be interpreted as the mean of h(x) relative to the distribution defined by fx (x). We again introduce a proposal distribution πx (x) and use it to rewrite the integral in the form:   ˆ ∞ fx (x) fx (x) ∆ E f h(x) = h(x) πx (x)dx = E π h(x) (33.64) πx (x) πx (x) −∞ This relation shows that the desired expectation of h(x) relative to fx (x) can be equivalently computed by determining the expectation of h(x)fx (x)/πx (x) relative to πx (x). Therefore, using particles sampled according to πx (x), we can use the approximation: J X ∆ 1 wj h(xj ), E f h(x) ≈ b h = J j=1

wj =

fx (xj ) , πx (xj )

xj ∼ πx (x)

(33.65)

We summarize the procedure in (33.66), which amounts to a basic form of the Monte Carlo method. We will encounter one useful application of importance sampling methods in Example 46.10 when we study on-policy and off-policy techniques for reinforcement learning. Basic importance sampling algorithm. objective: evaluate integrals of the form (33.63); given: h(x) and fx (x); choose proposal distribution, πx (x). repeat for j = 1, 2, . . . , J: generate sample xj ∼ πx (x) compute wj = fx (xj )/πx (xj ) end approximate the integral by using (33.65): ˆ ∞ J 1X h(x)fx (x)dx ≈ wj h(xj ) J j=1 −∞

(33.66)

In summary, given a target distribution fx (x) and a proposal or importance distribution πx (x), we generate J particles {xj } with their respective weights {wj } to “represent” the target pdf. The pairs {(xj , wj )} can then be used to

1336

Predictive Modeling

evaluate integral expressions involving fx (x) as described in listing (33.66), for given functions h(x). Note that this method does not generate samples from the target distribution fx (x). It only generates samples that can be used to evaluate certain integrals of interest. Let us illustrate the importance sampling construction (33.66) by applying it to the core problem of this chapter, which is the evaluation of the posterior fz|y (z|y) for the latent variable z given the observation y. This pdf is given by expression (33.28), namely, fy|z (y|z) fz (z) fy (y)

fz|y (z|y) =

(33.67)

where we already know that the main challenge is in evaluating the evidence that appears in the denominator, namely, fy (y). However, we know that the evidence can be obtained by marginalizing the numerator over z, i.e., by computing ˆ   fy (y) = fy|z (y|z) fz (z)dz = E fz fy|z (y|z) (33.68) z

The problem is therefore reduced to one similar to (33.63), where z is replaced by x, fy|z (y|z) plays the role of h(x), and fz (z) plays the role of fx (x). We arrive, for every value of y, at the approximation J 1X wj fy|z (y|zj ), fy (y) ≈ J j=1

wj =

fz (zj ) , z j ∼ πz (z) πz (zj )

(33.69)

A second approach is described in Example 33.5, which allows us to determine the posterior fz|y (z|y) by working directly with the joint pdf fy,z (y, z). Example 33.2 (Mean and variance of a beta distribution) We employ construction (33.66) to evaluate the mean and variance of a beta distribution with positive shape parameters a = 2 and b = 5, namely, fx (x) =

Γ(a + b) a−1 x (1 − x)b−1 , Γ(a)Γ(b)

x ∈ [0, 1]

(33.70)

We select πx (x) to be the uniform distribution over [0, 1]. We generate J = 40,000 samples and apply (33.66) using h(x) = x and h(x) = x2 . We find E f x ≈ 0.2841,

E f x2 ≈ 0.1064

(33.71)

We can estimate the variance from these two values by using Ef x2 − (Ef x)2 ≈ 0.1064 − (0.2841)2 ≈ 0.0257

(33.72)

For comparison purposes, the actual mean and variance values for a beta distribution with parameters (a, b) are given by Ef x =

a = 0.2857, a+b

σx2 =

ab = 0.0255 (a + b)2 (a + b + 1)

(33.73)

33.3 Markov Chain Monte Carlo Method

1337

Example 33.3 (Estimating probabilities) The importance sampling technique can also be used to approximate probabilities of events. For example, assume we are interested in assessing ˆ β fx (x)dx (33.74) P(α ≤ x ≤ β) = α

We introduce the indicator function h(x) = I[α ≤ x ≤ β]. This function evaluates to 1 when its argument is true and is 0 otherwise. Then, ˆ P(α ≤ x ≤ β) = I[α ≤ x ≤ β]fx (x)dx x



J 1X wj I[α ≤ xj ≤ β], J j=1

wj =

fx (xj ) πx (xj )

(33.75)

¯ = E f h(x) denote the Example 33.4 (Performance of importance sampling) Let h quantity that we are estimating under x ∼ fx (x) using the importance sampling method (33.66). We denote the estimate by J 1X b h= wj h(xj ), J j=1

wj =

fx (xj ) πx (xj )

(33.76)

b is unbiased since It is easy to see that the estimator h J   X b= 1 E π wj h(xj ) , Eπ h J j=1 J 1X = Eπ J j=1

xj ∼ πx (x)

! fx (xj ) h(xj ) πx (xj )

=

J ˆ fx (x) 1X h(x)πx (x)dx J j=1 x πx (x)

=

J ˆ 1X h(x)fx (x)dx J j=1 x

(33.77)

so that b = E f h(x) = h ¯ Eπ h

(unbiased estimator)

(33.78)

¯ for large J. The quality of We know from the law of large numbers that b h approaches h the estimator can be measured in terms of its error variance. Since the particles {xj } are sampled independently of each other from πx (x) we have: J  2 1 X ¯ , xj ∼ πx (x) E π wj h(xj ) − h 2 J j=1 ( ) J  1 X  ¯2 = 2 E π w2j h2 (xj ) − h J j=1 ( ! ) J fx2 (xj ) 2 1 X 2 ¯ = 2 Eπ 2 h (xj ) − h J j=1 πx (xj )

b − h) ¯ 2= E (h

(33.79)

1338

Predictive Modeling

so that b − h) ¯ 2 = 1 E (h J

(ˆ x

fx2 (x)h2 (x) ¯2 dx − h πx (x)

) (decaying variance)

(33.80)

The rate of decay of the error variance will depend on the choice of πx (x). In practice, the main criterion for selecting πx (x) is to choose a distribution that is easy to sample from and to use sufficiently large samples sizes, J. ¯ = 1. In One useful special case of result (33.80) arises when we set h(x) = 1 so that h this case, the estimate b h in (33.76) reduces to the mean of the weights with variance given by J 1X wj var J j=1

!

1 = J

(ˆ x

fx2 (x) dx − 1 πx (x)

) (33.81)

Note, in particular, that the right-hand side evaluates to zero when πx (x) = fx (x). This choice is not feasible because, after all, the main reason we are resorting to sampling is because we cannot sample that easily from fx (x). In practice, we would prefer importance distributions πx (x) that keep the above variance small. Since the ensemble mean of the weights is 1, the individual weights will assume values close to 1/J each and all particles will contribute to the inference process. On the other hand, if the variance is “large,” then some weights will be relatively larger than others and the contribution of some particles to the inference process will be limited. Example 33.5 (Normalization of weights) Continuing with Example 33.4, it is useful to normalize the weights to add up to 1 by rescaling them: wj ← PJ

wj

j 0 =1

(33.82)

wj 0

and therefore replace the estimate (33.76) by b h=

J X

!

wj PJ

j 0 =1

j=1

wj 0

h(xj )

(33.83)

Although this normalization introduces some bias, the size of the bias decreases with J and the construction leads to estimators with reduced error variance that continue ¯ – see Prob. 33.8. to converge to the true value h Proof (asymptotic bias goes to zero): We rewrite (33.83) in the form b h=

1 J

PJ

j=1

1 J

PJ

wj h(xj )

j 0 =1

wj 0

(33.84)

¯ as J → ∞: We know from the strong law of large numbers that the numerator tends to h J   1X ¯ =1 P lim wj h(xj ) = h J→∞ J j=1

(33.85)

33.3 Markov Chain Monte Carlo Method

1339

The quantity in the denominator has mean 1 since ( ! ) J J fx (xj ) 1 X 1X Eπ , xj ∼ πx (x) wj 0 = Eπ J 0 J j=1 πx (xj ) j =1

=

J ˆ 1X fx (x)dx J j=1 x

=1

(33.86)

It follows that the denominator also converges to 1 by the strong law of large numbers. It follows from part (c) of Prob. 3.46 (Slutsky theorem) that the ratio in (33.84) converges ¯ as desired. to h,  The normalization (33.82) is useful when the target distribution fx (x) is only known up to a scaling factor. Consider the following example. Assume we are given the joint distribution of two random variables, fy,z (y, z), and we are interested in computing the conditional mean of z given y, namely, E (z|y). For that purpose, we would need to evaluate the posterior, which according to the Bayes rule is given by: fz|y (z|y) =

fy,z (y, z) fy (y)

(33.87)

In principle, this expression suggests that in order to find fz|y (z|y) we need to determine first the evidence pdf, fy (y), which appears in the denominator. This is, however, unnecessary. We can work directly with fy,z (y, z) since it only differs from the target pdf fz|y (z|y) by a scaling factor. For such cases, the normalization step (33.82) is useful because the unknown scaling factor fy (y) will cancel out (it would appear in both the numerator and denominator of (33.82)). In other words, we can still apply the importance sampling procedure for these situations. We summarize the construction in (33.88). Importance sampling algorithm with weight normalization. objective: evaluate integrals of the form (33.63); given: h(x) and some g(x) ∝ fx (x); choose proposal distribution, πx (x). repeat for j = 1, 2, . . . , J: generate sample xj ∼ πx (x) compute wj = g(xj )/πx (xj ) end approximate the integral by using (33.65): ! ˆ ∞ J X wj h(x)fx (x)dx ≈ h(xj ) PJ 0 −∞ 0 j =1 wj j=1

33.3.2

(33.88)

Metropolis–Hastings Algorithm There is one important limitation with the importance sampling procedure (33.66). Specifically, the realizations {xj } are sampled independently of each other from

1340

Predictive Modeling

πx (x). This means that some realizations will be more relevant than others because not all samples will fall within regions where fx (x) assumes expressive values. This fact results in degraded performance and necessitates the generation of a large amount of samples. We will modify the procedure in order to select the sample xn in a manner that benefits from the sample right before it, which amounts to introducing a Markovian property into the sampling process. By doing so, we will be able to move the selection of samples toward more expressive (or more likely) regions in the domain of fx (x). Since the sampling will now follow a Markovian chain, we will end up with a MCMC method. We continue to assume that we are interested in evaluating integrals of the form (33.63), which is repeated here: ˆ ∞ ∆ E f h(x) = h(x)fx (x)dx (33.89) −∞

Now, however, we will devise a procedure that generates samples distributed directly according to fx (x). By doing so, we will be able to approximate the integral by using: E f h(x) ≈

J 1X h(xj ), J j=1

xj ∼ fx (x)

(33.90)

To generate the samples xj ∼ fx (x), we again select a proposal distribution πx (x) with one important modification. We center πx (x) at the previous sample, xj−1 , say; in the Gaussian case, we would select πx (x) in the following form: πx (x|xj−1 ) = Nx (xj−1 , σ 2 )

(33.91)

with mean at xj−1 and some variance σ 2 . We subsequently sample from this distribution: x0 ∼ πx (x|xj−1 )

(33.92)

which means that x0 is dependent solely on the most recent sample xj−1 ; it does not depend on any of the earlier samples in a manner that enforces a Markovian property. The value of x0 is not taken immediately to be the new sample xj ; instead, a test is performed to decide whether to accept or reject x0 as a valid sample. The purpose of the test is to ensure that x0 is moving toward more expressive regions of fx (x). If the test is affirmative, then we set xj ← x0 ; otherwise, we reject the sample and repeat the procedure. The test involves comparing a certain ratio A(x0 , xj−1 ), also called the acceptance ratio or the Hastings ratio, against the value of a uniformly distributed variable, u ∼ U[0, 1]. The resulting Metropolis–Hastings algorithm is listed in (33.93). One important feature of this algorithm is that, although its purpose is to generate samples according to any given distribution fx (x), it does not actually need to know fx (x) exactly. It is sufficient that it be given any function g(x) that is proportional to fx (x). The reason for this can be seen in the expression for the ratio A(x0 , xj−1 )

33.3 Markov Chain Monte Carlo Method

1341

below. Since g(x) ∝ fx (x), it is clear that the ratio g(x0 )/g(xj−1 ) agrees with the ratio f (x0 )/f (xn−1 ). The implementation of this sampling procedure usually involves a burn-in interval (i.e., an initial number of iterations) where only samples generated after this interval are retained. This is meant to provide the construction sufficient time for its samples to approach the desired distribution.

Metropolis–Hastings algorithm. objective: generate realizations from a pdf fx (x); given: g(x) ∝ fx (x); choose proposal distribution, πx (x|xj−1 ). sample initial condition: x0 ∼ πx (x|0). repeat for j = 1, 2, . . . , J: (33.93)

sample x0 ∼ πx (x|xj−1 )

calculate Hastings ratio A(x0 , xj−1 ) =

0

0

g(x ) πx (xj−1 |x ) g(xj−1 ) πx (x0 |xj−1 )

sample from the uniform distribution, u ∼ U[0, 1]  0 x, if A(x0 , xj−1 ) ≥ u set xj = xj−1 , otherwise end Observe that the candidate sample x0 is retained with probability equal to min{1, A(x0 , xj−1 )}. In the special case when the proposal distribution πx (x|xj−1 ) is symmetric relative to its arguments, namely, when πx (x0 |xj−1 ) = πx (xj−1 |x0 )

(33.94)

then the expression for the acceptance ratio simplifies to the Metropolis ratio: A(x0 , xj−1 ) =

g(x0 ) g(xj−1 )

(33.95)

In this case, the algorithm is referred to as simply the Metropolis algorithm. The symmetry condition (33.94) can be satisfied, for example, by selecting πx (x0 |xj−1 ) to be Gaussian with mean at xj−1 or by generating x0 as follows x0 = xj−1 + v

(33.96)

where v is independent of xj−1 and its distribution is symmetric around zero, say, v ∼ Nv (0, σv2 ).

1342

Predictive Modeling

Metropolis algorithm. objective: generate realizations from a pdf fx (x); given: g(x) ∝ fx (x); choose symmetric proposal distribution, πx (x|xj−1 ); sample initial condition: x0 ∼ πx (x|0). repeat for j = 1, 2, . . . , J: sample x0 ∼ πx (x|xj−1 )

calculate Metropolis ratio A(x0 , xj−1 ) =

(33.97) 0

g(x ) g(xj−1 )

sample from the uniform distribution, u ∼ U[0, 1]  0 x, if A(x0 , xj−1 ) ≥ u set xj = xj−1 , otherwise end One variant of the Metropolis algorithm, besides the symmetry condition (33.94), is to remove the sample u and to retain x0 with probability min{1, A(x0 , xj−1 )}:  r = min{1, A(x0 , xj−1 )}    0 x, with probability r   set xj = xj−1 , otherwise

(33.98)

Another special case of the Metropolis–Hastings algorithm is the Gibbs sampler, which we describe later in the concluding remarks of Chapter 66. Example 33.6 (Approximating a beta distribution) We apply the Metropolis algorithm (33.97) over J = 40,000 iterations and generate samples {xj } from a beta distribution with positive shape parameters a = 2 and b = 5, namely, fx (x) =

Γ(a + b) a−1 x (1 − x)b−1 , Γ(a)Γ(b)

x ∈ [0, 1]

(33.99)

We generate the successive samples according to (33.96) with v ∼ Nv (0, 0.01):  sample v ∼ Nv (0, 0.01)     set x0 = xj−1 + v     calculate A(x0 , xj−1 ) = fx (x0 )/fx (xj−1 ) sample from the uniform distribution, u ∼ U[0, 1]     0   x, if A(x0 , xj−1 ) ≥ u    set xj = xj−1 , otherwise

(33.100)

The result is shown in Fig. 33.4, where the smooth curve corresponds to the actual beta distribution.

33.3 Markov Chain Monte Carlo Method

1343

beta distribution

3 2.5 2

beta(2,5)

1.5 1 0.5 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 33.4 Histogram constructed from J = 40,000 iterations of the Metropolis

algorithm (33.97) for a beta distribution with parameters a = 2 and b = 5.

Example 33.7 (Evaluating posterior distributions) The main reason for computing the evidence in (33.69) is because we need it to evaluate the posterior fz|y (z|y). However, now, under the Metropolis algorithm (33.97) we can evaluate the conditional pdf without the need to compute first the evidence, fy (y). For this purpose, the pdf fz|y (z|y) plays the role of fx (x) for which we want to generate samples (so that a histogram can be generated for it). We do not know fz|y (z|y) exactly due to the missing normalization factor fy (y). However, we know the product that appears in the numerator, namely, fz (z)fy|z (y|z), which is proportional to fz|y (z|y). Therefore, this product plays the role of the function g(x). By using it in the Metropolis algorithm (33.97), we iterate for J = 5000 times and generate samples for the desired conditional fz|y (z|y):  sample v ∼ Nv (0, σv2 )     set z 0 = zj−1 + v       calculate A(z 0 , zj−1 ) =

fz (z 0 )fy|z (y|z 0 ) fz (zj−1 )fy|z (y|zj−1 )

(33.101)

  sample from the uniform distribution, u ∼ U[0, 1]     0    z, if A(z 0 , zj−1 ) ≥ u   set zj = zj−1 , otherwise Let us apply this construction to the logistic model with Gaussian latent variable w from Example 33.1. We already applied the Laplace method to estimate the conditional pdf of w given the labels {γ(n)}, for n = 1, 2, . . . , N . This led to result (33.44), namely, fw|γ N (w|γN ; HN ) ≈ Nw (wMAP , Rw )

(33.102)

with mean wMAP and covariance matrix Rw . We will evaluate this same posterior by using instead the Metropolis algorithm (33.97). In this simulation, and for simplicity, we assume M = 2 so that we are dealing with a two-dimensional vector w. We know from (33.25) that the desired posterior is proportional to fw|γ N (w|γN ; HN ) ∝ fγ N |w (γN |w; HN ) fw (w)

(33.103)

where 2 fw (w) = Nw (0, σw I2 )

fγ N |w (γN |w; HN ) =

N Y n=1

(33.104a) 

1 T 1 + e−hn w

 1+γ(n) 2

 ×

1 T 1 + ehn w

 1−γ(n) 2

(33.104b)

1344

Predictive Modeling

The term on the right-hand side of (33.103) therefore plays the role of g(x). The listing of the algorithm for this case would take the following form:  sample v ∼ Nv (0, σv2 I2 )      let w0 = wn−1 + v     fγ N |w (γN |w0 ; HN ) fw (w0 )   calculate A(w0 , wj−1 ) = fγ N |w (γN |wj−1 ; HN ) fw (wj−1 )      sample from the uniform distribution, u ∼ U[0, 1]   0    w, if A(w0 , wj−1 ) ≥ u   set wn = wj−1 , otherwise

(33.105)

In the simulations, we have N = 400 observations {γ(n), hn } and run the Metropolis algorithm for J = 40,000 iterations. We use σv2 = 0.01. Figure 33.5 plots the (normalized) histogram distributions for the first and second components of the latent variable w. Using the samples generated by the Metropolis algorithm, we estimate the mean and covariance matrix of w as follows:     1.2901 0.0275 0.0066 w ¯Metropolis = , Rw,Metropolis = (33.106) 1.1168 0.0066 0.0237 These values can be compared against the estimates (w, ¯ Rw ) generated by the Laplace method in (33.53). Figure 33.5 plots the Gaussian distributions for the components of w that result from the Laplace method, namely, Nw1 (1.2766, 0.0242) and Nw2 (1.1055, 0.0210). first component

3

Laplace method

2.5

2.5

2 1.5

second component

3

Laplace method

2

Metropolis

1.5

Metropolis

1

1

0.5

0.5

0

0 0.6

0.8

1

1.2

1.4

1.6

1.8

0.6

0.8

1

1.2

1.4

1.6

1.8

Figure 33.5 Histogram distributions for the first and second component of the latent

variable w constructed from J = 40,000 iterations of the Metropolis algorithm (33.97). Example 33.8 (Justifying Metropolis–Hastings construction) We still need to justify that the samples {xj } generated by the Metropolis–Hastings algorithm (33.93) arise from the desired distribution fx (x). To simplify the presentation, we assume first that all variables are discrete and replace pdfs by probability mass functions (pmfs). Let P(xj = xj |xj−1 = xj−1 ) denote the transition probability from a sample value xj−1 to a new value xj . We know from the construction of the algorithm that when xj 6= xj−1 : P(xj = xj |xj−1 = xj−1 ) = πx (xj |xj−1 ) × min{1, A(xj , xj−1 )}

(33.107)

which is the product of the probability of generating xj starting from xj−1 and the probability of retaining xj . Likewise, for the reverse transition probability we have

33.3 Markov Chain Monte Carlo Method

P(xj−1 = xj−1 |xj = xj ) = πx (xj−1 |xj ) × min{1, A(xj−1 , xj )}

1345

(33.108)

It then follows that fx (xj−1 )P(xj = xj |xj−1 = xj−1 ) (

fx (xj ) πx (xj−1 |xj ) = fx (xj−1 )πx (xj |xj−1 ) × min 1, fx (xj−1 ) πx (xj |xj−1 ) n o = min fx (xj−1 )πx (xj |xj−1 ), fx (xj )πx (xj−1 |xj )

)

(33.109)

and, similarly, (

)

fx (xj )P(xj−1 = xj−1 |xj = xj ) = min fx (xj )πx (xj−1 |xj ), fx (xj−1 )πx (xj |xj−1 ) (33.110) Comparing (33.109) and (33.110) we arrive at the equality fx (xj−1 )P(xj = xj |xj−1 = xj−1 ) = fx (xj )P(xj−1 = xj−1 |xj = xj )

(33.111)

This equality is also obviously true when xj = xj−1 . The result allows us to conclude that the samples {xj } arise from the desired pdf fx (x). Indeed, for this to be true we must check whether X ? fx (xj ) = fx (xj−1 )P(xj = xj |xj−1 = xj−1 ) (33.112) xj−1

That is, we need to verify whether we can obtain fx (xj ) by marginalizing the term on the right-hand side over xj−1 . This is the case since, in view of (33.111), X X fx (xj−1 )P(xj = xj |xj−1 = xj−1 ) = fx (xj )P(xj−1 = xj−1 |xj = xj ) xj−1

xj−1

= fx (xj )

X xj−1

| = fx (xj )

P(xj−1 = xj−1 |xj = xj ) {z

=1

} (33.113)

as desired. In other words, we have shown that the distribution fx (x) satisfies the following equality for successive states (x, x0 ): X fx (x0 ) = fx (x)P(x0 = x0 |x = x) (33.114) x

with the same distribution fx (x) appearing on both sides of the equality, In this case, we say that fx (x) is the steady-state distribution for the Markov chain defined over the successive states {xj }; we will be discussing Markov chains in greater detail in Chapter 38 – see, e.g., the comments at the end of the chapter and relation (38.124). More generally but briefly, for continuous variables and working with pdfs, the product on the right-hand side of (33.107) would correspond to the value of the transition kernel function that is associated with the Markov chain when xj 6= xj−1 . We denote it by K(xj |xj−1 ) = πx (xj |xj−1 ) × min{1, A(xj , xj−1 )}. In this case, equality (33.111) is replaced by fx (xj−1 )K(xj |xj−1 ) = fx (xj´)K(xj−1 |xj ), from which it can be verified that fx (x) satisfies the equality fx (x0 ) = X fx (x)K(x0 |x)dx in a manner similar to (33.114) to arrive again at the desired conclusion.

1346

Predictive Modeling

33.4

COMMENTARIES AND DISCUSSION Logit and probit models. We will explain the origin of the logit and probit models in the comments of Chapter 59 when we discuss logistic regression solutions. Here, we provide a brief summary. First, the logistic and logit functions are inverses of each other in the following sense. For any x ∈ IR, the logistic function, denoted by σ(x), is defined as the transformation: ∆

σ(x) =

1 1 + e−x

(logistic function)

(33.115)

while the logit function is defined by (we encountered the logit function earlier in the statement of Theorem 28.2): ∆



g(y) = logit(y) = ln



y 1−y

 , y ∈ (0, 1)

(logit function)

(33.116)

It then holds that y = σ(g(y))

(33.117)

so that the logistic function maps g(y) back to y. This observation provides one useful interpretation for the Bayesian logit model (33.19). If we make the identifications: y ← P(γ = γ|w = w; h),

x ← γhT w

(33.118)

then y = σ(x) and x = logit(y). The designation “logit” for the function (33.116) was introduced by Berkson (1944, 1951), whose work was motivated by the earlier introduction of the “probit” function by Gaddum (1933) and Bliss (1934a, b); the term “probit” was used in the references by Bliss (1934a, b). For any y ∈ (0, 1), the probit function is defined by (compare with (33.116)): ∆



g(y) = probit(y) = Φ−1 (y)

(probit function)

(33.119)

where Φ(x) denotes the cdf of the standard Gaussian distribution with zero mean and unit variance, i.e., ˆ x 2 1 ∆ e−τ /2 dτ (33.120) Φ(x) = √ 2π −∞ and Φ−1 (y) is the inverse transformation. Again, it holds that g(y) = probit(y) ⇐⇒ y = Φ(g(y))

(33.121)

For more information on the history of the logistic function and the logit and probit functions, the reader may refer to Cramer (2003). Additional discussion on Bayesian inference appears, for example, in Box and Tiao (1973), Bolstad (2007), Winkler (2013), Bolstad and Curran (2017), and Bailer-Jones (2017). Laplace method. The method is attributed to the French mathematician PierreSimon Laplace (1749–1827) and appeared in Laplace (1774) in the context of evalu´b ating integral expressions of the form a eαf (x) dx for large α. In Bayesian inference, we showed in the body of the chapter that the method approximates the posterior fz|y (z|y)

33.4 Commentaries and Discussion

1347

by means of a Gaussian distribution and is suitable for continuous latent variables z of small dimensions since it involves inverting a Hessian matrix. Readers interested in more details can refer to Bishop (2007), Friston et al. (2007), Daunizeau, Friston, and Kiebel (2009), and Daunizeau (2011). We applied the method to a prediction problem involving logit models in Example 33.1. The derivation relied on the use of two properties for Gaussian distributions. The first property (33.49) relates to an integral expression involving the Gaussian pdf and its cdf – see, e.g., Abramowitz and Stegun (1965), Gradshteyn and Ryzhik (1994), Patel and Read (1996), and Owen (1980). The second property (33.48) relates to an approximation for the sigmoid function in terms of the cdf of a standard Gaussian distribution – see, e.g., Shah (1985), Johnson, Kotz, and Balakrishnan (1994), Waissi and Rossin (1996), and Bowling et al. (2009). Markov chain Monte Carlo method. Monte Carlo methods are based on the idea of sampling from proposal distributions, as was the case with the importance sampling method (33.66). The performance of the method is dependent on the choice of the proposal distribution, πx (x). Different choices lead to different error variances for the estimated quantity. For example, assume the method is used to estimate the mean ¯ = E h(x) of some function h(x) of the random variable x ∼ fx (x). We denote the h estimate by J 1X b h= wj h(xj ) J j=1

(33.122)

b − h) ¯ 2 . It was shown by Kahn and Then, the resulting error variance amounts to E (h Marshall (1953) that the optimal choice for πx (x) in order to minimize this error variance is given by (see Prob. 33.7 and also Owen (2013)): πxo (x) = ´

|h(x)| fx (x) |h(x)| fx (d)dx x

(33.123)

which depends on both h(x) and fx (x). Unfortunately, this expression is not very helpful because the quantity in the denominator is similar to the integral we are trying to approximate in the first place. It is common practice to select πx (x) more freely as some distribution from which sampling can be performed with ease. It is also useful to normalize the weights in the importance sampling method (33.66) to add up to 1: wj (33.124) wj ← PJ 0 j 0 =1 wj It was shown by Casella and Robert (1998) that, although this normalization introduces some bias, its size decreases with J and the construction leads to estimators with reduced error variance. Still, the limitation exists that the particles {xj } are sampled from πx (x) independently of each other. This means that some particles will be more relevant than other particles because not all samples will fall within regions where fx (x) assumes expressive values. The MCMC method addresses this difficulty by introducing a Markovian property into the sampling process. By doing so, we are able to move the selection of samples toward more expressive (or more likely) regions in the domain of fx (x). There is extensive theory on MCMC methods, their convergence, and performance guarantees (cf. Examples 33.4, 33.5, and 33.8) – see, e.g., Chib and Greenberg (1995), Robert and Casella (2004), Kalos and Whitlock (2008), Liang, Liu, and Carroll (2010), Brooks et al. (2011), and Rubinstein and Kroese (2016). The purpose of the presentation in the chapter is to summarize the key steps by focusing on two of the most used variants, namely, the Metropolis and Metropolis–Hastings algorithms. The Metropolis algorithm was first proposed by Metropolis et al. (1953) and later extended by Hastings (1970). For further readings and applications in the context of engineering and statistics, readers may consult Geman and Geman (1984), Gelfand and Smith (1990), Chib and Greenberg (1995), Gamerman (1997), Newman and Barkema (1999), Robert and Casella (2004), Bolstad (2010), and Robert (2016).

1348

Predictive Modeling

PROBLEMS

33.1 One method to generate samples from a desired distribution is to transform one random variable to another. Consider a random variable x ∈ IRM with pdf fx (x). Transform it into the variable y = g(x), where g(x) is a one-to-one and differentiable function and y ∈ IRM . We denote the inverse mapping for g(·) by x = g −1 (y); it allows us to recover x from y. Introduce the M × M Jacobian matrix J of the inverse function with individual entries ∆

[J]m,n = ∂g −1 (ym )/∂yn = ∂xm /∂yn A well-known result in probability theory is that the pdf of y can be determined from knowledge of {fx (x), J} as follows: fy (y) = fx (x) | det J|. Establish the result. 33.2 Assume y = Ax + b for some invertible matrix A so that g(x) = Ax + b and g −1 (y) = A−1 (y − b). Show that the distribution of y is given by   1 fy (y) = fx A−1 (y − b) | det A| How does the result simplify when A and b are scalars? 33.3 Refer to Prob. 33.1. Assume x is scalar-valued and uniformly distributed in the interval [0, 1]. How should x be transformed in order to generate a random variable y that is distributed according to an exponential distribution with parameter λ, i.e., y ∼ λe−λy with λ > 0. That is, what should the function g(x) be? What is the corresponding inverse function g −1 (y)? Specifically, verify that   1 1 , g −1 (y) = 1 − e−λy g(x) = ln λ 1−x 33.4 The result of Prob. 33.1 suggests one way of generating random variables with any given cdf F (·). Let x be uniformly distributed within [0, 1] and define the transformation y = F −1 (x). Show that y has cdf equal to F (y). 33.5 Consider a collection of N observed vectors {hn ∈ IRM }, also called feature vectors. Each vector has an associated response variable γ(n) ∈ IR, which is assumed to be generated according to a regression model of the form γ(n) = hTn w + v(n), where 2 v(n) is a white-noise Gaussian process with v(n) ∼ N(0, σv2 ) and w ∼ Nv (0, σw IM ). In this problem, the variable w plays the role of the latent variable. The same realization for w is used in all measurements {γ(n), hn }. Introduce the N × M matrix and N × 1 vector quantities: HN = col{hT1 , hT2 , . . . , hTN }, (a)



γN = col{γ(1), γ(2), . . . , γ(N )}

2 Let SNR = σw /σv2 . Verify that the posterior of w is Gaussian, namely,

fw|HN ,γ N (w|HN , γN ) ∼ Nw (w, ¯ Rw ) where ∆

Rw = σv2 (b) (c)



1 T IM + HN HN SNR

−1 ,



w ¯ =

1 T Rw HN γN σv2

What is the MAP estimator for w? For simplicity, we drop the subscripts from the notation used for pdfs. Given a new observation h, we would like to predict its response γ. Argue that f (γ, w|h, HN , γ N ) = f (w|HN , γN ) f (γ|w, h)

References

1349

and conclude that the predictive distribution for γ given the observations satisfies ˆ f (γ|h, HN , γN ) = Nw (w, ¯ Rw ) × Nγ (hT w, σv2 ) dw w∈W

where the notation Na refers to a Gaussian distribution relative to the random variable a. Carry out the integration and show that the predictive distribution is Gaussian with   f (γ|h, HN , γN ) ∼ Nγ hT w, ¯ σv2 + hT Rw h What are the MAP and conditional mean predictors for γ given h and the prior measurements? 33.6 Refer to the probit model (33.21). Introduce the Gaussian random variable z ∼ Nz (hT w, 1) with mean hT w and unit variance. Here, the value of w is a realization 2 from the assumed model w ∼ Nw (0, σw IM ). Show that the probit model (33.21) can be equivalently written in the form γ = +1 if z > 0, otherwise γ = −1 33.7 Use expression (33.80) for the error variance of the importance sampling estimator to verify that the optimal choice for the importance distribution is given by expression (33.123). 33.8 Refer to the normalized expression (33.83). Argue that the error variance for this estimator can be approximated by ˆ 2 fx (x) ¯ 2 dx b − h) ¯ 2≈ 1 (hx (x) − h) E π (h J x πx (x) 33.9 Consider the following generative model for vector-valued observation y involving a mixture of Gaussian models:  µ1 , σ12 IM ) and µ2 ∼ Nµ2 (¯ µ2 , σ22 IM ).  generate two random means µ1 ∼ Nµ1 (¯ select an index k ∈ {1, 2} according to Bernoulli(p)  2 IM ) generate the observation y ∼ Ny (µk , σy,k where the notation Bernoulli(p) means that P(k = 1) = p. Write down the Metropolis algorithm for estimating the joint conditional pdf of the latent variables (k, µ1 , µ2 ) given the observation y. 33.10 Consider a collection of N measurements {γ(n)} arising from a Gaussian distribution with mean hTn w and variance σ 2 , i.e., γ(n) ∼ Nγ (n) (hTn w, σ 2 ). The inverse variance 1/σ 2 is chosen randomly from a gamma distribution with parameters (α, β), i.e., 1/σ 2 ∼ Γ(α, β). The vector w ∈ IRM is chosen randomly from a Gaussian distri2 bution with w ∼ Nw (0, σw IM ). Write down the Laplace method to estimate the joint conditional pdf of the latent variables (w, τ = 1/σ 2 ) given the N observations {γ(n)}.

REFERENCES Abramowitz, M. and I. Stegun (1965), Handbook of Mathematical Functions, Dover Publications. Bailer-Jones, C. A. L. (2017), Practical Bayesian Inference: A Primer for Physical Scientists, Cambridge University Press. Berkson, J. (1944), “Application of the logistic function to bio-assay,” J. Amer. Statist. Assoc., vol. 39, no. 227, pp. 357–365.

1350

Predictive Modeling

Berkson, J. (1951), “Why I prefer logits to probits?,” Biometrics, vol. 7, pp. 327–339. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Bliss, C. I. (1934a), “The method of probits,” Science, vol. 79, pp. 38–39. Bliss, C. I. (1934b), “The method of probits,” Science, vol. 79, pp. 409–410. Bolstad, W. M. (2007), Bayesian Statistics, 2nd ed., Wiley. Bolstad, W. M. (2010), Understanding Computational Bayesian Statistics, Wiley. Bolstad, W. M. and J. M. Curran (2017), Introduction to Bayesian Statistics, 3rd ed., Wiley. Bowling, S. R., M. T. Khasawneh, S. Kaewkuekool, and B. R. Cho (2009), “A logistic approximation to the cumulative normal distribution,” J. Indust. Eng. Manag., vol. 2, no. 1, pp. 114–127. Box, G. E. P. and G. C. Tiao (1973), Bayesian Inference in Statistical Analysis, Addison-Wesley. Brooks, S., A. Gelman, G. Jones, and X-L. Meng, editors (2011), Handbook of Markov Chain Monte Carlo, Chapman & Hall. Casella, G. and C. P. Robert (1998), “Post-processing accept–reject samples: Recycling and rescaling, J. Comput. Graphic. Statist., vol. 7, no. 2, pp. 139–157. Chib, S. and E. Greenberg (1995), “Understanding the Metropolis–Hastings algorithm,” Amer. Statist.,vol. 49, no. 4, pp. 327–335. Cramer, J. S. (2003), Logit Models from Economics and Other Fields, Cambridge University Press. Daunizeau, J. (2011), “The variational Laplace approach to approximate Bayesian inference,” available at arXiv:1703.02089. Daunizeau, J., K. Friston, and S. J. Kiebel (2009), “Variational Bayesian identification and prediction of stochastic nonlinear dynamic causal models,” Phys. Nonlinear Phenom., vol. 238, pp. 2089–2118. Friston, K., J. Mattout, N. Trujillo-Barreto, J. Ashburner, and W. Penny (2007), “Variational free energy and the Laplace approximation,” NeuroImage, vol. 34, pp. 220–234. Gaddum, J. H. (1933), “Reports on biological standard III. Methods of biological assay depending on a quantal response,” Special Report Series of the Medical Research Council, no. 183, London. Gamerman, D. (1997), Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Chapman & Hall. Gelfand, A. E. and A. F. M. Smith (1990), “Sampling-based approaches to calculating marginal densities,” J. Amer. Statist. Assoc., vol. 85, pp. 398–409. Geman, S. and D. Geman (1984), “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 6, pp. 721–741. Gradshteyn, I. S. and I. M. Ryzhik (1994), Table of Integrals, Series, and Products, Academic Press. Hastings, W. K. (1970), “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97–109. Johnson, N. L., S. Kotz, and N. Balakrishnan (1994), Continuous Univariate Distributions, vol. 1, Wiley. Kahn, H. and A. W. Marshall (1953), “Methods of reducing sample size in Monte Carlo computations,” J. Oper. Res. Soc. Amer., vol. 1, no. 5, pp. 263–278. Kalos, M. H. and P. A. Whitlock (2008), Monte Carlo Methods, 2nd ed., Wiley. Laplace, P. S. (1774), “Mémoire sur la probabilité des causes par les événements,” Mém. Acad. R. Sci. de MI (Savants étrangers), vol. 4, pp. 621–656. See also Oeuvres Complètes de Laplace, vol. 8, pp. 27–65 published by L’Académie des Sciences, Paris, during the period 1878–1912. Translated by S. M. Sitgler, Statistical Science, vol. 1, no. 3, pp. 366–367. Liang, F., C. Liu, and R. Carroll (2010), Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples, Wiley.

References

1351

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953), “Equations of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, pp. 1087–1091. Newman, M. E. J. and G. T. Barkema (1999), Monte Carlo Methods in Statistical Physics, Oxford University Press. Owen, B. A. (2013), Monte Carlo Theory, Methods and Examples, available at https://statweb.stanford.edu/owen/mc/ Owen, D. (1980), “A table of normal integrals,” Commun. Statist.: Sim. Comput., vol. B9, pp. 389–419. Patel, J. K. and C. B. Read (1996), Handbook of the Normal Distribution, 2nd ed., CRC Press. Robert, C. P. (2016), “The Metropolis–Hastings algorithm,” available at arXiv:1504.01896v3. Robert, C. P. and G. Casella (2004), Monte Carlo Statistical Methods, Springer. Rubinstein, R. Y. and D. P. Kroese (2016), Simulation and the Monte Carlo Method, 3rd ed., Wiley. Shah, A. K. (1985), “A simpler approximation for area under the standard normal curve,” Amer. Statist., vol. 39, no. 1, p. 80. Waissi, G. R. and D. F. Rossin (1996), “A sigmoid approximation of the standard normal integral,” App. Math. Comput., vol. 77, pp. 91–95. Winkler, R. L. (2013), An Introduction to Bayesian Inference, 2nd ed., Probabilistic Publishing.

34 Expectation Propagation

The Laplace method approximates the posterior distribution f

z|y (z|y)

through a Gaussian probability density function (pdf) that is not always accurate. The Markov chain Monte Carlo (MCMC) method, on the other hand, relies on sampling from auxiliary (proposal) distributions and provides a powerful way to approximate posterior distributions albeit through repeated simulations. In this chapter, we describe a third approach for approximating the posterior distribution, known as expectation propagation (EP). This method restricts the class of distributions from which the posterior is approximated to the Gaussian or exponential family and assumes a factored form for the posterior. The method can become analytically demanding, depending on the nature of the factors used for the posterior, because these factors can make the computation of certain moments unavailable in closed form. The EP method has been observed to lead to good performance in some applications such as the Bayesian logit classification problem, but this behavior is not universal and performance can degrade for other problems, especially when the posterior distribution admits a mixture model.

34.1

FACTORED REPRESENTATION Consider a latent random model z ∼ π(z), where π(z) denotes its assumed prior and z ∈ IRM . Some data is collected into the vector: n o y = col y(1), y(2), . . . , y(N ) ∈ IRN (34.1) The posterior of z is assumed to admit a factored representation of the form: fz|y (z|y) =

N Y 1 × π(z) × fn (y(n)|z) Z n=1

(34.2a)

in terms of individual likelihood functions, denoted by fn (·), and for some normalizing factor Z given by ! ˆ N Y Z= π(z) fn (y(n)|z) dz (34.2b) z∈Z

n=1

whose purpose is to ensure that the posterior expression (34.2a) integrates to 1. Note that Z is solely a function of the observations since z is integrated out.

34.1 Factored Representation

1353

Example 34.1 (Linear regression model) Assume the entries of y correspond to noisy measurements of a linear regression model such as: y(n) = xTn θ + v(n),

v(n) ∼ Nv (0, σv2 )

(34.3)

for some input vectors {xn ∈ IRM }, and where v(n) is a white-noise process independent of θ ∈ IRM . Assume further that θ follows a Gaussian prior with π(θ) = Nθ (0, Rθ ). Here, θ plays the role of the latent variable z. Then, applying the Bayes rule, the posterior distribution of θ given the observations y = col{y(1), y(2), . . . , y(N )} is: fθ (θ)fy|θ (y|θ)

fθ |y (θ|y) = =

(34.4) )

fy (y)

N oY n 1 1 1 1 1 p √ √ exp exp − θT Rθ−1 θ 2 Z (2π)M/2 det Rθ 2 2πσ v n=1

( −

1 (y(n) − xTn θ)2 2σv2

where Z denotes the evidence fy (y) and is equal to the integral of the product appearing in the numerator of the first line: ˆ ∆ Z = fy (y) = fθ (θ)fy|θ (y|θ)dθ (34.5) θ∈Θ

Expression (34.4) is of the form (34.2a) with the identifications: π(θ) = Nθ (0, Rθ ) n 1 o 1 exp − 2 (y(n) − xTn θ)2 fn (yn |θ) = √ 2 2σv 2πσv

(34.6a) (34.6b)

Example 34.2 (Mixture of two Gaussian distributions) Consider a second example, where a total of N observations y(n) are collected from a mixture model of the form: y(n) ∼ αNy (θ, σ12 ) + (1 − α)Ny (0, σ22 ),

α ∈ (0, 1), θ ∼ Nθ (0, σθ2 )

(34.7)

where the mean of the first Gaussian component arises from a Gaussian prior with variance σθ2 . This formulation assumes a scalar θ but can easily be extended to vector parameters. The parameter θ again plays the role of the latent variable z. The posterior distribution of θ given the observations y = col{y(1), . . . , y(N )} is given by N Y 1 × π(θ) × fy(n)|θ (y(n)|θ) fθ |y (θ|y) = Z n=1

=

(34.8)

N Y 1 × Nθ (0, σθ2 ) × αNy(n) (θ, σ12 ) + (1 − α)Ny(n) (0, σ22 ) Z n=1

!

where the normalization factor is the evidence (i.e., fy (y)): ˆ Z= θ∈Θ

Nθ (0, σθ2 )

N Y n=1

! αNy(n) (θ, σ12 )

+ (1 −

α)Ny(n) (0, σ22 )



(34.9)

Factored approximation In general, it is difficult to arrive at a closed-form expression for evidences Z = fy (y) and posterior distributions fz|y (z|y) represented in the form (34.2a)

1354

Expectation Propagation

from knowledge of the joint pdf fy,z (y, z). For this reason, we will describe an approximation algorithm. Returning to (34.2a), we will blend the prior into the product and simplify the notation for the likelihood fn (y(n)|z) by replacing it by fn (z) since it is a function of z (and the observation y(n) is implicit through the subscript n). We thus consider the following representation as our starting point:

fz|y (z|y) =

N 1 Y fn (z) Z n=0

(starting posterior model)

(34.10a)

where, by convention, f0 (z) = π(z), and ∆

Z =

ˆ

N Y

fn (z)dz

(starting evidence)

(34.10b)

z∈Z n=0

Note that while each fn (z) for n ≥ 1 depends on an observation y(n), the initial pdf f0 (z) is only dependent on z. We assume we know the forms of the individual likelihoods fn (z); these factors are called sites in the context of the EP formulation. We will denote the approximation for fz|y (z|y) by qz|y (z|y) and the approximation for Z by W . We will impose the following requirements in our quest for constructing {qz|y (z|y), W }: (a) First, we will assume that the approximation qz|y (z|y) has a similar structural form to the posterior fz|y (z|y), namely, that it factors into the product of individual functions (or sites) denoted by {qn (z)}: qz|y (z|y) =

N 1 Y qn (z) W n=0

(assumed posterior model)

(34.11)

We will then focus on determining the factors qn (z). Each of these factors is regarded as an approximation for the corresponding original site fn (z). (b) Second, and more importantly, each site qn (z) will be assumed to belong to some predefined class of distributions over z. One common class is the exponential family (which also includes Gaussian distributions); this class helps simplify some of the steps in the EP algorithm; other classes will generally lead to more challenging calculations. In the exponential case, each factor qn (z) will be of the form: n o qn (z) = h(z) exp λT n T (z) − a(λn )

(assumed site model) (34.12)

for some given functions {h(z), T (z), a(λ)}. Each site is parameterized by a vector λn ∈ IRQ and determining qn (z) amounts to finding its λn . For

34.1 Factored Representation

1355

example, for the special case of Gaussian distributions, each factor is modeled as qn (z) ∼ Nz (µn , Rn )

(34.13)

with mean parameter µn and covariance matrix Rn . Determining qn (z) in this case amounts to finding (µn , Rn ). One useful property of the class of exponential (or Gaussian) distributions is that it is closed under multiplication, meaning that the product of exponential (Gaussian) distributions is again exponential (Gaussian) under proper normalization. For example, for the exponential case (34.12), the approximate posterior will also be exponential and given by:

qz|y (z|y) =

n o 1 T exp λ T (z) − b(λ) W0

(34.14)

for some normalization coefficient W 0 where (see Prob. 34.1): ∆

λ =

N X

λn ,

n=0

ˆ 0

W = z∈Z

b(λ) =

N X

a(λn )

(34.15a)

n=0

n o exp λT T (z) − b(λ) dz

(34.15b)

Note that we are using W 0 to refer to the normalization factor in (34.14) rather than W , as was the case in (34.11). This is because expression (34.14) is not written in terms of the product of the individual factors {qn (z)} from (34.12).

Example 34.3 (A useful representation for Gaussian sites) There is a useful representation for Gaussian sites that is particularly helpful in the context of the EP algorithm. Assume qn (z) = Nz (µn , Rn ), which we already know is a special case of the exponential family (34.12). If we expand the exponent in the pdf expression we get: n 1 o 1 1 −1 √ qn (z) = p exp − (z − µn )T Rn (z − µn ) M 2 det Rn (2π) o n 1 o n 1 1 1 −1 −1 −1 √ = p exp − µTn Rn µn exp − z T Rn z + µTn Rn z 2 2 (2π)M det Rn n 1 o −1 −1 z + µTn Rn z (34.16) ∝ exp − z T Rn 2 where the last line provides a compact representation in terms of the defining parameters {µn , Rn }. This argument shows that the resulting global approximation for the posterior will have a similar form, namely, n 1 o qz|y (z|y) ∝ exp − z T R−1 z + µT R−1 z = Nz (µ, R) (34.17) 2 where N N X  X ∆ ∆ −1 −1 R−1 = Rn , µ = R × Rn µn (34.18) n=0

n=0

1356

Expectation Propagation

Inverse covariance matrices are called precision matrices. Therefore, the above expressions are in terms of precision matrices. The normalizing factor W 0 is given by ˆ ∞ n 1 o ∆ W0 = exp − z T R−1 z + µT R−1 z dz 2 −∞ n1 o √ (a) p = (2π)M det R exp µT R−1 µ (34.19) 2 where step (a) uses equality (4.34). It follows that o n 1 1 T −1 T −1 qz|y (z|y) = z R z + µ R z exp − W0 2 We can also write the equivalent expression: o n 1 1 1 √ exp − (z − µ)T R−1 (z − µ) qz|y (z|y) = p 2 (2π)M det R

(34.20)

(34.21)

This example shows that, when convenient, we can work with “unnormalized” sites qn (z), as in the last line (34.16), and then perform the normalization when necessary (as was done through (34.19)). Example 34.4 (Computing the evidence) We explained that the factor Z in (34.2a) amounts to the evidence, fy (y). It is approximated by the factor W in (34.11). We will describe in the following an algorithm for determining the factors {qn (z)} with their respective means and covariance matrices {µn , Rn } for Gaussian sites. These factors allow us to determine the posterior approximation qz|y (z|y), as already shown in (34.20). They also allow us to approximate the evidence as follows. Note first from the definition (34.11) that W × qz|y (z|y) =

N Y

qn (z)

(34.22)

n=0

or, in the log domain, N     X ln(W ) + ln qz|y (z|y) = ln qn (z)

(34.23)

n=0

We derived in the previous example expressions for both distributions qz|y (z|y) and qn (z), namely,   (34.20) 1 ln qz|y (z|y) = − ln(W 0 ) − z T R−1 z + µT R−1 z (34.24) 2 (34.16) M 1 1 1 −1 −1 −1 ln(qn (z)) = − ln(2π) − ln det Rn − µTn Rn µn − z T R n z + µTn Rn z 2 2 2 2 (34.25) Substituting into (34.23) gives

ln W = ln W 0 −

N N M 1 X T −1 1X (N + 1) ln(2π) − ln det Rn − µn R n µn 2 2 n=0 2 n=0

(34.26) 0

where W was computed earlier in (34.19).

34.2 Gaussian Sites

34.2

1357

GAUSSIAN SITES We are now ready to describe the EP algorithm, which is an iterative procedure for determining the site functions qn (z). The algorithm starts with the initialization qz|y (z|y) = π(z), which corresponds to the initial choices: W = 1 (initial normalization factor)

(34.27a)

q0 (z) = π(z), qn (z) = 1, for n = 1, 2, . . . , N

(34.27b)

and iterates to improve on qz|y (z|y). We will use the superscript (`) to refer to the successive iterations so that (0)

q0 (z) = π(z), qn(0) (z) = 1, W (0) = 1 (initial conditions)

(34.28a)

qn(`) (z) (`)

(34.28b)

W

(approximate nth site at iteration `)

(approximate normalization at iteration `)

(34.28c)

The approximate posterior at iteration ` would then be given by N 1 Y (`) q (z) W (`) n=0 n

(`)

qz|y (z|y) =

(34.29)

and the normalization factor would correspond to the computation ˆ W

(`)

N Y

=

z∈Z n=0

qn(`) (z)dz

(34.30)

After a sufficient number of iterations, say, L of them: (0)

(1)

(2)

(L)

qz|y (z|y) → qz|y (z|y) → qz|y (z|y) → . . . → qz|y (z|y) W

(0)

→ W

(1)

→ W

(2)

→ ... → W

(L)

(34.31a) (34.31b)

we will end up with the desired approximation for the posterior and evidence: (L)

qz|y (z|y) ← qz|y (z|y),

W ← W (L)

(34.32)

(`)

We proceed to explain how the individual factors qn (z) are determined at every iteration. To guide the explanation and help illustrate the construction, we will assume that these sites are limited to the Gaussian family. A similar construction will apply to the exponential family and is discussed in Section 34.3. (`−1)

Starting point. Thus, assume we have available the N + 1 factors {qn (z)}, represented by their means and covariance matrices:   (`−1) qn(`−1) (z) = Nz µ(`−1) , R , n = 0, 1, . . . , N (34.33) n n Using the result of Example 34.3 we readily conclude that the corresponding posterior approximation at this iteration is Gaussian with

1358

Expectation Propagation

  (`−1) qz|y (z|y) = Nz µ(`−1) , R(`−1)

(34.34a)

R(`−1)

(34.34b)



−1



=

N  X

n=0

(`−1) ∆

µ

Rn(`−1)

= R

−1

N  X

(`−1)

Rn(`−1)

n=0

−1

µn(`−1)

!

(34.34c)

At every iteration `, we cycle through the sites by selecting a random index n (`−1) (z) to uniformly without replacement and updating the corresponding site qn (`) qn (z), one factor at a time, by following four steps. This process is illustrated in Fig. 34.1. (`−1)

(z). This Step I (Construct the cavity function). Consider the nth factor qn factor serves as an approximation for fn (z). We want to improve the approx(`) (`−1) (z) by an updated factor qn (z). To do so, we first imation and replace qn (`−1) remove it from the global approximation qz|y (z|y). This is accomplished by division; the result is the cavity function denoted by the following notation with a subscript −n to indicate exclusion of the nth factor: (`−1)

∆ (`−1) qz,−n (z) =

qz|y (z|y) (`−1)

qn

(z)

(cavity function)

(34.35)

The cavity is not an actual distribution because normalization is lost when we extract one site. However, we explained in Example 34.3 that we can work with unnormalized “distributions” and carry out the normalization at a later stage. Given that we are restricting the selection of the sites to the Gaussian family for illustration purposes, it is immediate to see from the earlier result (4.56) dealing with the division of Gaussian distributions that the cavity function has the form: 1 (`−1) W−n

  (`−1) (`−1) (`−1) qz,−n (z) ∝ Nz µ−n , R−n

(34.36a)

with parameters 

(`−1)

R−n

−1

(`−1)

µ−n

−1  −1 R(`−1) − Rn(`−1) (34.36b) ( )  −1  −1 ∆ (`−1) = R−n R(`−1) µ(`−1) − Rn(`−1) µ(`−1) n ∆

=

(a)



(`−1)

= µ(`−1) + R−n



Rn(`−1)

−1 

µ(`−1) − µ(`−1) n



(34.36c)

where step (a) follows from a similar result to Prob. 1.11 and, moreover, from (4.64):

posterior approximation

(` 1)

q0

(z) q1(`

1)

(` 1)

(z) q2

(` 1)

(z) q3

(z)

(` 1)

qz|y (z|y)

remove cavity function

(` 1)

q0

(` 1)

q2

(z)

(` 1)

(z) q3

(z)

space of distributions for the approximation

(` 1) n (z)

qz,

qz|y (z|y) final approximation

desired posterior fz|y (z|y)

(`+1) qz|y (z|y)

replace hybrid distribution

(` 1)

q0

(z)

f1 (z)

(` 1)

q2

(` 1)

(z) q3

(z)

project updated posterior

(` 1)

q0

(z)

(`)

q1 (z)

(` 1)

q2

(`)

(` 1)

gz|y (z|y)

(`)

qz|y (z|y) (` 1)

qz|y (z|y) (` 1)

(z) q3

(z)

gz|y (z|y) (` 1)

gz|y (z|y) hybrids

(`−1)

Figure 34.1 Repeated application of the four steps shown on the left end up updating the posterior approximation from qz|y (z|y) at step (`)

` − 1 to q|z|y (z|y) at step ` through the construction of cavity and hybrid functions. The diagram on the right illustrates the evolution from posterior approximations to hybrid functions and the projection of the latter back onto the space of distributions. The process is repeated sufficiently enough until the posterior approximation approaches the desired distribution fz|y (z|y).

1360

Expectation Propagation

(`−1)

W−n

=

q

(`−1)

(2π)M r 

det Rn

 × (`−1) (`−1) det Rn −R

(34.37)

 −1  T  1 µ(`−1) − µ(`−1) R(`−1) − Rn(`−1) exp − µ(`−1) − µ(`−1) n n 2



(`−1)

(z) Step II (Construct the hybrid distribution). Next, we replace the factor qn that was removed by the true factor fn (z) and transform the cavity function into a hybrid distribution involving approximate factors and one true factor. The result is denoted by (`−1)

1



gz|y (z|y) =

(`−1)

(`−1) W+n

× fn (z) × qz,−n (z)

(hybrid)

(34.38)

where we also added a normalization factor to ensure that the result is a pdf; its value is given by ˆ (`−1) ∆ (`−1) W+n = fn (z) qz,−n (z)dz (34.39a) z∈Z

We expect the hybrid construction to be “closer” to the desired posterior fz|y (z|y) because one of the sites has been replaced by its true value. In general, depending on the nature of fn (z), the hybrid function will be outside the Gaussian class (or whichever other class of distributions we are limiting the approximations to). We will therefore need to “project” the hybrid function back onto the desired class of distributions (which is done in Step III below). To carry out that step, we will need to compute certain moments for the hybrid distribution. For the Gaussian case, it is sufficient to evaluate its mean and covariance matrix defined as follows: (`−1) ∆

µ+n

z∈Z

(`−1) ∆

R+n

ˆ

= ˆ

= z∈Z

(`−1)

zgz|y (z|y)dz 

(`−1)

z − µ+n



(34.39b) (`−1)

z − µ+n

T

(`−1)

gz|y (z|y)dz

(34.39c)

We need to assume that the form of the hybrid function is such that these quantities can be evaluated in a tractable manner, as well as the normalization (`−1) factor W+n in (34.39a). These three quantities are needed in Step III – see expression (34.45). Evaluation of these quantities is one of the most challenging tasks in implementing the EP algorithm. In some cases, as shown in the example further ahead, the mean and covariance matrix can be computed in closed form. However, this is very much dependent on the nature of the function fn (z). In many other cases, the computation is not feasible analytically. In some occasions, we may appeal to auxiliary results to facilitate the computation.

34.2 Gaussian Sites

1361

Continuing with the case of Gaussian sites, and using the fact that the normalized cavity is Gaussian, we have ! ˆ 1 (`−1) (`−1) (`−1) q (z) dz W+n = W−n fn (z) (`−1) z,−n z∈Z W−n | {z } Gaussian

=

(`−1) W−n

E fn (z)

(34.40)

which amounts to computing the expectation of fn (z) relative to the normalized Gaussian cavity function. Likewise, if we substitute the hybrid by (34.38) in expression (34.39b) for the mean we get ! (`−1) ˆ W−n 1 (`−1) (`−1) zfn (z) µ+n = (`−1) q (z) dz (`−1) z,−n z W+n W−n | {z } Gaussian

=

(`−1) W−n E zfn (z) (`−1) W+n

(34.41)

which requires computing the mean of zfn (z) relative to the same normalized Gaussian cavity distribution. We can then appeal to the Stein lemma (4.67) to conclude that ( ) (`−1) W−n (`−1) (`−1) (`−1) (34.42) µ+n = (`−1) µ−n E fn (z) + R−n E ∇zT fn (z) W+n where the expectations on the right-hand side are relative to the normalized Gaussian cavity distribution. Similarly, one can derive an expression for the (`−1) covariance matrix R+n using the same Stein lemma (see Prob. 34.2): (`−1)

R+n

(`−1)

=

W−n

(`−1)

W+n

ˆ zz T fn (z) z∈Z

|

(`−1)

=

1

W−n

(`−1) q (z) (`−1) z,−n W−n

(`−1)

E zz T fn (z) − µ+n (`−1)

W+n

{z

Gaussian



(`−1)

µ+n

T

! }

(`−1)

dz − µ+n



(`−1)

µ+n

T (34.43)

Step III (Projection). We now need to bring the hybrid distribution back to the class of distributions from which the sites are being chosen. This step can be achieved by seeking an approximation that has the smallest Kullback–Leibler (KL) divergence to the hybrid:   (`−1) (`−1) gbz|y (z|y) = argmin DKL gz|y (z|y) k gz|y (z|y) g∈G

(34.44)

1362

Expectation Propagation

where the symbol G refers to the class of distributions we are working with for the approximations, e.g., the exponential class or the Gaussian class. We continue with the Gaussian case. Thus, we are looking for a Gaussian pdf gz|y (z|y) that is (`−1)

closest to the hybrid function gz|y (z|y) in the KL divergence sense. We know from the earlier result (6.86)–(6.91) that the solution is obtained by matching the mean and covariance matrix of gz|y (z|y) to those of the hybrid function. That is, the optimal solution is given by   (`−1) (`−1) (`−1) gbz|y (z|y) = Nz µ+n , R+n (34.45) (`−1)

Step IV (Update). Substituting the “projection” gbz|y (z|y) into the left-hand side of (34.38), the calculation suggests that we can improve the estimate for fn (z) by using the updated site: (`−1)

(`−1)

qn(`) (z) ∝ W+n

gbz|y (z|y)

(34.46)

(`−1)

qz,−n (z)

(`−1)

(z) and ended up with a replacement for it Recall that we started with qn (`) denoted by qn (z) where the superscript has changed from ` − 1 to `. Let us examine more closely the nature of the update (34.46) when the approximations are limited to Gaussian distributions. The analysis will reveal that this step can be a source of instability for the EP algorithm. Using the normalized version of the cavity function we can write   (`−1) (`−1) Nz µ+n , R+n   qn(`) (z) ∝ (34.47) (`−1) (`−1) Nz µ−n , R−n We appeal to expression (4.56) for the ratio of two Gaussian distributions to conclude that n 1 T  o qn(`) (z) ∝ exp − z − µn(`) Rn(`) z − µ(`) (34.48) n 2 in terms of the parameters:  −1  −1  −1 ∆ (`−1) (`−1) − R−n Rn(`) = R+n ∆

µ(`) = Rn(`) n (a)

(`−1)

= µ+n



(`−1)

R+n

−1

(`−1)

µ+n



(34.49a) 

(`−1)

R−n

−1

(`−1)

µ−n

 −1   (`−1) (`−1) (`−1) + Rn(`) R−n µ+n − µ−n

! (34.49b)

where step (a) follows from a result similar to Prob. 1.11. Observe that, due to (`) the division of distributions in (34.47), the precision matrix (Rn )−1 is computed as the difference of two other precision matrices; it is not difficult to envision

34.2 Gaussian Sites

1363

situations (for numerical reasons or otherwise) where this difference loses its (`) positive-definiteness and Rn ceases to be a true covariance matrix. In summary, the normalized version of the updated site is given by   (`) qn(`) (z) = Nz µ(`) , R n n

(34.50)

and the procedure now continues in a similar manner and updates all other sites before moving on to iteration ` + 1. The resulting EP algorithm for Gaussian sites is listed in (34.54). We can approximate the evidence Z = fy (y) based on the outcome of the `th iteration by using, for example, expressions (34.19) and (34.26) in the Gaussian case to compute W (`) :   1   −1 1  ∆ M ln(2π) + ln det R(`) + (µ(`) )T R(`) ln W 0(`) = µ(`) 2 2 2 (34.51a) and N      M 1X  (N + 1) ln(2π) − ln det Rn(`) ln W (`) = ln W 0(`) − 2 2 n=0



N 1 X (`) T  (`) −1 (`) (µ ) Rn µn 2 n=0 n

(34.51b)

We observe from the listing that the most challenging step, which is left unspecified, is evaluation of the three moments: o n (`−1) (`−1) (`−1) (34.52) W+n , µ+n , R+n

from expressions (34.40), (34.41), and (34.43). We repeat the expressions here for ease of reference: (`−1)

W+n

(`−1)

= W−n

(`−1)

=

(`−1)

=

µ+n R+n

E fn (z)

(`−1) W−n E zfn (z) (`−1) W+n (`−1) W−n E zz T fn (z) (`−1) W+n

(34.53a) (34.53b) (`−1)

− µ+n

 T (`−1) µ+n

(34.53c)

All expectations are relative to the Gaussian distribution of the cavity function, (`−1) (`−1) namely, Nz (µ−n , R−n ).

1364

Expectation Propagation

Expectation-propagation algorithm using Gaussian sites. input: prior π(z) and N distributions {fn (z)}. (0) (0) initialization: q0 (z) = π(z), qn (z) = 1 for n = 1, 2, . . . , N . (0) (0) (0) (0) µ0 = z¯, R0 = Rz , µn = 0, Rn → ∞IM repeat until sufficient convergence for ` = 1, 2, . . .: (parameters for posterior approximation) N   −1 X −1 R(`−1) = Rn(`−1) (`−1)

µ

=R

n=0 N  X

(`−1)

n=0

Rn(`−1)

−1

µn(`−1)

repeat for each site n = 0, 1, . . . , N : (parameters for cavity function)  −1  −1  −1 (`−1) (`−1) R−n = R(`−1) − Rn (`−1)

µ−n

(`−1)

= R−n

n −1  −1 o (`−1) (`−1) R(`−1) µ(`−1) − Rn µn (`−1)

use expression (34.37) to find W−n

.

(parameters for hybrid distribution) n o (`−1) (`−1) (`−1) evaluate the three moments W+n , µ+n , R+n using expressions (34.53a)–(34.53c).

(update parameters for nth site and keep other parameters unchanged) −1  −1  −1  (`) (`−1) (`−1) − R−n Rn = R+n o −1  −1 n (`−1) (`) (`) (`−1) (`−1) (`−1) µ−n µ+n − R−n µn = Rn R+n n o n o (`) (`) (`−1) (`−1) µm , Rm = µm , Rm , m 6= n

end end

  return posterior approximation qz|y (z|y) = Nz µ(`) , R(`) ;

return evidence approximation W (`) using (34.51a)–(34.51b). (34.54) We assume the computation of the expectations in (34.53a)–(34.53c) is tractable, which will very much depend on the nature of the likelihood function fn (z). In

34.2 Gaussian Sites

1365

some cases, these moments may not be available in closed form, in which case one may resort to sampling or MCMC techniques to evaluate them (but, of course, one could have applied the MCMC technique directly to the evaluation of the original posterior distribution, fz|y (z|y)). We will illustrate in the following how the moments can be computed in closed form for two cases: in one example, each fn (z) is modeled as a mixture of two Gaussian distributions and in a second example, each fn (z) is defined in terms of the cumulative distribution of a Gaussian (which arises in the probit formulation of the binary classification problem). Example 34.5 (Mixture of two Gaussian distributions) We illustrate the steps of the EP algorithm by considering an example with scalar variables for ease of exposition. Thus, consider a collection of N observations {y(n)} arising from a Gaussian mixture distribution of the form (here, the parameter θ plays the role of the latent variable z; its prior is Gaussian with mean θ¯ and variance σθ2 ): 1 2 1 ¯ σθ2 ) (prior) (34.55) e− 2 θ = Nθ (θ, π(θ) = p 2 2πσθ ( ) ( ) − 12 (y(n)−θ)2 − 12 y 2 (n) 1 1 fn (θ) = (1 − α) p e 2σ1 + α p e 2σ2 2 2 2πσ1 2πσ2

= (1 − α)Ny(n) (θ, σ12 ) + αNy(n) (0, σ22 ), α ∈ (0, 1)

(34.56)

Note that θ influences the mean of the first Gaussian component in fn (θ). We collect the observations into a column vector n o y = col y(1), y(2), . . . , y(N ) (34.57) The posterior is given by fθ |y (θ|y) =

N Y 1 × π(θ) × fn (θ) Z n=1

(34.58)

for some normalization factor Z. Our objective is to approximate it by a product of Gaussian sites of the form N 1 Y qn (θ), qθ |y (θ|y) = W n=0

qn (θ) ∼ Nθ (µn , rn )

(34.59)

where each qn (θ) is selected from the Gaussian family of distributions with some mean µn and variance rn to be determined. We construct the sites iteratively using the EP algorithm, which involves several steps. Initialization at ` = 0: (0) ¯ σθ2 ), q0 (θ) = Nθ (θ,

qn(0) (θ) = 1, n = 1, 2, . . . , N

¯ r(0) = σθ2 , µ(0) = θ, n = 0, 0 ¯ σθ2 ) q (θ|y) = Nθ (θ, θ |y

(0) µ0 (0)

rn(0)

=∞

(34.60) (34.61) (34.62)

(`−1)

We proceed iteratively for ` ≥ 1. Assume we know the factors {qn (θ)} from iteration (`−1) (`−1) ` − 1 for all n = 0, 1, 2 . . . , N , along with their means and variances, {µn , rn }.

1366

Expectation Propagation

The global mean and variance for the {µ(`−1) , r(`−1) } for the posterior approximation at iteration (` − 1) are given by: N X

1/r(`−1) =

1/rn(`−1)

(34.63)

n=0 N X

µ(`−1) = r(`−1)

! µ(`−1) /rn(`−1) n

(34.64)

n=0

It can be easily verified from the recursive construction that we do not need to update (`−1) (`) (`) the moments for the zeroth site q0 (θ); they will stay fixed at µ0 = θ¯ and r0 = σθ2 for all ` ≥ 1. We therefore focus on updating the sites fn (θ) for n ≥ 1. Step I (Construct the cavity function). We extract the nth factor and introduce the cavity function: (`−1)

(θ|y) q θ |y q (θ) = (`−1) θ ,−n qn (θ) (`−1)



(34.65)

We know from the discussion prior to the example that this cavity function will be an (`−1) (`−1) (unnormalized) Gaussian with mean and variance {µ−n , r−n }, namely, 1 (`−1)

W−n

q

(`−1)

θ ,−n

  (`−1) (`−1) (θ) = Nθ µ−n , r−n

(`−1)

1/r−n

(`−1)

µ−n

= 1/r(`−1) − 1/rn(`−1) (`−1)   r (`−1) (`−1) = µ(`−1) + −n µ − µ n (`−1) rn

(34.66a) (34.66b) (34.66c)

with normalization factor (`−1) W−n

=

2π (`−1)

rn

− r(`−1)

!1/2

( rn(`−1)

exp

2 ) (`−1) µ(`−1) − µn  −  (`−1) 2 r(`−1) − rn 

(34.67)

Step II (Construct the hybrid distribution). Next, we introduce the hybrid distribution: 1 ∆ (`−1) (`−1) g (θ|y) = × fn (θ) × q (θ) (`−1) θ |y θ ,−n W+n

(34.68)

We need to compute three quantities associated with this distribution to enable Step III, namely, ˆ (`−1) (`−1) W+n = fn (θ) q (θ)dθ (normalization) (34.69a) θ ,−n θ ˆ (`−1) (`−1) µ+n = θg (θ|y)dθ (mean) (34.69b) θ |y θ ˆ  2 (`−1) (`−1) (`−1) r+n = θ − µ+n g (θ|y)dθ (variance) θ |y θ ˆ  2 (`−1) (`−1) = θ2 g (θ|y)dθ − µ+n (34.69c) θ |y θ

34.2 Gaussian Sites

1367

This is the most demanding step within the EP algorithm. For the normalization factor we have (using the normalized cavity function): (`−1)

W+n

(`−1)

= W−n

×

ˆ n θ

(1 − α)Ny(n) (θ, σ12 ) + αNy(n) (0, σ22 ) ×

(`−1) α)W−n

ˆθ (`−1)

= (1 − α)W−n

θ

Ny(n) (θ, σ12 )Nθ



(`−1) (`−1) µ−n , r−n



dθ +

(34.70) ! q

(`−1)

(θ) dθ θ ,−n {z }

(`−1) W−n

|

ˆ = (1 −

1

o

Gaussian

(`−1) αW−n Ny(n) (0, σ22 )

  (`−1) (`−1) (`−1) dθ + αW−n Ny(n) (0, σ22 ) Nθ (y(n), σ12 )Nθ µ−n , r−n

where in the last step we are treating Ny(n) (θ, σ12 ) as a Gaussian expression over θ with mean y(n). We still need to evaluate the integral of the product of the two Gaussian distributions. To do so, we appeal to the result of Lemma 4.1 to conclude that

( (`−1)

(`−1)

= W−n

W+n

×

)   (`−1) (`−1) + αNy(n) (0, σ22 ) (1 − α)Ny(n) µ−n , σ12 + r−n (34.71)

We can follow a similar argument for the mean, except that the hybrid function is not Gaussian. Nevertheless, we can replace it by its expression (34.68) and use the normalized cavity function to get: (`−1)

µ+n

(`−1)

=

W−n

(`−1)

W+n

(`−1)

=

W−n

(`−1)

W+n

ˆ θ

θ × fn (θ) ×

α = (1 −

(`−1)

q (θ) dθ (`−1) θ ,−n W−n | {z } Gaussian

ˆ

n

θ

o   (`−1) (`−1) θ × (1 − α)Ny(n) (θ, σ12 ) + αNy(n) (0, σ22 ) × Nθ µ−n , r−n dθ

(`−1)

= (1 − α)

!

1

W−n

(`−1) W+n (`−1) W−n (`−1) W+n

(`−1) W α) −n (`−1) W+n (`−1) W α −n (`−1) W+n

ˆ θ

  (`−1) (`−1) θ × Ny(n) (θ, σ12 ) × Nθ µ−n , r−n dθ + ˆ

× Ny(n) (0, σ22 ) × ˆ θ

θ

  (`−1) (`−1) θ × Nθ µ−n , r−n dθ

  (`−1) (`−1) θ × Nθ (y(n), σ12 ) × Nθ µ−n , r−n dθ + (`−1)

× Ny(n) (0, σ22 ) × µ−n

(34.72)

We appeal again to the result of Lemma 4.1 to express the product of two Gaussian distributions as another Gaussian so that

1368

Expectation Propagation

  (`−1) (`−1) Nθ (y(n), σ12 ) × Nθ µ−n , r−n = X × Nθ (¯ c, σc2 ) ∆

(34.73a)

(`−1) 1/r−n

1/σc2 = 1/σ12 +   ∆ (`−1) (`−1) c¯ = σc2 y(n)/σ12 + µ−n /r−n ∆

(34.73b) (34.73c)

(`−1)

σa2 = σ12 + r−n (34.73d) n 1  2 o   1 ∆ (`−1) (`−1) (`−1) 2 X = √ exp − y(n) − µ = N µ , σ + r y(n) 1 −n −n −n 2σa2 2πσa2 (34.73e) It follows that (`−1)

(`−1)

µ+n

=

W−n

(

(`−1)

W+n

  (`−1) (`−1) (`−1) + α µ−n Ny(n) (0, σ22 ) (1 − α) c¯ Ny(n) µ−n , σ12 + r−n

)

(34.74) which upon simplification leads to (see Prob. 34.4): (`−1)

(`−1)

µ+n

(`−1)

= µ−n

+ λ(n)

r−n σ12

+



(`−1) r−n

(`−1)

y(n) − µ−n



(34.75)

where we introduced (`−1)

( ∆

λ(n) = 1 −

α×

W−n

(`−1)

W+n

) ×

Ny(n) (0, σ22 )

(34.76)

We can repeat the argument for the variance and find that ( (`−1)   W−n (`−1) (`−1) (`−1) + r+n = (1 − α) (¯ c2 + σc2 ) Ny(n) µ−n , σ12 + r−n (`−1) W+n )    2 (`−1) (`−1) (`−1) α Ny(n) (0, σ22 ) (µ−n )2 + r−n − µ+n

(34.77)

which, upon using (34.71) to eliminate the Gaussian distribution in the first line, we can rewrite as (see also Prob. 34.5): (`−1)

r+n

(`−1)

= r−n

    (`−1) (`−1) (`−1) (`−1) + (µ−n )2 − (µ+n )2 − λ(n) (µ−n )2 + r−n − c¯2 − σc2 (34.78) (`−1)

Step III (Projection). We now approximate the hybrid distribution g (θ|y) by a θ |y Gaussian distribution that is closest in the KL divergence sense. The result is   (`−1) (`−1) (`−1) gb (θ|y) = Nθ µ+n , r+n (34.79) θ |y (`−1)

Step IV (Update). We update qn

(θ) to

  (`) qn(`) (θ) = Nθ µ(`) n , rn

(34.80)

34.2 Gaussian Sites

1369

where the mean and variance parameters are given by (`−1)

(`−1)

1/rn(`) = 1/r+n − 1/r−n   (`−1) (`−1) (`−1) (`−1) (`) µn = rn(`) µ+n /r+n − µ−n /r−n

(34.81) (34.82)

and the procedure repeats. We simulate the operation of the algorithm by considering a mixture model with parameters θ¯ = 3, σθ2 = 1/2, σ12 = 1, σ22 = 1/2, α = 0.2 (34.83) ¯ σθ2 ), giving We select θ randomly according to the Gaussian distribution Nθ (θ, θ = 2.3061

(34.84)

We subsequently generate N = 100 observations {y(n)} and run L = 25 iterations of the EP algorithm. The objective is to estimate the posterior distribution fθ |y (θ|y) and, in particular, the value of the parameter θ used to generate the observations. The result is the approximate Gaussian distribution qθ |y (θ|y) shown in the right plot of Fig. 34.2; its mean and variance parameters are given by b µ(L) = 2.4399 ≈ θ, mixture model

0.6

3

0.5

2.5

0.4

2

pdf

pdf

r(L) = 0.0166

0.3

1

0.1

0.5

-4

-2

0

2

4

6

estimated posterior

1.5

0.2

0 -6

(34.85)

0 -2

0

2

4

6

Figure 34.2 (Left) Gaussian mixture model consisting of two components: a clutter

and a signal. The plot shows the separate components and their combination fn (θ) = fy|θ (y|θ). (Right) The Gaussian signal component, which is centered at the value of θ, along with the estimated posterior qθ |y (θ|y).

Example 34.6 (Revisiting the probit problem) We revisit the probit formulation where we showed in (33.25) and (33.27) that the posterior distribution has the form: fw|γ N (w|γN ; HN ) ∝

N Y 1 2 × Nw (0, σw IM ) × Φ(γ(n)hTn w) Z n=1

(34.86)

in terms of the cumulative distribution Φ(·) for the standard Gaussian distribution. The vector w plays the role of the latent variable z, and the {γ(n)} play the role of the

1370

Expectation Propagation

observations {y(n)}. The {hn } are given feature vectors, assumed deterministic. The normalization Z is the evidence: ! ˆ N   Y 2 T Nw (0, σw IM ) Z= Φ γ(n)hn w dw (34.87) w∈W

n=1

2 We therefore encounter a situation with a Gaussian prior, π(w) = Nw (0, σw IM ) and the likelihood functions are defined in terms of the cumulative distribution function (cdf) (`−1) (`−1) Φ(·). To run the EP algorithm (34.54), we need to evaluate the moments {W+n , µ+n , (`−1) R+n } shown in (34.53a)–(34.53c). We rewrite these expressions more explicitly here:

ˆ (`−1)

W+n

(`−1)

= W−n

    (`−1) (`−1) Φ γ(n)hTn w Nw µ−n , R−n dw

(34.88a)

    (`−1) (`−1) w Φ γ(n)hTn w Nw µ−n , R−n dw

(34.88b)

w

(`−1)

(`−1)

µ+n

=

W−n

(`−1) W+n (`−1)

(`−1)

R+n

=

W−n

(`−1) W+n

ˆ

w

ˆ w

T      (`−1) (`−1) (`−1) (`−1) µ+n dw − µ+n wwT Φ γ(n)hTn w Nw µ−n , R−n (34.88c)

The three integrals in these expressions can be evaluated by appealing to the identities (4.111)–(4.114) for Gaussian distributions where the variable w in the above expressions plays the role of y and the variable γ(n)hn plays the role of h. Noting that γ(n) ∈ {+1, −1} so that γ 2 (n) = 1, we let (`−1)

γ(n)hTn µ−n ∆ yb(n) = q (`−1) 1 + hTn R−n hn Using the identities we have ˆ ∞     ∆ (`−1) (`−1) Z0 = Φ γ(n)hTn w Nw µ−n , R−n dw = Φ(b y (n))

(34.89)

(34.90)

−∞

and ∆

ˆ

Z1 =

    (`−1) (`−1) w Φ γ(n)hTn w Nw µ−n , R−n dw

w (`−1)

γ(n)R−n hn (`−1) = µ−n Φ(b y (n)) + q Nyb (n) (0, 1) (`−1) 1 + hTn R−n hn

(34.91)

and ∆

ˆ

Z2 =

    (`−1) (`−1) wwT Φ γ(n)hTn w Nw µ−n , R−n dw

(34.92)

w

  1 (`−1) (`−1) (`−1) = R−n + µ−n (µ−n )T Φ(b y (n) ) + q (`−1) T 1 + hn R−n hn ( ) (`−1) (`−1) yb(n)R−n hn hTn R−n (`−1) (`−1) T × 2γ(n)R−n hn (µ−n ) − q Nyb (0, 1) (`−1) 1 + hTn R−n hn

34.3 Exponential Sites

1371

so that (`−1)

W+n

(`−1)

= W−n

(`−1)

=

(`−1)

=

µ+n

(`−1) W−n (`−1) W+n

× Z0

(34.93a)

× Z1

(34.93b)

(`−1)

R+n

W−n

(`−1)

(`−1) W+n

× Z2 − µ+n

 T (`−1) µ+n

(34.93c)

and the procedure continues.

34.3

EXPONENTIAL SITES We focused on describing the EP algorithm for Gaussian sites. The same analysis extends with minor adjustments to the broader family of exponential distributions. We will therefore be brief. (`−1)

Starting point. Assume we have available the N factors {qn (z)}, each of which is now assumed to belong to the exponential family of distributions: n o 1 (34.94) qn(`−1) (z) = (`−1) exp (λn(`−1) )T T (z) , n = 0, 1, . . . , N Wn ˆ n o (`−1) Wn = exp (λ(`−1) )T T (z) dz (34.95) n z∈Z

(`−1)

and statistic T (z). We know from property (5.91) with natural parameter λn (`−1) for any exponential distribution with natural parameter λn and statistic T (z) (`−1) that there exists an invertible mapping that recovers λn from E T (z); we denote this mapping by ν(·), namely, it holds that   λ(`−1) = ν E T (z) (34.96) n

We will be using this mapping later in Step III – see (34.107). One useful property of the class of exponential distributions is that it is closed under multiplication so that the approximate posterior will also be exponential and can be written in the form: n o 1 (`−1) qz|y (z|y) = exp (λ(`−1) )T T (z) (34.97a) (`−1) W ∆

λ(`−1) =

N X

λ(`−1) n

(34.97b)

n=0

ˆ W (`−1) =

z∈Z

n o exp (λ(`−1) )T T (z) dz

(34.97c)

1372

Expectation Propagation

(`−1)

We update each qn

(`)

(z) to qn (z), one factor at a time, by following four steps:

Step I (Construct the cavity function). We introduce the cavity function: (`−1)

qz



(`−1)

qz,−n (z) =

(z|y)

(cavity function)

(`−1) qn (z)

(34.98)

which will again be exponential: 1

(`−1)

qz,−n (z) = (`−1)

λ−n

(`−1) W−n

(`−1)

W−n

n o (`−1) × exp (λ−n )T T (z)

(34.99a)

= λ(`−1) − λn(`−1) ˆ = z∈Z

(34.99b)

o n (`−1) exp (λ−n )T T (z) dz

(34.99c)

Step II (Construct the hybrid distribution). We construct the hybrid distribution: (`−1)



gz|y (z|y) =

1

(`−1)

(`−1) W+n

× fn (z) × qz,−n (z)

(hybrid)

(34.100)

!

(34.101)

where we also added the normalization factor: ˆ (`−1) ∆ (`−1) = fn (z) qz,−n (z)dz W+n z∈Z

=

(`−1) W−n

ˆ

1

fn (z) z∈Z

|

(`−1) (z) q (`−1) z,−n W−n

{z

exponential

dz

}

which is seen to be the mean of fn (z) relative to the normalized cavity distribution, which is exponential. We write the above relation more compactly as (`−1)

W+n

(`−1)

= W−n

E fn (z)

(34.102)

where the expectation is relative to the normalized cavity. We will also need to evaluate the moment E T (z) relative to the hybrid dis(`−1) tribution for Step III. We denote this moment by η+n . We can transform this calculation into an expectation over the normalized cavity as follows:

34.3 Exponential Sites

  = E g(`−1) T (z)

(`−1) ∆

η+n

1373

(34.103)

z|y

(`−1)

=

W−n

ˆ

T (z)fn (z)

(`−1)

W+n

z∈Z

|

That is, (`−1)

(`−1)

η+n

1

=

W−n

(`−1)

W+n

!

(`−1) q (z) (`−1) z,−n W−n

{z

exponential

  E T (z)fn (z)

dz

}

(34.104)

which amounts to computing the expectation of T (z)fn (z) relative to the normalized exponential cavity function. We require this computation to be tractable in order to proceed with the implementation. Step III (Projection). We bring the hybrid distribution back to the exponential family by solving:   (`−1) (`−1) gbz|y (z|y) = argmin DKL gz|y (z|y) k gz|y (z|y)

(34.105)

g∈G

where G now refers to the class of exponential distributions. We know from the earlier result (6.91) that the solution is obtained by selecting the parameter (`−1) (`−1) λ+n for gbz|y (z|y) in order to match its moments with those of the hybrid function, i.e., so that     E gb(`−1) T (z) = E g(`−1) T (z) z|y

(34.106)

z|y

It follows that it must hold: (`−1)

λ+n

  (`−1) = ν η+n

(34.107)

Step IV (Update). The updated site is given by (`−1)

qn(`) (z)



(`−1) W+n

gbz|y (z|y) (`−1)

qz,−n (z)

(34.108)

1374

Expectation Propagation

which is again exponential: qn(`) (z) =

1 (`)

Wn

n o T exp (λ(`) ) T (z) n

(`−1)

λ(`) n = λ+n ˆ Wn(`) =

z∈Z

(`−1)

− λ−n

(34.109) (34.110)

n o T exp (λ(`) n ) T (z) dz

(34.111)

The resulting EP algorithm for exponential sites is listed in (34.112).

Expectation-propagation algorithm using exponential sites. input: prior π(z) and N distributions {fn (z)}. (0) (0) initialization: q0 (z) = π(z), qn (z) = 1 for n = 1, 2, . . . , N . (0) (0) λ0 = E π T (z), λn = 0 repeat until sufficient convergence for ` = 1, 2, . . .: (parameters for posterior approximation) ˆ N n o X (`−1) λ(`−1) = λ(`−1) , W = exp (λ(`−1) )T T (z) dz n z

n=0

repeat for each site n = 0, 1, . . . , N : (parameters for cavity function) (`−1)

λ−n

(`−1)

= λ(`−1) − λn

(`−1)

ˆ

, W−n

z

n o (`−1) exp (λ−n )T T (z) dz

(parameters n for hybrid distribution) o (`−1) (`−1) evaluate W+n , η+n using (34.102) and (34.104)   (`−1) (`−1) λ+n = ν η+n

(update parameters for nth site) ˆ n o (`) (`) (`−1) (`−1) T λn = λ+n − λ−n , Wn = exp (λ(`) n ) T (z) dz (`)

(`−1)

λm = λm end end

(`)

(`−1)

, W m = Wm

z

, m 6= n

return the posterior approximation qz|y (z|y) = return the evidence approximation W (`) .

n o 1 exp (λ(`) )T T (z) ; (`) W (34.112)

34.4 Assumed Density Filtering

34.4

1375

ASSUMED DENSITY FILTERING The EP algorithm is a generalization of a procedure known as assumed density filtering (ADF). This method iterates over the likelihood functions and introduces them one at a time into the approximation, whereas the EP algorithm iterates repeatedly over the functions in order to refine the approximation for the posterior. Since ADF is useful in some contexts, including in online Bayesian implementations with the data {y(n)} streaming in, we describe its main steps in this section. We again consider some latent random model z ∼ π(z) and data collected into y = col{y(1), y(2), . . . , y(N )} ∈ IRN . The posterior of z is assumed to admit the factored representation: fz|y (z|y) =

N Y 1 fn (z) π(z) Z n=1

(34.113a)

in terms of individual likelihood functions, fn (·), and for some normalizing factor Z given by ! ˆ N Y Z= π(z) fn (z) dz (34.113b) z∈Z

n=1

We again wish to approximate the posterior fz|y (z|y). The ADF algorithm starts from the initialization (0)

qz|y (z|y) = π(z)

(34.114)

and updates this function repeatedly by incorporating one factor fn (z) at a time. ADF does not assume a factored form for the approximation but rather updates the approximation directly through a sequence of three steps. We will focus on approximations from the family of Gaussian distributions so that qz|y (z|y) will be Gaussian; the extension to the exponential family is straightforward as was done for EP: (`−1)

Starting point. Assume we have available the Gaussian approximation qz|y (z|y) at iteration ` − 1, represented by its mean vector µ(`−1) and covariance matrix R(`−1) . Step I (Incorporate one likelihood function). Using the Bayes rule we incorporate a new likelihood function and define 1



(`−1)

gz|y (z|y) = (`−1) ∆

W+n

(`−1)

(`−1) W+n

ˆ

= z∈Z

fn (z)qz|y (z|y) (`−1)

fn (z)qz|y (z|y)dz

(34.115) (34.116)

1376

Expectation Propagation

We expect this construction to be “closer” to the desired posterior fz|y (z|y). In general, depending on the nature of fn (z), the function will be outside the Gaussian class. We therefore “project” the function back onto the desired class of distributions (which is done in Step II below). To carry out that step, we need to compute the mean and covariance matrix of the above function defined as follows: (`−1) ∆

µ+n

ˆ

(`−1)

= z∈Z

(`−1) ∆

R+n

ˆ

= z∈Z

zgz|y (z|y)dz 

(`−1)

z − µ+n



(34.117) (`−1)

z − µ+n

T

(`−1)

gz|y (z|y)dz

(34.118)

(`−1)

We need to assume that the form of the function gz|y (z|y) is such that these quantities can be evaluated in a tractable manner, as well as the normalization (`−1) factor W+n in (34.116). The same observations leading to (34.40), (34.42), and (34.43) apply here so that (`−1)

W+n

(`−1)

µ+n

(`−1)

R+n

= E fn (z) = =

1 (`−1) W+n

(

)

µ(`−1) E fn (z) + R(`−1) E ∇zT fn (z)

1

(`−1)

E zz T fn (z) − µ+n (`−1)

W+n



(`−1)

µ+n

T

(34.119a) (34.119b) (34.119c)

(`−1)

where the expectations are relative to the Gaussian distribution qz|y (z|y). (`−1)

Step II (Projection and update). We bring the function gz|y (z|y) back to the class of Gaussian distributions by solving:   (`) (`−1) qz|y (z|y) = argmin DKL gz|y (z|y) k gz|y (z|y)

(34.120)

g∈G

The solution is obtained by matching the mean and covariance matrix of gz|y (z|y) (`−1)

to those of gz|y (z|y). We therefore conclude that   (`) qz|y (z|y) = Nz µ(`) , R(`)

(34.121)

where (`−1)

µ(`) = µ+n ,

(`−1)

R(`) = R+n

The resulting ADF algorithm for Gaussian sites is listed in (34.123).

(34.122)

34.4 Assumed Density Filtering

1377

Assumed density filtering using Gaussian sites. input: prior π(z) and incoming likelihood distributions {fn (z)}. (0) (0) (0) initialization: q0 (z) = π(z), µ0 = z¯, R0 = Rz . repeat for each site n = 0, 1,n2, . . .: o (`−1)

evaluate the three moments W+n

(`−1)

(`−1)

, µ+n , R+n

(34.123)

using expressions (34.119a)–(34.119c) (`−1) (`−1) set µ(`) = µ+n , R(`) = R+n end return posterior approximation qz|y (z|y) = Nz (µ(`) , R(`) ).

By following the arguments from Section 34.3 we can similarly extend the statement of the ADF algorithm to the class of exponential distributions. The result is stated in (34.124).

Assumed density filtering using exponential sites. input: prior π(z) and incoming likelihood functions {fn (z)}. (0) (0) initialization: q0 (z) = π(z), λ0 = E π T (z) repeat for each site n = 0, 1, 2, . . .: n o (`−1) (`−1) using evaluate W+n , η+n (`−1)

W+n

(`−1)

η+n

= E fn (z)

=

1 (`−1) W+n

E T (z)fn (z)

  (`−1) set λ(`) = ν η+n ˆ n o (`) set W = exp (λ(`) )T T (z) dz z

end

return the posterior approximation qz|y (z|y) = return the evidence approximation W (`) .

n o 1 exp (λ(`) )T T (z) ; (`) W (34.124)

1378

Expectation Propagation

34.5

COMMENTARIES AND DISCUSSION Expectation propagation. The EP algorithm described by (34.54) and (34.112) is a useful method to approximate intractable posterior distributions. The algorithm is due to Minka (2001), who generalized an algorithm derived by Opper and Winther (2000) for Gaussian distributions. Example 34.5 dealing with a Gaussian signal embedded in clutter is an expansion of an example from Minka (2001) and Bishop (2007). The EP algorithm can further be viewed as an extension of the ADF method (34.123) in that it introduces iterative refinement into the computation of the successive site approximations. The ADF method also goes by the name of online Bayesian learning or moment matching – see, e.g., Maybeck (1982), Lauritzen (1992), Boyen and Koller (1998), Opper and Winther (1999), and Raymond, Manoel, and Opper (2014). The EP algorithm is particularly useful for probability distributions that can be expressed in factored form, as the product of elementary distributions. We will see later in Chapter 43 that factored distributions of this form appear in the study of message-passing algorithms over undirected graphs. It is therefore not surprising to find, as already noted in Minka (2001) and Murphy (2012), that the EP algorithm is related to the message-passing and belief-propagation algorithms of Chapter 43. Further discussion on and applications of EP appear, for example, in Birlutiu and Heskes (2007), Seeger and Nickisch (2011), Li, Hernández-Lobato, and Turner (2015), Vehtari et al. (2020), and Wilkinson et al. (2020).

PROBLEMS

34.1 Is the normalization factor W in (34.11) the same as the normalization factor W 0 in (34.15b)? What is the relation between them? 34.2 Follow an argument similar to the one that led to expression (34.42) using the Stein lemma and derive an analogous expression for the covariance matrix defined by (34.39c). 34.3 Conclude from (34.46) that ˆ ˆ (`−1) (`−1) fn (θ)q (θ)dθ = qn(`) (θ)q (θ)dθ θ ,−n θ ,−n θ θ 34.4 Use expressions (34.71) and (34.74) to arrive at (34.75) for the mean term. Derive also expression (34.78) for the variance term. 34.5 Verify that expression (34.78) reduces to the following: (`−1)

(`−)

(`−1)

r+n = r−n

− λ(n)

(r−n )2 σ12 +

(`−1) r−n

(`−1)

+ λ(n)(1 − λ(n))

(r−n )2 (σ12 +

(`−1) r−n )2

 2 (`−1) y(n) − µ−n

34.6 Assume the prior π(z) ∼ Nz (¯ z , Rz ) and the likelihoods fn (z) ∼ N(an , Σn ) are all Gaussian. Assume also we limit the choice of all approximation sites to Gaussian distributions. How do the expectation propagation recursions simplify in this case? 34.7 Apply the ADF algorithm (34.123) to the model in Example 34.5 involving the mixture of two Gaussian distributions.

References

1379

REFERENCES Birlutiu, A. and T. Heskes (2007), “Expectation propagation for rating players in sports competitions,” Proc. European Conf. Principles of Data Mining and Knowledge Discovery (PKDD), pp. 374–381, Warsaw. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Boyen, X. and D. Koller (1998), “Tractable inference for complex stochastic processes,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 33–42, Madison, WI. Lauritzen, S. L. (1992), “Propagation of probabilities, means and variances in mixed graphical association models,” J. Amer. Statist. Assoc., vol. 87, pp. 1098–1108. Li, Y., J. M. Hernández-Lobato, and R. E. Turner (2015), “Stochastic expectation propagation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Montreal. Maybeck, P. S. (1982), Stochastic Models, Estimation, and Control, Academic Press. Minka, T. (2001), “Expectation propagation for approximate Bayesian inference,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 362–369, San Francisco, CA. Murphy, K. P. (2012), Machine Learning: A Probabilistic Perspective, MIT Press. Opper, M. and O. Winther (1999), “A Bayesian approach to on-line learning,” in OnLine Learning in Neural Networks, D. Saad, editor, pp. 363–378, Cambridge University Press. Opper, M. and O. Winther (2000), “Gaussian processes for classification: Mean field algorithms,” Neural Comput., vol. 12, pp. 2655–2684. Raymond, J., A. Manoel, and M. Opper (2014), “Expectation propagation,” available at arXiv:1409.6179. Seeger, M. and H. Nickisch (2011), “Fast convergent algorithms for expectation propagation approximate Bayesian inference,” Proc. Mach. Learn. Res. (PMLR), vol. 15, pp. 652–660. Vehtari, A., A. Gelman, T. Sivula, P. Jylänki, D. Tran, S. Sahai, P. Blomstedt, J. P. Cunningham, D. Schiminovich, and C. P. Robert (2020), “Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data,” J. Mach. Learn. Res., vol. 21, no. 17, pp. 1–53. Wilkinson, W., P. Chang, M. Andersen, and A. Solin (2020), “State space expectation propagation: Efficient inference schemes for temporal Gaussian processes,” Proc. Mac. Learn. Res. (PMLR), vol. 119, pp. 10270–10281.

35 Particle Filters

We develop a sequential version of the importance sampling technique from Chapter 33 in order to respond to streaming data, thus leading to a sequential Monte Carlo solution. The algorithm will lead to the important class of particle filters. This chapter presents the basic data model and the main construction that enables recursive inference. Many of the inference and learning methods in subsequent chapters will possess a recursive structure, which is a fundamental property to enable them to continually learn in response to the arrival of sequential data measurements. Particle filters are particularly well suited for scenarios involving nonlinear models and non-Gaussian signals, and they have found applications in a wide range of areas where these two features (nonlinearity and non-Gaussianity) are prevalent, including in guidance and control, robot localization, visual tracking of objects, and finance.

35.1

DATA MODEL We reconsider the basic importance sampling method described in (33.66) and apply it to an inference problem involving two (scalar or vector-valued) random variables x and y. Variable y will represent an observable quantity while variable x will represent a latent (or hidden) variable that we cannot observe directly. We attach a time (or space) index to the variables so that the {xn } are generated sequentially and the {y n } are also observed sequentially for n ≥ 0. For simplicity, we refer to n as the time index even though it can also admit other interpretations, such as a space index.

35.1.1

State Trajectory We refer to xn as the state variable at time n and assume that its evolution follows a first-order Markovian property, namely, that the distribution of xn is only dependent on the immediate past state xn−1 and not on any of the previous states, i.e., xn ∼ fxn |xn−1 (xn |xn−1 )

(state equation)

(35.1a)

35.1 Data Model

1381

for some known conditional distribution. We refer to the probability density function (pdf) in (35.1a) as the transition kernel of the associated Markov chain. At time n = 0, we assume the initial state is generated according to some known distribution as well, denoted by x0 ∼ fx0 (x0 )

(initial state distribution)

(35.1b)

In the same token, we assume that, given xn , the observation variable y n is independent of all previous observations and follows the distribution: y n ∼ fyn |xn (yn |xn )

(measurement equation)

(35.1c)

We further assume that the pdfs fxn |xn−1 (xn |xn−1 ) and fyn |xn (yn |xn ) do not change over time, which corresponds to studying a first-order Markovian model under homogeneous (i.e., time-invariant) conditions. Example 35.1 (Nonlinear state-space model) Assume the variables {xn , y n } satisfy a model of the form: xn+1 = g(xn ) + un , x0 ∼ Nx0 (¯ x, Π0 ) y n = k(xn ) + v n

(35.2a) (35.2b)

v m , un

um6=n , v n

|=

v n ∼ Nvn (0, R), un

|=

un ∼ Nun (0, Q),

|=

where the {un , v n } are zero-mean white-noise Gaussian processes that are independent of each other and with covariance matrices {Q, R}, respectively: v m6=n

(35.3)

Moreover, g(·) and k(·) are known functions that control the evolution of the model. Then, it is easy to verify that   ∆ xn |xn−1 ∼ Nxn g(xn−1 ), Q = fxn |xn−1 (xn |xn−1 ) (35.4)   ∆ y n |xn ∼ Nyn k(xn ), R = fyn |xn (yn |xn ) (35.5) which is a special case of model (35.1a)–(35.1c).

We will encounter variations of model (35.1a)–(35.1c) later in Chapter 38 when we study hidden Markov models (HMMs). Under HMMs, the state variables and observations {xn , y n } assume discrete values with a finite number of realizations described by probability mass functions (pmfs). The transition between states will be described by transition probabilities, and the generation of the observations will be described by emission probabilities. As a result, we will be able to derive for HMMs explicit closed-form expressions for solving several inference problems, such as predicting current and future state values from observations {y m }. These calculations will involve evaluating sums with a finite number of terms and will be tractable in the discrete case. The situation is more challenging in the current chapter with continuous random variables {xn , y n } because, depending on the form of the distributions fxn |xn−1 (xn |xn−1 ) and fyn |xn (yn |xn ),

1382

Particle Filters

it will become difficult to assess certain necessary moments and integrals in closed form. For this reason, we will need to resort to sampling and Monte Carlo techniques. These difficulties are illustrated in Section 35.1.2. Before discussing the example, we introduce some useful notation.

Notation We will employ the compact notation y 0:n to denote the collection of all observations received up to and including index n, i.e., ∆

y 0:n = {y 0 , y 1 , . . . , y n }

(35.6)

Likewise, we will write x0:n to denote the state trajectory up to that same time instant, including the initial state: ∆

x0:n = {x0 , x1 , . . . , xn }

(35.7)

As the discussion will reveal, we will be interested in evaluating (or approximating) two types of posterior distributions: fx0:n |y0:n (x0:n |y0:n ) and fxn |y0:n (xn |y0:n )

(35.8)

Once known, these distributions allow us to estimate the state trajectory x0:n or simply the current state xn , from the observations that are available up to time n. For example, the location of the peak of the second posterior can provide a maximum a-posteriori (MAP) estimator for xn based on knowledge of y 0:n . We will also be interested in updating the above two distributions over time in an effective recursive manner, i.e., in computing updates of the form: fx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) −→ fx0:n |y0:n (x0:n |y0:n )

fxn−1 |y0:n−1 (xn−1 |y0:n−1 )

−→ fxn |y0:n (xn |y0:n )

(35.9a) (35.9b)

In both cases, the updated pdfs are conditioned on an additional measurement, y n , and involve a new state xn .

35.1.2

Time and Measurement Updates Although we will not be using the time- and measurement-update relations of this section explicitly in our treatment, we will nevertheless describe them for reference purposes and to motivate the need for sampling techniques. The derivation will illustrate how some of the integrals involved in the updates are generally difficult to compute in closed form, which is the main reason to resort to Monte Carlo methods. The analogy between the arguments used below and those applied earlier in Example 30.6 for linear state-space models is purposely explicit in order to highlight the similarities (and also the differences). Thus, consider again the data model (35.1a)–(35.1c). We are interested in estimating the underlying state variable xn from the observations {y m , m ≤ n} up to time n. Doing so allows us to track the evolution of the hidden state

35.1 Data Model

1383

from the observations. This task requires that we determine the two (posterior) conditional pdfs, also called beliefs: fxn |y0:n (xn |y0:n )

and fxn |y0:n−1 (xn |y0:n−1 )

(35.10)

The pdf on the left allows us to filter the observations and estimate xn , for example, by computing the mean or mode of the distribution and using either of them as a filtered estimate for xn . The evaluation of the mean requires that we compute: ∆

b n |n = x

ˆ x

xfx|y0:n (x|y0:n )dx

(filtering)

(35.11)

where the integration is over the domain of x. There are three challenges in this expression: (a) first, we need to determine the conditional pdf fxn |y0:n (xn |y0:n ); (b) second, the form of the pdf changes with n since more measurements are added into it as n grows; and, above all, (c) we need to compute the integral. Likewise, the second pdf in (35.10) allows us to predict xn from the past observations y 0:n−1 , for example, by computing the mean or mode of the distribution and using either one of them as a predicted estimate for xn . In the mean case, we need to evaluate ∆

b n |n−1 = x

ˆ x

xfx|y0:n−1 (x|y0:n−1 )dx

(prediction)

(35.12)

This computation faces the same challenges as the previous one: We need to determine the pdf, whose argument grows with time and, moreover, we need to evaluate the integral. The above prediction relation is also known as the Chapman–Kolmogorov equation; we will encounter its discrete counterpart later in Section 38.2.4 when we study HMMs. Let us examine more closely the evaluation of the two pdfs in (35.10) and assess how tractable their calculation is. To begin with, using the Bayes rule we have fxn ,yn |y0:n−1 (xn , yn |y0:n−1 )

= fyn |y0:n−1 (yn |y0:n−1 ) fxn |yn ,y0:n−1 (xn |yn , y0:n−1 )

= fyn |y0:n−1 (yn |y0:n−1 ) fxn |y0:n (xn |y0:n )

(35.13)

and, similarly, factoring the same joint pdf in a different order: fxn ,yn |y0:n−1 (xn , yn |y0:n−1 )

= fxn |y0:n−1 (xn |y0:n−1 )fyn |xn ,y0:n−1 (yn |xn , y0:n−1 )

= fxn |y0:n−1 (xn |y0:n−1 )fyn |xn (yn |xn )

(35.14)

Equating (35.13) and (35.14) we arrive at the measurement update equation for computing the first pdf in (35.10):

1384

Particle Filters

(measurement-update equation or filtering distribution) ! fyn |xn (yn |xn ) fxn |y0:n−1 (xn |y0:n−1 ) fxn |y0:n (xn |y0:n ) = fyn |y0:n−1 (yn |y0:n−1 ) ∝ fyn |xn (yn |xn ) fxn |y0:n−1 (xn |y0:n−1 ) | {z } known

(35.15)

The term in the denominator in the first line is independent of xn and only serves as a normalization factor and can be dropped for convenience. The above result allows us to update the conditional pdf of xn that is based on the past measurements y0:n−1 by incorporating the newer measurement yn . Only the factor that appears in the numerator, namely, fyn |xn (yn |xn ), is known. The other two factors on the right-hand side are not available and need to be evaluated. Let us persist a bit more with these calculations. Regarding the rightmost term in (35.15), we again appeal to the Bayes rule and write fxn−1 ,xn |y0:n−1 (xn−1 , xn |y0:n−1 )

= fxn−1 |y0:n−1 (xn−1 |y0:n−1 ) fxn |xn−1 ,y0:n−1 (xn |xn−1 , y0:n−1 ) = fxn−1 |y0:n−1 (xn−1 |y0:n−1 ) fxn |xn−1 (xn |xn−1 )

(35.16)

where the conditioning over y0:n−1 is eliminated from the rightmost term because of the Markovian property of the state variable (i.e., the fact that the distribution of xn is solely dependent on the immediate past state xn−1 ). Consequently, by marginalizing over xn−1 , we obtain the time-update or prediction relation for computing the second pdf in (35.10): (time-update equation or predictive distribution) fxn |y0:n−1 (xn |y0:n−1 ) ˆ = fxn |xn−1 (xn |xn−1 )fxn−1 |y0:n−1 (xn−1 |y0:n−1 )dxn−1

(35.17)

xn−1

This relation requires computing an integral, which may not be straightforward. Moreover, by comparing with (35.15), we find that these two relations lead to a recursive structure for updating the pdf of xn based on the streaming observations. The recursive construction is useful because it enables us to carry out the calculations without the need to store the growing number of observations. Specifically, note the following sequence of updates (where we are removing the subscripts from the pdfs for compactness of notation):

35.2 Importance Sampling

(35.17)

(35.15)

time-update

measurement-update

1385

f (xn−1 |y0:n−1 ) −−−−−−−−→ f (xn |y0:n−1 ) −−−−−−−−−−−−−−→ f (xn |y0:n )

These steps are represented schematically in Fig. 35.1. Nevertheless, challenges persist with this sequence of calculations. First, the time-update step requires an integration as shown by (35.17) and, moreover, we still need to provide an expression for the term that appears in the denominator of the measurementupdate relation (35.15). That term will also require computing an integral. To see this, we again apply the Bayes rule to write: fxn−1 ,xn ,yn |y0:n−1 (xn−1 , xn , yn |y0:n−1 )

(35.18)

= fxn−1 |y0:n−1 (xn−1 |y0:n−1 )fxn |xn−1 (xn |xn−1 ) × fyn |xn−1 ,xn ,y0:n−1 (yn |xn−1 , xn , y0:n−1 )

= fxn−1 |y0:n−1 (xn−1 |y0:n−1 )fxn |xn−1 (xn |xn−1 )fyn |xn (yn |xn ) Marginalizing over {xn−1 , xn } gives the expression (35.19) fyn |y0:n−1 (yn |y0:n−1 ) ˆ ˆ fxn−1 |y0:n−1 (xn−1 |y0:n−1 )fxn |xn−1 (xn |xn−1 )fyn |xn (yn |xn )dxn dxn−1 = xn−1

xn

which requires integrating the right-hand side. The derivation in this section reveals several of the computational challenges that arise when we attempt to evaluate and propagate the posterior distribution for xn given the observations. For this reason, we need to appeal to sampling techniques. The main purpose for generating samples is to estimate integrals by means of empirical averages over the samples, as we proceed to explain in greater detail.

35.2

IMPORTANCE SAMPLING At every time instant n, we will generate samples from the conditional pdf fx0:n |y0:n (x0:n |y0:n ). The samples will then allow us, among other things, to estimate the trajectory of the state variable x based on the history of the observations up to the present time. For instance, one such estimator could be the conditional mean of the trajectory given the observations, namely, ˆ E (x0:n |y 0:n ) = x0:n fx0:n |y0:n (x0:n |y0:n )dx0:n (35.20) x0:n

AAAC3XichVJNb9QwEHXCV1kK3cKRi8WKqkWwSnoAhIRUqReORWLbis3KcpzJ1qrjBNupNjKReumBCnHlf3Hjf/ADcLKpuu0iOpLl5/fm2TO240JwbYLgt+ffun3n7r2V+70Hqw8frfXXH+/rvFQMRiwXuTqMqQbBJYwMNwIOCwU0iwUcxMe7jX5wAkrzXH4yVQGTjE4lTzmjxlGk/yeKYcqlhS+SKkWrF3VvYyMlNopzkegqc5Od1UR+XSSqmtjgnXwV1vXmzGnVxXIripz9fcSlIXZGbJty026OsJd7da6t3rKtFf5TyFy/LAYnHReBTBY7JP1BMAzawMsg7MAAdbFH+r+iJGdlBtIwQbUeh0FhJpYqw5mAuheVGgrKjukUxg5KmoGe2PZ1avzcMQlOc+WGNLhlFx2WZrrpx2Vm1Bzp61pD/ksblyZ9O7FcFqUByeYHpaXAJsfNU+OEK2BGVA5QprirFbMjqigz7kM0lxBeb3kZ7G8Pw9fD4OP2YOdldx0r6Cl6hjZRiN6gHfQB7aERYt5n79T75p37xD/zv/s/5qm+13meoCvh//wLmRzqgA==

fxn |y0:n 1 (xn |y0:n 1 ) Z = fxn |xn 1 (xn |xn

AAACMnicbZBPS8MwGMZT/876b+rRS3EIE3S0O6h4GnjR2wSnQltKmqZbME1LkspK7Wfy4icRPOhBEa9+CNNZYW6+EPLwe/KS9338hBIhTfNFm5mdm19YrC3pyyura+v1jc0rEacc4R6KacxvfCgwJQz3JJEU3yQcw8in+Nq/PS396zvMBYnZpcwS7Eawz0hIEJQKefVzx9ZDL3f8mAYii9SVDwsvZwdWcT8OMwXNkxIXzWHlZ79oz3F13as3zJY5KmNaWJVogKq6Xv3JCWKURphJRKEQtmUm0s0hlwRRXOhOKnAC0S3sY1tJBiMs3Hy0cmHsKhIYYczVYdIY0fGOHEaiHF29jKAciEmvhP95dirDYzcnLEklZujnozClhoyNMj8jIBwjSTMlIOJEzWqgAeQQSZVyGYI1ufK0uGq3rMOWedFudParOGpgG+yAJrDAEeiAM9AFPYDAA3gGb+Bde9RetQ/t8+fpjFb1bIE/pX19A+20ql4=

f xn

1 |y 0:n

1

(xn

1 |y0:n 1 )

xn

•••

x0

1

x0

xn AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+yrOab1edujMDXCZuQaqgQMu3vwahwElEuMYMKdV3nVh7KZKaYkayyiBRJEZ4gkakbyhHEVFeOkufwVOjhHAopDlcw5n6eyNFkcrzmckI6bFa9HLxP6+f6OGVl1IeJ5pwPH9omDCoBcyrgCGVBGs2NQRhSU1WiMdIIqxNYRVTgrv45WXSadTdi7pz26g2z4s6yuAYnIAz4IJL0AQ3oAXaAINH8AxewZv1ZL1Y79bHfLRkFTuH4A+szx/lB5Vw

AAACC3icbVDLSsNAFJ3UV42vqEs3oUVwISXpQl0W3LisYB/QhDKZ3LRDJ5MwMxFK6N6Nv+LGhSJu/QF3/o2TNqC2Hhg4nHPvnXtPkDIqleN8GZW19Y3Nreq2ubO7t39gHR51ZZIJAh2SsET0AyyBUQ4dRRWDfioAxwGDXjC5LvzePQhJE36npin4MR5xGlGClZaGVs3jCeUhcGUrGoPnmT9CloZYgTm06k7DmcNeJW5J6qhEe2h9emFCsljPIAxLOXCdVPk5FooSBjPTyySkmEzwCAaachyD9PP5LTP7VCuhHSVCP73DXP3dkeNYymkc6MoYq7Fc9grxP2+QqejKzylPMwWcLD6KMmarxC6CsUMqgCg21QQTQfWuNhljgYnS8RUhuMsnr5Jus+FeNJzbZr11XsZRRSeohs6Qiy5RC92gNuoggh7QE3pBr8aj8Wy8Ge+L0opR9hyjPzA+vgGD/pqg

yn

y0

AAAB/3icbVA7T8MwGHTKq5RXAImFxaJFYoAq6QCMlVgYi0QfUhtFjuO0Vh0nsh2kKGTgr7AwgBArf4ONf4PTZoCWkyyf7r5PPp8XMyqVZX0blZXVtfWN6mZta3tnd8/cP+jJKBGYdHHEIjHwkCSMctJVVDEyiAVBocdI35veFH7/gQhJI36v0pg4IRpzGlCMlJZc86gx8iLmyzTUV5bmbsYv7LzhmnWrac0Al4ldkjoo0XHNr5Ef4SQkXGGGpBzaVqycDAlFMSN5bZRIEiM8RWMy1JSjkEgnm+XP4alWfBhEQh+u4Ez9vZGhUBYJ9WSI1EQueoX4nzdMVHDtZJTHiSIczx8KEgZVBIsyoE8FwYqlmiAsqM4K8QQJhJWurKZLsBe/vEx6raZ92bTuWvX2eVlHFRyDE3AGbHAF2uAWdEAXYPAInsEreDOejBfj3fiYj1aMcucQ/IHx+QPPNZXj

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7mZM3fLvuNJ0Z4DJxS1IHJTq+/TUMBU5jwjVmSKmB6yTay5DUFDOS14apIgnCEzQiA0M5ionysln6HJ4aJYSRkOZwDWfq740MxarIZyZjpMdq0SvE/7xBqqMrL6M8STXheP5QlDKoBSyqgCGVBGs2NQRhSU1WiMdIIqxNYTVTgrv45WXSazXdi6Zz26q3z8s6quAYnIAz4IJL0AY3oAO6AINH8AxewZv1ZL1Y79bHfLRilTuH4A+szx+IHZUz

yn AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7Gc8bvl13ms4McJm4JamDEh3f/hqGAqcx4RozpNTAdRLtZUhqihnJa8NUkQThCRqRgaEcxUR52Sx9Dk+NEsJISHO4hjP190aGYlXkM5Mx0mO16BXif94g1dGVl1GepJpwPH8oShnUAhZVwJBKgjWbGoKwpCYrxGMkEdamsJopwV388jLptZruRdO5bdXb52UdVXAMTsAZcMElaIMb0AFdgMEjeAav4M16sl6sd+tjPlqxyp1D8AfW5w/mkZVx

1

time update

1 )fxn

1

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+6mQ13646dWcGuEzcglRBgZZvfw1CgZOIcI0ZUqrvOrH2UiQ1xYxklUGiSIzwBI1I31COIqK8dJY+g6dGCeFQSHO4hjP190aKIpXnM5MR0mO16OXif14/0cMrL6U8TjTheP7QMGFQC5hXAUMqCdZsagjCkpqsEI+RRFibwiqmBHfxy8uk06i7F3XntlFtnhd1lMExOAFnwAWXoAluQAu0AQaP4Bm8gjfryXqx3q2P+WjJKnYOwR9Ynz+Gk5Uy

AAAB/3icbVA7T8MwGHR4lvIKILGwWLRIDFAlHYCxEgtjkehDaqPIcZzWqmNHtoOoQgb+CgsDCLHyN9j4NzhtB2g5yfLp7vvk8wUJo0o7zre1tLyyurZe2ihvbm3v7Np7+20lUolJCwsmZDdAijDKSUtTzUg3kQTFASOdYHRd+J17IhUV/E6PE+LFaMBpRDHSRvLtw2o/ECxU49hc2UPuZ/zczau+XXFqzgRwkbgzUgEzNH37qx8KnMaEa8yQUj3XSbSXIakpZiQv91NFEoRHaEB6hnIUE+Vlk/w5PDFKCCMhzeEaTtTfGxmKVZHQTMZID9W8V4j/eb1UR1deRnmSasLx9KEoZVALWJQBQyoJ1mxsCMKSmqwQD5FEWJvKyqYEd/7Li6Rdr7kXNee2XmmczeoogSNwDE6BCy5BA9yAJmgBDB7BM3gFb9aT9WK9Wx/T0SVrtnMA/sD6/AHNqZXi

AAACAHicbVA9TwJBEN3DL8SvUwsLm41gYmHIHYVakthYYiJIAheyt8zBhr2P7M6ZkAuNf8XGQmNs/Rl2/hsXuELAl0zm5b2Z7M7zEyk0Os6PVVhb39jcKm6Xdnb39g/sw6OWjlPFocljGau2zzRIEUETBUpoJwpY6Et49Ee3U//xCZQWcfSA4wS8kA0iEQjO0Eg9+6TS9VMpARdbpWeXnaozA10lbk7KJEejZ393+zFPQ4iQS6Z1x3US9DKmUHAJk1I31ZAwPmID6BgasRC0l80OmNBzo/RpECtTEdKZ+ncjY6HW49A3kyHDoV72puJ/XifF4MbLRJSkCBGfPxSkkmJMp2nQvlDAUY4NYVwJ81fKh0wxjiazkgnBXT55lbRqVfeq6tzXyvXLPI4iOSVn5IK45JrUyR1pkCbhZEJeyBt5t56tV+vD+pyPFqx855gswPr6BcVQlnI=

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+6mQ13646dWcGuEzcglRBgZZvfw1CgZOIcI0ZUqrvOrH2UiQ1xYxklUGiSIzwBI1I31COIqK8dJY+g6dGCeFQSHO4hjP190aKIpXnM5MR0mO16OXif14/0cMrL6U8TjTheP7QMGFQC5hXAUMqCdZsagjCkpqsEI+RRFibwiqmBHfxy8uk06i7F3XntlFtnhd1lMExOAFnwAWXoAluQAu0AQaP4Bm8gjfryXqx3q2P+WjJKnYOwR9Ynz+Gk5Uy

xn

1 |y 0:n

1

(xn

xn

1

xn

yn

1

yn

•••

AAAB/3icbVA7T8MwGHR4lvIKILGwWLRIDFAlHYCxEgtjkehDaqPIcZzWqmNHtoOoQgb+CgsDCLHyN9j4NzhtB2g5yfLp7vvk8wUJo0o7zre1tLyyurZe2ihvbm3v7Np7+20lUolJCwsmZDdAijDKSUtTzUg3kQTFASOdYHRd+J17IhUV/E6PE+LFaMBpRDHSRvLtw2o/ECxU49hc2UPuZ/zczau+XXFqzgRwkbgzUgEzNH37qx8KnMaEa8yQUj3XSbSXIakpZiQv91NFEoRHaEB6hnIUE+Vlk/w5PDFKCCMhzeEaTtTfGxmKVZHQTMZID9W8V4j/eb1UR1deRnmSasLx9KEoZVALWJQBQyoJ1mxsCMKSmqwQD5FEWJvKyqYEd/7Li6Rdr7kXNee2XmmczeoogSNwDE6BCy5BA9yAJmgBDB7BM3gFb9aT9WK9Wx/T0SVrtnMA/sD6/AHNqZXi

AAACAHicbVA9TwJBEN3DL8SvUwsLm41gYmHIHYVakthYYiJIAheyt8zBhr2P7M6ZkAuNf8XGQmNs/Rl2/hsXuELAl0zm5b2Z7M7zEyk0Os6PVVhb39jcKm6Xdnb39g/sw6OWjlPFocljGau2zzRIEUETBUpoJwpY6Et49Ee3U//xCZQWcfSA4wS8kA0iEQjO0Eg9+6TS9VMpARdbpWeXnaozA10lbk7KJEejZ393+zFPQ4iQS6Z1x3US9DKmUHAJk1I31ZAwPmID6BgasRC0l80OmNBzo/RpECtTEdKZ+ncjY6HW49A3kyHDoV72puJ/XifF4MbLRJSkCBGfPxSkkmJMp2nQvlDAUY4NYVwJ81fKh0wxjiazkgnBXT55lbRqVfeq6tzXyvXLPI4iOSVn5IK45JrUyR1pkCbhZEJeyBt5t56tV+vD+pyPFqx855gswPr6BcVQlnI=

y0

AAAB/3icbVA7T8MwGHTKq5RXAImFxaJFYoAq6QCMlVgYi0QfUhtFjuO0Vh0nsh2kKGTgr7AwgBArf4ONf4PTZoCWkyyf7r5PPp8XMyqVZX0blZXVtfWN6mZta3tnd8/cP+jJKBGYdHHEIjHwkCSMctJVVDEyiAVBocdI35veFH7/gQhJI36v0pg4IRpzGlCMlJZc86gx8iLmyzTUV5bmbsYv7LzhmnWrac0Al4ldkjoo0XHNr5Ef4SQkXGGGpBzaVqycDAlFMSN5bZRIEiM8RWMy1JSjkEgnm+XP4alWfBhEQh+u4Ez9vZGhUBYJ9WSI1EQueoX4nzdMVHDtZJTHiSIczx8KEgZVBIsyoE8FwYqlmiAsqM4K8QQJhJWurKZLsBe/vEx6raZ92bTuWvX2eVlHFRyDE3AGbHAF2uAWdEAXYPAInsEreDOejBfj3fiYj1aMcucQ/IHx+QPPNZXj

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7mZM3fLvuNJ0Z4DJxS1IHJTq+/TUMBU5jwjVmSKmB6yTay5DUFDOS14apIgnCEzQiA0M5ionysln6HJ4aJYSRkOZwDWfq740MxarIZyZjpMdq0SvE/7xBqqMrL6M8STXheP5QlDKoBSyqgCGVBGs2NQRhSU1WiMdIIqxNYTVTgrv45WXSazXdi6Zz26q3z8s6quAYnIAz4IJL0AY3oAO6AINH8AxewZv1ZL1Y79bHfLRilTuH4A+szx+IHZUz

1 |y0:n 1 )dxn 1

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+yrOab1edujMDXCZuQaqgQMu3vwahwElEuMYMKdV3nVh7KZKaYkayyiBRJEZ4gkakbyhHEVFeOkufwVOjhHAopDlcw5n6eyNFkcrzmckI6bFa9HLxP6+f6OGVl1IeJ5pwPH9omDCoBcyrgCGVBGs2NQRhSU1WiMdIIqxNYRVTgrv45WXSadTdi7pz26g2z4s6yuAYnIAz4IJL0AQ3oAXaAINH8AxewZv1ZL1Y79bHfLRkFTuH4A+szx/lB5Vw

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7Gc8bvl13ms4McJm4JamDEh3f/hqGAqcx4RozpNTAdRLtZUhqihnJa8NUkQThCRqRgaEcxUR52Sx9Dk+NEsJISHO4hjP190aGYlXkM5Mx0mO16BXif94g1dGVl1GepJpwPH8oShnUAhZVwJBKgjWbGoKwpCYrxGMkEdamsJopwV388jLptZruRdO5bdXb52UdVXAMTsAZcMElaIMb0AFdgMEjeAav4M16sl6sd+tjPlqxyp1D8AfW5w/mkZVx

measurement update AAACEnicbVDLSsNAFJ3UV42vqEs3wSIoSEm6UJcFNy4r2Ac0oUwmN+3QySTMTIQS+g1u/BU3LhRx68qdf+OkDaitBwYO59x7594TpIxK5ThfRmVldW19o7ppbm3v7O5Z+wcdmWSCQJskLBG9AEtglENbUcWglwrAccCgG4yvC797D0LShN+pSQp+jIecRpRgpaWBdebxhPIQuLJjwDITEGvueeaPnqUhVmAOrJpTd2awl4lbkhoq0RpYn16YkKyYRxiWsu86qfJzLBQlDKaml0lIMRnjIfQ15TgG6eezk6b2iVZCO0qEfnqHmfq7I8exlJM40JUxViO56BXif14/U9GVn1OeZgo4mX8UZcxWiV3kY4dUAFFsogkmgupdbTLCAhOlUyxCcBdPXiadRt29qDu3jVrzvIyjio7QMTpFLrpETXSDWqiNCHpAT+gFvRqPxrPxZrzPSytG2XOI/sD4+AZydp3d

x0 AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+6mQ13646dWcGuEzcglRBgZZvfw1CgZOIcI0ZUqrvOrH2UiQ1xYxklUGiSIzwBI1I31COIqK8dJY+g6dGCeFQSHO4hjP190aKIpXnM5MR0mO16OXif14/0cMrL6U8TjTheP7QMGFQC5hXAUMqCdZsagjCkpqsEI+RRFibwiqmBHfxy8uk06i7F3XntlFtnhd1lMExOAFnwAWXoAluQAu0AQaP4Bm8gjfryXqx3q2P+WjJKnYOwR9Ynz+Gk5Uy

AAACEHicbVDLSgMxFM3UV62vqks3wSJWkDLThbosuHFZwT6gHUomc9uGZpIhyRRK6Se48VfcuFDErUt3/o3pdBbaeiBwcu49XM4JYs60cd1vJ7e2vrG5ld8u7Ozu7R8UD4+aWiaKQoNKLlU7IBo4E9AwzHBoxwpIFHBoBaPb+bw1BqWZFA9mEoMfkYFgfUaJsVKveC4DDWqc/jQuj4lixHo1plKEbK5CiKW46BVLbsVNgVeJl5ESylDvFb+6oaRJBMJQTrTueG5s/ClRhlEOs0I30RATOiID6FgqSATan6aBZvjMKiHuS2WfMDhVfzumJNJ6EgV2MyJmqJdnc/G/WScx/Rt/ykScGBB0caifcGwknreDQ6aAGj6xhFBl81NMh0QRamyHBVuCtxx5lTSrFe+q4t5XS7XLrI48OkGnqIw8dI1q6A7VUQNR9Iie0St6c56cF+fd+Vis5pzMc4z+wPn8AffGnRQ=

observations (variables conditioned on)

xn

1

xn

yn

1

yn

•••

AAAB/3icbVA7T8MwGHR4lvIKILGwWLRIDFAlHYCxEgtjkehDaqPIcZzWqmNHtoOoQgb+CgsDCLHyN9j4NzhtB2g5yfLp7vvk8wUJo0o7zre1tLyyurZe2ihvbm3v7Np7+20lUolJCwsmZDdAijDKSUtTzUg3kQTFASOdYHRd+J17IhUV/E6PE+LFaMBpRDHSRvLtw2o/ECxU49hc2UPuZ/zczau+XXFqzgRwkbgzUgEzNH37qx8KnMaEa8yQUj3XSbSXIakpZiQv91NFEoRHaEB6hnIUE+Vlk/w5PDFKCCMhzeEaTtTfGxmKVZHQTMZID9W8V4j/eb1UR1deRnmSasLx9KEoZVALWJQBQyoJ1mxsCMKSmqwQD5FEWJvKyqYEd/7Li6Rdr7kXNee2XmmczeoogSNwDE6BCy5BA9yAJmgBDB7BM3gFb9aT9WK9Wx/T0SVrtnMA/sD6/AHNqZXi

AAACAHicbVA9TwJBEN3DL8SvUwsLm41gYmHIHYVakthYYiJIAheyt8zBhr2P7M6ZkAuNf8XGQmNs/Rl2/hsXuELAl0zm5b2Z7M7zEyk0Os6PVVhb39jcKm6Xdnb39g/sw6OWjlPFocljGau2zzRIEUETBUpoJwpY6Et49Ee3U//xCZQWcfSA4wS8kA0iEQjO0Eg9+6TS9VMpARdbpWeXnaozA10lbk7KJEejZ393+zFPQ4iQS6Z1x3US9DKmUHAJk1I31ZAwPmID6BgasRC0l80OmNBzo/RpECtTEdKZ+ncjY6HW49A3kyHDoV72puJ/XifF4MbLRJSkCBGfPxSkkmJMp2nQvlDAUY4NYVwJ81fKh0wxjiazkgnBXT55lbRqVfeq6tzXyvXLPI4iOSVn5IK45JrUyR1pkCbhZEJeyBt5t56tV+vD+pyPFqx855gswPr6BcVQlnI=

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSPQhtVHkOG5r1bEj20GUKOKvsDCAECv/g41/g9NmgJaTLJ/uvk8+XxAzqrTjfFulldW19Y3yZmVre2d3z94/6CiRSEzaWDAhewFShFFO2ppqRnqxJCgKGOkGk+vc794Tqajgd3oaEy9CI06HFCNtJN8+qg0CwUI1jcyVPmR+yrOab1edujMDXCZuQaqgQMu3vwahwElEuMYMKdV3nVh7KZKaYkayyiBRJEZ4gkakbyhHEVFeOkufwVOjhHAopDlcw5n6eyNFkcrzmckI6bFa9HLxP6+f6OGVl1IeJ5pwPH9omDCoBcyrgCGVBGs2NQRhSU1WiMdIIqxNYRVTgrv45WXSadTdi7pz26g2z4s6yuAYnIAz4IJL0AQ3oAXaAINH8AxewZv1ZL1Y79bHfLRkFTuH4A+szx/lB5Vw

variable to be estimated AAACAXicbVDLSsNAFJ3UV62vqBvBzWARXEhJulCXBTcuK9gHtKFMJjft0MkkzEwKJdSNv+LGhSJu/Qt3/o2TNgttPTBwOOde5p7jJ5wp7TjfVmltfWNzq7xd2dnd2z+wD4/aKk4lhRaNeSy7PlHAmYCWZppDN5FAIp9Dxx/f5n5nAlKxWDzoaQJeRIaChYwSbaSBfTIhkhEzjXWMfcCgNIuIhmBgV52aMwdeJW5BqqhAc2B/9YOYphEITTlRquc6ifYyIjWjHGaVfqogIXRMhtAzVJAIlJfNE8zwuVECHMbSPKHxXP29kZFIqWnkm0lz3Ugte7n4n9dLdXjjZUwkqQZBFx+FKc/T5nXggEmgmk8NIVQycyumIyIJ1aa0iinBXY68Str1mntVc+7r1cZlUUcZnaIzdIFcdI0a6A41UQtR9Iie0St6s56sF+vd+liMlqxi5xj9gfX5Axs7lpg=

y0

variable to be marginalized

AAAB/3icbVA7T8MwGHTKq5RXAImFxaJFYoAq6QCMlVgYi0QfUhtFjuO0Vh0nsh2kKGTgr7AwgBArf4ONf4PTZoCWkyyf7r5PPp8XMyqVZX0blZXVtfWN6mZta3tnd8/cP+jJKBGYdHHEIjHwkCSMctJVVDEyiAVBocdI35veFH7/gQhJI36v0pg4IRpzGlCMlJZc86gx8iLmyzTUV5bmbsYv7LzhmnWrac0Al4ldkjoo0XHNr5Ef4SQkXGGGpBzaVqycDAlFMSN5bZRIEiM8RWMy1JSjkEgnm+XP4alWfBhEQh+u4Ez9vZGhUBYJ9WSI1EQueoX4nzdMVHDtZJTHiSIczx8KEgZVBIsyoE8FwYqlmiAsqM4K8QQJhJWurKZLsBe/vEx6raZ92bTuWvX2eVlHFRyDE3AGbHAF2uAWdEAXYPAInsEreDOejBfj3fiYj1aMcucQ/IHx+QPPNZXj

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7mZM3fLvuNJ0Z4DJxS1IHJTq+/TUMBU5jwjVmSKmB6yTay5DUFDOS14apIgnCEzQiA0M5ionysln6HJ4aJYSRkOZwDWfq740MxarIZyZjpMdq0SvE/7xBqqMrL6M8STXheP5QlDKoBSyqgCGVBGs2NQRhSU1WiMdIIqxNYTVTgrv45WXSazXdi6Zz26q3z8s6quAYnIAz4IJL0AY3oAO6AINH8AxewZv1ZL1Y79bHfLRilTuH4A+szx+IHZUz

AAACBHicbVC7SgNBFJ31GeNr1TLNYBAsJOymUMuAjWUE84AkhLuzd5Mhs7PLzGwghhQ2/oqNhSK2foSdf+PkUWjigYHDOedy554gFVwbz/t21tY3Nre2czv53b39g0P36Liuk0wxrLFEJKoZgEbBJdYMNwKbqUKIA4GNYHAz9RtDVJon8t6MUuzE0JM84gyMlbpuYQiKg01Tk9AAaQyqxyUI/oBh1y16JW8Gukr8BSmSBapd96sdJiyLURomQOuW76WmMwZlOBM4ybczjSmwAfSwZamEGHVnPDtiQs+sEtIoUfZJQ2fq74kxxFqP4sAmYzB9vexNxf+8Vmai686YyzQzKNl8UZSJ6cHTRmjIFTIjRpYAU9z+lbI+KGDG9pa3JfjLJ6+SernkX5a8u3KxcrGoI0cK5JScE59ckQq5JVVSI4w8kmfySt6cJ+fFeXc+5tE1ZzFzQv7A+fwBfleX7Q==

fxn |y0:n (xn |y0:n )

AAAB/XicbVA7T8MwGHTKq5RXeGwsFi0SA6qSDsBYiYWxSLRUaqPIcZzWqmNHtoNUooi/wsIAQqz8Dzb+DU6bAVpOsny6+z75fEHCqNKO821VVlbX1jeqm7Wt7Z3dPXv/oKdEKjHpYsGE7AdIEUY56WqqGeknkqA4YOQ+mFwX/v0DkYoKfqenCfFiNOI0ohhpI/n2UWMYCBaqaWyubJr7Gc8bvl13ms4McJm4JamDEh3f/hqGAqcx4RozpNTAdRLtZUhqihnJa8NUkQThCRqRgaEcxUR52Sx9Dk+NEsJISHO4hjP190aGYlXkM5Mx0mO16BXif94g1dGVl1GepJpwPH8oShnUAhZVwJBKgjWbGoKwpCYrxGMkEdamsJopwV388jLptZruRdO5bdXb52UdVXAMTsAZcMElaIMb0AFdgMEjeAav4M16sl6sd+tjPlqxyp1D8AfW5w/mkZVx

AAADBXicjVJNb9QwEHXCV1m+tnDsxWJFtYvKKukBEBJSBReORWLbSutV5DhOatVxgu2gtYwvXPgrXDiAEFf+Azf+DU42SNstEoxk+fnNzBt7PGnNmdJR9CsIL12+cvXa1vXBjZu3bt8Zbt89UlUjCZ2RilfyJMWKciboTDPN6UktKS5TTo/Ts5et//gdlYpV4o02NV2UuBAsZwRrTyXbwQ5KacGEpW8FlhKbh26wu5snFqUVz5Qp/WaXLhHv1wnjEhs9E86Nl95jVocJQj71OXrBimKMslxiYjeEzKZQq+zGxrNeaOL+Gd8XfhS7VZb5c5y4ru4Eor3/vHynsTyngajI1huRDEfRNOoMXgRxD0agt8Nk+BNlFWlKKjThWKl5HNV6YbHUjHDqBqhRtMbkDBd07qHAJVUL2/2igw88k8G8kn4JDTt2PcPiUrVP8JEl1qdq09eSf/PNG50/XVgm6kZTQVaF8oZDXcF2JGDGJCWaGw8wkczfFZJT7L9P+8FpmxBvPvkiONqfxo+n0ev90cFe344tsAPugzGIwRNwAF6BQzADJPgQfAq+BF/Dj+Hn8Fv4fRUaBn3OPXDOwh+/AQgX++A=

=

fyn |xn (yn |xn ) fyn |y0:n 1 (yn |y0:n

1)

!

fxn |y0:n 1 (xn |y0:n

1)

Figure 35.1 Schematic representation of the time- and measurement-update steps listed in (35.15). The time-update step marginalizes over xn−1 , while the measurement-update step performs a Bayesian update. The variables are color-coded depending on whether they correspond to observations, marginalized variables, or prediction variables.

35.2 Importance Sampling

1387

The samples will further allow us, as the discussion will reveal, to estimate only the current state, xn , based on the history of observations, rather than estimate the entire trajectory up to time n.

35.2.1

Particles We will be generally dealing with distributions that involve vector arguments. Nevertheless, the importance sampling construction (33.66) will continue to hold. We start by selecting a proposal distribution πx0:n (x0:n ) and apply construction (33.66) with fx (x) replaced by fx0:n |y0:n (x0:n |y0:n ). The construction then leads to a collection of J trajectories denoted by n o xj0:n , j = 1, 2, . . . , J (J trajectories) (35.21)

along with their scalar weights {wj }. Note that in this case, each particle corresponds to a full trajectory from time 0 to n. These particles and their weights can be used to estimate the desired trajectory using x b0:n =

J 1X wj xj0:n J j=1

(35.22)

As was explained earlier in Example 33.4, if desired we can compute the same estimate by working instead with the joint pdf fx0:n ,y0:n (x0:n , y0:n ), which happens to be proportional to fx0:n |y0:n (x0:n |y0:n ). In that case, we would normalize the weights to add up to 1 – recall (33.88). We will continue our exposition by assuming knowledge of the conditional pdf fx0:n |y0:n (x0:n |y0:n ) since only minor adjustments are needed to accommodate the case in which the joint pdf is known but not the conditional pdf. Recall that the observations {y n } arrive sequentially over time, and the state variable xn also evolves sequentially. We are therefore interested in estimating the state trajectory in a sequential manner using a succession of conditional pdfs of the form: fx0 |y0 (x0 |y0 ), fx0:1 |y0:1 (x0:1 |y0:1 ), fx0:2 |y0:2 (x0:2 |y0:2 ), . . . , fx0:n |y0:n (x0:n |y0:n )

(35.23)

The first distribution allows us to estimate the initial condition x0 from observation of y0 , while the second distribution allows us to estimate the two-state trajectory {x0 , x1 } once {y0 , y1 } are observed. The third distribution allows us to estimate the three-state trajectory {x0 , x1 , x2 } from knowledge of {y0 , y1 , y2 }, and so on. Observe that for each new observation, we are estimating the entire trajectory from time 0 up to n. This is because new observations provide information about prior states as well and, therefore, can be used to “improve” or “smooth” our estimates of these earlier states. Since states and observations appear repeated in the above pdfs, one would expect to be able to simplify the

1388

Particle Filters

importance sampling process to reduce its computational complexity, as we now explain. The result will lead to a recursive procedure known as the particle filter.

35.2.2

Weight Update Thus, assume we have already generated J weighted trajectories {xj0:n−1 , wj } based on the observations up to time n − 1 and using the conditional pdf fx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ). We will add a subscript (n − 1) and a superscript j to the weights and rewrite them as shown in the next equation to indicate that they were computed by sampling from the pdf fx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) involving states and observations up to time n − 1: n o j xj0:n−1 , wn−1 , j = 1, 2, . . . , J (35.24) (J particles from iteration n − 1)

We know from the importance sampling method (33.66) that the weights are computed as follows: x0:n−1 ∼ πx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) j wn−1 =

fx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 )

πx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 )

(35.25a)

(35.25b)

where πx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) denotes the importance distribution; it is often a function of the states only but we are allowing it to be conditioned on the observations as well for generality. Next, a new observation y n arrives and we now need to deal with the updated pdf fx0:n |y0:n (x0:n , y0:n ) with y 0:n−1 replaced by y 0:n . We will also replace the importance distribution πx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) by the updated version πx0:n |y0:n (x0:n |y0:n ). However, we will approximate the proposal distribution by the following factored form (see Prob. 35.12): πx0:n |y0:n (x0:n |y0:n )

(35.26)

≈ πx0:n−1 |y0:n−1 (x0:n−1 |y0:n−1 ) × πxn |x0:n−1 ,y0:n (xn |x0:n−1 , y0:n ) | {z } | {z } old

new factor

where the rightmost factor is used to generate samples for the new state xn . We are assuming in this general form that the new factor is conditioned on the past states and past and current observations. In the following, we will consider special cases for this importance factor and arrive at the bootstrap and auxiliary forms of particle filters. For now, we persist with (35.26). By factorizing the importance distribution in the above manner, we can take advantage of the samples that were generated from the old importance distribution rather than discard them. We do so by showing how to derive an update for the weights of the particles

35.2 Importance Sampling

1389

(we remove the subscripts from the pdfs in the equations below to simplify the notation; the subscripts are evident from the arguments for each function): ∆

wnj = = (a)

=

(b)



f (xj0:n |y0:n )

π(xj0:n |y0:n )

, xj0:n ∼ π(x0:n |y0:n )

f (xj0:n |y0:n−1 , yn ) π(xj0:n |y0:n )

f (xj0:n |y0:n−1 )f (yn |xj0:n , y0:n−1 )

π(xj0:n |y0:n )f (yn |y0:n−1 ) ! f (xj0:n−1 |y0:n−1 ) f (xjn |xj0:n−1 , y0:n−1 )f (yn |xj0:n , y0:n−1 )

π(xj0:n−1 |y0:n−1 )

(c)

j = wn−1

π(xjn |xj0:n−1 , y0:n )f (yn |y0:n−1 )

f (xjn |xjn−1 )f (yn |xjn )

π(xjn |xj0:n−1 , y0:n )f (yn |y0:n−1 )

j ∝ wn−1 ×

f (xjn |xjn−1 )f (yn |xjn )

(35.27)

π(xjn |xj0:n−1 , y0:n )

where in step (a) we used the conditional probability result (3.55), and in step (b) we used construction (35.26) and the factorization: f (xj0:n |y0:n−1 ) = f (xj0:n−1 |y0:n−1 ) × f (xjn |xj0:n−1 , y0:n−1 )

(35.28)

In step (c) we used the fact that conditioned on xn−1 , the state xn is independent of all other variables. Likewise, conditioned on xn , the observation y n is independent of all other variables. In the last step, we removed f (yn |y0:n−1 ), which is difficult to compute and appears in the denominator; we can remove this term because it will get canceled out anyway by the normalization step shown j below. All the terms multiplying wn−1 in (35.27) are known and, therefore, we arrive at the following update relation for the weights: xjn ∼ πxn |x0:n−1 ,y0:n (xjn |xj0:n−1 , y0:n ) wnj =

j wn−1

×

fxn |xn−1 (xjn |xjn−1 )fyn |xn (yn |xjn ) πxn |xn−1 (xjn |xjn−1 )

!

, j = 1, 2, . . . , J

normalize {wnj } to add up to 1 (i.e., divide by their sum) (35.29) The proportionality constant in (35.27) is irrelevant because we are normalizing the sum of these weights to 1. We therefore have an initial procedure, sometimes referred to as sequential importance sampling, where:

1390

Particle Filters

(basic recursive Bayesian inference without resampling)   1. We start from J trajectories {xj0:n−1 } up to n − 1.   j  j   2. Observe yn and generate J new samples xn ∼ π(xn |x0:n−1 , y0:n ). j 3. This augments each of the trajectories to {x0:n }.  j   4. We then update the weights from {wn−1 } to {wnj } using (35.29);    and repeat the process.

(35.30)

35.2.3

Resampling Each sampled trajectory {xj0:n } corresponds to one particle j in this implementation with weight wnj . It is desirable for the sampling procedure to generate and propagate particles that contribute similarly to the inference process (i.e., that have more or less uniform weights, say, wnj ≈ 1/J in the normalized case). However, it has been observed in practice that one difficulty with construction (35.29) is that, after a few iterations, the weights {wnj } will degenerate and become negligible with the exception of one weight approaching 1; meaning that all particles will become irrelevant and not contribute to the solution except for one particle (or trajectory). One approximate way to detect this degeneracy problem is to evaluate at each time n the effective sample size using the normalized weights: ∆

Jn = PJ

1

j 2 j=1 (wn )

(35.31)

√ Using the property of norms kak1 ≤ Jkak2 for J-dimensional vectors a, we have that J J X 1 X j 2 |w | = 1/J (35.32) (wnj )2 ≥ J j=1 n j=1 since the sum of the weights is normalized to 1. Hence, it holds that Jn ≤ J

(35.33)

The upper bound is achieved when all particles are equally weighted with wnj = 1/J. The lower bound is Jn ≥ 1 and is achieved when all weights are equal to 0 except for one of them (which is equal to 1). This result suggests that small values for Jn (say, Jn < J/2) can be used to flag the occurrence of degeneracy: small Jn =⇒ degeneracy is likely at time n

(35.34)

Multinomial resampling One key idea to address the degeneracy problem is resampling. We start from the collection of J trajectories {xj0:n }, each with weight wnj , at time n. We sample with replacement from these trajectories in proportion to the weights, i.e.,

35.2 Importance Sampling

  P jth trajectory xj0:n is selected = wnj

1391

(35.35)

We repeat the sampling procedure J times and replace the starting trajectories {xj0:n } with a new collection of J trajectories denoted by {xj? 0:n }. On average, j we expect each particle j to appear Jwn times in the resampled trajectories. During this resampling procedure, particles with larger weights will be more likely to be sampled than particles with smaller weights. In this way, we end up “eliminating” particles of smaller significance and retaining more prominent particles in a process that amounts to the “survival of the fittest.” While the original trajectories {xj0:n } were constructed by sampling from the importance distribution πx (x), the resampled trajectories {xj? 0:n } will amount to samples obtained approximately from the true distribution fx0:n |y0:n (x0:n |y0:n ). For this reason, following the resampling procedure, we reset the weights of the new collection of J particles to uniform values, i.e., we set wnj = 1/J,

j = 1, 2, . . . , J

(35.36)

Residual resampling A second way to address degeneracy is residual resampling. In this case, we ensure that each particle j is represented by at least nj = bJwnj c copies, where bxc represents the truncation of x. Let ∆ J¯ = J −

J X

nj

(35.37)

n=1

denote the residual number of particles that are needed to arrive at a total of J particles. Let also ∆

w ¯nj = wnj −

nj J

(35.38)

denote the residual weight after excluding the nj copies. Then, we generate the ¯nj }. In this way, remaining J¯ samples by resampling from {xj0:n } using weights {w most of the trajectories that appear in the resampled set would have been selected in a deterministic manner and only a few of them would have been selected randomly. The benefit of doing so is to reduce the increase in the variance of the weights that random resampling causes.

Systematic resampling One inconvenience of multinomial resampling is that during this process, particles with large weights will likely be repeated multiple times in the resampled trajectories. In systematic resampling, which is one of the preferred resampling approaches, the interval [0, 1] is divided into segments of size 1/J each. A root

1392

Particle Filters

number u1 is sampled from the uniform distribution u1 ∼ U[0, 1/J] and used to generate a total of J numbers; one for each segment: ui = ui−1 +

1 , i = 2, 3, . . . , J J

(35.39)

These numbers are represented by the colored circles on the vertical axis in Fig. 35.2. At the same time, we construct the cumulative density function (cdf) for the particles based on their normalized weights: c0 = 0, cj = cj−1 + wnj , j = 1, 2, . . . , J

(35.40)

Clearly, we will have cJ = 1. A particle j with large weight wnj will cause a large jump from cj−1 to cj ; this is illustrated by the jump from level c2 to level c3 in the figure. The jump may run across several of the J1 -long segments where the {ui } belong. For instance, the two numbers {u2 , u3 } lie within the gap between (c2 , c3 ). For every particle of index j, we count how many numbers ui appear within the interval ui ∈ [cj−1 , cj ], say, nj times. For j = 3, we have nj = 2 in the figure. This number will then correspond to the number of times that we include particle j in the resampled trajectories – see Prob. 35.8.

cdf cJ = 1

1

.. .

c4 4/J

u4

c3

3/J

u3

u2 u1

2/J

cumulative density function c2

1/J

0 c0 = 0 1

c1

2

3

4

5

j

Figure 35.2 In systematic resampling, the vertical axis is divided into segments of size

1/J each. One root u1 is sampled from within [0, 1/J] and displaced uniformly to other locations ui within the J segments. The cdf of the trajectories is generated based on their weights {wnj }.

35.3 Particle Filter Implementations

35.3

1393

PARTICLE FILTER IMPLEMENTATIONS We are ready to group the various elements together to arrive at the basic form of the particle filter listed in (35.41).

Generic form of the particle filter for model (35.1a)–(35.1c). given some positive threshold T < J (number of particles); given initial state distribution x0 ∼ f (x0 ); given conditional state distribution xn ∼ f (xn |xn−1 ); given conditional observation distribution y n ∼ f (yn |xn ); objective: estimate moments involving f (x0:n |y0:n ) such as the conditional mean for the trajectory x0:n . choose importance distribution, π(x|x0:n−1 , y0:n ); sample J initial states: xj0 ∼ π(x|0, 0), j = 1, 2, . . . , J; set their initial weights to w0j = f (xj0 )/π(xj0 |0, 0); normalize these weights to add up to 1 (divide by their sum). repeat over n = 1, 2, 3, . . . : sample J realizations for time n: xjn ∼ π(x|xj0:n−1 , y0:n ) augment trajectories from {xj0:n−1 } to {xj0:n }, ∀ j ! f (xjn |xjn−1 )f (yn |xjn ) j j , ∀j wn = wn−1 × π(xjn |xj0:n−1 , y0:n ) .P J j0 wnj ← wnj j 0 =1 wn , ∀ j (normalization) .P J j 2 let Jn = 1 j=1 (wn )

(35.41)

if Jn ≤ T : resample the trajectories with replacement j and generate {xj? 0:n } from {x0:n }, j = 1, 2, . . . , J end set {xj0:n } ← {xj? 0:n }, j = 1, 2, . . . , J set wnj = 1/J, j = 1, 2, . . . , J end return particles and weights {xj0:n , wnj }, ∀ j.

In filtering problems, we are usually interested in recovering the current state xn (or some function of it), and not the entire state trajectory, from the observations y 0:n . For instance, assume we are interested in computing E h(xn ), for some function h(·) such as h(x) = x. Then, by definition, conditioned on the observations:

1394

Particle Filters

ˆ E h(xn ) =

xn

ˆ =

xn

ˆ

h(xn )fxn |y0:n (xn |y0:n )dxn ˆ h(xn )fx0:n−1 ,xn |y0:n (x0:n−1 , xn |y0:n )dxn dx0:n−1 x0:n−1

= x0:n



J X j=1

h(xn )fx0:n |y0:n (x0:n |y0:n )dx0:n

wnj h(xjn ), xjn ∼ π(x|xjn−1 )

(35.42)

In other words, we ignore the other states and use the particles corresponding to the nth state only. This also means that we can approximate the posterior distribution for xn given all observations up to time n by means of the discrete representation (written with a clear abuse of notation): fxn |y0:n (xn |y0:n ) ≈

J X j=1

wnj δ(xn − xjn )

(35.43)

where δ(x) represents the Dirac delta function. Essentially, we examine all samples {xjn } at time n and generate a weighted histogram from these values, or place delta functions at the sample locations weighted by {wnj }.

35.3.1

Bootstrap Particle Filter There are several variations of the generic listing (35.41). The weights wjn are random quantities in view of the randomness in generating the samples xjn . We would like these weights to have small variance so that they stay clustered close to their mean and all particles contribute “uniformly” to the inference solution. We explain in Prob. 35.9 that the optimal choice for the importance factor to minimize the variance of these weights is given by π(xn |xj0:n−1 , y0:n ) = f (xn |xjn−1 , yn )

(35.44)

in terms of the distribution of xn conditioned on both xjn−1 and yn . The bootstrap implementation ignores the observation and sets the importance factor to ∆

π(xn |xj0:n−1 , y0:n ) = π(xn |xn−1 ) = f (xn |xjn−1 )

(35.45)

That is, π(xn |·) is only conditioned on the immediate past state. This approximation is motivated by the Markovian property of the state variable under the assumption that xjn−1 ≈ xn−1 (the actual state); the approximation is not true in general since xjn−1 is generated from the importance sampling distribution while

35.3 Particle Filter Implementations

1395

xn−1 is generated from the true model dynamics. Nevertheless, we continue with the above approximation. Of course, this choice assumes that it is possible to sample from f (xn |xjn−1 ). Moreover, in the bootstrap filter, resampling of the trajectories will be carried out at every iteration n and the computation of Jn becomes unnecessary. This version of the algorithm is known as the bootstrap filter, and also as the sampling importance resampling (SIR) filter.

Bootstrap particle filter for model (35.1a)–(35.1c). given initial state distribution x0 ∼ f (x0 ); given conditional state distribution xn ∼ f (xn |xn−1 ); given conditional observation distribution y n ∼ f (yn |xn ); objective: estimate moments involving f (x0:n |y0:n ) such as the conditional mean for the trajectory x0:n . sample J initial states: xj0 ∼ f (x0 ), j = 1, 2, . . . , J; set their initial weights to w0j = 1/J; repeat over n = 1, 2, 3, . . . : sample J realizations for time n: xjn ∼ f (xn |xjn−1 ) augment trajectories from {xj0:n−1 } to {xj0:n }, ∀ j

(35.46)

j wnj = f (y. n |xn ), ∀ j P J j0 wnj ← wnj j 0 =1 wn , ∀ j (normalization)

resample the trajectories with replacement j and generate {xj? 0:n } from {x0:n }, j = 1, 2, . . . , J set {xj0:n } ← {xj? 0:n }, j = 1, 2, . . . , J set wnj = 1/J, j = 1, 2, . . . , J end return particle and weights {xj0:n , wnj }, ∀ j. Note that under (35.45), the update equation (35.29) for the weights simplifies to

 j  wnj = wn−1 × f (yn |xjn ), j = 1, 2, . . . , J  normalize {wj } to add up to 1 n

(35.47)

j But since resampling is applied at every time instant n so that wn−1 = 1/J, we conclude that the weight is proportional to the conditional pdf of the observation:

(

wnj ∝ f (yn |xjn ), j = 1, 2, . . . , J normalize {wnj } to add up to 1

(35.48)

1396

Particle Filters

We still need to know how to sample from the conditional pdf f (xn |xjn−1 ). This step benefits from the model assumed for the evolution of the state variable xn . For instance, assume xn evolves according to (35.2a). Then, we can first generate a sample v n ∼ Nv (0, Rv ) and then set xn = g(xjn−1 ) + vn .

35.3.2

Auxiliary Particle Filter The sampling that produces xjn ∼ f (xn |xjn−1 ) in the bootstrap filter is not influenced by the current observation and can generate state values xjn with low likelihood of occurrence, namely, with low f (yn |xjn ). The auxiliary filter is a variation that helps address this anomaly by introducing an auxiliary variable. It is based on replacing the approximation (35.45) by an importance factor that involves an additional term that depends on the likelihood function. It is motivated as follows. We know from the Bayes rule that f (xn , yn |xjn−1 ) = f (yn |xjn−1 )f (xn |xjn−1 , yn )

= f (xn |xjn−1 )f (yn |xn , xjn−1 ) = f (xn |xjn−1 )f (yn |xn )

(35.49)

where in the last line we used the fact that, given xn , the observation is independent of xjn−1 . It follows that the optimal choice shown in (35.44) satisfies f (xn |xjn−1 , yn ) =

f (xn |xjn−1 )f (yn |xn ) f (yn |xjn−1 )

(35.50)

and, hence, as a function of xn : π(xn |xj0:n−1 , y0:n ) ∝ f (xn |xjn−1 )f (yn |xn )

(35.51)

Recall that the importance factor is needed to generate samples {xn }. The above choice cannot be used in this form because the rightmost factor involves conditioning on the unknown xn . We therefore need to estimate xn ; we refer to this estimate as an auxiliary variable and denote it by sjn with a superscript j because different particles will generally contribute different estimates. There are many ways by which we can compute a statistic sjn . One way is to estimate the “future” state xn based on the prior state xjn−1 by using ˆ xf (xn |xjn−1 )dx (35.52) sjn = E (xn |xjn−1 ) = x

in terms of the mean of the distribution f (xn |xjn−1 ). A second way is to sample a realization from the distribution: sjn ∼ f (xn |xjn−1 )

(35.53)

Both of these calculations lead to auxiliary variables sjn that depend on the particle index j. We may also consider averaging the past states across all particles and use the sample mean (or the sample mode) to compute

35.3 Particle Filter Implementations

J oJ n 1X j sn = xn−1 or sn = mode xjn−1 J j=1 j=1

1397

(35.54)

In the last two constructions, the auxiliary variable sn becomes independent of the particle index j. Now, given the statistic sjn , we end up with the following choice for the proposal distribution: π(xn |xj0:n−1 , y0:n ) ∝ f (xn |xjn−1 ) × | {z } transition kernel

f (yn |sjn ) | {z }

(35.55)

additional factor

The first factor on the right-hand side of (35.55) is the same one used in the bootstrap implementation to generate xjn . The second factor in (35.55), on the other hand, includes an approximation for the likelihood function f (yn |xn ). Substituting (35.55) into (35.29) leads to the following update relation for the unnormalized weights: j wnj = wn−1 ×

f (yn |xjn ) f (yn |sjn )

, j = 1, 2, . . . , J

(35.56)

We arrive at listing (35.59) for the auxiliary particle filter. Initially, at every iteration, and motivated by expression (35.55), the weights are scaled by f (yn |sjn ) in order to exploit the likelihood information. After resampling, the weights are set to 1/J and the trajectories are augmented. Finally, the weights are set according to (35.56) followed by normalization to ensure they add up to 1. Motivated by the form of the importance factor in (35.55), this filter implements resampling in a different manner than the bootstrap filter. Starting at time n − 1, each j particle j has weight wn−1 . The implementation first scales this weight by the j likelihood f (yn |sn ) so that trajectories that are likely under the observation yn receive an additional boost in their weight as opposed to trajectories that are less likely: j j × f (yn |sjn ), j = 1, 2, . . . , J wn−1 ← wn−1

(35.57)

The weights are normalized to add up to 1 and the trajectories {x0:n−1 } are subsequently resampled using these adjusted weights to generate {x?0:n−1 }. By doing so, trajectories with more likely states are selected and propagated forward. Subsequently, J samples for the state xn at time n are sampled from the same distribution xn ∼ f (xn |x?n−1 ) used in the bootstrap filter, albeit applied to the trajectories after resampling. This step allows us to extend the particles to {xj? 0:n } and to update their respective weights using wnj =

f (yn |xj? n ) f (yn |sj? n )

(35.58)

1398

Particle Filters

j The factor wn−1 does not appear multiplying the fraction on the right-hand j is already used in the initial scaling side, as is the case in (35.59), because wn−1 (35.57).

Auxiliary particle filter for model (35.1a)–(35.1c). given initial state distribution x0 ∼ f (x0 ); given conditional state distribution xn ∼ f (xn |xn−1 ); given conditional observation distribution y n ∼ f (yn |xn ); given statistic function s(·); objective: estimate moments involving f (x0:n |y0:n ) such as the conditional mean for the trajectory x0:n . sample J initial states: xj0 ∼ f (x0 ), j = 1, 2, . . . , J; set their initial weights to w0j = 1/J, ∀j. repeat over n = 1, 2, 3, . . . : evaluate a statistic sjn , e.g., using (35.53) or (35.54). j j wn−1 ← wn−1 × f (yn |sjn ), ∀ j (scaling) . PJ j j j0 wn−1 ← wn−1 (normalization) j 0 =1 wn−1 , ∀ j

(35.59)

j resample trajectories using {wn−1 } with replacement j? and generate {x0:n−1 } from {xj0:n−1 }, ∀ j j set wn−1 = 1/J, ∀ j j? sample J realizations for time n: xj? n ∼ f (xn |xn−1 ) j? j? augment trajectories from {x0:n−1 } to {x0:n }, ∀ j . j? wnj = f (yn |xj? n ) f (yn |sn ), ∀ j .P J j0 wnj ← wnj j 0 =1 wn , ∀ j (normalization)

set {xj0:n } ← {xj? 0:n }, ∀ j

end return particle and weights {xj0:n , wnj }, ∀ j.

Example 35.2 (Tracking a moving object in 2D) We reconsider a variation of Example 30.10. An object is moving in 2D, as shown in Fig. 35.3. We denote its coordinates at any instant n by col{an , bn }, where an is the horizontal coordinate and bn is the vertical coordinate. The velocity components along the horizontal and vertical axes are denoted by col{va,n , vb,n }. Therefore, at any instant in time, the state of the object is described by the four-dimensional state vector xn = col{an , bn , va,n , vb,n }. We assume this vector evolves according to the dynamics:

35.3 Particle Filter Implementations

  an  va,n   =  b n vb,n | {z } | 

xn

1 0 0 0

1 1 0 0

0 0 1 0 {z

=F

 0 an−1 0   va,n−1 1   bn−1 1 vb,n−1 }| {z xn−1

1399

 0.5 0   0  ua,n   1 + 0 0.5  ub,n | {z } 0 1 un } | {z } 



(35.60)

G

where the perturbation un ∼ Nun (0, σu2 I2 ). The observations are noisy measurements of the angle viewed from an observer at the origin, i.e.,

b-axis

yn = arctan(bn /an ) + Nv (0, σv2 )

(35.61)

vb,n va,n

bn

θn = arctan(bn /an ) an

a-axis

Figure 35.3 A moving object in 2D; its coordinates are denoted by (an , bn ) at time n;

the speed components along the horizontal and vertical axes are denoted by (va,n , vb,n ). The bearing θn is the angle viewed from the origin. The initial state is Gaussian-distributed with    0.25 0 0    x0 ∼ Nx0 (¯ x0 , Π0 ), x ¯0 =  , Π0 =  0.4  −0.05

 25 × 10

−6

0.09

  10−4 (35.62)

We assume the numerical values σu2 = 1 × 10−6 ,

σv2 = 25 × 10−6

(35.63)

The actual trajectory is generated in the simulation by starting from the initial state vector x0 = col{−0.05, 0.001, 0.7, −0.055}

(35.64)

1400

Particle Filters

Figure 35.4 shows the actual and estimated trajectories using J = 5000 particles and the boosting and auxiliary particle filters applied to model (35.60)–(35.61) over 0 ≤ n ≤ 100. The auxiliary variables sjn are computed by using sjn = F xjn−1 . 1 0

bootstrap particle filter

1

initial location

0

-1

-1

-2

-2

-3

-3

-4

auxiliary particle filter initial location

-4

actual trajectory

actual trajectory estimated trajectory

estimated trajectory -5 -1.5

-1

-0.5

0

0.5

-5 -1.5

-1

-0.5

0

0.5

Figure 35.4 Actual and estimated trajectories using J = 5000 particles and the boosting and auxiliary particle filters applied to model (35.60)–(35.61).

35.4

COMMENTARIES AND DISCUSSION Particle filters. These filters provide a recursive procedure for implementing Bayesian inference by means of the Monte Carlo technique. They are particularly well suited for scenarios involving nonlinear models and non-Gaussian signals, and they have found applications in a wide range of areas where these two features (nonlinearity and nonGaussianity) are prevalent, including in guidance and control, robot localization, visual tracking of objects, and finance. One of the earliest attempts at applying sequential Monte Carlo to inference is the work by Rosenbluth and Rosenbluth (1956) in their study of polymer growth and molecular chains. The first author, Marshall Rosenbluth (1927–2003), was an American physicist who was awarded the US National Medal of Science in 1997. Both authors in Rosenbluth and Rosenbluth (1956) also appear as co-authors on the foundational paper by Metropolis et al. (1953), where the original version of the Metropolis–Hastings algorithm was put forward. The basic form shown in (35.30) for the particle filter prior to the incorporation of resampling appeared in Handschin and Mayne (1969) and Handschin (1970). The bootstrap version (35.46) was introduced by Gordon, Salmond, and Smith (1993) and led to a flurry of interest in the field. This version limits the importance distribution to the conditional pdf of the state variable, i.e., it selects πx (xn ) = f (xn |xn−1 ). The alternative construction, known as the auxiliary filter, was proposed by Pitt and Shephard (1999); it involves two resampling procedures and exploits the likelihood fyn |xn (yn |xn ). The version (35.59) involving a single resampling step is due to Carpenter, Clifford, and Fearnhead (1999). Overviews on particle filters are given by Doucet, Godsill, and Andrieu (2000), Arulampalam et al. (2002), Gustafsson (2010), Doucet and Johansen (2011), Speekenbrink (2016), Sayed, Djuric, and Hlawatsch (2018), and Elfring, Torta, and van de Molengraft (2021). Overviews on sequential Monte Carlo methods and nonlinear filtering can be found in Doucet, de Freitas, and Gordon (2001), Crisan (2001), Cappe, Moulines, and Ryden (2005), Cappe, Godsill, and Moulines (2007), Liu (2008),

Problems

1401

Crisan and Rozovskii (2011), and Künsch (2013). Discussion on the effective sample size to control the degeneracy problem in particle filters can be found in Kong, Liu, and Wong (1994) and Liu (2008). Convergence results for particle filters rely largely on central limit theorems and the law of large numbers, as was illustrated in Chapter 33 for the importance sampling and Monte Carlo methods – see, e.g., Chopin (2004). Discussions on multinomial, systematic, and residual replacement sampling can be found in Kong, Liu, and Wong (1994), Kitagawa (1996), Liu and Chen (1998), and Carpenter, Clifford, and Fearnhead (1999). The model used in Example 35.2 involving the tracking of a moving object in a two-dimensional space from bearing measurements is from Gordon, Salmond, and Smith (1993). In this reference, certain additional refinements, including “roughening” and “prior editing” are introduced to improve the simulation results. Some general results on the convergence behavior of particle filters appear in Del Moral (1996, 1998, 2004), Künsch (2005), Atar (2011), and Del Moral, Hu, and Wu (2012).

PROBLEMS

35.1 Consider the following linear state-space model, which we encountered in our studies of the Kalman filter: xn = F xn−1 + Gun , un ∼ Nun (0, Q), x0 ∼ Nx0 (0, Π0 ) y n = Hxn + v n , v n ∼ Nvn (0, R) for some matrices {F, G, H, Q, R, Π0 } of appropriate dimensions. The processes {un , v m } are white noise and independent of each other and of x0 for all n, m. Determine fxn |xn−1 (xn |xn−1 ) and fyn |xn (yn |xn ). 35.2 Write down the main recursions that result from applying the bootstrap particle filter (35.46) to the linear state-space model of the previous problem. 35.3 How would your answers to the first two problems change if the coefficient matrix F is actually random: It assumes the value Fa with probability p or the value Fb with probability 1 − p? Moreover, F n is independent of all other random variables. 35.4 Consider the following so-called stochastic volatility model used in finance (see, e.g., Kim, Shephard, and Chib (1998)): xn = a xn−1 + v n ,

 v n ∼ Nvn (0, σv2 ), x0 ∼ Nx0 0,

y n = b un exp(xn /2), un ∼ Nun (0, 1)

σv2  1 − a2

Show that this model satisfies assumptions (35.1a)–(35.1c) with fxn |xn−1 (xn |xn−1 ) = Nxn (axn−1 , σv2 ),

fyn |xn (yn |xn ) = Nyn (0, b2 exn )

Determine the marginal pdf fxn (xn ). 35.5 Write down the main recursions that result from applying the bootstrap particle filter (35.46) to the stochastic volatility model of the previous problem. 35.6 Refer to the discussion in Section 35.1.2. Verify that we can predict the future observation by using the distribution fyn |y0:n−1 (yn |y0:n−1 ) ˆ = fxn−1 |y0:n−1 (xn−1 |y0:n−1 ) fxn |xn−1 (xn |xn−1 ) fyn |xn (yn |xn )dxn−1 dxn x

1402

Particle Filters

35.7 For the same discussion in Section 35.1.2, show further that the likelihood function of the observations satisfies fy0:n (y0:n ) = fy0 (y0 ) ×

n Y m=1

fym |y0:m−1 (ym |y0:m−1 )

where each of the factors inside the product is given by fym |y0:m−1 (ym |y0:m−1 ) ˆ = fxm−1 |y0:m−1 (xm−1 |y0:m−1 )fxm |xm−1 (xm |xm−1 )fym |xm (ym |xm )dxm dxm−1 x

35.8 Refer to the description of the systematic resampling procedure leading to Fig. 35.2. Recall that nj denotes the number of times that the jth particle ends up appearing in the resampled trajectories. Show that this resampling procedure is unbiased in the sense that E nj = Jwnj . 35.9 Refer to the assumed factorization in (35.26) for the importance distribution. (a) Compute the expectation relative to the distribution π(xjn |xj0:n−1 , y0:n ) and verify that, conditioned on the past, j E wjn = wn−1 ×

(b)

f (yn |xjn−1 ) f (yn |y0:n−1 )

Verify similarly that, conditioned on the past, the variance of wjn is given by (ˆ ) j (f (xjn , yn |xjn−1 )2 (wn−1 )2 j j j 2 var(wn ) = dxn − (f (yn |xn−1 )) j j j (f (yn |y0:n−1 ))2 xn π(xn |x0:n−1 , y0:n )

(c)

Conclude that the variance is zero when the importance sampling factor is selected according to π(xjn |xj0:n−1 , y0:n ) = f (xjn |xjn−1 , yn ). Remark. For a related discussion, the reader may refer to Zaritskii, Svetnik, and Shimelevich (1975), where the form of the optimal importance distribution first appeared, as well as Akashi and Kumamoto (1977). 35.10 According to the result of the previous problem, what are the optimal choices for the importance sampling factor for the models in Probs. 35.1 and 35.4? 35.11 Ignore the observation equation (35.1c) and consider solely the state evolution according to the first-order Markovian model (35.1a). We wish to apply the importance sampling procedure to the joint pdf fx0:n (x0:n ) to generate J trajectories or particles. Assume the importance distribution factorizes as follows: πx0:n (x0:n ) = πx0:n−1 (x0:n−1 ) × πxn |xn−1 (xn |xn−1 ) (a)

Repeat the argument that led to (35.27) to show the weights for the respective particles are now given by j wnj ∝ wn−1 ×

(b)

(c)

f (xjn |xjn−1 )

π(xjn |xjn−1 )

Let the symbol Fn−1 represent the collection of all possible random events generated by the states up to time n−1 (more formally, Fn−1 is the filtration generated by these random variables). The construction of wnj depends on elements from Fn−1 and on {xjn }. The sample xjn is generated at random according to the distribution xjn ∼ π(xjn |xjn−1 ). Verify that wjn is a martingale process, namely, it j satisfies E π (wjn |Fn−1 ) = wn−1 . j Conclude that var(wn ) ≥ var(wjn−1 ), so that the variance of the weights is nondecreasing over time. Remark. This conclusion is a special case of a result in Kong, Liu, and Wong (1994) involving both state vectors and observations.

References

1403

35.12 Use the Bayes rule to explain why assumption (35.26) is an approximate factorization for the proposal distribution.

REFERENCES Akashi, H. and H. Kumamoto (1977), “Random sampling approach to state estimation in switching environments,” Automatica, vol. 13, pp. 429–434. Arulampalam, S., S. Maskell, S. N. Gordon, and T. Clapp (2002), “A tutorial on particle filters for on-line nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188. Atar, R. (2011), “Exponential decay rate of the filter’s dependence on the initial distribution,” in The Oxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovskii, editors, pp. 299–318, Oxford University Press. Cappe, O., S. Godsill, and E. Moulines (2007), “An overview of existing methods and recent advances in sequential Monte Carlo,” Proc. IEEE, vol. 95, pp. 899–924. Cappe, O., E. Moulines, and T. Ryden (2005), Inference in Hidden Markov Models, Springer. Carpenter, J., P. Clifford, and P. Fearnhead (1999), “An improved particle filter for non-linear problems,” IEEE Proc. Radar Sonar Navig., vol. 146, no. 1, pp. 2–7. Chopin, N. (2004), “Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference,” Ann. Statist., vol. 32, pp. 2385–2411. Crisan, D. (2001), “Particle filters: A theoretical perspective,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, editors, pp. 17–41, Springer. Crisan, D. and B. Rozovskii, editors (2011), Handbook on Nonlinear Filtering, Oxford University Press. Del Moral, P. (1996), “Non linear filtering: Interacting particle solution,” Markov Process. Related Fields, vol. 2, no. 4, pp. 555–580. Del Moral, P. (1998), “Measure valued processes and interacting particle systems: Application to non linear filtering problems,” Ann. Appl. Probab., vol. 8, no. 2, pp. 438–495. Del Moral, P. (2004), Feynman–Kac Formulae: Genealogical and Interacting Particle Systems with Applications, Springer. Del Moral, P., P. Hu, and L. Wu (2012), On the Concentration Properties of Interacting Particle Processes, NOW Publishers. Doucet, A., N. de Freitas, and N. Gordon, editors (2001), Sequential Monte Carlo Methods in Practice, Springer. Doucet, A., S. Godsill, and C. Andrieu (2000), “On sequential Monte Carlo sampling methods for Bayesian filtering,” Statist. Comput., vol. 10, pp. 197–208. Doucet, A. and A. M. Johansen (2011), “A tutorial on particle filtering and smoothing: Fifteen years later,” in The Oxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovskii, editors, pp. 656–704, Oxford University Press. Elfring, J., E. Torta, and R. van de Molengraft (2021), “Particle filters: A hands-on tutorial,” Sensors, vol. 21, 438. Gordon, N., D. Salmond, and A. F. M. Smith (1993), “Novel approach to nonlinear and non-Gaussian Bayesian state estimation,” Proc. Inst. Elect. Eng., F, vol. 140, pp. 107–113. Gustafsson, F. (2010), “Particle filter theory and practice with positioning applications,” IEEE Aerosp. Electro. Syst. Mag., vol. 25, no. 7, pp. 53–82. Handschin, J. E. (1970), “Monte Carlo techniques for prediction and filtering of nonlinear stochastic processes,” Automatica 6, pp. 555–563.

1404

Particle Filters

Handschin, J. E. and D. Q. Mayne (1969), “Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering,” Int. J. Control, vol. 9, pp. 547–559. Kim, S., N. Shephard, and S. Chib (1998), “Stochastic volatility: Likelihood inference and comparison with ARCH models,” Rev. Econ. Studies, vol. 65, pp. 361–393. Kitagawa, G. (1996), “Monte Carlo filter and smoother for non-Gaussian nonlinear state space models,” J. Comput. Graph. Statist., vol. 5, no. 1, pp. 1–25. Kong, A., J. S. Liu, and W. H. Wong (1994), “Sequential imputations and Bayesian missing data problems,” J. Amer. Statist. Assoc., vol. 89, pp. 278–288. Künsch, H. R. (2005), “Recursive Monte Carlo filters: Algorithms and theoretical analysis,” Ann. Statist., vol. 33, pp. 1983–2021. Künsch, H. R. (2013), “Particle filters,” Bernoulli, vol. 19, no. 4, pp. 1391–1403. Liu, J. S. (2008), Monte Carlo Strategies in Scientific Computing, Springer. Liu, J. S. and R. Chen (1998), “Sequential Monte Carlo methods for dynamical systems,” J. Amer. Statist. Assoc., vol. 93, pp. 1032–1044. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953), “Equations of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, pp. 1087–1091. Pitt, M. and N. Shephard (1999), “Filtering via simulation: Auxiliary particle filters,” J. Amer. Statist. Assoc., vol. 94, no. 446, pp. 590–599. Rosenbluth, M. N. and A. W. Rosenbluth (1956), “Monte Carlo calculation of the average extension of molecular chains,” J. Chem. Phys., vol. 23, no. 2, pp. 356–359. Sayed, A. H., P. Djuric, and F. Hlawatsch (2018), “Distributed Kalman and particle filtering,” in Cooperative and Graph Signal Processing, P. Djuric and C. Richard, editors, pp. 169–207, Academic Press. Speekenbrink, M. (2016), “A tutorial on particle filters,” J. Math. Psychol., vol. 73, pp. 140–152. Zaritskii, V. S., V. B. Svetnik, and L. I. Shimelevich (1975), “Monte Carlo technique in problems of optimal data processing,” Aut. Remote Control, vol. 12, pp. 95–103.

36 Variational Inference

In Chapters 33 and 34 we described three methods for approximating posterior distributions: the Laplace method, the Markov chain Monte Carlo (MCMC) method, and the expectation-propagation (EP) method. Given an observable y and a latent variable z, the Laplace method approximates fz|y (z|y) by a Gaussian distribution and was seen to be suitable for problems with small-dimensional latent spaces because its implementation involves a matrix inversion. The Gaussian approximation, however, is not sufficient in many instances and can perform poorly. The MCMC method is more powerful, and also more popular, and relies on elegant sampling techniques and the Metropolis–Hastings algorithm. However, MCMC requires a large number of samples, does not perform well on complex models, and does not scale well to higher dimensions and large datasets. The EP method, on the other hand, limits the class of distributions from which the posterior is approximated to the Gaussian or exponential families, and can be analytically demanding. In this chapter, we develop a fourth powerful method for posterior approximation known as variational inference. One of its advantages is that it usually scales better to large datasets and large dimensions.

36.1

EVALUATING EVIDENCES Recall that the main challenge is to evaluate posterior distributions, fz|y (z|y), without the need to evaluate the evidence, fy (y), which appears in the denominator of the relations: fz|y (z|y) =

fz (z) fy|z (y|z) fy,z (y, z) = fy (y) fy (y)

(36.1)

We already illustrated this difficulty by considering data arising from logit and probit models in Chapter 33. We consider additional examples next, which will be used as references in the subsequent discussions. These examples will also help reveal connections to mixture models, which were used earlier in the study of the expectation maximization (EM) algorithm in Chapter 32.

1406

Variational Inference

Example 36.1 (Difficulty in evaluating the evidence – I) We consider a generative model for the observations y in the form of a Bayesian mixture of Gaussian models. Specifically, we let (µ1 , µ2 ) denote two mean parameters that are selected independently from two separate Gaussian distributions with means (¯ µ1 , µ ¯2 ) and variances (σ12 , σ22 ), i.e., µ2 , σ22 ) µ2 ∼ Nµ2 (¯

µ1 , σ12 ), µ1 ∼ Nµ1 (¯

(36.2)

We also let k denote a Bernoulli variable that assumes the values k ∈ {1, 2} with probabilities: P(k = 1) = p, P(k = 2) = 1 − p, p > 0 (36.3) Since k assumes the values {1, 2} rather than the usual values {0, 1} for Bernoulli variables, its probability density function (pdf) is given by ∆

fk (k) = p2−k (1 − p)k−1 = Bernoulli(p)

(36.4)

The model for generating the observation y is the following (see Fig. 36.1):  (generative model)     generate two random means µ1 ∼ Nµ1 (¯ µ1 , σ12 ) and µ2 ∼ Nµ2 (¯ µ2 , σ22 ) select a model index k ∈ {1, 2} according to Bernoulli(p)     2 generate the observation y ∼ Ny (µk , σy,k )

(36.5)

2 where the variance of y is also dependent on the index k; it is either σy,1 when k = 1 2 2 2 or σy,2 when k = 2. The values (σy,1 , σy,2 ) are known. In this way, the variance of y, denoted by σ 2y , is itself another Bernoulli random variable with 2 P(σ 2y = σy,1 ) = p,

2 P(σ 2y = σy,2 )=1−p

(36.6)

The main difference in model (36.5) in relation to the Gaussian mixture models considered earlier in Chapter 32 is that we are now allowing the parameters (µ1 , µ2 ) to be random variables with Gaussian priors of their own; thus, each realization for y would arise from some random choice for (µ1 , µ2 ). As a result, the model now involves three latent variables, denoted by z = (k, µ1 , µ2 ); one of the variables is discrete while the other two are continuous. The variable σ 2y is attached to k; once the value of k is known then the value of σ 2y is also known. For this reason we do not include σ 2y in the list of latent variables. There are many examples that fit into this mixture model. For instance, let y refer to the systolic blood pressure measure of an individual, assumed to be Gaussian distributed. Its mean value can be dependent on whether the patient has a particular medical condition (model 1) or not (model 2). Now, according to the Bayes rule (3.39), if we were to determine the conditional pdf of the latent variables given the observation (i.e., the posterior), we would need to evaluate the ratio: fk,µ1 ,µ2 |y (k, µ1 , µ2 |y) =

fy,k,µ1 ,µ2 (y, k, µ1 , µ2 ) fy (y)

(36.7)

The numerator is easy to compute due to independence since, assuming the parameters 2 2 (σy,1 , σy,2 , σ12 , σ22 ) and (¯ µ1 , µ ¯2 , p) are known: fy,k,µ1 ,µ2 (y, k, µ1 , µ2 ) = fµ1 (µ1 ) × fµ2 (µ2 ) × fk (k) × fy|k,µ1 ,µ2 (y|k, µ1 , µ2 )

2 = Nµ1 (¯ µ1 , σ12 ) × Nµ2 (¯ µ2 , σ22 ) × p2−k (1 − p)k−1 × Ny (µk , σy,k )

(36.8)

36.1 Evaluating Evidences

AAACN3icbVC7TsMwFHV4lvIKMLJYtEggoSrpAIyVWJiqItEWqQnRjesWCzuJbAepivJXLPwGGywMIMTKH+C2GaDlSJaPzz1XvveECWdKO86LtbC4tLyyWlorr29sbm3bO7sdFaeS0DaJeSxvQlCUs4i2NdOc3iSSggg57Yb3F+N694FKxeLoWo8S6gsYRmzACGgjBXaz6oUx76uRMFfmiTQPXE8xgTOPAMfNPMjmDfmRF4IsHifGPhQQuLf142pgV5yaMwGeJ25BKqhAK7CfvX5MUkEjTTgo1XOdRPsZSM0Ip3nZSxVNgNzDkPYMjUBQ5WeTvXN8aJQ+HsTSnEjjifq7IwOhxoMbpwB9p2ZrY/G/Wi/Vg3M/Y1GSahqR6UeDlGMd43GIuM8kJZqPDAEimZkVkzuQQLSJumxCcGdXniedes09rTlX9UrjpIijhPbRATpCLjpDDXSJWqiNCHpEr+gdfVhP1pv1aX1NrQtW0bOH/sD6/gEAAq0N

AAACN3icbVDLSgMxFM34rPVVdekm2AoKpczMQl0W3LgqFWwrdOpwJ03bYDIzJBmhDPNXbvwNd7pxoYhb/8C0nYW2Hgg5Ofdccu8JYs6Utu0Xa2l5ZXVtvbBR3Nza3tkt7e23VZRIQlsk4pG8DUBRzkLa0kxzehtLCiLgtBPcX07qnQcqFYvCGz2OaU/AMGQDRkAbyS81Kl4Q8b4aC3Olnkgy3/UUEzj1CHDcyPx00ZCdeAHI/FE19qEA371zTyt+qWzX7CnwInFyUkY5mn7p2etHJBE01ISDUl3HjnUvBakZ4TQreomiMZB7GNKuoSEIqnrpdO8MHxuljweRNCfUeKr+7khBqMngxilAj9R8bSL+V+smenDRS1kYJ5qGZPbRIOFYR3gSIu4zSYnmY0OASGZmxWQEEog2URdNCM78youk7dacs5p97Zbr1TyOAjpER+gEOegc1dEVaqIWIugRvaJ39GE9WW/Wp/U1sy5Zec8B+gPr+wcGeK0R

µ2 ⇠ Nµ2 (¯ µ2 ,

2 2)

µ1 ⇠ Nµ1 (¯ µ1 ,

AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hYGHJHo3YkNpYYPSGBC9lb9mDD3t5ld86EEH6CjYUaW/+Rnf/GBa5Q8CWTvLw3k5l5YSqFQdf9dgpr6xubW8Xt0s7u3v5B+fDo0SSZZtxniUx0O6SGS6G4jwIlb6ea0ziUvBWObmZ+64lrIxL1gOOUBzEdKBEJRtFK99W02itX3Jo7B1klXk4qkKPZK391+wnLYq6QSWpMx3NTDCZUo2CST0vdzPCUshEd8I6lisbcBJP5qVNyZpU+iRJtSyGZq78nJjQ2ZhyHtjOmODTL3kz8z+tkGF0FE6HSDLlii0VRJgkmZPY36QvNGcqxJZRpYW8lbEg1ZWjTKdkQvOWXV4lfr13XvLt6pXGRp1GEEziFc/DgEhpwC03wgcEAnuEV3hzpvDjvzseiteDkM8fwB87nD/ihjQw=

1407

2 1)

p 2 y,k )

AAACAXicbVDLSsNAFJ34rPUVdSO4CbZChVKSLtRlwY3LCvYBTQyT6aQdMjMJMxMhhLrxV9y4UMStf+HOv3HaZqGtBy4czrmXe+8JEkqksu1vY2V1bX1js7RV3t7Z3ds3Dw67Mk4Fwh0U01j0AygxJRx3FFEU9xOBIQso7gXR9dTvPWAhSczvVJZgj8ERJyFBUGnJN4+rNZelflR3JRkx6OdZPZrcN8+rvlmxG/YM1jJxClIBBdq++eUOY5QyzBWiUMqBYyfKy6FQBFE8KbupxAlEERzhgaYcMiy9fPbBxDrTytAKY6GLK2um/p7IIZMyY4HuZFCN5aI3Ff/zBqkKr7yc8CRVmKP5ojClloqtaRzWkAiMFM00gUgQfauFxlBApHRoZR2Cs/jyMuk2G85Fw75tVlr1Io4SOAGnoAYccAla4Aa0QQcg8AiewSt4M56MF+Pd+Ji3rhjFzBH4A+PzB8Czlbk=

(µk ,

AAACLXicbVDLSsNAFJ34rPUVdelmsBUqlJJ0oS4LunAlFewDmhgmk2k7dCYJMxMhhPyQG39FBBcVcetvOH0sbOuFYQ7nnMu99/gxo1JZ1thYW9/Y3Nou7BR39/YPDs2j47aMEoFJC0csEl0fScJoSFqKKka6sSCI+4x0/NHNRO88EyFpFD6qNCYuR4OQ9ilGSlOeeVt2/IgFMuX6y9LckZTDzMGIwfvcyxbFvOLwxBtVtWnAkZel1VH+VL8oe2bJqlnTgqvAnoMSmFfTM9+dIMIJJ6HCDEnZs61YuRkSimJG8qKTSBIjPEID0tMwRJxIN5tem8NzzQSwHwn9QgWn7N+ODHE5WVk7OVJDuaxNyP+0XqL6125GwzhRJMSzQf2EQRXBSXQwoIJgxVINEBZU7wrxEAmElQ64qEOwl09eBe16zb6sWQ/1UqM6j6MATsEZqAAbXIEGuANN0AIYvIA3MAafxqvxYXwZ3zPrmjHvOQELZfz8Ar4YqYc=

y ⇠ Ny (µk ,

2 y,k )

Figure 36.1 Illustration of the generative model (36.5). A switch selects between

µ1 ∼ Nµ1 (¯ µ1 , σ12 ) with probability p and µ2 ∼ Nµ2 (¯ µ2 , σ22 ) with probability 1 − p.

The difficulty in (36.7) lies in computing the evidence fy (y), which appears in the denominator, since it involves marginalizing (36.8) over the variables {k, µ1 , µ2 } to get:

fy (y) ˆ =

2 X

µ1 ,µ2

ˆ

k=1 2 X

= µ1 ,µ2

ˆ

! fy,k,µ1 ,µ2 (y, k, µ1 , µ2 ) dµ1 dµ2 ! Nµ1 (¯ µ1 , σ12 )

k=1

×

Nµ2 (¯ µ2 , σµ2 )

×p

2−k

(1 − p)

k−1

×

2 Ny (µk , σy,k )

2 Nµ1 (¯ µ1 , σ12 ) × Nµ2 (¯ µ2 , σ22 ) × Ny (µ1 , σy,1 )dµ1 dµ2 ˆ 2 + (1 − p) Nµ1 (¯ µ1 , σ12 ) × Nµ2 (¯ µ2 , σ22 ) × Ny (µ2 , σy,2 )dµ1 dµ2 µ1 ,µ2 ˆ 2 =p Nµ1 (¯ µ1 , σ12 ) × Ny (µ1 , σy,1 )dµ1 µ1 ˆ 2 + (1 − p) Nµ2 (¯ µ2 , σ22 ) × Ny (µ2 , σy,2 )dµ2

dµ1 dµ2

=p

µ1 ,µ2

µ2

(36.9)

1408

Variational Inference

That is, fy (y) =

p 2 )1/2 2π(σ12 σy,1

ˆ



1 −2

e

(µ1 −µ ¯ 1 )2 2 σ1

+

(y−µ1 )2 2 σy,1

µ1

{z

|



 dµ1 + }

=A

(1 − p) 2π(σ22 σy,2 )1/2

ˆ

1 −2



e

(µ2 −µ ¯ 2 )2 2 σ2

+

µ2

{z

|



(y−µ2 )2 2 σy,2

 dµ2 }

(36.10)

=B

Each of the last two integrals can be evaluated in closed form for this example. This is not generally the case for other generative models and/or probability distributions. Even when the evaluation is possible, it can still be tedious, as illustrated by continuing with the calculations in this case. We consider one of the integrals and the same derivation will apply to the other integral. Consider the exponent that appears in the exponential for the term designated by A. Introduce, for convenience, the centered variables ∆

µc = µ1 − µ ¯1 ,



yc = y − µ ¯1

(36.11)

and note that



1 2



(µ1 − µ ¯ 1 )2 (y − µ1 )2 + 2 2 σ1 σy,1



 (yc − µc )2 µ2c + (36.12) 2 σ12 σy,1   1 σ2 = − 2 µ2c + 21 (yc − µc )2 2σ1 σy,1   1 σ2 = − 2 µ2c + 21 (µ2c + yc2 − 2µc yc ) 2σ1 σy,1 h i σ2  σ2 2σ 2 1 = − 2 1 + 21 µ2c + 21 yc2 − 2 1 µc yc 2σ1 σy,1 σy,1 σy,1   2 2 (a) 1 σ1 2 2 σ1 b yc − µc y c = − 2 µ2c + 2 2 2σ1 b σy,1 b σy,1   b  1 σ12 2  1 σ12 1 σ14  2 = − 2 µc − y + − y c c 2 2 4 2σ1 b σy,1 b σy,1 b2 σy,1 = −

1 2



where in step (a) we introduced ∆

b = 1+

2 σy,1 + σ12 σ12 = 2 2 σy,1 σy,1

(36.13)

2 2 Let σH,1 denote the harmonic mean of σ12 and σy,1 , i.e., ∆

2 σH,1 =

2 2 (1/σ12 ) + (1/σy,1 )

(36.14)

Then, some algebra shows that (36.12) simplifies to −

2 1 (yc − µc )2  1  µ2c 1  σ2 1 + =− 2 µc − 2 1 2 yc − yc2 2 2 2 2 σ1 σy,1 σH,1 σy,1 + σ1 2 σy,1 + σ12

(36.15)

36.1 Evaluating Evidences

1409

and, consequently, the term designated by the letter A in (36.10) evaluates to ) (  2 2 σ1 ˆ 2 1 1 µ − −1 y − y c c c 2 2 2 2 2 2 ∆ σ /2 σy,1 +σ1 σy,1 +σ1 H,1 dµ1 A = e µ1 (b)

−1 2

= e

1 2 +σ 2 σy,1 1

2 yc

ˆ q 2 × 2πσH,1 /2

1

µ1

1 −2

q e 2 2πσH,1 /2

 µc − 2 σ

{z

|

=1



=

1 σ2 /2 H,1

q 1 1 2 πσH,1 exp − (y − µ ¯ 1 )2 2 2 σy,1 + σ12

2 σ1

2 y,1 +σ1

2 yc

dµ1

}

 (36.16)

where the integral in step (b) evaluates to 1 because it integrates a Gaussian pdf over the entire range of µ1 (or, equivalently, µc ). A similar argument will show that the term B in (36.10) evaluates to   q 1 1 2 2 B = πσH,2 exp − (y − µ ¯ ) (36.17) 2 2 2 σy,2 + σ22 where ∆

2 σH,2 =

2 2 ) (1/σ22 ) + (1/σy,2

(36.18)

Combining the results, some algebra leads to 2 2 fy (y) = p Ny (¯ µ1 , σ12 + σy,1 ) + (1 − p) N(¯ µ2 , σ22 + σy,2 )

(36.19)

In other words, the evidence has the form of a Gaussian mixture model; it is a combi2 nation of two Gaussian distributions with means (¯ µ1 , µ ¯2 ) and variances (σ12 + σy,1 , σ22 + 2 σy,2 ). Figure 36.2 simulates this calculation for two situations. In the first case, we set 2 2 σ12 = 2, σ22 = 10, σy,1 = 1, σy,2 = 3, µ ¯1 = 3, µ ¯2 = 15, and p = 0.3. We generate N = 10,000 realizations {yn }. For each realization: (a) we generate two random means µ1 and µ2 from the Gaussian distributions Nµ1 (3, 2) and Nµ2 (15, 10); (b) we pick an index k ∈ {1, 2} according to Bernoulli(p); (c) and generate yn from Ny (µk , σy,k ). The figure shows a normalized histogram for the 10,000 realizations along with the (smooth) curve corresponding to the evidence (36.19). There are 2937 samples under model k = 1 and 7063 samples under model k = 2. Using expression (36.19) we can evaluate the conditional pdf fk,µ1 ,µ2 |y (k, µ1 , µ2 |y), as well as other useful conditional pdfs. In particular, we can determine the responsibility for each model k in explaining the observation y as follows. Using the Bayes rule and the fact that k is independent of µ1 and µ2 , we have (we are dropping the subscripts from the pdf functions to simplify the notation): f (y, k|µ1 , µ2 ) = fy (y|µ1 , µ2 ) P(k = k|y = y, µ1 , µ2 ) = f (y|µ1 , µ2 ) P(k = k|y = y) = P(k = k|µ1 , µ2 ) f (y|k, µ1 , µ2 ) = P(k = k) f (y|k, µ1 , µ2 )

(36.20)

and, hence, P(k = 1|y = y) =

P(k = 1) f (y|k = 1, µ1 , µ2 ) f (y|µ1 , µ2 )

(36.21)

1410

Variational Inference

0.1

0.08

0.06

0.04

0.02

0 -5

0

5

10

15

20

25

30

Figure 36.2 Normalized histogram of N = 10,000 realizations {yn } generated 2 2 according to the Bayesian mixture model with σ12 = 2, σ22 = 10, σy,1 = 1, σy,2 = 3, µ ¯1 = 3, µ ¯2 = 15, and p = 0.3. The solid curve plots the resulting evidence from (36.19).

That is, P(k = 1|y = y) =

2 p Ny (µ1 , σy,1 ) 2 ) + (1 − p)N(µ2 , σy,2

2 ) pNy (µ1 , σy,1

(36.22)

In this expression, the scalars (µ1 , µ2 ) correspond to the means used to generate the observation y. For the realizations used to generate Fig. 36.2, this calculation led to misclassification error on the order of 0.35%. It is not difficult to envision that slight variations in the generative model (36.5) can make the computation of the evidence fy (y) more demanding than what we have seen here, or even impossible to obtain in closed form. For example, what if the means (µ1 , µ2 ) are selected according to some non-Gaussian distribution? The calculations can quickly become more cumbersome. Example 36.2 (Difficulty in evaluating the evidence – II) Let p denote a probability value that is chosen uniformly from within the interval p ∼ U[0, 1]. The quantity p plays the role of a latent variable. Let y ∈ {0, 1} denote a Bernoulli random variable with probability of success given by the chosen p, i.e., P(y = 1|p = p) = p. For example, the value of p can represent the likelihood that a member of Congress will vote in favor of a particular resolution. Assume we collect N iid observations {yn }. The likelihood of these observations is given by: fy1:N |p (y1:N |p) =

N Y n=1

pyn (1 − p)1−yn

(36.23)

where the notation y 1:N refers to the collection of N observations {y 1 , y 2 , . . . , y N }. The prior distribution is fp (p) = U[0, 1]. It follows that the posterior is given by fp|y1:N (p|y1:N ) =

fp (p)fy1:N |p (y1:N |p) fy1:N (y1:N )

∝ pa−1 (1 − p)b−1 ,

p ∈ [0, 1]

(36.24)

where we introduced ∆

a = 1+

N X n=1

yn ,



b = N +1−

N X n=1

yn

(36.25)

36.2 Evaluating Posterior Distributions

1411

The evidence (the term that appears in the denominator of (36.24)) is obtained by marginalizing (integrating) the numerator over p: ˆ fy1:N (y1:N ) =

fp (p)fy1:N |p (y1:N |p)dp

0

ˆ ∝

1

0

1

pa−1 (1 − p)b−1 dp

(36.26)

This integral is challenging to evaluate. However, in this particular case, by comparing expression (36.24) to the general form of a beta distribution, given earlier in (31.32), we deduce that fp|y1:N (p|y1:N ) can be normalized to the form of a beta distribution, namely, fp|y1:N (p|y1:N ) =

Γ(a + b) a−1 p (1 − p)b−1 , Γ(a)Γ(b)

0≤p≤1

(36.27)

in terms of the gamma function defined in Prob. 4.3. As a result, it will hold that the evidence in this case is given by fy1:N (y1:N ) =

36.2

Γ(a)Γ(b) Γ(a + b)

(36.28)

EVALUATING POSTERIOR DISTRIBUTIONS We illustrated in the last two examples how the computation of the evidence, fy (y), can be demanding. Moving forward, we will devise a technique that allows us to approximate posterior distributions without the need to evaluate the evidence directly. For generality, we will use in this section the letter z to refer to the collection of all latent variables in an inference problem. For instance, in Example 36.1, this latent variable consists of the three components (k, µ1 , µ2 ). The problem we will be dealing with is the following. (Approximating posterior distributions) Given the joint pdf fy,z (y, z), where y is observable and z is hidden, we would like to approximate the posterior fz|y (z|y) by some function qz (z) without the need to evaluate the evidence fy (y). To be more explicit, we should refer to the approximation by writing qz|y (z|y) instead of qz (z) in order to emphasize that we are approximating a posterior distribution that is conditioned on y. However, it is sufficient for our purposes to use the lighter notation qz (z) since the posterior is a function of z. In the remainder of this chapter, we will develop the methodology of variational inference (VI) to determine qz (z), which is also referred to as the variational factor. Variational inference seeks the distribution qz (z) that is closest to fz|y (z|y) in the Kullback–Leibler (KL) divergence sense:

1412

Variational Inference

  ∆ qz? (z) = argmin DKL qz (z)kfz|y (z|y)

(36.29)

qz (·)

This is in contrast to the EP construction from the previous chapter, which minimizes the reverse KL divergence, DKL (fz|y (z|y)kqz (z)) – see (34.44). Problem (36.29) is challenging to solve because the argument fz|y (z|y) is unknown. However, we established a fundamental equality earlier in (6.164) relating the KL divergence of two distributions to a second quantity called the evidence lower bound (ELBO), which we denoted by L. Specifically, we showed that:     DKL qz (z)kfz|y (z|y) + L qz (z)kfy,z (y, z) = ln fy (y) (36.30) where, by definition,

ˆ   ∆ L qz (z)kfy,z (y, z) =

 fy,z (y, z) dz qz (z) z∈Z (  ) fy,z (y, z) = E q ln qz (z) h i h i = E q ln fy,z (y, z) − E q ln qz (z) qz (z) ln



(36.31)

and the notation E q refers to expectation relative to the distribution qz (z). The last term in (36.31) is the (differential) entropy of the variable z relative to the distribution qz (z): ˆ h i ∆ H(z) = − qz (z) ln qz (z)dz = −E q ln qz (z) (36.32) z∈Z

If we drop the arguments, we can rewrite (36.30) more succinctly in the form DKL + L(q) = ln fy (y)

(36.33)

where we are writing L(q) to indicate that L changes with the choice of q. Now recall that the KL divergence is always nonnegative. It follows from (36.33) that L(q) ≤ ln fy (y)

(36.34)

which explains why the term L(q) is referred to as the evidence lower bound or ELBO(q); it provides a lower bound for the (natural) logarithm of the evidence. The general result (36.33) states that the KL divergence and ELBO always add up to the (natural) logarithm of the evidence. By changing qz (z), the values of both DKL and L will change, with one of them increasing and the other decreasing in such a way that their sum remains invariant. For this reason, the minimization problem (36.29) can be replaced by the equivalent maximization problem:

36.3 Mean-Field Approximation

  ∆ qz? (z) = argmax L qz (z)kfy,z (y, z)

1413

(36.35)

qz (·)

where the KL divergence is replaced by the ELBO function. If we compare with (36.29), we observe that the main difference is that the unknown posterior fz|y (z|y) in (36.29) has been replaced by the known joint pdf fy,z (y, z) in (36.35). Problem (36.35) continues to be challenging to solve, as we explain in the following. However, since it replaces fz|y (z|y) by fy,z (y, z), this substitution will facilitate the search for qz? (z). It is worth noting that the ELBO, L(q), is generally a nonconvex function of its argument so that multiple local maxima may exist for (36.35).

36.3

MEAN-FIELD APPROXIMATION Let the latent variable z have dimension M and denote its individual entries by z = col{z m }M m=1 . One useful approach to solving (36.35) is to rely on mean-field approximation, which limits the class of distributions qz (z) to those that can be expressed in multiplicative form as follows: qz (z) =

M Y

qzm (zm )

(mean-field approximation)

(36.36)

m=1

In this formulation, qz (z) is expressed as the product of M elementary functions, with each one of them depending solely on one of the entries of z. For simplicity of notation, we will write qm (·) instead of qzm (·). Thus, under the mean-field approximation model:   qz (z) approximates fz|y (z|y)     q1 (z) approximates fz1 |y (z1 |y)    q2 (z) approximates fz2 |y (z2 |y)  •     •    qM (z) approximates fzM |y (zM |y)

(36.37)

The mean-field model (36.36) implicitly assumes that the individual entries of the latent variable z are “independent” of each other. This is obviously not true in general. Still, the mean-field model is commonly used in practice and leads to reasonable approximations. Under this assumption, problem (36.35) becomes more tractable. To illustrate how it can be solved, we limit the presentation to the case M = 3 to simplify the derivation. The final conclusion is applicable more broadly and will be stated for general M .

1414

Variational Inference

36.3.1

Motivation The objective is to maximize the ELBO, L(q), in (36.35). Under (36.36), it is possible to maximize L(q) over one qm (·) at a time. Assume for the time being that the optimal choices for q2 (·) and q3 (·) have been determined, and denote them by q2? (·) and q3? (·), respectively. Our objective is to maximize L(q) over q1 (·). Thus, note using (36.31) that the expression for L(q) is given by     ∆ L = E q1 ,q2? ,q3? ln fy,z (y, z) − E q1 ,q2? ,q3? ln q1 (z1 )q2? (z2 )q3? (z3 ) h  i h i   = E q1 E q2? ,q3? ln fy,z (y, z) − E q1 ln q1 (z1 ) − E q2? ,q3? ln q2? (z2 )q3? (z3 ) {z } | ∆

=C

(36.38)

where C is a constant that is independent of z 1 . We would like to maximize L over q1 (z1 ). To facilitate the computation, we express the inner expectation over (q2? , q3? ) as the logarithm of some pdf. To do so, we express C as the sum of two constants, C = C1 + C2 , for some values C1 and C2 to be specified in the following, and introduce the function gy,z1 (y, z1 ) defined by: n   o ∆ gy,z1 (y, z1 ) = exp E q2? ,q3? ln fy,z (y, z) − C1 n  o = e−C1 exp E q2? ,q3? ln fy,z (y, z) n  o ∆ = K1 exp E q2? ,q3? ln fy,z (y, z)

(36.39)

If we select C1 such that the area under gy,z1 (y, z1 ) evaluates to 1, then gy,z1 (y, z1 ) behaves like an actual pdf. In other words, we use C1 as a normalization factor chosen to satisfy this condition. Then, we set C2 = C − C1 . Using gy,z1 (y, z1 ) we can rewrite expression (36.38) for L in the form: h   i h i L = E q1 ln gy,z1 (y, z1 ) + C1 − E q1 ln q1 (z1 ) − C   ˆ gy,z1 (y, z1 ) dz1 − C2 = q1 (z1 ) ln q1 (z1 ) z1   = −DKL q1 (z1 )kgy,z1 (y, z1 ) − C2

(36.40)

Since C2 is a constant, we find that maximizing L over q1 (z1 ) in (36.38) is equivalent to minimizing the KL divergence between q1 (z1 ) and gy,z (y, z1 ). The minimum is attained at n  o q1? (z1 ) = gy,z (y, z1 ) ∝ exp E q2? ,q3? ln fy,z (y, z)

(36.41)

36.3 Mean-Field Approximation

36.3.2

1415

Coordinate-Ascent Algorithm We can repeat the argument by assuming that (q1? , q3? ) are known and seek q2? , or that (q1? , q2? ) are known and seek q3? . The presentation would lead to a similar conclusion to (36.41). We state the result more generally, for latent vectors z of arbitrary size M , by noting that under the mean-field approximation (36.36), ? the optimal mth component qm (zm ) is a distribution that is proportional to n  o ? qm (zm ) ∝ exp E −m ln fy,z (y, z)

(36.42)

where the subscript −m means that the expectation is evaluated relative to ? the optimal distributions of all other components {qm 0 (zm0 )} excluding the mth component, qm (zm ). It can be verified that result (36.42) can be rewritten in an equivalent form where the joint pdf on the right-hand side is replaced by a conditional pdf for z m given the observation and all other latent variables (see Prob. 36.3): n  o ? qm (zm ) ∝ exp E −m ln fzm |z−m ,y (zm |z−m , y)

(36.43)

In this expression, the notation z −m refers to all entries of the latent variable z excluding z m . Coordinate-ascent algorithm for variational inference. QM objective: approximate fz|y (z|y) using m=1 qm (zm ). given: fy,z (y, z).

(36.44)

repeat for m = 1,n2, . . . ,M :

o ? qm (zm ) ∝ exp E −m ln fy,z (y, z)

end

Construction (36.42) is referred to as the coordinate-ascent iteration. The qualification “coordinate” refers to the fact that the algorithm updates one variational factor qm (zm ) (or one coordinate) at a time. The qualification “ascent” refers to the fact that the algorithm is maximizing (rather than minimizing) the ELBO measure. Solving (36.42) is still challenging because the equations are ? coupled: to determine qm we need to know the optimal values for all other factors; and determining these optimal components in turn requires knowledge of ? qm . The equations are usually solved iteratively by assuming initial choices for the q-factors and updating the relations continually; sometimes, we assume some parameterized form for the q-factors and iterate to estimate their parameters. These repeated constructions would continue until the ELBO approaches some saturation level or is large enough. We illustrate the procedure by means of an example.

1416

Variational Inference

Example 36.3 (Gaussian mixture model) We reconsider the generative model (36.5) and use it to generate N independent observations, {y n }, as follows:  µ2 , σ22 ). µ1 , σ12 ), µ2 ∼ Nµ2 (¯ generate two random means µ1 ∼ Nµ1 (¯        for each n = 1, 2, . . . , N : (36.45) select a model index kn ∈ {1, 2} according to Bernoulli(p)    2   generate the observation y n ∼ Ny (µkn , σy,k )   end That is, we associate a Bernoulli variable kn , with subscript n, with each observation y n such that P(kn = 1) = p > 0 fk (kn ) = p

2−kn

(36.46a) (1 − p)

kn −1

(36.46b)

where the second expression shows the form of the pdf for kn . We therefore end up with a collection of N independent variables {kn }. We will use the notation y 1:N and k1:N to refer to the N observations and their class variables. In Prob. 36.6 we extend the analysis to a case involving K models rather than only two. In this problem we have three latent variables, (µ1 , µ2 , k1:N ). Since the value of k de2 2 2 termines the variance of y, either σy,1 or σy,2 , we do not need to incorporate σy,k into the set of latent variables because it will be redundant information. The objective is to apply construction (36.42) to estimate the conditional pdf of the latent variables given the observations, namely, fµ1 ,µ2 ,k1:N |y1:N (µ1 , µ2 , k1:N |y1:N ). We denote the approximation for the posterior by qµ1 ,µ2 ,k1:N (µ1 , µ2 , k1:N ) and assume it has the factored form: qµ1 ,µ2 ,k1:N (µ1 , µ2 , k1:N ) = qµ1 (µ1 ) qµ2 (µ2 )

N Y

qkn (kn )

(36.47)

n=1

It is worth noting that this “mean-field approximation” expression is exact in this case since the variables (µ1 , µ2 , kn ) are independent of each other in the assumed generative model. The analysis will show that the factors qµ1 (µ1 ) and qµ2 (µ2 ) will be Gaussian. We will determine their means and variances. Likewise, the factor qkn (kn ) will turn out to be Bernoulli and we will determine its probability of success. According to (36.42), the optimal factors should satisfy the following coupled equations (we drop the ? superscript for convenience of notation):   ln qµ1 (µ1 ) = E k1:N ,µ2 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) (36.48a)   ln qµ2 (µ2 ) = E k1:N ,µ1 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) (36.48b)   ln qkn (kn ) = E k−n ,µ1 ,µ2 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) (36.48c) for n = 1, 2, . . . , N where the notation E a (·) on the right-hand side refers to expectation relative to the variational distribution of the random variable a, i.e., relative to qa (a). To be more precise, we should have written E qa instead of E a . We continue with the compact notation. Let us first comment on the third equation (36.48c). This equation determines the variational factor for each class variable kn ∈ {1, 2}. We know that kn is discrete in nature and its true distribution has a Bernoulli form with probability of success p, as

36.3 Mean-Field Approximation

1417

described in (36.46a)–(36.46b). We will therefore adopt a similar Bernoulli model for its variational factor, namely, we will assume that

qkn (kn ) = pn2−kn (1 − pn )kn −1 , kn ∈ {1, 2}, pn ∈ [0, 1]

(36.49)

which depends on some unknown parameter pn to be learned. The value for this parameter will follow from relation (36.48c), as we explain toward the end of this example. For now, we are only adopting a form for qkn (kn ). This factor approximates the conditional pdf of kn given all observations, fkn |y1:N (kn |y1:N ). We now determine the optimal variational factors from (36.48a)–(36.48c). To begin with, in a manner similar to (36.8), we have

fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) " N # Y = fµ1 (µ1 ) × fµ2 (µ2 ) × fk (kn ) × fy|kn ,µ1 ,µ2 (yn |kn , µ1 , µ2 )

(36.50)

n=1

" =

Nµ1 (¯ µ1 , σ12 )

×

Nµ2 (¯ µ2 , σ22 )

×

N Y

# p

2−kn

n=1

(1 − p)

kn −1

×

2 ) Nyn (µkn , σy,k n

so that the log-likelihood function is given by

ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 )

(36.51)

= ln Nµ1 (¯ µ1 , σ12 ) + ln Nµ2 (¯ µ2 , σ22 ) + N h X n=1

2 ln p2−kn (1 − p)kn −1 + ln Nyn (µkn , σy,k ) n

i

1 1 (µ1 − µ ¯1 )2 − 2 (µ2 − µ ¯ 2 )2 + 2σ12 2σ2 " # N   X 1 2−kn kn −1 2 ln p (1 − p) − 2 (yn − µkn ) + C 2σy,kn n=1

=−

where we have grouped all terms that are independent of the latent variables into the constant C. We can use the above expansion to evaluate the expectations that appear on the right-hand side of (36.48a)–(36.48c). For example, refer to the first equality (36.48a). If we take expectations relative to the variational distributions of k1:N and µ2 , only two terms will remain that depend on µ1 . We collect the other terms into a constant that is independent of µ1 and write:   E k1:N ,µ2 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) =−

(36.52)

N  1  X 1 2 2 (µ − µ ¯ ) − E (y − µ ) + (cte independent of µ1 ) 1 1 k , µ n n k 2 n 2 2σ12 2σy,k n n=1

1418

Variational Inference

We can evaluate the second term on the right-hand side to find ( ! !) 1 1 2 2 = E µ2 E kn (36.53) E kn ,µ2 (y n − µkn ) (y n − µkn ) 2 2 σy,k σy,k n n   (a) 1 1 = E µ2 pn 2 (y n − µ1 )2 + (1 − pn ) 2 (y − µ2 )2 σy,1 σy,2 1 = pn 2 (yn − µ1 )2 + term independent of µ1 σy,1 1 ∆ = pn 2 (yn,c − µ1,c )2 + term independent of µ1 σy,1 where in step (a) we used the fact that the variational distribution for the latent variable kn is Bernoulli with the probability of class kn = 1 equal to pn . And in the last step we introduced the centered variables ∆

¯1 , µ1,c = µ1 − µ 0



yn,c = yn − µ ¯1

(36.54)

Consequently, by letting C denote terms that are independent of µ1 :   E k1:N ,µ2 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) =− =

N  1 2 1 X 2 2 µ − p y + p µ − 2p y µ + C0 n n n n,c 1,c 1,c n,c 1,c 2 2σ12 2σy,1 n=1 ! ! N N 1 1 X 1 X + 2 pn yn,c µ1,c − pn µ21,c + C 0 2 σy,1 2σ12 2σy,1 n=1 n=1

Substituting into (36.48a) we obtain ( ! N 1 X qµ1 (µ1 ) ∝ exp pn yn,c µ1,c − 2 σy,1 n=1

N 1 1 X + 2 pn 2 2σ1 2σy,1 n=1

!

(36.55)

) µ21,c

(36.56)

The expression on the right-hand side can be normalized to a Gaussian distribution 2 over the centered variable µ1,c , with mean and variance denoted by µ ¯q,1 and σq,1 , respectively. To see this, we refer back to the analysis from Chapter 5 showing how Gaussian distributions are special cases of the exponential family of distributions. If we refer to expression (5.11) we observe that the expression on the right-hand side of (36.56) is written in a similar form, with the corresponding sufficient statistics vector given by " # µ1,c T (µ1 ) = (36.57) µ21,c and with coefficient vector     θ=  

N 1 X 2 σy,1

 pn yn,c

n=1

N 1 X 1 pn − 2 − 2 2σ1 2σy,1 n=1

     

(36.58)

We know from expression (5.11) that the entries of θ define the mean and variance of the distribution as follows: " # 2 µ ¯q,1 /σq,1 θ= (36.59) 2 −1/2σq,1

36.3 Mean-Field Approximation

1419

2 Therefore, we can determine (¯ µq,1 , σq,1 ) by solving N µ ¯q,1 1 X = pn yn,c 2 2 σq,1 σy,1 n=1



N 1 1 1 X = − − pn 2 2 2σq,1 2σ12 2σy,1 n=1

(36.60a)

(36.60b)

from which we conclude that PN µ ¯q,1 = 2 σq,1 =

pn (yn − µ ¯1 ) PN n=1 pn

(36.61a)

2 σy,1 P 2 /σ12 ) + N (σy,1 n=1 pn

(36.61b)

n=1

2 (σy,1 /σ12 ) +

2 Observe in passing how the summation in the denominator of σq,1 grows with N so 2 that σq,1 → 0. Observe also that N µ ¯q,1 1 X = 2 pn (yn − µ ¯1 ) 2 σq,1 σy,1 n=1

(36.62)

The expression on the right-hand side of (36.56) can be normalized to the following Gaussian distribution over the random variable µ1,c :   1 1 ∆ fµ1,c (µ1,c ) = q ¯q,1 )2 (36.63) exp − 2 (µ1,c − µ 2σq,1 2 2πσq,1 That is, the distribution for µ1,c is Gaussian, written as 2 µ1.c ∼ Nµ1,c (¯ µq,1 , σq,1 )

(36.64)

from which we conclude that the distribution qµ1 (µ1 ) is also Gaussian with   2 ¯1 + µ ¯q,1 , σq,1 qµ1 (µ1 ) ∼ Nµ1 µ

(36.65)

where we added µ ¯1 to the mean value to undo the centering operation. The variance remains unchanged. Repeating the argument for the second equality (36.48b) we similarly conclude that qµ2 (µ2 ) is Gaussian-distributed with: 2 qµ2 (µ2 ) ∼ Nµ2 (¯ µ2 + µ ¯q,2 , σq,2 )

(36.66)

where now PN µ ¯q,2 = 2 σq,2 =

n=1 (1

− pn )(yn,c − µ ¯2 ) PN (1 − pn ) n=1

2 (σy,2 /σ22 ) +

2 σy,2 P 2 (σy,2 /σ22 ) + N n=1 (1 − pn )

(36.67a) (36.67b)

Observe again that N X (yn − µ ¯2 ) µ ¯q,2 µ ¯2 = − + (1 − pn ) 2 2 2 σq,2 σy,2 σ y,2 n=1

(36.68)

1420

Variational Inference

Although we are able to deduce analytically the form of the variational factors in this problem, this is not possible in general and we will need to resort to alternative iterative techniques to complete the arguments (as will be illustrated in subsequent sections). Finally, we turn to (36.48c) and the assumed Bernoulli form for qkn (kn ) to deduce that it must hold (where terms that are independent of kn are grouped into the constant): n o ln pn = E k−n ,µ1 ,µ2 ln fy1:N ,k1:N ,µ1 ,µ2 (y1:N , k1:N , µ1 , µ2 ) | kn = 1 (36.51)

=

= = =

1 E µ1 (yn − µ1 )2 + (constant) 2 2σy,1 1 − 2 E µ1 (yn − µ ¯1 + µ ¯1 − µ1 )2 + (constant) 2σy,1 1 1 (yn − µ ¯1 ) E µ1 (µ1 − µ ¯1 ) − 2 E µ1 (µ1 − µ ¯1 )2 + (constant) 2 σy,1 2σy,1



1 2 σy,1

(yn − µ ¯1 )¯ µq,1 −

2 σq,1 + (constant) 2 2σy,1

(36.69)

It follows that ln pn ∝

1 2 σy,1

(yn − µ ¯1 )¯ µq,1 −

2 σq,1 ∆ = ln(an ) 2 2σy,1

(36.70)

We could have alternatively expanded (36.69) in the following form: 1 E µ1 (yn − µ1 )2 + (constant) 2 2σy,1 1 1 = 2 yn E µ1 (µ1 ) − 2 E µ1 (µ1 )2 + (constant) σy,1 2σy,1

ln pn = −

=

1 2 σy,1

yn (¯ µ1 + µ ¯q,1 ) −

2 σq,1 + (¯ µ1 + µ ¯q,1 )2 + (constant) 2 2σy,1

(36.71)

so that we can also use the expression ln pn ∝

1 2 σy,1

yn (¯ µ1 + µ ¯q,1 ) −

 1  2 ∆ σq,1 + (¯ µ1 + µ ¯q,1 )2 = ln(an ) 2 2σy,1

(36.72)

We list this second expression to facilitate comparison with another derivation given later in our presentation. Likewise, we can derive either of the following expressions: ln(1 − pn ) ∝ ln(1 − pn ) ∝

1 2 σy,2

2 σq,2 ∆ = ln(bn ) 2 2σy,2   ∆ 2 σq,2 + (¯ µ2 + µ ¯q,2 )2 = ln(bn )

(yn − µ ¯2 )¯ µq,2 −

1 1 yn (¯ µ2 + µ ¯q,2 ) − 2 2 σy,2 2σy,2

(36.73) (36.74)

Since the estimates pn and 1 − pn must add up to 1, we can obtain true probability values by normalizing by the sum an + bn , namely, by using pn =

an an + bn

(36.75)

We conclude that the variational distribution for each kn takes the form qkn (kn ) = Bernoulli(pn )

(36.76)

36.3 Mean-Field Approximation

1421

In summary, we find that the variational distributions for µ1 and µ2 are Gaussian with 2 2 means (¯ µ1 + µ ¯q,1 , µ ¯2 + µ ¯q,2 ) and variances (σq,1 , σq,2 ), while the variational distribution for kn is Bernoulli with success probability pn . The estimate pn depends on the means for the variational factors, which in turn depend on all {pn }. We can implement the coordinate-ascent algorithm in an iterative form to undo the coupling and estimate the desired parameters for the variational factors, as shown in listing (36.77). This listing is for a Gaussian mixture model with two components. In Prob. 36.6 we extend the analysis to a case involving K ≥ 2 models. Coordinate-ascent algorithm for a Bayesian mixture of two Gaussians. input: N observations {yn } arising from model (36.45). 2 2 2 given: parameters {¯ µ1 , µ ¯2 , σ12 , σn 2 , σy,1 , σy,2 }. o (−1) 2,(−1) 2,(−1) (−1) (−1) . ¯q,2 , σq,1 , σq,2 , pn start with initial conditions for µ ¯q,1 , µ repeat until ELBO value L(m) has converged over m = 0, 1, 2, . . . :   1 h 1 2,(m−1) i (m−1) an = exp − , n = 1, 2, . . . , N (y − µ ¯ )¯ µ σ n 1 q,1 q,1 2 σy,1 2   1 2,(m−1) i 1 h (m−1) , n = 1, 2, . . . , N − σ bn = exp (y − µ ¯ )¯ µ n 2 q,2 2 σy,2 2 q,2 (m−1)

pn

(m)

µ ¯q,1

(m)

µ ¯q,2

= an /(an + bn ), n = 1, 2, . . . , N PN (m−1) (yn − µ ¯1 ) n=1 pn ← PN (m−1) 2 2 (σy,1 /σ1 ) + n=1 pn PN (m−1) )(yn − µ ¯2 ) n=1 (1 − pn ← P (m−1) 2 ) (σy,2 /σ22 ) + N (1 − p n n=1

2,(m)



2,(m)



σq,1 σq,2

(36.77)

2 σy,1 P 2 b(m−1) /σ12 ) + N (σy,1 n=1 p

2 (σy,2 /σ22 )

+

2 σy,2 PN

n=1 (1

− pb(m−1) )

compute L(m) using (36.78a)–(36.78d) below. end   2 return qµ1 (µ1 ) ∼ Nµ1 µ ¯1 + µ ¯q,1 , σq,1   2 qµ2 (µ2 ) ∼ Nµ2 µ ¯2 + µ ¯q,2 , σq,2 qkn (kn ) = Bernoulli(pn ), n = 1, 2, . . . , N We monitor the progress of the algorithm by examining how close the successive variational factors become to each other. We measure this closeness by means of the KL divergence by examining the quantity below defined in terms of successive variational factors indexed by m: N     X   ∆ (m−1) (m) (m−1) (m) (m−1) (m) L(m) = −DKL qµ kqµ − DKL qµ kqµ − DKL qkn kqkn 1

1

2

2

n=1

(36.78a) We know from the result of Prob. 36.4 that as each factor q (m) approaches its optimal value q ? , the quantity L(m) would correspond to the ELBO value at the {q (m) }. The

1422

Variational Inference

first two terms on the right-hand side of (36.78a) involve the KL divergences between Gaussian distributions. We appeal to result (6.66) to write   2  ! (m−1) (m)   2,(m) 2,(m−1)     µ ¯ − µ ¯ q,1 q,1 σq,1 σq,1 1 (m−1) (m) DKL qµ kqµ = ln − 1 + + 2,(m−1) 2,(m) 2,(m) 1 1  2 σq,1 σq,1 σq,1   (36.78b) and   1 (m) (m−1) kqµ = DKL qµ 2 2 2

  

ln

 

2,(m) σq,2 2,(m−1) σq,2

! −1+

2,(m−1) σq,2 2,(m) σq,2

 +

(m−1)

µ ¯q,2

(m)

−µ ¯q,2

2   

2,(m)

σq,2

  (36.78c)

The last term in (36.78a) involves the KL divergence between two Bernoulli distributions. We appeal to result (6.51) to write ! ! (m−1) (m−1)     1 − pn pn (m−1) (m) (m−1) ln + p ln DKL qkn kqkn = 1 − p(m−1) n n (m) (m) 1 − pn pn (36.78d)

The algorithm is repeated until the difference |L(m) − L(m−1) | < , for  > 0 small enough. Once the variational factors are estimated, we can approximate the desired posterior for the latent variables by writing fµ1 ,µ2 ,k1:N |y1:N (µ1 , µ2 , k1:N |y1:N ) ≈ qµ1 (µ1 ) qµ2 (µ2 )

N Y

(36.79)

qkn (kn )

n=1

2 2 = Nµ1 (¯ µ1 + µ ¯q,1 , σq,1 ) × Nµ2 (¯ µ2 + µ ¯q,2 , σq,2 )×

N Y

Bernoulli(pn )

n=1

Likewise, we obtain the approximate distributions: fkn |yn (kn |yn ) ≈ qkn (kn ) = Bernoulli(pn )

fµ1 |y1:N (µ1 |y1:N ) ≈ qµ1 (µ1 ) = Nµ1 (¯ µ1 +

fµ2 |y1:N (µ2 |y1:N ) ≈ qµ2 (µ2 ) = Nµ2 (¯ µ2 +

(36.80a)

2 µ ¯q,1 , σq,1 )

(36.80b)

2 µ ¯q,2 , σq,2 )

(36.80c)

We can use the estimated variational parameters to approximate the generative distribution for the observations (and, hence, use this approximation to perform prediction or to generate observations with “similar” properties): 2 2 fy|y1:N (y|y1:N ) ≈ pb Ny (¯ µ1 + µ ¯q,1 , σy,1 ) + (1 − pb) Ny (¯ µ2 + µ ¯q,2 , σy,2 )

(36.81)

where, for example, the value of pb in this expression can be estimated by averaging all learned values {pn }: pb =

N 1 X pn N n=1

(36.82)

36.3 Mean-Field Approximation

1423

Likewise, in a manner similar to (36.21)–(36.22), we can estimate the responsibility for model k = 1 in explaining an observation y: P(k = 1|y = y, y1:N ) =

P(k = 1|y1:N ) fy|k=1,y1:N (y|k = 1, y1:N ) fy|y1:N (y|y1:N )

(36.83)

so that P(k = 1|y = y, y1:N ) =

2 pb Ny (¯ µ1 + µ ¯q,1 , σy,1 ) 2 2 ) µ2 + µ ¯q,2 , σy,2 pb Ny (¯ µ1 + µ ¯q,1 , σy,1 ) + (1 − pb)Ny (¯

(36.84)

1000 8

0 -1000

6

-2000 4

-3000 -4000

2

-5000 -6000

0 1

2

3

4

5

6

7

8

9

10

11

-7000

12

1

2

3

4

5

6

7

8

9

10

11

4 3.5

10

3 8

pdf

2.5

6

2 1.5

4

1 2 0 6.2

0.5 6.4

6.6

6.8

7

0 0.5

1

1.5

2

2.5

Figure 36.3 Successive estimates of the parameters (¯ µ1 , µ ¯2 , σ12 , σ22 ) for the variational

factors qµ1 (µ1 ), qµ2 (µ2 ), and qk (k) by applying algorithm (36.77) to N = 2000 observations {yn } generated according to the Gaussian mixture model (36.45). We run a simulation by generating N = 2000 observations {yn } according to the 2 2 Gaussian mixture model (36.45) using σ12 = 2, σ22 = 5, σy,1 = 1, σy,2 = 4, and p = 0.3. We apply construction (36.77) starting from random initial conditions: o n (−1) (−1) 2,(−1) 2,(−1) µ ¯q,1 , µ ¯q,2 , σq,1 , σq,2 , p(−1) (36.85) n The mean realizations used to generate the observations are µ1 = 1.4554 and µ2 = 6.3255. The results are shown in Fig. 36.3. It is seen that the ELBO value L(m) converges quickly to its steady-state value. The resulting estimated parameters are:

1424

Variational Inference

µ ¯1 + µ ¯q,1 ≈ 6.6323, µ ¯2 + µ ¯q,2 ≈ 1.5558 2 σq,1 ≈ 0.0015, pb = 0.7104

(36.86)

2 σq,2 ≈ 0.0117

(36.87) (36.88)

where the estimate for p is obtained by using (36.82) and the values of {pn } when the algorithm stops. Figure 36.4 shows the predictive distribution (36.81) along with a normalized histogram for the observations. We generate an additional collection of 200 observations and use (36.84) to predict their labels. The error rate is 88%. Observe, however, that the estimated models are switched: The mean estimated for model 1 actually corresponds to model 2 and vice-versa. Also, the value estimated for p is switched with 1 − p. Therefore, the actual error rate is 12% in this contrived example. prediction distribution

0.3 0.25 0.2 0.15 0.1 0.05 0 -2

0

2

4

6

8

10

12

14

Figure 36.4 Histogram for N = 2000 observations {yn } generated according to the Gaussian mixture model (36.45), along with the predictive distribution (36.81).

Remark 36.1. (Estimating the model parameters) The coordinate-ascent recursions 2 2 (36.77) require knowledge of the model parameters {¯ µ1 , µ ¯2 , σ12 , σ22 , σy,1 , σy,2 }. When these parameters are not available beforehand, they would need to be estimated. One approach to estimate them is to maximize the log-likelihood function, ln fy1:N (y1:N ) (i.e., to solve a maximum-likelihood estimation problem). However, as already explained in the earlier parts of this chapter, the evidence fy1:N (y1:N ) may not be easy to compute in closed form; this is one of the main reasons why we are pursuing variational inference techniques in the first place. Nevertheless, we know from property (36.34) that the ELBO is a lower bound for the (log of the) evidence, namely, L(q) ≤ ln fy1:N (y1:N )

(36.89)

We can replace the maximum-likelihood estimation step by the problem of maximizing the ELBO over the parameters:   ∆ 2 2 b b µ ¯1 , µ ¯2 , σ b12 , σ b22 , σ by,1 ,σ by,2 =

  

L(q) argmax  µ ¯1 , µ ¯2 ,σ12  2 , σ2  σ22 , σy,1 y,2

(36.90)

We continue to assume in this chapter that the model parameters are known beforehand and focus on the variational inference construction. We will illustrate later in Section 37.4 how the above maximization of the ELBO can be used to estimate the parameters in the context of a topic modeling application where variational inference is used. 

36.3 Mean-Field Approximation

1425

Example 36.4 (Estimating the posterior for probit models) Let us reconsider the logit and probit models introduced in Chapter 33. There we explained that evaluation of the evidence (33.26) in closed form is not possible, which in turn meant that determining a closed-form expression for the posterior (33.25) was not possible either and it needed to be approximated. We showed in Example 33.1 how this approximation can be pursued for logit models by using the Laplace method. Here, we carry out the approximation by relying on the coordinate-ascent construction (36.44). For illustration purposes, we consider here the probit model, leaving the extension to the logit case to Prob. 36.5. To begin with, we showed in Prob. 33.6 that we can rewrite the probit model (33.21) in a convenient equivalent form. For any feature vector h, we introduce the Gaussian random variable z ∼ Nz (hT w, 1) with mean hT w and unit variance, and where w is a 2 realization from a Gaussian prior, w ∼ Nw (0, σw IM ). We verified in that problem that the probit model (33.21) can be equivalently rewritten in the form: γ = +1 if z > 0, otherwise γ = −1

(36.91)

In other words, the variable γ will assume the value +1 with probability equal to that of the event z > 0. We associate a local latent Gaussian variable z n with each observation γ(n). These auxiliary latent variables will facilitate the evaluation of the expectations that arise during the application of the coordinate-ascent construction. We therefore end up with a generative model that contains one global latent variable, w, and N local latent variables {z n }:  2 generate w ∼ Nw (0, σw IM )     for each data point n = 1, 2, . . . , N : generate z n ∼ Nzn (hTn w, 1)   set γ(n) = +1 if z n > 0; otherwise, γ(n) = −1   end

(36.92)

Since we are denoting the class values by +1 and −1, we can express the assignment (36.91) by writing the pdf:  fγ |z,w (γ|z, w) =

1+γ 2

I[z>0] 

1−γ 2

I[z≤0] (36.93)

in terms of the indicator function. For example, when z > 0, the first term (1 + γ)/2 is retained. This term evaluates to 1 when γ = +1 and to zero when γ = −1, thus placing the entire probability mass at location γ = +1. The above pdf expression depends on w as well since the distribution of z depends on it. It follows that the joint pdf of the latent variables and the observations is given by ! N Y fw,z1:N ,γ 1:N (w, z1:N , γ1:N ) = fw (w) fz|w (zn |w) fγ |z,w (γn |zn , w) (36.94) n=1

and, consequently, ln fw,z1:N ,γ 1:N (w, z1:N , γ1:N )

(36.95)

N X

1 1 kwk2 − (zn − hTn w)2 + 2 2σw 2 n=1   X   N N X 1 + γ(n) 1 − γ(n) I[zn > 0] ln + I[zn ≤ 0] ln +C 2 2 n=1 n=1

=−

where the constant collects all terms that are independent of the latent variables.

1426

Variational Inference

We are interested in estimating the posterior fw,z1:N |γ 1:N (w, z1:N |γ1:N ). Based on the mean-field approximation method, we assume a factored form: fw,z1:N |γ 1:N (w, z1:N |γ1:N ) ≈ qw (w)

N Y

qzn (zn )

(36.96)

n=1

where, according to (36.42), the variational factors should satisfy the coupled equations (we are dropping the ? superscript for convenience of notation):   ln qw (w) = E z1:N ln fw,z1:N ,γ 1:N (w, z1:N , γ1:N ) (36.97a)   (36.97b) ln qzn (zn ) = E z−n ,w ln fw,z1:N ,γ 1:N (w, z1:N , γ1:N ) for n = 1, 2, . . . , N Here, again, the notation E a (·) on the right-hand side is used to refer to expectation relative to the variational distribution of a, i.e., relative to qa (a). Using (36.95) and (36.97b), taking expectations of the right-hand side, and keeping only terms that depend on zn we find that: ln qzn (zn ) (36.98)     1 + γ(n) 1 − γ(n) 1 ∝ − E w (zn − hTn w)2 + I[zn > 0] ln + I[zn ≤ 0] ln 2 2 2 where the expectation is relative to the variational distribution qw (w). Now note that   E w (zn − hTn w)2 = E w zn2 − 2zn hTn w + wT hn hTn w = zn2 − 2zn hTn E w (w) + term independent of zn  2 = zn − hTn E w (w) + term independent of zn

(36.99)

Substituting into (36.98) gives ln qzn (zn ) (36.100)     2 1 + γ(n) 1 − γ(n) 1 ∝ − zn − hTn E w (w) + I[zn > 0] ln + I[zn ≤ 0] ln 2 2 2 If the last two terms were not present and we exponentiate the right-hand side, then a Gaussian-type distribution would result for qzn (zn ) with mean hTn E w (w). However, the range of zn is coupled to the class variable γ(n). Specifically, if γ(n) = +1, then it must hold that zn > 0; otherwise, zn ≤ 0. This means that the variational distribution for z n will either have its support over the positive real axis when γ(n) = +1 or over the negative real axis when γ(n) = −1. As such, the distribution for z n cannot be Gaussian, whose support extends over the entire real axis. Actually, the variational distribution qzn (zn ) will have a truncated Gaussian form. To see this, let us first note that if x ∼ Nx (µ, σx2 ) denotes some generic Gaussian distribution over x ∈ (−∞, +∞), then truncating it to the interval x ∈ [a, b] results in a new distribution that is given by (see Prob. 36.8):  0, xb This new pdf assumes zero values outside the interval [a, b]. The middle line includes the cumulative distribution Φ(·) of the standard Gaussian, which was defined earlier

36.3 Mean-Field Approximation

1427

in (33.20). The original normal distribution for x appears in the numerator. A lower truncated Gaussian has b = +∞ while an upper truncated Gaussian has a = −∞. We denote them by writing Nx>a (µ, σx2 ) and Nx0 hTn E w (w), 1 , qzn (zn ) =     Nz≤0 hTn E w (w), 1 ,

when γ(n) = +1 (36.102) when γ(n) = −1

For later use, we note that the mean of a general truncated distribution of the form (36.101) is given by     φ (a − µ)/σx − φ (b − µ)/σx    E x,[a,b] = µ + σx  (36.103) Φ (b − µ)/σx − Φ (a − µ)/σx in terms of the normalized Gaussian function 1 2 1 φ(x) = √ e− 2 x 2π

It follows for lower and upper truncated Gaussian distributions that   φ (a − µ)/σx   E x,x>a = µ + σx 1 − Φ (a − µ)/σx   φ (b − µ)/σx  E x,x0 (αn , 1),

qw (w) = Nw (w, ¯ Rw )

(36.111)

where the notation used for qzn (zn ) is a compact representation for (36.102) and where the parameters are given by  !−1 N  X  1 T   Rw = I + hn hn  2 M  σw   n=1  !  N  X   w ¯ = Rw βn hn (36.112) n=1      φ(αn )   βn = αn + γ(n)   Φ(α n)     T αn = hn w ¯ The expressions for {w, ¯ αn , βn } are coupled. We can iterate them to estimate the desired values: repeat for m = 0, 1, 2, . . .: ! N X w ¯ (m) = Rw βn(m−1) hn n=1 (m)

αn

(m)

βn

= hTn w ¯ (m)

(36.113)

  (m) φ αn (m)  = αn + γ(n)  (m) Φ αn

end (−1)

(−1)

starting from some initial conditions w(−1) , αn , and βn . Once we converge to a value w, ¯ we obtain a characterization for the variational factor qw (w). This factor approximates the conditional distribution of w given the observations, i.e., fw|γ N (w|γN ; HN ) ≈ Nw (w, ¯ Rw )

(36.114)

36.3 Mean-Field Approximation

1429

From this point onward we can follow the derivation from Example 33.1 after (33.44). Specifically, we can use the above distribution to approximate the predictive distribution for γ given a new feature h. Indeed, introduce the auxiliary scalar variable: ∆

x = γhT w

(36.115)

Conditioned on γ, and in view of the Gaussian distribution (36.114) for w given all data, the variable x is also Gaussian-distributed with mean and variance given by: ¯ hT Rw h) fx|γ N (x|γN ; HN ) ≈ Nx (γ hT w,

(36.116)

Similar to (33.47), we can evaluate the predictive distribution as follows: P(γ = γ|γ N = γN ; h, HN )   ˆ ∞ 1 1 T 2 = Φ(x) exp − (x − γ h w) ¯ dx 2hT Rw h (2πhT Rw h)1/2 −∞

(36.117)

The last integral is difficult to evaluate in closed form. However, it can be approximated by using property (33.49) to conclude that P(γ = γ|γ N = γN ; h, HN ) ≈ Φ



γhT w ¯ √ 1 + hT Rw h

 (36.118)

We illustrate this construction by means of a simulation. A total of N = 400 feature vectors {hn ∈ IR2 } are generated randomly from a Gaussian distribution with zero mean and unit variance. The weight vector w ∈ IR2 is also generated randomly from 2 a Gaussian distribution with zero mean and variance σw = 2. For each hn , we assign T its label to γ(n) = +1 with probability equal to Φ(hn w); otherwise, it is assigned to γ(n) = −1. The 400 pairs {hn , γ(n)} generated in this manner are used for training (i.e., for learning the conditional and predictive distributions). These samples are shown in the top row of Fig. 36.5. We also generate separately 50 additional sample pairs {hn , γ(n)} to be used during testing. These samples are shown in the plot on the left in the bottom row. Running recursions (36.113) for 100 iterations leads to     −2.0055 0.0022 −0.0001 w ¯= , Rw = (36.119) −1.2971 −0.0001 0.0027 Using these values, we can predict the label for each of the test features hn using construction (33.51). We assign hn to class γ(n) = +1 if   hT w ¯ ≥ 1/2 =⇒ γ(n) = +1 (36.120) Φ √ 1 + hT R w h The result is shown in the plot on the right in the bottom row of Fig. 36.5, with eight errors (corresponding to a 16% error rate). Example 36.5 (Legislative voting record) We examine next a more involved generative model in the context of legislative voting. Consider a legislative body consisting of N members (say, N senators). Assume we collect the voting record of the senators on L legislative issues. The objective is to infer from the data whether there is some natural grouping that occurs within the legislative body. Do some senators tend to vote together on certain issues? Assume the senators can be divided into B groups or blocks, where senators in the same block tend to vote similarly on legislative issues presented to them. We would like to identify which senators belong to each block. Before we address this problem by means of variational inference, let us first introduce a generative model to explain the data (i.e., the voting record).

1430

Variational Inference

training features

3

-1

2 1 0 -1 -2

+1

-3 -3

-2

-1

0

samples used for testing

3

3

-1

2

1

1

0

0

-1

-1

-2

2

test results

3

-1

2

1

-2

+1

-3

+1

-3 -3

-2

-1

0

1

2

3

-3

-2

-1

0

1

2

3

Figure 36.5 (Top row) A total of N = 400 pairs (hn , γ(n)) used for training in order to learn the predictive distribution. (Bottom row) The plot on the left shows the 50 pairs {hn , γ(n)} used for testing. The plot on the right shows the labels that are assigned by the test (33.51) to these points.

For each senator n, we let g n denote the index of the group (or block) that the senator belongs to with P(g n = b) = πb ,

b = 1, 2, . . . , B

(36.121)

The variable g n follows a categorical distribution defined by the B × 1 vector of probabilities π = col{π1 , π2 , . . . , πB }; the class g n assumes one of the values b = 1, 2, . . . , B with probability πb . In this way, the likelihood that senator n belongs to the bth block is πb , with all the {πb } adding up to 1. Each senator can belong to only one block. We model π as arising from a Dirichlet distribution with positive scalar parameters {α1 , α2 , . . . , αB }, collected into a column vector α. Recall that Dirichlet distributions are useful for generating realizations of probability vectors such as π. According to (5.53), this means that the pdf of π takes the form ! B B B X . Y Y αb −1 fπ (π) = K(α) πb , K(α) = Γ αb Γ(αb ) (36.122) b=1

b=1

b=1

where Γ(·) is the gamma function. We refer to the {αb } as the “proportion” parameters; smaller or larger values for these parameters can end up concentrating the senators into fewer groups or dispersing them among more groups. We recall from (5.66) that

36.3 Mean-Field Approximation

1431

every Dirichlet distribution of the above form belongs to the exponential family of distributions. Specifically, we can write fπ (π) = h(π)eα

T

T (π)−a(α)

(36.123a)

where   ∆  α =  

α1 α2 .. . αB

   , 

  ∆  T (π) =  

ln(π1 ) ln(π2 ) .. . ln(πB )

 B .Y h(π) = 1 πb

  , 

(36.123b)

b=1

as well as a(α) =

B X b=1

ln Γ(αb ) − ln Γ

B X

αb



(36.123c)

b=1

It follows that 1 fπ (π) = QB

b=1

( = exp

( πb

exp

b=1

B X

= exp

B X b=1

) αb ln(πb ) − a(α) !

αb ln(πb )

b=1

(

B X

− ln

B Y b=1

 πb − a(α)

! (αb − 1) ln(πb )

(

B X ∝ exp (αb − 1) ln(πb )

)

) − a(α)

) (36.124)

b=1

We will call upon this form later in the discussion; its purpose is to assist us in identifying exponential distributions. We assume next that the senators vote on L legislative issues. Let θ b,` denote the likelihood (i.e., probability) that senators in block b would support legislation `. There are L such likelihoods for each block b; one for each legislative item. We collect them into a vector of size L × 1: ∆

θb = col{θb,1 , θb,2 , . . . , θb,L } θb,` = likelihood of block b voting in favor of legislation `

(36.125) (36.126)

The entries of θb are nonnegative. If we select a senator n from block b, the likelihood that this senator’s vote on legislation `, denoted by v n,` , will be in the affirmative is given by   P v n,` = 1|g n = b; θb,` = θb,` (36.127) This means that the vote v n,` ∈ {0, 1} is a Bernoulli variable with success probability given by θb,` . We can model the distribution of the observations (i.e., the votes) by writing B   Y I[g =b]vn,` fvn,` |gn vn,` |gn ; {θb,` } = θb,`n (1 − θb,` )I[gn =b](1−vn,` )

(36.128)

b=1

Only one term inside the product expression will be active at a time (the one corresponding to the actual block for the nth member), with all other terms inside the product evaluating to 1.

1432

Variational Inference

We model each θ b,` as being beta-distributed with parameters (λ1 , λ2 ), say, fθ

b,`

(θb,` ) = beta(λ1 , λ2 ) =

Γ(λ1 + λ2 ) λ1 −1 θ (1 − θb,` )λ2 −1 Γ(λ1 )Γ(λ2 ) b,`

(36.129)

Recall that beta distributions are useful in generating realizations for the probability variable of a Bernoulli distribution. According to (5.66), every beta distribution of the above form belongs to the exponential family of distributions as well. In this case, we can write fθ

T

b,`

(θb,` ) = h(θ b,` )eλ

T (θ b,` )−s(λ)

(36.130a)

where ∆



λ =

λ1 λ2

 ,



T (θ b,` ) =



ln(θ b,` ) ln(1 − θ b,` )

 ,

h(θ b,` ) =

1 θ b,` (1 − θ b,` )

(36.130b)

as well as s(λ) = ln Γ(λ1 ) + ln Γ(λ2 ) − ln Γ(λ1 + λ2 ) It follows, after some simple algebra similar to (36.124), that n o fθ (θb,` ) ∝ exp (λ1 − 1) ln(θb,` ) + (λ2 − 1) ln(1 − θb,` ) b,`

(36.130c)

(36.131)

We therefore have the following generative model for the voting record of the members of the legislative body:  generate a  n B × 1 probability vector π ∼ Dirichlet(α) o    generate θ b,` ∼ beta(λ1 , λ2 ), b = 1, . . . , B, ` = 1, . . . , L      for each senator n = 1, 2, . . . , N :    let g n denote the group or block assignment for senator n, where the assignment follows g n ∼ Categorical(π)    for each legislative issue ` = 1, 2, . . . , L:     generate the senator vote v n,` = Bernoulli(θgn ,` )    end   end

(36.132)

The observations in this problem consist of the voting record {v n,` } for all senators n = 1, 2, . . . , N and for all legislative issues ` = 1, 2, . . . , L. The variables {g n } are local latent variables, while {π, θ b,` } are global latent variables. For compactness of notation, we collect the various quantities into vectors or matrices: ∆

V = [v n,` ]N,L n=1,`=1

(N × L)

(36.133a)

(B × L)

(36.133b)

(collection of votes by N senators on L pieces of legislation) Θ = [θ b,` ]B,L b=1,`=1

(likelihoods of voting by the B blocks on the L pieces of legislation) ∆

g = col{g n }N (N × 1) n=1 (block assignments for all senators)

(36.133c)

Conditioned on block assignments, the entries of V are Bernoulli-distributed with success probabilities given by the entries of Θ. The entries of Θ are beta-distributed with parameters (λ1 , λ2 ), while the entries of g follow a categorical distribution with

36.3 Mean-Field Approximation

1433

parameter vector π. The joint distribution of all observations V and the latent variables {π, Θ, g} is given by: fπ ,Θ,g,V (π, Θ, g, V )

(36.134)

= fπ (π)fΘ (Θ)fg|π (g|π)fV |g,Θ (V |g, Θ) ! ! B Y L N Y B Y Y I[gn =b] = fπ (π) f (θb,` ) πb × n=1 b=1

b=1 `=1

L Y

B Y

n=1 `=1

b=1

N Y

I[g =b]vn,` θb,`n (1

I[gn =b](1−vn,` )

!

− θb,` )

Consequently, by collecting all terms that are independent of the latent variables into a constant C and grouping the remaining terms, the log-likelihood becomes ln fπ ,Θ,g,V (π, Θ, g, V ) =

B  X

N X

b=1

n=1

αb − 1 +

(36.135) 

I[gn = b] ln(πb ) +

B X N X L   X 1 (λ1 − 1) + I[gn = b]vn,` ln(θb,` ) + N n=1 b=1

`=1

B X N X L  X b=1 n=1 `=1

 1 (λ2 − 1) + I[gn = b](1 − vn,` ) ln(1 − θb,` ) + C N

Under the mean-field approximation, we estimate the conditional distribution of the latent variables given the observations in factored form, i.e., we adopt the model fπ ,Θ,b|V (π, Θ, b|V ) ≈ qπ (π) qΘ (Θ) qg (g) = qπ (π)

B Y L Y

(36.136) !



b,`

(θb,` )

N Y

! qgn (gn )

n=1

b=1 `=1

According to (36.42), the individual variational factors should satisfy the coupled equations (we drop the ? superscript for convenience of notation):   ln qπ (π) = E g,Θ ln fπ ,Θ,g,V (π, Θ, g, V ) (36.137a)   ln qθ (θb,` ) = E π ,g,Θ−b,−` ln fπ ,Θ,b,V (π, Θ, g, V ) (36.137b) b,`   ln qgn (gn ) = E π ,g ,Θ ln fπ ,Θ,g,V (π, Θ, g, V ) (36.137c) −n where the expectations are relative to the variational distributions. Let us examine (36.137c) first. We know from (36.121) that the entries of g n follow a categorical distribution, i.e., ∆

P(g n = b) = fgn (g n = b) =

B Y

I[gn =b]

πb

(36.138)

b=1

We seek a similar distribution for the variational factor qgn (gn ) and denote the individual probabilities by {b πn,b } (i.e., there is one estimate π bn,b for each sample n), namely, we assume B Y ∆ I[g =b] P(g n = b|V ) = qgn (gn ) = π bn,b n (36.139) b=1

1434

Variational Inference

Then, for a given n and gn = b, relations (36.135) and (36.137c) would imply the following probability value (where we are ignoring terms that are independent of πb ): ln π bn,b ( =

(36.140)

  E π ln(π b ) +

L X `=1

  vn,` E Θ ln(θ b,` −

L X `=1

  vn,` E Θ ln(1 − θ b,` )

) + C0

We will verify soon in (36.152) and (36.154) that the variational factor for π has a Dirichlet form with B ×1 parameter vector α b = col{b αb }, while the variational factor for bb`,1 , λ bb`,2 ). We can therefore use property each θ b,` has a beta form with parameters (λ (5.60) for the mean of the logarithm of a Dirichlet-distributed variable (of which the beta distribution is a special case) to deduce that: B   X  E π ln(π b ) = ψ(b αb ) − ψ α bb0

(36.141)

b0 =1

in terms of the digamma function ψ(x) defined as the ratio of the derivative of the gamma function to itself: ∆

ψ(x) =

Γ0 (x) d ln Γ(x) = dx Γ(x)

(36.142)

We provided in (5.62) one expression for evaluating the digamma function:  ∞  X 1 1 − ψ(x) ≈ −0.577215665 + 1+m x+m m=0

(36.143)

This expression can be used to approximate ψ(x) by replacing the infinite series by a finite sum. Likewise, since the beta distribution is a special case of the Dirichlet distribution, we conclude from property (5.60) that     bb`,1 ) − ψ λ bb`,1 + λ bb`,2 E Θ ln(θ b,` ) = ψ(λ (36.144a)     bb`,2 ) − ψ λ bb`,1 + λ bb`,2 E Θ ln(1 − θ b,` ) = ψ(λ (36.144b) Substituting into (36.140) we find that ( π bn,b ∝ exp ψ(b αb ) − ψ

B X b0 =1

α bb0



+

L X

vn,`



`=1

bb`,1 ) − ψ(λ bb`,2 ) ψ(λ



) (36.145)

We can normalize the π bn,b to add up to 1 by setting, for each n: π bn,b ← π bn,b

B .X b0 =1

π bn,b0

(36.146)

Remark 36.2. (Possibility of overflow and/or underflow) Some care is needed in the numerical implementation of this procedure due to the exponentiation in (36.145) and the possibility of overflow or underflow. If the number in the exponent assumes large values, then π bn,b given by (36.145) can saturate in finite-precision implementations. One way to avoid this possibility is as follows. Assume, for illustration purposes, that we have a collection of exponential quantities of the form πk = eak , which we would like to normalize to add up to 1, i.e., we would like to replace each πk by eak πk ← PK k0 =1

eak0

(36.147)

36.3 Mean-Field Approximation

1435

The difficulty is that some of the {ak } may assume large values, thus leading to saturation when attempting to compute eak . This problem can be avoided by subtracting from all the {ak } their largest value and introducing: n o ∆ bk = ak − max ak0 (36.148) 0 1≤k ≤K

By doing so, the largest value of the {bk } will be zero. It is easy to see that using the {bk } instead of the {ak } does not change the value of the normalized {πk }: ebk π k ← PK k0 =1

(36.149)

ebk0

 Let us now return to (36.137b). For a given b and `, using (36.135), relation (36.137b) implies (where we are ignoring terms that are independent of θb,` ): ln qθ

b,`

N   X (θb,` ) = (λ1 − 1) + E g (I[g n = b]v n,` ) ln(θb,` ) +

(36.150)

n=1



(λ2 − 1) +

N X n=1

h i E g I[g n = b](1 − v n,` ) ln(1 − θb,` ) + C 00

where the expectation is relative to the variational distribution of g. Using (36.127) we readily recognize that   E g I[g n = b] v n,` = P(g n = b)vn,` = π bn,b vn,` (36.151) Using the analogy with (36.131) for beta distributions, we conclude from (36.150) that qθ (θb,` ) b,`

=

bb`,1 λ

=

bb`,2 λ

=

bb`,1 , λ bb`,2 ) beta( λ N X λ1 + π bn,b vn,` λ2 +

n=1 N X n=1

(36.152)

π bn,b (1 − vn,` )

Let us now consider (36.137a). Using (36.135), the relation implies ln qπ (π) =

B  N  X X αb − 1 + E g I[g n = b] ln(πb ) + C 000 n=1

b=1

B  N  X X = αb − 1 + π bn,b ln(πb ) + C 000

(36.153)

n=1

b=1

Comparing with (36.124) we conclude that the variational factor for π has a Dirichlet form: qπ (π)

=

α b

=

α bb

=

Dirichlet(b α) n oB col α bb b=1

αb +

N X n=1

π bn,b

(36.154)

1436

Variational Inference

In summary, we arrive at listing (36.155) for estimating the distributions of the latent variables in model (36.132) conditioned on the observations. Coordinate-ascent algorithm for generative model (36.132). input: N × L votes {v n,` } arising from model (36.132). given: parameters λ1 , λ2 , and {αb }B b=1 n oB,L n oB (−1) b(−1) , λ b(−1) initial conditions: λ , and α b b`,1 b`,2 b b=1,`=1

b=1

repeat until convergence over m = 0, 1, 2, . . . :   P  (m−1) (m−1) B ab,1 = ψ α bb −ψ b b0 , b = 1, 2, . . . , B b0 =1 α     b(m−1) − ψ λ b(m−1) , b = 1, 2, . . . , B ab`,2 = ψ λ b`,1 b`,2 ` = 1, 2, . . . , L ( !) L X (m) π bn,b ∝ exp ab,1 + ab`,2 vn,` `=1

(m)

(m)

π bn,b ← π bn,b

b(m) = λ1 + λ b`,1

B .X b0 =1 N X

(m)

π bn,b0 ,

(m)

π bn,b vn,` ,

n=1

b(m) = λ2 + λ b`,2

N X n=1

(m)

α bb

= αb +

N X

(36.155)

n = 1, 2, . . . , N b = 1, 2, . . . , B b = 1, 2, . . . , B ` = 1, 2, . . . , L

(m)

π bn,b (1 − vn,` ) (m)

π bn,b , b = 1, 2, . . . , B

n=1

end Once convergence is attained, we can estimate the parameters by using the estimates at the end of the recursions to compute: (m)

α bb = α bb

, b = 1, 2, . . . , B

(36.156a)

N 1 X (m) π bb = π b , b = 1, 2, . . . , B N n=1 n,b

(36.156b)

B X L X b1 = 1 b(m) λ λ b`,1 BL

(36.156c)

B X L X b2 = 1 b(m) λ λ b`,2 BL

(36.156d)

b=1 `=1

b=1 `=1

We can also estimate the parameters {θb,` } by using the location of the peak values of the respective beta distributions, qθ (θb,` ). Recall that this function approximates the b,` conditional distribution of θ b,` given the observations V , and its peak therefore amounts to a maximum a-posteriori (MAP) estimate for θb,` . It is known that the mode of a b(m) , λ b(m) } larger than 1 is located at (see beta distribution with shape parameters {λ b`,1 b`,2 Prob. 36.9): θbb,` =

b(m) − 1 λ b`,1 b(m) > 1 and λ b(m) > 1 , when λ b`,1 b`,2 b(m) + λ b(m) − 2 λ b`,1

b`,2

(36.157)

36.3 Mean-Field Approximation

1437

For convenience, we will adopt the following conventions for the location of the mode(s) for other (less common) possibilities for the shape parameters (although, technically, one can question whether the endpoints {0, 1} should be treated as modes):  b(m) = λ b(m) = 1 any number in the interval (0, 1), when λ  b`,1 b`,2     b(m) < 1 and λ b(m) < 1  0 and 1 (two peaks), when λ b`,1 b`,2 θbb,` = (36.158) (m) (m)  b b  0, when λb`,1 ≤ 1 and λb`,2 > 1     b(m) > 1 and λ b(m) ≤ 1 1, when λ b`,1 b`,2 We can also infer the block membership of each senator. Thus, let the L × 1 vector V n denote the collection of all votes by the nth senator (its entries consist of 1s and 0s). Using the Bayes rule we have P(g n = b, V n = Vn |π = π, Θ = Θ) = fV n |π ,Θ (Vn |π, Θ) P(g n = b|V n = Vn , π = π, Θ = Θ) = P(g n = b|π = π, Θ = Θ) fV n |π ,Θ,g

n =b

(Vn |π, Θ, gn = b)

n =b

(Vn |π, Θ, gn = b)

(36.159)

so that P(g n = b|V n = Vn , π = π, Θ = Θ) =

=

P(g n = b|π = π, Θ = Θ) fV n |π ,Θ,g

fV n |π ,Θ (Vn |π, Θ)

P(g n = b|π = π, Θ = Θ) fV n |π ,Θ,g =b (Vn |π, Θ, gn = b) n PB 0 )f 0 P(g = b (V n |π, Θ, gn = b ) 0 0 n b =1 V n |π ,Θ,g =b

(36.160)

n

and, hence,

P(g n = b|Vn , π, Θ) = P B

πb

b0 =1

Q

v

L `=1

πb0

Q

n,` θb,` (1 − θb,` )(1−vn,` )

L `=1

v



(1−vn,` ) 0 θb0n,` ,` (1 − θb ,` )



(36.161)

We evaluate B such probabilities for user n, one for each value of b. Then, we select its block location as the index of the maximum probability value. In order to avoid the possibility of overflow or underflow, it is preferable to work with the logarithm of the above expression. Moreover, since the denominator remains unchanged as we vary b, we can set the block choice to: ( b

o



=

argmax 1≤b≤B

ln(b πb ) +

L X `=1

) b b vn,` ln(θb,` ) + (1 − vn,` ) ln(1 − θb,` )

=⇒ assign nth senator to block b

(36.162)

o

We use the estimates {b πb , θbb,` } in place of the true values {πb , θb,` } in (36.161) to carry out this assignment. We illustrate the operation of recursions (36.155) by means of a numerical simulation using N = 100 senators, L = 1000 legislative issues, B = 7 blocks, and (λ1 , λ2 ) = (0.1, 0.1). The entries of the vector α are chosen randomly:  T α = 8.1079 8.0130 9.7175 6.7358 0.4260 9.5652 4.8754 (36.163)

Variational Inference

The vector of block assignments π = col{πb } is generated from a Dirichlet distribution with the above parameters {αb }: π=



0.1198

0.1641

0.1953

0.1420

0.0106

0.1855

0.1827

T

(36.164)

The entries of the B × L matrix Θ are generated independently from a beta(λ1 , λ2 ) distribution. Figure 36.6 plots in a gray-coded graph the values of the parameters {θb,` } that are generated in this manner across all blocks and all legislation topics. We further generate the voting data {vn,` } according to the generative model (36.132); we assign the senators to random blocks according to the distribution defined by the vector π, and subsequently assign their votes to 1 or 0 according to the probabilities defined by the {θb,` }. Figure 36.7 lists the voting record for all senators across all legislation instances. A dark color corresponds to an affirmative vote (vn,` = 1), while a light color corresponds to a negative vote (vn,` = 0). beta-distributed entries of

estimated entries of

7

7

6

6

5

5

4

4

3

3

2

2

1

0.8

0.6

0.4

0.2

1

1 200

400

600

800

200

1000

400

600

800

1000

0

Figure 36.6 (Left) A gray-coded representation of the values of the probabilities {θb,` } generated independently according to beta(0.1, 0.1). (Right) A similar diagram using the estimated values {θbb,` } at the end of recursion (36.157).

voting record

senators

1438

100 50 1 1

100

200

300

400

500

600

700

800

900

1000

legislation Figure 36.7 A diagram representing the votes by the senators on the various

legislative issues posed to them. A dark color corresponds to an affirmative vote (vn,` = 1), while a light color corresponds to a negative vote (vn,` = 0). We apply recursions (36.155) for 100 iterations. Using (36.156a)–(36.156d), and the estimated values at the end of these 100 iterations, we find

36.3 Mean-Field Approximation

1439

b1 , λ b2 ) = (7.3337, 7.1520) (λ (36.165)  T π b = 0.5600 0.0000 0.0000 0.0000 0.28000.0000 0.1600 (36.166)  T α b = 64.1079 8.0130 9.7175 6.7358 28.4260 9.5652 20.8754 (36.167) The simulation ends up identifying four blocks of senators with (30, 12, 20, 38) members, respectively. The blocks are shown in the left plot in the top row of Fig. 36.8, where the senators are renumbered and grouped together into four blocks indicated by the squares. Senators within the same block tend to vote together. To generate this figure, we renumber the senators so that those belonging to the same group are renumbered sequentially. The right plot in the top row maintains the original numbering of the senators along the horizontal axis, and colors their block assignments. For example, the senator with original number #45 is assigned to block 6). The same information appears in the left plot in the bottom row, while the right plot in that row shows the new numbering of the senators against their original numbers (for example, the senator with original number #45 is now assigned number 20).

4

80

4

60

blocks

reordered senators

100

3 40

2

20

20

2 1

1 1

3

40

60

80

100

1

4

3 2 1 1

20

40

60

80

original senator number

20

40

60

80

100

orignal senator number reordered senator number

block assignment

reordered senators

100

100 80 60 40 20

1

20

40

60

80

100

original senator number

Figure 36.8 (Top left) Senators are renumbered and grouped together into four blocks

indicated by the colored squares. Senators within the same block tend to vote together. (Top right) Original numbering for senators is shown on the horizontal axis with their block assignments across the vertical axis. (Bottom left) Similar information in graphical form showing the block assignments using the original numbering for the senators. (Bottom right) Mapping from original numbers to reordered numbers.

1440

Variational Inference

36.4

EXPONENTIAL CONJUGATE MODELS In the examples considered so far it was possible to infer from relation (36.42) the form of the optimal variational factors, as was illustrated by Example 36.3. This is generally not possible for arbitrary distributions. However, for the class of distributions in the exponential family, a similar conclusion can be deduced, namely, that the variational factors will continue to have exponential forms. When this happens, the expressions for implementing the coordinate-ascent construction can be simplified. We treat this case in this section, which includes many useful special cases such as Bayesian mixtures of exponential and Gaussian models.

36.4.1

Data Model From this point onward in our treatment, we will divide the latent variables into two types: a global variable denoted by θ ∈ IRM and a collection of local variables denoted by {z n }; one for each of the observations {y n }. We assume there are N observations for n = 1, 2, . . . , N , and that each y n depends only on (z n , θ). The variables (y n , z n ) can be vector or scalar valued. While each z n affects its own y n , the variable θ affects all observations. For example, z n could refer to the class of the nth observation, in which case z n would be discrete. The variable z n could also refer to some other property about y n such as the mean of its distribution, in which case z n would be a continuous. We will consider one important application in the next chapter in the context of topic modeling and later in Example 42.5 in the context of hidden Markov models (HMMs). Here we illustrate the definition with two simple examples. Example 36.6 (Gaussian mixture model revisited) Consider the generative model (36.45) from Example 36.3, namely,  generate two random means µ1 ∼ Nµ1 (¯ µ1 , σ12 ), µ2 ∼ Nµ2 (¯ µ2 , σ22 ).     for each n = 1, 2, . . . , N : select a model index kn ∈ {1, 2} according to Bernoulli(p)   generate the observation y n ∼ Nyn (µkn , σk2 n )   end

(36.168)

In this case, the means θ = col{µ1 , µ2 } are global latent variables, which affect all observations {y n }. The class label kn , on the other hand, is a local variable that affects only y n . Example 36.7 (Bernoulli model revisited) Consider the Bernoulli model described in Example 36.2. Let p denote a probability value that is chosen uniformly from within the interval p ∼ U[0, 1]. We collect N independent Bernoulli observations y n ∈ {0, 1} where, for each y n , the probability of success is p, i.e., P(y n = 1|p = p) = p. In this case, the probability θ = p is a global latent variable that affects all observations, and there are no local latent variables z n .

36.4 Exponential Conjugate Models

1441

Joint distribution Let y 1:N denote the collection of all observations from n = 1 to n = N . Likewise, for z 1:N . We assume that, conditioned on the parameter θ, the data (y n , z n ) are independent of each other over n so that the joint pdf can be factored as: fy

1:N

(y , z , θ) = fθ (θ) ,z 1:N ,θ 1:N 1:N

N Y

n=1

fy

θ (yn , zn |θ)

n ,z n |

(36.169)

We further assume that each of the individual pdfs of (y n , z n ) that appear inside the product sign belong to the exponential family of distributions with a general form given by fy

n

n o T (y , z |θ) = h(y , z ) exp (φ(θ)) T (y , z ) − αa(θ) n n n n n n ,z n |θ

(36.170)

for some base function h(·), parameter θ ∈ IRM , vector function φ(·) : IRM → IRM , log-normalization function a(θ), and statistic T (yn , zn ) ∈ IRM . One important special case is when φ(θ) = θ. The analysis and results that follow can be easily specialized to this case. Note that we are including a scaling factor α ∈ IR for convenience; its value is generally equal to 1. The choice α = 0 accommodates situations when a(θ) may not be present.

Global latent parameter Using the analogy with the earlier expression (5.132) and the corresponding conjugate prior in (5.126), we will similarly assume that the pdf for the global parameter θ is a conjugate prior for (36.170), namely, that it is of the form: n o fθ (θ) = k(θ) exp λT 1 θ − λ2 a(θ) − s(λ1 , λ2 ) n o ∆ = k(θ) exp λT G(θ) − s(λ) (36.171) for some base function k(θ), log-normalization factor s(λ), and statistic vector   θ G(θ) =  −a(θ)  ∈ IR2M +1 (36.172) φ(θ)

The last entry in this vector is unnecessary for characterizing the prior fθ (θ) from (36.171); it is included here in order to allow us to show that the prior and the posterior (to be derived soon) can be represented uniformly by using the same statistic vector G(θ). For this reason, the λ-vector in (36.171) has a trailing zero, i.e.,   λ1 ∈ IRM , λ =  λ2 ∈ IR (2M + 1) × 1 (36.173) M λ3 = 0 ∈ IR

1442

Variational Inference

so that λT G(θ) = λT 1 θ − λ2 a(θ)

(36.174)

as required by the first line in (36.171). Now, repeating the argument that led to (5.133), it can be verified that the posterior for θ would continue to belong to the same exponential family and would take the form: (θ|y1:N , z1:N )   ∝ k(θ)exp λT θ − (λ2 + N α)a(θ) +  1 fθ |y

1:N ,z 1:N

N X

!T

T (yn , zn )

n=1

  φ(θ) 

(36.175)

If we introduce the (2M + 1)-dimensional parameter vector λ0 = col{λ01 , λ02 , λ03 } with entries ∆

λ01 = λ1 ∈ IRM ∆

λ02 = (λ2 + N α) ∈ IR ∆

λ03 =

N X

n=1

T (yn , zn ) ∈ IRM

(36.176a) (36.176b) (36.176c)

then the posterior (36.175) for θ can be rewritten in the same form as the prior: n o fθ |y ,z1:N (θ|y1:N , z1:N ) = k(θ) exp (λ0 )T G(θ) − s(λ0 ) 1:N

(36.177)

in terms of the log-normalization function s(λ0 ), which can be written more explicitly as s(λ01 , λ02 , λ03 ) to emphasize that it is a function of (λ01 , λ02 , λ03 ). We remark that the same log-normalization factor s(·) from (36.171) appears in (36.177) since expression (36.177) needs to be a valid exponential pdf that integrates to 1. The posterior (36.177) is what will appear in expression (36.194b) when solving for the variational factor for θ; thus, having it belong to the exponential family helps facilitate the algebraic manipulations since the logarithm operation will undo the exponential operation. Observe that the only major change in moving from the prior distribution (36.171) to the posterior distribution (36.177) is replacing the parameter λ by λ0 . We compare the relation between the prior and posterior distributions in Table 36.1. The posterior (36.177) is also referred to as the complete conditional model for θ, where the term “complete conditional” for a latent variable refers to its conditional pdf given all observations and all other latent variables. Example 36.8 (Special case φ(θ) = θ) Assume φ(θ) = θ and α = 1. We can group the terms involving θ and φ(θ) in (36.171) and (36.177) so that expressions (36.170), (36.171), and (36.177) reduce to

36.4 Exponential Conjugate Models

1443

Table 36.1 Comparison of the prior and posterior distributions for the global latent variable, θ. Prior (36.171) for θ n o k(θ) exp λT G(θ) − s(λ)   θ G(θ) =  −a(θ)  φ(θ)   λ1 λ =  λ2  λ3

Posterior (36.177) for θ n o k(θ) exp (λ0 )T G(θ) − s(λ0 )   θ G(θ) =  −a(θ)  φ(θ)  0  λ1 λ0 =  λ02  λ03

λ1 ∈ IRM λ2 ∈ IR

λ01 = λ1 λ02 = λ2 + N α

λ3 = 0M

λ03 =

N X

T (yn , zn )

n=1

fy

fθ |y

n o T (y , z |θ) = h(y , z ) exp θ T (y , z ) − a(θ) n n n n n n n ,z n |θ n o fθ (θ) = k(θ) exp λT G(θ) − s(λ) n o (θ|y1:N , z1:N ) = k(θ) exp (λ0 )T G(θ) − s(λ0 )

1:N ,z 1:N

(36.178a) (36.178b) (36.178c)

with  θ ∈ IRM +1 −a(θ)    0  λ1 λ1 ∈ IRM +1 λ= ∈ IRM +1 , λ0 = λ2 λ02 

G(θ) =

λ01 = λ1 +

N X n=1

T (yn , zn ) ∈ IRM

λ02 = λ2 + N ∈ IR

(36.179a) (36.179b) (36.179c) (36.179d)

Local latent variables Since the latent variable {z n } is conditionally independent of all other latent variables {z m6=n } given (y n , θ), then the complete conditional for z n is the posterior fzn |y ,θ (zn |yn , θ). Given that the joint pdf of (y n , z n ) is in the exponential n family, we know from the result of the earlier Prob. 5.6 that n o fzn |y ,θ (zn |yn , θ) = c(zn ) exp η T S(zn ) − r(η) n

(36.180)

1444

Variational Inference

for some base function c(zn ), log-normalization factor r(η), statistic S(zn ), and parameter η; this parameter is actually a function of both (yn , θ) due to the conditioning, i.e., we should have written η(yn , θ) instead of η to be more explicit about the dependence. Again, this conditional pdf will appear in the coordinateascent update (36.194a) when solving for the variational factor for z n ; thus, having it belong to the exponential family helps facilitate the algebraic manipulations since the logarithm operation will undo the exponential operation. We summarize the discussions so far in the following statement. (Exponential conjugate model) Usually, when variational inference is performed under exponential conjugate models, it is meant that the joint pdf of the observations and the local latent variables satisfies (36.170), while the prior for the global latent variable satisfies (36.171): n o fy ,zn |θ (yn , zn |θ) = h(yn , zn ) exp (φ(θ))T T (yn , zn ) − α a(θ) (36.181a) n n o fθ (θ) = k(θ) exp λT G(θ) − s(λ) (36.181b) where G(θ) = col{θ, −a(θ), φ(θ)}. It then follows that the complete conditionals of the global latent variable θ and the local latent variables {z n } satisfy (36.177) and (36.180), namely, n o fθ |y ,z1:N (θ|y1:N , z1:N ) = k(θ) exp (λ0 )T G(θ) − s(λ0 ) (36.181c) 1:N n o fzn |y ,θ (zn |yn , θ) = c(zn ) exp η T S(zn ) − r(η) (36.181d) n

These models are our starting point for the derivations that follow. In particular, we will assume that we know the parameters {λ, η} and the form of the statistics T (yn , zn ) and S(zn ); their actual values are of course not known since they depend on the latent variables {zn }.

Example 36.9 (Gaussian mixture model as a special case) Let us reconsider the generative model (36.168) and verify that it is a special case of (36.181a)–(36.181b). To begin with, for θ = col{µ1 , µ2 } we have ! ! ¯1 ) ¯2 ) − 12 (µ1 −µ − 12 (µ2 −µ 1 1 2σ1 2σ2 × p (36.182) fθ (θ) = p e e 2πσ12 2πσ22 Using the exponential representation (5.12) we arrive at  µ1 (  µ2  µ ¯1 µ ¯2 1   1 04×1  fθ (θ) = × exp  − 1 2 µ21 + 1 2 µ22 σ12 σ22 2π 2σ1 2σ2 φ(θ) )   µ ¯2 µ ¯2 ln σ1 + ln σ2 + 12 + 22 2σ1 2σ2

  − 

(36.183)

36.4 Exponential Conjugate Models

1445

for some vector-valued function φ(θ), whose form is irrelevant at this stage but will be deduced further ahead in (36.189). The above expression is of the form shown in (36.181b) with k(θ) = 1/2π  µ ¯1 /σ12  µ ¯2 /σ22 λ=  1 0   µ1 θ= µ2 a(θ) =

(36.184a) 

λ1 ∈ IR2  ∆  =  λ2 ∈ IR   λ3 = 04×1 

 (36.184b)

(36.184c)

1 1 2 µ1 + 2 µ22 2σ12 2σ2

b(λ) = ln σ1 + ln σ2 +

(36.184d)

µ ¯21 µ ¯2 + 22 2 2σ1 2σ2

(36.184e)

On the other hand, for the joint distribution of (y n , z n ), where z n = kn , and using again the exponential representation (5.12), we have fy

θ (yn , zn |θ)

n ,z n |

I[zn =1] I[zn =2] 2 2 = pI[zn =1] Ny (µ1 , σy,1 ) × (1 − p)I[zn =2] Ny (µ2 , σy,2 )     I[zn =1] µ1 1 µ2 1 yn − 2 − ln σy,1 − 21 × = pI[zn =1] √ exp 2 2 σy,1 2σy,1 yn 2σy,1 2π  I[zn =2]    µ2 1 1 µ22 yn − (1 − p)I[zn =2] √ exp − ln σ − y,2 2 2 2 σy,2 2σy,2 yn2 2σy,2 2π    2 (yn /σy,1 )I[zn = 1]          2     )I[z = 2] (y /σ   n n y,2  (36.185) ∝ h(yn , zn ) exp µ1 µ2 −µ21 −µ22   (1/2σ 2 )I[z = 1]    n   y,1       2 1/2σy,2 )I[zn = 2] where h(yn , zn ) collects all remaining terms, which are independent of (µ1 , µ2 ); its exact expression is unnecessary (it can be seen that, in this example, h(yn , zn ) is only a function of zn and not yn ). The above expression is of the same form as (36.181a) with α = 0 and 

µ1

  µ2 φ(θ) =   −µ2 1  −µ22

   ,  

2 (yn /σy,1 )I[zn = 1]



 2  (yn /σy,2 )I[zn = 2] T (yn , zn ) =   (1/2σ 2 )I[z = 1] n  y,1

    



2 (1/2σy,2 )I[zn = 2]

(36.186)

Next, using the Bayes rule, we examine the posterior: fzn |y ,θ (zn |yn , θ) = n

fy

fy ,zn |θ (yn , zn |θ) θ (yn , zn |θ) n = ˆ fy |θ (yn |θ) n fy ,z|θ (yn , z|θ)dz

n ,z n |

z∈Z

n

(36.187)

1446

Variational Inference

so that fzn |y ,θ (zn |yn , θ) = n

fy ,z |θ (yn , zn |θ)  ˆ n n fy ,z|θ (yn , z|θ)dz exp ln

(36.188)



n

z∈Z

 ˆ  T ∝ exp (φ(θ))T T (yn , zn ) − ln h(yn , z)e(φ(θ)) T (yn ,z) dz z

Conditioned on y n , the integral term is a function of θ. Moreover, using (36.186), the inner product (φ(θ))T T (yn , zn ) can be rewritten as η T S(zn ) in terms of some function of zn :

(φ(θ))T T (yn , zn ) =



µ1 yn µ2 − 21 2 σy,1 2σy,1

|



µ2 y n µ2 − 22 2 σy,2 2σy,2 {z

 I[zn = 1] I[zn = 2] {z } }|





= S(zn )

= η(yn ,θ)T

(36.189) Therefore, expression (36.189) is of the same form as (36.188) with µ1 yn µ21 − 2  σ2 2σy,1  y,1 η=  µ2 yn µ2 − 22 2 σy,2 2σy,2 

   , 

" S(zn ) =

I[zn = 1]

#

I[zn = 2]

(36.190)

Although we already know that the complete conditionals for the latent variables have particular exponential forms in (36.181c)–(36.181d), we are interested in computing the conditional pdf of the aggregate latent variables given the observations, namely, the quantity: fz1:N ,θ |y

1:N

(z1:N , θ|y1:N )

(desired posterior)

(36.191)

We will do so by applying the mean-field method, which approximates the above pdf by a product of variational factors:

fz1:N ,θ |y

1:N

(z1:N , θ|y1:N ) ≈ qθ (θ)

N Y

n=1

!

qzn (zn )

(36.192)

where each qa (a) approximates the individual conditional pdf fa|y1:N (a|y1:N ), i.e., qθ (θ) approximates fθ |y (θ|y1:N ) 1:N qzn (zn ) approximates fzn |y1:N (zn |y1:N )

(36.193a) (36.193b)

36.4 Exponential Conjugate Models

36.4.2

1447

Coordinate-Ascent Solution Now, according to construction (36.43), the sought-after variational factors for the latent variables satisfy the coupled equations (we are again dropping the ? superscript for simplicity of notation):   (36.194a) ln qzn (zn ) = E z−n ,θ ln fzn | y ,z−n ,θ (zn |y1:N , z−n , θ) 1:N   ln qθ (θ) = E z1:N ln fθ | y ,z1:N (θ|y1:N , z1:N ) (36.194b) 1:N

where, as explained earlier, the expectation E a is relative to the variational distribution of a, i.e., relative to qa (a). Since the two conditional pdfs appearing on the right-hand side of (36.194a)–(36.194b) are assumed to be exponential in form, some important simplifications occur due to the logarithm annihilating the exponential function. Thus, using (36.181d) and the fact that z n is independent of all other z −n and y −n , equality (36.194a) reduces to   ln qzn (zn ) = E θ ln fzn |y ,θ (zn |yn , θ) (36.195) n n  o (36.181d) = ln c(zn ) + E θ η(y n , θ)T S(zn ) − r η(y n , θ) It follows that the variational factor for each z n is exponential and given by: n o qzn (zn ) = c(zn ) exp βnT S(zn ) − r(βn ) where we introduced

h i ∆ βn = E θ η(y n , θ)

(36.196)

(36.197)

The expectation is relative to the variational distribution qθ (θ), which we will determine next. Moreover, the same log-normalization function r(·) with argument βn appears in expression (36.196) because this expression needs to be a valid exponential pdf that integrates to 1. Observe how the resulting variational factor for z n has the same form as the conditional pdf for z n given by (36.181d) except that the parameter η is replaced by βn . It is useful to remember that η is also dependent on n since it depends on the observation yn . Next, using (36.194b) and (36.181c) we have

ln qθ (θ) = ln k(θ) + λT 1θ +

(36.198) N X

n=1

!T

E zn T (y n , z n )

φ(θ) − (λ2 + N α)a(θ) − E z1:N s(λ0 )

It follows again that the variational factor for the global parameter θ is exponential and given by

1448

Variational Inference

n o qθ (θ) = k(θ) exp (λ00 )T G(θ) − s(λ00 )

(36.199)

with the same statistic G(θ) = col{θ, −a(θ), φ(θ)} but with a new hyper-parameter vector λ00 = col{λ001 , λ002 , λ003 } with entries λ001 = λ1 ∈ IRM

λ002

(36.200a)

= (λ2 + N α) ∈ IR

λ003 =

N X

n=1

(36.200b)

E zn T (y n , z n ) ∈ IRM

(36.200c)

where the expectation is relative to the variational distribution qzn (zn ). The same log-normalization factor s(·) appears in (36.199) since expression (36.199) needs to be a valid exponential pdf that integrates to 1. Thus, observe again that the variational factor for θ has the same form as the conditional pdf given by (36.181c) except that the parameter λ0 is replaced by λ00 . We summarize the main conclusion in Table 36.2, which lists the assumed conditional pdfs and the resulting forms for their variational factors (or approximations). Note how the variational factors are parameterized by (λ00 , βn ), which need to be determined. In the listing in the table we remove the subscripts from the conditional pdfs to simplify the notation. Specifically, we write f (zn |yn , θ) and q(zn ) instead of fzn |y ,θ (zn |yn , θ) and qzn (zn ). We also write f (θ|y1:N , z1:N ) n and q(θ) instead of fθ |y ,z1:N (θ|y1:N , z1:N ) and qθ (θ). 1:N

Table 36.2 Forms of the original pdfs and the resulting variational factor approximations. Original pdfs f (zn |yn , θ) = c(zn )eη

Variational factors T

S(zn )−r(η)

f (θ|y1:N , z1:N ) = k(θ)e(λ

0 T

) G(θ)−s(λ0 )

T

q(zn ) = c(zn )eβn S(zn )−r(βn ) 00 T

00

 λ01 0 λ =  λ2  λ03

q(θ) = k(θ)e(λ ) G(θ)−s(λ   βn = E θ η(y n , θ)  00  λ1 00 λ =  λ002  λ003

λ01 = λ1

λ001 = λ1

λ02 = λ2 + N α

λ002 = λ2 + N α

η(yn , θ)  0

λ03 =

N X n=1

T (yn , zn )

λ003 =

N X

)

E zn T (y n , z n )

n=1

We thus find that in order to determine qzn (zn ) in (36.196) we need to evaluate qθ (θ) because βn depends on it. In return, to compute qθ (θ) we need to

36.4 Exponential Conjugate Models

1449

determine qzn (zn ). This suggests that we need to iterate between determining both variational factors until sufficient convergence is attained. We therefore arrive at the general listing (36.201) of a coordinate-ascent algorithm applied to conditionally conjugate exponential models.

Coordinate-ascent for conjugate exponential models. input: N observations {y n } satisfying models (36.181a)–(36.181b). given: parameters (λ1 , λ2 , λ3 , α, η), statistics {T (yn , zn )}; log-normalization factors a(θ), s(λ), r(η), as well as φ(θ); base functions k(θ), c(zn ). 00(m) set λ1 = λ1 , for all m; 00(m) set λ2 = λ2 + N α, for all m; 00(−1) select initial λ3 ∈ IRM randomly; ∆

let G(θ) = col{θ, n −a(θ), φ(θ)}; o 00(−1) 00(−1) 00(−1) 00(−1) let λ = col λ1 , λ2 , λ3 ; (−1)

select βn

randomly, for n = 1, 2, . . . , N .

repeat until ELBO value L(m) has converged over  o m = 0, 1, 2, . . . : n  (m−1) 00(m−1) T 00(m−1) G(θ) − s λ q (θ) = k(θ) exp λ θ h i (m−1) = E θ (m−1) η(y n , θ) , n = 1, . . . , N βn (m−1)

qzn

00(m)

λ3

(zn ) = c(zn ) exp

=

N X

n=1

T o n  (m−1) (m−1) , n = 1, . . . , N βn S(zn ) − r βn

  E z(m−1) T (y n , z n ) n

compute L(m) using (36.204) end (36.201) (m−1)

One of the challenges in running this algorithm is the need to evaluate βn , which is dependent on computing an expectation relative to the distribution of θ (m−1) . We will illustrate this calculation in the next example. We will again assess the performance of the algorithm by using

L

N   X   (m−1) (m) = −DKL q kq − kqz(m) DKL qz(m−1) n n θ θ

(m) ∆

(36.202)

n=1

The terms on the right-hand side involve the KL divergences between exponential distributions. Using results (6.54)–(6.55), we have

1450

Variational Inference

  (m−1) (m) DKL q kq (36.203a) θ θ  T 00(m−1) 00(m) = λ3 − λ3 E θ (m−1) (φ(θ)) + s(λ00(m) ) − s(λ00(m−1) )

where we used the fact that 00(m)

λ1

00(m−1)

= λ1

= λ1 ,

00(m)

λ2

00(m−1)

= λ2

= λ2 + N, for all m

(36.203b)

Likewise,   DKL qz(m−1) kqz(m) (36.203c) n n  T = βn(m−1) − βn(m) E z(m−1) (S(z n )) + r(βn(m) ) − r(βn(m−1) ) n

where the expectation is over the distribution qzn (zn ) from the (m − 1)th iteration. We thus find that, apart from constant factors (which cancel out when we compare two successive ELBO values): L(m) = (36.204)  T 00(m) 00(m−1) λ3 − λ3 E θ (m−1) (φ(θ)) + s(λ00(m) ) − s(λ00(m−1) ) + N  X

n=1

βn(m) − βn(m−1)

T

E z(m−1) (S(z n )) + r(βn(m) ) − r(βn(m−1) ) n

The algorithm is repeated until this value has converged, i.e., until the difference |L(m) − L(m−1) | is small enough. Example 36.10 (Application to the Gaussian mixture model) Let us reconsider the generative Gaussian mixture model from Example 36.9 and apply recursions (36.201). To begin with, we determine the forms of the variational factors. Thus, note that the variational factor for z n = kn is given by n o qzn (zn ) ∝ exp βnT S(zn ) − r(η)

(36.205)

where   µ1 yn µ21 E − 2 2 θ  σy,1 2σy,1 (36.190)  ∆ βn = E θ η(y n , θ) =     µ22 µ2 yn − 2 Eθ 2 σy,2 2σy,2 " # I[zn = 1] S(zn ) = I[zn = 2] 

    

(36.206)

(36.207)

and the expectations are relative to the variational distribution for θ = col{µ1 , µ2 }. We will determine this distribution in the following in order to be able to assess these

36.4 Exponential Conjugate Models

1451

expectations. For now, let us denote the means and variances of the individual entries of θ by ∆

µ ¯1 + µ ¯q,1 = E θ (µ1 )

(36.208)



µ ¯2 + µ ¯q,2 = E θ (µ2 ) 2 σq,1



=

E θ (µ21 )

2 σq,2



E θ (µ22 )

=

(36.209) − (E θ µ1 ) − (E θ µ2 )

2

(36.210)

2

(36.211)

2 2 where we need to determine the parameters (¯ µq,1 , µ ¯q,2 ) and (σq,1 , σq,2 ). We are adding (¯ µ1 , µ ¯2 ) to the means of the distributions for µ1 and µ2 in order to remain consistent, and to facilitate comparison, with the models we derived earlier in (36.65)–(36.66).

Using (36.206), it follows that (compare with (36.72)–(36.74)):   1 1  2 yn (¯ µ1 + µ ¯q,1 ) − 2 σq,1 + (¯ µ1 + µ ¯q,1 )2 2 2σy,1  σy,1 βn =    1 1  2 yn (¯ µ2 + µ ¯q,2 ) − 2 σq,2 + (¯ µ2 + µ ¯q,2 )2 2 σy,2 2σy,2

   

(36.212)

For convenience, we rewrite the entries of βn in the form βn = col{ln(an ), ln(bn )} where   1 1  2 2 an = exp y (¯ µ + µ ¯ ) − σ + (¯ µ + µ ¯ ) (36.213) n 1 q,1 1 q,1 q,1 2 2 σy,1 2σy,1   1 1  2 bn = exp yn (¯ µ2 + µ ¯q,2 ) − 2 σq,2 + (¯ µ2 + µ ¯q,2 )2 (36.214) 2 σy,2 2σy,2 Consequently, using expression (36.207) for S(zn ) we get βnT S(zn ) = ln(an )I[zn = 1] + ln(bn )I[zn = 2]

(36.215)

This inner product appears in the exponent of the variational factor for qzn (zn ) in (36.205). Now, consider a generic Bernoulli distribution with P(x = 1) = p and P(x = 2) = 1 − p. This can be written as fx (x) = pI[x=1] (1 − p)I[x=2] n h io = exp ln pI[x=1] (1 − p)I[x=2] n o = exp I[x = 1] ln(p) + I[x = 2] ln(1 − p)

(36.216)

We would like the variational factor for z n , namely, qzn (zn ), to correspond to a Bernoulli distribution since kn is a Bernoulli variable. Comparing the form of the exponent in (36.216) with (36.215), we deduce that the variational factor for z n can be made to correspond to a Bernoulli distribution if we scale the entries of βn properly; specifically, the scalars an and bn should add up to 1. We can regard an as providing an estimate for pbn and bn as providing an estimate for 1 − pbn . These two estimates need to have unit sum. This can be achieved by scaling them as follows: an ← an /(an + bn ),

bn ← bn /(an + bn )

(36.217)

in which case expression (36.215) becomes βnT S(zn )

an I[zn = 1] + = an + bn

 1−

an an + bn



I[zn = 2]

(36.218)

1452

Variational Inference

We now recognize that qzn (zn ) will correspond to a Bernoulli distribution with success probability   an pn = exp (36.219) an + bn where, by definition, ∆

pn = E zn I[z n = 1]

(36.220)

Let us examine next the variational factor for θ = col{µ1 , µ2 }. It is given by n o qθ (θ) ∝ exp (λ00 )T G(θ) − s(λ00 ) (36.221) where µ ¯1 /σ12 µ ¯2 /σ22 1



  ∆ λ00 = 

λ001 λ002 λ003

   Tab.=36.2   

        λ1   (36.186)  λ2 + N α   N  =  X    E zn T (y n , z n )   n=1      



    N  X  2 (yn /σy,1 )E zn I[z n = 1]   n=1  N  X  2 (yn /σy,2 )E zn I[z n = 2]   n=1   N X  2 (1/2σy,1 )E zn I[z n = 1]    n=1  N X  2 (1/2σy,2 )E zn I[z n = 2] n=1

(36.222) and 

µ1 µ2

      2    µ22 µ1 θ  (36.186)  − + ∆ G(θ) =  −a(θ)  =  2σ12 2σ22  φ(θ)  µ1   µ2   −µ21 −µ22

              

(36.223)

The entries (µ1 , µ2 ) are repeated in G(θ). It follows that we can group terms as follows: ! N µ ¯1 1 X 00 T (λ ) G(θ) = µ1 + 2 pn yn + σ12 σy,1 n=1 ! N µ ¯2 1 X µ2 + 2 (1 − pn )yn − σ22 σy,2 n=1 ! N 1 1 X 2 µ1 + 2 pn − 2σ12 2σy,1 n=1 ! N 1 1 X 2 µ2 + 2 (1 − pn ) (36.224) 2σ22 2σy,2 n=1

36.4 Exponential Conjugate Models

1453

This inner product appears in the exponent of qθ (θ) in (36.221). Comparing with the exponential form for a Gaussian distribution with a diagonal covariance matrix from Prob. 5.10 we conclude that θ is Gaussian-distributed and that the means and variances of the entries of θ = col{µ1 , µ2 } satisfy the relations (compare with (36.60a)–(36.60b) and see also Prob. 36.10): N µ ¯1 + µ ¯q,1 µ ¯1 1 X = 2 + 2 pn yn 2 σq,1 σ1 σy,1 n=1

(36.225a)

N µ ¯2 1 X µ ¯2 + µ ¯q,2 = 2 + 2 (1 − pn )yn 2 σq,2 σ2 σy,2 n=1

(36.225b)



N 1 1 X 1 =− 2 − 2 pn 2 2σq,1 2σ1 2σy,1 n=1

(36.225c)



N 1 1 1 X =− 2 − 2 (1 − pn ) 2 2σq,2 2σ2 2σy,2 n=1

(36.225d)

It follows from these relations that PN

µ ¯q,1 =

¯1 ) n=1 pn (yn − µ PN 2 2 (σy,1 /σ1 ) + n=1 pn

2 σy,1 P 2 (σy,1 /σ12 ) + N n=1 pn PN pn (yn − µ ¯2 ) = 2 n=1 P (1 − pn ) (σy,2 /σ22 ) + N n=1

(36.226a)

2 σq,1 =

(36.226b)

µ ¯q,2

(36.226c)

2 σq,2 =

2 σy,2 P 2 (σy,2 /σ22 ) + N n=1 (1 − pn )

(36.226d)

and also N µ ¯q,1 1 X pn (yn − µ ¯1 ) = 2 2 σq,1 σy,1 n=1

(36.227a)

N µ ¯q,2 1 X = (1 − pn )(yn − µ ¯2 ) 2 2 σq,2 σy,2 n=1

(36.227b)

2 2 We still need to determine the parameters {¯ µ1 , µ ¯2 , σq,1 , σq,2 }. For this purpose, we note that   N X 2 pn yn /σy,1       N n=1  X      2 1/x1 (1 − p )y /σ  n n y,2  N X   (36.186) ∆  1/x2  n=1  = λ003 = E zn T (y n , z n ) =   1/2x  N   X 3   2 n=1   pn /2σy,1 1/2x4   n=1     X N   2 (1 − pn )/2σy,2 n=1

(36.228) λ003

where we are denoting the individual entries of by {1/x1 , 1/x2 , 1/2x3 , 1/2x4 } with inverses and further scaling by 1/2 the last two entries for convenience. The entries of

1454

Variational Inference

λ003 are related to the ratios (36.227a)–(36.227b) and the variance parameters (36.225c) and (36.225c). Indeed, note that 1 x3 1 x4 1 µ ¯1 − x1 x3 1 µ ¯2 − x2 x4

1 1 − 2 2 σq,1 σ1 1 1 = 2 − 2 σq,2 σ2 µ ¯q,1 = 2 σq,1 µ ¯q,2 = 2 σq,2 =

(36.229) (36.230) (36.231) (36.232)

2 2 We can therefore estimate the parameters {¯ µ1 , µ ¯2 , σq,1 , σq,2 } from the recursion for λ003 . Collecting the various results and applying recursions (36.201) we arrive at the listing below, which is similar to the earlier construction (36.77): (−1)

(−1)

2,(−1)

2,(−1)

start from initial conditions µq,1 , µq,2 , σq,1 , σq,2 . repeat for m  = 0, 1, 2, . . .:  1 1  2 2 an = exp y (¯ µ + µ ¯ ) − σ + (¯ µ + µ ¯ ) n 1 q,1 1 q,1 q,1 2 2 2σy,1  σy,1   1 1 2 2 bn = exp y (¯ µ + µ ¯ ) − σ + (¯ µ + µ ¯ ) n 2 q,2 2 q,2 q,2 2 2 σy,2 2σy,2 (m−1) pn = an /(an + bn )   N X (m−1) 2 pn yn /σy,1      N n=1     X  (m)   (m−1) 2 1/x 1 (1 − pn )yn /σy,2   (m)   n=1  ∆  1/x2  (m)  =  λ3 =    N   X   1/2x(m)   3 (m−1) 2   pn /2σy,1 (m) 1/2x4   n=1    X  N   2 (1 − p(m−1) )/2σy,2 n 2,(m)

n=1

(36.233)

(m)

1/σq,1 = 1/x3 + 1/σ12 2,(m) (m) 1/σq,2 = 1/x4 + 1/σ22 (m) 2,(m) (m) (m) µ ¯q,1 /σq,1 = 1/x1 − µ ¯1 /x3 (m) 2,(m) (m) (m) µ ¯q,2 /σq,2 = 1/x2 − µ ¯2 /x4 end

36.5

MAXIMIZING THE ELBO We arrived at recursions (36.201) by working with the coordinate-ascent construction (36.43). The derivation allowed us to conclude that the variational factors qθ and qzn have exponential forms defined by parameters (λ00 , βn ). In the process, we arrived at an iterative algorithm for estimating these parameters. We can solve the same problem and arrive at the same estimates for (λ00 , βn ) by working directly with the ELBO expression and optimizing it over these

36.5 Maximizing the ELBO

1455

same parameters. The main difference between the derivation in this section and the one leading to (36.201) is that we now need to assume beforehand that the variational factors for θ and z n have exponential forms, say, as: n o qzn (zn ) = c(zn ) exp βnT S(zn ) − r(βn ) (36.234a) n o qθ (θ) = k(θ) exp (λ00 )T G(θ) − s(λ00 ) (36.234b) and treat the (λ00 , βn ) as unknown parameters that we wish to estimate. To do so, we first evaluate the ELBO under these assumed exponential models. Thus, recall from (36.31) that the ELBO is given by: h i h i L(q) = E q ln fy ,z1:N ,θ (y1:N , z1:N , θ) − E q ln qz1:N ,θ (z1:N , θ) 1:N h  i (a) = E q ln fy1:N ,z1:N (y1:N , z1:N )fθ |y ,z1:N (θ|y1:N , z1:N − 1:N N h  i Y E q ln qθ (θ) qzn (zn ) n=1 h i h i = E q ln fy1:N ,z1:N (y1:N , z1:N ) + E q ln fθ |y ,z1:N (θ|y1:N , z1:N ) − 1:N N h i X h i E q ln qθ (θ; λ00 ) − E q ln qzn (zn ; βn ) (36.235) n=1

where in step (a) we used the Bayes rule and the assumed factorized variational model from (36.192). The expectations over q(·) refer to expectations over the variational factors of θ and z 1:N . In the last two terms of (36.235) we are indicating the parameters (λ00 , βn ) explicitly.

Optimizing over λ00 Let us optimize first over λ00 . Using the assumed exponential forms (36.234a)– (36.234b), we rewrite the above ELBO as a function of λ00 and collect all terms that are independent of λ00 into a constant: h i h i L(λ00 ) = E q (λ0 )T G(θ) − s(λ0 ) − E θ (λ00 )T G(θ) − s(λ00 ) + constant T      = E z1:N λ0 E θ G(θ) − (λ00 )T E θ G(θ) + s(λ00 ) + constant (36.236) Recall that the expectations are relative to the distributions of the variational factors, which we know are exponential in form. Moreover, we know from property (5.89) for exponential distributions that the mean of their statistic is equal to the gradient of their log-normalization factor relative to the parameter. Therefore, E θ G(θ) = ∇(λ00 )T [s(λ00 )]

(a column vector)

(36.237)

1456

Variational Inference

from which we conclude that L(λ00 ) = (36.238)  T E z1:N λ0 ∇(λ00 )T [s(λ00 )] − (λ00 )T ∇(λ00 )T [s(λ00 )] + s(λ00 ) + constant

If we now differentiate L with respect to λ00 and set the gradient vector to zero, we can recover the optimal expression we derived earlier in (36.200b)–(36.200c). To see this, let h  i ∆ ∆ Hλ00 = ∇λ00 ∇(λ00 )T s(λ00 ) = ∇2λ00 s(λ00 ) (36.239) denote the Hessian matrix of s(λ00 ) relative to λ00 . Then,

  ∇(λ00 )T L = Hλ00 E z1:N (λ0 ) − λ00

(36.240)

Setting this gradient to zero gives

00

λ = E z1:N



λ1

  λ2 + N α λ =   X N  E zn T (y n , z n ) 0

n=1

     

(36.241)

which agrees with (36.200b)–(36.200c). The Hessian matrix Hλ00 has a useful interpretation in this case. Referring to properties (5.92) and (5.96) for exponential distributions, we know that Hλ00 is equal to the Fisher information matrix associated with fy |θ (y1:N |θ), namely, using definitions (31.96)–(31.97): 1:N



Hλ00 = F (λ00 ) = E S(λ00 )ST (λ00 )

(36.242)

where S(λ00 ) refers to the score function:  ∆ S(λ00 ) = ∇(λ00 )T ln fy

1:N |θ

(y1:N |θ)



(36.243)

This result is in agreement with the general property (6.99), which showed that the Hessian matrix of a divergence measure coincides with the Fisher information matrix. This conclusion shows that Hλ00 is at least nonnegative-definite. We assume it is positive-definite to avoid degenerate situations.

Optimizing over β n We can repeat the argument and derive the optimal expression for the parameters {βn }. For this purpose, we return to (36.235) and use a different Bayes factorization in step (a) as follows:

36.5 Maximizing the ELBO

1457

h i h i L(q) = E q ln fy ,z1:N ,θ (y1:N , z1:N , θ) − E q ln qz1:N ,θ (z1:N , θ) 1:N h  i = E q ln fy ,θ (y1:N , θ) fz1:N |y ,θ (z1:N |y1:N , θ) − 1:N 1:N N h  i Y E q ln qθ (θ) qzn (zn ) h

n=1

= E q ln fy

1:N ,θ

i h (y1:N , θ) + E q ln fz1:N |y

1:N ,θ

N h i X h i E q ln qθ (θ; λ00 ) − E q ln qzn (zn ; βn )

i (z1:N |y1:N , θ) − (36.244)

n=1

The first and third terms in (36.244) are independent of the variational factors {βn }. Using the assumed exponential forms (36.234a)–(36.234b), we get for the remaining terms: L({βn }) = =

N X

n=1

    E q η T S(z n ) − r(η) − E zn βnT S(z n ) − r(βn ) + constant

N n X

n=1

T     o E θ (η) E zn S(z n ) − βnT E zn S(z n ) + r(βn ) + constant

(36.245)

where we have collected all factors independent of βn into the constant term. Again, using property (5.89) for exponential distributions we have E zn S(z n ) = ∇βnT [r(βn )] (a column vector)

(36.246)

and, therefore, L({βn }) = (36.247) N n  o X T E θ (η) ∇βnT [r(βn )] − βnT ∇βnT [r(βn )] + r(βn ) + constant n=1

If we now differentiate L with respect to βn and set the gradient vector to zero, we can recover the optimal expression we derived earlier for βn in (36.197). To see this, let h  i ∆ ∆ Hβn = ∇βn ∇(βn )T r(βn ) = ∇2βn r(βn ) (36.248) denote the Hessian matrix of r(βn ) relative to βn . Then,   ∇βnT L = Hβn E θ η − βn

(36.249)

Setting this gradient to zero gives

h i βn = E θ η(y n , θ)

(36.250)

1458

Variational Inference

which agrees with (36.197). Again, the Hessian matrix Hβn has a useful interpretation in this case, namely, it is equal to the Fisher information matrix associated with fy |β (y1:N |βn ): 1:N

n



Hβn = F (βn ) = E S(βn )ST (βn ) where S(βn ) refers to the score function  ∆ S(βn ) = ∇(βn )T ln fy

1:N |β n

(36.251)

(y1:N |β n )



(36.252)

This shows that Hβn is at least nonnegative-definite. We assume it is positivedefinite to avoid degenerate situations.

36.6

STOCHASTIC GRADIENT SOLUTION One of the main difficulties with the iterative coordinate-ascent solution (36.201) 00(m) is that the expression for λ3 involves a sum over N terms corresponding to the expectations E zn T (y n , z n ). This computation can be demanding for large datasets. We now pursue an alternative solution method that relies on the use of stochastic gradients. In this solution, the maximum of the ELBO is sought iteratively by moving along the negative direction of a stochastic gradient vector. Thus, assume first that we employ a gradient-ascent algorithm to maximize the ELBO over the vector parameter λ00 . Using expression (36.240) for the gradient vector of L(λ) relative to λ00 we would write:  00(m)   00(m−1)  λ1 λ1      00(m)   00(m−1)  (36.253)  λ2  =  λ2  + µ(m)Hλ00 ×     00(m)

00(m−1)

λ3

λ3 

λ1

    X N 

n=1





00(m−1)

λ1

    00(m−1) − λ   2  E z(m−1) T (y n , z n ) 00(m−1) n λ3 λ2 + N α

     

where µ(m) ≥ 0 is a decaying step-size sequence satisfying conditions similar to (12.66a): ∞ ∞ X X µ2 (m) < ∞, µ(m) = ∞ (36.254) m=0

m=0

and the notation E z(m−1) denotes expectation relative to the variational factor n

(m−1)

qzn (·) given by (36.234a) at iteration m − 1, i.e., using the estimate βn . We can simplify recursion (36.253) by adjusting the search direction. We know that in gradient-ascent (or -descent) methods we can scale the search direction by any

36.6 Stochastic Gradient Solution

1459

positive-definite matrix. Assume we scale the gradient vector by Hλ−1 00 . Then, the above recursion reduces to  00(m)   00(m−1)  λ1 λ1      00(m)   00(m−1)  (36.255)  λ2  =  λ2 +     00(m−1)

00(m)

λ3

λ3





λ1

   µ(m)   X N 

n=1



00(m−1)

λ1

    00(m−1) − λ   2  E z(m−1) T (y n , z n ) 00(m−1) n λ3 λ2 + N α

     

where Hλ00 is now removed and does not need to be evaluated. This algorithm is a Newton-type recursion where the search direction is scaled by the inverse of the Hessian matrix. Comparing (36.255) with (36.253), we see that the former recursion performs updates along the direction of the standard gradient vector   ∆ ggrad = Hλ00 E z1:N λ0 − λ00 (36.256) whereas recursion (36.255) updates the iterates along the direction of the natural gradient vector – recall definition (6.130): ∆

gnatural = Hλ−1 00 ggrad

(36.257)

Therefore, recursion (36.255) is a natural gradient solution since we know from (36.242) that Hλ00 coincides with the Fisher information matrix for λ00 . Note further that (36.241) gives the optimal values for λ001 and λ002 , namely, λ001 = λ1 ,

λ002 = λ2 + N,

for all m

(36.258)

We can simplify recursion (36.255) further and focus only on learning λ003 : 00(m) λ3

=

00(m−1) λ3

+ µ(m)

N X

n=1

E z(m−1) T (y n , z n ) − n

00(m−1) λ3

!

(36.259)

This recursion continues to suffer from the same problem mentioned before for large datasets: It involves a sum of N terms. We can derive a simpler implementation by appealing to a stochastic gradient substitution. Specifically, at every iteration m, we select a random integer index n ∈ [1, N ] and the corresponding pair (z n , y n ). Then, we approximate the sum in the expression by N X

n=1

E z(m−1) T (y n , z n ) ≈ N E z(m−1) T (y n , z n ) n

n

(36.260)

1460

Variational Inference

The approximation leads to the following stochastic variational inference algorithm, which is based on a natural gradient implementation:

00(m)

λ3

00(m−1)

= λ3

  00(m−1) + µ(m) N E z(m−1) T (y n , z n ) − λ3 n

(36.261)

In summary, we arrive at the following listing.

Stochastic variational inference for conjugate exponential models. input: N observations {y n } satisfying models (36.181a)–(36.181b). given: parameters (λ1 , λ2 , λ3 , α, η), statistics {T (yn , zn )}; log-normalization factors a(θ), s(λ), r(η), as well as φ(θ); base functions k(θ), c(zn ). 00(m) set λ1 = λ1 ∈ IRM , for all m; 00(m) set λ2 = λ2 + N α ∈ IR, for all m; 00(−1) select initial λ3 ∈ IRM randomly; ∆

let G(θ) = col{θ, n −a(θ), φ(θ)}; o 00(−1) 00(−1) 00(−1) 00(−1) let λ = col λ1 , λ2 , λ3 ; (−1)

select βn

randomly, for n = 1, 2, . . . , N

repeat for m = 0, 1, 2, n ...: q

(m−1)

θ

(θ) = k(θ) exp

λ00(m−1)

T

o G(θ) − s(λ00(m−1) )

select a random integer uniformly from the interval n ∈ [1, N ] h i (m−1) βn = E θ (m−1) η(y n , θ) , for selected n (m−1)

qz n

00(m)

λ3 end

(zn ) = c(zn ) exp 00(m−1)

= λ3

n

(m−1)

βn

T

(m−1)

zn − r(βn

)

o

  00(m−1) + µ(m) N E z(m−1) T (y n , z n ) − λ3 n

(36.262) Example 36.11 (Application to the Gaussian mixture model) Let us reconsider the generative Gaussian mixture model from Example 36.9 and the analysis from Example 36.10. Applying recursions (36.262) we get (where we are now writing ρ(m) for the step-size sequence instead of µ(m) to avoid confusion with the µ-mean parameters in the example):

36.7 Black Box Inference

(−1)

(−1)

2,(−1)

2,(−1)

start from initial conditions µq,1 , µq,2 , σq,1 , σq,2 . repeat for m = 0, 1, 2, . . .: select a random integer n uniformly from the interval [1, N]   1 1  2 an = exp yn (¯ µ1 + µ ¯q,1 ) − 2 σq,1 + (¯ µ1 + µ ¯q,1 )2 2 2σy,1  σy,1  1  2 1 2 y (¯ µ + µ ¯ ) − σ + (¯ µ + µ ¯ ) bn = exp n 2 q,2 2 q,2 q,2 2 2 σy,2 2σy,2 (m−1) pn = an /(an + bn )   N X (m−1) 2 p b y /σ n   n y,1     N n=1   X   (m−1) 2 (1 − p b )y /σ (36.228)  n n y,2    ∆ n=1  δ (m) =  N   X   (m−1) 2   pbn /2σy,1   n=1     X N   (m−1) 2 (1 − pbn )/2σy,2 n=1   (m) 1/x1    (m)  ∆  1/x2  00(m) 00(m−1) 00(m−1) λ3 = λ3 + ρ(m) N δ (m) − λ3 =     1/2x(m) 3 (m) 1/2x4 (m) 2,(m) 1/σq,1 = 1/x3 + 1/σ12 (m) 2,(m) 1/σq,2 = 1/x4 + 1/σ22 (m) (m) 2,(m) (m) ¯1 /x3 µ ¯q,1 /σq,1 = 1/x1 − µ (m) (m) 2,(m) (m) ¯2 /x4 µ ¯q,2 /σq,2 = 1/x2 − µ end

36.7

1461

(36.263)

BLACK BOX INFERENCE The analysis in the previous sections assumed conditionally conjugate exponential models and exploited the simplifications that occur in the derivation of the variational inference solution. In particular, we showed that the variational factors continue to exhibit the same exponential form as the original pdfs. There are many situations, however, where the underlying models are not conditionally conjugate exponential and the simplifications in the derivation will not occur anymore. We already encountered examples to this effect in the context of the logit and probit models, where the underlying models do not fall into the class of conditional conjugate exponential distributions. Example 36.12 (Revisiting the logit model) Consider again the discussion on the logit model from Chapter 33. There we showed that the posterior for w is given by

1462

Variational Inference



posterior = fw|γ (w|γN ; HN ) . = fγ N ,w (γN , w; HN ) fγ N (γN ; HN ) = ˆ

fγ N |w (γN |w; HN ) fw (w)

(36.264)

fγ N |w (γN |w; HN ) fw (w) dw

w∈W

where the integral in the denominator is the evidence of the observations: ˆ



evidence = fγ N (γN ; HN ) =

w∈W

fγ N |w (γN |w; HN ) fw (w) dw

(36.265)

2 which involves both the prior fw (w) = Nw (0, σw IM ) and the likelihood function:

fγ N |w (γN |w; HN ) =

N  Y n=1

1 T 1 + e−hn w

 1+γ(n) 2

 ×

1 T 1 + ehn w

 1−γ(n) 2

(36.266)

Evaluation of the evidence (36.265) in closed form is not possible. Therefore, while the 2 prior distribution is Gaussian, fw (w) = Nw (0, σw IM ), the posterior fw|γ N (w|γN ; HN ) is not Gaussian. It follows that the prior fw (w) and the likelihood fγ N |w (γN |w; HN ) are not conjugate priors. In this case, the derivation for exponential conjugate models from the previous sections, and the associated simplifications, will not be applicable.

Even when the underlying models do not belong to the conjugate exponential family of distributions, we can still pursue the variational analysis directly by appealing to the coordinate-ascent construction (36.44). This is what we did for the probit model in Example 36.4. In this section, we derive an alternative approach. We first provide a general formulation. We will now denote all latent variables by z (whether local or global) and consider the problem of estimating the posterior fz|y (z|y). As customary, we approximate this pdf by some variational factor, denoted by qz (z; ζ), and whose form is known beforehand but is dependent on some parameter ζ that we wish to estimate. For example, we can use a Gaussian distribution for qz (·), in which case the entries of ζ would be the mean and variance for the distribution. Now, to derive a general-purpose stochastic algorithm for maximizing the ELBO, we first recall from (36.31) that the ELBO is given by h i h i L(q) = E q ln fy,z (y, z) − E q ln qz (z; ζ) (36.267) ˆ ˆ = qz (z; ζ) ln fy,z (y, z)dz − qz (z; ζ) ln qz (z; ζ)dz z∈Z

z∈Z

where the notation E q refers to expectation relative to the variational distribution qz (z; ζ). Computing the gradient of L(q) relative to ζ gives

36.7 Black Box Inference

1463

∇ζ T L(q) ˆ ˆ h i i h ∇ζ T qz (z; ζ) ln qz (z; ζ) dz = ∇ζ T qz (z; ζ) ln fy,z (y, z)dz − ˆz∈Z h ˆz∈Z h i i = ∇ζ T qz (z; ζ) ln fy,z (y, z)dz − ∇ζ T qz (z; ζ) ln qz (z; ζ)dz − z∈Z z∈Z ˆ ∇ζ T qz (z; ζ)dz z∈Z   ˆ ∇ζ T qz (z; ζ) = qz (z; ζ) ln fy,z (y, z)dz − q(z; ζ) z∈Z   ˆ  ˆ ∇ζ T q(z; ζ) qz (z; ζ) ln qz (z; ζ)dz + ∇ζ T qz (z; ζ)dz q(z; ζ) z∈Z | {z } } | z∈Z {z ˆ

=

z∈Z

=∇ζ T ln qz (z;θ)

=1

h in o qz (z; ζ) ∇ζ T ln qz (z; ζ) ln fy,z (y, z)dz − ln qz (z; ζ) dz

(36.268)

and, hence,

∇ζ T L(q) = E q

( 

) h i ∇ζ T ln qz (z; ζ) ln fy,z (y, z) − ln qz (z; ζ)

(36.269)

which involves both qz (z; ζ) and its score function. The derivation has therefore led to an expression for the gradient of L(q) in the form of an expectation. We can now replace the expectation by a stochastic approximation and arrive at a mini-batch stochastic gradient algorithm, as shown in (36.270). The step-size parameter µ(m) is replaced by the diagonal matrix Dm , which allows us to assign different step-size values to different entries of the vector ζ; this is because these entries may converge at different rates.

Stochastic variational inference for general models. given: fy,z (y, z) and variational form qz (z; ζ); select ζ (−1) randomly. repeat over m = 0, 1, 2, . . . : generate B samples from zb ∼ qz (z; ζ (m−1) ) B  h i X ∆ 1 δ (m) = ∇ζ T ln qz (zb ; ζ (m−1) ) ln fy,z (y, zb ) − ln qz (zb ; ζ (m−1) ) B b=1

ζ (m) = ζ (m−1) + Dm δ (m) end

(36.270)

1464

Variational Inference

In order to generate samples {zb }, we will generally need to assume some prior form for the variational factor, such as assuming qz (z; ζ) to be Gaussian with ζ representing its mean and variance. Example 36.13 (Estimating the posterior for probit models) Let us reconsider the probit model from Example 36.4, which consisted of the following generative model:  2 generate w ∼ Nw (0, σw IM )    for each data point n = 1, 2, . . . , N :  (36.271) generate z n ∼ Nzn (hTn w, 1)   set γ(n) = +1 if z n > 0; otherwise, γ(n) = −1   end We therefore have a model with one global latent variable, w, and N local latent variables {z n }. We collect all these variables into:   w  z1  ∆   z =  .  (36.272)  ..  zN We know that the joint pdf is given by (we are indicating the variance of z n explicitly by writing σn2 although we know from (36.271) that it is equal to 1): ! N Y fγ ,z (y, z) = fw (w) fz|w (zn |w) fγ |z,w (γn |zn , w) n=1

N

2 Y − 12 kwk2 − 1 (zn −hT 1 1 n w) √ e 2σw e 2σn2 × = √ 2 2 2πσw 2πσn n=1   I[zn ≤0] !  N I[z >0] n Y 1 + γ(n) 1 − γ(n) 2 2 n=1

! × (36.273)

so that ln fγ ,z (y, z) N N 1 1X 1X 1 2 ) − 2 kwk2 − ln(2πσn2 ) − (zn − hn w)2 + = − ln(2πσw 2 2σw 2 n=1 2 n=1     N X 1 + γ(n) 1 − γ(n) I[zn > 0] ln + I[zn ≤ 0] ln (36.274) 2 2 n=1

We next assume a Gaussian distribution for each entry of the latent vector z with means and variances (these are approximations for the conditional distributions of w and z n , conditioned on all observations {hn , γ(n)}): 2 w ∼ Nw (w, ¯ σ bw IM )

z n ∼ Nzn (¯ zn , σ bn2 )

so that the entries of the parameter vector ζ are n o ∆ 2 2 ζ = w, ¯ σ bw , z¯1 , σ b12 , . . . , z¯N , σ bN

(36.275) (36.276)

(36.277)

36.7 Black Box Inference

1465

and the variational distribution is given by − 12 kw−wk ¯ 2 1 qz (z; ζ) = √ e 2σb w × 2 2πb σw

N Y n=1

− 1 (zn −¯ zn )2 1 √ e 2σb n2 2 2πb σn

! (36.278)

It follows that ln qz (z; ζ)

(36.279)

1 1 1 2 σw ) − 2 kw − wk = − ln(2πb ¯ 2− 2 2b σw 2

N X n=1

ln(2πb σn2 ) −

1 2

N X n=1

1 (zn − z¯n )2 σ bn2

Differentiating this expression relative to the means and variances we get: 1 (w − w) ¯ 2 σ bw 1 1 = − 2 + 4 kw − wk ¯ 2 2b σw 2b σw 1 = 2 (zn − z¯n ) σ bn 1 1 = − 2 + 4 (zn − z¯n )2 2b σn 2b σn

∇w¯ T ln qz (z; ζ) = ∂ ln qz (z; ζ)/∂σbw 2 ∂ ln qz (z; ζ)/∂z¯n ∂ ln qz (z; ζ)/∂σbn2

(36.280a) (36.280b) (36.280c) (36.280d)

By concatenating these values we obtain the gradient vector of ln qz (z; ζ) relative to ζ. We therefore have all the elements necessary to run recursions (36.270). Stochastic variational inference for probit model (36.271). given: features and labels {hn , γ(n)}, no= 1, 2, . . . , N ; n 2,(−1) (−1) 2,(−1) start with w ¯ (−1) , σ bw , z¯n , σ bn , n = 1, 2, . . . , N . repeat over m = 0, 1, 2, . . . :   2,(m−1) IM generate B samples wb ∼ N w ¯ (m−1) , σ bw   (m−1) 2,(m−1) for each n, generate B samples zn,b ∼ N z¯n ,σ bn collect the samples into B vectors zb = col{wb , z1,b , z2,b , . . . , zN,b } B  h i X ∆ 1 ∇ζ T ln qz (zb ; ζ (m−1) ) ln fγ ,z (γ, zb ) − ln qz (zb ; ζ (m−1) ) δ (m) = B b=1

ζ (m) = ζ (m−1) + Dm δ (m) end (36.281) (m)

The term ln fγ ,z (γ, zb ) in the expression for δ in the above algorithm is evaluated by substituting (w, zn ) in (36.274) by (wb , zn,b ). We obtain better performance in this 2 2 case if we also replace (σw , σn2 ) in the expression for fγ ,z (γ, zb ) by (b σw ,σ bn2 ), which estimate the conditional variances of the latent variables. Once the variational factors 2 are estimated, we can use expression (36.118), with Rw replaced by σ bw IM , to perform prediction, namely, P(γ = γ|γ N = γN ; h, HN ) ≈ Φ

γhT w ¯ p 2 khk2 1+σ bw

! (36.282)

We illustrate this construction by means of a simulation. A total of N = 400 feature vectors {hn ∈ IR2 } are generated randomly from a Gaussian distribution with zero

1466

Variational Inference

mean and unit variance. The weight vector w ∈ IR2 is also generated randomly from 2 a Gaussian distribution with zero mean and variance σw = 2. For each hn , we assign T its label to γ(n) = +1 with probability equal to Φ(hn w); otherwise, it is assigned to γ(n) = −1. The 400 pairs {hn , γ(n)} generated in this manner are used for training. These samples are shown in the top row of Fig. 36.9. We also generate separately 50 additional sample pairs {hn , γ(n)} to be used during testing. These samples are shown in the plot on the left in the bottom row. We run recursions (36.281) for 200,000 −8 iterations using a small step size Dm = 10 I. The algorithm leads to m+1   −0.9091 2 , σ bw = 0.7047 (36.283) w ¯= 0.2560 Using these values, we predict the label for each of the test features hn using construction (36.282), namely, we assign hn to class γ(n) = +1 if ! hT w ¯ ≥ 1/2 =⇒ γ(n) = +1 (36.284) Φ p 2 khk2 1+σ bw The result is shown in the plot on the right in the bottom row of Fig. 36.9 with an error rate of 16% (meaning that eight test samples are misclassified). training features

3

-1

2 1 0 -1 -2

+1

-3 -3

3

-2

-1

0

samples used for testing

2

1

3

test results

3 2

-1

1

1

0

0

-1

-1

-2

2

-2

+1

-3

-1

+1

-3 -3

-2

-1

0

1

2

3

-3

-2

-1

0

1

2

3

Figure 36.9 (Top row) A total of N = 400 pairs {hn , γ(n)} used for training in order

to learn the predictive distribution (36.282). (Bottom row) The plot on the left shows the 50 pairs {hn , γ(n)} used for testing. The plot on the right shows the labels that are assigned by the test (36.284) to these points.

Problems

36.8

1467

COMMENTARIES AND DISCUSSION Variational inference. Variational inference methods are powerful techniques for approximating the posterior distribution in Bayesian formulations. They construct the approximation by solving an optimization problem that seeks to minimize the KL divergence between the desired posterior and a postulated family of distributions. Variational inference has found applications in a wide range of fields including topic modeling, natural language processing, speech processing, population genetics, computational biology, computer vision, robotics, and political science. The discussion on the probit model in Example 36.4 is motivated by results from Albert and Chib (1993), Consonni and Marin (2007), and Li, Pericchi, and Wang (2020). The discussion on the legislative voting record in Example 36.5 is from Spirling and Quinn (2010) and Grimmer (2011). The useful surveys by Ormerod and Wand (2010), Blei, Kucukelbir, and McAuliffe (2017), and Zhang et al. (2019) provide a list of references applying variational techniques to several other domains; see also the texts by Barber (2012) and Nakajima, Watanabe, and Sugiyama (2019). One of the earliest references on the application of variational inference methods is the work by Peterson and Anderson (1987) in the context of training neural networks. This work exploited advances from mean-field theory in statistical physics applied to the study of spin glass, which refers to a model for a type of magnet where the magnetic spins are not aligned regularly – see the publication Parisi (1988) by Giorgio Parisi (1948–), who was awarded the Nobel Prize in Physics in 2021. The work by Peterson and Anderson (1987) motivated the application of variational inference to many other contexts and models. Some of the subsequent contributions appear in the works by Hinton and Van Camp (1993), Jaakkola and Jordan (1996, 1997, 2000), Saul, Jaakkola, and Jordan (1996), and Opper and Winther (1996). The texts by MacKay (2003) and Bishop (2007) provide introductions to variational Bayes methods. The text by Wainwright and Jordan (2008) provides an overview of variational inference in the context of learning and graphical models, motivated by the earlier study from Jordan et al. (1999). Other useful surveys include the articles by Fox and Roberts (2012), Blei, Kucukelbir, and McAuliffe (2017), and Zhang et al. (2019). An overview of mean field techniques appears in the text by Opper and Saad (2001). The presentation in the body of the chapter is motivated by the treatments in Jordan et al. (1999), Wainwright and Jordan (2008), Hoffman et al. (2013), and Ranganath, Gerrish, and Blei (2014). In particular, application of stochastic gradient methods to variational inference (cf. Section 36.6) appears in Hoffman et al. (2013). This reference, along with Wainwright and Jordan (2008), applies variational techniques to the class of exponentially conjugate models (cf. Section 36.4), while the references by Jordan et al. (1999) and Ranganath, Gerrish, and Blei (2014) develop black box variational inference methods that apply to a broader class of generative models (cf. Section 36.7).

PROBLEMS

36.1 Refer to the Bayesian Gaussian mixture model (36.5) and replace it by one where the variables are now vector-valued:  µ1 , σ12 IM ) and µ2 ∼ Nµ2 (¯ µ2 , σ22 IM ).  generate two random means µ1 ∼ Nµ1 (¯ select an index k ∈ {1, 2} according to Bernoulli(p)  2 generate the observation y ∼ Ny (µk , σy,k IM ) (a) (b)

Derive a closed-form expression for the evidence, fy (y). Derive a closed-form expression for the responsibility, P(k = k|y = y).

1468

Variational Inference

2 What about the case when the covariance matrices {σ12 IM , σ22 IM , σy,k IM } are replaced by M × M diagonal matrices {D1 , D2 , Dy,k }? 36.2 Model (36.5) assumes two Gaussian components. Assume instead that there are µk , σk2 ) for k = 1, 2, . . . , K. The label variable k K Gaussian components, µk ∼ Nµk (¯ assumes each value k with probability πk , i.e., P(k = k) = πk with all the {πk } adding 2 up to 1. The observation y continues to be generated according to y ∼ Ny (µk , σy,k ). Derive a closed-form expression for the evidence, fy (y). 36.3 Verify that the coordinate-ascent iteration (36.42) can be written in the form n  o ? qm (zm ) ∝ exp E −m ln fzm |z−m ,y (zm |z−m , y)

(c)

where z −m refers to all entries of the latent variable z excluding z m . 36.4 Refer to expressions (36.38) and (36.42). (a) Verify that the ELBO can be written in the following form as a function of the mth variational factor: n h io h i L(qm ) = E qm E −m ln fy,z (y, z) − E qm ln qm (zm ) + constant where E q−m refers to expectation over all variational distributions excluding qm , and the constant term does not depend on qm . (b) Apart from a constant, conclude that L(qm ) is equal to the negative divergence ? ? between qm (zm ) and the optimal factor qm (zm ), i.e., L(qm ) = −DKL (qm kqm )+ constant. 36.5 The derivation in Example 36.4 was limited to the probit model. We wish to extend it to the logit model. The following expression provides the form of what is known as the logistic pdf with mean µ and variance σx2 :   exp − √π 2 (x − µ) 3σx π fx (x) = √ 2   2 , x ∈ (−∞, +∞) 3σx π 1 + exp − √ 2 (x − µ) 3σx

Its cdf is given by Φ(x) =

−1   π 1 + exp − √ 2 (x − µ) 3σx

Use this information to extend the derivation from Example 36.4 to the logit case. Estimate the posterior distribution by applying the coordinate-ascent construction (36.44), and use the result to approximate the predictive distribution. 36.6 The generative model (36.45) assumes two underlying mixture components, {µ1 , µ2 }. Consider instead a model with K mixture components as in Prob. 36.2. Repeat the derivation of Example 36.3 and explain that the listing in (36.77) should be modified as shown below. Derive the expression for L(m) . Argue that the responsibility for model k in explaining an observation y can be estimated by means of the expression P(k = k|y = y; y1:N ) = PK

2 π bk Ny (¯ µk + µ ¯q,k , σy,k )

k0 =1

2 π bk0 Ny (¯ µk 0 + µ ¯q,k0 , σy,k 0)

where the {¯ µq,k } are the estimates obtained at the end of the algorithm, and π bk can be estimated by averaging over all observations at the end of the algorithm as well: π bk =

N 1 X π bk,n N n=1

Problems

1469

Coordinate-ascent for a Bayesian mixture of K zero-mean Gaussians. input: N observations {yn } satisfying the model of Example 36.3. 2 given: {¯ µk , σk2 , σy,k }. n o (−1) 2,(−1) (−1) start with initial conditions µ ¯q,k , σq,k , π bk,n , k = 1, . . . , K. repeat until ( ELBO value L(m) has converged over m = 0, 1, 2, . . . : )   2 i 1 1 h (m−1) (m−1) 2,(m−1) yn µk + µ ¯q,k − 2 µ ¯k + µ ¯q,k ak,n = exp + σq,k 2 σy,k 2σ

(m)

π bk,n

(m)

µ ¯k

n = 1, 2, . . . , N, k = 1, 2, . . . , K .P n = 1, 2, . . . , N K = ak,n k=1 ak,n , k = 1, 2, . . . , K PN (m) bk,n (yn − µ ¯k ) n=1 π ← , k = 1, 2, . . . , K P (m) N 2 (σy,k /σk2 ) + n=1 π bk,n

2,(m)

σq,k



2 σy,k , k = 1, 2, . . . , K P (m) 2 (σy,k /σk2 ) + N bk,n n=1 π

compute L(m) end 36.7 Extend the result of Prob. 36.6 to vector-valued random variables of size M 2 2 with σk2 and σy,k replaced by σk2 IM and σy,k IM , respectively. 36.8 Derive expression (36.101) for the pdf of a Gaussian random variable restricted to the interval x ∈ [a, b]. How does the expression simplify when a = −∞ and when b = +∞? Establish also the mean expression (36.103). 36.9 Establish (36.157) for the location of the mode of a beta distribution. 36.10 Verify that expression (36.225a) matches (36.60a). 36.11 Consider a collection of N measurements {y n } arising from a Gaussian distribution with mean µ and variance σ 2 , i.e., y n ∼ Ny (µ, σ 2 ). The inverse variance 1/σ 2 is chosen randomly from a gamma distribution with parameters (α, β), i.e., 1/σ 2 ∼ Γ(α, β) – recall Prob. 5.2 for the definition of a gamma distribution. The mean µ is chosen randomly from a Gaussian distribution with mean µ ¯ and variance σ 2 /λ, i.e., µ ∼ Nµ (¯ µ, σ 2 /λ). Use the coordinate-ascent algorithm (36.42) to show that the conditional distributions of µ and 1/σ 2 given the observations are given by the following variational factors. Let τ = 1/σ 2 . Then,   ? ? qµ (µ) = Nµ µ ¯N , 1/τN , qτ (τ ) = Γ(a, b) where the parameters can be computed iteratively as follows: N 1 X yn , N n=1

N y¯ + λ¯ µ N +λ a (m) a = α, τN = (N + λ) (m−1) , m ≥ 0 b " # N   X 1 (m) (m) 2 2 2 b =β+ (N + λ) 1/τN + µ ¯N + λ¯ µ + yn − 2¯ µN (λ¯ µ + N y¯) 2 n=1 y¯ =

µ ¯N =

36.12 Consider a collection of N measurements {γ(n)} arising from a Gaussian distribution with mean hTn w and variance σ 2 , i.e., γ(n) ∼ Nγ (n) (hTn w, σ 2 ). The inverse variance 1/σ 2 is chosen randomly from a gamma distribution with parameters (α, β),

1470

Variational Inference

i.e., 1/σ 2 ∼ Γ(α, β). The vector w ∈ IRM is chosen randomly from a Gaussian distri2 bution with w ∼ Nw (0, σw IM ). Use the coordinate-ascent algorithm (36.42) to show that the conditional distributions of w and 1/σ 2 given the observations are given by the following variational factors. Let τ = 1/σ 2 . Then,   ? ? qw (w) = Nw w, ¯ Rw , q τ (τ ) = Γ(a, b) where the parameters (w, ¯ Rw , a, b) can be estimated iteratively as follows: a(m) = α, for all m ≥ 0 N 2 1 X (m−1) b(m) = β + γ(n) − hTn w ¯ (m−1) + hTn Rw hn 2 n=1 !−1 N 1 a(m) X (m) T Rw = + (m) hn hn 2 σw b n=1 w(m) =

N a(m) (m) X R γ(n)hn w b(m) n=1

REFERENCES Albert, J. H. and S. Chib (1993), “Bayesian analysis of binary and polychotomous response data,” J. Amer. Statist. Assoc., vol. 88, pp. 669–679. Barber, D. (2012), Bayesian Reasoning and Machine Learning, Cambridge University Press. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017), “Variational inference: A review for statisticians,” J. Amer. Statist. Assoc., vol. 112, no. 518, pp. 859–877. Also available at arXiv:1601.00670. Consonni, G. and J. M. Marin (2007), “Mean-field variational approximate Bayesian inference for latent variable models,” Comput. Stat. Data Anal., vol. 52, pp. 790–798. Fox, C. W. and S. J. Roberts (2012), “A tutorial on variational Bayesian inference,” Artif. Intell. Rev., vol. 38, no. 2, pp. 85–95. Grimmer, J. (2011), “An introduction to Bayesian inference via variational approximations,” Polit. Anal., vol. 19, no. 1, pp. 32–47. Hinton, G. and D. Van Camp (1993), “Keeping the neural networks simple by minimizing the description length of the weights,” Proc. ACM Conf. Computational Learning Theory, pp. 5–13, Santa Cruz, CA. Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley (2013), “Stochastic variational inference,” J. Mach. Learn. Res., vol. 14, no. 1, pp. 1303–1347. Jaakkola, T. and M. I. Jordan (1996), “Computing upper and lower bounds on likelihoods in intractable networks,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 340–348, Portland, OR. Also available at arXiv:1302.3586 Jaakkola, T. and M. I. Jordan (1997), “A variational approach to Bayesian logistic regression models and their extensions,” Proc. Int. Workshop on Artificial Intelligence and Statistics (AISTAT), pp. 1–12, Fort Lauderdale, FL. Jaakkola, T. and M. I. Jordan (2000), “Bayesian parameter estimation via variational methods,” Stat. Comput., vol. 10, pp. 25–37. Jordan, M. I., Z. Ghahramani, T. Jaakkola, and L. Saul (1999), “Introduction to variational methods for graphical models,” Mach. Learn., vol. 37, pp. 183–233.

References

1471

Li, A., L. Pericchi, and K. Wang (2020), “Objective Bayesian inference in probit models with intrinsic priors using variational approximations,” Entropy, vol. 22, no. 5:513, doi:10.3390/e22050513. MacKay, D. J. C. (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press. Nakajima, S., K. Watanabe, and M. Sugiyama (2019), Variational Bayesian Learning Theory, Cambridge University Press. Opper, M. and D. Saad (2001), Advanced Mean Field Methods: Theory and Practice, MIT Press. Opper, M. and O. Winther (1996), “A mean field algorithm for Bayes learning in large feed-forward neural networks,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 225–231, Denver, CO. Ormerod, J. T. and M. P. Wand (2010), “Explaining variational approximations,” Amer. Statist., vol. 64, pp. 140–153. Parisi, G. (1988), Statistical Field Theory, Addison-Wesley. Peterson, C. and J. Anderson (1987), “A mean field theory learning algorithm for neural networks,” Complex Syst., vol. 15, pp. 995–1019. Ranganath, R., S. Gerrish, and D. Blei (2014), “Black box variational inference,” Proc. Mach. Learn. Res., vol. 33, pp. 814–822. Saul, L., T. Jaakkola, and M. I. Jordan (1996), “Mean field theory for sigmoid belief networks, ” J. Artif. Intell. Res., vol. 4, no. 1, pp. 61–76. Spirling, A. and K. Quinn (2010), “Identifying intraparty voting blocs in the U.K. House of Commons,” J. Amer. Statist. Assoc., vol. 105, no. 490, pp. 447–457. Wainwright, M. J. and M. I. Jordan (2008). Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends in Machine Learning, NOW Publishers, vol. 1, nos. 1–2, pp. 1–305. Zhang, C., J. Butepage, H. Kjellstrom, and S. Mandt (2019), “Advances in variational inference,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 41, no. 8, pp. 2008–2026.

37 Latent Dirichlet Allocation

one prominent application of the variational inference methodology of Chapter 36 arises in the context of topic modeling. In this application, the objective is to discover similarities between texts or documents such as news articles. For example, given a large library of articles, running perhaps into the millions, such as a database of newspaper articles written over 100 years, it would be useful to be able to discover in an automated manner the multitude of topics that are covered in the database and to cluster together articles dealing with similar topics such as sports or health or politics. In another example, when a user is browsing an article online, it would be useful to be able to identify automatically the subject matter of the article in order to recommend to the reader other articles of similar content. Latent Dirichlet allocation (or LDA) refers to the procedure that results from applying variational inference techniques to topic modeling in order to address questions of this type. Latent Dirichlet allocation performs a soft form of unsupervised clustering. By “unsupervised” we mean that LDA will be able to identify the variety of topics present in a library of documents without the necessity for training. It will also be able to associate topics with documents. The qualification “soft clustering” refers to the fact that any particular document does not necessarily cover a single topic, but can be associated with several topics. For example, 30% of the material in a document may deal with one topic, 50% with a second topic, and 20% with a third topic. As such, this specific document would be assigned by LDA to three topic clusters, albeit at different strength levels of 30%, 50%, and 20%, respectively. The techniques developed in this chapter are broad and can be applied to problems other than topic modeling. We encountered one instance earlier when we studied the voting record of a legislative body in Example 36.5. The main unifying theme across the different problems that can be handled by LDA is that they should involve observations (such as documents) that consist of smaller units (such as words) that arise probabilistically from underlying clusters (such as topics). For simplicity of presentation, we will motivate LDA by focusing on the topic modeling context, with the understanding that the formulation is general enough and can be applied more broadly.

37.1 Generative Model

37.1

1473

GENERATIVE MODEL In order to apply variational inference to topic modeling, we need to adopt a generative model for the articles (i.e., a model that explains how articles are generated). Every article will deal with a blend of topics, with each topic discoverable (or revealed to the reader) through words appearing in the article. Consider an article discussing the three topics of classroom teaching, online courses, and government. A reader browsing through the article would be able to infer the presence of these topics from certain representative keywords. One collection of words appearing in the article could be “teacher, school, textbook, blackboard, desk” relating to the topic of classroom teaching. A second collection of words in the same article could be “Internet, video, streaming, computer, online, recording” relating to the topic of online courses. And a third collection of words could be “policy, investment, educational board, funding” relating to the third topic of government. Motivated by this example, we describe next the generative process for articles (or documents). We start by modeling the range of topics in each article.

37.1.1

Bag of Topics We assume a dataset with a total of D documents. We denote the number of topics running across these documents by T ≥ 2. That is, if we were to browse through the entire database, the collection would contain T identifiable and separate topics. Each individual document would consist of a blend of topics at various strength levels. Let the subscript d refer to a generic document or article. Each document d blends a subset of the T topics at different proportions. For example, one article may be 50% sports, 20% education, and 30% arts. A second article may be 40% politics and 60% economics, while a third article may be 20% movies, 25% TV, 30% actors and actresses, and 25% technology. We model the topic proportions as follows. For each document d, we let θd,t denote the proportion of document d that deals with topic t (i.e., the fraction of words in document d that relate to topic t). For example, if θd,t = 0.1 for sports, this means that 10% of the words in document d deal with sports. We collect all {θd,t } into a T × 1 vector: ∆

θd = col{θd,1 , θd,2 , . . . , θd,T },

T X t=1

θd,t = 1,

θd,t ≥ 0

(37.1a)

The entries of this vector describe the likelihood of each topic in document d (i.e., what fraction of words belong to each topic). Obviously, several entries of θd can be zero or very small. Figure 37.1 illustrates this scenario. The figure shows documents along the horizontal direction and topics along the vertical direction. Each document d is represented by a vertical column with colored squares; the darker the color of a square is, the stronger the representation of that topic will

sha1_base64="wHXKhXqAFGfgmwnYm5FrcWQeMxc=">AAACIHicbVC7TsMwFHV4lvIKMLJYtEhMVdKFslViYSyI0kptVN04TmPVsSPbAVVVf4WFX2FhAAQbfA3uY4CWMx2dc6/v8QkzzrTxvC9nZXVtfWOzsFXc3tnd23cPDu+0zBWhTSK5VO0QNOVM0KZhhtN2piikIaetcHA58Vv3VGkmxa0ZZjRIoS9YzAgYK/XcWldIJiIqDI5ADajC0zdxuXvD+okBpeRDGSeWWytTMoSQcWaGxZ5b8ireFHiZ+HNSQnM0eu5nN5IkT+0pwkHrju9lJhiBMoxwOi52c00zIAPo046lAlKqg9E0zRifWiXCsQ0WSxt1qv7eGEGq9TAN7WQKJtGL3kT8z+vkJq4FIyay3FBBZofinGMj8aQuHDFFieFDS4AoZrNikoACYmypkxL8xS8vk2a1clHxr6ulenXeRgEdoxN0hnx0juroCjVQExH0iJ7RK3pznpwX5935mI2uOPOdI/QHzvcP5d6jfw==

sha1_base64="wHXKhXqAFGfgmwnYm5FrcWQeMxc=">AAACIHicbVC7TsMwFHV4lvIKMLJYtEhMVdKFslViYSyI0kptVN04TmPVsSPbAVVVf4WFX2FhAAQbfA3uY4CWMx2dc6/v8QkzzrTxvC9nZXVtfWOzsFXc3tnd23cPDu+0zBWhTSK5VO0QNOVM0KZhhtN2piikIaetcHA58Vv3VGkmxa0ZZjRIoS9YzAgYK/XcWldIJiIqDI5ADajC0zdxuXvD+okBpeRDGSeWWytTMoSQcWaGxZ5b8ireFHiZ+HNSQnM0eu5nN5IkT+0pwkHrju9lJhiBMoxwOi52c00zIAPo046lAlKqg9E0zRifWiXCsQ0WSxt1qv7eGEGq9TAN7WQKJtGL3kT8z+vkJq4FIyay3FBBZofinGMj8aQuHDFFieFDS4AoZrNikoACYmypkxL8xS8vk2a1clHxr6ulenXeRgEdoxN0hnx0juroCjVQExH0iJ7RK3pznpwX5935mI2uOPOdI/QHzvcP5d6jfw==AAACIHicbVC7TsMwFHV4lvIKMLJYtEhMVdKFslViYSyI0kptVN04TmPVsSPbAVVVf4WFX2FhAAQbfA3uY4CWMx2dc6/v8QkzzrTxvC9nZXVtfWOzsFXc3tnd23cPDu+0zBWhTSK5VO0QNOVM0KZhhtN2piikIaetcHA58Vv3VGkmxa0ZZjRIoS9YzAgYK/XcWldIJiIqDI5ADajC0zdxuXvD+okBpeRDGSeWWytTMoSQcWaGxZ5b8ireFHiZ+HNSQnM0eu5nN5IkT+0pwkHrju9lJhiBMoxwOi52c00zIAPo046lAlKqg9E0zRifWiXCsQ0WSxt1qv7eGEGq9TAN7WQKJtGL3kT8z+vkJq4FIyay3FBBZofinGMj8aQuHDFFieFDS4AoZrNikoACYmypkxL8xS8vk2a1clHxr6ulenXeRgEdoxN0hnx0juroCjVQExH0iJ7RK3pznpwX5935mI2uOPOdI/QHzvcP5d6jfw==AAACIHicbVC7TsMwFHV4lvIKMLJYtEhMVdKFslViYSyI0kptVN04TmPVsSPbAVVVf4WFX2FhAAQbfA3uY4CWMx2dc6/v8QkzzrTxvC9nZXVtfWOzsFXc3tnd23cPDu+0zBWhTSK5VO0QNOVM0KZhhtN2piikIaetcHA58Vv3VGkmxa0ZZjRIoS9YzAgYK/XcWldIJiIqDI5ADajC0zdxuXvD+okBpeRDGSeWWytTMoSQcWaGxZ5b8ireFHiZ+HNSQnM0eu5nN5IkT+0pwkHrju9lJhiBMoxwOi52c00zIAPo046lAlKqg9E0zRifWiXCsQ0WSxt1qv7eGEGq9TAN7WQKJtGL3kT8z+vkJq4FIyay3FBBZofinGMj8aQuHDFFieFDS4AoZrNikoACYmypkxL8xS8vk2a1clHxr6ulenXeRgEdoxN0hnx0juroCjVQExH0iJ7RK3pznpwX5935mI2uOPOdI/QHzvcP5d6jfw==AAAB8HicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN6ywIa9vcvunAm58C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK30WO3hmCPtN6r9csWtuwuQdeLlpAI5Wv3yV28QszTiCpmkxnQ9N8EgoxoFk3xW6qWGJ5RN6Ih3LVU04ibIFhfPyIVVBmQYa1sKyUL9PZHRyJhpFNrOiOLYrHpz8T+vm+LwKsiESlLkii0XDVNJMCbz98lAaM5QTi2hTAt7K2FjqilDG1LJhuCtvrxO/Eb9uu7dNSrNWp5GEc7gHGrgwSU04RZa4AMDBc/wCm+OcV6cd+dj2Vpw8plT+APn8wftUI/hAAAB8HicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN6ywIa9vcvunAm58C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK30WO3hmCPtN6r9csWtuwuQdeLlpAI5Wv3yV28QszTiCpmkxnQ9N8EgoxoFk3xW6qWGJ5RN6Ih3LVU04ibIFhfPyIVVBmQYa1sKyUL9PZHRyJhpFNrOiOLYrHpz8T+vm+LwKsiESlLkii0XDVNJMCbz98lAaM5QTi2hTAt7K2FjqilDG1LJhuCtvrxO/Eb9uu7dNSrNWp5GEc7gHGrgwSU04RZa4AMDBc/wCm+OcV6cd+dj2Vpw8plT+APn8wftUI/hAAAB8HicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN6ywIa9vcvunAm58C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK30WO3hmCPtN6r9csWtuwuQdeLlpAI5Wv3yV28QszTiCpmkxnQ9N8EgoxoFk3xW6qWGJ5RN6Ih3LVU04ibIFhfPyIVVBmQYa1sKyUL9PZHRyJhpFNrOiOLYrHpz8T+vm+LwKsiESlLkii0XDVNJMCbz98lAaM5QTi2hTAt7K2FjqilDG1LJhuCtvrxO/Eb9uu7dNSrNWp5GEc7gHGrgwSU04RZa4AMDBc/wCm+OcV6cd+dj2Vpw8plT+APn8wftUI/hAAAB8HicbVA9TwJBEN3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxycPJsk0B58nMtGdkBmQQoGPAiV0Ug0sDiW0w/HN3G8/gTYiUfc4SSGI2VCJSHCGVnqs9nAEyPpetV+uuHV3AbpOvJxUSI5Wv/zVGyQ8i0Ehl8yYruemGEyZRsElzEq9zEDK+JgNoWupYjGYYLq4eEYvrDKgUaJtKaQL9ffElMXGTOLQdsYMR2bVm4v/ed0Mo6tgKlSaISi+XBRlkmJC5+/TgdDAUU4sYVwLeyvlI6YZRxtSyYbgrb68TvxG/bru3TUqzVqeRpGckXNSIx65JE1yS1rEJ5wo8kxeyZtjnBfn3flYthacfOaU/IHz+QPrzI/gAAAB8HicbVA9TwJBEN3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxycPJsk0B58nMtGdkBmQQoGPAiV0Ug0sDiW0w/HN3G8/gTYiUfc4SSGI2VCJSHCGVnqs9nAEyPpetV+uuHV3AbpOvJxUSI5Wv/zVGyQ8i0Ehl8yYruemGEyZRsElzEq9zEDK+JgNoWupYjGYYLq4eEYvrDKgUaJtKaQL9ffElMXGTOLQdsYMR2bVm4v/ed0Mo6tgKlSaISi+XBRlkmJC5+/TgdDAUU4sYVwLeyvlI6YZRxtSyYbgrb68TvxG/bru3TUqzVqeRpGckXNSIx65JE1yS1rEJ5wo8kxeyZtjnBfn3flYthacfOaU/IHz+QPrzI/gAAAB8HicbVA9TwJBEN3DL8Qv1NJmI5hQkTsatSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxycPJsk0B58nMtGdkBmQQoGPAiV0Ug0sDiW0w/HN3G8/gTYiUfc4SSGI2VCJSHCGVnqs9nAEyPpetV+uuHV3AbpOvJxUSI5Wv/zVGyQ8i0Ehl8yYruemGEyZRsElzEq9zEDK+JgNoWupYjGYYLq4eEYvrDKgUaJtKaQL9ffElMXGTOLQdsYMR2bVm4v/ed0Mo6tgKlSaISi+XBRlkmJC5+/TgdDAUU4sYVwLeyvlI6YZRxtSyYbgrb68TvxG/bru3TUqzVqeRpGckXNSIx65JE1yS1rEJ5wo8kxeyZtjnBfn3flYthacfOaU/IHz+QPrzI/gAAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7UhsLDFyQgIXsrcssGFv77I7Z0Iu/AQbCzW2/iM7/40LXKHgSyZ5eW8mM/PCRAqDrvvtbGxube/sFvaK+weHR8elk9NHE6eacZ/FMtadkBouheI+CpS8k2hOo1Dydji5nfvtJ66NiFULpwkPIjpSYigYRSs9VFqVfqns1twFyDrxclKGHM1+6as3iFkacYVMUmO6nptgkFGNgkk+K/ZSwxPKJnTEu5YqGnETZItTZ+TSKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeB1kQiUpcsWWi4apJBiT+d9kIDRnKKeWUKaFvZWwMdWUoU2naEPwVl9eJ369dlPz7uvlRjVPowDncAFV8OAKGnAHTfCBwQie4RXeHOm8OO/Ox7J1w8lnzuAPnM8fzP2M7A==

sha1_base64="zhIcjWcO1/zV08YSZkEHDw32RsE=">AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7UhsLDFyQgIXsrcssGFv77I7Z0Iu/AQbCzW2/iM7/40LXKHgSyZ5eW8mM/PCRAqDrvvtbGxube/sFvaK+weHR8elk9NHE6eacZ/FMtadkBouheI+CpS8k2hOo1Dydji5nfvtJ66NiFULpwkPIjpSYigYRSs9VFqVfqns1twFyDrxclKGHM1+6as3iFkacYVMUmO6nptgkFGNgkk+K/ZSwxPKJnTEu5YqGnETZItTZ+TSKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeB1kQiUpcsWWi4apJBiT+d9kIDRnKKeWUKaFvZWwMdWUoU2naEPwVl9eJ369dlPz7uvlRjVPowDncAFV8OAKGnAHTfCBwQie4RXeHOm8OO/Ox7J1w8lnzuAPnM8fzP2M7A==AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7UhsLDFyQgIXsrcssGFv77I7Z0Iu/AQbCzW2/iM7/40LXKHgSyZ5eW8mM/PCRAqDrvvtbGxube/sFvaK+weHR8elk9NHE6eacZ/FMtadkBouheI+CpS8k2hOo1Dydji5nfvtJ66NiFULpwkPIjpSYigYRSs9VFqVfqns1twFyDrxclKGHM1+6as3iFkacYVMUmO6nptgkFGNgkk+K/ZSwxPKJnTEu5YqGnETZItTZ+TSKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeB1kQiUpcsWWi4apJBiT+d9kIDRnKKeWUKaFvZWwMdWUoU2naEPwVl9eJ369dlPz7uvlRjVPowDncAFV8OAKGnAHTfCBwQie4RXeHOm8OO/Ox7J1w8lnzuAPnM8fzP2M7A==AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7UhsLDFyQgIXsrcssGFv77I7Z0Iu/AQbCzW2/iM7/40LXKHgSyZ5eW8mM/PCRAqDrvvtbGxube/sFvaK+weHR8elk9NHE6eacZ/FMtadkBouheI+CpS8k2hOo1Dydji5nfvtJ66NiFULpwkPIjpSYigYRSs9VFqVfqns1twFyDrxclKGHM1+6as3iFkacYVMUmO6nptgkFGNgkk+K/ZSwxPKJnTEu5YqGnETZItTZ+TSKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeB1kQiUpcsWWi4apJBiT+d9kIDRnKKeWUKaFvZWwMdWUoU2naEPwVl9eJ369dlPz7uvlRjVPowDncAFV8OAKGnAHTfCBwQie4RXeHOm8OO/Ox7J1w8lnzuAPnM8fzP2M7A==AAACEHicbVC7TsMwFL3hWcorwMhi0SIYUJV0AMZKLIxFog+piSrHcVqrjhPZDlIV9RNY+BUWBhBiZWTjb3DbDNByJEtH59x7fe8JUs6Udpxva2V1bX1js7RV3t7Z3du3Dw7bKskkoS2S8ER2A6woZ4K2NNOcdlNJcRxw2glGN1O/80ClYom41+OU+jEeCBYxgrWR+vaZJxImQio0qnpBxjnVnoeWWLVvV5yaMwNaJm5BKlCg2be/vDAhWWwmE46V6rlOqv0cS80Ip5OylymaYjLCA9ozVOCYKj+fHTRBp0YJUZRI88xmM/V3R45jpcZxYCpjrIdq0ZuK/3m9TEfXfs5EmmkqyPyjKONIJ2iaDgqZpETzsSGYSGZ2RWSIJSbaZFg2IbiLJy+Tdr3mXtacu3qlcVHEUYJjOIFzcOEKGnALTWgBgUd4hld4s56sF+vd+piXrlhFzxH8gfX5AzWznJ0=

AAACC3icbVC7TsMwFL3hWcqrwMgS0SIxVUkXYKvEwlgkQis1VeU4TmrVsSPbAVVRP4CFX2FhAMTKD7DxNzhtBmg5kqWjc+69vvcEKaNKO863tbK6tr6xWdmqbu/s7u3XDg7vlMgkJh4WTMhegBRhlBNPU81IL5UEJQEj3WB8VfjdeyIVFfxWT1IySFDMaUQx0kYa1uo+F5SHhGtbi5RiZTd8JngsaTzSSErx0KiaKqfpzGAvE7ckdSjRGda+/FDgLDFTMUNK9V0n1YMcSU0xI9OqnymSIjxGMekbylFC1CCfHTO1T40S2pGQ5pmtZurvjhwlSk2SwFQmSI/UoleI/3n9TEcXg5zyNNOE4/lHUcbM3XaRjB1SSbBmE0MQltTsauMRkghrk18Rgrt48jLxWs3LpnvTqrdbZRoVOIYTOAMXzqEN19ABDzA8wjO8wpv1ZL1Y79bHvHTFKnuO4A+szx/caJr/AAACC3icbVC7TsMwFL3hWcqrwMgS0SIxVUkXYKvEwlgkQis1VeU4TmrVsSPbAVVRP4CFX2FhAMTKD7DxNzhtBmg5kqWjc+69vvcEKaNKO863tbK6tr6xWdmqbu/s7u3XDg7vlMgkJh4WTMhegBRhlBNPU81IL5UEJQEj3WB8VfjdeyIVFfxWT1IySFDMaUQx0kYa1uo+F5SHhGtbi5RiZTd8JngsaTzSSErx0KiaKqfpzGAvE7ckdSjRGda+/FDgLDFTMUNK9V0n1YMcSU0xI9OqnymSIjxGMekbylFC1CCfHTO1T40S2pGQ5pmtZurvjhwlSk2SwFQmSI/UoleI/3n9TEcXg5zyNNOE4/lHUcbM3XaRjB1SSbBmE0MQltTsauMRkghrk18Rgrt48jLxWs3LpnvTqrdbZRoVOIYTOAMXzqEN19ABDzA8wjO8wpv1ZL1Y79bHvHTFKnuO4A+szx/caJr/AAACC3icbVC7TsMwFL3hWcqrwMgS0SIxVUkXYKvEwlgkQis1VeU4TmrVsSPbAVVRP4CFX2FhAMTKD7DxNzhtBmg5kqWjc+69vvcEKaNKO863tbK6tr6xWdmqbu/s7u3XDg7vlMgkJh4WTMhegBRhlBNPU81IL5UEJQEj3WB8VfjdeyIVFfxWT1IySFDMaUQx0kYa1uo+F5SHhGtbi5RiZTd8JngsaTzSSErx0KiaKqfpzGAvE7ckdSjRGda+/FDgLDFTMUNK9V0n1YMcSU0xI9OqnymSIjxGMekbylFC1CCfHTO1T40S2pGQ5pmtZurvjhwlSk2SwFQmSI/UoleI/3n9TEcXg5zyNNOE4/lHUcbM3XaRjB1SSbBmE0MQltTsauMRkghrk18Rgrt48jLxWs3LpnvTqrdbZRoVOIYTOAMXzqEN19ABDzA8wjO8wpv1ZL1Y79bHvHTFKnuO4A+szx/caJr/AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==

sha1_base64="+sUoP1xTCrumg1PX3FE56wMPqpE=">AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==

sha1_base64="jTRdB2AeitGvUk/ONBsjhHzHyCA=">AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==

sha1_base64="jTRdB2AeitGvUk/ONBsjhHzHyCA=">AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63U9jY3NreKe6W9vYPDo/KxyePJsk04z5LZKI7ITVcCsV9FCh5J9WcxqHk7XB8M/fbT1wbkagHnKQ8iOlQiUgwila6r3rVfrni1t0FyDrxclKBHK1++as3SFgWc4VMUmO6nptiMKUaBZN8VuplhqeUjemQdy1VNOYmmC5OnZELqwxIlGhbCslC/T0xpbExkzi0nTHFkVn15uJ/XjfD6CqYCpVmyBVbLooySTAh87/JQGjOUE4soUwLeythI6opQ5tOyYbgrb68TvxG/bru3TUqzVqeRhHO4Bxq4MElNOEWWuADgyE8wyu8OdJ5cd6dj2VrwclnTuEPnM8fl/GMyQ==AAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5A4LtSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaObmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPUuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0FE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa9c1765eblTzNArklJyRKvHIJWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz/u1I/i

sha1_base64="POAwq3CwPhXEmswdaoPkFsig6+E=">AAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5I5G7Ui0sMTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaPrmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPVuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0GE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa1c1765eblTzNArklJyRKvHIBWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz8Ip4/z

sha1_base64="POAwq3CwPhXEmswdaoPkFsig6+E=">AAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5I5G7Ui0sMTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaPrmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPVuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0GE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa1c1765eblTzNArklJyRKvHIBWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz8Ip4/zAAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5I5G7Ui0sMTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaPrmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPVuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0GE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa1c1765eblTzNArklJyRKvHIBWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz8Ip4/zAAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5I5G7Ui0sMTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaPrmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPVuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0GE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa1c1765eblTzNArklJyRKvHIBWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz8Ip4/zAAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7Ui0sMToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hoeua3nrg2IlEPOE55ENOBEpFgFK10X7mp9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRpfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa1c1765eblTzNApwCmdQBQ8uoAG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AbS9jNw=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7Ui0sMToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hoeua3nrg2IlEPOE55ENOBEpFgFK10X7mp9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRpfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa1c1765eblTzNApwCmdQBQ8uoAG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AbS9jNw=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5I5G7Ui0sMToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hoeua3nrg2IlEPOE55ENOBEpFgFK10X7mp9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRpfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa1c1765eblTzNApwCmdQBQ8uoAG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AbS9jNw=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=

sha1_base64="6Qu1GQp/+nOgzxHS/2VijO+61cM=">AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAAB6XicbVA9TwJBEJ3zE/ELtbTZCCZU5A4LtSOxscToCQlcyN6yBxv29i67cyaE8BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+HoZua3nrg2IlEPOE55ENOBEpFgFK10X7mo9Eplt+bOQVaJl5My5Gj2Sl/dfsKymCtkkhrT8dwUgwnVKJjk02I3MzylbEQHvGOpojE3wWR+6pScW6VPokTbUkjm6u+JCY2NGceh7YwpDs2yNxP/8zoZRlfBRKg0Q67YYlGUSYIJmf1N+kJzhnJsCWVa2FsJG1JNGdp0ijYEb/nlVeLXa9c1765eblTzNApwCmdQBQ8uoQG30AQfGAzgGV7hzZHOi/PufCxa15x85gT+wPn8AZr5jMs=AAACAHicbVA9TwJBEN3DL8SvUwsLm41gYmHIHYVakthYYiJIAheyt8zBhr2P7M6ZkAuNf8XGQmNs/Rl2/hsXuELAl0zm5b2Z7M7zEyk0Os6PVVhb39jcKm6Xdnb39g/sw6OWjlPFocljGau2zzRIEUETBUpoJwpY6Et49Ee3U//xCZQWcfSA4wS8kA0iEQjO0Eg9+6TS9VMpARdbpWeXnaozA10lbk7KJEejZ393+zFPQ4iQS6Z1x3US9DKmUHAJk1I31ZAwPmID6BgasRC0l80OmNBzo/RpECtTEdKZ+ncjY6HW49A3kyHDoV72puJ/XifF4MbLRJSkCBGfPxSkkmJMp2nQvlDAUY4NYVwJ81fKh0wxjiazkgnBXT55lbRqVfeq6tzXyvXLPI4iOSVn5IK45JrUyR1pkCbhZEJeyBt5t56tV+vD+pyPFqx855gswPr6BcVQlnI=

AAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5A4LtSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaObmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPUuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0FE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa9c1765eblTzNArklJyRKvHIJWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz/u1I/iAAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5A4LtSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaObmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPUuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0FE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa9c1765eblTzNArklJyRKvHIJWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz/u1I/iAAAB8HicbVA9TwJBEN3zE/ELtbTZCCZU5A4LtSOxscTEEyJcyN4yBxv29i67cyaE8C9sLNTY+nPs/DcucIWCL5nk5b2ZzMwLUykMuu63s7a+sbm1Xdgp7u7tHxyWjo4fTJJpDj5PZKLbITMghQIfBUpopxpYHEpohaObmd96Am1Eou5xnEIQs4ESkeAMrfRY6eIQkPUuKr1S2a25c9BV4uWkTHI0e6Wvbj/hWQwKuWTGdDw3xWDCNAouYVrsZgZSxkdsAB1LFYvBBJP5xVN6bpU+jRJtSyGdq78nJiw2ZhyHtjNmODTL3kz8z+tkGF0FE6HSDEHxxaIokxQTOnuf9oUGjnJsCeNa2FspHzLNONqQijYEb/nlVeLXa9c1765eblTzNArklJyRKvHIJWmQW9IkPuFEkWfySt4c47w4787HonXNyWdOyB84nz/u1I/iAAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAAB6XicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscToCQlcyN6ywIa9vcvunAm58BNsLNTY+o/s/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxyePJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEasHnCY8iOhIiaFgFK10X21U++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqaMRNkC1OnZELqwzIMNa2FJKF+nsio5Ex0yi0nRHFsVn15uJ/XjfF4VWQCZWkyBVbLhqmkmBM5n+TgdCcoZxaQpkW9lbCxlRThjadkg3BW315nfiN+nXdu2tUmrU8jSKcwTnUwINLaMIttMAHBiN4hld4c6Tz4rw7H8vWgpPPnMIfOJ8/mXWMyg==AAACDnicbVC7TsMwFHV4lvIKMLJYtCCmKukCbJVYGItEaKWmqhzHSa06dmQ7oCrqH7DwKywMgFiZ2fgbnDZI0HIkS0fn3Ht97wlSRpV2nC9raXlldW29slHd3Nre2bX39m+VyCQmHhZMyG6AFGGUE09TzUg3lQQlASOdYHRZ+J07IhUV/EaPU9JPUMxpRDHSRhrYJz4XlIeEaxgKnCWGKFj3meCxpPFQIynFfb06sGtOw5kCLhK3JDVQoj2wP/2feZghpXquk+p+jqSmmJFJ1c8USREeoZj0DOUoIaqfT++ZwGOjhDAS0jyz2FT93ZGjRKlxEpjKBOmhmvcK8T+vl+novJ9TnmaacDz7KMoY1AIW4cCQSoI1GxuCsKRmV4iHSCKsTYRFCO78yYvEazYuGu51s9ZqlmlUwCE4AqfABWegBa5AG3gAgwfwBF7Aq/VoPVtv1vusdMkqew7AH1gf31vwnF0=

sha1_base64="TEXj5UN8z3fYu37eLkj+A29Zgi8=">AAACDnicbVC7TsMwFHV4lvIKMLJYtCCmKukCbJVYGItEaKWmqhzHSa06dmQ7oCrqH7DwKywMgFiZ2fgbnDZI0HIkS0fn3Ht97wlSRpV2nC9raXlldW29slHd3Nre2bX39m+VyCQmHhZMyG6AFGGUE09TzUg3lQQlASOdYHRZ+J07IhUV/EaPU9JPUMxpRDHSRhrYJz4XlIeEaxgKnCWGKFj3meCxpPFQIynFfb06sGtOw5kCLhK3JDVQoj2wP/2feZghpXquk+p+jqSmmJFJ1c8USREeoZj0DOUoIaqfT++ZwGOjhDAS0jyz2FT93ZGjRKlxEpjKBOmhmvcK8T+vl+novJ9TnmaacDz7KMoY1AIW4cCQSoI1GxuCsKRmV4iHSCKsTYRFCO78yYvEazYuGu51s9ZqlmlUwCE4AqfABWegBa5AG3gAgwfwBF7Aq/VoPVtv1vusdMkqew7AH1gf31vwnF0=AAACDnicbVC7TsMwFHV4lvIKMLJYtCCmKukCbJVYGItEaKWmqhzHSa06dmQ7oCrqH7DwKywMgFiZ2fgbnDZI0HIkS0fn3Ht97wlSRpV2nC9raXlldW29slHd3Nre2bX39m+VyCQmHhZMyG6AFGGUE09TzUg3lQQlASOdYHRZ+J07IhUV/EaPU9JPUMxpRDHSRhrYJz4XlIeEaxgKnCWGKFj3meCxpPFQIynFfb06sGtOw5kCLhK3JDVQoj2wP/2feZghpXquk+p+jqSmmJFJ1c8USREeoZj0DOUoIaqfT++ZwGOjhDAS0jyz2FT93ZGjRKlxEpjKBOmhmvcK8T+vl+novJ9TnmaacDz7KMoY1AIW4cCQSoI1GxuCsKRmV4iHSCKsTYRFCO78yYvEazYuGu51s9ZqlmlUwCE4AqfABWegBa5AG3gAgwfwBF7Aq/VoPVtv1vusdMkqew7AH1gf31vwnF0=AAACDnicbVC7TsMwFHV4lvIKMLJYtCCmKukCbJVYGItEaKWmqhzHSa06dmQ7oCrqH7DwKywMgFiZ2fgbnDZI0HIkS0fn3Ht97wlSRpV2nC9raXlldW29slHd3Nre2bX39m+VyCQmHhZMyG6AFGGUE09TzUg3lQQlASOdYHRZ+J07IhUV/EaPU9JPUMxpRDHSRhrYJz4XlIeEaxgKnCWGKFj3meCxpPFQIynFfb06sGtOw5kCLhK3JDVQoj2wP/2feZghpXquk+p+jqSmmJFJ1c8USREeoZj0DOUoIaqfT++ZwGOjhDAS0jyz2FT93ZGjRKlxEpjKBOmhmvcK8T+vl+novJ9TnmaacDz7KMoY1AIW4cCQSoI1GxuCsKRmV4iHSCKsTYRFCO78yYvEazYuGu51s9ZqlmlUwCE4AqfABWegBa5AG3gAgwfwBF7Aq/VoPVtv1vusdMkqew7AH1gf31vwnF0= 1 and α2 > 1 α1 + α2 − 2

(42.169)

We will call upon this result later in this example in order to estimate the desired model parameters {π, A, B}. Data model. We assume we have a collection of N independent measurements of the (n) (n) (n) form {y1 , y2 , y3 } for the observable variables. We associate with each of these mea(n) (n) (n) surement sets a realization for the latent variables and denote them by {x1 , x2 , x3 }. In order to simplify the notation, and when there is no room for confusion, we will employ the following more compact notation to refer to the collection of variables in each nth set: o n ∆ (n) (n) (n) , (3 × 1) (42.170a) yn = col y1 , y2 , y3 o n ∆ (n) (n) (n) (42.170b) xn = col x1 , x2 , x3 , (3 × 1) The measurements {y n } are assumed to be collected independently of each other; likewise, the associated latent variables {xn } are independent of each other. Our objective is to use the observations {yn , n = 1, . . . , N } to learn the graph parameters {π, A, B} subject to the assumption that these parameters are modeled as random quantities (as befits a Bayesian formulation) and that they are generated according to the beta distributions defined above. The main question is to estimate the variables {θ 1 , θ 2 , θ 3 }, from which estimates for {π, A, B} can be constructed. We will approach this problem as follows. For each parameter θ ` , we will determine its posterior distribution given the observations, namely, the quantity: f (θ` | y1 , . . . , yN , x1 , . . . , xN )

(42.171)

which we also write more compactly as f (θ` | y1:N , x1:N ). We will determine this posterior by using the mean-field approximation theory. Once this is concluded, we will be able to recognize that the posterior has the form of a beta distribution. As such, we will then estimate each parameter θ ` by seeking, for example, the MAP estimate for it. This estimate is given by the location of the mode of the beta distribution, as already specified by (42.169). Prior distribution is exponential. Let us first examine the form of the prior distribution for the model parameters. Since the variables {θ 1 , θ 2 , θ 3 } are assumed independent of each other, their joint pdf is given by the product of the individual pdfs, written as f (θ) = f (θ1 ) × f (θ2 ) × f (θ3 )

(42.172)

where we are writing θ = col{θ1 , θ2 , θ3 } to refer to the three model parameters. Since each of the individual pdfs has an exponential form similar to (42.168), we conclude that the prior distribution f (θ) is exponential as well and given by

42.6 Learning Graph Parameters

n o f (θ) = h(θ) exp αT φ(θ) − c(α)

1725

(42.173)

where  ln(θ1 )  ln(1 − θ1 )      1 ∆   ln(θ2 )   α =   , φ(θ) =  ln(1 − θ )  , h(θ) = Q3 2     θ (1 − θ` ) ` `=1  ln(θ3 )  ln(1 − θ3 ) 6 n o X ln Γ(α` ) − ln Γ(α1 + α2 )Γ(α3 + α4 )Γ(α5 + α6 ) c(α) = 

α1 α2 .. . α6





(42.174a)

(42.174b)

`=1

Note that the quantities {α, c(α)} are known. Joint distribution is exponential. We next examine the joint distribution of the observed and latent variables given the model parameters. Thus, note that   (n) (n) (n) (n) (n) (n) f x1 , x 2 , x 3 , y 1 , y 2 , y 3 | θ             (n) (n) (n) (n) (n) (n) (n) (n) (n) (n) (n) = f x1 f x2 |x1 f x3 |x2 f y1 |x1 f y2 |x2 f y3 |x3 (n)

I[x1

= θ1 (

3 Y

=0]

(n)

× (1 − θ1 )I[x1

(n) (n) I[xk−1 =xk ]

θ1

k=2

(

3 Y

(n)

I[x θ3 k

(n)

=yk

k=1

]

=1]

×

× (1 − θ2 )

× (1 − θ3 )

(n)

(n)

I[xk−1 6=xk

(n)

I[xk

(n)

6=yk

) ]

×

) ]

(42.175)

or, equivalently,   (n) (n) (n) (n) (n) (n) ln f x1 , x2 , x3 , y1 , y2 , y3 | θ h i h i (n) (n) = I x1 = 0 ln(θ1 ) + I x1 = 1 ln(1 − θ1 ) + ( ) 3 h i h i X (n) (n) (n) (n) I xk−1 = xk ln(θ2 ) + I xk−1 6= xk ln(1 − θ2 ) + k=2 3 X k=1

(

h i h i (n) (n) (n) (n) I xk = yk ln(θ3 ) + I xk 6= yk ln(1 − θ3 )

) (42.176)

This shows that, given the parameters θ, the joint distribution of the latent and observed variables has an exponential form as well:

n o f (xn , yn | θ) = exp (φ(θ))T T (yn , xn )

(42.177)

1726

Inference over Graphs

where we are introducing h i  (n) I x1 = 0  h i  (n)  I x1 = 1   3 h i  X (n) (n)  I xk−1 = xk   k=2  3 h i ∆  X (n) (n) T (yn , xn ) =  I xk−1 6= xk   k=2  3  X h i  (n) (n) I xk = yk    k=1 3  X h i  (n) (n) I xk 6= yk

            ,          

    φ(θ) =    ∆

ln(θ1 ) ln(1 − θ1 ) ln(θ2 ) ln(1 − θ2 ) ln(θ3 ) ln(1 − θ3 )

      

(42.178)

k=1

with the same φ(θ) function defined in (42.174a). Unfortunately, evaluation of the function T (yn , xn ) requires knowledge of the latent variables, which are not available. This issue will be addressed in the following; we will eliminate the latent variables by computing the average of the T (yn , xn ) function over the distribution of the latent variables. Since we do not know this distribution either, we will learn it using the mean-field approximation. Posterior distribution is exponential. By examining the prior distribution (42.173) for θ and the joint conditional distribution (42.177) for the latent and observed variables, we notice that we are dealing with a scenario similar to what was described in Section 36.4 on VI applied to exponential models. Hence, much of the analysis from that section is applicable here, with some minimal adjustments. For instance, given all the observations and latent variables, the posterior for θ is again exponential since f (θ|y1:N , x1:N ) ∝ f (θ)

N Y n=1

f (xn , yn |θ)

( T

∝ exp α φ(θ) − c(α) + ∝ exp

( 

α+

N X n=1

N X

T (yn , xn )

T

) φ(θ)

n=1

) T T (yn , xn ) φ(θ) − c(α)

(42.179)

That is, n o f (θ|y1:N , x1:N ) = h(θ)exp (α0 )T φ(θ) − c(α0 )

(42.180)

Observe that the only major change in moving from the prior distribution for θ to its posterior distribution is replacing the parameter α by ∆

α0 = α +

N X

T (yn , xn )

(42.181)

n=1

where α is known. In view of the form of φ(θ), it is easy to see that the above posterior can be expressed as the product of individual beta-type distributions (one for each of {θ 1 , θ 2 , θ 3 }):

42.6 Learning Graph Parameters

α0 −1

f (θ|y1:N , x1:N ) ∝ θ1 1

α0 −1

0

(1 − θ1 )α2 −1 × θ2 3

α0 −1

0

(1 − θ2 )α4 −1 × θ3 5

1727

0

(1 − θ3 )α6 −1 (42.182)

where the {αk0 } denote the individual entries of α0 . However, we still do not know the value of these parameters. Local latent distribution is exponential. We know from the result of Prob. 5.6 regarding a property of the exponential family of distributions that, since the joint pdf of (y n , xn ) is in the exponential family, then the conditional pdf of xn given y n will also have an exponential form. Specifically, note that f (xn |yn , θ)

f (xn , yn |θ)/f (yn |θ) .X f (xn , yn |θ) f (xn , yn |θ),

= =

(42.183) since xn is discrete

xn

(42.177)

=

( T

exp (φ(θ)) T (yn , xn ) − ln

X

(φ(θ))T T (yn ,xn )

e



)

xn

Now, conditioned on yn (i.e., given yn ), the term T (yn , xn ) becomes solely a function of xn : ∆

S(xn ) = T (yn , xn ),

with yn given

(42.184)

Unfortunately, evaluation of the function S(xn ) continues to require knowledge of the latent variables, which are not available. Likewise, given yn and marginalizing over xn , the term in the denominator in the second line of (42.183) becomes some function of θ and yn , denoted by r(θ, yn ): ( ) X (φ(θ))T T (y ,x ) ∆ n n e (42.185) r(θ, yn ) = ln xn ∈{0,1}3

The form of r(θ, yn ) is known and its value can be evaluated since it involves marginalizing over the values of xn . Therefore, we arrive at n o f (xn | yn , θ) = exp (φ(θ))T S(xn ) − r(θ, yn )

(42.186)

Mean-field approximation. Our problem involves two types of latent variables: the (n) (n) (n) global parameters {θ 1 , θ 2 , θ 3 } and the nodes {x1 , x2 , x3 } across all measurements. We now employ the mean-field approximation approach to estimate the conditional pdfs of these latent variables given the observations, namely, the quantity: f (x1:N , θ | y1:N )

(desired conditional pdf)

(42.187)

We approximate this pdf by a product of variational factors in the form:

f (x1:N , θ | y1:N ) ≈ q(θ)

N Y

! qn (xn )

(42.188)

n=1

where q(θ) approximates f (θ | y1:N ) qn (xn ) approximates f (xn | y1:N )

(42.189a) (42.189b)

1728

Inference over Graphs

Now, according to the coordinate-ascent construction (36.43), the sought-after variational factors for the latent variables can be obtained by solving the following coupled equations (we are dropping the ? superscript for simplicity of notation): n o ln qn (xn ) = E x−n ,θ ln f (xn | y1:N , x−n , θ) (42.190a) n o ln q(θ) = E x1:N ln f (θ | y1:N , x1:N ) (42.190b) where the notation E a on the right-hand side is used to refer to the expectation of the argument relative to the variational distribution of the random variable a, i.e., relative to q(a). Moreover, the notation E x−n refers to expectation relative to the variational distributions for all {xm } with the exception of xn . Since the two conditional pdfs appearing on the right-hand side of (42.190a)–(42.190b) are exponential in form, some important simplifications occur due to the logarithm annihilating the exponential function. Thus, using (42.186) and the fact that xn is independent of all other x−n and y −n , equality (42.190a) reduces to n o n o (42.191) ln qn (xn ) = E θ ln f (xn | yn , θ) = E θ (φ(θ))T S(xn ) − r(θ, yn ) so that n o qn (xn ) = exp β T S(xn ) − r(β, yn )

(42.192)

where we introduced ∆

β = Eθ

n o φ(θ)

(42.193)

The expectation is relative to the variational distribution q(θ), which we will determine next. Observe how the resulting variational factor for xn has the same form as the conditional pdf for xn given by (42.186), except that the parameter φ(θ) is replaced by its mean β. Next, using (42.180) and (42.190b) we have  T ln q(θ) = ln h(θ) + E x1:N α0 φ(θ) − E x1:N c(α0 )

(42.194)

and, hence, the variational factor for θ is exponential in form as well: n o q(θ) = h(θ) exp (α00 )T φ(θ) − c(α00 )

(42.195)

with the same statistic φ(θ) but with a new hyper-parameter vector: ∆

α00 = E x1:N (α0 ) = α+

N X

E xn T (yn , xn )

n=1

= α+

N X

E xn S(xn )

(42.196)

n=1

where the expectation is relative to the variational distribution qn (xn ), and the second equality is because qn (xn ) is determined by conditioning on the {yn }; hence, these vectors are known. Observe that the variational factor for θ has the same form as the

42.6 Learning Graph Parameters

1729

conditional pdf (42.180) except that the parameter α0 is replaced by α00 . In view of the form of φ(θ), it is easy to see that the above variational factor for θ can be expressed as the product of three individual beta-type distributions (one for each of {θ 1 , θ 2 , θ 3 }): α00 −1

q(θ) ∝ θ1 1

α00 −1

00

(1 − θ1 )α2 −1 × θ2 3

α00 −1

00

(1 − θ2 )α4 −1 × θ3 5

00

(1 − θ3 )α6 −1

(42.197)

where the {αk00 } denote the individual entries of α00 . However, we still do not know the value of these parameters. This is necessary if we are to rely on expression (42.169) to determine the MAP estimates for {θ 1 , θ 2 , θ 3 }. Therefore, we need to explain how to estimate α00 from the observations. In the meantime, observe that in view of the form (42.197) involving a product of beta distributions, we can appeal to results (42.165a)– (42.165b) to evaluate β in (42.193) as follows: ψ(α100 ) − ψ(α100 + α200 )





 ψ(α200 ) − ψ(α100 + α200 )   ψ(α300 ) − ψ(α300 + α400 )    ψ(α400 ) − ψ(α300 + α400 )    ψ(α500 ) − ψ(α500 + α600 ) 

    n o ∆  β = E θ φ(θ) =     

(42.198)

ψ(α600 ) − ψ(α500 + α600 )

We can also evaluate α00 as follows. We know from property (5.89) for exponential distributions that the mean of their statistic is equal to the gradient of their log-normalization factor relative to the parameter. In other words, if we consider the exponential distribution (42.192) for qn (xn ), we can write E xn S(xn ) = ∇β T r(β, yn ) (a column vector)

(42.199)

where, from (42.185),



β T S(xn )

X

r(β, yn ) = ln

!

e



,

S(xn ) = T (yn , xn )

(42.200)

xn ∈{0,1}3

For illustration purposes, let us compute the derivative of r(β, yn ) relative to the first entry of β, denoted by β1 . This entry is multiplied by the leading entry of S(xn ), which (n) is equal to I[x1 = 0]. Therefore, (

1

∂r(β, yn )/∂β1 = P

xn ∈{0,1}3

=

1 er(β)

eβ T S(xn )

×

×

X (n)

x2

(n)

,x3

β T S(xn )

)

= 0]]e

xn ∈{0,1}3



(

(n) I[x1

X

(n)

β T S x1

e

(n)

=0,x2

(n)

,x3

)

∈{0,1}

Repeating this calculation for all entries of β leads to the expression:

(42.201)

1730

Inference over Graphs

E xn S(xn ) = ∇β T r(β, yn )

(42.202)



 X

=

1 er(β,yn )

                                   

(n)

x2

e

( X (n)

x1

(n)

,x2

(n)

( (n)

x1

(n)

,x2

(n)

,x3

(n)

,x2

(n)

(n)

(n)

,x2

(n)

,x3

(n)



=0,x3

(n)

,x3



(n)

x1

(n)

,x2

(n)

)

(n)

)

,x3



(n)

x1

(n)

,x2

,x3

3 h i βT S X (n) (n) I xk = yk e



(n)

x1

(n)

,x2

(n)

)

(n)

)

,x3

k=1

( x1



k=2

,x3

X

e

(n)

,x2

3 h i βT S X (n) (n) I xk−1 6= xk e

X (n)

(n)

β T S x1

3 h i βT S X (n) (n) I xk−1 = xk e

( x1



k=2

,x3

X

(n)

,x3

(n)



(n)

(n)

=0,x2

,x3

X x1

(n)

β T S x1

3 h i βT S X (n) (n) I xk 6= yk e



(n)

x1

(n)

,x2

,x3

                                   

k=1

We thus find that in order to determine qn (xn ) in (42.192) we need to evaluate q(θ) because β depends on it. In turn, to compute q(θ) we need to determine qn (xn ) because α00 depends on it. This suggests that we need to iterate between determining both variational factors until sufficient convergence is attained (e.g., by using the ELBO measure, as explained after listing (36.201)). We therefore arrive at the following coordinateascent algorithm:

Coordinate-ascent algorithm for the HMM. n o (n) (n) (n) input: N observations yn = {y1 , y2 , y3 } . given: parameter vector α, statistics T (yn , zn ) and S(xn ); log-normalization factors c(·), r(·), as well as φ(θ) and base function h(θ). select α00(−1) randomly with positive entries. repeat until sufficient convergence over m = 0, 1, 2, . . . : n T  o q (m−1) (θ) = h(θ) exp α00(m−1) φ(θ) − c α00(m−1) n o β (m−1) = E θ (m−1) φ(θ) using (42.198) at α00(m−1) o n T  (m−1) qn (xn ) = exp β (m−1) S(xn ) − r β (m−1) , yn , n = 1, . . . , N N n o X α00(m) = α + E x(m−1) S(xn ) using (42.202) at β (m−1) n=1

n

end return α00 ← α00(m) estimate {θb1 , θb2 , θb3 } using α00 and (42.169).

(42.203)

42.6 Learning Graph Parameters

1731

Maximizing the ELBO directly. Alternatively, in place of using the coordinate-ascent construction (42.190a)–(42.190b), we can estimate (α00 , β) by working directly with the ELBO and maximizing it over these same parameters (e.g., by using a stochastic gradient algorithm). Once α00 is estimated, we can readily extract MAP estimates for the model parameters {θ 1 , θ 2 , θ 2 }, as explained before. Thus, recall from (36.31) that the ELBO is defined by: h i h i L(q) = E q ln f (y1:N , x1:N , θ) − E q ln q(x1:N , θ) N h  i h  i Y = E q ln f (y1:N , x1:N )f (θ | y1:N , x1:N ) − E q ln q(θ) qn (xn )

(a)

n=1

h

i

h

i

= E q ln f (y1:N , x1:N ) + E q ln f (θ | y1:N , x1:N ) − N h i X h i E q ln q(θ; α00 ) − E q ln qn (xn ; β) n=1

i i h = E q ln(h(θ)) + (α ) φ(θ) − c(α0 ) − E q ln(h(θ)) + (α00 )T φ(θ) − c(α00 ) + cte h i h i = E q (α0 )T φ(θ) − E q (α00 )T φ(θ) − c(α00 ) + cte h

0 T

(42.204) where in step (a) we use the Bayes rule and the assumed factorized variational model from (42.188). The expectations over q(·) refer to expectations over the variational factors of θ and x1:N . The constant in the last step represents terms that are independent of (α00 , β). To highlight the dependence of the ELBO on (α00 , β), we will write L(α00 , β) instead of L(q). Thus, our objective is to solve n

o c00 , βb = argmax L(α00 , β) α

(42.205)

α00 ,β

To employ a stochastic gradient construction, we need to evaluate the gradients of L relative to α00 and β. For the first gradient, we rewrite the above ELBO as a function of α00 and collect all terms that are independent of α00 into a constant: h i h i L(α00 ) = E q (α0 )T φ(θ) − E θ (α00 )T φ(θ) − c(α00 ) + cte  T = E x1:N α0 E θ (φ(θ)) − (α00 )T E θ (φ(θ)) + c(α00 ) + cte = (α00 )T β − (α00 )T β + c(α00 ) + cte = c(α00 ) + cte 00

(42.206) 0

where we used the fact that, by definition, α = E x1:N α and β = E θ φ(θ). We recall again property (5.89) for exponential distributions that the mean of their statistic is equal to the gradient of their log-normalization factor relative to the parameter. In other words, if we consider the exponential distribution (42.195) for q(θ), we can write n o ∇(α00 )T c(α00 ) = E θ φ(θ) = β (42.207) and, hence, ∇(α00 )T L(α00 , β) = β

(42.208)

1732

Inference over Graphs

Let us evaluate next the derivative of L(q) relative to β. For this purpose, we return to (42.204) and use a different Bayes factorization in step (a) as follows: h i h i L(β) = E q ln f (y1:N , x1:N , θ) − E q ln q(x1:N , θ) N h  i h  i Y = E q ln f (y1:N , θ) f (x1:N | y1:N , θ) − E q ln q(θ) qn (xn ) n=1

h

i

h

i

= E q ln f (y1:N , θ) + E q ln f (x1:N | y1:N , θ) − N h i h i X E q ln qn (xn ; β) E q ln q(θ; α00 ) − n=1

N i X h i = E q ln f (x1:N | y1:N , θ) − E q ln qn (xn ; β) + cte

h

n=1

h

= E q ln

N Y n=1

=

N X n=1

=

N X

N i X h i f (xn | yn , θ) − E q ln qn (xn ; β) + cte n=1

N h i X h i E q (φ(θ))T S(xn ) − r(θ) − E q β T S(xn ) − r(β, yn ) + cte n=1

N X

i

(E θ φ(θ))T E xn S(xn ) −

n=1

n=1

i β T E xn S(xn ) − r(β, yn ) + cte (42.209)

since the first and third terms in the third equality are independent of β. Using E θ φ(θ) = β, we get N X L(β) = r(β, yn ) + cte (42.210) n=1

We appeal again to property (5.89) for exponential distributions and consider the exponential distribution (42.192) for qn (xn ) to note that n o ∇(β)T r(β, yn ) = E xn S(xn ) (42.211) which implies N X n=1

∇(β)T r(β, yn ) =

N X

E xn S(xn )

(42.196)

n=1

=

α00 − α

(42.212)

and, hence, ∇(β)T L(α00 , β) = α00 − α

(42.213)

The exact gradient expression (42.213) is not practical, especially when N is large, since according to (42.212) we will need to compute a sum over N terms. One stochastic approximation for the gradient of L relative to β for iteration m can be computed as follows: (a) (b)

Assume we have available estimates α00(m−1) and β (m−1) from iteration m − 1. Select a random integer n ∈ [1, N ] and consider the corresponding measurement (n) (n) (n) vector yn = {y1 , y2 , y3 }.

42.7 Commentaries and Discussion

(c)

1733

Replace the sum in (42.212) by a stochastic approximation of the form n o 00 (42.214) ∇\ (β)T L(α , β) ≈ N E xn S(xn ) where the last expectation can be computed from (42.202) with β replaced by β (m−1) .

We therefore arrive at the stochastic gradient algorithm listed in (42.216), where µ(m) ≥ 0 is a decaying step-size sequence satisfying conditions similar to (12.66a): ∞ X m=0

µ2 (m) < ∞,

∞ X m=0

µ(m) = ∞

Stochastic gradient algorithm for the HMM. o n (n) (n) (n) input: N observations yn = {y1 , y2 , y3 } . given: parameter vector α, statistics T (yn , zn ) and S(xn ); log-normalization factors c(·), r(·), as well as φ(θ) and base function h(θ). select α00(−1) randomly with positive entries; evaluate β (−1) from α00(−1) using (42.198) repeat until sufficient convergence m = 0, 1, 2, . . . : α00(m) = α00(m−1) + µ(m) β (m−1) select random index n and compute E n oxn S(xn ) using (42.202) β (m) = β (m−1) + µ(m)N E xn S(xn ) end return α00 ← α00(m) estimate {θb1 , θb2 , θb3 } using α00 and (42.169).

42.7

(42.215)

(42.216)

COMMENTARIES AND DISCUSSION Inferring graph structure. The variable elimination method for inference over Bayesian networks is due to Zhang and Poole (1996) and Dechter (1996). Learning graph structure from data was shown to be NP-complete by Chickering (1996). In Section 42.4 we presented one heuristic-search method for learning tree graph structures that is based on the work by Chow and Liu (1968); see also the discussion in Pearl (1988). Extensions, variations, and further results on this topic appear in Chow and Wagner (1973), Lam and Bacchus (1994), Meila (1999), Lee, Ganapathi, and Koller (2006), and Tan et al. (2011). The Chow–Liu algorithm builds a weighted graph where the weights on the edges are estimates for the mutual information of the nodes defining each edge. One then searches for tree structures with maximal cumulative weight. There are many well-known algorithms for finding spanning trees with maximal or minimal aggregate weight from a weighted connected graph, such as the Kruskal algorithm and the Prim algorithm – see, e.g., the texts by Aho, Hopcroft, and Ullman (1985), Kucera (1990), and Cormen et al. (2001), and the original references by Kruskal (1956) and Prim (1957). The Prim method appears to have been discovered earlier by Jarník (1930) and independently by Dijkstra (1959). For K nodes, these algorithms are able to determine a spanning tree in O(K 2 ln K) operations. Some other useful references on graph theory and structure are the texts by Diestel (2005), Mesbahi and Egerstedt (2010), and Rosen

1734

Inference over Graphs

(2011). There are of course alternative approaches to learning a graph structure – see, e.g., Cooper and Herskovits (1992), Lam and Bacchus (1994), Jordan et al. (1999), Friedman and Koller (2000), Vlaski, Ying, and Sayed (2016), Vlaski et al. (2018), Dong et al. (2019), Matta, Santos, and Sayed (2020), and the many references in these works. The `1 -regularized graph learning procedure resulting from (42.80) is from Meinshausen and Buhlmann (2006). The graphical LASSO procedure is from Banerjee, El Ghaoui, and d’Aspremont (2008), Friedman, Hastie, and Tibshirani (2008), and Mazumder and Hastie (2012). Computational complexity. Probabilistic inference over Bayesian networks belongs to the class of NP-complete problems. To clarify this statement, we provide a brief overview of complexity in computational theory. Thus, consider a problem involving K variables, such as determining the index k of a number a within an array of numbers of size K, say, array = {0, 1, 2, 5, −2, 4, 12, −1}, K = 8 (42.217) If a = 5, then its location in the array is index k = 4. This type of linear search problem, which involves searching for a key within an array of numbers, can be solved with polynomial complexity in K. That is, there exists an algorithm that requires on the order of O(K n ) computations for some finite power n. For this particular example, the complexity is O(K) with n = 1. Consider the alternative problem of searching whether there exists a subset of entries within the same array whose sum equals some given value S, say, S = 3. This is sometimes referred to as the subset sum problem. In this case, the problem cannot be solved in polynomial complexity. However, if we are given a solution such as the entries {5, −2}, then we are able to verify its validity in polynomial time (by simply adding its entries in this case). Motivated by these examples, we list four popular classes of problems of varying complexity depending on whether a problem is verifiable or solvable in polynomial time: (a) P problems: This class consists of all decision or inference problems that can be solved in polynomial complexity. That is, for any problem p ∈ P , there exists an algorithm that finds its solution in polynomial complexity. The linear search problem belongs to this class. (b) NP problems: This class consists of all decision or inference problems that are verifiable in polynomial complexity. That is, if a candidate solution is given, it is possible to verify its validity in polynomial complexity. It obviously holds that P⊆ NP, i.e., all P problems are NP. The subset sum problem belongs to this class. The designation “NP” stands for “nondeterministic polynomial time.” This is because the class of NP problems can also be equivalently defined as the set of problems for which nondeterministic algorithms (which include some form of guessing) exist for solving them in polynomial time. Interestingly, verifying whether the classes P and NP coincide is one of the outstanding open research problems in theoretical computer science. Intuitively, the open question is whether a decision problem that is “easily verifiable” is also “easily solvable.” (c) NP-complete problems: This class is a subset of NP problems. All decision or inference problems in NP-complete can therefore be verified in polynomial time and, moreover, every problem in NP can be reduced to a problem in NP-complete in polynomial time. In this way, NP-complete problems are the hardest problems in the NP class. The subset sum problem is one example. (d) NP-hard problems: This class consists of all decision or inference problems that are at lowest as hard to solve as NP-complete problems and they do not need to be in NP. One example is the maximum clique problem over undirected graphs, which corresponds to determining the size of the largest possible clique (i.e., the largest completely connected subgraph). This problem is NP-hard but not NPcomplete. Another NP-hard problem is finding the lowest cost route between two arbitrary nodes in an undirected weighted graph (i.e., with weights over the edges corresponding to costs).

Problems

1735

The relationship between these classes is illustrated in Fig. 42.9 . For more discussion on computational complexity, the reader may refer to Garey and Johnson (1979), Papadimitriou (1993), Cormen et al. (2001), Goldreich (2008), Arora and Barak (2009), or Sipser (2012). AAACT3icbVFNTxRBEO1ZVGBFWfXopcJGw2kzQ4x4JPHiERIXSHY2m56eGraz/ZXuGnCy4R9ygZt/wwsHjLFnmYMCL+n0y6tX6arXhVMyUJr+THprz56/WN/Y7L/cevV6e/Dm7XGwtRc4FlZZf1rwgEoaHJMkhafOI9eFwpNi8bWtn5yjD9Ka79Q4nGp+ZmQlBacozQZVbqw0JRqCOfclBgLnbezWAchCsOocgZsSarMw9sLAxRxpjj7P+/FqQHADBd77SpAGnFWNsVpyBcJqp/CHpKY/GwzTUboCPCZZR4asw+FscJOXVtQ6DiYUD2GSpY6mS+5JCoWX/bwO6LhY8DOcRGq4xjBdrvK4hA9RKaGyPp642Er9t2PJdQiNLqJTc5qHh7VWfKo2qan6Ml1K42pCI+4fqmrVJtWGC6X0KEg1kXDhZZwVRIyVC4pf0IaQPVz5MTneG2WfR5+O9oYHH7s4Nth7tsN2Wcb22QH7xg7ZmAl2xX6xO/Y7uU5ukz+9ztpLOvKO/Yfe5l8VCbT5

hardest problems to solve and unknown whether they can be solved in polynomial complexity

verifiable in polynomial complexity

NP NP-complete

NP-hard

P

AAACOXicbVDLSgMxFM34tr6qLt0Ei+BCykwRdVlw40oqWBU6pWQyt21oXiQZcSj9LTf+hTvBjQtF3PoDpnXA54XA4Zx7bu49iebMujB8CKamZ2bn5hcWS0vLK6tr5fWNC6syQ6FJFVfmKiEWOJPQdMxxuNIGiEg4XCaD47F+eQ3GMiXPXa6hLUhPsi6jxHmqU27EUjGZgnS4T0wK1mFtlHcLi5nEp404Ln21+EneS7w8FrXiuVSCEY6pEprDDXN5p1wJq+Gk8F8QFaCCimp0yvdxqmgm/HzKibWtKNSuPSTGMcphVIozC5rQAelBy0NJBNj2cHL5CO94JsVdZfzz+03Y744hEdbmIvGdgri+/a2Nyf+0Vua6R+0hkzpzIOnnR92MY6fwOEacMgPU8dwDQg3zu2LqAyTU+bBLPoTo98l/wUWtGh1U989qlfpeEccC2kLbaBdF6BDV0QlqoCai6BY9omf0EtwFT8Fr8PbZOhUUnk30o4L3DwcPrjo=

solvable and verifiable in polynomial complexity AAACL3icbVDLSgMxFM34rPVVdekmWAQXUmZE1GVBEJcV7AM6pdzJpG0wjyHJFIfSP3Ljr3Qjoohb/8L0AWrrhcDJueck954o4cxY33/1lpZXVtfWcxv5za3tnd3C3n7NqFQTWiWKK92IwFDOJK1aZjltJJqCiDitRw/X4369T7VhSt7bLKEtAV3JOoyAdVS7cBNKxWRMpcVG8T44HwYZY+dxqvE1DPM/GiZxongmlWDAMVEi4fSR2axdKPolf1J4EQQzUESzqrQLozBWJBXuUcLBmGbgJ7Y1AG0Z4XSYD1NDEyAP0KVNByUIalqDyb5DfOyYGHeUdscNNWF/OwYgjMlE5JQCbM/M98bkf71majtXrQGTSWqpJNOPOinHVuFxeDhmmhLLMweAaOZmxaQHGoh1EeddCMH8yougdlYKLkrnd2fF8uksjhw6REfoBAXoEpXRLaqgKiLoCY3QG3r3nr0X78P7nEqXvJnnAP0p7+sbP/OqXQ==

hardest problems in NP verifiable in polynomial complexity

Figure 42.9 Diagram representation of the classes P, NP, NP-hard, and NP-complete.

PROBLEMS

42.1 Refer to the DAG in Fig. 42.1. Use the enumeration inference method to determine the marginal P(x1 = 1) and the conditional P(x3 = 0|x5 = 1). Only variable x5 is observed and it assumes the value x5 = 1. 42.2 Refer to the DAG in Fig. 42.1. Use the variable elimination inference method to determine the marginal P(x1 = 1) and the conditional P(x3 = 0|x5 = 1). Only variable x5 is observed and it assumes the value x5 = 1. 42.3 Refer to the DAG in Fig. 42.1. Use the variable elimination inference method to determine the marginal P(x1 = 1, x2 = 0) and the conditional P(x3 = 0, x4 = 1|x5 = 1). Only variable x5 is observed and it assumes the value x5 = 1. 42.4 Establish the backward recursion (42.39) for HMMs. 42.5 Refer to the mutual information values shown in (42.77), and which were computed from training measurements in the heart disease dataset. Use the listed values to justify the spanning tree structure in Fig. 42.6. 42.6 Refer again to the discussion in Example 42.3 on Markov chains and consider the part dealing with estimating the parameters {A, π}. Assume we observe N independent realizations for the variables {x1 , x2 , . . . , xK }. We denote each collection by (n) (n) (n) {x1 , x2 , . . . , xK } by using a superscript (n). Write down the log-likelihood function for this case and use it to estimate the entries of {A, π}. 42.7 Refer to the procedure for estimating the parameters {π, A, B} of a hidden Markov chain in Example 42.4. Assume the entries of {π, A, B} are defined as follows:        a 1−a r 1−r π= , A= , B= , , a, r ∈ (0, 1) 1− 1−a a 1−r r Repeat the derivation to estimate the model parameters {, a, r}.

1736

Inference over Graphs

42.8 The estimates shown in the tables for the marginal and joint probabilities for the three nodes {x1 , x19 , x26 } were computed from the training dataset used to generate the directed spanning tree of Fig. 42.6. A randomly selected feature from the test data has (x1 , x19 ) = (1, 0). Can you predict its label? That is, can you predict whether the patient has heart disease or not, based on this data?

x1

b 1) P(x

x19

b 19 ) P(x

x26

b 26 ) P(x

0 1

0.5 0.5

0 1

0.4083 0.5917

0 1

0.5542 0.4458

x1

x19

b 1 , x19 ) P(x

x19

x26

b 19 , x26 ) P(x

0 0 1 1

0 1 0 1

0.1281 0.3719 0.2810 0.2190

0 0 1 1

0 1 0 1

0.1157 0.2934 0.4380 0.1529

42.9 Denote the cost function that is being minimized in (42.87) by J(Θ). (a) Show that J(Θ) is convex over Θ. (b)

bx k2 )−1 ≤ kΘ? k2 ≤ K/α for any Verify that the solution Θ? satisfies (αK + kR α > 0. Verify further that a subgradient for the cost function can be chosen as bx + α sign(Θ) ∂J(Θ) = −Θ−1 + R

(c)

Is the cost J(A) in (42.89) convex over A? Find a subgradient for J(A) relative to A.

(d) Write down a coordinate-descent recursion for estimating the entries of A. Remark. See the work by Banerjee, El Ghaoui, and d’Aspremont (2008) for a related discussion. 42.10 Refer to the partitioning (42.93). Establish the relations: u = −βΘ−1 11 z, 1 = β − wT u, ν

T Θ−1 11 = W11 − u u/β 1 ν = (1 − z T u) β

42.11 Refer to the discussion on the graphical LASSO algorithm. Let y = βz. Substitute the expression for u from Prob. 42.10 into (42.94) and conclude that the resulting optimality condition corresponds to the following alternative problem: ( y ? = argmin y∈IRK−1

)

2 1 −1/2 + Θ y + αkyk

c

1 , 11 2

c = ΘT/2 r

Use this result to justify the following statement for the primal-graphical LASSO algorithm. Remark. The reader may refer to Mazumder and Hastie (2012) for a related discussion.

References

1737

Primal graphical LASSO for learning a graph structure. given N vector observations {xn ∈ IRK } arising from x ∼ N(0, Rx ); given regularization parameter α > 0; N X bx = 1 bx } + αIK , Θ = W −1 . compute R xn xTn , W = diag{R N n=1 repeat multiple times until sufficient convergence: for each column k = K, K − 1, . . . , 1: bx , W, Θ) such that the kth permute rows and columns of (R column becomes last and its diagonal entry becomes corner entry let β = permuted matrices:  ρ + α. Partition     b11 r W u Θ11 R 11 bx = R , W = , Θ = uT β zT rT ρ T compute Θ−1 11 = W11 − u u/β 1/2

1/2

z ν



1/2

let A = (Θ11 )−1 where Θ11 = Θ11 (Θ11 )T let c = (A−1 )T r ( solve y

?

= argmin y∈IRK−1

2 1

c + Ay + αkyk1 2

)

update z to zb = y ? /β 1 update ν to νb = (1 − zbT u) β update W = Θ−1 using: βb = 1/(b ν − zbT Θ−1 b) 11 z  −1 −1 b Θ11 + βΘ bzbT Θ−1 11 z 11 W = −1 T bz Θ −βb 11

b −1 zb −βΘ 11 βb



end end undo the permutations return Θ.

REFERENCES

Aho, A. V., J. Hopcroft, and J. D. Ullman (1985), Data Structures and Algorithms, Addison-Wesley. Arora, S. and B. Barak (2009), Computational Complexity: A Modern Approach, Cambridge University Press. Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008), “Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data,” J. Mach. Learn. Res., vol. 9, pp. 485–516. Chickering, D. M. (1996), “Learning Bayesian networks is NP-complete,” in Learning from Data, D. Fisher and H. J. Lenz, editors, vol. 112, pp. 121–130, Springer. Chow, C. K. and C. N. Liu (1968), “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 462–467. Chow, C. K. and T. Wagner (1973), “Consistency of an estimate of tree-dependent probability distribution,” IEEE Trans. Inf. Theory, vol. 19, no. 3, pp. 369–371.

1738

Inference over Graphs

Cooper, G. F. and E. Herskovits (1992), “A Bayesian method for the induction of probabilistic networks from data,” Mach. Learn., vol. 9, pp. 309–347. Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2001), Introduction to Algorithms, 2nd ed., MIT Press. Dechter, R. (1996), “Bucket elimination: A unifying framework for probabilistic inference,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 211–219, Portland, OR. Diestel, R. (2005), Graph Theory, 3rd ed., Springer. Dijkstra, E. W. (1959), “A note on two problems in connection with graphs,” Numerische Mathematik, vol. 1, no. 1, pp. 269–271. Dong, X., D. Thanou, M. Rabbat, and P. Frossard (2019), “Learning graphs from data: A signal representation perspective,” IEEE Signal Process. Mag., vol. 36, pp. 44–63. Friedman, J. H., T. Hastie, and R. Tibshirani (2008), “Sparse inverse covariance estimation with the graphical LASSO,” Biostatistics, vol. 9, no. 3, pp. 432–441. Friedman, N. and D. Koller (2000), Being Bayesian about Network Structure: A Bayesian Approach to Structure Discovery in Bayesian Networks, Kluwer Academic Publishers. Garey, M. R. and D. S. Johnson (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Company. Goldreich, O. (2008), Computational Complexity: A Conceptual Perspective, Cambridge University Press. Jarník, V. (1930), “O jistém problému minimálním,” Práce Moravské Prírodovédecké Spolecnosti (in Czech), vol. 6, no. 4, pp. 57–63. Jordan, M. I., Z. Ghahramani, T. Jaakkola, and L. Saul (1999), “Introduction to variational methods for graphical models,” Mach. Learn., vol. 37, pp. 183–233. Kruskal, J. B. (1956), “On the shortest spanning subtree of a graph and the traveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1, pp. 48–50. Kucera, L. (1990), Combinatorial Algorithms, Adam Hilger. Lam, W. and F. Bacchus (1994), “Learning Bayesian networks: An approach based on the MDL principle,” Comput. Intell., vol. 10, pp. 269–293. Lee, S., V. Ganapathi, and D. Koller (2006), “Efficient structure learning of Markov networks using `1 -regularization,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–8, Vancouver. Matta, V., A. Santos, and A. H. Sayed (2020), “Graph learning under partial observability,” Proc. IEEE, vol. 108, no. 11, pp. 2049–2066. Mazumder, R. and T. Hastie (2012), “The graphical LASSO: New insights and alternatives,” Electron. J. Statist., vol. 6, pp. 2125–2149. Meila, M. (1999), “An accelerated Chow and Liu algorithm: Fitting tree distributions to high-dimensional sparse data,” Proc. Int. Conf. Machine Learning (ICML), pp. 249–257, Bled. Meinshausen, N. and P. Buhlmann (2006), “High-dimensional graphs and variable selection with the LASSO,” Ann. Statist., vol. 34, no. 3, pp. 1436–1462. Mesbahi, M. and M. Egerstedt (2010), Graph Theoretic Methods in Multiagent Networks, Princeton University Press. Papadimitriou, C. H. (1993), Computational Complexity, Pearson. Pearl, J. (1988), Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann. Prim, R. C. (1957), “Shortest connection networks and some generalizations,” Bell Sys. Tech. J., vol. 36, no. 6, pp. 1389–1401. Rosen, K. (2011), Discrete Mathematics and Its Applications, 7th ed., McGraw-Hill. Sipser, M. (2012), Introduction to the Theory of Computation, 3rd ed., Cengage Learning. Tan, V. V. F., A. Anandkumar, L. Tong, and A. Willsky (2011), “A large-deviation analysis of the maximum-likelihood learning of Markov tree structures,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1714–1735.

References

1739

Vlaski, S., H. P. Maretic, R. Nassif, P. Frossard, and A. H. Sayed (2018), “Online graph learning from sequential data,” Proc. IEEE Data Science Workshop, pp. 190–194, Lausanne. Vlaski, S., B. Ying, and A. H. Sayed (2016), “The BRAIN strategy for online learning,” Proc. IEEE GlobalSIP Symp. Signal Processing of Big Data, pp. 1285–1289, Washington, DC. Zhang, N. L. and D. Poole (1996), “Exploiting causal independence in Bayesian network inference,” J. Artif. Intell. Res., vol. 5, pp. 301–328.

43 Undirected Graphs

The discussion in the last two chapters focused on directed graphical models or Bayesian networks, where a directed link from a variable x1 toward another variable x2 carries with it an implicit connotation of “causal effect” by x1 on x2 . In many instances, this implication need not be appropriate or can even be limiting. For example, there are cases where conditional independence relations cannot be represented by a directed graph. One such example is provided in Prob. 43.1. In this chapter, we examine another form of graphical representations where the links are not required to be directed anymore, and the probability distributions are replaced by potential functions. These are strictly positive functions defined over sets of connected nodes; they broaden the level of representation by graphical models. The potential functions carry with them a connotation of “similarity” or “affinity” among the variables, but can also be rolled back to represent probability distributions. Over undirected graphs, edges linking nodes will continue to reflect pairwise relationship between the variables but will lead to a fundamental factorization result in terms of the product of clique potential functions. We will show that these functions play a prominent role in the development of message-passing algorithms for the solution of inference problems.

43.1

CLIQUES AND POTENTIALS Undirected graphs consist of three components: nodes, edges, and potential functions. The nodes continue to designate random variables, while the potential functions will be strictly positive functions from which we will be able to construct probability distributions if desired. While over directed graphs we associated a marginal or conditional probability distribution with each individual node, we will now associate a potential function with each clique. A clique is a collection of connected nodes defined as follows.

43.1.1

Cliques We refer to an example to guide the discussion. Consider a collection of K = 5 random variables, denoted by {x1 , x2 , x3 , x4 , x5 }, as shown in Fig. 43.1. On the far left, we show a graph with undirected edges linking the nodes. A clique is a

43.1 Cliques and Potentials

1741

subgraph consisting of any collection of nodes that are fully connected to each other (i.e., where every node is connected to every other node directly). In this way, every connected pair of nodes constitutes a clique. This is illustrated in the middle plot in the figure, where we are highlighting three two-node cliques consisting of the pairs {x1 , x4 }, {x4 , x5 }, and {x2 , x3 }. There are of course other two-pair cliques in the same graph, such as the pairs {x3 , x4 } and {x1 , x3 }. We therefore say that every edge in a graph identifies a two-pair clique. Singletons are also cliques, although not used frequently. For example, {x1 } can be treated as a clique. There can exist larger cliques in a graph. In the rightmost plot, we are showing a clique involving the three nodes {x1 , x3 , x4 }. Actually, this particular clique is maximal. A maximal clique is such that if we add any new node to its member nodes, it will cease to be a valid clique (i.e., not all nodes will be connected to each other anymore). For example, if we add x2 to the set {x1 , x3 , x4 } we will not have a clique because x1 and x2 are not connected to each other.

x1

x1

x2

x3

C (x1 , x4 )

x1

x2

C (x1 , x3 , x4 )

x3

x2

x3

C (x2 , x3 )

x4

x4

x5

C (x4 , x5 )

C (x2 , x3 )

x4

x5

C (x4 , x5 )

x5

Figure 43.1 Two representations of an undirected graphical model with potential

functions associated with each clique in the middle and right plots.

Now, consider an undirected graph G with C total cliques. We denote the set of cliques by C. For example, the rightmost plot in Fig. 43.1 shows C = 3 cliques defined by: n o C1 = {x2 , x3 }, {x4 , x5 }, {x1 , x3 , x4 } (43.1) while the middle plot shows C = 3 cliques defined by n o C2 = {x2 , x3 }, {x4 , x5 }, {x1 , x4 }

(43.2)

Both sets of cliques cover all nodes in the graph. This example shows that there can be many choices of clique sets even for the same graph. Each clique set C must cover all nodes in a graph and, moreover, the individual cliques must be

1742

Undirected Graphs

different from each other in that one clique cannot be a subset of another; they must always differ by some node(s). For example, in the rightmost plot, cliques {x1 , x3 , x4 } and {x2 , x3 } share one node x3 , but they differ by other nodes. One main difference between the sets C1 and C2 shown above is that all cliques in C1 happen to be maximal.

43.1.2

Joint Probability Distribution Given a clique set C for an undirected graph G, we associate a potential function φC (·) with every clique C ∈ C; the function φC (·) is also called a clique potential or factor. The potential function is strictly positive and its arguments are the nodes that constitute the clique. For example, the function that is associated with the clique {x1 , x3 , x4 } is denoted by φC (x1 , x3 , x4 ); the subscript C refers to “clique” and the arguments are the random variables that constitute the clique. The arguments {x1 , x3 , x4 } are also called the scope of the potential function. Likewise, the functions that are associated with the cliques {x2 , x3 } and {x4 , x5 } are denoted by φC (x2 , x3 ) and φC (x4 , x5 ), respectively. As future examples will reveal, the role of a clique potential is to model “affinity” or “similarity” or “compatibility” between the variables within the clique; the potential function does not need to correspond to a probability distribution or even have a probabilistic interpretation, although we will be using it to construct probability distributions. As a matter of notation, although we are using the same notation φC (·) to refer to the clique functions, these functions are generally different from each other. For example, the function φC (x2 , x3 ), as a function of two variables, can be different from the function φC (x4 , x5 ), which is also a function of two variables. If desired, and to be more explicit, we could add subscripts to the potential functions in order to distinguish between them, such as writing φC,23 (x2 , x3 ),

φC,45 (x4 , x5 ),

φC,134 (x1 , x3 , x4 )

(43.3)

We will, however, lighten the notation and let the arguments of each function indicate the likely differences between them. Example 43.1 (Tabular representation of clique potentials) Each clique potential function φC (·) can be represented in tabular form for discrete random nodes. For example, when the nodes are Boolean, the function φC (x2 , x3 ) assumes four values, one for each combination (x2 , x3 ). Hence, it can be represented by means of a 2×2 matrix, as shown in Table 43.1, as well as by means of a 22 × 1 vector. More generally, when each variable x assumes B discrete levels, the function φC (x2 , x3 ) will be represented by a B × B matrix or B 2 × 1 vector, while φC (x1 , x3 , x4 ) will be represented by a B × B × B tensor or B 3 × 1 vector.

43.1 Cliques and Potentials

1743

Table 43.1 Tabular and vector representations for the clique potential function φC (x2 , x3 ) for Boolean variables.

x2 = 0 x2 = 1

x3 = 0

x3 = 1

φC (0, 0) φC (1, 0)

φC (0, 1) φC (1, 1)

x1

x3

φC (x1 , x3 )

0 0 1 1

0 1 0 1

φC (0, 0) φC (0, 1) φC (1, 0) φC (1, 1)

Given a graph G and a clique set C (such as the graph shown in the rightmost plot of Fig. 43.1), we will explain in the following that the graph represents the following factorization for the joint probability mass function (pmf) of the random variables in terms of the product of potential functions: P(x1 , x2 , x3 , x4 , x5 ) 1 = φC (x2 , x3 ) φC (x4 , x5 )φC (x1 , x3 , x4 ) Z

(43.4)

Here, the symbol Z is referred to as the “partition function,” and its role is to normalize the product of potential functions into a probability distribution; Z is also called the “free entropy” term in the statistical physics literature. The normalization is achieved by computing Z as follows: X ∆ Z = φC (x2 , x3 ) φC (x4 , x5 )φC (x1 , x3 , x4 ) (43.5) x1 ,...,x5 ∈{0,1}

where we continue to assume Boolean random variables for simplicity. If the random variables could assume multiple levels, say, x ∈ X, then the sum in (43.5) would be over all the values in X and not only over {0, 1}. Also, if the random variables were continuous rather than discrete, then the summation in (43.5) would be replaced by integration. In all cases, it is observed that Z = 1 when the potential functions happen to be marginal or conditional probability distributions to begin with. We continue with the Boolean case. More generally, for an undirected graph G with a clique set C with C elements, the joint pmf will be given by 1 Y φC (xC ) Z C∈C ( ) Y φC (xC )

P(x1 , x2 , . . . , xK ) = Z=

X

x1 ,...,xK ∈{0,1}

(43.6a)

(43.6b)

C∈C

where we are using the notation xC to denote the collection of random variables that define the Cth clique in the graph. For example, xC = {x2 , x3 } for the clique defined by these two random variables. Any probability distribution that factorizes into the product of positive functions defined over the cliques of a

1744

Undirected Graphs

graph is called a Gibbs distribution. Therefore, the joint pmf shown in (43.6a) is a Gibbs distribution. Example 43.2 (Representing a joint pmf) Let us employ the formalism of undirected graphs to represent the joint pmf factorization: P(x1 , x2 , x3 , x4 , x5 ) = P(x1 ) P(x2 ) P(x3 |x2 , x1 ) P(x4 |x3 , x1 ) P(x5 |x4 )

(43.7)

This is the same factorization we encountered earlier in (41.10) while studying Bayesian networks. Assume we start by drawing the directed graph shown in the leftmost plot of Fig. 43.2. As an initial step, we remove the direction of the arrows to obtain the undirected graph shown in the middle plot in the same figure with three colored cliques. Then, on a first attempt and using the terms that appear in (43.7), we could select the potential functions in the following form: φC (x4 , x5 ) = P(x5 |x4 ) φC (x1 , x3 , x4 ) = P(x1 )P(x4 |x3 , x1 )

(43.8a) (43.8b)

?

φC (x2 , x3 ) = P(x2 ) (incomplete)

(43.8c)

which will not lead to a factorization of the form (43.7) simply because the conditional pmf P(x3 |x2 , x1 ), which is a function of the three variables {x1 , x2 , x3 }, does not appear in any of the potential functions (and, therefore, will not appear in their product). This particular pmf will only appear in the product if we can construct a clique that contains the variables {x1 , x2 , x3 }. None of the cliques shown in the middle plot of Fig. 43.2 involves these three nodes together. (original directed graph)

P(x2 )

x1 P(x1 )

(removing directed links)

x2

x1

x2

x3

cv x4

P(x3 |x2 , x1 )

P(x4 |x3 , x1 )

x3

x4

C (x2 , x3 )

C (x1 , x3 , x4 )

added link

x1

C (x1 , x3 , x4 )

x2

x3

C (x1 , x2 , x3 )

x4

C (x4 , x5 )

x5 P(x5 |x4 )

(“moralized” undirected graph)

nodes (x1 , x2 ) need to be linked

x5

C (x4 , x5 )

x5

Figure 43.2 The original directed graph is “moralized” and transformed into the

undirected graph shown on the far right with three colored cliques. This simple example shows that the transformation from a directed graph representation for a pmf to an undirected graph representation cannot always be attained by simply removing the arrows from the directed links; this is because a node and its parents need not end up in the same clique. The graph will need to be moralized by marrying the parents, meaning that all parent nodes will need to have a link between them as well. The rightmost plot in Fig. 43.2 illustrates this procedure, and shows the parent nodes {x1 , x2 } linked together. The figure also shows three colored cliques defined by:

43.1 Cliques and Potentials

( C =

1745

) {x1 , x2 , x3 }, {x1 , x3 , x4 }, {x4 , x5 }

(43.9)

where the first clique now contains the variables {x1 , x2 , x3 }, which was made possible by the newly added link to ensure all nodes within this clique are connected to each other, as required under the definition of a clique. Observe that the choice of cliques is often suggested by the desired factorization (43.7) that we seek for the joint pmf. Using the cliques defined by (43.9), we can now set the potential functions to the following choices: φC (x1 , x2 , x3 ) = P(x2 )P(x3 |x1 , x2 ) φC (x1 , x3 , x4 ) = P(x1 )P(x4 |x3 , x1 ) φC (x4 , x5 ) = P(x5 |x4 )

(43.10a) (43.10b) (43.10c)

with Z = 1. In this case, expression (43.6a) would lead to the same factorization (43.7).

43.1.3

Potential Functions The previous example shows one way by which clique potentials can be constructed directly from marginal and conditional probability distributions. However, clique potentials do not need to correspond to probability measures and can be defined more generally as strictly positive functions that reflect some measure of correlation among their elements: nodes that are “close” or “similar” to each other in behavior or have some correlation will have “higher” potential values than other nodes. Due to the positivity assumption, clique potentials can be expressed in the form of exponentials: n o ∆ φC (xC ) = exp −EC (xC ) (43.11) for some energy function EC (xC ) or, equivalently, n o ∆ EC (xC ) = − ln φC (xC )

(43.12)

where EC (xC ) is real-valued (i.e., it does not need to satisfy a positivity constraint, as is the case with potential functions). Energy functions continue to be functions of the same variables xC included in the Cth clique. Allowing the energy function to assume positive or negative values is useful in many applications (e.g., in statistical physics, the interactions between atoms can be through attraction or repulsion). The potential function (43.11) plays a role similar to the term that appears in the numerator of the earlier expression (3.168) for characterizing the likelihood of observing state xC in a Boltzmann distribution. Using (43.11) in (43.6a), the joint pmf of the graph will take the form of a Boltzmann distribution, namely, n X o 1 EC (xC ) (43.13) P(x1 , x2 , . . . , xK ) = exp − Z C∈C

1746

Undirected Graphs

where the sum in the exponent is also called the “free energy”: X ∆ EC (xC ) E(x1 , x2 , . . . , xK ) =

(43.14)

C∈C

Expression (43.13) is referred to as the log-linear model in the statistical literature since it implies: ln P(x1 , x2 , . . . , X K ) = −E − ln(Z)

(43.15)

which is linear in the energy function. Formulation (43.13) is a useful representation in that it links the fields of probabilistic reasoning and statistical physics; it allows one to move back and forth between the pmf and energy representations. Let us consider a few examples dealing with both discrete and continuous random variables. Example 43.3 (Gaussian Markov field) Consider a vector-valued Gaussian random variable x ∈ IRK with individual entries {x1 , x2 , . . . , xK }. Assume x has zero mean and covariance matrix Rx = E xxT > 0. Then, the probability density function (pdf) of x is given by o n 1 fx (x) ∝ exp − xT Rx−1 x (43.16) 2 The free energy function in this case is the quadratic E(x) =

1 T −1 x Rx x 2

(43.17)

Let Θ = Rx−1 . The matrix Θ is K × K symmetric with individual entries Θ = [θk` ]; it is known as the precision matrix. Using Θ in the pdf expression allows us to write !) ( K K X 1 X (θk` + θ`k )xk x` θkk x2k + fx (x) ∝ exp − 2 k6=`=1 k=1 ( ) K K X 1X 2 = exp − θkk xk − θk` xk x` , since θk` = θ`k 2 k=1 k0

(43.25)

In this way, the energy is negative when the neighboring pixels are at the same state, and positive otherwise. In other words, negative energy values are assigned to similar neighbors (which will translate into higher likelihood of occurrence in the corresponding clique potential) and positive energy values for dissimilar neighbors (which will translate into lower likelihood values). Likewise, we associate an energy function for the interaction between a pixel and its measurement in the following form: E(xk , yk ) = −αxk yk , α > 0

(43.26)

Here again the energy will be negative when xk and y k have the same sign (leading to higher likelihood) and the energy will be positive when xk and y k have different signs.

43.1 Cliques and Potentials

1749

As a result, the overall free energy function that is associated with the graph shown in Fig. 43.4 is E({xk }, {y k }) = −

K X k=1

αxk yk −

K X X

βxk x`

(43.27)

k=1 `∈Nk

where the last sum is only over the neighbors of node xk . This choice of the free energy function is a special case of the so-called Ising model (described further in the comments at the end of the chapter). It is a popular graphical model used in computer vision applications over 2D graphs. The corresponding joint pmf is given by n o 1 exp −E({xk }, {y k }) Z ) ( K )( K n o n o Y Y 1 Y exp βxk x` = exp αxk y k Z

P({xk }, {y k }) =

k=1

(43.28)

k=1 `∈Nk

Example 43.5 (Pairwise Markov network) This terminology refers to undirected graphs where the individual cliques are pairs of connected random variables C = {xk , x` }. A potential function φC (xk , x` ) is associated with every such clique (i.e., a potential function is associated with every edge in the graph). For generality, we can also associate potential functions with the individual nodes in the graph, say, {φC (xk )}. In this way, the joint pmf becomes ) )( K ( K Y Y Y φC (xk , x` ) (43.29) φC (xk ) P(x1 , x2 , . . . , xK ) = k=1

k=1 `∈Nk ,k 0. Using the fact that these probabilities should add up to 1 we get: αq =

1 1 + e−θq +Wq,: h

(43.42)

1752

Undirected Graphs

and arrive at the claimed logistic distribution: P(z q = 1|h = h) = P(z q = 0|h = h) =

1

(43.43a)

−(Wq,: h−θq )

1+e e−(Wq,: h−θq )

(43.43b)

1 + e−(Wq,: h−θq )

In a similar manner, we can verify that (see Prob. 43.26): 1

P(hm = 1|z = z) =

 − (W T )

1+e P(hm = 0|z = z) =

 − (W T )

e

m,:

m,:

 − (W T )

1+e

z−θr,m

z−θr,m

m,:



(43.44a)



(43.44b)



z−θr,m

in terms of the individual entries of θr . In summary, RBMs allow us to recover the distribution of the latent components via (43.43a)–(43.43b) given the observation h. We still need to explain how to determine Θ (i.e., how to learn the parameters of the distributions that link the variables in the undirected graph). Thus, assume that we are able to observe N independent realizations for the vectors h ∈ IRM , denoted by {h1 , h2 , . . . , hN }; we now use the subscript notation hn to refer to the realizations of h; each hn is a vector of size M ×1. Then, the likelihood of observing this set of realizations is given by the product: P(h1 = h1 , . . . , hN = hN ; Θ) =



1 Z(Θ)

N Y N

e−E(hn ;Θ)

(43.45)

n=1

where E(hn ; Θ) is given by (43.37). We can then formulate the problem of estimating Θ by maximizing the log-likelihood function, which is equivalent to solving ( ?

Θ



= argmin Θ

N 1 X E(hn ; Θ) ln(Z(Θ)) + N n=1

) (43.46)

One could consider employing a stochastic gradient iteration to solve this problem. However, this is a challenging task since we only have access to realizations for h; there are no realizations for the latent variable z. We will devise a recursive method for solving (43.46) in Section 66.3, known as contrastive divergence. The method includes a mechanism for generating realizations z 0 for the hidden component z using model (43.43a)–(43.43b).

43.2

REPRESENTATION THEOREM We examine next the conditional independence relations that are induced by undirected graphs, and characterize their representation power by identifying the class of joint probability distributions that are represented by these graphs. We start with the conditional independence relations.

43.2 Representation Theorem

Conditional Independence Over directed graphs in Chapter 41 there was the notion of directed edges alongside parents and children. These concepts led to the local conditional independence characterization (41.23), repeated here for ease of reference: x

y | pa(x), for all y ∈ / descendant(x)

(over directed graphs)

|=

(43.47)

This characterization needs to be adjusted for undirected graphs, especially since the notions of parents and children are not applicable anymore. For every undirected graph G, we will instead introduce the concept of the neighborhood of a node x, which will be denoted by Nx . This set consists of all nodes that are connected to x by an edge: n o ∆ Nx = nodes y ∈ G such that x and y are linked (43.48) For example, for the undirected graph shown in the rightmost plot of Fig. 43.2, the neighborhoods of the various nodes are given by the following expressions (where we are using the index of the node as a subscript): N1 = {x2 , x3 , x4 }

(43.49a)

N2 = {x1 , x3 }

(43.49b)

N4 = {x1 , x3 , x5 }

(43.49d)

N3 = {x1 , x2 , x4 }

N5 = {x4 }

(43.49c) (43.49e)

The Markov blanket of a node is now defined as its neighborhood (i.e., the set of nodes connected to it); it continues to be denoted by mb(x): mb(x) = Nx ,

for undirected graphs

(43.50)

We can now state the analogue of (43.47) for undirected graphs: x

y | Nx , for all y ∈ / Nx

|=

(over undirected graphs)

(43.51)

In other words, given the neighborhood of a node x, this node will be conditionally independent of the remaining nodes in the graph. This is referred to as a local Markov property for the undirected graph. For instance, if we refer to the rightmost plot of Fig. 43.2: x5 | {x2 , x3 , x4 }

{x4 , x5 } | N2 ⇐⇒ x1

{x4 , x5 } | {x1 , x3 }

|=

x2

x5 | N1 ⇐⇒ x1

|=

x1

|= |=

43.2.1

1753

(43.52a) (43.52b)

We can restate the conditional independence property (43.51) by noting that the conditional pmf of any node given the remaining nodes in the graph is only dependent on the nodes in its neighborhood. Specifically, let x denote an arbitrary node with neighborhood Nx . Let also Nxc denote the remaining nodes in the graph (i.e., the complementary set):

Undirected Graphs

Nc (x) = G\ {x, Nx }

(43.53)

Note that Nxc excludes x and its neighbors. Then, in view of (43.51), we can equivalently write P(x | Nx , Nxc ) = P(x | Nx )

(over undirected graphs)

(43.54)

This property is analogous to (41.34) for directed graphs. It highlights the fact that nodes that are separated by evidence are conditionally independent – see also property (43.58). It is because of this property that undirected graphs are also called Markov random fields (MRFs) or Markov networks in statistics. We are now ready to define undirected graphs more formally. Definition 43.1. (Undirected graph) An undirected graph G consists of a list of random variables, say, {x1 , x2 , . . . , xK }, with edges linking the variables and such that for any nodes {xk , y} ∈ G it holds that xk

y | Nk , for all y ∈ / Nk

|=

(43.55)

Under these conditions, if C denotes a collection of distinct cliques whose union covers all nodes in the graph, then it will hold, as shown in Theorem 43.1, that the joint pmf of the graph variables can be factored in the form: 1 Y φC (xC ) (43.56a) P(x1 , . . . , xK ) = Z C∈C ( ) X Y Z= φC (xC ) (43.56b) x1 ,...,xK ∈{0,1}

C∈C

where xC denotes the collection of random variables in the Cth clique, and φC (xC ) the corresponding potential function. As was the case with directed graphs, we will refer to the conditional independence relations that follow directly from applying (43.55) as the local independence map of the graph, also written as I-map or IL (G). This map will include all independence relations of the form: n o IL (G) = (x, y) such that x y | Nx , for all y ∈ / Nx (43.57) |=

1754

There are other conditional independence results that are embedded in the graph and which can be read directly by inspection. For example, refer to Fig. 43.6. Consider three disjoint collections of random variables denoted by the letters {X, Y , Z}. The following definition extends the notion of d-separation from directed graphs to undirected graphs. We say that the sets X and Y are separated given Z if every trajectory linking any variable in X to any variable in Y needs to pass through the nodes in Z. When this is the case, it will hold that

43.2 Representation Theorem

Y |Z

(over undirected graphs)

|=

X

1755

(43.58)

In this case, we say that Z blocks the paths between X and Y . We will refer to these properties as global Markov conditions and they define the global independence map. This map will contain all independence relations of the form: n o IG (G) = (x, y) such that (x, y) are separated given z (43.59)

|=

for all possible choices of z. The independence relations in IG (G) can be found as follows. For every pair of nodes (x, y), consider all paths linking them. Select any variable z on these paths as evidence and check if all paths become blocked (i.e., if they need to pass through z), in which case we would conclude that (x, y) are conditionally independent given z and the condition x y|z should be added to the set IG (G). It turns out that the local and global Markov properties (43.55) and (43.58) are equivalent to each other for positive joint distributions, as shown further ahead in Corollary 43.1. In other words, the collection of local independence relations implies the validity of the global independence relations and vice-versa.

x1

x2

n o Z = z1, z2, z3

z1

z2

y1

x3

y2

y3

x4

z3

y4

y5

z4 x5

n o X = x2 , x3 , x4 , x5

y6

n o Y = y1 , y3 , y4 , y6

Figure 43.6 The random variables in X are conditionally independent of the random

variables in Y given Z since all trajectories linking X to Y should pass through nodes in Z.

43.2.2

Fundamental Factorization Result In a manner similar to Theorem 41.1 for Bayesian networks, we can establish that if we require the nodes of an undirected graph to satisfy the conditional independence relations (43.51), then their joint pmf will need to be of the form

1756

Undirected Graphs

(43.56a) for some clique set C. The converse statement is also true: If the joint pmf of a collection of random variables factorizes according to (43.56a), then the variables should satisfy the independence relations (43.51). In other words, an undirected graph with local Markovian properties is equivalent to the joint pmf factorizing as a product of clique potentials. The following factorization theorem for undirected graphs is a fundamental result in statistics and is known as the Hammersley–Clifford theorem. It characterizes the type of distributions that are encoded by an undirected graph: These are all the distributions that incorporate the independence relations (43.51). The proof appears in Appendix 43.A. Theorem 43.1. (Factorization theorem for undirected graphs) Consider an undirected graph G with K nodes denoted by {x1 , x2 , . . . , xK }, and some associated clique selection C. The graph G is an I-map for a joint distribution P defined over these variables if, and only if, the distribution factorizes in the form: 1 Y φC (xC ) (43.60) IL (G) ⊆ I(P) ⇐⇒ P(x1 , x2 , . . . , xK ) = Z C∈C

Using the result of the theorem, we are now able to establish the equivalence between the local and global independence relations. The proof of the following corollary is given in Appendix 43.B. Corollary 43.1. (Equivalence of local and global Markov properties) The local and global Markov properties (43.55) and (43.58) are equivalent to each other.

43.3

FACTOR GRAPHS We are now in a position to derive powerful procedures for inference over both directed and undirected graphs. The procedures will be based on passing messages between nodes and they will provide exact answers for graph structures that can be classified as trees. The definition of a tree varies slightly depending on whether the graph is directed or not: (a) For undirected graphs, a tree is a structure where there is only a single path between any arbitrary pair of nodes – see the leftmost plot in Fig. 43.7. Thus, if x and y are two nodes, then we can only find a single sequence of links forming a trajectory from x to y. In this way, there can be no cycles in the tree (i.e., no sequence of links starting from one node and returning to it), and if a single link is removed, then the graph will become disconnected. It follows that a tree is necessarily an undirected acyclic graph. (b) For directed graphs, a tree is a structure that has a single root node (namely, a node without any parents), and every other node has a single parent – see

43.3 Factor Graphs

c

c

c

undirected acyclic graph (a tree)

directed acyclic graph (a polytree)

directed acyclic graph (neither a tree nor a polytree)

1757

c

AAACTXicbVBNSwMxFMzW7/pV9eglWAS9lF0V9Sh48ahgVeiW8jb7tg1mkyXJCsvSP+hF8Oa/8OJBETFtF/x8p2FmXt5kokxwY33/yatNTc/Mzs0v1BeXlldWG2vrV0blmmGbKaH0TQQGBZfYttwKvMk0QhoJvI5uT0f69R1qw5W8tEWG3RT6kiecgXVUrxGHUnEZo7Q05hqZxZgCK5jgjPY1ZIMwrH9ZdsowSqhEbgeoKVCrEYc/HBODGomZEsXYsNtrNP2WPx76FwQVaJJqznuNxzBWLE/di0yAMZ3Az2y3BG05Ezish7nBDNgt9LHjoIQUTbcctzGk246JaeIyJMolGrPfN0pIjSnSyDlTsAPzWxuR/2md3CbH3ZLLLLco2eRQkgtqFR1VWxUoCgeAae6yUjYADa5UbequhOD3l/+Cq71WcNg6uNhrnuxXdcyTTbJFdkhAjsgJOSPnpE0YuSfP5JW8eQ/ei/fufUysNa/a2SA/pjb3CUyQtHw=

directed acyclic graph (a tree)

Figure 43.7 The last three graphs are directed while the leftmost graph is undirected.

The first and last graphs are trees. The second graph from the left is a polytree, while the third graph from the left is neither a tree nor a polytree (its undirected version has a cycle).

the rightmost plot in Fig. 43.7. Again, if x and y are two nodes, then we can only find a single sequence of directed links forming a trajectory from x to y. The tree can still have many leaves (i.e., end nodes) but there can only be a single root note). (c) The term polytree is used to describe directed acyclic graphs (DAGs) whereby if we transform all directed links into undirected links, then the resulting undirected graph is a tree (i.e., it is connected and does not have cycles) – see the two middle plots in Fig. 43.7. The main difference between a polytree and a directed tree is that some nodes may have more than one parent. There can also be more than one root node.

43.3.1

Factor Representation We already know how to transform a DAG into an undirected graph such that the joint pmf distribution that is encoded by the DAG is also encoded by the undirected graph. This is attained through the process of “moralization,” as was illustrated in Fig. 43.2. Specifically, a link is added between any two unconnected parents in the directed graph. We will now move a step further and explain how undirected graphs can be transformed into factor graphs; the main motivation for this transformation is that factor graphs will facilitate the derivation of inference algorithms and will enable a framework that applies uniformly to both directed and undirected graphs. Once this is done, we will end up with a scheme that involves transformations of the form: directed acyclic graph −→ undirected graph −→ factor graph along with an effective procedure for inference over factor graphs.

1758

Undirected Graphs

Let us start with an undirected graph since we already know how to transform a DAG into an undirected representation through moralization. The undirected graph consists of a collection of nodes, say, K of them, along with a clique set C that covers its nodes. A potential function φC (xC ) is associated with each clique C ∈ C; it is a function of the nodes present in xC . Recall that the potential function is also called a factor. We transform the undirected graph into a factor graph as follows. For every clique C ∈ C, we perform the following operations: (a) We add a new node (called a factor node), represented by a square symbol. The original nodes in the graph are represented by circular symbols. (b) We remove all edges linking the nodes in the clique. We replace these edges by links from the new added factor node to each of the nodes in the clique. (c) We associate the clique potential φC (xC ) with the factor node. These steps are illustrated in Fig. 43.8, where the undirected graph with five cliques shown on the left is transformed into the factor graph with five factors shown on the right.

x1 x1

C (x1 , x2 , x3 )

x2

x2

C (x1 , x2 , x3 )

x3 x3 C (x3 , x6 )

C (x3 , x4 )

C (x3 , x6 )

C (x3 , x4 )

x6

x6 x4 C (x4 , x5 )

x4 C (x4 , x5 )

x7 x5

C (x4 , x7 )

C (x4 , x7 )

x7 x5

Figure 43.8 The undirected graph with five cliques on the left is transformed into the

factor graph with five factors on the right.

Factor graphs consist of two types of nodes: regular nodes, representing the underlying random variables, and which are represented by circular symbols, and factor nodes, represented by square symbols. All edges are undirected and links exist only between factor and regular nodes. The factor nodes are not connected to other factor nodes, and the same for regular nodes. The joint pmf that is represented by the factor graph is still given by the product of factor terms:

43.3 Factor Graphs

P(x1 , x2 , . . . , xK ) =

1 Y fC (xC ) Z

1759

(43.61)

C∈C

where we are writing fC (·) to denote the potential function that is associated with the Cth factor. In the example of Fig. 43.8, the factors agree with the original clique functions, e.g., fC (x1 , x2 , x3 ) = φC (x1 , x2 , x3 )

(43.62)

and there are as many factors as there are cliques in the original graph. Factor graphs constructed in this manner have useful properties: (a) Structurally, factor graphs are bipartite graphs, meaning that they consist of two types of nodes, and links only exist between nodes of different types. (b) More importantly, starting from a tree or a polytree, factor graphs constructed in the manner described above will have a tree structure with only a single path between any two nodes in the graph – see Prob. 43.13. Here, the designation “node” refers to any type of node (regular or factor node). If we refer to the undirected graph on the left in Fig. 43.8, we note that it has a cycle linking {x1 , x2 , x3 }. This cycle disappears from the factor graph shown on the right side of the same figure because all links between regular nodes are disconnected, and only links to factor nodes appear.

Example 43.7 (Factor graph with cycles) We can sometimes end up with factor graphs that contain cycles in them. If we start from the third plot from left in Fig. 43.7, which is neither a tree nor a polytree, its undirected graph will involve two cycles due to the need to moralize the parents appearing in the top of the graph. When this undirected graph is transformed into a factor graph, the latter will include a cycle and will not be a tree. This construction is illustrated in Fig. 43.9, which shows that the resulting factor graph is not a tree and has cycles in it. In our treatment in this chapter, we will be dealing primarily with tree factor graphs, which are frequent in applications, and will derive powerful message-passing algorithms for solving inference problems over such structures. In the comments at the end of the chapter, we explain how the message-passing algorithms can be extended to arbitrary factor graphs (with and without cycles) by transforming these graphs into junction tree graphs and by transforming the message-passing procedures into the junction tree algorithm. Example 43.8 (Factor graphs are not unique) Factor models are not unique and this degree of freedom can be exploited to add more flexibility into the representation of joint distributions. This is because we can incorporate additional factors without changing the form of the joint pmf for the resulting factor graph. This is illustrated in Fig. 43.10, where two factor graphs are shown for the same undirected graph. The representation on the right employs two factors, denoted by fa (x1 , x2 , x3 ) and fb (x1 , x2 ) for the threenode clique {x1 , x2 , x3 }. As long as these factors are chosen such that their product matches the original clique potential, i.e., fa (x1 , x2 , x3 )fb (x1 , x2 ) = φC (x1 , x2 , x3 )

(43.63)

1760

Undirected Graphs

cycle

c

c

original directed graph

moralized undirected graph AAACCXicbVDLSgMxFM3UV62vUZdugkVwVWaqqMuCG5cV7APaoWTSO21okhmSjFCHbt34K25cKOLWP3Dn35i2s9DWA4HDOfdyc06YcKaN5307hZXVtfWN4mZpa3tnd8/dP2jqOFUUGjTmsWqHRANnEhqGGQ7tRAERIYdWOLqe+q17UJrF8s6MEwgEGUgWMUqMlXouzrphhEWsCGcP0Mep7DMF1Fg6UCQZTnpu2at4M+Bl4uekjHLUe+5Xtx/TVIA0lBOtO76XmCAjyjDKYVLqphoSQkdkAB1LJRGgg2yWZIJPrNLHUazskwbP1N8bGRFaj0VoJwUxQ73oTcX/vE5qoqsgYzJJDUg6PxSlHJsYT2vB89R8bAmhitm/YjokitgmlC7ZEvzFyMukWa34F5Wz22q5dp7XUURH6BidIh9dohq6QXXUQBQ9omf0it6cJ+fFeXc+5qMFJ985RH/gfP4A0DaaXw==

factor graph

Figure 43.9 The directed graph on the left is neither a tree nor a polytree. It is first

moralized and then transformed into a factor graph.

then the overall joint pmf for the rightmost factor graph will continue to match the joint pmf for the original undirected graph. Observe, however, that the resulting factor graph in this case is not a tree anymore (it induces a cycle among the variables {x1 , x2 } along with the factor nodes {fa , fb }).

fb (x1 , x2 ) x1

x2

x1

C (x1 , x2 , x3 )

x3

fa (x1 , x2 , x3 ) x3

C (x3 , x6 )

C (x3 , x4 )

x6

x4

x5

C (x3 , x6 )

C (x3 , x4 )

x6

x4

C (x4 , x7 )

x7

C (x4 , x5 )

x2

C (x4 , x7 )

x7

C (x4 , x5 )

x5

Figure 43.10 Factor graphs are not unique. The figure shows two factor graphs

corresponding to the same undirected graph from Fig. 43.8.

43.4 Message-Passing Algorithms

43.4

1761

MESSAGE-PASSING ALGORITHMS Now that we know how to transform directed and undirected graphs into factor graphs with tree structure, we move on to solve inference problems over these latter graphs. In this section, we derive two types of algorithms: one for determining likelihoods or beliefs for some nodes given evidence, and another for the most probable hypothesis given evidence. We addressed these questions by means of enumeration and elimination methods for Bayesian networks in Chapter 42. Here, we will derive more effective recursive algorithms that apply to both directed and undirected graphs. We will solve the first type of inference by means of a sum-product algorithm and the second type of inference by means of a max-sum algorithm.

43.4.1

Sum-Product Algorithm Consider a factor graph with a total of K regular nodes and F factors (i.e., F square nodes and K circular nodes). Each factor a has a set of neighbors denoted by Na ; these consist of the regular (circular) nodes that are connected to a. We will denote the variables at these regular nodes by xA and the factor that is associated with a by fa (xA ). We use the capital letter A as a subscript in xA because xA will generally consist of a subset of random variables linked to factor a. The nodes that appear in xA are the neighbors of factor a: xA = Na = {set of regular nodes connected to factor a}

(43.64)

It follows that the joint pmf that is represented by the factor graph is given by P(x1 , x2 , . . . , xK ) =

F F Y 1 Y ∆ fa (xA ) = fa (xA ) Z a=1 a=0

(43.65)

where we added a fictitious factor f0 defined over an empty set of nodes and such that f0 = 1/Z. For simplicity of notation, we let the symbol X denote the collection of regular nodes in the graph: X = {x1 , x2 , . . . , xK }

(43.66)

Our initial objective is to devise a scheme that allows us to compute the marginal distribution for any of the regular nodes, say, node xk . We know that P(xk ) can be obtained by marginalizing the joint pmf over all other random variables excluding xk : X P(xk ) = P(x1 , x2 , . . . , xK ) (43.67a) X\{xk } (43.65)

=

X

X\{xk }

F Y

a=0

fa (xA )

(43.67b)

1762

Undirected Graphs

The second equality shows that P(xk ) has the form of a sum of product terms. The sum-product algorithm derived in this section provides an effective procedure for evaluating quantities of this type (sums of products). Remark 43.1. (Ignoring the partition function Z) If desired, we can ignore the partition function Z altogether and replace (43.65) by an unnormalized pmf, say, Pu (x1 , x2 , . . . , xK ) =

F Y

fa (xA )

(43.68)

a=1

We can also replace the marginal (43.67a) by the unnormalized version: Pu (xk ) =

X

Pu (x1 , x2 , . . . , xK )

(43.69)

X\{xk } We then apply the same message-passing algorithm derived in this section to the unnormalized pmf Pu (x1 , x2 , . . . , xK ) and determine any desired unnormalized marginal Pu (xk ). If we normalize the unnormalized marginal, then we will recover the value of Z from it: X u Z= P (xk ) (43.70) xk



Subtrees In order to explain the simplifications that result from the graph structure, let us refer to the factor graph shown in Fig. 43.11 and use it to guide our derivations. The figure shows a factor graph with K = 7 regular nodes and F = 5 factor nodes. Regular node x3 is highlighted and it will play the role of the generic node xk whose marginal distribution we wish to evaluate. In the figure, node x3 is connected to three factor nodes {f1 , f2 , f3 }. For this reason, we will first express the joint pmf of the graph as the product of three terms constructed as follows. We know that the joint pmf is given by (where f0 = 1/Z): P(x1 , x2 , . . . , x7 )

(43.71)

= f0 f1 (x1 , x2 , x3 ) f2 (x3 , x6 ) f3 (x3 , x4 ) f4 (x4 , x7 ) f5 (x4 , x5 ) The desired node x3 can only be reached through its neighbors {f1 , f2 , f3 }. All factors involving random variables that reach x3 through node f1 will be collected into a single term denoted by λ1 ; its arguments will involve all these random variables along with x3 . For the case under consideration we have (we will include f0 into this term; it could have been included into any of the other terms): ∆

λ1 (x1 , x2 , x3 ) = f0 f1 (x1 , x2 , x3 )

(43.72)

43.4 Message-Passing Algorithms

x1

1763

x2 f1

f1 (x1 , x2 , x3 )

x3 f2

f2 (x3 , x6 )

f3 (x3 , x4 ) f3

x6

x4

f5 (x4 , x5 )

f4

f4 (x4 , x7 )

f5

x7 x5

Figure 43.11 A factor graph with K = 7 regular nodes and F = 5 factor nodes.

Regular node x3 is highlighted.

Likewise, all factors involving random variables that reach x3 through node f2 will be collected into a second single term denoted by λ2 ; its arguments will involve all these random variables along with x3 : ∆

λ2 (x3 , x6 ) = f2 (x3 , x6 )

(43.73)

Finally, all factors involving random variables that reach x3 through node f3 will be collected into a third single term denoted by λ3 ; its arguments will involve all these random variables along with x3 : ∆

λ3 (x3 , x4 , x5 , x7 ) = f3 (x3 , x4 ) f4 (x4 , x7 ) f5 (x4 , x5 )

(43.74)

Observe that all three functions {λ1 , λ2 , λ3 } involve x3 as an argument. The joint pmf (43.71) can then be rewritten in the equivalent form P(x1 , x2 , . . . , x7 ) = λ1 (x1 , x2 , x3 ) × λ2 (x3 , x6 ) × λ3 (x3 , x4 , x5 , x7 )

(43.75)

Observe further that for each λ-function, besides x3 , the random variables that are included as arguments are those that appear in the subtree that reaches x3 through the corresponding factor. For example, the subtree that is connected to factor f3 involves the random variables {x4 , x5 , x7 }; these are the arguments

1764

Undirected Graphs

that appear in λ3 , in addition to x3 . We will denote these random variables from the subtree associated with f3 by X 3 : ∆

X a = {collection of random variables reaching x3 through factor fa } (43.76) Likewise, consider factor f2 . The subtree that is connected to it involves only the random variable x6 ; this is the argument that appears in λ2 besides x3 , and similarly for λ1 . Motivated by this example, we can write the following expression for the joint pmf of a factor graph and a generic node xk : P(x1 , x2 , . . . , xK ) =

Y

λa (xk , X a )

(43.77)

a∈Nk

where the product is over the factors fa that are neighbors to xk (i.e., connected to xk ). Each of the factors fa involves a subtree with a collection of random variables denoted by X a , and contributes a term λa (xk , X a ) to (43.77) whose arguments are xk and X a .

Step I: (Construction of λ-terms from subtrees). In summary, we started from a product of factors of the form (43.65) for the joint pmf and selected a reference node xk . We then transformed the same pmf expression into the product of a smaller number of λ-terms in (43.77): one term for each neighboring factor of xk . We will apply this construction repeatedly. Each time we encounter a product of factors, we can select a reference node, identify its neighboring factors, and rewrite the product in terms of λ-terms defined by subtrees under these factors.

Messages Returning to the computation of the marginal distribution in (43.67a) and using the representation (43.77) we have ( ) X Y P(xk ) = λa (xk , X a ) a∈N k X\{xk } ( ) Y X = λa (xk , X a ) a∈Nk



=

Y

a∈Nk

Xa

mfa →xk (xk )

(43.78)

where, in the second equality, we switched the order of the summation and product operations because they act on different variables and, moreover, each set of variables X a appears within a single λ-factor due to the tree structure. In

43.4 Message-Passing Algorithms

1765

the last line we introduced the message functions moving from the neighboring factor nodes fa to xk : ∆

mfa →xk (xk ) =

X

λa (xk , X a )

(43.79)

Xa

Expression (43.78) reveals an interesting structure. It shows that the marginal of xk can be expressed as the product of |Nk | terms, called messages and denoted by mfa →xk . One message is associated with each factor a connected to xk . Each message is a function of xk and it is obtained by “marginalizing” the corresponding λa over the X a (i.e., over the variables in the corresponding subtree), as shown by (43.79).

Step II: (Computing messages through marginalization). For an arbitrary node xk , we determine its λ-terms under Step I; each λa term is associated with a neighboring factor fa and it is simply the product of all factors appearing in the subtree reaching xk through fa . Each neighboring factor fa contributes a message to xk obtained by “marginalizing” its λa -term over all variables in the corresponding subtree.

Let us illustrate this second step by returning to the example from Fig. 43.11. In this case, we have three messages arriving at node x3 through its three neighboring factors {f1 , f2 , f3 }. The first two messages are given by ∆

mf1 →x3 (x3 ) = =

X X1

X X1

=

λ1 (x1 , x2 , x3 ), X 1 = {x1 , x2 } f0 f1 (x1 , x2 , x3 ) X

f0 f1 (x1 , x2 , x3 ) (assuming Boolean variables)

x1 ,x2 ∈{0,1}

n o = f0 × f1 (0, 0, x3 ) + f1 (0, 1, x3 ) + f1 (1, 0, x3 ) + f1 (1, 1, x3 )

(43.80a)

and ∆

mf2 →x3 (x3 ) = =

X X2

λ2 (x3 , x6 ), X 2 = {x6 }

X

f2 (x3 , x6 )

x6 ∈{0,1}

= f2 (x3 , 0) + f2 (x3 , 1)

(43.80b)

1766

Undirected Graphs

Observe that each of these messages involves a single factor: f1 (·) for the message from f1 to x3 and f2 (·) for the message from f2 to x3 . This is because the subtrees connected to x3 through f1 and f2 do not include any additional factor nodes besides f1 and f2 . The situation is different for the subtree connected to x3 through f3 ; this subtree involves two other factor nodes f4 and f5 : X λ3 (x3 , x4 , x5 , x7 ), X 3 = {x4 , x5 , x7 } mf3 →x3 (x3 ) = X3

=

X

f3 (x3 , x4 )f4 (x4 , x7 )f5 (x4 , x5 )

(43.80c)

x4 ,x5 ,x7 ∈{0,1}

When messages involve multiple factors, as revealed by the above expression, we can repeat the same construction from Steps I and II to simplify their calculation. Observe, for instance, that the above expression for mf3 →x3 (x3 ) involves a sum of a product of factors, similar to the original problem (43.67b). The product of factors can be handled by Step I and the “marginalization” by Step II. We repeat these steps as follows. We move slowly to illustrate the construction and later describe the solution more generally. Consider the λ3 term in (43.80c): λ3 (x3 , x4 , x5 , x7 ) = f3 (x3 , x4 )f4 (x4 , x7 )f5 (x4 , x5 )

(43.81)

It involves the product of three factors {f3 , f4 , f5 }; this product summarizes the contribution of the subtree that reaches x3 through f3 . The subtree has root node x4 and two branches: one through f4 and the other through f5 . Let us focus on this subtree in isolation from the remainder of the graph. This subtree can be used to evaluate the product f4 (x4 , x7 )f5 (x4 , x5 ) and to “marginalize” it over {x5 , x7 } following similar steps to what we have done before: (a) Step I: Node 4 has two neighboring factors f4 and f5 (we are only focusing on the subtree below f3 ). The λ-terms associated with these factors are λ4 (x4 , x7 ) = f4 (x4 , x7 )

(43.82a)

λ5 (x4 , x5 ) = f5 (x4 , x5 )

(43.82b)

f4 (x4 , x7 )f5 (x4 , x5 ) = λ4 (x4 , x7 )λ5 (x4 , x5 )

(43.83)

so that

(b) Step II: “Marginalizing” λ5 over x5 provides a message from f5 to x4 , while “marginalizing” λ4 over x7 provides a message from f4 to x4 : X mf5 →x4 (x4 ) = λ5 (x4 , x5 ) x5 ∈{0,1}

=

X

f5 (x4 , x5 )

x5 ∈{0,1}

= f5 (x4 , 0) + f5 (x4 , 1)

(43.84a)

43.4 Message-Passing Algorithms

1767

and mf4 →x4 (x4 ) = =

X

λ4 (x4 , x7 )

x7 ∈{0,1}

X

f4 (x4 , x7 )

x7 ∈{0,1}

= f4 (x4 , 0) + f4 (x4 , 1)

(43.84b)

Substituting into (43.80c) we find that n o X mf3 →x3 (x3 ) = f3 (x3 , x4 ) mf5 →x4 (x4 ) × mf4 →x4 (x4 ) x4 ∈{0,1}



=

X

x4 ∈{0,1}

f3 (x3 , x4 ) mx4 →f3 (x4 )

(43.85)

where we introduced ∆

mx4 →f3 (x4 ) = mf5 →x4 (x4 ) × mf4 →x4 (x4 )

(43.86)

Expression (43.85) has an interesting interpretation. Thus, refer to the subtree that reaches x3 through f3 . This subtree has other factor nodes within it. In this particular example, the two factor nodes {f4 , f5 } appear in simple branches (but one could envision more general situations where there will be additional subtrees connected to these factor nodes with additional factor nodes within these subtrees). Focusing on the case of the two branches involving f4 and f5 , we determine the messages from f4 to x4 and from f5 to x4 . These messages are straightforward to obtain for the simple branches under consideration; we marginalize the respective λ-terms as shown in (43.84a)–(43.84b). These messages arriving at x4 are first multiplied to determine the message that moves forward from x4 to f3 , as indicated by (43.86); the order is now reversed and this message moves from a regular node toward a factor node. Note that in order to find the message originating from node x4 toward the factor node f3 , we multiply the messages from all other neighboring factor nodes (f4 , f5 ) and exclude f3 . The result is then multiplied by the factor f3 . In other words, the messages arriving from the subtree linked to f3 get multiplied by the factor f3 . The product of these terms is then marginalized to determine the message from f3 to x3 as in (43.85). Motivated by this example, we conclude that in the general setting (43.78)– (43.79), we should consider two types of messages: from factor nodes to regular nodes and from regular nodes to factor nodes. Thus, consider an arbitrary node xk and one factor node fa connected to it. We define two types of neighborhoods: Nx (fa ) = {set of regular nodes connected to fa }

Nf (x) = {set of factor nodes connected to a regular node x}

(43.87a) (43.87b)

1768

Undirected Graphs

For example, if we return to Fig. 43.11 we have Nx (f3 ) = {x3 , x4 }

(regular nodes connected to f3 )

Nf (x4 ) = {f3 , f4 , f5 } (factor nodes connected to x4 )

(43.88a) (43.88b)

Then, as suggested by the derivation leading to (43.85), the message from a factor node to a regular node is given by ( ) X Y mfa →xk (xk ) = fa (xk , X a ) mx→fa (x) (43.89a) Xa

mx→fa (x) =

Y

fb ∈Nf (x)\{fa }

x∈Nx (fa )\{xk }

mfb →x (x), ∀ x ∈ Nx (fa )\{xk }

(43.89b)

The construction is illustrated in Fig. 43.12 for an undirected graph with K = 11 regular nodes and F = 7 factor nodes. Regular node x3 is highlighted; it plays the role of xk . We indicate in the figure some of the messages from regular to factor nodes and from factor to regular nodes. Let us focus on evaluating the message from factor f3 to node x3 . That is, we select fa = f3 . Then, the set X a = X 3 consists of all regular nodes reaching x3 through f3 . These nodes are X 3 = {x4 , x5 , x7 , x8 , x9 , x10 , x11 }

(43.90)

The first expression (43.89a) has the form of a sum of products (hence, the name of the algorithm). The product is over all regular neighbors of f3 with the exception of the target node x3 ; these are nodes {x4 , x8 }. We need to multiply the messages arriving from them, and further multiply the result by the factor f3 to get n o X mf3 →x3 (x3 ) = f3 (x3 , X 3 ) mx8 →f3 (x8 ) × mx4 →f3 (x4 ) (43.91) X3

The second expression (43.89b) shows how to evaluate the messages from the regular nodes toward f3 , for every regular neighbor of f3 excluding x3 . The neighbors are {x4 , x8 }. For each of these neighbors, we multiply the messages arriving at them from their factor neighbors excluding f3 . Consider x4 . Its factor neighbors are {f4 , f5 } apart from x3 . Thus, we have mx4 →f3 (x4 ) = mf4 →x4 (x4 ) × mf5 →x4 (x4 )

(43.92a)

mx8 →f3 (x8 ) = mf6 →x8 (x8 ) × mf7 →x8 (x8 )

(43.92b)

Likewise,

and the process continues in this manner. We summarize the listing of the sumproduct algorithm in (43.93).

43.4 Message-Passing Algorithms

v

x1

subtree

mx1 !f1 (x1 )

mx2 !f1 (x2 )

v

x2

f1

subtree

mf2 !x3 (x3 ) x3

mf3 !x3 (x3 ) mx8 !f3 (x8 )

x8

f2

mx6 !f2 (x6 ) mx4 !f3 (x4 )

f7

mx10 !f6 (x10 ) x10

x4

mx11 !f7 (x11 )

f6

x11

mx9 !f6 (x9 ) x9

mf2 !x3 (x3 )

f3

mf7 !x8 (x8 )

mf6 !x8 (x8 )

1769

x6 subtree

mf4 !x4 (x4 )

f4

mf5 !x4 (x4 )

mx7 !f4 (x7 ) f5

x7 subtree

mx5 !f5 (x5 ) x5

subtree

Figure 43.12 A factor graph with K = 11 regular nodes and F = 7 factor nodes.

Regular node x3 is highlighted.

Sum-product algorithm over tree factor graphs. initial conditions: (a) if a leaf node is regular, the message it sends toward its factor node neighbor is 1; (b) if a leaf node is a factor node, the message it sends toward its regular neighbor is its factor. for every leaf node, repeat the following construction until its message reaches all other leaves in the graph: compute messages from Y regular nodes x to factor nodes f : mx→f (x) = mf 0 →x (x)

(43.93)

f 0 ∈Nf (x)\{f }

compute messages from factor ( nodes f to regular nodes ) x: X Y mf →x (x) = f (x, X f ) mx0 →f (x0 ) Xf

end

x0 ∈Nx (f )\{x}

The listing of the algorithm does not focus on a single reference node xk . Regardless of which reference node for which we wish to compute the marginal,

1770

Undirected Graphs

the same messages will still need to be propagated over the graph and subsequently combined. We therefore start from all leaf nodes and compute their outgoing messages. These messages are subsequently combined through product or marginalization (i.e., sum) operations; every node sends messages to its neighbors after it has received all the messages it needs to compute the outgoing message. The computations continue in this manner until every message has been sent in both directions across all edges. For example, in Fig. 43.12, we are interested in the marginal of x3 and, for this reason, the messages are propagating toward that node. If, however, we become interested in the marginal of x4 , then we would need to propagate messages toward that node. In particular, we would need to compute the message mf3 →x4 (x4 ), which is not shown in the figure; it moves in the opposite direction as the message currently shown on the link from x4 to f3 . Figure 43.13 provides a diagram representation of the product and sum-product operations that are involved in transmitting messages from regular to factor nodes and from factor to regular nodes. (from factor to regular nodes) (sum-product operation)

(from regular to factor nodes) (product operation) f0

mf 0 !x (x)

f0

x

mx!f (x)

f

f0

mx!f (x) =

x0

mx0 !f (x0 )

x0

f

mf !x (x)

x

x0 Y

f 0 2Nf (x)\{f }

mf 0 !x (x)

mf !x (x) =

X

f (x, X f )

Xf

(

Y

x0 2Nx (f )\{x}

mx0 !f (x0 )

)

Figure 43.13 Diagram representation of the product and sum-product operations that

are involved in transmitting messages from regular to factor nodes and from factor to regular nodes.

Remark 43.2. (Finding marginal distributions) Once all messages are determined by using algorithm (43.93), we can evaluate the marginal distribution for any node xk by using expression (43.78), namely, P(xk ) =

Y

mfa →xk (xk )

(43.94)

a∈Nk

In other words, we multiply all messages arriving at xk from its neighboring factors.  Remark 43.3. (Dealing with evidence) The sum-product algorithm allows us to evaluate marginal distributions for any of the regular variables in the graph. Sometimes, evidence is available when some variables are observed and need to have their values

43.4 Message-Passing Algorithms

1771

pinned to specific realizations. This can be accommodated as follows. Assume variable x4 is observed and its value is x4 = 1. We consider all factors involving x4 . In the example of Fig. 43.12, these are factors {f3 , f4 , f5 }. We then replace each one of them by f3 (x3 , x4 , x8 ) ← f3 (x3 , x4 , x8 ) × I[x4 = 1] f4 (x4 , x8 ) ← f4 (x4 , x7 ) × I[x4 = 1] f5 (x4 , x5 ) ← f5 (x4 , x5 ) × I[x4 = 1]

(43.95a) (43.95b) (43.95c)

That is, we multiply the relevant factor functions by an indicator function assuming the value 1 when x4 = 1 and 0 otherwise. In this case, the marginal distribution that results from applying the product (43.94) would correspond to the joint pmf: Y (43.96) P(xk , x4 = 1) = mfa →xk (xk ) a∈Nk

Clearly, the conditional pmf P(xk |x4 = 1) would be proportional to the above value since P(xk |x4 = 1) = P(xk , x4 = 1)/P(x4 = 1)

(43.97) 

Remark 43.4. (Belief propagation) If the factors fa (·) are constructed from marginal and conditional probability distributions as happens, for example, when we transform a Bayesian network (or DAG) into a factor graph, then the message-passing algorithm reduces to what is known as the belief propagation algorithm. We illustrate this construction in the next two examples. The algorithm can be used to compute marginal distributions, as already seen through the derivation of the sum-product algorithm. It can also be used to compute maximum a-posteriori (MAP) probabilities (or the most likely events), as will be seen in the next section when we derive the max-sum algorithm. It should be noted, though, that the derivation of the message-passing algorithm (43.93), and the corresponding belief propagation special case, assumes tree-structured graphs. In this case, the algorithm provides exact inference solutions (i.e., it leads to exact marginals). If the factor graph contains cycles, however, messages can move around the graph repeatedly and the message-passing procedure need not converge. There are still useful instances in practice where the procedure, referred to in this case as loopy belief propagation for Bayesian graphs with cycles, has been observed to give good (approximate) results after repeated iterations passing the messages around the graph – see, e.g., the iterative decoding procedure described later in Example 43.13, where loop belief propagation has been observed to lead to good decoding performance. There are scheduling schemes to plan how messages are passed over graphs with cycles to alleviate the convergence issues. We forgo this discussion here, but note that in the loopy implementation all messages are initialized to the value 1. 

Example 43.9 (Belief propagation: All nodes are hidden) When the message-passing algorithm (43.93) is applied to a tree-structured Bayesian network (i.e., a tree-structured directed graph), it reduces to the belief propagation algorithm. We illustrate the calculations by considering the directed graph shown in Fig. 43.14; the graph consists of K = 6 random nodes denoted by {x1 , x2 , . . . , x6 }. Node 3 is observed and appears shaded. The figure also shows the corresponding marginal or conditional probability tables, where the variables are assumed to be Boolean. We transform the directed graph into an undirected graph shown in the left plot of Fig. 43.15. In this case, we simply remove the direction of the arrows and there is no need for moralization. The plot shows that the graph consists of five cliques. The plot on the right

1772

Undirected Graphs

x1 0 1

x2 0 0 1 1

x3 0 1 0 1

P(x1 ) 0.3 0.7

P(x3 |x2 ) 0.3 0.7 0.4 0.6

x1

x1 0 0 1 1

x2

x4 0 1 0 1

P(x4 |x3 ) 0.7 0.3 0.8 0.2

P(x2 |x1 ) 0.1 0.9 0.6 0.4

x3 x3 = 1 evidence

x6

x5 x3 0 0 1 1

x2 0 1 0 1

x2 0 0 1 1

x6 0 1 0 1

P(x6 |x2 ) 0.5 0.5 0.2 0.8

x4

x3 0 0 1 1

x5 0 1 0 1

P(x5 |x3 ) 0.4 0.6 0.1 0.9

Figure 43.14 A directed acyclic graph with tree structure consisting of 6 nodes and

the corresponding conditional probability tables.

in the same figure provides the corresponding factor graph, which consists of K = 6 regular nodes and F = 5 factor nodes. The factor functions are also indicated next to each factor node. In particular, the tabular form for factor f1 (x1 , x2 ) = P(x1 , x2 ) can be deduced from P(x1 ) and P(x2 |x1 ), leading to: x1

x2

f1 (x1 , x2 )

0 0 1 1

0 1 0 1

0.3 × 0.1 = 0.03 0.3 × 0.9 = 0.27 0.7 × 0.6 = 0.42 0.7 × 0.4 = 0.28

To apply the message-passing algorithm, we consider each node (regular or factor) individually and write down the messages from this node to all of its neighbors. Alternatively, we could start from a leaf node and propagate the message forward until it reaches every other leaf in the graph, and then propagate the message backward. In this example, we organize the computations in a manner that facilitates the exposition. We start from the leaf nodes {x1 , x4 , x5 , x6 }: 1.

node x1 : This node has one neighbor only, given by f1 . The message from x1 to its neighbor is one that moves from a regular node to a factor node. Since x1 is a leaf node we set mx1 →f1 (x1 ) = 1

2.

(43.98)

node x4 : This node has one neighbor only, given by f4 . It is a leaf node and therefore mx4 →f4 (x4 ) = 1

(43.99)

43.4 Message-Passing Algorithms

1773

x1 x1

f1

f1 (x1 , x2 ) = P(x1 ) P(x2 |x1 )

x2

x2

f3 (x2 , x3 ) = P(x3 |x2 )

x3

f3

f2 (x2 , x6 ) = P(x6 |x2 )

f2

x6

x3 x6

x5 x4

f4 (x3 , x4 ) = P(x4 |x3 )

f4

f5

x4

f5 (x3 , x5 ) = P(x5 |x3 ) x5

Figure 43.15 Undirected and factor graph representations for the DAG from Fig.

43.14. The factor graph consists of K = 6 regular nodes and F = 5 factor nodes.

3.

node x5 : This node has one neighbor only, given by f5 . It is a leaf node and therefore mx5 →f5 (x5 ) = 1

4.

(43.100)

node x6 : This node has one neighbor only, given by f2 . It is a leaf node and therefore mx6 →f2 (x6 ) = 1

(43.101)

Now that all leaf nodes have been treated, we consider the nodes reached by them, namely, {f1 , f2 , f4 , f5 }. 5.

node f 1 : This node has two neighbors, given by {x1 , x2 }. Hence, we need to compute two messages, which now move from a factor node toward regular nodes:

mf1 →x1 (x1 ) =

X x2 ∈{0,1}

=

X x2 ∈{0,1}

f1 (x1 , x2 ) × mx2 →f1 (x2 ) P(x1 , x2 ) × mx2 →f1 (x2 )

(43.102a)

where the boxed message indicates that it has not been computed yet. For the second message we have

1774

Undirected Graphs

mf1 →x2 (x2 )

X

=

x1 ∈{0,1} (43.98)

X

=

x1 ∈{0,1}

P(x1 , x2 ) × 1

P(x2 )

=

x2

P(x2 )

0 1

0.03 + 0.42 = 0.45 0.27 + 0.28 = 0.55

=

6.

f1 (x1 , x2 ) × mx1 →f1 (x1 )

(43.102b)

node f 2 : This node has two neighbors, given by {x2 , x6 }. Hence, we need to compute two messages, one toward each of these nodes: X mf2 →x2 (x2 ) = f2 (x2 , x6 ) × mx6 →f2 (x6 ) x6 ∈{0,1}

(43.101)

X

=

x6 ∈{0,1}

=

P(x6 |x2 ) × 1, ∀ x2 ∈ {0, 1}

1

(43.103a)

and X

mf2 →x6 (x6 ) =

x2 ∈{0,1}

X

=

x2 ∈{0,1}

7.

f2 (x2 , x6 ) × mx2 →f2 (x2 ) P(x6 |x2 ) mx2 →f2 (x2 )

(43.103b)

node f 4 : This node has two neighbors, given by {x3 , x4 }. Hence, we need to compute two messages, one toward each of these nodes: X mf4 →x3 (x3 ) = f4 (x3 , x4 ) × mx4 →f4 (x4 ) x4 ∈{0,1}

(43.99)

X

=

x4 ∈{0,1}

=

P(x4 |x3 ) × 1

∀ x3 ∈ {0, 1}

1,

(43.104a)

For the second message we have X

mf4 →x4 (x4 ) =

x3 ∈{0,1}

X

=

x3 ∈{0,1}

8.

f4 (x3 , x4 ) × mx3 →f4 (x3 ) P(x4 |x3 ) mx3 →f4 (x3 )

(43.104b)

node f 5 : This node has two neighbors, given by {x3 , x5 }. Hence, we need to compute two messages, one toward each of these nodes: X mf5 →x3 (x3 ) = f5 (x3 , x5 ) × mx5 →f5 (x5 ) x5 ∈{0,1}

(43.100)

=

X x5 ∈{0,1}

=

P(x5 |x3 ) × 1

1, ∀ x3 ∈ {0, 1}

(43.105a)

43.4 Message-Passing Algorithms

1775

and mf5 →x5 (x5 ) =

X x3 ∈{0,1}

=

X x3 ∈{0,1}

f5 (x3 , x5 ) × mx3 →f5 (x3 ) P(x5 |x3 ) mx3 →f5 (x3 )

(43.105b)

Now we move to a third layer of nodes that are reached from the factor nodes {f1 , f2 , f4 , f5 }. These are nodes {x2 , x3 }. 9.

node x2 : This node has three neighbors, given by {f1 , f2 , f3 }. Hence, we need to compute three messages, one toward each of these nodes: mx2 →f1 (x2 )

= (43.103a)

mx2 →f2 (x2 )

=

mf3 →x2 (x2 )

=

mf1 →x2 (x2 ) × mf3 →x2 (x2 )

(43.102b)

mx2 →f3 (x2 )

=

P(x2 ) mf3 →x2 (x2 )

=

mf1 →x2 (x2 ) × mf2 →x2 (x2 )

(a)

P(x2 ) × 1 P(x2 )

= =

10.

mf2 →x2 (x2 ) × mf3 →x2 (x2 ) (43.106a)

(43.106b)

(43.106c)

where step (a) follows from (43.102b) and (43.103a). node x3 : This node has three neighbors, given by {f3 , f4 , f5 }. Hence, we need to compute three messages, one toward each of these nodes: mx3 →f3 (x3 )

mf4 →x3 (x3 ) × mf5 →x3 (x3 )

= (a)

mx3 →f4 (x3 )

= =

1×1 1

=

mf3 →x3 (x3 ) × mf5 →x3 (x3 )

(43.105a)

mx3 →f5 (x3 )

=

mf3 →x3 (x3 )

=

mf3 →x3 (x3 ) × mf4 →x3 (x3 )

(43.104a)

=

mf3 →x3 (x3 )

(43.107a)

(43.107b)

(43.107c)

where step (a) is because of (43.104a) and (43.105a). The only node left to consider is f3 . 11.

node f3 : This node has two neighbors, given by {x2 , x3 }. Hence, we need to compute two messages, one toward each of these nodes: X mf3 →x2 (x2 ) = f3 (x2 , x3 ) × mx3 →f3 (x3 ) x3 ∈{0,1}

(43.107a)

X

=

x3 ∈{0,1}

=

1,

P(x3 |x2 )

∀ x2 ∈ {0, 1}

(43.108a)

1776

Undirected Graphs

and mf3 →x3 (x3 )

X

=

x2 ∈{0,1} (43.106c)

X

=

x2 ∈{0,1}

X

=

f3 (x2 , x3 ) × mx2 →f3 (x2 ) P(x3 |x2 )P(x2 ) P(x2 , x3 )

x2 ∈{0,1}

P(x3 )

=

(43.108b)

Since we know P(x2 ) and P(x3 |x2 ) we can determine P(x2 , x3 ) and marginalize it to obtain P(x3 ): x2

x3

P(x2 , x3 )

0 0 1 1

0 1 0 1

0.45 × 0.3 = 0.135 0.45 × 0.7 = 0.315 0.55 × 0.4 = 0.220 0.55 × 0.6 = 0.330

x3 0 1

P(x3 ) 0.135 + 0.220 = 0.355 0.315 + 0.330 = 0.645

If we examine the expressions derived so far, we find that all messages that were boxed have been determined. For example, if we return to (43.102a) and use (43.106a) and (43.108a) we find that mf1 →x1 (x1 ) =

X x2 ∈{0,1}

P(x1 , x2 ) × 1

= P(x1 ) =

x1

P(x1 )

0 1

0.3 0.7

(43.109)

Similarly, other messages will be expressed in terms of P(x4 ), P(x5 ), and P(x6 ). We can compute these marginals in a manner similar to how we evaluated P(x3 ) above, leading to x4

P(x4 )

x5

P(x5 )

x6

P(x6 )

0 1

0.7645 0.2355

0 1

0.2065 0.7935

0 1

0.3350 0.6650

For ease of reference, we list all the messages derived in these calculations in Table 43.2. They are also depicted graphically in Fig. 43.16. Example 43.10 (Belief propagation: Some nodes are observed) Let us repeat the analysis when node x3 is observed with value x3 = 1. In other words, evidence is available. If we examine the derivation in the previous example, we find that adjustments need to be made to steps 7, 8, and 11 where the factors f4 (x3 , x4 ), f (x3 , x5 ), and f3 (x2 , x3 ) appear; these factors depend on x3 . They will therefore be modulated by the indicator function I[x3 = 1]. We reconsider the steps here: 7.

node f 4 : This node has two neighbors, given by {x3 , x4 }. Hence, we need to compute two messages, one toward each of these nodes:

43.4 Message-Passing Algorithms

x1

1777

1

P(x1 )

f1

P(x2 )

1 x2 P(x2 )

1 P(x2 )

f3

1

1

f2 P(x3 ) 1

x3 1

f4

P(x6 ) x6

P(x3 ) P(x3 )

1

f5 P(x5 )

1 P(x4 )

1

x4

x5

Figure 43.16 Directed messages sent across the nodes for the factor graph from Fig.

43.15.

mf4 →x3 (x3 = 1)

X

=

x4 ∈{0,1} (43.99)

X

=

x4 ∈{0,1}

=

I[x3 = 1]f4 (x3 , x4 ) × mx4 →f4 (x4 ) P(x4 |x3 = 1) × 1

0.8 + 0.2 = 1

(43.110a)

and mf4 →x4 (x4 ) =

X x3 ∈{0,1}

I[x3 = 1]f4 (x3 , x4 ) × mx3 →f4 (x3 )

= P(x4 |x3 = 1) mx3 →f4 (x3 ) 8.

(43.110b)

node f 5 : This node has two neighbors, given by {x3 , x5 }. Hence, we need to compute two messages, one toward each of these nodes:

1778

Undirected Graphs

Table 43.2 List of messages derived for the factor graph of Fig. 43.15 assuming all regular nodes are hidden. Node Message x1 x2 x2 x2 x3 x3 x3 x4 x5 x6

mx1 →f1 (x1 ) = 1 mx2 →f1 (x2 ) = 1 mx2 →f2 (x2 ) = P(x2 ) mx2 →f3 (x2 ) = P(x2 ) mx3 →f3 (x3 ) = 1 mx3 →f4 (x3 ) = P(x3 ) mx3 →f5 (x3 ) = P(x3 ) mx4 →f4 (x4 ) = 1 mx5 →f5 (x5 ) = 1 mx6 →f2 (x6 ) = 1

f1

mf1 →x1 (x1 ) = P(x1 )

f1

mf1 →x2 (x2 ) = P(x2 )

f2

mf2 →x2 (x2 ) = 1

f2

mf2 →x6 (x6 ) = P(x6 )

f3

mf3 →x2 (x2 ) = 1

f3 f4

mf3 →x3 (x3 ) = P(x3 ) mf4 →x3 (x3 ) = 1

f4

mf4 →x4 (x4 ) = P(x4 )

f5

mf5 →x3 (x3 ) = 1

f5

mf5 →x5 (x5 ) = P(x5 )

mf5 →x3 (x3 = 1)

X

=

x5 ∈{0,1} (43.100)

X

=

x5 ∈{0,1}

=

I[x3 = 1]f5 (x3 , x5 ) × mx5 →f5 (x5 ) P(x5 |x3 = 1) × 1

0.1 + 0.9 = 1

(43.111a)

and X

mf5 →x5 (x5 ) =

x3 ∈{0,1}

I[x3 = 1]f5 (x3 , x5 ) × mx3 →f5 (x3 )

= P(x5 |x3 = 1) mx3 →f5 (x3 = 1) 11.

(43.111b)

node f3 : This node has two neighbors, given by {x2 , x3 }. Hence, we need to compute two messages, one toward each of these nodes: mf3 →x2 (x2 )

=

X x3 ∈{0,1}

(43.107a)

=

I[x3 = 1]f3 (x2 , x3 ) × mx3 →f3 (x3 = 1)

P(x3 = 1|x2 )

(43.112a)

43.4 Message-Passing Algorithms

1779

and

mf3 →x3 (x3 = 1)

=

X x2 ∈{0,1}

(43.106c)

=

X

f3 (x2 , x3 ) × mx2 →I[x3 =1]f3 (x2 ) P(x3 = 1|x2 )P(x2 )

x2 ∈{0,1}

=

X

P(x2 , x3 = 1)

x2 ∈{0,1}

= =

P(x3 = 1) 0.645

(43.112b)

For ease of reference, we list all the messages derived in these calculations with evidence in Table 43.3. Table 43.3 List of messages derived for the factor graph of Fig. 43.15 assuming the evidence x3 = 1. Node Message x1 x2 x2 x2 x3 x3 x3 x4 x5 x6

mx1 →f1 (x1 ) = 1 mx2 →f1 (x2 ) = 1 mx2 →f2 (x2 ) = P(x2 ) mx2 →f3 (x2 ) = P(x2 ) mx3 →f3 (x3 = 1) = 1 mx3 →f4 (x3 = 1) = P(x3 = 1) mx3 →f5 (x3 = 1) = P(x3 = 1) mx4 →f4 (x4 ) = 1 mx5 →f5 (x5 ) = 1 mx6 →f2 (x6 ) = 1

f1

mf1 →x1 (x1 ) = P(x1 )

f1

mf1 →x2 (x2 ) = P(x2 )

f2

mf2 →x2 (x2 ) = 1

f2

mf2 →x6 (x6 ) = P(x6 )

f3

mf3 →x2 (x2 ) = 1

f3 f4

mf3 →x3 (x3 = 1) = P(x3 = 1) mf4 →x3 (x3 = 1) = 1

f4

mf4 →x4 (x4 ) = P(x4 )

f5

mf5 →x3 (x3 = 1) = 1

f5

mf5 →x5 (x5 ) = P(x5 )

Example 43.11 (Markov chain) Consider the DAG shown in the top part of Fig. 43.17, where the nodes are connected linearly in a chain structure, with one node feeding into the subsequent node. This structure corresponds to a Markov chain. The lower part of the figure shows the corresponding factor graph with factors:

1780

Undirected Graphs

P(x1 )

x1

P(x2 |x1 )

x2

f1 (x1 , x2 )

x1

! m(x2 )

m(x1 )

x3

x2

x4

x3

! m(x4 )

m(x3 )

m(x2 )

f4 (x4 , x5 )

x4

x3

! m(x3 )

P(xK |xK

P(x4 |x3 )

f3 (x3 , x4 )

f2 (x2 , x3 )

x2

x1

! m(x1 ) = 1

P(x3 |x2 )

xK

•••

•••

fK

1 (xK 1 , xK )

••

! m(xK )

x4

••

m(x4 )

1)

m(xK

xK

xK 1)

m(xK ) = 1

Figure 43.17 A DAG corresponding to a Markov chain where nodes are connected

linearly in a chain structure, with one node feeding into the subsequent node. The middle plot shows the corresponding factor graph. The lower plot shows the resulting forward and backward messages from applying the sum-product algorithm for general factor functions.

f1 (x1 , x2 ) = P(x1 ) P(x2 |x1 ) = P(x1 , x2 ) f2 (x2 , x3 ) = P(x3 |x2 ) f3 (x3 , x4 ) = P(x4 |x4 ) • • fK−1 (xK−1 , xK ) = P(xK−1 |xK )

(43.113a) (43.113b) (43.113c)

(43.113d)

Writing down the sum-product recursions, we find that the messages propagating from left to right are given by: mx1 →f1 (x1 ) = 1 X mf1 →x2 (x2 ) = P(x1 , x2 ) × 1 = P(x2 )

(43.114a) (43.114b)

x1

mx2 →f2 (x2 ) = P(x2 ) X mf2 →x3 (x3 ) = P(x3 |x2 ) × P(x2 ) = P(x3 )

(43.114c) (43.114d)

x2

mx3 →f3 (x3 ) = P(x3 ) X mf3 →x4 (x4 ) = P(x4 |x3 ) × P(x3 ) = P(x4 ) x3

• •

(43.114e) (43.114f)

43.4 Message-Passing Algorithms

1781

That is, mfk−1 →xk (xk ) = P(xk )

(43.115a)

mxk →fk (xk ) = P(xk )

(43.115b)

Likewise, the messages that propagate from right to left are given by mxK →fK−1 (xK ) = 1 X mfK−1 →xK−1 (xK−1 ) = P(xK |xK−1 ) × 1 = 1

(43.116a) (43.116b)

xK

mxK−1 →fK−2 (xK−1 ) = 1 X mfK−2 →xK−2 (xK−2 ) = P(xK−1 |xK−2 ) × 1 = 1

(43.116c) (43.116d)

xK−1

• • That is, mxk →fk−1 (xk ) = 1

(43.117a)

mfk →xk (xk ) = 1

(43.117b)

More generally, if the factor functions do not have the interpretation of probability distributions, we can keep the messages expressed in terms of these functions. The messages propagating from left to right would be given by mxk →fk (xk ) = mfk−1 →xk (xk ) X mfk →xk+1 (xk+1 ) = fk (xk , xk+1 )mxk →fk (xk )

(43.118a) (43.118b)

xk

We note in this case that the messages in the first equation are identical. Assume we simplify the notation and define: ∆ − → m(xk ) = mfk−1 →xk (xk )

(43.119)

The upper arrow is meant to indicate that this message is propagating from xk to the right. Combining both equations (43.118a)–(43.118b) gives the following relation for propagating this forward message from one xk to another: − → m(xk+1 ) =

X

→ fk (xk , xk+1 )− m(xk )

(43.120)

xk

→ with boundary condition − m(x1 ) = 1. Likewise, the messages propagating from right to left will be given by mxk+1 →fk (xk+1 ) = mfk+1 →xk+1 (xk+1 ) X mfk →xk (xk ) = fk (xk , xk+1 )mxk+1 →fk (xk+1 )

(43.121a) (43.121b)

xk+1

We note again that the messages in the first equation are identical. We simplify the notation and define: ∆ ← − m(x (43.122) k ) = mfk →xk (xk )

1782

Undirected Graphs

This message propagates to the left. Combining (43.121a)–(43.121b) gives the following relation for propagating this backward message from one xk to another: ← − m(x k) =

X

− fk (xk , xk+1 )← m(x k+1 )

(43.123)

xk+1

− with boundary condition ← m(x K ) = 1. The messages are indicated in the lower plot of Fig. 43.17. For any regular node xk , its marginal pmf can then be found by multiplying the messages arriving at it from its neighboring factor nodes so that 1 − → − m(xk ) × ← m(x k) Z

P(xk ) =

(43.124)

and Z is the normalizing factor. Let us return to the probabilistic interpretation for the factor functions and assume that all variables are Boolean for simplicity. We note that − → m(xk )



=

mfk−1 →xk (xk )

(43.115a)

=

=

P(xk ) X P(xk |xk−1 ) × P(xk−1 )

(43.125)

xk−1

→ so that, if we represent − m(xk ) in we can write  − → m(xk ) =  = |

vector form for the two possible realizations of xk , P(xk = 0) P(xk = 1) a00 a01

a10 a11 {z



 

P(xk−1 = 0) P(xk−1 = 1)

 (43.126)

}

= AT

where we are denoting the transition probabilities from one node to another by: ∆

a00 = P(xk = 0|xk−1 = 0) ∆

a01 = P(xk = 1|xk−1 = 0) ∆

a10 = P(xk = 0|xk−1 = 1) ∆

a11 = P(xk = 1|xk−1 = 1) for any k. It follows, by iteration, that     P(xk = 0) P(x1 = 0) = (AT )k−1 P(xk = 1) P(x1 = 1)

(43.127a) (43.127b) (43.127c) (43.127d)

(43.128)

which agrees with result (38.53) for Markov chains.

43.4.2

Max-Sum Algorithm The sum-product algorithm (43.93) enables us to compute marginals from distributions represented by tree-structured factor graphs. We now describe a similar

43.4 Message-Passing Algorithms

1783

procedure, known as the max-sum algorithm, which enables us to determine the most likely realization for the random variables (with or without evidence) and the probability of this maximizing realization occurring. It turns out that the max-sum algorithm is one instance of dynamic programming applied to graphical models. Although unnecessary for the derivation in this section, interested readers may refer to the discussion on dynamic programming and multistage decision processes in the comments section of Chapter 45. Our objective in this section is to devise a scheme that allows us to compute the most likely realization for the regular nodes, namely, b2, . . . , x bK } {b x1 , x

=

argmax P(x1 , x2 , . . . , xK )

{x1 ,...,xK } (43.65)

=

argmax {x1 ,...,xK }

(

F X

)

ln fa (xA )

a=1

(43.129)

The second equality is the reason for the name max-sum algorithm (since Z is a constant, we ignore it from the maximization altogether). The solution structure shares many features with the sum-product algorithm from the previous section. We will therefore focus on the main differences. For later reference, we denote the likelihood of the maximizing solution by the notation: ∆

Messages

b2, . . . , x bK ) pmax = P(b x1 , x

(43.130)

Let xk be some arbitrary node in the graph. We start from expression (43.77), which we repeat here for ease of reference: Y P(x1 , x2 , . . . , xK ) = λa (xk , X a ) (43.131) a∈Nk

where the product is over the factors fa that are neighbors to xk . Each of these factors involves a subtree with a collection of random variables denoted by X a , and contributes a term λa (xk , X a ) whose arguments are xk and X a . Taking logarithms of both sides gives X ln P(x1 , x2 , . . . , xK ) = ln λa (xk , X a ) (43.132) a∈Nk

where all λ-terms depend on xk , whereas the X a sets are disjoint among themselves, i.e., {x1 , x2 , . . . , xK } = {xk } ∪ X a1 ∪ X a2 ∪ . . . ∪ X a|Nk | {z } |

(43.133)

|Nk | of these terms

where we are enumerating the sets X a in the neighborhood of xk by using the indices {a1 , a2 , . . . , a|Nk | }. Therefore, maximizing over all random variables allows us to split the maximization operation as follows:

1784

Undirected Graphs

max

{x1 ,...,xK }

n o ln P(x1 , x2 , . . . , xK ) = max xk



= max xk

( (

X

a∈Nk

X

a∈Nk

max ln λa (xk , X a ) Xa

)

mfa →xk (xk )

)

(43.134)

where the message from factor fa to node xk is now defined as n o ∆ mfa →xk (xk ) = max ln λa (xk , X a ) Xa

(43.135)

In (43.134) we have a sum of messages as opposed to a product of messages as was the case with the earlier derivation leading to (43.78). Also, the message in (43.135) involves maximization over X a as opposed to marginalization over X a as was the case in (43.79). If we continue from here and repeat the same line of reasoning as in the derivation of the sum-product algorithm we will find that the node-to-factor and factor-to-node message expressions (43.89a)–(43.89b) will need to be replaced by the following form, where “marginalization” is replaced by maximization and the messages are defined in the logarithmic scale: ( ) X mfa →xk (xk ) = max ln fa (xk , X a ) + mx→fa (x) Xa

mx→fa (x) =

x∈Nx (fa )\{xk }

X

(43.136a)

mfb →x (x), ∀ x ∈ Nx (fa )\{xk }

(43.136b)

fb ∈Nf (x)\{fa }

Accordingly, the boundary conditions at the leaf nodes are now given by µx→fa (x) = 0

(if leaf node is regular)

(43.137a)

µfa →x (x) = ln fa

(if leaf node is a factor)

(43.137b)

We illustrate the construction by referring again to Fig. 43.12 with K = 11 regular nodes and F = 7 factor nodes. Regular node x3 is highlighted; it plays the role of xk . We indicate in the figure some messages from regular to factor nodes and from factor to regular nodes. Let us focus on evaluating the message from factor f3 to node x3 . That is, we select fa = f3 . Then, the set X a = X 3 consists of all regular nodes reaching x3 through f3 . These nodes are X 3 = {x4 , x5 , x7 , x8 , x9 , x10 , x11 }

(43.138)

The first expression (43.136a) has the form of the maximum of a sum. The sum is over all regular neighbors of f3 with the exception of the target node x3 ; these are nodes {x4 , x8 }. We need to add the messages arriving from them, and further add the result to ln f3 to get

43.4 Message-Passing Algorithms

1785

n o mf3 →x3 (x3 ) = max ln f3 (x3 , X 3 ) + mx8 →f3 (x8 ) + mx4 →f3 (x4 )

(43.139)

X3

The second expression (43.136b) shows how to evaluate the messages from the regular nodes toward f3 , for every regular neighbor of f3 excluding x3 . The neighbors are {x4 , x8 }. For each of these neighbors, we add the messages arriving at them from their factor neighbors excluding f3 . Consider x4 . Its factor neighbors are {f4 , f5 } apart from x3 . Thus, we have mx4 →f3 (x4 ) = mf4 →x4 (x4 ) + mf5 →x4 (x4 )

(43.140a)

mx8 →f3 (x8 ) = mf6 →x8 (x8 ) + mf7 →x8 (x8 )

(43.140b)

Likewise,

and the process continues in this manner. We summarize the listing of the maxsum algorithm in (43.141). The algorithm can start from all leaf nodes and compute their outgoing messages. These messages are subsequently combined through sum and maximization operations. The computations continue until every message has been sent in both directions across all edges.

Max-sum algorithm over tree factor graphs. initial conditions: (a) if a leaf node is regular, the message it sends toward its factor node neighbor is 0; (b) if a leaf node is a factor node, the message it sends toward its regular neighbor is the log of its factor. for every leaf node, repeat the following construction until its message reaches all other leaves in the graph: compute messages from X regular nodes x to factor nodes f : mx→f (x) = mf 0 →x (x) f 0 ∈Nf (x)\{f }

compute messages from ( factor nodes f to regular nodes x:) X mf →x (x) = max ln f (x, X f ) + mx0 →f (x0 ) Xf

end

x0 ∈Nx (f )\{x}

(43.141)

Performing maximization Once all messages have been found, we still need to determine the optimizing b2, . . . , x b K } defined by (43.129). This last step requires some arguments {b x1 , x computations that are reminiscent of a dynamic programming argument, and

1786

Undirected Graphs

similar in spirit to the derivation of the Viterbi algorithm that we encountered earlier in Section 39.4 while decoding observations from hidden Markov models (which are special directed graphs). We start by selecting an arbitrary node xk from the graph and let X −k denote the collection of remaining nodes: ∆

X −k = {x1 , . . . , xk−1 , xk+1 , . . . , xK }

(43.142)

P(x1 , x2 , . . . , xK ) = P(xk , X −k )

(43.143)

Thus, we can write

Assume we fix xk and maximize the joint pmf over all remaining variables in X −k . We denote the maximal value by ∆

P−k (xk ) = max P(xk , X −k )

(43.144)

X −k

Then, it is clear from (43.134) that P−k (xk ) =

X

a∈Nk

mfa →xk (xk )

(43.145)

In other words, we can determine P−k (xk ) by summing all the messages arriving at xk from its neighboring factors. If we now maximize P−k (xk ) over xk , we bk: arrive at the desired minimizer x b k = argmax x xk

(

X

a∈Nk

mfa →xk (xk )

)

(43.146)

along with resulting maximal likelihood value from (43.130):

ln pmax = max xk

(

X

a∈Nk

)

mfa →xk (xk )

(43.147)

We still need to determine the optimal values for the remaining nodes in X −k . For this purpose, we recall that step (43.136a) involves a maximization over X a ; these are the nodes that reach xk through its neighboring factor fa . The result ca ; its value will depend on xk because xk of this maximization is denoted by X appears as an argument on the right-hand side of (43.136a). We write ( ) X c X a (xk ) = argmax ln fa (xk , X a ) + mx→f (x) Xa

a

x∈Nx (fa )\{xk }

If we repeat this calculation for every neighboring factor fa to xk , we end up with a collection of maximizers, one for each neighbor of xk : n o ca (xk ), a ∈ Nx X (43.148)

43.4 Message-Passing Algorithms

1787

All these optimizers are functions of xk . If we replace the maximizing xk from (43.146), we end up determining an optimal collection of random variables that maximize the joint pmf: n o ca (b b2, . . . , x bK } = x bk, X {b x1 , x xk ), a ∈ Nx (43.149) Example 43.12 (Max-sum algorithm: All nodes are hidden) We illustrate the application of the max-sum algorithm to the same factored graph from Fig. 43.15. To apply the message-passing algorithm, start from a leaf node and propagate the message forward until it reaches every other leaf in the graph, and then propagate the message backward. We organize the computations in a manner that facilitates the exposition. We start from the leaf nodes {x1 , x4 , x5 , x6 }: 1.

node x1 : This node has one neighbor only, given by f1 : mx1 →f1 (x1 ) = 0

2.

(43.150)

node x4 : This node has one neighbor only, given by f4 : mx4 →f4 (x4 ) = 0

3.

(43.151)

node x5 : This node has one neighbor only, given by f5 : mx5 →f5 (x5 ) = 0

4.

(43.152)

node x6 : This node has one neighbor only, given by f2 : mx6 →f2 (x6 ) = 0

(43.153)

Now that all leaf nodes have been treated, we consider the nodes reached by them, namely, {f1 , f2 , f4 , f5 }. 5.

node f 1 : This node has two neighbors, given by {x1 , x2 }: n o mf1 →x1 (x1 ) = max ln f1 (x1 , x2 ) + mx2 →f1 (x2 ) x2 ∈{0,1} ( ) =

max

x2 ∈{0,1}

ln P(x1 , x2 ) + mx2 →f1 (x2 )

(43.154a)

where the values for ln P(x1 , x2 ) are given by the following table: x1

x2

ln P(x1 , x2 )

0 0 1 1

0 1 0 1

−3.5066 −1.3093 −0.8675 −1.2730

For the second message we have mf1 →x2 (x2 )

= (43.150)

=

=

max

n

max

n

x1 ∈{0,1}

x1 ∈{0,1}

ln f1 (x1 , x2 ) + mx1 →f1 (x1 ) o ln P(x1 , x2 ) + 0

max ln P(x1 , x2 )

x1 ∈{0,1}

o

(43.154b)

If we perform the maximization, we can determine the message for each value of x2 and the corresponding maximizing x1 :

1788

Undirected Graphs

6.

x2

mf1 →x2 (x2 )

b1 x

0 1

−0.8675 −1.2730

1 1

node f 2 : This node has two neighbors, given by {x2 , x6 }: mf2 →x2 (x2 )

=

o n ln f2 (x2 , x6 ) + mx6 →f2 (x6 )

max

x6 ∈{0,1}

(43.101)

=

max ln P(x6 |x2 )

(43.155a)

x6 ∈{0,1}

which leads to x2

mf2 →x2 (x2 )

b6 x

0 1

−0.6931 −0.2231

0 or 1 1

For the second message, we have mf2 →x6 (x6 ) = = 7.

n o ln f2 (x2 , x6 ) + mx2 →f2 (x2 ) x2 ∈{0,1} ( ) max

ln P(x6 |x2 ) + mx2 →f2 (x2 )

max

x2 ∈{0,1}

(43.155b)

node f 4 : This node has two neighbors, given by {x3 , x4 }: mf4 →x3 (x3 )

=

n

max

x4 ∈{0,1}

(43.99)

=

ln f4 (x3 , x4 ) + mx4 →f4 (x4 )

o

max ln P(x4 |x3 )

x4 ∈{0,1}

(43.156a)

which leads to x3

mf4 →x3 (x3 )

b4 x

0 1

−0.3567 −0.2231

0 0

For the second message, we have mf4 →x4 (x4 ) = = 8.

n o ln f4 (x3 , x4 ) + mx3 →f4 (x3 ) x3 ∈{0,1} ( ) max

max

x3 ∈{0,1}

ln P(x4 |x3 ) + mx3 →f4 (x3 )

(43.156b)

node f 5 : This node has two neighbors, given by {x3 , x5 }: mf5 →x3 (x3 )

= (43.100)

=

which leads to

max

n o ln f5 (x3 , x5 ) + mx5 →f5 (x5 )

x5 ∈{0,1}

max ln P(x5 |x3 )

x5 ∈{0,1}

(43.157a)

43.4 Message-Passing Algorithms

x3

mf5 →x3 (x3 )

b5 x

0 1

−0.5108 −0.1054

1 1

1789

For the second message, we have mf5 →x5 (x5 ) = =

o n ln f5 (x3 , x5 ) + mx3 →f5 (x3 ) x3 ∈{0,1} ( ) max

max

x3 ∈{0,1}

ln P(x5 |x3 ) + mx3 →f5 (x3 )

(43.157b)

Now we move to a third layer of nodes that are reached from the factor nodes {f1 , f2 , f4 , f5 }. These are nodes {x2 , x3 }. 9.

node x2 : This node has three neighbors, given by {f1 , f2 , f3 }: mx2 →f1 (x2 )

= (43.155a)

mx2 →f2 (x2 )

=

mf2 →x2 (x2 ) + mf3 →x2 (x2 )

=

mf1 →x2 (x2 ) + mf3 →x2 (x2 )

(43.12)

mx2 →f3 (x2 )

mf2 →x2 (x2 ) + mf3 →x2 (x2 ) (43.158a)

=

mf1 →x2 (x2 ) + mf3 →x2 (x2 )

(43.158b)

=

mf1 →x2 (x2 ) + mf2 →x2 (x2 )

(43.158c)

where both messages on the right-hand side of the above expression are known. Combining them gives

10.

x2

mx2 →f3 (x2 )

b1 x

b6 x

0 1

−0.8675 − 0.6931 = −1.5606 −1.2730 − 0.2231 = −1.4961

1 1

0 or 1 1

node x3 : This node has three neighbors, given by {f3 , f4 , f5 }: mx3 →f4 (x3 ) = mf3 →x3 (x3 ) + mf5 →x3 (x3 )

(43.159a)

mx3 →f5 (x3 ) = mf3 →x3 (x3 ) + mf4 →x3 (x3 )

(43.159b)

mx3 →f3 (x3 ) = mf4 →x3 (x3 ) + mf5 →x3 (x3 )

(43.159c)

where both messages on the right-hand side of the above expression are known. Combining them gives mx3 →f3 (x3 ), and from there the other two messages as well: x3

mx3 →f3 (x3 )

b4 x

b5 x

0 1

−0.3567 − 0.5108 = −0.8675 −0.2231 − 0.1054 = −0.3285

0 0

1 1

x3

mx3 →f4 (x3 )

b4 x

b5 x

0 1

−0.8675 − 0.5108 = −1.3783 −0.3285 − 0.1054 = −0.4339

0 0

1 1

1790

Undirected Graphs

x3

mx3 →f5 (x3 )

b4 x

b5 x

0 1

−0.8675 − 0.3567 = −1.2242 −0.3285 − 0.2231 = −0.5516

0 0

1 1

The only node left to consider is f3 . 11.

node f3 : This node has two neighbors, given by {x2 , x3 }: n o mf3 →x2 (x2 ) = max ln f3 (x2 , x3 ) + mx3 →f3 (x3 ) x3 ∈{0,1} ( ) (43.107a)

=

mf3 →x3 (x3 )

ln P(x3 |x2 ) + mx3 →f3 (x3 )

max

x3 ∈{0,1}

(43.160a)

n o ln f3 (x2 , x3 ) + mx2 →f3 (x2 ) x2 ∈{0,1} ( )

=

max

=

ln P(x3 |x2 ) + mx2 →f3 (x2 )

max

x2 ∈{0,1}

First, we evaluate the logarithm of P(x3 |x2 ): x2

x3

0 0 1 1

0 1 0 1

ln P(x3 |x2 ) −1.2040 −0.3567 −0.9163 −0.5108

and use these values to determine the above messages: x2

mf3 →x2 (x2 )

b3 x

b4 x

b5 x

0 1

−0.3285 − 0.3567 = −0.6852 −0.3285 − 0.5108 = −0.8393

1 1

0 0

1 1

x3

mf3 →x3 (x3 )

b2 x

b1 x

b6 x

0 1

−1.4961 − 0.9163 = −2.4124 −1.5606 − 0.3567 = −1.9173

1 0

1 1

0 or 1 1

We can now return and evaluate all the remaining messages: x2

mx2 →f1 (x2 )

b3 x

b4 x

b5 x

b6 x

0 1

−0.6852 − 0.6931 = −1.3783 −0.8393 − 0.2231 = −1.0624

1 1

0 0

1 1

0 or 1 1

x2

mx2 →f2 (x2 )

b1 x

b3 x

b4 x

b5 x

0 1

−0.6852 − 0.8675 = −1.5527 −0.8393 − 1.2730 = −2.1123

1 1

1 1

0 0

1 1

(43.160b)

43.4 Message-Passing Algorithms

x5

mf5 →x5 (x5 )

b3 x

b4 x

b5 x

0 1

−1.2242 − 0.9163 = −2.1405 −0.5516 − 0.1054 = −0.6570

0 1

0 0

1 1

x4

mf4 →x4 (x4 )

b3 x

b4 x

b5 x

0 1

−0.4339 − 0.2231 = −0.6570 −0.4339 − 1.6094 = −2.0433

1 1

0 0

1 1

x6

mf2 →x6 (x6 )

b1 x

b2 x

b3 x

b4 x

b5 x

0 1

−1.5527 − 0.6931 = −2.2458 −1.5527 − 0.6931 = −2.2458

1 1

0 0

1 1

0 0

1 1

x1

mf1 →x1 (x1 )

b2 x

b3 x

b4 x

b5 x

b6 x

0 1

−1.0624 − 1.3093 = −2.3717 −1.3783 − 0.8676 = −2.2459

1 0

1 1

0 0

1 1

0 or 1 1

1791

Now that all messages have been determined, we can consider determining the combination of random variables that results in the largest joint likelihood value. For this purpose, we select any of the variables, say, x3 , and apply the following two steps: (a)

First, we sum all the messages arriving at x3 from its factor neighbors to get P−3 (x3 ) = mf3→x3 (x3 ) + mf4→x3 (x3 ) + mf5→x3 (x3 )

(43.161)

Adding the messages gives x2

P−3 (x3 )

0 1

−2.4124 − 0.3567 − 0.5108 = −3.2799 −1.9173 − 0.2231 − 0.1054 = −2.2458

Maximizing the result over x3 we find the maximizing argument: b3 = 1 x

(43.162)

pmax = e−2.2458 = 0.1058

(43.163)

as well as the maximum value:

1792

Undirected Graphs

(b)

b 3 = 1 and for every neighboring factor of x3 we solve Second, we set x3 to x ( b2 = x

) b 3 = 1) + mx2 →f3 (x2 ) ln f3 (x2 , x

argmax x1 ,x2 ,x6 ∈{0,1}

)

( =

argmax x1 ,x2 ,x6 ∈{0,1}

ln P(b x3 = 1|x2 ) + mx2 →f3 (x2 )

( b 4 = argmax x

(43.164a)

) ln f4 (b x3 , x4 ) + mx4 →f4 (x4 )

x4 ∈{0,1}

= argmax ln P(x4 |b x3 = 1)

(43.164b)

x4 ∈{0,1}

( b 5 = argmax x

x5 ∈{0,1}

) ln f5 (b x3 = 1, x5 ) + mx5 →f5 (x5 )

= argmax ln P(x5 |b x3 = 1)

(43.164c)

x5 ∈{0,1}

Evaluating the expressions that need to be maximized, we get: x4 0 1

P(x4 |b x3 = 1)

x5

−0.2231 −1.6094

0 1

P(x5 |b x3 = 1) −2.3026 −0.1054

and x2

ln P(b x3 = 1|x2 ) + mx2 →f3 (x2 )

b1 x

b6 x

0 1

−0.3567 − 1.5606 = −1.9173 −0.5108 − 1.4961 = −2.0069

1 1

0 or 1 1

and we conclude that b 1 = 1, x

b 2 = 0, x

b 3 = 1, x

b 4 = 0, x

b 5 = 1, x

b 6 ∈ {0, 1} x

(43.165)

We can verify the validity of this result by computing the joint pmf from first principles – see Prob. 43.18. Example 43.13 (Application to error-correcting codes) We discuss an application of the max-sum algorithm to iterative decoding. We consider the case of block parity-check codes. The intention is to transmit K information bits {0, 1} embedded in a block of N bits where N is larger than K. The additional bits provide redundancy and help recover from errors during transmission over noisy communication channels. Table 43.4 provides one example of a block parity-check code with K = 3 information bits and N = 6 total bits per block; the information bits are the first three bits and the paritycheck bits are the last three bits. In this example, the codewords are constructed to satisfy the following three constraints: (b1 + b2 + b4 ) mod 2 = 0 (b1 + b3 + b5 ) mod 2 = 0 (b2 + b3 + b6 ) mod 2 = 0

(43.166a) (43.166b) (43.166c)

43.5 Commentaries and Discussion

1793

Table 43.4 Example of a block parity-check code with K = 3 information bits and N = 6 total bits per block. b1 b2 b3 b4 b5 b6 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

0 0 1 1 1 1 0 0

0 1 0 1 1 0 1 0

0 1 1 0 0 1 1 0

That is, the sum of the three bits in each case is even. Errors in transmission are discovered when any of the parity-check constraints are violated. For example, if codeword “001011” is transmitted and the received codeword is “001010,” then the third constraint is violated. We assume that the channel is “memoryless” so that each bit is flipped independently of the other bits during transmission from 1 to 0 or from 0 to 1 with small probability  > 0. The observed bits are denoted by {y 1 , . . . , y 6 } so that P(bk P(bk P(bk P(bk

= 0|y k = 1|y k = 0|y k = 1|y k

= 1) =  = 1) = 1 −  = 0) = 1 −  = 0) = 

(43.167a) (43.167b) (43.167c) (43.167d)

One useful way to describe the operation of the block code is by means of a Tanner graph, which bears a strong resemblance to the factor graphs we described before (except that the factor functions have now a particular parity-check form, whereas they can assume more arbitrary forms under factor graphs). Tanner graphs are again bipartite graphs, where each of the N bits is represented by a circular node and the three constraints are represented by square nodes. The factor functions that are associated with the square nodes are the constrains (43.166a)–(43.166c). The N bits are represented by circular nodes, and they are connected to square nodes where parity check calculations are performed – see Fig. 43.18. We therefore end up with a factor graph containing six hidden nodes {bk }, six observed nodes {y k }, and three factor nodes {f` }. We associate with the graph the following aggregate distribution: ( 6 )   1 Y P(bk |y k ) × f1 (b1 , b2 , b4 ) × f2 (b1 , b3 , b5 ) × f3 (b2 , b3 , b6 ) P {bk }, {y k } = Z k=1 (43.168) where Z is the partition function (or normalization factor). While the factor graph shown in Fig. 43.18 has cycles in it, the message-passing algorithm can still be applied and it has been observed in practice to lead to good (approximate) results – see Probs. 43.23 and 43.24.

43.5

COMMENTARIES AND DISCUSSION Undirected and factor graphs. Compared to Bayesian networks, undirected graphical models remove the directions on the edges and replace the marginal and conditional probability measures by broader potential functions, thus bringing up useful connections with statistical physics. There are many useful references on these types of models,

1794

Undirected Graphs

Figure 43.18 Tanner graph for a block parity-check code with K = 3 information bits

and N = 6 block size, along with three parity constraints represented by the factor nodes. The {y k } are the noisy bits.

including the works by Jordan and Weiss (1994), Lauritzen (1996), Frey (1998), Jordan (1998), Edwards (2000), Wainwright and Jordan (2008), Koller and Friedman (2009), Borgelt, Steinbrecher, and Kruse (2009), and Grimmett (2018). Factor graphs extend undirected graphs by incorporating two types of nodes, and help decouple the regular graph nodes from their potential factors. They usually lead to tree structures over which exact inference can be performed by means of the sum-product and max-sum algorithms. Factor graphs were introduced in Kschischang, Frey, and Loeliger (2001) as a means to link two types of graphs: Bayesian networks (which are directed graphs) and Tanner graphs from coding theory (which employ factor nodes). The same reference describes the corresponding message-passing (sum-product and max-sum) algorithms. A review of factor graphs appears in Loeliger (2004). Belief propagation and message passing. The belief propagation algorithm was developed by Pearl (1982, 1986) and used to solve exact inference problems on treestructured directed graphs, such as computing the probability distributions for each node in the network in response to evidence. The algorithm was extended by Kim and Pearl (1983) to inference over polytrees. Although belief propagation was derived originally for acyclic graphs, it has been observed to provide good (approximate) performance even for graphs with cycles. When applied in this context, the algorithm is referred to as loopy belief propagation. It, however, need not converge when cycles are present, and there are many studies on sufficient conditions to ensure convergence, such as the works by Frey and Mackay (1998), Murphy, Weiss, and Jordan (1999), Weiss (2000), and Mooij and Kappen (2007). Belief propagation is a special case of the sum-product algorithm described in the body of the chapter. A good overview of belief propagation and undirected graphs appears in the book chapter by Yedida, Freeman, and Weiss (2003). The discussion in Example 43.13 on error-correcting codes is motivated by a case study from this reference. For further discussion on this type of application to iterative decoding, the reader may refer to Gallager (1968), Tanner (1981), McEliece, MacKay, and Cheung (1998), and Mackay (1999, 2003). In the body of the chapter, we derived two types of message-passing algorithms: the sum-product

43.5 Commentaries and Discussion

1795

and the max-sum methods. They were both derived for factor graphs, along the lines of Kschischang, Frey, and Loeliger (2001). We also explained how to transform DAGs and undirected graphs into factor graphs. Junction tree graphs. The sum-product and max-sum message-passing algorithms derived in the body of the chapter assume tree-structured factor graphs. If cycles are present, the algorithms need not converge. Nevertheless, message passing can be extended to arbitrary factor graph structures to yield exact inference solutions. The resulting extension is known as the junction tree algorithm. A junction tree is a tree of cliques constructed as follows. Starting from a DAG, we first moralize it by connecting all parents. This is illustrated by the red segment in the second plot of Fig. 43.19; all arrows are removed so that the resulting graph is now undirected. However, this graph exhibits cycles with at least four nodes, such as the cycle linking {x1 , x2 , x3 , x6 }. The next step involves adding a chord linking {x2 , x3 } in order to divide this cycle into two smaller cliques with three nodes each: {x1 , x2 , x3 } and {x2 , x3 , x6 }. The third plot in the figure shows the resulting “triangulated” undirected graph with five cliques; a triangulated graph is one where no cycles with four or more nodes exist without chords in them. Each clique is finally represented by a “super node” in the last plot. The first node originates from the clique involving {x1 , x2 , x3 }. This node is linked to two other “super nodes” originating from the cliques involving {x1 , x4 } and {x2 , x3 , x6 }. The rectangular nodes include the variables that are common to two connected “super nodes.” Observe that the junction graph has a tree structure. There are several extensions of message passing for inference over junction graphs, such as the Hugin algorithm and the Shafer–Shenoy algorithm – see, e.g., the treatments in Jordan and Weiss (1994), Jensen (1996), Cowell et al. (1999), Bach and Jordan (2002), Koller and Friedman (2009), Barber (2012), and the works by Shenoy and Shafer (1986, 1990), Lauritzen and Spiegelhalter (1988), Jensen, Olesen, and Andersen (1990), and Jensen, Lauritzen, and Olesen (1990).

x1

x1

x2

x3

x6

x3

x4

x1

x7 x5

“moralized undirected graph (includes a cycle with at least 4 nodes)

{x1 , x2 , x3 }

{x1 }

{x2 , x3 }

{x1 , x4 }

{x2 , x3 , x6 }

x6

x4

x7

x7

x5

x2

x3

x6

x4

c

directed acyclic graph

x2

{xv4 }

{x3v , x6 }

{x4 , x5 }

{x3 , x6 , x7 }

x5

AAACUXicbVBNTxsxFHxZWkgDpYEee7EaVYJLtIsiyjESF46p1ABSNgpvvS9ZC6+9sr2gEOUvcoAT/4MLh1Z1PiQg6UiWRvPmfXiSQgrrwvCpEmx8+Li5Vf1U2975vPulvrd/bnVpOHW5ltpcJmhJCkVdJ5yky8IQ5omki+T6dFa/uCFjhVa/3bigfo4jJYaCo/PSoJ7FSguVknLs6irXBqW4o5SVKhWGuPN0ZLDI4rj2ajwQissyJcuQ8TGXxG6Fy9g7DzomCa1jLaa0tx4O6o2wGc7B1km0JA1YojOoP8Sp5mXup3GJ1vaisHD9CRon/MppLS4tFcivcUQ9TxXmZPuTeSJT9sMrKRtq45+/Zq6+7Zhgbu04T7wzR5fZ1dpM/F+tV7rhSX8iVFE6UnyxaFhK5jSbxcsWscmxJ8iN8LcynqFBH6WxNR9CtPrldXJ+1IyOm61fR412uIyjCt/gOxxABD+hDWfQgS5wuIdn+AN/K4+VlwCCYGENKsuer/AOwfY/8P2zuA==

“triangulated” undirected graph (added one chord; 5 cliques)

junction tree (tree of cliques)

Figure 43.19 The DAG on the left is first moralized and transformed into an

undirected graph. The graph is further triangulated by adding one chord, ending up with four cliques. The junction tree graph corresponds to the tree linking the cliques. Hammersley–Clifford theorem. The theorem is one of the fundamental pillars of random Markov fields in statistics – see, e.g., Rozanov (1982), Clifford (1990), Chellappa and Jain (1993), Lauritzen (1996), Brooks et al. (2011), and Grimmett (2018). The proof we provide in Appendix 43.A follows closely the constructive argument given in the unpublished notes by Cheung (2008). A proof also appears in Grimmett (2018, ch. 7), including one for the equivalence of the local and global Markovian properties

Undirected Graphs

studied in Appendix 43.B. The theorem is named after Hammersley and Clifford (1971), although they never published their note and the theorem actually first appeared in Besag (1974). According to the accounts given in Hammersley (1974) and Brooks et al. (2011, p. 53), the original work by Hammersley and Clifford (1971) was not published because the authors were not satisfied with the positivity requirement on the joint pdf and were hoping to relax the condition prior to submission. Interestingly, it was shown by Moussouris (1974) by means of a counter example that the positivity condition is actually necessary. Ising model. The Ising model (or, more accurately, the Lenz–Ising model) is widely used in statistical physics. We encountered one special case of it in (43.27). More generally, the free energy function is chosen as E({xk }, {y k }) = −α

K X k=1

θkk xk yk − β

K X X

θk` xk x`

(43.169)

k=1 `∈Nk

with parameters {θkk , θk` }, where θk` is nonzero when (xk , x` ) are neighbors and zero otherwise. When the random variables {xk } are binary assuming the states +1 or −1, the free energy function (43.169) is also known as the “spin glass” energy function – see Mezard, Parisi, and Virasoro (1987) and Fischer and Hertz (1991). In this context, the state of each variable can be interpreted as corresponding to the orientation of the spin of a magnet. In statistical physics, the parameters {θkk } represent an external magnetic field, while the parameters {θk` } among neighboring agents indicate the type of their interaction: ferromagnetic when θk` > 0 and antiferromagnetic when θk` < 0. The one-dimensional version of the Ising model appeared in the doctoral dissertation by Ising (1925), on the suggestion of his doctoral advisor who proposed the model first in Lenz (1920), as acknowledged in Ising’s own article. The more challenging extension to two-dimensional lattices appeared later in Onsager (1944) by the Norwegian-American scientist Lars Onsager (1903–1976), who was awarded the Nobel Prize in Chemistry in 1968 – see the accounts by Brush (1967) and Niss (2005) on the history of the Lenz–Ising model.

PROBLEMS

|=

43.1 Consider four random variables satisfying the conditional independence relations x1 x3 | {x2 , x4 } and x2 x4 | {x1 , x3 }. Show that a directed graphical model does not exist that captures these relations. Show that an undirected graph exists that encodes these relations. 43.2 Consider an undirected graph G. Show that any two nodes in the graph that are not directly connected to each other are conditionally independent given all other nodes. This property is referred to as pairwise independence over undirected graphs. 43.3 True or false? (a) All independence relations encoded in a Bayesian network can be encoded in an undirected graph or Markov random field. (b) All independence relations encoded in an undirected graph or Markov random field can be encoded in a Bayesian network. (c) Not every probability distribution can be encoded in an undirected graph. 43.4 Consider three collections of random variables {X, Y , Z} over an undirected graph. Show that if X and Y are not separated given Z (i.e., there exists some path linking an element from X to an element from Y that does not pass through Z), then X and Y are conditionally dependent given Z for some probability distribution that factorizes over the given graph. |=

1796

Problems

1797

43.5 What is the partition function Z that corresponds to the use of the clique potentials (43.21)–(43.22)? 43.6 Refer to the undirected graph shown in the rightmost plot of Fig. 43.2. Assume first that the distribution P defined over the variables of the graph factorizes as 1 φC (x1 , x2 , x3 ) φC (x1 , x3 , x4 ) φC (x4 , x5 ) Z

Show from first principles that x2 Consider further the three sets: X from first principles that X and Z 43.7 Consider the problem of selecting that ! ! ! N N N − + − 0 1 2

(a) (b)

{x4 , x5 } | {x1 , x3 }. = {x2 }, Y = {x1 , x3 }, Z = {x4 , x5 }. Verify are conditionally independent given Y . k elements from N for k = 0, 1, . . . , N . Show

|=

P(x1 , x2 , x3 , x4 , x5 ) =

N 3

! + . . . + (−1)

N

N N

! = 0

43.8 Consider K random variables. Roughly, how many connected trees can be drawn linking these variables? How many connected undirected graphs? 43.9 The Hammersley–Clifford theorem (43.60) does not hold if the joint distribution P(x1 , x2 , . . . , xK ) is zero at some variable location. Find an example to illustrate this fact. 43.10 Argue that the following statements are equivalent: (a) A tree is an undirected graph with a single path connecting any two nodes. (b) A tree is an undirected acyclic graph. (c) A tree of K nodes is a connected undirected graph with K − 1 links. 43.11 True or false. Every tree can be treated as a bipartite graph. 43.12 True or false. Every regular node in a factor graph has only one link and that link connects it to a factor node. 43.13 Start from a tree or a polytree and transform them into factor graphs following the procedure described in the body of the chapter. Show that the factor graph will have a tree structure (i.e., it will only have a single path between any two nodes). 43.14 Consider the directed graph consisting of K nodes {x1 , x2 , . . . , xK } connected linearly in a chain with x1 feeding into x2 , x2 feeding into x3 , and so forth, until xK−1 feeds into xK , as shown in Fig. 43.17. Show how the max-sum algorithm simplifies in this case. 43.15 Refer to the undirected graph shown in Fig. 43.14. Apply the sum-product message-passing algorithm to the corresponding factor graph and determine the marginal P(x2 |x3 = 1). 43.16 Refer to the undirected graph shown in Fig. 43.14. Apply the sum-product message-passing algorithm to the corresponding factor graph and determine the marginal P(x2 , x5 |x3 = 1). 43.17 Refer to the undirected graph shown in Fig. 43.14. Apply the sum-product message passing algorithm to the corresponding factor graph and determine the marginal P(x1 , x2 |x3 = 1, x5 = 0). 43.18 Refer to Example 43.12. Determine the joint pmf of the variables {x1 , . . . , x6 } from first principles by using the expression P(x1 , x2 , x3 , x4 , x5 , x6 ) = P(x1 )P(x2 |x1 )P(x3 |x2 )P(x4 |x3 )P(x5 |x3 )P(x6 |x2 ) and verify that it is maximized at the locations specified by (43.165). 43.19 Refer to the undirected graph shown in Fig. 43.14. Apply the max-sum messagepassing algorithm to the corresponding factor graph and determine the most likely value for x2 given the evidence x3 = 1. 43.20 Refer to the undirected graph shown in Fig. 43.14. Apply the max-sum messagepassing algorithm to the corresponding factor graph and determine the most likely value for {x1 , x2 , x4 , x5 , x6 } given the evidence x3 = 1.

1798

Undirected Graphs

43.21 Refer to the undirected graph shown in Fig. 43.14. Apply the max-sum message passing algorithm to the corresponding factor graph and determine the value of x2 that maximizes P(x2 |x3 = 1, x5 = 0) and P(x2 , x3 = 1, x5 = 0). 43.22 Refer to the undirected graph shown in Fig. 43.14 and assume all nodes are hidden. Apply the max-sum message-passing algorithm to the corresponding factor graph and determine the variables {x1 , x4 } that maximize P(x1 , x4 ). 43.23 Refer to expression (43.168) for the joint distribution over the factor graph given in Fig. 43.18. Assume the received codeword is {y 1 , . . . , y 6 } = 110,000 and  = 0.1. Use the max-sum message-passing algorithm to determine the most likely codeword {b1 , . . . , b6 }. 43.24 Refer to expression (43.168) for the joint distribution over the factor graph given in Fig. 43.18. Assume y 1 = 1, y 6 = 1, and  = 0.1. What is the most likely bit b3 ? 43.25 Consider an Ising model applied to the grid structure shown in Fig. 43.20 with K = 9 nodes and free energy function given by E(x1 , . . . , x9 ) = −

K−1 X k=1

αxk xk+1 −

K X

βxk

k=1

Draw the corresponding factor graph and identify the factor functions. Write down the sum-product message-passing algorithm and deduce expressions for the marginal distributions of the individual nodes.

Figure 43.20 A regular lattice grid of K = 9 binary variables assuming the values

xk = ±1. 43.26 Derive relations (43.44a) and (43.44b) for the restricted Boltzmann machine. 43.27 Consider the hidden Markov model from Fig. 42.2 and the corresponding factor graph shown in Fig. 43.21. The variables {y 1 , y 2 , y 3 } are observed and their values are denoted by {y1 , y2 , y3 }. (a) Determine the messages that result from applying the sum-product messagepassing algorithm. (b) Assume π0 = 1/3 and π1 = 2/3. Let a01 = a10 = 0.1 and b01 = b10 = 0.1. The sequence {y1 , y2 , y3 } = {0, 1, 0} is observed. What is the likelihood of z 2 = 1 given these observations? 43.28 We continue with the setting of Prob. 43.27. Apply the max-sum algorithm and determine the most likely sequence {z1 , z2 , z3 }.

43.A Proof of the Hammersley–Clifford Theorem

1799

Figure 43.21 A factor graph representation for the hidden Markov model from Fig.

42.2.

43.A

PROOF OF THE HAMMERSLEY–CLIFFORD THEOREM We establish in this appendix the validity of Theorem 43.1 by following the derivation given in the unpublished notes by Cheung (2008), adjusted to our notation and conventions. Another proof is given in Grimmett (2018, ch. 7); see also the arguments in Besag (1974), Moussouris (1974), and Lauritzen (1996). Assume first that the distribution P factorizes as shown in (43.60), and consider an arbitrary node xk with neighborhood Nk and complementary set Nkc . Clearly, all nodes in the graph G are contained in the union: G = {xk } ∪ Nk ∪ Nkc

(43.170)

We will verify that the local Markov properties hold, i.e., P(xk | Nk , Nkc ) = P(xk | Nk )

(43.171)

as required by (43.54). Indeed, note that P(xk | Nk ) = P(xk , Nk )/P(Nk )

(43.172)

where the factors in the numerator and denominator can be obtained by marginalizing the joint pmf P(xk , Nk , Nkc ) over the relevant variables as follows: P Q c Nk C∈C φC (xC ) Q P(xk | Nk ) = P P (43.173) c xk N C∈C φC (xC ) k

The sum over Nkc means marginalizing over the variables included in Nkc . Let Ck denote the collection of cliques in C that contain xk (i.e., each clique in Ck will contain xk ). Likewise, let Cck denote the remaining cliques, which do not contain xk . Then, C = Ck ∪ Cck

(43.174)

and we can split the product of clique potentials as follows (the partition factor Z cancels out): ! ! Y Y Y φC (xC ) = φC (xC ) φC (xC ) (43.175) C∈C

C∈Ck

C∈Cc k

Moreover, if we select an arbitrary clique from Ck , then by definition this clique should contain xk along with neighbors of xk (this is because all nodes in the clique will be

1800

Undirected Graphs

connected to xk ). This means that all nodes appearing in the clique set Ck will belong to Nk . It follows that we can write Q  P Q  (((( ( ( c ( c φC (xC ) ( C∈Ck φC (xC ) N C∈C (k k (  P  P(xk | Nk ) = P Q Q ((((( ( φ (x ) c ( c φC (xC ) C xk C∈Ck C N( C∈C ( k ( k Q C∈Ck φC (xC ) = P Q xk C∈Ck φC (xC ) Q φC (xC ) (a) Q = P C∈C C∈C φC (xC ) xk P(x1 , x2 , . . . , xK ) P(Nk , Nkc ) P(xk , Nk , Nkc ) = P(Nk , Nkc ) = P(xk | Nk , Nkc ) (local Markov property) =

(43.176)

as desired, and where in step (a) we added the same multiplicative factors in the numerator and denominator and replaced Ck by C. Conversely, we need to show that if the nodes of an undirected graph satisfy (43.55), then the factorization for P holds for some clique potentials that we are going to construct. Let X denote any subset of the nodes of G and split the graph into G = X ∪ Xc

c

(43.177)

where X is the collection of the remaining nodes. The set X may or may not be a clique (i.e., the nodes in X need not be fully connected to each other). Let Y denote any subset of nodes in X, with Y c its complement. We also have G=Y ∪Yc We associate with X the following positive function: α(Y ) Y ∆ φ(X = X) = P(Y = Y, Y c = 0)

(43.178)

(43.179a)

Z⊂X ∆

α(Y ) = (−1)|X|−|Y |

(exponent)

(43.179b)

where the product is over every possible subset Y of X. The above expression sets the likelihood of X assuming value X (which in turn implies Y assumes some value Y already determined by X since Y ⊂ X). The notation |X| is the cardinality of X (i.e., the number of variables in it). The exponent in the above expression is the difference in the number of entries between X and Y . The exponent α(Y ) is either +1 (when the difference in the number of entries is even) or −1 (otherwise). It turns out that the function φ(X) satisfies two properties: Y φ(X) = P(x1 , x2 , . . . , xK ) (43.180a) X⊂G

φ(X) = 1, if X is not a clique

(43.180b)

where the product in the first equation is over all subsets X of the graph nodes. Once we establish these two properties we will be able to conclude that the joint pmf P(x1 , . . . , xK ) can be expressed as the product of clique potentials φ(·) over the graph. Let us consider (43.180a) first. Using (43.179a) we have that ( ) α(Y ) Y Y Y c φ(X) = P(Y = Y, Y = 0) (43.181) X⊂G

X⊂G

Y ⊂X

43.A Proof of the Hammersley–Clifford Theorem

1801

We want to verify that the expression on the right-hand side simplifies to the joint pmf P(x1 , . . . , xK ). Consider a particular strict subset Y of nodes so that |Y | < |G|. We fix Y and examine how its probability factors influence the result. Later we examine the case when Y = G (the entire set of nodes). We denote the probability factor corresponding to Y by ∆

β = P(Y = Y, Y c = 0)

(43.182)

The subset Y and its probability factor β will appear in several terms in the innermost product in (43.181), namely, in the following situations whenever X X X X

= Y =⇒ α(Y ) = +1 = Y ∪ {one more node} =⇒ α(Y ) = −1 = Y ∪ {two more nodes} =⇒ α(Y ) = +1 = Y ∪ {three more nodes} =⇒ α(Y ) = −1 • •

(43.183a) (43.183b) (43.183c) (43.183d)

In the first case, when X = Y , the factor appears as β. In the second case, there is a total of |Y c | possibilities for selecting X; this is the number of combinations when choosing one element from among |Y c | possibilities: ! |Y c | = |Y c | (43.184) 1 This is the number of nodes left outside Y . Adding one node at a time to the set Y results in a valid X, which is enlarged by a single element. For each of these |Y c | cases, the probability factor corresponding to Y will appear as β −1 since α(Y ) = −1. In the third case, when two more nodes are added to form X, there are a total of ! |Y c | combinations (43.185) 2 For each of these |Y c | cases, the probability factor corresponding to Y will appear as β since α(Y ) = +1, and so forth. The subset Y will therefore contribute a product of factors of the following form in the expression (43.181):  (|Y c |)  (|Y c |)  (|Y cc |) |Y c | 1 3 |Y | β × β −1 × β ( 2 ) × β −1 × . . . × β ±1 ( ! ! ! !) c |Y c | |Y c | |Y c | |Y c | |Y | 1− + − + . . . (−1) 1 2 3 |Y c | =β = βS

(43.186)

where the exponent of the last factor β is +1 or −1 depending on whether |Y |c is even or odd, respectively. The exponent S is a sum of combinatorial functions of the following form: ! ! ! ! c |Y c | |Y c | |Y c | |Y c | |Y | − + . . . (−1) = 0 (43.187) S =1− + 2 3 |Y c | 1 which can be verified to add up to zero – see Prob. 43.7. Therefore, every strict subset Y of G will contribute an aggregate factor to (43.181) that collapses to β 0 = 1. It remains to examine the case Y = G. Here, Y c = ∅ (the empty set), and the factor contributed by Y is the desired joint pmf β = P(Y = Y ) = P(x1 , x2 , . . . , xK )

(43.188)

1802

Undirected Graphs

The set Y now appears only when X = Y , in which case α(Y ) = 1 and we conclude that Y

φ(X) = P(x1 , x2 , . . . , xK )

(43.189)

X⊂G

as desired. Let us establish next result (43.180b). When X is not a clique, there exist at least two nodes within X that are not connected to each other. We denote these nodes by {xa , xb } ⊂ X. Let Y denote any subset of X and let W denote any subset of X 0 = X\{xa , xb } (with {xa , xb } excluded). Then, each possible Y can be one of four possibilities: Y = W , Y = W ∪ {xa }, Y = W ∪ {xb }, Y = W ∪ {xa , xb }

(43.190)

That is, the subset Y excludes both {xa , xb }, includes both of them, or includes one of them. The corresponding exponents α(Y ) are given by ∆

α(W ) = (−1)|X|−|W | = α |X|−|W |−1

(43.191a) |X|−|W |

α(W ∪ {xa }) = (−1)

= −(−1)

α(W ∪ {xb }) = (−1)

= −(−1)

|X|−|W |−1

|X|−|W |−2

α(W ∪ {xa , xb }) = (−1)

|X|−|W |

= (−1)

|X|−|W |





= −α

(43.191b)

= −α

(43.191c)



= α

(43.191d)

where α is either +1 or −1 depending on whether the difference |X| − |W | is even or odd. Then, we can write Y 

P(Y = Y, Y c = 0)

α(Y )

(43.192)

Y ⊂X

( =

Y W ⊂X 0

 α P(W = W, W c = 0) ×

−α P(W = W, xa = xa , (W ∪ {xa })c = 0) ×  −α P(W = W, xb = xb , (W ∪ {xb })c = 0) × 



c

P(W = W, xa = xa , xb = xb (W ∪ {xa , xb }) = 0)



)

or, equivalently, α(Y ) Y  Y P(Y = Y, Y c = 0) = W ⊂X 0

Y ⊂X

(

P(W = W, W = 0) P(W = W, xa = xa , xb = xb (W ∪ {xa , xb })c = 0) P(W = W, xa = xa , (W ∪ {xa })c = 0) P(W = W, xb = xb , (W ∪ {xb })c = 0) c



(43.193)

43.B Equivalence of Markovian Properties

1803

The above ratio simplifies to 1, as desired, since from the Bayes rule: P(W = W, W c = 0) P(W = W, xa = xa , (W ∪ {xa })c = 0) (a)

=

(b)

=

(c)

=

(d)

=

P(W = W, xa = 0, xb = 0, (W ∪ {xa , xb })c = 0) P(W = W, xa = xa , xb = 0, (W ∪ {xa , xb })c = 0) P(xa = 0|W = W, xb = 0, (W ∪ {xa , xb })c = 0) P(xa = xa |W = W, xb = 0, (W ∪ {xa , xb })c = 0) P(xa = 0|W = W, xb = xb , (W ∪ {xa , xb })c = 0) P(xa = xa |W = W, xb = xb , (W ∪ {xa , xb })c = 0) P(W = W, xb = xb , (W ∪ {xb })c = 0) P(W = W, xa = xa , xb = xb , (W ∪ {xa , xb })c = 0)

(43.194)

where step (a) is because W c includes {xa , xb }, step (b) applies the Bayes rule, and step (c) replaces xb = 0 by xb = xb for any xb since xa and xb are conditionally independent of each other given all other nodes in the graph (recall that xa and xb are disconnected). Step (d) applies the Bayes rule again. Substituting (43.194) into (43.193) shows that φ(X) = 1 when X is not a clique.

EQUIVALENCE OF MARKOVIAN PROPERTIES We establish in this appendix the validity of Corollary 43.1 following arguments similar to Grimmett (2018, proposition 7.13). To begin with, if X Y |Z, for any subsets {X, Y } that can only be linked through nodes in Z, then it is straightforward to conclude that xk is independent of any other node in the graph conditioned on knowledge of its neighborhood. This is because any link from xk to these other nodes will need to pass through the neighborhood of xk . The reverse direction is more demanding. Assume the local Markovian property (43.55) holds. Then, we know from the result of Theorem 43.1 that the joint pmf for the variables in the graph admits a factorization of the following form over some clique set C: 1 Y P(x1 , x2 , . . . , xK ) = φC (xC ) (43.195) Z |=

43.B

C∈C

Now consider three node subsets {X, Y , Z} such that nodes in X and Y are only linked by trajectories that pass through Z. This means that nodes in X and nodes in Y cannot be immediate neighbors. Therefore, if we consider any arbitrary node from X and any arbitrary node from Y , then they cannot belong to the same clique. It follows that we can split the cliques in C into two subsets: Cx = cliques that do not contain any nodes from Y Cy = cliques that do not contain any nodes from X

(43.196a) (43.196b)

The clique potentials over Cx will generally include variables from X or Z (but not from Y ), while the clique potentials over Cy will generally include variables from Y or Z (but not from X). It follows that we can write ( )( ) Y Y P(x1 , x2 , . . . , xK ) ∝ φC (xC ) φC (xC ) (43.197) C∈Cx

C∈Cy

1804

Undirected Graphs

Moreover, if we denote all other nodes in the graph besides those in X ∪ Y by Gc = G\{X ∪ Y }

(43.198)

then this complementary set will include the nodes in Z and any other nodes not accounted for in X ∪ Y ∪ Z. These latter nodes cannot connect nodes from X and Y because otherwise there will be trajectories linking X and Y that do not go through Z. Therefore, these latter nodes cannot appear in cliques together with nodes from both X and Y ; the clique can only contain nodes from X or nodes from Y but not from both. For this reason, we can decompose Gc into three sets: Gc = Z ∪ X a ∪ Y a

(43.199)

where X a (likewise Y a ) denote those nodes whose potentials can only involve nodes from X (Y ). In this way, the arguments of the joint pmf over the graph can be written as P(x1 , x2 , . . . , xK ) = P(X, Z, Y , X a , Y a )

(43.200)

0

and, hence (we are denoting the realization for Z by Z to avoid confusion with the partition function Z): P(X, Y | Z = Z 0 ) = P(X, Y )/P(Z = Z 0 ) ∝ P(X, Y ) X (a) ∝ P(X, Y , Z = Z 0 , X a , Y a ) X a ,Y a

( ∝ (b)

)

X

Y

X a ,Y a

C∈Cx

φC (xC )

( φC (xC )



{z

ψ(X,Z=Z 0 )

) X Y

X a C∈Cx

|

φC (xC )

C∈Cy

)( X Y

=

Y

φC (xC )

Y a C∈Cy

}|

{z

ψ(Y ,Z=Z 0 )

= ψ(X, Z = Z 0 ) ψ(Y , Z = Z 0 )

} (43.201)

where in step (a) we are marginalizing over {X a , Y a } and in step (b) we split the sums since variables in X a can only appear in the product of potentials over Cx ; likewise for Y a . The last line expresses the conditional pmf of X and Y given Z as the product of two functions: One depends on X and Z 0 while the second depends on Y and Z 0 . We conclude that X and Y are conditionally independent given Z.

REFERENCES Bach, F. and M. Jordan (2002), “Thin junction trees,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–8, Vancouver. Barber, D. (2012), Bayesian Reasoning and Machine Learning, Cambridge University Press. Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” J. Roy. Statist. Soc., Ser. B, vol. 36, no. 2, pp. 192–236. Borgelt, C., M. Steinbrecher, and R. Kruse (2009), Graphical Models: Representations for Learning, Reasoning, and Data Mining, 2nd ed., Wiley. Brooks, S., A. Gelman, G. Jones, and X.-L. Meng, editors (2011), Handbook of Markov Chain Monte Carlo, Chapman & Hall.

References

1805

Brush, S. G. (1967), “History of the Lenz–Ising model,” Rev. Modern Phys., vol. 39, pp. 883–893. Chellappa, R. and A. Jain (1993), Markov Random Fields: Theory and Applications, Academic Press. Cheung, S. (2008), “Proof of Hammersley–Clifford theorem,” unpublished note at http://read.pudn.com/downloads329/ebook/1447911/Hammersley-Clifford_Theorem.pdf. Clifford, P. (1990), “Markov random fields in statistics,” in Disorder in Physical Systems, G. R. Grimmett and D. J. A. Welsh, editors, pp. 19–32, Oxford University Press. Cowell, R. G., P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter, (1999), Probabilistic Networks and Expert Systems, Springer. Edwards, D. (2000), Introduction to Graphical Modeling, 2nd ed., Springer. Fischer, K. H. and J. A. Hertz (1991) Spin Glasses, Cambridge University Press. Frey, B. J. (1998), Graphical Models for Machine Learning, MIT Press. Frey, B. J. and D. J. C. Mackay (1998), “A revolution: Belief propagation in graphs with cycles,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–7, Denver, CO. Gallager, R. G. (1968), Information Theory and Reliable Communication, Wiley. Grimmett, G. (2018), Probability on Graphs, 2nd ed., Cambridge University Press. Hammersley, J. M. (1974), “Discussion of Mr. Besag’s paper,” J. Roy. Statist. Soci. Ser. B, vol. 36, pp. 230–231. Hammersley, J. M. and P. Clifford (1971), “Markov fields on finite graphs and lattices,” unpublished notes, available at: www.statslab.cam.ac.uk/∼grg/books/hammfest/ hamm-cliff.pdf. Ising, E. (1925), “Contribution to the theory of ferromagnetism,” Zeitschrift fur Physik, vol. XXXI. Jensen, F. (1996), An Introduction to Bayesian Networks, Springer. Jensen, F. V., S. L. Lauritzen, and K. G. Olesen (1990), “Bayesian updating in causal probabilistic networks by local computation,” Comput. Statist. Q., vol. 4, pp. 269– 282. Jensen, F. V., K. G. Olesen, and S. K. Andersen (1990), “An algebra of Bayesian belief universes for knowledge-based systems,” Networks, vol. 20, no. 5, pp. 637–659. Jordan, M. I. (1998), Learning in Graphical Models, MIT Press. Jordan, M. I. and Y. Weiss (1994), Graphical Models: Probabilistic Inference, 2nd ed., Addison Wesley. Kim, J. H. and J. Pearl (1983), “A computational model for combined causal and diagnostic reasoning in inference systems,” Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pp. 190–193, Karlsruhe. Koller, D. and N. Friedman (2009), Probabilistic Graphical Models: Principles and Techniques, MIT Press. Kschischang, F. R., B. J. Frey, and H. A. Loeliger (2001), “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, pp. 498–519. Lauritzen, S. L. (1996), Graphical Models, Oxford University Press. Lauritzen, S. L. and D. J. Spiegelhalter (1988), “Local computations with probabilities on graphical structures and their application to expert systems (with discussion),” J. Roy. Statist. Soc. Ser. B, vol. 50, no. 2, pp. 157–224. Lenz, W. (1920), “Beitrag zum Verstandnis der magnetischen Erscheinungen in festen Korpern,” Physikalische Zeitschrift, vol. 21, pp. 613–615. Loeliger, H. A. (2004), “An introduction to factor graphs,” IEEE Signal Process. Mag., vol. 21, no. 1, pp. 28–41. Mackay, D. J. C. (1999), “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inf. Theory, vol. 45, pp. 399–431. MacKay, D. J. C. (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press. McEliece, R. J., D. J. C. MacKay, and J. F. Cheung (1998), “Turbo decoding as an instance of Pearl’s belief propagation algorithm,” IEEE J. Sel. Topics Commun., vol. 16, no. 2, pp. 140–152.

1806

Undirected Graphs

Mezard, M., G. Parisi, and M. A. Virasoro (1987), Spin Glass Theory and Beyond, World Scientific. Mooij, J. M. and H. J. Kappen (2007), “Sufficient conditions for convergence of the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 53, no. 12, pp. 4422–4437. Moussouris, J. (1974), “Gibbs and Markov random systems with constraints,” J. Statist. Phys., vol. 10, pp. 11–33. Murphy, K., Y. Weiss, and M. I. Jordan (1999), “Loopy belief propagation for approximate inference: An empirical study,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 467–475, Stockholm. Niss, M. (2005), “History of the Lenz–Ising model 1920–1950: From ferromagnetic to cooperative phenomena,” Arch. Hist. Exact Sci., vol. 59, pp. 267–318. Onsager, L. (1944), “Crystal statistics, I. A two-dimensional model with an order– disorder transition,” Phys. Rev. Ser. II, vol. 65, nos. 3–4, pp. 117–149. Pearl, J. (1982), “Reverend Bayes on inference engines: A distributed hierarchical approach,”Proc. Nat. Conf. Artificial Intelligence (AAAI), pp. 133–136, Pittsburgh, PA. Pearl, J. (1986), “Fusion, propagation, and structuring in belief networks,” Artif. Intell., vol. 29, no. 3, pp. 241–288. Rozanov, Y. A. (1982), Markov Random Fields, Springer. Shenoy, P. P. and G. Shafer (1986), “Propagating belief functions using local computation,” IEEE Expert, vol. 1, no. 3, pp. 43–52. Shenoy, P. P. and G. Shafer (1990), “Axioms for probability and belief-function propagation,” in Uncertainty in Artificial Intelligence, R. D. Shachter, T. S. Levitt, J. F. Lemmer, and L. N. Kanal, editors, pp. 169–198, North-Holland. Tanner, R. M. (1981), “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. 27, pp. 533–547. Wainwright, M. J. and M. I. Jordan (2008). Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends in Machine Learning, NOW Publishers, vol. 1, nos. 1–2, pp. 1–305. Weiss, Y. (2000), “Correctness of local probability propagation in graphical models with loops,” Neural Comput., vol. 12, no. 1, pp. 1–41. Yedida, J. S., W. T. Freeman, and Y. Weiss (2003), “Understanding belief propagation and its generalizations,” in Exploring Artificial Intelligence in the New Millennium, G. Lakemeyer and B. Nebel, editors, pp. 239–269, Morgan Kaufmann.

44 Markov Decision Processes

Markov decision processes (MDPs) are at the core of reinforcement learning theory. Similar to Markov chains, MDPs involve an underlying Markovian process that evolves from one state to another, with the probability of visiting a new state being dependent on the most recent state. Different from Markov chains, MDPs involve both agents and actions taken by these agents. As a result, the next state is dependent on which action was chosen at the state preceding it. MDPs therefore provide a powerful framework to explore state spaces and to learn from actions and rewards. The presentation in this chapter defines MDPs and introduces several of their properties, including the concepts of state value functions and state–action value functions. Algorithms for computing these quantities are presented in this chapter, while algorithms for selecting actions optimally are discussed in the next chapter. In particular, we will derive methods used for policy evaluation, value iteration, and policy iteration. The results in these two chapters assume that the parameters that define the MDP model are known. In future chapters on reinforcement learning, starting from Chapter 46, we will devise algorithms that allow agents to learn the underlying model parameters.

44.1

MDP MODEL The state variable can be continuous or discrete. There exist applications with continuous state spaces, as well as continuous action spaces. In our treatment, however, we will focus on the discrete case because it is easier to visualize and understand the concepts in this domain, while the conclusions can be extended to the continuous case with proper adjustments. We therefore study MDPs that involve a finite (though still possibly large) number of states and a finite number of actions, with the transition from one state to another occurring at discrete time instants.

44.1.1

Model Parameters Formally, an MDP is modeled as a 4-tuple written as M = (S, A, P, r)

(44.1)

1808

Markov Decision Processes

where the letters S, A, P, and r refer to the following entities: (a) (State-space) S denotes the finite collection of states of size |S| (the cardinality of S). We denote an individual state by the letter s ∈ S.

(b) (Action set) A denotes the finite set of actions of size |A| (the cardinality of A). We denote an individual action by the letter a ∈ A.

(c) (Transition probability kernel): P : S×A×S → [0, 1] is a matrix consisting of the transition probabilities between states conditioned on the actions taken by the agent. Specifically, if we consider an MDP at state s, taking action a, and moving to state s0 , then the scalar P(s, a, s0 ) represents the probability of reaching state s0 given that the agent started from state s and selected action a. If desired, we can write this quantity using the conditional probability notation for added clarity: P(s, a, s0 ) = P(s0 = s0 |s = s, a = a) = P(s0 |s, a)

(44.2)

where the boldface notation refers to the random nature of states and actions. The rightmost notation, P(s0 |s, a), is a more compact representation. We refer to s as the state and to (s, a) as the state–action pair or q-state. Thus, we can interpret the transition from state s to state s0 as involving two steps. First, while at state s, the agent interacting with the MDP selects an action a and the MDP moves to the “intermediate” state–action location (s, a). We explain below how the action is selected according to some policy π(a|s), which refers to the likelihood of selecting action a at state s. From there, the MDP moves to a random state s0 according to the transition probability P(s, a, s0 ) – see Fig. 44.1. The Markovian qualification in the title of an MDP refers to the fact that the conditional probability P(s0 |s, a) of the future state, s0 , is only dependent on the most recent state, s, and its corresponding action, a. That is, if we attach a time index n to states and actions to highlight their evolution over time, then the Markovian property for MDPs means that: P(sn = s0 |sn−1 = s, an−1 = a, . . . , s2 = s2 , a2 = a2 , s1 = s1 , a1 = a1 ) = P(sn = s0 |sn−1 = s, an−1 = a)

(44.3)

(d) (Reward function) The letter r : S×A×S → IR denotes the reward function, which determines the amount of reward that is received by an agent when it takes an action and transitions from one state to another, namely, r(s, a, s0 ) = reward for transitioning from s to s0 after taking action a (44.4) In this definition, the reward is a function of three elements: the current state, the action taken at that state, and the future state. There are situations where the reward may only be a function of either the origin state or the

44.1 MDP Model

1809

s-state

(s, a) pair

s -state

Starting from a state s, the agent selects an action according to policy π(a|s) and moves to an intermediate q-state (s, a). Starting from this q-state, the agent ends up at state s0 according to the transition probability P(s, a, s0 ) and collects the reward value r(s, a, s0 ). Figure 44.1

destination state, say, r(s) or r(s0 ). There are also situations when the reward is stochastic, for example, when it is subjected to some random perturbation. For this reason, we will sometimes write r(s, a, s0 ), using the boldface symbol r, to allow for this possibility. We can represent the transition probability kernel and the reward function in the form of three-dimensional tensors with one layer for each possible action, as shown in Fig. 44.2. For each choice of an action by the agent, there is a matrix of probabilities mapping current states s to future states s0 , along with the corresponding matrix of reward values. We denote these matrices, corresponding to every action a ∈ A, by Pa = [P(s, a, s0 )] , s, s0 ∈ S (|S| × |S|) 0

0

Ra = [R(s, a, s )] , s, s ∈ S (|S| × |S|)

(44.5a) (44.5b)

These are square matrices, indexed by the action a. There is a total of |A| matrices Pa and |A| matrices Ra , one for each a.

Policy function We also associate a policy function with an MDP, denoted by the letter π: A × S → [0, 1]. This function represents the likelihood (probability) of selecting actions based on the agent’s state, i.e.,

1810

Markov Decision Processes

The transition probability kernel, P(s, a, s0 ), and the reward function, r(s, a, s ), can be represented as three-dimensional tensors, with one layer for each possible action. For each choice of an action by the agent, there is a matrix of probabilities mapping current states s to future states s0 , along with the corresponding matrix of reward values. The highlighted squares in the figure correspond to the values P(s2 , a1 , s04 ) and r(s2 , a1 , s04 ) when action a1 is selected.

Figure 44.2 0

π(a|s) = probability of choosing action a when the agent is at state s (44.6) In this way, the policy of an MDP defines a distribution over the action space given the state, and helps determine how the agent interacts with its environment. As the agent moves around in the state space, the policy determines the likelihood of selecting specific actions by the agent. We will be assuming the policy π(a|s) to be Markovian and to depend only on the current state, s, and not on the history of states, i.e., π(a|sn = sn , sn−1 = sn−1 , . . . , s1 = s1 ) = π(a|sn = s)

(44.7)

We will also be assuming the policy to be stationary, meaning that regardless of the time instant at which the MDP finds itself at a particular state s, the policy will continue to be the same π(a|s) at that time instant, i.e., π(a|sn = s) = π(a|sn0 = s), for any n, n0

(44.8)

Formulation (44.6) refers to stochastic policies. A special case of (44.6) corresponds to deterministic policies where there is a specific action, say as , associated with every state, s. We represent this situation by writing

44.1 MDP Model

π(a|s) =



1, if a = as 0, if a 6= as

1811

(44.9)

This means that the agent selects action as with certainty when at state s. We show in Fig. 44.3 how the policy π(a|s) can be represented in matrix form, with each row providing the probabilities for selecting the various actions under the state corresponding to that row. We denote the |S| × |A| matrix by ∆

Π = [π(a|s)]s,a , s ∈ S, a ∈ A

(44.10)

where the (s, a)th entry of Π is the probability value given by π(a|s).

The policy π(a|s) can be represented in matrix form, with each row providing the probabilities for selecting the various actions under the state corresponding to that row. The highlighted row on the left, for example, corresponds to state s2 and each entry in that row provides the likelihood of selecting the corresponding action at state s2 . When the policy is deterministic, only one action is available for each state and a single unit entry appears on each row.

Figure 44.3

Example 44.1 (Playing a game over a grid) We illustrate the elements of an MDP by considering an example involving the displacement of an agent over a grid with obstacles and rewards. We refer to Fig. 44.4, which shows a grid with 16 squares labeled #1 through #16. Four squares are special; these are squares #4, #8, #11, and #16. Square #8 (danger) and #16 (star) are terminal states. If the agent reaches one of these squares, the agent moves to an EXIT state and the game stops. We refer to the EXIT state as an absorbing state. The agent collects either a reward of +10 at square #16 or a reward of −10 at square #8. Squares #4 and #11 represent obstacles in the grid; the space occupied by these obstacles cannot be traversed by the agent. Each of the 14 remaining squares can be viewed as valid states so that the state space for this game is given by: n o s ∈ S = 1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, EXIT , |S| = 15 (44.11)

1812

Markov Decision Processes

11

DANGER

4

Figure 44.4 A Markov decision process is defined over a grid with 16 tiles. The actions available for the agent are A = {UP, DOWN, LEFT, RIGHT, STOP}. If the agent hits the boundaries of the grid, or the obstacle squares at #4 and #11, then it stays put in its location. The arrows in the figure at state s = 7 indicate that there is a 70% chance that the agent will move UPWARD and a 30% chance that the agent will move in a perpendicular direction (either right or left).

Note that EXIT refers to a fictitious state that is not shown explicitly in the grid diagram for the game. The actions by the agent at any valid state s ∈ S are defined as follows. At any of the nonterminal states, the agent chooses an action corresponding to one direction of motion, say, a ∈ {UP, DOWN, LEFT, RIGHT, STOP}. If, for example, these actions are chosen with equal probability, independently of the state, then this means that the policy that is followed by the agent is given by:  1/4,     1/4, 1/4, π(a|s) =     1/4, 0,

a = UP a = DOWN a = RIGHT , a = LEFT a = STOP

for any s ∈ S\{8, 16, EXIT}

(44.12)

At the terminal states, s ∈ {8, 16}, the only action available is for the agent to exit and STOP the game so that  1, for s = 8, 16 π(a = STOP|s) = (44.13) 0, for s ∈ S\{8, 16} There is no action associated with the EXIT state. Therefore, the set of actions for this game is given by A = {UP, DOWN, LEFT, RIGHT, STOP} ,

|A| = 5

(44.14)

Note that, in general, not all actions are available at all states. For instance, in this example, the action a = STOP is only available at the terminal states, s ∈ {8, 16}. Of course, other policies are possible, with the values of π(a|s) depending on the state s.

44.1 MDP Model

1813

Continuing with the example, we next note that the selection of an action by the agent at any of the nonterminal states does not automatically determine what the next state will be. This is because we will be introducing an element of randomness that is characteristic of MDPs. For example, the figure shows the ball in state s = 7. At that state, and according to (44.12), the agent can select one of four actions uniformly. Assume the agent chooses the action a = UP. It will not necessarily follow from this choice that the next state for the agent will be s = 10. This is because, in an MDP, randomness interferes with the transition between states. For instance, in this game scenario, one can envision that the tiles in the squares are not perfectly smooth so that even if the agent decides to move the ball upward, it may still get deflected and move either to the right (ending up at the terminal state s = 8) or to the left (ending up at state s = 6). The percentages next to the ball in state s = 7 in the figure are meant to indicate the likelihood with which the ball will end up moving to states s ∈ {6, 8, 10} after selecting the action a = UP. As the numbers show, we are assuming that the probability of moving in a different direction than the intended upward direction is 30%, split equally between two opposing directions. For simplicity, we will be assuming in this game that deflections can only occur along the direction that is perpendicular to the intended direction. In other words, if the agent decides to move upward, then the deflections can only be to the right or to the left, and not downward. However, if the chosen direction of motion by the agent happens to hit a wall or the obstacles located at squares #4 and #11, then the agent will remain in its current state. For instance, if the agent happens to be in state s = 10 and selects action a = LEFT, then the next state continues to be s = 10 because of the obstacle at location #11. Likewise, if the agent happens to be in state s = 2 and selects action a = DOWN, then the next state continues to be s = 2 because of the wall around the grid. Based on this description of the game, we can now construct the transition probabilities P(s, a, s0 ) for all possible state transitions. For example, let us refer to Fig. 44.5, which shows the agent in three different states, s ∈ {7, 12, 14}. The wide arrows indicate the actions chosen by the agent at the respective locations, while the narrow arrows indicate the deflections that can occur at these locations. For instance, consider first state s = 7 where the action chosen by the agent is a = UP. Then, the next state s0 can be one of three possibilities, namely, s0 ∈ {6, 8, 10} with transition probabilities: P(s = 7, a = UP, s0 = 6) = 0.15 P(s = 7, a = UP, s0 = 8) = 0.15 P(s = 7, a = UP, s0 = 10) = 0.70

(44.15a) (44.15b) (44.15c)

Consider next state s = 12 where the action chosen by the agent is a = RIGHT. Then, the next state s0 can be one of the following three possibilities s0 ∈ {5, 12, 13} due to the obstacle at square #11, with transition probabilities: P(s = 12, a = RIGHT, s0 = 5) = 0.15 P(s = 12, a = RIGHT, s0 = 12) = 0.70 P(s = 12, a = RIGHT, s0 = 13) = 0.15

(44.16a) (44.16b) (44.16c)

Likewise, consider state s = 14 where the action chosen by the agent is a = RIGHT. Then, the next state s0 can be one of the following two possibilities s0 ∈ {14, 15} due to the obstacle at square #11 and the boundary of the grid, with transition probabilities: P(s = 14, a = RIGHT, s0 = 14) = 0.30 P(s = 14, a = RIGHT, s0 = 15) = 0.70

(44.17a) (44.17b)

We can perform similar calculations for all other states and compute the corresponding transition probabilities as a function of the actions the agent can take at these states.

1814

Markov Decision Processes

Obviously, once the agent arrives at a terminal state, it moves to the EXIT state and the game stops. In that case, we have P(s = 16, a = STOP, s0 = EXIT) = 1 P(s = 8, a = STOP, s0 = EXIT) = 1

(44.18a) (44.18b)

and, likewise, regardless of any action by the agent: P(s = EXIT, a = any, s0 = EXIT) = 1

(44.19)

11

DANGER

4

Figure 44.5 The agent is shown in three different states, s = 7, 12, 14. The wide arrows indicate the actions chosen by the agent at the respective locations, while the narrow arrows indicate the deflections that can occur at these locations.

The only element that is left to specify in order to complete the definition of the MDP is the reward function, r(s, a, s0 ). We assume in this example that, for all nonterminal states, the agent collects a constant reward value (also called a “living reward”) that is a function of the “future” state, i.e., r(s, a, s0 ) = r(s0 ). This constant value can be negative or positive (or even zero) and it represents the reward that the agent keeps collecting as it moves across the state space and before exiting; for example, r(s0 ) = −0.1,

s ∈ S\{8, 16, EXIT}

(44.20)

and for the terminal states, r(s0 = 16) = +10,

r(s0 = 8) = −10

(44.21)

while for the EXIT state we have r(EXIT) = 0

(44.22)

As the agent moves around the grid, from one state to another, it accumulates rewards. If our objective is to maximize the cumulative reward that is received by the agent by the time the game stops (which we are going to learn how to do), then using a nega-

44.1 MDP Model

1815

tive reward value in (44.20) for the nonterminal states has the effect of encouraging the agent to exit the game quickly in order to reduce the accumulation of negative rewards. In summary, we have identified the components {S, A, P, r} that define the MDP for this game problem, in addition to a policy π(a|s) used by the agent. We will use this example to illustrate other concepts in this and future chapters.

44.1.2

State Markov Process We now attach a time index n to the evolution of an MDP, M = (S, A, P, r). We denote the state at time n by sn , the action selected at that time by an , and the next state by sn+1 . We further denote the reward collected during this transition by r(n), i.e., ∆

r(n) = r(sn , an , sn+1 )

(44.23)

Schematically, we represent the transition at time n by writing an ,r(n)

sn −−−−−→ sn+1

(44.24)

When it is not necessary to explicitly indicate the time stamp, we will write s for the current state, s0 for the future state, a for the action taken at state s, and r(s, a, s0 ) for the reward collected during this one-step transition: a,r(s,a,s0 )

s −−−−−−→ s0

(44.25)

Now, in view of the Markovian properties (44.3) and (44.7), the sequence of states will satisfy a first-order Markovian property, namely, P(sn = sn |sn−1 = sn−1 , . . . , s1 = s1 ) = P(sn = sn |sn−1 = sn−1 )

(44.26)

We establish this fact, for convenience of notation, for time n = 3; the calculation can be easily extended for arbitrary n. Proof of (44.26): Let α = P(s3 = s3 |s2 = s2 , s1 = s1 ) denote the quantity on the left-hand side of (44.26) for n = 3. Using the Bayes rule (3.42c) and the Markovian property (44.3) we have:

α =

X

a2 ,a1 ∈A (a)

=

X

a2 ,a1 ∈A

P(s3 = s3 , a2 = a2 , a1 = a1 |s2 = s2 , s1 = s1 ) P(a2 = a2 , a1 = a1 |s2 = s2 , s1 = s1 ) × P(s3 = s3 |s2 = s2 , a2 = a2 , s1 = s1 , a1 = a1 )

(44.27)

1816

Markov Decision Processes

where in step (a) we are marginalizing over the actions {a1 , a2 ∈ A}. Continuing we get (b)

α=

X

a2 ,a1 ∈A

=

X

P(a2 = a2 |s2 = s2 , s1 = s1 ) × P(a1 = a1 |s2 = s2 , a2 = a2 , s1 = s1 ) × P(s3 = s3 |s2 = s2 , a2 = a2 )

P(a2 = a2 |s2 = s2 ) P(a1 = a1 |s1 = s1 ) P(s3 = s3 |s2 = s2 , a2 = a2 )

a2 ,a1 ∈A

=

X

a1 ∈A (c)

=

X

a1 ∈A

π(a1 |s1 ) π(a1 |s1 )

X

a2 ∈A

X

a2 ∈A

π(a2 |s2 ) P(s3 = s3 |s2 = s2 , a2 = a2 ) !

!

P(s3 = s3 , a2 = a2 |s2 = s2 )

(44.28)

where in step (b) we are applying the Bayes rule, and in step (c) we are using the Bayes rule to combine the probabilities. It follows that P(s3 = s3 |s2 = s2 , s1 = s1 ) =

X

a1 ∈A

π(a1 |s1 ) P(s3 = s3 |s2 = s2 )

= P(s3 = s3 |s2 = s2 ) = P(s3 = s3 |s2 = s2 )

|

X

a1 ∈A

!

π(a1 |s1 ) {z

=1

}

(44.29)

as claimed by (44.26). 

44.1.3

State Transition Probabilities Now that we know that the state evolution in an MDP forms a Markov chain, we can evaluate the transition probabilities between states, similar to what was defined earlier in (38.14b) for hidden Markov models (HMMs). Here, however, we need to adjust the notation to avoid confusion with the symbols {a, A} already used for actions in this chapter. We denote the probability of transitioning from state s to state s0 by the notation pπs,s0 ; the superscript π is to emphasize that this probability depends on the policy π(a|s) that is followed by the agent. Using the Bayes rule (3.42c), and marginalizing over actions, we get

44.1 MDP Model

1817



pπs,s0 = P(sn = s0 |sn−1 = s), for any n X P(sn = s0 , an−1 = a|sn−1 = s) = a∈A

=

X

a∈A

P(an−1 = a|sn−1 = s) P(sn = s0 |sn−1 = s, an−1 = a) | {z }| {z } = P(s,a,s0 )

= π(a|s)

(44.30)

so that ∆

pπs,s0 = P(s0 = s0 |s = s) =

X

π(a|s)P(s, a, s0 )

(44.31)

a∈A

Note that the expression for pπs,s0 is independent of n since, by assumption, the probabilities π(a|s) and P(s, a, s0 ) in the MDP model are independent of n. Note also that the expression for pπs,s0 depends on the policy π(a|s). Therefore, different selections for π(a|s) can influence the evolution or dynamics of the MDP in a significant manner. We collect the transition probabilities (44.31) into the |S|×|S| matrix:  ∆  P π = pπs,s0 , s, s0 ∈ S

Clearly, each row of P π adds up to 1: X pπs,s0 = 1, for every s ⇐⇒ P π 1 = 1

(44.32)

(44.33)

s0 ∈S

Example 44.2 (State transitions for grid problem) We refer to the grid MDP game described in Example 44.1 and evaluate the state transition probabilities from s = 7 to all other states for the policy described in the example. First, it is clear that the agent can only transition to states {2, 6, 8, 10}. Therefore, pπ7,s0 = 0, for s0 ∈ S\{2, 6, 8, 10}

(44.34)

Let us now consider the transition to state s0 = 10. According to (44.31), we have pπ7,10

=

X

π(a|s = 7) P(s = 7, a, s0 = 10)

a∈A

π(a = UP|s = 7) P(s = 7, a = UP, s0 = 10) + π(a = DOWN|s = 7) P(s = 7, a = DOWN, s0 = 10) + π(a = RIGHT|s = 7) P(s = 7, a = RIGHT, s0 = 10) + π(a = LEFT|s = 7) P(s = 7, a = LEFT, s0 = 10) (44.12) 1 1 1 1 = × 0.7 + ×0 + × 0.15 + × 0.15 4 4 4 4 = 0.25 =

(44.35)

1818

Markov Decision Processes

Likewise, for state s0 = 2: X pπ7,2 = π(a|s = 7) P(s = 7, a, s0 = 2) a∈A

π(a = UP|s = 7) P(s = 7, a = UP, s0 = 2) + π(a = DOWN|s = 7) P(s = 7, a = DOWN, s0 = 2) + π(a = RIGHT|s = 7) P(s = 7, a = RIGHT, s0 = 2) + π(a = LEFT|s = 7) P(s = 7, a = LEFT, s0 = 2) (44.12) 1 1 1 1 = ×0 + × 0.7 + × 0.15 + × 0.15 4 4 4 4 = 0.25 =

(44.36)

0

while for state s = 6: X pπ7,6 = π(a|s = 7) P(s = 7, a, s0 = 6) a∈A

π(a = UP|s = 7) P(s = 7, a = UP, s0 = 6) + π(a = DOWN|s = 7) P(s = 7, a = DOWN, s0 = 6) + π(a = RIGHT|s = 7) P(s = 7, a = RIGHT, s0 = 6) + π(a = LEFT|s = 7) P(s = 7, a = LEFT, s0 = 6) (44.12) 1 1 1 1 × 0.15 + × 0.15 + ×0 + × 0.7 = 4 4 4 4 = 0.25 =

(44.37)

and for state s0 = 8: pπ7,8

=

X

π(a|s = 7) P(s = 7, a, s0 = 8)

a∈A

π(a = UP|s = 7) P(s = 7, a = UP, s0 = 8) + π(a = DOWN|s = 7) P(s = 7, a = DOWN, s0 = 8) + π(a = RIGHT|s = 7) P(s = 7, a = RIGHT, s0 = 8) + π(a = LEFT|s = 7) P(s = 7, a = LEFT, s0 = 8) (44.12) 1 1 1 1 × 0.15 + × 0.15 + × 0.7 + ×0 = 4 4 4 4 = 0.25

(44.38)

Therefore, the seventh row in the state transition matrix P π is given by   [P π ]7,: = 0 14 0 0 0 14 0 14 0 41 0 0 0 0 0

(44.39)

=

44.1.4

Expected One-Step Rewards Using the probabilities pπs,s0 computed via (44.31) we can evaluate the reward that an agent at state s is expected to collect during a transition. Recall that there are two sources of randomness in a transition: (1) the randomness in the selection of the action a at state s, and (2) the randomness in the next state s0 . We therefore define the expected one-step reward (also called the expected immediate reward) as

44.1 MDP Model



rπ (s) = E π,P r(s, a, s0 )

1819

(44.40)

where the expectation is over the randomness in the action selection and the next state, i.e., over the distributions defined by π(a|s) and P(s, a, s0 ). This explains why the variables a and s0 appear in boldface in (44.40), while the state s is known and written in normal font. We already know that the transition from state s to an arbitrary state s0 rewards the agent by an amount r(s, a, s0 ) when action a is taken. This reward occurs with probability P(s, a, s0 ). Therefore, conditioned on an action a at state s, the expected one-step reward denoted by n o ∆ r(s|a) = E P r(s, a, s0 ) | a = a (44.41) is given by

r(s|a) =

X

P(s, a, s0 )r(s, a, s0 )

(44.42)

s0 ∈S

In this expression, we are averaging over all possible landing states, s0 , with the action fixed at a. Observe that r(s|a) is independent of the policy π(a|s) and, hence, we do not need to use a superscript π for r(s|a). The calculation in (44.42) corresponds to computing the inner product between the row corresponding to state s in the transition kernel, P(s, a, s0 ), and the row corresponding to the same state s in the reward kernel, r(s, a, s0 ), with both layers corresponding to action a – see Fig. 44.6.

The expected one-step reward, conditioned on an action a, is obtained by computing the inner product between the rows corresponding to state s in the transition kernel, P(s, a, s0 ), and in the reward kernel, r(s, a, s0 ), both in the layer corresponding to action a. Figure 44.6

1820

Markov Decision Processes

We can now use r(s|a) to evaluate the desired one-step reward (44.40) since X rπ (s) = π(a|s)r(s|a) (44.43) a∈A

which, in view of (44.42), is equivalent to π

r (s) =

X

π(a|s)

X

0

!

P(s, a, s )r(s, a, s )

s0 ∈S

a∈A

0

(44.44)

Relation (44.43) shows that the expected one-step reward is obtained by averaging the conditional rewards (44.42) for state s across all action layers – see Fig. 44.6. For later reference, we collect the expected one-step rewards for all states into a vector of size S × 1:

r



   =   

π ∆

rπ (s = 1) rπ (s = 2) rπ (s = 3) .. . rπ (s = |S|)



    , (S × 1)  

(44.45)

Example 44.3 (Expected one-step rewards for the grid problem) We refer to the grid MDP game described in Example 44.1 and proceed to evaluate the expected one-step reward for state s = 7. We start by computing the conditional expected reward (44.42) for each of the actions that are possible at state s = 7. Thus, note that X r(s = 7|a = UP) = P(s = 7, a = UP, s0 ) r(s = 7, a = UP, s0 ) s0 ∈S

= P(s = 7, a = UP, s0 = 2) r(s = 7, a = UP, s0 = 2) + P(s = 7, a = UP, s0 = 6) r(s = 7, a = UP, s0 = 6) + P(s = 7, a = UP, s0 = 8) r(s = 7, a = UP, s0 = 8) + P(s = 7, a = UP, s0 = 10) r(s = 7, a = UP, s0 = 10) = (0 × −0.1) + (0.15 × −0.1) + (0.15 × −10) + (0.70 × −0.1) = −1.585 (44.46) while r(s = 7|a = DOWN) =

X

P(s = 7, a = DOWN, s0 ) r(s = 7, a = DOWN, s0 )

s0 ∈S

= P(s = 7, a = DOWN, s0 = 2) r(s = 7, a = DOWN, s0 = 2) + P(s = 7, a = DOWN, s0 = 6) r(s = 7, a = DOWN, s0 = 6) + P(s = 7, a = DOWN, s0 = 8) r(s = 7, a = DOWN, s0 = 8) + P(s = 7, a = DOWN, s0 = 10) r(s = 7, a = DOWN, s0 = 10) = (0.7 × −0.1) + (0.15 × −0.1) + (0.15 × −10) + (0 × −0.1) = −1.585 (44.47)

44.2 Discounted Rewards

1821

Likewise, r(s = 7|a = LEFT) =

X

P(s = 7, a = LEFT, s0 ) r(s = 7, a = LEFT, s0 )

s0 ∈S

= P(s = 7, a = LEFT, s0 = 2) r(s = 7, a = LEFT, s0 = 2) + P(s = 7, a = LEFT, s0 = 6) r(s = 7, a = LEFT, s0 = 6) + P(s = 7, a = LEFT, s0 = 8) r(s = 7, a = LEFT, s0 = 8) + P(s = 7, a = LEFT, s0 = 10) r(s = 7, a = LEFT, s0 = 10) = (0.15 × −0.1) + (0.7 × −0.1) + (0 × −10) + (0.15 × −0.1) = −0.1 (44.48) and r(s = 7|a = RIGHT) =

X

P(s = 7, a = RIGHT, s0 ) r(s = 7, a = RIGHT, s0 )

s0 ∈S

= P(s = 7, a = RIGHT, s0 = 2) r(s = 7, a = RIGHT, s0 = 2) + P(s = 7, a = RIGHT, s0 = 6) r(s = 7, a = RIGHT, s0 = 6) + P(s = 7, a = RIGHT, s0 = 8) r(s = 7, a = RIGHT, s0 = 8) + P(s = 7, a = RIGHT, s0 = 10) r(s = 7, a = RIGHT, s0 = 10) = (0.15 × −0.1) + (0 × −0.1) + (0.7 × −10) + (0.15 × −0.1) = −7.030 (44.49) It follows from (44.43) that the expected one-step reward for state s = 7 is given by rπ (s = 7) =

X

π(a|s = 7)rπ (s = 7|a)

a∈A

= π(a = UP|s = 7)rπ (s = 7|a = UP) + π(a = DOWN|s = 7)rπ (s = 7|a = DOWN) + π(a = RIGHT|s = 7)rπ (s = 7|a = RIGHT) + π(a = LEFT|s = 7)rπ (s = 7|a = LEFT)         1 1 1 1 × −1.585 + × −1.585 + × −7.03 + × −0.1 = 4 4 4 4 = −2.575 (44.50)

44.2

DISCOUNTED REWARDS Besides expected immediate rewards, received during one-step transitions, we are also interested in evaluating the expected cumulative reward that an agent can earn well into the future or until the MDP reaches an absorbing state. This objective requires that we develop procedures to forecast future rewards. We start our exposition by defining cumulative rewards, with and without discounting. These definitions will lead to the concepts of state value function and state– action value function.

1822

Markov Decision Processes

44.2.1

Cumulative Reward To begin with, at every time n in an MDP, the agent launches from some state sn , selects an action an , and lands at some state sn+1 . The agent receives a reward r(n) for the transition; these rewards accumulate over time, say, starting from n = 0: a0 ,r(0)

a1 ,r(1)

a2 ,r(2)

s0 −−−−→ s1 −−−−→ s2 −−−−→ s3 . . .

(44.51)

The sequence of transitions defines a trajectory in the state space, starting from s0 . One useful measure to assess the efficacy of a trajectory is to evaluate its discounted cumulative reward, also called utility or simply the discounted reward, which we define as follows. Without loss of generality, we set the origin of time for the trajectory to n = 0, as is done in (44.51). Then, for any trajectory starting from some state s0 = s at time n = 0, its discounted cumulative reward is defined by the following exponentially weighted sum: ∆

U (s) =

∞ X

γ n r(n)

(44.52)

n=0

where γ ∈ [0, 1) is a discount factor. We emphasize that the value of U (s) is dependent on the initial state, s. The discount factor γ has at least three useful roles: (1) First, we will see later that its presence ensures well-posedness and the convergence of iterative algorithms for learning purposes – see, e.g., the argument prior to (44.73). (2) Second, if γ were not present (i.e., if γ = 1), then it is not difficult to envision situations where the cumulative reward becomes unbounded. This can happen, for example, if the reward value is constant for all n. In contrast, if γ < 1 is present and if rewards are bounded, say, |r(n)| ≤ R for all n, then we obtain from (44.52) that the utility value will remain bounded for any initial state since ∞ X R |U (s)| ≤ γ n |r(n)| ≤ (44.53) 1−γ n=0 This does not mean that the choice γ = 1 is not useful. This choice is still useful, for example, when the MDP terminates, as in the grid game discussed in Example 44.1. In that example, the agent moves toward the EXIT state, and remains there indefinitely with zero rewards. These MDPs are called episodic MDPs, with finite-duration trajectories that end at an absorbing state. Such Markov chains with absorbing states toward which the chain converges with probability 1 are also known as absorbing chains. Therefore, for episodic MPDs with trajectories lasting at most N steps, we can define the cumulative reward of any trajectory starting from some state s0 = s by using instead

44.2 Discounted Rewards

U (s) =

N −1 X

r(n)

1823

(44.54)

n=0

(3) Third, if we expand the discounted sum (44.52), we find that U (s) = r(0) + γr(1) + γ 2 r(2) + γ 3 r(3) + . . .

(44.55)

so that future rewards are discounted more heavily than earlier rewards. In this way, the incorporation of γ makes it more attractive for agents to favor state trajectories that collect rewards earlier rather than later, as the next example illustrates. Observe further that small values for γ tend to emphasize the contribution of rewards from the present and immediate future; this is because the powers of γ will decay rapidly to zero. We say in this case that small values of γ favor a “myopic evaluation” of cumulative rewards. In comparison, values of γ closer to 1 make the powers of γ decay more slowly. The effect is that the window of rewards now looks further into the future. We therefore say that values of γ closer to 1 favor a “far-sighted evaluation” of cumulative rewards.

Example 44.4 (Preference for collecting rewards earlier) Assume an agent in an MDP faces a choice between two possible trajectories with rewards: (trajectory #1) r(0) = 20, r(n) = 1 for n ≥ 1 (trajectory #2) r(0) = 1, r(1) = 1, r(2) = 20, r(n) = 1 for n ≥ 3

(44.56a) (44.56b)

where the main difference between both options is that the larger reward of 20 is delayed until instant n = 2 in the second possibility. Then, the first state trajectory would lead to the utility value: U1 = 20 +

∞ X

γn

(44.57)

n=1

while the second state trajectory would lead to U2 = 1 + γ + 20γ 2 +

∞ X

γn

(44.58)

U1 − U2 = 19(1 − γ 2 ) > 0

(44.59)

n=3

Subtracting both utilities we find that

so that the first state sequence, where the larger reward is collected earlier, will lead to a larger utility.

44.2.2

Selecting a Policy Now, given an MDP M = (S, A, P, r), there are several important design questions that we would like to address, such as:

1824

Markov Decision Processes

(a) Starting from some initial state, what is the largest cumulative reward value that the agent can expect to receive? (b) Which policy π(a|s) should the agent follow to achieve this largest cumulative reward? (c) At every state, what is the best choice of action by the agent? We will answer these questions in some great detail in the following and in the next chapters. This section is only meant to motivate the policy design question by considering a contrived example for illustration purposes. Obviously, as the discussion in future sections and chapters will reveal, selecting optimal actions for general MDPs is more challenging than the example discussed here, and the design question will therefore need to be approached more systematically. Example 44.5 (Selecting an optimal policy) Consider a simplified game problem defined over the grid shown in Fig. 44.7. The figure shows two terminal states s ∈ {1, 5} with rewards r(1) = +5, r(5) = +15, and three additional states s ∈ {2, 3, 4} with constant reward r(s) = −1. The terminal squares lead to the EXIT state where the game stops. In the center states, the agent can select equally between two actions: a ∈ {RIGHT,LEFT}, and the motion occurs according to this action without any randomness. For example, if the agent is at state s = 3 and selects action a = RIGHT, then the next state is s = 4. The discount factor is set to γ = 0.5. The figure also provides a graphical representation for the MDP in terms of six nodes (representing the states) and arrows representing the transitions between these nodes. On each branch we show two values: the probability of that transition occurring and the reward that is collected by that action. If we examine the diagram of Fig. 44.7, it is clear by inspection that if the agent starts from state s = 4, then the largest cumulative reward it can expect is attained by moving to the right, where it lands in state s = 5 and collects a reward of +15 for that transition. From there, the agent moves into the EXIT state where the reward is continuously zero. It follows that the cumulative reward for this trajectory from state 4, written as U (4), is U (4) = 15 + (γ × 0) + (γ 2 × 0) + . . . = 15

(44.60)

Likewise, if the agent starts from state s = 3, then the largest cumulative reward it can expect is attained by moving to the right to state s = 4, from where we already know that the best action is again to move to the right. It follows that the cumulative reward for this trajectory from state 3, written as U (3), is U (3) = −1 + (γ × 15) + (γ 2 × 0) + . . . = 6.5

(44.61)

Finally, if the agent starts from state s = 2, then the largest cumulative reward it can expect is attained by moving to the left to state s = 1. In this case, its cumulative reward will be U (2) = 5 + (γ × 0) + (γ 2 × 0) + + . . . = 5 (44.62) If, while at state s = 2, the agent decides instead to move to the right to state s = 3, from where we already know that the best actions are to keep moving right, then the cumulative reward for this alternative trajectory from state 2 would be U (2) = −1 + (γ × −1) + (γ 2 × 15) + (γ 3 × 0) + . . . = 2.25

(44.63)

which is smaller than 5. Figure 44.8 summarizes these results by showing the optimal action at each of the center states, along with the resulting cumulative reward.

44.3 Policy Evaluation

1825

6

Figure 44.7 The figure shows two terminal states s ∈ {1, 5} with rewards {+5, +15} and three additional states s ∈ {2, 3, 4} with constant reward r(s) = −1. The agent can select one of two actions, a ∈ {RIGHT,LEFT}, and the motion occurs according to this action. The lower part of the figure provides a graphical representation for the evolution of the MDP in terms of six nodes (representing the states). On each branch we show two values: the probability of that transition occurring followed by the reward that is collected by that action.

Optimal actions at the states s ∈ {2, 3, 4}, along with the corresponding cumulative rewards.

Figure 44.8

44.3

POLICY EVALUATION We defined in (44.52) the discounted cumulative reward for a particular trajectory starting from an initial state, s0 = s. This computation assumes that the rewards that are collected during that trajectory are known. In general, starting

1826

Markov Decision Processes

from any state s0 = s, there will exist a multitude of trajectories that can be followed by the agent. This is because, at any time and state, the agent selects actions randomly and the transitions to subsequent states also occur randomly. Each possible trajectory from state s will have its own cumulative reward denoted by: U1 (s), U2 (s), U3 (s), . . .

(44.64)

for trajectories #1, #2, #3, and so forth. Each of these trajectories has a certain likelihood of occurring – see Fig. 44.9.

Multiple trajectories are possible for each initial state s0 = s, with their discounted rewards denoted by Uk (s), k = 1, 2, 3, . . .. If we average the discounted rewards over all possible trajectories we arrive at the state value function for state s, which we denote by v π (s).

Figure 44.9

If we do not know beforehand which one of these trajectories the agent will follow, then we need to assess the expected cumulative reward by computing the average over all possibilities. This is what the concept of state value function does. This value is denoted by v π (s), with the superscript π used to emphasize the dependency on the policy π(a|s), and is defined as n o ∆ v π (s) = E π,P U (s) | s0 = s ( ∞ ) X n = E π,P γ r(n) | s0 = s = E π,P

(

n=0

∞ X

n=0

)

γ n r(sn , an , sn+1 ) | s0 = s

(44.65)

44.3 Policy Evaluation

1827

where we are regarding the cumulative reward for state s as a random variable, denoted in boldface notation U (s). The expectation in (44.65) is over the distributions π(a|s) and P(s, a, s0 ), which are the two sources of randomness. The state s itself is fixed. We also refer to v π (s) as the state value, i.e., the value of state s. If we let the integer τ ≥ 1 serve as an index for the trajectories, and if we let p(τ ) denote the probability that trajectory #τ occurs, then finding (44.65) is equivalent to computing a sum of the form: v π (s) =

X

p(τ )Uτ (s)

(44.66)

τ

This method of calculating v π (s) is tedious because it requires enumeration of all possible trajectories, determination of their discounted rewards, {Uτ (s)}, and computation of the likelihoods, {p(τ )}. We will discuss a more efficient approach based on the principle of dynamic programming, which we already encountered in Section 39.4 while discussing the Viterbi algorithm. Recall that dynamic programming is a technique that relies on a divide-and-conquer strategy, whereby the solution to a complex problem is constructed by recursively combining the optimal solutions to less-complex sub-problems.

44.3.1

State Value Function Therefore, given an MDP with a specified policy π(a|s), the value of a state s is denoted by v π (s) and it amounts to the expected cumulative reward that would result if the MDP starts from state s and acts continually thereafter according to the policy π(a|s). It is instructive to compare v π (s) with the expected immediate reward rπ (s) from (44.43): The latter is the expected one-step reward for state s, whereas v π (s) is the expected cumulative reward well into the future. By leveraging the Markovian property, the quantity v π (s) can be evaluated by adding two components: the expected one-step reward from state s and the discounted cumulative reward from the next step onward, i.e., π

π

v (s) = r (s) + γ

X

s0 ∈S

!

pπs,s0 v π (s0 )

(Poisson equation)

(44.67)

Relation (44.67) is known as the Poisson equation in the study of Markov chains and processes, especially when it is written in the linear equations form introduced later in (44.72). The above relation expresses the value of state s in terms of the values of states that are reachable in one step from s. Proof of (44.67): Let s denote the initial state, a the action taken at state s, and s0 the state following it. We establish the result in two ways to illustrate the main concepts. First, algebraically, we know from the definition (44.65) that

1828

Markov Decision Processes

v π (s)



E π,P { U (s)| s0 = s} (∞ ) X n (44.52) = E π,P γ r(n) | s0 = s =

n=0

=

n o E π,P r(0)|s0 = s + E π,P

(

∞ X n=1

=

n o E π,P r(s, a, s0 ) + γ E π,P

(

∞ X

n=1 (a)

=

( rπ (s) + γE π,P

∞ X n=0

(b)

=

r (s) + γ E s0

E π,P

(c)

=

) γ

n−1

r(n) | s0 = s )

∞ X n=0

=

γ r(n) | s0 = s

γ n r(n + 1) | s0 = s (

π

) n

)! n

0

γ r(n + 1) | s1 = s , s0 = s

n o rπ (s) + γ E s0 v π (s0 ) | s0 = s ! X π π 0 π r (s) + γ ps,s0 v (s )

(44.68)

s0 ∈S

where step (a) is by a change of variables, step (b) uses the conditional mean property E x = E (E (x|y)) for two random variables {x, y}, and step (c) is because the probability of transitioning from state s to state s0 is given by pπs,s0 . The second proof is more intuitive. We already know that the expected reward for the first transition is rπ (s). This transition lands the agent at some state s0 ∈ S. There are multiple possibilities for s0 , and each can be reached from state s with probability pπs,s0 . Once at some state s0 , the expected cumulative reward from that state onward will be v π (s0 ). Accordingly, the last term in expression (44.67) is simply the average of the cumulative rewards from s0 since each v π (s0 ) occurs with probability pπs,s0 . The discounting by γ is because the second term in (44.67) between parentheses is the expected cumulative reward from the second transition onwards.

 Using the expressions for pπs,s0 and rπ (s) from (44.31) and (44.44), we can expand (44.67) and rewrite the expression in terms of the original quantities {P, r, π(a|s)} as follows: ! h i X X π 0 0 π 0 v (s) = π(a|s) P(s, a, s ) r(s, a, s ) + γv (s ) (44.69) a∈A

s0 ∈S

This argument motivates the following equivalent characterization for the state value function (see Prob. 44.9): n o v π (s) = E π,P r(s, a, s0 ) + γv π (s0 ) | s = s

(44.70)

This relation is referred to as the Bellman equation for the value function. Later, in (45.15a), we will present the corresponding Bellman optimality equation; the latter does not depend on the policy.

44.3 Policy Evaluation

1829

We are interested in evaluating the state value function v π (s) for all states s ∈ S. We refer to this task as the policy evaluation problem, i.e., the task of computing the values that would result for all states in the MDP if it follows a particular policy, π(a|s). On the face of it, the unknowns {v π (·)} appear on both sides of the above equation. However, these equations amount to a linear system of equations, as explained next.

Linear system of equations The Poisson equation (44.67) holds for every state s ∈ S. We can rewrite it more compactly using the vector notation. In a manner similar to the one-step reward vector, rπ , defined earlier in (44.45), we collect the values of all states into the S × 1 vector:  π  v (s = 1)  v π (s = 2)   π  ∆   v π =  v (s = 3)  , (S × 1) (44.71)   . ..   v π (s = |S|) We refer to v π as the value vector. Then, expression (44.67) translates into the linear system of equations: v π = rπ + γP π v π ⇐⇒ (I − γP π ) v π = rπ

(44.72)

These equations can be solved to arrive at v π , which contains the values for all states. Although this procedure is straightforward, several technical issues deserve attention. First, the above linear system of equations has a unique solution, v π , whenever γ ∈ [0, 1). This is because the matrix P π is right-stochastic, as shown by (44.33), which means that its spectral radius is equal to 1 (recall Prob. 1.49). It then follows that, whenever γ ∈ [0, 1), the eigenvalues of γP π are strictly inside the unit disc and the matrix I − γP π is invertible so that −1 π

v π = (I − γP π )

r

(44.73)

For small state spaces (i.e., small dimension |S|), these equations can be solved rather efficiently to recover v π . One major challenge arises, however, when the size of the state space is enormous. In such cases, direct inversion of the matrix I − γP π becomes intractable. We will discuss in Section 44.4 a useful parameterization procedure that ameliorates the challenge for high-dimensional state spaces. For now we continue with the policy evaluation procedure (44.73). Example 44.6 (Policy evaluation for a simplified grid) If we apply construction (44.73) to the grid problem of Fig. 44.4, we would be able to assess the state value function for all its states. This calculation is easier done by means of a computer simulation. Here, we illustrate the procedure by considering the simpler grid problem of Fig. 44.7. We also use this example to illustrate a situation corresponding to a unit discount factor,

1830

Markov Decision Processes

i.e., γ = 1 (in which case I − γP π is not invertible anymore). We first list the model parameters for the grid problem of Fig. 44.7, namely, S = {1, 2, 3, 4, 5, EXIT} A = {RIGHT, LEFT, STOP} r(s0 = 1) = +5, r(s0 = 5) = +15 r(s0 = 2) = r(s0 = 3) = r(s0 = 4) = −1 r(EXIT) = 0 π(a = RIGHT|s) = π(a = LEFT|s) = 1/2, s = 2, 3, 4 π(a = STOP|s) = 1, s = 1, 5

(44.74a) (44.74b) (44.74c) (44.74d) (44.74e) (44.74f) (44.74g)

where we used the fact that for this problem the reward function is solely dependent on the landing state, i.e., r(s, a, s0 ) = r(s0 ). Moreover, the transition probabilities are given by: P(s = 2, a = RIGHT, s0 = 3) = 1 P(s = 2, a = LEFT, s0 = 1) = 1 P(s = 3, a = RIGHT, s0 = 4) = 1 P(s = 3, a = LEFT, s0 = 2) = 1 P(s = 4, a = RIGHT, s0 = 5) = 1 P(s = 4, a = LEFT, s0 = 3) = 1 P(s = 5, a = “any”, s0 = EXIT) = 1 P(s = 1, a = “any”, s0 = EXIT) = 1 P(s = EXIT, a = “any”, s0 = EXIT) = 1

(44.75a) (44.75b) (44.75c) (44.75d) (44.75e) (44.75f) (44.75g) (44.75h) (44.75i)

with all other probabilities equal to zero. Let us now evaluate the state transition probabilities, pπs,s0 , for all states. Note that most of the entries of pπs,s0 will be zero since from any of the central states we can only transition to the states adjacent to them, while from the terminal states we can only transition to the EXIT state. We therefore evaluate the nonzero probabilities as follows: X pπ2,3 = π(a|s = 2) P(s = 2, a, s0 = 3) a∈A

= π(a = RIGHT|s = 2) P(s = 2, a = RIGHT, s0 = 3) + π(a = LEFT|s = 2) P(s = 2, a = LEFT, s0 = 3) 1 1 ×0 = ×1 + 2 2 = 1/2

(44.76a)

and pπ3,2 =

X

π(a|s = 3) P(s = 3, a, s0 = 2)

a∈A

= π(a = RIGHT|s = 3) P(s = 3, a = RIGHT, s0 = 2) + π(a = LEFT|s = 3) P(s = 3, a = LEFT, s0 = 2) 1 1 = ×0 + ×1 2 2 = 1/2

(44.76b)

Likewise, pπ3,4 = pπ4,3 = pπ2,1 = pπ4,5 = 1/2

(44.76c)

44.3 Policy Evaluation

1831

so that



0 1 2

   0 P =  0  0 0 π

0 0

0

1 2

0

0 0 0

1 2 1 2

0 0

0 0 1 2

0 0 0

0 0 0 1 2

0 0

1 0 0 0 1 1

      

(44.77)

We next evaluate the one-step expected rewards. We start by computing the conditional expected rewards (44.42) for each of the actions that are possible at the various states. Thus, note that

r(s = 1|a = STOP) =

X

P(s = 1, a = STOP, s0 ) r(s = 1, a = STOP, s0 )

s0 ∈S

= P(s = 1, a = STOP, s0 = EXIT) r(s = 1, a = STOP, s0 = EXIT) =1×0 =0 (44.78)

while

r(s = 2|a = RIGHT) =

X

P(s = 2, a = RIGHT, s0 ) r(s = 2, a = RIGHT, s0 )

s0 ∈S

= P(s = 2, a = RIGHT, s0 = 3) r(s = 2, a = RIGHT, s0 = 3) = 1 × −1 = −1 (44.79a) r(s = 2|a = LEFT) =

X

P(s = 2, a = LEFT, s0 ) r(s = 2, a = LEFT, s0 )

s0 ∈S

= P(s = 2, a = LEFT, s0 = 1) r(s = 2, a = LEFT, s0 = 1) =1×5 =5 (44.79b)

and, similarly,

r(s = 3|a = RIGHT) = −1 r(s = 3|a = LEFT) = −1 r(s = 4|a = RIGHT) = 15 r(s = 4|a = LEFT) = −1 r(s = 5|a = STOP) = 0

(44.79c) (44.79d) (44.79e) (44.79f) (44.79g)

1832

Markov Decision Processes

Continuing, we now use (44.43) to compute the expected one-step rewards for each state: X rπ (s = 1) = π(a|s = 1) rπ (s = 1|a) a∈A

= π(a = STOP|s = 1) rπ (s = 1|a = STOP) =1×0 =0 rπ (s = 2) =

X

(44.80a)

π(a|s = 2) rπ (s = 2|a)

a∈A

π(a = RIGHT|s = 2) rπ (s = 2|a = RIGHT) + π(a = LEFT|s = 2) rπ (s = 2|a = LEFT)     1 1 × −1 + ×5 = 2 2 =2 rπ (s = 3) =

X

(44.80b)

π(a|s = 3) rπ (s = 3|a)

a∈A

π(a = RIGHT|s = 3) rπ (s = 3|a = RIGHT) + π(a = LEFT|s = 3) rπ (s = 3|a = LEFT)     1 1 × −1 + × −1 = 2 2 = −1 rπ (s = 4) =

X

(44.80c)

π(a|s = 4) rπ (s = 4|a)

a∈A

π(a = RIGHT|s = 4) rπ (s = 4|a = RIGHT) + π(a = LEFT|s = 4) rπ (s = 4|a = LEFT)     1 1 × 15 + × −1 = 2 2 =7

(44.80d)

and rπ (s = 5) =

X

π(a|s = 5) rπ (s = 5|a)

a∈A

π(a = STOP|s = 5) rπ (s = 5|a = STOP) =1×0 =0 rπ (s = 6) =

X

(44.80e)

π(a|s = 6) rπ (s = 6|a)

a∈A

π(a = any|s = 6) rπ (s = 6|a = any) =0 We therefore conclude that the one-step reward vector is given by  T rπ = 0 2 −1 7 0 0

(44.80f)

(44.81)

44.3 Policy Evaluation

1833

Assuming a discount factor γ = 1, we can determine the value vector by solving (I − P π )v π = rπ

(44.82)

That is, 1

      

− 12

0 0 0 0

0 1 − 12 0 0 0

0 − 12

1

− 12 0 0

0 0 − 12 1 0 0

0 0 0 − 12 1 0

−1 0 0 0 −1 0

      

v π (1) v π (2) v π (3) v π (4) v π (5) v π (6)



0 2 −1 7 0 0



       =     

      

(44.83)

This system of equations has infinitely many solutions (in contrast, when γ ∈ [0, 1), we will always have a unique solution). The above equations imply that the unknowns {v π (s)} should satisfy the following relations, for any α ∈ IR: v π (1) = v π (5) = v π (6) = α 15 v π (2) = +α 4 7 v π (3) = + α 2 21 +α v π (4) = 4

(44.84a) (44.84b) (44.84c) (44.84d)

Now, recall that, by definition, v π (6) is the expected cumulative reward for state s = 6, which is the EXIT state. We therefore have that v π (6) = 0, which means α = 0. We conclude that the values of the states are given by v π (1) = v π (5) = v π (6) = 0, v π (2) =

44.3.2

15 7 21 , v π (3) = , v π (4) = 4 2 4

(44.85)

State–Action Value Function The state value function v π (s) returns the expected cumulative reward for every state s. We can also assign expected cumulative rewards for the intermediate state–action pairs (s, a). Recall that the evolution of an MDP can be interpreted as corresponding to transitions from states s to state–action pairs (s, a) to states s0 and so forth. For example, we can expand description (44.51) to include the intermediate state–action pairs in the trajectory:

a

r(0)

a

r(1)

a

0 1 2 (s0 , a0 ) −−→ s1 −→ (s1 , a1 ) −−→ s2 −→ (s2 , a2 ) . . . s0 −→ (44.86)

We will denote the value of each state–action pair by the notation q π (s, a). This quantity is referred to as the state–action value function (or more simply as the action value function); it measures the expected cumulative reward starting from state s0 = s, taking action a, and acting according to policy π(a|s) thereafter, i.e., in two equivalent forms:

1834

Markov Decision Processes

n o ∆ q π (s, a) = E π,P U (s) | s0 = s, a0 = a

and π

q (s, a) = E π,P

(

∞ X

n=0

(44.87a)

)

n

γ r(n) | s0 = s, a0 = a

(44.87b)

For instance, referring to the trajectory shown in (44.86), the state–action value for the state–action pair (s0 , a0 ) is denoted by q π (s0 , a0 ) and it gives the expected cumulative reward from that point in time onward, knowing that the agent has already started from state s0 and taken action a0 . In comparison, the value for state s0 , which we denoted earlier by v π (s0 ), gives the expected cumulative reward from that time onward knowing only that the agent has started from state s0 . Thus, in the first case, the initial action is fixed at a0 while in the second case the initial action is unknown and the reward value is computed by averaging over all possible initial actions. This suggests that there is a direct relationship between state values and state–action values, namely, for any state s it holds that: n o (44.88) v π (s) = E π q π (s, a) where the expectation over π means that the actions are selected according to the distribution π(a|s). Proof of (44.88): Indeed, referring to definition (44.65), we have ( π

v (s) = E π,P

∞ X

) γ r(n) s0 = s n

n=0

( = E a∼π

E π,P

∞ X

)! γ r(n) s0 = s, a0 = a n

n=0

n o = E π q π (s, a)

(44.89)

 π

Again, in a manner similar to (44.69), the state–action value q (s, a) can be evaluated by adding two components: the expected one-step conditional reward from state s (given that the action is now fixed at a), and the discounted cumulative reward from the next step onward: ! X q π (s, a) = r(s|a) + γ P(s, a, s0 )v π (s0 ) (44.90) s0 ∈S

where r(s|a) is computed via (44.42). Comparing with (44.69), we find pπs,s0 is now replaced by P(s, a, s0 ) since the initial action, a, is fixed.

44.3 Policy Evaluation

1835

Proof of (44.90): Starting from s and taking action a, there are multiple states s0 that can be reached in one step, each with probability P(s, a, s0 ). The expected cumulative reward from each of these states is v π (s0 ). Accordingly, the last term in the above expression is simply the average of these cumulative rewards: Each v π (s0 ) now occurs with probability P(s, a, s0 ).

 π

If we replace r (s|a) in (44.90) by its expression (44.42), we can rewrite the state–action value in terms of the original quantities {P, r}: q π (s, a) =

X

s0 ∈S

h i P(s, a, s0 ) r(s, a, s0 ) + γv π (s0 )

(44.91)

This relation shows how the state–action values {q π (s, a)} can be determined from knowledge of the state values {v π (s)}. Using (44.88), or comparing the above expression with (44.69), we can readily relate the value and action functions for any state as follows: X v π (s) = π(a|s)q π (s, a) (44.92) a∈A

Substituting (44.92) into (44.91), we can eliminate v π (s) from the latter expression and find that the state–action value function satisfies a Poisson equation similar to (44.69): ! X X π 0 0 0 0 π 0 0 q (s, a) = P(s, a, s ) r(s, a, s ) + γ π(a |s )q (s , a ) s0 ∈S

(a)

=

X

0

P(s, a, s )

s0 ∈S

X

a0 ∈A

a0 ∈A

0

0

0

π(a |s )r(s, a, s ) + γ

X

a0 ∈A

0

0

π

0

0

!

π(a |s )q (s , a )

(44.93)

where in step (a) we used the fact that, for any s0 , the probabilities {π(a0 |s0 )} add up to 1 over a0 ∈ A. It follows that q π (s, a) =

XX

s0 ∈S a0 ∈A

h i π(a0 |s0 )P(s, a, s0 ) r(s, a, s0 ) + γq π (s0 , a0 )

(44.94)

which shows that the state–action value function can be equivalently interpreted as the conditional expectation over the distributions of both actions and transitions: n o q π (s, a) = E π,P r(s, a, s0 ) + γq π (s0 , a0 ) s = s, a = a (44.95) This relation is a second Bellman equation, albeit now for the state–action value function. Later in (45.15b), we will present the corresponding Bellman optimality equation; the latter does not depend on the policy.

1836

Markov Decision Processes

For reference, we collect the state–action values into a matrix of size |S| × |A| as follows: ∆

Qπ = [q π (s, a)] , s ∈ S, a ∈ A

(44.96)

That is,



Qπ =

a1 × × × × × ×

s1 s2 s3 .. . s|S|

a2 × × × × × ×

a3 × × × × × ×

... × × × × × ×

a|A| × × × × × ×

(44.97)

Example 44.7 (Evaluating state–action values for a simplified grid) We continue with the setting of Example 44.6 with γ = 1 and proceed to evaluate the state–action values for all possible state–action pairs. The calculations that follow will show that these values are given by the entries in the following matrix, whose rows correspond to the states s ∈ {1, 2, 3, 4, 5, 6} and whose columns correspond to the actions a ∈ {RIGHT, LEFT, STOP}:     Q =   π

0 5/2 17/4 15 0 0

0 5 11/4 5/2 0 0

0 0 0 0 0 0

      

(44.98)

Indeed, note from (44.90) that for states s = 1, 5, 6: q π (s = 1, a = STOP) = rπ (s = 1|a = STOP) +

X

P(s = 1, a = STOP, s0 )v π (s0 )

s0 ∈S

= 0 + P(s = 1, a = STOP, s0 = 6)v π (s0 = 6) =0 + 1×0 =0 q π (s = 5, a = STOP) = rπ (s = 5|a = STOP) +

X

(44.99a)

P(s = 5, a = STOP, s0 )v π (s0 )

s0 ∈S

= 0 + P(s = 5, a = STOP, s0 = 6)v π (s0 = 6) =0 + 1×0 =0 q π (s = 6, a = any) = rπ (s = 6|a = any) +

X

(44.99b)

P(s = 6, a = any, s0 )v π (s0 )

s0 ∈S

= 0 + P(s = 6, a = any, s0 = 6)v π (s0 = 6) =0 + 1×0 =0

(44.99c)

44.3 Policy Evaluation

1837

while for the remaining states: q π (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) +

X

P(s = 2, a = RIGHT, s0 )v π (s0 )

s0 ∈S

= −1 + P(s = 2, a = RIGHT, s0 = 3)v π (s0 = 3) 7 = −1 + 1 × 2 = 5/2 q π (s = 2, a = LEFT) = r(s = 2|a = LEFT) +

X

(44.99d)

P(s = 2, a = LEFT, s0 )v π (s0 )

s0 ∈S 0

= 5 + P(s = 2, a = LEFT, s = 1)v π (s0 = 1) =5 + 1×0 =5 q π (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) +

X

(44.99e)

P(s = 3, a = RIGHT, s0 )v π (s0 )

s0 ∈S

= −1 + P(s = 3, a = RIGHT, s0 = 4)v π (s0 = 4) 21 = −1 + 1 × 4 = 17/4 q π (s = 3, a = LEFT) = r(s = 3|a = LEFT) +

X

(44.99f)

P(s = 3, a = LEFT, s0 )v π (s0 )

s0 ∈S

= −1 + P(s = 3, a = LEFT, s0 = 2)v π (s0 = 2) 15 = −1 + 1 × 4 = 11/4

(44.99g)

and q π (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) +

X

P(s = 4, a = RIGHT, s0 )v π (s0 )

s0 ∈S

= 15 + P(s = 4, a = RIGHT, s0 = 5)v π (s0 = 5) = 15 + 1 × 0 = 15 q π (s = 4, a = LEFT) = r(s = 4|a = LEFT) +

X

(44.99h)

P(s = 4, a = LEFT, s0 )v π (s0 )

s0 ∈S

= −1 + P(s = 4, a = LEFT, s0 = 3)v π (s0 = 3) 7 = −1 + 1 × 2 = 5/2

(44.99i)

Example 44.8 (Bounded state–action values) It can be verified that whenever the reward function is uniformly bounded, then the state–action value function will also be uniformly bounded. This fact can be deduced from (44.94). Thus, note from

1838

Markov Decision Processes



q π (s, a) =

XX s0 ∈S a0 ∈A

h i π(a0 |s0 )P(s, a, s0 ) r(s, a, s0 ) + γq π (s0 , a0 )

(44.100)

that r(s, a, s0 ) + γ |q (s, a)| ≤ max 0 π

s ∈S

 max

s0 ∈S,a0 ∈A

 π 0 0 q (s , a )

(44.101)

and, hence, max |q π (s, a)| ≤

s∈S,a∈A

max 0

s,s ∈S,a∈A

r(s, a, s0 ) + γ

 max0 0

s ∈S,a ∈A

 π 0 0 q (s , a )

(44.102)

It follows that max

s∈S,a∈A

|q π (s, a)| ≤



1 1−γ

 max

s,s0 ∈S,a∈A

r(s, a, s0 )

(44.103)

which shows that q π (s, a) is also bounded for all states and actions.

44.3.3

Fixed-Point Iteration The presentation so far shows that determination of the state value function, v π (s), requires that we solve the Poisson equation (44.72). Once the state values v π (s) are determined from (44.72), then we can substitute them into (44.91) and determine the state–action values q π (s, a). In this section, we describe an alternative procedure that relies on an iterative approximation for the solution. This alternative construction, which is recursive in nature, will form the backbone of many of the computational procedures described in future chapters on reinforcement learning. We therefore motivate it from first principles. We introduce a mapping T (x) : IR|S| → IR|S| , which maps a vector of size |S| into another vector of similar size, as follows: ∆

T (x) = rπ + γP π x

(44.104)

This mapping transforms the vector x into the vector T (x). If a vector x0 can be found such that x0 is mapped back onto itself, i.e., T (x0 ) = x0 , then we say that x0 is a fixed point for the mapping T (x). If we examine the Poisson equation (44.72), we readily recognize that the desired vector v π corresponds to a fixed point for the mapping T (·) defined by v π = rπ + γP π v π = T (v π )

(44.105)

Given that γP π is a stable matrix, meaning that its spectral radius satisfies ρ(γP π ) < 1, for any γ ∈ [0, 1)

(44.106)

44.3 Policy Evaluation

1839

then one useful iterative scheme for determining v π is to apply the following fixedpoint iteration (recall a similar discussion in Section 11.2 on proximal methods using fixed-point iterations): π vkπ = rπ + γP π vk−1 , v0π = 0, k ≥ 1

(44.107)

where k is the iteration index. Starting from any initial condition v0π , it will hold that: lim vkπ = v π

k→∞

(44.108)

Proof of (44.108): Subtracting relations (44.105) and (44.107) we obtain the error recursion π (v π − vkπ ) = γP π (v π − vk−1 ), k ≥ 1

which shows that the error converges to zero as k → ∞ since γP π is stable.

(44.109)



It is useful to rewrite, for later use, recursion (44.107) in terms of the individual state values vkπ (s) as follows (recall (44.69)): ! h i X X π vkπ (s) = π(a|s) P(s, a, s0 ) r(s, a, s0 ) + γvk−1 (s0 ) a∈A

= E π,P

s0 ∈S

n o π r(s, a, s0 ) + γvk−1 (s0 ) | s = s

(44.110)

One advantage of the recursive calculation (44.107) is that it avoids the need to invert the matrix (I − γP π ). This is useful when P π is large and sparse (i.e., with a few nonzero entries). In that case, the matrix-vector multiplication in (44.107) can be performed efficiently. A second advantage of the recursive procedure (44.107) is a useful interpretation for the successive vectors vkπ (s), as discussed next.

44.3.4

Expected K-Step Rewards The successive iterates {vkπ } have a useful interpretation when the boundary condition is set to v0π = 0. This interpretation is easier to explain if we follow an inductive argument. Note first from (44.107) that v1π = rπ

(44.111)

which shows that v1π is the one-step reward vector. Moving to the next iteration, we get v2π = rπ + γP π v1π

(44.112)

which can be interpreted as the value vector that would result if we look two steps into the future. This conclusion is perhaps better appreciated if we break

1840

Markov Decision Processes

down the above vector relation into its individual entries in a manner similar to (44.67) and write ! X v2π (s) = rπ (s) + γ pπs,s0 v1π (s0 ) (44.113) s0 ∈S

The first term, rπ (s), on the right-hand side is the expected reward for state s during the first transition from s, while the second term is the discounted expected reward for the next transition. Likewise, for the next iteration we have ! X π π π π 0 v3 (s) = r (s) + γ ps,s0 v2 (s ) (44.114) s0 ∈S

Again, the first term, rπ (s), is the expected reward for state s during the first transition from s, while the second term is now the discounted expected reward for the next two transitions. More generally, the vector vkπ from (44.107) contains the discounted expected rewards for all states by looking k steps into the future: vkπ = discounted expected reward vector for k steps into the future (44.115) The resulting algorithm is listed in (44.116). Policy evaluation procedure for an MDP, M = {S, A, P, r, π, γ}. start from initial condition, v0π = 0|S| ; compute the one-step reward vector rπ using (44.44); compute the transition probability matrix P π using (44.31); compute v π by solving (I − γP π )v π = rπ or, alternatively, repeat until convergence over k = 1, 2, 3, . . . : π vkπ = rπ + γP π vk−1 end return v π ← vkπ .

44.4

(44.116)

LINEAR FUNCTION APPROXIMATION The Poisson equation (44.72) allows us to determine the state values that correspond to a given policy π(a|s) by solving: (I − γP π ) v π = rπ

(44.117)

There are many powerful and effective techniques for the solution of such linear systems of equation. However, one main challenge remains for very large state

44.4 Linear Function Approximation

1841

spaces. For example, the game of backgammon has 1020 states while the game of chess has 1047 states; these numbers make the solution of (44.117) computationally intractable.

44.4.1

Projected Bellman Error One useful technique to address this computational challenge is to associate feature vectors with the states. Specifically, we associate with each state s ∈ S a feature vector hs ∈ IRM , where the value of M is much smaller than the dimension of the state space, M  |S|: hs : S → IRM

(44.118)

It is generally assumed that the agent cannot observe the state variable, s, but only its feature vector, hs , so that the agent “senses” the state only through the feature vector. Using the feature representation, we approximate the value of each state by a linear parametric model of the form: v π (s) ≈ hT sw

(44.119)

for some vector w ∈ IRM . Observe that the vector w is fixed for all states, while the feature vector changes with the state. In this way, rather than determine the value of v π (s) for every s, it becomes sufficient to determine w (in some optimal manner) and then use (44.119) to estimate v π (s) even for states that the agent may not have visited. Other more complex parameterization models can be used in lieu of (44.119), such as using neural networks, but linear models are sufficient here to convey the main idea. The selection of which features to use is often governed by the nature of the problem under study and by some prior information and understanding of its state space and dynamics. In many cases, the choice of features is suggested naturally by the problem formulation, while in other cases a more guided approach is necessary to select informative features.

Example 44.9 (Selection of feature entries) Figure 44.10 provides one example of feature selection. The figure shows a large geographic space, with 48 × 64 = 3072 states represented by the small squares. The space is traversed by an agent, while collecting rewards at the star locations or facing obstacles/danger. One possibility to define feature entries is to measure distances to certain anchor locations in the space, represented by the dark squares. There are 22 anchor locations in the figure so that, for this example, M = 22 and |S| = 3072. Instead of measuring distances directly, one can consider giving more weight to anchors that are closer to the agent’s location and less weight to anchors that are further away. One way to achieve this type of modulation is to employ a Gaussian radial basis function centered at the agent’s locations. For example, if s denotes the current location (state) of the agent and sa denotes the location of the ath anchor, then the entry in hs relative to this ath anchor can be computed as   1 ∆ 2 hs (a) = exp − 2 ksa − sk (44.120) 2σ

1842

Markov Decision Processes

Example of selection of feature entries. In a large state space, with |S| = 3072 states represented by the small squares, the distances to M = 22 anchor states (represented by the dark squares) can be used as features. Figure 44.10

where σ 2 denotes a variance parameter used to control how fast the Gaussian function decays away from s.

Returning to (44.119), we stack the feature vectors on top of each other and introduce the |S| × M feature matrix: 

hT 1

  hT  2  ∆  H =  hT  3   ..  . hT |S|



      , (|S| × M )    

(44.121)

For well-posedness, we assume that H has full rank, meaning that its columns are linearly independent so that H T H is invertible (actually positive-definite). If this were not the case, then this means that there is redundancy in the feature space and we can reduce the dimension of the feature vectors. Using the feature matrix H, we can transform the entry-wise expression (44.119) into a vector relation for the value vector and write v π ≈ Hw

(44.122)

44.4 Linear Function Approximation

1843

If we assume this model, then it follows from the Poisson equation (44.117) that a good choice for w is to seek a vector that satisfies the approximate equality: Hw ≈ rπ + γP π Hw

(44.123)

If this were to be an exact equation, then a solution w would exist only if the term on the right-hand side belongs to the column span of H, i.e., only when rπ + γP π Hw ∈ R(H)

(44.124)

We can enforce this property by replacing (44.123) by the so-called projected Bellman equation: Hw ≈ Π (rπ + γP π Hw)

(44.125)

zb = argmin (d − Hz)T D(d − Hz)

(44.126)

zb = (H T DH)−1 H T Dd

(44.127)

where the matrix Π ∈ IR|S|×|S| is chosen to perform (weighted) projections onto R(H). In other words, the matrix Π is selected such that, when applied to some generic vector d ∈ IR|S| , it transforms it into the unique vector db = H zb ∈ R(H) where zb is obtained by minimizing the squared weighted norm: z∈IRM

In this formulation, the matrix D ∈ IR|S|×|S| is some symmetric positive-definite weighting matrix that we are free to choose (e.g., D = I or some other choice). Later, we will find that the diagonal choice for D is useful. Problem (44.126) is seeking a solution zb such that H zb is closest to d in a weighted least-squares sense. Differentiating the cost in (44.126) relative to z and setting the gradient equal to zero at z = zb gives so that

db = H(H T DH)−1 H T Dd

(44.128)

The “projection” matrix Π that achieves this transformation from d to db is therefore given by ∆

Π = H(H T DH)−1 H T D

(44.129)

Using this expression for Π, we can now return to (44.125) and seek a weight vector w that makes both sides of the approximate equation close to each other. Specifically, we seek the vector w that solves the following weighted least-squares problem: w b = argmin w∈IRM

(

2 1

JPB (w) = Hw − Π(rπ + γP π Hw) 2 D ∆

)

(44.130)

Here, the notation kxk2D stands for xT Dx. We say that the above problem minimizes the squared projected Bellman error. The cost JPB (w) is quadratic in w. If we expand it, and use (44.129), we find that (see Prob. 44.12):

1844

Markov Decision Processes

JPB (w) =

1 T π (H Dr − Bw)T (H T DH)−1 (H T Drπ − Bw) 2



B = H T D(I − γP π )H

(44.131a) (44.131b)

in terms of the matrix B defined above. Since ρ(γP π ) < 1, and H is full rank, the matrix B is invertible. Likewise, the matrix H T DH is invertible. It is then clear that JPB (w) is minimized when:

from which we conclude that

Bw b = H T Drπ

n o−1 w b = H T D(I − γP π )H H T Drπ

(44.132)

(44.133)

and, consequently, the linear approximation model for the value vector is n o−1 π = H H T D(I − γP π )H vc H T Drπ

44.4.2

(44.134)

Arrow–Hurwicz Iteration Result (44.133) provides a closed-form expression for the solution w b in terms of the quantities {P π , rπ } and the feature matrix H. This construction requires the inversion of the M ×M matrix B, and computation of B can be demanding since it involves the matrix H of dimensions |S|×M with |S| very large. We now devise an alternative procedure that avoids the matrix inversion and approximates w b in an iterative manner. We do so by pursuing duality arguments in order to replace the original (or primal) optimization problem (44.130) by a dual problem whose solution solves the original problem – recall the discussion from Section 9.2. In Example 44.11 we provide a more direct argument based on the concept of conjugate functions. We motivate the dual problem as follows. Using expressions (44.131a)–(44.131b), we start by noting that minimizing JPB (w) over w in (44.130) is equivalent to solving: ( ) 1 T T y (H DH)−1 y , subject to y = H T Drπ − Bw (44.135) min y,w 2 where y, w ∈ IRM . This problem involves a quadratic cost; it is quadratic over y and convex over y and w. The problem also involves an affine constraint relating y and w. Such problems are special cases of the general class of convex optimization problems described earlier in Section 9.1. Based on the discussion in that section, the property of strong duality holds. What this means is that the global solutions (b y , w) b to the primal problem (44.135) can be determined by working instead with the dual problem that corresponds to (44.135).

44.4 Linear Function Approximation

1845

The dual problem is determined as follows. First, we introduce the Lagrangian function,  1 T T y (H DH)−1 y + λT H T Drπ − Bw − y 2



L(w, y, λ) =

(44.136)

where λ ∈ IRM is the Lagrange multiplier. Next, we minimize L(w, y, λ) over y and w. Differentiating with respect to y and w and setting the gradients to zero at yb and w b gives: ∇y L(w, y, λ) = ybT (H T DH)−1 − λT = 0

∇w L(w, y, λ) = −λ B = 0 T

(44.137a)

(44.137b)

Substituting these two conditions into the expression for the Lagrangian function, we arrive at the dual function: 1 g(λ) = − λT (H T DH)λ + λT H T Drπ 2

(44.138)

The dual problem now involves maximizing g(λ) (or minimizing −g(λ)) over λ under the condition B T λ = 0: ( ) 1 T T T T π b = argmin λ λ (H DH)λ − λ H Dr , subject to B T λ = 0 (44.139) 2 M λ∈IR

Therefore, we are reduced to solving the constrained problem (44.139). Observe that this dual problem removes the inverse of H T DH in comparison to the primal problem (44.135). This fact facilitates the derivation of an iterative procedure for determining w b by working with (44.139). To solve (44.139), we again introduce its Lagrangian function: ∆

L(λ, w) =

1 T T λ (H DH)λ − λT H T Drπ + wT B T λ 2

(44.140)

where we are denoting its Lagrange multiplier by the same symbol w ∈ IRM . That is, we are using the same notation w to denote the dual variable in (44.140) because it can be verified that, by computing the dual of the dual problem (44.139) we recover the original problem (44.135) or (44.130). Indeed, some straightforward algebra will show that the dual function that results from (44.140) is: 1 g2 (w) = − (H T Drπ − Bw)T (H T DH)−1 (H T Drπ − Bw) 2

(44.141)

so that maximizing this g2 (w) over w is equivalent to minimizing JPB (w) over w. This means that the optimal dual variable, w, b of (44.140) is the optimal solub w) tion to (44.135). To determine a saddle point (λ, b of the Lagrangian function (44.140), we can invoke an iterative procedure that alternates between applying gradient descent to L(λ, w) with respect to λ (since we minimize over λ) and gradient ascent to L(λ, w) with respect to w (since we maximize the dual function over w):

1846

Markov Decision Processes

  λk = λk−1 − µ H T D Hλk−1 + (I − γP π )Hwk−1 − rπ

wk = wk−1 + µ H T (I − γP π )T DHλk−1

(44.142a) (44.142b)

with boundary conditions λ0 = 0, w0 = 0, and where k ≥ 1 is the iteration index. The scalar µ > 0 is a small step-size parameter. This construction is a special case of the well-known Arrow–Hurwicz algorithm applied to the particular Lagrangian function (44.140); for a generic Lagrangian function, the algorithm alternates between applying a gradient-descent step over the primal variable and a gradient ascent step over the dual variable; for problem (44.140), the primal variable is λ and the dual variable is w. Later in Example 48.2 we will explain how this formulation can be used to motivate useful algorithms for reinforcement learning when the model parameters of the MDP, such as its kernel probability matrix P, are not known beforehand. Arrow–Hurwicz algorithm for solving (44.130). start from initial conditions w0 = 0, λ0 = 0; select a small step size µ > 0; given a positive-definite diagonal matrix D; set B = H T D(I − γP π )H. repeat until convergence over k = 1, 2, . . . : 

λk = λk−1 − µ H T DHλk−1 + Bwk−1 − H T Drπ

wk = wk−1 + µB T λk−1 end return w ← wk .

(44.143) 

Example 44.10 (Convergence of Arrow–Hurwicz algorithm) We now examine the dynamics of recursions (44.142a)–(44.142b). We show that a small enough µ exists that ensures the convergence of the recursions to the desired saddle point locations. To begin with, by setting the gradient vectors of the Lagrangian function (44.140) to b w) zero we find that its saddle point (λ, b satisfies the relations   b + (I − γP π )H w H TD H λ b − rπ = 0 (44.144a) b=0 H T (I − γP π )T DH λ

(44.144b)

We next introduce the error quantities ∆

w ek = wk−1 − w, b

∆ ek = b λ λk−1 − λ

(44.145)

b from both We would like to verify that these errors go to zero as k → ∞. Subtracting λ sides of (44.142a) and adding the zero quantity (44.144a) we find that

44.4 Linear Function Approximation

  ek = λ ek−1 − µH T D H λ ek−1 + (I − γP π )H w λ ek−1

1847

(44.146)

Likewise, subtracting w b from both sides of (44.142b) and adding the zero quantity (44.144b) we find that ek−1 w ek = w ek−1 + µH T (I − γP π )T DH λ

(44.147)

For compactness of representation, we let B = H T D(I − γP π )H. We can then combine the last two recursions into a single vector recursion as follows:        ek−1 e ek −µH T DH −µB λ λ + = λk−1 T w ek−1 w ek−1 w ek µB 0 !    ek−1 −H T DH −B λ = I +µ (44.148) w ek−1 BT 0 {z } | ∆

=A

From the result of Prob. 44.14 we know that all eigenvalues of the matrix A defined above have strictly negative real parts. From the result of Prob. 44.15 we know there exists a small positive µ that ensures that all the eigenvalues of the matrix I + µA will lie strictly inside the unit circle, namely, for   Re[λ` (A)] (44.149) 0 < µ < min −2 ` |λ` (A)|2 b and wk → w. where λ` (A) denotes the `th eigenvalue of A. We conclude that λk → λ b Example 44.11 (Derivation of Arrow–Hurwicz using conjugate functions) We provide an alternative (simpler) derivation of the Arrow–Hurwicz algorithm (44.142a)–(44.142b) by appealing to the concept of conjugate functions. Recall that our objective is to minimize the weighted quadratic cost function (44.130), which according to (44.131a) can be rewritten in the equivalent form ) ( 1 2 kBw − bkC −1 (44.150) w b = argmin 2 w∈IRM in terms of the quantities: B = H T D(I − γP π )H T

(44.151a)

π

(44.151b)

C = H DH

(44.151c)

b = H Dr T

Now we appeal to the easily verifiable property satisfied by every quadratic function: n o 1 1 kBw − bk2C −1 = max − (Bw − b)T λ + kλk2C (44.152) M 2 2 λ∈IR which motivates us to replace the minimization problem (44.150) by the saddle point problem: n o 1 min max − (Bw − b)T λ + kλk2C (44.153) M M 2 w∈IR λ∈IR If we now write down a gradient-ascent recursion over λ and a gradient-descent recursion over w we recover recursions (44.142a)–(44.142b).

1848

Markov Decision Processes

Although unnecessary for the argument, it is useful to note that equality (44.152) can be interpreted as corresponding to a conjugate function calculation. To see this, consider the quadratic function f (λ) = 21 kλk2C . Its conjugate function is defined by (recall expression (51.114)): ∆

r? (z) = sup λ



z T λ − f (λ)

 (a)   = max z T λ − f (λ) λ   1 = max z T λ − kλk2C λ 2 (b) 1 2 = kzkC −1 2

(44.154)

where step (a) is because the sup is attained and step (b) follows from the result of Prob. 8.51. Equality (44.152) follows if we select z = b − Bw.

44.5

COMMENTARIES AND DISCUSSION Markov decision processes. The history of Markov decision processes is intertwined with the history of multistage decision processes and dynamic programming, all of which are widely used tools in control theory, decision-making, robotics, learning, and economics. A multistage decision process is one where an optimal decision needs to be taken at each stage in order to maximize (or minimize) a certain objective function: The problem setting will involve states, actions, rewards, and transitions. While the transitions between states in an MDP are generally stochastic in nature, and governed by a transition probability kernel P(s, a, s0 ), multistage decision processes refer more broadly to problems where these transitions can be either stochastic or deterministic (one example in the context of optimal trajectory planning is discussed in the next chapter in Example 45.4). Some of the earliest works on MDPs and multistage decision processes are due to the American mathematician and control theorist Richard Bellman (1920–1984) and appear in the works by Bellman (1953, 1954, 1957a, b). The following text is extracted from Bellman (1954, p. 503); it describes the essential elements of multistage decision processes (including states, decisions or actions, transitions, objective function) and how these problems serve as motivation to the development of the theory of dynamic programming: ....the theory was created to treat the mathematical problems arising from the study of various multi-stage decision processes, which may roughly be described in the following way: We have a physical system whose state at any time t is determined by a set of quantities which we call state parameters, or state variables. At certain times, which may be prescribed in advance, or which may be determined by the process itself, we are called upon to make decisions which will affect the state of the system. These decisions are equivalent to transformations of the state variables, the choice of a decision being identical with the choice of a transformation. The outcome of the preceding decisions is to be used to guide the choice of future ones, with the purpose of the whole process that of maximizing some function of the parameters describing the final state. We will describe and illustrate Markov decision processes in some greater detail in the comments at the end of the next chapter. For more information on such processes, readers may refer to the texts by Howard (1960), Derman (1970), Feinberg and Shwartz (2002), and Bertsekas (2007).

44.5 Commentaries and Discussion

1849

Value function approximation. For MDPs with large state dimensions, it often becomes necessary to employ feature representations for the state variables, as was explained in Section 44.4. The problem of state value evaluation then amounts to estimating a vector w ∈ IRM that enforces the linear approximation (44.119) with M  |S|. In this representation, the vector w is fixed for all states, while the feature vector changes with the state. In this way, rather than determine the value of v π (s) for every s, it becomes sufficient to determine w and then use (44.119) to estimate v π (s) even for states that the agent may not have visited. The Bellman error criterion (44.130) for estimating w was proposed in the context of reinforcement learning problems by Baird (1995, 1999) using Π = I, and adjusted by Bradtke and Barto (1996), Lagoudakis and Parr (2003), Sutton, Szepesvari, and Maei (2009), and Sutton et al. (2009) to include the projection matrix Π to help enforce property (44.124) and reduce mismatch errors. According to Bertsekas (2011), this projected error formulation is one manifestation of the class of Galerkin methods, which are used to determine approximate solutions for partial differential equations. In this framework, the solution is approximated as a linear combination of a finite number of basis functions, and the coefficients of the combination are determined so as to minimize a weighted error criterion. For additional information on Galerkin methods, the reader may consult the text by Fletcher (1984), or more introductory presentations in Reddy (2006) and Logan (2016). The work by Schweitzer and Seidman (1985) discusses Galerkin methods in the context of reinforcement learning and Markovian decision processes. Arrow–Hurwicz algorithm. Recursions (44.142a)–(44.142b) are a special instance of the Arrow–Hurwicz algorithm, which is a general technique for reducing the solution of constrained maximization/minimization problems to the determination of saddle points – see, e.g., the overview by Zulehner (2001). The algorithm was introduced by Arrow and Hurwicz (1956) – see also Arrow, Hurwicz, and Uzawa (1958). Interestingly, both authors Kenneth Arrow (1921–2017) and Leonid Hurwicz (1917–2008), who happen to be economists, went on to receive (separately) the Nobel Prize in Economics in 1972 (for Arrow) and 2007 (for Hurwicz). In the paper by Arrow and Hurwicz (1956), the authors considered constrained problems of the form (using our notation – compare with (44.139)): n o min g(λ) , subject to f (λ) = 0 (44.155) λ∈IRN

where g(λ) : IRN → IR and f (λ) : IRN → IRM . To solve the above constrained problem, we generally introduce a Lagrange multiplier w ∈ IRM and the Lagrangian function: L(λ, w) = g(λ) + wT f (λ)

(44.156)

b is a function of w and, upon subWe then minimize L(λ, w) over λ; the solution λ stitution into L(λ, w), we arrive at the dual function, say, D(w). Under certain favorable technical conditions, such as strong duality, the maximization of D(w) over b to the original constrained problem (44.155). The w helps determine the solution λ Arrow–Hurwicz algorithm provides an alternative approach. It works directly with the b w) Lagrangian function L(λ, w) and seeks a saddle point (λ, b for it. Specifically, the method alternates between applying gradient descent to L(λ, w) with respect to λ (since we are minimizing over λ) and gradient ascent to L(λ, w) with respect to w (since we need to maximize the dual function over w):   λk = λk−1 − µλ (k) ∇λT L(λ, w)|(λbk−1 ,wbk−1 ) (44.157a)   wk = wk−1 + µw (k) ∇wT L(λ, w)|(λbk−1 ,wbk−1 ) (44.157b) with boundary conditions λ0 = 0, w0 = 0, and where k ≥ 1 is the iteration index. The nonnegative scalars {µλ (k), µw (k)} refer to nonnegative step-size sequences.

1850

Markov Decision Processes

PROBLEMS

44.1 Follow arguments similar to Example 44.2 and compute the transition probabilities pπ6,s0 from state s = 6 to any state s0 . 44.2 Follow arguments similar to Example 44.3 and compute the expected one-step reward for state s = 6. 44.3 Refer to the MDP described in Example 44.5. What would the optimal actions be for each of the center states if γ = 1. 44.4 Refer to the MDP described in Example 44.5. Replace the reward +15 by a generic value R at state s = 5. For what values of R does the optimal action at state 4 become “moving left”? 44.5 Refer to the MDP described in Example 44.5. Replace the reward +5 by +30 and consider a generic discount factor γ ∈ [0, 1). For what value(s) of γ, does the optimal action at state 4 become “moving left”? 44.6 Refer to the MDP described in Example 44.5. Assume that whenever the action RIGHT is chosen, there is a 25% chance that the agent will move left. (a) Write down the model parameters {S, A, P, r} for this MDP and the policy π(a|s). (b) Determine the state transition probability matrix, P π . (c) Determine the one-step expected reward vector, rπ . (d) Determine the state values, v π (s), under the assumed policy. 44.7 Refer to the MDP described in Example 44.5. Assume that whenever the agent reaches state s = 1, there is a 50% chance it will move to the EXIT state and stop the game, and 50% chance it will instead transition either to state s = 2 or state s = 3 with equal probability. (a) Write down the model parameters {S, A, P, r} for this MDP and the policy π(a|s). (b) Determine the state transition probability matrix, P π . (c) Determine the one-step expected reward vector, rπ . (d) Write down the Poisson equation for this problem. (e) Determine the state values, v π (s), under the assumed policy. 44.8 An MDP consists of two states, S = {s, s0 }, and two actions a ∈ {MOVE, STAY}, with transition probabilities P(s, a = MOVE, s) = 0, P(s, a = MOVE, s0 ) = 1 P(s, a = STAY, s) = 1, P(s0 , a = any, s0 ) = 1 The only nonzero reward is r(s, a = STAY, s) = 1. The only action available at state s0 is STAY while at state s, the action MOVE is selected with probability p and the action STAY is selected with probability 1 − p. (a) Draw a graphical representation for the MDP similar to the one appearing in the lower part of Fig. 44.7. (b) Identify the parameters {S, A, P, r, π} for the MDP. (c) Determine the state values for s and s0 , i.e., v π (s) and v π (s0 ), using γ = 1/2. 44.9 Refer to the argument leading to (44.70). Show that, for any finite N , it also holds ( ) N −1   X π N π n v (s) = E π,P γ v (sN ) + γ r(n) s0 = s n=0

44.10 Refer to the linear equation (44.73). Show that if the entries of rπ are nonnegative, then the entries of v π are also nonnegative. 44.11 Consider a deterministic policy of the form  1, if a = as π(a|s) = 0, if a 6= as Verify that v π (s) = q π (s, as ).

References

1851

44.12 Establish expression (44.131a). 44.13 Estimate the number of states in the games of backgammon and chess and argue that they are on the order of 1020 and 1047 states, respectively. 44.14 A real square matrix is called Hurwitz if all its eigenvalues have strictly negative real parts. Consider a block matrix of the form   −X −Y ∆ A = YT 0 where X ∈ IRP ×P is positive-definite and Y ∈ IRP ×Q has full column rank. Show that A is Hurwitz. Remark. See Lemma 2 in Towfic and Sayed (2015). 44.15 A real square matrix is called Schur if all its eigenvalues lie strictly inside the unit circle. Let A be a Hurwitz matrix of the form defined in Prob. 44.14. Show that the matrix B = I + µA is a Schur matrix if, and only if, the scalar µ satisfies:   Re[λ` (A)] 0 < µ < min −2 ` |λ` (A)|2 where λ` (A) denotes the `th eigenvalue of A. Remark. See Polyak (1987, p. 39) and also Lemma 1 in Towfic and Sayed (2015).

REFERENCES Arrow, K. J. and L. Hurwicz (1956), “Reduction of constrained maxima to saddle-point problems,” Proc. 3rd Berkeley Symp. Math. Statist. Probab., pp. 1–20, Berkeley, CA. Arrow, K. J., L. Hurwicz, and H. Uzawa (1958), Studies in Linear and Non-Linear Programming, Stanford University Press. Baird, L. C. (1995), “Residual algorithms: Reinforcement learning with function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 30–37, Tahoe City, CA. Baird, L. C. (1999), Reinforcement Learning Through Gradient Descent, Ph.D. thesis, Carnegie-Mellon University, USA. Bellman, R. E. (1953), “An introduction to the theory of dynamic programming,” Report R-245, RAND Corporation. Bellman, R. E. (1954), “The theory of dynamic programming,” Bull. Amer. Math. Soc., vol. 60, no. 6, pp. 503–516. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Bellman, R. E. (1957b), “A Markovian decision process,” Indiana Univ. Math. J., vol. 6, no. 4, pp. 679–684. Bertsekas, D. P. (2007), Dynamic Programming and Optimal Control, 4th ed., 2 vols, Athena Scientific. Bertsekas, D. P. (2011), “Approximate policy iteration: A survey and some new methods,” J. Control Theory Appl., vol. 9, no. 3, pp. 310–335. Bradtke, S. J. and A. G. Barto (1996), “Linear least-squares algorithms for temporal difference learning,” J. Mach. Learn. Res., vol. 22, pp. 33–57. Derman, C. (1970), Finite State Markovian Decision Processes, Academic Press. Feinberg, E. A. and A. Shwartz, editors (2002), Handbook of Markov Decision Processes, Kluwer. Fletcher, C. A. J. (1984), Computational Galerkin Methods, Springer. Howard, R. A. (1960), Dynamic Programming and Markov Processes, MIT Press. Lagoudakis, M. G. and R. Parr (2003), “Least-squares policy iteration,” J. Mach. Learn. Res., vol. 4, pp. 1107–1149.

1852

Markov Decision Processes

Logan, D. L. (2016), A First Course in the Finite Element Method, 6th ed., Cengage Learning. Polyak, B. T. (1987), Introduction to Optimization, Optimization Software. Reddy, J. N. (2006), An Introduction to the Finite Element Method, McGraw-Hill. Schweitzer, P. and A. Seidman (1985), “Generalized polynomial approximation in Markovian decision processes,” J. Math. Anal. Appl., vol. 110, pp. 568–582. Sutton, R. S., H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and E. Wiewiora (2009), “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 993–1000, Montreal. Sutton, R. S., C. Szepesvari, and H. R. Maei (2009), “A convergent O(n) temporaldifference algorithm for off-policy learning with linear function approximation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1609–1616, Vancouver. Towfic, Z. J. and A. H. Sayed (2015), “Stability and performance limits of adaptive primal–dual networks,” IEEE Trans. Signal Process., vol. 63, no. 11, pp. 2888–2903. Zulehner, W. (2001), “Analysis of iterative methods for saddle point problems: A unified approach,” Math. Comput., vol. 71, no. 238, pp. 479–505.

45 Value and Policy Iterations

We continue our treatment of Markov decision processes (MDPs) and focus in this chapter on methods for determining optimal actions or policies. We derive two popular methods known as value iteration and policy iteration, and establish their convergence properties. We also examine the Bellman optimality principle in the context of value and policy learning. In a later section, we extend the discussion to the more challenging case of partially observable MDPs (POMDPs), where the successive states of the MDP are unobservable to the agent, and the agent is only able to sense measurements emitted randomly by the MDP from the various states. We will define POMDPs and explain that they can be reformulated as belief-MDPs with continuous (rather than discrete) states. This fact complicates the solution of the value iteration. Nevertheless, we will show that the successive value iterates share a useful property, namely, that they are piecewise linear and convex. This property can be exploited by computational methods to reduce the complexity of solving the value iteration for POMDPs.

45.1

VALUE ITERATION The policy evaluation procedure (44.116) for MDPs assesses the performance of a particular policy, π(a|s), by computing the state value function, v π (s), for all states. Obviously, different policies will generally lead to different state value functions.

45.1.1

Optimal Behavior Ideally, if the agent is given the freedom to select a policy, the agent would like to follow that policy that maximizes the state value function for each state. That is, the agent would like to solve ∆

π ? (a|s) = argmax v π (s)

(45.1)

π

where the maximization is over the policy π(a|s). We are denoting the optimal strategy by the notation π ? (a|s), with a ? superscript. By solving problem (45.1), the agent will be able to determine for each state s, a distribution π ? (a|s) for

1854

Value and Policy Iterations

the actions conditioned on that state. The resulting optimal state value function will be denoted by ∆

?

v ? (s) = v π (s)

(45.2)

π

= max v (s) π

= expected cumulative reward starting from state s and following an optimal policy π ? (a|s) thereafter. As we are going to see, optimal policies π ? (a|s) are not unique, but they will all need to have the same optimal state value v ? (s) by (45.1). The purpose of this section is to derive an iterative procedure, known as the value iteration, whose purpose is to satisfy two objectives: (a) determine the optimal state value function, v ? (s); and (b) determine an optimal policy π ? (a|s) to attain these optimal state values. We will also assign optimal values for the intermediate state–action pairs or q-states of the MDP. We denote the optimal state–action values by ∆

?

q ? (s, a) = q π (s, a)

(45.3)

= expected cumulative reward starting from state s, taking action a, and following an optimal policy π ? (a|s) thereafter. Observe that the optimal values v ? (s) and q ? (s, a) are obtained by following the same policy, π ? (a|s) (which we have not determined yet). Therefore, applying expressions (44.91), (44.92), and (44.94) we find that these optimal values satisfy similar relations: X v ? (s) = π ? (a|s)q ? (s, a) (45.4a) a∈A

?

q (s, a) =

X

s0 ∈S

q ? (s, a) =

h i P(s, a, s0 ) r(s, a, s0 ) + γv ? (s0 )

XX

a0 ∈A s0 ∈S

h i π ? (a0 |s0 )P(s, a, s0 ) r(s, a, s0 ) + γq ? (s0 , a0 )

(45.4b) (45.4c)

We observe from (45.4b) that all optimal policies π ? (a|s) will need to have the same optimal state–action values q ? (s, a) as well; this is because v ? (s0 ) is the same for all optimal policies. We note in passing that by following the same type of argument as in (44.103), and whenever the reward function is uniformly bounded, the optimal state–action value function will also be uniformly bounded, namely, it will hold that   1 ? max |r(s, a, s0 )| (45.5) max |q (s, a)| ≤ s∈S,a∈A 1 − γ s,s0 ∈S,a∈A Relations (45.4a)–(45.4c) cannot be solved for the optimal values {v ? (s), q ? (s, a)} because they depend on the yet-unknown optimal policy π ? (a|s). We will resolve this ambiguity by establishing the Bellman optimality condition further ahead

45.1 Value Iteration

1855

in Section 45.1.3. In preparation for that discussion, we first establish a useful property for the optimal policy and explain that it amounts to a greedy strategy in the state × action space.

45.1.2

Greedy Policy One way to construct a particular optimal policy is as follows. Let ao denote an action that maximizes q ? (s, a) over a: ∆

ao = argmax q ? (s, a)

(45.6)

a∈A

There can be several optimal actions ao . We select one of them and construct a deterministic policy π o (a|s) as follows:  1, when a = ao o π (a|s) = (45.7) 0, otherwise In this way, the agent selects a specific action ao whenever at state s. Alternatively, when multiple optimal actions ao exist, we can distribute the probability among all these actions. For example, assume there are three optimal actions {ao1 , ao2 , ao3 } at state s. Then, another construction for a policy is:  1/3, when a = ao1    1/3, when a = ao2 (45.8) π o (a|s) =  1/3, when a = ao3   0, otherwise o

We continue with (45.7) and refer to it as the deterministic policy. Let v π (s) denote the value function for this π o (a|s). We establish in Appendix 45.A that o v π (s) = v ? (s) so that π o (a|s) is an optimal policy that solves (45.1). In other words, we find that every MDP has at least one optimal deterministic policy. We say that π o (a|s) is a greedy strategy; it selects the action that maximizes the state–action value function.

Remark 45.1. (Notation for optimal deterministic policy) Since π o (a|s) is an optimal policy, we can refer to it by using the same notation π ? (a|s). However, in order to highlight its deterministic nature, we will refer to construction (45.6)–(45.7) in the more suggestive (but less precise) form: π ? (a|s) := argmax q ? (s, a)

(45.9)

a∈A

using the symbol :=. The symbol is meant to indicate that this π ? (a|s) is constructed by first determining an optimal action ao that maximizes q ? (s, a) and then using it in (45.7) to get π o (a|s). Since this policy is optimal, it satisfies (45.1) and we also refer to it by π ? (a|s). However, when necessary, we will use the explicit superscript o to distinguish between an optimal deterministic policy and any other optimal policy. 

On the face of it, construction (45.9) is a circular result. In order to construct an optimal deterministic policy, we need to maximize q ? (s, a) which in turn

1856

Value and Policy Iterations

requires knowledge of an optimal policy. For now, we will view (45.9) as simply a useful property that relates π o (a|s) and q ? (s, a). Later, we will show that construction (45.9) serves as the launchpad for actual computational procedures for determining optimal policies and optimal state–action values. This is because we will devise procedures that are able to determine q ? (s, a) directly “without” explicit knowledge of any optimal policy. When this happens, we will then be able to use (45.9) to deduce an optimal deterministic policy. The solution of (45.6) is straightforward in general, especially for MDPs involving a finite number of states and actions. To see this, assume we collect the optimal state–action values, {q ? (s, a)}, into a matrix Q? , with rows corresponding to states and columns corresponding to actions: Q? = [q ? (s, a)] , s ∈ S, a ∈ A,

(|S| × |A|)

(45.10)

Then, construction (45.6) amounts to selecting for each row s the column index of maximum value in that row; say, for a situation with nine states and four actions: a1 a2 a3 a4 s1 × × × × s2 × × × × s3 × × × × × × × × s4 ? (45.11) Q = s5 × × × × s6 × × × × s7 × × × × × × × × s8 s9 × × × × where the boxed entries are meant to refer to the maximum values in each row of Q? . For example, for state s = 5, the maximum entry occurs under column a2 ; this means that the optimal action for state s = 5 is a2 .

45.1.3

Bellman Optimality Condition o

Let further q π (s, a) denote the state–action function for π o (a|s). From the second equality in relation (45.137a) in the appendix we have: v ? (s)

o

v π (s)

= (45.137a)

o

=

q π (s, ao )

=

q ? (s, ao ) (since π o is optimal)

(45.12)

Note that the action argument appearing in q ? (s, ao ) is ao . We then conclude that the optimal functions {v ? (s), q ? (s, a)} satisfy the following optimality condition: v ? (s) = max q ? (s, a), for every s ∈ S a∈A

(45.13)

45.1 Value Iteration

1857

from which we deduce, along with (45.4b), that the following two relations hold: (Bellman optimality equations) ( ) h i X ? 0 0 ? 0 v (s) = max P(s, a, s ) r(s, a, s ) + γv (s )

(45.14a)

q ? (s, a) =

(45.14b)

a∈A

X

s0 ∈S

s0 ∈S

h i P(s, a, s0 ) r(s, a, s0 ) + γ max q ? (s0 , a0 ) 0 a ∈A

Comparing with the earlier expressions (45.4a)–(45.4b), we observe that the optimal policy π ? (a|s) does not appear explicitly on the right-hand side in equations (45.14a) and (45.14b); these two equations are referred to as the Bellman optimality equations. It is shown, in a guided manner, in Probs. 45.8 and 45.9 that there exists a unique solution v ? (s) to the Bellman equation (45.14a) when γ ∈ [0, 1). However, while v ? (s) is unique, the optimal policy π ? (a|s) need not be unique. There can be multiple optimal policies leading to the same optimal state value function. Using (44.42), we can interpret the above expressions for v ? (s) and q ? (s, a) as corresponding to the following expectations over the distribution of the transitions: (Bellman optimality equations) ( ) n o v ? (s) = max E P r(s, a, s0 ) + γv ? (s0 ) a∈A

n o ? 0 0 q ? (s, a) = E P r(s, a, s0 ) + γ max q (s , a ) a0 ∈A n o (45.13) = E P r(s, a, s0 ) + γv ? (s0 )

(45.15a)

(45.15b)

Moreover, using (45.9), we can construct an optimal deterministic policy by using either of the expressions: (optimal deterministic policy) ( ) h i X ? 0 0 ? 0 π (a|s) := argmax P(s, a, s ) r(s, a, s ) + γv (s ) a∈A

s0 ∈S

π ? (a|s) := argmax q ? (s, a)

(45.16a) (45.16b)

a∈A

Note that the Bellman optimality equations (45.14a)–(45.14b) and (45.15a)– (45.15b) are satisfied by any optimal policy π ? (a|s) and, therefore, they are also satisfied by the optimal deterministic policy π ? |(a|s) constructed from (45.16a). For ease of reference, we list in Table 45.1 some of the main relations and definitions established this far in the chapter. It is also useful to note that the Bellman condition (45.15a) can be rewritten in the equivalent form: n o v ? (s) = max E π,P r(s, a, s0 ) + γ v ? (s0 ) (45.17) π

1858

Value and Policy Iterations

where the maximization is now over the policy, π(a|s), and the expectation is over both π and P. In other words, this alternative form restores the policy. Proof of (45.17): We already know from (45.7) that one optimal deterministic policy is given by  1, for some action ao π ? (a|s) = (45.18) 0, otherwise Therefore, returning to (45.17) we have ( ) n o 0 ? 0 max E π E P r(s, a, s ) + γ v (s ) π

= max π

(45.18)

=

(45.12)

=

n o π(a|s)E P r(s, a, s0 ) + γ v ? (s0 )

a∈A

n o E P r(s, ao , s0 ) + γ v ? (s0 )

(45.15b)

=

X

q ? (s, ao )

v ? (s)

(45.19) 

45.1.4

Value Iteration Algorithm It is instructive to compare (45.14a) with the earlier state value policy evaluation relation (44.69). There, we were able to reduce the Poisson equation (44.69) to the linear system of equations (44.72), from which v π (s) could be determined. In contrast, we now have a maximization step embedded in (45.14a), which makes the problem of determining v ? (s) more challenging since the optimal value function now satisfies a nonlinear equation instead. Nevertheless, motivated by the fixed-point iteration (44.107), which we employed earlier to solve for v π (s), we can apply a similar technique to solve (45.14a) and determine v ? (s), namely, ( ) h i X ? vk? (s) = max P(s, a, s0 ) r(s, a, s0 ) + γvk−1 (s0 ) , v0? (s) = 0, k ≥ 1 a∈A

s0 ∈S

(45.20) This procedure is known as the value iteration. Observe that, in contrast to (44.107) where the recursion iterated on the vector v π , we are now writing the recursion for individual entries, v ? (s); one for each state s. Alternatively, we can apply the same construction to the state–action value function (45.14b) in the following manner: X    ? ? P(s, a, s0 ) r(s, a, s0 ) + γvk−1 (s0 ) , q0? (s, a) = 0   qk (s, a) = s0 ∈S (45.21)   v ? (s) = max q ? (s, a), k ≥ 1 k k a∈A

This procedure is listed in (45.23) where, although unnecessary for the recursion, we are also introducing the policy that is generated at iteration k and denoting

45.1 Value Iteration

1859

Table 45.1 Expressions relating the value and state-value functions in an MDP. Relation

Description ∞ X



v π (s) = E π,P

n=0 ∆

q (s, a) = E π,P π

! γ n r(s, a, s0 ) | s0 = s

∞ X n=0

state value function !

0

n

γ r(s, a, s ) | s0 = s, a0 = a

n o v π (s) = E π,P r(s, a, s0 ) + γv π (s0 ) n o q π (s, a) = E π,P r(s, a, s0 ) + γq π (s0 , a0 ) n o v π (s) = E π q π (s, a)

state–action value function Bellman equation Bellman equation useful relation



π ? (a|s) = argmax v π (s)

optimal policy

π

π ? (a|s) := argmax q ? (s, a)

optimal deterministic policy

a∈A

v ? (s) = max q ? (s, a), a∈A



?



?

(v ? = v π , q ? = q π )

n o v ? (s) = max E P r(s, a, s0 ) + γv ? (s0 ) a∈A n o ? q (s, a) = E P r(s, a, s0 ) + γv ? (s0 ) n o q ? (s, a) = E P r(s, a, s0 ) + γ max q ? (s0 , a0 ) a∈A n o ? π (a|s) := argmax E P r(s, a, s0 ) + γv ? (s0 )

optimal state value function Bellman optimality equation Bellman optimality equation Bellman optimality equation optimal deterministic policy

a∈A

it by πk? (a|s); one useful property for this policy iterate is revealed in Prob. 45.12. One criterion for stopping iterations (45.20) or (45.21) is to check for the condition:   1−γ ? ? max vk (s) − vk−1 (s) <  s∈S 2γ

(45.22)

for some small  > 0. That is, we run the algorithm until the state value functions at two successive iterations are sufficiently close to each other. We explain in the following that this stopping criterion ensures that the performance of the policy at the end of the iterations is -close to the performance of any optimal policy – see expression (45.26). Actually, if  is small enough, we may still obtain exactly the optimal policy. This is because we do not need to know the exact value of q ? (s, a) but only for which actions it is maximized.

1860

Value and Policy Iterations

Value iteration for an MDP, M = {S, A, P, r, π, γ}. start from initial condition, v0? (s) = 0, ∀s ∈ S. repeat until convergence over k ≥ 1: for every (s, a) ∈ S × A: X   ? ? qk (s, a) = P(s, a, s0 ) r(s, a, s0 ) + γvk−1 (s0 ) s0 ∈S

(45.23)

vk? (s) = max qk? (s, a) a∈A

πk? (a|s)

:= argmax qk? (s, a) (not needed for algorithm) a∈A

end end We establish in Appendix 45.B that the value iteration (45.23) converges to the optimal values v ? (s) and q ? (s, a), namely, lim qk? (s, a) = q ? (s, a), ∀ (s, a) ∈ S × A

k→∞

lim vk? (s) = v ? (s),

k→∞

∀s ∈ S

(45.24a) (45.24b)

and that convergence occurs at an exponential rate. In other words, the absolute difference |v ? (s) − vk? (s)| decays exponentially fast; similarly for the difference |q ? (s, a) − qk? (s, a)|. We also find that the convergence of vk? (s) to v ? (s) is monotonic, meaning that vk? (s) gets closer to v ? (s) as k increases. In a similar vein, the policy iterates πk? (a|s) approach an optimal deterministic policy: lim πk? (a|s) = π ? (a|s),

k→∞

∀ (s, a) ∈ S × A

(45.25)

Assume we stop the value iteration (45.23) at some iteration ko after condition (45.22) has been met. We denote the resulting policy iterate by πk?o (a|s) and its state value function by vk?o (s). We establish in Appendix 45.C the useful property that this policy iterate will be -optimal, meaning that its state value function will be -close to the optimal state value function: |vk?o (s) − v ? (s)| ≤ ,

45.1.5

∀s ∈ S

(45.26)

Principle of Optimality As was the case with (44.115), it turns out that the successive iterates {vk? (s)} that are generated by the value iteration (45.20) have a useful interpretation when the boundary condition is set to v0? (s) = 0 for all s ∈ S. Thus, note that for k = 1: ( ) X ? 0 0 v1 (s) = max P(s, a, s )r(s, a, s ) (45.27) a∈A

s0 ∈S

45.1 Value Iteration

1861

which is the maximum expected reward for one-step transitions from state s. Moving to the next iteration, we get ( ) h i X v2? (s) = max P(s, a, s0 ) r(s, a, s0 ) + γv1? (s0 ) (45.28) a∈A

s0 ∈S

where the right-hand side is the sum of two terms: the first term is the expected reward for state s during the first transition, while the second term is the discounted optimal reward for the next transition from state s0 . Therefore, if we maximize the combination of both terms over the action variable, a, we find that v2? (s) corresponds to the maximum reward value that can be expected for two steps into the future. More generally, the value vk? (s) corresponds to the maximum discounted expected reward for state s by looking k steps into the future – see Prob. 45.11 and also the comments at the end of the chapter on dynamic programming leading to result (45.123): vk? (s) = maximum expected reward for state s looking k steps into the future

(45.29)

We refer to vk? (s) as the state value function of depth k. We show in Prob. 45.12 that the iterate πk? (a|s) is a (deterministic) policy that attains this maximum reward value. We therefore observe from the value iteration (45.23) that the algorithm constructs the optimal state value function recursively. It finds the optimal state value function of depth k from the optimal state value function of depth k − 1, and so forth. This is an example of a procedure that constructs the optimal solution for a longer time horizon by using optimal solutions for shorter time horizons. This construction is one notable example of dynamic programming. This approach is made viable by a fundamental optimality principle, which essentially states that whatever the initial state is, an optimal policy must remain optimal from the first transition onward. More formally, we have the following statement. (Principle of optimality) A policy π(a|s) is optimal for state s and achieves v π (s) = v ? (s) if, and only if, for any state s0 that is reachable from s, the same policy π(a|s0 ) will achieve the optimal value for s0 , i.e., v π (s0 ) = v ? (s0 ).

Proof: The result is motivated in the comments at the end of the chapter, where we discuss dynamic programming more broadly – see the argument leading to (45.123) and also see Prob. 45.19. 

Example 45.1 (Policy extraction using value iteration) We re-examine Example 44.5 and re-derive the optimal policy and the optimal state values under γ = 0.5; these optimal quantities were shown earlier in Fig. 44.8. We were able to derive them by

1862

Value and Policy Iterations

inspection for this simple example. Here, we want to arrive at the same conclusion by applying the value iteration (45.20). The calculations below show that the value iteration converges in three steps in this case, and the conclusions will match the results derived earlier in Fig. 44.8. To begin with, we have from (44.79c)–(44.79d) that

r(s = 3|a = RIGHT) = −1 r(s = 3|a = LEFT) = −1 r(s = 3|a = STOP) = −1

(45.30a) (45.30b) (45.30c)

P(s = 3, a = RIGHT, s0 = 4) = 1 P(s = 3, a = LEFT, s0 = 2) = 1

(45.31a) (45.31b)

We also know that

We start with v0? (s) = 0 and iterate. (iteration k = 1) q1? (s = 1, a = STOP) = r(s = 1|a = STOP) + 1 P(s = 1, a = STOP, s0 = 6)v0? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v1? (s = 1)

(45.32a)

= 0 (STOP)

q1? (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) + 1 P(s = 2, a = RIGHT, s0 = 3)v0? (s0 = 3) 2 = −1 + 21 (1 × 0) = −1 q1? (s = 2, a = LEFT) = r(s = 2|a = LEFT) + 1 P(s = 2, a = LEFT, s0 = 1)v0? (s0 = 1) 2 = +5 + 12 (1 × 0) = +5 v1? (s = 2)

=

max

a∈{RIGHT,LEFT}

(45.32b)

q1? (s = 2, a) = +5 (LEFT)

q1? (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) + 1 P(s = 3, a = RIGHT, s0 = 4)v0? (s0 = 4) 2 = −1 + 12 (1 × 0) = −1 q1? (s = 3, a = LEFT) = r(s = 3|a = LEFT) + 1 P(s = 3, a = LEFT, s0 = 2)v0? (s0 = 2) 2 = −1 + 12 (1 × 0) = −1 v1? (s = 3)

=

max

a∈{RIGHT,LEFT}

q1? (s = 3, a) = −1 (RIGHT or LEFT) (45.32c)

45.1 Value Iteration

1863

q1? (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) + 1 P(s = 4, a = RIGHT, s0 = 5)v0? (s0 = 5) 2 = +15 + 12 (1 × 0) = +15 q1? (s = 4, a = LEFT) = r(s = 4|a = LEFT) + 1 P(s = 4, a = LEFT, s0 = 3)v0? (s0 = 3) 2 = −1 + 12 (1 × 0) = −1 v1? (s = 4)

=

max

a∈{RIGHT,LEFT}

q1? (s = 3, a) = +15 (RIGHT)

q1? (s = 5, a = STOP) = r(s = 5|a = STOP) + 1 P(s = 5, a = STOP, s0 = 6)v0? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v1? (s = 5)

(45.32d)

(45.32e)

= 0 (STOP)

q1? (s = 6, a = any) = r(s = 6|a = any) + 1 P(s = 6, a = any, s0 = 6)v0? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v1? (s = 6)

(45.32f)

= 0 (STOP)

(iteration k = 2) q2? (s = 1, a = STOP) = r(s = 1|a = STOP) + 1 P(s = 1, a = STOP, s0 = 6)v1? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v2? (s = 1)

(45.33a)

= 0 (STOP)

q2? (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) + 1 P(s = 2, a = RIGHT, s0 = 3)v1? (s0 = 3) 2 = −1 + 12 (1 × −1) = −1.5 q2? (s = 2, a = LEFT) = r(s = 2|a = LEFT) + 1 P(s = 2, a = LEFT, s0 = 1)v1? (s0 = 1) 2 = +5 + 12 (1 × 0) = +5 v2? (s = 2)

=

max

a∈{RIGHT,LEFT}

(45.33b)

q2? (s = 2, a) = +5 (LEFT)

q2? (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) + 1 P(s = 3, a = RIGHT, s0 = 4)v1? (s0 = 4) 2 = −1 + 12 (1 × 15) = +6.5 q2? (s = 3, a = LEFT) = r(s = 3|a = LEFT) + 1 P(s = 3, a = LEFT, s0 = 2)v1? (s0 = 2) 2 = −1 + 12 (1 × 5) = +1.5 v2? (s = 3)

=

max

a∈{RIGHT,LEFT}

q2? (s = 3, a) = +6.5 (RIGHT)

(45.33c)

1864

Value and Policy Iterations

q2? (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) + 1 P(s = 4, a = RIGHT, s0 = 5)v1? (s0 = 5) 2 = +15 + 12 (1 × 0) = +15 q2? (s = 4, a = LEFT) = r(s = 4|a = LEFT) + 1 P(s = 4, a = LEFT, s0 = 3)v1? (s0 = 3) 2 = −1 + 12 (1 × −1) = −1.5 v2? (s = 4)

=

max

a∈{RIGHT,LEFT}

q2? (s = 3, a) = +15 (LEFT)

q2? (s = 5, a = STOP) = r(s = 5|a = STOP) + 1 P(s = 5, a = STOP, s0 = 6)v1? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v2? (s = 5)

(45.33d)

(45.33e)

= 0 (STOP)

q2? (s = 6, a = any) = r(s = 6|a = any) + 1 P(s = 6, a = any, s0 = 6)v1? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v2? (s = 6)

(45.33f)

= 0 (STOP)

(iteration k = 3) q3? (s = 1, a = STOP) = r(s = 1|a = STOP) + 1 P(s = 1, a = STOP, s0 = 6)v2? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v3? (s = 1)

(45.34a)

= 0 (STOP)

q3? (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) + 1 P(s = 2, a = RIGHT, s0 = 3)v2? (s0 = 3) 2 = −1 + 21 (1 × 6.5) = +2.25 q3? (s = 2, a = LEFT) = r(s = 2|a = LEFT) + 1 P(s = 2, a = LEFT, s0 = 1)v2? (s0 = 1) 2 = +5 + 12 (1 × 0) = +5 v3? (s = 2)

=

max

a∈{RIGHT,LEFT}

(45.34b)

q3? (s = 2, a) = +5 (LEFT)

q3? (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) + 1 P(s = 3, a = RIGHT, s0 = 4)v2? (s0 = 4) 2 = −1 + 12 (1 × 15) = +6.5 q3? (s = 3, a = LEFT) = r(s = 3|a = LEFT) + 1 P(s = 3, a = LEFT, s0 = 2)v2? (s0 = 2) 2 = −1 + 21 (1 × 5) = +1.5 v3? (s = 3)

=

max

a∈{RIGHT,LEFT}

q3? (s = 3, a) = +6.5 (RIGHT)

(45.34c)

45.1 Value Iteration

1865

q3? (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) + 1 P(s = 4, a = RIGHT, s0 = 5)v2? (s0 = 5) 2 = +15 + 12 (1 × 0) = +15 q3? (s = 4, a = LEFT) = r(s = 4|a = LEFT) + 1 P(s = 4, a = LEFT, s0 = 3)v2? (s0 = 3) 2 = −1 + 21 (1 × 6.5) = +2.25 v3? (s = 4)

=

max

a∈{RIGHT,LEFT}

q3? (s = 3, a) = +15 (RIGHT)

q3? (s = 5, a = STOP) = r(s = 5|a = STOP) + 1 P(s = 5, a = STOP, s0 = 6)v2? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v3? (s = 5)

(45.34e)

= 0 (STOP)

q3? (s = 6, a = any) = r(s = 6|a = any) + 1 P(s = 6, a = any, s0 = 6)v2? (s0 = 6) 2 = 0 + 12 (1 × 0) = 0 v3? (s = 6)

(45.34d)

(45.34f)

= 0 (STOP)

We therefore arrive at the state values shown in Table 45.2 for iterations k = 1, 2, 3. Table 45.2 Listing of the optimal state values of depths 1, 2, 3. s

v1? (s)

v2? (s)

v3? (s)

1 2 3 4 5 6

0 +5 −1 +15 0 0

0 +5 +6.5 +15 0 0

0 +5 +6.5 +15 0 0

Likewise, if we collect the qk? (s, a) values into successive matrix iterates, denoted by Q?k with columns corresponding to the actions a ∈ {RIGHT,LEFT,STOP}, we have:       0 0 0 0 0 0 0 0 0  +2.25  −1 +5 0   −1.5 +5 0  +5 0         +6.5 +1.5 0   −1 −1 0   +6.5 +1.5 0  ? ? , Q = , Q = Q?1 =       3 2  +15 −1.5 0   +15 −1 0   +15 −1.5 0     0 0 0  0 0 0  0 0 0  0 0 0 0 0 0 0 0 0 (45.35) The boxes around the entries in Q?3 identify the column locations of the optimal actions for the respective states along with their optimal state values: state s = 2: move LEFT with v ? (s = 2) = +5 state s = 3: move RIGHT with v ? (s = 3) = +6.5 state s = 4: move RIGHT with v ? (s = 4) = +15

(45.36a) (45.36b) (45.36c)

1866

Value and Policy Iterations

45.2

POLICY ITERATION The value iteration (45.23) relies on iterating the state and state–action value functions, vk? (s) and qk? (s, a), and requires a sweep of the entire state × action space. We now describe a more efficient method, which relies instead on iterating the deterministic policy, i.e., on propagating iterates of the form π`? (a|s). This method is motivated by the observation that policy iterates tend to converge to an optimal policy well before the state value function converges to its optimal value.

45.2.1

Derivation We start from some initial deterministic policy π0 (a|s), at iteration ` = 0. This policy assigns a particular action, a, to each state s. We express this assignment more explicitly by writing a0 (s), which refers to the action a that is assigned at iteration ` = 0 to state s under policy π0 (a|s). That is,  1, if a = a0 (s) π0 (a|s) = (45.37) 0, otherwise Using the deterministic policy π0 (a|s) or, equivalently, a0 (s), it holds that ! X (44.44) X π0 0 0 r (s) = π0 (a|s) P(s, a, s )r(s, a, s ) s0 ∈S

a∈A

=

X

P(s, a = a0 (s), s0 )r(s, a = a0 (s), s0 )

(45.38a)

s0 ∈S

and 0 pπs,s 0

(44.31)

=

X

π0 (a|s)P(s, a, s0 ) = P(s, a = a0 (s), s0 )

(45.38b)

a∈A

and π0

v (s)

=

X

a∈A

= (44.91)

=

X

s0 ∈S

π0 (a|s)

X

s0 ∈S

h i P(s, a, s ) r(s, a, s0 ) + γv π0 (s0 ) 0

!

h i P(s, a = a0 (s), s0 ) r(s, a = a0 (s), s0 ) + γv π0 (s0 )

q π0 (s, a = a0 (s))

(45.38c)

Now, given π0 (a|s), or equivalently a0 (s), we can evaluate the corresponding state function value (i.e., perform policy evaluation) by using any of the linear forms described in the previous chapter. We can use either the linear system of equations (44.72) to solve for the value vector v π0 with entries {v π0 (s)}, i.e., v π0 = (I − γP π0 )

−1 π0

r

(45.39)

45.2 Policy Iteration

1867

or apply the fixed-point iteration (44.107) repeatedly to converge to this vector: π0 , v0π0 = 0, k ≥ 1 vkπ0 = rπ0 + γP π0 vk−1

(45.40)

This second recursion can be written for each state entry separately as (see (44.69)): h i X π0 P(s, a0 (s), s0 ) r(s, a0 (s), s0 ) + γvk−1 (s0 ) vkπ0 (s) = (45.41) s0 ∈S

Using these state function values, and motivated by (45.16a), we now consider improving on the policy by solving: ( ) h i X 0 0 π0 0 π1 (a|s) := argmax P(s, a, s ) r(s, a, s ) + γv (s ) (45.42) a∈A

s0 ∈S

where the result is a deterministic policy again and, hence, the use of the assignment symbol :=. The process is repeated until convergence (which we still need to establish). The algorithm is listed in (45.43).

Policy iteration for an MDP, M = {S, A, P, r, π, γ}. initial policy and state values: π0 (a|s) = a0 (s), v0π0 (s) = 0, ∀ (s, a) ∈ S × A. repeat until convergence over ` ≥ 1: repeat until convergence over k ≥ 1 (for all states s ∈ S): h i X π π`−1 0 vk `−1 (s) = P(s, a`−1 (s), s0 ) r(s, a`−1 (s), s0 ) + γ vk−1 (s ) s0 ∈S

end π v π`−1 (s) ← vk `−1 (s), ∀ s ∈ S h i X P(s, a, s0 ) r(s, a, s0 ) + γv π`−1 (s0 ) , ∀ (s, a) ∈ S × A q π`−1 (s, a) = s0 ∈S

a` (s) = argmax q π`−1 (s, a), ∀ s ∈ S a∈A

π` (a|s) = end



1, if a = a` (s) , ∀s ∈ S 0, otherwise (45.43)

We establish in Appendix 45.D that the policy iteration (45.43) converges to the optimal state and state–action value functions, v ? (s) and q ? (s, a): lim v π` (s) = v ? (s),

`→∞ π`

lim q (s, a) = q ? (s, a),

`→∞

∀s ∈ S

(45.44a)

∀ (s, a) ∈ S × A

(45.44b)

1868

Value and Policy Iterations

where v ? (s) is the unique solution to the Bellman optimality equation (45.14a). It also holds that v π`−1 (s) ≤ v π` (s) ≤ v ? (s), ∀ s ∈ S

(45.45)

so that the policy iteration generates increasing state values that are bounded from above (and are therefore convergent). Moreover, the policy iterates converge to an optimal deterministic policy: lim π` (a|s) = π ? (a|s), ∀ (s, a) ∈ S × A

`→∞

(45.46)

with the convergence of π` (a|s) to π ? (a|s) occurring in a finite number of steps, namely, in at most |A||S| steps. Example 45.2 (Policy extraction using policy iteration) We re-examine Example 45.1 and re-derive the optimal policy for γ = 1/2. There, we applied the value iteration (45.23). Now we apply the policy iteration (45.43), with one modification. Rather than iterate to determine the state values, v π` (s), we will instead solve for them directly by using the linear system of equations:   1 π` (45.47) v π` = r π` I− P 2 where the vectors {v π` , rπ` } consist of the entries {v π` (s), rπ` (s)} for all states. To begin with, we already know from Example 44.6 that (see, e.g., (44.79c)–(44.79d)): r(s = 1|a = STOP) = 0 r(s = 2|a = RIGHT) = −1 r(s = 2|a = LEFT) = +5 r(s = 3|a = RIGHT) = −1 r(s = 3|a = LEFT) = −1 r(s = 4|a = RIGHT) = +15 r(s = 4|a = LEFT) = −1 r(s = 5|a = STOP) = 0 r(s = 6|a = STOP) = 0

(45.48a) (45.48b) (45.48c) (45.48d) (45.48e) (45.48f) (45.48g) (45.48h) (45.48i)

We start from the initial deterministic policy: π0 (a|s = 1) = STOP π0 (a|s = 2) = RIGHT π0 (a|s = 3) = RIGHT π0 (a|s = 4) = LEFT π0 (a|s = 5) = STOP π0 (a|s = 6) = STOP

(45.49a) (45.49b) (45.49c) (45.49d) (45.49e) (45.49f)

and proceed to evaluate its state and state–action values. Note that, for such deterministic policies, when we write, for example, π0 (a|s = 3) = RIGHT, it is meant that  1, a = RIGHT P(a = a|s = 3) = (45.50) 0, a = otherwise

45.2 Policy Iteration

1869

(iteration ` = 1) The expected one-step rewards under policy π0 (a|s) are given by

rπ0 (s = 1) =

X

π0 (a|s = 1)r(s = 1|a)

a∈A

= 1 × r(s = 1|a = STOP) =1×0 = 0 rπ0 (s = 2) =

X

(45.51a)

π0 (a|s = 2)r(s = 2|a)

a∈A

= 1 × r(s = 2|a = RIGHT) = 1 × −1 = −1 rπ0 (s = 3) =

X

(45.51b)

π0 (a|s = 3)r(s = 3|a)

a∈A

= 1 × r(s = 3|a = RIGHT) = 1 × −1 = −1

(45.51c)

and

rπ0 (s = 4) =

X

π0 (a|s = 4)r(s = 4|a)

a∈A

= 1 × r(s = 4|a = LEFT) = 1 × −1 = −1 rπ0 (s = 5) =

X

(45.51d)

π0 (a|s = 5)r(s = 5|a)

a∈A

= 1 × r(s = 5|a = STOP) =1×0 = 0 rπ0 (s = 6) =

X

(45.51e)

π0 (a|s = 6)r(s = 6|a)

a∈A

= 1 × r(s = 6|a = STOP) =1×0 = 0

(45.51f)

so that  r

π0

   =  

0 −1 −1 −1 0 0

      

(45.52)

1870

Value and Policy Iterations

Likewise, the nonzero state transition probabilities are given by X 0 pπs=1,s π0 (a|s = 1)P(s = 1, a, s0 = 6) 0 =6 = a∈A

= 1 × P(s = 1, a = STOP, s0 = 6) =1×1 = 1 0 pπs=2,s 0 =3 =

X

(45.53a)

π0 (a|s = 2)P(s = 2, a, s0 = 3)

a∈A

= 1 × P(s = 2, a = RIGHT, s0 = 3) =1×1 = 1 0 pπs=3,s 0 =4 =

X

(45.53b)

π0 (a|s = 3)P(s = 3, a, s0 = 4)

a∈A

= 1 × P(s = 3, a = RIGHT, s0 = 4) =1×1 = 1 0 pπs=4,s 0 =3 =

X

(45.53c)

π0 (a|s = 4)P(s = 4, a, s0 = 3)

a∈A

= 1 × P(s = 4, a = LEFT, s0 = 3) =1×1 = 1

(45.53d)

and 0 pπs=5,s 0 =6 =

X

π0 (a|s = 5)P(s = 5, a, s0 = 6)

a∈A

= 1 × P(s = 5, a = STOP, s0 = 6) =1×1 = 1 0 pπs=6,s 0 =6 =

X

(45.53e)

π0 (a|s = 6)P(s = 6, a, s0 = 6)

a∈A

= 1 × P(s = 6, a = any, s0 = 6) =1×1 = 1

(45.53f)

That is, the state transition matrix under policy π0 (a|s) is given by  P

π0

   =   

0 0 0 0 0 0

0 0 0 0 0 0

0 1 0 1 0 0

0 0 1 0 0 0

0 0 0 0 0 0

1 0 0 0 1 1

      

(45.54)

It follows that  v

π0

 −1   1 π0  π0 r =  = I− P  2 

0 −2 −2 −2 0 0

      

(45.55)

45.2 Policy Iteration

1871

while

q π0 (s = 1, a = STOP) = r(s = 1|a = STOP) +

1X P(s = 1, a = STOP, s0 )v π0 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 1, a = STOP, s0 = 6)v π0 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2 q π0 (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) +

(45.56a)

1X P(s = 2, a = RIGHT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 2, a = RIGHT, s0 = 3)v π0 (s0 = 3) 2 1 = −1 + (1 × −2) = −2 2 q π0 (s = 2, a = LEFT) = r(s = 2|a = LEFT) +

(45.56b)

1X P(s = 2, a = LEFT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 2, a = LEFT, s0 = 1)v π0 (s0 = 1) 2 1 = +5 + (1 × 0) = +5 (45.56c) 2 X 1 q π0 (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) + P(s = 3, a = RIGHT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 3, a = RIGHT, s0 = 4)v π0 (s0 = 4) 2 1 = −1 + (1 × −2) = −2 2 q π0 (s = 3, a = LEFT) = r(s = 3|a = LEFT) +

(45.56d)

1X P(s = 3, a = LEFT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 3, a = LEFT, s0 = 2)v π0 (s0 = 2) 2 1 = −1 + (1 × −2) = −2 2 q π0 (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) +

(45.56e)

1X P(s = 4, a = RIGHT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 4, a = RIGHT, s0 = 5)v π0 (s0 = 5) 2 1 = +15 + (1 × 0) = +15 2 q π0 (s = 4, a = LEFT) = r(s = 4|a = LEFT) +

(45.56f)

1X P(s = 4, a = LEFT, s0 )v π0 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 4, a = LEFT, s0 = 3)v π0 (s0 = 3) 2 1 = −1 + (1 × −2) = −2 2

(45.56g)

1872

Value and Policy Iterations

and q π0 (s = 5, a = STOP) = r(s = 5|a = STOP) +

1X P(s = 5, a = STOP, s0 )v π0 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 5, a = STOP, s0 = 6)v π0 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2 q π0 (s = 6, a = STOP) = r(s = 6|a = STOP) +

(45.56h)

1X P(s = 6, a = STOP, s0 )v π0 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 6, a = STOP, s0 = 6)v π0 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2

(45.56i)

It follows that the matrix of state–action values under π0 (a|s), with columns corresponding to the actions a ∈ {RIGHT,LEFT,STOP}, is given by   0 0 0  −2 +5 0    0   −2 −2 π0 (45.57) Q =  −2 0   +15   0 0 0 0 0 0 where we are marking the largest values in each row, with the understanding that the STOP action is only permitted at states s = 1, 2, 6, while the actions RIGHT and LEFT are only available to states s = 2, 3, 4. Note that for the row corresponding to s = 3, we could have boxed the −2 in the first column or the −2 in the second column. We adopt the convention that, when there are multiple choices, we select the value that keeps the current action for that state. We conclude that the new iterate for the policy is given by π1 (a|s = 1) = STOP π1 (a|s = 2) = LEFT π1 (a|s = 3) = RIGHT π1 (a|s = 4) = RIGHT π1 (a|s = 5) = STOP π1 (a|s = 6) = STOP

(45.58a) (45.58b) (45.58c) (45.58d) (45.58e) (45.58f)

(iteration ` = 2) The expected one-step rewards under policy π1 (a|s) are given by X rπ1 (s = 1) = π1 (a|s = 1)r(s = 1|a) a∈A

= 1 × r(s = 1|a = STOP) =1×0 = 0 rπ1 (s = 2) =

X

(45.59a)

π1 (a|s = 2)r(s = 2|a)

a∈A

= 1 × r(s = 2|a = LEFT) = 1 × +5 = +5

(45.59b)

45.2 Policy Iteration

1873

and rπ1 (s = 3) =

X

π1 (a|s = 3)r(s = 3|a)

a∈A

= 1 × r(s = 3|a = RIGHT) = 1 × −1 = −1 rπ1 (s = 4) =

X

(45.59c)

π1 (a|s = 4)r(s = 4|a)

a∈A

= 1 × r(s = 4|a = RIGHT) = 1 × +15 = +15 rπ1 (s = 5) =

X

(45.59d)

π1 (a|s = 5)r(s = 5|a)

a∈A

= 1 × r(s = 5|a = STOP) =1×0 = 0 rπ1 (s = 6) =

X

(45.59e)

π1 (a|s = 6)r(s = 6|a)

a∈A

= 1 × r(s = 6|a = STOP) =1×0 = 0

(45.59f)

so that  r

π1

   =  

0 +5 −1 +15 0 0

      

(45.60)

Likewise, the nonzero state transition probabilities are given by 1 pπs=1,s 0 =6 =

X

π1 (a|s = 1)P(s = 1, a, s0 = 6)

a∈A

= 1 × P(s = 1, a = STOP, s0 = 6) =1×1 = 1 1 pπs=2,s 0 =1 =

X

(45.61a)

π1 (a|s = 2)P(s = 2, a, s0 = 1)

a∈A

= 1 × P(s = 2, a = LEFT, s0 = 1) =1×1 = 1 1 pπs=3,s 0 =2 =

X

(45.61b)

π1 (a|s = 3)P(s = 3, a, s0 = 4)

a∈A

= 1 × P(s = 3, a = RIGHT, s0 = 4) =1×1 = 1

(45.61c)

1874

Value and Policy Iterations

and 1 pπs=4,s 0 =5 =

X

π1 (a|s = 4)P(s = 4, a, s0 = 5)

a∈A

= 1 × P(s = 4, a = RIGHT, s0 = 5) =1×1 = 1 1 pπs=5,s 0 =6 =

X

(45.61d)

π1 (a|s = 5)P(s = 5, a, s0 = 6)

a∈A

= 1 × P(s = 5, a = STOP, s0 = 6) =1×1 = 1 1 pπs=6,s 0 =6 =

X

(45.61e)

π1 (a|s = 6)P(s = 6, a, s0 = 6)

a∈A

= 1 × P(s = 6, a = any, s0 = 6) =1×1 = 1

(45.61f)

That is, the state transition matrix under policy π1 (a|s) is given by  P

π1

   =   

0 1 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

1 0 0 0 1 1

      

(45.62)

It follows that  v π1 =

 −1   1  r π1 =  I − P π1  2 

0 +5 +6.5 +15 0 0

      

(45.63)

while q π1 (s = 1, a = STOP) = r(s = 1|a = STOP) +

1X P(s = 1, a = STOP, s0 )v π1 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 1, a = STOP, s0 = 6)v π1 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2 q π1 (s = 2, a = RIGHT) = r(s = 2|a = RIGHT) +

(45.64a)

1X P(s = 2, a = RIGHT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 2, a = RIGHT, s0 = 3)v π1 (s0 = 3) 2 1 = −1 + (1 × 6.5) = +2.25 2

(45.64b)

45.2 Policy Iteration

1875

and q π1 (s = 2, a = LEFT) = r(s = 2|a = LEFT) +

1X P(s = 2, a = LEFT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = +5 + × P(s = 2, a = LEFT, s0 = 1)v π1 (s0 = 1) 2 1 = +5 + (1 × 0) = +5 2 1X π1 q (s = 3, a = RIGHT) = r(s = 3|a = RIGHT) + P(s = 3, a = RIGHT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 3, a = RIGHT, s0 = 4)v π1 (s0 = 4) 2 1 = −1 + (1 × 15) = +6.5 2 q π1 (s = 3, a = LEFT) = r(s = 3|a = LEFT) +

(45.64c)

1X P(s = 3, a = LEFT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 3, a = LEFT, s0 = 2)v π1 (s0 = 2) 2 1 = −1 + (1 × 5) = +1.5 2 q π1 (s = 4, a = RIGHT) = r(s = 4|a = RIGHT) +

(45.64d)

1X P(s = 4, a = RIGHT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = +15 + × P(s = 4, a = RIGHT, s0 = 5)v π1 (s0 = 5) 2 1 (45.64e) = +15 + (1 × 0) = +15 2 and q π1 (s = 4, a = LEFT) = r(s = 4|a = LEFT) +

1X P(s = 4, a = LEFT, s0 )v π1 (s0 ) 2 0 s ∈S

1 = −1 + × P(s = 4, a = LEFT, s0 = 5)v π1 (s0 = 3) 2 1 = −1 + (1 × 6.5) = +2.25 2 q π1 (s = 5, a = STOP) = r(s = 5|a = STOP) +

(45.64f)

1X P(s = 5, a = STOP, s0 )v π1 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 5, a = STOP, s0 = 6)v π1 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2 q π1 (s = 6, a = STOP) = r(s = 6|a = STOP) +

(45.64g)

1X P(s = 6, a = STOP, s0 )v π1 (s0 ) 2 0 s ∈S

1 = 0 + × P(s = 6, a = STOP, s0 = 6)v π1 (s0 = 6) 2 1 = 0 + (1 × 0) = 0 2

(45.64h)

1876

Value and Policy Iterations

It follows that the matrix of state–action values under π1 (a|s), with columns corresponding to the actions a ∈ {RIGHT,LEFT,STOP}, is given by     Qπ1 =   

0 +2.25 +6.5 +15 0 0

0 +5 +1.5 +2.25 0 0

0 0 0 0 0 0

      

(45.65)

where we are marking the largest values in each row, with the understanding that the STOP action is only permitted at states s = 1, 2, 6, while the actions RIGHT and LEFT are only available to states s = 2, 3, 4. We conclude that the new iterate for the policy is

π2 (a|s = 1) = STOP π2 (a|s = 2) = LEFT π2 (a|s = 3) = RIGHT π2 (a|s = 4) = RIGHT π2 (a|s = 5) = STOP π2 (a|s = 6) = STOP

(45.66a) (45.66b) (45.66c) (45.66d) (45.66e) (45.66f)

(iteration ` = 3) The expected one-step ahead rewards under policy π2 (a|s) are given by

rπ2 (s = 1) =

X

π2 (a|s = 1)r(s = 1|a)

a∈A

= 1 × r(s = 1|a = STOP) =1×0 = 0 rπ2 (s = 2) =

X

(45.67a)

π2 (a|s = 2)r(s = 2|a)

a∈A

= 1 × r(s = 2|a = LEFT) = 1 × +5 = +5 rπ2 (s = 3) =

X

(45.67b)

π2 (a|s = 3)r(s = 3|a)

a∈A

= 1 × r(s = 3|a = RIGHT) = 1 × −1 = −1 rπ2 (s = 4) =

X

(45.67c)

π2 (a|s = 4)r(s = 4|a)

a∈A

= 1 × r(s = 4|a = RIGHT) = 1 × +15 = +15

(45.67d)

45.2 Policy Iteration

1877

and rπ2 (s = 5) =

X

π2 (a|s = 5)r(s = 5|a)

a∈A

= 1 × r(s = 5|a = STOP) =1×0 = 0 rπ2 (s = 6) =

X

(45.67e)

π2 (a|s = 6)r(s = 6|a)

a∈A

= 1 × r(s = 6|a = STOP) =1×0 = 0

(45.67f)

so that  r

π2

   =  

0 +5 −1 +15 0 0

      

(45.68)

Likewise, the nonzero state transition probabilities are given by 2 pπs=1,s 0 =6 =

X

π2 (a|s = 1)P(s = 1, a, s0 = 6)

a∈A

= 1 × P(s = 1, a = STOP, s0 = 6) =1×1 = 1 2 pπs=2,s 0 =1 =

X

(45.69a)

π2 (a|s = 2)P(s = 2, a, s0 = 1)

a∈A

= 1 × P(s = 2, a = LEFT, s0 = 1) =1×1 = 1 2 pπs=3,s 0 =4 =

X

(45.69b)

π2 (a|s = 3)P(s = 3, a, s0 = 4)

a∈A

= 1 × P(s = 3, a = RIGHT, s0 = 4) =1×1 = 1 2 pπs=4,s 0 =5 =

X

(45.69c)

π2 (a|s = 4)P(s = 4, a, s0 = 5)

a∈A

= 1 × P(s = 4, a = RIGHT, s0 = 5) =1×1 = 1 2 pπs=5,s 0 =6 =

X

(45.69d)

π2 (a|s = 5)P(s = 5, a, s0 = 6)

a∈A

= 1 × P(s = 5, a = STOP, s0 = 6) =1×1 = 1

(45.69e)

1878

Value and Policy Iterations

and 2 pπs=6,s 0 =6 =

X

π2 (a|s = 6)P(s = 6, a, s0 = 6)

a∈A

= 1 × P(s = 6, a = any, s0 = 6) =1×1 = 1

(45.69f)

That is, the state transition matrix under policy π2 (a|s) is given by   0 0 0 0 0 1  1 0 0 0 0 0     0 0 0 1 0 0  π2 P =    0 0 0 0 1 0   0 0 0 0 0 1  0 0 0 0 0 1

(45.70)

It follows that  v

π2

 −1   1 π2  π2 r =  = I− P  2 

0 +5 +6.5 +15 0 0

      

(45.71)

and we observe that v π2 coincides with the value vector v π1 from the previous iteration. We conclude that the construction has converged. The resulting optimal policy shown in (45.66a)–(45.66f), and the corresponding optimal state values in (45.71), coincide with the results shown earlier in Fig. 44.8.

45.2.2

Policy Improvement The proof of convergence for the policy iteration (45.43) suggests one useful way for improving a deterministic policy. Assume we are given an MDP M = (S, A, P, r, π), with a deterministic policy π(a|s). We can evaluate the policy and compute its state–action value function, q π (s, a), for all s ∈ S, a ∈ A. Then, an improved policy (which will also be deterministic) can be obtained as follows: π 0 (a|s) := argmax q π (s, a)

(45.72)

a∈A

That is, for every state s, we maximize over the action a and use the result to 0 construct the deterministic policy π 0 (a|s). If we let {v π (s), v π (s)} denote the state values for the policies {π(a|s), π 0 (a|s)}, then the same argument that led to (45.45) will show that (see Prob. 45.14): 0

v π (s) ≥ v π (s),

s∈S

(45.73)

We then say that π 0 (a|s) is a better policy than π(a|s) because it leads to higher state values. We already illustrated this construction in Example 45.2. For instance, the policy π1 (a|s) obtained there at iteration ` = 1 is better than the initial policy π0 (a|s), where we had

45.3 Partially Observable MDP



v π0

45.3

     =    

0 0 −2 −2 −2 0 0



     ,    



v π1

     =    

0 0 +5 +6.5 +15 0 0

1879

          

(45.74)

PARTIALLY OBSERVABLE MDP The development in the earlier sections assumed that the state s of the MDP is observable, and that actions by the agent are selected according to a policy π(a|s) that is conditioned on knowledge of the state variable. In a manner similar to hidden Markov models (HMMs), we will now examine the situation where the successive states of the MDP are unobservable. At every state s, the agent can only observe a measurement o that is emitted at that location. It is sufficient for our purposes to examine the case in which the observation o ∈ O is discrete and can assume one of |O| possible values denoted by {o1 , o2 , . . . , o|O| }. Here, the symbol O refers to the set of observables, and we will describe the mapping from the state variable s to the observation o by means of an emission probability kernel (or matrix) B with entries ∆

B(s, o) = [B]so = P(o = o|s = s)

(45.75)

For example, referring to the grid problem shown in Fig. 44.4, we may consider a situation in which the agent is not aware of the exact square number (i.e., state) where it is located. Instead, it may only know how many columns separate it from the exit at square 16. In this case, if the agent happens to be in any of the squares {1, 8, 9}, then it is zero columns away from square 16, and if it is in any of the squares {3, 6, 14} then it is two columns away from square 16, and so forth. The number of columns would correspond to the observations, and the elements of the observation space are O = {0, 1, 2, 3}. Formally, a POMDP is modeled as a 6-tuple written as M = (S, O, A, P, B, r)

(45.76)

where the letters S, A, P, and r continue to have the same interpretation as in the original MDP context (44.1), and where we are incorporating the quantities O and B: (a) (Observation-space) O denotes the finite collection of observations of size |O|. We denote an individual observation by the letter o ∈ O.

1880

Value and Policy Iterations

(b) (Emission probability kernel) B : S × O → [0, 1] is a matrix consisting of the emission probabilities for the observations conditioned on the state: B(s, o) = P(o = o|s = s) = P(o|s)

(45.77)

where the boldface notation refers to the random nature of observables and states. The rightmost notation, P(o|s), is a more compact representation. We will continue to associate a policy function with the POMDP, denoted by the same letter π : A × O → [0, 1], except that this function now represents the likelihood (probability) of selecting actions based on the agent’s observation and not state, i.e., π(a|o) = probability of choosing action a when the agent observes o (45.78) In a POMDP, the transition from state s to state s0 involves the following steps. First, once at state s, an observation o is emitted according to the emission kernel B(s, o). The agent interacting with the MDP senses the observation and selects an action a according to the policy π(a|o). This action moves the MDP to a new (unknown) state s0 according to the transition kernel P(s, a, s0 ). At this new state, a new observation o0 is emitted according to the kernel B(s0 , o0 ), a new action is selected according to π(a0 |o0 ), and the MDP moves to a new (unknown) state s00 according to P(s0 , a0 , s00 ), and the process continues in this manner – see Fig. 45.1.

45.3.1

Belief Vectors The objective continues to be learning a policy in order to maximize the cumulative reward. First, however, we need to adjust the definition of the state value function for POMDPs. Since the state s is not observable anymore, we will associate with the POMDP a belief vector b of size |S| × 1; it is a probability vector whose entries are nonnegative and add up to 1. The size of b matches the number of states, and each entry in b corresponds to one state location. If we number the states s = 1, 2, . . . , |S|, then the value of the sth entry of vector b, denoted by b(s), corresponds to the likelihood that the state is at location s: ∆

b(s) = likelihood that the state is at location s

(45.79)

For example, consider a POMDP with four states {s1 , s2 , s3 , s4 } and a belief vector b = [0.1 0.2 0.6 0.1]. This means that the POMDP is most likely to be at state s = s3 . The belief vector will be updated over time as the agent collects more observations, moving from the belief b0 at time n = 0 to the belief b1 at time n = 1, and so on. We write bn to denote the belief vector at a generic time instant n. It turns out that we can derive an update for the belief vector as follows.

45.3 Partially Observable MDP

1881

s-state possible observations

B(s, o) π(a|o)

o

(s, a) pair

possible transitions

s -state

Figure 45.1 Starting from a state s, an observation o is emitted according to the kernel B(s, o) and the agent selects an action according to policy π(a|o). The MDP then moves to state s0 according to the transition kernel P(s, a, s0 ) and collects the reward value r(s, a, s0 ).

Let bn be the starting belief vector at time n. The agent takes an action an and the MDP moves to some new unknown state where observation on+1 is emitted. We would like to update bn to bn+1 , as illustrated in the following diagram:

sn ?

action, an

−−−−−−−−−−→

l bn

update

−−−−−−−−→

sn+1 ?

observation

−−−−−−−−−−→

on+1

l bn+1

For any new state s0 ∈ S, the entry bn+1 (s0 ) measures the likelihood of being at state s0 at time n + 1, following action an at state sn and having observed on+1 . That is, ∆

bn+1 (s0 ) = likelihood that state is at location s0 at n + 1 = P(sn+1 = s0 | an , on+1 ; bn )

(45.80)

1882

Value and Policy Iterations

The belief vector at time n is added as an argument to (45.80) because it defines the likelihoods of the state values at that instant. It follows from the Bayes rule that bn+1 (s0 ) =

P(sn+1 = s0 |an ; bn ) P(on+1 |sn+1 = s0 , an ; bn ) P(on+1 |an ; bn )

=

P(sn+1 = s0 |an ; bn ) P(on+1 |sn+1 = s0 ) P(on+1 |an ; bn )

=

B(s0 , on+1 )P(sn+1 = s0 |an ; bn ) P(on+1 |an ; bn )

(45.81)

where the terms in the numerator and denominator are given by X P(sn+1 = s0 |an ; bn ) = bn (x)P(x, an , s0 )

(45.82a)

x∈S

and P(on+1 |an ; bn )

=

X

x0 ∈S

=

X

x0 ∈S

=

X

x0 ∈S

=

X

x0 ∈S (45.82a)

=

X

P(on+1 , x0 | an ; bn ) P(x0 |an ; bn ) P(on+1 |an , x0 ; bn ) P(x0 |an ; bn ) P(on+1 |x0 ) B(x0 , on+1 )P(x0 |an ; bn ) B(x0 , on+1 )

x0 ∈S

X

!

bn (x)P(x, an , x0 )

x∈S

(45.82b)

Substituting into (45.81) gives the update

(state estimator) P 0 0 x∈S bn (x)P(x, an , s )B(s , on+1 ) P bn+1 (s0 ) = P , ∀ s0 ∈ S 0 0 x0 ∈S x∈S bn (x)P(x, an , x )B(x , on+1 )

(45.83)

We therefore have a relation that allows us to update the belief vector bn to bn+1 by relying solely on the transition and emission kernels P and B. It is customary to refer to the above relation as the state estimator. This is because, once on+1 is observed, it updates bn to bn+1 and finds the new beliefs for the states. Observe that the computation of bn+1 requires knowledge of the most immediate belief bn and none of the previous beliefs, bm for m < n. For this reason, we say that the POMDP is transformed into a belief-MDP.

45.3 Partially Observable MDP

45.3.2

1883

Belief Value Function The earlier state value function v π (s), which was a function of the state under MDPs, will now be defined as a function of the belief vector and denoted by v π (b). We will refer to it as the belief value function. This is because the state is unobservable and what is available is a belief vector informing the agent about which states are most likely. By defining the value function in this manner, one major difference arises in relation to the traditional MDP formulation we had in the earlier sections. The discrete state s in v π (s) is now replaced by a real-valued belief vector b in v π (b). We interpret b as playing the role of a new state variable so that we are in effect moving from studying a traditional MDP with discrete states s to studying an MDP with continuous states b. Moreover, the vector b of size |S| × 1 lies within the probability simplex in IR|S| defined by ∆=

(

b | b(s) ≥ 0,

|S| X

b(s) = 1

s=1

)

(probability simplex)

(45.84)

While the evaluation of v π (s) could be attained by computing its value for every discrete realization of s, the evaluation of v π (b) will require computing the value function for “infinitely” many b values; this poses a challenge. We will comment on this issue as we proceed with the presentation. The belief value function v π (b) is defined as follows (compare with (44.65)): π



v (b) = E π,P,B = E π,P,B

(

∞ X

n=0 ( ∞ X

n=0

n

γ r(n) | b0 = b n

)

)

γ r(sn , an , sn+1 ) | b0 = b

(45.85)

where the conditioning is relative to the initial belief vector b0 at time n = 0. Moreover, the expectation is relative to the policy distribution π(a|o), the transitional kernel P, and the additional emission kernel B. We expand (45.85) into an equivalent form as follows: n o v π (b) = E π,P,B r(s, a, s0 ) + γ E π,P,B ∆

(

∞ X

n=1

γ n−1 r(n) | b0 = b

)

(45.86)

The first term on the right-hand side is given by n o XX X E π,P,B r(s, a, s0 ) = b(s)π(a|s)P(s, a, s0 )r(s, a, s0 ) s∈S a∈A s0 ∈S

while the second term evaluates to

(45.87)

1884

Value and Policy Iterations

E π,P,B

(

(a)

∞ X

n=1

= E π,P,B (

(b)

γ

n−1

(

∞ X

n=0

= E o0 E π,P,B (

(c)

= E o0 E π,P,B

)

r(n) | b0 = b

n

γ r(n + 1) | b0 = b ∞ X

n=0 ∞ X

n=0

)

n

0

n

0

γ r(n + 1) | o1 = o , b0 = b γ r(n + 1) | b1 = b

!)

!)

= E o0 v π (b0 ) X = P(o0 = o0 )v π (b0 ) o0 ∈O

=

XX X X

b(s)π(a|s)P(s, a, s0 )B(s0 , o0 )v π (b0 )

(45.88)

s∈S a∈A s0 ∈S o0 ∈O

where step (a) is by a change of variables, step (b) uses the conditional mean property E x = E (E (x|y)) for any two random variables {x, y}, and step (c) is because, under observation o0 , the belief vector is updated from b0 to the next value denoted by b1 = b0 according to (45.83). In the last equality we applied the Bayes rule to note that

P(o0 = o0 ) =

X

s0 ∈S

=

X

s0 ∈S

=

X

s0 ∈S

=

X

s0 ∈S

=

X

s0 ∈S

=

X

s0 ∈S

=

X

s0 ∈S

P(o0 = o0 , s0 = s0 ) P(o0 = o0 |s0 = s0 ) P(s0 = s0 ) B(s0 , o0 )P(s0 = s0 ) B(s0 , o0 )

XX

P(s0 = s0 , a = a, s = s)

s∈S a∈A

0

0

0

0

0

0

B(s , o )

XX

s∈S a∈A

B(s , o )

XX

P(s0 = s0 |a = a, s = s)P(a = a, s = s) P(s, a, s0 )P(s = s)π(a|s)

s∈S a∈A

B(s , o )

XX

P(s, a, s0 )b(s)π(a|s)

(45.89)

s∈S a∈A

Returning to (45.86) we arrive at the Bellman equation (compare with (44.70)):

45.3 Partially Observable MDP

v π (b) =

XX X s∈S a∈A

γ

b(s)π(a|s)P(s, a, s0 )r(s, a, s0 ) +

s0 ∈S

XX X X s∈S a∈A

s0 ∈S

1885

(45.90)

b(s)π(a|s)P(s, a, s0 )B(s0 , o0 )v π (b0 )

o0 ∈O

or, equivalently, h i v π (b) = E b,π,P,B r(s, a, s0 ) + γ v π (b0 )

(45.91)

where the new belief vector b0 is related to the prior vector b via relation (45.83). For compactness, we introduce the following quantities that appear in (45.90) by excluding the policy: XX ∆ r(b, a) = b(s)P(s, a, s0 )r(s, a, s0 ) (45.92a) ∆

P(o0 |b, a) =

s∈S s0 ∈S

XX s∈S

b(s)P(s, a, s0 )B(s0 , o0 )

(45.92b)

s0 ∈S

The quantity r(b, a) measures the expected reward, starting from a belief vector b and taking action a. The quantity P(o0 |b, a) is the likelihood of emitting o0 from the same initial conditions. Using these quantities, we can rewrite the Bellman equation in the form: π

(

v (b) = E π r(b, a) + γ

X

o0 ∈O

0

π

0

)

P(o |b, a) v (b )

(45.93)

where the belief vector is updated from b to b0 according to (45.83) under observation o0 . We can also rewrite the state estimator (45.83) as bn+1 (s0 ) =

X 1 bn (x)P(x, an , s0 )B(s0 , on+1 ) P(on+1 |bn , an )

(45.94)

x∈S

45.3.3

Optimal Policy The optimal policy is defined by ∆

π ? (a|o) = argmax v π (b)

(45.95)

π

with the corresponding optimal belief value function denoted by ∆

?

v ? (b) = v π (b) = max v π (b) π

(45.96)

By following arguments similar to the ones that led to (45.14a) in the traditional MDP context, we can similarly derive the following Bellman optimality condition under POMDPs (see Prob. 45.17):

1886

Value and Policy Iterations

    X v ? (b) = max r(b, a) + γ P(o0 |b, a) v ? (b0 ) a∈A   0

(45.97)

o ∈O

Motivated by the fixed-point iteration (44.107), which we employed earlier to solve for v π (s), we can apply a similar technique to (45.97) and write the valueiteration algorithm: vk? (b) = max a∈A

  

r(b, a) + γ

X

o0 ∈O

  ? P(o0 |b, a) vk−1 (b0 ) , v0? (b) = 0, k ≥ 1 

(45.98) where the belief vector is updated from b to the next value b0 according to (45.83) for each observation o0 . If we compare this recursion with the earlier value iteration (45.20) for traditional MDPs, we find that the discrete state variable s is now replaced by the continuous belief vector, b. This poses a challenge. One may consider propagating the recursion over a dense grid in the simplex b ∈ ∆. One may also consider using an approximation for the belief value function, such as linear approximations, as was explained in an earlier section for traditional MDPs. There exist several other solution methods that exploit properties of the belief value function to solve (45.98) more directly. We will not be discussing these methods here, except to comment on one useful property of the iterates vk? (b) that is exploited by several methods, namely, the fact that vk? (b) is piecewise linear and convex.

Theorem 45.1. (Piecewise linear and convex property) Let αk,t denote a generic |S| × 1 real-valued vector at iteration k and indexed by t; its individual entries are denoted by αk,t (s). It holds that every belief value function vk? (b) generated by the value iteration (45.98) is piecewise linear and convex over b, meaning that it can be expressed in the form X vk? (b) = max b(s)αk,t (s) t∈Ck

s∈S

n o = max bT αk,1 , bT αk,2 , bT αk,3 , . . . , bT αk,|Ck |

(45.99)

over a collection of vectors {αk,t } where the set of all indices t at iteration k is denoted by t ∈ Ck . The theorem is established in Appendix 45.E. Observe that each inner product of the form v = bT α describes a hyperplane in IR|S| where the independent variable is b and the normal direction to the hyperplane is α. Therefore, expression (45.99) is stating that vk? (b) is obtained by determining which hyperplane lies above the other hyperplanes. As Example 45.3 will show, the hyperplanes will

45.3 Partially Observable MDP

1887

overrun each other over different regions of the probability simplex b ∈ ∆, so that vk? (s) will end up with a piecewise linear and convex form. Example 45.3 (Illustrating the value iteration for POMDPs) We illustrate the operation of the value iteration (45.98) by considering the two-state MDP with two actions shown in Fig. 45.2. We set the discount factor to γ = 0.9. The states are denoted by {s1 , s2 } and the actions by {a1 , a2 }. The initial belief vector is     P(s = s1 ) 1/4 b0 = = (45.100) P(s = s2 ) 3/4 The observation signal o ∈ {0, 1} is binary with emission probabilities: B(s1 , o = 1) = P(o = 1|s = s1 ) = 0.6 B(s1 , o = 0) = P(o = 0|s = s1 ) = 0.4 B(s2 , o = 1) = P(o = 1|s = s2 ) = 0.3 B(s2 , o = 1) = P(o = 0|s = s2 ) = 0.7

(45.101a) (45.101b) (45.101c) (45.101d)

o

action a2 0.3 s1

action a1

0.7 1 1

observation

observation

0.6

s2 action a1

0.4

action a2

o

Figure 45.2 A two-state POMDP with two actions {a1 , a2 } and a binary observation

o ∈ {0, 1}.

The arrows in the figure show the transitions that occur when actions a1 or a2 are taken. For example, when the MDP is at state s1 and action a1 is selected, then the MDP moves with probability 1 to state s2 . If, on the other hand, action a2 is selected at state s1 , then the MDP moves to state s2 with probability 0.7 or stays at state s1 with probability 0.3. Thus, the transition probability kernel is given by P(s1 , a1 , s1 ) = 0, P(s1 , a2 , s1 ) = 0.3, P(s2 , a1 , s1 ) = 1, P(s2 , a2 , s1 ) = 0.6,

P(s1 , a1 , s2 ) = 1 P(s1 , a2 , s2 ) = 0.7 P(s2 , a1 , s2 ) = 0 P(s2 , a2 , s2 ) = 0.4

(45.102a) (45.102b) (45.102c) (45.102d)

We assume the rewards are state-dependent and set them to r(s1 ) = 2 and r(s2 ) = 1. We start from v0? (b) = 0 for all belief vectors b; these are two-dimensional vectors in the probability simplex in IR2 .

1888

Value and Policy Iterations

first iteration k = 1 Any belief vector b can be parameterized in the form  b=

p 1−p

 (45.103)

for some scalar p ∈ [0, 1]. Using (45.92a)–(45.92b) we have r(b, a1 ) = b(s1 )P(s1 , a1 , s1 )r(s1 ) + b(s1 )P(s1 , a1 , s2 )r(s2 ) + b(s2 )P(s2 , a1 , s1 )r(s1 ) + b(s2 )P(s2 , a1 , s2 )r(s2 ) = (p × 0 × 2) + (p × 1 × 1) + ((1 − p) × 1 × 2) + ((1 − p) × 0 × 1) =2−p (45.104a) and r(b, a2 ) = b(s1 )P(s1 , a2 , s1 )r(s1 ) + b(s1 )P(s1 , a2 , s2 )r(s2 ) + b(s2 )P(s2 , a2 , s1 )r(s1 ) + b(s2 )P(s2 , a2 , s2 )r(s2 ) = (p × 0.3 × 2) + (p × 0.7 × 1) + ((1 − p) × 0.6 × 2) + ((1 − p) × 0.4 × 1) = 1.6 − 0.3p (45.104b) Observe that the expressions for r(b, a1 ) and r(b, a2 ) are both linear in the parameter p. Their values would agree at 2 − p = 1.6 − 0.3p ⇐⇒ p = 4/7

(45.105)

The linear curves are shown in Fig. 45.3; they intersect at location p = 4/7. According to the value iteration (45.98), we need to choose the maximum over the actions {a1 , a2 } since n o v1? (b) = max r(b, a1 ), r(b, a2 ) (45.106) a∈{a1 ,a2 }

Doing so results in the thick dark piecewise segment shown in the figure. Note that the value function, v1? (·), is a function of the parameter p as well and we could write v1? (p) if desired, as in the figure:

v1? (b) =



2 − p, 1.6 − 0.3p,

if p ≤ 4/7 , if p > 4/7



where b =



p 1−p

 (45.107)

Moreover, the optimal action at the first iteration is dependent on the value of b or p: a?0 (b) =



a1 , a2 ,

if p ≤ 4/7 if p > 4/7

(45.108)

These expressions indicate what the optimal value function and action should be for any value of the argument p ∈ [0, 1] (i.e., for any belief vector b).

45.3 Partially Observable MDP

1889

v1? (p) 2

choose action a1 here choose action a2 here

1.5

r(b0 , a2 ) = 1.6

0.3p

1

r(b0 , a1 ) = 2

p

0.5

0

0.5

4 7

1

p

Figure 45.3 Plot of the reward values r(b0 , a1 ) and r(b0 , a2 ), which are linear functions

in p. The solid dark line represents the maximizing curve. second iteration k = 2 We now use (45.92b) to find P(o0 = 1|b, a1 ) = b(s1 )P(s1 , a1 , s1 )B(s1 , o0 = 1) + b(s1 )P(s1 , a1 , s2 )B(s2 , o0 = 1) + b(s2 )P(s2 , a1 , s1 )B(s1 , o0 = 1) + b(s2 )P(s2 , a1 , s2 )B(s2 , o0 = 1) = (p × 0 × 0.6) + (p × 1 × 0.3) + ((1 − p) × 1 × 0.6) + ((1 − p) × 0 × 0.3) = 0.6 − 0.3p (45.109a) P(o0 = 0|b, a1 ) = b(s1 )P(s1 , a1 , s1 )B(s1 , o0 = 0) + b(s1 )P(s1 , a1 , s2 )B(s2 , o0 = 0) + b(s2 )P(s2 , a1 , s1 )B(s1 , o0 = 0) + b(s2 )P(s2 , a1 , s2 )B(s2 , o0 = 0) = (p × 0 × 0.4) + (p × 1 × 0.7) + ((1 − p) × 1 × 0.4) + ((1 − p) × 0 × 0.7) = 0.4 + 0.3p (45.109b) P(o0 = 1|b, a2 ) = b(s1 )P(s1 , a2 , s1 )B(s1 , o0 = 1) + b(s1 )P(s1 , a2 , s2 )B(s2 , o0 = 1) + b(s2 )P(s2 , a2 , s1 )B(s1 , o0 = 1) + b(s2 )P(s2 , a2 , s2 )B(s2 , o0 = 1) = (p × 0.3 × 0.6) + (p × 0.7 × 0.3) + ((1 − p) × 0.6 × 0.6) + ((1 − p) × 0.4 × 0.3) = 0.48 − 0.09p (45.109c)

1890

Value and Policy Iterations

P(o0 = 0|b, a2 ) = b(s1 )P(s1 , a2 , s1 )B(s1 , o0 = 0) + b(s1 )P(s1 , a2 , s2 )B(s2 , o0 = 0) + b(s2 )P(s2 , a2 , s1 )B(s1 , o0 = 0) + b(s2 )P(s2 , a2 , s2 )B(s2 , o0 = 0) = (p × 0.3 × 0.4) + (p × 0.7 × 0.7) + ((1 − p) × 0.6 × 0.4) + ((1 − p) × 0.4 × 0.7) = 0.52 + 0.09p (45.109d) We also need to determine v1? (b0 ) where b0 is computed from b and o0 , for the various possibilities of o0 and action a: (a)

case (a = a1 , o0 = 1): b0 (s1 ) n o 1 b(s1 )P(s1 , a1 , s1 )B(s1 , o0 = 1) + b(s2 )P(s2 , a1 , s1 )B(s1 , o0 = 1) = 1|b, a1 )   1 p × 0 × 0.6 + (1 − p) × 1 × 0.6 = 0.6 − 0.3p 0.6 − 0.6p = (45.110a) 0.6 − 0.3p =

P(o0

and

b0 (s2 ) n o 1 b(s1 )P(s1 , a1 , s2 )B(s2 , o0 = 1) + b(s2 )P(s2 , a1 , s2 )B(s2 , o0 = 1) = 1|b, a1 )   1 p × 1 × 0.3 + (1 − p) × 0 × 0.3 = 0.6 − 0.3p 0.3p = (45.110b) 0.6 − 0.3p =

P(o0

so that the belief vector under this case is   1 0.6 − 0.6p b0 = , 0.3p 0.6 − 0.3p

(a = a1 , o0 = 1)

(45.110c)

If we now refer to the general form of v1? (b) given in (45.107), we find that at b0 :  0.6 − 0.6p 0.6 − 0.6p  if ≤ 4/7   2 − 0.6 − 0.3p , 0.6 − 0.3p ? 0 v1 (b ) = (45.110d)  0.6 − 0.6p     1.6 − 0.3 , otherwise 0.6 − 0.3p or, equivalently,

v1? (b0 ) =

(b)

  0.6,

1 0.6 − 0.3p  0.78 − 0.3p,

case (a = a1 , o0 = 0):

if p ≥ 0.6

,

(a = a1 , o0 = 1)

otherwise (45.110e)

b0 (s1 ) n o 1 b(s1 )P(s1 , a1 , s1 )B(s1 , o0 = 0) + b(s2 )P(s2 , a1 , s1 )B(s1 , o0 = 0) = 0|b, a1 )   1 = p × 0 × 0.4 + (1 − p) × 1 × 0.4 0.4 + 0.3p 0.4 − 0.4p = (45.111a) 0.4 + 0.3p =

P(o0

45.3 Partially Observable MDP

1891

and b0 (s2 ) n o 1 b(s1 )P(s1 , a1 , s2 )B(s2 , o0 = 0) + b(s2 )P(s2 , a1 , s2 )B(s2 , o0 = 0) = 0|b, a1 )   1 p × 1 × 0.7 + (1 − p) × 0 × 0.7 = 0.4 + 0.3p 0.7p = (45.111b) 0.4 + 0.3p =

P(o0

so that the belief vector under this case is b0 =

1 0.4 + 0.3p



0.4 − 0.4p 0.7p



(a = a1 , o0 = 0)

,

(45.111c)

If we now refer to the general form of v1? (b) given in (45.107), we find that at b0 :   0.4 + p, 1 v1? (b0 ) = 0.4 + 0.3p  0.52 + 0.6p, (c)

if p ≥ 0.3

(a = a1 , o0 = 0)

,

otherwise (45.111d)

case (a = a2 , o0 = 1): b0 (s1 )

n o 1 b(s1 )P(s1 , a2 , s1 )B(s1 , o0 = 1) + b(s2 )P(s2 , a2 , s1 )B(s1 , o0 = 1) = 1|b, a2 )   1 = p × 0.3 × 0.6 + (1 − p) × 0.6 × 0.6 0.48 − 0.09p 0.36 − 0.18p (45.112a) = 0.48 − 0.09p =

P(o0

and b0 (s2 ) n o 1 b(s1 )P(s1 , a2 , s2 )B(s2 , o0 = 1) + b(s2 )P(s2 , a2 , s2 )B(s2 , o0 = 1) = 1|b, a2 )   1 = p × 0.7 × 0.3 + (1 − p) × 0.4 × 0.3 0.48 − 0.09p 0.12 + 0.09p = (45.112b) 0.48 − 0.09p =

P(o0

so that the belief vector under this case is b0 =

1 0.48 − 0.09p



0.36 − 0.18p 0.12 + 0.09p

 ,

(a = a2 , o0 = 1)

(45.112c)

If we now refer to the general form of v1? (b) given in (45.107), we find that at b0 :   0.6, 1 v1? (b0 ) = 0.48 − 0.09p  0.66 − 0.09p,

if p ≥ 2/3

,

(a = a2 , o0 = 1)

otherwise (45.112d)

1892

Value and Policy Iterations

(d)

case (a = a2 , o0 = 0): b0 (s1 ) n o 1 0 0 b(s )P(s , a , s )B(s , o = 0) + b(s )P(s , a , s )B(s , o = 0) 1 1 2 1 1 2 2 2 1 1 P(o0 = 0|b, a2 )   1 p × 0.3 × 0.4 + (1 − p) × 0.6 × 0.4 = 0.52 + 0.09p 0.24 − 0.12p = (45.113a) 0.52 + 0.09p =

and b0 (s2 ) n o 1 b(s1 )P(s1 , a2 , s2 )B(s2 , o0 = 0) + b(s2 )P(s2 , a2 , s2 )B(s2 , o0 = 0) = 0|b, a2 )   1 = p × 0.7 × 0.7 + (1 − p) × 0.4 × 0.7 0.52 + 0.09p 0.28 + 0.21p (45.113b) = 0.52 + 0.09p

=

P(o0

so that the belief vector under this case is b0 =

1 0.52 + 0.09p



0.24 − 0.12p 0.28 + 0.21p

 ,

(a = a2 , o0 = 0)

(45.113c)

If we again refer to the general form of v1? (b) given in (45.107), we find that at b0 : v1? (b0 ) =

0.8 + 0.3p , 0.52 + 0.09p

∀p,

(a = a2 , o0 = 0)

(45.113d)

Returning to the value iteration (45.98), and using the expressions derived above, we have for any b: v2? (b) = max a∈{a1 ,a2 } ( h i r(b, a1 ) + γ P(o0 = 1|b, a1 )v1? (b0 |a1 , o0 = 1) + P(o0 = o|b, a1 )v1? (b0 |a1 , o0 = o) , i h r(b, a2 ) + γ P(o0 = 1|b, a2 )v1? (b0 |a2 , o0 = 1) + P(o0 = o|b, a2 )v1? (b0 |a2 , o0 = o)

)

(45.114) so that v2? (b) = max p (

h (2 − p) + 0.9 0.6 I[p ≥ 0.6] + (0.78 − 0.3p) I[p < 0.6] + i (0.4 + p) I[p ≥ 0.3] + (0.52 + 0.6p) I[p < 0.3] , ) h i (1.6 − 0.3p) + 0.9 0.6 I[p ≥ 2/3] + (0.66 − 0.09p)I[p < 2/3] + 0.8 + 0.3p (45.115)

45.4 Commentaries and Discussion

which simplifies to          ? v2 (b) =        

n max 3.170 − 0.730p, p n max 3.062 − 0.370p, p n max 2.900 − 0.100p, p n max 2.900 − 0.100p, p

o 2.914 − 0.111p , o 2.914 − 0.111p , o 2.914 − 0.111p , o 2.860 − 0.030p ,

1893

0 ≤ p < 0.3 0.3 ≤ p < 0.6 0.6 ≤< 2/3

(45.116)

2/3 ≤ p ≤ 1

This result illustrates the piecewise linear nature of v2? (b), as was the case with the previous iterate v1? (b). The resulting function is plotted in Fig. 45.4. It is sufficient to illustrate the calculations for the first two steps. The value iterations can be continued for several steps from here as desired. It is clear that the complexity of the computations grows with the iterations as more segmentation of the probability range becomes necessary in order to describe the successive belief value functions. 3.2

3.1

3

2.9

2.8

2.7

2.6

2.5

2.4 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 45.4 The thick solid line shows the resulting piecewise linear and convex belief

value function v2? (b) that results from the calculation (45.116).

45.4

COMMENTARIES AND DISCUSSION Value and policy iterations. The value iteration algorithm (45.23) was introduced by Bellman (1953, 1954, 1957a, b) and is a direct consequence of his principle of optimality for dynamic programming. The value iteration also appeared in Shapley (1953) in work on stochastic games by the Nobel Prize in Economics Lyold Shapley (1923–2016). Another influential contribution to the theory of MDPs is the work by Howard (1960), which introduced the policy iteration method (45.43). Howard’s algorithm relied on π −1 determining the intermediate value function vk ` (s) by solving directly linear systems of equations of the form (44.72), as was illustrated in (45.39). The form of the algorithm described in the listing (45.43), which relies instead on running the fixed-point iteration

1894

Value and Policy Iterations

(44.107), is due to Puterman and Shin (1978). For further reading on MDPs, the reader may consult Derman (1970), Howard (1971), Bertsekas and Shreve (1978), Ross (1983), White (1993), Sutton and Barto (1998), Bather (2000), Feinberg and Shwartz (2002), Puterman (2005), Sheskin (2010), and Kochenderfer (2015). Partially observable MDPs. We examined POMDPs in Section 45.3, where the state information is not available and agents can only observe measurements generated according to some emission distribution. We explained that POMDPs can be transformed into regular MDPs, albeit ones with continuous states, where the traditional discrete state s is replaced by a continuous belief vector b. We derived relation (45.83) to update the belief vectors. It was remarked that the new belief vector requires information from only the most recent belief, thus leading to a belief-MDP. This observation is due to Aström (1965). The corresponding optimal belief value function, v ? (b), was further shown to satisfy the Bellman optimality condition (45.97), which in turn motivated the value iteration (45.98). One major difficulty in updating the iteration is that the argument b is now continuous, rather than discrete, and lies within the probability simplex in IR|S| . The calculations in Example 45.3 were meant to illustrate how the complexity of the computations increases as we move from one iteration to another, and soon becomes intractable. The intractability of determining the optimal belief value function for POMDPs was formalized in the work by Papadimitriou and Tsitsiklis (1987). Existing solution methods exploit the piecewise and convex nature of v ? (b), which was observed by Smallwood (1971), Sondik (1971, 1978), and Smallwood and Sondik (1973). This property was exploited in these references and in other works, such as those by Sawaki and Ichikawa (1978), Cassandra, Kaelbling, and Littman (1994), Zhang and Liu (1996), Littman, Cassandra, and Kaelbling (1996), and Cassandra, Littman, and Zhang (1997); the reference by Littman, Cassandra, and Kaelbling (1996) introduces the “Witness algorithm.” The methods in these works rely on transforming one piecewise linear ? representation for vk−1 at iteration k − 1 to another piecewise linear representation at the next iteration. Other methods in policy space appear in Hansen (1997, 1998) and Meuleau et al. (1999a, b). Dynamic programming. The term dynamic programming was coined by the American mathematician and control theorist Richard Bellman (1920–1984) in the late 1940s in his studies on multistage decision processes – see Bellman (1953, 1954, 1957a). The qualification “dynamic” is meant to refer to the evolving nature of the decision process across stages. We encountered in the body of the chapter one important application of dynamic programming in the form of the optimality condition (45.14a) for determining the optimal value function. There, we explained that the solution to the problem of optimizing the value function v π (s) over the policy π(a|s) over a longer horizon (or depth) can be solved by constructing the solution recursively from the optimal value functions over shorter horizons. This technique, which is a direct consequence of the Bellman principle of optimality in dynamic optimization, has a wide range of applications across many fields, including optimal control theory, communications theory, and economics. We also encountered another instance of dynamic programming earlier in Section 39.4 while discussing the Viterbi algorithm in the context of HMMs. For further reading on dynamic programming and its applications, the reader may consult the texts by Howard (1960, 1971), Dreyfus and Law (1977), Ross (1983), Bryson (1998), Bather (2000), Puterman (2005), Bertsekas (2007), Kamien and Schwartz (2012), and Kochenderfer (2015). Multistage decision processes. A multistage decision process is one where an optimal decision needs to be taken at each stage in order to maximize (or minimize) a certain objective function. The decision at stage k ≥ 0 will generally involve looking k steps into the future (i.e., it will involve a decision of depth k). It turns out that for objective functions with certain structure, the optimal decision of depth k can be eval-

45.4 Commentaries and Discussion

1895

uated recursively from the optimal decision of depth k − 1, and so forth. This property allows the decomposition of more complex/deeper optimization problems into the solution of smaller/shallower optimization problems and forms the essence of dynamic programming. We motivate it as follows. Let s ∈ S denote a generic state variable for the decision process, a ∈ A the action variable, and r(s, a, s0 ) ∈ IR the reward variable. The next state, reached from s after action a, is denoted by s0 . Although we are using the same notation we introduced in the body of the chapter for MDPs, this notation is general and applies to a variety of other optimization problems that involve sequential decision-making (as illustrated by the next example on optimal trajectory determination). For generality, we will allow the transition from state s to state s0 to be stochastic, as was the case with MDPs where the probability kernel P(s, a, s0 ) defined the probability of reaching s0 given the state– action pair (s, a). In the stochastic setting, the states, actions, and reward variables are treated as random variables and written in boldface notation. For any generic state s ∈ S, the value function that is associated with stage k is denoted by vkπ (s) and defined as ! k−1 X n ∆ π vk (s) = E γ r(sn , an , sn+1 ) | s0 = s (45.117) n=0

where γ ∈ [0, 1] is a discount factor, sn denotes the state at time n, an denotes the action taken at that point in time, and sn+1 denotes the landing state at the next time instant n + 1. We sometimes denote the reward function more compactly by writing: ∆

r(n) = r(sn , an , sn+1 )

(45.118)

Expression (45.117) computes the expected reward when the system is launched from some initial state s0 = s, performs k transitions (corresponding to depth k), and receives the rewards {r(s, a0 , s1 ), r(s1 , a1 , s2 ), . . . , r(sk−1 , ak−1 , sk )} during these transitions: a0 ,r(0)

ak−1 ,r(k−1)

a1 ,r(1)

s0 −−−−−→ s1 −−−−−→ s2 −−−→ . . . −−−−−−−−→ sk

(45.119)

vkπ (s)

refers to the policy for choosing actions based on states, The superscript π in which we denoted in the body of the chapter more explicitly by π(a|s). The expectation in (45.117) is in relation to the randomness in selecting actions and also in relation to the randomness in state transitions. For this reason, we will write more explicitly: ! k−1 X n ∆ π vk (s) = E a0 ,··· ,ak−1 ∼π γ r(sn , an , sn+1 ) | s0 = s (45.120) s1 ,··· ,sk ∼(P,π)

n=0

where the notation an ∼ π means that action an is selected according to the distribution π(a|sn ), while the notation s0 ∼ (P, π) means that the state s0 is selected according to the transition probabilities defined by P(s, a, s0 ) and π(a|s). We are interested in maximizing the value function over all actions, i.e., in determining the optimal value: ( !) k−1 X n ∆ ? vk (s) = max E a0 ,··· ,ak−1 ∼π γ r(sn , an , sn+1 ) | s0 = s (45.121) π

s1 ,··· ,sk ∼(P,π)

n=0

with boundary conditions = 0 for all s ∈ S. It is important to note that, in general, the optimal policy is not necessarily stationary (i.e., it can change with n) so that it needs to be defined for every time step. For this reason, we will use the more explicit notation πn (a|s), with a subscript n, to refer to the policy for selecting actions at time n. Thus, when we write an ∼ πn , it is understood that the selection of an occurs according to policy πn (a|s) defined at step n. In this way, a nonstationary policy is defined by a sub-policy for each possible time step. In turn, each sub-policy v0? (s)

1896

Value and Policy Iterations

πn (a|s) is in effect a collection of |S| distributions (one distribution per state). For this reason, and for generality, we rewrite problem (45.121) in the form: ( ∆ vk? (s) =

max

π0 ,··· ,πk−1

E

a0 ∼π0 ,··· ,ak−1 ∼πk−1 s1 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

k−1 X n=0

!) n

γ r(sn , an , sn+1 ) | s0 = s (45.122)

Observe how problem (45.122) involves looking k steps into the future, i.e., into the policies (and, hence, actions) that determine the rewards {r(0), . . . , r(k −1)} that enter into the objective function. We are ready to establish the Bellman optimality principle, which we encountered earlier in (45.15a) and Section 45.1.5. The first lines below reveal how the principle of optimality was stated by Bellman (1954, p. 504); a proof is given in Appendix 45.F.

Bellman principle of optimality (Bellman (1954)) “An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” More explicitly, it should hold that n   o ? vk? (s) = max E r(s, a, s0 ) + γvk−1 (s0 ) | s = s , v0? (s) = 0 (45.123) a∈A

In particular, if {a?1 , a?2 , . . . , a?k−1 } are the optimal actions/decisions for a path originating from state s0 and looking k − 1 steps into the future, then these decisions will be optimal for the longer horizon of k steps originating from an earlier state s and taking an optimal action at that step. Although the above description treats a stochastic optimization problem, the same arguments and derivation apply to deterministic optimization problems. For example, if we remove the expectation from (45.121) and consider instead an optimization problem of the form: ( k−1 ) X n ∆ ? vk (s) = max γ r(n) (45.124) a0 ,...,ak−1 ∈A

n=0

where no randomness is involved in transitioning between states, with state s and action a determining completely the next state s0 , then the same derivation will show that the optimality principle in this case leads to the Bellman equation: n o ? vk? (s) = max r(s, a, s0 ) + γvk−1 (s0 ) , a∈A

v0? (s) = 0

(45.125)

In the same token, we can consider optimization problems that involve minimization rather than maximization, such as ( k−1 ) X n ∆ ? vk (s) = min γ r(n) (45.126) a0 ,...,ak−1 ∈A

n=0

In this case, the optimality principle will take the form o n ? vk? (s) = min r(s, a, s0 ) + γvk−1 (s0 ) , a∈A

v0? (s) = 0

(45.127)

45.4 Commentaries and Discussion

1897

Example 45.4 (Optimal trajectory planning) We illustrate the application of the optimality principle (and the dynamic programming construction) to a classical problem in optimal trajectory determination. The leftmost plot in Fig. 45.5 shows trajectories between an origin node, denoted by O on the far left, and a destination node, denoted by D on the far right. The trajectories pass through intermediate nodes, denoted by {X, Y, Z, W }. The edges linking the nodes have numbers on them referring to the cost of following that particular path. For example, these values could be interpreted as the cost of traveling from one node to another. In other problem settings, the values could correspond to the distances traveled over the edges, or the rewards collected over these edges (see Prob. 45.20). There are many problem formulations that fit into this scenario. (finding optimal trajectory)

(na¨ıve solution)

(optimal solution)

Figure 45.5 The plot on the left shows trajectories between an origin node, O, and a destination node, D. The trajectories pass through intermediate nodes, denoted by {X, Y, Z, W }. The edges linking the nodes have numbers on them referring to the cost of following that particular path.

The objective is to select an optimal trajectory according to some criterion of interest. In this example, we assume that the values on the edges correspond to cost of travel and our objective is to determine a trajectory from origin to destination with the smallest total cost. Obviously, for this simple example, this determination is trivial and can be done by inspection. However, for more involved graphs with many more nodes and edges, the determination needs to follow a more methodical approach. The example in the figure is meant to illustrate the approach. Let us attempt initially a naïve approach. Starting from O, we follow the shortest path from there. That would be the path that goes to node X. We then follow the shortest path from X. That would be the path to node W . From there we have a single possibility to D. The result is the red trajectory shown in the middle plot of the figure, with a total cost of 2 + 3 + 4 = 9 units. The rightmost plot shows an optimal path, with a smaller cost of 4 + 3 + 1 = 8 units. This second path can be determined by inspection in this case. We illustrate next the application of the optimality recursion (45.127) to arrive at the same conclusion. For this purpose, let us first identify the parameters of the model. We are going to treat the destination, D, as an EXIT state with zero cost associated with it. Once the agent reaches that state it stops. We therefore have a total of six states S = {O, X, Y, Z, W, D}

(45.128)

The actions are the choices of pathways at each state. For example, at state s = O, the agent can only choose between two actions: “move to X” or “move to Y .” Likewise, at state s = Y , the agent can choose between three actions: “move to X,” “move to Z,” or

1898

Value and Policy Iterations

“move to W .” Similarly for the other states. At state s = D, the agent stays in D. The transition between states is deterministic. Knowing the launch state and the action, the landing state is completely identified. For example, starting from state s = X, and taking the action “move to W ,” the agent ends at state s0 = W . The rewards over these transitions are the costs over the edges. For example, the reward for moving from s = X to s0 = W is r(X, “move to W ”, W ) = 3. For compactness, we write this reward as r(X → W ) = 3 using the arrow notation. For every state s, we now let vk? (s) denote the minimum cumulative cost that would result if we start from node s, and are able to take k steps. We set the discount factor to γ = 1. Obviously, the boundary conditions are v0? (s) = 0,

s∈S

(45.129)

That is, starting from any of the states and taking no steps, the costs incurred will be zero. We now apply iteration (45.127) to evaluate v1? (s), namely, the minimum cumulative cost for each state if only one step is taken. Applying (45.127) to the various states we get v1? (D) = min a∈A

 r(D, a, D) + v0? (s0 )

= min r(D, a, D) a∈A

=0 v1? (Z) = min a∈A

(45.130a)  r(Z, a, s0 ) + v0? (s0 )

= r(Z → D) + v0? (D) = 1 (optimal action = “move to D”) v1? (W ) = min a∈A

 r(W, a, s0 ) + v0? (s0 )

= r(W → D) + v0? (D) = 4 (optimal action = “move to D”) v1? (X) = min a∈A



a∈A

(45.130c)

 r(X, a, s0 ) + v0? (s0 )

= min {r(X → Z) + v0? (Z), r(X → W ) + v0? (W )} = min {6 + 0, 3 + 0} = 3 (optimal action = “move to W ”)

v1? (Y ) = min

(45.130b)

r(Y, a, s0 ) + v0? (s0 )

(45.130d)



= min {r(Y → X) + v0? (X), r(Y → Z) + v0? (Z), r(Y → W ) + v0? (W )} = min {1 + 0, 3 + 0, 2 + 0} = 1 (optimal action = “move to X”) (45.130e) v1? (O) = min a∈A



r(O, a, s0 ) + v0? (s0 )



= min {r(O → X) + v0? (X), r(O → Y ) + v0? (Y )} = min {2 + 0, 4 + 0} = 2 (optimal action = “move to X”)

(45.130f)

45.4 Commentaries and Discussion

1899

We now repeat for the next iteration:  v2? (D) = min r(D, a, D) + v1? (s0 ) a∈A

= min r(D, a, D) a∈A

=0 v2? (Z) = min

(45.131a) 

a∈A

r(Z, a, s0 ) + v1? (s0 )

= r(Z → D) + v1? (D) = 1 (optimal action = “move to D”) v2? (W ) = min



a∈A

r(W, a, s0 ) + v1? (s0 )

(45.131b)



= r(W → D) + v1? (D) = 4 (optimal action = “move to D”) v2? (X) = min



a∈A

r(X, a, s0 ) + v1? (s0 )

(45.131c)



= min {r(X → Z) + v1? (Z), r(X → W ) + v1? (W )} = min {6 + 1, 3 + 4} = 7 (optimal action = “move to Z or W ”) v2? (Y ) = min



a∈A

r(Y, a, s0 ) + v1? (s0 )

(45.131d)



= min {r(Y → X) + v1? (X), r(Y → Z) + v1? (Z), r(Y → W ) + v1? (W )} = min {1 + 3, 3 + 1, 2 + 4} = 4 (optimal action = “move to X or Z”) (45.131e) v2? (O) = min a∈A



r(O, a, s0 ) + v1? (s0 )

= min {r(O → X) + v1? (X), r(O → Y ) + v1? (Y )} = min {2 + 3, 4 + 2} = 5 (optimal action = “move to X”)

(45.131f)

We repeat one last time: v3? (D) = min



a∈A

r(D, a, D) + v2? (s0 )



= min r(D, a, D) a∈A

=0 v3? (Z) = min

(45.132a) 

a∈A

r(Z, a, s0 ) + v2? (s0 )



= r(Z → D) + v2? (D) = 1 (optimal action = “move to D”) v3? (W ) = min a∈A



r(W, a, s0 ) + v2? (s0 )

(45.132b)



= r(W → D) + v2? (D) = 4 (optimal action = “move to D”)

(45.132c)

1900

Value and Policy Iterations

and v3? (X) = min



a∈A

r(X, a, s0 ) + v2? (s0 )



= min {r(X → Z) + v2? (Z), r(X → W ) + v2? (W )} = min {6 + 1, 3 + 4} = 7 (optimal action = “move to Z or W ”) v3? (Y ) = min



a∈A

r(Y, a, s0 ) + v2? (s0 )

(45.132d)



= min {r(Y → X) + v2? (X), r(Y → Z) + v2? (Z), r(Y → W ) + v2? (W )} = min {1 + 7, 3 + 1, 2 + 4} = 4 (optimal action = “move to Z”) (45.132e) v3? (O) = min



a∈A

r(O, a, s0 ) + v2? (s0 )



= min {r(O → X) + v2? (X), r(O → Y ) + v2? (Y )} = min {2 + 7, 4 + 4} = 8 (optimal action = “move to Y ”)

(45.132f)

Table 45.3 lists the value functions for all states over these three iterations, along with the optimal actions at k = 3. We therefore find that the optimal trajectory has cost v3? (O) = 8 and involves moving from O to Y to Z and to D, as already predicted in Fig. 45.5. Table 45.3 Listing of the optimal state values, vk? (s), of depths k = 1, 2, 3. State, s

v1? (s)

v2? (s)

v3? (s)

D Z W X Y O

0 1 4 3 1 2

0 1 4 7 4 5

0 1 4 7 4 8

Action, a →D →D →D → Z, W →Z →Y

One famous algorithm for finding shortest paths over graphs by using dynamic programming ideas is the solution by the Dutch computer scientist Edsger W. Dijkstra (1930–2002) in the work by Dijkstra (1959).

PROBLEMS

45.1 Continuing with Prob. 44.6, follow the arguments of Example 45.1 and apply the value iteration procedure to determine the optimal policy and the optimal state values. 45.2 Continuing with Prob. 44.6, follow the arguments of Example 45.2 and apply the policy iteration procedure to determine the optimal policy and the optimal state values.

Problems

1901

45.3 Continuing with Prob. 44.7, follow the arguments of Example 45.1 and apply the value iteration procedure to determine the optimal policy and the optimal state values. 45.4 Continuing with Prob. 44.7, follow the arguments of Example 45.2 and apply the policy iteration procedure to determine the optimal policy and the optimal state values. 45.5 Prove that a policy π ? (a|s) is optimal if, and only if, its discounted reward v ? (s) satisfies equality (45.14a). 45.6 Refer to the policy iteration (45.43). Verify that, at any iteration `, it holds that X v π` (s) = π` (a|s)q π` (s, a) a∈A

45.7 Refer to the argument (45.148). Let v`? (s) and vk? (s) denote two iterates that are generated by the value iteration (45.23). Show that ? max v`? (s) − vk? (s) ≤ γ max v`? (s) − vk−1 (s) s∈S

s∈S

45.8 The purpose of this problem is to establish that at least one solution exists to the Bellman equation (45.14a) when γ ∈ [0, 1); in the next problem we show that this solution is actually unique. Consider the sequence of state values {vk? (s)} that are generated by the value recursion (45.20). (a) Use an argument similar to (45.147) to show that ? ? ? max vk? (s) − vk−1 (s) ≤ γ max vk−1 (s) − vk−2 (s) s∈S

(b)

s∈S

Verify that, for any integer m: (1 − γ m )γ k ? max v1? (s) − v0? (s) max vk+m (s) − vk? (s) ≤ s∈S s∈S (1 − γ)

? Conclude that for any  > 0, there exists K such that |vk+m (s) − vk? (s)| <  for all k ≥ K. (d) Argue therefore that the sequence vk? (s) has a limit. 45.9 The purpose of this problem is to establish that, when a solution exists to the Bellman equation (45.14a), then it must be unique when γ ∈ [0, 1). Assume solutions exist but are not unique. Let vx? (s) and vy? (s) denote any solutions, with the corresponding state–action value functions denoted by qx? (s, a) and qy? (s, a). The argument below is patterned after the proof given to establish the convergence property (45.24b). (a) Show that max qx? (s, a) − qy? (s, a) ≤ γ max qx? (s, a) − qy? (s, a) s∈S,a∈A s∈S,a∈A max vx? (s) − vy? (s) ≤ γ max vx? (s) − vy? (s)

(c)

s∈S

s∈S

qx? (s, a)

(b) Conclude that = and = vy? (s). 45.10 We know from Prob. 45.9 that the Bellman equation (45.14a) has a unique solution v ? (s) when γ ∈ [0, 1). Is the optimal policy obtained from (45.16a) unique? 45.11 Let νk (s) denote the solution to the following Bellman recursion (compare with (45.20)): ( ) h i X 0 0 0 νk (s) = max P(s, a, s ) r(s, a, s ) + γνk−1 (s ) , ν0 (s) = 0, k ≥ 1 a∈A

qy? (s, a)

vx? (s)

s0 ∈S

The purpose of this problem is to establish that the values {νk (s)} that result from this recursion coincide with the optimal values {vk? (s)} and admit the interpretation

1902

Value and Policy Iterations

(45.29). The argument follows by induction. First, we note that ν0 (s) = v0? (s) = 0, for all s ∈ S. Next, assume that ν` (s) ≥ v`? (s) = 0, for ` < k. (a) Show that, for any stochastic policy π(a|s), it holds that νk (s) ≥ vkπ (s). Conclude that νk (s) ≥ vk? (s). (b) Select the deterministic policy: ) ( X o 0  0 0  π (a|s) := argmax P(s, a, s ) r(s, a, s ) + γνk−1 (s ) a∈A

s0 ∈S

o

Show that vkπ (s) = νk (s) for all k and s. Conclude that there exists a policy whose value matches νk (s) and, therefore, νk (s) = vk? (s) for all s. 45.12 Refer to recursion (45.14a) for the optimal policy value, vk? (s). At time n, the state s on the right-hand side of this expression can be replaced by the more explicit notation sn with a time subscript attached to it. Introduce the following deterministic policy: ( ) h i X ? 0 0 ? 0 πk (a|sn ) := argmax P(s, a, s ) r(s, a, s ) + γvk−1 (s ) a∈A

s0 ∈S

Argue that the optimal value vk? (sn ) is solely dependent on sn (i.e., it does not depend on prior state values). (b) Show that πk? (a|sn ) is the optimal policy when the time horizon is limited to k steps into the future. Argue further that the policy is Markovian, i.e., it depends solely on the current state, sn . 45.13 Refer to expression (45.14a) for the optimal state value, vk? (s). We would like to verify that maximizing the right-hand side over deterministic or random policies yields the same result, so that it is sufficient to maximize over deterministic policies. Recall that, in the stochastic case, the notation π(a|s) corresponds to a probability distribution over the actions. Obviously, since the set of random policies is larger and includes the set of deterministic policies, it holds that ( !) X 0 ? 0 max r(s|a) + γ P(s, a, s )v (s ) ≤ (a)

π d (a|s)

s0 ∈S

( max

π r (a|s)

!) r(s|a) + γ

X

0

0

P(s, a, s )v (s ) ?

s0 ∈S

where we are using the notation π d (a|s) on the left to refer to deterministic policies, and the notation π r (a|s) on the right to refer to stochastic policies. Now given v ? (s0 ), select some arbitrary random policy π r (a|s) and let ! X ∆ c(s, a) = r(s|a) + γ P(s, a, s0 )v ? (s0 ) , for every a ∈ A, s ∈ S s0 ∈S

Construct the deterministic policy π d (a|s) := argmax c(s, a), a ∈ A, s ∈ S a∈A

(a) (b) (c)

What is the value of policy π d (a|s)? What is the value of policy π r (a|s)? Show that π r (a|s) ≤ π d (a|s). Conclude that maximizing the right-hand side of (45.14a) over deterministic or random policies yields the same result.

45.A Optimal Policy and State–Action Values

45.14 policy

1903

Consider an MDP M = (S, A, P, r, π), with a policy π(a|s). Define the new π 0 (a|s) := argmax q π (s, a) a∈A 0

Show that q π (s, a) ≥ q π (s, a), for all (s, a) ∈ S × A, with equality if, and only if, π 0 (a|s) and π(a|s) are both optimal policies. 45.15 Follow arguments similar to the proof of (45.26) to show that the following holds for the value iteration:   2γ k max |vk? (s) − v ? (s)| ≤ max |v1? (s) − v0? (s)| s∈S 1 − γ s∈S How many iterations are needed to attain an -optimal policy? 45.16 Refer to the value iteration (45.23). Show that, for all k: ? |v ? (s) − vk+1 (s)| ≤ |v ? (s) − vk? (s)|

45.17 Derive the Bellman optimality condition (45.97) for POMDPs. 45.18 Refer to the value iteration (45.98) for the optimal belief value function for POMDPs. Assume we start from a value function vk−1 (b0 ) that differs from the optimal value v ? (b0 ) by at most , i.e., |vk−1 (b0 )−v ? (b0 )| ≤ . Let vk (b) denote the value function that is obtained from applying (45.98) to vk−1 (b0 ). Show that |vk (b) − v ? (b)| ≤ γ. Remark. The reader may refer to the discussion in Bertsekas (1987). 45.19 Motivated by the comments at the end of the chapter, establish the validity of the optimality principle stated in Section 45.1.5. Use this principle to arrive at the recursions for the value iteration (45.23). 45.20 Refer to Fig. 45.5 and assume the values next to the edges correspond to gains the agent collects as it moves over these paths. The objective then becomes that of maximizing the total reward. Iterate the optimality principle (45.125) to determine an optimal trajectory. 45.21 Refer again to Fig. 45.5, where D is a terminal state, and assume the values next to the edges correspond to gains the agent collects as it moves over these paths. Assume further that at state s = b, there exists a small likelihood that the edge to node C may be closed with probability 0.25. When that happens, the agent stays at state s = b and incurs a cost of −5. Determine an optimal trajectory under these conditions.

45.A

OPTIMAL POLICY AND STATE–ACTION VALUES In the next four appendices we establish several properties for the value and policy iterations described in the text, motivated by treatments from Bellman (1957a), Howard (1960), Sutton and Barto (1998), and Bertsekas (2007). To begin with, in this first appendix, we establish that the deterministic policy (45.6) is optimal and solves (45.1). Specifically, we verify that if we maximize q ? (s, a) over a ∈ A, then the result can be used to construct an optimal deterministic policy. Let ∆

ao = argmax q ? (s, a)

(45.133)

a∈A

and construct the deterministic policy π o (a|s):  1, when a = ao π o (a|s) = 0, otherwise

(45.134)

1904

Value and Policy Iterations

o

o

Let v π (s) and q π (s, a) denote the corresponding value and action functions. We want o to show that v π (s) = v ? (s) so that π o (a|s) is an optimal policy. For this purpose, we note first that (45.4a) X ? v ? (s) = π (a|s)q ? (s, a) a∈A

X



  π ? (a|s) max q ? (s, a) a∈A

a∈A

=

max q ? (s, a)

=

q ? (s, ao )

a∈A

(45.135)

That is, v ? (s) ≤ q ? (s, ao )

(45.136)

We next deduce from expressions (44.91)–(44.92) and (45.4b) that o o (44.92) X o v π (s) = π (a|s)q π (s, a) a∈A (45.134)

=

(44.91)

=

o

q π (s, ao ) i h X o P(s, ao , s0 ) r(s, ao , s0 ) + γv π (s0 )

(45.137a)

s0 ∈S

and q ? (s, ao )

(44.91)

=

X

h i P(s, ao , s0 ) r(s, ao , s0 ) + γv ? (s0 )

s0 ∈S (45.136)



v ? (s)

(45.137b)

We rewrite the last two results as follows to facilitate comparison: i h X o o P(s, ao , s0 ) r(s, ao , s0 ) + γv π (s0 ) v π (s) =

(45.138a)

s0 ∈S

v ? (s) ≤

X

  P(s, ao , s0 ) r(s, ao , s0 ) + γv ? (s0 )

(45.138b)

s0 ∈S

Subtracting we find that o

v π (s) − v ? (s) ≥ γ

X s0 ∈S

h o i P(s, ao , s0 ) v π (s0 ) − v ? (s0 )

(45.139)

We can rewrite this inequality in vector form by introducing the vector and matrix quantities: n o o n o h i o ∆ o ∆ ∆ v π = col v π (s) , v ? = col v ? (s) , P π = P(s, ao , s0 ) s∈S

s,s0 ∈S

s∈S

(45.140) so that o

o

o

o

(v π − v ? )  γP π (v π − v ? ) ⇐⇒ (I − γP π )δ  0 | {z } | {z } ∆



(45.141)



= y

where we introduced the difference vector, δ, and where the notation a  b means o elementwise comparison for the entries of the vectors. Let y = (I − γP π )δ. Then,

45.B Convergence of Value Iteration]

o

1905

o

the entries of the vector y are nonnegative. Now, since ρ(γP π ) < 1 and γP π has nonnegative entries, we have   o −1 y δ = I − γP π       o o 2 o 3 = y + γP π y + γP π y + γP π y + ... 0

(45.142)

so that all entries of δ are nonnegative as well. We conclude that o

v π (s) ≥ v ? (s)

(45.143)

But since, by definition, v ? (s) is the largest possible state value function, we conclude that o

v ? (s) = v π (s)

(45.144)

o

This shows that π (a|s) should be an optimal policy according to (45.1).

45.B

CONVERGENCE OF VALUE ITERATION We establish in this appendix the convergence properties (45.24a)–(45.24b) and (45.25) for the value iteration (45.23). For a related discussion, the reader may refer to Bellman (1957a), Howard (1960), Sutton and Barto (1998), and Bertsekas (2007). Thus, note from (45.14b) and (45.21) that for any state s and action a: X   ? ? 0 ? 0 ? 0 |q (s, a) − qk (s, a)| = γP(s, a, s ) v (s ) − vk−1 (s ) 0 s ∈S X ? 0 0 ? 0 0 ≤ γP(s, a, s0 ) max q (s , a ) − max q (s , a ) k−1 0 0 a ∈A a ∈A s0 ∈S X ? 0 0 ? 0 0 ≤ γP(s, a, s0 ) max q (s , a ) − qk−1 (s , a ) 0 a ∈A

s0 ∈S



X s0 ∈S

? 0 0 ? 0 0 γP(s, a, s0 ) max q (s , a ) − q (s , a ) k−1 0 0 s ,a

X ? 0 0 ? 0 0 P(s, a, s0 ) ≤ γ max q (s , a ) − qk−1 (s , a ) 0 0 s ,a

s0 ∈S

? 0 0 ? 0 0 = γ max q (s , a ) − q (s , a ) k−1 0 0

(45.145)

s ,a

where the right-hand side is independent of s and a, and the sum of transition probabilities over s0 is equal to 1. We conclude that ? max q ? (s, a) − qk? (s, a) ≤ γ max q ? (s, a) − qk−1 (s, a) (45.146) s,a

s,a

But since γ ∈ [0, 1), we find that the absolute maximum difference between qk? (s, a) and q ? (s, a) is decaying exponentially fast so that (45.24a) holds. Now since qk? (s, a) converges to q ? (s, a), it follows that πk? (a|s) converges to an optimal (deterministic) policy π ? (a|s).

1906

Value and Policy Iterations

Likewise, using (45.13) and (45.21): ? ? ? ? |v (s) − vk (s)| = max q (s, a) − max qk (s, a) a∈A a∈A ≤ max q ? (s, a) − qk? (s, a) a∈A X   0 ? 0 ? 0 = max γP(s, a, s ) v (s ) − vk−1 (s ) a∈A 0 s ∈S X ? ≤ max γP(s, a, s0 ) v ? (s0 ) − vk−1 (s0 ) a∈A

≤ max a∈A

s0 ∈S

X s0 ∈S

? 0 ? v (s ) − vk−1 γP(s, a, s0 ) max (s0 ) 0 s ∈S

X ? 0 ? 0 ≤ γ max (s ) − v (s ) P(s, a, s0 ) v max k−1 0 a∈A

s ∈S

s0 ∈S

? 0 ? 0 = γ max (s ) − v (s ) v k−1 0

(45.147)

s ∈S

where the right-hand side is independent of s. We conclude that ? max v ? (s) − vk? (s) ≤ γ max v ? (s) − vk−1 (s) s∈S

s∈S

(45.148)

and the absolute maximum difference between vk? (s) and v ? (s) is also decaying exponentially fast so that (45.24b) holds. We observe further that the convergence of vk? (s) toward v ? (s) is monotonic, meaning that vk? (s) gets closer to v ? (s) as we run more iterations – see Prob. 45.16.

45.C

PROOF OF -OPTIMALITY In this appendix we establish property (45.26). To begin with, an argument similar to (45.148) can be used to establish part (a) of Prob. 45.8, namely, ? ? ? (45.149) (s) ≤ γ max vk−1 (s) − vk−2 (s) max vk? (s) − vk−1 s∈S

s∈S

from which we conclude, by iterating, that ? max vk? (s) − vk−1 (s) ≤ γ k−1 max v1? (s) − v0? (s) s∈S

s∈S

(45.150)

in terms of the state value function iterates at steps {0, 1, k − 1, k}. Since γ ∈ [0, 1), this means that there exists a finite k0 such that condition (45.22) is satisfied for all k ≥ k0 , i.e.,   1−γ max vk?0 (s) − vk?0 −1 (s) < , k ≥ k0 (45.151) s∈S 2γ Now, note that max vk?o (s) − v ? (s) = max vk?o (s) − vk?0 (s) + vk?0 (s) − v ? (s) (45.152) s∈S s∈S ≤ max vk?o (s) − vk?0 (s) + max vk?0 (s) − v ? (s) s∈S

s∈S

45.D Convergence of Value Iteration

1907

We can bound each of the terms on the right-hand side as follows: max vk?o (s) − vk?0 (s) ≤ max vk?o (s) − vk?0 +1 (s) + vk?0 +1 (s) − vk?0 (s) (45.153) s∈S s∈S ≤ max vk?o (s) − vk?0 +1 (s) + max vk?0 +1 (s) − vk?0 (s) s∈S s∈S     = γ max vk?o (s) − vk?0 (s) + γ max vk?0 (s) − vk?0 −1 (s) s∈S

s∈S

where for the first term on the right-hand side we used the result of Prob. 45.7, and for the second term we used (45.149). It follows that γ max vk?0 (s) − vk?0 −1 (s) (45.154) max vk?o (s) − vk?0 (s) ≤ s∈S 1 − γ s∈S Likewise, max vk?0 (s) − v ? (s) ≤ s∈S

γ max vk?0 (s) − vk?0 −1 (s) 1 − γ s∈S

and we conclude from (45.152) that max vk?o (s) − v ? (s) ≤ s∈S

(45.22)



=

(45.155)

2γ max vk?0 (s) − vk?0 −1 (s) 1 − γ s∈S

2γ 1 − γ  1 − γ 2γ 

(45.156)

as claimed by (45.26).

45.D

CONVERGENCE OF POLICY ITERATION We establish in this appendix the convergence properties (45.44a)–(45.44b) and (45.46) for the policy iteration (45.43). For a related discussion, the reader may refer to Bellman (1957a), Howard (1960), Sutton and Barto (1998), and Bertsekas (2007). Thus, note that it is clear from the last maximization over the actions in the policy iteration (45.43) that q π`−1 (s, a` (s)) ≥ q π`−1 (s, a`−1 (s))

(45.38c)

=

v π`−1 (s)

or, more explicitly, using the expression for q π`−1 (s, a` (s)) from (45.43): h i X P(s, a = a` (s), s0 ) r(s, a = a` (s), s0 ) + γv π`−1 (s0 ) ≥ v π`−1 (s)

(45.157)

(45.158)

s0 ∈S

At the same time, due to the Bellman optimality condition (45.14a), we have h i X P(s, a = a` (s), s0 ) r(s, a = a` (s), s0 ) + γv π` (s0 ) = v π` (s) (45.159) s0 ∈S

Subtracting (45.158) and (45.159) we find that h i X v π` (s) − v π`−1 (s) ≥ γ P(s, a = a` (s), s0 ) v π` (s0 ) − v π`−1 (s0 ) s0 ∈S

(45.160)

1908

Value and Policy Iterations

We can rewrite this inequality in vector form by introducing the vector and matrix quantities: h i n o ∆ ∆ (45.161) , P π` = P(s, a = a` (s), s0 ) v π` = col v π` (s) s,s0 ∈S

s∈S

so that (v π` − v π`−1 )  γP π` (v π` − v π`−1 ) ⇐⇒ (I − γP π` )δ  0 | | {z } {z } ∆

(45.162)





= y

where we introduced the difference vector, δ, and where the notation x  y means elementwise comparison for the entries of the vectors. Let y = (I − γP π` )δ. Then, the entries of the vector y are nonnegative. Now, since ρ(γP π` ) < 1 and γP π` has nonnegative entries, we have δ = (I − γP π` )−1 y

= y + (γP π` ) y + (γP π` )2 y + (γP π` )3 y + . . . 0

(45.163)

so that all entries of δ are nonnegative as well. We conclude that (45.45) holds, so that the policy iteration (45.43) generates increasing state value functions that are bounded from above, which in turn means that the sequence {v π` (s)} has a limit. Let us denote this limit temporarily by {v ◦ (s)} and verify that it coincides with v ? (s). Indeed, when {v π` (s)} converges, it also follows that {q π` (s, a), π` (a|s), a` (s)} converge to limit values, denoted by {q ◦ (s, a), π ◦ (a|s), ao (s)}. These limit values satisfy the relations: h i X v o (s) = P(s, a = ao (s), s0 ) r(s, a = a0 (s), s0 ) + γ v o (s0 ) s0 ∈S

( ◦

π (a|s) := argmax a∈A

X

0

h

0



0

P(s, a, s ) r(s, a, s ) + γv (s )

i

) (45.164a)

s0 ∈S

The maximum in the second line is attained at some action a0 (s) from which π o (a|s) is constructed. Comparing the expressions for v o (s) and π o (a|s), we conclude that v o (s) satisfies the Bellman equation ( ) h i X o 0 0 ◦ 0 v (s) = max P(s, a, s ) r(s, a, s ) + γv (s ) (45.165) a∈A

s0 ∈S

which implies, by uniqueness, that v o (s) = v ? (s) and q o (s, a) = q ? (s, a). Accordingly, π o (a|s) is an optimal (deterministic) policy, namely, π o (a|s) = π ? (a|s). Finally, the convergence of π` (a|s) to the optimal (deterministic) policy π ? (a|s) occurs in a finite number of steps. To see this, we verify that the sequence of state vectors, {v π` }, generated by the policy iteration (45.43) consists of distinct vectors for all ` unless the sequence has converged to v ? . Indeed, assume that for some `o , it holds that v π`o = v π`o −1 . Then, it must hold that v π` = v π`o for all ` > `o , which means that v π` should have converged to v ? . Thus, note from the policy iteration (45.43) that when v π`o = v π`o −1 , it will hold that q π`o (s, a) = q π`o −1 (s, a)

(45.166)

π`o +1 (a|s) = π`o (a|s)

(45.167)

v π`o +1 = v π`o

(45.168)

and, therefore, This in turn implies that

45.E Piecewise Linear Property

1909

The argument can be repeated to conclude that v π` = v π`o = v ? for all ` > `o . This argument shows that the policy iteration converges by generating a sequence of distinct policies and state values until convergence. Since there can be at most |A||S| distinct policies, we conclude that convergence cannot take more than this number of steps.

45.E

PIECEWISE LINEAR PROPERTY In this appendix, we establish result (45.99) by induction; see Smallwood (1971), Smallwood and Sondik (1973), and Sondik (1978) for a related discussion. First, we know that the representation holds at k = 0 where v0? (b) = 0. In this case, we have a single α0,1 = 0|S| with C0 = {1}. Now, assume the piecewise linear representation holds for ? (b0 ) at iteration k − 1, namely, vk−1 ( ? vk−1 (b0 )

) X

= 0 max

t ∈Ck−1

0

0

0

b (s )αk−1,t0 (s )

=

max

t0 ∈C

s0 ∈S

k−1

o n (b0 )T αk−1,t0

(45.169)

Using this assumption, we will verify that a similar representation is valid for vk? (b), from which we would conclude that (45.99) holds. For this purpose, using expressions (45.92a) and (45.92b) we observe first that, for a fixed action a ∈ A, the term between brackets in (45.98) can be written as (where b0 is the belief vector that results from taking this action a, which moves the MDP into a new state s0 where o0 is emitted):

X

r(b, a) + γ

o0 ∈O

=

XX

? P(o0 |b, a) vk−1 (b0 )

b(s)P(s, a, s0 )r(s, a, s0 ) + γ

s∈S s0 ∈S

=

XX

(45.170a) X o0 ∈O

? P(o0 |b, a)vk−1 (b0 )

b(s)P(s, a, s0 )r(s, a, s0 ) +

s∈S s0 ∈S

! γ

X o0 ∈O

P(o0 |b, a)

X

max 0

t ∈Ck−1

b0 (s0 )αk−1,t0 (s0 )

s0 ∈S

( =

X

X

o0 ∈O

s∈S

b(s)

X

1 + |O| }

P(s, a, s0 )r(s, a, s0 )

s0 ∈S

|



{z

= ρ(s,a)

!) γ

max

t0 ∈Ck−1

X

b(s)

s∈S

X

0

0

0

0

P(s, a, s )B(s , o )αk−1,t0 (s )

s0 ∈S

|



{z

}

= βk−1,t (s,a,o0 )

( =

!)

X

X

o0 ∈O

s∈S

b(s)ρa (s) + γ

max

t0 ∈Ck−1

X s∈S

b(s)β

k−1,t0

0

(s, a, o )

1910

Value and Policy Iterations

That is, r(b, a) + γ

X

? P(o0 |b, a) vk−1 (b0 )

o0 ∈O

( X

=

max

t0 ∈Ck−1

o0 ∈O

X

)! 0

b(s) ρ(s, a) + γβk−1,t0 (s, a, o )

s∈S

|



{z

}

= αk,t0 (a,o0 )

! X

=

max

t0 ∈Ck−1

o0 ∈O

X

0

b(s)αk,t0 (a, o )

(45.170b)

s∈S

It is now clear that the maximization in the last step is over a collection of hyperplanes whose number is |Ck−1 |; this maximization should be repeated for each possible o0 . By summing the piecewise linear representations over the o0 ∈ O, we obtain another piecewise linear representation where the number of hyperplanes is now increased to at most |Ck−1 ||O| . This conclusion is valid for expression (45.170a), which is limited to one action a. Therefore, if we consider all actions a ∈ A, then the total number of hyperplanes to be considered is at most |Ck−1 ||O| |A|. According to (45.98), we need to maximize over these hyperplanes and, therefore, we conclude that form (45.99) holds for a new set Ck that has at most |Ck−1 ||O| |A| elements.

45.F

BELLMAN PRINCIPLE OF OPTIMALITY In this appendix we establish the statement of the Bellman principle of optimality given by (45.123). For further details and related discussion, the reader may consult the texts by Howard (1960), Dreyfus and Law (1977), Ross (1983), Bryson (1998), Puterman (2005), and Bertsekas (2007). To begin with, the initial reward r(0) in (45.122) involves only the first action variable a0 (also called the decision variable in the general setting), whose selection is only affected by π0 . The remaining reward values in (45.122) depend on all action variables including a0 . This is because this initial action influences the state s1 , which in turn influences all subsequent reward values {r(1), r(2), . . . , r(k − 1)}. Therefore, we can write: (   (a) ? vk (s) = max E a0 ∼π0 r(s, a0 , s1 ) | s0 = s + π0 ,··· ,πk−1

s1 ∼(P,π0 )

E (b)

= max π0

 E

a0 ∼π0 ,··· ,ak−1 ∼πk−1 s1 ∼(P,π0 ),··· ,sk ∼(P,πk−1 )

a0 ∼π0 s1 ∼(P,π0 )



max

E

n=1

!) n

γ r(sn , an , sn+1 ) | s0 = s

 r(s, a0 , s1 ) | s0 = s +

" π1 ,··· ,πk−1

k−1 X

a0 ∼π0 ,··· ,ak−1 ∼πk−1 s1 ∼(P,π0 ),··· ,sk ∼(P,πk−1 )

k−1 X n=1

!#) n

γ r(sn , an , sn+1 ) | s0 = s (45.171)

where in step (a) we split the sum from (45.122) into two parts and in step (b) we split the maximization into two steps: We first maximize the innermost term over {π1 , . . . , πk−1 } and subsequently maximize over π0 . The result of the innermost maximization is a function of a0 (or π0 ) since it influences the next state s1 = s0 . That is the reason why we have the outer maximization of the combination over π0 .

45.F Bellman Principle of Optimality

1911

Next, using the conditional mean property E x = E (E (x|y) for two random variables x and y, we manipulate the expectation in the last term as follows:

E

k−1 X

a0 ∼π0 ,··· ,ak−1 ∼πk−1 s1 ∼(P,π0 ),··· ,sk ∼(P,πk−1 )

n=1

! n

γ r(sn , an , sn+1 ) | s0 = s

( =E

E

a0 ∼π0 s1 ∼(P,π0 )

a1 ∼π1 ,··· ,ak−1 ∼πk−1 s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

a0 ∼π0 s1 ∼(P,π0 )

E |

n

0

)

γ r(sn , an , sn+1 ) | s1 = s

n=1 ∆

0

!

k−1 X

a1 ∼π1 ,··· ,ak−1 ∼πk−1 s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

n

γ r(sn , an , sn+1 ) | s1 = s , s0 = s

n=1

( =E

!)

k−1 X

{z

| s0 = s }

= L(s,a0 ,s1 )

=E

a0 ∼π0 L(s, a0 , s1 ) s1 ∼(P,π0 )

(45.172)

where, for compactness, we introduced for the moment the auxiliary function:



L(s, a0 , s1 ) = E

a1 ∼π1 ,··· ,ak−1 ∼πk−1 s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

!

k−1 X

γ n r(sn , an , sn+1 ) | s1 = s0

n=1

(45.173)

Now consider a function of two random variable arguments, L(x, y). It is straightforward to verify that the following inequality always holds n o n o max E L(x, y) ≤ E max L(x, y)

(45.174)

x

x

This is because of the following sequence of arguments: max L(x, y) ≥ L(x, y) =⇒ x

E max L(x, y) ≥ E L(x, y) =⇒ x

E max L(x, y) ≥ max E L(x, y) x

(45.175)

x

Using property (45.174) and result (45.172) we get

( max

π1 ,··· ,πk−1 (45.172)

=

(45.174)



E

a0 ∼π0 ,··· ,ak−1 ∼πk−1 s1 ∼(P,π0 ),··· ,sk ∼(P,πk−1 )

 max

π1 ,··· ,πk−1

E a0 ∼π0 s1 ∼(P,π0 )

E

k−1 X n=1

!) n

γ r(sn , an , sn+1 ) | s0 = s

 a0 ∼π0 L(s, a0 , s1 ) s1 ∼(P,π0 )



 max

π1 ,··· ,πk−1

L(s, a0 , s1 )

(45.176)

1912

Value and Policy Iterations

Using expression (45.173) for L(s, a0 , s1 ) we have max

π1 ,··· ,πk−1

L(s, a0 , s1 ) (

=

max

π1 ,··· ,πk−1

(c)

=E

 ? ? a1 ∼π1 ,··· ,ak−1 ∼πk−1 ? ? s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

=γE



= γE

=γE

n=1

!) 0

n

γ r(sn , an , sn+1 ) | s1 = s

r(1) + γr(2) + . . . + γ k−2 r(k − 1) | s1 = s0 

? ? ,··· ,ak−2 ∼πk−2 a0 ∼π0 ? ? s1 ∼(P,π0 ),··· ,sk−1 ∼(P,πk−2 )

? ? ,··· ,ak−2 ∼πk−2 a0 ∼π0 ? ? s1 ∼(P,π0 ),··· ,sk−1 ∼(P,πk−2 )

(45.117)

k−1 X

γr(1) + γ 2 r(2) + . . . + γ k−1 r(k − 1) | s1 = s0

? ? ,··· ,ak−1 ∼πk−1 a1 ∼π1 ? ? s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )

(d)

=

E a1 ∼π1 ,··· ,ak−1 ∼πk−1 s2 ∼(P,π1 ),··· ,sk ∼(P,πk−1 )





r(0) + γr(1) + . . . + γ k−2 r(k − 2) | s0 = s0

k−2 X n=0



! 0

n

γ r(sn , an , sn+1 ) | s0 = s

∗ γ vk−1 (s0 )

(45.177)

where in step (c) we are denoting optimal policies by the notation πn? and step (d) uses the fact that the transition probabilities P(s, a, s0 ) do not change with time. Substituting (45.177) into (45.176) and returning to (45.171) we find that this line of reasoning establishes the validity of the following inequality: vk? (s) ≤ max



π0

E

a0 ∼π0 s1 ∼(P,π0 )



  ? r(s, a0 , s1 ) + γ vk−1 (s1 ) | s0 = s

(45.178)

which we can also rewrite as vk? (s) ≤ max a∈A

n   o ? E s1 ∼P r(s, a, s1 ) + γ vk−1 (s1 ) | s0 = s

(45.179)

where we replaced the maximization over π0 by maximization over the action space a ∈ A; this is because the initial state s0 = s is known and, hence, maximizing over the policy space is equivalent to maximizing over the action space. It turns out that equality actually holds in (45.179) or in (45.176), which we establish by induction. We start with k = 2 for which v2? (s) = max



π0 ,π1

E

 a0 ∼π0 ,a1 ∼π1 s1 ∼(P,π0 ),s2 ∼(P,π1 )

(r(s, a0 , s1 ) + γ r(s1 , a1 , s2 ) | s0 = s)

( = max E π0

a0 ∼π0 s1 ∼(P,π0 )

 γ max π1

(r(s, a0 , s1 ) | s0 = s) +

E a0 ∼π0 ,a1 ∼π1 s1 ∼(P,π0 ),s2 ∼(P,π1 )

) (r(s1 , a1 , s2 ) | s0 = s)

( = max

a0 ∈A

E s1 ∼(P,π0 ) (r(s, a0 , s1 ) | s0 = s) + )



γ max E π1

a0 ∼π0 ,a1 ∼π1 s1 ∼(P,π0 ),s2 ∼(P,π1 )

(r(s1 , a1 , s2 ) | s0 = s)

(45.180)

45.F Bellman Principle of Optimality

1913

We expand the second term as follows:   (r(s1 , a1 , s2 ) | s0 = s) max E a0 ∼π0 ,a1 ∼π1 π1 s1 ∼(P,π0 ),s2 ∼(P,π1 )     = max E a0 ∼π0 E a1 ∼π1 (r(s1 , a1 , s2 ) | s1 = s) | s = s0 π1 s2 ∼(P,π1 ) s1 ∼(P,π0 )    X X  X X π0 (a0 |s)P(s, a0 , s1 )  π1 (a1 |s1 )P(s1 , a1 , s2 )r(s1 , a1 , s2 ) = max π1   a0 ∈A s1 ∈S a1 ∈A s2 ∈S    X X  X X = π0 (a0 |s)P(s, a0 , s1 )  max πa (a1 |s1 )P(s1 , a1 , s2 )r(s1 , a1 , s2 )   π1 (·|s1 )  a0 ∈A s1 ∈S a1 ∈A s2 ∈S   = E s1 ∼(P,π0 ) v1? (s1 )|s0 = s (45.181) Substituting into (45.180) we get n  o v2? (s) = max E s1 ∼P r(s, a, s1 ) + γ v1? (s1 ) | s0 = s a∈A

(45.182)

so that equality in (45.179) holds for k = 2. We continue by induction. Assume that for a given horizon k, there exists at least one policy, denoted by {πnk,? }, for which equality holds in (45.179), namely, vk? (s) = max a∈A

n

  o ? E s0 ∼P r(s, a, s0 ) + γ vk−1 (s0 ) | s0 = s

(45.183)

Then, for horizon k + 1, there should exist a policy {πnk+1,? } for which equality will hold at that time step as well, i.e., for which k+1,?

π vk+1

(s) = max a∈A

n

  o E s0 ∼P r(s, a, s0 ) + γ vk? (s0 ) | s0 = s

(45.184)

We establish this fact by construction. Define a policy π k+1,? that chooses its initial action a0 to maximize the value function and then follows πnk,? after the first step, i.e., k,? πnk+1,? (a|s) = πn−1 ,

for n = 1, 2, . . .

(45.185)

while for n = 0: ( a?0

= argmax a0 ∈A

E

k,?

k,? ,··· ,ak ∼πk−1 k,? k,? s1 ∼P,s2 ∼(P,π0 ),··· ,sk+1 ∼(P,πk−1 )

a1 ∼π0

k X

!) n

γ r(sn , an , sn+1 )|s0 = s

n=0

(45.186)

1914

Value and Policy Iterations

The value function for this policy is given by k+1,?

π vk+1

(s)    = max E k,? k,? a1 ∼π0 ,··· ,ak ∼πk−1 a0 ∈A   k,? k,? s1 ∼P,s2 ∼(P,π0 ),··· ,sk+1 ∼(P,πk−1 ) (

 !  γ n r(sn , an , sn+1 )|s0 = s   n=0 k X

E s1 ∼P r(s, a0 , s1 ) +

= max

a0 ∈A

E

k,?

a1 ∼π0

k,?

,··· ,ak ∼πk−1

k,?

s1 ∼P,s2 ∼(P,π0

k X

!) n

γ r(sn , an , sn+1 )|s0 = s

n=1

k,?

),··· ,sk+1 ∼(P,πk−1 )

( = max

a0 ∈A

E s1 ∼P r(s, a0 , s1 ) + "

E s1 ∼P E

k,? k,? ,··· ,ak ∼πk−1 k,? k,? s2 ∼(P,π0 ),··· ,sk+1 ∼(P,πk−1 )

a1 ∼π0

k X n=1

! n

γ r(sn , an , sn+1 ) | s1 = s1

(e)

n  o = max E s1 ∼P r(s, a0 , s1 ) + γvk? (s1 ) | s0 = s a0 ∈A

#) | s0 = s (45.187)

where step (e) is because of the optimality of the policy πnk,? for horizons of depth k. Note that this result achieves the bound (45.184), as claimed. Comparing (45.184) with (45.179) we conclude that equality should hold in the latter relation as well so that n   o ? vk? (s) = max E s0 ∼P r(s, a, s0 ) + γ vk−1 (s0 ) | s0 = s (45.188) a∈A

and we arrive at the Bellman principle of optimality given by (45.123).

REFERENCES Aström, K. J. (1965), “Optimal control of Markov decision processes with incomplete state estimation,” J. Math. Anal. Appl., vol. 10, pp. 174–205. Bather, J. A. (2000), Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions, Wiley. Bellman, R. E. (1953), “An introduction to the theory of dynamic programming,” Report R-245, RAND Corporation. Bellman, R. E. (1954), “The theory of dynamic programming,” Bull. Amer. Math. Soc., vol. 60, no. 6, pp. 503–516. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Bellman, R. E. (1957b), “A Markovian decision process,” Indiana Univ. Math. J., vol. 6, no. 4, pp. 679–684. Bertsekas, D. P. (1987), Dynamic Programming: Deterministic and Stochastic Models, Prentice Hall. Bertsekas, D. P. (2007), Dynamic Programming and Optimal Control, 4th ed., 2 vols, Athena Scientific. Bertsekas, D. P. and S. Shreve (1978), Stochastic Optimal Control, Academic Press. Bryson, A. E. (1998), Dynamic Optimization, Pearson.

References

1915

Cassandra, A. R., L. P. Kaelbling, and M. L. Littman (1994), “Acting optimally in partially observable stochastic domains,” Proc. Nat. Conf. Artificial Intelligence, pp. 1023–1028, Seattle, WA. Cassandra, A. R., M. L. Littman, and N. L. Zhang (1997), “Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes,” Proc. Conf. Uncertainty in Artificial Intelligence, pp. 54–61, Providence, RI. Derman, C. (1970), Finite State Markovian Decision Processes, Academic Press. Dijkstra, E. W. (1959), “A note on two problems in connection with graphs,” Numerische Mathematik, vol. 1, no. 1, pp. 269–271. Dreyfus, S. E. and A. M. Law (1977), The Art and Theory of Dynamic Programming, Academic Press. Feinberg, E. A. and A. Shwartz, editors, (2002), Handbook of Markov Decision Processes, Kluwer. Hansen, E. A. (1997), “An improved policy iteration algorithm for partially observable MDPs,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1015– 1021, Denver, CO. Hansen, E. A. (1998), “Solving POMDPs by searching in policy space,” Proc. Conf. Uncertainty in Artificial Intelligence, pp. 211–219, Madison, WI. Howard, R. A. (1960), Dynamic Programming and Markov Processes, MIT Press. Howard, R. A. (1971), Dynamic Probabilistic Systems, Wiley. Kamien, M. I. and N. L. Schwartz (2012), Dynamic Optimization, Dover Publications. Kochenderfer, M. J. (2015), Decision Making Under Uncertainty: Theory and Application, MIT Press. Littman, M. L., A. R. Cassandra, and L. P. Kaelbling (1996), Efficient DynamicProgramming Updates in Partially Observable Markov Decision Processes, Technical Report CS–95–19, Brown University. Meuleau, N., K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra (1999a), “Solving POMDPs by searching the space of finite policies,” Proc. Conf. Uncertainty in Artificial Intelligence, pp. 417–426, Stockholm. Meuleau, N., L. Peshkin, K.-E. Kim, and L. P. Kaelbling (1999b), “Learning finitestate controllers for partially observable environments,” Proc. Conf. Uncertainty in Artificial Intelligence, pp. 427–436, Stockholm. Papadimitriou, C. H. and J. N. Tsitsiklis (1987) “The complexity of Markov decision processes,” Math. Oper. Res., vol. 12, no. 3, pp. 441–450. Puterman, M. L. (2005), Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley. Puterman, M. L. and M. C. Shin (1978), “Modified policy iteration algorithms for discounted Markov decision problems,” Manage. Sci., vol. 24, pp. 1127–1137. Ross, S. (1983), Introduction to Stochastic Dynamic Programming, Academic Press. Sawaki, K., and A. Ichikawa (1978) “Optimal control for partially observable Markov decision processes over an infinite horizon,” J. Oper. Res. Soc. Japan, vol. 21, no. 1, pp. 1–14. Shapley, L. S. (1953), “Stochastic games,” Proc. Nat. Acad. Sci., vol. 39, pp. 1095–1100. Sheskin, T. J. (2010), Markov Chains and Decision Processes for Engineers and Managers, CRC Press. Smallwood, R. D. (1971), “The analysis of economic teaching strategies for a simple learning model,” J. Math. Psychol., vol. 8, no. 2, pp. 285–301. Smallwood, R. D. and E. J. Sondik (1973), “The optimal control of partially observable Markov processes over a finite horizon,” Oper. Res., vol. 21, pp. 1071–1088. Sondik, E. J. (1971), The Optimal Control of Partially Observable Markov Decision Processes, Ph.D. dissertation, Stanford University, USA. Sondik, E. J. (1978), “The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs,” Oper. Res., vol. 26, pp. 282–304.

1916

Value and Policy Iterations

Sutton, R. S. and A. G. Barto (1998), Reinforcement Learning: An Introduction, A Bradford Book. White, D. J. (1993), Markov Decision Processes, Wiley. Zhang, N. L. and W. Liu (1996), Planning in Stochastic Domains: Problem Characteristics and Approximation, Technical Report HKUST-CS96-31, Department of Computer Science, Hong Kong University of Science and Technology.

46 Temporal Difference Learning

We derived in the previous two chapters procedures for assessing the performance of strategies used by agents interacting with a Markov decision process (MDP), including obtaining optimal policies. Among other methods, we discussed the policy evaluation algorithm (44.116) and the value and policy iterations (45.23) and (45.43), respectively. All these procedures were based on the assumption that the transition kernel, P(s, a, s0 ), and the reward function, r(s, a, s0 ), for the MDP are known beforehand for all states s ∈ S and actions a ∈ A. When this information is unavailable, the procedures will need to be redesigned in order to incorporate steps that enable the agent to explore and learn the unknown parameters (either directly or indirectly). There are at least two broad routes that can be used to accomplish this task, depending on whether one relies on model-based or model-free learning. We will describe several procedures in this and the coming chapters that fit into one route or the other. The resulting general framework for learning is known as reinforcement learning, since it will involve an agent learning by trial and error through a repeated process of actions, rewards, and experiences. In this way, the agent ends up influencing the evolution of the MDP through its own actions. The agent explores the state–action space and receives feedback in real time in the form of rewards in response to the actions it is taking. These rewards inform the agent about how good its actions have been but provide no additional information about how good other actions could have been (or whether the actions taken by the agent are optimal or not). We will distinguish between two broad classes of reinforcement learning mechanisms: on-policy learning and off-policy learning. In the first case, the agent selects actions according to a target policy that the agent wishes to assess, and uses the observed rewards/experiences to guide the learning process about this policy. In off-policy learning, the agent acts according to some other behavior policy (other than the target policy), and uses the observed rewards/experiences to guide the learning process about the target policy. This latter mode of operation enables agents to learn from observing behavior by other agents, even when these other agents are following policies that may be different from the target policy. For instance, if an agent wishes to learn how to move a car from city X to city Y, it can learn from observing other cars moving between other cities. Many

1918

Temporal Difference Learning

possibilities open up once the critical element of learning is embedded into the operation of value-based and policy-based algorithms for MPDs.

46.1

MODEL-BASED LEARNING We initiate our discussion by examining a model-based approach to learning the parameters of an MDP. In model-based learning, we assume that the MDP under consideration has at least one exit state, denoted by se , with zero reward and where the operation of the MDP comes to a stop, i.e., P(se , a = “any”, se ) = 1,

r(s, a = “any”, se ) = 0, for any s ∈ S

(46.1)

For this MDP, we observe/collect many episodes involving actions, rewards, and transitions. Each episode would start from some state s and end at an exit state. For example, one episode may involve transitions of the following form: r(s,a,s0 )

r(s0 ,a0 ,s00 )

r(s00 ,a00 ,se )

s −−−−−→ s0 −−−−−−−→ s00 −−−−−−−→ se

(46.2)

starting from state s, and with intermediate states {s0 , s00 } and actions {a, a0 , a00 }. Each transition involves an observed reward value, which is indicated above the arrow. Once collected, the episodes can be used to estimate missing parameters for the MDP model, such as the transition probability kernel P(s, a, s0 ) or the reward function r(s, a, s0 ), as follows. For every state s, we examine the collection of episodes and count how many transitions exist from state s. We denote this number by N (s). Then, for this same state s, we count how many times a transition occurs from s to some other state s0 under some specific action a. We denote this number by N (s, a, s0 ) times. Then, we set 0 b a, s0 ) = N (s, a, s ) P(s, N (s)

(46.3)

That is, we estimate the transition probabilities by calculating the relative frequencies of each transition in the training data. Observe that the denominator in the above expression is not the total number of episodes but rather the total number of one-step transitions starting from state s. If desired, we can incorporate Laplace smoothing to avoid situations where N (s, a, s0 ) = 0. For this purpose, we would use b a, s0 ) = P(s,

N (s, a, s0 ) + 1 N (s) + |A| × |S|

(46.4)

using the cardinalities of the action and state sets, {A, S}. Likewise, each transition from a state s to a state s0 under an action a will allow us to learn the reward value: rb(s, a, s0 ) = r(s, a, s0 )

(46.5)

46.1 Model-Based Learning

1919

If rewards are measured under some uncertainty (such as noise), then we can consider averaging repeated instances of r(s, a, s0 ) to obtain rb(s, a, s0 ): X 1 rb(s, a, s0 ) = r(i) (s, a, s0 ) (46.6) N (s, a, s0 ) i

where the integer i indexes the repeated instances of r(s, a, s0 ). Once the episodes have been used to estimate P and r, we can then apply the same policy evaluation procedure (44.116) and value and policy iteration algorithms (45.23) and (45.43) b = (S, A, P, b rb). Of course, from the earlier chapter to the empirical MDP model M the efficacy of this approach is only as good as the accuracy of the parameter estimates generated from the episodes. It is not difficult to see that this approach is problematic for large state spaces since it is based on estimating the entries of P and r for all states; a large number of experiments/episodes is needed and the state × action space needs to be sampled densely enough. In the following, we will describe more efficient (but also more elaborate) approaches.

Example 46.1 (Constructing an empirical MDP) We reconsider the simplified grid problem from Example 44.5, which is reproduced in Fig. 46.1. We assume now that we do not know the transition probabilities, P(s, a, s0 ), and the reward function, r(s, a, s0 ). We continue to number the states from s = 1 through s = 5, with states s = 1 and s = 5 leading to the exit state s = 6 (which is not shown in the figure).

Figure 46.1 The figure shows a one-dimensional grid with s ∈ {1, 2, 3, 4, 5} states of unknown rewards. The exit state is s = 6 and is not shown. The transition probabilities between states are also not known.

We perform seven experiments and observe the following episodes (where we are using the letters {R,L} to refer to the actions {RIGHT, LEFT}: #1 : 2 #2 : 2 #3 : 5 #4 : 4 #5 : 4 #6 : 3 #7 : 1

r(2,L,1)=+5

−→

1

−→

3

−→

6

−→

3

−→

5

−→

2

−→

6

r(2,R,3)=−1 r(5,R,6)= 0 r(4,L,3)=−1 r(4,R,5)= 15 r(3,L,2)=−1 r(1,L,6)= 0

r(1,R,6)= 0

−→

r(3,R,4))=−1

−→

6 4

(46.7a) r(4,R,5)=15

−→

5

r(5,R,6)=0

−→

6

(46.7b) (46.7c)

r(3,L,2))=−1

−→

2

−→

6

−→

1

r(5,L,6)= 0 r(2,L,1)=+5

r(2,L,1)=+5

−→

1

r(1,L,6)=0

−→

6

(46.7d) (46.7e)

r(1,R,6)= 0

−→

6

(46.7f) (46.7g)

We use the episodes to estimate/discover the transition probabilities and rewards. To begin with, since s = 6 is the exit state, we set

1920

Temporal Difference Learning

b = 6, a = “any”, s0 = 6) = 1, P(s

rb(s = 6, a = “any”, s0 = 6) = 0

(46.8)

Next, we observe that within the above episodes there are four transitions from state s = 2 in episodes {#1, #2, #4, #6}. Three of these transitions occur to state s0 = 1 and one of them occurs to state s0 = 3. Thus, we have N (s = 2) = 4 N (s = 2, a = LEFT, s0 = 1) = 3 N (s = 2, a = RIGHT, s0 = 3) = 1

(46.9a) (46.9b) (46.9c)

so that b = 2, a = LEFT, s0 = 1) = 3/4, P(s

b = 2, a = RIGHT, s0 = 3) = 1/4 P(s

(46.10)

For these same transitions, we observe that rb(s = 2, a = LEFT, s0 = 1) = +5,

rb(s = 2, a = RIGHT, s0 = 3) = −1

(46.11)

Similarly, we have b = 3, a = LEFT, s0 = 2) = 2/3, P(s 0

rb(s = 3, a = LEFT, s0 = 2) = −1

(46.12a)

b = 3, a = RIGHT, s = 4) = 1/3, P(s b = 4, a = LEFT, s0 = 3) = 1/3, P(s b = 4, a = RIGHT, s0 = 5) = 2/3, P(s

rb(s = 3, a = RIGHT, s0 = 4) = −1 (46.12b) rb(s = 4, a = LEFT, s0 = 3) = −1

(46.12c)

b = 5, a = RIGHT, s0 = 6) = 2/3, P(s b = 5, a = LEFT, s0 = 6) = 1/3, P(s b = 1, a = LEFT, s0 = 6) = 1/2, P(s

rb(s = 5, a = RIGHT, s0 = 6) = 0

(46.12e)

rb(s = 5, a = LEFT, s = 6) = 0

(46.12f)

rb(s = 1, a = LEFT, s0 = 6) = 0

(46.12g)

b = 1, a = LEFT, s0 = 6) = 1/2, P(s

rb(s = 1, a = LEFT, s0 = 6) = 0

(46.12h)

0

rb(s = 4, a = RIGHT, s = 5) = 15 (46.12d) 0

b and rb are set to zero. Of course, the estimated entries in while all other entries in P b are approximate values: Their accuracy will improve with more experiments. Using P b = (S, A, P, b rb), we can now appeal to any of the procedures the empirical MDP M described before, such as the policy evaluation algorithm (44.116), or the value and policy iterations (45.23) and (45.43), to interact with the MDP. For example, given a particular policy π(a|s), we can use the policy evaluation algorithm (44.116) to estimate its state values. We can also determine an approximate optimal policy to maximize the total expected rewards. These answers will be approximate due to the empirical nature of the MDP model. We will describe better-performing procedures in the following. For example, in the next section we describe an alternative technique for performing policy b rb) first and evaluation directly (rather than building an empirical model M = (S, A, P, then running (44.116)).

46.2

MONTE CARLO POLICY EVALUATION We will be dealing mainly with deterministic policies and will consider stochastic policies later in Section 49.1. A deterministic policy is one where a specific action is associated with each state. Thus, assume we are given a deterministic policy π(a|s) and the objective is to evaluate the policy, i.e., determine its state values,

46.2 Monte Carlo Policy Evaluation

1921

v π (a|s), under the assumption that the underlying MDP model is unknown. b and rb, as was the case with the approach of the previous Rather than estimate P section, we describe a more direct procedure to policy evaluation. Specifically, given a policy π(a|s), the agent simply follows the actions dictated by this policy and observes the rewards. The agent does not make decisions; it simply executes the policy that is given and learns from observed experiences. For this reason, this technique is viewed as passive learning; it is also referred to as a Monte Carlo (MC)-based method since the policy values are learned from averaging over multiple samples. The procedure operates as follows. For every state s ∈ S, the agent executes policy π(a|s) until it reaches an exit state. This process is repeated L independent times for each s. There are now at least three different ways by which the collected episodes can be used to estimate the state values v π (s). These are known by the following names: (a) naïve policy evaluation; (b) first-visit policy evaluation; and (c) every-visit policy evaluation. One of the main advantages of these methods is their simplicity. They also do not require the environment to be Markovian. On the other hand, one of their main limitations is that they require full episodes for each state to carry out the learning process. These methods are not able to learn from incomplete episodes or from streaming data, as will be the case, for example, with the temporal difference methods described in Section 46.3.

46.2.1

Naïve Policy Evaluation Recall that for each state s, we have available L episodes starting from s and ending at an exit state. For each such state s, we select one of its L episodes, say, the `th episode, and evaluate its discounted reward until the end state. For example, if we let n = 0 correspond to the origin of time at state s, and denote this state by s0 and the states and actions along this path by (sn , an ), then the discounted reward for the `th episode is given by: U` (s) = r(so = s, ao , s1 ) + γr(s1 , a1 , s2 ) + γ 2 r(s2 , a2 , s3 ) + . . . ∞ X = γ n r(sn , an , sn+1 ) =

n=0 ∞ X

γ n r(n)

(46.13)

n=0

where we are using the short-hand notation ∆

r(n) = r(sn , an , sn+1 ),

s0 = s

(46.14)

and where the actions are determined from the given policy, an = π(a|s = sn ). The sum in (46.13) will generally involve a finite number of terms since the MDP is assumed to have an exit state where actions stop. Observe that we are attaching a subscript ` to U` (s) to emphasize that this is the discounted reward associated with the `th episode. If we repeat this evaluation for L episodes starting from

1922

Temporal Difference Learning

state s, we can then average the resulting {U` (s)}, and use the average as an approximation for the state value, v π (s), i.e., L

vbπ (s) =

1X U` (s), s ∈ S L

(46.15)

`=1

This construction is illustrated in Fig. 46.2. Clearly, this estimate is consistent with definition (44.65), where the actual state value was defined as:   v π (s) = E π,P U (s)|s0 = s

(46.16)

in terms of the mean of the discounted utility: ∆

U (s) =

∞ X

γ n r(n)

(46.17)

n=0

where we are denoting the rewards in boldface since their values depend on the distribution of actions and transitions, i.e., r(n) = r(s, a, s0 ).

na¨ıve policy evaluation AAACCnicbVDLSgMxFM3UV62vUZduYovQVZmpoC4LblxWsA9oS8mkd9rQTDIkmcIwdO3GX3HjQhG3foE7/8b0sdDWA4HDOedyc08Qc6aN5307uY3Nre2d/G5hb//g8Mg9PmlqmSgKDSq5VO2AaOBMQMMww6EdKyBRwKEVjG9nfmsCSjMpHkwaQy8iQ8FCRomxUt89z7pBiAXpFjM2nQCOJWc0xTAhPJlHpn235FW8OfA68ZekhJao992v7kDSJAJhKCdad3wvNr2MKMMoh2mhm2iICR2TIXQsFSQC3cvmp0zxhVUGOJTKPmHwXP09kZFI6zQKbDIiZqRXvZn4n9dJTHjTy5iIEwOCLhaFCcdG4lkveMAUUMNTSwhVzP4V0xFRhBrbXsGW4K+evE6a1Yp/Vbm8r5Zq5WUdeXSGiqiMfHSNaugO1VEDUfSIntErenOenBfn3flYRHPOcuYU/YHz+QNyfZqu

episode #L AAAB9XicbVA9SwNBEN3zM8avqKXN4kWwkHCXQi0DNhYWEcwHJGfY25skS/Z2j909JRz5HzYWitj6X+z8N26SKzTxwcDjvRlm5oUJZ9p43rezsrq2vrFZ2Cpu7+zu7ZcODptapopCg0ouVTskGjgT0DDMcGgnCkgccmiFo+up33oEpZkU92acQBCTgWB9Romx0gMkTMsIcLnr4ttyr+R6FW8GvEz8nLgoR71X+upGkqYxCEM50brje4kJMqIMoxwmxW6qISF0RAbQsVSQGHSQza6e4FOrRLgvlS1h8Ez9PZGRWOtxHNrOmJihXvSm4n9eJzX9qyBjIkkNCDpf1E85NhJPI8ARU0ANH1tCqGL2VkyHRBFqbFBFG4K/+PIyaVYr/kXFu6u6tfM8jgI6RifoDPnoEtXQDaqjBqJIoWf0it6cJ+fFeXc+5q0rTj5zhP7A+fwBl16RLw==

Under naïve policy evaluation, the state value v π (s) is estimated by averaging the discounted rewards over L experiments according to (46.15). Figure 46.2

Since vbπ (s) amounts to computing a sample mean, we can appeal to the result of Prob. 16.9 to devise an alternative recursive procedure for evaluating it. If we let v`π (s) denote the estimate for v π (s) that is available after the evaluation of the `th term U` (s), then we have π v`π (s) = v`−1 (s) +

 1 π U` (s) − v`−1 (s) , ` ≥ 1 `

(46.18)

46.2 Monte Carlo Policy Evaluation

1923

with boundary condition v0π (s) = 0. At the end of this iteration, when ` = L, it will hold that π vbπ (s) = vL (s)

(46.19)

More generally, if the number of experiments L is large, by appealing to the same Prob. 16.9, we can employ instead a step-size sequence and apply the recursive construction:   π π (46.20) v`π (s) = v`−1 (s) + µ(`) U` (s) − v`−1 (s) , ` ≥ 1 where µ(`) is any sequence satisfying the conditions: 0 < µ(`) < 1,

lim µ(`) = 0,

`→∞

∞ X `=1

µ(`) = ∞

(46.21)

General sequences of the form µ(`) = τ /`c , for constants τ ∈ (0, 1) and c ∈ ( 21 , 1], satisfy these conditions. The choice c = 1 is common, and the condition on τ can be relaxed to any τ > 0. In the recursive forms (46.18) and (46.20), we refer to U` (s) as the target signal or sample or observation. It is clear from definition (46.13) that U ` (s) (in boldface notation) is an unbiased estimator for the true utility function U (s), i.e., E U ` (s) = E U (s) = v π (s)

(46.22)

Recall that we use the boldface notation to refer to random variables. Returning to (46.15), and since the episodes for each state are assumed to be based on independent experiment runs, the rewards {U` (s)} can therefore be viewed as realizations of independent observations for U (s). By the law of large numbers bπ (s) or v bπ` (s), (see Prob. 4.28), we conclude that the sample mean estimators, v defined by (46.15), (46.18), or (46.20), are expected to converge to the true state value, v π (s), in probability for large L. That is, for any  > 0, it will hold that   π b (s) − v π (s) ≥  = 0 lim P v (46.23a) L→∞   lim P v π` (s) − v π (s) ≥  = 0 (46.23b) `→∞

This means that, for any small margin , and for a sufficient number L of episodes, there is a very high probability that the sample mean, vbπ (s), will be close to the desired state value. In addition, we know from Prob. 3.54 that the sample mean estimator will approach the actual state value exponentially fast in the number of episodes, i.e.,   2 π b (s) − v π (s) ≥  ≤ 2e−2 L P v (46.24) Implementation (46.20) employs a decaying step-size sequence satisfying (46.21). We can employ constant step sizes and replace (46.20) by   π π v`π (s) = v`−1 (s) + µ U` (s) − v`−1 (s) , ` ≥ 1 (46.25)

1924

Temporal Difference Learning

where µ > 0 is a small fixed parameter and independent of `. Constant step sizes are beneficial for two main reasons: (a) they are useful in nonstationary environments where a constant µ keeps learning alive while a decaying µ turns off learning; and (b) they lead to a faster convergence rate of v π` (s) toward v π (s) than decaying step sizes. These two advantages come at the expense of a small deterioration in accuracy. We know from result (19.15a) for stochastic gradient algorithms with constant step sizes that v`π (s) in (46.25) will not converge to v π (s) anymore, but only to a small neighborhood around it. Specifically, it will now hold that, for small values of µ, the mean-square error (MSE) in the steady state will be on the order of µ:

lim sup `→∞

46.2.2

(

 2 E v π` (s) − v π (s)

)

= O(µ)

(46.26)

First-Visit Policy Evaluation The naïve construction assesses each state value by averaging over all L episodes collected for that state; each of these episodes starts from a state s and runs according to policy π(a|s) until the exit state. We can “increase” the number of episodes that are available for each state, beyond L, without actually increasing the number of experiments. This is accomplished as follows. Observe that any episode starting from a state s will generally traverse other states before it lands at an exit state. For instance, if for a particular episode the state following s is s0 , then this means that we can extract from this episode a path that runs from s0 to the exit state. This path can be added to the L episodes that were originally collected for state s0 , thus increasing the number of available episodes for s0 by one to L + 1. In the first-visit technique, the first time that a state s0 occurs within an episode for another state s, the path from that location s0 to the exit state is added to the episodes for s0 . Observe that we are limiting the extraction of the additional path to the first occurrence of s0 ; if future occurrences arise for s0 in that same path, they are ignored under this construction (but taken into account in the construction described in the next section). We repeat this construction for all states. This construction is illustrated in Fig. 46.3, where we show two episodes starting from two arbitrary states sx and sy . The first occurrences of state s0 in each episode is highlighted and the paths from these occurrences to the exit state are subsequently added to the episodes available for s0 . Let L(s) denote the total number of episodes that end up being available for each state s once the first-visit trajectories from s to exit states have been added to the original L episodes for s. For each of these L(s) episodes, we evaluate its

sha1_base64="zkVPTVbt0I9Nnu+Qyv1cNkcdTkw=">AAACDHicbVDLSgMxFM3UV62vqks3wSJ0Y5mpoC4LblxWsK3QlpJJM21oJhmSO4Vh6Ae48VfcuFDErR/gzr8xnc5CWw8EDuecy809fiS4Adf9dgpr6xubW8Xt0s7u3v5B+fCobVSsKWtRJZR+8IlhgkvWAg6CPUSakdAXrONPbuZ+Z8q04UreQxKxfkhGkgecErDSoFxJe36AA64NnE+54YAjJThNMJsSEWehmU25NTcDXiVeTiooR3NQ/uoNFY1DJoEKYkzXcyPop0QDp4LNSr3YsIjQCRmxrqWShMz00+yYGT6zyhAHStsnAWfq74mUhMYkoW+TIYGxWfbm4n9eN4bgup9yGcXAJF0sCmKBQeF5M3jINaMgEksI1dz+FdMx0YSC7a9kS/CWT14l7XrNu6xd3NUrjWpeRxGdoFNURR66Qg10i5qohSh6RM/oFb05T86L8+58LKIFJ585Rn/gfP4AWIubvQ==

1925

AAACCXicbVC7TsNAEDzzDOFloKQ5kSAoUGSnAMpINJRBIg8psaLzeZ2ccr6z7s5IUZSWhl+hoQAhWv6Ajr/BdlxAwlSjmV3tzvgxZ9o4zre1srq2vrFZ2ipv7+zu7dsHh20tE0WhRSWXqusTDZwJaBlmOHRjBSTyOXT88U3mdx5AaSbFvZnE4EVkKFjIKDGpNLBxTMxIYxIEEGAjMcRMywA0DqXCVX1WHdgVp+bkwMvELUgFFWgO7K9+IGkSgTCUE617rhMbb0qUYZTDrNxPNMSEjskQeikVJALtTfMkM3yaKkF+O5TC4Fz9vTElkdaTyE8no+zxRS8T//N6iQmvvSkTcWJA0PmhMOFZ5KwWHDAF1PBJSghVLP0V0xFRhJq0vHJagrsYeZm06zX3subc1SuNi6KOEjpGJ+gcuegKNdAtaqIWougRPaNX9GY9WS/Wu/UxH12xip0j9AfW5w+rkpjy

AAACDHicbVDLSgMxFL1TX7W+qi7dBIvQjWWmgrosuHFZwbZCW0omzbShmWRIMoVh6Ae48VfcuFDErR/gzr8xnc5CWw8EDuecy809fsSZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8tYEdoikkv14GNNORO0ZZjh9CFSFIc+px1/cjP3O1OqNJPi3iQR7Yd4JFjACDZWGpQrac8PELWR5HzKNDMokpyRxEqYx1loZlNuzc2AVomXkwrkaA7KX72hJHFIhSEca9313Mj0U6wMI5zOSr1Y0wiTCR7RrqUCh1T30+yYGTqzyhAFUtknDMrU3xMpDrVOQt8mQ2zGetmbi/953dgE1/2UiSg2VJDFoiDmyEg0bwYNmaLE8MQSTBSzf0VkjBUmxvZXsiV4yyevkna95l3WLu7qlUY1r6MIJ3AKVfDgChpwC01oAYFHeIZXeHOenBfn3flYRAtOPnMMf+B8/gBdapvA

added to the episodes for s0 . Even if the state s itself is repeated along an episode starting from s, then we can also extract multiple paths for s. This construction is illustrated in Fig. 46.4.

AAACCXicbVC7TsNAEDzzDOFloKQ5kSAoUGSnAMpINJRBIg8psaLzeZ2ccr6z7s5IUZSWhl+hoQAhWv6Ajr/BdlxAwlSjmV3tzvgxZ9o4zre1srq2vrFZ2ipv7+zu7dsHh20tE0WhRSWXqusTDZwJaBlmOHRjBSTyOXT88U3mdx5AaSbFvZnE4EVkKFjIKDGpNLBxTMxIYxIEEGAjMcRMywA0DqXCVX1WHdgVp+bkwMvELUgFFWgO7K9+IGkSgTCUE617rhMbb0qUYZTDrNxPNMSEjskQeikVJALtTfMkM3yaKkF+O5TC4Fz9vTElkdaTyE8no+zxRS8T//N6iQmvvSkTcWJA0PmhMOFZ5KwWHDAF1PBJSghVLP0V0xFRhJq0vHJagrsYeZm06zX3subc1SuNi6KOEjpGJ+gcuegKNdAtaqIWougRPaNX9GY9WS/Wu/UxH12xip0j9AfW5w+rkpjy

paths added to episodes for s0 0 (the same final conclusion will continue to hold if we allow the step sizes to be iteration-dependent but matching in value for all n for both the backward and forward views). By an offline implementation we mean the following. We will now have two indices: One index n denotes time and runs over states and rewards, and another index k denotes episodes (assumed of length N each). Rather than update the state value estimates for every time n, we will update them only for every episode.

1948

Temporal Difference Learning

π Specifically, we now let vk−1 (s) denote the state value estimate that becomes available at the end of the (k − 1)th episode. This estimate will be updated by using samples from the kth episode, which starts from some initial state and runs forward in time until some terminal state. In this case, the online forward procedure (46.73) is replaced by the following offline construction running for n ≥ 0 over the observations from episode k:

π,f vkπ,f (s) = vk−1 (s) + µk (s)

N −1  X n=0

 π,f Unλ (s) − vk−1 (s) I[sn = s]

(46.89)

π,f with boundary condition v−1 (s) = 0 for all states. Here, we are using the subscript f to refer to the iterates that are generated by the forward implementation in order to distinguish them from the iterates generated by the backward implementation described below, where we employ the superscript b instead. Also, µk (s) now denotes the step-size parameter used for this episode. Observe in (46.89) that the error signals for state s are aggregated during the run over the π,f kth episode, using the same fixed iterate vk−1 (s). The aggregate error is then π,f π,f used to update vk−1 (s) to vk (s) at the end of the episode. We then repeat the construction for the (k + 1) episode and the process continues in this manner. In a similar vein, the online backward procedure (46.87) is replaced by the following offline construction running for n ≥ 0 over observations from episode k:

π,b π,b δkb (sn ) = r(n) + γvk−1 (sn+1 ) − vk−1 (sn ) π,b vkπ,b (x) = vk−1 (x) + µk (s)

N −1 X

tn (x)δkb (sn ),

n=0

(46.90a) x∈S

(46.90b)

π,b (x) = 0. Here, again, the product tn (x)δkb (sn ) is with boundary condition v−1 π,b aggregated during the run over the episode, using the same fixed iterate vk−1 (s) b in the computation of all δk (sn ). This aggregate error is then used to update π,b vk−1 (s) to vkπ,b (s) at the end of the episode. To establish the equivalence between the offline forward and backward iterations (46.89) and (46.90a)–(46.90b), it is sufficient to verify that:

N −1  X n=0

N −1  X π,f Unλ (s) − vk−1 (s) I[sn = s] = tn (s)δnb (sn )

(46.91)

n=0

π,f π,b so that if the recursions start from the same initial conditions, v−1 (s) = v−1 (s) = π,f π,b 0, the iterates will remain identical thereafter, i.e., vk (s) = vk (s) for all k and all states s ∈ S. Equality (46.91) is established in Appendix 46.D.

46.6 True Online TD(λ) Algorithm

46.6

1949

TRUE ONLINE TD(λ) ALGORITHM There is an alternative TD(λ) implementation that admits equivalent forward and backward interpretations. The resulting algorithm is known as true online (p) TD(λ). It employs a modified construction for the approximations Un used for the discounted utility. Recall from the earlier discussion that: π Un(1) (s) = r(n) + γvn−1 (s0 )

Un(2) (s) Un(3) (s)

(46.92)

0

0

0

00

0

0

0

00

= r(s, a, s ) + γ r(s , a , s ) +

π γ vn−1 (s00 ) 2 00 00 000 2

= r(s, a, s ) + γ r(s , a , s ) + γ r(s , a , s ) + γ

(46.93) 3

π vn−1 (s000 )

(46.94)

• • and, more generally, for a forward-looking approximation of order P :

π Un(P ) (s) = γ P vn−1 (sn+P ) +

P −1 X

γ m r(n + m)

(46.95)

m=0 π In these constructions, the value functions vn−1 (·) have the same subscript n − 1, π indicating that the true value function, v (·), is evaluated by using the approximation that is available for it at iteration n−1 of the learning algorithm. The true online implementation employs updated approximations for the value function. (p) Specifically, since each Un (s) is looking p steps into the future, updated estimates will become available for v π (·) and they can be used to enhance performance. For this reason, the above expressions will now be adjusted as follows where the subscripts for v π (·) change from n − 1 to n to n + 1 and so forth: π Un(1) (s) = r(n) + γvn−1 (s0 )

Un(2) (s) Un(3) (s)

(46.96)

0

0

0

00

0

0

0

00

= r(s, a, s ) + γ r(s , a , s ) +

γ vnπ (s00 ) 2 00 00 2

(46.97) 000

= r(s, a, s ) + γ r(s , a , s ) + γ r(s , a , s ) + γ

3

π vn+1 (s000 )

(46.98)

and, more generally, for a forward-looking approximation of order P :



π Un(P ) (s) = γ P vn+P −2 (sn+P ) +

P −1 X

γ m r(n + m)

(46.99)

m=0

We again combine these utility approximations in a weighted manner. Now, however, we employ a truncated version and replace (46.67) by:

1950

Temporal Difference Learning

λ Un,N (s) ∆

= (1 − λ)

N −n X

λp−1 Un(p) (s) + λN −n UnN −n+1 (s)

p=1

n o = (1 − λ) Un(1) (s) + λUn(2) (s) + . . . + λN −n−1 Un(N −n) (s) + λN −n Un(N −n+1) (s)

(46.100)

λ (s). Observe that for some positive integer N ; it is added as a subscript to Un,N λ Un,N (s) now consists of the weighted addition of N − n terms (rather than an infinite number of terms). We next replace the forward TD algorithm (46.73) by  π π λ π vn,N (s) = vn−1,N (s) + µn (s)I[sn = s] Un,N (s) − vn−1,N (s) , 0 ≤ n ≤ N

(46.101)

Note that we are adding a second subscript to the estimate for the value function, π vn,N (·), where the subscript N is used to indicate that the resulting iterates are π (s) and based on a finite-horizon of length N . The final iterate is given by vN,N it would correspond to the desired value function at time N : ∆

π π vN (s) = vN,N (s)

(46.102)

π We will now explain how to derive a recursion that updates vn−1 (s) directly to vnπ (s). To begin with, some simple algebra allows us to deduce from (46.100) that (see Prob. 46.9): λ λ N −n+1 Un,N δN +1 (s) +1 (s) − Un,N (s) = (γλ)

(46.103)

where the earlier expression (46.66c) for the temporal difference δn (s) is modified to ∆

π π δn (s) = r(n) + γvn−1 (s0 ) − vn−2 (s)

(46.104)

π π We next establish that the difference vn−1,N +1 (s) − vn−1,N (s) admits the following representation: π π N −n+2 vn−1,N δN +1 (s)tn−1 (s) +1 (s) − vn−1,N (s) = (γλ)

(46.105)

for some scalar tn−1 (s). This is certainly true for n = 1 if we start from the π π boundary conditions v−1,N +1 (s) = v−1,N (s) = 0. Indeed, using (46.101) we would have π λ v0,N (s) = µ0 (s)U0,N (s) π v0,N +1 (s)

=

λ µ0 (s)U0,N +1 (s)

(46.106) (46.107)

46.6 True Online TD(λ) Algorithm

1951

so that π π v0,N +1 (s) − v0,N (s)

= (46.103)

=

  λ λ µ0 (s) U0,N +1 (s) − U0,N (s) (γλ)N +1 δN +1 (s)µ0 (s)

(46.108)

which is of the same form as (46.105) with t0 (s) = µ0 (s). We proceed by induction. Assume form (46.105) is valid at n − 1. Then, some algebra using (46.101) and (46.103) shows that relation (46.105) holds at time n as well in the following manner (see Prob. 46.10): π π N −n+1 vn,N δN +1 (s)tn (s) +1 (s) − vn,N (s) = (γλ)

(46.109)

in terms of new trace variables computed as   tn (s) = µn (s)I[sn = s] 1 − γλtn−1 (s) + γλtn−1 (s),

t−1 (s) = 0

(46.110)

Note that if the state sn does not coincide with s then the trace value is simply scaled down by γλ; otherwise, it is scaled down and adjusted further by µn (s)(1− γλtn−1 (s)). We are now ready to derive an equivalent backward view recursion that focuses π on propagating the desired endpoints vn,n (s) (which correspond to the desired π iterates vn (s)). Indeed, using (46.101) and (46.105), we can write   π π λ π vn,n (s) = vn−1,n (s) + µn (s)I[sn = s] Un,n (s) − vn−1,n (s)

π vn−1,n (s)

=

π vn−1,n−1 (s)

+ γλδn (s)tn−1 (s)

(46.111) (46.112)

Noting from definition (46.100) that λ Un,n (s)

=

Un(1) (s)

(46.96)

π r(n) + γvn−1 (sn+1 )

(46.104)

π δn (s) + vn−2 (sn )

=

=

(46.113)

and substituting (46.112) into (46.111) we arrive at h i π π π π vn,n (s) = vn−1,n−1 (s) + δn (s)tn (s) + µn (s) vn−2 (sn ) − vn−1 (sn )

(46.114)

In summary, we arrive at the following listing for the true online TD(λ) algorithm.

1952

Temporal Difference Learning

True online TD(λ) for on-policy state value function evaluation. given a (deterministic or stochastic) policy π(a|s); π initial state values : v−1 (s) = arbitrary, for all s ∈ S; initial trace values : t−1 (s) = 0, for all s ∈ S. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ π(a|s = sn ) s0 = sn+1 r(n) = r(s, a, s0 ) π 0 π δn (s) = r(n) + γvn−1 (s  ) − vn−2 (s) 

tn (x) = µn (x)I[x = s] 1 − γλtn−1 (x) + γλtn−1 (x), ∀ x ∈ S h i π π π vnπ (x) = vn−1 (x) + δn (s)tn (x) + µn (x) vn−2 (x) − vn−1 (x) , ∀ x ∈ S

end π v−1 (s) ← vnπ (s) and t−1 (s) ← tn (s), ∀s ∈ S in preparation for next episode end

(46.115)

Example 46.9 (Playing a game over a grid) We illustrate the operation of the true online TD(λ) algorithm (46.115) by reconsidering the same grid problem from Fig. 46.6. We again evaluate the state value function v π (s) in two ways for comparison purposes. In the first method, we assume knowledge of the MDP parameters {π(a|s), P(s, a, s0 )} and compute v π (s) by using the Poisson equation (44.72) with γ = 0.9. In the second method, we employ the true online TD(λ) algorithm and run it over 2000 episodes π using the constant step-size value µ = 0.01 and setting v−1 (s) to random initial values. We also use λ = 0.9. Figure 46.12 plots the estimated state values.

46.7

OFF-POLICY LEARNING The TD algorithms are examples of on-policy strategies where the agent takes actions according to a given target policy π(a|s) and observes the corresponding rewards. The learning process is then guided by these observations. Before progressing further, we explain how these (and other similar) strategies can be modified to operate off-policy. That is, to operate in a manner where the agent takes actions and observes rewards according to some other behavior policy, φ(a|s), which need not agree with the target policy, π(a|s). As explained earlier, endowing agents with the ability to operate off-policy is useful because, among other

46.7 Off-Policy Learning

1953

state value function

10

5

0

-5

True online TD( ) Poisson equation -10 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Figure 46.12 State value function v π (s) estimated by running the true online TD(λ)

algorithm over 2000 episodes using λ = 0.9. The figure also shows the state values obtained from Poisson’s equation (44.72) when the MDP parameters are assumed known.

things, it enables agents to learn from observing the behavior of other agents. The policies π(a|s) and φ(a|s) are assumed to be stochastic in the derivation below. We illustrate the off-policy construction by reconsidering the TD(0) algorithm. Thus, referring to the derivation of the TD(0) algorithm, we recall that it started with the characterization of the state value function as the expectation in (46.36), namely,   v π (s) = E π,P r(s, a, s0 ) + γv π (s0 ) | s = s (46.116) and then replaced this expectation by a sample estimate using (46.38), i.e.,   π E π,P r(s, a, s0 ) + γv π (s0 ) | s = s ≈ r(s, a, s0 ) + γvn−1 (s0 ) (46.117)

There are two facts that need to be highlighted here. First, the expectation in (46.116) is computed relative to the target distribution, π(a|s), and, second, the transition from s to s0 is carried out according to the action dictated by this same policy with the resulting reward given by r(s, a, s0 ): a=π(a|s), r(s,a,s0 )

s −−−−−−−−−−−−→ s0

(46.118)

The superscript π in v π (·) reflects these two facts, which characterize on-policy operation. These facts will be changed when we transform the algorithm to operate off-policy. Let φ(a|s) denote some behavior policy and introduce the importance weights: ∆

ξ(s, a) =

π(a|s) ⇐⇒ π(a|s) = φ(a|s)ξ(s, a) φ(a|s)

(46.119)

These weights measure the level of dissimilarity between the target (π) and behavior (φ) policies. We assume that φ(a|s) is positive whenever π(a|s) is nonzero.

1954

Temporal Difference Learning

If π(a|s) and φ(a|s) happen to be zero at the same state–action pair (s, a), then we set ξ(s, a) = 1. The following example illustrates the role of importance weights in computing expectations relative to different distributions; we encountered a similar argument when discussing importance sampling techniques in an earlier chapter. We will use the result of the example afterwards to develop the off-policy version of TD(0). Example 46.10 (Importance sampling revisited) The reason for the designation “importance weights” for the scalars ξ(s, a) is because this calculation relates to the concept of importance sampling. We already discussed this concept earlier in Section 33.3.1 in the context of Markov chain Monte Carlo (MCMC) methods. We review the idea briefly by using the notation of the current section. Thus, consider two generic discrete probability distributions, π(x) and φ(x), with the same support x ∈ X. We assume φ(x) is nonzero whenever π(x) is nonzero. Given some discrete random variable h(x), say, h(x) = x2 , the means of h(x) relative to the distributions π(x) and φ(x) are defined by X ∆ Eφ h(x) = h(xk )φ(xk ) (46.120a) xk ∈X ∆

Eπ h(x) =

X

h(xk )π(xk )

(46.120b)

xk ∈X

where the sums run over the discrete values of x ∈ X. Importance sampling allows us to evaluate the mean relative to one distribution from the mean relative to the other distribution. This can be seen as follows:      X  π(x) π(xk ) φ(xk ) = Eφ h(x) = Eφ ξ(x)h(x) h(xk ) Eπ h(x) = φ(xk ) φ(x) xk ∈X

(46.121) where we introduced the importance weight function ξ(x) = π(x)/φ(x). Result (46.121) shows that the mean of h(x) relative to π(x) can be evaluated by transforming h(x) by the importance weight function ξ(x) and computing the mean of the product ξ(x)h(x) relative to φ(x). Therefore, importance sampling allows us to evaluate the mean relative to one distribution by sampling from another distribution. For example, if we collect samples {xn , n = 1, 2, . . . , N } arising from some distribution φ(x), then the mean of h(x) relative to distribution π(x) can be estimated from these samples by using N 1 X E\ ξ(xn )h(xn ), π h(x) = N n=1

ξ(xn ) =

π(xn ) φ(xn )

(46.122)

We will be employing importance sampling below, where the agent will be taking actions according to some behavior policy φ(a|s) while it is interested in evaluating a target policy π(a|s) – see the underlined terms in (46.123). By incorporating importance weights, ξ(s, a), we will be able to replace π(a|s) by φ(a|s) and compute averages relative to π(a|s) from observations collected based on φ(a|s).

Returning to the TD(0) algorithm, we reconsider the same characterization (46.116) but rework it into an alternative form involving expectations relative to the behavior policy φ(a|s) rather than π(a|s) as follows:

46.7 Off-Policy Learning

1955

  v π (s) = E π,P r(s, a, s0 ) + γv π (s0 ) | s = s h i XX π(a|s)P(s, a, s0 ) r(s, a, s0 ) + γv π (s0 ) = a∈A s0 ∈S

=

XX

a∈A s0 ∈S

=

XX

a∈A s0 ∈S

h i φ(a|s)ξ(s, a)P(s, a, s0 ) r(s, a, s0 ) + γv π (s0 )

h i φ(a|s)P(s, a, s0 )ξ(s, a) r(s, a, s0 ) + γv π (s0 )

n h io = E φ,P ξ(s, a) r(s, a, s0 ) + γv π (s0 ) | s = s

(46.123)

where the expectation in the last line is relative to the behavior distribution φ(a|s). Moreover, the selection of the action and the resulting transition from s to s0 is now assumed to occur according to policy φ(a|s). The last expression, along with the result of Example 46.10, suggest that we can estimate v π (s) by constructing a sample realization based on observations from policy φ(a|s): a=φ(a|s), r(s,a,s0 )

s −−−−−−−−−−−−→ s0

(46.124)

The resulting off-policy TD(0) algorithm is listed in (46.125), where the main difference in relation to the earlier on-policy TD(0) implementation (46.50) is the addition of the importance sampling weights, ξn (sn , an ). By setting these weights to the value 1, the algorithm reduces to an on-policy implementation.

TD(0) algorithm for off-policy state value function evaluation. target policy π(a|s) and behavior policy φ(a|s); π (s) = arbitrary, for any s ∈ S. initial state values v−1 repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ φ(a|sn ) s0 = sn+1 r(n) = r(s, a, s0 ) ξn (s, a) = π(a|s)/φ(a|s)   π π δn (s) = ξn (s, a) r(n) + γvn−1 (s0 ) − vn−1 (s)

π vnπ (s) = vn−1 (s) + µn (s)δn (s) end π v−1 (s) ← vnπ (s), ∀s ∈ S in preparation for next episode end

(46.125)

1956

Temporal Difference Learning

Example 46.11 (Playing a game over a grid) We illustrate the operation of the true online TD(λ) algorithm (46.115) by reconsidering the same grid problem from Fig. 46.6. We again evaluate the state value function v π (s) in two ways for comparison purposes. In the first method, we assume knowledge of the MDP parameters {π(a|s), P(s, a, s0 )} and compute v π (s) by using the Poisson equation (44.72) with γ = 0.9. In the second method, we employ the true online TD(λ) algorithm and run it over 4000 episodes π using the constant step-size value µ = 0.01 and setting v−1 (s) to random initial values. We also use λ = 0.9. Figure 46.13 plots the estimated state values. state value function 10

5

0

-5

-10

Off-policy TD(0) Poisson equation 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Figure 46.13 State value function v π (s) estimated by running the off-policy TD(λ)

algorithm over 4000 episodes using λ = 0.9. The figure also shows the state values obtained from the Poisson equation (44.72) when the MDP parameters are assumed known. For this simulation, the target and behavior policies are chosen as follows: (target policy)  1/4,     1/4, 1/4, π(a|s) =     1/4, 0,

a = UP a = DOWN a = RIGHT , a = LEFT a = STOP

for any s ∈ S\{8, 16, EXIT}

(46.126)

At the terminal states, s ∈ {8, 16}, the only action available is for the agent to exit and STOP the game so that  1, for s = 8, 16 π(a = STOP|s) = (46.127) 0, for s ∈ S\{8, 16} (behavior policy)  3/8,     1/8, 2/6, φ(a|s) =     1/6, 0,

a = UP a = DOWN a = RIGHT , a = LEFT a = STOP

for any s ∈ S\{8, 16, EXIT}

(46.128)

46.8 Commentaries and Discussion

1957

At the terminal states, s ∈ {8, 16}, the only action available is for the agent to exit and STOP the game so that  1, for s = 8, 16 φ(a = STOP|s) = (46.129) 0, for s ∈ S\{8, 16}

46.8

COMMENTARIES AND DISCUSSION Markov decision processes. Two of the earliest works on integrating the formalism of MDPs into the study of reinforcement learning algorithms are the contributions by Andreae (1969) and Witten (1977). This integration was pursued more comprehensively in the dissertation by Watkins (1989), which has had a strong impact on subsequent progress in the field of reinforcement learning. Since then, many authors have adopted Watkins’ MDP formulation in the development of their algorithms. Two other early contributions to reinforcement learning are the works by Widrow, Gupta, and Maitra (1973) and Werbos (1977). The former uses the concept of decision-directed adaptation (by using a bootstrap estimate) to drive the operation of a learning agent for the game of blackjack, while the latter uses the language of approximate dynamic programming instead of MDPs to perform forecasting. Although Minsky (1961) was the first to connect reinforcement learning to dynamic programming, it was not until the work by Watkins (1989) that this connection was made more explicit. Reinforcement learning. There are several accessible texts and tutorial articles that deal with various issues related to reinforcement learning, including the tutorials by Barto, Bradtke, and Singh (1995), Kaelbling, Littman, and Moore (1996), Gosavi (2009), Szepesvari (2010), Grondman et al. (2012), and Geramifard et al. (2013), as well as the texts by Bertsekas and Tsitsiklis (1996), Sutton and Barto (1998), Bertsekas (2007), and Cao (2007). The designation “reinforcement learning” is motivated by connections to the law of effect in educational psychology, which is deduced from observations of cat behavior. The principle is attributed to the American physiologist Edward Thorndike (1874– 1949), and it states that responses that are rewarded positively are more likely to reoccur while responses that are punished are less likely to reoccur – see, e.g., Thorndike (1932) and the description in the text by Gray (2010). In engineering and computer science, the concept of reinforcement learning can be traced back to the early contributions by Samuel (1959), Michie and Chambers (1968), Klopf (1972), and Holland (1975). Samuel (1959) proposed a checker game-playing algorithm, which can be viewed as an early form of TD learning. In his algorithm, the difference between the assessment of two successive positions of the game was used to guide the learning process in a manner similar to what is done, for example, by the TD(0) learning algorithm (46.50) through the use of the TD signal δn (s). The TD(0) learning rule itself, although not under this same name, was first proposed by Witten (1977) in his work on adaptive controllers. Temporal difference algorithms in their present form were formally developed and extended by Sutton (1988), who proposed in this work the backward TD(λ) formulation (46.87). Sutton (1988) motivated a flurry of research into temporal difference algorithms and their applications. For example, the TD(λ) recursion formed the basis for the TD-Gammon, a famous program developed by Tesauro (1995) to play the game of backgammon. The program was able to play the game at a level that matched human experts, and helped shine further light on reinforcement learning techniques. In more recent years, in 2016, reinforcement learning techniques were employed by Google to develop a computer program

1958

Temporal Difference Learning

that was able to defeat champion players in the game of GO – see, e.g., the article by Silver et al. (2016). The forward view of TD(λ) in terms of look-ahead reward values, as well as the equivalence between this forward view and Sutton’s TD(λ) backward implementation with eligibility traces, was developed in Watkins (1989). The true online TD(λ) algorithm (46.115) is from van Seijen and Sutton (2014). The earlier work by Michie and Chambers (1968) applied a form of the every-visit MC method of Section 46.2.3 to reinforcement learning. Studies on the performance of several Monte Carlo methods in the context of reinforcement learning appear in Barto and Duff (1994) and Singh and Sutton (1996). Convergence studies. Some of the earliest works on the convergence of the TD(0) and TD(λ) algorithms, as well as the Q-learning algorithm discussed in the next chapter, are those by Sutton (1988), Watkins and Dayan (1992), Dayan (1992), and Dayan and Sejnowski (1994). In this and the next chapter, we pursue a unified approach to the convergence analysis of various algorithms by relying on the stochastic inequality (46.130) in the appendix in a manner similar to what was proposed by Jaakkola, Jordan, and Singh (1994). Related convergence analyses were also given by Tsitsiklis (1994) and Singh et al. (2000).

PROBLEMS

46.1 Refer to the naïve and first-visit policy evaluation procedure of Sections 46.2.1 and 46.2.2. If the number of episodes for each state s ends up increasing four times from L to L(s) = 4L, by how much is the rate of convergence in (46.24) increased? 46.2 Refer to (46.24). Estimate how many episodes L are needed to ensure that, for some small 0 :  X  π P |b v (s) − v π (s)| ≥  ≤ 0 s∈S

46.3 Refer to the sample mean calculation (46.27) over all episodes starting from some state s. (a) Using the result of Prob. 16.9, verify that vbπ (s) can be evaluated recursively as follows. Repeat for ` = 1, 2, . . . , L(s):  1 π π vb`π (s) = vb`−1 (s) + U` (s) − vb`−1 (s) , vb0π (s) = 0 ` (b)

π with vbπ (s) = vbL(s) (s). Show further that the same sample mean can be approximated recursively as follows:   π π vb`π (s) = vb`−1 (s) + µ(`) U` (s) − vb`−1 (s) , vb0π (s) = 0

where, in the case of a large number of episodes, the scalar sequence {µ(`)} satisfies ∞ X 0 ≤ µ(`) < 1, lim µ(`) = 0, µ(`) = ∞ `→∞

`=1

46.4 Derive recursion (46.66a). 46.5 Follow arguments similar to the proof of result (46.51) and Example 46.5 to establish that the same convergence result holds for the look-ahead TD algorithm (46.63).

46.A Useful Convergence Result

1959

(P )

46.6 Refer to the random variable U n (s) defined by (46.64a). Use an argument (P ) similar to (46.54) to show that U n (s) is an asymptotically unbiased estimator for π v (s). 46.7 Follow arguments similar to the proof of result (46.51) and Example 46.5 to establish that the same convergence result holds for the forward view of the TD(λ) algorithm given by (46.73). 46.8 Assume we combine a collection of independent realizations of scalar random variables {x(p)}, each with mean x ¯ and variance σx2 , by using ∆

xλ = (1 − λ)

∞ X

λp−1 x(p),

p=1

λ ∈ (0, 1)

¯)2 is given by Verify that E xλ = x ¯ while the variance σx2λ = E (xλ − x σx2λ =

1−λ 2 σx 1+λ

Conclude that the fused variable xλ has a smaller variance than the individual variables. 46.9 Derive expression (46.103). 46.10 Derive relation (46.109) for the true online TD(λ) algorithm. 46.11 Refer to recursion (46.85). Assume state s is observed only once at time no . Determine tn (s) for all n.

46.A

USEFUL CONVERGENCE RESULT We state below a useful convergence result for a particular type of stochastic difference equation, which is used in the subsequent appendices to establish convergence properties for reinforcement learning algorithms. The statement is an adjustment of a result by Dvoretzky (1956) and it was introduced by Jaakkola, Jordan, and Singh (1994). Let un (θ n ) denote a scalar sequence of random variables over the integer n ≥ 0. The sequence is parameterized by a discrete variable θ ∈ Θ, which can vary with n as well, i.e., at each n, the variable θ assumes some random value from within the set Θ. The sequence un (θ n ) is assumed to satisfy a stochastic recursion of the form   un (θ n ) = 1 − αn (θ n ) un−1 (θ n ) + β n (θ n )en (θ `≥n ),

n≥0

(46.130)

At every iteration n, some choice θ ∈ Θ occurs randomly and we denote it by The value un (θ n ) is then updated for that state. The scalars {αn (θ n ), β n (θ n )} dependent on θ n , while en can be dependent on future states θ(`) for ` ≥ n. denote this possibility by writing en (θ `≥n ) or, more compactly, en (θ). Introduce following collections of past and present variables: Fn = {θ n } ∪

n

o θ m , αm (θ m ), β m (θ m ), em (θ), m ≤ n − 1

Xn = {β n (θ n )} ∪ Fn

θn . are We the

(46.131) (46.132)

1960

Temporal Difference Learning

Lemma 46.1. (Convergence of a stochastic recursion) Under conditions (a)–(d) listed below it holds that lim un (θ n ) = 0, almost surely

n→∞

(46.133)

where the conditions are: (a) θ ∈ Θ, where Θ is a finite set (i.e., it has a finite number of elements). (b) {αn (θ n ) ≥ 0, β n (θ n ) ≥ 0} are scalar nonnegative sequences satisfying the following conditions with probability 1 and uniformly for all θ ∈ Θ:  ∞ ∞ X X   α (θ ) = ∞, α2n (θ n ) < ∞, 0 ≤ αn (θ n ) < 1  n n    n=0   n=0 ∞ ∞ X X (46.134) β n (θ n ) = ∞, β 2n (θ n ) < ∞    n=0 n=0          E β (θ )|F ≤ E α (θ )|F n

n

n

n

n

n

(c) With probability 1 and for some γ ∈ (0, 1), it holds that     E en (θ) | Xn ≤ γ max |un−1 (θ n )| θ ∈Θ (d) For some bounded nonnegative constant c, it holds that    var e2n | Xn ≤ c 1 + max |un−1 (θ n )|2 θ ∈Θ

(46.135)

(46.136)

where the notation var(x) denotes the variance of the random variable x.

46.B

CONVERGENCE OF TD(0) ALGORITHM In this and the next appendix, we pursue the convergence analysis of the TD(0) and TD(λ) algorithms by relying on the stochastic inequality (46.130) in a manner similar to what was done by Jaakkola, Jordan, and Singh (1994). Related convergence analyses were also given by Tsitsiklis (1994) and Singh et al. (2000). We introduce the error signal: ∆

venπ (s) = v π (s) − vnπ (s),

s∈S

(46.137)

and refer to the TD(0) recursion (46.41) where, for completeness of the analysis, we need to consider the indicator function. Subtracting v π (s) from both sides of the TD(0) recursion (46.41) gives:   π π π venπ (s) = ven−1 (s) − µn (s)I[sn = s] r(n) + γvn−1 (s0 ) − vn−1 (s) (a)

π π π = ven−1 (s) −µn (s)I[sn = s] ven−1 (s) + µn (s)I[sn = s]e vn−1 (s) {z } | =0   π π −µn (s)I[sn = s] r(n) + γvn−1 (s0 ) − vn−1 (s) (46.138)

where in step (a) we added and subtracted the same quantity on the right-hand side. Next we introduce the driving term en (s, s0 ), which is a function of both the current and next states (s, s0 ), to get

46.B Convergence of TD(0) Algorithm

1961

    π π venπ (s) = 1 − µn (s)I[sn = s] ven−1 (s) − µn (s)I[sn = s] r(n) + γvn−1 (s0 ) − v π (s) {z } | ∆

= en (s,s0 )

  π = 1 − µn (s)I[sn = s] ven−1 (s) − µn (s)I[sn = s] en (s, s0 ) | {z }

(46.139)

µ ¯ n (s)

We also introduce the compact notation ∆

µ ¯n (s) = µn (s)I[sn = s]

(46.140)

which incorporates the indicator function into the step size. Then, it follows that the error variable evolves according to the following dynamics (with time subscripts added to the state variables for clarity): venπ (sn ) =

  π 1−µ ¯n (sn ) ven−1 (sn ) − µ ¯n (sn )en (sn , sn+1 ), n ≥ 0

(46.141)

We further note, for later use, that the expression for en (s, s0 ) can be written in the π following equivalent form in terms of the error ven−1 (s0 ): ∆

π en (s, s0 ) = r(n) + γvn−1 (s0 ) − v π (s)

π (s0 ) − v π (s) = r(n) + γv π (s0 ) − γv π (s0 ) + γvn−1 | {z } =0

π = r(n) + γv π (s0 ) − v π (s) − γe vn−1 (s0 )

(46.142)

This form is useful because, under expectation, the first three terms on the righthand side will cancel out as we are going to see in (46.145). To proceed, we rely on the convergence result stated in Appendix 46.A for a particular type of stochastic difference equation. Specifically, comparing (46.141) with recursion (46.130) in the appendix, we make the identifications: θ←s π un−1 (θ n ) ← ven−1 (sn ) ¯ n (sn ) αn (θ(n)) ← µ ¯ n (sn ) β n (θ(n)) ← µ en (θ) ← en (sn , sn+1 )

(46.143a) (46.143b) (46.143c) (46.143d) (46.143e)

where we are now writing sn and sn+1 in boldface to emphasize their random nature. We next introduce the following collection of past and present variables: ∆

Xn =

n

¯ n (sn ) sn , µ

o



n

¯ m (sm ), em (sm , sm+1 ), m ≤ n − 1 sm , µ

o

(46.144)

The reason why past states {sm } are included in the set Xn , even though we are π examining the update for state sn = s, is because the estimates vm (s) from prior iterations may have been influenced by transitions from state s to some other states.

1962

Temporal Difference Learning

Now conditioning (46.142) on Xn we get:   E en (s, s0 )| Xn     (a) π = E π,P r(n) + γv π (s0 )| sn = s − v π (s) − γ E π,P ven−1 (s0 ) | sn = s   (44.70) π π = v (s) − v π (s) − γ E π,P ven−1 (s0 ) | sn = s ! X X 0 π 0 =− π(a|s) γ P(s, a, s ) ven−1 (s ) (46.145) a∈A

s0 ∈S

where the conditioning over sn = s in step (a) is because of the Markovian property (since s0 is determined by sn = s and not by the prior states), and where the expectation in the second line is over the randomness in s0 ∈ S and a ∈ A. It follows that !   X X 0 0 π 0 e (s, s )| X ≤ π(a|s) γ P(s, a, s ) v e (s ) E n n n−1 s0 ∈S

a∈A

! ≤

X

π(a|s)

X s0 ∈S

a∈A

π ven−1 (s0 ) γP(s, a, s0 ) max 0 s ∈S

! π X ven−1 (s0 ) = γ max π(a|s) 0 s ∈S

a∈A

X

P(s, a, s0 )

s0 ∈S

π X = γ max ven−1 (s0 ) π(a|s) s0 ∈S

a∈A

π ven−1 (s0 ) = γ max 0 s ∈S

(46.146)

Therefore, condition (46.135) is satisfied. Next, we recall that, for any random variable x, its variance satisfies σx2 ≤ E x2 . Using this fact, along with (46.142) and (46.145), we get   var en (s, s0 )| Xn n o 2 ≤ E en (s, s0 ) | Xn n o 2 π = E π,P r(n) + γv π (s0 ) − v π (s) − γe vn−1 (s0 ) | Xn n o 2 (46.36) π = E π,P r(n) − E r(n) + γv π (s0 ) − γE v π (s0 ) − γe vn−1 (s0 ) | Xn     (a) ≤ 3 var r(n) | sn = s + 3γ 2 var v π (s0 ) sn = s + n o 2 π 3γ 2 E π,P ven−1 (s0 ) | sn = s     2 ≤ 3 E π,P r 2 (n) | sn = s + 3γ 2 E π,P v π (s0 ) | sn = s + n o 2 π 3γ 2 E π,P ven−1 (s0 ) | sn = s (46.147) where in step (a) we used the Jensen inequality (recall Example 8.9): (x + y + z)2 ≤ 3x2 + 3y 2 + 3z 2 , 0

for any scalars x, y, z.

(46.148)

Now since the reward function, r(s, a, s ), is uniformly bounded for all states and actions, as well as the actual state value function v π (s), we conclude that the first two terms in (46.147) are bounded by some constant d ≥ 0:

46.C Convergence of TD(λ) Algorithm

1963

    2 ∆ 3 E π,P r 2 (n) | sn = s + 3γ 2 E π,P v π (s0 ) | sn = s ≤ 3R2 + 3γ 2 V 2 = d (46.149) Note that it is not necessary to assume uniformly bounded reward and state value functions; it is sufficient to require that the rewards r(n) have a bounded second-order moment, E r 2 (n) (or variance), so that (46.149) holds. On the other hand, we have !   X X π 0 2 0 π 0 2 E π,P ven−1 (s ) | sn = s ≤ π(a|s) P(s, a, s ) ven−1 (s ) a∈A

s0 ∈S

X

X

! ≤

π(a|s)

a∈A

s0 ∈S

π 2 P(s, a, s ) max ven−1 (s0 ) 0

s0 ∈S

! π 2 X = max ven−1 (s0 ) π(a|s) s0 ∈S

a∈A

X

0

P(s, a, s )

s0 ∈S

π ven−1 (s0 ) 2 = max 0

(46.150)

s ∈S

Let ∆

c = max {d, 3γ 2 }

(46.151)

Substituting into (46.147) we find that     π ven−1 (s0 ) 2 var en (s, s0 )| Xn ≤ c 1 + max 0 s ∈S

(46.152)

Therefore, condition (46.136) is satisfied. We conclude from (46.133) that vnπ (s) converges to v π (s) almost surely. This conclusion is valid under condition (46.134), i.e., we must have ∞ X n=0

∞ X

µn (s)I[sn = s] = ∞,

n=0

µ2n (s)I[sn = s] < ∞,

0 ≤ µn (s)I[sn = s] < 1

(46.153) The last two conditions are satisfied in view of (46.43). The first condition requires each state s to be visited infinitely often.

46.C

CONVERGENCE OF TD(λ) ALGORITHM We introduce the error signal ∆

venπ (s) = v π (s) − vnπ (s),

s∈S

(46.154)

and refer to the TD(λ) recursion (46.83). This recursion is updated for all states s at every iteration n. Also, in this recursion the notation sn refers to the state that occurs at time n and, moreover, the TD error term δn (sn ) is defined by the state sn alone: ∆

π π δn (sn ) = r(n) + γvn−1 (sn+1 ) − vn−1 (sn )

(46.155)

For compactness of notation, we further introduce the scaled step-size sequence ∆

µ ¯n (s) = µn (s)tn (s)

(46.156)

1964

Temporal Difference Learning

where, from (46.86), the variable tn (s) counts in a discounted manner the number of occurrences of the state s up to time n, i.e.,

tn (s) =

n X

(λγ)n−m I[sm = s]

(46.157)

m=0

Subtracting v π (s, a) from both sides of the TD(λ) recursion (46.83) gives: π venπ (s) = ven−1 (s) − µ ¯n (s)δn (sn ) π π π µn (s)δn (sn ) = ven−1 (s) −¯ µn (s)e vn−1 (s) + µ ¯n (s)e vn−1 (s) −¯ {z } |

(46.158)

=0

π π π π = (1 − µ ¯n (s))e vn−1 (s) − µ ¯n (s) (r(n) + γvn−1 (sn+1 ) − vn−1 (sn ) − ven−1 (s)) | {z } ∆

= e(s,sn ,sn+1 )

where we are introducing the driving term en (s, sn , sn+1 ). This term is a function of three parameters: the state s for which the state value is being updated by (46.83), the current state sn , and the next state sn+1 . It follows that the error variable for TD(λ) evolves according to the dynamics: venπ (s) =

  π 1−µ ¯n (s) ven−1 (s) − µ ¯n (sn )en (s, sn , sn+1 ), n ≥ 0

(46.159)

We further note, for later use, that the expression for en (s, sn , sn+1 ) can be written in π the following equivalent form in terms of the error variable ven−1 (·): en (s, sn , sn+1 ) π π π = r(n) + γvn−1 (sn+1 ) − vn−1 (sn ) − ven−1 (s) (46.160) π π π π π (sn ) − ven−1 (s) = r(n) + γv (sn+1 ) − γv (sn+1 ) +γvn−1 (sn+1 ) − vn−1 {z } | =0

π π π = r(n) + γv π (sn+1 ) − vn−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s)

This form is useful because, under expectation, the first three terms on the right-hand π side will combine to ven−1 (sn ). In that case, the expected value of en (·) will involve three error terms. For now, comparing (46.141) with recursion (46.130) in Appendix 46.A we make the identifications: θ←s π un−1 (θ n ) ← ven−1 (s) ¯ n (s) αn (θ n ) ← µ ¯ n (s) β n (θ n ) ← µ en (θ) ← en (s, sn , sn+1 )

(46.161a) (46.161b) (46.161c) (46.161d) (46.161e)

where we are now writing {s, sn , sn+1 } in boldface to emphasize their random nature. We next introduce the following collection of past and present variables: ∆

Xn =

n o ¯ n (s) ∪ {sm , µ ¯ m (sm ), em (s, sm , sm+1 ), m ≤ n − 1} sn , µ

(46.162)

46.C Convergence of TD(λ) Algorithm

1965

Now conditioning (46.160) on Xn we get:

  E en (s, sn , sn+1 ) | Xn (46.163)   (a) π = E π,P r(n) + γv π (sn+1 ) | sn = s − vn−1 (sn ) −     π π γ E π,P ven−1 (sn+1 ) | sn = s − E π,P ven−1 (s) | Xn   (44.70) π π π = v (sn ) − vn−1 (sn ) − γE π,P ven−1 (sn+1 ) | sn = s −   E π,P ven−1 (s) | Xn     π π π = ven−1 (sn ) − γ E π,P ven−1 (sn+1 ) | sn = s − E π,P ven−1 (s) | Xn where the conditioning over sn = s in step (a) is because of the Markovian property (since sn+1 is determined by sn and not by the prior states). It follows that

  E en (s, sn , sn+1 ) | Xn     π π π ≤ e vn−1 (sn ) + γ E π,P ven−1 (s0 ) | sn = s + E π,P ven−1 (s)| Xn π π π ≤ max |e vn−1 (s)| + γ max |e vn−1 (s)| + max |e vn−1 (s)| s∈S

s∈S

s∈S

π = (2 + γ) max |e vn−1 (s)|

(46.164)

s∈S

Therefore, condition (46.135) is satisfied. Next, note from (46.160) and (46.163) that

  var en (s, sn , sn+1 ) Xn   2 ≤E en (s, sn , sn+1 ) | Xn   2 (46.160) π π π r(n) + γv π (sn+1 ) − vn−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s) | Xn = E   (a) = E r(n) −v π (sn ) + v π (sn ) +  {z } | =0  2 π π π γv π (sn+1 ) − vn−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s) | Xn (46.35)

=

E

nh

i r(n) − E r(n) − γE v π (sn+1 ) + v π (sn )+

π π π γv π (sn+1 ) − vn−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s) (46.35)

=

2

| Xn

i h i r(n) − E r(n) + γ v π (sn+1 ) − E v π (sn+1 ) +  2 π π π ven−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s) | Xn

E



nh

(46.165)

1966

Temporal Difference Learning

where in step (a) we added and subtracted v π (sn ). Applying the Jensen inequality (46.148) we have   var en (s, sn , sn+1 ) | Xn ≤     = 3 var r(n)|sn = s + 3γ 2 var v π (sn+1 )|sn = s +   2 π π π 3E ven−1 (sn ) − γe vn−1 (sn+1 ) − ven−1 (s) | Xn     ≤ 3 var r(n)|sn = s + 3γ 2 E (v π (sn+1 ))2 |sn = s +   2 2 2 π π π vn−1 (s) | Xn vn−1 (sn+1 ) + 3 e 3 E 3 e vn−1 (sn ) + 3γ 2 e (46.150)

2    π 3 var r(n)|sn = s + 9γ 2 + 18 max e vn−1 (s) + 3γ 2 V 2 s∈S 2 π ≤ 3R2 + (9γ 2 + 18) max e vn−1 (s) + 3γ 2 V 2 s∈S 2   π ≤ c 1 + max e (46.166) vn−1 (s) ≤

s∈S

where we introduced o n ∆ c = max 3R2 + 3γ 2 V 2 , 12γ 2 + 18

(46.167)

In other words, we established that  2    π var en (s, sn , sn+1 )| Xn ≤ c 1 + max e vn−1 (s) s∈S

(46.168)

Therefore, condition (46.136) is satisfied. We conclude from (46.133) that vnπ (s) converges to v π (s) almost surely. This conclusion is valid under condition (46.134), i.e., we must have ∞ X n=0

µn (s)tn (s) = ∞,

∞ X n=0

µ2n (s)t2n (s) < ∞,

0 ≤ µn (s)tn (s) < 1

(46.169)

Sufficient conditions for these requirements to hold are as follows. It is easy to see from (46.157) that tn (s) ≤

n X

(λγ)n−m

m=0

= 1 + (λγ) + (λγ)2 + . . . + (λγ)n ∞ X ≤ (λγ)m m=0

=

1 1 − λγ

(46.170)

while tn (s) ≥ I[sn = s]

(46.171)

The last two conditions in (46.169) can then be easily satisfied in view of (46.43), while the first condition in (46.169) is satisfied if each state s is visited infinitely often.

46.D Equivalence of Offline Implementations

46.D

1967

EQUIVALENCE OF OFFLINE IMPLEMENTATIONS In this appendix we establish the equivalence between the offline forward and backward iterations (46.89) and (46.90a)–(46.90b) by showing that equality (46.91) holds. Indeed, repeating the same argument that led to (46.76) we now obtain   π,f π,f π,f Unλ (s) − vk−1 (s) = (λγ)0 r(n) + γvk−1 (sn+1 ) − vk−1 (s) +   π,f π,f (sn+1 ) + (sn+2 ) − vk−1 (λγ)1 r(n + 1) + γvk−1   π,f π,f (sn+2 ) + . . . (sn+3 ) − vk−1 (λγ)2 r(n + 2) + γvk−1 (46.172) and, hence, ∆

Xn

= (46.66c)

=



 π,f (s) I[sn = s] Unλ (s) − vk−1 ! N −n−1 X ` f (λγ) δk (sn+` ) I[sn = s]

(46.173)

`=0

where π,f π,f (sn+` ) (sn+`+1 ) − vk−1 δkf (sn+` ) = r(n + `) + γvk−1

(46.174)

and the upper limit for the summation index in (46.173) is N − n − 1. This is because the corresponding TD error δn (sN −1 ) would involve the reward value, r(N − 1), in the N -long episode (since r(m) = 0 for m ≥ N ). Consequently, N −1  X n=0

N −1 N −n−1  X X π,f (s) I[sn = s] = Unλ (s) − vk−1 (λγ)m δkf (sn+m )I[sn = s] n=0

(46.175)

m=0

At the same time, using (46.86) and (46.90a), we have ∆

Yn = tn (s)δkb (sn ) ! n X b n−m = δk (sn ) (λγ) I[sm = s]

(46.176)

m=0

so that N −1 X

tn (s)δnb (sn ) =

n=0

N −1 X n=0

=

N −1 X

(λγ)n−m I[sm = s]

m=0

δkb (sm )

m=0

=

n X

δkb (sn )

N −1 X m X

m X

(λγ)m−n I[sn = s], n ← m, m ← n

n=0

(λγ)m−n δkb (sm )I[sn = s]

(46.177)

m=0 n=0

On the right-hand side we have two sums: one sum is running over the (row) index m = 0 : N − 1 while the second sum is running over the (column) index n = 0 : m. If we envision a square N × N matrix with (n, m)th entry given by (λγ)m−n δkb (sm )I[sn = s], then for each row of index m, we are adding the entries up to column m – see Fig. 46.14.

1968

Temporal Difference Learning

On the right-hand side of expression (46.177), for each row of index m, we are adding the entries up to column m. We can arrive at the same value for the sum of all entries in the matrix by adding, for each column n, all entries from row m onwards.

Figure 46.14

We can arrive at the same value for the sum of all entries in the matrix by adding, for each column n, all entries from row m onwards. That is, it holds that N −1 X

tn (s)δnb (sn ) =

n=0

N −1 N −1 X X

(λγ)m−n δkb (sm )I[sn = s]

n=0 m=n

=

N −1 N −n−1 X X n=0

(λγ)m δkb (m + n)I[sn = s]

(46.178)

m=0

which is the same result as in (46.175) once we establish that the TD errors that are generated by the forward and backward recursions coincide with each other. This can be easily established by induction. Assume we have a total of K episodes running over k = 1, 2, . . . , K. The state value estimates for iteration k = 0 for both the forward and backward recursion are initialized to π,f π,b v−1 (s) = v−1 = 0,

s∈S

(46.179)

Then, it follows from expressions (46.174) and (46.90a) for the TD errors that these errors will coincide after the first iteration, i.e., δ0f (s) = δ0b (s),

for all s

(46.180)

and, therefore, from the just derived expressions (46.178) and (46.175), the state value iterates will also coincide after the first iteration v1π,f (s) = v1π,b (s),

for all s

(46.181)

This argument can be continued for all successive iterates to conclude that δkf (s) = δkb (s) and vkπ,f (s) = vkπ,b (s) for all k and all states.

References

1969

REFERENCES Andreae, J. H. (1969), “Learning machines: A unified view,” in Encyclopedia of Information, Linguistics, and Control, A. R. Meetham and R. A. Hudson, editors, pp. 261–270, Pergamon. Barto, A. G., S. J. Bradtke, and S. P. Singh (1995), “Learning to act using real-time dynamic programming,” Artif. Intell., vol. 72, pp. 81–138. Barto, A. G. and M. Duff (1994), “Monte Carlo matrix inversion and reinforcement learning,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 687– 694, Denver, CO. Bertsekas, D. P. (2007), Dynamic Programming and Optimal Control, 4th ed., 2 vols, Athena Scientific. Bertsekas, D. P. and J. N. Tsitsiklis (1996), Neuro-Dynamic Programming, Athena Scientific. Cao, X. R. (2007), Stochastic Learning and Optimization: A Sensitivity-Based Approach, Springer. Dayan, P. (1992), “The convergence of TD(λ) for general λ,” Mach. Learn., vol. 8, pp. 341–362. Dayan, P. and T. Sejnowski (1994), “TD(λ) converges with probability 1,” Mach. Learn., vol. 14, pp. 295–301. Dvoretzky, A. (1956), “On stochastic approximation,” Proc. 3rd Berkeley Symp. Math. Statist. Probab., vol. 1, pp. 39–56, Berkeley, CA. Geramifard, A., T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and J. P. How (2013), “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Found. Trends Mach. Learn., vol. 6, no. 4, pp. 375–454. Gosavi, A. (2009), “Reinforcement learning: A tutorial survey and recent advances,” INFORMS J. Comput., vol. 21, no. 2, pp. 178–192. Gray, P. O. (2010), Psychology, 6th ed., Worth. Grondman, I., L. Busoniu, G. A. D. Lopes, and R. Babuska (2012), “A survey of actor– critic reinforcement learning: Standard and natural policy gradients,” IEEE Trans. Syst. Man Cyber. Part C, vol. 42, no. 6, pp. 1291–1307. Holland, J. H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press. Jaakkola, T., M. I. Jordan, and S. P. Singh (1994), “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Comput., vol. 6, no. 6, pp. 1185– 1201. Kaelbling, L. P., M. L. Littman, and A. W. Moore (1996), “Reinforcement learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285. Klopf, A. H. (1972), “Brain function and adaptive systems: A heterostatic theory,” Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories. Michie, D. and R. A. Chambers (1968), “BOXES: An experiment in adaptive control,” in Machine Intelligence, E. Dale and D. Michie, editors, pp. 137–152, Oliver and Boyd. Minsky, M. L. (1961), “Steps toward artificial intelligence,” Proc. Inst. Radio Engineers, vol. 49, pp. 8–30. Reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman, editors, McGraw-Hill, 1963. Samuel, A. L. (1959), “Some studies in machine learning using the game of checkers,” IBM J. Res. Develop., vol. 3, pp. 210–229. Reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman, editors, McGraw-Hill, 1963. Silver, D., A. Huang, C. J. Maddison, et al. (2016), “Mastering the game of GO with deep neural networks and tree search,” Nature, vol. 529, pp. 484–489. Singh, S., T. Jaakkola, M. L. Littman, and C. Szepesvari (2000), “Convergence results for single-step on-policy reinforcement-learning algorithms,” Mach. Learn., vol. 39, pp. 287–308.

1970

Temporal Difference Learning

Singh, S. and R. S. Sutton (1996), “Reinforcement learning with replacing eligibility traces,” Mach. Learn., vol. 22, pp. 123–158. Sutton, R. S. (1988), “Learning to predict by the method of temporal differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44. Sutton, R. S. and A. G. Barto (1998), Reinforcement Learning: An Introduction, A Bradford Book. Szepesvari, C. (2010), Algorithms for Reinforcement Learning, Morgan and Claypool Publishers. Tesauro, G. (1995), “Temporal difference learning and TD-Gammon,” Commun. ACM, vol. 38, no. 3, pp. 58–68. Thorndike, E. (1932), The Fundamentals of Learning, AMS Press. Tsitsiklis, J. N. (1994), “Asynchronous stochastic approximation and Q-learning,” Mach. Learn., vol. 16, pp. 185–202. van Seijen, H. and R. S. Sutton (2014), “True online TD(λ),” Proc. Int. Conf. Machine Learning (ICML), pp. 692–700, Beijing. Watkins, C. J. (1989), Learning from Delayed Rewards, Ph.D. dissertation, Cambridge University, UK. Watkins, C. J. and P. Dayan (1992), “Q-learning,” Mach. Learn., vol. 8, pp. 279–292. Werbos, P. J. (1977), “Advanced forecasting methods for global crisis warning and models of intelligence,” Gen. Syst. Yearb., vol. 22, pp. 25–38. Widrow, B., N. K. Gupta, and S. Maitra (1973), “Punish/reward: Learning with a critic in adaptive threshold systems,” IEEE Trans. Syst., Man Cybern., vol. 3, no. 5, pp. 455–465. Witten, I. H. (1977), “An adaptive optimal controller for discrete-time Markov environments,” Inf. Control, vol. 34, pp. 286–295.

47 Q-Learning

The temporal learning algorithms TD(0) and TD(λ) of the previous chapter are useful procedures for state value evaluation; i.e., they permit the estimation of the state value function v π (s) for a given target policy π(a|s) by observing actions and rewards arising from this policy (on-policy learning) or another behavior policy (off-policy learning). In most situations, however, we are not interested in state values but rather in determining optimal policies, denoted by π ? (a|s) (i.e., in selecting what optimal actions an agent should follow in a Markov decision process (MDP)). For that purpose, we will need to estimate the state–action values, q π (s, a), rather than the state values v π (s). This is because, once the state–action values are available, they can be maximized over the action space to determine an optimal policy. In this chapter, we describe two families of algorithms known as SARSA and Q-learning. The temporal difference and SARSA algorithms are examples of passive learning methods because they observe rewards and take actions according to some already given target policy, π(a|s). The Q-learning procedure, on the other hand, is an example of an active learning method where the agent takes actions according to some exploratory policies before learning the optimal policy.

47.1

SARSA(0) ALGORITHM We consider first the class of SARSA algorithms, where the letters in the acronym stand for “State, Action, Reward, State, and Action.” The SARSA methods extend the temporal difference (TD) construction to the estimation of the state– action function, q π (s, a), directly. The algorithms are on-policy procedures, meaning that they learn the state–action values of a target policy, π(a|s), by experiencing actions and rewards from this same policy. In a later section, we will describe an alternative off-policy technique, known as Q-learning, which focuses instead on estimating the optimal state–action value, q ? (s, a). In that case, the algorithm will be able to learn q ? (s, a) by taking actions and observing rewards from policies that do not match the optimal policy, π ? (a|s). SARSA algorithms can be motivated in much the same way as TD(0) and TD(λ); the similarities are many and, therefore, we will be brief. We discuss first the SARSA(0) algorithm. We are given a policy π(a|s) and wish to estimate its state–action value function, q π (s, a). We know from (44.70) that q π (s, a) satisfies the Poisson equation:

1972

Q-Learning

  q π (s, a) = E π,P r(s, a, s0 ) + γq π (s0 , a0 ) | s = s, a = a

(47.1)

π q π (s0 , a0 ) ≈ qn−1 (sn+1 , an+1 )

(47.2)

This expression characterizes q π (s, a) as an expectation, except that the state– action value function appears on both sides of the equation: q π (s, a) on the left and q π (s0 , a0 ) on the right. As was the case with the motivation for (46.38), we can employ the following sample approximation:

π in terms of the iterate qn−1 (·, ·) for the state–action value function at iteration n − 1, and in terms of the future state and action values, sn+1 and an+1 . In this way, we end up with the recursion:

  π π π qnπ (s, a) = qn−1 (s, a) + µn (s, a) r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) , n ≥ 0

(47.3) in terms of a step-size sequence µn (s, a) that is now dependent on the state– action pair (s, a). We continue to use the notation ∆

r(n) = r(sn , an , sn+1 ),

an = π(a|sn )

(47.4)

The target or sample signal that is used by the algorithm is the quantity r(s, a, s0 )+ π π γqn−1 (s0 , a0 ). If s0 happens to be the terminal state, we set qn−1 (s0 , a0 ) = 0 so that, in effect, we are employing the following sample approximation for q π (s, a) from (47.1):  r(s, a, s0 ), if s0 is terminal q π (s, a) ≈ (47.5) 0 π 0 0 r(s, a, s ) + γqn−1 (s , a ), if s0 is non-terminal We comment further on the step-size sequence, µn (s, a). First, it is important to note that the above recursion is applied at iteration n only if the transition at this time happens to originate from state sn = s under action an = a. For all other state–action pairs, (x, b) 6= (s, a), their state–action values remain invariant during this iteration. That is, when (47.3) is applied at iteration n, it is implicitly assumed that π qnπ (x, b) = qn−1 (x, b), for (x, b) ∈ S × A\{(s, a)}

(47.6)

If desired, we can capture this situation in a single expression by writing π qnπ (s, a) = qn−1 (s, a) +

(47.7)   π π µn (s, a)I[sn = s, an = a] r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a)

where we added the indicator function for the state–action pair (s, a). Now note that if update (47.3) were to occur at every time instant n, then we would simply select µn (s, a) in the form µn (s, a) =

τ , (n + 1)c

τ ∈ (0, 1),

c ∈ (1/2, 1]

(47.8)

47.1 SARSA(0) Algorithm

1973

Here, the denominator of µn (s, a) would be tied to the iteration index n. This choice satisfies

0 < µn (s, a) < 1,

lim µn (s, a) = 0,

n→∞

∞ X

n=0

µn (s, a) = ∞

(47.9)

which are the conditions required by Prob. 16.9. However, the full update in (47.3) will only occur occasionally, and only when the pair (s, a) is observed. In that case, the denominator in the step-size sequence µn (s, a) should not be (n + 1)c , but should instead be defined in terms of the number of actual updates that involved the state–action pair (s, a) until time n. If we let



n(s, a) = number of times (s, a) has been observed until time n

(47.10)

then the step-size sequence µn (s, a) should be set instead as follows:

µn (s, a) =

τ , (n(s, a) + 1)c

τ ∈ (0, 1),

c ∈ (1/2, 1]

(47.11)

As explained before, we will rarely employ the indicator function explicitly in our presentation, except when necessary for added clarity. We will often rely instead on the more compact representation (47.3) with the understanding that the recursion updates the state–action value for the pair (s, a) alone, which is the state–action pair occurring at time n, while all other state–action values remain unchanged. And, moreover, the step-size value is defined according to (47.11), the value c = 1 is common along with any τ ∈ (0, 1). Recursion (47.3) leads to the SARSA(0) algorithm. It is clear from this implementation that state–action pairs (s0 , a0 ) that are more likely to occur will contribute more heavily to the update of the state–action values. Observe that π the rightmost term in (47.3) is a difference between two terms: r(n)+γqn−1 (sn+1 , π an+1 ) and qn−1 (sn , an ). For compactness of notation, we denote this term by (it plays a role similar to δn (sn ) from (46.48) in TD learning): ∆

π π βn (sn , an ) = r(n) + γqn−1 (sn+1 , an+1 ) − qn−1 (sn , an )

= βn (s, a)

(47.12)

The resulting algorithm is listed in (47.13). At each time n, the agent acts according to policy π(a|s), experiences the reward r(s, a, s0 ), and updates the relevant state–action value.

1974

Q-Learning

SARSA(0) algorithm for on-policy state–action function evaluation. given a (deterministic or stochastic) policy π(a|s); π initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ π(a|s) s0 = sn+1 r(n) = r(s, a, s0 ) a0 ∼ π(a|s0 ) π π βn (s, a) = r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) π π qn (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) end π q−1 (s, a) ← qnπ (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.13) We examine in Appendix 47.A the convergence behavior of the SARSA(0) algorithm (47.13). During the convergence analysis we assume, as already indicated in the previous chapter, that the iteration index n in recursions such as (47.13) increases continually even across episodes. In this way, when qnπ (s, a) is reached π at the end of one episode, this value is not set to q−1 (s, a) for the next episode. π π Instead, the iterations for the next episode continue to qn+1 (s, a), qn+2 (s, a), and so forth. In Appendix 47.A we establish that if every state–action pair (s, a) is visited infinitely often, and assuming a bounded reward function r(s, a, s0 ), then the iterated state–action values, qnπ (s, a), converge to the true state–action values q π (s, a) almost surely for discount values γ ∈ [0, 1), namely, lim qn (s, a) = q π (s, a), a.s.

n→∞

(47.14)

Example 47.1 (Expected SARSA) In listing (47.13) for the SARSA(0) algorithm, the TD term βn (s, a) depends on the future action a0 , which is chosen stochastically according to a0 ∼ π(a|s0 ). The randomness in the selection of a0 at each iteration can affect the performance of the algorithm in that it can enlarge the variance of the state– action value estimates qnπ (s, a). This effect can in turn slow down convergence. One way to reduce the variance effect is to introduce averaging over all possible choices for the action a0 , such as redefining the TD factor as follows (see Prob. 47.5): βn (s, a) = r(n) + γ

X a0 ∈A

π π π(a0 |s0 )qn−1 (s0 , a0 ) − qn−1 (s, a)

(47.15)

47.2 Look-Ahead SARSA Algorithm

1975

The resulting algorithm is known as expected SARSA and is listed in (47.16). Expected SARSA(0) for on-policy state–action function evaluation. given a (deterministic or stochastic) policy π(a|s); π initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ π(a|s) s0 = sn+1 r(n) = r(s, a, s0 ) X π π βn (s, a) = r(n) + γ π(a0 |s0 )qn−1 (s0 , a0 ) − qn−1 (s, a)

(47.16)

a0 ∈A

π qnπ (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) end π q−1 (s, a) ← qnπ (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end

We can similarly establish the convergence of this variant of SARSA(0) by applying the same proof technique from Appendix 47.A; this analysis is left to Prob. 47.3.

47.2

LOOK-AHEAD SARSA ALGORITHM The SARSA(0) algorithm estimates q π (s, a) by generating samples for the random variable r(s, a, s0 ) + γq π (s0 , a0 ) in the form: ∆

π 0 0 Q(1) n (s, a) = r(n) + γqn−1 (s , a )

(47.17)

(1)

where we are introducing the notation Qn (s, a) for this sample. The superscript refers to the fact that this estimate is computed by adding the one-step reward, r(n), over the immediate transition from s to s0 , to an estimate for the state–action value from state s0 onward. The reason for introducing the superscript notation is because we can also consider employing higher-order approximations for r(s, a, s0 ) + γq π (s0 , a0 ) by combining rewards over several steps into the future, as was the case for the look-ahead TD algorithm. Since the argument is similar to what was carried out for TD learning, we proceed directly to list the resulting look-ahead SARSA algorithm where the agent employs a P th or(P ) der approximation, Qn (s, a), to generate samples for q π (s, a). Specifically, the SARSA(0) recursion (47.3) is replaced by (1)

  π ) π qnπ (s, a) = qn−1 (s, a) + µn (s, a) Q(P (s, a) − q (s, a) , n≥0 n n−1

(47.18)

1976

Q-Learning

where, by definition,



) P π Q(P n (s, a) = γ qn−1 (sn+P , an+P ) +

P −1 X

γ m r(n + m)

(47.19a)

m=0

  r(n + m) = r sn+m , a ∼ π(a|s = sn+m ), sn+m+1

(47.19b)

under the understanding that sn = s. This implementation relies on using the rewards for P transitions into the future. It can be verified (see Prob. 47.1) that these successive estimates satisfy the following recursion: (p−1) Q(p) (s, a) + γ p−1 βn (sn+p−1 ), p = 1, 2, . . . , P n (s, a) = Qn

Q(0) n (s)

=

π qn−1 (sn , an )

(boundary condition)

(47.20a) (47.20b)



π π βn (sn+p ) = r(n + p) + γqn−1 (sn+p+1 , an+p+1 ) − qn−1 (sn+p , an+p )

(47.20c)

The resulting look-ahead SARSA algorithm is listed in (47.21). The case P = 1 corresponds to SARSA(0). By following an argument similar to the proof of result (47.14), it can be verified that the same convergence result holds for the look-ahead SARSA algorithm – see Prob. 47.2.

Look-ahead SARSA algorithm for on-policy state–action evaluation. given a (deterministic or stochastic) policy π(a|s); π (s, a) = arbitrary, s ∈ S, a ∈ A. initial state–action values q−1 repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn r(n + m) = r(sn+m , a ∼ π(a|sn+m ), sn+m+1 ), m = 0, . . . , P − 1 P −1 X (P ) π Qn (s, a) = γ P qn−1 (sn+P , an+P ) + γ m r(n + m) m=0   (P ) π π qnπ (s, a) = qn−1 (s, a) + µn (s, a) Qn (s, a) − qn−1 (s, a)

end π (s, a) ← qnπ (s, a), ∀ (s, a) ∈ S × A in preparation for next episode q−1 end (47.21)

47.3 SARSA(λ) Algorithm

47.3

1977

SARSA(λ) ALGORITHM One of the inconveniences of the look-ahead SARSA implementation (47.21) is that, for each iteration n, the agent needs to examine P transitions into the future and use their rewards to carry out the estimation process for the state–action value function. In a manner similar to the derivation of TD(λ), we can motivate a more effective backward SARSA(λ) algorithm to resolve this inconvenience. (p) Assume the pth order estimates Qn (s, a) are available for the state–action pair (s, a) for p = 1, 2, . . .. Then, similar to (46.67), we can generate a newer estimate for q π (s, a) by combining these values in a weighted manner, say, as follows: ∆

Qλn (s, a) = (1 − λ) (P )

∞ X

λp−1 Qn(p) (s, a)

(47.22)

p=1

(λ)

where λ ∈ (0, 1]. Replacing Qn (s, a) by Qn (s, a), we arrive at the forward view for the SARSA (λ) algorithm:   π π qnπ (s, a) = qn−1 (s, a) + µn (s, a) Qλn (s, a) − qn−1 (s, a) , n ≥ 0 (47.23) where the update is performed only when the state–action pair (s, a) occurs at time n. Again, by following an argument similar to the proof of result (47.14), it can be verified that the same convergence result holds for (47.23) – see Prob. 47.4. As was the case with the TD(λ) implementation, the forward nature of the above recursion is not practical since evaluation of Qλn (s, a) requires that we (p) compute all pth level approximations, Qn (s, a), for all p ≥ 1. This in turn requires access to full episodes. By repeating the arguments of Example 46.7, we can similarly motivate a backward view for SARSA(λ) by using eligibility traces and arrive at the following recursion (see Prob. 47.6): π qnπ (s, a) = qn−1 (s, a) + µn (s, a)tn (s, a)βn (sn ), n ≥ 0

(47.24)

where the eligibility traces are now dependent on both s and a and are generated recursively via: tn (s, a) = λγtn−1 (s, a) + I[sn = s, an = a]

(47.25)

starting with t−1 (s, a) = 0. We refer to tn (s, a) as the accumulating eligibility trace for the state–action pair (s, a). By iterating (47.25), we get tn (s, a) =

n X

(λγ)n−m I[sm = s, am = a]

(47.26)

m=0

These traces are now incorporated into the SARSA(0) algorithm, which is modified to the listing shown in (47.28) where the update for qnπ (x, b) includes an additional multiplication by tn (x, b) for any state x and action b. Although this

1978

Q-Learning

backward view is not fully equivalent to the forward implementation for online operation, we show in Prob. 47.7 that the offline implementations continue to be equivalent. Furthermore, the same argument used to establish the convergence property (46.88) for TD(λ) can be repeated here to establish that, if every state–action pair (s, a) is visited infinitely often, and assuming a bounded reward function r(s, a, s0 ), then the iterated state–action values, qnπ (s, a), converge to the true state–action values q π (s, a) almost surely for discount values γ ∈ [0, 1) (see Prob. 47.8): lim qnπ (s, a) = q π (s, a), a.s.

n→∞

(47.27)

SARSA(λ) algorithm for on-policy state–action evaluation. given a (deterministic or stochastic) policy π(a|s); π initial state–action values : q−1 (s, a) = arbitrary, s ∈ S, a ∈ A; initial trace values, t−1 (s, a) = 0, s ∈ S, a ∈ A. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ π(a|s) s0 = sn+1 r(n) = r(s, a, s0 ) a0 ∼ π(a|s0 ) π π βn (s, a) = r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) tn (x, b) = λγtn−1 (x, b) + I[s = x, a = b], x ∈ S, b ∈ A π qnπ (x, b) = qn−1 (x, b) + µn (x, b)tn (x, b)βn (s, a), x ∈ S, b ∈ A end π (s, a) ← qnπ (s, a) and t−1 (s, a) ← tn (s, b), ∀ (s, a) ∈ S × A q−1 in preparation for next episode end

(47.28)

There are three crucial differences in relation to SARSA(0): (1) First, observe that, at every iteration n, new variables in the form of the eligibility traces tn (x, b) are updated for all states and actions, x ∈ S, b ∈ A.

(2) Second, and contrary to SARSA(0) where only qnπ (s, a) is updated at iteration n for the state–action pair that occurs at that point in time, we now update all state–action pairs for all states x ∈ S and actions b ∈ A using the same βn (s, a) computed for the state–action pair (sn , an ). In other words, all state–action values are adjusted (the one corresponding to the observed state–action pair (sn , an ) and the ones corresponding to all other states as well).

47.4 Off-Policy Learning

1979

(3) Third, since the updates for qnπ (x, b) now occur continuously, for every n, the step-size parameter µn (x, b) will evolve according to µn (x, b) = c/(n + 1)β , with the iteration index n in the denominator. In contrast, for SARSA(0), the update for qnπ (s, a) only occurs whenever state–action pair (s, a) is observed and, as explained earlier, the step size evolved instead according to µn (s, a) = τ /(n(s, a) + 1)c with n replaced by the number of observations n(s, a) in the denominator.

47.4

OFF-POLICY LEARNING The SARSA algorithms derived so far for state–action function evaluation are examples of on-policy strategies where the agent takes actions according to a given target policy π(a|s) and observes the corresponding rewards. Following the same arguments from Section 46.7 we can motivate off-policy implementations, where the agent takes actions and observes rewards according to some other behavior policy, φ(a|s). Listing (47.13) shows the resulting off-policy implementation for SARSA(0) where the main difference in relation to the earlier on-policy SARSA(0) algorithm (47.13) is the addition of the importance sampling weights, ξn (sn , an ). By setting these weights to the value 1, the algorithm reduces to an on-policy implementation.

SARSA(0) algorithm for off-policy state–action function evaluation. target policy π(a|s) and behavior policy φ(a|s); π (s, a) = arbitrary, s ∈ S, a ∈ A. initial state–action values q−1 repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a ∼ π(a|s) s0 = sn+1 r(n) = r(s, a, s0 ) a0 ∼ φ(a|s0 ) ξn (s, a) = π(a|s)/φ(a|s)   π π βn (s, a) = ξn (s, a) r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a)

π qnπ (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) end π q−1 (s, a) ← qnπ (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.29)

1980

Q-Learning

47.5

OPTIMAL POLICY EXTRACTION The descriptions (47.13) and (47.28) for SARSA(0) and SARSA(λ) provide procedures for evaluating the state–action function q π (s, a) for a given policy π(a|s). We can modify the statements of both algorithms to learn an optimal policy by including a greedy selection of actions at each step. The modification for SARSA(0) is listed in (47.30), where we are now denoting the successive iterates for the state–action value function by qn (s, a), without the superscript π . While the earlier implementation (47.13) followed actions dictated by the given policy π(a|s) and estimated its state–action value function, q π (s, a), the algorithm now follows a greedy policy selection and chooses a0 , for each step, through the maximization of qn−1 (s0 , a) over the action set that is permissible at state s0 .

SARSA(0) algorithm for optimal policy extraction. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A. repeat over episodes: for each episode, let (s0 , a0 ) denote its initial state–action pair repeat over episode for n ≥ 0: s0 = sn+1 r(n) = r(s, a, s0 ) a0 = argmax qn−1 (s0 , a) a∈A

βn (s, a) = r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) qn (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) s ← s0 a ← a0 end q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.30) If we examine the expression for βn (s, a) in listing (47.30) we find that, in view of the greedy construction for action a0 , it can be rewritten as n o βn (s, a) = r(n) + γ max qn−1 (s0 , b) − qn−1 (s, a) (47.31) b∈A

which, as we are going to see from the discussion in Section 47.6, has the same form as the correction term used to update qn−1 (s, a) to qn (s, a) in the Q-learning algorithm (47.45). Therefore, the same convergence conclusion obtained there for Q-learning will hold here and the iterates qn (s, a) generated by the modified SARSA(0) algorithm (47.30) will converge to the optimal state–action function q ? (s, a) almost surely for discount values γ ∈ [0, 1), namely,

47.5 Optimal Policy Extraction

lim qn (s, a) = q ? (s, a), a.s.

n→∞

1981

(47.32)

which in turn implies that we can recover optimal deterministic policies from (45.6)

π ? (a|s) := argmax q ? (s, a)

(47.33)

a∈A

Recall that the notation := means that an optimal action ao is first selected by maximizing q ? (s, a) over a ∈ A and subsequently used to construct a deterministic policy π ? (a|s) that associates action ao with state s – recall construction (45.7). We say that implementation (47.30) performs active (rather than passive) learning because the agent decides on what action to take at each step, observes the resulting reward, and then decides on what action to take at the next step, and so forth. In this approach, the agent guides the exploration of the state space. The Q-learning algorithm described in the next section will have a similar structure. We can similarly transform listing (47.28) for SARSA(λ) into one that enables extraction of an optimal policy, as shown in (47.34).

SARSA(λ) algorithm for optimal policy extraction. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A; initial trace values, t−1 (s, a) = 0, s ∈ S, a ∈ A. repeat over episodes: for each episode, let (s0 , a0 ) denote its initial state–action pair repeat over episode for n ≥ 0: s0 = sn+1 r(n) = r(s, a, s0 ) a0 = argmax qn−1 (s0 , a) a∈A

0

0

βn (s, a) = r(n) + γqn−1 (s , a ) − qn−1 (s, a) tn (x, b) = λγtn−1 (x, b) + I[s = x, a = b], x ∈ S, b ∈ A qn (x, b) = qn−1 (x, b) + µn (x, b)tn (x, b)βn (s, a), x ∈ S, b ∈ A s ← s0 a ← a0 end q−1 (s, a) ← qn (s, a) and t−1 (s, a) ← tn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end

(47.34)

1982

Q-Learning

47.6

Q-LEARNING ALGORITHM The temporal difference and SARSA learning algorithms for state and state– action function evaluations are examples of passive learning procedures. They estimate these functions by observing rewards and taking actions according to some already given target policy. We now introduce an active learning procedure, known as Q-learning, which enables agents to estimate directly the optimal state– action value function, q ? (s, a). They also enable agents to determine the optimal policy, π ? (a|s), in a manner similar to the SARSA variants (47.30) and (47.34) for optimal policy extraction. The Q-learning algorithm is an off-policy procedure because, although the agent is interested in learning an optimal (or greedy) policy π ? (a|s), it will nevertheless take actions according to some exploratory policies until it ultimately converges to the optimal policy. We know from Bellman optimality condition (45.15b) that the optimal state– action value satisfies:   ? 0 0 q ? (s, a) = E P r(s, a, s0 ) + γ max q (s , a ) | s = s, a = a (47.35) 0 a ∈A

?

This expression characterizes q (s, a) as an expectation except that the state– action value function appears on both sides of the equation: q ? (s, a) on the left and q ? (s0 , a0 ) on the right. One useful feature of the above relation for the derivation of off-policy algorithms is that the expression is independent of the policy. It relates q-values with each other to form a relation that involves only the desired quantity (i.e., the optimal q-function). Efficient algorithms can be derived by exploiting this observation. Thus, following an argument similar to the derivation that led to (46.39), we generate sample realizations for the quantity under expectation on the righthand side by using the iterate qn−1 (·, ·) that is available for q ? (s, a) at iteration n − 1 to arrive at the following recursion for n ≥ 0: 

0

0

qn (s, a) = qn−1 (s, a) + µn (s, a) r(n) + γ max qn−1 (s , a ) − qn−1 (s, a) 0 a ∈A



(47.36) with a step-size sequence µn (s, a), now dependent on the state–action pair (s, a). Moreover, the symbols {s, a, s0 } denote the states and action {sn , an , sn+1 }, while r(n) denotes the reward in moving from state sn to state sn+1 under action an : ∆

r(n) = r(s, a, s0 ) = r(sn , an , sn+1 )

(47.37)

π If s0 happens to be the terminal state, we set qn−1 (s0 , a0 ) = 0 so that, in effect, we are employing the following sample approximation for q ? (s, a) from (47.35): ( r(s, a, s0 ), if s0 is terminal q ? (s, a) ≈ (47.38) 0 0 0 r(s, a, s ) + γ max qn−1 (s , a ), if s0 is nonterminal 0 a ∈A

47.6 Q-Learning Algorithm

1983

The maximization in (47.38) is over the set of actions a0 that are permissible at state s0 . Again, recursion (47.36) is applied at iteration n only if the transition at this time happens to originate from state sn = s under action an = a. For all other state–action pairs, (x, b) 6= (s, a), their state–action values remain invariant during this iteration. That is, when (47.36) is applied at iteration n, it is implicitly assumed that qn (x, b) = qn−1 (x, b), for (x, b) ∈ S × A\{(s, a)}

(47.39)

If desired, we can capture this situation in a single expression by writing qn (s, a) = qn−1 (s, a) +

  0 0 µn (s, a)I[sn = s, an = a] r(n) + γ max q (s , a ) − q (s, a) n−1 n−1 0 a ∈A

(47.40)

where we added the indicator function for the state–action pair (s, a). Now note that if update (47.36) were to occur at every time instant n, then we would simply select µn (s, a) in the form τ , τ ∈ (0, 1), c ∈ (1/2, 1] (47.41) µn (s, a) = (n + 1)c for some constants τ and c. Here, the denominator of µn (s, a) would be tied to the iteration index n. This choice satisfies ∞ X 0 < µn (s, a) < 1, lim µn (s, a) = 0, µn (s, a) = ∞ (47.42) n→∞

n=0

which are the conditions required by Prob. 16.9. However, the full update in (47.36) will only occur occasionally, and only when the pair (s, a) is observed. In that case, the denominator in the step-size sequence µn (s, a) should not be (n + 1)c but should instead be defined in terms of the number of actual full updates that involved the state–action pair (s, a) until time n. If we let ∆

n(s, a) = number of times (s, a) has been observed until time n

(47.43)

then the step-size sequence µn (s, a) should be set to µn (s, a) =

τ , (n(s, a) + 1)c

τ ∈ (0, 1),

c ∈ (0.5, 1]

(47.44)

As explained before, we will rarely employ the indicator function explicitly in our presentation, except when necessary for added clarity. We will often rely instead on the more compact representation (47.36) with the understanding that the recursion updates the state–action value for the pair (s, a) alone, which is the state–action pair occurring at time n, while all other state–action values remain unchanged. And, moreover, the step-size value is defined according to

1984

Q-Learning

(47.44); the value c = 1 is common along with any τ ∈ (0, 1). We explain in the following that the successive iterates qn (s, a) converge toward q ? (s, a) almost surely for increasing values of n. The resulting algorithm is listed in (47.45). We can initialize q−1 (s, a) with zero or arbitrary values, or with any prior information the designer may have about the problem.

Q-learning algorithm for optimal policy extraction. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a = argmax qn−1 (s, a0 ) a0 ∈A

s0 = sn+1 r(n) = r(s, a, s0 )

  0 0 βn (s, a) = r(n) + γ max q (s , a ) − qn−1 (s, a) n−1 0 a ∈A

qn (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) end q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.45) If we compare listings (47.30) and (47.45) for both SARSA(0) and Q-learning algorithms for optimal policy extraction we find, in light of (47.31), that the two methods are similar. The main difference is that the derivation of SARSA(0) starts from the Poisson equation (47.1) and uses it first to motivate a procedure for evaluating the state–action function, and subsequently incorporates a greedy search for the actions. In comparison, the derivation of Q-learning starts directly from the Bellman optimality criterion (47.35). We examine the convergence of the Q-learning algorithm (47.45) in Appendix 47.B. There we show that if every state–action pair (s, a) is visited infinitely often (which requires that we adjust the algorithm to enhance the likelihood of visiting any (s, a) pair by using some of the exploration strategies described next, such as -greedy or Boltzmann exploration), and assuming a bounded reward function r(s, a, s0 ), then the iterated state–action values, qn (s, a), converge to the optimal state–action values q ? (s, a) almost surely for discount values γ ∈ [0, 1), namely, lim

n→∞

n o qn (s, a) = q ? (s, a), a.s.

(47.46)

47.7 Exploration versus Exploitation

1985

which in turn allows us to determine optimal deterministic policies by setting: (45.6)

π ? (a|s) := argmax q ? (s, a)

(47.47)

a∈A

47.7

EXPLORATION VERSUS EXPLOITATION The convergence analysis given in Appendix 47.B for the Q-algorithm (47.45) shows that the state–action space S × A needs to be explored densely enough so that state–action pairs are visited with sufficient frequency to enable proper learning. Implementation (47.45) chooses the action a by maximizing qn−1 (s, a0 ) over the action space. It is necessary to ensure that there exists a positive probability for choosing any (s, a) pair so that the state × action space is explored densely. There are many ways by which the state × action space can be explored. The following are examples of adjustments that can be made for this purpose.

47.7.1

Optimistic Initialization Under implementation (47.45), the agent explores the space by following greedy actions, namely, at every iteration n, the action is selected as: an = argmax qn−1 (sn , a0 )

(47.48)

a0 ∈A

These actions may not be exploratory enough, especially during the initial stages of learning. One useful technique to enhance the exploration ability of the learning procedure is to employ optimistic initialization. Specifically, if we appeal to definition (44.87b) for the state–action value function, we find that its values vary within the interval [rmin /(1 − γ), rmax /(1 − γ)] in terms of the largest and smallest possible reward values. Then, we set the initial condition for algorithm (47.45) to the largest possible value as follows: q−1 (s, a) =

max{ |rmin |, |rmax | } , ∀ (s, a) ∈ S × A 1−γ

(47.49)

By doing so, the greedy step is encouraged to explore across the actions.

47.7.2

-Greedy Exploration A second useful technique to enhance exploration is to employ the -greedy strategy as follows. Let  ∈ [0, 1) denote a small exploration parameter. The larger the value of  is, the wider the amount of exploration will be. Let qn−1 (s, a0 ) denote some generic state–action value function, and denote the corresponding greedy action for state s by: a(s) = argmax qn−1 (s, a0 ) a0 ∈A

(47.50)

1986

Q-Learning

In -greedy exploration, rather than follow action a(s), the agent will select randomly between following a(s) with some high probability 1 −  or selecting any random action from A with the smaller probability  according to the rule (see also Prob. 47.15): ( a(s), with probability (1 − )  a (s) =  any random a ∈ A, with probability |A| each ∆

= -greedy[a]

(47.51)

Note that we are denoting the modified action by a (s) and using the notation a (s) = -greedy[a] to refer to the transformation. The resulting Q-algorithm is listed in (47.52). The same arguments used to establish the convergence results (47.46)–(47.47) can be applied to this case as well – see Prob. 47.10. -greedy Q-learning algorithm for optimal policy extraction. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A; exploration parameter,  > 0. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a = argmax qn−1 (s, a0 ) a0 ∈A

a = -greedy [a] s0 = sn+1 (based on action a ) r(n) = r(s, a , s0 )   0 0 q (s , a ) − qn−1 (s, a ) βn (s, a ) = r(n) + γ max n−1 0 a ∈A

qn (s, a ) = qn−1 (s, a ) + µn (s, a )βn (s, a ) end q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.52) As can be seen from the -greedy policy, exploration strategies allow the agents to explore new locations in the state × action space in order to collect information that will help drive the learning process more thoroughly. The -greedy strategy achieves this objective by assigning a probability  to selecting any other action besides the greedy action a(s). If the agent were to continue employing solely the greedy action a(s) during learning, it would be performing exploitation as opposed to exploration; this is because it would be driving its learning process by exploiting at each instant the aggregate of the information that it has seen until then. Exploration, on the other hand, adds an element of randomness and allows the agent to explore new scenarios. This is helpful because, sometimes, the

47.7 Exploration versus Exploitation

1987

best strategy requires taking actions that may look suboptimal in the short term but turn out to be rewarding in the long run. As such, we say that the -greedy strategy will be exploring -fraction of the time and exploiting (1 − )-fraction of the time. Example 47.2 (Greedy exploration improves state values) We verify that -greedy exploration improves the state values of any policy. Specifically, let π(a|s) denote an arbitrary policy with state and state–action value functions denoted by v π (s) and q π (s, a), respectively. We modify the policy π(a|s) and replace it by an -greedy policy constructed according to (47.51). Specifically, for every state s, we first determine the greedy action: a(s) = argmax q π (s, a0 )

(47.53)

a0 ∈A

and then replace a(s) by the action a (s) according to (47.51). These new actions define a new policy π  (a|s), where each state s is mapped to a (s). We therefore say that, starting from the original policy π(a|s) prior to exploration, we modified it to an greedy policy π  (a|s) for enhanced exploration. We can now compare the state values under both policies and show that, for any state, 

v π (s) ≥ v π (s), s ∈ S

(47.54)

Proof: First, introduce the scalar coefficients ∆

φ(a|s) =

1 1−

 π(a|s) −

 |A|



, a∈A

(47.55)

Then, it is straightforward to check that the {φ(a|s)} add up to 1 since X

φ(a|s) =

a∈A

/|A| 1 X × |A| π(a|s) − 1 −  a∈A 1− | {z } =1

1  = − = 1 1− 1−

(47.56)

Using relation (44.92), which relates state values to state–action values, we then have X    v π (s) = π (a|s)q π (s, a) a∈A

= (47.56)

=

 X π q (s, a) + (1 − ) q π (s, a(s)) |A| a∈A X  X π q (s, a) + (1 − ) φ(b|s)q π (s, a(s)) |A| a∈A b∈A

(47.50)

X  X π ≥ q (s, a) + (1 − ) φ(b|s)q π (s, b) |A| a∈A b∈A (47.55) X π = π(a|s)q (s, a) a∈A (44.92)

=

v π (s)

(47.57)

as claimed. 

1988

Q-Learning

Example 47.3 (Boltzmann or softmax greedy exploration) Although simple, one of the disadvantages of the -greedy exploration strategy (47.51) is that all actions in the set A\{a(s)} are selected uniformly with equal probability. This selection can be problematic since not all remaining actions are equally effective. There are other ways to select the exploratory action a (s) besides construction (47.51) by assigning more or less likelihood to an action based on its state–action value. Let again q(s, a) denote an arbitrary state–action value function. In (47.51), we first determine the greedy action a(s) by maximizing q π (s, a0 ) over a0 , and then use the resulting a(s) to select the exploratory action a (s). Now, instead, we select a (s) according to the following softmax (also called Gibbs or Boltzmann) distribution, for some parameter  > 0 (referred to as the temperature parameter – recall (71.89)): π  (a|s) = eq(s,a)/

0

.X

eq(s,a

)/

(47.58)

a0 ∈A

That is, we transform the state–action values into probabilities and assign to each action a ∈ A a likelihood value under state s that is given by the above expression. The action a (s) is then selected randomly according to the distribution π  (a|s). The same arguments used to establish the convergence results (47.46)–(47.47) can be applied to this case as well – see Prob. 47.12. Observe from (47.58) that as  → ∞, the exploration policy becomes the uniform distribution, i.e., lim π  (a|s) =

→∞

1 |A|

(47.59)

On the other hand, when  → 0+ , the exploration policy tends to the greedy policy – see Prob. 47.14: n o lim a (s) = argmax q π (s, a0 ) = a(s) (47.60) →0+

a0 ∈A

This is because the probability of the action a(s), which has the largest state–action value, tends to 1 in (47.58). Example 47.4 (Optimal policy for a game over a grid) We illustrate the operation of the Q-learning algorithm under -greedy exploration by reconsidering the earlier grid problem from Fig. 44.4. Recall that the grid consisted of 16 squares labeled #1 through #16. Four squares are special; these are squares #4, #8, #11, and #16. Square #8 (danger) and #16 (stars) are terminal states. If the agent reaches one of these terminal states, the agent moves to an EXIT state and the game stops. The agent collects either a reward of +10 at square #16 or a reward of −10 at square #8. The agent collects a reward of −0.1 at all other states. We iterate the Q-learning iteration (47.52) over 50,000 episodes using  = 0.1, γ = 0.9, and a constant step size µ = 0.01. We employ the optimistic initialization (47.49) for the state–action value function, namely, q−1 (s, a) =

max{0.1, 10} = 100, 1 − 0.9

∀ (s, a) ∈ S × A

(47.61)

In the implementation of the algorithm, the maximization of qn−1 (sn , a0 ) and qn−1 (s0 , a0 ) is performed over the actions that are permissible at the respective states, sn and s0 , respectively. After convergence, we examine the resulting state–action value function q(s, a), whose values we collect into a matrix Q with each row corresponding to a state value and each column to an action:

47.7 Exploration versus Exploitation

               Q=              

s=1 s=2 s=3 s=4 s=5 s=6 s=7 s=8 s=9 s = 10 s = 11 s = 12 s = 13 s = 14 s = 15 s = 16 s = 17

up 94.8026 108.7510 108.8600 − 109.2077 108.9250 106.4342 − 109.8279 109.7102 − 109.3789 109.4312 109.5837 109.7302 − −

down 108.0683 108.2464 108.4572 − 109.0464 108.7999 105.5675 − 95.5984 108.9413 − 109.1449 109.3414 109.5774 109.6244 − −

left 105.7811 108.4248 108.3800 − 109.0972 109.0351 109.0016 − 107.0995 109.4142 − 109.2611 109.3911 109.4614 109.6038 − −

right 105.9368 107.9651 108.3750 − 108.9634 108.8364 94.6887 − 106.5895 109.5597 − 109.2649 109.5225 109.6903 109.8346 − −

stop 100.0000 100.0000 100.0000 − 100.0000 100.0000 100.0000 90.0000 100.0000 100.0000 − 100.0000 100.0000 100.0000 100.0000 110.0000 100.0000

1989

                (47.62)              

For each state s, we determine the action a that maximizes q(s, a). These results are indicated by boxes in the above expression, leading to the following deterministic optimal policy:

π(a = “down”|s = 1) = 1 π(a = “up”|s = 2) = 1 π(a = “up”|s = 3) = 1 π(a = “up”|s = 5) = 1 π(a = “left”|s = 6) = 1 π(a = “left”|s = 7) = 1 π(a = “stop”|s = 8) = 1 π(a = “up”|s = 9) = 1 π(a = “up”|s = 10) = 1 π(a = “up”|s = 12) = 1 π(a = “right”|s = 13) = 1 π(a = “right”|s = 14) = 1 π(a = “right”|s = 15) = 1 π(a = “stop”|s = 16) = 1 π(a = “stop”|s = “EXIT”) = 1

(47.63a) (47.63b) (47.63c) (47.63d) (47.63e) (47.63f) (47.63g) (47.63h) (47.63i) (47.63j) (47.63k) (47.63l) (47.63m) (47.63n) (47.63o)

The optimal actions are represented by arrows in Fig. 47.1. Observe in particular that the optimal action at state s = 7 is to move left. By doing so, the agent has a 70% chance to move to state s = 6 and 15% chance each to move to states s = 2 and s = 10. Note that there is no chance to end up in the danger state s = 8. Likewise, the optimal action selected for state s = 1 avoids any possibility of ending up in the danger state s = 1. This is because by choosing to move downward, the agent has 85% chance of staying at state s = 1 and 15% chance of moving to state s = 2.

1990

Q-Learning

SUCCESS

11

DANGER

4

Figure 47.1 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations.

47.7.3

Upper Confidence Bound The -greedy strategy is simple and works relatively well in practice. As mentioned in Example 47.3, one of its inconveniences is that, during exploration, it treats all actions equally even though some actions may be superior to other actions. Another approach to exploration is to modulate the selection of actions by the number of times that certain state × action pairs have been visited. The idea is to encourage the agent to explore regions in the state × action space that have not been explored enough. This approach is useful for tabular state × action spaces, i.e., for situations with a manageable number of discrete states and actions so that counters can be used to keep track of how many times any (s, a)-pair has been visited. Let C(s, a) denote the number of times, up to iteration n, that action a has been taken at state s – see definition (47.43). If this number is large, then this means that, with good likelihood, information related to the state–action pair (s, a) has been incorporated sufficiently enough into the learning process. In this case, it would be useful for the agent to explore actions whose performance is less certain in order to enlarge its exploration of the state × action space. This can be achieved as follows. We replace the state–action value function qn−1 (s, a0 ) by a perturbed version of the form: ∆

e qn−1 (s, a0 ) = qn−1 (s, a0 ) + Un−1 (s, a0 )

(47.64)

47.7 Exploration versus Exploitation

1991

where U (s, a0 ) is a function to be determined and whose role is to model the size of the uncertainty in the estimate qn−1 (s, a0 ) at iteration n − 1. Remember that qn−1 (s, a0 ) is an estimate for the (unknown) true value q ? (s, a0 ). We will explain shortly how to estimate Un−1 (s, a0 ). Once this is done, one exploration strategy would be to select the action according to (this selection would replace the -greedy action a in listing (47.52)): n o e ae = argmax qn−1 (s, a0 ) (47.65) a0 ∈A

We are adding the superscript e to refer to “exploration.” Under this construction, the action corresponding to the largest uncertainty is selected. This approach to selecting ae is sometimes referred to as optimism in the face of uncertainty. The intuition is as follows. The state s is fixed and we need to select one of the actions. Each action a0 has some uncertainty Un−1 (s, a0 ) around its state–action value estimate qn−1 (s, a0 ). As the argument below will show, the smaller the size of the uncertainty Un−1 (s, a0 ), the more confident we are about its state–action value. It is then likely that by choosing an action that we are most uncertain about, the agent would be exploring new regions in the state × action space where we have learned less about their state–action values. We are therefore optimistic that by exploring this larger uncertain space, we would be able to discover trajectories that could lead to better rewards in the long term. One way to estimate Un−1 (s, a0 ) is to determine an upper confidence bound (UCB) for it. To do so, we rely on future Hoeffding inequality from part (a) of Prob. 3.55. Noting that the estimate qn−1 (s, a0 ) can be viewed as a sample mean approximation for q ? (s, a0 ) using C(s, a0 ) samples, we write   n o 2 P q ? (s, a0 ) > qn−1 (s, a0 ) + Un−1 (s, a0 ) ≤ exp −2C(s, a0 )Un−1 (s, a0 )/δq2 (47.66)

where we are assuming that all state–action value functions are bounded within some interval [q` , qu ] of length δq = qu − q` (which in turn requires the rewards to be bounded). If we use the fact that the state–action value function is bounded to the interval [rmin /(1 − γ), rmax /(1 − γ)] in terms of the largest and smallest possible reward values, then we can select δq = ∆r/(1 − γ),

where ∆r = rmax − rmin

(47.67)

The bound on the right-hand side of (47.66) suggests one way to select the uncertainty Un−1 (s, a0 ). Assume we want the likelihood of event (47.66) occurring to be at most some small probability α (so that at least 1−α fraction of the time, the true q ? (s, a0 ) will be upper-bounded by qn−1 (s, a0 ) + Un−1 (s, a0 )). Setting n o 2 exp −2C(s, a0 )Un−1 (s, a0 )/δq2 = α (47.68)

1992

Q-Learning

gives Un−1 (s, a0 ) = δq

s

ln(1/α) 2 C(s, a0 )

(47.69)

In other words, we can select Un−1 (s, a0 ) to be a function that decays with the square-root of C(s, a0 ). Under this construction, the amount of uncertainty will decay as the pair (s, a0 ) is visited more frequently, which is in line with intuition. e In this way, maximizing qn−1 (s, a0 ) over a0 will tend to favor the selection of actions from less frequent occurrences. We can further adjust the probability factor α and decrease it over time as we observe more data, say, as α = 1/(n + 1)4 so that 0

Un−1 (s, a ) = δq

s

2 ln(n + 1) C(s, a0 )

(47.70)

(47.71)

We arrive at listing (47.72), where we add a small positive  to C(s, a0 ) in the above expression to avoid division by zero; this implementation is suitable for tabular state × action spaces. Other possibilities for defining the exploration e state value function qn−1 (s, a0 ) are of course possible. Q-learning using UCB exploration for optimal policy extraction. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A; initial state–action counters C(s, a) = 0, s ∈ S, a ∈ A; interval bound δq = ∆r/(1 − γ), ∆r = rmax − rmin ; a small positive  > 0. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn s 2 ln(n + 1) Un−1 (s, a0 ) = δq , ∀a0 ∈ A  + C(s, a0 ) n o ae = argmax qn−1 (sn , a0 ) + Un−1 (s, a0 ) a0 ∈A

s0 = sn+1 (based on action ae ) r(n) = r(s, ae , s0 )   0 0 βn (s, ae ) = r(n) + γ max q (s , a ) − qn−1 (s, ae ) n−1 0 a ∈A

qn (s, ae ) = qn−1 (s, ae ) + µn (s, ae )βn (s, ae ) end q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.72)

47.8 Q-Learning with Replay Buffer

47.8

1993

Q-LEARNING WITH REPLAY BUFFER The Q-learning procedure (47.45) can face challenges for large state spaces. First, the states visited during training can end up being localized in a particular region of space with limited representation for what happens beyond this region. This can happen for at least two reasons: (a) the large size of the state space may only allow for a sparse exploration, and (b) the Markovian property of the state transitions makes nearby states dependent on each other and highly correlated, which limits the efficacy of the learning process. It is therefore useful to adjust the procedure to employ state–action pairs that are largely independent of each other, and that can cover a broader region of the state × action space. The concept of an experience replay buffer helps attain this objective. The buffer will be used to save information about transitions (s, a, s0 , r) visited by the algorithm during learning. By accumulating experiences into the buffer, rather than discarding them, it becomes possible to address another challenge for Q-learning, which arises when sampling of the state × action space is costly. The buffer will allow the algorithm to tap into saved experiences. The Q-learning algorithm with replay buffer is listed in (47.73).

Q-learning for optimal policy extraction with replay buffer. initial state–action values q−1 (s, a) = arbitrary, s ∈ S, a ∈ A; apply repeated actions according to some arbitrary policy and construct a buffer containing sufficient samples for the transitions (s, a, s0 , r). repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn a = argmax qn−1 (s, a0 ) 0

a0 ∈A

s = sn+1 r(n) = r(s, a, s0 ) save the just-found values (s, a, s0 , r) into the replay buffer select at random a point (¯ s, a ¯, s¯0 , r¯) from the replay buffer  βn (¯ s, a ¯) = r¯(n) + γ max qn−1 (¯ s0 , a0 ) − qn−1 (¯ s, a ¯) 0 a ∈A

qn (¯ s, a ¯) = qn−1 (¯ s, a ¯) + µn (¯ s, a ¯)βn (¯ s, a ¯) end q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.73)

1994

Q-Learning

Exploration policies such as -greedy can be incorporated into the algorithm. We list the algorithm, for simplicity, assuming greedy actions. There are three modifications in relation to the standard Q-learning procedure (47.45): (a) The data point (a, s, s0 , r) is saved into the buffer. This point is not used in the update of the state–action function at that iteration. (b) Instead, another point (¯ s, a ¯, s¯0 , r¯) is selected at random from the buffer and this point is used to update the state–action function. (c) The state–action value qn (·) is updated at (¯ s, a ¯) and not (s, a).

47.9

DOUBLE Q-LEARNING We motivated the Q-learning algorithm (47.45) by starting from the Bellman optimality condition (47.35), where a maximization operation appears on the right-hand side. We know from the definition of the state–action function in (44.87b) that q ? (s0 , a0 ) is the expected value of the accumulated return under the optimal policy, with the value of q ? (s0 , a0 ) varying over the choice of the action a0 : (∞ ) X ∆ ? 0 0 n 0 0 q (s , a ) = E π? ,P γ r(n) | s0 = s , a0 = a = E y(a0 ) (47.74) n=0

On the right-hand side, we are using the short-hand notation y(a0 ) to refer to the sum inside the expectation; it is a random quantity parameterized by the action a0 . We therefore encounter a situation where we need to compute the maximum of the expected value of a collection of random variables parameterized by a0 : max q ? (s0 , a0 ) = max E y(a0 ) 0 0 a ∈A

a ∈A

(47.75)

This situation is analogous to the case studied earlier in Prob. 31.31, where we examined constructing an estimator for the maximum of the means of a collection of random variables. If we examine what the Q-learning algorithm is doing, we find that it is first computing instantaneous estimates for E y(a0 ), given by qn−1 (s0 , a0 ), one for each a0 , and then maximizing over a0 . Specifically, it first finds: qn−1 (s0 , a0 ) = Ecy(a0 ) (47.76) and then uses

max qn−1 (s0 , a0 ) ≈ max Ecy(a0 ) 0 0 a ∈A

a ∈A

(47.77)

This construction mimics the solution from Prob. 31.31 if we assume the agent is collecting a single observation for each a0 (namely, qn−1 (s0 , a0 )). We explained there that this construction is biased. Moreover, due to the maximization over a0 , the above construction can end up overestimating maxa0 ∈A E y(a0 ) and lead

47.9 Double Q-Learning

1995

to poor performance. We describe an alternative implementation for Q-learning that relies on the double estimator idea from Prob. 31.32, which was shown to lead to estimators that do not overestimate the quantity of interest. The algorithm is listed in (47.78). Double Q-learning algorithm for optimal policy extraction. (1)

(2)

initial state–action values q−1 (s, a), q−1 (s, a) = arbitrary, s ∈ S, a ∈ A. repeat over episodes: for each episode, let s0 denote its initial state repeat over episode for n ≥ 0: s = sn choose at random to update state–action function using (1) or (2) if (1) is selected: (1) a = argmax qn−1 (s, a0 ) a0 ∈A

s0 = sn+1 r(n) = r(s, a, s0 ) (2) (1) βn (s, a) = r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) (1) (1) qn (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) else if (2) is selected: (2) a = argmax qn−1 (s, a0 ) a0 ∈A

s0 = sn+1 r(n) = r(s, a, s0 ) (1) (2) βn (s, a) = r(n) + γqn−1 (s0 , a0 ) − qn−1 (s, a) (2) (2) qn (s, a) = qn−1 (s, a) + µn (s, a)βn (s, a) end end (2) (1) (1) (2) q−1 (s, a) ← qn (s, a), q−1 (s, a) ← qn (s, a), ∀ (s, a) ∈ S × A in preparation for next episode end (47.78) Algorithm (47.78) relies on propagating two estimates for the state–action (1) (2) function, {qn−1 (s0 , a0 ), qn−1 (s0 , a0 )}, where we are using the superscripts (1) and (2) to distinguish between them. As explained in Prob. 31.32, we basically use one function to determine the maximizing action and then compute the state–action value from the other function. Specifically, we first determine the index (1)

ao = argmax qn−1 (s0 , a0 )

(47.79)

a0 ∈A

(2)

and subsequently use qn−1 (s0 , ao ) in the update relation. We can actually alter(1) nate between these steps: We can select ao by maximizing either qn−1 (s0 , a0 ) or (2) qn−1 (s0 , a0 ) and then use the value of the other function in the update relation.

1996

Q-Learning

The algorithm is listed in (47.78); exploration policies such as -greedy can be incorporated. We list the algorithm, for simplicity, assuming greedy actions. Its convergence analysis follows the same arguments used for Q-learning; we leave it to Prob. 47.13.

47.10

COMMENTARIES AND DISCUSSION SARSA algorithms. The SARSA(0) and SARSA(λ) algorithms (47.13) and (47.28), including their variations for linear function approximation from Section 48.4, were developed by Rummery and Niranjan (1994), who referred to them as modified Q-learning algorithms. The acronym SARSA was introduced by Sutton (1996). The expected SARSA algorithm (47.16) was introduced in the text by Sutton and Barto (1998) and analyzed by van Seijen et al. (2009). Q-learning. The original reference on the Q-learning algorithm (47.45) is the Ph.D. dissertation by Watkins (1989), where the terminology of “action value function” was introduced to refer to the state–action value function, q π (s, a). The concept of a state– action value function is actually older and was mentioned, although not under this name, decades earlier by the American engineer Claude Shannon (1916–2001) in his work on programming a computer to play the game of chess. In Shannon (1950), it was suggested to employ a function h(P, M ) to assess the efficacy of performing move M in position P . The double Q-learning algorithm (47.78) is from van Hasselt (2010). The idea of an experience replay buffer in algorithm (47.73) is from Lin (1992) – see also Andrychowicz et al. (2017) for another application. One of the earliest works on the convergence of the Q-learning algorithms is by Watkins and Dayan (1992). In the appendices to this chapter, we pursue a unified approach to convergence analysis by relying on the stochastic inequality (46.130) in a manner similar to what was proposed by Jaakkola, Jordan, and Singh (1994). Exploration strategies. The -greedy strategy (47.51) enhances the exploration ability of agents by assigning a nonzero likelihood to nongreedy actions. This simple yet useful strategy was proposed by Watkins (1989) in the same dissertation where the Q-learning algorithm was developed. One of the disadvantages of this rule is that it assigns equal probabilities to all nongreedy actions. The softmax variation (47.58), on the other hand, assigns likelihoods to the actions in proportion to their state–action values. This formulation was proposed by Bridle (1990a, b) and is motivated by a result in probability theory from the late 1950s, known as the Luce choice axiom. The axiom published in Luce (1959, 1977) states that the probability of selecting one item from a collection of N items is not influenced by the presence of the other items and can be modeled as: N .X P(select item k) = α(k) α(n) (47.80) n=1

where α(n) refers to the nonnegative weight (or strength) assigned to the nth item. In the softmax formulation, the strength of action a at state s is chosen as: n o α(a) = exp q π (s, a)/ (47.81) where  > 0 is referred to as the temperature parameter. We show in Example 47.3 how the softmax selection converges to two extremes as  approaches either 0+ or ∞. We explained in Section 47.7 that there exists a fundamental trade-off between exploitation and exploration in the context of probing MDPs with unknown parameters to

47.10 Commentaries and Discussion

1997

maximize the cumulative reward value. Exploitation favors selecting actions that have proven to be rewarding from experiences collected from the past and is generally the best strategy to follow in the short term. Exploration, on the other hand, favors taking new actions to gain more information and hopefully increase the overall reward in the long term. Exploration is necessary because there will always be errors and uncertainties infused into the learning process (such as uncertainties in estimating the state and state–action values). If we were to rely exclusively on exploitation, by selecting the best actions based on prior estimates, then these selections will be subject to error and may not lead to optimal performance. Exploration alleviates this problem. In the body of the chapter, we described a couple of exploration strategies such as -greedy, softmax exploration, and UCB exploration. The UCB bound (47.69) was derived by relying on the Hoeffding inequality (47.66). Other similar bounds can be derived by using other concentration inequalities such as Chernoff inequality, Azuma inequality, or Bernstein inequality – see, e.g., Lin and Bai (2010) for an overview of probability inequalities. Other approaches to exploration exist, such as those using Thompson sampling, where one selects the action according to the belief that it is the best action; this approach is motivated by the work of Thompson (1933). Counter-based approaches, such as the UCB approach discussed in Section 47.7.3, are not convenient for large dimensional spaces, and would not work when the value functions are approximated. An alternative method is to rely on “pseudo-counts” where counters are replaced by density models to represent how frequently state–action pairs have been visited – see, e.g., Bellemare et al. (2016) and the references therein. Likewise, more elaborate schemes are needed for deep exploration when the reward structure is sparse, such as in situations where an agent plays a game and only receives a reward at the end depending on whether the game was won or lost. There is significant literature on deep exploration methods and interested readers may consult some of these works and the references therein, such as Stadie, Levine, and Abbeel (2015), Osband, Van Roy, and Wen (2016), Osband et al. (2016), Osband, Aslanides, and Cassirer (2018), Burda et al. (2018), O’Donoghue et al. (2018), and Cassano and Sayed (2019). Multi-armed bandits. The exploration method based on optimism in the face of uncertainty from Section 47.7.3 leading to (47.71) is motivated by results from the rich literature on multi-armed bandits (MABs) – see, e.g., the overviews by Bubeck and Cesa-Bianchi (2012), Slivkins (2019), and Lattimore and Szepesvari (2020). MABs correspond to single-step decision processes in contrast to MDPs, which are sequential in nature, with the MDP moving from one state to another. An MAB problem consists of |A| arms, with each arm corresponding to an action in the MDP context. At every instant n, the user selects one of the arms, i.e., one of the actions an ∈ A. In return, the user observes some reward denoted by r n (an ). This reward is random because it arises from some unknown distribution determined by the selected action; different arms will generally have different reward distributions associated with them (with some actions being more generous than other actions). Observe that there is no concept of state in the MAB context. However, if we were to draw an analogy with MDPs, we can consider the situation in which the state is fixed at some value s. There are multiple actions a ∈ A that can be chosen at that state according to some random distribution determined by the policy π(a|s). In response to the selected action and the future state s0 , there will be some reward received. However, this reward is random in view of the random nature of the landing state, s0 . Each state s0 will be reached with probability P(s, a, s0 ) and will deliver some reward r(s, a, s0 ). Returning to the MAB setting, for any generic action a, we let r(a) denote the random reward received according to the unknown conditional reward distribution associated with that action, denoted by fr|a (r|a): r(a) ∼ fr|a (r|a)

(47.82)

We also let ∆

q(a) = E r(a)

(47.83)

1998

Q-Learning

denote the average reward that results from action a, where the expectation is relative to the distribution fr|a (r|a). In other words, q(a) is the mean of the distribution associated with action a. These means are unknown because the reward distributions are themselves unknown. Ideally, at every step, the user would like to select the optimal action: ∆

a? = argmax q(a) a∈A ?

q ? = q(a )

(optimal arm)

(47.84a)

(optimal reward)

(47.84b)

The gap for an action a measures the difference between its expected reward and the optimal reward: qe(a) = q ? − q(a) On the other hand, the regret measures the average gap size: n o ∆ R = E π qe(a) = q ? − E π q(a)

(47.85)

(47.86)

where the expectation is over the randomness in action selection (represented by a distribution π). The regret at step n is denoted by ∆

Rn = q ? − q(an )

(47.87)

It follows that over a period of N iterations, the average total regret would be given by ∆

R(N ) = q ? −

N −1 N −1 1 X 1 X q(an ) = Rn N n=0 N n=0

(47.88)

The total regret is N R(N ). The goal of a MAB problem is to select a sequence of actions, driven by the history of the observed reward values, in order to minimize the total regret. A fundamental bound in MAB problems is the following result, which shows that the total regret grows at least logarithmically with the number of steps. Logarithmic regret (Lai and Robbins (1984, 1985)) Let A+ denote the subset of all actions with positive gaps, A+ = {a | qe(a) > 0}. Then, the total regret for any algorithm that attempts to minimize the regret should satisfy the following asymptotic bound: X qe(a)   lim N R(N ) ≥ ln N × (47.89) N →∞ ? a∈A+ DKL fr|a (r|a), fr|a? (r|a ) in terms of the Kullback–Leibler (KL) divergence between the reward distributions at a and a? . We can estimate the means q(a) from observations by keeping track of how many times action a has been chosen and the corresponding rewards. Specifically, the estimate of q(a) at iteration n − 1 can be computed as follows: C(a) = number of times action a has been taken until then qbn−1 (a) =

1 C(a)

n−1 X

rm (am )I[am = a]

(47.90a) (47.90b)

m=0

In a manner similar to the derivation in the body of the chapter, the corresponding UCB algorithm takes the form

Problems



a?n = argmax a∈A

s Un−1 (a) = δq

n o qbn−1 (a) + Un−1 (a)

2 ln(n + 1) C(a)

1999

(47.91a) (47.91b)

where δq = rmax − rmin denotes the size of the interval within which the mean values q(a) are assumed bounded. This algorithm was derived by Auer, Cesa-Bianchi, and Fischer (2002), motivated by earlier results from Agrawal (1995). It was shown in their work that the algorithm’s total regret achieves N R(N ) = O(ln N ). In comparison, if one implements instead greedy or -greedy solutions, say, ∆

a?n = argmax qbn−1 (a)

(greedy solution)

(47.92a)

(-greedy solution)

(47.92b)

a∈A



an = -greedy[a?n ]

then it can be verified that these strategies lead to total regrets, N R(N ), that grow linearly with N rather than logarithmically.

PROBLEMS

47.1 Derive recursions (47.20a)–(47.20c). 47.2 Follow arguments similar to the proof of result (47.14) to establish that the same convergence result holds for the look-ahead SARSA algorithm described by (47.21). 47.3 Follow the same convergence analysis from Appendix 47.A to establish the convergence of the expected SARSA(0) algorithm (47.16). 47.4 Follow arguments similar to the proof of result (47.14) to establish that the same convergence result holds for the forward view of the SARSA(λ) algorithm described by (47.23). 47.5 Refer to listings (47.13) and (47.16) for SARSA(0) and expected SARSA(0) algorithms and the expressions for their respective TD factors, βn (s, a). We denote the target signals for these implementations by the notation ∆

π d = r(n) + γqn−1 (s0 , a0 ) (for SARSA(0)) X ∆ 0 0 π 0 0 de = r(n) + γ π(a |s )qn−1 (s , a ) (for expected SARSA(0)) a0 ∈A

0

where r(n) = r(s, a, s ). These target signals are actually random variables due to the randomness in selecting (s0 , a0 ) according to the distributions π(a0 |s0 ) and P(s, a, s0 ). We will therefore denote them in boldface. (a) Show that E d = E de , where the expectations are over all random quantities. Conclude that the above target constructions have the same bias relative to the true state value function q π (s, a). (b) Introduce the second-order moments v = E d2 and ve = E d2e . Verify that v − ve ) (  2  X 2 X X 2 0 0 0 π 0 0 0 0 π 0 0 =γ P(s, a, s ) π(a |s ) qn−1 (s , a ) − π(a |s )qn−1 (s , a ) s0

(c) .

a0 ∈A

a0 ∈A

Comment on conditions under which v ≥ ve so that the expected SARSA(0) construction would lead to target values with reduced variance.

2000

Q-Learning

47.6 Refer to expression (47.22) for Qλn (s, a). Follow arguments similar to Example 46.7 to establish the following relations: π Qλn (s, a) − qn−1 (s, a) =

∞ X (λγ)` βn (sn+` ) `=0

βn (sn+` , an+` ) I[sn = s, an = a] = βn (sn , an ) I[sn−` = s, an−` = a] h i  λ π Qn (s, a) − qn−1 (s, a) I[sn = s, an = a] = βn (sn , an )tn (s, a) + O (λγ)n+1 tn (s, a) =

n X (λγ)n−t I[st = s, at = a] t=0

47.7 The recursive forward and backward implementations of SARSA(λ) are not equivalent. Here, we show that their offline versions are equivalent by following the arguments of Section 46.5.3. We consider two indices: One index n denotes time and runs over states and rewards, and another index k denotes episodes (assumed of length N each). The online forward SARSA(λ) procedure (47.23) is replaced by the following offline construction running for n ≥ 0 over the observations from episode k: π,f (s, a) + µk (s, a) qkπ,f (s, a) = qk−1

N −1  X n=0

 π,f (s, a) I[sn = s, an = a] Qλn (s, a) − qk−1

π,f with boundary condition q−1 (s, a) = 0 for all states and actions. Here, we are using the subscript f to refer to the iterates that are generated by the forward implementation in order to distinguish them from the iterates generated by the backward implementation: π,b π,b (sn , an ) (sn+1 , an+1 ) − qk−1 βkb (sn , an ) = r(n) + γqk−1 π,b (x, c) + µk (s, c) qkπ,b (x, c) = qk−1

N −1 X

tn (x, c)βkb (sn , cn ),

n=0

x ∈ S, c ∈ A

π,b with boundary condition q−1 (s, a) = 0. Show that these offline implementations are π,f equivalent, i.e., qk (s, a) = qkπ,b (s, a) for all k and all state–action pairs (s, a) ∈ S × A. 47.8 Repeat the argument that was used to establish (46.88) to show that the SARSA(λ) algorithm (47.28) also converges, i.e., that (47.27) holds. 47.9 Refer to the Bellman optimality condition (47.35). Assume the reward function has a bounded second-order moment, i.e., E {r2 (s, a, s0 )|s = s, a = s} ≤ d (here the expectation is over the randomness in s0 ). Show that the variable ∆

ψ(s0 ) = max q ? (s0 , a0 ) 0 a ∈A

also has a bounded second-order moment over the randomness of s0 . 47.10 Follow arguments similar to the ones used to establish the convergence results (47.46)–(47.47) to show that similar conclusions hold for the -greedy version of Qlearning given by listing (47.52). 47.11 Refer to the Q-learning algorithm (47.45) and its convergence proof. This form of the algorithm chooses the exploratory action a by maximizing qn−1 (s, a0 ) over the action space. Assume we modify the selection of a by choosing it uniformly from within the action set A. Show again that if every state–action pair (s, a) is visited infinitely often, the resulting Q-learning implementation converges (i.e., that result (47.46) continues to hold). 47.12 Follow arguments similar to the ones used to establish the convergence results (47.46)–(47.47) to show that similar conclusions hold for the softmax greedy version of Q-learning where a is selected according to (47.58).

47.A Convergence of SARSA(0) Algorithm

2001

47.13 Follow arguments similar to the ones used to establish the convergence results (47.46)–(47.47) to show that similar conclusions hold for the Q-learning algorithm (47.78). 47.14 Establish the validity of (47.60). That is, show that the probability of the action with the largest q π (s, a) value tends to 1. 47.15 We presented in (47.51) one construction for a greedy action. Consider the alternative construction   a(s), with probability (1 − ) + |A| a (s) =  a ∈ A\{a(s)}, with probability |A| each 

Show that this exploration policy improves the state value function, i.e., v π (s) ≥ v π (s), s ∈ S.

47.A

CONVERGENCE OF SARSA(0) ALGORITHM In this and the next appendix we pursue the convergence analysis of SARSA(0) and Q-learning by relying on the stochastic inequality (46.130) in a manner similar to what was proposed by Jaakkola, Jordan, and Singh (1994); see also Sutton and Barto (1998) and Singh et al. (2000) for a related discussion. We introduce the error signal ∆

qenπ (s, a) = q π (s, a) − qnπ (s, a)

(47.93)

and the scaled step-size sequence ∆

µ ¯n (s, a) = µn (s, a)I[sn = s, an = a]

(47.94)

Subtracting q π (s, a) from both sides of the state–action recursion (47.3) gives h i π π qenπ (s, a) = (1 − µ ¯n (s, a))e qn−1 (s, a) − µ ¯n (s, a) r(n) + γqn−1 (s0 , a0 ) − q π (s, a) | {z } ∆

= en (s,a,s0 ,a0 ) π = (1 − µ ¯n (s, a))e qn−1 (s, a) − µ ¯n (s, a)en (s, a, s0 , a0 )

(47.95)

Comparing with recursion (46.130) in Appendix 46.A we can make the identifications: θ ← (s, a) un (θ n )) ← qenπ (s, a) ¯ n (s, a) αn (θ n ) ← µ ¯ n (s, a) β n (θ n ) ← µ en (θ) ← en (s, a, s0 , a0 )

(47.96a) (47.96b) (47.96c) (47.96d) (47.96e)

We further introduce the following collection of past and present variables: n o ¯ n (sn , an ) ∪ Xn = (sn , an ), µ n o ¯ m (sm , am ), em (s, a, s0 , a0 ), m ≤ n − 1 (sm , am ), µ

(47.97)

2002

Q-Learning

Now note that, conditioned on Xn , we have   E en (s, a, s0 , a0 ) | Xn n o = E P r(n) + γ π qn−1 (s0 , a0 ) − q π (s, a) | Xn n o (a) π = γ E P qn−1 (s0 , a0 ) − q π (s0 , a0 ) |sn = s, an = a h i X π = γ P(s, a, s0 ) qn−1 (s0 , a0 ) − q π (s0 , a0 )

(47.98)

s0 ∈S

where (a) is because of (47.1) and where the expectation is over the randomness in s0 ∈ S. It follows that   0 0 E en (s, a, s , a ) | Xn X π ≤ γ P(s, a, s0 ) qn−1 (s0 , a0 ) − q π (s0 , a0 ) s0 ∈S



X s0 ∈S

π 0 0 π 0 0 γ P(s, a, s0 ) max q (s , a ) − q (s , a ) n−1 0 0 s ,a

!

  π 0 0 π 0 0 = γ max q (s , a ) − q (s , a ) n−1 0 0

X

s ,a

0

P(s, a, s )

s0 ∈S

qen−1 (s0 , a0 ) = γ max 0 0

(47.99)

s ,a

Therefore, condition (46.135) is satisfied. Next, using the fact that for any random variable x, its variance satisfies σx2 ≤ E x2 , we have that   var e2n (s, a, s0 , a0 ) | Xn (47.100) n o  2 ≤ E P en (s, a, s0 , a0 ) | Xn   2 π (s0 , a0 ) − q π (s, a) | Xn = EP r(n) + γqn−1 (

(47.1)

= EP (

= EP

r(n) +

π γqn−1 (s0 , a0 )

! )  2 − E r(n) + γq (s , a ) | sn = s, an = a | Xn 

π

0

  π r(n) − E r(n) + γ qn−1 (s0 , a0 ) − q π (s0 , a0 )

0

!2

) sn = s, an = a

Using inequality (46.148) we get   var e2n (s, a, s0 ) | Xn (a)

  ≤ 2 var r(n)|sn = s, an = a + n o 2 π 2γ 2 E P q π (s0 , a0 ) − qn−1 (s0 , a0 ) | sn = s, an = a 2 n o (b) π 0 0 ≤ 2 E P r 2 (n) | sn = s, an = a + 2γ 2 max q (s , a ) e n−1 0 0

(47.101)

s ,a

where in step (a) we used the inequality (x + y)2 ≤ 2x2 + 2y 2 and in step (b) we used (47.99). Now since the reward function is assumed to be uniformly bounded for all

47.B Convergence of Q-Learning Algorithm

2003

states and actions, |r(n)| ≤ R, as well as the state–action value function q π (s0 , a0 ), we conclude that  2    π 0 0 var e2n (s, a, s0 ) | Xn ≤ 2R2 + 2γ 2 max q (s , a ) (47.102) e n−1 0 0 s ,a

As was remarked earlier after (46.149), it is sufficient to require the reward function to have a bounded second-order moment, E r 2 (n) (or variance); it is not necessary to require a uniformly bounded reward function – see also Prob. 47.9. Let c = max{2R2 , 2γ 2 }. Then, we further have     π qen−1 (s0 , a0 ) 2 var e2n (s, a, s0 ) | Xn ≤ c 1 + max (47.103) 0 0 s ,a

and we conclude that condition (46.136) is satisfied. It then follows from (46.133) that qnπ (s, a) converges to q π (s, a) almost surely. This conclusion is valid under condition (46.134), i.e., we must have ∞ X n=0

µ ¯n (s, a) = ∞,

∞ X n=0

µ ¯2n (s, a) < ∞,

0≤µ ¯n (s, a) < 1

(47.104)

The last two conditions are satisfied in view of (47.9). The first condition requires each state–action pair (s, a) to be visited infinitely often.

47.B

CONVERGENCE OF Q-LEARNING ALGORITHM We again rely on the useful lemma stated in Appendix 46.A along the lines of Jaakkola, Jordan, and Singh (1994); see also Watkins and Dayan (1992) and Singh et al. (2000). We introduce the error signal ∆

qen (s, a) = q ? (s, a) − qn (s, a)

(47.105)

and the scaled step-size sequence ∆

µ ¯n (s, a) = µn (s, a)I[sn = s, an = a]

(47.106)

Subtracting q ? (s, a) from both sides of the state–action recursion (47.40) gives   0 0 ? qen (s, a) = (1 − µ ¯n (s, a))e qn−1 (s, a) − µ ¯n (s, a) r(n) + γ max q n−1 (s , a ) − q (s, a) a0 ∈A | {z } ∆

= en (s,a,s0 ,a0 ) 0

0

= (1 − µ ¯n (s, a))e qn−1 (s, a) − µn (s, a)en (s, a, s , a )

(47.107)

Comparing with recursion (46.130) we can make the identifications θ ← (s, a) un (θ n ) ← qen (s, a) ¯ n (s, a) αn (θ n ) ← µ ¯ β n (θ n ) ← µn (s, a) en (θ n ) ← en (s, a, s0 , a0 )

(47.108a) (47.108b) (47.108c) (47.108d) (47.108e)

2004

Q-Learning

We further introduce the following collection of past and present variables: n o ¯ n (sn , an ) ∪ Xn = (sn , an ), µ n o ¯ m (sm , am ), em (s, a, s0 , a0 ), m ≤ n − 1 (sm , am ), µ

(47.109)

Now note that, conditioned on Xn , we have   E en (s, a, s0 , a0 )| Xn   0 0 ? = E P r(n) + γ max q (s , a ) − q (s, a) | X n−1 n a0 ∈A   (a) 0 0 ? 0 0 = γ E P max q n−1 (s , a ) − max q (s , a ) | sn = s, an = a a0 ∈A a0 ∈A   X 0 0 ? 0 0 = γ P(s, a, s0 ) max q n−1 (s , a ) − max q (s , a ) 0 0 a ∈A

s0 ∈S

(47.110)

a ∈A

where (a) is because of (47.35) and where the expectation is over the randomness in s0 ∈ S. It follows that   X 0 0 0 0 ? 0 0 γP(s, a, s0 ) max q (s , a ) − max q (s , a ) E en (s, a, s , a ) | Xn ≤ n−1 0 0 a ∈A

s0 ∈S

≤ ≤

X s0 ∈S

X s0 ∈S

a ∈A

qn−1 (s0 , a0 ) − q ? (s0 , a0 ) γP(s, a, s0 ) max 0 a ∈A

qn−1 (s0 , a0 ) − q ? (s0 , a0 ) γP(s, a, s0 ) max 0 0 s ,a

X = γ max qn−1 (s0 , a0 ) − q ? (s0 , a0 ) P(s, a, s0 ) s0 ,a0

s0 ∈S

qen−1 (s0 , a0 ) = γ max 0 0

(47.111)

s ,a

Therefore, condition (46.135) is satisfied. Next, using the fact that for any random variable x, its variance satisfies σx2 ≤ E x2 , we have that   var e2n (s, a, s0 , a0 ) | Xn

≤ = (47.35)

=

o 2 en (s, a, s0 , a0 ) | Xn (47.112) ( ) 2 0 0 ? EP r(n) + γ max qn−1 (s , a ) − q (s, a) | Xn 0

EP

n

a ∈A

( EP



r(n) + γ max qn−1 (s0 , a0 ) − 0 a ∈A

  E r(n) + γ max q ? (s0 , a0 ) | sn = s, an = a − 0 a ∈A ) 2 ? 0 0 ? 0 0 γ max q (s , a ) − γ max q (s , a ) | Xn a0 ∈A a0 ∈A {z } | =0

References

2005

Using inequality (46.148) we get   var e2n (s, a, s0 ) | Xn   ≤ 3 var r(n) | sn = s, an = a +   3γ 2 var max q ? (s0 , a0 ) | sn = s, an = a + a0 ∈A ( ) 2 2 ? 0 0 0 0 3γ E P max q (s , a ) − max qn−1 (s , a ) | sn = s, an = a 0 0 a ∈A

a ∈A

(a)

  ≤ 3 E P r 2 (n) | sn = s, an = a +   ? 0 0 2 q (s , a ) | sn = s, an = a + 3γ 2 E P max 0 a ∈A

qen−1 (s0 , a0 ) 2 3γ max 0 0 2

(47.113)

s ,a

where in step (a) we used (47.111). Now since the reward function is assumed to be uniformly bounded for all states and actions, say, |r(n)| ≤ R, as well as the optimal state–action value function q ? (s0 , a0 ), say, |q ? (s0 , a0 )| ≤ C, we conclude that   qen−1 (s0 , a0 ) 2 var e2n (s, a, s0 , a0 ) | Xn ≤ 3R2 + 3γ 2 C 2 + 3γ 2 max (47.114) 0 0 s ,a

As was remarked earlier after (46.149), it is sufficient to require the reward function to have a bounded second-order moment, E r 2 (n) (or variance); it is not necessary to require a uniformly bounded reward function – see also Prob. 47.9. Let c = max{3R2 + 3γ 2 C 2 , 3γ 2 }. Then, we further have     qen−1 (s0 , a0 ) 2 (47.115) var e2n (s, a, s0 , a0 ) | Xn ≤ c 1 + max 0 0 s ,a

and we conclude that condition (46.136) is satisfied. It then follows from (46.133) that qn (s, a) converges to q ? (s, a) almost surely. This conclusion is valid under condition (46.134), i.e., we must have ∞ X n=0

µ ¯n (s, a) = ∞,

∞ X n=0

µ ¯2n (s, a) < ∞,

0≤µ ¯n (s, a) < 1

(47.116)

The last two conditions are satisfied in view of (47.42). The first condition requires each state–action pair (s, a) to be visited infinitely often. 

REFERENCES Agrawal, R. (1995), “Sample mean based index policies with O(log n) regret for the multi-armed bandit problem,” Adv. Appl. Probab., vol. 27, pp. 1054–1078. Andrychowicz, M., F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017), “Hindsight experience replay,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–11, Long Beach, CA. Auer, P., N. Cesa-Bianchi, and P. Fischer (2002), “Finite-time analysis of the multiarmed bandit problem,” Mach. Learn., vol. 47, pp. 235–256.

2006

Q-Learning

Bellemare, M. G., S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016), “Unifying count-based exploration and intrinsic motivation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Barcelona. Bridle, J. S. (1990a), “Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 211–217, Denver, CO. Bridle, J. S. (1990b), “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing: Algorithms, Architectures and Applications, F. F. Soulie and J. Herault, editors, pp. 227–236, Springer. Bubeck, S. and N. Cesa-Bianchi (2012), “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Found. Trends Mach. Learn., vol. 5, no. 1, pp. 1–122. Burda, Y., H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2018), “Large-scale study of curiosity-driven learning,” available at arXiv:1808.04355. Cassano, L. and A. H. Sayed (2019), “ISL: Optimal policy learning with optimal exploration-exploitation trade-off,” available at arXiv:1909.06293. Presented at the 2019 NeurIPS Workshop on Optimization Foundations for Reinforcement Learning, Vancouver. Jaakkola, T., M. I. Jordan, and S. Singh (1994), “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Comput., vol. 6, no. 6, pp. 1185– 1201. Lai, T. L. and H. Robbins (1984), “Asymptotically optimal allocation of treatments in sequential experiments,” in Design of Experiments, T. J. Santer and A. J. Tamhane, Editors, pp. 127–142, Marcel Dekker. Lai, T. L. and H. Robbins (1985), “Asymptotically efficient adaptive allocation rules,” Adv. Appl. Math., vol. 6, pp. 4–22. Lattimore, T. and C. Szepesvari (2020), Bandit Algorithms, Cambridge University Press. Lin, L.-J. (1992), “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Mach. Learn., vol. 8, pp. 293–321. Lin, Z. and Z. Bai (2010), Probability Inequalities, Springer. Luce, R. D. (1959), Individual Choice Behavior: A Theoretical Analysis, Wiley. Luce, R. D. (1977), “The choice axiom after twenty years,” J. Math. Psychol., vol. 15, no. 3, pp. 215–233. O’Donoghue, B., I. Osband, R. Munos, and V. Mnih (2018), “The uncertainty Bellman equation and exploration,” Proc. Int. Conf. Machine Learning (ICML), pp. 3839– 3848, Stockholm. Osband, I., J. Aslanides, and A. Cassirer (2018), “Randomized prior functions for deep reinforcement learning,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Montreal. Osband, I., C. Blundell, A. Pritzel, and B. Van Roy (2016), “Deep exploration via bootstrapped DQN,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Barcelona. Osband, I., B. Van Roy, and Z. Wen (2016), “Generalization and exploration via randomized value functions,” Proc. Int. Conf. Machine Learning (ICML), vol. 48, pp. 2377–2386. Rummery, G. A. and M. Niranjan (1994), “On-line q-learning using connectionist systems,” Technical Report CUED/F-INFENG/TR 166, Cambridge University, Engineering Department. Shannon, C. E. (1950), “Programming a computer for playing chess,” Philos. Mag., vol. 41, pp. 256–275. Singh, S., T. Jaakkola, M. L. Littman, and C. Szepesvari (2000), “Convergence results for single-step on-policy reinforcement-learning algorithms,” Mach. Learn., vol. 39, pp. 287–308. Slivkins, A. (2019), Introduction to Multi-Armed Bandits, Foundations and Trends in Machine Learning, NOW Publishers, vol. 38, pp. 1–306.

References

2007

Stadie, B. C., S. Levine, and P. Abbeel (2015), “Incentivizing exploration in reinforcement learning with deep predictive models,” available at arXiv:1507.00814. Sutton, R. S. (1996), “Generalization in reinforcement learning: Successful examples using sparse coarse coding,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1038–1044, Denver, CO. Sutton, R. S. and A. G. Barto (1998), Reinforcement Learning: An Introduction, A Bradford Book. Thompson, W. R. (1933), “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3–4, pp. 285–294. van Hasselt, H. (2010), “Double Q-learning,” Proc. Advances Neural Processing Systems (NIPS), pp. 1–9, Vancouver. van Seijen, H., H. van Hasselt, S. Whiteson, and M. Wiering (2009), “A theoretical and empirical analysis of expected SARSA,” Proc. IEEE Symp. Adaptive Dynamic Programming and Reinforcement Learning, pp. 1–8, Nashville, TN. Watkins, C. J. (1989), Learning from Delayed Rewards, Ph.D. dissertation, Cambridge University, UK. Watkins, C. J. and P. Dayan (1992), “Q-learning,” Mach. Learn., vol. 8, pp. 279–292.

48 Value Function Approximation

The various reinforcement learning algorithms described in the last two chap-

ters rely on estimating state values, v π (s), or state–action values, q π (s, a), directly. We explained earlier in Section 44.4 that this approach becomes problematic for Markov decision processes (MDPs) with large state or action spaces (i.e., MDPs with large |S| or |A| or both). We also explained there that one useful technique for these scenarios is to rely on the use of feature vectors. In this chapter, we motivate several algorithms that involve feature-based modeling for both the state value function, v π (s), and the state–action value function, q π (s, a).

48.1

STOCHASTIC GRADIENT TD-LEARNING We consider first solution methods that employ linear approximations for the state value function. Thus, we assume henceforth that a feature vector, hs ∈ IRM , is associated with every state, s ∈ S, where the value of M satisfies M  |S|: hs : S → IRM

(48.1)

The feature vectors will be used to approximate state values by a linear parametric model of the form: v π (s) ≈ hT sw

(48.2)

for some vector w ∈ IRM . In this way, rather than estimate the value of v π (s) for each state s, it becomes sufficient to estimate a single vector w ∈ IRM for all states. One typical construction for hs when the state-space S has moderate size is to employ one-hot encoding, namely, to employ the basis vectors es ∈ IR|S| as the feature vectors. For example, if |S| = 4, we may employ         1 0 0 0  0         , h2 =  1  , h3 =  0  , h4 =  0  h1 =  (48.3)  0   0   1   0  0

0

0

1

Model (48.2) implicitly assumes that the state value v π (s) is zero for states where the feature vector hs happens to be zero. More generally, we can incorporate an offset parameter into the linear approximation model and use instead

48.1 Stochastic Gradient TD-Learning

v π (s) ≈ hT sw −θ

2009

(48.4)

where θ ∈ IR is a scalar parameter to be estimated along with w. If we redefine the feature and parameter vectors as     1 −θ e ∆ e ∆ hs = , w = (48.5) hs w where a constant bias value of +1 is added as a leading entry to the feature vector, and −θ is added as a leading entry to w, then we rewrite (48.4) as the inner product expression: v π (s) ≈ (hes )T we

(48.6)

This relation has the same form as (48.2). Therefore, without loss of generality, we will continue our discussion by assuming form (48.2) with the understanding that the quantities {hs , w} may have been extended as necessary. We will derive several algorithms for estimating the parameter w; the methods will differ by the objective functions they optimize. In a later section, we will consider nonlinear approximation models in the form of neural networks when we study deep Q-learning. Our first implementation for estimating the model w in (48.2) is based on a straightforward mean-square-error (MSE) formulation. We seek the vector w that solves: ) ( 2  ∆ 1 T π o (48.7) w = argmin J(w) = E v (s) − hs w 2 w∈IRM where we are denoting the objective function by J(w). The expectation in J(w) is over the randomness in the state variable. The solution wo can be determined iteratively by pursuing a stochastic gradient implementation. Thus, note first that the gradient vector of J(w) is given by   ∇wT J(w) = −E v π (s) − hT w hs (48.8) s

We drop the expectation operator and employ a sample realization for the gradient vector. If we were able to observe, at some iteration n, the feature vector hsn and its state value v π (sn ) corresponding to state sn , then we could use these values to drive a stochastic gradient update of the form:   wn = wn−1 + µ(n)hn v π (sn ) − hT (48.9) n wn−1 , n ≥ 0 where µ(n) denotes the step-size sequence, and the compact notation hn refers to the feature vector that is associated with the state sn occurring at time n, i.e., hn = hsn . A small constant step size µ(n) = µ > 0 can also be used to enable continuous learning especially in nonstationary environments where the properties of the MDP model may drift with time. Note that we are now denoting the step-size sequence by the notation µ(n), instead of the earlier notation µn (s).

2010

Value Function Approximation

This is because there is no need here to use the state variable s anymore as an argument for the step size since we are assuming the same model w for all states.

48.1.1

TD(0) Implementation The main difficulty with iteration (48.9) is that the reference (or target) signal in (48.9), which is represented by v π (sn ), is unavailable (after all, we are interested in estimating its value). We can nevertheless follow the same argument that led to (46.39) and use the sample realization (46.57) to approximate v π (sn ), i.e., π v π (sn ) ≈ r(n) + γvn−1 (sn+1 )

(48.2)



r(n) + γhT n+1 wn−1

(48.10)

where we are using the existing iterate, wn−1 , to generate the estimate for π vn−1 (sn+1 ). By doing so, we arrive at the stochastic gradient TD(0) learning algorithm listed in (48.11). Note that we are denoting the TD error by the compact notation δ(n) (while we denoted it before by δn (sn )); this is because it is assumed that the agent now senses the feature vector, hn , and not the state, sn . For this reason, we also denote the policy by π(a|h) rather than π(a|s). Note further that the update from wn−1 to wn is now performed for every iteration n, regardless of which feature vector is observed at that time instant. The last line provides the approximation for v π (s), for any state s, based on the estimate for wo .

Stochastic gradient TD(0) for linear state value approximation. given a (deterministic or stochastic) policy, π(a|h); start with w−1 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ π(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + µ(n)δ(n)hn end w−1 ← wn in preparation for next episode end wo ← wn o v π (s) ≈ hT s w , ∀ s ∈ S.

(48.11)

Example 48.1 (Playing a game over a grid) We illustrate the operation of the TD(0) algorithm (48.11) by reconsidering the earlier grid problem from Fig. 44.4; we reproduce

48.1 Stochastic Gradient TD-Learning

2011

the grid in Fig. 48.1 for ease of reference. The figure shows a grid with 16 squares labeled #1 through #16. Four squares are special; these are squares #4, #8, #11, and #16. Square #8 (danger) and #16 (stars) are terminal states. If the agent reaches one of these terminal states, the agent moves to an EXIT state and the game stops. The agent collects either a reward of +10 at square #16 or a reward of −10 at square #8. The agent collects a reward of −0.1 at all other states.

11

DANGER

4

Figure 48.1 An MDP is defined over a grid with 16 tiles. The actions available for the agent are A = {UP, DOWN, LEFT, RIGHT, STOP}. If the agent hits the boundaries of the grid, or the obstacle squares at #4 and #11, then it stays put in its location. The arrows in the figure at state s = 7 indicate that there is a 70% chance that the agent will move UPWARD and a 30% chance that the agent will move in a perpendicular direction (either right or left).

In this simulation, we will first represent each state s by a reduced feature vector of size M = 4 with binary entries defined as follows: h b1 b2 b3 b4



= = = = =

 1 1 1 1

b1 b2 if agent if agent if agent if agent

T b3 b4 is on same row as SUCCESS, otherwise b1 = 0 is on same row as DANGER, otherwise b2 = 0 is in rightmost two columns, otherwise b3 = 0 is in leftmost two columns, otherwise b4 = 0

(48.12a) (48.12b) (48.12c) (48.12d) (48.12e)

If we collect all transposed feature vectors into a matrix H of size 17 × 4 we get expression (48.13). Note that locations s = 4 and s = 11 are not actual states since they are not visited by the agent; we assign feature vectors to them for convenience of presentation. We are also not adding a bias value of +1 into the feature vectors because, for this example, the state value function at the EXIT state s = 17 is zero, namely, v π (17) = 0:

2012

Value Function Approximation

b1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0

               H=             

s=1 s=2 s=3 s=4 s=5 s=6 s=7 s=8 s=9 s = 10 s = 11 s = 12 s = 13 s = 14 s = 15 s = 16 s = 17

b2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0

b3 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0

b4 0 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0

                            

(48.13)

state value function under linear approximation

10

5

0

-5

-10

TD(0) algorithm Poisson equation 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Figure 48.2 State value function v π (s) estimated by running the TD(0) algorithm

(48.11) over 20,000 episodes using the features defined in (48.13). The figure also shows the state values obtained from the Poisson equation (44.72) when the MDP parameters are assumed known.

There is an implicit assumption that, for this selection of feature vectors, the state value function v π (s) can be well approximated by a linear functional of the form hT wo , for some weight vector wo . This assumption is not always true and different constructions for the feature vectors would lead to different approximations for the state value function. To illustrate this effect, we evaluate v π (s) in two ways for comparison purposes. In the first method, we assume knowledge of the MDP parameters {π(a|s), P(s, a, s0 )} and compute the true v π (s) by using the Poisson equation (44.72) with γ = 0.9; the computer code determines {rπ , P π } and finds v π (s). We listed these values earlier in (46.55). In the second method, we employ the TD(0) algorithm (48.11) using the above feature construction and run it over 20,000 episodes using the constant step-size value π µ = 0.01 and setting v−1 (s) to random initial values. Figure 48.2 plots the estimated state values. It is clear from this simulation that the feature vectors used in (48.13) are not well representative of the evolution of the game.

48.1 Stochastic Gradient TD-Learning

2013

We consider next a more extreme construction for the feature vectors by relying on a hot-encoded representation, where h now has dimensions 17 × 1 with the jth entry of h set to 1 if the agent is located at the jth square and 0 otherwise; the only exception is the last row of H corresponding to the EXIT state s = 17, where the feature vector is all 0. We can view this construction as taking a screenshot of the game and identifying the location of the agent in the image. Constructions of this type are viable only for small state-space dimensions; otherwise, one needs to resort to reduced feature representations. In this way, we end up with the new feature matrix:   1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ←1  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  ←2  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  ←3    0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0  ←4    0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0  ←5    0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0  ←6    0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ←7    0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0  ←8   H= 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0  ←9  0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0  ← 10    0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0  ← 11    0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0  ← 12    0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0  ← 13    0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0  ← 14    0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0  ← 15  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0  ← 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ← 17 (48.14) Figure 48.3 plots the estimated state values for these new feature vectors; the simulation shows an improved match between the estimates and the true values. Of course, this second construction amounts, for all practical purposes, to assuming knowledge of the state values. The example is meant to show that the performance of state value estimation algorithms is dependent on the selection of the feature space. state value function under a second linear approximation

10

5

0

-5

TD(0) algorithm Poisson equation

-10 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Figure 48.3 State value function v π (s) estimated by running the TD(0) algorithm

(48.11) over 20,000 episodes using the extended feature representation (48.14). The figure also shows the state values obtained from the Poisson equation (44.72) when the MDP parameters are assumed known.

17

2014

Value Function Approximation

48.1.2

TD(λ) Implementation We can similarly derive a stochastic gradient TD(λ) variant. For example, we can follow arguments similar to those employed in Section 46.4 and consider higher-order approximations for v π (s) in lieu of (48.10), such as: v π (sn )

(46.58)



π r(n) + γr(n + 1) + γ 2 vn−1 (sn+2 )



r(n) + γr(n + 1) + γ 2 hT n+2 wn−1

(48.2) ∆

Un(2) (hn )

=

(48.15) (2)

where we are denoting the second-order approximation by Un (hn ) (while we (2) denoted it before by Un (sn )). We can also employ higher-order approximations such as: v π (sn ) Un(P ) (hn )

Un(P ) (hn )

≈ (46.64a)

=

(48.16a)

γ P hT n+P wn−1 +

P −1 X

γ m r(n + m)

(48.16b)

m=0

r(n + m)

(46.64b)

=

r(hn+m , a = π(a|h = hn+m ), hn+m+1 )

(48.16c)

More broadly, we can combine higher-order approximations by means of a weighting factor λ ∈ (0, 1] and use v π (sn ) Unλ (hn )

≈ (46.67)

=

Unλ (hn ) (1 − λ)

(48.17a) ∞ X

λp−1 Un(p) (hn )

(48.17b)

p=1

so that the update for wn in (48.11) would be replaced by the forward implementation:   wn = wn−1 + µ(n)hn Unλ (hn ) − hT w n n−1 , n ≥ 0

(48.18)

If we now follow the arguments that led to the TD(λ) algorithm (46.87), we arrive at the backward implementation (48.19) for linear approximation models – see Prob. 48.2. Note that we are again employing eligibility trace variables, which are now M × 1 vectors denoted by tn ; they are updated by adding the current feature vector to prior values. Again, the update from wn−1 to wn is performed for every iteration n, regardless of which feature vector is observed at that time instant. The last line provides the approximation for v π (s), for any state s, based on the iterate that is available at time n.

48.1 Stochastic Gradient TD-Learning

2015

Stochastic gradient TD(λ) for linear state value approximation. given a (deterministic or stochastic) policy, π(a|h); initial trace vector, t−1 = 0M ×1 ; start with w−1 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ π(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) δ(n) = r(n) + (γhn+1 − hn )T wn−1 tn = λγtn−1 + hn wn = wn−1 + µ(n)δ(n)tn end w−1 ← wn , t−1 = tn in preparation for next episode end wo ← wn o v π (s) ≈ hT s w , ∀ s ∈ S.

48.1.3

(48.19)

True Online TD(λ) Implementation The same arguments can be applied to the true online TD(λ) algorithm from Section 46.6. For instance, we now employ the approximations: Un(1) (hn ) = r(n) + γhT n+1 wn−1 Un(2) (hn ) Un(3) (hn )

(48.20)

0

0

0

00

0

0

0

00

= r(s, a, s ) + γ r(s , a , s ) +

2

γ hT n+2 wn 2 00 00

(48.21) 000

= r(s, a, s ) + γ r(s , a , s ) + γ r(s , a , s ) + γ

3

hT n+3 wn+1 (48.22)

and, more generally, for a forward-looking approximation of order P : ∆

Un(P ) (hn ) = γ P hT n+P wn+P −2 +

P −1 X

γ m r(n + m)

(48.23)

m=0

We further combine these utility approximations in the same weighted manner λ as (46.100) to obtain Un,N (hn ), i.e., ∆

λ Un,N (hn ) = (1 − λ)

NX −n+1

λp−1 Un(p) (hn ) + λN −n+1 UnN −n+2 (hn )

p=1

(48.24)

2016

Value Function Approximation

and then replace the update in (48.11) by:   λ wn,N = wn−1,N + µ(n)hn Un,N (hn ) − hT w , 0≤n≤N n−1,N n

(48.25)

Note that we are adding a second subscript to the estimate for the parameter, wn,N , where the subscript N is used to indicate that the resulting iterates are based on a finite-horizon of length N . The final desired iterate is given by wN = wN,N and it would allow us to recover the state value function at time N : π vN (s) ≈ hT N wN

(48.26)

We explain next how to derive a recursion that updates the endpoint wn−1 directly to the next endpoints wn .

Deriving a recursion To begin with, some simple algebra allows us to deduce from the expression for λ (hn ) that (recall Prob. 46.9): Un,N λ λ N −n+1 Un,N δ(N + 1) +1 (hn ) − Un,N (hn ) = (γλ)

(48.27)

where the earlier expression (46.66c) for the temporal difference δn is modified to ∆

T δ(n) = r(n) + γhT n+1 wn−1 − hn wn−2

(48.28)

Moreover, the difference wn−1,N +1 −wn−1,N admits the following representation: wn−1,N +1 − wn−1,N = (γλ)N −n+2 δ(N + 1)tn−1

(48.29)

for some vector tn−1 . This is certainly true for n = 1 if we start from the boundary conditions w−1,N +1 = w−1,N = 0. Indeed, using (48.25) we have λ w0,N = µ(0)h0 U0,N (h0 )

w0,N +1 = so that w0,N +1 − w0,N

= (48.27)

=

(48.30)

λ µ(0)h0 U0,N +1 (h0 )

(48.31)

  λ λ µ(0)h0 U0,N +1 (h0 ) − U0,N (h0 ) µ(0)h0 (γλ)N −n+1 δ(N + 1)

(48.32)

which is of the same form as (48.29) with t0 = µ(0)h0 . We proceed by induction. Assume form (48.29) is valid at n − 1. Then, some algebra using (48.25) and (48.27) shows that relation (48.29) holds at time n as well in the following manner (see Prob. 46.10): wn,N +1 − wn,N = (γλ)N −n+1 δN +1 tn

(48.33)

in terms of new trace vector variables computed as tn = γλtn−1 + µ(n)hn − µ(n)γλhn (hT n tn−1 ),

t−1 = 0

(48.34)

48.1 Stochastic Gradient TD-Learning

2017

We are now ready to derive an equivalent backward view recursion that focuses on propagating the desired endpoints wn . Indeed, using (48.25) and (48.29), we can write λ wn,n = wn−1,n + µ(n)hn Un,n (hn ) − hT n wn−1,n

wn−1,n = wn−1,n−1 + γλδ(n)tn−1



(48.35) (48.36)

Noting from definition (48.24) that λ Un,n (hn ) = Un(1) (hn )

(48.20)

=

r(n) + γhT n+1 wn−1

(48.28)

=

δ(n) + hT n wn−2

(48.37)

and substituting (48.36) into (48.35), we arrive at   T wn,n = wn−1,n−1 + δ(n)tn + µ(n)hn hT n wn−2 − hn wn−1

(48.38)

In summary, we arrive at the following listing for the true online TD(λ) algorithm:

True online TD(λ) for linear state value approximation. given a (deterministic or stochastic) policy π(a|h); initial trace values, t−1 = 0M ×1 ; start from w−1 = w−2 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ π(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) T δ(n) = r(n) + γhT n+1 wn−1 − hn wn−2 T tn = γλtn−1 + µ(n)hn − µ(n)γλh  n (hn tn−1 )  T wn = wn−1 + δ(n)tn + µ(n)hn hn wn−2 − hT n wn−1

end w−1 ← wn , w−2 = wn−1 , t−1 = tn in preparation for next episode end wo ← wn o v π (s) ≈ hT s w , ∀ s ∈ S. (48.39)

2018

Value Function Approximation

48.2

LEAST-SQUARES TD-LEARNING Instead of the MSE criterion used in (48.7), we can seek the model w in (48.2) by minimizing a regularized least-squares risk of the form:

w

?

= argmin w∈IRM

(

2

ρkwk +

N −1 X n=0

π

v (sn ) −

2 hT nw

)

, ρ>0

(48.40)

where we now denote the solution by w? . The minimization is written over data arising from one episode extending from time n = 0 to n = N − 1. In this formulation, the scalar ρ is a regularization parameter where a large ρ favors solutions w? with small norm. In Chapter 50 we will show that a recursive procedure can be derived to update the solution to least-squares problems of the form (48.40). Let wN −1 denote the above solution w? , where the subscript N − 1 is used to indicate that the solution is based on data up to time N − 1. Likewise, let wN denote the solution to the same problem except that the data runs now up to time N :



wN = argmin w∈IRM

(

ρkwk2 +

N X

n=0

v π (sn ) − hT nw

2

)

(48.41)

Then, the recursive least-squares algorithm (50.123) derived in that chapter allows us to update wN −1 to wN and from there to wN +1 and so forth. We leave the derivation to the relevant section; here, we are only interested in the form of the algorithm. Applying the future recursive least-squares recursions (50.123) to problem (48.40), and using (48.10), we arrive at listing (48.43), where the symbol Pn refers to an M × M Riccati matrix variable used by the algorithm at iteration n. The last line provides the approximation for v π (s), for any state s, based on w? . When there are multiple episodes, we can apply the algorithm repeatedly with the initial conditions {P−1 , w−1 } for each episode set to the final values obtained from the previous episode. Doing so would amount to solving a least-squares problem of the following form in place of (48.40): (

w? = argmin ρkwk2 + w∈IRM

E N −1  X X e=1 n=0

(e) T v π (s(e) n ) − (hn ) w

(e)

(e)

2

) (48.42)

where e = 1, 2, . . . , E is the episode index and {sn , hn } refer to the corresponding state and feature variables.

48.3 Projected Bellman Learning

2019

Least-squares TD learning for linear state value approximation. given a (deterministic or stochastic) policy π(a|h); start with P−1 = ρ−1 IM ×M , w−1 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ π(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) τ (n) = 1/(1 + hT n Pn−1 hn ) gn = Pn−1 hn τ (n) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + δ(n)gn Pn = Pn−1 − gn gnT /τ (n) end P−1 ← Pn , w−1 ← wn in preparation for next episode end w? ← wn ? v π (s) ≈ hT s w , s ∈ S.

48.3

(48.43)

PROJECTED BELLMAN LEARNING One of the main disadvantages of the stochastic gradient and least-squares TD algorithms (48.11) and (48.43) is that they do not enforce condition (44.117), namely, the equality (I − γP π ) v π = rπ

(48.44)

and as such they can face convergence difficulties with their learning parameters growing unbounded, especially when used in off-policy scenarios. In this section, we derive better-performing algorithms by minimizing instead the projected Bellman cost introduced earlier in (44.131a), namely, JPB (w) = ∆

T  1 H T Drπ − Bw (H T DH)−1 H T Drπ − Bw 2

B = H T D(I − γP π )H

(48.45a) (48.45b)

We will derive two algorithms for this purpose: One is known as the TDC algorithm and the other is known as the GTD2 algorithm. The algorithms are largely identical except for a minor difference in how they approximate the gradient vector. Also, the TDC version will be a variation of the TD(0) algorithm

2020

Value Function Approximation

(48.11) with a correction term; hence, the name TDC with the letter “C” referring to “correction.” The reason for the designation GTD2 is because there was an earlier GTD algorithm proposed in the literature, but its performance is problematic compared to GTD2 (for this reason, we do not describe GTD here). For generality, we derive the algorithms to operate under both on-policy and off-policy strategies.

48.3.1

Selecting the Weighting Matrix D Recall that the weight matrix D is a free parameter that the designer can choose. We explain how to select D in a convenient manner to facilitate the derivation of the learning algorithms. We explained earlier in Section 44.1.2 that for an MDP M = (S, A, P, r), the process of transitioning from one state to another forms a Markov chain with transition probability matrix given by P π and whose entries were computed via (44.31): X ∆ pπs,s0 = P(s0 = s0 |s = s) = π(a|s)P(s, a, s0 ) (48.46) a∈A

We further explained in the concluding remarks of Chapter 38 that when the Markov chain is aperiodic and irreducible, then the transition matrix P π will be primitive. This means that an eigenvector for P π , denoted now by dπ of size |S| × 1, will exist with only strictly positive entries that add up to 1 (recall (38.142)): T

(P π ) dπ = dπ , 1T dπ = 1,

dπ (s) > 0, ∀ s ∈ S

(48.47)

Here, the notation dπ (s) denotes the sth entry of dπ . We refer to dπ as the Perron vector for P π and observe that its entries can be used to define a probability distribution. We further know from relation (38.124) that this vector determines the stationary probability distribution of the Markov chain under the target policy π(a|s). This explains why we are attaching a superscript π to dπ . Now, assume that during the learning process the agent follows some behavior policy, φ(a|s). That is, the action and state visitations will be based on φ(a|s) rather than π(a|s). When φ(a|s) = π(a|s), then the agent will be operating onpolicy. Otherwise, the agent will be operating off-policy. We let dφ denote the stationary distribution of the Markov chain under the behavior policy φ(a|s), with individual entries denoted by dφ (s), i.e., Pφ

T

dφ = dφ , 1T dφ = 1,

dφ (s) > 0, ∀ s ∈ S

(48.48)

where the entries of the transition matrix P φ under policy φ(a|s) are given by: X pφs,s0 = φ(a|s)P(s, a, s0 ) (48.49) a∈A

48.3 Projected Bellman Learning

2021

The entries of dφ can again be used to define a probability distribution. We now select D as the diagonal matrix with entries {dφ (s)}: ∆

D = diag{dφ } = Dφ

(48.50)

and introduce the importance weights: ∆

ξ(s, a) =

π(a|s) ⇐⇒ π(a|s) = φ(a|s)ξ(s, a) φ(a|s)

(48.51)

These weights measure the level of dissimilarity between the target (π) and behavior (φ) policies. Recall further that we defined in (44.40), for every state s, its expected onestep reward: ∆

rπ (s) = E π,P r(s, a, s0 )

(48.52)

and found in (44.44) that rπ (s) is given by π

r (s) =

X

π(a|s)

X

0

0

P(s, a, s )r(s, a, s )

s0 ∈S

a∈A

!

(48.53)

We also collected all these rewards into a vector rπ = col{rπ (s)}.

48.3.2

Equivalent Representation for Bellman Error Using D = Dφ in (48.45a), we can expand each of the terms (H T Drπ − Bw) and H T DH as follows: H T DH

=

X

dφ (s)hT s hs

s∈S

=

X X s∈S

=

!

π(a|s)

a∈A

X

s0 ∈S

!

P(s, a, s0 ) dφ (s)hT s hs

| {z }| {z } =1 =1 XX X π(a|s)P(s, a, s0 )dφ (s)hT s hs s∈S a∈A s0 ∈S

(48.51) X X X = φ(a|s)ξ(s, a)P(s, a, s0 )dφ (s)hT s hs s∈S a∈A s0 ∈S

=

E dφ ,φ,P ξ(s, a)hT s hs

(48.54)

where the expectation is relative to the randomness defined by the behavior policy φ(a|s), state transition probabilities P, and the state likelihoods dφ (s)

2022

Value Function Approximation

under the Markovian evolution defined by the behavior policy. Similarly, we have: Bw

H T D(I − γP π )Hw

=

X

=

φ

hs − γ

d (s)hs

s∈S

X

=

φ

hs − γ

d (s)hs

s∈S

= (48.51)

=

XX X

s∈S a∈A s0 ∈S

XX X s∈S a∈A

s0 ∈S

X

pπs,s0 hs0

s0 ∈S

XX

!T

w

0

π(a|s)P(s, a, s )hs0

a∈A s0 ∈S

!T

w

T

π(a|s)P(s, a, s0 )dφ (s)hs (hs − γhs0 ) w T

φ(a|s)ξ(s, a)P(s, a, s0 )dφ (s)hs (hs − γhs0 ) w (48.55)

and H T Drπ

=

X

dφ (s)hs rπ (s)

s∈S

(48.53)

=

X

φ

d (s)hs

s∈S

= =

XX X s∈S a∈A

s0 ∈S

s∈S a∈A

s0 ∈S

XX X

XX

0

0

!

π(a|s)P(s, a, s )r(s, a, s )

a∈A s0 ∈S

π(a|s)P(s, a, s0 )dφ (s)hs r(s, a, s0 ) φ(a|s)ξ(s, a)P(s, a, s0 )dφ (s)hs r(s, a, s0 ) (48.56)

It follows that H T Drπ − Bw =

XX X

s∈S a∈A s0 ∈S

φ(a|s)ξ(s, a)P(s, a, s0 )dφ (s)hs × T r(s, a, s0 ) + γhT s0 w − hs w

= E dφ ,φ,P ξ(s, a)hs δ(w)



(48.57)

where we introduced the random variable (dependent on the unknown w): ∆

T δ(w) = r(s, a, s0 ) + γhT s0 w − hs w

(48.58)

We conclude that the projected Bellman cost (48.45a) can be rewritten equivalently as the product of three expectation terms: JPB (w) = (48.59) T  −1 1 E dφ ,φ,P ξ(s, a)hs δ(w) E dφ ,φ,P ξ(s, a)hs hT E dφ ,φ,P ξ(s, a)hs δ(w) s 2

48.3 Projected Bellman Learning

48.3.3

2023

Gradient Correction Algorithm (TDC) Our objective is to minimize this cost over w. Since the quantities P(s, a, s0 ) and r(s, a, s0 ) are not known, the expectations cannot be computed and we need to resort to stochastic approximations. However, since the cost function involves a product of expectation terms, direct approximation of the means by sample realizations will generally introduce bias. We proceed instead as follows. We first combine the product of the last two terms in JPB (w) into the auxiliary column vector (which is a function of w): ∆

λo (w) =



E dφ ,φ,P ξ(s, a)hs hT s

−1

E dφ ,φ,P ξ(s, a)hs δ(w)

(48.60)

It is easy to see that this vector is the solution to the MSE estimation problem: ( ) 2 p p ∆ 1 T o ξ(s, a) δ(w) − ξ(s, a) hs λ λ = argmin J(λ) = E dφ ,φ,P 2 λ (48.61) o We can apply a stochastic gradient recursion to estimate λ from sample realizations. For this purpose, we note that the gradient vector of the above MSE relative to λ is given by (after switching the expectation and differentiation operations, which is possible under certain technical conditions when the function under expectation and its gradient are continuous functions of λ – recall the discussion in Appendix 16.A on the dominated convergence theorem): p p p ξ(s, a) δ(w) − ξ(s, a) hT λ ξ(s, a) hs ∇λT J(λ) = −E dφ ,φ,P s   = −E dφ ,φ,P ξ(s, a) δ(w) − hT (48.62) s λ hs If we drop the expectation operator and employ a sample approximation for the gradient vector, we arrive at the following stochastic gradient recursion for learning λo (where we are now replacing state variables sn by their feature representations hn ): ∆

ξ(n) = π(an |hn )/φ(an |hn ) δ(n) = r(n) +

γhT n+1 wn−1

(48.63a) −

hT n wn−1

λn = λn−1 + µλ (n)ξ(n)hn (δ(n) −

hT n λn−1 )

(48.63b) (48.63c)

in terms of a step-size sequence µλ (n) and where we are denoting the importance sampling factor by the compact notation ξ(n) = ξ(hn , an ). The expression for δn is in terms of the weight iterate wn−1 , which we still need to evaluate. For this purpose, we use expression (48.59) to evaluate the gradient vector of JPB (w) relative to w and express the result in terms of the auxiliary variable λo from (48.60). Thus, note that the first and last expectation terms in JPB (w) are functions of w so that the cost function has the generic form JPB (w) = 21 (b(w))T A−1 b(w) for some function b(w). Using the result of Prob. 2.4 on vector differentiation, we

2024

Value Function Approximation

find that (where we are dropping the subscripts {dφ , φ, P} from the expectations in the first line for compactness of notation): ∇w JPB (w) −1  T    = E ξ(s, a)hs δ(w) E ξ(s, a)hs hT ∇ E ξ(s, a)h δ(w) w s s   o T = (λ ) ∇w E ξ(s, a)hs δ(w) n o = (λo )T E dφ ,φ,P ξ(s, a)hs (γhs0 − hs )T   (48.60) T = γ(λo )T E dφ ,φ,P ξ(s, a)hs hT s0 − E dφ ,φ,P ξ(s, a)hs δ(w)

(48.64)

If we now drop the expectations, employ sample realizations, and replace λo by its estimate λn−1 , we obtain the following recursion for estimating w:   wn = wn−1 + µw (n)ξ(n) δ(n)hn − γ(hT λ )h (48.65) n−1 n+1 n

with a step-size sequence µw (n). We therefore arrive at the listing shown in (48.66).

Gradient correction algorithm (TDC) for either on-policy or off-policy linear state value approximation. given target and behavior policies π(a|h) and φ(a|h); start with λ−1 = 0M ×1 , w−1 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ φ(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) ξ(n) = π(an |hn )/φ(an |hn ) δ(n) = r(n) + (γhn+1 − hn )T wn−1 λn = λn−1 + µλ (n)ξ(n)hn (δ(n) − hT n λn−1 ) 

(48.66)

wn = wn−1 + µw (n)ξ(n) δ(n)hn − γ(hT n λn−1 )hn+1

end w−1 ← wn , λ−1 ← λn in preparation for next episode end w? ← wn ? v π (s) ≈ hT s w , ∀ s ∈ S.

The algorithm can be used for on-policy evaluation by setting ξ(n) = 1 for all n. If desired, the step-size parameters {µλ (n), µw (n)} can be set to small

48.3 Projected Bellman Learning

2025

constant values. The qualification “gradient correction” for this algorithm refers to the fact that, apart from a correction, the update for wn is similar to the one used by the TD(0) algorithm in (48.11). The difference becomes evident if we write both updates next to each other for on-policy learning (when ξ(n) = 1): wn wn

48.3.4

(48.11)

=

(48.66)

=

wn−1 + µw (n)hn δ(n) (gradient TD(0)) wn−1 + µw (n)hn δ(n) −

GTD2 Algorithm

(48.67a)

γµw (n)hn+1 (hT n λn−1 ) |

{z

correction

}

(TDC)

(48.67b)

We derive a second algorithm, shown in (48.68), which is a slight variation of the TDC implementation (48.66).

Gradient TD (GTD2) algorithm for either on-policy or off-policy linear state value approximation. given target and behavior policies π(a|h) and φ(a|h); start with λ−1 = 0M ×1 , w−1 = 0M ×1 . repeat over episodes: for each episode, let h0 denote its initial feature vector repeat over episode for n ≥ 0: an ∼ φ(a|hn ) observe hn+1 r(n) = r(hn , an , hn+1 ) ξ(n) = π(an |hn )/φ(an |hn ) δn = r(n) + (γhn+1 − hn )T wn−1 λn = λn−1 + µλ (n)ξ(n)hn (δn − hT n λn−1 ) wn = wn−1 + µw (n)ξ(n) (hn − γhn+1 ) hT n λn−1 end w−1 ← wn , λ−1 ← λn in preparation for next episode end w? ← wn ? v π (s) ≈ hT s w , ∀ s ∈ S.

(48.68)

To arrive at the above algorithm, we return to the gradient calculation (48.64) and stay with the form given in the third equality, rather than use the form from the last equality, i.e., we now use n o ∇w JPB (w) = (λo )T E dφ ,φ,P ξ(s, a)hs (γhs0 − hs )T (48.69)

In this way, a stochastic gradient iteration to estimate w would take the form (compare with (48.65)): wn = wn−1 + µw (n)ξ(n) (hn − γhn+1 ) hT n λn−1

(48.70)

2026

Value Function Approximation

We thus observe that the main difference between the GTD2 and TDC variants is that they use different sample approximations for the same gradient vector. Example 48.2 (Relation to the Arrow–Hurwicz procedure) We can relate the GTD2 algorithm (48.68) to the Arrow–Hurwicz recursions we encountered earlier in (44.142a)– (44.142b), and which were based on seeking the vector w that minimizes the projected Bellman cost (48.68), namely,   λk = λk−1 − µλ (k)H T Dφ Hλk−1 + (I − γP π )Hwk−1 − rπ (48.71a) wk = wk−1 + µw (k)H T (I − γP π )T Dφ Hλk−1

(48.71b)

with boundary conditions λ0 = 0, w0 = 0, and where k ≥ 1 is the iteration index. These iterations assumed the availability of the quantities (P π , rπ ), which are not known during learning. We now explain how these recursions reduce to GTD2. The derivation provides an interpretation for the intermediate variable λ in the GTD2 algorithm as corresponding to the Lagrange multiplier in the Arrow–Hurwicz formulation. First, we verify in Prob. 48.3 that the gradient vector that appears in (48.71a) can be rewritten in terms of the moment values of the behavior policy as follows:   H T Dφ Hλk−1 + (I − γP π )Hwk−1 − rπ (48.72) n   o = E dφ ,φ,P hs hTs λk−1 + (hs − γhs0 )T wk−1 − r(s, a, s0 ) ξ(s, a) Similarly, we verify in Prob. 48.4 that we can write the gradient vector appearing in (48.71b) under D = Dφ as:   H T (I − γP π )T Dφ Hλk−1 = E dφ ,φ,P (hs − γhs0 )hTs ξ(s, a) λk−1 (48.73) Results (48.72) and (48.73) indicate that the gradient directions along which the updates are performed in recursions (48.71a) and (48.71b) correspond to the expected values shown above. These expected values depend on knowledge of the quantities {P(s, a, s0 ), r(s, a, s0 )}. When the MDP model is not known, recursions (48.71a) and (48.71b) cannot be applied anymore. However, the expectation results (48.72) and (48.73) suggest that we can replace the actual gradient vectors by stochastic approximations, with the gradients estimated from sample measurements collected over the time variable n as follows: δn = r(n) + γhTn+1 wn−1 − hTn wn−1 T

π T

φ

H (I − γP ) D Hλn−1 ≈ ξ(n)(hn −

(48.74a) γhn+1 )hTn λn−1

(48.74b)   T H D (Hλk−1 + (I − γP )Hwk−1 − r ) ≈ −ξ(n)hn δn − hn λn−1 (48.74c) T

φ

π

π

Using these approximations in (48.71a)–(48.71b) reduces the recursions to the GTD2 algorithm (48.68).

48.4

SARSA METHODS We can also apply linear approximation models directly to the estimation of the state–action value function, q π (s, a). In this case, we associate feature vectors,

48.4 SARSA Methods

2027

now denoted by fs,a ∈ IRT , with every state–action pair, (s, a) ∈ S × A, where the value of the dimension T satisfies T  |S| × |A|: fs,a : S × A → IRT

48.4.1

(48.75)

Feature Representation There are many ways by which these feature vectors can be constructed. For example, one immediate choice is to construct fs,a from the concatenation of two parts, say, as:   hs fs,a = ∈ IRT (48.76) ea where hs is the feature vector associated with the state s ∈ S and ea is a one-hot encoded representation for the action a ∈ A (i.e., a basis vector in IR|A| ). In this case, the dimension T would be T = M + |A|. We illustrate later at the end of Example 49.1 in the context of policy gradient methods that this particular construction is not always meaningful and can lead to erroneous behavior. Other constructions for fs,a are of course possible. For instance, in future Example 49.5 we introduce the following representation. We collect all transposed feature vectors {hs } for the states into a matrix H:   hT 1  hT   2   H= (48.77)  ..  , (|S| × M )  .  hT |S|

We also collect all transposed feature vectors {ea } for the actions into a second matrix A:   eT a  eT   2   A= (48.78)  ..  , (|A| × |A|)  .  eT |A| We compute the Kronecker product of H and A and obtain a new matrix F : ∆

F = H ⊗ A,

(|S| |A| × M |A|)

(48.79)

Each row in F would then correspond to the transpose of a feature vector fs,a for a state–action pair (s, a). For any particular (s, a), the row in F that corresponds to it has index: n o s = 1, 2, . . . , |S| row index = (s − 1) × |A| + a , (48.80) a = 1, 2, . . . , |A|

2028

Value Function Approximation

It is generally useful to assume the feature vectors have been centered around their means, i.e., to replace fs,a by fs,a ← fs,a −

X

a0 ∈A

π(a0 |s)fs,a0

(48.81)

where the expectation is over the distribution of the actions when the state is at location s. Once available, the feature vectors fs,a can be used to approximate the state–action values by a linear parametric model of the form: T q π (s, a) ≈ fs,a c

(48.82)

for some vector c ∈ IRT . We consider two implementations for estimating w.

48.4.2

SARSA(0) Implementation First, in a manner similar to (48.7), we consider the MSE criterion: o

c

(

2 1  T = argmin J(w) = E q π (s, a) − fs,a c 2 c∈IRT ∆

)

(48.83)

where the state value function is replaced by the state–action value function. Computing the gradient vector of the MSE relative to c gives   T ∇cT J(w) = −E q π (s, a) − fs,a c fs,a

(48.84)

A stochastic gradient algorithm for solving (48.83) then takes the form:   cn = cn−1 + µ(n)fn q π (sn , an ) − fnT cn−1 , n ≥ 0

(48.85)

where µ(n) denotes the step-size sequence, and fn represents the feature vector that corresponds to the state–action pair (sn , an ) occurring at time n, i.e., fn = fsn ,an . One difficulty with this recursion is that the value of q π (sn , an ) is not known beforehand. Following an argument similar to (47.1)–(47.2), we employ the sample approximation: π T q π (sn , an ) ≈ r(n) + γqn−1 (sn+1 , an+1 ) = r(n) + γfn+1 cn−1

(48.86)

where r(n) is the reward received in transitioning from state hn to state hn+1 under action an , namely, r(n) = r(hn , an , hn+1 ). We therefore arrive at the following stochastic gradient SARSA(0) algorithm listed in (48.87).

48.4 SARSA Methods

2029

Stochastic SARSA(0) for linear state–action approximation. given a (deterministic or stochastic) policy model, π(a|h); start with c−1 = 0T ×1 . repeat over episodes: let (f0 , h0 ) denote initial feature vectors for episode let a0 denote its initial action repeat over episode for n ≥ 0: observe hn+1 using action an r(n) = r(hn , an , hn+1 ) an+1 ∼ π(a|hn+1 ) observe fn+1 β(n) = r(n) + (γfn+1 − fn )T cn−1 cn = cn−1 + µ(n)β(n)fn end c−1 ← cn in preparation for next episode end co ← cn T co , ∀ (s, a) ∈ S × A. qnπ (s, a) ≈ fs,a

48.4.3

(48.87)

SARSA(λ) Implementation We can similarly derive a stochastic gradient SARSA(λ) variant. For example, we can follow arguments similar to those employed in Section 47.1 and consider higher-order approximations for q π (s, a) in lieu of (48.86), such as: π q π (sn ) ≈ r(n) + γr(n + 1) + γ 2 qn−1 (sn+2 , an+2 ) T ≈ r(n) + γr(n + 1) + γ 2 fn+2 cn−1 ∆

= Q(2) n (fn , an )

(48.88) (2)

where we are denoting the second-order approximation by Qn (fn , an ) (while we (2) denoted it before by Qn (sn , an )). We can also employ higher-order approximations such as: ) q π (sn , an ) ≈ Q(P n (fn , an ) ) P T Q(P n (fn , an ) = γ fn+P cn−1 +

(48.89a) P −1 X

γ m r(n + m)

(48.89b)

m=0

r(n + m) = r(hn+m , a = π(a|hn+m ), hn+m+1 )

(48.89c)

2030

Value Function Approximation

More broadly, we can combine higher-order approximations by means of a weighting factor λ ∈ (0, 1] and use q π (sn , an ) ≈ Qλn (fn , an ) ∞ X Qλn (fn , an ) = (1 − λ) λp−1 Qn(p) (fn , an )

(48.90a) (48.90b)

p=1

so that the update for cn in (48.85) would be replaced by the forward implementation:   cn = cn−1 + µ(n)fn Qλn (fn , an ) − fnT cn−1 , n ≥ 0

(48.91)

If we now follow the arguments that led to the SARSA(λ) algorithm (47.28), we can arrive at the backward implementation (48.92) for linear approximation models. The last line provides the approximation for q π (s, a) for any state–action pair (s, a) based on the estimate for co .

Stochastic SARSA(λ) for linear state–action approximations. given a (deterministic or stochastic) policy, π(a|h); start with c−1 = 0T ×1 and t−1 = 0T ×1 . repeat over episodes: let (f0 , h0 ) denote initial feature vectors for episode let a0 denote its initial action repeat over episode for n ≥ 0: observe hn+1 using action an r(n) = r(hn , an , hn+1 ) an+1 ∼ π(a|hn+1 ) observe fn+1 β(n) = r(n) + (γfn+1 − fn )T cn−1 tn = λγtn−1 + fn cn = cn−1 + µ(n)β(n)tn end c−1 ← cn , t−1 ← tn in preparation for next episode end co ← cn T qnπ (s, a) ≈ fs,a co , ∀ (s, a) ∈ S × A.

(48.92)

48.4 SARSA Methods

48.4.4

2031

Least-Squares SARSA Learning We can replace the MSE criterion (48.83) by the regularized least-squares cost:

c

?

= argmin c∈IRT

(



2

J(c) = ρkck +

N −1 X n=0

π

q (sn , an ) −

fnT c

2

)

(48.93)

over an episode extending from time n = 0 to n = N − 1, and where ρ > 0 is a regularization parameter. Applying the future recursive least-squares recursions (50.123) to this case, and using (48.86) again, we arrive at the on-policy recursions listed in (48.94) (where the symbol Pn refers to the Riccati variable at iteration n):

Least-squares SARSA learning for linear state–action approximation. given a (deterministic or stochastic) policy π(a|h); start with P−1 = ρ−1 IT ×T , c−1 = 0T ×1 . repeat over episodes: let (f0 , h0 ) denote initial feature vectors for episode let a0 denote its initial action repeat over episode for n ≥ 0: observe hn+1 using action an r(n) = r(hn , an , hn+1 ) an+1 ∼ π(a|hn+1 ) observe fn+1 τ (n) = 1/(1 + fnT Pn−1 fn ) gn = Pn−1 fn τ (n) T β(n) = r(n) + γfn+1 wn−1 − fnT wn−1 cn = cn−1 + β(n)gn Pn = Pn−1 − gn gnT /τ (n) end P−1 ← Pn , c−1 ← cn in preparation for next episode end c? ← cn T qnπ (s, a) ≈ fs,a c? , ∀ (s, a) ∈ S × A.

(48.94)

The last line in the algorithm provides the intermediate approximation for q π (s, a) for any state–action pair (s, a). When there are multiple episodes, we can apply the algorithm repeatedly with the initial conditions {P−1 , c−1 } for one episode set to the final values obtained from the previous episode. Doing so would amount to solving a least-squares problem of the form:

2032

Value Function Approximation

(

c? = argmin ρkck2 + c∈IRT

E N −1  X X e=1 n=0

(e) (e) T q π (s(e) n , an ) − (fn ) c (e)

(e)

2

)

(48.95)

(e)

where e = 1, 2, . . . , E is the episode index and {sn , an , fn } refer to the corresponding state, action, and feature variables.

48.5

DEEP Q-LEARNING In principle, one could also consider adjusting the original Q-learning procedure (47.45) by incorporating the linear approximation model (48.82) into it. Doing so would lead to a construction of the following form (compare with the expressions for β(n) and wn in the SARSA(0) implementation (48.87)):   T 0 ) cn−1 β(n) = r(n) + γ max (f − fnT cn−1 h ,a n+1 0 a ∈A

cn = cn−1 + µ(n)β(n)fn

(48.96a) (48.96b)

where the notation fhn+1 ,a0 refers to the feature vector corresponding to the state defined by hn+1 and arbitrary action a0 . However, it has been observed in practice that this version of Q-learning leads to stability problems with diverging weight vectors for three main reasons: (a) the use of the linear approximation model, which introduces modeling errors that influence the dynamics of the algorithm; (b) the greedy nature of the above implementation involving a maximization step over actions a0 ∈ A; and, moreover, (c) the approximation of the true state–action function q π (·, ·) by sample-based estimates qn−1 (·, ·). In particular, the use of models (linear or otherwise) to approximate the state value function q π (·, ·) leads to the effect of delusional bias, which refers to the fact that the chosen model may not have enough power to capture all possible action choices by the original problem. We illustrate this effect by means of an example.

48.5.1

Illustrating the Instability Problem Consider an MDP and let (s1 , s2 , s3 ), denote three generic states. Assume there are only two action choices, a1 , a2 ∈ A. Once the MDP lands in state s1 , it can only take action a1 , and this action moves the MDP to state s2 , namely, π(a1 |s1 ) = 1, π(a2 |s1 ) = 0, P(s1 , a1 , s2 ) = 1

(48.97)

Likewise, when the MDP lands in state s2 , it can only select the same action a1 and the MDP moves to state s3 : π(a1 |s2 ) = 1, π(a2 |s2 ) = 0, P(s2 , a1 , s3 ) = 1

(48.98)

48.5 Deep Q-Learning

2033

All transitions and action choices at the other states in the MDP can happen stochastically. Now assume we use a linear model to approximate the state– action value function for this MDP. In particular, let us assume a model of order T = 2: T q π (s, a) ≈ fs,a c,

c, fs,a ∈ IR2

(48.99)

Assume further that the entries of the feature vectors fs,a associated with the state–action choices involving (s1 , s2 ) and (a1 , a2 ) are given by fs1 ,a1 = {0, 0},

fs2 ,a1 = {0, −1},

fs1 ,a2 = {0, 1}

fs2 ,a2 = {0, −2}

(48.100) (48.101)

T c Accordingly, if we denote the entries of c by c = {c1 , c2 }, the inner products fs,a at states s1 and s2 will assume the values:  0, a = a1 T fs1 ,a c = (48.102) c2 , a = a2  −c2 , a = a1 fsT2 ,a c = (48.103) −2c2 , a = a2 T c over a for each state s1 or s2 can never It follows that the maximization of fs,a result in the selection of the same policy a1 for both of them. For example, the first expression shows that the maximizing action will be a1 at state s1 only if c2 < 0. When this happens, the maximizing action at state s2 will be a2 and we know that a2 is not a feasible action at state s2 ; only action a1 can be taken there. In other words, we find that the approximate linear model can lead to unrealizable greedy selections for the actions, which in turn can cause divergence problems. The instability problem is one key motivation for the study of policy gradient methods in the next chapter; these methods seek the optimal policy by directly parameterizing the policy itself rather than the state–action value function. A second approach is to consider more general models than the linear approximation model used so far, such as employing neural networks. We discuss this other possibility next. Another approach is to incorporate entropy regularization and rely on soft Bellman conditions, as described in Section 49.10.

48.5.2

Deep Learning1 We will be discussing neural networks in some detail in Chapter 65, where we also derive algorithms for training them. In the description that follows, we assume familiarity with the cascade structure of a neural network. Readers can refer to that chapter if necessary. 1

This section can be skipped on a first reading. It requires familiarity with feedforward neural networks, the notation used to describe their structure, and the algorithms used to train them. These networks will be studied in Chapter 65.

2034

Value Function Approximation

In deep Q-learning, a multilayer feedforward network is employed to approximate (or model) the optimal state–action value function, denoted by q ? (s, a). At every iteration, the input to the network will be the feature vector hs ∈ IRM corresponding to state s ∈ S and the output of the network will be a vector denoted by ys of size |A| × 1. The entries of ys would correspond to approximations for the optimal state–action values at that state and are denoted by 

 ∆  ys =  

q(s, a1 ) q(s, a2 ) .. . q(s, a|A| )



  , 

(|A| × 1)

(48.104)

Once the network is trained, it becomes possible to determine greedy or -greedy actions by examining the output of the network. For instance, for a greedy selection, the optimal action would be determined by the index of the largest entry in ys , namely, ∆

ao = argmax q(s, a0 )

(48.105)

a0 ∈A

Sometimes, for convenience of notation, we will drop the subscript s and write simply h and y to refer to the input and output vectors. We also write y(h) to emphasize that y is the output vector generated by h: y(h) : IRM → IR|A|

(48.106)

In place of (48.104), and for compactness, we will denote the individual entries of the output vector y by {y(k)} for k = 1, 2, . . . , |A| or, more explicitly, by {q(k)} to highlight the connection with the state–action value function. In this notation, q(k) stands for q(s, ak ), which is the state–action approximate value for state s and action ak . We may also write q(h, ak ) using the feature vector h corresponding to the state s. All these possibilities will be evident from the context.

Neural network model We consider a neural network that consists of L layers, including the input layer where h is applied and the output layer where y is observed. The output layer employs linear activation functions, meaning that y = z where z denotes the pre-activation vector for the output layer. The network receives input vectors h ∈ IRM and generates output vectors y ∈ IR|A| . For example, the input h can be the vectorized version of an image representing a screenshot of a video game. Figure 48.4 illustrates a network structure with L = 5 layers, M = 3 attributes per feature vector, and 4 entries per output vector. We use the notation introduced in future expression (65.31) for the weights, biases, and pre- and post-activation signals across the successive layers of the neural network, such as {W` , θ` , z` , y` }.

48.5 Deep Q-Learning

layer 1

layer 2

layer 3

layer 4

2035

layer 5

y(1) = q(h, a1 ) y(2) = q(h, a2 )

y(3) = q(h, a3 )

y(4) = q(h, a4 )

4

input layer

hidden layers

z

y

output layer y=z

Figure 48.4 A feedforward neural network with three hidden layers, a linear output

layer y = z, |A| = 4 actions, and M = 3 attributes per feature vector.

The network parameters are learned by minimizing an `2 -regularized leastsquares risk of the form: (W ? , θ? )

(48.107)   |A| E N −1 X   L−1 2  X X X 1 ∆ = argmin ρ kW` k2F + yn(e) (k) − q ? (h(e) n , ak )   N E e=1 n=0 W,θ `=1

k=1

where e = 1, 2, . . . , E is the episode index and the superscript (e) is used to refer to feature and action variables within the episode. Here, the notation {W, θ} denotes the aggregate of all weight and bias parameters within the network. In the above formulation, we are assuming a total of N data points per episode (it can be made episode-dependent and replaced by Ne with N E replaced by PE (e) e=1 Ne ). Moreover, hn denotes the feature vector at time n and ak denotes (e) the kth action choice. The vector yn is the output of the neural network in (e) ? (e) response to hn while q (hn , ak ) is the optimal state–action value that corre(e) sponds to the state–action pair (hn , ak ). In this way, by training the neural network and finding its parameters, we would be determining a neural structure that approximates well the optimal state–action values. There is one problem, (e) though, with this formulation. The values q ? (hn , ak ) are not known beforehand. Following an argument similar to the derivation that led to (46.39), we ap(e) proximate q ? (hn , ak ) by using the approximation that is available at that point in time for it, namely,

2036

Value Function Approximation

a

(e),k

k h(e) n −→ hn+1

q ? (h(e) n , ak ) ≈ r(n) + γ max 0 a ∈A

(



(e),k

qm−1 hn+1 , a0



(48.108a)

)

(48.108b)

where the first line means that we determine the next feature vector (or state) (e) in the eth episode that would result from action ak applied at state hn . We (e),k denote this new state by hn+1 , where we are adding a superscript k to highlight that there will be several such feature vectors, one for each action k: n o (e),1 (e),2 (e),|A| hn+1 , hn+1 , . . . , hn+1

(48.109) (e)

All these next-step states start from the same feature hn but result from the application of different actions – see the illustration in Fig. 48.5.

starting state at time n in eth episode (e)

hn actions a1

(e),1

hn+1

a2 (e),2

a3 (e),3

hn+1

hn+1

•••

a|A|

(e),|A|

•••

hn+1

next-step states (e),k

(e)

Figure 48.5 Generation of the next-step feature vectors {hn+1 } from hn

in response

to the actions ak ∈ A.

(e),k

Subsequently, we use the hn+1 in the expression on the right-hand side of (e) (48.108b) to approximate q ? (hn , ak ). Note that we are adding a subscript m − 1 to the state–action approximation qm−1 (·, ·). There will be two indices in our implementation: One is n and it will be used as a running index over the samples as we traverse an episode, and the other is m and it will be used as the iteration index for updating the parameters of the neural network. Since these parameters will be updated after processing a batch of data, it becomes necessary to have two separate indices. Thus, the notation qm−1 (·, ·) means that this value is based on the neural network model that is available after the (m − 1)th update iteration. For example, during each iteration m, we will be processing a batch of state– action transitions (with their samples indexed by n), and at the end of each

48.5 Deep Q-Learning

2037

batch, we would update the network parameters with the index m running over batches. For compactness of notation, we introduce the following notation for the target signals that result from construction (48.108a)–(48.108b): ∆ o(e) n (k) =

r(n) + γ max 0 a ∈A

(



(e),k qm−1 hn+1 , a0



)

,

k = 1, . . . , |A|

(48.110)

These signals depend on both n and k, and they are computable. Substituting (48.110) into (48.107), the problem is transformed into (W ? , θ? )

(48.111)   |A|  −1 X E N  L−1 2  X X X 1 ∆ yn(e) (k) − on(e) (k) kW k2F + = argmin ρ   N E e=1 n=0 W,θ k=1

`=1



= argmin W,θ

(

ρ

L−1 X `=1

kW k2F

(e)

E N −1

2 1 XX

(e)

+

yn − o(e) n N E e=1 n=0

)

where the vectors yn refer to the output signals for the neural network, while (e) the vectors on refer to available target signals constructed by exploration. We will learn in Chapter 65 how to train a neural network under this regularized least-squares risk in a mini-batch form by using, for example, algorithm (65.82). We reproduce here the recursions that would operate on one batch of data and update the network parameters. Later, we will use the description as a building block during our listing of the full deep Q-learning algorithm.

48.5.3

Training Algorithm Let us ignore the episode superscript (e) to lighten the notation. We denote the parameters for the network by {W`,m−1 , θ`,m−1 } and will show how to update them to {W`,m , θ`,m } by processing a mini-batch of data. Here, the letter ` refers to the layer index, ` = 1, . . . , L − 1. We also write {Wm−1 , θm−1 } without the layer index to refer to the aggregate parameters representing the network model at m − 1; these will be updated to {Wm , θm }. Accordingly, assume we select a random batch of B points, with each point consisting of a transition of the form (hb , hb+1 , ab , r(b)) for b = 0, 1, . . . , B − 1. Every such transition is obtained from observing the behavior of the MDP during some episode. Here, hb is a feature vector, ab is the action taken then, hb+1 is the feature observed in response to this action, and r(b) is the reward received: ab ,r(b)

hb −−−−→ hb+1

(b = 0, 1, . . . , B − 1)

(48.112)

2038

Value Function Approximation

We first feed the feature vectors forward through the neural network and determine their corresponding output signals, where the activation function for the last layer (the one that generates yL from zL ) is simply f (z) = z while the earlier layers can employ other activation functions (e.g., sigmoidal functions, ReLu, tanh): repeat for b = 0, 1, . . . , B − 1 (forward processing ): y1,b = hb (input layer; input feature vector) repeat for ` = 1, 2, . . . , L − 1: T z`+1,b = W`,m−1 y`,b − θ`,m−1 y`+1,b = f (z`+1,b ) end yb = yL,b (output of network) zb = zL,b (same as output) end

(48.113)

Next, we determine the target signals that correspond to each feature vector. Of course, this calculation can be incorporated into the for-loop described above. We separate the descriptions for ease of exposition. Once the target signals are determined we can use them to set the boundary conditions for the sensitivity factors of the neural networks. We use the letter δ to refer to these sensitivity factors in Chapter 65 (e.g., in the listing (65.82)). In this chapter, we have reserved the δ-notation to temporal difference factors. We will therefore use the alternative symbol σ to refer to sensitivity factors:

repeat for b = 0, 1, . . . , B − 1: start from state hb repeat for each possible action k = 1, 2, . . . , |A|: apply action ak to hb and determine hkb+1 if hkb+1 is terminal: set ob (k) = r(b) else k feed hkb+1 through network {Wm−1 , θm−1 } and determine yb+1 k set ob (k) = r(b) + γ kyb+1 k∞ (using ∞-norm for vectors) end end σL,b = 2(yb − ob ) end (48.114) We can combine the two forward-processing steps into a single description as follows:

48.5 Deep Q-Learning

2039

repeat for b = 0, 1, . . . , B − 1 (forward processing ): y1,b = hb (input layer; input feature vector) repeat for each possible action k = 1, 2, . . . , |A|: apply action ak to hb and determine hkb+1 if hkb+1 is terminal: set ob (k) = r(b) else k feed hkb+1 through network {Wm−1 , θm−1 } and determine yb+1 k set ob (k) = r(b) + γ kyb+1 k∞ (using ∞−norm for vectors) end end (finds ob ) repeat for ` = 1, 2, . . . , L − 1: T z`+1,b = W`,m−1 y`,b − θ`,m−1 y`+1,b = f (z`+1,b ) end (finds yb ) yb = yL,b , (output of network) zb = zL,b σL,b = 2(yb − ob ) end (48.115) Finally, we update the network model through backward processing over the layers:

repeat for ` = L − 1, . . . , 2, 1 (backward processing ): B−1 µ X T y`,b σ`+1,b W`,m = (1 − 2µρ)W`,m−1 − B b=0 B−1 µ X θ`,m = θ`,m−1 + σ`+1,b B

(48.116)

b=0

σ`,b = f 0 (z`,b ) (W`,m−1 σ`+1,b ) , ` ≥ 2, b = 0, 1, . . . , B − 1 end We therefore arrive at listing (48.119) for training a deep Q-learner. The algorithm involves a replay buffer similar to what was described earlier for implementation (47.73). Once the network is well trained, its parameters are fixed, say, to {W`? , θ`? }. Then, in response to any new state (i.e., feature vector) h, the network propagates it to the output and generates the corresponding vector y. The index of the largest entry of y determines the greedy action: a = argmax y(k)

(greedy policy)

(48.117)

(-greedy policy)

(48.118)

1≤k≤|A|

a = -greedy[a]

2040

Value Function Approximation

Deep Q-learning for optimal policy extraction with replay buffer. initial network model parameters (W−1 , θ−1 ) = {W`,−1 , θ`,−1 }L−1 `=1 ; apply repeated actions according to some arbitrary policy and construct a buffer containing sufficient samples for the transitions (s, a, s0 , r); repeat for m ≥ 0 until sufficient convergence: generate a batch of B transitions for the MDP as follows: select a random initial state s0 (or feature h0 ) repeat for b = 0, 1, . . . , B − 1: feed hb through network and find output yb as in (48.113) set qm−1 (hb , ak ) = yb (k), k = 1, 2, . . . , |A| find a = argmax qm−1 (hb , ak ) 1≤k≤|A|

set action to ab = -greedy[a] apply action ab to obtain the next feature (state) hb+1 obtain reward r(b) = r(hb , ab , hb+1 ) save (sb , ab , sb+1 , r(b)) into the replay buffer end end select at random a batch of B transitions from the replay buffer feed them forward through the network using (48.115) update parameters {W`,m−1 , θ`,m−1 } to {W`,m , θ`,m } using (48.116) end return {W`? , θ`? } ← {W`,m , θ`,m }. (48.119)

Example 48.3 (Playing a game over a grid) We illustrate the operation of the deep Q-learning algorithm (48.119) by reconsidering the earlier grid problem from Fig. 48.1. For this example, we do not implement a replay buffer but train the network directly using constructions (48.115) and (48.116). We consider a feedforward neural network with L = 4 layers (including one input layer, one output layer, and two hidden layers). All neurons employ the ReLU activation function except for the output layer, where the neurons employ linear activation. The step size, regularization, and discount factor parameters are set to µ = 0.00001, ρ = 0.001, γ = 0.9

(48.120)

We employ the same one-hot encoded feature vectors (48.14) from Example 48.1. As explained in that example, the feature vectors do not incorporate a constant bias value of +1 since the state value function at the EXIT state s = 17 is zero, i.e., v π (17) = 0. For this reason, we will only be training the weight parameters {W` } for the network layers; there is no need for the offset parameters {θ` } for this example. The input layer has M = 17 neurons while the output layer has |A| = 5 neurons. The two hidden layers have n2 = 32 = n3 neurons each. We train the network using E = 50,000 episodes and batches of size B = 1 (single samples). For each episode, we generate a random state

48.6 Commentaries and Discussion

2041

s (and the corresponding feature hs ), and update the parameters of the network using constructions (48.115) and (48.116). Subsequently, we move from s to a new state s0 using an -greedy exploration policy with  = 0.1 and feed s0 (i.e., its feature vector hs0 ) into the network, and repeat this process until the end of the episode. We continue to the next episode and so forth. After convergence, we fix the network parameters at {W`? }. Then, for each state s, we feed the corresponding feature vector hs through the trained network and determine the corresponding output vector ys . The entries of this vector approximate the optimal state–action values q ? (s, ak ) for all actions ak ∈ A at state s. The entry with the largest value over the set of permissible actions at state s identifies the optimal action at that state. We collect the resulting “optimal” state–action values q ? (s, a) into a matrix Q? with each row corresponding to a state value and each column to an action:   up down left right stop  s = 1 −5.4262 4.7707 2.2983 2.7980 −0.0034     s=2 3.8333 4.0790 3.5466 4.8131 0.0499     s=3 3.6247 3.1124 3.4883 3.6524 −0.0673    − − − − −   s=4  s=5  4.8261 3.7703 4.2803 4.0417 −0.0296    s=6 3.8548 3.3647 4.1220 3.3806 −0.1519     s=7 3.8548 2.0752 3.9053 −4.5246 0.5266     s=8 − − − − 0.0012   Q? =  (48.121)  s=9 9.7817 −3.9148 5.2275 5.5454 0.0265     s = 10 8.7231 5.8284 6.7440 7.5686 −0.0193     s = 11 − − − − −     s = 12 5.6183 4.4485 4.7203 5.0721 −0.0521    6.0812 5.4040 5.6415 6.5038 −0.0640   s = 13   7.1153 7.1385 6.3988 8.0406 −0.0077 s = 14    s = 15 8.5392 7.7579 7.4839 9.5775 0.0423     s = 16 − − − − 0.0498  s = 17









0

The resulting optimal actions are represented by arrows in Fig. 48.6. We repeat the simulation by using instead the reduced feature vectors (48.13) leading to the outcome shown in Fig. 48.7. Observe that the “optimal” action at state s = 1 is not plausible since it leads the agent directly into the danger zone. This is a reflection of the fact that the reduced feature vectors for this example are not discriminatory enough. For example, observe from (48.13) that both states s = 1 and s = 9 have the same feature representation; by moving upward, one state leads to danger while the other state leads to success.

48.6

COMMENTARIES AND DISCUSSION Linear value approximation. The use of linear models for inference, estimation, and control purposes has a rich history in estimation and adaptation theories, as detailed in the earlier chapters and, for example, in the texts by Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008). In the context of reinforcement learning, some of the earliest uses of linear approximation models for the state value function appear to be Farley and Clark (1954), Samuel (1959), and Bellman and Dreyfus (1959). Samuel (1959) was

2042

Value Function Approximation

SUCCESS

11

DANGER

4

Figure 48.6 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the extended feature vectors (48.14).

motivated by a suggestion in Shannon (1950) on the latter’s work on programming a computer to play the game of chess. Shannon (1950) suggested employing a function h(P, M ) to assess the efficacy of performing move M in position P , and indicated that this function does not need to be represented exactly but can be approximated by a linear model using feature vectors. The article by Bellman and Dreyfus (1959) did not explicitly discuss reinforcement learning but dealt with approximate dynamic programming, which is at the core of reinforcement learning mechanisms. Overviews of linear approximation methods in reinforcement learning appear in Busoniu et al. (2010) and Geramifard et al. (2013). Convergence studies of the TD(0) and TD(λ) algorithms (48.11) and (48.19) for linear policy evaluation, under the assumption of linearly independent feature vectors, appear in Baird (1999), Sutton (1988), Dayan and Sejnowski (1994), Jaakkola, Jordan, and Singh (1994), Tsitsiklis (1994), and Tsitsiklis and Van Roy (1997). In particular, it is shown by Tsitsiklis and Van Roy (1997) and Tadić (2001) that TD(λ) is guaranteed to converge under linear value function approximations when used for on-policy learning. In contrast, it is noted in Baird (1995), Tsitsiklis and Van Roy (1996, 1997), and Sutton and Barto (1998) that the same is not true under offpolicy training where TD, SARSA, and Q-learning can become unstable with their parameters growing unbounded. This is in part due to the fact that these solutions do not enforce the Poisson condition (48.44). Subsequent variants based on Bellman projected error constructions, such as TDC and GTD2, lead to stable algorithms. Bellman projected error. The Bellman error cost (48.45a) for estimating w was proposed in the context of reinforcement learning problems by Baird (1995, 1999) using Π = I, and adjusted by Bradtke and Barto (1996), Lagoudakis and Parr (2003), Sutton, Szepesvari, and Maei (2009), and Sutton et al. (2009) to include the projection matrix Π to help enforce property (48.44) and reduce mismatch errors. The last reference by Sutton et al. (2009) derives the TDC and GTD2 algorithms and establishes their

Problems

2043

SUCCESS

11

DANGER

4

Figure 48.7 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the reduced feature vectors (48.13).

convergence toward the value function (fixed-point) solution of the Poisson equation (48.44) for decaying step-size sequences – see Probs. 48.6 and 48.7. The reference by Bradtke and Barto (1996) also discusses the TD least-squares implementation (48.43). A useful survey is given by Bertsekas (2011). The derivations in Section 48.3 are motivated by the exposition from Macua et al. (2015) and Cassano, Yuan, and Sayed (2021). Deep Q-learning. The delusional bias problem described in Section 48.5.1 is from Lu, Schuurmans, and Boutilier (2018). The deep Q-learning algorithm (48.119) originates from the work by Mnih et al. (2013), where the method was used to train a convolutional neural network and applied to several Atari games. An extension of the algorithm using the double Q-learning construction is given by van Hasselt, Guez, and Silver (2016).

PROBLEMS

48.1 Refer to the importance weights ξ(s, a) defined by (48.51). Assume π(a|s) is a deterministic policy. Compute ξ(s, a). 48.2 The purpose of this problem is to derive the TD(λ) algorithm (48.19) by adjusting the arguments that led to (46.87). Introduce the error signal δn (hn+` ) = r(n + `) + γhTn+`+1 wn−1 − hTn+` wn−1 (a)

Show that Unλ (hn ) − hTn wn−1 =

∞ X (λγ)` δn (hn+` ). `=0

2044

Value Function Approximation

(b) (c)

Verify that δn (hn+` )I[hn = h] = δn (hn )I[hn−` = h]. Conclude that n h i X Unλ (h) − hT wn−1 I[h = hn ] = δn (hn ) (λγ)` hn−` I[h = hn−` ]

! + O (λγ)n+1

`=0

Introduce the change of variables m = n − ` and the eligibility trace vector n X



tn =

(λγ)n−m hm

m=0

Complete the argument to arrive at the listing (48.19). 48.3 Establish equality (48.72). 48.4 Establish equality (48.73). 48.5 Refer to the projected Bellman cost in (48.45a) and modify it by adding an elastic regularization term, say, as: T   1 T π H Dr − Bw (H T DH)−1 H T Drπ − Bw JPB (w) = αkwk1 + ρkwk2 + 2 where α, ρ ≥ 0. How would the listings of the TDC and GTD2 algorithms be modified if we were to repeat the arguments of Section 48.3? 48.6 Study the convergence of the TDC algorithm (48.66) following similar arguments to the work by Sutton et al. (2009). This work establishes that the value function approximation converges with probability 1 toward the solution of the Poisson equation (48.44) for decaying step-size sequences. 48.7 Study the convergence of the GTD2 algorithm (48.68) following similar arguments to the work by Sutton et al. (2009). This work establishes that the value function approximation converges with probability 1 toward the solution of the Poisson equation (48.44) for decaying step-size sequences. 48.8 Consider an MDP M = (S, A, P, r), where the process of transitioning from one state to another forms a Markov chain with transition probability matrix given by P π and whose entries are computed via (44.31). We explained in the concluding remarks of Chapter 38 that when the Markov chain is aperiodic and irreducible then P π will be primitive. This means that an eigenvector, denoted by dπ ∈ IR|S| and with entries dπ (s), will exist with strictly positive entries satisfying (recall (38.142)): (P π )T dπ = dπ ,

1T dπ = 1,

dπ (s) > 0

We referred to this vector as the Perron vector. We further know from (38.124) that it corresponds to the stationary probability distribution of the Markov chain under policy π(a|s; θ). Introduce the weighted least-squares error: 2 X π  π ∆ J(w) = d (s) v (s) − hTs w s∈S

where hs is the feature vector associated with state s. Let w? denote the minimizer of J(w) over w ∈ IRM . Verify that the iterates wn generated by the stochastic gradient TD(0) algorithm (48.11) approach a limit point w† that satisfies J(w† ) ≤

1 J(w? ) 1−γ

Remark. The reader may refer to Tsitsiklis and Van Roy (1997) for a related discussion. 48.9 Examine the convergence of the stochastic gradient TD(λ) algorithm (48.19). Refer, for example, to the statement of theorem 1 in Tsitsiklis and Van Roy (1997), where convergence with probability 1 is established for any λ ∈ [0, 1].



References

2045

REFERENCES Baird, L. C. (1995), “Residual algorithms: Reinforcement learning with function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 30–37, Tahoe City, CA. Baird, L. C. (1999), Reinforcement Learning Through Gradient Descent, Ph.D. thesis, Carnegie Mellon University, USA. Bellman, R. E. and S. E. Dreyfus (1959), “Functional approximations and dynamic programming,” Math Tables Other Aides Comput., vol. 13, pp. 247–251. Bertsekas, D. P. (2011), “Approximate policy iteration: A survey and some new methods,” J. Control Theory Appl., vol. 9, no. 3, pp. 310–335. Bradtke, S. J. and A. G. Barto (1996), “Linear least-squares algorithms for temporal difference learning,” J. Mach. Learn. Res., vol. 22, pp. 33–57. Busoniu, L., R. Babuska, B. Schutter, and D. Ernst (2010), Reinforcement Learning and Dynamic Programming Using Function Approximators, CRC Press. Cassano, L., K. Yuan, and A. H. Sayed (2021), “Multi-agent fully decentralized value function learning with linear convergence rates,” IEEE Trans. Aut. Control, vol. 66, no. 4, pp. 1497–1512. Dayan, P. and T. Sejnowski (1994), “TD(λ) converges with probability 1,” Mach. Learn., vol. 14, pp. 295–301. Farley, B. G. and W. A. Clark (1954), “Simulation of self-organizing systems by digital computer,” IRE Trans. Inf. Theory, vol. 4, pp. 76–84. Geramifard, A., T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and J. P. How (2013), “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Found. Trends Mach. Learn., vol. 6, no. 4, pp. 375–454. Jaakkola, T., M. I. Jordan, and S. P. Singh (1994), “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Comput., vol. 6, no. 6, pp. 1185– 1201. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall. Lagoudakis, M. G. and R. Parr (2003), “Least-squares policy iteration,” J. Mach. Learn. Res., vol. 4, pp. 1107–1149. Lu, T., D. Schuurmans, and C. Boutilier (2018), “Non-delusional Q-learning and value iteration,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–11, Montreal. Macua, S. V., J. Chen, S. Zazo, and A. H. Sayed (2015), “Distributed policy evaluation under multiple behavior strategies,” IEEE Trans. Aut. Control, vol. 60, no. 5, pp. 1260–1274. Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmille (2013), “Playing Atari with deep reinforcement learning,” NIPS Deep Learning Workshop, pp. 1–9, Lake Tahoe, NV. Available at arXiv:1312.5602. Samuel, A. L. (1959), “Some studies in machine learning using the game of checkers,” IBM J. Res. Develop., vol. 3, no. 3, pp. 210–229. Reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman, editors, McGraw-Hill, 1963. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Shannon, C. E. (1950), “Programming a computer for playing chess,” Philos. Mag., vol. 41, pp. 256–275. Sutton, R. S. (1988), “Learning to predict by the method of temporal differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44. Sutton, R. S. and A. G. Barto (1998), Reinforcement Learning: An Introduction, A Bradford Book. Sutton, R. S., H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and E. Wiewiora (2009), “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 993–1000, Montreal.

2046

Value Function Approximation

Sutton, R. S., C. Szepesvari, and H. R. Maei (2009), “A convergent O(n) temporaldifference algorithm for off-policy learning with linear function approximation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1609–1616, Vancouver. Tadić, V. (2001), “On the convergence of temporal-difference learning with linear function approximation,” Mach. Learn., vol. 42, pp. 241–267. Tsitsiklis, J. N. (1994), “Asynchronous stochastic approximation and Q-learning,” Mach. Learn., vol. 16, pp. 185–202. Tsitsiklis, J. N. and B. Van Roy (1996), “Feature-based methods for large scale dynamic programming,” Mach. Learn., vol. 22, pp. 59–94. Tsitsiklis, J. N. and B. Van Roy (1997), “An analysis of temporal-difference learning with function approximation,” IEEE Trans. Aut. Control, vol. 42, no. 5, pp. 6754– 690. van Hasselt, H., A. Guez, and D. Silver (2016), “Deep reinforcement learning with double Q-learning,” Proc. AAAI Conf. Artificial Intelligence, pp. 2094–2100, Phoenix, AZ.

49 Policy Gradient Methods

In most multistage decision problems, we are interested in determining the optimal strategy, π ? (a|s) (i.e., the optimal actions to follow in the state–action space). Most of the algorithms described in the previous chapters focused on evaluating the state and state–action value functions, v π (s) and q π (s, a), for a given policy π(a|s). More is needed to learn the optimal policy. The deep Q-learning method provided one such approach. All these previous methods, however, are based on parameterizing the value functions. Policy gradient methods, on the other hand, parameterize the policy directly and they can help ameliorate the divergence and instability problems observed under Q-learning.

49.1

POLICY MODEL In policy gradient methods, the policy π(a|s) is parameterized by some vector θ ∈ IRT and is written as π(a|s; θ). The value of θ will be sought optimally, which means that these methods will directly seek optimal actions for the agents. One common model is to associate a feature vector, denoted by fs,a ∈ IRT , with each state–action pair (s, a). We continue to denote the feature vector associated with state s by hs . We then model the probability measure π(a|s; θ) for selecting action a at state s by the Gibbs distribution (also called the Boltzmann or softmax distribution): T

π(a|h; θ) = P

e(fs,a )

a0 ∈A

e

θ

(fs,a0 )T θ

,

θ ∈ IRT

(49.1)

where we are writing π(a|h; θ), with conditioning over h rather than the corresponding state s. We will use the notation π(a|s; θ) and π(a|h; θ) interchangeably. Other models for π(a|h; θ) are of course possible. Again, as was the case in (48.76), the feature vector fs,a can be constructed from the concatenation of the feature vector hs and the action vector ea , or from some other representation. The normalization in (49.1) ensures that the probabilities π(a|h; θ) add up to 1: X π(a|h; θ) = 1 (49.2) a∈A

2048

Policy Gradient Methods

Observe how the expression on the right-hand side of (49.1) is a function of the state–action feature vector, fs,a , and not of the state feature alone. It is implicit that the sum in the denominator of (49.1) is computed only over the actions a0 that are permissible at state s, so that we set π(a|h; θ) = 0 for actions that are not permitted at that state. In policy gradient methods, the parameter θ will be updated iteratively in order to maximize certain objective functions, denoted by J(θ). We describe in the sequel several choices for J(θ). Once an objective function is selected, the updates for θ will take the form of a (stochastic) gradient-ascent method: , m≥0 (49.3) θm = θm−1 + µ(m) ∇θT J(θ) θ=θm−1

where µ(m) ≥ 0 is the step-size sequence, and where we will be using the letter m to index the iterations for updating the parameter vector. The actual gradient vector will rarely be available because its value will depend on statistical information that is unknown. We will present various methods, differing in sophistication, for estimating the gradient vector from data.

49.2

FINITE-DIFFERENCE METHOD We start with one of the most basic methods, which although inefficient is still useful in some applications (especially if the assumed policy model π(a|h; θ) is not differentiable relative to θ). The method is based on approximating the gradient vector for J(θ) by perturbing the parameter vector and measuring the effect of these small perturbations on the cost. Specifically, let θm−1 denote the estimate for the parameter vector that is available at iteration m − 1. We perturb the estimate by adding δθk to it, for several choices k = 1, 2, . . . , K. Each perturbation results in a modified parameter vector (k)



θm−1 = θm−1 + δθk , k = 1, 2, . . . , K From a Taylor series expansion around θ = θm−1 we have   (k) T J θm−1 ≈ J(θm−1 ) + (δθk ) ∇θT J(θm−1 ) | {z }

(49.4)

(49.5)



=x

in terms of the gradient vector that we want to approximate (denoted by the column vector x). The quantity on the left-hand side is the value of the cost function at the perturbed vector, which can be evaluated by the agent by running the Markov decision process (MDP) for that parameter value (e.g., by using multiple runs and averaging the results). For instance, consider a situation where the MDP starts always from the same initial state s0 and where the objective is to select θ to maximize the state value until termination, i.e.,

49.2 Finite-Difference Method

n o ∆ ∆ θo = argmax J(θ) = v π (s0 )

2049

(49.6)

θ∈IRT

(k)

Using the perturbed parameter θm−1 , the agent can perform several runs, say, E of them, from state s0 until the terminal state (for episodic MDPs), and determine the state value v π,(e) (s0 ) for each run, e. Averaging these values over the runs provides an estimate for v π (s0 ) and for the cost: E   1 X π,(e) (k) v (s0 ) J θm−1 ≈ E e=1

(49.7)

Now, returning to (49.5), we introduce the matrix and vector quantities for the various perturbations: 

  ∆  y =    

  (1) J θm−1 − J(θm−1 )   (2) J θm−1 − J(θm−1 ) ..  . (K) J θm−1 − J(θm−1 )



   ,   



  ∆  H =    

T

(δθ1 ) T (δθ2 ) T (δθ3 ) .. . T

(δθK )

       

(49.8)

Both quantities are known by the agent: it can estimate y and it knows the perturbations in H. We then formulate the problem of estimating the desired gradient vector x as the solution to the least-squares problem: ∆ 2 d ∇ θ T J(θm−1 ) = argmin ky − Hxk

(49.9)

x∈IRT

Assuming H has full column rank, we differentiate the quadratic cost relative to x and set its gradient vector to zero to find that T −1 T d ∇ H y θ T J(θm−1 ) = (H H)

(49.10)

This construction is then used in (49.3) to update θm−1 to θm , and the process continues. The above method is costly and requires recurrent sampling and broad exploration of the state × action space. There are of course other methods for estimating the gradient vector for the update (49.3). We describe these methods in the next sections; they will all assume the policy model π(a|h; θ) is differentiable with respect to θ. Under this condition, the concept of the score function (encountered earlier in (31.97)) will play an important role in evaluating the gradient vector; it will appear in the update relations for most policy gradient algorithms.

2050

Policy Gradient Methods

49.3

SCORE FUNCTION Assuming the policy model π(a|h; θ) is differentiable relative to θ, it holds that π(a|h; θ) ∇ T π(a|h; θ) π(a|h; θ) θ ∇ T π(a|h; θ) = π(a|h; θ) θ π(a|h; θ) = π(a|h; θ) ∇θT ln π(a|h; θ)

∇θT π(a|h; θ) =

(49.11)

where the rightmost term is a column vector and is referred to as the score function of π(a|h; θ) – we encountered this definition before in (31.97): ∆

S(h, a; θ) = ∇θT ln π(a|h; θ)

(score function)

(49.12)

This function will appear repeatedly in the forthcoming discussions on policy gradient algorithms; it is important to emphasize that S is a function of the parameter θ. Example 49.1 (Score function for Gibbs distribution) Let us evaluate the score function that corresponds to the Gibbs distribution model (49.1). Thus, note that S(h, a; θ) = ∇θT ln π(a|h; θ) ( !−1 ) X fT θ T fs,a θ 0 s,a = ∇θT ln e e a0 ∈A

( = ∇θ T

T fs,a θ

= fs,a − = fs,a − = fs,a −

− ln

X

X

T fs,a 0θ

e

a0 ∈A

T fs,a 0θ

!−1

T fs,a 0θ

a0 ∈A

!

a0 ∈A

X

e

fs,a0

e

T fs,a 0θ

!−1

a0 ∈A

X

X

∇θT

e

a0 ∈A

X

!)

fs,a0 e

T fs,a 0θ

!

a0 ∈A

X

T fs,a 0θ

e

!−1

T fs,a 0θ

e

(49.13)

a0 ∈A

That is, S(h, a; θ) = fs,a −

X a0 ∈A

π(a0 |h; θ)fs,a0

= fs,a − f¯s = fs,a − Eπ f s,a0

(49.14a) (49.14b) (49.14c)

Recall that fs,a is the feature vector for state–action pair (s, a). The summation term in (49.14a) corresponds to the average feature vector under state s, where the averaging is over the policy (action selection). We are denoting this average by f¯s .

49.3 Score Function

2051

The above useful result shows that, for each state–action pair (s, a), its score value measures the amount of deviation of its feature vector, fs,a , from the mean feature vector (with the expected value measured under the distribution model π(a|h; θ)). The boldface notation for f s,a0 refers to the stochastic nature of the feature vector over the distribution a0 ∼ π(a0 |h; θ). The first expression (49.14a) will be used extensively in the learning algorithms in this chapter. The main reason is because it provides an expression for the gradient of the log of π(a|h; θ); and this gradient vector will be used to construct the update directions for the various stochastic algorithms. Expression (49.14a), however, already highlights one difficulty that arises from constructing feature vectors fs,a through the concatenation procedure (48.76) for policy gradient algorithms, namely,   hs fs,a = ∈ IRT (49.15) ea where hs is a feature vector for state s and ea is a feature vector for action a. If we substitute this construction into (49.14a) we find that X ∇θT ln π(a|h; θ) = fs,a − π(a0 |h; θ)fs,a0 a0 ∈A

 =  =

hs ea

 −

0 ea − e¯a

X a0 ∈A

π(a0 |h; θ)



hs ea0





where we introduced the mean feature representation for the actions: X ∆ e¯a = π(a0 |h; θ)ea0

(49.16)

(49.17)

a0 ∈A

Observe from (49.16) that the gradient vector will be independent of the state feature since its leading block is zero. As a result, learning algorithms that rely on using this gradient direction will not be able to discriminate between the states and will lead to policies π(a|h; θ) that are independent of s (or h). We propose an alternative construction for fs,a in Example 49.5 to avoid this difficulty. Example 49.2 (Useful property of score functions) One useful property of a parameterized policy function π(a|h; θ) is that n o Eπ S(h, a; θ)g(s) = 0 (49.18) for any function g(s) that depends solely on the state variable, s. In other words, multiplying the score function by any such g(s) evaluates to zero on average. In particular, if we select g(s) = v π (s) (the value function corresponding to the assumed policy model π(a|h; θ), which depends solely on the state variable), then it holds that   Eπ ∇θT ln π(a|h; θ) v π (s) = 0

(49.19)

We will use this useful property to propose adjustments to policy gradient methods in the following. This is because, as the presentation will reveal, policy gradient methods will tend to suffer from high variance in their estimates for the gradient vectors. By using property (49.19), we will be able to introduce adjustments into the recursions to reduce the variance.

2052

Policy Gradient Methods

Proof of (49.18): Note that



n

S(h, a; θ)g(s)

o

=

  Eπ ∇θT ln π(a|h; θ) g(s)

=

g(s)

! X

π(a|h; θ)∇θT ln π(a|h; θ)

a∈A (49.11)

=

! g(s)

X

∇θT π(a|h; θ)

a∈A

! X

g(s) ∇θT

=

π(a|h; θ)

a∈A

|

{z

=1

}

g(s) × 0 0

= =

(49.20) 

49.4

OBJECTIVE FUNCTIONS There are several criteria that can be used to optimize the selection of θ. We denote the cost function by J(θ) : IRT → IR.

49.4.1

Discounted Reward One first option is to select J(θ) as follows (as already advanced in example (49.6)): J1 (θ) = (1 − γ)v π (s0 ),

s0 = initial state

(49.21)

This formulation is used for situations where the initial state of the MDP is fixed at some initial value, s0 . The scaling by (1 − γ) is not important and can be removed if desired since it does not alter the final result; it is only added for normalization purposes – see argument (49.223) in the Appendix. In this case, the objective is to select θ in order to maximize the cumulative reward that would result if the MDP starts from state s0 and acts continually thereafter until termination according to the policy π(a|h; θ): n o ∆ θo = argmax J1 (θ) = (1 − γ)v π (s0 )

(49.22)

θ∈IRT

We know from (44.69) and (44.92) that the term v π (s0 ) satisfies X v π (s0 ) = π(a|h; θ)q π (s0 , a) a∈A

(49.23)

49.4 Objective Functions

where q π (so , a) =

X

s0 ∈S

h i P(so , a, s0 ) r(so , a, s0 ) + γv π (s0 )

2053

(49.24)

We also know from (44.65) that expression (49.23) can be unfolded and v π (s0 ) rewritten in the series form: ! ∞ X π n v (s0 ) = E π,P γ r(n) |s0 = s0 (49.25) n=0

where γ ∈ [0, 1), and r(n) denotes the reward r(s, a, s0 ) resulting from the transition at time n. We are denoting the rewards in boldface since their values depend on the distribution of actions and transitions. In this way, problem (49.22) becomes (see also Prob. 49.3): ( !) ∞ X ∆ o n θ = argmax J1 (θ) = (1 − γ)E π,P γ r(n) | s0 = s0 , γ ∈ [0, 1) θ∈IRT

n=0

(49.26)

49.4.2

Expected Discounted Reward We can modify the cost (49.21) to allow the initial state to assume any value from the set of states, S, according to some probability distribution. We select this distribution to be the steady-state distribution for the MDP. Recall that we explained earlier in Section 44.1.2 that for an MDP M = (S, A, P, r), the process of transitioning from one state to another forms a Markov chain with transition probability matrix given by P π and whose entries are computed via (44.31): X ∆ pπs,s0 = P(s0 = s0 |s = s) = π(a|h; θ)P(s, a, s0 ), s, s0 ∈ S (49.27) a∈A

We further explained in the concluding remarks of Chapter 38 that when the Markov chain is aperiodic and irreducible, then the transition matrix P π = [pπs,s0 ] will be primitive. This means that an eigenvector, denoted by dπ ∈ IR|S| and with entries dπ (s), will exist with strictly positive entries satisfying (recall (38.142)): T

(P π ) dπ = dπ , 1T dπ = 1,

dπ (s) > 0

(49.28)

We refer to this vector as the Perron vector. We further know from (38.124) that it corresponds to the stationary probability distribution of the Markov chain under policy π(a|s; θ). In other words, each entry dπ (s) denotes the likelihood of state sn being at value s after sufficient time n has elapsed: dπ (s) = lim P(sn = s) n→∞

Using these steady-state probabilities we replace J1 (θ) in (49.21) by X J2 (θ) = (1 − γ) dπ (s)v π (s) s∈S

(49.29)

(49.30)

2054

Policy Gradient Methods

so that the problem of selecting θ becomes θ

o

= argmax θ∈IRT

(



J2 (θ) = (1 − γ)

X

π

)

π

d (s)v (s)

s∈S

(49.31)

where the scaling by (1 − γ) is again added for normalization purposes only – see argument (49.223) and Prob. 49.5.

49.4.3

Expected Immediate Reward A third possibility is to select θ in order to maximize the immediate (or one-step) reward over the state distribution by considering X J3 (θ) = dπ (s)rπ (s) (49.32) s∈S

π

where r (s) denotes the expected one-step reward for the transition from state s. We already know from (44.43) and (44.44) that this reward value is given by: ! X X π 0 0 r (s) = π(a|h; θ) P(s, a, s )r(s, a, s ) (49.33) s0 ∈S

a∈A

The problem of selecting θ now becomes θ

o

= argmax θ∈IRT

49.4.4

(



J3 (θ) =

X s∈S

π

π

)

d (s)r (s)

(49.34)

Average Reward When discounting is not used (i.e., when γ = 1), the series in (49.26) may diverge. We therefore only allow γ = 1 for episodic MDPs, i.e., for MDPs with terminal states that are reached within some finite time-horizon, N . More broadly, for both episodic and nonepisodic MDPs, we can consider an alternative objective function that is based on measuring the average reward per time step, i.e.,

θ

o

= argmax θ∈IRT

(



J4 (θ) =

lim E π,P

N →∞

N −1 1 X r(n) N n=0

!)

(49.35)

If the MDP happens to be episodic and terminates after N steps, then we can remove the limiting operation and use instead: ( !) N −1 X 1 ∆ θo = argmax J4 (θ) = E π,P r(n) (episodic MDP) N n=0 θ∈IRT (49.36)

49.4 Objective Functions

2055

Note that there is no discount factor in formulation (49.35), i.e., γ = 1. Example 49.3 (Relation between objectives J3 (θ) and J4 (θ)) We can relate objectives J3 (θ) and J4 (θ) and show that they coincide for stationary policies (cf. (44.8)) and MDPs operating in the steady state. Indeed, assume the Markov chain has reached steady-state operation with state distribution dπ (s). At any time n, the agent can be at any of the |S| states with probability dπ (s) each. From there, it can take an action a with probability π(a|h; θ), move to a state h0 with probability P(h, a, h0 ), and collect reward r(n) = r(h, a, h0 ). Then, the expected reward at time n is given by X π ∆ r¯ = E dπ ,π,P r(n) = d (s)rπ (s) = J3 (θ) (49.37) s∈S

where the expectation is over the randomness in state transitions (represented by P), action selections (represented by π(a|h; θ)), and launch state (represented by dπ (s)). On the other hand, when the rewards process r(n) is ergodic in the mean or first-order ergodic (see (7.18)), the average of a long collection of measurements r(n) will tend to the actual mean value: N −1 1 X r(n) = E dπ r(n) lim (49.38) N →∞ N n=0 where the expectation is over the randomness in the launch state. Comparing with (49.37), we conclude that J3 (θ) = J4 (θ).

49.4.5

Centered Poisson Equation When the average reward J4 (θ) in (49.35) is selected as an objective function for policy design, it is customary to redefine the state and state–action value functions by centering them around the average reward r¯ as follows: ∞ n  o X ∆ r(n) − r¯ | s0 = s (49.39a) v π (s) = E π,P ∆

q π (s, a) = E π,P

n=0 ∞ n X n=0

 o r(n) − r¯ | s0 = s, a0 = a

(49.39b)

where the value of r¯ is dependent on θ; it can also be written as r¯(θ) and is given by J4 (θ), i.e., ( N −1 ) 1 X ∆ r¯(θ) = lim E π,P r(n) (49.40a) N →∞ N n=0

For episodic MDPs, we would use instead the expression for J4 (θ) from (49.36), i.e., ( N −1 ) 1 X ∆ r¯(θ) = E π,P r(n) (episodic MDPs) (49.40b) N n=0

2056

Policy Gradient Methods

We could have used new notation for the centered variables (49.39a)–(49.39b), such as v¯π (s) and q¯π (s, a). However, we prefer to continue with the same notation to avoid additional symbols. It follows from the above expressions that v π (s) and q π (s, a) now measure the excess values around the mean, r¯. It is straightforward to verify, by subtracting r¯ from both sides of (44.92), that these centered values continue to satisfy the same relation: v π (s) =

X

π(a|h; θ)q π (s, a)

(49.41)

a∈A

The centered variables {v π (s), q π (s, a)} also satisfy a Poisson equation similar to (44.92), albeit with a minor modification with r¯(θ) appearing on the left, namely,   v π (s) + r¯(θ) = E π,P r(s, a, s0 ) + v π (s0 ) | s = s (49.42a)   XX = π(a|h; θ)P(s, a, s0 ) r(s, a, s0 ) + v π (s0 ) s0 ∈S a∈A

and   q π (s, a) + r¯(θ) = E π,P r(s, a, s0 ) + q π (s0 , a0 ) | s = s, a = a   XX = π(a0 |h; θ)P(s, a, s0 ) r(s, a, s0 ) + q π (s0 , a0 ) s0 ∈S a0 ∈A

=

X

s0 ∈S

h i P(s, a, s0 ) r(s, a, s0 ) + v π (s0 )

(49.42b)

Proof of (49.42a)–(49.42b): This can be seen as follows: v π (s)

(49.39a)

∞ n  o X r(n) − r¯ | s0 = s

=

E π,P

=

∞ n n  o  o X E π,P r(0) − r¯ | s0 = s + E π,P r(n) − r¯ | s0 = s

n=0

n=1

(a)

=

E π,P



E π,P



∞ n   o X r(s, a, s0 ) − r¯ + E π,P r(n) − r¯ | s1 = s0 n=1

(b)

=

∞ n   o X r(s, a, s0 ) − r¯ + E π,P r(n) − r¯ | s0 = s0 n=0

=



0

π

0

E π,P r(s, a, s ) − r¯ + v (s )



(49.43)

where step (a) is because of the Markovian property, and step (b) is because we are assuming a stationary policy. A similar argument holds for q π (s, a) – see Prob. 49.4.



49.5 Policy Gradient Theorem

49.5

2057

POLICY GRADIENT THEOREM We now derive expressions for the gradient vectors of the objective functions introduced in the previous section. The main result is referred to as the policy gradient theorem. It states that for any of the objective functions listed in Table 49.1, and for Markov chains that are irreducible and aperiodic, it holds that   ∇θ J(θ) = E π,d q π (s, a) − g(s) ∇θ ln π(a|h; θ)

(49.44)

where we are denoting the cost function by J(θ) and where g(s) denotes any state-dependent baseline function, such as g(s) = v π (s). The expectation is relative to the policy distribution π(a|h; θ) (i.e., the randomness in the action variable, a) and also relative to a stationary state distribution, denoted generically by d, and defined over s ∈ S. The value of its entries, d(s), depends on whether we are optimizing an average reward measure (J3 (θ) or J4 (θ)) or a discounted reward measure (J1 (θ)): ∆

(average reward) d(s) = dπ (s) = lim P(sn = s) n→∞



(discounted reward) d(s) = dπγ (s) = (1 − γ)

∞ X

(49.45)

γ n P(sn = s|s0 = s)

n=0

(49.46)

The distribution dπγ is called the discounted stationary distribution, while dπ (s) is the regular stationary distribution given by the Perron vector for the state transition matrix P π . The form of the distribution d for J2 (θ) is treated in Prob. 49.5. Table 49.1 Listing of objective functions. Objective function

Expression

discounted reward

J1 (θ) = (1 − γ)v π (s0 )

expected discounted reward

J2 (θ) = (1 − γ)

expected immediate reward

J3 (θ) =

X

X

dπ (s)v π (s)

s∈S

dπ (s)rπ (s)

s∈S

( average reward

J4 (θ) = lim E π,P N →∞

) N 1 X r(n) N n=0

The significance of result (49.44) is that it shows that the gradient of the objective function can be estimated empirically from observations/estimates of

2058

Policy Gradient Methods

the state–action value function, q π (s, a), as we will clarify in the following. We establish the validity of (49.44) in Appendix 49.A. Example 49.4 (Compatible approximations) The policy gradient theorem (49.44) expresses the gradient vector of the objective function in terms of the actual state–action value function, q π (s, a). There is an alternative useful result, which expresses the same exact gradient vector in terms of an approximation for the state–action value function. The result only holds for a particular class of approximations, known as compatible approximations. To see this, let us assume a parameterized approximation for q π (s, a) in the form: q π (s, a; c) = (∇θT ln π(a|h; θ))T c

(49.47)

where we select the vector c as the optimal solution to the mean-square-error (MSE) problem: ( )  2 1 o ∆ π π c = argmin E π,d q (s, a) − q (s, a; c) (49.48) 2 c∈IRT Under the assumed representation (49.47), it will hold that ∇θT J(θ) = E π,d q π (s, a; co )∇θT ln π(a|h; θ)

(49.49)

Comparing with (49.44), and ignoring g(s) since it can be removed, the earlier expression uses the true state–action value q π (s, a) while the above expression uses the linear state–action value approximation, q π (s, a; co ). In other words, the gradient expression continues to be exact and no bias is introduced if the linear approximation for q π (s, a) satisfies the two conditions (49.47)–(49.48). We refer to approximations that satisfy these conditions as compatible function approximations. This is a useful conclusion because, in the context of policy gradient algorithms, the result motivates the use of linear approximation models for the state–action value function. Moreover, if we compare model (49.47) with the earlier formulation (48.83), we see that the term ∇θT ln π(a|h; θ) is now playing the role of the feature vectors fs,a . Actually, feature vectors chosen as fs,a = ∇θT ln π(a|h; θ)

(49.50)

are called compatible features. In particular, if the model selected for π(a|h; θ) is the Gibbs distribution (49.1), then it follows from result (49.14c) that condition (49.47) is met by selecting T q π (s, a; co ) = fs,a − f¯s co (49.51) That is, once the feature vectors are centered around their mean (as was advanced by (48.81)), then they become compatible features under the assumed Gibbs model for the policy. Proof of (49.49): It follows from (49.47) that

∇c q π (s, a; w)

c=co

= ∇θ ln π(a|h; θ)

(49.52)

49.6 Actor–Critic Algorithms

2059

Setting the gradient vector of the MSE (49.48) to zero at c = co gives: ( )   π o π π 0 = E π,d q (s, a; c ) − q (s, a) ∇w q (s, a; c) c=co

(49.52)

=

( E π,d

  q π (s, a; co ) − q π (s, a) ∇θ ln π(a|h; θ)

) (49.53)

from which we conclude that E π,d q π (s, a; co )∇θ ln π(a|h; θ) = E π,d q π (s, a)∇θ ln π(a|s; θ)

(49.54)

and the result follows from (49.44) using g(s) = 0. 

49.6

ACTOR–CRITIC ALGORITHMS Expression (49.44) identifies the gradient vector of the objective function. Recall that in this expression the term g(s) refers to an arbitrary baseline function of the state s; its role is clarified further ahead. We can estimate the gradient from observations as follows. Assume we collect E episodes or trajectories by following the existing policy π(a|h; θm−1 ). Let Ne denote the number of samples in each trajectory. The total number of data points available for estimation is then N=

E X

Ne

(49.55)

e=1

Let the following symbols, with superscript (e), n o (e) (e) (e) (e) s(e) , h , a , r (n), f n n n n

(49.56)

denote the state, feature vectors, action, and reward associated with the eth episode at time n. In particular, fn(e) = fs(e) ,a(e) n

n

(49.57)

Then, we can approximate (49.44) by using ∆

gm−1 = ∇θT J(θm−1 ) (gradient vector at θ = θm−1 ) E Ne −1h  i     1 X X (e) (e) (e) (e) ≈ q π s(e) , a − g s ∇ ln π a |h ; θ T m−1 θ n n n n n N e=1 n=0

(49.58)

Observe that the score function, ∇θT ln π(a|h; θm−1 ), appears in the computation of gm−1 . Its value depends on the assumed model for the policy. For example, under the Gibbs distribution model (49.1), we can use expression (49.14a), namely,

2060

Policy Gradient Methods

∆ f¯n(e) =

 X  π a|h(e) ; θ fs(e) ,a m−1 n n

(49.59a)

a∈A

  (e) (e) |h ; θ − f¯n(e) ∇θT ln π a(e) m−1 = fn n n

(49.59b)

θm = θm−1 + µ(m)gm−1 , m ≥ 0

(49.60)

Using gm−1 , we can now resort to a stochastic gradient-ascent iteration of the following form to update θm−1 to θm :

where µ(m) ≥ 0 is a step-size sequence, θm is the estimate at iteration m. This recursion is still not useful in this form because it requires knowledge of q π (s, a); we also need to explain how to select g(s). Different policy gradient algorithms differ by how these two quantities are selected. We first remark that, given the state s = s, the conditional mean of the action value function satisfies   E π q π (s, a) | s = s = v π (s) (49.61) This conclusion is obvious from expression (44.92). Result (49.61) means that the state value v π (s) can be used as an unbiased sample estimate for the state– action value function at state s. The next issue is to estimate v π (s) itself. There are several possibilities, leading to various algorithms.

49.6.1

REINFORCE Algorithm The first possibility is to rely on a Monte Carlo construction, similar to (46.13). Let us consider first the case of discounted rewards, such as those corresponding to the objective functions J1 (θ) and J2 (θ) from Table 49.1. We comment after the listing of the algorithm on the changes that are necessary for the average rewards J3 (θ) and J4 (θ). Thus, consider one episode e under policy π(a|h; θm−1 ), consisting of a sequence of state transitions and rewards where the actions are chosen according to this policy iterate, written as (without the superscript (e) for simplicity): a1 ,r(1)

a2 ,r(2)

a3 ,r(3)

s1 −−−−→ s2 −−−−→ s3 −−−−→ s4 . . .

a ∼ π(a|h; θm−1 )

(49.62)

Assume that some state s happens at time n in this episode. Starting from sn , we evaluate the discounted reward until the end state, say, as vbπ (s) =

N −1 X

γ m−n r(m)

(49.63)

m=n

The resulting REINFORCE algorithm sets the baseline function g(s) to zero and is listed in (49.64).

49.6 Actor–Critic Algorithms

2061

REINFORCE for optimal policy design under discounted rewards. start from an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ). repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) when necessary, let notation (hn , an , r(e) (n)) with superscript (e) index the (s, a, r) quantities over episodes and time. repeat over samples in each episode: e −1   NX (e) vbπ sn = γ m−n r(e) (m) m=n

end

N=

E X e=1

gm−1 =

Ne E Ne −1     1 X X vbπ s(e) ∇θT ln π an(e) |hn(e) ; θm−1 n N e=1 n=0

θm = θm−1 + µ(m)gm−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.64)

The above derivation is applicable to the discounted reward objective functions J1 (θ) and J2 (θ) from Table 49.1 since for these costs, expression (49.63) can be used to estimate the value function v π (s). However, when the average rewards J3 (θ) or J4 (θ) are used, we know from (49.39a) that we should now rely instead on the following expression for estimating v π (s): vbπ (sn ) =

Ne X

m=n

(r(m) − r¯)

(49.65)

with the correction factor r¯; this factor can be estimated from the episode samples by using b r¯ =

Ne −1 1 X r(n) Ne n=0

(49.66)

Sometimes it is convenient to estimate r¯ iteratively in an online manner. In that case, we can appeal to the result of Prob. 16.9 and employ the following recursion to compute recursive estimates:  b r¯(n) = b r¯(n − 1) + µr (n) r(n) − b r¯(n − 1) , n ≥ 0 (49.67)

2062

Policy Gradient Methods

with boundary condition b r¯(−1) = 0. In this way, algorithm (49.64) will be adjusted as shown in listing (49.68). REINFORCE for optimal policy design under average rewards. start from an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ); repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) when necessary, let notation (hn , an , r(e) (n)) with superscript (e) index the (s, a, r) quantities over episodes and time. repeat over samples in each episode: Ne −1 1 X b r¯ = r(e) (n) Ne n=0 e −1   NX   (e) vbπ sn = γ m−n r(e) (m) − b r¯ m=n

end

N=

E X e=1

gm−1 =

Ne E Ne −1     1 X X (e) (e) vbπ s(e) ln π a |h ; θ ∇ T m−1 θ n n n N e=1 n=0

θm = θm−1 + µ(m)gm−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.68)

In the following we will ignore the correction that is necessary in the average reward case and list the algorithms only for the discounted case, with the understanding that simple adjustments similar to the above should be incorporated for average rewards. The adjustments may involve either a block estimation of r¯ using all available rewards from an observed episode, as done above, or an iterative estimation of r¯(n) using (49.67).

49.6.2

Standard Gradient Policies The REINFORCE algorithm sets the baseline function g(s) in expression (49.44) to zero and employs instantaneous estimates for the gradient vector ∇θ J(θ) by constructing samples of the following form (we are ignoring the episode superscript and assuming a single measurement, which is sufficient to convey the main idea):

49.6 Actor–Critic Algorithms

  \ bπ (sn , an ) ∇θT ln π(an |hn ; θm−1 ) ∇ θ T J(θm−1 ) ≈ q

2063

(49.69)

This gradient estimate can have large variance. We will therefore exploit the ability to employ a nontrivial baseline function g(s) to reduce this variance. In particular, we will set g(s) = v π (s) (which we already know corresponds to the conditional mean of the state–action function). By doing so, we end up with a gradient estimate of the form: \ ∇ θ T J(θn−1 ) ≈



 qbπ (sn , an ) − vbπ (sn ) ∇θT ln π(an |hn ; θn−1 )

(49.70)

which will have a smaller variance. This is because, for any random variable x, it holds that E (x − E x)2 = E x2 − (E x)

2

≤ E x2

(49.71)

so that E π,P



q π (s, a) − v π (s)

2

 n o 2 | s = s ≤ E π,P (q π (s, a)) | s = s

(49.72)

We are therefore motivated to introduce the advantage function, defined as the difference: ∆

Aπ (s, a) = q π (s, a) − v π (s), s ∈ S, a ∈ A

(49.73)

The advantage function Aπ (s, a) measures how much additional cumulative reward the agent receives from taking action a at state s. The advantage can assume zero, negative, or positive values depending on whether action a decreases or increases the cumulative reward. The algorithms that we describe in this section are referred to as actor–critic algorithms. The critic terminology refers to that component in the algorithm that estimates q π (sn , an ) and v π (sn ), while the actor terminology refers to that component in the algorithm that updates θm−1 and learns the policy. Note that the critic component is dealing with a problem that we already studied in previous chapters, related to evaluating state or state–action value functions, which can be accomplished by using any of the variations of temporal difference (TD) or SARSA learning with and without linear approximation. For example, by relying on the TD(0) and SARSA constructions (48.11) and (48.87) for linear models, we arrive at listing (49.74), where we are denoting the weight vectors for estimating v π (s) and q π (s, a) by the letters w and c, respectively. We are also denoting the feature vectors for the states and state–action pairs by {hs , fs,a }.

2064

Policy Gradient Methods

Actor–critic algorithm for optimal policy design using two critics. set initial weight vectors, c−1 = 0T ×1 , w−1 = 0M ×1 ; choose an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ); repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) (e) when necessary, let notation (hn , an , fn , r(e) (n)) with superscript (e) index the (s, a, f, r) quantities over episodes and time repeat over episodes: for each episode, let (h0 , f0 ) denote the initial feature vectors repeat over episode for n ≥ 0: (we are ignoring the superscript (e) on (h, a, r, f ) for simplicity) a ∼ π(a|hn ; θm−1 ) observe hn+1 r(n) = r(hn , a, hn+1 ) (critic for state value function) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + µ1 (n)δ(n)hn (critic for state–action value function) β(n) = r(n) + (γfn+1 − fn )T cn−1 cn = cn−1 + µ2 (n)β(n)fn end c−1 ← cn , w−1 ← wn , in preparation for next episode end (learned critic models for iteration m − 1) o ← wn com−1 ← cn , wm−1 (actor for policy parameter) N=

E X

Ne (number of data points)

e=1

gm−1 =

E Ne −1    1 XX o (e) (fn(e) )T com−1 − (hn(e) )T wm−1 ∇θT ln π a(e) |h ; θ m−1 n n N e=1 n=0

θm = θm−1 + µ(m)gm−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.74)

49.6 Actor–Critic Algorithms

2065

Listing (49.74) is valid for discounted rewards – see Prob. 49.11 for adjustments for average rewards. We remark though that this algorithm is rather inefficient and involves two critics: one for the state value function and another for the state–action value function. There is an alternative construction that relies on a single linear approximation (and, hence, on a single critic). It can be motivated as follows. We consider the TD term: ∆

δ π (s, a, s0 ) = r(s, a, s0 ) + γv π (s0 ) − v π (s)

(49.75)

and compute its conditional expectation to find: n o E δ π (s, a, s0 ) | s = s, a = a n o = E r(s, a, s0 ) + γv π (s0 ) − v π (s) | s = s, a = a n o = E r(s, a, s0 ) + γv π (s0 ) | s = s, a = a − v π (s) (44.70)

=

q π (s, a) − v π (s)

= Aπ (s, a)

(49.76)

In other words, realizations of the TD term δ π (s, a, s0 ) can be used to construct an unbiased estimate for the advantage function. These realizations are already present in the actor–critic algorithm (49.74) in the form of the factors δ(n). We therefore arrive at listing (49.77), again for discounted rewards – see Prob. 49.12 for adjustments for the case of average rewards. The procedure in listing (49.77) is referred to as the advantage actor–critic (A2C) implementation. Figure 49.1 provides a block diagram representation for the resulting actor–critic structure. The critic component generates estimates, bπ (s, a), for the advantage function, through the parameter δ(n). denoted by A These in turn are used by the actor component to generate estimates, denoted b for the policy parameter, θ. The policy model feeds actions generically by θ, into the MDP and the score function back into the actor component. The states (s, s0 ), represented by their feature vectors (h, h0 ), are fed into the critic component along with the reward for that transition. The process continues until convergence. Algorithm (49.77) relies on the TD(0) construction (48.11) for estimating v π (s). One can rely instead on the TD(λ) construction (48.19) to arrive at the listing shown in (49.78). This listing is valid for discounted rewards – see Prob. 49.13 for adjustments for the case of average rewards.

2066

Policy Gradient Methods

Advantage actor–critic (A2C) algorithm for optimal policy design. set initial weight vector, w−1 = 0M ×1 ; choose an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ). repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) when necessary, let notation (hn , an , r(e) (n)) with superscript (e) index the (s, a, r) quantities over episodes and time repeat over episodes: for each episode, let (h0 , f0 ) denote the initial feature vectors repeat over episode for n ≥ 0: (we are ignoring the superscript (e) on (h, a, r, f ) for simplicity) a ∼ π(a|hn ; θm−1 ) observe hn+1 r(n) = r(hn , a, hn+1 ) (critic for state value function) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + µ1 (n)δ(n)hn end w−1 ← wn , in preparation for next episode end (learned critic model for iteration m − 1) o wm−1 ← wn (actor for policy parameter) N=

E X

Ne (number of data points)

e=1

 T (e) (e) (e) o δn = r(e) (n) + γhn+1 − hn wm−1 , for all n, e gm−1 =

E Ne −1   1 X X (e) δn(e) ∇θT ln π a(e) n |hn ; θm−1 N e=1 n=0

θm = θm−1 + µ(m)gm−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.77)

49.6 Actor–Critic Algorithms

2067

Advantage actor–critic (A2C(λ)) algorithm for optimal policy design. set initial weight vector, w−1 = 0M ×1 ; set initial eligibility trace, t−1 = 0|S|×1 , s ∈ S; choose an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ). repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) when necessary, let notation (hn , an , r(e) (n)) with superscript (e) index the (s, a, r) quantities over episodes and time repeat over episodes: for each episode, let (h0 , f0 ) denote the initial feature vectors repeat over episode for n ≥ 0: (we are ignoring the superscript (e) on (h, a, r, f ) for simplicity) a ∼ π(a|hn ; θm−1 ) observe hn+1 r(n) = r(hn , a, hn+1 ) (critic for state value function) δ(n) = r(n) + (γhn+1 − hn )T wn−1 tn = λγtn−1 + hn wn = wn−1 + µ1 (n)δ(n)tn end w−1 ← wn , t−1 ← tn in preparation for next episode end (learned critic model for iteration m − 1) o wm−1 ← wn (actor for policy parameter) E X N= Ne (number of data points)

e=1  T (e) (e) (e) o δn = r(e) (n) + γhn+1 − hn wm−1 , for all n, e N −1 E e   1 X X δn(e) ∇θT ln π an(e) |hn(e) ; θm−1 gm−1 = N e=1 n=0 θm = θm−1 + µ(m)gm−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.78)

2068

Policy Gradient Methods

Figure 49.1 Block diagram representation for the actor–critic structure. The critic

bπ (s, a), for the advantage function, component generates estimates, denoted by A which are used by the actor component to generate estimates θb for the policy parameter, θ. The policy model feeds actions into the MDP and the score function into the actor component. The states (s, s0 ), represented by their feature vectors (h, h0 ), are fed into the critic component along with the reward for that transition.

As a general remark, it is worth noting from listing (49.78) that policy gradient algorithms tend to be sample inefficient; every iteration m involves collecting o several trajectories and estimating the critics {wm−1 , com−1 } before updating the policy parameter from θm−1 to θm . It is necessary to store all state–action– reward transitions during the repeated episodes in order to re-evaluate the TD (e) o factors δn using the critic models {wm−1 , com−1 }. These policies can also perform poorly in environments with very sparse reward structures where rewards are zero almost everywhere. For example, in a game of chess, an agent would need to wait until the game concludes to collect the reward (say, a positive reward for winning and a negative reward for losing). More efficient exploration strategies of the state–action space are necessary. Example 49.5 (Playing a game over a grid) We illustrate the operation of the advantage actor–critic (A2C) algorithm (49.77) by reconsidering the same grid problem from Fig. 48.1. We assign one-hot encoded feature vectors with the actions a ∈ A and collect them into a matrix of size 5 × 5 (one row per action):   1 0 0 0 0 ← up  0 1 0 0 0  ← down   A =  0 0 1 0 0  ← left (49.79)  0 0 0 1 0  ← right 0 0 0 0 1 ← stop

49.6 Actor–Critic Algorithms

2069

We continue to employ the same one-hot encoded feature vectors (48.14) from Example 48.1, which we collected into a 17 × 17 matrix H with one row corresponding to each state:

              H=            

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

                          

←1 ←2 ←3 ←4 ←5 ←6 ←7 ←8 ←9 ← 10 ← 11 ← 12 ← 13 ← 14 ← 15 ← 16 ← 17 (49.80)

Using the matrices H and A, we define feature vectors fs,a for the state–action pairs (s, a) by computing the Kronecker product of H and A as follows:



F = H ⊗A

(49.81)

The resulting matrix F has dimensions 85 × 85, with each row in F corresponding to the transpose of the feature vector fs,a associated with the state–action pair (s, a). For any particular (s, a), the row in F that corresponds to it has index:

row index =

n

o (s − 1) × |A| + a ,

s = 1, 2, . . . , |S| a = 1, 2, . . . , |A|

(49.82)

where |S| = 17 and |A| = 5. We run the A2C algorithm for 1,000,000 iterations using constant step sizes:

µ1 = 0.01, µ = 0.0001, ρ = 0.001, γ = 0.9

(49.83)

Each iteration involves a run over E = 100 episodes to train the critic. At the end of the simulation, we evaluate the policies at each state and arrive at the matrix Π in (49.84), with each row corresponding to one state and each column to one action (i.e., each row corresponds to the distribution π(a|s)). The boxes highlight the largest entries in each column and are used to define the “optimal” policies at that state. The resulting optimal actions are represented by arrows in Fig. 49.2.

2070

Policy Gradient Methods

               Π=              

up 0.0020 0.9254 0.9156 0 0.9735 0.0230 0.9840 0 0.9940 0.9905 0 0.9878 0.0058 0.0030 0.0019 0 0

down 0.8653 0.0144 0.0250 0 0.0080 0.0094 0.0026 0 0.0008 0.0013 0 0.0025 0.0027 0.0030 0.0010 0 0

left 0.1253 0.0540 0.0337 0 0.0128 0.9429 0.0124 0 0.0024 0.0032 0 0.0049 0.0037 0.0016 0.0009 0 0

right 0.0074 0.0061 0.0257 0 0.0057 0.0247 0.0011 0 0.0027 0.0050 0 0.0049 0.9878 0.9924 0.9963 0 0

stop 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 1.0000 1.0000

state ←s=1 ←s=2 ←s=3 ←s=4 ←s=5 ←s=6 ←s=7 ←s=8 ←s=9 ← s = 10 ← s = 11 ← s = 12 ← s = 13 ← s = 14 ← s = 15 ← s = 16 ← s = 17

                             

(49.84)

SUCCESS

11

DANGER

4

Figure 49.2 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the A2C algorithm (49.77) and relies on the extended feature vectors (49.80) and (49.81).

49.7 Natural Gradient Policy

49.7

2071

NATURAL GRADIENT POLICY The policy gradient methods described in the previous section follow gradientascent iterations along the direction defined by the gradient vector ∇θT J(θ). These methods are therefore parameter-dependent since the search directions depend on θ; this in turn means that the performance of the algorithms is sensitive to the choice of the coordinate system. For example, assume the coordinate system is modified and the parameter θ for the objective function is replaced by a transformed vector θ0 in the new space. Then, the gradient vector of the same objective function J(·) relative to θ0 will generally result in a different search direction than the gradient vector relative to θ (despite the objective function remaining invariant). This situation was illustrated earlier in Example 6.1. Natural gradient policies are search procedures that are parameter-independent. Their main motivation is to ensure that a change in the parameterization of the policy does not affect the policy result. As was shown earlier in (6.130), the search direction for natural gradient algorithms is modulated by the inverse of the Fisher information matrix. We motivate the construction in the context of reinforcement learning policies as follows. Given a policy π(a|h; θ), parameterized by θ, we introduce its point Fisher information matrix F (θ, s) (recall (31.96)). This matrix corresponds to the covariance matrix of the score function at state s (or feature h), namely, F (θ, s)

= (49.12)

=

=

E π S(h, a; θ) (S(h, a; θ))

T

(49.85)

  E π ∇θT ln π(a|h; θ) ∇θ ln π(a|h; θ) X

a∈A

   π(a|s; θ) ∇θT ln π(a|h; θ) ∇θ ln π(a|h; θ)

Note that F (θ, s) is state-dependent. We can average over a distribution for the state variables s ∈ S and remove the dependency on s (or the feature space). For this purpose, we introduce the average Fisher information matrix defined by (see Prob. 49.15): X ∆ F (θ) = d(s)F (θ, s) (49.86) s∈S

where the distribution d(s) is chosen as defined earlier in (49.45) and (49.46), depending on whether we are dealing with the optimization of an average reward objective (J3 (θ) or J4 (θ)) or a discounted reward objective (J1 (θ)). In the first case, we set d(s) = dπ (s) and in the second case we set d(s) = dπγ (s). Continuing with the generic notation d(s), we then have

2072

Policy Gradient Methods

F (θ) =

XX

s∈S a∈A

   d(s)π(a|h; θ) ∇θT ln π(a|h; θ) ∇θ ln π(a|h; θ)

(49.87)

Natural gradient policy methods are based on replacing the gradient direction ∇θT J(θ) (which is sometimes referred to as “vanilla” gradient) by the natural gradient direction F −1 (θ)∇θT J(θ):

replace gradient ∇θT J(θ) by natural gradient F −1 (θ)∇θT J(θ)

(49.88)

This substitution is possible because gradient-ascent algorithms can employ any gradient direction of the form B∇θT J(θ) for any B > 0. We employed B = I (the identity matrix) in the previous derivations but other choices are possible and can lead to faster convergence, such as the choice B = F −1 (θ), under the assumption that the Fisher matrix is invertible. The motivation for this choice for the search direction becomes evident if the compatible state–action approximation (49.47) is employed. In this case, the natural gradient direction reduces to

F −1 (θ)∇θT J(θ)

(49.49)

=

(49.47)

=

(49.87)

  F −1 (θ) E π,d ∇θT ln π(a|h; θ)q π (s, a; co )

  F −1 (θ) E π,d ∇θT ln π(a|h; θ)∇θ ln π(a|h; θ) co

=

F −1 (θ) F (θ) co

=

co

(49.89)

This means that gradient policy algorithms update the policy parameter, θm−1 , by moving along the direction of the critic parameter used for q π (s, a). We list the resulting recursions in (49.91), where we use the approximation

F −1 (θm−1 )∇θT J(θm−1 ) ≈ com−1

(49.90)

This listing is valid for discounted rewards – see Prob. 49.14 for the case of average rewards. We describe another variant in Prob. 49.16 for the case of compatible feature vectors.

49.7 Natural Gradient Policy

2073

Natural-gradient actor–critic algorithm for optimal policy design. set initial weight vectors, c−1 = 0T ×1 ; choose an arbitrary θ−1 ∈ IRT , which defines π(a|h; θ−1 ). repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ); let Ne denote the number of transitions in episode e; (e) (e) (e) when necessary, let notation (hn , an , fn , r(e) (n)) with superscript (e) index the (s, a, f, r) quantities over episodes and time repeat over episodes: for each episode, let (h0 , f0 ) denote the initial feature vectors a0 ∼ π(a|h0 ; θm−1 ) f0 = fs0 ,a0 repeat over episode for n ≥ 0: (we are ignoring the superscript (e) on (h, a, r, f ) for simplicity) observe hn+1 corresponding to s0 = sn+1 r(n) = r(hn , a, hn+1 ) an+1 ∼ π(a|hn+1 ; θm−1 ) let fn+1 = fsn+1 ,an+1 (critic for state–action value function) β(n) = r(n) + (γfn+1 − fn )T cn−1 cn = cn−1 + µ2 (n)β(n)fn end c−1 ← cn in preparation for next episode end (learned critic model for iteration m − 1) com−1 ← cn (actor for policy parameter) θm = θm−1 + µ(m)com−1 end θ o ← θm π ? (a|s) ← π(a|h; θo ).

(49.91)

Example 49.6 (Playing a game over a grid) We illustrate the operation of the naturalgradient actor–critic algorithm (49.91) by reconsidering the same grid problem from Fig. 48.1. We employ the same encoding for states s and state–action pairs (s, a) from Example 49.5. We run the algorithm for 10,000 iterations using constant step sizes

2074

Policy Gradient Methods

µ1 = 0.01, µ = 0.0001, ρ = 0.001, γ = 0.9

(49.92)

Each iteration involves a run over E = 100 episodes to train the critic. At the end of the simulation, we evaluate the policies at each state and arrive at the matrix Π in (49.93), with each row corresponding to one state and each column to one action (i.e., each row corresponds to the distribution π(a|s)). The boxes highlight the largest entries in each column and are used to define the “optimal” policies at that state. The resulting optimal actions are represented by arrows in Fig. 49.3.                Π=              

49.8

up 0.0000 0.9825 0.8348 0 0.9831 0.0657 0.9983 0 1.0000 0.9951 0 0.9974 0.0032 0.0001 0.0000 0 0

down 0.9933 0.0008 0.0198 0 0.0013 0.0013 0.0000 0 0.0000 0.0000 0 0.0000 0.0000 0.0001 0.0000 0 0

left 0.0067 0.0168 0.0341 0 0.0155 0.5922 0.0017 0 0.0000 0.0000 0 0.0013 0.0001 0.0000 0.0000 0 0

right 0.0000 0.0000 0.1113 0 0.0001 0.3408 0.0000 0 0.0000 0.0049 0 0.0013 0.9967 0.9998 1.0000 0 0

stop 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 1.0000 1.0000

state ←s=1 ←s=2 ←s=3 ←s=4 ←s=5 ←s=6 ←s=7 ←s=8 ←s=9 ← s = 10 ← s = 11 ← s = 12 ← s = 13 ← s = 14 ← s = 15 ← s = 16 ← s = 17

                             

(49.93)

TRUST REGION POLICY OPTIMIZATION The policy gradient methods presented so far face at least two challenges. First, it is difficult to fine-tune their step-size parameters to ensure smooth changes to the behavior policy. Since the methods adjust the policies directly, small errors can have consequential effects (such as leading an agent to move to the “right” rather than to the “left”). Second, this difficulty is compounded by the nonstationary nature of the MDP environment during the learning process. Since the policies are changing through continual updates, the sampling of the environment according to these policies leads to samples with nonstationary statistical properties (which makes effective learning more difficult to attain). We describe in this section another approach to policy gradient design that helps ameliorate some of these difficulties; the approach is based on updating the policies in a constrained manner to avoid sharp changes from one iteration to another and to prevent the agent from behaving erratically. The approach aims at producing successive policies with improved cost/reward value (as explained further ahead in Example 49.7 and also Prob. 49.8).

49.8 Trust Region Policy Optimization

2075

SUCCESS

11

DANGER

4

Figure 49.3 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the natural-gradient actor–critic algorithm (49.91) and relies on the extended feature vectors (49.80) and (49.81).

49.8.1

Vanilla Policy Gradient We refer to the policy gradient methods from the earlier sections as “vanilla” methods; they were motivated by relying on the policy gradient theorem (49.44), which characterized the gradient of the cost function in the following manner:   ∇θT J(θ) = E π,d q π (s, a) − g(s) ∇θT ln π(a|h; θ) (49.94) for some arbitrary base function, g(s). We selected g(s) = v π (s) and ended up relying on the expression: ∇θT J(θ) = E π,d Aπ (s, a)∇θT ln π(a|h; θ)

(49.95)

in terms of the advantage function under policy π(a|s) and which was defined as ∆

Aπ (s, a) = q π (s, a) − v π (s)

(49.96)

For example, the A2C algorithm (49.77) employs the following sample approximation for the gradient vector of J(θ) evaluated at the current iterate θm−1 (we are ignoring the episode superscript and assuming a single measurement, which is sufficient to convey the main idea):

2076

Policy Gradient Methods

θm = θm−1 + µ2 (m)

δ(n) |{z}



cπ (s,a) =A

|

 ∇θT ln π(an |hn ; θm−1 ) {z

\ =∇ J(θm−1 ) θT

(49.97)

}

If we were to reverse this step and integrate the gradient expression in (49.97), we would conclude that this same stochastic gradient algorithm may be interpreted as attempting to maximize the following cost function: n o G(θ) = E π,d ln π(a|h; θ)Aπ (s, a)

(one alternative view)

(49.98)

This is because if we were to differentiate G(θ) relative to θ we would recover the same gradient expression (49.95). The above expression highlights one problem with implementation (49.97): It suggests that at state–action locations where the advantage function Aπ (s, a) is positive, the stochastic gradient recursion will update θ in a direction that increases the value of ln π(a|h; θ) toward +∞; this is problematic since it can cause a large change to the policy and lead to erratic behavior by the agent. This is one reason why the trust region methods of this section will impose a constraint on the amount of change that will be tolerated on the policy update from one iteration to another. We can rewrite the cost (49.98) in another useful form that will be exploited in our derivation of the trust region implementation. Let π(a|h; θ0 ) with parameter θ0 denote some second policy; we can view it as an updated version for π(a|h; θ). Then, consider the following cost, which incorporates an importance sampling ratio: GIS (θ0 ) = E π,d

(

π(a|h; θ0 ) π A (s, a) π(a|h; θ)

)

(yet another view)

(49.99)

If we differentiate this cost relative to θ0 and assess the value of the gradient vector at θ0 = θ we find that (after switching the expectation and gradient operations): ∇θ 0

GIS (θ ) 0

θ 0 =θ

= E π,d

(

) ∇θ π(a|h; θ) π A (s, a) π(a|h; θ)

= ∇θ G(θ) = ∇θ J(θ)

(49.100)

This result suggests that we can use GIS (·) as another starting cost to develop learning algorithms since its gradient matches the gradient for the original cost J(θ) that we wish to optimize. We will show further ahead in (49.110) that GIS (·) approximates well the change in the cost function when the parameter is

49.8 Trust Region Policy Optimization

2077

updated from θ to θ0 . In this way, by maximizing its value, we will be assisting the algorithm to generate successive policies with improved reward values.

49.8.2

TRPO Method It is sufficient for our purposes to consider the expected discounted reward J2 (θ) defined by (49.30). For simplicity, we will remove the scaling by (1 − γ) and denote the cost function simply by J(θ). Thus, our aim is to seek the policy parameter θ that solves:

θ

o

= argmax θ∈IRT

(



π

J(θ) = E dπ v (s) =

X

π

π

)

d (s)v (s)

s∈S

(49.101)

where dπ (s) denotes the stationary distribution of policy π(a|s; θ). Each entry dπ (s) denotes the likelihood of state s being selected. To solve problem (49.101) we proceed by examining first how J(θ) changes when the parameter of the policy is adjusted from θ to θ0 = θ + δθ. The main objective of the trust region policy optimization (TRPO) method is to select δθ to ensure two properties. First, J(θ0 ) should be increased relative to J(θ) and second, the new policy π(a|h; θ0 ) should be obtained by minimally adjusting the current policy π(a|h; θ) so that the behavior of the agent is not perturbed drastically when the policy is updated. This second property will be enforced by requiring the Kullback–Leibler (KL) divergence between the current and updated policies to be small. The details are as follows. Consider two generic policies: an old policy π(a|h; θ) and a newer policy 0 π(a|h; θ0 ) with value functions v π (s) and v π (s), respectively. In this notation, we are assuming that θ0 is the updated parameter and we use the prime symbol to refer to both the newer policy, π 0 , and its parameter, θ0 . It is easy to verify that the corresponding cost functions satisfy the relation:

J(θ0 ) = J(θ) + E π0 ,P

∞ X

n=0

γ n Aπ (sn , an )

!

(49.102)

where the expectation is over actions selected according to the new policy π 0 (a|h; θ0 ) while the advantages are calculated according to the old policy π(a|h; θ) (i.e., using the state and state–action functions v π (s) and q π (s, a) for that policy). This relation provides one initial (though not so convenient) result showing how the cost function is changed when θ is updated to θ0 .

2078

Policy Gradient Methods

Proof of (49.102): From the definition (49.96) we have (∞ ) X n π E π0 ,P γ A (sn , an ) , an ∼ π(a|hn ; θ0 ) n=0

( = E π0 ,P

∞ X

n=0

( = E π0 ,P

∞ X

  γ q π (sn , an ) − v π (sn )   γ n r(n) + γv π (sn+1 )

)

= E π0 ,P

∞ X

) n

( + E π0 ,P

γ r(n)

n=0

( = E π0 ,P

∞ X

) n

γ r(n)

n=0

= E dπ 0

E π0 ,P

∞ X n=0

− E π0 ,P

∞ X

) γ n v π (sn )

n=0

) γ

n+1 π

n=0

(

(a)

∞ X

( − E π0 ,P

n=0

(

)

n

n π

v (sn+1 ) − γ v (sn )

n o v π (s0 ) !)

n

γ r(n) | s0

− E π,P

n o v π (s0 )

n 0 o n o = E dπ0 v π (s0 ) − E dπ v π (s0 ) = J(θ0 ) − J(θ)

(49.103)

where in step (a) we use the conditional mean property E x = E (E (x|y)) in the first term and replace the expectation over π 0 by expectation over π in the rightmost term since v π (s0 ) is solely dependent on policy π. 

Expression (49.102) shows how the cost function changes when we replace policy π by policy π 0 . The change is given by a discounted sum of advantage values over an entire trajectory (with n running from zero to infinity). We can derive an equivalent expression that involves averaging over the state × action space rather than over time. For this purpose, we first introduce the discounted visitation function: 0



ρπ (s) = P(s0 = s) + γP(s1 = s) + γ 2 P(s2 = s) + . . . ! ∞ X n = E π0 γ I[sn = s]

(49.104)

n=0

where the actions are taken according to policy a ∼ π(a|h; θ0 ). Then, it follows that: ! X 0 X 0 π 0 π (49.105) J(θ ) = J(θ) + ρ (s) π(a|h; θ )A (s, a) s∈S

a∈A

which can also be written in the form: J(θ0 ) = J(θ) + E π0 ,P Aπ (s, a)

(49.106)

where the expectation is over the transition probabilities defined by the MDP and over the new policy π 0 (i.e., actions are taken according to this policy).

49.8 Trust Region Policy Optimization

2079

Proof of (49.105): Starting from (49.102) we have ( 0

J(θ ) = J(θ) + E P

∞ X

!) γ

n

n=0

X

∞ X

s∈S

n=0

( = J(θ) +

= J(θ) +

X

X

0

π

π(a|h; θ )A (sn , a)

a∈A

!) n

γ P(s = s)

π0

ρ (s)

s∈S

X

0

π

π(a|h; θ )A (s, a)

a∈A

! X

0

π

π(a|h; θ )A (s, a)

(49.107)

a∈A

 0

Recall that we wish to update θ to θ in order to increase the cost from J(θ) to J(θ0 ). Expression (49.106) suggests that we should maximize the correction term E π0 ,P Aπ (s, a). However, evaluation of this term involves computing an expectation over actions taken according to the future policy π 0 (the one we wish to determine) and not over the existing policy π (the one we have available). Nevertheless, a useful approximation is possible if we limit ourselves to small updates of the policy and resort to an off-policy evaluation. Assume, for instance, that the change in going from π to π 0 is small enough (i.e., the policies are close to each other in a sense that will be made more precise further ahead) to justify 0 writing ρπ (s) ≈ ρπ (s). Then, substituting into (49.105) gives 0

J(θ ) − J(θ) ≈

X

π

ρ (s)

s∈S

X

0

π

π(a|h; θ )A (s, a)

a∈A

!

(49.108)

! π(a|h; θ0 ) π A (s, a) = ρ (s) π(a|h; θ) π(a|h; θ) s∈S a∈A ! X X π 0 π = ρ (s) π(a|h; θ) ξ(h, a; θ ) A (s, a) X s∈S

π

X

a∈A

where ξ(s, a; θ0 ) refers to the importance sampling ratio: ∆

ξ(h, a; θ0 ) =

π(a|h; θ0 ) π(a|h; θ)

(49.109)

That is, the change in the cost function can be approximated by n o J(θ0 ) − J(θ) ≈ E π,P ξ(s, a; θ0 ) Aπ (s, a)

(49.99)

=

GIS (θ0 )

(49.110)

where the expectation is now over the old policy π(a|h; θ), while the ξ(s, a) refer to importance sampling weights. This expression shows that the change in the cost can be approximated by sampling data from trajectories generated according to the old policy, π(a|h; θ), as follows:

2080

Policy Gradient Methods

 (a) construct E trajectories according to model π(a|h; θ)     (b) estimate the advantage values as explained later, e.g., in (49.119)    E  X  (c) set total number of samples to N = Ne  e=1    E NX e −1      X   (d) then J(θ0 ) − J(θ) ≈ 1 (e) 0  ξ hn(e) , a(e) Aπ s(e)  n , an n ;θ N e=1 n=0 (49.111) where we are using the same notation from (48.107), with superscript (e) to refer to feature vectors and actions within episode e. We are also assuming each episode e has duration Ne . We will use this empirical data-based approximation to motivate the TRPO method below. Example 49.7 (Monotonic improvement in cost values) A stronger result than (49.110) is possible. It is established in Prob. 49.7 that the approximation (49.110) can be replaced by an inequality of the following form:

J(θ0 ) ≥ J(θ) + E π,P |

n

ξ(s, a; θ0 ) Aπ (s, a) {z ∆

o



4a γ max DKL (θ, θ0 ) (1 − γ)2 }

(49.112)

= Kθ (θ 0 )

where we are denoting the lower bound on J(θ0 ) by Kθ (θ0 ) since the right-hand side depends on both (θ, θ0 ), and o n 0 π E π0 A (s, a) , a ∼ π(a|h; θ ) s∈S o n  ∆ max DKL (θ, θ0 ) = max DKL π(a|h; θ)kπ(a|h; θ0 ) ∆

a = max s∈S

(49.113a) (49.113b)

in terms of the KL divergence between the old and updated policies. Problem 49.8 uses this result to show that if one updates θm−1 = θ (at iteration m − 1) to θm (at iteration m) by maximizing the lower bound, namely, θm = argmax Kθm−1 (θ0 )

(49.114)

θ 0 ∈IRT

then the successive cost values J(θm ) will be nondecreasing. This is a useful property since it means that the constructed policies continually improve the cost.

Solving problem (49.114) is challenging. The TRPO method introduces two simplifications. First, it employs the empirical approximation (49.111) for the cost to be minimized and, second, it replaces the maximum of the KL divergence by the average KL divergence and seeks to bound this average value:

49.8 Trust Region Policy Optimization

max

θ 0 ∈IRT

(

E Ne −1     1 X X (e) 0 (e) ξ h(e) Aπ s(e) n , an ; θ n , an N e=1 n=0

2081

) (49.115)

E Ne −1 i h    1 X X (e) (e) (e) 0 s.t. ≤ DKL π a(e) n |hn ; θ k π an |hn ; θ N e=1 n=0

where the sample expression involving the KL divergence measure is an empirical approximation for the condition E dπ DKL [π(a|h; θ)kπ(a|h; θ0 )] ≤ . The value for  is typically small (close to 0.02) in order to keep the policies close to each other. For compactness, we will denote the cost function and the constraint function ¯ 0 ), respectively: appearing in the problem formulation (49.115) by L(θ0 ) and D(θ



L(θ0 ) =

E Ne −1     1 X X (e) 0 ξ h(e) Aπ sn(e) , a(e) n , an ; θ n N e=1 n=0

E NX e −1 h    i X (e) (e) 0 ¯ 0) = 1 DKL π an(e) |h(e) ; θ k π a |h ; θ D(θ n n n N e=1 n=0

(49.116a)

(49.116b)

so that the constrained optimization problem (49.115) becomes

θm = argmax θ 0 ∈IRT

n o ¯ 0) ≤  L(θ0 ) , subject to D(θ

(49.117)

To solve this problem, we need to determine the quantities defining it. We consider first the advantage function. It can be estimated from the observations in a manner similar to the stochastic gradient TD(0) algorithm (48.11); other approaches can also be used such as a TD(λ) construction or Monte Carlo simulations. Specifically, we use the observed episodes (or trajectories) to estimate o a critic model, which we denote by wm−1 , and then use this model to compute the advantage values. Listing (49.118) shows the algorithm for learning the critic model.

2082

Policy Gradient Methods

Stochastic gradient TD(0) algorithm for learning a critic model. set initial critic model to, w−1 = 0M ×1 ; given current policy θm−1 ∈ IRT , which defines π(a|h; θm−1 ). repeat over the episodes (or trajectories): for each episode, let h0 denote the initial feature vector repeat over episode for n ≥ 0: a ∼ π(a|hn ; θm−1 ) observe hn+1 r(n) = r(hn , a, hn+1 ) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + µ1 (n)δ(n)hn end w−1 ← wn in preparation for next episode end o wm−1 ← wn (return critic model).

(49.118)

o Once the critic model wm−1 is determined, we use it to compute instantaneous estimates for the advantage function by relying on the TD term in view of property (49.76). For any state–action pair (s, a) (or feature–action pair (hs , a)), we determine its future state s0 (h0s ) under action a and the corresponding reward r(hs , a, h0s ). Then, one estimate for Aπ (s, a) at model θm−1 is the TD term:

cπ (s, a) = r(hs , a, h0 ) + (γh0 − hs )T wo A s s m−1

(49.119)

Observe that the advantage factor is determined fully from the data; it does not depend on the unknown θ0 (or θm ). Next we consider the KL divergence term. We are assuming a Gibbs model for the policy π in (49.1), namely, T

π(a|hs ; θ) = P

e(fs,a )

a0 ∈A e

θ

(fs,a0 )T θ

,

θ ∈ IRT

(49.120)

Let ar denote an arbitrary action from A to be used as a reference in the calculations. If we invoke the earlier result (6.59), we find that the KL divergence between policies π(a|hs ; θ) and π(a|hs ; θ0 ) for any given state s (or feature vector hs ) is given by the expression: ! P y 0 (s,a) e a∈A DKL (θ, θ0 ; s) = ln P + (49.121) y(s,a) a∈A e   X  ey(s,a) 0 P y(s, a) − y (s, a) 0 y(s,a ) a0 ∈A e a∈A\{ar }

49.8 Trust Region Policy Optimization

2083

where we are simplifying the notation of the KL divergence by showing only the arguments (θ, θ0 ; s) instead of the policies, and where we introduced the scalar quantities: ∆

y(s, a) = (fs,a − fs,ar )T θ, (s, a) ∈ S × A ∆

y 0 (s, a) = (fs,a − fs,ar )T θ0

(49.122a) (49.122b)

To solve problem (49.117) we resort to the same line search method we encountered earlier in Section 6.6 while studying natural gradients. To do so, we call upon the second-order approximation result (6.100) for the KL divergence of two distributions. It follows from (49.121) that the first gradient of the KL divergence for any state s is given by ) ( 0 X ey(s,a) ey (s,a) 0 P −P (fs,a − fs,ar ) ∇(θ0 )T DKL (θ, θ ; s) = y 0 (s,a0 ) y(s,a0 ) a0 ∈A e a0 ∈A e a∈A n o (49.1) X = π(a|s; θ0 ) − π(a|s; θ) (fs,a − fs,ar ) a∈A

(49.14a)

=

f¯s;θ0 − f¯s;θ

(49.123)

where f¯s,θ0 and fs;θ are the mean feature vectors under the policies defined by θ0 and θ, respectively, X X ∆ ∆ f¯s;θ0 = π(a|s; θ0 )fs,a , f¯s;θ = π(a|s; θ)fs,a (49.124) a∈A

a∈A

Likewise, the Hessian of the KL divergence for any state s is given by ∇2θ0 DKL (θ, θ0 ; s) = X

X

a∈A

0

0

ey (s,a) (f − fs,ar )(fs,a − fs,ar )T − y 0 (s,a0 ) s,a e 0 a ∈A

P

ey (s,a) P (f − fs,ar ) y 0 (s,a0 ) s,a a0 ∈A e a∈A

!

X

0

ey (s,a) P (f − fs,ar ) y 0 (s,a0 ) s,a a0 ∈A e a∈A

!T

(49.125)

so that ∇2θ0 DKL (θ, θ0 ; s) =

X

a∈A

π(a|s; θ0 )(fs,a − fs,ar )(fs,a − fs,ar )T − (f¯s;θ0 − fs,ar )(f¯s;θ0 − fs,ar )T

(49.126)

To evaluate these expressions at θ0 = θ we replace the quantity y 0 (s, a) appearing on the right-hand side in both expressions by y(s, a). Once this is done, we denote the Hessian (49.125) at θ0 = θ by the compact notation Hs,θ : ∆ Hs,θ = ∇2θ0 DKL (θ, θ0 ; s) (49.127) 0 θ =θ

2084

Policy Gradient Methods

where the subscript θ is meant to indicate that this Hessian is evaluated at location θ. More generally, for a state sn at time n, we will denote the corresponding Hessian matrix by Hn,θ . If we now invoke the second-order approximation (6.100) for KL divergences we can write DKL (θ, θ0 ; s) ≈

1 0 (θ − θ)T Hs,θ (θ0 − θ) 2

(49.128)

so that the average KL measure appearing in (49.116b) can be approximated by the quadratic expression ! E NX e −1 X 1 1 ∆ 1 (e) ¯ 0 ) ≈ (θ0 − θ)T D(θ H (θ0 − θ) = (θ0 − θ)T Hθ (θ0 − θ) 2 N e=1 n=0 n,θ 2 {z } | ∆

= Hθ

(49.129)

Let us now consider the function L(θ0 ) in (49.116a) and compute a first-order approximation for it (we explain the reason for doing so further ahead when we derive a line search algorithm for solving the optimization problem (49.117)). To begin with, consider the general form of the terms appearing inside the sum defining L(θ0 ), namely, ∆

`(s, a; θ0 ) = ξ(h, a; θ0 )Aπ (s, a) =

π(a|h; θ0 ) π A (s, a) π(a|h; θ)

Using results (49.12) and (49.14c) we have for a Gibbs policy model:  Aπ (s, a)  ∇(θ0 )T `(s, a; θ0 ) = ∇(θ0 )T π(a|h; θ0 ) π(a|h; θ)   Aπ (s, a) = fs,a − f¯s,θ0 π(a|h; θ0 ) π(a|h; θ) where f¯s,θ0 is the average feature vector under state s and policy θ0 : X ∆ f¯s,θ0 = π(a|h; θ0 )fs,a

(49.130)

(49.131)

(49.132)

a∈A

where the subscript θ0 in f¯s,θ0 indicates that this mean value is computed relative to the policy defined by θ0 . Evaluating the gradient at θ0 = θ we get ∇(θ0 )T `(s, a; θ0 )

θ 0 =θ

  = fs,a − f¯s,θ Aπ (s, a)

(49.133)

Using this expression, we evaluate the gradient of L(θ0 ) at θ0 = θ in (49.116a): ! E Ne −1     1 X X ∆ (e) 0 (e) π (e) (e) ∇(θ0 )T L(θ ) = fn − f¯n,θ A sn , an = gθ N e=1 n=0 θ 0 =θ

(49.134)

49.8 Trust Region Policy Optimization

2085

where we are denoting the gradient vector by gθ . In view of the approximate expression (49.119) for Aπ (·, ·), as well as result (49.14a), the above expression for gθ agrees with the expression for gm−1 that appears in listing (49.78), i.e.,  T (e) o δn(e) = r(e) (n) + γhn+1 − h(e) wm−1 (49.135a) n gθ = =

E Ne −1   1 X X (e) δn(e) ∇θT ln π a(e) |h ; θ n n N e=1 n=0 E Ne −1  1 X X (e) fn(e) − f¯n,θ δn(e) N e=1 n=0

(49.135b)

o where wm−1 denotes the critic model at iteration m − 1. We therefore arrive at the first-order Taylor series expansion for the cost L(θ0 ) around θ:

L(θ0 ) ≈ L(θ) + gθT (θ0 − θ)

(49.136)

Using the first- and second-order approximations (49.129) and (49.136), we are now ready to derive a line search algorithm for solving (49.117) by following the same argument used to derive the natural gradient policy from (6.128). Specifically, we let θm = θm−1 + δθ and consider the approximate optimization problem (we are removing L(θm−1 ) since it does not depend on the unknown δθ):



δθo = argmin δθ 0 ∈IRT

n o −(gm−1 )T δθ , subject to 12 (δθ)T Hm−1 δθ ≤ 

set θm = θm−1 + δθo

(49.137) where we are using the notation gm−1 and Hm−1 to denote the gradient and Hessian matrices (gθ , Hθ ) at θ = θm−1 . To solve this problem we introduce a Lagrangian function with multiplier λ ≥ 0: ! 1 L(δθ, λ) = −(gm−1 )T δθ + λ (δθ)T Hm−1 δθ −  (49.138) 2 Setting the gradient relative to δθ to zero gives δθo =

1 −1 H gm−1 λ m−1

(49.139)

We determine λ by seeking the “largest” perturbation possible. Specifically, we set !1/2 1 o T 1 2 (δθ ) Hm−1 δθo =  =⇒ = − (49.140) −1 T 2 λ gm−1 Hm−1 gm−1

2086

Policy Gradient Methods

and we arrive at the line search update:

θm = θm−1 + µ(m)

2 −1 T gm−1 Hm−1 gm−1

!1/2

−1 Hm−1 gm−1

(49.141)

where µ(m) is a step-size parameter that allows further control on the evolution of the algorithm, and can be adjusted as necessary. Observe that each step of (49.141) is not an actual gradient step but is rather constructed from the solution to an optimization problem. In summary, we arrive at listing (49.142) for the TRPO method for policy design.

TRPO method for optimal policy design. set initial policy parameter, θ−1 ∈ IRT ; given desired KL divergence bound,  > 0; given a method for estimating a critic model, such as (49.118); Given one arbitrary reference action a ¯ ∈ A. repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ) let Ne denote the number of transitions in episode e (e) (e) let notation (sn , an ) index the (s, a) pairs over episodes and time (updating policy model) o use episodes to estimate a critic model, wm−1 , e.g., using (49.118) o use wm−1 and expression (49.135b) to determine the gradient gm−1 use expressions (49.125) and (49.129) to determine Hessian Hm−1 update θm−1 to θb using (49.141)

(checking whether to accept the update) b ¯ θ) use (49.116b) and (49.121) to evaluate average KL divergence, D( Tb b ¯ if increment (gm−1 ) θ ≥ 0 and KL divergence D(θ) ≤  set θm ← θb otherwise θm ← θm−1 end (adjusting step-size parameter) b ≥ 1.5, set µ(m + 1) = µ(m)/2 ¯ θ) if D( b ≤ /1.5, set µ(m + 1) = 2µ(m) ¯ θ) elseif D( end end θo ← θm (return policy model).

(49.142)

The last steps in the algorithm help control the behavior of the method. Since the objective function and the KL constraint were replaced by first- and second-

49.8 Trust Region Policy Optimization

2087

order approximations, it does not necessarily follow that the update obtained via (49.141) will increase the reward J(θ) or satisfy the constraint on the average KL divergence (which needs to be bounded by ). We therefore verify whether the updated parameter θm satisfies these properties before accepting it. We also adjust the value of the step-size parameter depending on how far the KL divergence is from the bound . It should be noted that some computational enhancements are possible to reduce the complexity of the implementation. For −1 example, the computation of quantities of the form Hm−1 gm can be attained by selecting among a variety of efficient solvers for linear systems of equations of the form Hm−1 x = g, including conjugate-gradient methods.

49.8.3

PPO Method with Penalized Cost The TRPO formulation is computationally demanding; it requires the inversion of a Hessian matrix at every update step. The proximal policy optimization (PPO) method simplifies the implementation and performs equally well. PPO starts by moving the constraint in (49.117) to a penalty term and considers the alternative optimization:

θm = argmax θ 0 ∈IRT

n o ¯ 0) L(θ0 ) − β D(θ

(49.143)

for some penalty parameter β > 0. We can solve this problem by means of a standard stochastic gradient implementation. For that purpose, we need to determine an expression for the gradient of the cost function. We refer to expression (49.123) for the gradient of a typical KL divergence term evaluated at θ0 = θ; we denote this gradient by the compact notation bs,θ : ∆ bs,θ = ∇(θ0 )T DKL (θ, θ0 ; s) 0 (49.144) θ =θ

where the subscript θ is meant to indicate that this gradient vector is evaluated at location θ. We also write bn,θ for state sn at time n. It follows that the gradient vector of the average KL measure (49.116b) at θ0 = θ is given by ¯ 0 ) ∇(θ0 )T D(θ

θ 0 =θ

=

E Ne −1 1 X X ∆ (e) b = bθ N e=1 n=0 n,θ

(49.145)

Our objective is to update θm−1 = θ to θm = θ0 . We denote the gradient vectors ¯ of D(·) and L(·) at θ by {bm−1 , gm−1 } and combine the above gradient with

(49.135b) to write

  θm = θm−1 + µ(m) gm−1 − β(m)bm−1

(49.146)

where the penalty parameter, β(m), is also dependent on the iteration index. In summary, we arrive at listing (49.147) for the PPO-penalty method for policy design. The last steps in the algorithm help control the behavior of the method,

2088

Policy Gradient Methods

as was the case with TRPO, with the penalty parameter β adjusted to give more or less weight to the penalty term.

PPO-penalty method for optimal policy design. set initial policy parameter, θ−1 ∈ IRT ; given desired KL divergence bound,  > 0; given initial penalty parameter β(0) > 0; given a method for estimating a critic model, such as (49.118); given one arbitrary reference action a ¯ ∈ A. repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ) let Ne denote the number of transitions in episode e (e) (e) let notation (sn , an ) index the (s, a) pairs over episodes and time (updating policy model) o use episodes to estimate a critic model, wm−1 , e.g., using (49.118) o use wm−1 and expression (49.135b) to determine the gradient gm−1 use expressions (49.123) and (49.145) to determine the gradient bm−1 update θm−1 to θb using (49.146) (checking whether to accept the update) b ¯ θ) use (49.116b) and (49.121) to evaluate average KL divergence, D( Tb b ≤  set θm ← θb ¯ θ) if increment (gm−1 ) θ ≥ 0 and KL divergence D( otherwise θm ← θm−1 end (adjusting penalty parameter) b ≥ 1.5, set β(m + 1) = 2β(m) ¯ θ) if D( b ≤ /1.5, set β(m + 1) = β(m)/2 ¯ θ) elseif D( end end θo ← θm (return policy model).

49.8.4

(49.147)

PPO Method with Clipped Cost The PPO-penalty method (49.147) employs a penalty factor β and adjusts it to help meet the requirement that the average KL divergence between two successive policy models does not exceed . The selection and adjustment of β requires some nontrivial tuning. An alternative approach is to modify the cost function (49.143) by clipping the importance sampling factor. Specifically, each generic term of the form

49.8 Trust Region Policy Optimization

0



0

π

`(s, a; θ ) = ξ(h, a; θ )A (s, a) =



π(a|h; θ0 ) π(a|h; θ)



Aπ (s, a)

2089

(49.148)

appearing in the expression for L(θ0 ) in (49.116a) is clipped and replaced by

min

(

π(a|h; θ0 ) π(a|h; θ)



π

A (s, a), clip



π(a|h; θ0 ) , 1 − , 1 +  π(a|h; θ)



π

A (s, a)

)

(49.149) The clipping function is defined as follows:   1 + , if x > 1 +  clip(x, 1 − , 1 + ) = x, if 1 −  ≤ x ≤ 1 +   1 − , if x < 1 − 

(49.150)

Recall that we desire to keep the updated policy close to the current policy and, hence, the importance ratio π(a|h; θ0 )/π(a|h; θ) should stay close to 1. For this reason, for values of this ratio outside the interval [1 − , 1 + ], the clipping function saturates them at the endpoints. In Prob. 49.9, expression (49.149) is shown to be equivalent to the following representation: ) (  π(a|h; θ0 ) π π A (s, a), c(, A (s, a)) (49.151) min π(a|h; θ) where the second term inside the minimization is independent of the unknown θ0 and the function c(·) is defined as follows:  (1 + )A, if A ≥ 0 c(, A) = (49.152) (1 − )A, if A < 0 Using the following property: i n o 1h f (x) + g(x) − |f (x) − g(x)| min f (x), g(x) = 2

we can express (49.151) in the form (we denote it by `c (θ0 )): n o ∆ 1 `c (s, a; θ0 ) = `(s, a; θ0 ) + c(, Aπ ) − |`(s, a; θ0 ) − c(, Aπ )| 2

(49.153)

(49.154)

Appealing to (49.133), and with a slight abuse of the gradient notation, we use the following expression for a subgradient vector of `c (·) relative to θ0 : n  o 1 ∇(θ0 )T `c (s, a; θ0 ) = ∇(θ0 )T `(s, a; θ0 ) 1 − sign `(s, a; θ0 ) − c(, Aπ ) 2 (49.155) where the sign function is defined by  +1, if x ≥ 0 sign(x) = −1, otherwise

(49.156)

2090

Policy Gradient Methods

Evaluating the subgradient at θ0 = θ gives ∇(θ0 )T `c (s, a; θ0 )

θ 0 =θ

=

n  o 1 ∇(θ0 )T `(s, a; θ) 1 − sign `(s, a; θ) − c(, Aπ ) 2 (49.157)

The PPO-clip method replaces the optimization problem (49.117) by the following: θm = argmax Lc (θ0 )

(49.158)

θ 0 ∈IRT

where now ( ) E Ne −1 1 X X (e) (e) 0 π (e) (e) π (e) (e) min ξ(hn , an ; θ ) A (sn , an ), c(, A (sn , an )) Lc (θ ) = N e=1 n=0 0



(49.159)

and its gradient vector at θ0 = θ is given by ∇(θ0 )T Lc (θ0 )

θ 0 =θ

=

E Ne −1 1 X X ∆ ∇ 0 T `c (s, a; θ) = zθ N e=1 n=0 (θ )

(49.160)

where we are denoting the result by zθ . Our objective is to update θm−1 = θ to θm = θ0 . We denote the gradient vector of Lc (·) at θ = θm−1 by zm−1 and write θm = θm−1 + µ(m)zm−1

(49.161)

(e)

where, approximating Aπ (·, ·) by δn : zm−1 =

E Ne −1  n  o 1 X X (e) fn(e) − f¯n,θ δn(e) 1 − sign δn(e) − c(, δn(e) ) (49.162) 2N e=1 n=0

In summary, we arrive at listing (49.163) for the PPO-clip method for policy design. The last steps in the algorithm help control the behavior of the method, as was the case with TRPO and PPO-penalty.

49.8 Trust Region Policy Optimization

2091

PPO-clip method for optimal policy design. set initial policy parameter, θ−1 ∈ IRT ; given desired KL divergence bound,  > 0; given a method for estimating a critic model, such as (49.118); given one arbitrary reference action a ¯ ∈ A. repeat over m = 0, 1, 2, . . .: collect E episodes/trajectories following current policy π(a|h; θm−1 ) let Ne denote the number of transitions in episode e (e) (e) let notation (sn , an ) index the (s, a) pairs over episodes and time (updating policy model) o use episodes to estimate a critic model, wm−1 , e.g., using (49.118) o use wm−1 and expression (49.135b) to determine the gradient gm−1 o use wm−1 and expression (49.162) to determine zm−1 update θm−1 to θb using (49.161)

(checking whether to accept the update) b ¯ θ) use (49.116b) and (49.121) to evaluate average KL divergence, D( Tb b ¯ if increment (gm−1 ) θ ≥ 0 and KL divergence D(θ) ≤  set θm ← θb otherwise θm ← θm−1 end (adjusting step-size parameter) b ≥ 1.5, set µ(m + 1) = 2µ(m) ¯ θ) if D( b ≤ /1.5, set µ(m + 1) = µ(m)/2 ¯ θ) elseif D( end end θo ← θm (return policy model).

(49.163)

Example 49.8 (Playing a game over a grid) We illustrate the operation of the PPOclip method (49.163) by reconsidering the same grid problem from Fig. 48.1. We employ the same encoding for states s and state–action pairs (s, a) from Example 49.5. For illustration purposes, we run the algorithm for 1,000,000 iterations using constant step sizes throughout the simulation: µ1 = 0.01, µ = 0.0001, ρ = 0.001, γ = 0.9

(49.164)

Each iteration involves a run over E = 100 episodes to train the critic. At the end of the simulation, we evaluate the policies at each state and arrive at the matrix Π in (49.165), with each row corresponding to one state and each column to one action (i.e., each row corresponds to the distribution π(a|s)). The boxes highlight the largest entries in each column and are used to define the “optimal” policies at that state. The resulting optimal actions are represented by arrows in Fig. 49.4.

2092

Policy Gradient Methods

               Π=              

up 0.0469 0.4030 0.4543 0 0.6573 0.2107 0.8450 0 0.9136 0.8824 0 0.8148 0.0823 0.0445 0.0278 0 0

down 0.2984 0.1399 0.1747 0 0.1065 0.1208 0.0313 0 0.0137 0.0196 0 0.0393 0.0417 0.0445 0.0144 0 0

left 0.5393 0.3889 0.2237 0 0.1546 0.4925 0.1090 0 0.0352 0.0440 0 0.0730 0.0545 0.0238 0.0134 0 0

right 0.1154 0.0681 0.1472 0 0.0817 0.1760 0.0147 0 0.0375 0.0541 0 0.0729 0.8215 0.8872 0.9445 0 0

stop 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 1.0000 1.0000

state ←s=1 ←s=2 ←s=3 ←s=4 ←s=5 ←s=6 ←s=7 ←s=8 ←s=9 ← s = 10 ← s = 11 ← s = 12 ← s = 13 ← s = 14 ← s = 15 ← s = 16 ← s = 17

                             

(49.165)

SUCCESS

11

DANGER

4

Figure 49.4 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the PPO-clip method (49.163) and relies on the extended feature vectors (49.80) and (49.81).

49.9 Deep Reinforcement Learning

49.9

2093

DEEP REINFORCEMENT LEARNING1 The derivations so far in the chapter focused on modeling the policy π(a|h; θ) by means of the Gibbs distribution (49.1). The model assumes a parameter vector θ ∈ IRT and searches for it by means of various stochastic-type implementations such as REINFORCE, actor–critic algorithms, TRPO, and PPO methods. More generally, we can resort to neural network implementations for the policy in order to enable a broader range of mappings from the feature space to the policy space, besides the inner product parameterization (fs,a )T θ used in the Gibbs distribution. We illustrate the procedure by focusing, without loss of generality, on the A2C algorithm (49.77). Similar constructions apply to other algorithms.

49.9.1

Neural Network Model We will be discussing neural networks in more detail in Chapter 65, where we also derive algorithms for training them. In the description that follows, we assume familiarity with the cascade structure of a neural network. Readers can refer to that chapter if necessary. We consider a feedforward neural network with L layers (including the input and output layers) – see Fig. 49.5. The size of the input layer is M , which matches the dimension of the state feature space h ∈ IRM . The size of the output layer is |A|, which matches the dimension of the action space A. The output layer implements a softmax mapping. Specifically, if we denote the preactivation vector by z, with entries {z(a)}, and the output vector by γ b with entries {b γ (a)}, then ez(a) , γ b(a) = P|A| z(a0 ) a0 =1 e

a = 1, 2, . . . , |A|

(49.166)

Each entry γ b(a) plays the role of the policy value π(a|h), i.e., the likelihood of selecting action a under feature h. Since sometimes not all actions are permissible at a particular state, it is understood that, for every state s (or corresponding feature h at the input of the network), the softmax mapping (49.166) is applied only over the permissible actions at that state. The entries of γ b(a) corresponding to nonpermissible actions will have their values set to 0. For example, assume |A| = 5 and consider an input feature h corresponding to some state s where only the first four actions are permissible. Then, for some numerical example,  T z = 0.2 0.3 1.2 −0.5 0.1 (49.167a)  T =⇒ γ b = 0.1880 0.2077 0.5110 0.0933 0 (49.167b) where the last entry of γ b is 0 and the first four entries add up to 1. Once the network is trained, the output vector γ b will approximate the optimal 1

This section can be skipped on a first reading. It requires familiarity with feedforward neural networks, the notation used to describe their structure, and the algorithms used to train them. These networks are studied in Chapter 65.

2094

Policy Gradient Methods

policy π ? (a|h) in response to some input feature h. In this way, the optimal action for h can be selected, for example, greedily by choosing the entry of γ b with the largest likelihood value over the set of actions that are permissible at that state: ∆

ao = argmax γ b(a)

(49.168)

a∈A

In the assumed network model, the mapping from h to π(a|h) is parameterized by weights matrices {W` } and bias vectors {θ` } across the neural layers. Note that we are now using the notation θ` to refer to the bias vectors within the neural network (while we used θ earlier in the Gibbs distribution model); there is no need for the latter θ since the mapping from h to π(a|s) is now modeled more generally by the neural network instead of the Gibbs expression (49.1). The training of the network now requires that we learn the parameters {W` , θ` }. We use the notation introduced in future expression (65.31) for the weights, biases, and pre- and post-activation signals across the successive layers of the neural network, such as {W` , θ` , z` , y` }. layer 1 +1

layer 2 ,,.-------, '

layer 3

layer 4

'

pre-activation IAI x 1 vector

h(l) softmax

h I

: I

(M x 1)

\

h(M)

input layer

_____""'"""______,,

-:Y

= 1r(alh) (IAI x 1)

I

I

\ I

.....,_

hidden layers

output layer

Figure 49.5 A feedforward neural network with three hidden layers, a softmax output

layer with |A| actions, and M attributes per input feature vector.

49.9.2

Training Algorithm The network parameters will be learned as follows (compare with the description of the A2C algorithm (49.77)): (a) At the start of every iteration m, the parameters of the neural network are denoted by {W`,m−1 , θ`,m−1 }. For each input h, after feeding it through the network, these parameters lead to a policy at the output of the network that is denoted by γ b = πm−1 (a|h) with a subscript m − 1. We are using the notation γ b and πm−1 (a|h) interchangeably to refer to the output of the

49.9 Deep Reinforcement Learning

2095

network in response to h. We denote the maximum number of iterations or experiments by ITER. (b) For each iteration m, we run repeated episodes and collect sufficient state– action–state–reward transitions: n o (e) (e) (e) h(e) (n) (49.169) n , an , hn+1 , r where the superscript (e) refers to the episode index and the subscript n refers to the time index within the episode. The actions are selected according to the existing policy model, πm−1 (a|h), from step (a).

(c) We use the data realizations collected in step (b) to learn a critic model, o denoted by wm−1 , say, by applying the same critic algorithm that appears within the listing of the A2C algorithm (49.77). Using the critic model, we evaluate the TD factors:  T (e) o δn(e) = r(e) (n) + γhn+1 − hn(e) wm−1 , for all n, e (49.170)

We already know from (49.76) and (49.119) that these factors serve as ap(e) (e) proximations for the advantage function at the state–action pair (sn , an ).

(d) We update the network parameters from {W`,m−1 , θ`,m−1 } to {W`,m , θ`,m } and repeat steps (a)–(d). We explain how this last step is carried out in the following. In order to lighten the notation, we are not attaching an iteration index m to the data points. For example, we should have been more explicit and written (e) (e) hn,m instead of hn to indicate that this feature vector is observed during the nth transition of episode e within the mth iteration. Instead, we will place all the data collected within the same iteration m between parentheses with a subscript m as shown in (49.172), to indicate that the data is collected during that iteration. The total amount of data collected during iteration m is denoted by ! E X Nm = Ne (49.171) e=1

m

where the Ne is the number of transitions within each episode e during this iteration. Motivated by the alternative cost representation (49.98), we will train the parameters of the neural network by considering an `2 -regularized empirical risk optimization problem of the following form (where we are now maximizing rather than minimizing the empirical risk): ( L−1 X ∆ (W ? , θ? ) = argmax ρ kW` k2F + (49.172) W,θ

1 ITER

ITER X m=1

`=1

1 Nm

E NX e −1 X e=1 n=0

δn(e)

ln π



an(e) |h(e) n ; {W` , θ` }

! )  m

2096

Policy Gradient Methods

The loss function that is associated with each iteration m (or experiment) is given by (where we are replacing π(a|h) by the γ b(a) notation): ∆

Q =

E Ne −1   1 X X δn(e) ln γ bn (a(e) n ) Nm e=1 n=0

(49.173)

Only the logarithm term depends on the network parameters that we wish to (e) update. According to our notation, when the feature vector hn is fed into the neural network, the corresponding output vector is denoted by γ bn . Therefore, the (e) quantity γ bn (an ) that appears in the above sum refers to the particular entry (e) of γ bn at the location corresponding to action an . We can now derive recursions for the sensitivity factors of the neural network (we denote them by σ since the δ notation is used for TDs). For any generic term within Q we let   ∂ δ ln γ b (a) ∆ (49.174) σ` (j) = ∂z` (j) where we have dropped the subscript n and the superscript (e). We consider first the boundary case ` = L, where zL = z, so that   ∂ δ ln γ b(a) (49.166)  δ(1 − γ b(a)), if a0 = a = (49.175) −δb γ (a0 ), otherwise ∂z(a0 ) for a0 = 1, 2, . . . , |A|

In vector form, using γ b as the output vector corresponding to input h, we get   ∂ δ ln γ b(a) = δ(ea − γ b) (49.176) ∂z

where ea is the basis vector with unit entry at location a. Consequently, the boundary sensitivity factor, for any single data point (h, a), is given by (where δ is a scalar and ea − γ b is a vector): σL = δ(ea − γ b)

(|A| × 1)

(49.177)

The same argument that leads to expression (65.61) will similarly lead to the following backward recursion for the sensitivity factor relative to a single data point (h, a), which runs backward from ` = L − 1 down to ` = 2: σ` = f 0 (z` ) (W` σ`+1 )

(49.178)

We therefore arrive at listing (49.179) where, again, we recall that, for every state s, the softmax operation is applied solely over the actions that are permissible at that state.

49.9 Deep Reinforcement Learning

2097

Deep A2C algorithm for optimal policy design. given neural network with L layers, including an output softmax layer; set initial network parameters {W`,−1 , θ`,−1 }. repeat over m = 0, 1, 2, . . .: collect E episodes following current policy πm−1 (a|h): (e) (e) ◦ let {hn , an , r(e) (n)} index the (h, a, r) quantities in eth episode (e) ◦ for each hn , we feed it through the network and determine the (e) (e) (e) internal and output signals {z`,n , y`,n , γ bn } (e) (e) (e) bn = πm−1 (a|hn ) ◦ sample the action an from the distribution γ ◦ let Ne denote the number of transitions in episode e

repeat over episodes e = 1, 2, . . . , E: repeat over episode for n ≥ 0: (we are ignoring the superscript (e) on (h, a, r, f ) for simplicity) select a ∼ πm−1 (a|hn ), observe hn+1 and r(n) = r(hn , a, hn+1 ) δ(n) = r(n) + (γhn+1 − hn )T wn−1 wn = wn−1 + µ1 (n)δ(n)hn end w−1 ← wn , in preparation for next episode end o wm−1 ← wn (critic model) E X Nm = Ne (number of data points) e=1 T  (e) (e) (e) o wm−1 , for all n, e δn = r(e) (n) + γhn+1 − hn (e)

(e)

(e)

σL,n = δn (ea(e) − γ bn ) (boundary sensitivity factor for all n, e) n

repeat for ` = L − 1, . . . , 2, 1 (backward processing ) ! E Ne −1  T 1 X X (e) (e) y σ W`,m = (1 − 2µρ)W`,m−1 + µ Nm e=1 n=0 `,n `+1,n ! E Ne −1 1 X X (e) θ`,m = θ`,m−1 − µ σ Nm e=1 n=0 `+1,n   (e) (e) (e) σ`,n = f 0 (z`,n ) W`,m−1 σ`+1,n , ` ≥ 2, and all n, e

end end

{W`? , θ`? } ← {W`,m , θ`,m },

  π ? (a|s) ← π a|h; {W`? , θ`? } .

(49.179)

2098

Policy Gradient Methods

Example 49.9 (Playing a game over a grid) We illustrate the operation of the deep learning A2C implementation (49.179) by reconsidering the same grid problem from Fig. 48.1. We consider a feedforward neural network with L = 4 layers (including one input layer, one output layer, and two hidden layers). All neurons employ the ReLU activation function except for the output layer, which employs a softmax mapping. The step-size, regularization, and discount factor parameters are set to: µ1 = 0.01,

µ = 0.00001, ρ = 0.001, γ = 0.9

(49.180)

We employ the same one-hot encoded feature vectors (49.80). We run the deep learning algorithm for 500,000 iterations to learn the parameters {W` , θ` }. Each iteration involves a run over E = 100 episodes to train the critic. At the end of the simulation, we evaluate the policies at each state (we feed feature vector hs and determine γ bs = π ? (a|s)). We arrive at the matrix Π in (49.181), with each row corresponding to one state and each column to one action (i.e., each row corresponds to the distribution π ? (a|s)). The boxes highlight the largest entries in each column and are used to define the “optimal” policies at that state. The resulting optimal actions are represented by arrows in Fig. 49.6.                Π=              

49.10

up 0.0077 0.9658 0.9792 0 0.9911 0.0178 0.9855 − 0.9980 0.9975 0 0.9960 0.0000 0.0000 0.0000 0 −

down 0.0014 0.0001 0.0000 0 0.0000 0.0016 0.0000 − 0.0000 0.0000 0 0.0000 0.0000 0.0000 0.0000 0 −

left 0.9619 0.0342 0.0208 0 0.0089 0.9631 0.0145 − 0.0020 0.0025 0 0.0040 0.0043 0.0019 0.0012 0 −

right 0.0290 0.0000 0.0000 0 0.0000 0.0175 0.0000 − 0.0000 0.0000 0 0.0000 0.9957 0.9981 0.9988 0 −

stop 0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 0 1.0000 1.0000

state ←s=1 ←s=2 ←s=3 ←s=4 ←s=5 ←s=6 ←s=7 ←s=8 ←s=9 ← s = 10 ← s = 11 ← s = 12 ← s = 13 ← s = 14 ← s = 15 ← s = 16 ← s = 17

                             

(49.181)

SOFT LEARNING The policy gradient methods such as REINFORCE, actor–critic, TRPO, and PPO are sample-inefficient. Their training requires the collection of episodes based on current estimates of the policy, which are then discarded and new samples are generated for the subsequent iteration. We consider an alternative soft-learning approach that will permit off-policy learning and the use of replay buffers, thus increasing sample efficiency to a great extent. Soft learning is motivated as follows.

49.10 Soft Learning

2099

SUCCESS

11

DANGER

4

Figure 49.6 Optimal actions at each square are indicated by red arrows representing

motions in the directions {upward, downward, left, right}. The circular arrows in states 8 and 16 indicate that the game stops at these locations. This simulation employs the deep learning A2C algorithm (49.179) and relies on the feature vectors (49.80).

49.10.1

Log-Sum Exponentiation Most of the reinforcement learning procedures described earlier have been motivated from the Bellman optimality conditions (45.15a) and (45.15b), namely, n o v ? (s) = max E P r(s, a, s0 ) + γv ? (s0 ) a∈A n o ? 0 0 q ? (s, a) = E P r(s, a, s0 ) + γ max q (s , a ) 0 ?

?

π (a|s) := argmax q (s, a)

a ∈A

(49.182a) (49.182b) (49.182c)

a∈A

v ? (s) = max q ? (s, a) a∈A

(49.182d)

where we are using the notation := in the third line to indicate that the resulting optimal policy is deterministic: We first determine an optimal action ao that maximizes q ? (s, a) and then use it to construct an optimal deterministic policy π o (a|s) according to (45.7), which we set to π ? (a|s). For reference, we know from (45.17) that the Bellman condition (49.182a) can be rewritten in the equivalent form: n  o v ? (s) = max E π E P r(s, a, s0 ) + γ v ? (s0 ) (49.182e) π

2100

Policy Gradient Methods

where the maximization is now over the policy, π(a|s), and the expectation is over both π and P. This alternative form makes the policy explicit. The expressions for v ? (s) and q ? (s, a) involve hard maximization (greedy) operations. These operations can be a source of instability, as already illustrated in Section 48.5.1. One method to smooth the hard operations is to approximate the maximization step by a log-sum exponentiation operation. Specifically, we call upon the following well-known smooth approximation for the maximum of a collection of variables in terms of the log-sum exponential function: ! N n o X xn /λ max x1 , x2 , . . . , xN ≈ λ ln e (49.183) n=1

where λ > 0 is a temperature parameter that controls the quality of the approximation; the smaller the value of λ is, the more accurate the approximation will be. Using (49.183) in (49.182a) gives the following expression, which we will refer to as the soft Bellman optimality equation: vλ? (s) = λ ln

"

X

a∈A

exp



  1 0 ? 0 E P r(s, a, s ) + γ vλ (s ) λ

#

(49.184)

Using (49.183) in (49.182b), and recalling the identity (49.182d), gives n o qλ? (s, a) = E P r(s, a, s0 ) + γvλ? (s0 ) (49.185)

In the last two expressions we are writing vλ? (s) and qλ? (s, a) to denote the approximations to v ? (s) and q ? (s, a); they are parameterized by the scalar λ. Substituting (49.185) into (49.184) we arrive at the relation: " #   X ? ? vλ (s) = λ ln exp qλ (s, a)/λ (49.186) a∈A

which is the approximation for (49.182d) when the max operation is replaced by the log-sum exponential function. In this way, we can rewrite (49.185) in the following form, which corresponds to a second soft Bellman optimality equation: qλ? (s, a)

= EP

(

0

r(s, a, s ) + γλ ln

"

X

a∈A

exp





qλ? (s0 , a)/λ

#)

(49.187)

Comparing with (49.182b), we observe that there is no maximization step under expectation on the right-hand side of this expression. Instead, the maximum has been replaced by a smooth log-sum exponential function.

49.10.2

Consistency Theorem Now that smoothed approximations for the state value and state–action value functions have been characterized, we establish an important consistency rela-

49.10 Soft Learning

2101

tion that will form the basis for the derivation of a sample-efficient off-policy algorithm. To begin with, it is straightforward to verify that the solution to the following problem, which is similar to (49.188), agrees with the vλ? (s) given by (49.184): n   o vλ? (s) = max E π E P r(s, a, s0 ) + γ vλ? (s0 ) − λ ln π(a|s) π

(49.188)

Comparing expression (49.188) with (49.182e), we notice the presence of the additional factor λ ln π(a|s). This factor gives rise to the entropy of the policy distribution at state s, which is defined as X ∆ Hπ (s) = −E π ln π(a|s) = − π(a|s) ln π(a|s) (49.189) a∈A

Using Hπ (s) we rewrite (49.188) as n  o vλ? (s) = max λHπ (s) + E π,P r(s, a, s0 ) + γ vλ? (s0 ) π

(49.190)

For this reason, we say that the soft maximum approximation adds an entropy “regularization” term into the evaluation of the value function. Proof of equivalence between (49.184) and (49.188): Expanding (49.188) we have ( )   i Xh ? 0 ? 0 vλ (s) = max π(a|s)E P r(s, a, s ) + γ vλ (s ) − λπ(a|s) ln π(a|s) π

a∈A

subject to

X a∈A

π(a|s) = 1 and π(a|s) ≥ 0, ∀a ∈ A

(49.191)

Following a Lagrange multiplier argument, it can be verified that the optimal policy is given by the Gibbs distribution – see Prob. 49.17: o n  exp λ−1 E P r(s, a, s0 ) + γ vλ? (s0 ) n  o πλ? (a|s) = P ? −1 E 0 0 P r(s, b, s ) + γ vλ (s ) b∈A exp λ

(49.192)

Substituting this expression into the right-hand side of (49.191) leads to the right-hand side of (49.184). 

Observe that the optimal policy (49.192) is not deterministic anymore. Using (49.184) and (49.185), we also have n o n  o exp qλ? (s, a)/λ n o (49.193) πλ? (a|s) = exp qλ? (s, a) − vλ? (s) /λ = P ? (s, a0 )/λ exp q 0 a ∈A λ We establish in the same Prob. 49.17 that: lim πλ? (a|s) = π ? (a|s)

(original optimal policy)

(49.194a)

lim πλ? (a|s) = 1/|A|

(uniform policy)

(49.194b)

λ→0 λ→∞

2102

Policy Gradient Methods

One useful property that follows by taking the logarithm of (49.193) is to observe that the regularized optimal policy and the corresponding optimal state value function satisfy the following consistency relation:   vλ? (s) = E P r(s, a, s0 ) + γ vλ? (s0 ) − λ ln π ? (a|s) (49.195)

Actually, {πλ? (a|s), vλ? (s)} are the only quantities that satisfy this consistency result. The following statement is established in Appendix 49.B. (Consistency theorem) If a policy π(a|s) and its state value function v π (s) satisfy the following relation for every (s, a) ∈ S × A:   v π (s) = E P r(s, a, s0 ) + γ v π (s0 ) − λ ln π(a|s) (49.196a) then it must hold that

v π (s) = vλ? (s),

π(a|s) = πλ? (a|s)

(49.196b)

Observe that relation (49.196a) does not involve any expectation over the policy. Therefore, transitions obtained by following any policy will be useful to search for π ? (a|s), as we explain further ahead in the SBEED algorithm. Table 49.2 collects the expressions derived so far for soft state and state–action value functions. Table 49.2 Expressions relating the soft value and state value functions in an MDP using a temperature parameter λ > 0. Relation

Description "

#   1 E P r(s, a, s0 ) + γ vλ? (s0 ) λ a∈A n   o vλ? (s) = max E π E P r(s, a, s0 ) + γ vλ? (s0 ) − λ ln π(a|s) π n o ? qλ (s, a) = E P r(s, a, s0 ) + γvλ? (s0 ) ( " #)   X ? 0 ? 0 qλ (s, a) = E P r(s, a, s ) + γλ ln exp qλ (s , a)/λ vλ? (s) = λ ln

X



exp

Bellman optimality Bellman optimality Bellman optimality Bellman optimality

a∈A

n  o exp λ−1 E P r(s, a, s0 ) + γ vλ? (s0 ) n  o πλ? (a|s) = P ? −1 E 0 0 P r(s, b, s ) + γ vλ (s ) b∈A exp λ n . o πλ? (a|s) = exp qλ? (s, a) − vλ? (s) λ " #   X ? ? vλ (s) = λ ln exp qλ (s, a)/λ a∈A

optimal policy

optimal policy useful relation

49.10 Soft Learning

2103

Example 49.10 (Soft Q-learning) We illustrate the soft Bellman conditions in the construction of a soft version for Q-learning. We refer to the original Q-learning procedure (47.45) where   π 0 0 π βn (s, a) = r(n) + γ max q (s , a − qn−1 (s, a) (49.197) n−1 0 a ∈A

and employ the linear approximation model (48.82) to replace π T qn−1 (s, a) ≈ fs,a cn−1

(49.198)

using the model cn−1 at iteration n − 1. We next use the log-sum exponential form (49.187) to replace the max operation by !   X π 0 0 T max qn−1 (s , a ) ≈ λ ln exp fs0 ,a0 cn−1 /λ (49.199) 0 a ∈A

a0 ∈A

which motivates the soft Q-learning algorithm listed in (49.200). Soft Q-learning for policy evaluation. given a (deterministic or stochastic) policy model, π(a|h); given a small temperature parameter λ > 0; start with c−1 = 0T ×1 . repeat over episodes: let (f0 , h0 ) denote initial feature vectors for episode let a0 denote its initial action repeat over episode for n ≥ 0: observe hn+1 using action an r(n) = r(hn , an , hn+1 ) an+1 ∼ π(a|hn+1 ) observe fn+1 ( ) !   X T T β(n) = r(n) + γλ ln exp fsn+1 ,a cn−1 /λ − fn cn−1

(49.200)

a∈A

cn = cn−1 + µ(n)β(n)fn end c−1 ← cn in preparation for next episode end co ← cn T qnπ (s, a) ≈ fs,a co , ∀(s, a) ∈ S × A.

49.10.3

SBEED Algorithm The consistency result (49.196a)–(49.196b) motivates an off-policy sampleefficient approach to determining the optimal policy and its state value. The idea is based on searching for functions π(a|s) and v π (s) that satisfy equality (49.196a). We illustrate the procedure by assuming linear parameterizations for the functions involved, say, π

v (s) ≈

hT s w,

π(a|s) ≈ P

e(fs,a )

a0 ∈A

T

θ Tθ

e(fs,a0 )

(49.201)

2104

Policy Gradient Methods

where w ∈ IRM and θ ∈ IRT . Other parameterizations are of course possible. One way to achieve the desired goal is as follows. We collect E trajectories on the MDP, with each trajectory consisting of Ne transitions (s, a, s0 , r) following an arbitrary policy. We then formulate the problem of minimizing the following quadratic cost over the data (where we are adding the superscript (e) to identify data from the eth episode): ( ) E Ne −1h   i2 1 X X (e) π (e) (e) (e) π (e) min E P r (n)+γ v (sn+1 ) −λ ln π(an |sn )−v (sn ) w,θ 2N e=1 n=0 (49.202) where E X ∆ N = Ne (49.203) e=1

Each term inside the summation has the generic form  2 ∆ 1 F (w, θ) = E P r(s, a, s0 ) + γE P v π (s0 ) − λ ln π(a|s) − v π (s) 2

(49.204)

In principle, we could consider solving (49.202) by employing a stochastic gradient algorithm. This step would require approximating gradients of terms of the form F (w, θ) from samples, which is problematic in this case for the following reason. If we compute the gradient of F (w, θ) relative to w we get   ∇wT F (w, θ) = E P r(s, a, s0 ) − λ ln π(a|s) + (γE P h0 − h)T w (γE P h0 − h) (49.205)

where terms involving products of expectations appear, such as the product of the last two terms denoted by: ∆

x = (γE P h0 − h)T w(γE P h0 − h)

(49.206)

If we were to approximate x using sample realizations and the prior iterate for w, we would use x ≈ (γhn+1 − hn )T w(γhn+1 − hn )

(49.207)

However, this expression provides a sample realization for another quantity that is given by the following expression: n o ∆ x0 = E P (γh0 − h)T w(γh0 − h) (49.208)

with a single expectation operation applied to the product of the factors. For this reason, using the approximation (49.207) for x introduces bias. We therefore follow an alternate approach and transform first the original cost (49.202) into another equivalent form where the product of expectations problem does not arise (this problem is due to the presence of an expectation inside the quadratic cost in (49.204)).

49.10 Soft Learning

2105

To begin with, note that if we expand the cost function appearing in (49.202) and remove the quadratic term involving the rewards, which is independent of (w, θ), then the problem is equivalent to: ( E Ne −1   1 X X (e) (e) π (e) min E P r (e) (n) γ E P v π (sn+1 ) − λ ln π(a(e) |s ) − v (s ) n n n w,θ N e=1 n=0 ) 2 1 (e) (e) π (e) + γ E P v π (sn+1 ) − λ ln π(a(e) n |sn ) − v (sn ) 2

(49.209)

Next, we call upon a result similar to the algebraic property (44.152) for quadratic forms to replace the last term in (49.209) by (we are dropping the superscripts for simplicity): 2 1 γ E P v π (s0 ) − λ ln π(a|s) − v π (s) 2 ) (   1 = max γ E P v π (s0 ) − λ ln π(a|s) − v π (s) ρ − ρ2 ρ 2

(49.210)

where ρ is a function of the state–action pair, ρ(s, a). Observe that the expectation of v π (s0 ) is not “squared” anymore. We again adopt a linear approximation model for ρ(s, a) in terms of some parameter vector c, say, ρ(s, a) ≈ (fs,a )T c,

c ∈ IRT

(49.211)

It can now be verified that problem (49.209) is equivalent to the following saddle point problem: min max w,θ

(49.212)

c

( E Ne −1   1 X X (e) (e) π (e) E P r (e) (n) γ E P v π (sn+1 ) − λ ln π(a(e) n |sn ) − v (sn ) + N e=1 n=0

h

γ EP v

π

(e) (sn+1 )



(e) λ ln π(a(e) n |sn )

−v

π

(s(e) n )

i 1 2 (e) (e) (e) ρ(s(e) n , an ) − ρ (sn , an ) 2

)

where the maximization over c is placed before the summations – see Prob. 49.19. We can write down gradient-descent recursions over w and θ and a gradientascent recursion over c. Since the exploration can follow arbitrary policies, the resulting procedure can learn off-policy and can also rely on the use of a replay buffer for more efficient sampling. The resulting algorithm appears in listing (49.213) and is referred to as the smoothed Bellman error embedding (SBEED) algorithm – see Prob. 49.20.

2106

Policy Gradient Methods

SBEED algorithm for optimal policy and state value learning. initial model parameters w−1 ∈ IRM , θ−1 ∈ IRT , c−1 ∈ IRT ; step-size parameters µw , µc , µθ ; apply repeated actions according to some arbitrary policy and construct a buffer containing sufficient samples for the transitions (s, a, s0 , r). repeat for m ≥ 0 until sufficient convergence: using policy π(a|s; θm−1 ), generate a mini-batch of B transitions for the MDP and save them into the replay buffer select at random a mini-batch of B transitions {(sb , ab , sb+1 , r(b))} from the replay buffer for each b = 0, 1, . . . , B − 1: set fb = fsb ,ab X set f¯b = π(a|sb ; θm−1 )fsb ,a a∈A

end

(compute the mini-batch gradient vectors) B−1 1 X gw,m−1 = (r(b) + fbT cm−1 )(γhb+1 − hb ) B b=0 B−1   X 1 gc,m−1 = (γhb+1 − hb )T wm−1 −λ ln π(ab |sb ; θm−1 ) fb −fbT cm−1 B b=0 B−1 1 X λ(r(b) + fbT cm−1 )(fb − f¯b ) gθ,m−1 = − B b=0

(update parameter models) wm = wm−1 − µw gw,m−1 θn = θn−1 − µθ gθ,m−1 cm = cm−1 + µc gc,m−1 end return {w? , θ? , c? } ← {wm , θm , cm }.

49.11

(49.213)

COMMENTARIES AND DISCUSSION Classes of algorithms. Based on the presentation in Chapters 46–49, we find that there are mainly three broad classes of reinforcement learning algorithms, depending on whether actor or critic components are present:

49.11 Commentaries and Discussion

2107

(a) Actor-only algorithms such as the REINFORCE algorithm (49.64). This algorithm parameterizes the policy function, π(a|h; θ), and estimates the parameter without storing estimates for the value function. (b) Critic-only algorithms such as TD learning, SARSA learning, and Q-learning for estimating the state and state–action value functions. These algorithms observe rewards and continuously update estimates for the value functions by incorporating feedback errors into the update equations, such as the TD term δn (s) – see, e.g., (46.50). (c) Actor–critic algorithms such as the policy gradient and natural gradient algorithms of Section 49.1. These techniques help address the potential divergence of critic algorithms (such as Q-learning) under linear approximation models; they also help reduce the variance in the approximation of the gradient directions by means of baseline functions. These algorithms have encountered widespread application in a variety of areas, including learning in games (such as chess, backgammon, GO, solitaire), learning in robotics, fleet management, supply chain management, finance, diagnostics, and helicopter control, among other applications. Policy gradient algorithms. One of the main drivers for the development of policy gradient algorithms is to help address convergence difficulties encountered by Q-learning and SARSA algorithms; these latter algorithms may diverge when used with linear approximation models for the value functions. Several works examined the divergence problem, providing examples and counterexamples under certain conditions, including Baird (1995), Gordon (1995), Tsitsiklis and Van Roy (1996, 1997), Schoknecht (2003), and Melo, Meyn, and Ribeiro (2008). In particular, we described earlier in Section 48.5.1 the delusional bias problem from Lu, Schuurmans, and Boutilier (2018). Policy gradient algorithms parameterize the policy function itself rather than the value functions, and estimate the parameter vector by means of stochastic gradient methods. In its most naïve version, the gradient vector is estimated by means of a finitedifference perturbation method; useful overviews of methods for gradient estimation are given by Aleksandrov, Sysoyev, and Shemeneva (1968) and L’Ecuyer (1990, 1991) and the references therein. The development of more sophisticated variants of policy gradient algorithms relies on the application of the policy gradient theorem (49.44). This form of the result was derived independently by Sutton et al. (2000) and Konda and Tsitsiklis (2000, 2003), based on earlier versions by Cao and Chen (1997) and Jaakkola, Singh, and Jordan (1995); the result characterizes the gradient vector of the objective function in terms of the score function, which is a critical step in facilitating the derivation of policy gradient algorithms. One of the earliest implementations of a policy gradient method is the REINFORCE algorithm by Williams (1992), which is essentially an actor-only implementation. Some of the earliest works on reinforcement learning architectures involving both actor and critic elements appear in Samuel (1959), Witten (1977), and Barto, Sutton, and Anderson (1983). This last article is considered by many as the main reference that has motivated most of the subsequent work on actor–critic structures. A good survey on actor–critic algorithms is given by Grondman et al. (2012), with many relevant references and commentaries. Another useful overview is the book by Cao (2007). The TRPO method described in Section 49.8 is from Schulman et al. (2015). The method solves a constrained optimization problem by imposing a constraint on the average divergence measure between the new and previous policies (note that the constraint is on the behavior policy and not on the parameters). The objective is to avoid sudden changes in the policy that would cause erratic behavior by the agent.The TRPO for-

2108

Policy Gradient Methods

mulation is complex and its solution is computationally demanding. The PPO method alleviates these difficulties. The method is from Schulman et al. (2017) and it has two versions: One uses a penalized cost function while the other uses a clipped cost function. PPO is able to deliver similar performance to TRPO but is much simpler to implement. These constrained formulations aim to limit the size of the policy update and they are motivated by earlier results on natural gradient policies from Kakade (2001) and on the conservation policy iteration from Kakade and Langford (2002) – recall the earlier formulation (6.127) while deriving the natural gradient algorithm. The useful identity (49.102) is from Kakade and Langford (2002).

Soft learning. The algorithms derived in the body of the chapter based on the policy gradient approach, such as REINFORCE, actor–critic, TRPO, and PPO, are sampleinefficient. Their training requires the collection of episodes based on current estimates of the policy, which are then discarded and new samples are generated for the subsequent iteration. This inefficiency has motivated the work on soft learning described in Section 49.10 by using the log-sum exponential to approximate the maximization operation, which induces the use of the entropy of the policy as a regularization factor. This step leads to the fundamental consistency result (49.196a)–(49.196b), which is from Rawlik, Toussaint, and Vijayakumar (2012) and Nachum et al. (2017). The consistency equation does not involve any expectation over the policy and, therefore, allows for the generation of samples from arbitrary trajectories. This observation was exploited to arrive at the SBEED algorithm (49.213), which incorporates a replay buffer as well. The algorithm is from Dai et al. (2018). The incorporation of entropy regularizers into the evaluation of the optimal value function, and the replacement of hard max operations by soft log-sum exponential operations, is motivated by the works of Rawlik, Toussaint, and Vijayakumar (2012), Fox, Pakman, and Tishby (2016), Nachum et al. (2017), and Asadi and Littman (2017); see also Neu, Jonsson, and Gomez (2017). The derivation in Section 49.10 follows arguments from Nachum et al. (2017) and Cassano, Alghunaim, and Sayed (2019).

Natural gradient algorithms. We explained in Section 49.7 that policy gradient methods follow gradient-ascent iterations and are sensitive to the choice of the coordinate system. Natural gradient policies are search procedures that are parameter-independent (recall the discussion in Section 6.6). Their main motivation is to ensure that a change in the parameterization of the policy does not affect the policy result. These policies resemble, to a certain extent, Newton-type iterations where the search direction is scaled by the inverse of the Hessian matrix of the objective function; for natural policies, the search direction is scaled by the inverse of the Fisher information matrix (which is the covariance matrix of the score function). For compatible state–action approximations satisfying (49.47), the scaling of the search direction by the inverse of the Fisher information matrix leads to the important simplification (49.90). This result means that it is sufficient for policy gradient algorithms to update the policy parameter along the direction of the critic parameter. One of the first works to recognize the significance of natural gradients for learning purposes is Amari (1998). The use of the average Fisher information matrix (49.86) for reinforcement learning was proposed by Kakade (2001), and further supported by the analysis in Bagnell and Schneider (2003) and Peters, Vijayakumar, and Schaal (2003). Variations on actor–critic algorithms appear in Peters and Schaal (2008) and Bhatnagar et al. (2009).

Problems

2109

PROBLEMS

T 49.1 Consider the Gaussian policy parameterizations π(a|s; θ) = N(fs,a θ, σ 2 ). What is the score function of π(a|s; θ)? 49.2 Consider a cost function J(θ) : IRQ → IR of the form ˆ ∆ J(θ) = fx (x; θ)g(x)dx = E x g(x) x∈X

in terms of a probability density function fx (x; θ) parameterized by the vector θ ∈ IRQ , and where x ∈ X denotes the domain of x. Assume regularity conditions hold so that the operations of differentiation and integration can be switched with each other. Show that ∇θ J(θ) = E x ∇θ ln fx (x; θ)g(x) 49.3 Refer to the objective function J1 (θ) defined in (49.26). Derive the following equivalent expression: XX X π J1 (θ) = (1 − γ) dγ (s)π(a|h; θ)P(s, a, s0 )r(s, a, s0 ) s∈S a∈A s0 ∈S

dπγ (s)

where is the discounted state distribution defined by (49.222). 49.4 Refer to the average reward objective function J4 (θ) defined by (49.35). Let J ? denote its optimal (i.e., maximum) value over θ. Refer also to the definitions (49.39a)– (49.39b) for the state and state–action value functions. Show that the Bellman optimality conditions in this case are given by ( ) h i X ? ? 0 0 ? 0 v (s) + J = max P(s, a, s ) r(s, a, s ) + v (s ) a∈A

q ? (s, a) + J ? =

X s0 ∈S

s0 ∈S

h i ? 0 0 P(s, a, s0 ) r(s, a, s0 ) + max q (s , a ) 0 a ∈A

Show further that these optimality relations can be written as   v ? (s) + J ? = max E P r(s, a, s0 ) + v ? (s0 ) | s = s a∈A   ? ? q (s, a) + J = E P r(s, a, s0 ) + max q ? (s0 , a0 ) | s = s, a = a 0 a ∈A

49.5 Refer to the argument used to establish (49.44) for J1 (θ), J3 (θ), and J4 (θ). Assume the Markov chain is irreducible and aperiodic, the value function v π (s) is ergodic for all states, and the chain is operating in the steady state. 1 J2 (θ) = E π,P vπ (s). (a) Argue that expression (49.30) can be interpreted as 1−γ (b) Argue by ergodicity that one may consider the alternative objective function: ( N −1 ) 1 X π J5 (θ) = lim E π,P v (sn ) N →∞ N n=0 (c)

Introduce the variables dπγ (s; s0 ) = (1 − γ) βγπ (s) =

X s0 ∈S

∞ X k=0

γ k P(sk = s0 |s0 = s)

dπ (s0 )dπγ (s, s0 )

2110

Policy Gradient Methods

(d)

Verify that the {βγπ (s)} add up to 1 over s ∈ S. Repeat the argument that led to (49.224) to show that X π X ∇θ J5 (θ) = βγ (s) π(a|h; θ)q π (s, a)∇θ ln π(a|h; θ) s∈S

a∈A

Conclude that (49.44) holds for J2 (θ) with the entries of the distribution d given by d(s) = βγπ (s). 49.6 Refer to the definition of the advantage function (49.73). Verify that E π Aπ (s, a) = 0, i.e., X π(a|s)Aπ (s, a) = 0 a∈A

49.7 Refer to the discussion in Section 49.8 on the TRPO method. Introduce the quantities n o ∆ a = max |E π0 Aπ (s, a)| , a ∼ π(a|h; θ0 ) s∈S n  o ∆ max 0 DKL (θ, θ ) = max DKL π(a|h; θ)kπ(a|h; θ0 ) s∈S

Show that J(θ0 ) is bounded from below by J(θ0 ) ≥ Kθ (θ0 ) where (the first two terms on the right-hand side are from (49.108)): ! X π X 4a γ ∆ 0 max 0 π Kθ (θ ) = J(θ) + ρ (s) DKL (θ, θ0 ) π(a|h; θ )A (s, a) − 2 (1 − γ) s∈S a∈A Here, the notation π and π 0 stands for the policies π(a|h; θ) and π(a|h; θ0 ), respectively. Remark. For more details on this inequality and its role in policy gradient design, the reader may refer to the work by Schulman et al. (2015). 49.8 We continue with Prob. 49.7. Consider the following optimization problem: θm = argmax Kθm−1 (θ0 ) θ 0 ∈IRT

Show that J(θm ) − J(θm−1 ) ≥ Kθm−1 (θm ) − Kθm−1 (θm−1 ). Conclude that by maximizing Kθm−1 (θ0 ) at each iteration m, it will hold that J(θm ) is nondecreasing. Remark. The optimization formulation in this problem provides an example of a minorization– maximization procedure, where one maximizes the lower bound (or surrogate function) Kθm−1 (θ0 ) to move toward the maximum of J(θ0 ) – for more information on this class of optimization methods, see for example the treatment by Lange (2016). 49.9 Establish the equivalence of expressions (49.149) and (49.151). Remark. This useful observation is from Achiam (2018). 49.10 Refer to the derivation that established (49.49). Let us assume instead of (49.47) that we model the advantage function Aπ (s, w) linearly as Aπ (s, a; wo ) = (∇θT ln π(a|h; θ))T wo where Aπ (s, w) = q π (s, a) − v π (s), and wo solves wo = argmin w∈IRQ

 2 1 E π Aπ (s, a; w) − Aπ (s, a) 2

Show that result (49.49) continues to hold in the form ∇θT J(θ) = E π,P Aπ (s, a)∇θ ln π(a|h; θ)

Problems

2111

49.11 The actor–critic listing (49.74) is valid for discounted rewards. Explain that for the average rewards J3 (θ) or J4 (θ), a recursion similar to (49.67) needs to be included prior to the expression for δ(n) and the value of b r¯(n) needs to be subtracted from both δ(n) and β(n), i.e., δ(n) = r(n) + (γhn+1 − hn )T wn−1 − b r¯(n) T b β(n) = r(n) + (γfn+1 − fn ) wn−1 − r¯(n) 49.12 The A2C actor–critic listing (49.77) is valid for discounted rewards. Explain that for the average rewards J3 (θ) or J4 (θ), a recursion similar to (49.67) needs to be included prior to the expression for δ(n) and the value of b r¯(n) needs to be subtracted from δ(n), i.e., δ(n) = r(n) + (γhn+1 − hn )T wn−1 − b r¯(n) 49.13 The A2C(λ) actor–critic listing (49.78) is valid for discounted rewards. Explain that for the average rewards J3 (θ) or J4 (θ), we set γ = 1 and a recursion similar to (49.67) needs to be included prior to the expression for δ(n) and the value of b r¯(n) needs to be subtracted from δ(n), i.e., δ(n) = r(n) + (γhn+1 − hn )T wn−1 − b r¯(n) 49.14 The natural gradient actor–critic listing (49.91) is valid for discounted rewards. Explain that for the average rewards J3 (θ) or J4 (θ), a recursion similar to (49.67) needs to be included prior to the expression for δn (hn ) and the value of b r¯(n) needs to be subtracted from δn (hn ), i.e., π π δn (hn ) = r(n) + γvn−1 (s0 ) − vn−1 (s) − b r¯(n)

49.15 Let τ0:N denote an arbitrary state trajectory in an MDP M = (S, A, P, r), with policy π(a|s; θ) parameterized by θ ∈ IRT . Starting from any arbitrary state s0 , the trajectory moves forward to other states at times n = 1, 2, 3, . . . , N . The transitions among states occur randomly according to the transition probabilities pπs,s0 . Therefore, there are many such possible trajectories. Each trajectory generates a sequence of reward values, r(0), r(1), r(2), . . . , r(N − 1). In order to indicate that these rewards relate to a particular trajectory τ , we denote them by writing rτ (n), with a superscript τ . Likewise, we denote the states corresponding to the same trajectory by sτn and the action at time n by aτn : (a) Using the Markovian property, show that the probability of each trajectory τ0:N occurring is given by p(τ0:N ; θ) = P(sτ0 = s0 )

N −1 Y n=0

    P sτn , aτ (n), sτn+1 π aτn |sτn ; θ

where p(τ0:N ; θ) is dependent on θ. Conclude the following equality for the Hessian matrices: N −1   X ∇2θ ln p(τ0:N ; θ) = ∇2θ ln π aτ (n)|sτn ; θ n=0

(b)

Define the Fisher information matrix over the distribution of the trajectories: n o 1 ∆ E p ∇θT ln p(τ 0:N ; θ) (∇θT ln p(τ 0:N ; θ))T G(θ) = lim N →∞ N

2112

Policy Gradient Methods

Use the equivalence between expressions (31.95) and (31.96) for Fisher information matrices to conclude that G(θ) is also given by G(θ) = − lim

N →∞

1 E p ∇2θ ln p(τ 0:N ; θ) N

in terms of the Hessian matrix of the trajectory distribution. Consider the case of the average reward objective function, J4 (θ), defined by (49.35). Show that G(θ) = F (θ), where F (θ) is the average Fisher information matrix given by (49.86). (d) Consider the case of the discounted reward objective function, J1 (θ), defined by (49.26). Show that G(θ) = F (θ), where F (θ) is the average Fisher information matrix given by (49.86) with dπ (s) replaced by the discounted distribution dπγ (s) defined by (49.222). 49.16 Assume compatible feature vectors (49.50) are used to represent the state– action value function, q π (s, a). (a) In a manner similar to the derivation that led to the A2C actor–critic algorithm (49.77), argue that the gradient vector ∇θT J(θn−1 ) can be approximated by (c)

gm−1 =

E Ne −1 1 X X (e) (e) δn fn , N e=1 n=0

N=

N X

Ne

e=1

(e)

(b)

where fn = fs(e) ,a(e) . n n Refer to the natural-gradient actor–critic listing (49.91). Argue that a second natural-gradient actor–critic listing can be obtained by replacing the update for θm by −1 θm = θm−1 + µ(m)Fm−1 gm−1

(c)

where Fm−1 denotes an approximation for the average Fisher information matrix (49.86) at iteration m − 1. Motivated by (49.87), argue that the average Fisher information matrix can be estimated from sample realizations as follows: Fm = I +

m 1 X fn fnT m n=0

where  > 0 is a small constant to ensure invertibility, and I is the identity matrix. −1 Let Jm = Fm . Apply the matrix inversion lemma and follow the derivation from Section 50.3.2 to establish the following recursion: Jm = Jm−1 −

T Jm−1 fm fm Jm−1 , J0 = −1 I TJ 1 + fm m−1 fm

49.17 Use a Lagrange multiplier argument to solve problem (49.191) and establish (49.192). Verify that (a) limλ→0 πλ? (a|s) = π ? (a|s), where the latter is the original optimal policy (49.182c). (b) limλ→∞ πλ? (a|s) = 1/|A| (the uniform policy). 49.18 Let q ∈ IRM with entries {qm }. Consider λ > 0 and the log-sum exponential function: " M # X F (q, λ) = λ ln exp(qm /λ) m=1

Show that F (q1 , λ) − F (q2 , λ) ≤ kq1 − q2 k∞ in terms of the ∞-norm of the vector difference. Remark. See the appendix in Nachum et al. (2017) for a useful discussion. 49.19 Establish the equivalence of problems (49.209) and (49.212). 49.20 Verify the gradient expressions in algorithm (49.213).

49.A Proof of Policy Gradient Theorem

49.A

2113

PROOF OF POLICY GRADIENT THEOREM We establish in this appendix the validity of the policy gradient theorem (49.44) based on the arguments from Sutton et al. (2000) and Konda and Tsitsiklis (2000, 2003). We establish the result for J1 (θ) and J4 (θ) (the latter also applies to J3 (θ)). Similar arguments apply to J2 (θ) – see Prob. 49.5. Considering the cost J1 (θ) in (49.21), we can evaluate its gradient vector relative to θ as follows: 1 ∇θ J1 (θ) 1−γ

= (49.23)

=

=

∇θ v π (s0 ) ! ∇θ X

X

π

π(a|h0 ; θ)q (s0 , a)

a∈A

q π (s0 , a)∇θ π(a|h0 ; θ) +

a∈A (49.24)

=

X

X

π(a|h0 ; θ)∇θ q π (s0 , a)

a∈A π

q (s0 , a)∇θ π(a|h0 ; θ) +

a∈A

X

π(a|h0 ; θ)

=

γP(s0 , a, s0 )∇θ v π (s0 )

s0 ∈S

a∈A

X

X

π

q (s0 , a)∇θ π(a|h0 ; θ) +

a∈A

γ

X X s0 ∈S a∈A

 π(a|h0 ; θ)P(s0 , a, s0 ) ∇θ v π (s0 ) {z

| (44.31)

=

X

}

=ps ,s0 0

q π (s0 , a)∇θ π(a|h0 ; θ) + γ

X s0 ∈S

a∈A

|



{z

pπs0 ,s0 ∇θ v π (s0 )

}

= η(s0 )

(49.214) where pπs0 ,s0 is the probability of transitioning from state s0 to state s0 : ∆

pπs0 ,s0 = P(s0 = s0 |s = s0 ) =

X

π(a|h0 ; θ)P(s0 , a, s0 )

(49.215)

a∈A

and where we introduced the state-dependent notation: ∆

η(s) =

X

q π (s, a)∇θ π(a|h; θ)

(49.216)

a∈A

Therefore, we arrive at the relation: ∇θ v π (s0 ) = η(s0 ) + γ

X s0 ∈S

pπs0 ,s0 ∇θ v π (s0 )

(49.217)

2114

Policy Gradient Methods

We observe that the gradient of the state value function appears on both sides of the above relation. We can expand the right-hand side to find:

∇θ v π (s0 ) = η(s0 ) + γ

X

pπs0 ,s0

  X π η(s0 ) + γ ps0 ,s00 ∇θ v π (s00 ) s00 ∈S

s0 ∈S

|

{z

}

= ∇θ v π (s0 )

! = η(s0 ) + γ

X

pπs0 ,s0

0

η(s ) + γ

2

s0 ∈S

X

X

s00 ∈S

s0 ∈S

∇θ v π (s00 )

pπs0 ,s0 pπs0 ,s00

(49.218)

Now we recognize that

X

pπs0 ,s0 pπs0 ,s00 =

s0 ∈S

X s0 ∈S

(a)

=

X s0 ∈S

(b)

=

X s0 ∈S

P(s1 = s0 |s0 = s0 ) P(s2 = s00 |s1 = s0 ) P(s1 = s0 |s0 = s0 ) P(s2 = s00 |s1 = s0 , s0 = s0 ) P(s2 = s00 , s1 = s0 |s = s0 )

= P(s2 = s00 |s0 = s0 ) = P(moving from state s0 to state s00 in two time steps) ∆

π,(2)

= ps0 ,s00

(49.219)

where step (a) is because of the Markovian property, step (b) follows from the Bayes π,(k) rule (3.42c), and where we are introducing the notation ps,x to refer to the probability of moving from some state s to another state x in k steps. It is worth noting that we could have arrived at this same conclusion by appealing to the Chapman–Kolmogorov equation (38.56). Substituting into (49.218) we have

! π

∇θ v (s0 ) = η(s0 ) + γ

X s0 ∈S

π,(1) ps0 ,s0

0

η(s )

! +γ

2

X s00 ∈S

π,(2) ps0 ,s00 ∇θ

v (s )

! = η(s0 ) + γ

X s∈S

pπ,(1) s0 ,s

η(s)

00

π

! +γ

2

X s∈S

pπ,(2) s0 ,s ∇θ

π

v (s)

(49.220)

where in the second equality we simply redefined s0 and s00 as s. We can now proceed with the argument and substitute the rightmost gradient vector of v π (s) and repeat the calculations to arrive at the series representation:

49.A Proof of Policy Gradient Theorem

! X

π

∇θ v (s0 ) = η(s0 ) + γ

=

∞ X

+ ...

! γ

k

X

psπ,(k) η(s) 0 ,s

s∈S

! γ

X k

psπ,(k) η(s) 0 ,s

s∈S

k=0

=



pπ,(2) s0 ,s η(s)

s∈S

k=1 (c)

η(s)

! X

2

s∈S ∞ X

= η(s0 ) +

pπ,(1) s0 ,s

2115

X

∞ X

s∈S

k=0

|

! γ k psπ,(k) 0 ,s {z



η(s) }

= dπ γ (s)/(1−γ) ∆

=

1 X π dγ (s)η(s) 1 − γ s∈S

(49.221) π,(0)

π,(0)

where step (c) is because ps0 ,s0 = 1 and ps0 ,s = 0 for any s 6= s0 , and in the last equality we introduced the nonnegative (discounted) weighting coefficients: ∆

dπγ (s) = (1 − γ) = (1 − γ)

∞ X k=0 ∞ X

γ k psπ,(k) , 0 ,s

s∈S

γ k P(sk = s|s0 = s0 ),

s∈S

k=0

(49.222)

Observe that these coefficients play the role of a probability distribution since they are nonnegative and, moreover, X s∈S

dπγ (s) = (1 − γ) = (1 − γ)

∞ XX

γ k psπ,(k) 0 ,s

s∈S k=0 ∞ X

! γ

X

k

s∈S

k=0

| = (1 − γ)

∞ X

psπ,(k) 0 ,s {z

=1

}

γk

k=0

= (1 − γ)/(1 − γ) =1

(49.223)

Substituting (49.221) into the first line of (49.214) we arrive at ∇θ J1 (θ)

(49.216)

=

X

(49.11)

X

dπγ (s)

s∈S

=

s∈S

X

q π (s, a)∇θ π(a|h; θ)

a∈A

dπγ (s)

X

π(a|h; θ)q π (s, a)∇θ ln π(a|h; θ)

(49.224)

a∈A

or, equivalently, ∇θ J1 (θ) = E π,dπγ q π (s, a)∇θ ln π(a|h; θ)

(49.225)

Using property (49.19), for any baseline function g(s), we arrive at the desired result (49.44) for J1 (θ).

2116

Policy Gradient Methods

Let us now establish the same conclusion for J4 (θ), with the understanding that the state and state–action value functions, {v π (θ), q π (s, a)}, are centered according to (49.39a)–(49.39b). Thus, starting from (49.42a), we note first that ! X (49.42a) π π ∇θ v (s) = ∇θ π(a|s; θ)q (s, a) a∈A

X

=

X

π

q (s, a)∇θ π(a|s; θ) +

a∈A (49.42b)

X

=

π(a|s; θ)∇θ q π (s, a)

a∈A π

q (s, a)∇θ π(a|s; θ) +

a∈A

! X a∈A (a)

X

=

a∈A

π(a|s; θ) −∇θ r¯(θ) +

X

0

0

P(s, a, s )∇θ v (s ) π

s0 ∈S

q π (s, a)∇θ π(a|s; θ) − ∇θ J4 (θ) + X

π(a|s; θ)

X

P(s, a, s0 )∇θ v π (s0 )

(49.226)

s0 ∈S

a∈A

where step (a) replaces r¯(θ) by J4 (θ). We can now solve for the gradient of J4 (θ) and write ∇θ J4 (θ)

(49.227) !

= −∇θ v π (s) +

X

X

q π (s, a)∇θ π(a|s; θ) +

X

s0 ∈S

a∈A

a∈A

π(a|s; θ)P(s, a, s0 ) ∇θ v π (s0 ) {z

|

pπ 0 s,s

}

Let dπ denote the Perron vector that is associated with the transition matrix P π = [pπs,s0 ]. We denote its entries by dπ (s); these entries are strictly positive and add up to 1. Multiplying both sides of the above equality by dπ (s) and summing over s ∈ S, gives X π ∇θ J4 (θ) = d (s)∇θ J4 (θ) s∈S

(a)

=



X

dπ (s)∇θ v π (s) +

s∈S

X

dπ (s)

s∈S

X

q π (s, a)∇θ π(a|s; θ) +

a∈A

! X

X

s0 ∈S

s∈S

| =



X

{z

}

= dπ (s0 )

dπ (s)∇θ v π (s) +

s∈S

∇θ v π (s0 )

dπ (s)pπs,s0

X s∈S

X

dπ (s)

X

q π (s, a)∇θ π(a|s; θ) +

a∈A

dπ (s0 )∇θ v π (s0 )

s0 ∈S (49.11)

=

XX

dπ (s)π(a|s; θ)q π (s, a)∇θ ln π(a|s; θ)

s∈S a∈A

(49.228) where step (a) is by (49.28). We conclude that ∇θ J4 (θ) = E π,dπ q π (s, a)∇θ ln π(a|s; θ)

(49.229)

49.B Proof of Consistency Theorem

2117

Using property (49.19), for any baseline function g(s), we arrive again at the same result (49.44) for J4 (θ). 

49.B

PROOF OF CONSISTENCY THEOREM We establish in this appendix the validity of expressions (49.196a)–(49.196b) for the consistency theorem motivated by arguments from Rawlik, Toussaint, and Vijayakumar (2012) and Nachum et al. (2017). We first introduce the quantities:   ∆ q π (s, a) = E P r(s, a, s0 ) + γ v π (s0 ) , (s, a) ∈ S × A

(49.230)

Then, from the consistency relation (49.196a) we have n  o π(a|s) = exp q π (s, a) − v π (s) /λ , ∀a ∈ A Summing over a, and using

P

a∈A

(49.231)

π(a|s) = 1, we conclude that "

v π (s) = λ ln

X

# n o exp q π (s, a0 )/λ

(49.232)

a0 ∈A

Substituting into (49.231) we conclude that the policy π(a|s) satisfying the consistency relation (49.196a) should be of the form: n o exp q π (s, a)/λ n o π(a|s) = P π 0 a0 ∈A exp q (s, a )/λ

(49.233)

Next, we introduce the |A| × 1 column vectors: ∆

πs = col{π(a|s), a ∈ A} ∆

(49.234)

π

qs = col{q (s, a), a ∈ A}

(49.235)

Both vectors are defined at state s and, hence, the subscript s in their notation. Let ∆ denote the probability simplex in IRM consisting of the set of all probability vectors p ∈ IRM with nonnegative entries {pm } that add up to 1: n o ∆ = p ∈ IRM | 1T p = 1, pm ≥ 0

(49.236)

Motivated by (49.184) and (49.186) we define the mapping " ∆

S[v (s)] = λ ln π

X a∈A

 exp

  1 E P r(s, a, s0 ) + γ v π (s0 ) λ

# (49.237)

2118

Policy Gradient Methods

and note that "

(a)

S[v (s)] π

=

λ ln

X



π

exp q (s, a)/λ



#

a∈A (b)

=

max

πs ∈∆

(c)

=

(49.230)

=

(49.196a)

=

n o πsT qs + λHπ (s)

q π (s, a) − λ ln π(a|s), ∀a ∈ A   E P r(s, a, s0 ) + γ v π (s0 ) − λ ln π(a|s), ∀a ∈ A v π (s)

(49.238)

where in step (a) we used (49.230), in step (b) we used the entropy expression (49.189) and the result from part (a) in Prob. 9.21, and in step (c) we used the result from part (c) in the same Prob. 9.21 as well (49.233). We conclude that v π (s) is a fixed point for the mapping S. Next we verify that this mapping is a strict contraction for γ < 1 and therefore has a unique fixed point so that v π (s) must agree with vλ? (s). Indeed, consider two arbitrary functions v1π (s) and v2π (s) and note that max S[v1π (s)] − S[v2π (s)] s∈S " # " #     X X π π = max λ ln exp q1 (s, a)/λ − ln exp q2 (s, a)/λ s∈S a∈A

a∈A

(a)

≤ max max q1π (s, a) − q2π (s, a) s∈S a∈A h i (b) = γ max max E P v1π (s0 ) − v2π (s0 ) s∈S a∈A X h i = γ max max P(s, a, s0 ) v1π (s0 ) − v2π (s0 ) s∈S

a∈A

≤ γ max max s∈S

a∈A

s0 ∈S

X s0 ∈S

P(s, a, s0 ) v1π (s0 ) − v2π (s0 )

X π 0 π 0 P(s, a, s0 ) ≤ γ max (s ) − v (s ) v max max 1 2 0 s ∈S

π 0 π 0 = γ max (s ) − v (s ) v 1 2 0 s ∈S

s∈S

a∈A

s0 ∈S

(49.239)

where step (a) uses the result of Prob. 49.18 and step (b) uses definition (49.230). It follows that the mapping S[·] has a unique fixed point by the fixed-point theorem. Since vλ? (s) satisfies the same mapping and the same consistency relation, as shown by (49.195), we conclude that v π (s) = vλ? (s) and π(a|s) = πλ? (a|s). 

REFERENCES Achiam, J. (2018), “Simplified PPO-clip objective,” unpublished note. Available at https://spinningup.openai.com/en/latest/algorithms/ppo.html. Aleksandrov, V. M., V. I. Sysoyev, and V. V. Shemeneva (1968), “Stochastic optimization,” Eng. Cybern., vol. 5, pp. 11–16.

References

2119

Amari, S. I. (1998), “Natural gradient works efficiently in learning,” Neural Comput., vol. 10, no. 2, pp. 251–276. Asadi, K. and M. L. Littman (2017), “An alternative softmax operator for reinforcement learning,” Proc. Int. Conf. Machine Learning (ICML) pp. 243–252, Sydney. Bagnell, J. A. and J. Schneider (2003), “Covariant policy search,” Proc. Int. Joint Conf. Artificial Intelligence, pp. 1019–1024, Acapulco. Baird, L. C. (1995), “Residual algorithms: Reinforcement learning with function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 30–37, Tahoe City, CA. Barto, A. G., R. S. Sutton, and C. W. Anderson (1983), “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man Cybern., vol. 13, no. 5, pp. 834–846. Bhatnagar, S., R. S. Sutton, M. Ghavamzadeh, and M. Lee (2009), “Natural actor–critic algorithms,” Automatica, vol. 45, no. 11, pp. 2471–2482. Cao, X. R. (2007), Stochastic Learning and Optimization: A Sensitivity-Based Approach, Springer. Cao, X. R. and H. F. Chen (1997), “Perturbation realization, potentials, and sensitivity analysis of Markov processes,” IEEE Trans. Aut. Control, vol. 42, pp. 1382–1393. Cassano, L., S. A. Alghunaim, and A. H. Sayed (2019), “Team policy learning for multiagent reinforcement learning,” Proc. IEEE ICASSP, pp. 3062–3066, Brighton. Dai, B., A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song (2018), “SBEED: Convergent reinforcement learning with nonlinear function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 1125–1134, Stockholm. Fox, R., A. Pakman, and N. Tishby (2016), “Taming the noise in reinforcement learning via soft updates,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI), pp. 202– 211, New York. Gordon, G. J. (1995), “Stable function approximation in dynamic programming,” Proc. Int. Conf. Machine Learning (ICML), pp. 261–268, Tahoe City, CA. Grondman, I., L. Busoniu, G. A. D. Lopes, and R. Babuska (2012), “A survey of actor– critic reinforcement learning: Standard and natural policy gradients,” IEEE Trans. Syst. Man Cybern. Part C, vol. 42, no. 6, pp. 1291–1307. Jaakkola, T., S. Singh, and M. I. Jordan (1995), “Reinforcement learning algorithms for partially observable Markov decision problems,” in Proc. Advances Neural Information Processing Systems (NIPS), pp. 345–352, Denver, CO. Kakade, S. (2001), “Natural policy gradient,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1531–1538, British Columbia. Kakade, S. and J. Langford (2002), “Approximately optimal approximate reinforcement learning,” Proc. Int. Conf. Machine Learning (ICML), pp. 267–274, Sydney. Konda, V. R. and J. N. Tsitsiklis (2000), “Actor–critic algorithms,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1008–1014, Denver, CO. Konda, V. R. and J. N. Tsitsiklis (2003), “On actor–critic algorithms,” SIAM J. Control Optim., vol. 42, no. 4, pp. 1143–1166. Lange, K. (2016), MM Optimization Algorithms, SIAM. L’Ecuyer, P. (1990), “A unified version of the IPA, SF, and LR gradient estimation techniques,” Manag. Sci., vol. 36, no. 11, pp. 1364–1383. L’Ecuyer, P. (1991), “An overview of derivative estimation,” Proc. Winter Simulation Conf., pp. 207–217, Phoenix, AZ. Lu, T., D. Schuurmans, and C. Boutilier (2018), “Non-delusional Q-learning and value iteration,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–11, Montreal. Melo, F. S., S. P. Meyn, and M. I. Ribeiro (2008), “An analysis of reinforcement learning with function approximation,” Proc. Int. Conf. Machine Learning (ICML), pp. 664– 671, Helsinki. Nachum, O., M. Norouzi, K. Xu, and D. Schuurmans (2017), “Bridging the gap between value and policy based reinforcement learning,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 2772–2782, Long Beach, CA.

2120

Policy Gradient Methods

Neu, G., A. Jonsson, and V. Gomez (2017), “A unified view of entropy-regularized Markov decision processes,” available at arXiv:1705.07798. Peters, J. and S. Schaal (2008), “Natural actor–critic,” Neurocomputing, vol. 71, pp. 1180–1190. Peters, J., S. Vijayakumar, and S. Schaal (2003), “Reinforcement learning for humanoid robotics,” Proc. IEEE-RAS Int. Conf. Humanoid Robots, pp. 1–20, Karlsruhe. Rawlik, K., M. Toussaint, and S. Vijayakumar (2012), “On stochastic optimal control and reinforcement learning by approximate inference,” Proc. Conf. Robotics Science and Systems, pp. 1–8, Sydney. Samuel, A. L. (1959), “Some studies in machine learning using the game of checkers,” IBM J. Res. Develop., vol. 3, no. 3, pp. 210–229. Reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman, editors, McGraw-Hill. Schoknecht, R. (2003), “Optimality of reinforcement learning algorithms with linear function approximation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1555–1562, Vancouver. Schulman, J., S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015), “Trust region policy optimization,” Proc. Int. Conf. Machine Learning (ICML), pp. 1–9, Lille. Schulman, J., F. Wolski, P. Dhariwal, A. Radfod, and O. Klimov (2017), “Proximal policy optimization algorithms,” available at https://arxiv.org/abs/1707.06347. Sutton, R. S., D. McAllester, S. Singh, and Y. Mansour (2000), “Policy gradient methods for reinforcement learning with function approximation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1057–1063, Denver, CO. Tsitsiklis, J. N. and B. Van Roy (1996), “Feature-based methods for large scale dynamic programming,” Mach. Learn., vol. 22, pp. 59–94. Tsitsiklis, J. N. and B. Van Roy (1997), “An analysis of temporal-difference learning with function approximation,” IEEE Trans. Aut. Control, vol. 42, no. 5, pp. 6754– 690. Williams, R. J. (1992), “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256. Witten, I. H. (1977), “An adaptive optimal controller for discrete-time Markov environments,” Inf. Control, vol. 34, pp. 286–295.

Author Index

Please note: this index contains terms for all three volumes of the book. Only terms listed below with page numbers 1053–2120 are featured in this volume. Abaya, E., 2281 Abbeel, P., 1997, 2110, 3136 Abdi, H., 2412 Abramowitz, M., 292, 1253, 1347 Abu-Mostafa, Y. S., 2686 Achiam, J., 2110, 3136 Adadi, A., 3061 Adali, T., 1085 Adamcík, M., 293 Adams, R. P., 2836 Adebayo, J., 3061 Adrain, R., 158, 1083, 2198 Agarwal, A., 497, 585, 674 Aggarwal, C. C., 2886 Aggoun, L., 1556, 1605 Agmon, S., 496, 2519 Agrawal, R., 1999 Aguiar, P., 996, 2253 Aharon, M., 2445 Aho, A. V., 1734 Aiello, L. M., 2445 Ailon, N., 2201 Aitken, A. C., 42, 1145 Aizerman, M. A., 2634 Akaike, H., 1253 Akashi, H., 1402 Akhtar, N., 3095 Al-Baali, M., 466, 713 Alacaoglu, A., 627, 631, 632, 638 Albert, A. E., 583, 713 Albert, J. H., 1467 Aldrich, J., 1113, 1251 Aleksandrov, V. M., 2108 Alghunaim, S. A., 430, 775, 996, 999, 1000, 2108 Ali, S. M., 233 Allen, D. M., 2553 Allen-Zhu, Z., 873 Allwein, E., 2486 Alon, N., 110, 123, 126, 2678 Alriksson, P., 942

Altman, E. I., 2378 Altman, N. S., 2280 Aly, M., 2486 Amari, S. I., 233, 584, 1638, 2108, 2445 Amblard, P. O., 1085 Amit, Y., 2581 Anandkumar, A., 1734 Ancona, M., 3061 Andersen, E. B., 189 Andersen, S. K., 1795 Anderson, B. D. O., 1201, 2202 Anderson, C. W., 2108 Anderson, J., 1203, 1467 Anderson, T. W., 2205, 2378, 2412 Andreae, J. H., 1957 Andrews, A. P., 1201 Andrieu, C., 1401, 2834 Andrychowicz, M., 1996, 3136 Anouar, F., 2378 Ans, B., 1637 Anthony, M., 2528, 2553 Antos, A., 2679 Antsaklis, P. J., 1202 Apostol, T., 591 Applebaum, S. P., 584 Archimedes, xxvii, 290 Ariki, Y., 1556, 1605 Arjovsky, M., 2961 Armijo, L., 418 Arnold, B. C., 159 Arnold, V. I., 338 Aronszajn, N., xxxi, 2638 Arora, S., 1735 Arrow, K. J., 430, 1849 Arsenin, V. Y., 2242 Arthur, D., 2281 Artin, E., 1253 Arulampalam, S., 1401 Arulkumaran, K., 2961 Asadi, K., 2108 Ash, R. B., 109, 232

2122

Author Index

Ashburner, J., 1347 Aslanides, J., 1997 Aström, K. J., 1201, 1894, 2202 Atchadé, Y. F., 775 Athalye, A., 3095 Athans, M., 941, 942 Atlas, L., 2487 Attar, R., 1401 Auer, P., 1999 Aurenhammer, F., 2280 Austin, J., 2778 Autonne, L., 45 Avriel, M., 323 Azizan, N., 775 Azoury, K. S., 293 Azulay, A., 2886 Azuma, K., 126 Ba, J. L., 627, 631, 632, 3095 Babuska, R., 1957, 2042 Bacchus, F., 1734 Bach, F., 362, 418, 585, 775, 806, 809, 832, 996, 1515, 1734, 1795, 2248, 2249, 2445 Bach, S., 3061 Bachman, G., 2414, 2637 Baehrens, D., 3061 Baeza-Yates, R., 2280 Bagnell, J. A., 2108 Bahdanau, D., 3037, 3038 Bahl, L., 1556, 1605 Bai, Z., 1997 Bailer-Jones, C. A. L., 1346 Baird, L. C., 2043, 2108 Baker, J. K., 1556, 1605 Bakiri, G., 2486, 2779 Balakrishnan, N., 1116, 1347 Balakrishnan, V. K., 940 Banachiewicz, T., 42 Banerjee, A., 293 Banerjee, O., 1734, 1736 Banks, S. C., 2243 Bapat, R. B., 940 Barak, B., 1735 Baram, Y., 2487 Barankin, E. W., 189 Barber, D., 1467, 1677, 1795, 2634, 2646 Barkema, G. T., 1347 Barnett, A., 188 Barnett, V., 1259 Barron, A., 1254 Barry, D., 2836 Bartle, R. G., 591 Bartlett, P. L., 291, 496, 497, 585, 752, 2581, 2679, 2680, 2685, 2708 Barto, A. G., 1849, 1894, 1903, 1905, 1907, 1957, 1958, 1996, 2001, 2042, 2043, 2108

Basharin, G. P., 1551 Bass, R. F., 111 Bather, J. A., 1894 Batzoglou, S., 1309 Bau, D., 43, 50 Baudat, G., 2378 Baum, L. E., 1309, 1556, 1604, 1605 Bauschke, H. H., 360, 362, 424, 539 Bay, J. S., 1202 Bayes, T., xxix, 110 Bazerque, J. A., 996 Beale, P. D., 111 Beaver, B. M., 2199 Beaver, R. J., 2199 Beck, A., 360, 362, 363, 423, 540, 541 Beckenbach, E. F., 638 Becker, S., 418, 540 Beckley, B. D., xxxix Bedeian, A. G., xxvii Beezley, J. D., 1203 Belkin, M., 2205 Bell, A. J., 1638 Bell, R., 586 Bellemare, M. G., 1997 Bellman, R. E., xxxi, 41, 638, 1202, 1605, 1848, 1894, 1903, 1905, 1907, 2042, 2672 Beltrami, E., 44 Ben-David, S., 713, 2336, 2488, 2583, 2675, 2678, 2679, 2685, 2686, 2701, 2702 Ben-Tal, A., 541 Benesty, J., 627 Bénézit, F., 942 Bengio, S., 3095, 3136 Bengio, Y., 584, 2553, 2778, 2779, 2830, 2831, 2886, 2960, 2961, 3037, 3038, 3136 Bengtsson, T., 1203 Bennett, K. P., 2486, 2518, 2551, 2552 Benveniste, A., 583, 713 Berg, H. C., 110 Berger, R. L., 188, 940, 941, 995, 1113, 1116, 1251, 1265 Berkowitz, J., 3037 Berkson, J., 1346, 1677, 1794, 2485 Berlinet, A., 2635 Berman, A., 46, 1555, 2445 Bernardo, J. M., 1113 Bernoulli, J., 121 Bernshtein, S. N., 110 Bernstein, D. S., 41, 65 Berrada, M., 3061 Bertsekas, D. P., 45, 290–292, 322, 323, 338, 362, 418, 496, 497, 540, 583, 674, 713, 752, 873, 941, 942, 1848, 1849, 1894, 1903, 1905, 1907, 1910, 1957, 1958, 2043 Besag, J., 1796 Bethard, S., 3037

Author Index

Bhambri, S., 3095 Bharath, A., 2961 Bhatia, R., 45 Bhatnagar, S., 1849, 1957, 2043, 2108 Bianchi, P., 423, 943 Biau, G., 2280, 2284 Bickel, P. J., 189, 2378 Bickel, S., 2488, 2491 Bien, J., 3061 Bienaymé, I. J., 110 Bienenstock, E., 1084, 2672 Biggio, B., 3095 Billings, S. A., 2446 Billingsley, P., 109, 112, 115, 119, 158, 257, 1551 Bilmes, J., 1309 Binmore, K., 337 Bioucas-Dias, J., 360, 540 Birkhoff, G. D., 257 Birlutiu, A., 1378 Bishop, C., 1113, 1309, 1347, 1378, 1677, 2310, 2378, 2380, 2412, 2416, 2417, 2486, 2634, 2778 Bjorck, A., 43, 2198, 2199 Bjorck, J., 2779 Blackwell, D., 1252 Blanc, G., 2336 Blatt, D., 941 Blei, D. M., 1467, 1515 Blight, B. J. N., 160 Bliss, C. I., 1346, 2485 Block, H. D., 2513, 2516 Blum, J. R., 583 Blumer, A., 2673 Blundell, C., 1997 Bochner, S., 109 Bollobas, B., 940 Bolstad, W. M., 1113, 1346, 1347, 2486 Boltzmann, L., 111 Bondy, A., 940 Boots, B., 2280 Bordes, A., 2634, 2640, 2778, 3136 Borel, E., 112 Borgelt, C., 1794 Borgwardt, K. M., 2962 Borwein, J. M., 424 Boser, B., 2551, 2552, 2831 Bottou, L., 418, 584, 585, 713, 715, 718, 775, 832, 876, 2634, 2640, 2778, 2961 Bouboulis, P., 2634, 2640 Boucheron, S., 110, 123, 2679, 2701 Bouman, C., 423 Bourlard, H., 2778 Bousquet, O., 110, 123, 584, 585, 2679, 2701 Boutilier, C., 2043, 2107 Bowling, S. R., 1347

2123

Bowman, K. O., 1253 Box, G. E. P., 1113, 1251, 1346 Boyce, W. E., 338 Boyd, S., 290, 292, 322, 323, 328, 358, 360, 418, 540, 941, 942, 996, 2486, 2517 Boyen, X., 1378 Boyle, J. P., 424, 432 Bródy, F., 43 Braca, P., 943 Bradtke, S. J., 1849, 1957, 1958, 2043 Brailovsky, V., 2553 Braverman, E. M., 2634 Brazdil, P., 3136 Bredensteiner, E. J., 2486 Bredies, K., 362 Bregman, L. M., 293, 424 Breiman, L., 1084, 2336, 2553, 2581, 2672 Brent, R., 425 Brereton, R. G., 2310 Brezis, H., 360, 539 Bridle, J. S., 1996 Brin, S., 1551 Bromley, J., 2786, 3136 Brooks, S., 1347, 1796 Browder, F., 360, 539 Brown, G., 2581 Brown, L. D., 189 Brown, R., 111 Brown, W. M., 1085 Broyden, C. G., 421 Brualdi, R. A., 45 Bruck, R. E., 360, 539, 540 Bruckstein, A., 2445 Brueckner, M., 2488, 2491 Bruna, J., 3095 Brush, S. G., 1796 Brutzkus, A., 2336 Bubeck, S., 541, 543, 775, 1997 Buck, J. R., 109 Buck, R. C., 591 Bucklew, J. A., 627 Bucy, R. S., 1201, 1204 Budinich, M., 2526 Budka, M., 3136 Buhlmann, P., 1734, 2581 Bui, T. D., 2779 Buja, A., 2378, 2581 Burda, Y., 1997 Burges, C., 2551, 2552 Burt, C., 2412 Businger, P., 2198 Busoniu, L., 1957, 2042 Bussgang, J. J., 163 Butala, M., 1203 Butepage, J., 1467

2124

Author Index

Caetano, T. S., 806 Cai, J., 364 Cai, T. T., 2446 Caines, P. E., 1252 Calafiore, G. C., 43, 45, 50 Callier, F. M., 1202 Campbell, S. L., 2445 Campbell, T., 1264 Candes, E. J., 50, 364, 2412, 2446 Canini, K., 1515 Cantelli, F. P., 2674 Cao, X. R., 1957, 2107 Capen, E., 1264 Caponnetto, A., 2678 Cappe, O., 1401, 1556, 1605 Cardoso, J. F., 1638 Carin, L., 2486 Carli, R., 942 Carlin, J., 1515 Carlini, N., 3095, 3096 Carpenter, J., 1401 Carrera, J. P., 43 Carroll, R., 1347 Caruana, R., 2353, 2779, 3095 Casella, G., 188, 189, 1347, 2835 Casey, R. G., 2336 Cassandra, A. R., 1894 Cassano, L., 1997, 2043, 2108 Cassella, G., 1113, 1116, 1251, 1265 Cassirer, A., 1997 Castanon, D., 293 Castillo, E., 159 Cattivelli, F. S., 943, 945, 996 Cauchy, A.-L., xxx, 418 Cavanaugh, J. E., 1254, 1271 Celik, Z. B., 3095 Censor, Y., 293 Ceppellini, R., 1309 Cesa-Bianchi, N., 585, 775, 1997, 1999, 2634, 2678 Cevher, V., 418, 540, 627, 631, 632, 638 Chávez, E., 2280 Cha, S., 46, 1555 Chakhchoukh, Y., 1309 Chambers, J., 3037 Chambers, R. A., 1957 Chan, R., 2248 Chandrasekaran, S., 2245, 2249 Chandrasekaran, V., 2412 Chang, K. W., 423 Chapman, S., 1552 Charnes, A., 2551, 2552 Chartrand, G., 940 Chatterjee, C., 2378 Chaudhuri, K., 630, 2280 Chellappa, R., 1796, 3095

Chen, C.-T., 1202 Chen, G., 362, 2280 Chen, H., 2107, 2245, 3095 Chen, J., 674, 878, 941, 943, 954, 998, 2043, 2253, 2445 Chen, L., 364 Chen, P., 293 Chen, R., 1401 Chen, S., 2243, 2446 Chen, S. S., 2445 Chen, Y., 293 Cheney, W., 424 Cherkassky, V., 2673 Chernoff, H., 110, 233, 2673 Chervonenkis, A. Y., 2551, 2552, 2673, 2678 Cheung, J. F., 1795 Cheung, S., 1796 Chevalier, P., 1085 Chi, E. C., 423 Chib, S., 1347, 1401, 1467 Chickering, D. M., 1734 Child, D., 2412 Cho, K., 3037, 3038 Choi, S., 1637, 3136 Chopin, N., 1401 Chopra, S., 2830, 3136 Choromanska, A., 873 Chow, C. K., 1113, 1734 Chowdhary, G., 1957, 2042, 2107 Christensen, C., 420 Christmann, A., 2551, 2552 Chung, F., 110, 123, 126 Chung, K. L., 109 Cichocki, A., 233, 1637, 1638, 2445 Cinlar, E., 1551, 1552 Ciochina, S., 627 Cipparrone, F. A. M., 2244, 2249, 2250 Claerbout, J. F., 2243 Claeskens, G., 1254 Clapp, R., 1264 Clark, D. S., 583, 713 Clark, P., 2353 Clark, W. A., 2042 Clarke, F. H., 291, 496, 752 Clemmensen, L., 2378 Cleveland, W. S., 2198 Clifford, P., 1401, 1796 Clune, J., 3095 Cobb, L., 1203 Cochran, W. G., 2553 Cochrane, J. H., 159 Coddington, E. A., 338 Cohen, A., 466 Cohn, D., 2487

Author Index

Combettes, P. L., 358, 360, 362, 424, 432, 539, 540 Comon, P., 1085, 1637, 1638 Conconi, A., 585 Conn, A. R., 425 Consonni, G., 1467 Constantinescu, T., 42 Cook, D., 2553 Cooley, J. W., xxx Cooper, G. F., 1734 Cormen, T. H., 1734, 1735 Cornfeld, I., 257 Cortes, C., 2487, 2488, 2551, 2552, 2581 Cotter, A., 291, 496 Cottle, R. W., 42, 323 Courant, R., 45 Courty, N., 2488 Courville, A., 584, 2778, 2831, 2886, 2960, 2961 Cover, T. M., 232, 1254, 2284, 2523, 2525, 2551, 2552, 2674 Cowell, R. G., 1794, 1795 Cox, D. D., 2639 Cox, D. R., 188, 1113, 1116, 1143, 1251, 2281, 2485, 2486, 2553 Cramer, H., 1251 Cramer, J. S., 1346, 2485, 2486 Crammer, K., 2486, 2488, 2634 Crane, R. B., 1085 Cresswell, A., 2961 Criminisi, A., 2581 Crisan, D., 1401 Cristianini, N., 2551, 2552, 2634, 2640 Crowder, H. P., 291, 496 Csiszár, I., 233, 235 Cucker, F., 2678 Cummins, F., 3037 Curran, J. M., 1346 Curry, H. B., 418 Curtis, F. E., 713, 718, 873, 876 Cybenko, G., 2778 d’Aspremont, A., 1734, 1736 Dahl, G. E., 2831 Dahleh, M. A., 2243 Dai, A. M., 2779 Dai, B., 2108 Dai, Y. H., 466 Daintith, J., xxxiv Dale, A. I., 110 Dalenius, T., 2281 Daneshmand, A., 998 Daneshmand, H., 873, 874 Daniel, J. W., 466, 540 Daniely, A., 2336 Dantzig, G. B., 323

2125

Darmois, G., 189 Darrell, T., 2280 Darwiche, A., 1677 Das, A., 942 Dasarathy, B. V., 2581 Dasgupta, S., 2280, 2487 Dattoro, J., 322, 424, 432 Daube-Witherspoon, M., 2445 Daubechies, I., 362, 423 Daunizeau, J., 1347 Davenport, W. B., 1113 Davidson, J., 116, 119 Davies, J., 337 Davis, G., 2446 Davis, P. J., 1253 Dayan, P., 1958, 1996, 2003, 2042 De Boer, P.-T., 2778 de Finetti, B., 1259 de Freitas, N., 1401, 2834 de Moivre, A., xxix De Mol, C., 362, 423, 2242 De Pierro, A. R., 941 De Vito, E., 2242, 2678 De, S., 833 Dean, J., 3095 Debnath, L., 2638 Dechter, R., 1733 Defazio, A., 806, 832 Defrise, M., 362, 423 DeGroot, M. H., 940, 941, 995, 1113, 1116 Dekel, O., 2634 Del Moral, P., 1401 Delampady, M., 1254, 1271 Deller, J. R., 1309 Demir, G. K., 2378 Demmel, J., 45 Dempster, A. P., 1309 Demyanov, V. F., 540 Deng, J., 2779 Deng, L., 2831 Deo, N., 940 Derman, C., 1848, 1894 Descartes, R., 2280 Desoer, C. A., 1202 Detrano, R., xxxix, 1259, 2337 Devijver, P. A., 2553 Devlin, J., 3037 Devlin, S. J., 2198 Devroye, L., 2280, 2284, 2673, 2674, 2683, 2688, 2695, 2960 Dhillon, A., 2886 Diaconis, P., 188, 1638 Diamantaras, K. I., 2412 Diaz-Bobillo, I., 2243 Diestel, R., 1734

2126

Author Index

Dietterich, T. G., 1084, 2486, 2553, 2581, 2672, 2779 Dieudonné, J., 591 Dijkstra, E. W., 1734, 1900 Dimakis, A. G., 943 DiPrima, R. C., 338 Dirichlet, G. L., 2280 Djuric, P., 1401 Do, C. B., 1309 Dobrushin, R. L., 2962 Dobson, A. J., 188 Doersch, C., 2961 Doksum, K. A., 189 Domingos, P., 1084, 2353, 2672 Domke, J., 806 Donahue, J., 2779 Dong, X., 1734 Dong, Y., 3095 Donini, M., 3061 Donnelly, P., 1515 Donoho, D. L., 358, 2243, 2445, 2446, 2450 Doob, J. L., 109, 256, 2353 Doshi-Velez, F., 3061 Doucet, A., 1401, 2834 Douglas, J., 362 Douglas, S. C., 234 Doursat, R., 1084, 2672 Dozat, T., 627 Drake, A. W., 1551, 1552 Draper, N. R., 2199 Dreyfus, S. E., 1894, 1910, 2042 Drineas, P., 118, 2201 Drissi, Y., 3136 Dror, G., 586 Drucker, H., 2581 Du, Q., 2281 Du, S. S., 873 Dua, D., 1085, 2378 Dubois, J.-M., 112 Dubois-Taine, B., 806 Duchi, J., 291, 324, 325, 362, 425, 496, 627, 631, 752, 775 Duda, R. O., 1113, 1309, 2280, 2378, 2380, 2513, 2551, 2552, 2778 Dudani, S. A., 2280 Dudley, R. M., 119, 2674, 2678 Dudoit, S., 2378 Duff, M., 1957 Dugundji, J., 360, 539 Dumitrescu, B., 2445 Dumoulin, V., 2961 Duncan, W. J., 42 Dunford, N., 2635 Dunham, W., 290 Dunn, J. C., 540 Durbin, R., 1556

Durrett, R., 109, 112, 119, 121, 255 Duttweiler, D. L., 2640 Duvaut, P., 1085 Dvoretzky, A., 583, 713, 1959 Dwilewicz, R. J., 290 Dwork, C., 630 Dyck, W., 65 Dykstra, R. L., 424, 432 Dziugaite, G. K., 2962 Eagon, J. A., 1556, 1605 Eaton, M. L., 2205 Eatwell, J., 292 Eckart, C., 44 Eckstein, J., 362 Eddy, S., 1263, 1556 Edgeworth, F. Y., 1251 Edwards, A. W. F., 110 Edwards, B. H., 337 Edwards, D., 1794 Efron, B., 2553, 2581 Egerstedt, M., 1734 Ehrenfeucht, A., 2673 Einstein, A., xxviii, 111 El Ghaoui, L., 43, 45, 50, 1734, 1736, 2245 Elad, M., 2445 Eldracher, M., 2961 Elfring, J., 1401 Elkan, C., 3037 Elliott, R. J., 1556, 1605 Elman, J. L., 3037 El-Yaniv, R., 2487 Emelianenko, M., 2281 Ene, V., 592 Engl, H. W., 2242 Engstrom, L., 3095 Erciyes, K., 940 Erdogmus, D., 2778 Eremin, I., 2519 Erhan, D., 3095, 3136 Ermoliev, Y. M., 291, 496 Ernst, D., 2042 Ersboll, B., 2378 Ertekin, S., 2640 Erwin, E., 2310 Escalante, R., 424 Evensen, G., 1203 Everitt, B. S., 1309 Faber, V., 2281 Fang, C., 873 Farley, B. G., 2042 Farlow, S. J., 338 Fatemi, E., 499, 541 Fausett, L., 2310 Fawzi, A., 3095

Author Index

Fawzi, H., 3095 Fawzi, O., 3095 Fearnhead, P., 1401 Fedorov, V., 160, 2634 Fei-Fei, L., 1515, 3136 Feinberg, E. A., 1848, 1894 Feinberg, S. E., 110 Feingold, D. G., 45 Feller, W., 109, 112, 119, 255 Fenchel, W., 290, 292 Fercoq O., 423 Fergus, R., 3060, 3095, 3136 Fernandez-Delgado, M., 2677 Fiat, A., 2336 Field, D. J., 2445 Figueiredo, M., 360, 540, 2486 Fine, B., 43 Finn, C., 3136 Fischer, A., 2830 Fischer, H., 110, 158 Fischer, K. H., 1796 Fischer, P., 1999 Fischione, C., 942 Fisher, J., 2778 Fisher, R. A., xxxix, 111, 158, 187, 189, 1085, 1113, 1251, 1253, 1309, 2378 Fisher, W. D., 2281 Fix, E., 2279 Flaxman, A. D., 425 Fleisher, M., 2778 Fleming, W., 65 Fletcher, R., 2199 Fletcher, C. A. J., 1849 Fletcher, R., 323, 418, 421, 466 Foltin, B., 2961 Fomin, S., 257 Forgy, E. W., 2281 Forney, G. D., 1605 Fort, G., 775 Fort, J. C., 2310 Fourier, J., xxx Fox, C. W., 1467 Fox, L., 466 Fox, R., 2108 Fréchet, G., 1202 Frank, E., 2281, 2353 Frank, I. E., 2248 Frank, M., 540 Frasconi, P., 2778, 3037 Fredrikson, M., 3095 Freedman, D., 1638, 2486 Freeman, R. A., 996 Freeman, W. T., 1795 Freund, Y., 2581, 2640, 2830 Frey, B. J., 1677, 1794, 1795 Fridlyand, J., 2378

2127

Frieden, B. R., 1251, 1265 Friedman, J., 423, 1084, 1201, 1254, 1309, 1638, 1734, 2242, 2248, 2336, 2378, 2446, 2486, 2553, 2581, 2634, 2672, 2673 Friedman, N., 1677, 1734, 1794, 1795 Friston, K., 1347 Fritsche, C., 1203 Frobenius, G., 42, 1556 Frossard, P., 1734, 2445, 3095 Frosst, N., 3095 Fu, K. S., 583, 713 Fu, W. J., 423, 2248 Fu, X., 2445 Fujii, A., 2487 Fukunaga, K., 2280, 2378, 2380, 2673 Fukushima, M., 362 Fullagar, P. K., 2243 Funahashi, K., 2778 Furedi, Z., 2526 Furnkranz, J., 2486 Furrer, R., 1203 Gabor, D., 584 Gabriel, K. R., 2412 Gabrys, B., 3136 Gaddum, J. H., 1346, 2485 Gallager, R. G., 111, 1795 Gallant, S. I., 497, 2516 Gallot, S., 234 Galton, F., 2412 Gamerman, D., 1347 Ganapathi, V., 1734 Gantmacher, F. R., 41 Garcia, V., 189 Garcia-Pedrajas, N., 2486 Gardner, L. A., 583, 713 Garey, M. R., 1735 Garg, A., 2353 Gauchman, H., 941 Gauss, C. F., xxix, 157, 423, 1083, 1145, 1201, 2197, 2198 Ge, R., 873, 874 Geisser, S., 2553 Gelb, A., 1201, 1203 Gelfand, A. E., 1347, 2834 Gelfand, I. M., 41 Gelfand, S., 873 Gelman, A., 1515 Geman, D., 1347, 2248, 2581, 2834 Geman, S., 1347, 2834 Gentile, C., 585, 2634 George, E. I., 2835 Geramifard, A., 1957, 2042, 2107 Germain, P., 2488 German, S., 1084, 2672

2128

Author Index

Gers, F. A., 3037, 3040 Gershgorin, S., 45 Geurts, P., 1084, 2672 Ghadimi, S., 627, 775 Ghahramani, Z., 1467, 2487, 2962 Ghavamzadeh, M., 1957, 2108 Ghosh, J. K., 1254, 1271 Giannakis, G. B., 996 Gibbs, J. W., 111, 231, 2832 Gilbert, A. C., 2446 Gilbert, J. C., 466 Gilks, W. R., 2834, 2836 Gilles, O., 112 Gillis, N., 2445 Gilpin, L., 3061 Gine, E., 2678 Gini, C., 2336 Girolami, M., 1638 Girosi, F., 160, 2634, 2639 Gladyshev, E. G., 583, 713, 720 Glivenko, V., 2674 Glorot, X., 2778, 2779 Glymour, C., 1678 Glymour, M., 1678 Gnedenko, B. V., 109 Godard, D. N., 2202 Godsil, C., 940 Godsill, S., 1401 Goffin, J.-L., 291, 496 Goh, V. S. L., 1085 Goldenstein, S. K., 2486 Goldfarb, D., 364, 421 Goldreich, O., 1735 Goldstein, A. A., 418, 424, 540 Goldstein, T., 833 Golub, G. H., 41, 43–45, 423, 466, 1556, 2198, 2242, 2245 Gomez, V., 2108 Goode, B., 584 Goodfellow, I., 584, 2778, 2886, 2961, 3095, 3096 Goodman, N., 160 Goodman, T. N. T., 255 Gordon, G. J., 713, 2108 Gordon, N., 1401 Gori, M., 2779 Gorodnitsky, I., 2445 Gorsuch, R. L., 2412 Gosavi, A., 1957 Gou, X., 293 Gower, R. M., 2201 Goyal, V., 109 Gradshteyn, I. S., 1347 Graff, C., 1085, 2378 Gram, J., 43 Granas, A., 360, 539

Grandvalet, Y., 2553 Grant, J. A., 2199 Graves, A., 3037 Gray, P. O., 1957 Gray, R. M., 2281 Greenberg. E., 1347 Greenside, P., 3061 Grenander, U., 256 Gretton, A., 2962 Grewal, M. S., 1201 Griffiths, T., 1514, 1515 Griffits, L. J., 584 Grimmer, J., 1467, 1514 Grimmett, G., 112, 119, 1677, 1794, 1796, 1803 Grimson, E., 1515 Grondman, I., 1957 Gross, J. L., 940 Grunwald, P., 1254 Gu, M., 2245 Gua, J., 2886 Gubin, L. G., 424 Guestrin, C., 3061 Guez, A., 2043 Guler, O., 362 Gunasekar, S., 2201, 2210 Gundelfinger, S., 65 Gunning, R. C., 2636 Gunzburger, M., 2281 Guo, G., 3095 Guo, R., 1678 Guo, Y., 2378 Gupta, A. K., 1116, 2205 Gupta, N. K., 1957 Gustafsson, F., 1203, 1401 Gustavsson, E., 628 Guttman, L., 42 Guyon, I., 2551, 2552 Gyorfi, L., 2280, 2673, 2674, 2683, 2688, 2695 Gürbüzbalaban, M., 715, 832, 833 Habrard, A., 2488 Hackbusch, W., 41 Hadsell, R. M., 2830 Haffner, P., 2778 Hager, W. W., 42 Hahn, G. J., 1116 Hald, A., 110 Hale, E. T., 362 Hall, M. A., 2281 Hall, T., 2197 Halmos, P. R., 41, 43, 189, 255, 2637, 2638 Halperin, I., 424 Hamill, T. M., 1203 Hammer, B., 2310

Author Index

Hammersley, J. M., 1251, 1262, 1796 Han, S.-P., 424 Hand, D. J., 1309, 2353 Handschin, J. E., 1401 Hanke, M., 2242 Hansen, E. A., 1894 Hansen, J. H. L., 1309 Hansen, L. K., 2581 Hansen, M., 1254 Hansen, P. C., 2242 Hanson, R. J., 2198 Har-Peled, S., 2281 Hardle, W., 2378 Hardt, M., 3061 Hardy, G. H., 110, 292, 638 Harley, T., 2279 Harman, H. H., 2412 Harmeling, S., 3136 Harrell, F. E., 2486 Harremoës, P., 233, 293 Harris, R., 2581 Harshbarger, S., 540 Hart, P. E., 1113, 1309, 2280, 2284, 2378, 2380, 2513, 2551, 2552, 2778 Hartemink, A., 2486 Hartigan, J. A., 2281, 2836 Hartley, H. O., 1309 Hartman, C., 1552 Hassibi, B., 255, 584, 775, 1084, 1143, 1145, 1201–1203, 2042, 2198, 2778 Hastie, T., 364, 542, 1084, 1201, 1254, 1309, 1734, 1736, 2205, 2242, 2336, 2378, 2412, 2421, 2486, 2553, 2581, 2634, 2672, 2673 Hastings, W. K., 945, 1347, 2832 Haussler, D., 2489, 2641, 2673, 2678, 2830 Hawkins, T., 43 Hayes, B., 1551, 1552 Haykin, S., 584, 714, 2202, 2213, 2310, 2513, 2778 Haynsworth, E. V., 41 Hazan, E., 291, 496, 541, 627, 631, 752 He, S., 721 Heath, J. L., xxvii, 290 Heath, R. W., 2448 Hebb, D. O., 2514 Hebden, M. D., 2199 Hecht-Nielsen, R., 2778 Hefny, A., 806 Heidari, H., 3061 Held, M., 291, 496 Hellinger, E., 233 Hellman, M. E., 2280 Helou, E. S., 941 Hendeby, G., 1203 Henderson, H. V., 41, 42 Hensel, K., 41

2129

Herault, J., 1637 Herbrich, R., 2551, 2552, 2634, 2639, 2640 Hernan, M. A., 1678 Hernández-Lobato, J. M., 1378 Hero, A. O., 425, 941 Herskovits, E., 1734 Hertz, J. A., 1796 Heskes, T., 1378 Hestenes, M. R., 466 Hewitt, E., 591 Higham, N. J., 2198 Hilbe, J. M., 2486 Hild, H., 2779 Hildreth, C., 423 Hill, T. L., 111 Hille, E., 2638 Hills, M., 2553 Hinkley, D. V., 1143 Hinton, G., 627, 1467, 2778, 2831, 2886, 3037, 3095, 3136 Hiriart-Urruty, J.-B., 290, 322 Hjort, N. L., 1254 Hlawatsch, F., 1401 Ho, T. K., 2581 Ho, Y. C., 42, 2202 Hochreiter, S., 2778, 3037 Hodges, J. L., 2279 Hoeffding, W., 110, 2673 Hoerl, A., 2242, 2378 Hoerl, R. W., 2242 Hoff, M. E., xxx, 584 Hoff, P. D., 1113 Hoffman, M., 1467, 1515 Hogben, L., 41 Hogg, R. V., 1113, 1251 Holden, A. D. C., 2779 Holden, S. B., 2553 Hölder, O. L., 292 Holland, J. H., 1958 Hopcroft, J., 1734 Hopfield, J. J., 3037 Horn, R. A., 41, 45, 46, 50, 955, 1555 Hornik, K., 2778 Hosmer, D. W., 2486 Hospedales, T., 3136 Hostetler, L., 2280 Hotelling, H., 42, 2412 Hothorn, T., 2581 Householder, A. S., 42, 2198 Houtekamer, P. L., 1203 How, J. P., 1957, 2042, 2107 Howard, R. A., 1848, 1893, 1894, 1903, 1905, 1907, 1910 Hsieh, C. J., 423 Hsu, C.-W., 2486

2130

Author Index

Hsu, D., 2487 Hu, C., 775 Hu, P., 1401 Hu, Y. F., 586, 2200 Huang, F.-J., 2830 Huang, J. Z., 2412 Huang, K., 112, 2445 Huang, X., 2779 Huang, X. D., 1556, 1605 Hubbard, B. B., 65 Hubbard, J., 65 Hubbard, S., 2310 Huber, P. J., 358, 1309, 1638, 2446 Huberty, C. J., 2378 Hudson, H. M., 159 Huetter, J.-C., 2205 Hughes, G. F., 2672 Huijbregts, C. J., 160, 2634 Hulin, D., 234 Hunt, E. B., 2336 Huo, X., 2446, 2450 Hurvich, C. M., 1254 Hurwicz, L., 430, 1849 Huskey, H. D., 466 Hyvarinen, A., 1637, 1638 Iba, W., 2353 Ichikawa, A., 1894 Igel, C., 2830 Ilyas, A., 3095 Ince, E. L., 338 Indyk, P., 2201, 2280 Ingenhousz, J., 111 Ingersoll, J., 159 Intriligator, M. D., 941 Ioffe, S., 2779 Irofti, P., 2445 Ising, E., 1796 Iskander, D. R., 2581 Isotalo, J., 42 Jaakkola, T., 1467, 1958–1960, 1996, 2001, 2003, 2042, 2107, 2489, 2641, 3061 Jack, M. A., 1556, 1605 Jackel, L. D., 2581 Jackson, J. E., 2412 Jaggi, M., 540 Jain, A., 1796 Jaitly, N., 2831 Jakovetic, D., 996 James, G., 1084, 2672 Jarnik, V., 1734 Jarrett, K., 2778, 2781 Jaynes, E. T., 232, 2353 Jeffreys, H., 2353 Jelinek, F., 1556, 1605

Jenatton, R., 362 Jensen, F., 1677, 1795 Jensen, J., 292 Jewell, N. P., 1678 Jha, S., 3095 Jiang, H., 3061 Jin, C., 873, 874 Jin, H., 2201 Jin, R., 427, 719 Jin, X.-B., 720 Jirstrand, M., 628 Joachims, T., 2486 Johansen, A. M., 1401 Johansson, B., 941, 942 Johansson, K. H., 942 Johansson, M., 941 John, F., 323 Johns, M. V., 2279 Johnson, C. R., 41, 45, 46, 50, 955, 1555 Johnson, D. S., 1735 Johnson, N. L., 1116, 1347 Johnson, R., 806, 832 Johnson, R. A., 2380 Johnson, W., 2201, 2353 Johnstone, I. M., 358, 2412 Jolliffe, I. T., 2412 Jones, M. C., 1638 Jonsson, A., 2108 Jordan, C., 44 Jordan, M. I., 1309, 1467, 1515, 1734, 1794, 1795, 1958–1960, 1996, 2001, 2003, 2042, 2107, 2110, 2353, 2487, 2680, 2834, 3038 Journel, A. G., 160, 2634 Ju, L., 2281 Juditsky, A., 419, 497, 541, 583, 627, 719, 775 Julier, S. J., 1204 Jutten, C., 1637, 1638 Kabadi, S. N., 873 Kabkab, M., 3095 Kaczmarz, S., 432, 588 Kaelbling, L. P., 1894, 1957, 1958 Kahan, W., 44 Kahane, J.-P., 2686 Kahn, H., 1347 Kahng, S. W., 2199 Kailath, T., xxxi, 42, 255, 584, 1084, 1143, 1145, 1201–1203, 2042, 2202, 2513, 2640, 2776, 2778 Kak, A. C., 2412 Kakade, S., 2108, 2685 Kalai, A., 2336 Kalai, A. T., 425 Kale, S., 627, 631, 632 Kalman, R. E., xxx, 1201

Author Index

Kalos, M. H., 1347 Kamp, Y., 2778 Kanal, L., 2279 Kandola, J., 2634 Kaniel, S., 466 Kannan, R., 118 Kappen, H. J., 1795 Kar, S., 942 Karhunen, J., 1638 Karhunen, K., 2413 Karush, W., 323 Kasiviswanathan, S. P., 2445 Kaski, S., 2310 Kass, R., 1515 Katzfuss, M., 1203 Katznelson, Y., 109, 2636 Kavukcuoglu, K., 2778 Kawaguchi, K., 873 Kay, S., 1113, 1116, 1143, 1201, 1251, 1252 Kearns, M., 2336, 2488, 2581, 2673 Kelley, C. T., 418, 422, 466 Kemperman, J. H. B., 235 Kendall, M., 1143 Kennard, R., 2378 Kennedy, R. A., 2638 Kenner, H., 1552 Kergosien, Y. L., 2779 Kesidis, G., 3095 Khan, U. A., 996 Khintchine, A., 2686 Khoshgoftaar, T. M., 2488 Kiebel, S. J., 1347 Kiefer, J., 425, 583, 713 Kifer, D., 630 Kim, B., 3061 Kim, J. H., 1795 Kim, S., 1401, 2486 Kimeldorf, G. S., 2639 Kingma, D. P., 627, 631, 632, 2961 Kingsbury, B., 2831 Kitagawa, G., 1254, 1401 Kittler, J., 2553 Kivinen, J., 2634, 2640 Kiwiel, K., 291, 496, 497, 752 Kjeldsen, T. H., 323 Kjellstrom, H., 1467 Klautau, A., 2486 Klein, R., 2280 Klopf, A. H., 1958 Kocay, W., 940 Koch, G., 2786, 3136 Kochenderfer, M. J., 1894 Koenigstein, N., 586 Koh, K., 2486 Kohavi, R., 1084, 2672 Kohler, J., 2779

2131

Kohonen, T., 2310 Koivunen, V., 1309 Koller, D., 1378, 1677, 1734, 1794, 1795, 2487 Kollerstrom, N., 420 Kolmogorov, A. N., xxx, 109, 112, 1143, 1201, 1202, 1552 Koltchinskii, V., 2679, 2708 Konda, V. R., 2108, 2113 Konecny, J., 628, 806 Kong, A., 1401, 1402 Kong, E. B., 1084, 2672 Koning, R. H., 41 Konishi, S., 1254 Kononenko, I., 2353 Konukoglu, E., 2581 Koopman, B., 189 Koren, Y., 586, 2200 Kosambi, D. D., 2413 Kotz, S., 1116, 1347 Koutroumbas, K., 1113, 2486, 2778 Kouw, W. M., 2488 Kovacevic, J., 109 Kovoor, N., 325 Kraft, L. G., 232 Kreher, D. L., 940 Kreutz–Delgado, K., 2445 Kreyszig, E., 2414, 2638 Krishnan, T., 1309 Krishnaprasad, P., 2446 Krishnapuram, B., 2486 Krizhevsky, A., xxxix, 2762, 2778, 2779, 2781, 2886 Kroese, D. P., 1347, 2778 Krogh, A., 1556 Kruse, R., 1794 Kruskal, J. B., 1638, 1734 Kschischang, F. R., 1794 Kucera, L., 1734 Kucukelbir, A., 1467 Kulkarni, S., 2673 Kullback, S., 233, 235 Kullis, B., 293 Kumamoto, H., 1402 Kumar, S., 627, 631, 632 Kumar, V., 2281 Kundaje, A., 3061 Kung, S. Y., 2412 Kungurtsev, V., 998 Künsch, H. R., 1401 Kurakin, A., 3095 Kushner, H. J., 583, 713 Kuth, D., 110 Kwok, J. T., 775 Kwok, K., 3095

2132

Author Index

Lachenbruch, P., 2553 Lacoste-Julien, S., 806, 832 Lacoume, J. L., 1085 Ladner, R., 2487 Lafferty, J., 1515 Lafontaine, J., 234 Lagoudakis, M. G., 1849, 2043 Laheld, B. H., 1638 Lai, T. L., 582, 1999 Laird, N. M., 1309 Lakshmivarahan, S., 1203 Lam, W., 1734 Lampert, C. H., 3136 Lan, G., 627, 775 Lancaster, P., 1202 Landau, L. D., 111 Landecker, W., 3061 Landgrebe, D., 2336 Lange, J., 2336 Lange, K., 423, 2110 Langford, J., 2108 Langley, P., 2353 Langville, A. N., 1551 Lanza, A., 2248 Laplace, P. S., xxix, 110, 1083, 1347, 2352 Larkin, F. M., 2639, 2640 Larochelle, H., 3136 Larranaga, P., 112 Larson, J., 425 Larson, R., 337 Larson, S., 2553 Lattimore, T., 1997 Lau, K. W., 2310 Laub, A. J., 41 Lauritzen, S., 1378, 1794, 1795 Lavielle, M., 2836 Law, A. M., 1894, 1910 Lawler, G. F., 111 Lawrence, D. A., 1202 Lawson, C. L., 2198 Lax, P., 41 Lay, D., 41, 45, 50 Lay, S., 41, 45, 50 Le, Q., 2779, 3037 Lebarbier, E., 2836 Lebedev, N. N., 1253 Lebret, H., 2245 LeCun, Y., xxxix, 418, 775, 2581, 2778, 2779, 2830, 2831, 2886 Ledoux, M., 110, 123, 126, 2685 L’Ecuyer, P., 2108 Lee, D., 2445 Lee, H., 2336 Lee, J. D., 873 Lee, J. M., 234 Lee, M., 1957, 2108, 2445

Lee, P. M., 1113 Lee, S., 943, 1637, 1734 Lee, T. W., 1638 Lee, Y., 1201, 2486, 2551, 2552, 3136 Legendre, A. M., 2197 Legendre, A. M., 1083, 1201 Lehmann, E. L., 188, 189, 1116, 1251 Lehr, M. A., 2778 Lei, Y., 541 Leibler, R. A., 233 Leiserson, C. E., 1735 Leissa, A. W., 45 Lemaire, B., 358 Lemaréchal, C., 290, 322, 418 Lemeshow, S., 2486 Lemke, C., 3136 Lenz, W., 1796 Leon-Garcia, A., 109, 255, 2960 Lerner, A., 2551, 2552 Le Roux, N., 418, 775, 806, 809, 2830 Leshno, M., 2778 Lessard, L., 996 Levin, D. A., 1551, 1552 Levin, E., 2778 Levina, E., 2378 Levine, S., 1997, 2110, 3136 Levinson, N., xxx, 338 Levitin, E. S., 540 Levy, B. C., 1116 Levy, S., 2243 Lewin, J. W., 591 Lewin, K., xxvii Li, A., 1467 Li, B., 2886 Li, M., 2336 Li, N., 996, 998 Li, Y., 873, 1378 Li, Z., 996, 999, 3136 Lian, X., 998 Liang, D., 2960 Liang, F., 1347 Liang, H., 2248 Liang, J., 806 Liberty, E., 2201 Lidstone, G. J., 2353 Liebling, T. M., 2280 Liese, F., 233 Lifshitz, E. M., 111 Limic, V., 111 Lin, C. J., 423, 2486 Lin, H.-T., 2686 Lin, L.-J., 1996 Lin, V. Y., 2778 Lin, Y., 500, 541, 2486, 2551, 2552

Author Index

Lin, Z., 873, 1997 Lindeberg, J. W., 158 Lindenbaum, M., 2487 Lindenstrauss, J., 2201 Lindsay, R. B., 45 Ling, Q., 996, 1002 Linhart, H., 1254 Link, D., 1551 Linsker, R., 2778 Linvill, W. K., 1202 Lions, P., 362 Lipschitz, R., 338 Lipster, R., 721 Lipton, Z. C., 3037 Littlewood, J. E., 110, 292, 638 Littman, M. L., 1894, 1957, 1958, 2108 Liu, C., 1347, 1734 Liu, J. S., 1401, 1402, 2834, 2836 Liu, S., 41, 425 Liu, W., 1894 Liu, Z., 2779 Ljung, L., 583, 713, 2202 Lloyd, S. P., 2281 Loève, M., 2413 Loeliger, H. A., 1794 Logan, D. L., 1849 Lojasiewicz, S., 290 Loog, M., 2488 Loomis, L. H., 2636 Lopes, C. G., 941, 943 Lopes, G. A. D., 1957 Lorenzo, P. D., 996, 998 Lowd, D., 3095 Lozano, J. A., 112 Lu, L., 110, 123, 126 Lu, T., 2043, 2107 Luce, R. D., 1996 Luenberger, D. G., 323, 418, 466, 2517, 2637 Lugosi, G., 110, 123, 585, 2280, 2673, 2674, 2679, 2683, 2688, 2695, 2701 Lukacs, E., 109 Luntz, A., 2553 Luo, W., 2446 Luo, Z., 360, 423, 540, 2634 Luroth, J., 65 Luthey-Schulten, 112 Luxemburg, W., 592 Luz, K., 2487 Lyapunov, A. M., xxix, 158 Lynch, K. M., 996 Lynch, S. M., 2834, 2836 Ma, S., 364, 2205 Ma, T., 873 Ma, W.-K., 2445 MacDonald, I. L., 1556, 1605

2133

MacDuffee, C. C., 41 MacKay, D. J. C., 232, 1467, 1795, 2487, 2836 Mackenzie, D., 1678 Maclin, R., 2581 MacQueen, J. B., 2281 Macready, W. G., 2675 Macua, S. V., 2043 Madry, A., 3095 Maei, H. R., 1849, 2043 Magdon-Ismail, M., 2686 Magnello, M. E., 1309 Mahoney, M. W., 118, 2201 Maimon, O., 2336 Mairal, J., 362, 2445 Maitra, A., 189 Maitra, S., 1957 Makov, U., 1309 Malach, E., 2336 Mallat, S., 2445, 2446 Mammone, R. J., 3136 Mandal, S., 2205 Mandel, J., 1203 Mandic, D. P., 1085, 3037 Mandt, S., 1467 Mangasarian, O. L., xxxix, 2310, 2518, 2551, 2552 Mannor, S., 2778 Manoel, A., 1378 Mansour, Y., 2108, 2336, 2488 Mantey, P. E., 584 Mao, X., 996 Marano, S., 943 Margalit, T., 541 Marin, J., 2336 Marin, J. M., 1467 Markov, A. A., xxxi, 1145, 1551, 1552 Markovitch, S., 2487 Maron, M. E., 2353 Marroquin, J. L., 2280 Marschak, J., 2281 Marshall, A. W., 1347 Martens, J., 234 Marti, K., 583, 713 Martinetz, T., 2310 Martinez, A. M., 2412 Martinez, S., 996 Masani, P., 1202 Mason, L., 2581 Massart, P., 110, 123, 2678, 2701 Mateos, G., 996 Matta, V., 943, 1734 Mattout, J., 1347 Mattson, R. L., 584 Maybeck, P. S., 1201, 1203, 1378

2134

Author Index

Mayne, D. Q., 1401 Mazumder, R., 364, 542, 1734 McAllester, D., 2108 McAuliffe, J. D., 1467, 1514, 2680 McClave, J. T., 2199 McCormick, G. P., 540 McCoy, J. F., 2243 McCullagh, P., 188, 2486, 2492 McCulloch, W., 2778 McDaniel, P., 3095 McDiarmid, C., 110, 123, 126, 2678 McDonald, J., 41, 45, 50 McEliece, R. J., 1795 McIrvine, E. C., 231 McKean, J., 1113, 1251 McLachlan, G. J., 1309, 2378, 2553 McLeish, D. L., 2638 McMahan, H. B., 425, 628 McMillan, B., 232 McQuarrie A. D. R., 1254 McSherry, F., 630 Meek, C., 3095 Mehotra, K., 2310 Mei, S., 2205 Meila, M., 1734 Meinshausen N., 1734 Melis, D. A., 3061 Mellendorf, S., 45 Melo, F. S., 2108 Mendel, J. M., 583, 713 Mendelson, S., 2679, 2685, 2708 Mendenhall, W., 2199 Meng, D., 3095 Mengersen, K., 1309 Menickelly, M., 425 Mercer, J., xxxi, 2635 Mercer, R., 1556, 1605 Mercier, B., 360, 362 Mertikopoulos, P., 775 Mesbahi, M., 942, 1734 Métivier, M., 583, 713 Metropolis, N., 1347, 1401, 2832 Meuleau, N., 1894 Meyer, C. D., 41, 45, 46 Meyn, S. P., 2108 Mezard, M., 1796 Mian, A., 3095 Michel, A.N., 1202 Michie, D., 1957 Mickey, M., 2553 Middleton, D., 1113 Miikkulainen, R., 2310 Mika, S., 2378, 2412, 2640, 2643 Mikusinski, P., 2638 Milgate, M., 292 Milgram, M., 2778

Miller, D. J., 3095 Miller, K., 160 Miller, R. G., 1113 Minc, H., 46 Mine, H., 362 Minka, T., 233, 1378, 1515 Minkowski, H., 2516 Minnick, R. C., 2551, 2552 Minsky, M., 1957, 2513, 2779 Minty, G. J., 360, 539 Mirza, M., 2961 Misak, C., 1256 Mishra, N., 3136 Mitchell, H. L., 1203 Mitchell, T., 2336, 3136 Mitchison, G., 1556 Mitra, S., 41 Mitter, S., 873 Mnih, V., 2043 Moens, M.-F., 3037 Mohamed, A., 2831 Mohamed, S., 2961 Mohan, C. K., 2310 Mohri, M., 110, 123, 126, 2488, 2679, 2681, 2685, 2686, 2701, 2702 Mokhtari, A., 996 Monfardini, G., 2779 Monro, S., xxx, 582, 713 Montúfar, G., 2201 Montanari, A., 2205 Montavon, G., 3061 Monteleoni, C., 630, 2487 Mooij, J. M., 1795 Moon, T. K., 1309 Moore, A. W., 1957, 1958 Moore, B. C., 2412 Moore, E. H., 2638 Moore, J. B., 1201, 1556, 1605, 2202 Moosavi-Dezfooli, S. M., 3095 Moré, J. J., 466 Moreau, J. J., 292, 359, 496 Morgan, N., 2778 Morimoto, T., 233 Moritz, P., 2110 Morris, E. R., 3037 Morrison, W. J., 42 Morters, P., 111 Moskowitz, M., 65 Mosteller, F., 2553 Mota, J., 996, 2253 Motwani, R., 1551, 2201 Motzkin, T., 496, 2519 Moulines, E., 585, 775, 1401, 1556, 1605 Moura, J. M. F., 942 Moussouris, J., 1796 Mozer, M. C., 2528

Author Index

Mudholkar, G. S., 2412 Muehllehner, G., 2445 Muir, F., 2243 Mulaik, S. A., 2412 Mulder, W. D., 3037 Mulier, F. M., 2673 Muller, K. R., 2378, 2412, 2640, 3061 Muma, M., 1309 Munkhdalai, T., 3136 Muroga, S., 2528 Murphy, K., 1378, 1795, 2491 Murty, K. G., 873 Murty, U., 940 Muthukrishnan, S., 2201 Nachum, O., 326, 2108, 2112, 2117 Nadarajah, S., 1116 Nadeau, C., 2553 Nagaoka, H., 233 Nagar, D. K., 2205 Nagy, B. S., 2636 Nagy, G., 2336 Naik, D. K., 3136 Nakajima, S., 1467 Narasimha, R., 2413 Narici, L., 2414, 2637 Naryanan, A., 1515 Nascimento, V. H., 2244, 2249, 2250 Nash, S. G., 418 Nassif, R., 775, 943 Natanson, I. P., 591 Naumov, V. A., 1551 Navarro, G., 2280 Naylor, A. W., 2414 Neal, R., 160, 1467, 2634, 2780 Neapolitan, R. E., 1677 Neath, A. A., 1254, 1271 Nedic, A., 291, 496, 497, 675, 752, 941–943, 996, 998 Needell, D., 716 Nelder, J. A., 188, 2486, 2492 Nemeh, B., 586 Nemirovski, A. S., 291, 496, 497, 541, 674, 713, 752, 775 Nesterov, Y., 290, 291, 322, 338, 339, 360, 419, 423, 425, 427, 432, 496, 497, 540, 752, 775, 873 Neu, G., 2108 Neubauer, A., 2242 Neudecker, H., 41 Neuhoff, D. L., 2281 Neumaier, A., 2242 Newman, M. E. J., 1347 Newman, P., 292 Newton, I., 420

2135

Ney, H., 2778 Neyman, J., 189, 1116 Neyshabur, B., 2201 Ng, A. Y., 1515, 2353, 2486 Nguyen, A., 3095 Nguyen, P., 2831 Niblett, T., 2353 Nica, B., 940 Nichol, A., 3136 Nicholson, W. K., 43, 45, 50 Nickisch, H., 1378, 3136 Niculescu-Mizil, A., 2353 Nielsen, F., 189 Niemitalo, O., 2961 Nilsson, A., 628 Nilsson, N., 2279, 2778 Niranjan, M., 1996 Niss, M., 1796 Nissim, K., 630 Nocedal, J., 422, 466, 468, 713, 718, 876 Noether, M., 65 Norris, J. R., 591, 1551, 1552 Norvig, P., 2353 Novikoff, A. B. J., 2513, 2516 Nowak, R., 360, 540, 941 Nutini, J., 298, 423, 429 Nychka, D., 1203 Nyquist, H., xxx Obermayer, K., 2310 Obozinski, G., 362 O’Connor, D., 362 Odena, A., 2961 O’Donoghue, B., 1997 Ogilvie, M., xxxiv O’Hagan, A., 160, 2634 Oja, E., 1637, 1638, 2310, 2412, 2415, 2515 Okabe, A., 2280 Okamoto, M., 110, 2673 Oldenburg, D., 2243 O’Leary, D. P., 466 Olesen, K. G., 1795 Olfati-Saber, R., 942 Ollila, E., 1309 Olshausen, B. A., 2445 Olshen, R. A., 2336, 2553 Olshevsky, A., 996, 998 Omura, J. K., 1605 Ondar, K. O., 112 Onsager, L., 1796 Onuchic, J. N., 112 Opitz, D., 2581 Oppenheim, A. V., 109 Opper, M., 1378, 1467 Ormerod, J. T., 1467 O’Rourke, J., 1552

2136

Author Index

Ortega, J. M., 423 Ortiz-Boyer, D., 2486 Ortiz-Jimenez, G., 3095 Osband, I., 1997 Osborne, M. R., 2199 Osher, S., 499, 541, 2486 Osindero, S., 2778, 2961 Ostrovsky, R., 2281 O’Sullivan, F., 2639 Ott, L., 160 Owen, B. A., 1347 Owen, D., 159, 161, 1347 Ozair, S., 2961 Ozdaglar, A., 291, 496, 497, 675, 715, 752, 832, 833, 942 Ozmehmet, K., 2378 Paarmann, L. D., 257 Paatero, P., 2445 Page, L., 1551 Paisley, J., 1467 Pakman, A., 2108 Palatucci, M., 3136 Paleologu, C., 627 Paley, R., 257 Paliogiannis, F., 65 Paliwal, K. K., 3037 Palomar, D., 996 Pan, R., 586 Pan, W., 775 Panchenko, D., 2679, 2708 Papadimitriou, C. H., 1735, 1894 Papernot, N., 3095 Papert, S., 2513, 2779 Papoulis, A., 109, 110, 255, 257, 1552, 2960 Pardalos, P. M., 325 Parikh, N., 358, 360, 540 Parisi, G., 1467, 1796 Park, C. H., 2378 Park, H., 1637, 2378 Parlett, B. N., 45 Parr, R., 1849, 2043 Parra, L. C., 1638 Parrilo, P., 715, 832, 833 Parzen, E., xxxi, 2640 Passty, G., 362 Patel, J. K., 159, 161, 1347 Paterek, 586 Pathria, R. K., 111 Pati, Y., 2446 Paul, D. B., 1556, 1605 Pazzani, M., 2353 Pearl, J., 109, 1677, 1678, 1734 Pearlmutter, B. A., 1638 Pearson, E., 1116 Pearson, J. B., 2243

Pearson, K., xxxi, 111, 1253, 1309, 2412 Pechyony, D., 2336 Pedersen, M. S., 65 Peel, D., 1309 Penny, W., 1347 Peres, Y., 111, 1551, 1552 Peretto, P., 2513, 2528 Pericchi, L., 1467 Perona, P., 1515, 3136 Perron, O., 1556 Pesquet, J.-C., 358, 360, 362, 432, 540 Peters, J., 2108 Petersen, K. B., 65 Petersen, K. E., 255, 257 Peterson, C., 1467 Peterson, D. W., 2280 Petrie, T., 1309, 1556, 1605 Pettis, B. J., 2517 Phillips, D. L., 2242 Phillips, R., 112 Piana, M., 2678 Picard, R., 2553 Picinbono, B., 109, 160, 1085 Pilaszy, I., 586, 2200 Pillai, S. U., 46, 1555 Pinkus, A., 2778 Pinsker, M. S., 235 Pitman, E., 189 Pitt, M., 1401 Pitts, W., 2778 Plackett, R. L., 42, 1083, 1145, 2197, 2198 Platt, J. C., 2486 Pleiss, G., 3061 Plemmons, R. J., 46, 1555, 2445 Póczos, B., 806 Poggio, T., 160, 2634, 2639 Polak, E., 466, 540 Polikar, R., 2581 Pollaczek-Geiringer, H., 582 Pollard, D., 2674 Pollard, H., 338 Polson, N. G., 360 Pólya, G., 110, 158, 292, 638 Polyak, B. T., 290, 291, 322, 337, 339, 360, 418, 419, 426, 496, 497, 501, 540, 583, 627, 674, 713, 719, 720, 752, 873, 1851 Pomerleau, D., 3136 Ponce, J., 2445 Poole, D., 1733 Poole, G. D., 2445 Poon, C., 806 Poor, H. V., 1116 Porter, R., xxxiv Pouget-Abadie, J., 2961 Poulin, B., 3061

Author Index

Pournin, L., 2280 Powell, M. J. D., 431, 466, 713 Pratt, J. W., 1113, 1251 Pratt, L., 3136 Precup, D., 1849, 2043 Price, E., 3061 Price, R., xxix, 110, 162 Principe, J., 2310, 2778 Priouret, P., 583, 713 Pritchard, J. K., 1515 Pritzel, A., 1997 Proakis, J. G., 1309 Prokhorov, Y. V., 112 Pugachev, V. S., 1083 Pukelsheim, F., 41 Puntanen, S., 41, 42 Puschel, M., 996, 2253 Pushkin, A. S., 1552 Puterman, M. L., 1894, 1910 Qian, G., 2836 Qian, J., 2310 Qu, G., 996, 998 Quenouille, M., 2553 Quinn, K., 1467 Qunilan, J. R., 2336 Rabbat, M. G., 941, 943, 1734 Rabi, M., 941 Rabiner, L. R., 1309, 1556, 1605 Rachford, H. H., 362 Radford, A. K., 2779 Radon, J., 2681 Raik, E. V., 424 Rakhlin, A., 2205 Ram, S. S., 497, 675, 752, 943 Ramage, D., 628 Ramanan, K., 942 Ramavajjala, V., 2779 Ramsey, F., 1256 Ranka, S., 2310 Rantzer, A., 942 Ranzato, M., 2778, 2830 Rao, B., 2445 Rao, C. R., 188, 1143, 1251, 1252, 2412 Rao, M., 293 Raphson, J., 420 Rasch, M., 2962 Rasmussen, C. E., 159–161, 2634, 2646 Ratsch, G., 2378, 2412, 2640 Ravi, S., 2779, 3136 Ravikumar, P., 497, 585 Rawat, W., 2886 Rawlik, K., 2108, 2117 Raydan, M., 424 Rayleigh, J. W. S., 45

2137

Raymond, J., 1378 Re, C., 715, 832 Read, C. B., 159, 161, 1347 Recht, B., 715, 832 Reczko, M., 3037 Reddi, S. J., 627, 631, 632, 806, 873 Reddy, J. N., 1849 Redko, I., 2488 Redner, R., 1309 Reed, M., 2636 Reeves, C. M., 466 Regalia, P. A., 41 Reich, S., 360, 539 Reinsch, C., 44 Reneau, D. M., 1259 Renyi, A., 233 Rezaiifar, R., 2446 Rezende, D. J., 2961 Rheinboldt, W., 423 Ribeiro, A., 996 Ribeiro, M. I., 2108 Ribeiro, M. T., 3061 Ribière, G., 466 Riccati, J. F., 1202 Richard, C., 943 Richardson, S., 2834, 2836 Richtárik, P., 423, 628, 2201 Riesz, F., 2636 Rifkin, R., 2486 Rigollet, P., 2205 Rios, L. M., 425 Ripley, B. D., 160, 2634 Rissanen, J., 1254 Ritter, H., 2310 Ritz, W., 45 Rivest, R. L., 1735 Rizk, E., 628, 875 Robbins, H., xxx, 582, 713, 1999 Robert, C., 1113, 1309, 1347 Roberts, S. J., 1467 Robins, J. M., 1678 Robinson, D. P., 873 Rocha, A., 2486 Rockafellar, R. T., 290–292, 322, 323, 358, 360, 362, 496, 539, 594, 752 Rodman, L., 1202 Rogers, L. C. G., 111 Rojas, R., 2581 Rokach, L., 2336, 2581 Roli, F., 3095 Romano, J. P., 1116 Romberg, J. K., 50, 2446 Romera-Paredes, B., 3136 Ronning, G., 1515 Root, W. L., 1113 Rosasco, L., 775, 2242, 2678

2138

Author Index

Rosen, K., 1734 Rosenberg, C. R., 2486 Rosenberger, G., 43 Rosenblatt, F., xxx, 2513, 2514 Rosenbluth, A. W., 1401, 2832 Rosenbluth, M. N., 1401, 2832 Ross, S., 1551, 1552, 1894, 1910 Rossin, D. F., 1347 Rostamizadeh, A., 110, 123, 126, 2488, 2679, 2681, 2685, 2686, 2701, 2702 Roth, A., 630 Roth, D., 2353 Roth, M., 1203 Rouzaire-Dubois, B., 112 Roy, D. M., 2962 Roy, N., 1957, 2042, 2107 Roy, S., 627 Roychowdhury, V. P., 2378, 2513, 2776 Royden, H. L., 590 Royle, G. F., 940 Rozanov, Y. A., 1796 Rozoner, L. I., 2634 Rozovskii, B., 1401 Rubin, D., 1309, 1515 Rubinov, A. M., 540 Rubinstein, R. Y., 1347, 2445, 2778 Rudin, C., 3061 Rudin, L. I., 499, 541 Rudin, W., 590, 2636 Rugh, W. J., 1202 Rumelhart, D. E., 2528, 2778, 3037 Rummery, G. A., 1996 Rupp, M., 584 Ruppert, D., 497, 583, 719 Rusakov, D., 2487 Russakovsky, O., 2779, 2886 Ruszczynski, A., 323 Ruzzo, W. L., 2412 Ryden, T., 1401, 1556, 1605 Ryu, E. K., 1000 Ryzhik, I. M., 1347 Saad, D., 1467 Sabin, M. J., 2281 Sackett, D., 1677, 1794 Sadeghi, P., 2638 Sadri, B., 2281 Safavian, S. R., 2336 Sahinidis, N. V., 425 Sahlin, N.-E., 1256 Sainath, T. N., 2831 Saitoh, S., 2638 Sajda, P., 2486 Sakurai, K., 3095 Salakhutdinov, R., 2778, 2779, 2831, 3136 Salamon, P., 2581 Saligrama, V., 293

Salimans, T., 2961 Salmond, D., 1401 Samadi, M., 873 Samangouei, P., 3095 Samaniego, F. J., 1259 Samanta, T., 1254, 1271 Samek, W., 3061 Samuel, A. L., 1958, 2042, 2108 Santana, R., 112 Santoro, A., 3136 Santos, A., 1734 Santosa, F., 2243 Santurkar, S., 2779 Sapiro, G., 2445 Sarabia, J. M., 159 Sarlós, T., 2201 Sarvarayndu, G. P. R., 2336 Sarwate, A. D., 630 Sauer, K., 423 Sauer, N., 2673, 2674, 2690 Saul, L., 1467 Saunders, M. A., 2243, 2445 Saunderson, N., 110 Savage, L. J., 189, 1113, 1251, 1259 Sawaki, K., 1894 Sayed, A. H., 42, 45, 46, 48, 159, 255, 291, 418, 423, 430, 496, 497, 583–585, 627, 628, 631, 674, 676, 677, 679, 713–717, 750, 754, 775, 806, 832, 833, 872–876, 878, 941–943, 945, 946, 949, 954, 996, 998–1000, 1083, 1084, 1143–1145, 1201–1203, 1401, 1734, 1851, 1997, 2042, 2043, 2108, 2197, 2198, 2202, 2213, 2244, 2249, 2250, 2253, 2445, 2581, 2778 Scaglione, A., 943 Scarselli, F., 2779 Schaal, S., 2108 Schafer, R., 109 Schaffer, C., 2675 Schapire, R. E., 2581, 2640 Scharf, L. L., 1085, 1113, 1201, 1251, 1252 Scheffer, T., 2488, 2491 Schein, A. I., 2487 Scheinberg, K., 425 Scheines, R., 1678 Scheuer, W. T., 2243 Schlafli, L., 2523 Schmetterer, L., 583 Schmidhuber, J., 2778, 2961, 3037, 3040, 3136 Schmidt, E., 43, 44 Schmidt, M., 418, 540, 775, 806, 809 Schmidt, S. F., 1203 Schneider, J., 2108 Schocken, S., 2778 Schoenberg, I. J., 496, 2519

Author Index

Schoenlieb, C., 806 Schoknecht, R., 2108 Scholkopf, B., 2378, 2412, 2551, 2552, 2634, 2639, 2640, 2962 Scholz, M., 2412 Schraudolph, N. N., 3037, 3040 Schreiber, R., 586 Schreier, P. J., 1085 Schulman, J., 2108, 2110, 3136 Schulten, K., 2310 Schur, I., 42 Schuster, M., 3037 Schutter, B., 2042 Schuurmans, D., 2043, 2107 Schwartz, L., 2635 Schwarz, G. E., 1254 Schweitzer, P., 1849 Schwenk, H., 2778 Schweppe, F. C., 1203 Scott, J. G., 360 Scutari, G., 996, 998 Seaman, K., 996 Searle, S. R., 41, 42 Sebban, M., 2488 Sebestyen, G., 2279, 2281 Seeger, M., 1378 Sefl, O., 584 Seidel, L., 423 Seidman, A., 1849 Sejnowski, T., 1638, 2042, 2486 Sell, G., 1556, 1605, 2414 Selvaraju, R. R., 3061 Seneta, E., 46, 112, 1551 Sengupta, B., 2961 Senior, A., 2831 Senne, K. D., 1204 Serfling, R. J., 123 Sethares, W. A., 627 Sethi, I. K., 2336 Settles, B., 2487 Seung, H., 2445 Shafer, G., 109, 1795 Shah, A. K., 1347 Shah, D., 2280 Shakarchi, R., 109 Shakhnarovich, G., 2280 Shalev-Shwartz, S., 291, 298, 325, 338, 496, 585, 713, 752, 775, 806, 2336, 2583, 2634, 2675, 2679, 2685, 2686, 2701, 2702 Shamir, O., 291, 496, 715, 752, 775, 832, 833 Shang, F., 806 Shanno, D. F., 421 Shannon, C. E., xxx, 231, 1552, 1996, 2042, 2336 Shapire, R., 2486 Shapiro, S. S., 1116

2139

Shapley, L. S., 1894 Sharma, R., 627 Shawe-Taylor, J., 2551, 2552, 2634, 2640 Shcherbina, A., 3061 She, Q., 2961 Sheela, B. V., 2581 Shelah, S., 2673, 2690 Shemeneva, V. V., 2108 Shen, H., 2412 Shen, Z., 364 Shenoy, P. P., 1795 Shenton, L. R., 1253 Shephard, N., 1401 Sherman, J., 42 Sherstinsky, A., 3037 Sheskin, T. J., 1551, 1552, 1894 Shewchuk, J. R., 466 Sheynin, O. B., 1083 Shi, H.-J. M., 423 Shi, J., 2486 Shi, L., 1515 Shi, W., 996, 998, 999, 1002 Shimelevich, L. I., 1402 Shimodaira, H., 2488 Shin, M. C., 1894 Shiryayev, A. N., 110, 721 Shlens, J., 3095, 3096 Shor, N. Z., 291, 496, 497, 752 Shotton, J., 2581 Shreve, S., 1894 Shrikumar, A., 3061 Shwartz, A., 1848, 1894 Shynk, J. J., 627 Siahkamari, A., 293 Sibony, M., 360 Sibson, R., 1638 Sidiropoulos, N. D., 2445 Siemon, H. P., 2310 Silver, D., 1849, 1958, 2043 Silvey, S. D., 233 Simar, L., 2378 Simard, P., 2581, 3037 Simon, B., 2636 Simonyan, K., 3061, 3095 Sinai, Y. G., 257 Sincich, T. T., 2199 Singer, Y., 291, 325, 362, 496, 627, 631, 752, 2486, 2581, 2634 Singh, R. P., 41 Singh, S., 1957–1960, 2001, 2003, 2107, 2108, 3061 Singh, S. P., 1996, 2001, 2042 Singleton, R. C., 2551, 2552 Siniscalco, M., 1309 Siu, K.-Y., 2513, 2776 Slater, M., 323

2140

Author Index

Slavakis, K., 2634, 2640 Slivkins, A., 1997 Slutsky, E., 116 Slymaker, F., 2279 Smale, S., 2678 Small, C. G., 2638 Smallwood, R. D., 1894, 1909 Smith, A., 630, 1113, 1309, 1347, 1401, 2834 Smith, C. A., 1309 Smith, D., 2279 Smith, F. W., 2518, 2551, 2552 Smith, H., 2199 Smith, J. E., 1264 Smith, R. L., 188, 189, 1113, 1116 Smith, S., 628 Smola, A. J., 806, 2412, 2551, 2552, 2634, 2639, 2640, 2962 Smolensky, P., 2528, 2831 Smoluchowski, M., 111 Snell, J., 3136, 3144 Snow, J., 2280 Söderström, T., 2202 Sofer, A., 418 Soh, Y. C., 998 Solla, S. A., 2778 Sondik, E. J., 1894, 1909 Sorenson, H. W., 1201 Sorenson, H. W., 1204, 2202 Soudry, D., 2201, 2554 Soules, G., 1309, 1556, 1605 Soumith, C., 2961 Southwell, R. V., 423 Spall, J. C., 583, 713 Speed, T. P., 2378 Speekenbrink, M., 1401 Spencer, J. H., 110, 123, 126 Speranzon, A., 942 Spiegelhalter, D. J., 1795, 2834, 2836 Spirling, A., 1467 Spirtes, P., 1678 Spokoiny, V., 425, 432 Springenberg, J. T., 2961 Sra, S., 806 Srebro, N., 291, 496, 716, 2201, 3061 Sridharan, K., 2685 Srivastava, K., 497, 675, 752, 943 Srivastava, N., 2778, 2779 Stadie, B. C., 1997 Stankovic, M. S., 943 Stankovic, S. S., 943 Stark, H., 109 Staudemeyer, R. C., 3037 Staudigl, M., 775 Stearns, S. D., 584, 714 Steffey,D., 1515 Stegun, I., 292, 1253, 1347

Stein, C., 159, 1735 Stein, E. M., 109 Steinbach, M., 2281 Steinbrecher, M., 1794 Steinhaus, H., 2281 Steinwart, I., 2551, 2552 Stensrud, D., 1203 Stephens, M., 1515 Stern, H., 1515 Stevens, M. A., 2336 Stewart, G. W., 44, 45, 1083, 2198 Stewart, W. J., 1551 Steyvers, M., 1514 Stiefel, E., 466 Stigler, S. M., 110, 158, 1083, 1309, 2197 Stinchcombe, M., 2778 Stipanovic, D. S., 943 Stirzaker, D., 112, 119 Stone, C., 2336, 2553, 2674 Stone, J. V., 1638 Stone, M., 2553 Stork, D. G., 1113, 1309, 2280, 2378, 2778 Strang, G., 41, 45, 50 Stratonovich, R. L., 1556, 1605 Street, W. N., xxxix, 2310 Strobach, P., 2202 Strohmer, T., 588, 2448 Stromberg, K., 591 Stroud, J. R., 1203 Stuart, A., 1143 Stuetzle, W., 2446, 2581 Styan, G. P. H., 41, 42 Su, J., 3095 Suddarth, S. C., 2779 Sudderth, E. B., 1515 Suel, T., 46, 1555 Sugihara, K., 2280 Sugiyama, M., 1467 Sun, J., 45, 466, 720 Sun, Y., 996 Sundararajan, A., 996 Sundberg, R., 1309 Sung, F., 2786, 3136 Sutherland, W., 111 Sutskever, I., 627, 2778, 2779, 2886, 3037, 3095 Sutton, R. S., 1849, 1894, 1903, 1905, 1907, 1957, 1958, 1996, 2001, 2043, 2044, 2108, 2113 Svensén, M., 2310 Svetnik, V. B., 1402 Swami, A., 3095 Swerling, P., 1204 Swersky, K., 2962, 3136, 3144

Author Index

Switzer, P., 1638 Symeonidis, P., 586 Symes, W. W., 2243 Sysoyev, V. I., 2108 Szegö, G., 256 Szegedy, C., 2779, 3095, 3096 Szepesvari, C., 1849, 1957, 1997, 2043 Tadić, V., 2042 Takác, M., 423 Takahashi, N., 45 Takahashi, T., 2445 Takayama, A., 323 Talagrand, M., 2685 Talwalkar, A., 110, 123, 126, 2679, 2681, 2685, 2686, 2701, 2702 Tan, L.-Y., 2336 Tan, P.-N., 2281 Tan, V. V. F., 1734 Tang, H., 998 Tanik, Y., 627 Tankard, J. W., 111, 1309, 2412 Tanner, R. M., 1795 Tao, T., 50, 2446 Tapia , R. A., 540 Tapper, U., 2445 Tatarenko, T., 998 Tauvel, C., 775 Taylor, H. L., 2243 Teboulle, M., 360, 362, 540, 541 Teh, Y.-W., 2778 Teller, A. H., 2832 Teller, E., 2832 Tellex, S., 1957, 2042, 2107 Temme, N. M., 1253 Tenenbaum, M., 338 Tesauro, G., 1958 Tetruashvili, L., 423 Tewari, A., 2685 Thakurta, A., 630 Thaler, R. H., 1264 Thanou, D., 1734 Theodoridis, S., 585, 1113, 1201, 1677, 2486, 2513, 2634, 2640, 2778 Thiran, P., 942 Thireou, T., 3037 Thomas, J. A., 232 Thomas, L. B., 2445 Thomas-Agnan, C., 2635 Thompson, K., 2353 Thompson, W. R., 1997 Thorndike, E., 1957 Thrun, S., 3136 Thuente, D. J., 466 Tiao, G. C., 1113, 1251, 1346

2141

Tibshirani, R., 364, 500, 541, 542, 1084, 1201, 1254, 1309, 1734, 2242, 2243, 2336, 2378, 2412, 2421, 2486, 2553, 2581, 2634, 2672, 2673, 3061 Tieleman, T., 627 Tikhonov, A. N., 2242 Tikk, D., 586, 2200 Tipping, M. E., 2412, 2416, 2417 Tishby, N., 2108 Titterington, D., 1309 Tolman, R. C., 111 Tomczak-Jaegermann, N., 2686 Tomioka, R., 2201, 2445 Tong, L., 1734 Tong, S., 2487 Torr, P. H. S., 3136 Torta, E., 1401 Tosic, I., 2445 Touri, B., 998 Toussaint, M., 2108, 2117 Towfic, Z. J., 941, 996, 1851, 2253, 2445 Tracy, D. S., 41 Traskin, M., 2581 Trefethen, L. N., 43, 50 Treves, F., 2637 Tribus, M., 231 Tripuraneni, N., 873 Tropp, J. A., 2446 Troutman, J. L., 591 Trudeau, R. J., 940 Trujillo-Barreto, N., 1347 Tsai, C. L., 1254 Tseng, P., 360, 362, 423, 540 Tsitsiklis, J. N., 45, 418, 540, 583, 674, 713, 873, 941, 942, 1894, 1957, 1958, 1960, 2001, 2042, 2044, 2108, 2113 Tsuda, K., 2640 Tsybakov, B., 2205 Tsypkin, Y. Z., 582–584, 674, 713 Tu, S. Y., 941 Tucker, A. W., 323 Tugay, M. A., 627 Tukey, J. W., xxx, 1309, 1638, 2280, 2553, 2581 Turner, R. E., 1378 Uffink, J., 111 Uhlmann, J. K., 1204 Ullman, J. D., 1734 Ulm, G., 628 Ultsch, A., 2310 Ungar, L. H., 2487 Uribe, C. A., 998 Utgoff, P. E., 2336 Uzawa, H., 1849

2142

Author Index

Vajda, I., 233 Valiant, L., 2336, 2581, 2673 Vámos, T., 43 Van Camp, D., 1467 van de Molengraft, R., 1401 Van den Steen, E., 1264 van der Hoek, J., 1556, 1605 van der Merwe, R., 1204 van der Pas, P. W., 111 van der Sluis, A., 466 van der Vaart, A. W., 116, 119, 2674 van der Vorst, H. A., 45, 466 Van Erven, T., 233 van Hasselt, H., 1264, 2043 Van Huffel, S., 44 Van Hulle, M. M., 2310 Van Loan, C. F., 41, 43, 45, 423, 466, 1556, 2242 van Merrienboer, B., 3037 Van Roy, B., 1997, 2042, 2044, 2108 Van Scoy, B., 996 van Seijen, H., 1958, 1996 Van Trees, H. L., 1113, 1116, 1251, 1265 Vandenberghe, L., 290, 292, 322, 323, 328, 362, 418, 941, 2517 VanderPlas, J., 1259, 1263 Vandewalle, J., 44 Vanhoucke, V., 2831 Vapnik, V. N., 1201, 2486, 2487, 2551–2553, 2581, 2673, 2678 Varga, R. S., 45 Vargas, D. V., 3095 Varshney, P., 425 Vassilvitskii, S., 2281 Vaswani, A., 3037 Vazirani, U., 2673 Vedaldi, A., 3061, 3095 Veeravalli, V. V., 497, 675, 752, 943 Vehtari, A., 1378 Venkatesan, R., 2886 Venkatesh, S., 2673 Verhulst, P. F., 2484 Verma, G. K., 2886 Verri, A., 2678 Vershynin, R., 110, 123, 588 Vetterli, M., 109, 942 Vicente, L. N., 425 Vidyasagar, M., 2243, 2673 Vijayakumar, S., 2108, 2117 Vilalta, R., 3136 Villa, S., 775 Vincent, P., 2778, 2831, 2960 Vinyals, O., 3037, 3095, 3136, 3138 Virasoro, M. A., 1796 Viterbi, A. J., xxxi, 1605

Vlaski, S., 628, 679, 715, 716, 832, 872–876, 998, 1734 Volinsky, C., 586, 2200 von der Malsburg, C., 2310 von Mises, R., 582 von Neumann, J., 43, 47, 257, 424 Voronoi, G. F., 2280 Vovk, V., 109, 2634 Vu, B. C., 775 Wagner, D., 3095, 3096 Wagner, T., 1734 Wahba, G., 2242, 2486, 2551, 2552, 2639, 2640 Wainwright, M. J., 110, 123, 126, 291, 496, 497, 585, 752, 1467, 2679, 2701 Waissi, G. R., 1347 Wajs, V. R., 360, 362, 540 Wakefield, J., 1259 Wald, A., 1113 Walker, H., 1309 Walker, W., 2279 Wallace, D. L., 2553 Wallis, J., 420 Walsh, T. J., 1957, 2042, 2107 Walters, P., 257 Wan, E., 1204 Wand, M. P., 1467 Wang, C., 423, 1467 Wang, D., 2488 Wang, H., 293 Wang, J., 721 Wang, K., 1467 Wang, L., 2446 Wang, X., 1515 Wang, Y., 998, 2445 Wang, Z., 2886, 2961 Wansbeek, T., 41 Ward, I. R., 2779 Ward, R., 716 Ward, T. E., 2961 Warde-Farley, D., 2961, 3095 Warga, J., 423 Warmuth, M. K., 293, 2673 Wasserstein, L. N., 2962 Watanabe, K., 1467 Watkins, C., 1957, 1996, 2003 Watkins, C. J., 2486 Webb, A., 1113 Wedderburn, R., 2492 Weimer, M., 586 Weinert, H. L., 2640 Weiss, K., 2488 Weiss, N., 115, 1309, 1556, 1605 Weiss, Y., 1794, 1795, 2886 Welch, L. R., 1556, 1604, 1605, 2448

Author Index

Weldon, W. F. R., 1309 Wellekens, C. J., 2778 Welling, M., 2961 Wellner, J. A., 2674 Wen, Z., 1997 Werbos, P. J., 1957, 2778, 3037 Werner, D., 2635 Weston, J., 2378, 2486, 2634, 2640, 3136 Wetherhill, G. B., 583 Wets, R., 358, 594 Weyl, H., 45 White, D. J., 1894 White, H., 2778 White, T., 2961 Whitlock, P. A., 1347 Wibisono, A., 425 Wichern, D. W., 2380 Widrow, B., xxx, 584, 714, 1957, 2778 Wiener, N., xxx, 257, 1143, 1201, 1202 Wierstra, D., 2961 Wiewiora, E., 1849, 2043 Wikle, C. K., 1203 Wilby, W. P. Z., 584 Wild, S. M., 425 Wilkinson, D., 586 Wilkinson, J. H., 45, 466, 1556 Wilkinson, W., 1378 Willard, B. T., 360 Williams, C., 159–161, 2310, 2634, 2646, 2780 Williams, D., 111, 721 Williams, L. J., 2412 Williams, R. J., 2108, 2778, 3037 Williams, R. L., 1202 Williamson, R. C., 2634, 2640 Willshaw, D. J., 2310 Willsky, A., 1734 Willsky, A. S., 2202 Wilmer, E. L., 1551, 1552 Wilson, D. L., 2280 Winkler, R. L., 1264, 1346 Winograd, T., 1551 Winther, O., 1378, 1467 Wise, F., 2281 Witte, J. S., 2199 Witte, R. S., 2199 Witten, D., 2378 Witten, I. H., 1957, 2281 Wittenmark, B., 2202 Wolberg, W. H., xxxix, 2310 Wold, H., 1143, 1202 Wolfe, P., 291, 418, 466, 496, 540 Wolff, T. H., 2686 Wolfowitz, J., 425, 583, 713 Wolpert, D. H., 1084, 2672, 2675

2143

Wolynes, P. G., 112 Wong, W. H., 1401, 1402 Wong, Y. K., 43 Woodbury, M., 42 Woodcock, R., 584 Wooding, R., 160 Woodruff, D. P., 2201 Woods, J. W., 109 Wortman, J., 2488 Wright, R. M., 1638 Wright, S. J., 422, 423, 466, 468, 713, 1678 Wu, C. F. J., 1309 Wu, G., 996, 1002 Wu, L., 1401 Wu, T. T., 423 Wu, X., 3095 Wu, Y., 2836 Wu, Z., 2779 Wyatt, J., 2581 Xavier, J., 996, 2253 Xia, X., 293 Xiang, Z., 3095 Xiao, L., 627, 775, 806, 942 Xie, C., 3095 Xie, L., 998 Xie, N., 3061, 3095 Xie, S., 2445 Xin, R., 996 Xu, B., 2961 Xu, D., 2778 Xu, J., 996, 998 Xu, L., 1309 Xu, M., 2836 Xue, L., 2412 Yamada, I., 45 Yamanishi, K., 2445 Yamashita, R., 2886 Yan, J., 721 Yan, M., 996, 999 Yang, C., 2248 Yang, H. H., 233, 1638 Yang, P., 996, 3095 Yang, S., 427, 719 Yao, H., 3136 Yao, X., 2581 Ye, Y., 323, 418 Yedida, J. S., 1795 Yellen, J., 940 Yeung, K. Y., 2412 Yin, G. G., 583, 713 Yin, H., 2310 Yin, M., 362 Yin, W., 423, 998, 1002, 2486

2144

Author Index

Ying, B., 291, 496, 497, 627, 631, 674, 676, 677, 679, 713, 715, 716, 750, 806, 832, 833, 996, 1734, 2581 Ylvisaker, D., 188 Yosida, K., 358 Yosinski, J., 2779, 3061, 3095 Young, G., 44 Young, G. A., 188, 189, 1113, 1116 Young, N., 2638 Yu, B., 1254, 2581 Yu, D., 2831 Yu, H., 427, 719, 3136 Yu, Y., 2353, 3037 Yuan, K., 627, 631, 679, 713, 715, 716, 806, 832, 833, 996, 1000, 2043 Yuan, M., 500, 541 Yuan, X., 3095 Yuan, Y., 466 Yudin, D. B., 291, 496, 541, 775 Yujia, L., 2962 Yun, S., 423 Yushkevich, A. P., 338 Zacks, S., 188, 189 Zadeh, L. A., xxx, 941, 1202 Zakai, M., 1083 Zalinescu, C., 298 Zareba, S., 627 Zaremba, W., 3095 Zaritskii, V. S., 1402 Zazo, S., 2043 Zdunek, R., 2445 Zehfuss, G., 41 Zeiler, M. D., 627, 3060 Zemel, R., 2778, 2962, 3136, 3144 Zeng, J., 998 Zenios, S., 293 Zenko, B., 2581

Zhang, C., 1467 Zhang, F., 42 Zhang, H., 2353 Zhang, J., 466, 720 Zhang, K., 1678 Zhang, N. L., 1733, 1894 Zhang, P., 940 Zhang, T., 291, 496, 716, 752, 775, 806, 832, 873 Zhang, W., 2886, 3095 Zhang, Y., 362, 2445 Zhang, Z., 2445, 2446 Zhao, P., 716 Zhao, Q., 2445 Zhao, X., 717, 754, 943, 954, 996 Zheng, H., 2779 Zheng, Y., 873 Zhou, D., 541 Zhou, G., 2445 Zhou, H., 423 Zhou, J., 2779 Zhou, P., 2778 Zhou, Y., 586, 2200, 2779 Zhou, Z., 775, 2581 Zhu, M., 996 Zhu, S., 998 Zibriczky, D., 586, 2200 Zill, D. G., 338 Zinkevich, M., 713 Zinn, J., 2678 Zioupos, A., 586 Zipser, D., 3037 Zisserman, A., 3061, 3095 Zorich, V. A., 65 Zou, H., 2242, 2412, 2421 Zoubir, A. M., 1309, 2581 Zoutendijk, G., 466, 468, 713 Zucchini, W., 1254, 1556, 1605 Zulehner, W., 1849

Subject Index

Please note: this index contains terms for all three volumes of the book. Only terms listed below with page numbers 1053–2120 are featured in this volume. 0/1−loss, 1094 A2C algorithm, 2065 absorbing chain, 1822 absorbing state, 1811 accelerated momentum, 614, 626 acceptance ratio, 1341 accumulating trace variable, 1944 activation function, 2716, 2717 active learning, 2471, 2487 active path, 1663 actor–critic algorithm, 2059, 2063, 2106 actor-only algorithm, 2106 AdaBoost, 2561, 2576 AdaDelta, 608, 626 AdaGrad, 599, 602, 626 adaline, 557, 584 ADAM, 599, 610, 626, 2744, 3096 AdaMax, 612, 626 adaptive gradient method, 599 adaptive importance sampling, 703 adaptive learning rate, 599, 604, 626 ADMM, 996 advantage actor–critic, 2065 advantage function, 2063, 2110 adversarial attack, 3065 adversarial machine learning, 425, 433, 3065 adversarial training, 3095 affine classifier, 2654, 3042, 3078 affine hull, 292 affine-Lipschitz, 476, 497, 498, 677 agreement, 935, 970 AIC, 1238, 1266 Akaike information criterion, 1238, 1253, 1266 AlexNet, 2886 algebraic connectivity, 908 alignment score, 2992 almost-sure convergence, 119 α-divergence, 232 alternating least-squares, 586, 2192, 2200

alternating minimization, 2272 alternating projection algorithm, 393, 412, 423, 431, 2518, 2553 analysis step, 1188 ancestor node, 1651 AND function, 2517 annotator, 2474 antiferromagnetic, 1796 aperiodic Markov chain, 1553, 1558 a-posteriori error, 2191 a-priori error, 2191 Archimedes principle, xxvii arcing classifier, 2580 ARMA process, 254 Armijo condition, 400 array algorithm, 1204, 1207 Arrow–Hurwicz algorithm, 430, 1844, 1846, 1849, 2026 assumed density filtering, 1375, 1378 Atari game, 2043 atom, 2425 attention mechanism, 2991, 3037, 3038, 3138, 3140 attention score, 2992, 3141 attribute, 2313, 2315 Aug-DGM algorithm, 977, 978 auto-correlation sequence, 242 auto-regressive random process, 258 autoencoder, 2745, 2747, 2769, 2778, 2779, 2784, 2797 automatic character recognition, 2845 autonomous system, 627 auxiliary particle filter, 1396 average regret, 579, 633 average reward, 2054, 2109 averaged projections, 431 averaging combination, 1558 averaging rule, 909 AVRG algorithm, 816, 827 axioms of probability, 109 Azuma inequality, 110, 126, 1997

2146

Subject Index

backgammon, 1841, 1851, 1957, 2107 backpropagation, 852, 2739, 2778, 2779, 2781–2784, 2876, 2887, 2924, 2950 backpropagation through time, 2974, 2982, 3009, 3018 backtracking line search, 393, 398, 455 backward view, 1945 bagging, 2557, 2580, 2750 bagging classifier, 1933 balanced matrix, 944 Banach fixed-point theorem, 347 base function, 1227 basis pursuit, 305, 350, 354, 472, 491, 569, 571, 1706, 2243, 2247 basis pursuit denoising, 2237, 2243 batch algorithm, 383 batch logistic regression, 384, 2459 batch normalization, 2769, 2770, 2779 batch perceptron, 473 batch SOM, 2300 batch subgradient method, 473 Baum–Welch algorithm, 1517, 1537, 1563, 1588, 1592, 1601 Bayes rule, 78, 110 Bayes classifier, 1062, 1092, 1097, 1098, 1113, 1116, 2260, 2262, 2341, 2358 Bayes risk, 1117, 2679 Bayesian inference, 1092, 1113, 1211, 1319, 1352 Bayesian information criterion, 1238, 1253, 1257, 1271 Bayesian mixture of Gaussians, 1406 Bayesian net, 1647 Bayesian network, 85, 1643, 1647, 1682 Bayesian reasoning, 1647 Bayesian variational inference, 1320 Bayesian view, 1256 bearing, 1199, 1399 behavior policy, 1952, 1979, 2020 belief, 1174, 1383, 1682 belief propagation, 1378, 1682, 1684, 1698, 1771, 1794 belief vector, 1880 belief-MDP, 1882, 1894 Bellman projected error, 2042 Bellman equation, 1828, 1835, 1884, 1896 Bellman optimality equation, 1828, 1835 Berkson paradox, 1669, 1677, 1679 Bernoulli distribution, 2343 Bernoulli mixture model, 1302 Bernstein inequality, 1997 BERT method, 3037 best linear unbiased estimator, 1141 best matching unit, 2293 beta distribution, 175, 185, 1096, 1116, 1217, 1253, 1336, 1342, 1411

BFGS method, 421 bi-clustering, 2445 bias, 1143 bias selection, 1670 bias–variance relation, 1072, 1083, 1088, 1213, 1254, 2205, 2282, 2560, 2672 biased coin, 1303 BIBO stability, 248, 1177 BIC, 1238, 1257, 1271 bidirectional LSTM, 3026, 3037 bidirectional RNN, 2994, 3037 big data, xxix, 2185 big-O notation, 121, 355, 388, 576 big-Θ notation, 388 bijection, 521 binary classification, 1097, 2464, 2716, 3078 binary tree, 236 binomial distribution, 170, 185, 1223, 1226 binomial formula, 2641 binomial theorem, 1226, 2693 bioinformatics, 1556, 2551 biostatistics, 1309 bipartite, 1759 bit, 196, 1246, 2320 black-box attack, 3068, 3088 blackjack, 1957 blind separation, 1637 block coordinate descent, 423 block Kronecker product, 28, 41 block maximum norm, 36, 45 block RLS, 2191, 2208–2210 blocked path, 1663 BLUE estimator, 1141, 1144 Bochner theorem, 2595, 2636, 2642 Boltzmann constant, 105, 231 Boltzmann distribution, 99, 100, 111, 190, 1745, 1988, 2047, 2809 Boltzmann exploration, 1987 Boltzmann machine, 98, 1111, 2769, 2797, 2830 Boolean function, 2526 Boolean variable, 197, 2315 boosting, 1109, 2557, 2561, 2580 bootstrap, 1255, 1933, 2557, 2561 bootstrap particle filter, 1395 bounded gradient disagreement, 875 BPTT, 2974, 2982, 3009, 3018 breast cancer dataset, 2307, 2310, 2539, 2607 Bregman divergence, 233, 285, 292, 298, 319 Bregman projection, 523 bridge regression, 2248 Brownian motion, 102, 110 Brownian process, 2417

Subject Index

Bryson–Frazier filter, 1182, 1206 burn-in period, 1341, 2833 Bussgang theorem, 163 C4.5 algorithm, 2336 calculus of variations, 1202 canonical exponential family, 1226 canonical spectral factor, 1180 capacity, 2662 capacity of classifier, 2526 cardinality, 2271 CART algorithm, 2336 cascade, 1662 categorical distribution, 189 Cauchy sequence, 2637 Cauchy–Schwarz inequality, 32, 113, 332, 1266, 2640 causality, 84, 109, 1659, 1677 causation, 84, 1659, 1677 cavity function, 1358, 1372 centered Poisson equation, 2055 central limit theorem, xxix, 104, 120, 158 central processor, 621 centrality of a node, 911 centroid, 991 Ceres, 2198 change-point model, 2835 channel capacity, 308 Chapman-Kolmogorov equation, 1174, 1383, 1530 characteristic function, 99 characteristic polynomial, 96 Chebyshev inequality, 75, 109, 114, 122 Chernoff inequality, 1997 chess, 1841, 1851, 1957, 1996, 2042, 2107 chi-square distribution, 160, 1270 child node, 1651 choice axiom, 1996 cholera outbreak, 2280 Cholesky factor, 23 Cholesky factorization, 14 Chow–Liu algorithm, 1657, 1699, 1733 CIFAR-10 dataset, 2761, 2781 circular Gaussian distribution, 155, 1088 circular random variable, 155, 1084 circularity, 155, 160, 1084 class map, 2306, 2309 classification, 1595, 2260, 2290, 2341, 2728 classification-calibrated loss, 2680 classifier, 1097 client, 620 clipped cost, 2088 clipping, 3003 clipping function, 2089 clique, 1740 clique potential, 1742

2147

closed function, 263 closed set, 261 closed-loop matrix, 1161 closure of a set, 520 cloud computing, 903 clustering, 1278, 1290, 2270 clustering map, 2306 clutter, 1378 CNN, 2715, 2838 co-coercivity, 339, 371, 389, 1003 codeword, 2486 coding matrix, 2486 coefficient of determination, 2181 coercivity, 267 collaborative filtering, 562, 586, 2192, 2200, 2929 colored map, 2303 column span, 7 combination matrix, 905, 909 combining estimators, 1145 common cause, 1662 common effect, 1662 compact set, 594, 2516 compatible approximation, 2058 compatible features, 2058 competitive learning, 2293 complementary condition, 307, 314, 2543 complementary error function, 161 complete conditional, 1442 complete DAG, 1679 complete graph, 1679 complete space, 2638 completion-of-squares, 13, 1055, 1059 complex random variable, 106 compression, 2906, 2925 compressive sensing, 2445, 2449 computational complexity, 1734 computer vision, 1677, 1749, 2838, 2886 concentration inequality, 123 condition number, 20 conditional dependence, 1643 conditional distribution, 80 conditional entropy, 200, 2321 conditional GAN, 2955, 2961 conditional gradient algorithm, 540 conditional independence, 85, 88, 1644 conditional mean, 89 conditional mean estimator, 1058 conditional mutual information, 234, 1661 conditional pdf, 80 conditional probability table, 1647 conditional VAE, 2930 cone, 261 confidence interval, 1255 confidence region, 2179 congruence, 13, 41

2148

Subject Index

conic combination, 293 conic hull, 261 conjugate function, 281, 282, 292, 359, 988, 1844, 1847, 2256 conjugate gradient, 441, 448, 451, 453, 456, 465, 574, 705, 708, 719, 720, 2087 conjugate pair, 183 conjugate prior, 183 conjugate utility, 282, 292 conjugate vectors, 448 connected graph, 905, 908, 1558 consensus strategy, 921, 935, 936, 949, 970 conservation, 3056 conservation policy iteration, 2108 consistency relation, 2102 consistent classifier, 2673 consistent equations, 2171 consistent estimator, 1235, 1237 constant step size, 381 constrained estimator, 1139 constrained least-squares, 2205 constrained optimization, 302, 1150, 2228, 2233, 2250 context, 2992 continuous mapping theorem, 119 continuously differentiable function, 338 contraction, 347 contraction property, 89 contractive operator, 347, 365 contrastive divergence, 1752, 2805, 2820, 2825 contribution propagation, 3061 controllability, 1203 convergence in distribution, 119 convergence in mean, 119 convergence in probability, 117, 119, 164, 244 convergence of random variables, 119 convergence with probability one, 119, 244 conversion factor, 2191 convex cone, 261 convex function, 263 convex hull, 261, 292, 295, 2685 convex optimization, 302, 1844 convex quadratic program, 2542, 2608 convex set, 261 convex–concave saddle problem, 541 convolution, 91 convolution mask, 2839 convolutional neural network, 2715, 2778, 2838, 2839, 2887, 2909 coordinate descent, 402, 422, 493, 495, 573, 690, 764, 776 coordinate-ascent algorithm, 1415, 1482 coordinate-descent LASSO, 495 corrected AIC, 1254 correlation, 1677

correlation coefficient, 113 correlation layer, 2847 cosine similarity, 2785, 3038, 3141 cost function, 1054 counter-based exploration, 1997 counting theorem, 2520–2522 covariance form, 1163 covariance function, 152, 241, 2623 covariance matrix, 94, 96 covariate shift, 2478 crafting, 3072, 3077 Cramer–Rao bound, 1213, 1229, 1251, 1260, 1264 critic, 2962 critic-only algorithm, 2106 critical value, 2199 cross validation, 1238, 1248, 2239, 2546, 2552 cross-covariance matrix, 94 cross-entropy, 204, 208, 2489, 2754, 2755, 2778, 2784, 2887, 2984, 3021 cross-entropy training, 2984, 3021 cumulative distribution function, 116, 119, 133, 147, 1619, 2484 cumulative incoherence, 2448 cumulative reward, 1822 curse of dimensionality, 1644, 2307, 2650, 2672 curvature, 225 curvature tensor, 234 cycle, 943, 1648 cyclic projection algorithm, 424 DAG, 1648 DARE, 1178 data assimilation, 1203 data fusion, 47, 624, 1133, 1136, 1145 data preprocessing, 1615, 2383 data-driven method, 1111 d-dependence, 1672 decaying step size, 392, 489, 514 decentralized algorithm, 949, 1000 decentralized learning, 902, 969 decentralized optimization, 902, 969 decentralized SGD, 925, 938, 941, 943 decentralized stochastic gradient descent, 925, 938, 941, 943 decision tree, 2313, 2336, 2561 decoder, 1070, 1071, 1320, 2391, 2906, 2909 decoding, 1534, 1563 DeconvNet, 3049, 3060 deconvolution, 2243 deep belief network, 2769, 2797, 2830 deep exploration, 1997 deep learning, 2779, 2797

Subject Index

deep network, 111, 2715, 2720, 2722, 2768, 2839 deep Q-learning, 2009, 2033 deep reinforcement learning, 2092 deepfake, 2961 DeepFool, 3069, 3077, 3094 DeepLIFT, 3062 defense to adversarial attacks, 3091 defensive distillation, 3092, 3095 deflation, 2390 degree of bootstrapping, 1940 degree of node, 906 delta rule, 557, 584 delusional bias, 2043, 2107 denoising, 500, 541, 587, 2237, 2243 denoising autoencoder, 2747 dependent observations, 1586 derivative-free optimization, 425 descendant node, 1651 descent direction, 460, 708 descent lemma, 332 deterministic policy, 1856 dichotomy, 2520, 2662, 2688 dictionary learning, 2424, 2425, 2443 differential entropy, 192, 202, 1625 differential privacy, 628 diffusion, 102 diffusion coefficient, 104 diffusion learning, 921 diffusion strategy, 921, 925, 937, 949 digamma function, 176, 1218, 1253, 1500, 1723 DIGing algorithm, 975, 977, 998 dimensionality reduction, 2231, 2307, 2383, 2746 Dirac delta, 70 directed acyclic graph, 1648 directed graph, 907, 981 Dirichlet distribution, 174, 185, 1515, 2354 discount factor, 1822 discounted reward, 1822, 2052 discounted state distribution, 2109 discounted utility, 1821 discounted visitation function, 2078 discrete algebraic Riccati equation, 1178 discrete-time Fourier transform, 246 discriminant function, 1106, 2357, 2359 discriminative method, 1110, 2262 discriminative model, 1110, 2262, 2457 discriminator, 2935 dispersed exponential distribution, 190 distillation, 3092, 3095 distillation temperature, 3092 distilled network, 3092 distributed learning, 902, 969 distributed optimization, 902, 969

2149

divergence, 196, 232 divergence theorem, 436 divide-and-conquer, 1575, 1605, 1827 DNA microarray, 2672 do operator, 84 document classification, 2341 domain adaptation, 2476, 2487 dominated convergence theorem, 212, 548, 590, 646, 2023 double descent, 2205 double Q-learning, 1996, 2043 doubly stochastic matrix, 38, 909, 944 Douglas–Rachford algorithm, 356, 365 dropout, 2724, 2779 d-separation, 1672, 1754 D-SGD algorithm, 925 δ-smooth function, 645 DTFT, 246 dual averaging, 535 dual function, 306, 607, 1845 dual method, 903, 986 dual norm, 32, 48 dual problem, 312, 1844 dual space, 521 dual variable, 306 duality, 322 duality argument, 306 duality gap, 314 dying ReLU, 2719 Dykstra algorithm, 425, 432 dynamic programming, xxxi, 1575, 1604, 1783, 1786, 1827, 1861, 1894, 1897, 2042, 2672 early stopping, 2743 Earth mover distance, 2962 echo cancellation, 627 Eckart–Young theorem, 44, 50, 2200, 2397, 2434 ECOC method, 2486 edge, 904 edge detection, 2845 EEG signal, 1611 effective sample size, 1390 efficient estimator, 1213, 1229, 1237 eigen-decomposition, 3, 2385 eigenfilter, 1150 eigenvalue, 1 eigenvector, 1 Einstein–Smoluchowski relation, 104, 110 EKF algorithm, 1193, 1203 elastic-net regularization, 376, 2226, 2237, 2242 ELBO, 229, 1411, 1412, 1722, 1731, 2907 elevated terrain map, 2303

2150

Subject Index

eligibility trace, 1942, 1944, 1958 ellipsoid, 293, 2198 Elman network, 3037 ELU, 2719 EM algorithm, 1276, 1282, 1405, 1517, 1588, 2271, 2281 embedding, 2785, 2990, 3102 emission distribution, 1586 emission probability, 1382 emission probability kernel, 1879 emission probability matrix, 1689 empirical data distribution, 1220 empirical error, 2263 empirical risk, 376, 642, 683, 730 empirical risk minimization, 376, 852, 2654, 2657, 2913 EMSE, 589 encoder, 1070, 2391, 2909 encoder–decoder RNN, 2990, 3037, 3038 encoder–decoder, 1070 encoding, 2315 energy function, 845, 850, 1745, 2834 energy-based model, 2809 ensemble average, 243 ensemble Kalman filter, 1185, 1203, 1208 ensemble learning, 2557, 2564, 2580 entropy, 192, 196, 231, 326, 1625, 1699, 2033, 2320, 2336, 2338, 2353, 2683 enumeration method, 1685 epidemiology, 2280 epigraph, 327 episode, 3121 episodic MDP, 1822 epoch, 555 −insensitive loss, 2553 equality constraint, 302 equivalence, 2211 erf function, 161, 2779 erfc function, 161 ergodic in the mean, 244 ergodic process, 243, 257, 2055 ergodic theorem, 257, 2833 ergodicity, 243, 257 ERM, 376, 2657 error covariance matrix, 1162 error-correcting code, 1677, 1792, 1794 error-correcting output code, 2486 E-step, 1286 estimate, 1054 estimator, 1058, 1122 Euclidean geometry, 2170 Euler constant, 1253 event, 109 every-visit policy evaluation, 1925 evidence, 78, 1211, 1328, 1405, 1671, 1682, 1683

evidence lower bound, 1412, 2907 EXACT diffusion, 1000 excess kurtosis, 1627 excess mean-square error, 589 excess risk, 389, 391, 713, 2679 expectation maximization, 1276, 1319, 1711, 1715, 1717, 2271, 2281 expectation propagation, 1110, 1378, 1405 expected discounted reward, 2053 expected immediate reward, 2054 expected SARSA, 1974 experience replay, 1992, 1996 expert system, 1677 explainable learning, 3043 explaining away, 1669 exploding gradient, 3004 exploitation, 1986, 1996 exploration, 1985, 1986, 1996 exploration function, 1990 exploration model, 3118 exponential convergence, 387 exponential distribution, 114, 167, 188–190, 1312 exponential family of distributions, 1226 exponential loss, 1109, 2576 exponential LU, 2719 exponential mixture model, 1312 exponential risk, 1109, 2252 exponential weighting, 2188, 2209 extended Kalman filter, 1193, 1203 face recognition, 1556, 2435, 2551, 2886 factor, 1742, 2391 factor analysis, 2406, 2412 factor graph, 1643, 1647, 1740, 1756 factorial, 1253 factorization theorem, 1228 fake image, 2929, 2960 false alarm, 1115 false negative, 1115 false positive, 1115 far-sighted evaluation, 1823 fast causal inference, 1678 fast Fourier transform, xxx fast gradient sign method, 3070, 3094 fast ICA, 1628 FBI, 1056 FCI algorithm, 1678 FDA, 2223, 2357, 2378, 2383, 2412, 2654 F -distribution, 2199 f -divergence, 232 feasible learning, 2662 feasible point, 303 feasible set, 327 feature map, 2847 feature selection, 2383

Subject Index

feature vector, 1841 federated averaging, 623 federated learning, 619, 627, 631, 797, 875, 902 federated SVRG, 797 feedback neural network, 2716 feedforward neural network, 2715, 2716, 2721 Fenchel conjugate, 282, 359, 366, 2256 Fenchel–Young inequality, 296 ferromagnetic, 1796 few-shot learning, 3099, 3118, 3144 FFT, xxx FGSM, 3069, 3070, 3094 filtration, 1402 financial model, 1326 fine tuning, 3099 finite difference method, 2048 finite precision, 1435, 2721 Finito algorithm, 806 firmly nonexpansive operator, 348, 357, 365 first-order ergodic, 244 first-order moment, 71 first-order optimization, 433 first-order stationarity, 854, 873 first-visit policy evaluation, 1924 Fisher discriminant analysis, 2223, 2262, 2341, 2357, 2365, 2380, 2383, 2643, 2654 Fisher information matrix, 181, 213, 601, 1230, 1456, 1458, 2071 Fisher kernel, 2641 Fisher–Neyman theorem, 188 Fisher ratio, 2369 fixed point, 341, 346, 360, 539, 1838 fixed-interval smoothing, 1181 fixed-lag smoothing, 1181, 1206 fixed-point iteration, 1839 fixed-point smoothing, 1181, 1206 fixed-point theorem, 2118 flat prior, 1273 Fletcher–Reeves algorithm, 456, 468, 574, 719, 720 FOCUSS, 2443 FOMAML, 3132 forecast step, 1187 forgery, 2785 forget gate, 3005 forgetting factor, 2188 Forgy initialization, 2272 forward view, 1941, 1977 forward–backward splitting, 351 forward-backward recursions, 1517, 1535, 1537, 1538, 1688 Fourier transform, 2641 fourth-order moment, 137 Frank–Wolfe method, 540 fraud detection, 1556

2151

Fredholm equation, 2242 free energy, 1746, 2811 frequentist approach, 1255 frequentist vs. Bayesian, 1255, 1256 Frobenius norm, 33, 2397 full context embedding, 3138 full rank, 8 functional causal model, 1678 fundamental theorem of algebra, 43, 50 fundamental theorem of calculus, 265, 330 fused LASSO, 500, 541, 587 gain matrix, 1161 Galerkin method, 1849 game theory, 946 gamma distribution, 161, 189, 1118, 1226 Gamma function, 1096, 1218, 1252, 1411, 2183, 2653 gamma function, 160, 185 GAN, 2905, 2935 gap, 1998 gated recurrent unit, 3034, 3039 Gauss–Markov field, 1708 Gauss–Markov theorem, 1141, 1144, 1213 Gauss–Southwell rule, 409, 423 Gaussian distribution, 74, 132, 155, 169, 177, 1066, 1214 Gaussian error function, 161, 2779 Gaussian graphical model, 1747 Gaussian kernel, 153, 159, 2592, 2594 Gaussian Markov field, 1746 Gaussian mask, 2845 Gaussian mixture model, 1111, 1277, 1280, 1287, 1517, 2271, 2281 Gaussian naïve classifier, 2351, 2355 Gaussian policy, 2109 Gaussian prior, 2242 Gaussian process, 150, 159, 2623, 2634, 2779 Gaussian RBM, 2834 Gaussian sketching, 2186 gene analysis, 1551 gene expression, 2427 general position, 2520, 2522 generalization, 2201, 2222 generalization error, 2264, 2661 generalization theory, 379, 2650, 2657, 3065 generalized linear model, 2492 generalized logistic distribution, 189 generalized portrait, 2551 generative adversarial network, 1111, 2905, 2935 generative graphical model, 2824 generative method, 1110, 2262 generative model, 1070, 1110, 1319, 1325, 1473, 1533, 1609, 2262, 2806, 2819, 2830

2152

Subject Index

generative network, 2905 generator, 2935 geodesic, 225 geophysics, 2243 geostatistics, 160, 2634 Gershgorin theorem, 38, 45 Gibbs inequality, 205 Gibbs distribution, 100, 190, 206, 1744, 1988, 2047, 2101, 2720 Gibbs phenomenon, 2832 Gibbs sampling, 1342, 2804, 2819, 2831 Gini coefficient, 2336 Gini gain ratio, 2325 Gini impurity, 2325, 2336, 2338 Gini index, 2325 GLASSO, 1708 GLM, 2492 global independence relation, 1662 global temperature, 2175, 2179 GMM, 1277 GO game, 1957, 1958, 2107 goodness-of-fit test, 1238 Google, 1531, 1958 gradient ascent, 971, 976, 980, 1845 gradient boosting, 2572 gradient correction TD, 2023 gradient descent, 375, 418, 971, 976, 980, 1845 gradient explosion, 3004 gradient × input analysis, 3049 gradient noise, 574, 642, 860, 861 gradient operator, 2846 gradient perturbation, 852, 873 gradient TD, 2008 gradient vector, 59, 65 Gram–Schmidt procedure, 18, 43, 51, 449, 1158, 1620, 1631 Gramian matrix, 153, 2543, 2554, 2592, 2635 Granger causality test, 1678 graph, 903 graph learning, 1698, 1705 graph neural network, 2748, 2779 graphical LASSO, 1705, 1734, 1736 grayscale image, 2275, 2443, 2760, 2822, 2829, 2839, 2884, 2927, 3059, 3086, 3111 greedy exploration, 1985 greedy method, 2445, 2449, 3077 greedy strategy, 1855, 1996, 2830 group LASSO, 500, 541, 587 growing memory, 2188 growth function, 2688, 2690 GRU, 3034, 3039 GTD2 algorithm, 2019, 2025 guided backprop, 3048 guided backpropagation, 3048, 3060

Hadamard division, 2440 Hadamard matrix, 2187 Hadamard product, 66, 604, 1565, 2734, 2791, 2918, 2975 Hadamard sketching, 2186 Hahn–Banach theorem, 2778 halfspace, 293, 323 HALS, 2438 Hammersley–Clifford theorem, 1795, 1799 Hamming distance, 2487 hard assignment, 2272 hard decision, 1060, 1062 hard label, 3092 hard thresholding, 363 hard-margin SVM, 2530, 2533 harmonic mean, 1408 Harmonium, 2830 Hastings ratio, 1341 Hastings rule, 945 heart disease dataset, 1259, 1703, 2329, 2337, 2394, 2511 heatmap, 3042, 3046 heavy-ball implementation, 419, 426 Hebbian learning, 2412, 2415, 2514 Hebbian rule, 2514 Hellinger distance, 233, 236 Hermitian matrix, 39 Hessian matrix, 62, 65, 268, 2111 hidden Markov model, 1111, 1283, 1284, 1381, 1383, 1440, 1517, 1533, 1563, 1643, 1646, 1688, 1717 hidden variable, 1283, 1683 hierarchical alternating least-squares (HALS), 2438 hierarchical model, 1515 Hilbert space, 43, 2637, 2638 hinge loss, 271, 273, 2614 hinge risk, 376, 474, 2252 hit histogram, 2306 HMM, 1517, 1533, 1563 Hoeffding inequality, 110, 117, 123, 1263, 2673, 2679 Hoeffding lemma, 118, 123 Hölder inequality, 192, 638 Hopfield network, 3037 Hotelling transform, 2413 Huber function, 358, 363 Huber loss, 2784 Hughes effect, 2672 Hugin algorithm, 1795 human in the loop, 2474 Hurwitz matrix, 1851 hybrid distribution, 1360, 1372 hyperbolic secant, 2484 hyperbolic tangent, 2778 hypercube, 2526

Subject Index

hyperparameter, 2546 hypothesis space, 2655 hypothesis testing, 1115 ICA, 1609, 1637 ID3 algorithm, 2336 idempotent matrix, 2173 I-FGSM, 3072 iid process, 242 ill-conditioning, 2223, 2667 ill-posed problem, 2242 image denoising, 2445 image filtering, 2839 ImageNet, 2886 image processing, 2838 image recognition, 2886 image segmentation, 1299 immediate reward, 1818 immoral chain, 1676 implicit bias, 2195, 2197, 2201, 2210, 2553 implicit regularization, 2201, 2210, 2553 importance distribution, 1334 importance sampling, 552, 553, 588, 649, 653, 669, 701, 1333, 1347, 1385, 1954, 2076 importance weights, 1334, 1953, 2021, 2479 impurity, 231, 2336 impurity function, 2336 imputation, 561 incidence matrix, 907 incomplete information, 1311 incremental strategy, 430, 509, 919 independence map, 1653 independent component analysis, 1609, 1637 independent random variables, 83 independent sources, 1609 induced norm, 32 inductive bias, 2225, 2657 inductive inference, 379 inductive reasoning, 379 inequality constraint, 302 inequality recursion, 501, 590, 720 inertia, 12, 41 inertia additivity, 41 inference, 1054 information, 196 information gain ratio, 2323 inner-product space, 2637 innovations process, 1157 input gate, 3005 instance-weighting, 2481 instantaneous approximation, 550 integral equation, 2242 Internet-of-Things, 627 interpretability, 2179 interpretable learning, 3043 intersection of convex sets, 325, 412

2153

inverse gamma distribution, 191 inverse Wishart distribution, 2205 ion channel, 100 IoT, 627 iris dataset, 1063, 1085, 1102, 1296, 2363, 2378, 2393, 2463, 2506, 2510, 2537 IRLS, 2199 irreducible Markov chain, 1553 irreducible matrix, 46, 49, 947 Ising model, 1747, 1749, 1796 isolated word recognition, 1595 ISTA, 510, 2238, 2402, 2428 iterated fast gradient sign method, 3072 iterated soft-thresholding algorithm, 2402, 2428 iteration-dependent step size, 392, 489, 514 iterative decoding, 1792, 1794 iterative dichotomizer, 2336 iterative reweighted least-squares, 2199 Jacobian matrix, 59, 1348, 3090 Jacobian saliency map approach, 3069, 3075 3094 Jensen inequality, 279, 292, 646, 918 Jensen–Shannon divergence, 2942 Johnson–Lindenstrauss lemma, 2201 Johnson–Lindenstrauss transform, 2201 joint distribution, 77 joint pdf, 77 joint stationarity, 251 Jordan canonical form, 34, 47 Jordan network, 3038 JSMA, 3069, 3075, 3094 junction tree, 1759, 1795 junction tree algorithm, 1759, 1795 Kaczmarz algorithm, 393, 432, 588, 2201 Kalman filter, 255, 1154, 2201, 2212 Karhunen–Loève transform, 2413 Karush–Kuhn–Tucker (KKT) conditions, 307, 322, 2250, 2541 KDDCup, 586 kelvin scale, 105 kernel, 2839 kernel-based learning, 2613 kernel-based perceptron, 2595 kernel-based SVM, 2603 kernel FDA, 2643 kernel function, 153, 2242 kernel method, 1112, 2587, 2591, 2592, 2715 kernel PCA, 2412, 2618, 2619, 2634 kernel ridge regression, 2610 kernel trick, 2636, 2637 K-fold cross validation, 2547 Khintchine–Kahane inequality, 2686

2154

Subject Index

KKT conditions, 307, 322, 327, 607, 2250, 2251, 2541 KKT multiplier, 306 KL divergence, 196, 204, 232, 235, 1109, 1699, 2077 k–means algorithm, 2270, 2280 k–means clustering, 2260, 2280, 3144 k–means++ clustering, 2273 k–nearest neighbor, 2265 k–NN rule, 2260, 2265, 2282 Kohonen map, 2290, 2310 Kraft inequality, 199, 236 krigging, 160, 2634 Kronecker product, 24, 41 Kruksal algorithm, 1734 Krylov sequence, 467 Krylov subspace, 467 K-SVD, 2444 Kullback–Leibler divergence, 196, 204, 232, 235, 1108, 2756 kurtosis, 1621, 1627, 1640 `1 –regularization, 376, 2226, 2230, 2242 `2 –regularization, 376, 2225, 2226, 2242 `2 –pooling, 2864 L2Boost, 2577 labeled data, 2471 Lagrange multiplier, 306, 1497 Lagrangian, 306, 607, 1844 language processing, 2838 language translation, 2967, 2990, 2995 language understanding, 1677 Laplace mechanism, 629 Laplace prior, 2242 Laplace distribution, 358, 629, 2231 Laplace method, 1110, 1328, 1346, 1405 Laplace smoothing, 1604, 1660, 1701, 2344, 2352, 2354, 2358 Laplacian, 906 Laplacian noise, 629 Laplacian rule, 909 Laplacian matrix, 906 LASSO, 350, 354, 472, 477, 491, 500, 510, 541, 569, 571, 587, 733, 750, 1706, 2237, 2243 latent class analysis, 1302 latent Dirichlet allocation, 1472, 1514 latent factor model, 2406 latent model, 2405 latent variable, 227, 562, 1283, 1319, 1518, 1519, 1683, 2391 law of effect, 1957 law of large numbers, xliii, 112, 117, 164, 1334, 1923, 2937 law of total expectations, 115 law of total probability, 81 lazy mirror descent, 535

LDA, 1472, 2357, 2378 leaf, 1651, 2313 leaky ReLU, 2719 leaky LMS, 557 learning to learn, 3099 least-mean-squares-error estimator, 1058 least-squares, 157, 587, 1244, 2168, 2197, 2221 least-squares risk, 376 least-squares SARSA, 2031 least-squares TD, 2018 leave-one-out, 2549, 2552 left-stochastic matrix, 38, 50, 909, 944 Leibniz integral rule, 591 Lenz-Ising model, 1796 letter sequences, 1528 leverage score, 2186 Levinson algorithm, 255 Lidstone smoothing, 1604, 1660, 2345, 2353, 2358 likelihood, 78, 184, 215, 1211, 2906 likelihood ratio, 1099, 1117 limit superior, 687 line search method, 217, 235, 396, 443, 2083 linear autoencoder, 2747 linear classifier, 2654, 3042 linear combiner, 2717 linear convergence, 387 linear discriminant analysis, 2262, 2341, 2351, 2357, 2378, 2459 linear estimator, 1121, 2165 linear Gaussian model, 1068 linear model, 1134 linear operator, 43, 2413 linear prediction, 1143 linear program, 304, 2518 linear programming, 2552 linear regression, 136, 1121, 1169, 1244, 2165 linear search problem, 1734 linear separability, 2499, 2518, 2520, 2527 linearized Kalman filter, 1191, 1203 link function, 2493 Lipschitz condition, 332, 339, 367 Lipschitz constant, 332, 2962 Lipschitz continuity, 332, 337 Lipschitz continuous, 332, 497 Lipschitz gradient, 339 Lipschitz function, 120 little-o notation, 388, 576 living reward, 1814 LLMSE, 1122 Lloyd algorithm, 2270, 2281 LMS, xxx, 584 LMSE, 1058 local independence, 1653 local Markov property, 1753 local maximum, 601

Subject Index

local minimum, 601 LOESS smoothing, 2175, 2198 log loss, 270, 2756 log-likelihood function, 216, 1276, 1618, 1623, 2811 log-partition function, 168, 1227 log-sum exponential, 2100 logistic density function, 1468 logistic distribution, 1751, 2484, 2818 logistic function, 1325, 1619, 2484 logistic loss, 270, 2614, 2639 logistic model, 1110 logistic regression, 244, 384, 457, 510, 558, 571, 619, 1106, 1110, 1333, 2201, 2262, 2341, 2353, 2457, 2474, 2482, 2485 logistic risk, 376, 2252 logit model, 1324, 1346, 1461, 2359, 2459, 2484 LogitBoost, 2578 long short-term memory, 2715, 2967, 3004, 3026 long-term model, 714 loopy belief propagation, 1771 loss function, 376, 853, 1092, 2677 low-rank approximation, 44 LOWESS smoothing, 2175, 2198 `p -pooling, 2864, 2887 LSTM, 2715, 2967, 3004, 3026, 3036 LTI system, 248 Luce choice axiom, 1996 Lyapunov equation, 26, 27, 41 Lyapunov recursion, 48 machine precision, 2203 MAE estimator, 1094 magnetic field, 1796 Mahalanobis distance, 299, 630, 2362 majorization, 429, 2441 majorization–minimization, 429, 2442 MAML, 3130, 3136 manifold, 225 MAP estimator, 1094, 1113, 1174, 1247, 1257, 1575, 2227, 2231 margin, 2507 margin variable, 2669 marginal distribution, 80 marginal pdf, 80 Markov inequality, 109, 114, 122, 126, 552 Markov blanket, 1651, 1674, 1753 Markov chain, 39, 45, 1345, 1522, 1524, 1713, 1779 Markov chain Monte Carlo, 1110, 1258, 1333, 1347, 1405, 1515, 1722, 2834 Markov decision process, 1807, 1853, 1917, 1957, 1971, 2008, 2047

2155

Markov inequality, 125 Markov network, 1754 Markov process, 1815 Markov random field, 1754 martingale, 126, 1402 martingale difference, 126 mask, 2839 Massart lemma, 2678, 2701 matching network, 3111, 3136–3138 matching pursuit, 2446 matrix completion, 542, 561, 873 matrix factorization, 562, 586, 2192, 2200, 2201 matrix inversion formula, 1134 max-sum algorithm, 1782 maximal clique, 1741 maximum a-posteriori estimator, 1113, 1247, 1257, 1575, 2227, 2231 maximum clique problem, 1734 maximum entropy, 192, 209, 235 maximum likelihood, 157, 1113, 1211, 1251, 1309, 1319, 1617, 1711, 1712, 2169, 2460 maximum mean discrepancy, 2962 maximum of means, 1263 maximum-phase system, 253 McCulloch–Pitts model, 2513, 2515 McDiarmid inequality, 110, 126, 2678, 2707 MCMC method, 1258, 1333, 1347, 1365, 1405, 1515, 1722, 2834 MDC, 2360 MDL, 1238 MDP, 1957 mean, 71, 106 mean absolute error, 1083, 1094 mean estimation, 1142, 1143 mean field approximation, 1413, 1723, 1724, 1727 mean field theory, 1467, 2907 mean function, 152, 2623 mean-square convergence, 119 mean-square deviation, 552, 575, 589, 713 mean-square-error, 244, 1053–1055, 1058, 1082, 1121, 1122, 2165 mean-square-error convergence, 244, 686, 735 mean-square-error estimation, 1053 mean-value theorem, 124, 330, 337, 591, 1269 measure of similarity, 2785 measurement update, 1171, 1175, 1383 median, 1094, 1095, 1312, 2206 median filter, 1300 medical diagnosis, 1677, 2341, 2346 memoryless channel, 1793 Mercer theorem, 153, 2592, 2635 message passing, 1378, 1682, 1684, 1760 meta learning, 2763, 2779, 2786, 3099

2156

Subject Index

meta testing, 3127 meta training, 3125 method of moments, 1253 metric tensor, 225 metric space, 261 Metropolis–Hastings algorithm, 1333, 1339, 1347, 1405, 2832 Metropolis matrix, 1559 Metropolis ratio, 1341 Metropolis rule, 909 midpoint convex, 293 min-max problem, 2940 mini-batch, 550 mini-batch algorithm, 384 minimal I-map, 1655, 1676 minimum description length, 1238, 1253, 1254 minimum distance classifier, 2357, 2360, 2362 minimum mean-square-error, 1055, 1059, 1122, 2166 minimum-norm solution, 2171, 2210 minimum-phase system, 253 minimum-variance unbiased estimator, 1139, 1141, 1213, 1229 minorization–maximization, 2110 mirror descent, 391, 519, 524, 530, 540, 573, 771 mirror function, 521 mirror prox, 519, 771 misclassification, 1098 missed detection, 1115 missing data, 1283, 2355, 2379 mixing matrix, 1611 mixing probability, 1287 mixture model, 1276 mixture of learners, 2557, 2580 ML estimator, 1211 MMSE, 1055, 1059, 1122, 2166 MNIST dataset, 2275, 2282, 2443, 2760, 2822, 2829, 2884, 2927, 2934, 2958, 3059, 3086, 3111, 3134 mode, 1095 mode collapse, 2960, 2962 model-agnostic algorithm, 3130 model-based inference, 1111 model-based learning, 1918 modeling filter, 255, 1165 moment matching, 211, 1219, 1378 moment parameter, 181 momentum acceleration, 418, 426, 612 momentum method, 459, 614 momentum parameter, 614 monotone convergence theorem, 365 monotone function, 264 monotone gradient, 264 Monte Carlo method, 1333, 1921, 2834

Monte Carlo policy evaluation, 1920 Moore–Aronszajn theorem, 2637 moralization, 1744 Moreé–Thuente method, 456 Moreau decomposition, 292 Moreau envelope, 342, 358, 363, 366 Moreau–Yosida envelope, 358 moving average, 608, 610, 1244 MRF, 1754 MSD, 552, 589 MSE, 1055, 1122 M-step, 1286 multi-agent system, 39, 45, 903 multi-armed bandits, 1997 multiclass classification, 1105, 2322, 2464, 2486, 2716, 2728, 3081 multiclass logistic regression, 2485, 2491 multilabel classification, 2716, 2728 multilayer network, 2722 multilayer perceptron, 2715 multinomial distribution, 171, 185, 1223, 1311, 1515, 2343, 2355 multinomial logistic regression, 2459, 2486, 2491 multinomial replacement, 1401 multiplicative update algorithm, 2439 multiplicity, 47 multistage decision process, 1848, 1893, 1894 multitask learning, 2716, 2729, 2765, 2779, 2785 mutual incoherence, 2448, 2451 mutual information, 231, 1622, 1623, 1625, 1699, 2320, 2336 MVUE estimator, 1139, 1213, 1229 myopic evaluation, 1823 Nadam, 618 Naïve Bayes, 1649 Naïve Bayes classifier, 2341 Naïve policy evaluation, 1921 NAND function, 2517, 2776 NASA, 2175, 2179 nat, 196, 1246 natural exponential family, 1226 natural gradient, 216, 233, 1459, 2072, 2083 natural gradient method, 420, 601, 2108 natural gradient policy, 2071 natural language processing, 1481, 1514, 2967, 2990, 2995 natural parameter, 167, 181 natural parameter space, 180 nearest neighbor classifier, 1105 nearest-neighbor rule, 1112, 2260, 2265, 2279, 2282, 2284, 2599

Subject Index

negative entropy function, 288, 294, 521, 525 negative hypothesis, 1115 negentropy, 1640 neocognitron, 2886 Nesterov momentum acceleration, 427, 615, 713 Netflix prize, 586, 2200 neural network, 557, 1112, 2499, 2715, 2779 neural Turing machine, 3136 neuron, 2716 Newton equations, 1170 Newton method, 217, 420, 426, 601, 1500, 1515 Newton-type recursion, 1628 Neyman–Pearson test, 1113 NMF, 2435 NN rule, 2260, 2279, 2282, 2284 no free lunch theorem, 2674 Nobel Prize, 1678, 1796 noise, 243, 1136, 1159 non-Bayesian inference, 1211 nonconvex optimization, 375, 852, 872 nonexpansive operator, 318, 348 nonexpansiveness, 762 nonlinear PCA, 1629 nonnegative matrix factorization, 2435, 2445 nonnegative orthant, 2436 nonnegative-definite matrix, 1, 5 nonparametric model, 2268 nonsingular code, 232 nonsmooth function, 341 NOR function, 2517 norm, 30 normal cone, 262, 293 normal distribution, 158 normal equations, 9, 1124, 1151, 2170 normalized mutual information, 2323 novel document detection, 2445 NP complete, 1734 NP complexity, 1734 NP-hard, 873, 921, 1734, 2280 nuclear norm, 33, 363, 542, 2210 null hypothesis, 1115 nullspace, 7, 944, 949 Occam razor principle, 1254, 2589 OCR, 2335 odds function, 1108, 2359 off-policy, 1335, 2105 off-policy learning, 1952, 1979, 1982, 2021, 2488 offset, 386, 1121, 2368 Oja rule, 2415, 2515, 2519 OMP, 50, 2448 on-policy, 1335 on-policy learning, 1952

2157

one-hot encoding, 2315, 2931, 3067 one-point estimate, 433 one-shot learning, 3101 one-step reward, 1818 one-tailed test, 2183 one-versus-all, 2464, 2486, 2716, 3081 one-versus-one, 2464, 2486, 2716 online algorithm, 384, 547, 551, 730, 756, 779, 816, 1201 online Bayesian learning, 1375, 1378 online dictionary learning, 2425 online learning, 551 online mirror descent, 524, 530, 540 online policy evaluation, 1928 operator theory, 423 optical character recognition, 2335 optimal control, 1202 optimal importance sampling, 702 optimal policy, 1853 optimal trajectory, 1897 optimal transport, 2488 optimism in the face of uncertainty, 1991, 1997 optimistic initialization, 1985 OR function, 2517 oracle, 2474 Ornstein–Uhlenbeck kernel, 154, 159 orthant, 293 orthogonal complement space, 359, 2173, 2202 orthogonal matching pursuit, 50, 2445, 2448 orthogonal matrix, 3 orthogonal random variables, 92, 107 orthogonal vectors, 8 orthogonality condition, 2172, 2204 orthogonality principle, 1065, 1124, 1133 outlier, 1312, 1628, 2179, 2185, 2392, 2533 output gate, 3005 OvA strategy, 2464, 2486, 2716, 3081 over-determined least-squares, 2211 over-parameterized least-squares, 2195 overcomplete basis, 2243 overcomplete dictionary, 2427 overcomplete representation, 2747 overfitting, 1077, 1238, 1254, 2223, 2229, 2264, 2334, 2589, 2665, 3065 overflow, 1435, 2721 OvO strategy, 2464, 2468, 2486, 2716 PAC learning, 2660 PageRank, xxxi, 1531 pairwise independence, 1796 pairwise Markov network, 1749 Paley–Wiener condition, 256 parent node, 1651 Pareto optimality, 915, 941, 946

2158

Subject Index

Pareto solution, 915, 946 parity-check code, 1792 Parseval relation, 2641 partial information, 1225, 1311 partial optimization, 325 partially-observable MDP, 1879 particle, 1335 particle filter, 1154, 1174, 1191, 1204, 1380, 1393, 1400 partition function, 168, 1743, 2810 party cocktail problem, 1610 passive learning, 1920 path analysis, 1678 payoff, 2940 PC algorithm, 1678 PCA, 1629, 2223, 2373, 2383, 2411, 2654, 2747, 2770 P complexity, 1734 pdf, 70 peephole LSTM, 3039 penalty-based optimization, 935 perceptron, xxx, 473, 568, 584, 1112, , 2499, 2513, 2595 perceptron loss, 2614, 2639 perceptron risk, 376 Perron eigenvector, 39 Perron-Frobenius theorem, 45, 1533, 1553, 1555 Perron vector, 39, 50, 911, 944, 1532, 1556, 2020, 2044, 2053 PG-EXTRA algorithm, 999 PGM, 1647 phase transition, 1796 Phillips–Tikhonov regularization, 2242 photosynthesis, 111 piecewise linear, 1886 Pinsker inequality, 235 P-map, 1655 pmf, 69 pocket perceptron, 2509 pocket variable, 483, 497, 498, 738, 740, 2509 Poisson distribution, 173, 1260, 2835 Poisson equation, 1827, 1835, 1843 Poisson regression, 2492 Polak–Ribiére algorithm, 457 Polak–Ribiére–Powell algorithm, 457 policy evaluation, 1825 policy function, 1809 policy gradient, 215 policy gradient method, 2027, 2047, 2107 policy gradient theorem, 2057 policy improvement, 1878 policy iteration, 1866 Polyak momentum acceleration, 419, 426, 614, 713 Polyak–Ruppert averaging, 498, 560, 719

polygamma function, 1253 polyhedron, 293, 3083 polynomial kernel, 154, 2592–2594 polytree, 1697 POMDP, 1853, 1879, 1894 pool-based sampling, 2472 pooled statistics, 2358 pooling, 2860, 2887 positive hypothesis, 1115 positive orthant, 321 positive semi-definite matrix, 5 posterior, 78, 184 posterior distribution, 1319, 1320, 1352 potential function, 1740 power, 242 power spectral density, 246 power of a test, 1115 power spectral density, 245 power spectrum, 246 PPO, 2087, 2088, 2107 PPO-clip, 2090 PPO-penalty, 2088 precision matrix, 1356, 1707, 1746 preconditioned conjugate gradient, 452 preconditioning, 452 predictability minimization, 2961 predicted class, 3045, 3066 prediction, 1320, 1535, 1547, 1549 prediction step, 1175, 1384 predictive distribution, 1320, 1322, 1331, 1429, 2631 predictive modeling, 1068, 1070, 1319 predictor, 255 prefix code, 199 Prewitt mask, 2845 Price theorem, 162 Prim algorithm, 1734 primal feasible solution, 303 primal function, 313 primal graphical LASSO, 1736 primal method, 902, 935 primal problem, 312, 1844 primal variable, 303 primal–dual method, 903, 969 primal–dual problem, 311 primitive matrix, 39, 46, 49, 50, 905, 947, 1532, 1553, 1555 principal component, 2390 principal component analysis, 1629, 2223, 2373, 2383, 2411, 2654, 2747, 2770 principle of optimality, 1860, 1896 prior, 78, 184, 2227, 2232 privacy, 620, 903 privacy loss parameter, 629

Subject Index

probabilistic graphical model, 85, 1643, 1647, 1682 probabilistic PCA, 2404 probability density function, 70, 77 probability mass function, 69 probability of error, 1105 probability simplex, 293, 326, 1883 probit model, 1324, 1346, 1365, 1461, 1462, 2484 probit regression, 2485, 2491 product of experts, 2816, 2831 profit function, 282, 292 projected Bellman equation, 1843 projected gradient method, 515 projection, 2171 projection gradient method, 499, 515, 540, 571, 579, 769 projection matrix, 2171 projection onto convex sets, 315, 323, 325, 423 projection pursuit, 1634, 2446 projection subgradient method, 499, 754 proper function, 263 proper random variable, 1084 proportionate LMS, 627 proposal distribution, 1334 protein folding, 101 prototypical network, 3136, 3137, 3144 proximal AVRG algorithm, 831 proximal coordinate descent, 512, 573 proximal decentralized algorithm, 999 proximal EXACT diffusion, 1000 proximal gradient algorithm, 366, 507, 539 proximal LASSO, 541 proximal learning, 507 proximal logistic regression, 510, 2459 proximal operator, 341, 762 proximal point algorithm, 348, 360, 539 proximal policy optimization, 2074, 2087 proximal projection, 507 proximal SAGA algorithm, 802 proximal SVRG algorithm, 805 proximity operator, 341 pruning, 2334, 2335 pseudo-inverse, 21, 2171, 2210 pseudocount, 1997 pure distribution, 2325 Pythagoras relation, 321 Q−function, 164, 1261 Q-learning, 1982, 1996, 2043, 2106 QPSK, 107 QR decomposition, 18, 43, 2229 QR method, 18, 43, 2198, 2203 quadratic loss, 269, 2577, 2614, 2639

2159

quadratic program, 304, 326 quantum mechanics, 43 quasi-Newton method, 421 query set, 3121 query variable, 1683 Rademacher complexity, 2678, 2685, 2701 Rademacher distribution, 2679 radial basis function, 153, 1841, 2594 Radon theorem, 2681 random forest, 2336, 2560, 2581 random partitioning, 2273 random reshuffling, 552, 652, 698, 782, 816, 818 random subsampling, 2186 random variable, 68, 71 random variable transformation, 1348 random vector, 93, 108 random walk, 102, 110 randomized algorithm, 118, 2186, 2200 randomized coordinate descent, 405, 429, 512, 573, 690 range space, 7 rank deficient, 8 Rao-Blackwell theorem, 1252, 1261 ratings matrix, 561 raw data, 621 Rayleigh distribution, 74, 114, 1095 Rayleigh–Ritz ratio, 3, 40, 1150, 2515 RBF kernel, 2594 RBM, 1749, 2802 re-parameterization trick, 2911 real-time recurrent learning, 3037 reasoning under uncertainty, 1647 receptive field, 2831, 2886 recommender system, 542, 561, 586, 2192, 2200, 2929 rectification layer, 2887 rectifier linear function, 2718, 2778, 2782, 2887 recurrent neural network, 2715, 2967, 3036 recursive least-squares, xxix, 2187 reduced-error pruning, 2335 reference input, 3062 region of convergence, 246 regression, 1121, 1169, 1244, 2165, 2198, 2728 regret, 390, 482, 513, 579, 633, 690, 737, 1998 regret analysis, 390, 482, 513, 576, 632, 690, 737 regularization, 376, 1258, 2174, 2221, 2225, 2226, 2230, 2242, 2461, 2504 regularized least-squares, 2243 REINFORCE algorithm, 2060, 2107 reinforcement learning, 215, 234, 1263, 1335, 1807, 1853, 1917, 1971, 2008, 2047 Relation network, 2786, 3112, 3136

2160

Subject Index

relation score, 2786, 3113 relative entropy, 204, 232 relative interior, 291 relative-degree rule, 945 relaxation method, 2518, 2553 relaxation parameter, 365 relevance, 3048 relevance analysis, 3050, 3062 relevance score, 3042, 3050 ReLU, 2718 Renyi divergence, 232 Renyi entropy, 234 replay buffer, 1992, 1996, 2105 replication, 1482 Representer theorem, 2613, 2637, 2638 reproducing kernel Hilbert space, 2614, 2637, 2638 Reptile algorithm, 3123, 3132, 3136 resampling, 1390 residual replacement, 1401 residual resampling, 1391 resolvent, 360 responsibility score, 1288, 1289 restricted Boltzmann machine, 111, 1111, 1749, 2769, 2797, 2802, 2820 restricted isometry property, 50 restricted strongly convex, 1002 reward function, 1808 RGB, 1299, 2299, 2856 ρ-norm, 34 Riccati recursion, 1161, 1163, 1202 ridge analysis, 2242 ridge regression, 2226, 2242, 2610 Riemann tensor, 225 Riemannian geometry, 222, 234 Riesz representation theorem, 2778 right-stochastic, 1522 RIP, 50 risk, 375, 1054, 1092 RKHS, 2637 RLS algorithm, 2187, 2197 RMSprop, 599, 607, 626 RNN, 2715, 2967, 3036 robotic swarm, 903 robotics, 1957, 2107 robust estimator, 1309 robust least-squares, 2243, 2249 robust PCA, 2412, 2416 robust statistics, 1309, 1312 robustness, 903 root finding, 582 root node, 1651, 2313 RTRL, 3037 rule of succession, 2352 saddle point, 601, 857, 971, 976, 979, 1845 saddle point problem, 311

SAG algorithm, 806, 809, 834 SAGA algorithm, 785, 789, 822 saliency analysis, 3050 saliency map, 3042, 3075 sample mean estimator, 1142 sample selection, 552 sample space, 69, 109 sampling, 1333, 1347, 2953 sampling importance resampling filter, 1396 sampling with replacement, 649, 651, 779, 782 sampling without replacement, 649, 652, 782, 816, 818 SARSA learning, 2026, 2063, 2106 SARSA(0) algorithm, 1971, 1973, 2028 SARSA(λ) algorithm, 1977, 2029 Sauer lemma, 2521, 2673, 2683, 2688, 2690 SBEED algorithm, 2105 scheduling, 1771 Schur algorithm, 255 Schur complement, 11, 41, 146 Schur matrix, 1851 SCIgen, 1552 scope, 1742 score function, 181, 1230, 1456, 1458, 2050 sea level, 2175 secant equation, 422 second law of thermodynamics, 111 second-order moment, 72 second-order stationarity, 855, 873 seed, 2472 seismic signal, 2243 self-adjoint operator, 43, 2413 self-loop, 904 self-organizing map, 1112, 2290, 2310 semi-martingale process, 721 semi-supervised learning, 2274 sensitivity, 629 sensitivity analysis, 3042, 3046 sensitivity factor, 2733, 2888, 3071 sensitivity vector, 2917, 2948 sentence analysis, 2988 sentiment analysis, 2987, 2988, 3021 separable data, 2588 separating hyperplane theorem, 2516 separating matrix, 1611 separation theorem, 2516 seq2seq, 2990 sequence classification, 2987, 3021 sequential data, 2967 sequential importance sampling, 1389 sequential Monte Carlo, 1380 server, 621 SGA, 551 Shafer-Shenoy algorithm, 1795

Subject Index

Shannon entropy, 234, 2353 shape parameter, 1096 shatter coefficient, 2523, 2683, 2688, 2690 shattering, 2662, 2689 short-term model, 878 shrinkage, 341, 343, 2225, 2228 Siamese network, 2785, 2786, 3101, 3136 sifting property, 70, 2284 sigma point, 1194 sigmoid function, 134, 1325, 2718, 2778 signal-to-noise ratio, 308 signature, 2785 significance level, 2183 similarity, 47, 2785 similarity score, 3141 simplex, 321, 324, 326, 521 singular value, 20 singular value decomposition, 20, 43, 52, 1638, 2196, 2211, 2392 singular vector, 20 sink node, 907 SIR filter, 1396 site function, 1354 skeleton, 1676 sketching, 2185, 2200 skewness, 1640 slack variable, 2534, 2552 Slater condition, 314 Slutsky theorem, 116, 1339 smooth function, 332, 379 smooth label, 2960 smooth maximum, 2100 smoother, 255 smoothing, 1181, 2213 smoothing filter, 1181, 2212 SNR, 1168 Sobel mask, 2845 social media, 1481, 1514 soft assignment, 2272 soft Bellman condition, 2033 soft Bellman equation, 2100 soft Bellman optimality equation, 2100 soft clustering, 1472 soft decision, 1060, 1062 soft label, 2960, 3092 soft learning, 2098 soft Q-learning, 2103 soft thresholding, 325, 341, 343, 349, 358, 932, 2234 soft-impute algorithm, 542 soft-margin SVM, 2530, 2533 softmax, 1988, 2047, 2486, 2720, 2730, 2736, 2769, 3045, 3066, 3101 softmax exploration, 1987 softmax layer, 2971, 3024 softmax policy, 1988, 2047

2161

softmax regression problem, 2486 softplus, 2719, 2808, 3096 solitaire, 1957, 2107 SOM, 1112, 2290 source coding theorem, 198 source distribution, 2478 source node, 907 spam filtering, 1551, 2341, 2350 spanning tree, 1698 spark, 50, 2448 sparse coding, 2427 sparse dictionary learning, 2427 sparse PCA, 2399, 2412 sparse reward, 1997 sparse signal recovery, 2445, 2449 sparse vector, 2425 sparsity, 2242 spectrahedron, 543 spectral decomposition, 3 spectral factor, 253 spectral factorization, 252, 253, 255 spectral norm, 32 spectral theorem, 3, 42, 50, 192 speech processing, 2838, 2967 speech recognition, 1595 sphere packing problem, 2280 spherically invariant distribution, 156 spin glass, 1467 spin-glass energy function, 1796 square-root matrix, 22, 912 squared exponential kernel, 153, 2594 stabilizing solution, 1178, 1203 stacked autoencoders, 2797, 2830 standard deviation, 72 standard gradient policy, 2062 state, 2968 state estimator, 1882 state trajectory, 1382 state transition matrix, 1689 state transition probability, 1816 state value, 1827 state value function, 1807, 1821, 1825, 1853 state variable, 1519 state vector, 1159 state–action value function, 1807, 1821, 1833, 1853 state-space model, 1154, 1159, 1202 stationarity, 241, 243 stationary probability distribution, 1556 stationary random process, 241, 243 statistical inference, 1116 statistical learning theory, 2658 statistical mechanics, 99, 231 statistical physics, 1467, 1796 steady-state distribution, 1345 steepest descent, 396, 418

2162

Subject Index

Stein equation, 26 Stein lemma, 143, 159, 164, 165, 182, 1361 step size, 382 stochastic approximation, 549, 582 stochastic conjugate gradient, 574, 705 stochastic coordinate descent, 573, 764, 776 stochastic difference equation, 1961 stochastic Fletcher–Reeves, 574 stochastic game, 1893 stochastic gradient algorithm, 384, 547, 548, 683, 860 stochastic gradient TD, 2008 stochastic matrix, 45, 1522 stochastic mirror descent, 573, 756 stochastic optimization, 551 stochastic projection gradient, 571, 756, 776 stochastic proximal algorithm, 512, 569, 756, 775 stochastic risk, 377, 642, 683, 730 stochastic risk minimization, 377, 852 stochastic subgradient algorithm, 565, 730, 775 stochastic variational inference, 1460 streaming data, 552, 649, 653 strict saddle point, 857 strictly convex function, 265 strictly monotone function, 265 stride, 2843 strong duality, 314, 987 strong law of large numbers, 112, 121, 257 strongly connected graph, 905, 947, 1558 strongly convex function, 266 strongly monotone function, 267 structural risk minimization, 2550 Student t-distribution, 2182 stump, 2317, 2319 sub-Gaussianity, 1627 subdifferential, 272 subgradient, 272, 471 subgradient method, 471, 496 sublinear convergence, 392 sublinear rate, 497 submultiplicative property, 30 subset sum problem, 1734 successive projection method, 423 sufficient statistic, 187, 1227, 1228, 1251, 1252 sum-product algorithm, 1682, 1684, 1761 sun problem, 2355 super-Gaussianity, 1627 supervised learning, 1917, 1971, 2008, 2047 support function, 297 support set, 3101, 3121 support vector, 2533, 2539, 2544 support vector machine, 322, 473, 750, 1112,

2201, 2353, 2486, 2499, 2530, 2551, 2669, 3078 supporting hyperplane theorem, 2516 surrogate loss function, 2668, 2677, 2680 surrogate model, 3088 surrogate risk function, 2668, 2677 survival of the fittest, 1391 suspicious behavior, 3089 SVD, 20, 43, 52, 1638, 2211, 2392 SVM, 474, 2486, 2530, 2551 SVM for regression, 2540 SVRG algorithm, 793, 830 Sylvester equation, 27 Sylvester law of inertia, 13 symmetric distribution, 2698 symmetric matrix, 1 symmetrization, 2695 synaptic weight, 2514 synthetic image, 2960 systematic replacement, 1401 systematic resampling, 1402 tanh function, 2718 Tanner graph, 1793 target class, 3068 target distribution, 1336, 2478 target misclassification, 3066, 3070 target model, 3125 target policy, 1952, 1979, 2020 target tracking, 1170 target variable, 2540 targeted attack, 3068 Taylor series, 420, 807 Taylor theorem, 814 t-distribution, 2183 TD learning, 1928, 2008, 2018, 2063, 2106 TD(0) algorithm, 1928, 2010 TD(λ) algorithm, 1940, 1958, 2014 TD-Gammon, 1957 telescoping sequence, 518 telescoping sum, 724 temporal difference learning, 1928, 1931, 1936 tensor, 1300, 2857 tensor decomposition, 873 terminal node, 2313 tessellation, 2280 test data, 2222 test error, 2661 text processing, 2551 text segmentation, 2341 Thompson sampling, 1997 Tikhonov regularization, 2242 time average, 243 time update, 1171, 1175, 1384 topic modeling, 174, 1440, 1472, 1677, 2435

Subject Index

topology-preserving mapping, 2291 torque, 72 torus, 225 total relevance, 3055 total variation, 233, 500, 541, 587, 750 trace, 24, 47 Tracy–Singh product, 41 training data, 2167, 2222 training error, 2172, 2263 transfer learning, 2478, 2763, 2779, 3099 transferability, 3089, 3094 Transformer, 3037 transition probability, 1382, 1808 translation equivariance, 2886 translation invariance, 2839, 2886 tree, 1697, 2313 triangle inequality, 30 triangular factorization, 12 triangulated graph, 1795 trigamma function, 1500 TRPO, 2077, 2107, 2110 true class, 3045 true online TD(λ) algorithm, 1949, 1958, 2015 truncated Gaussian, 1426, 1469 trust region optimization, 2077 trust region policy optimization, 2074 turbo code, 1677 Turing machine, 3136 two timescale, 942 two-player game, 2943 two-point estimate, 433 type-I error, 1115 type-II error, 1115 UDA algorithm, 983 UKF algorithm, 1194, 1203 unbiased estimator, 1140, 1213, 1237 uncertainty, 1647 uncertainty sampling, 2473 uncorrelated random variables, 92, 107 under-determined least-squares, 2195, 2211 undercomplete dictionary, 2427 underfitting, 1077, 2229, 2588, 2665 underflow, 1435, 2721 undirected graph, 904, 1378, 1643, 1740 unified decentralized algorithm, 983 unified distance matrix, 2305 uniform sampling, 552, 651, 779, 782 uniformly continuous function, 338 uninformative prior, 1273 uniquely decodable code, 232 unit-circle controllability, 1203 unitary matrix, 40 universal approximation theorem, 2777, 2778, 2797

2163

universal consistency, 2673 unlabeled data, 2473 unmixing matrix, 1611 unscented Kalman filter, 1194, 1203 unsupervised clustering, 1472 unsupervised learning, 1290, 2290, 2745, 2769 unsupervised method, 1609, 2383 untargeted attack, 3068 upper confidence bound, 1991 user–item matrix, 561 VAE, 2905 validation, 1535, 1547 validation set, 2547 value function, 1821, 1825 value function approximation, 2008 value iteration, 1853, 1858, 1886 value vector, 1829 vanilla gradient, 2072 vanishing gradient problem, 584, 2743, 2768, 2778, 3004, 3050 vanishing step size, 392, 489, 514, 747 Vapnik–Chervonenkis bound, 2523, 2553, 2658, 2672, 2694 variable elimination method, 1686, 1691 variable step-size LMS, 627 variance, 71, 106 variance reduction, 779, 789 variance within, 2367 variance-reduced algorithm, 660, 779, 789, 816 variational autoencoder, 563, 1111, 2905 variational factor, 1411 variational inference, 234, 1070, 1110, 1258, 1405, 1467, 1711, 1723, 1727, 2907 VC dimension, 2553, 2662, 2690 vector estimation, 1131 vector quantization, 2306 vector space, 2637 velocity vector, 614 vertex, 904 video analysis, 2967 Viterbi algorithm, 1517, 1534, 1537, 1563, 1574, 1596, 1604, 1786, 1894 volatility model, 1401 von Neumann trace inequality, 47 Voronoi cell, 2283 Voronoi diagram, 2268, 2280 voting model, 1429 Wasserstein GAN, 2962, 2964 Wasserstein distance, 2962 water filling, 308 weak classifier, 2562 weak convergence, 119 weak duality, 314, 327

2164

Subject Index

weak law of large numbers, 112, 117, 121, 164 weak saddle point, 857 weak union, 89 web crawler, 1531 web search, 1531, 1551 weight matrix, 909 weighted graph, 1734 weighted least-squares, 2169, 2174, 2177, 2204, 2243 weighted nearest neighbor rule, 2264, 2266 Welch bound, 2448 Weyl theorem, 37, 45 white noise, 243, 1159 white-box attack, 3068 whitening, 1616 whitening filter, 255, 1165 wide-sense stationary process, 241 Wiener filter, 1177, 1180 Wiener process, 2416

Wiener-Hopf technique, 1143 winning neuron, 2293 Wishart distribution, 2205 Witness algorithm, 1894 Wolfe conditions, 455, 468 XNOR, 2517 XOR, 2517, 2527, 2591, 2715, 2781 Yahoo music dataset, 586 Young inequality, 636 zero-shot learning, 3136 zero-sum game, 2940 zeroth-order optimization, 425, 433, 585 ZIP codes, 2885 Zoutendijk condition, 461, 468, 709 z-spectrum, 246, 1180