Inference and Learning from Data: Volume 3: Learning [New ed.] 100921828X, 9781009218283

This extraordinary three-volume work, written in an engaging and rigorous style by a world authority in the field, provi

107 88 19MB

English Pages 990 [1082] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half-title
Title page
Copyright information
Dedication
Contents
Preface
P.1 Emphasis on Foundations
P.2 Glimpse of History
P.3 Organization of the Text
P.4 How to Use the Text
P.5 Simulation Datasets
P.6 Acknowledgments
Notation
50 Least-Squares Problems
50.1 Motivation
50.2 Normal Equations
50.3 Recursive Least-Squares
50.4 Implicit Bias
50.5 Commentaries and Discussion
Problems
50.A Minimum-Norm Solution
50.B Equivalence in Linear Estimation
50.C Extended Least-Squares
References
51 Regularization
51.1 Three Challenges
51.2 [ell[sub(2)]]-Regularization
51.3 [ell[sub(1)]]-Regularization
51.4 Soft Thresholding
51.5 Commentaries and Discussion
Problems
51.A Constrained Formulations for Regularization
51.B Expression for LASSO Solution
References
52 Nearest-Neighbor Rule
52.1 Bayes Classifier
52.2 k-NN Classifier
52.3 Performance Guarantee
52.4 k-Means Algorithm
52.5 Commentaries and Discussion
Problems
52.A Performance of the NN Classifier
References
53 Self-Organizing Maps
53.1 Grid Arrangements
53.2 Training Algorithm
53.3 Visualization
53.4 Commentaries and Discussion
Problems
References
54 Decision Trees
54.1 Trees and Attributes
54.2 Selecting Attributes
54.3 Constructing a Tree
54.4 Commentaries and Discussion
Problems
References
55 Naïve Bayes Classifier
55.1 Independence Condition
55.2 Modeling the Conditional Distribution
55.3 Estimating the Priors
55.4 Gaussian Naïve Classifier
55.5 Commentaries and Discussion
Problems
References
56 Linear Discriminant Analysis
56.1 Discriminant Functions
56.2 Linear Discriminant Algorithm
56.3 Minimum Distance Classifier
56.4 Fisher Discriminant Analysis
56.5 Commentaries and Discussion
Problems
References
57 Principal Component Analysis
57.1 Data Preprocessing
57.2 Dimensionality Reduction
57.3 Subspace Interpretations
57.4 Sparse PCA
57.5 Probabilistic PCA
57.6 Commentaries and Discussion
Problems
57.A Maximum-Likelihood Solution
57.B Alternative Optimization Problem
References
58 Dictionary Learning
58.1 Learning Under Regularization
58.2 Learning Under Constraints
58.3 K-SVD Approach
58.4 Nonnegative Matrix Factorization
58.5 Commentaries and Discussion
Problems
58.A Orthogonal Matching Pursuit
References
59 Logistic Regression
59.1 Logistic Model
59.2 Logistic Empirical Risk
59.3 Multiclass Classification
59.4 Active Learning
59.5 Domain Adaptation
59.6 Commentaries and Discussion
Problems
59.A Generalized Linear Models
References
60 Perceptron
60.1 Linear Separability
60.2 Perceptron Empirical Risk
60.3 Termination in Finite Steps
60.4 Pocket Perceptron
60.5 Commentaries and Discussion
Problems
60.A Counting Theorem
60.B Boolean Functions
References
61 Support Vector Machines
61.1 SVM Empirical Risk
61.2 Convex Quadratic Program
61.3 Cross Validation
61.4 Commentaries and Discussion
Problems
References
62 Bagging and Boosting
62.1 Bagging Classifiers
62.2 AdaBoost Classifier
62.3 Gradient Boosting
62.4 Commentaries and Discussion
Problems
References
63 Kernel Methods
63.1 Motivation
63.2 Nonlinear Mappings
63.3 Polynomial and Gaussian Kernels
63.4 Kernel-Based Perceptron
63.5 Kernel-Based SVM
63.6 Kernel-Based Ridge Regression
63.7 Kernel-Based Learning
63.8 Kernel PCA
63.9 Inference under Gaussian Processes
63.10 Commentaries and Discussion
Problems
References
64 Generalization Theory
64.1 Curse of Dimensionality
64.2 Empirical Risk Minimization
64.3 Generalization Ability
64.4 VC Dimension
64.5 Bias–Variance Trade-off
64.6 Surrogate Risk Functions
64.7 Commentaries and Discussion
Problems
64.A VC Dimension for Linear Classifiers
64.B Sauer Lemma
64.C Vapnik–Chervonenkis Bound
64.D Rademacher Complexity
References
65 Feedforward Neural Networks
65.1 Activation Functions
65.2 Feedforward Networks
65.3 Regression and Classification
65.4 Calculation of Gradient Vectors
65.5 Backpropagation Algorithm
65.6 Dropout Strategy
65.7 Regularized Cross-Entropy Risk
65.8 Slowdown in Learning
65.9 Batch Normalization
65.10 Commentaries and Discussion
Problems
65.A Derivation of Batch Normalization Algorithm
References
66 Deep Belief Networks
66.1 Pre-Training Using Stacked Autoencoders
66.2 Restricted Boltzmann Machines
66.3 Contrastive Divergence
66.4 Pre-Training using Stacked RBMs
66.5 Deep Generative Model
66.6 Commentaries and Discussion
Problems
References
67 Convolutional Networks
67.1 Correlation Layers
67.2 Pooling
67.3 Full Network
67.4 Training Algorithm
67.5 Commentaries and Discussion
Problems
67.A Derivation of Training Algorithm
References
68 Generative Networks
68.1 Variational Autoencoders
68.2 Training Variational Autoencoders
68.3 Conditional Variational Autoencoders
68.4 Generative Adversarial Networks
68.5 Training of GANs
68.6 Conditional GANs
68.7 Commentaries and Discussion
Problems
References
69 Recurrent Networks
69.1 Recurrent Neural Networks
69.2 Backpropagation Through Time
69.3 Bidirectional Recurrent Networks
69.4 Vanishing and Exploding Gradients
69.5 Long Short-Term Memory Networks
69.6 Bidirectional LSTMs
69.7 Gated Recurrent Units
69.8 Commentaries and Discussion
Problems
References
70 Explainable Learning
70.1 Classifier Model
70.2 Sensitivity Analysis
70.3 Gradient X Input Analysis
70.4 Relevance Analysis
70.5 Commentaries and Discussion
Problems
References
71 Adversarial Attacks
71.1 Types of Attacks
71.2 Fast Gradient Sign Method
71.3 Jacobian Saliency Map Approach
71.4 DeepFool Technique
71.5 Black-Box Attacks
71.6 Defense Mechanisms
71.7 Commentaries and Discussion
Problems
References
72 Meta Learning
72.1 Network Model
72.2 Siamese Networks
72.3 Relation Networks
72.4 Exploration Models
72.5 Commentaries and Discussion
Problems
72.A Matching Networks
72.B Prototypical Networks
References
Author Index
Subject Index
Recommend Papers

Inference and Learning from Data: Volume 3: Learning [New ed.]
 100921828X, 9781009218283

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Inference and Learning from Data Volume III This extraordinary three-volume work, written in an engaging and rigorous style by a world authority in the field, provides an accessible, comprehensive introduction to the full spectrum of mathematical and statistical techniques underpinning contemporary methods in data-driven learning and inference. This final volume, Learning, builds on the foundational topics established in Volume I to provide a thorough introduction to learning methods, addressing techniques such as least-squares methods, regularization, online learning, kernel methods, generalization theory, feedforward, convolutional, recurrent, and generative neural networks, meta learning, explainable learning, and adversarial attacks. A consistent structure and pedagogy is employed throughout this volume to reinforce student understanding, with over 350 end-of-chapter problems (including solutions for instructors), 100 solved examples, 280 figures, datasets, and downloadable Matlab code. Supported by sister volumes Foundations and Inference, and unique in its scale and depth, this textbook sequence is ideal for early-career researchers and graduate students across many courses in signal processing, machine learning, statistical analysis, data science, and inference. Ali H. Sayed is Professor and Dean of Engineering at École Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He has also served as Distinguished Professor and Chairman of Electrical Engineering at the University of California, Los Angeles (UCLA), USA, and as President of the IEEE Signal Processing Society. He is a member of the US National Academy of Engineering (NAE) and The World Academy of Sciences (TWAS), and a recipient of several awards, including the 2022 IEEE Fourier Award and the 2020 IEEE Norbert Wiener Society Award. He is a Fellow of the IEEE, EURASIP, and AAAS.

Inference and Learning from Data Volume III: Learning A L I H . S AY E D École Polytechnique Fédérale de Lausanne University of California at Los Angeles

Shaftesbury Road, Cambridge CB2 8EA, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467 Cambridge University Press is part of Cambridge University Press & Assessment, a department of the University of Cambridge. We share the University’s mission to contribute to society through the pursuit of education, learning and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/highereducation/isbn/9781009218283 DOI: 10.1017/9781009218276 © Ali H. Sayed 2023 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press & Assessment. First published 2023 Printed in the United Kingdom by Bell and Bain Ltd A catalogue record for this publication is available from the British Library ISBN - 3 Volume Set 978-1-009-21810-8 Hardback ISBN - Volume I 978-1-009-21812-2 Hardback ISBN - Volume II 978-1-009-21826-9 Hardback ISBN - Volume III 978-1-009-21828-3 Hardback Additional resources for this publication at www.cambridge.org/sayed-vol3. Cambridge University Press & Assessment has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

In loving memory of my parents

Contents

VOLUME I FOUNDATIONS Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation

page xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv

1

Matrix Theory 1.1 Symmetric Matrices 1.2 Positive-Definite Matrices 1.3 Range Spaces and Nullspaces 1.4 Schur Complements 1.5 Cholesky Factorization 1.6 QR Decomposition 1.7 Singular Value Decomposition 1.8 Square-Root Matrices 1.9 Kronecker Products 1.10 Vector and Matrix Norms 1.11 Perturbation Bounds on Eigenvalues 1.12 Stochastic Matrices 1.13 Complex-Valued Matrices 1.14 Commentaries and Discussion Problems 1.A Proof of Spectral Theorem 1.B Constructive Proof of SVD References

1 1 5 7 11 14 18 20 22 24 30 37 38 39 41 47 50 52 53

2

Vector Differentiation 2.1 Gradient Vectors 2.2 Hessian Matrices

59 59 62

viii

Contents

2.3 2.4

Matrix Differentiation Commentaries and Discussion Problems References

63 65 65 67

3

Random Variables 3.1 Probability Density Functions 3.2 Mean and Variance 3.3 Dependent Random Variables 3.4 Random Vectors 3.5 Properties of Covariance Matrices 3.6 Illustrative Applications 3.7 Complex-Valued Variables 3.8 Commentaries and Discussion Problems 3.A Convergence of Random Variables 3.B Concentration Inequalities References

68 68 71 77 93 96 97 106 109 112 119 122 128

4

Gaussian Distribution 4.1 Scalar Gaussian Variables 4.2 Vector Gaussian Variables 4.3 Useful Gaussian Manipulations 4.4 Jointly Distributed Gaussian Variables 4.5 Gaussian Processes 4.6 Circular Gaussian Distribution 4.7 Commentaries and Discussion Problems References

132 132 134 138 144 150 155 157 160 165

5

Exponential Distributions 5.1 Definition 5.2 Special Cases 5.3 Useful Properties 5.4 Conjugate Priors 5.5 Commentaries and Discussion Problems 5.A Derivation of Properties References

167 167 169 178 183 187 189 192 195

6

Entropy and Divergence 6.1 Information and Entropy 6.2 Kullback–Leibler Divergence 6.3 Maximum Entropy Distribution

196 196 204 209

Contents

6.4 6.5 6.6 6.7 6.8

ix

Moment Matching Fisher Information Matrix Natural Gradients Evidence Lower Bound Commentaries and Discussion Problems References

211 213 217 227 231 234 237

7

Random Processes 7.1 Stationary Processes 7.2 Power Spectral Density 7.3 Spectral Factorization 7.4 Commentaries and Discussion Problems References

240 240 245 252 255 257 259

8

Convex Functions 8.1 Convex Sets 8.2 Convexity 8.3 Strict Convexity 8.4 Strong Convexity 8.5 Hessian Matrix Conditions 8.6 Subgradient Vectors 8.7 Jensen Inequality 8.8 Conjugate Functions 8.9 Bregman Divergence 8.10 Commentaries and Discussion Problems References

261 261 263 265 266 268 272 279 281 285 290 293 299

9

Convex Optimization 9.1 Convex Optimization Problems 9.2 Equality Constraints 9.3 Motivating the KKT Conditions 9.4 Projection onto Convex Sets 9.5 Commentaries and Discussion Problems References

302 302 310 312 315 322 323 328

10

Lipschitz Conditions 10.1 Mean-Value Theorem 10.2 δ-Smooth Functions 10.3 Commentaries and Discussion Problems References

330 330 332 337 338 340

x

Contents

11

Proximal Operator 11.1 Definition and Properties 11.2 Proximal Point Algorithm 11.3 Proximal Gradient Algorithm 11.4 Convergence Results 11.5 Douglas–Rachford Algorithm 11.6 Commentaries and Discussion Problems 11.A Convergence under Convexity 11.B Convergence under Strong Convexity References

341 341 347 349 354 356 358 362 366 369 372

12

Gradient-Descent Method 12.1 Empirical and Stochastic Risks 12.2 Conditions on Risk Function 12.3 Constant Step Sizes 12.4 Iteration-Dependent Step-Sizes 12.5 Coordinate-Descent Method 12.6 Alternating Projection Algorithm 12.7 Commentaries and Discussion Problems 12.A Zeroth-Order Optimization References

375 375 379 381 392 402 413 418 425 433 436

13

Conjugate Gradient Method 13.1 Linear Systems of Equations 13.2 Nonlinear Optimization 13.3 Convergence Analysis 13.4 Commentaries and Discussion Problems References

441 441 454 459 465 466 469

14

Subgradient Method 14.1 Subgradient Algorithm 14.2 Conditions on Risk Function 14.3 Convergence Behavior 14.4 Pocket Variable 14.5 Exponential Smoothing 14.6 Iteration-Dependent Step Sizes 14.7 Coordinate-Descent Algorithms 14.8 Commentaries and Discussion Problems 14.A Deterministic Inequality Recursion References

471 471 475 479 483 486 489 493 496 498 501 505

Contents

xi

15

Proximal and Mirror-Descent Methods 15.1 Proximal Gradient Method 15.2 Projection Gradient Method 15.3 Mirror-Descent Method 15.4 Comparison of Convergence Rates 15.5 Commentaries and Discussion Problems References

507 507 515 519 537 539 541 544

16

Stochastic Optimization 16.1 Stochastic Gradient Algorithm 16.2 Stochastic Subgradient Algorithm 16.3 Stochastic Proximal Gradient Algorithm 16.4 Gradient Noise 16.5 Regret Analysis 16.6 Commentaries and Discussion Problems 16.A Switching Expectation and Differentiation References

547 548 565 569 574 576 582 586 590 595

17

Adaptive Gradient Methods 17.1 Motivation 17.2 AdaGrad Algorithm 17.3 RMSprop Algorithm 17.4 ADAM Algorithm 17.5 Momentum Acceleration Methods 17.6 Federated Learning 17.7 Commentaries and Discussion Problems 17.A Regret Analysis for ADAM References

599 599 603 608 610 614 619 626 630 632 640

18

Gradient Noise 18.1 Motivation 18.2 Smooth Risk Functions 18.3 Gradient Noise for Smooth Risks 18.4 Nonsmooth Risk Functions 18.5 Gradient Noise for Nonsmooth Risks 18.6 Commentaries and Discussion Problems 18.A Averaging over Mini-Batches 18.B Auxiliary Variance Result References

642 642 645 648 660 665 673 675 677 679 681

xii

Contents

19

Convergence Analysis I: Stochastic Gradient Algorithms 19.1 Problem Setting 19.2 Convergence under Uniform Sampling 19.3 Convergence of Mini-Batch Implementation 19.4 Convergence under Vanishing Step Sizes 19.5 Convergence under Random Reshuffling 19.6 Convergence under Importance Sampling 19.7 Convergence of Stochastic Conjugate Gradient 19.8 Commentaries and Discussion Problems 19.A Stochastic Inequality Recursion 19.B Proof of Theorem 19.5 References

683 683 686 691 692 698 701 707 712 716 720 722 727

20

Convergence Analysis II: Stochastic Subgradient Algorithms 20.1 Problem Setting 20.2 Convergence under Uniform Sampling 20.3 Convergence with Pocket Variables 20.4 Convergence with Exponential Smoothing 20.5 Convergence of Mini-Batch Implementation 20.6 Convergence under Vanishing Step Sizes 20.7 Commentaries and Discussion Problems References

730 730 735 738 740 745 747 750 753 754

21

Convergence Analysis III: Stochastic Proximal Algorithms 21.1 Problem Setting 21.2 Convergence under Uniform Sampling 21.3 Convergence of Mini-Batch Implementation 21.4 Convergence under Vanishing Step Sizes 21.5 Stochastic Projection Gradient 21.6 Mirror-Descent Algorithm 21.7 Commentaries and Discussion Problems References

756 756 761 765 766 769 771 774 775 776

22

Variance-Reduced Methods I: Uniform Sampling 22.1 Problem Setting 22.2 Naïve Stochastic Gradient Algorithm 22.3 Stochastic Average-Gradient Algorithm (SAGA) 22.4 Stochastic Variance-Reduced Gradient Algorithm (SVRG) 22.5 Nonsmooth Risk Functions 22.6 Commentaries and Discussion Problems

779 779 782 785 793 799 806 808

Contents

xiii

22.A Proof of Theorem 22.2 22.B Proof of Theorem 22.3 References

810 813 815

23

Variance-Reduced Methods II: Random Reshuffling 23.1 Amortized Variance-Reduced Gradient Algorithm (AVRG) 23.2 Evolution of Memory Variables 23.3 Convergence of SAGA 23.4 Convergence of AVRG 23.5 Convergence of SVRG 23.6 Nonsmooth Risk Functions 23.7 Commentaries and Discussion Problems 23.A Proof of Lemma 23.3 23.B Proof of Lemma 23.4 23.C Proof of Theorem 23.1 23.D Proof of Lemma 23.5 23.E Proof of Theorem 23.2 References

816 816 818 822 827 830 831 832 833 834 838 842 845 849 851

24

Nonconvex Optimization 24.1 First- and Second-Order Stationarity 24.2 Stochastic Gradient Optimization 24.3 Convergence Behavior 24.4 Commentaries and Discussion Problems 24.A Descent in the Large Gradient Regime 24.B Introducing a Short-Term Model 24.C Descent Away from Strict Saddle Points 24.D Second-Order Convergence Guarantee References

852 852 860 865 872 874 876 877 888 897 900

25

Decentralized Optimization I: Primal Methods 25.1 Graph Topology 25.2 Weight Matrices 25.3 Aggregate and Local Risks 25.4 Incremental, Consensus, and Diffusion 25.5 Formal Derivation as Primal Methods 25.6 Commentaries and Discussion Problems 25.A Proof of Lemma 25.1 25.B Proof of Property (25.71) 25.C Convergence of Primal Algorithms References

902 903 909 913 918 935 940 943 947 949 949 965

xiv

Contents

26

Decentralized Optimization II: Primal–Dual Methods 26.1 Motivation 26.2 EXTRA Algorithm 26.3 EXACT Diffusion Algorithm 26.4 Distributed Inexact Gradient Algorithm 26.5 Augmented Decentralized Gradient Method 26.6 ATC Tracking Method 26.7 Unified Decentralized Algorithm 26.8 Convergence Performance 26.9 Dual Method 26.10 Decentralized Nonconvex Optimization 26.11 Commentaries and Discussion Problems 26.A Convergence of Primal–Dual Algorithms References

969 969 970 972 975 978 979 983 985 987 990 995 998 1000 1006

Author Index Subject Index

1009 1033

VOLUME II INFERENCE Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation

xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv

27

Mean-Square-Error Inference 27.1 Inference without Observations 27.2 Inference with Observations 27.3 Gaussian Random Variables 27.4 Bias–Variance Relation 27.5 Commentaries and Discussion Problems 27.A Circular Gaussian Distribution References

1053 1054 1057 1066 1072 1082 1085 1088 1090

28

Bayesian Inference 28.1 Bayesian Formulation 28.2 Maximum A-Posteriori Inference 28.3 Bayes Classifier 28.4 Logistic Regression Inference

1092 1092 1094 1097 1106

Contents

28.5 28.6

Discriminative and Generative Models Commentaries and Discussion Problems References

xv

1110 1113 1116 1119

29

Linear Regression 29.1 Regression Model 29.2 Centering and Augmentation 29.3 Vector Estimation 29.4 Linear Models 29.5 Data Fusion 29.6 Minimum-Variance Unbiased Estimation 29.7 Commentaries and Discussion Problems 29.A Consistency of Normal Equations References

1121 1121 1128 1131 1134 1136 1139 1143 1145 1151 1153

30

Kalman Filter 30.1 Uncorrelated Observations 30.2 Innovations Process 30.3 State-Space Model 30.4 Measurement- and Time-Update Forms 30.5 Steady-State Filter 30.6 Smoothing Filters 30.7 Ensemble Kalman Filter 30.8 Nonlinear Filtering 30.9 Commentaries and Discussion Problems References

1154 1154 1157 1159 1171 1177 1181 1185 1191 1201 1204 1208

31

Maximum Likelihood 31.1 Problem Formulation 31.2 Gaussian Distribution 31.3 Multinomial Distribution 31.4 Exponential Family of Distributions 31.5 Cramer–Rao Lower Bound 31.6 Model Selection 31.7 Commentaries and Discussion Problems 31.A Derivation of the Cramer–Rao Bound 31.B Derivation of the AIC Formulation 31.C Derivation of the BIC Formulation References

1211 1211 1214 1223 1226 1229 1237 1251 1259 1265 1266 1271 1273

xvi

Contents

32

Expectation Maximization 32.1 Motivation 32.2 Derivation of the EM Algorithm 32.3 Gaussian Mixture Models 32.4 Bernoulli Mixture Models 32.5 Commentaries and Discussion Problems 32.A Exponential Mixture Models References

1276 1276 1282 1287 1302 1308 1310 1312 1316

33

Predictive Modeling 33.1 Posterior Distributions 33.2 Laplace Method 33.3 Markov Chain Monte Carlo Method 33.4 Commentaries and Discussion Problems References

1319 1320 1328 1333 1346 1348 1349

34

Expectation Propagation 34.1 Factored Representation 34.2 Gaussian Sites 34.3 Exponential Sites 34.4 Assumed Density Filtering 34.5 Commentaries and Discussion Problems References

1352 1352 1357 1371 1375 1378 1378 1379

35

Particle Filters 35.1 Data Model 35.2 Importance Sampling 35.3 Particle Filter Implementations 35.4 Commentaries and Discussion Problems References

1380 1380 1385 1393 1400 1401 1403

36

Variational Inference 36.1 Evaluating Evidences 36.2 Evaluating Posterior Distributions 36.3 Mean-Field Approximation 36.4 Exponential Conjugate Models 36.5 Maximizing the ELBO 36.6 Stochastic Gradient Solution 36.7 Black Box Inference 36.8 Commentaries and Discussion

1405 1405 1411 1413 1440 1454 1458 1461 1467

Contents

Problems References

xvii

1467 1470

37

Latent Dirichlet Allocation 37.1 Generative Model 37.2 Coordinate-Ascent Solution 37.3 Maximizing the ELBO 37.4 Estimating Model Parameters 37.5 Commentaries and Discussion Problems References

1472 1473 1482 1493 1500 1514 1515 1515

38

Hidden Markov Models 38.1 Gaussian Mixture Models 38.2 Markov Chains 38.3 Forward–Backward Recursions 38.4 Validation and Prediction Tasks 38.5 Commentaries and Discussion Problems References

1517 1517 1522 1538 1547 1551 1557 1560

39

Decoding Hidden Markov Models 39.1 Decoding States 39.2 Decoding Transition Probabilities 39.3 Normalization and Scaling 39.4 Viterbi Algorithm 39.5 EM Algorithm for Dependent Observations 39.6 Commentaries and Discussion Problems References

1563 1563 1565 1569 1574 1586 1604 1605 1607

40

Independent Component Analysis 40.1 Problem Formulation 40.2 Maximum-Likelihood Formulation 40.3 Mutual Information Formulation 40.4 Maximum Kurtosis Formulation 40.5 Projection Pursuit 40.6 Commentaries and Discussion Problems References

1609 1610 1617 1622 1627 1634 1637 1638 1640

41

Bayesian Networks 41.1 Curse of Dimensionality 41.2 Probabilistic Graphical Models

1643 1644 1647

xviii

Contents

41.3 41.4 41.5

Active and Blocked Pathways Conditional Independence Relations Commentaries and Discussion Problems References

1661 1670 1677 1679 1680

42

Inference over Graphs 42.1 Probabilistic Inference 42.2 Inference by Enumeration 42.3 Inference by Variable Elimination 42.4 Chow–Liu Algorithm 42.5 Graphical LASSO 42.6 Learning Graph Parameters 42.7 Commentaries and Discussion Problems References

1682 1682 1685 1691 1698 1705 1711 1733 1735 1737

43

Undirected Graphs 43.1 Cliques and Potentials 43.2 Representation Theorem 43.3 Factor Graphs 43.4 Message-Passing Algorithms 43.5 Commentaries and Discussion Problems 43.A Proof of the Hammersley–Clifford Theorem 43.B Equivalence of Markovian Properties References

1740 1740 1752 1756 1761 1793 1796 1799 1803 1804

44

Markov Decision Processes 44.1 MDP Model 44.2 Discounted Rewards 44.3 Policy Evaluation 44.4 Linear Function Approximation 44.5 Commentaries and Discussion Problems References

1807 1807 1821 1825 1840 1848 1850 1851

45

Value and Policy Iterations 45.1 Value Iteration 45.2 Policy Iteration 45.3 Partially Observable MDP 45.4 Commentaries and Discussion Problems 45.A Optimal Policy and State–Action Values

1853 1853 1866 1879 1893 1900 1903

Contents

45.B 45.C 45.D 45.E 45.F

xix

Convergence of Value Iteration Proof of -Optimality Convergence of Policy Iteration Piecewise Linear Property Bellman Principle of Optimality References

1905 1906 1907 1909 1910 1914

46

Temporal Difference Learning 46.1 Model-Based Learning 46.2 Monte Carlo Policy Evaluation 46.3 TD(0) Algorithm 46.4 Look-Ahead TD Algorithm 46.5 TD(λ) Algorithm 46.6 True Online TD(λ) Algorithm 46.7 Off-Policy Learning 46.8 Commentaries and Discussion Problems 46.A Useful Convergence Result 46.B Convergence of TD(0) Algorithm 46.C Convergence of TD(λ) Algorithm 46.D Equivalence of Offline Implementations References

1917 1918 1920 1928 1936 1940 1949 1952 1957 1958 1959 1960 1963 1967 1969

47

Q-Learning 47.1 SARSA(0) Algorithm 47.2 Look-Ahead SARSA Algorithm 47.3 SARSA(λ) Algorithm 47.4 Off-Policy Learning 47.5 Optimal Policy Extraction 47.6 Q-Learning Algorithm 47.7 Exploration versus Exploitation 47.8 Q-Learning with Replay Buffer 47.9 Double Q-Learning 47.10 Commentaries and Discussion Problems 47.A Convergence of SARSA(0) Algorithm 47.B Convergence of Q-Learning Algorithm References

1971 1971 1975 1977 1979 1980 1982 1985 1993 1994 1996 1999 2001 2003 2005

48

Value Function Approximation 48.1 Stochastic Gradient TD-Learning 48.2 Least-Squares TD-Learning 48.3 Projected Bellman Learning 48.4 SARSA Methods

2008 2008 2018 2019 2026

xx

Contents

48.5 48.6

49

Deep Q-Learning Commentaries and Discussion Problems References

2032 2041 2043 2045

Policy Gradient Methods 49.1 Policy Model 49.2 Finite-Difference Method 49.3 Score Function 49.4 Objective Functions 49.5 Policy Gradient Theorem 49.6 Actor–Critic Algorithms 49.7 Natural Gradient Policy 49.8 Trust Region Policy Optimization 49.9 Deep Reinforcement Learning 49.10 Soft Learning 49.11 Commentaries and Discussion Problems 49.A Proof of Policy Gradient Theorem 49.B Proof of Consistency Theorem References

2047 2047 2048 2050 2052 2057 2059 2071 2074 2093 2098 2106 2109 2113 2117 2118

Author Index Subject Index

2121 2145

VOLUME III LEARNING Preface P.1 Emphasis on Foundations P.2 Glimpse of History P.3 Organization of the Text P.4 How to Use the Text P.5 Simulation Datasets P.6 Acknowledgments Notation 50

Least-Squares Problems 50.1 Motivation 50.2 Normal Equations 50.3 Recursive Least-Squares 50.4 Implicit Bias 50.5 Commentaries and Discussion Problems 50.A Minimum-Norm Solution 50.B Equivalence in Linear Estimation

xxvii xxvii xxix xxxi xxxiv xxxvii xl xlv 2165 2165 2170 2187 2195 2197 2202 2210 2211

Contents

xxi

50.C Extended Least-Squares References

2212 2217

51

Regularization 51.1 Three Challenges 51.2 `2 -Regularization 51.3 `1 -Regularization 51.4 Soft Thresholding 51.5 Commentaries and Discussion Problems 51.A Constrained Formulations for Regularization 51.B Expression for LASSO Solution References

2221 2222 2225 2230 2234 2242 2245 2250 2253 2257

52

Nearest-Neighbor Rule 52.1 Bayes Classifier 52.2 k -NN Classifier 52.3 Performance Guarantee 52.4 k -Means Algorithm 52.5 Commentaries and Discussion Problems 52.A Performance of the NN Classifier References

2260 2262 2265 2268 2270 2279 2282 2284 2287

53

Self-Organizing Maps 53.1 Grid Arrangements 53.2 Training Algorithm 53.3 Visualization 53.4 Commentaries and Discussion Problems References

2290 2290 2293 2302 2310 2310 2311

54

Decision Trees 54.1 Trees and Attributes 54.2 Selecting Attributes 54.3 Constructing a Tree 54.4 Commentaries and Discussion Problems References

2313 2313 2317 2327 2335 2337 2338

55

Naïve Bayes Classifier 55.1 Independence Condition 55.2 Modeling the Conditional Distribution 55.3 Estimating the Priors

2341 2341 2343 2344

xxii

Contents

55.4 55.5

Gaussian Naïve Classifier Commentaries and Discussion Problems References

2351 2352 2354 2356

56

Linear Discriminant Analysis 56.1 Discriminant Functions 56.2 Linear Discriminant Algorithm 56.3 Minimum Distance Classifier 56.4 Fisher Discriminant Analysis 56.5 Commentaries and Discussion Problems References

2357 2357 2360 2362 2365 2378 2379 2381

57

Principal Component Analysis 57.1 Data Preprocessing 57.2 Dimensionality Reduction 57.3 Subspace Interpretations 57.4 Sparse PCA 57.5 Probabilistic PCA 57.6 Commentaries and Discussion Problems 57.A Maximum-Likelihood Solution 57.B Alternative Optimization Problem References

2383 2383 2385 2396 2399 2404 2411 2414 2417 2421 2422

58

Dictionary Learning 58.1 Learning Under Regularization 58.2 Learning Under Constraints 58.3 K-SVD Approach 58.4 Nonnegative Matrix Factorization 58.5 Commentaries and Discussion Problems 58.A Orthogonal Matching Pursuit References

2424 2425 2430 2432 2435 2443 2446 2448 2454

59

Logistic Regression 59.1 Logistic Model 59.2 Logistic Empirical Risk 59.3 Multiclass Classification 59.4 Active Learning 59.5 Domain Adaptation 59.6 Commentaries and Discussion Problems

2457 2457 2459 2464 2471 2476 2484 2488

Contents

xxiii

59.A Generalized Linear Models References

2492 2496

60

Perceptron 60.1 Linear Separability 60.2 Perceptron Empirical Risk 60.3 Termination in Finite Steps 60.4 Pocket Perceptron 60.5 Commentaries and Discussion Problems 60.A Counting Theorem 60.B Boolean Functions References

2499 2499 2501 2507 2509 2513 2517 2520 2526 2528

61

Support Vector Machines 61.1 SVM Empirical Risk 61.2 Convex Quadratic Program 61.3 Cross Validation 61.4 Commentaries and Discussion Problems References

2530 2530 2541 2546 2551 2553 2554

62

Bagging and Boosting 62.1 Bagging Classifiers 62.2 AdaBoost Classifier 62.3 Gradient Boosting 62.4 Commentaries and Discussion Problems References

2557 2557 2561 2572 2580 2581 2584

63

Kernel Methods 63.1 Motivation 63.2 Nonlinear Mappings 63.3 Polynomial and Gaussian Kernels 63.4 Kernel-Based Perceptron 63.5 Kernel-Based SVM 63.6 Kernel-Based Ridge Regression 63.7 Kernel-Based Learning 63.8 Kernel PCA 63.9 Inference under Gaussian Processes 63.10 Commentaries and Discussion Problems References

2587 2587 2590 2592 2595 2604 2610 2613 2618 2623 2634 2640 2646

xxiv

Contents

64

Generalization Theory 64.1 Curse of Dimensionality 64.2 Empirical Risk Minimization 64.3 Generalization Ability 64.4 VC Dimension 64.5 Bias–Variance Trade-off 64.6 Surrogate Risk Functions 64.7 Commentaries and Discussion Problems 64.A VC Dimension for Linear Classifiers 64.B Sauer Lemma 64.C Vapnik–Chervonenkis Bound 64.D Rademacher Complexity References

2650 2650 2654 2657 2662 2663 2667 2672 2679 2686 2688 2694 2701 2711

65

Feedforward Neural Networks 65.1 Activation Functions 65.2 Feedforward Networks 65.3 Regression and Classification 65.4 Calculation of Gradient Vectors 65.5 Backpropagation Algorithm 65.6 Dropout Strategy 65.7 Regularized Cross-Entropy Risk 65.8 Slowdown in Learning 65.9 Batch Normalization 65.10 Commentaries and Discussion Problems 65.A Derivation of Batch Normalization Algorithm References

2715 2716 2721 2728 2731 2739 2750 2754 2768 2769 2776 2781 2787 2792

66

Deep 66.1 66.2 66.3 66.4 66.5 66.6

2797 2797 2802 2809 2820 2823 2830 2834 2836

67

Convolutional Networks 67.1 Correlation Layers 67.2 Pooling 67.3 Full Network

Belief Networks Pre-Training Using Stacked Autoencoders Restricted Boltzmann Machines Contrastive Divergence Pre-Training using Stacked RBMs Deep Generative Model Commentaries and Discussion Problems References

2838 2839 2860 2869

Contents

xxv

67.4 67.5

Training Algorithm Commentaries and Discussion Problems 67.A Derivation of Training Algorithm References

2876 2885 2887 2888 2903

68

Generative Networks 68.1 Variational Autoencoders 68.2 Training Variational Autoencoders 68.3 Conditional Variational Autoencoders 68.4 Generative Adversarial Networks 68.5 Training of GANs 68.6 Conditional GANs 68.7 Commentaries and Discussion Problems References

2905 2905 2913 2930 2935 2943 2956 2960 2963 2964

69

Recurrent Networks 69.1 Recurrent Neural Networks 69.2 Backpropagation Through Time 69.3 Bidirectional Recurrent Networks 69.4 Vanishing and Exploding Gradients 69.5 Long Short-Term Memory Networks 69.6 Bidirectional LSTMs 69.7 Gated Recurrent Units 69.8 Commentaries and Discussion Problems References

2967 2967 2973 2995 3002 3004 3026 3034 3036 3037 3040

70

Explainable Learning 70.1 Classifier Model 70.2 Sensitivity Analysis 70.3 Gradient X Input Analysis 70.4 Relevance Analysis 70.5 Commentaries and Discussion Problems References

3042 3042 3046 3049 3050 3060 3061 3062

71

Adversarial Attacks 71.1 Types of Attacks 71.2 Fast Gradient Sign Method 71.3 Jacobian Saliency Map Approach 71.4 DeepFool Technique 71.5 Black-Box Attacks

3065 3066 3070 3075 3078 3088

xxvi

Contents

71.6 71.7

72

Defense Mechanisms Commentaries and Discussion Problems References

3091 3093 3095 3096

Meta 72.1 72.2 72.3 72.4 72.5

Learning Network Model Siamese Networks Relation Networks Exploration Models Commentaries and Discussion Problems 72.A Matching Networks 72.B Prototypical Networks References

3099 3099 3101 3112 3118 3136 3136 3138 3144 3146

Author Index Subject Index

3149 3173

Preface

Learning directly from data is critical to a host of disciplines in engineering and the physical, social, and life sciences. Modern society is literally driven by an interconnected web of data exchanges at rates unseen before, and it relies heavily on decisions inferred from patterns in data. There is nothing fundamentally wrong with this approach, except that the inference and learning methodologies need to be anchored on solid foundations, be fair and reliable in their conclusions, and be robust to unwarranted imperfections and malicious interference.

P.1

EMPHASIS ON FOUNDATIONS Given the explosive interest in data-driven learning methods, it is not uncommon to encounter claims of superior designs in the literature that are substantiated mainly by sporadic simulations and the potential for “life-changing” applications rather than by an approach that is founded on the well-tested scientific principle to inquiry. For this reason, one of the main objectives of this text is to highlight, in a unified and formal manner, the firm mathematical and statistical pillars that underlie many popular data-driven learning and inference methods. This is a nontrivial task given the wide scope of techniques that exist, and which have often been motivated independently of each other. It is nevertheless important for practitioners and researchers alike to remain cognizant of the common foundational threads that run across these methods. It is also imperative that progress in the domain remains grounded on firm theory. As the aphorism often attributed to Lewin (1945) states, “there is nothing more practical than a good theory.” According to Bedeian (2016), this saying has an even older history. Rigorous data analysis, and conclusions derived from experimentation and theory, have been driving science since time immemorial. As reported by Heath (1912), the Greek scientist Archimedes of Syracuse devised the now famous Archimedes’ Principle about the volume displaced by an immersed object from observing how the level of water in a tub rose when he sat in it. In the account by Hall (1970), Gauss’ formulation of the least-squares problem was driven by his desire to predict the future location of the planetoid Ceres from observations of its location over 41 prior days. There are numerous similar examples by notable scientists where experimentation led to hypotheses and from there

xxviii

Preface

to substantiated theories and well-founded design methodologies. Science is also full of progress in the reverse direction, where theories have been developed first to be validated only decades later through experimentation and data analysis. Einstein (1916) postulated the existence of gravitational waves over 100 years ago. It took until 2016 to detect them! Regardless of which direction one follows, experimentation to theory or the reverse, the match between solid theory and rigorous data analysis has enabled science and humanity to march confidently toward the immense progress that permeates our modern world today. For similar reasons, data-driven learning and inference should be developed with strong theoretical guarantees. Otherwise, the confidence in their reliability can be shaken if there is over-reliance on “proof by simulation or experience.” Whenever possible, we explain the underlying models and statistical theories for a large number of methods covered in this text. A good grasp of these theories will enable practitioners and researchers to devise variations with greater mastery. We weave through the foundations in a coherent and cohesive manner, and show how the various methods blend together techniques that may appear decoupled but are actually facets of the same common methodology. In this process, we discover that a good number of techniques are well-grounded and meet proven performance guarantees, while other methods are driven by ingenious insights but lack solid justifications and cannot be guaranteed to be “fail-proof.” Researchers on learning and inference methods are of course aware of the limitations of some of their approaches, so much so that we encounter today many studies, for example, on the topic of “explainable machine learning.” The objective here is to understand why learning algorithms produce certain recommendations. While this is an important area of inquiry, it nevertheless highlights one interesting shift in paradigm. In the past, the emphasis would have been on designing inference methods that respond to the input data in certain desirable and controllable ways. Today, in many instances, the emphasis is to stick to the available algorithms (often, out of convenience) and try to understand or explain why they are responding in certain ways to the input! Writing this text has been a rewarding journey that took me from the early days of statistical mathematical theory to the modern state of affairs in learning theory. One can only stand in awe at the wondrous ideas that have been introduced by notable researchers along this trajectory. At the same time, one observes with some concern an emerging trend in recent years where solid foundations receive less attention in lieu of “speed publishing” and over-reliance on “illustration by simulation.” This is of course not the norm and most researchers in the field stay honest to the scientific approach to inquiry and design. After concluding this comprehensive text, I stand humbled at the realization of “how little we know !” There are countless questions that remain open, and even for many of the questions that have been answered, their answers rely on assumptions or (over)simplifications. It is understandable that the complexity of the problems we face today has increased manifold, and ingenious approximations become necessary to enable tractable solutions.

P.2 Glimpse of History

P.2

xxix

GLIMPSE OF HISTORY Reading through the text, the alert reader will quickly realize that the core foundations of modern-day machine learning, data analytics, and inference methods date back for at least two centuries, with contributions arising from a range of fields including mathematics, statistics, optimization theory, information theory, signal processing, communications, control, and computer science. For the benefit of the reader, I reproduce here with permission from IEEE some historical remarks from the editorial I published in Sayed (2018). I explained there that these disciplines have generated a string of “big ideas” that are driving today multi-faceted efforts in the age of “big data” and machine learning. Generations of students in the statistical sciences and engineering have been trained in the art of modeling, problem solving, and optimization. Their algorithms power everything from cell phones, to spacecraft, robotic explorers, imaging devices, automated systems, computing machines, and also recommender systems. These students mastered the foundations of their fields and have been well prepared to contribute to the explosive growth of data analysis and machine learning solutions. As the list below shows, many well-known engineering and statistical methods have actually been motivated by data-driven inquiries, even from times remote. The list is a tour of some older historical contributions, which is of course biased by my personal preferences and is not intended to be exhaustive. It is only meant to illustrate how concepts from statistics and the information sciences have always been at the center of promoting big ideas for data and machine learning. Readers will encounter these concepts in various chapters in the text. Readers will also encounter additional historical accounts in the concluding remarks of each chapter, and in particular comments on newer contributions and contributors. Let me start with Gauss himself, who in 1795 at the young age of 18, was fitting lines and hyperplanes to astronomical data and invented the least-squares criterion for regression analysis – see the collection of his works in Gauss (1903). He even devised the recursive least-squares solution to address what was a “big” data problem for him at the time: He had to avoid tedious repeated calculations by hand as more observational data became available. What a wonderful big idea for a data-driven problem! Of course, Gauss had many other big ideas. de Moivre (1730), Laplace (1812), and Lyapunov (1901) worked on the central limit theorem. The theorem deals with the limiting distribution of averages of “large” amounts of data. The result is also related to the law of “large” numbers, which even has the qualification “large” in its name. Again, big ideas motivated by “large” data problems. Bayes (ca mid-1750s) and Laplace (1774) appear to have independently discovered the Bayes rule, which updates probabilities conditioned on observations – see the article by Bayes and Price (1763). The rule forms the backbone of much of statistical signal analysis, Bayes classifiers, Naïve classifiers, and Bayesian networks. Again, a big idea for data-driven inference.

xxx

Preface

Fourier (1822), whose tools are at the core of disciplines in the information sciences, developed the phenomenal Fourier representation for signals. It is meant to transform data from one domain to another to facilitate the extraction and visualization of information. A big transformative idea for data. Forward to modern times. The fast Fourier transform (FFT) is another example of an algorithm driven by challenges posed by data size. Its modern version is due to Cooley and Tukey (1965). Their algorithm revolutionized the field of discrete-time signal processing, and FFT processors have become common components in many modern electronic devices. Even Gauss had a role to play here, having proposed an early version of the algorithm some 160 years before, again motivated by a data-driven problem while trying to fit astronomical data onto trigonometric polynomials. A big idea for a data-driven problem. Closer to the core of statistical mathematical theory, both Kolmogorov (1939) and Wiener (1949) laid out the foundations of modern statistical signal analysis and optimal prediction methods. Their theories taught us how to extract information optimally from data, leading to further refinements by Wiener’s student Levinson (1947) and more dramatically by Kalman (1960). The innovations approach by Kailath (1968) exploited to great effect the concept of orthogonalization of the data and recursive constructions. The Kalman filter is applied across many domains today, including in financial analysis from market data. Kalman’s work was an outgrowth of the model-based approach to system theory advanced by Zadeh (1954). The concept of a recursive solution from streaming data was a novelty in Kalman’s filter; the same concept is commonplace today in most online learning techniques. Again, big ideas for recursive inference from data. Cauchy (1847) early on, and Robbins and Monro (1951) a century later, developed the powerful gradient-descent method for root finding, which is also recursive in nature. Their techniques have grown to motivate huge advances in stochastic approximation theory. Notable contributions that followed include the work by Rosenblatt (1957) on the perceptron algorithm for single-layer networks, and the impactful delta rule by Widrow and Hoff (1960), widely known as the LMS algorithm in the signal processing literature. Subsequent work on multilayer neural networks grew out of the desire to increase the approximation power of single-layer networks, culminating with the backpropagation method of Werbos (1974). Many of these techniques form the backbone of modern learning algorithms. Again, big ideas for recursive online learning. Shannon (1948a, b) contributed fundamental insights to data representation, sampling, coding, and communications. His concepts of entropy and information measure helped quantify the amount of uncertainty in data and are used, among other areas, in the design of decision trees for classification purposes and in driving learning algorithms for neural networks. Nyquist (1928) contributed to the understanding of data representations as well. Big ideas for data sampling and data manipulation. Bellman (1957a, b), a towering system-theorist, introduced dynamic programming and the notion of the curse of dimensionality, both of which are core

P.3 Organization of the Text

xxxi

underpinnings of many results in learning theory, reinforcement learning, and the theory of Markov decision processes. Viterbi’s algorithm (1967) is one notable example of a dynamic programming solution, which has revolutionized communications and has also found applications in hidden Markov models widely used in speech recognition nowadays. Big ideas for conquering complex data problems by dividing them into simpler problems. Kernel methods, building on foundational results by Mercer (1909) and Aronszajn (1950), have found widespread applications in learning theory since the mid-1960s with the introduction of the kernel perceptron algorithm. They have also been widely used in estimation theory by Parzen (1962), Kailath (1971), and others. Again, a big idea for learning from data. Pearson and Fisher launched the modern field of mathematical statistical signal analysis with the introduction of methods such as principal component analysis (PCA) by Pearson (1901) and maximum likelihood and linear discriminant analysis by Fisher (1912, 1922, 1925). These methods are at the core of statistical signal processing. Pearson (1894, 1896) also had one of the earliest studies of fitting a mixture of Gaussian models to biological data. Mixture models have now become an important tool in modern learning algorithms. Big ideas for data-driven inference. Markov (1913) introduced the formalism of Markov chains, which is widely used today as a powerful modeling tool in a variety of fields including word and speech recognition, handwriting recognition, natural language processing, spam filtering, gene analysis, and web search. Markov chains are also used in Google’s PageRank algorithm. Markov’s motivation was to study letter patterns in texts. He laboriously went through the first 20,000 letters of a classical Russian novel and counted pairs of vowels, consonants, vowels followed by a consonant, and consonants followed by a vowel. A “big” data problem for his time. Great ideas (and great patience) for data-driven inquiries. And the list goes on, with many modern-day and ongoing contributions by statisticians, engineers, and computer scientists to network science, distributed processing, compressed sensing, randomized algorithms, optimization, multi-agent systems, intelligent systems, computational imaging, speech processing, forensics, computer visions, privacy and security, and so forth. We provide additional historical accounts about these contributions and contributors at the end of the chapters.

P.3

ORGANIZATION OF THE TEXT The text is organized into three volumes, with a sizable number of problems and solved examples. The table of contents provides details on what is covered in each volume. Here we provide a condensed summary listing the three main themes:

xxxii

Preface

1. (Volume I: Foundations). The first volume covers the foundations needed for a solid grasp of inference and learning methods. Many important topics are covered in this part, in a manner that prepares readers for the study of inference and learning methods in the second and third volumes. Topics include: matrix theory, linear algebra, random variables, Gaussian and exponential distributions, entropy and divergence, Lipschitz conditions, convexity, convex optimization, proximal operators, gradient-descent, mirror-descent, conjugate-gradient, subgradient methods, stochastic optimization, adaptive gradient methods, variance-reduced methods, distributed optimization, and nonconvex optimization. Interestingly enough, the following concepts occur time and again in all three volumes and the reader is well-advised to develop familiarity with them: convexity, sample mean and law of large numbers, Gaussianity, Bayes rule, entropy, Kullback–Leibler divergence, gradientdescent, least squares, regularization, and maximum-likelihood. The last three concepts are discussed in the initial chapters of the second volume. 2. (Volume II: Inference). The second volume covers inference methods. By “inference” we mean techniques that infer some unknown variable or quantity from observations. The difference we make between “inference” and “learning” in our treatment is that inference methods will target situations where some prior information is known about the underlying signal models or signal distributions (such as their joint probability density functions or generative models). The performance by many of these inference methods will be the ultimate goal that learning algorithms, studied in the third volume, will attempt to emulate. Topics covered here include: mean-square-error inference, Bayesian inference, maximum-likelihood estimation, expectation maximization, expectation propagation, Kalman filters, particle filters, posterior modeling and prediction, Markov chain Monte Carlo methods, sampling methods, variational inference, latent Dirichlet allocation, hidden Markov models, independent component analysis, Bayesian networks, inference over directed and undirected graphs, Markov decision processes, dynamic programming, and reinforcement learning. 3. (Volume III: Learning). The third volume covers learning methods. Here, again, we are interested in inferring some unknown variable or quantity from observations. The difference, however, is that the inference will now be solely data-driven, i.e., based on available data and not on any assumed knowledge about signal distributions or models. The designer is only given a collection of observations that arise from the underlying (unknown) distribution. New phenomena arise related to generalization power, overfitting, and underfitting depending on how representative the data is and how complex or simple the approximate models are. The target is to use the data to learn about the quantity of interest (its value or evolution). Topics covered here include: least-squares methods, regularization, nearest-neighbor rule, self-organizing maps, decision trees, naïve Bayes classifier, linear discrimi-

P.3 Organization of the Text

xxxiii

nant analysis, principal component analysis, dictionary learning, perceptron, support vector machines, bagging and boosting, kernel methods, Gaussian processes, generalization theory, feedforward neural networks, deep belief networks, convolutional networks, generative networks, recurrent networks, explainable learning, adversarial attacks, and meta learning. Figure P.1 shows how various topics are grouped together in the text; the numbers in the boxes indicate the chapters where these subjects are covered. The figure can be read as follows. For example, instructors wishing to cover:

Figure P.1 Organization of the text.

(a) Background material on linear algebra and matrix theory: they can use Chapters 1 and 2. (b) Background material on random variables and probability theory: they can select from Chapters 3 through 7. (c) Background material on convex functions and convex optimization: they can use Chapters 8 through 11.

xxxiv

Preface

The three groupings (a)–(c) contain introductory core concepts that are needed for subsequent chapters. For instance, instructors wishing to cover gradient descent and iterative optimization techniques would then proceed to Chapters 12 through 15, while instructors wishing to cover stochastic optimization methods would use Chapters 16–24 and so forth. Figure P.2 provides a representation of the estimated dependencies among the chapters in the text. The chapters are color-coded depending on the volume they appear in. An arrow from Chapter a toward Chapter b implies that the material in the latter chapter benefits from the material in the earlier chapter. In principle, we should have added arrows from Chapter 1, which covers background material on matrix and linear algebra, into all other chapters. We ignored obvious links of this type to avoid crowding the figure.

P.4

HOW TO USE THE TEXT Each chapter in the text consists of several blocks: (1) the main text where theory and results are presented, (2) a couple of solved examples to illustrate the main ideas and also to extend them, (3) comments at the end of the chapter providing a historical perspective and linking the references through a motivated timeline, (4) a list of problems of varying complexity, (5) appendices when necessary to cover some derivations or additional topics, and (6) references. In total, there are close to 470 solved examples and 1350 problems in the text. A solutions manual is available to instructors. In the comments at the end of each chapter I list in boldface the life span of some influential scientists whose contributions have impacted the results discussed in the chapter. The dates of birth and death rely on several sources, including the MacTutor History of Mathematics Archive, Encyclopedia Britannica, Wikipedia, Porter and Ogilvie (2000), and Daintith (2008). Several of the solved examples in the text involve computer simulations on datasets to illustrate the conclusions. The simulations, and several of the correc , which sponding figures, were generated using the software program Matlab is a registered trademark of MathWorks Inc., 24 Prime Park Way, Natick, MA 01760-1500, www.mathworks.com. The computer codes used to generate the figures are provided “as is” and without any guarantees. While these codes are useful for the instructional purposes of the book, they are not intended to be examples of full-blown or optimized designs; practitioners should use them at their own risk. We have made no attempts to optimize the codes, perfect them, or even check them for absolute accuracy. On the contrary, in order to keep the codes at a level that is easy to follow by students, we have often decided to sacrifice performance or even programming elegance in lieu of simplicity. Students can use the computer codes to run variations of the examples shown in the text.

Figure P.2 A diagram showing the approximate dependencies among the chapters in the text. The color scheme identifies chapters from the same volume, with the numbers inside the circles referring to the chapter numbers.

xxxvi

Preface

In principle, each volume could serve as the basis for a master-level graduate course, such as courses on Foundations of Data Science (volume I), Inference from Data (volume II), and Learning from Data (volume III). Once students master the foundational concepts covered in volume I (especially in Chapters 1– 17), they will be able to grasp the topics from volumes II and III more confidently. Instructors need not cover volumes II and III in this sequence; the order can be switched depending on whether they desire to emphasize data-based learning over model-based inference or the reverse. Depending on the duration of each course, one can also consider covering subsets of each volume by focusing on particular subjects. The following grouping explains how chapters from the three volumes cover specific topics and could be used as reference material for several potential course offerings: (1) (Core foundations, Chapters 1–11, Vol. I): matrix theory, linear algebra, random variables, Gaussian and exponential distributions, entropy and divergence, Lipschitz conditions, convexity, convex optimization, and proximal operators. These chapters can serve as the basis for an introductory course on foundational concepts for mastering data science. (2) (Stochastic optimization, Chapters 12–26, Vol. I): gradient-descent, mirrordescent, conjugate-gradient, subgradient methods, stochastic optimization, adaptive gradient methods, variance-reduced methods, convergence analyses, distributed optimization, and nonconvex optimization. These chapters can serve as the basis for a course on stochastic optimization for both convex and nonconvex environments, with attention to performance and convergence analyses. Stochastic optimization is at the core of most modern learning techniques, and students will benefit greatly from a solid grasp of this topic. (3) (Statistical or Bayesian inference, Chapters 27–37, 40, Vol. II): meansquare-error inference, Bayesian inference, maximum-likelihood estimation, expectation maximization, expectation propagation, Kalman filters, particle filters, posterior modeling and prediction, Markov chain Monte Carlo methods, sampling methods, variational inference, latent Dirichlet allocation, and independent component analysis. These chapters introduce students to optimal methods to extract information from data, under the assumption that the underlying probability distributions or models are known. In a sense, these chapters reveal limits of performance that future data-based learning methods, covered in subsequent chapters, will try to emulate when the models are not known. (4) (Probabilistic graphical models, Chapters 38, 39, 41–43, Vol. II): hidden Markov models, Bayesian networks, inference over directed and undirected graphs, factor graphs, message passing, belief propagation, and graph learning. These chapters can serve as the basis for a course on Bayesian inference over graphs. Several methods and techniques are discussed along with supporting examples and algorithms.

P.5 Simulation Datasets

xxxvii

(5) (Reinforcement learning, Chapters 44–49, Vol. II): Markov decision processes, dynamic programming, value and policy iterations, temporal difference learning, Q-learning, value function approximation, and policy gradient methods. These chapters can serve as the basis for a course on reinforcement learning. They cover many relevant techniques, illustrated by means of examples, and include performance and convergence analyses. (6) (Data-driven and online learning, Chapters 50–64, Vol. III): least-squares methods, regularization, nearest-neighbor rule, self-organizing maps, decision trees, naïve Bayes classifier, linear discriminant analysis, principal component analysis, dictionary learning, perceptron, support vector machines, bagging and boosting, kernel methods, Gaussian processes, and generalization theory. These chapters cover a variety of methods for learning directly from data, including various methods for online learning from sequential data. The chapters also cover performance guarantees from statistical learning theory. (7) (Neural networks, Chapters 65–72, Vol. III): feedforward neural networks, deep belief networks, convolutional networks, generative networks, recurrent networks, explainable learning, adversarial attacks, and meta learning. These chapters cover various architectures for neural networks and the respective algorithms for training them. The chapters also cover topics related to explainability and adversarial behavior over these networks. The above groupings assume that students have been introduced to background material on matrix theory, random variables, entropy, convexity, and gradient-descent methods. One can, however, rearrange the groupings by designing stand-alone courses where the background material is included along with the other relevant chapters. By doing so, it is possible to devise various course offerings, covering themes such as stochastic optimization, online or sequential learning, probabilistic graphical models, reinforcement learning, neural networks, Bayesian machine learning, kernel methods, decentralized optimization, and so forth. Figure P.3 shows several suggested selections of topics from across the text, and the respective chapters, which can be used to construct courses with particular emphasis. Other selections are of course possible, depending on individual preferences and on the intended breadth and depth for the courses.

P.5

SIMULATION DATASETS In several examples in this work we run simulations that rely on publicly available datasets. The sources for these datasets are acknowledged in the appropriate locations in the text. Here we provide an aggregate summary for ease of reference: (1) Iris dataset. This classical dataset contains information about the sepal length and width for three types of iris flowers: virginica, setosa, and

Figure P.3 Suggested selections of topics from across the text, which can be used to construct stand-alone courses with particular emphases. Other options are possible based on individual preferences.

P.5 Simulation Datasets

xxxix

versicolor. It was originally used by Fisher (1936) and is available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/iris. Actually, several of the datasets in our list are downloaded from this useful repository – see Dua and Graff (2019). (2) MNIST dataset. This is a second popular dataset, which is useful for classifying handwritten digits. It was used in the work by LeCun et al. (1998) on document recognition. The data contains 60,000 labeled training examples and 10,000 labeled test examples for the digits 0 through 9. It can be downloaded from http://yann.lecun.com/exdb/mnist/. (3) CIFAR-10 dataset. This dataset consists of color images that can belong to one of 10 classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. It is described by Krizhevsky (2009) and can be downloaded from www.cs.toronto.edu/∼kriz/cifar.html. (4) FBI crime dataset. This dataset contains statistics showing the burglary rates per 100,000 inhabitants for the period 1997–2016. The source of the data is the US Criminal Justice Information Services Division at the link https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-1. (5) Sea level and global temperature changes dataset. The sea level dataset measures the change in sea level relative to the start of year 1993. There are 952 data points consisting of fractional year values. The source of the data is the NASA Goddard Space Flight Center at https://climate.nasa.gov/vitalsigns/sea-level/. For information on how the data was generated, the reader may consult Beckley et al. (2017) and the report GSFC (2017). The temperature dataset measures changes in the global surface temperature relative to the average over the period 1951–1980. There are 139 measurements between the years 1880 and 2018. The source of the data is the NASA Goddard Institute for Space Studies (GISS) at https://climate.nasa.gov/vital-signs/globaltemperature/. (6) Breast cancer Wisconsin dataset. This dataset consists of 569 samples, with each sample corresponding to a benign or malignant cancer classification. It can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). For information on how the data was generated, the reader may consult Mangasarian, Street, and Wolberg (1995). (7) Heart-disease Cleveland dataset. This dataset consists of 297 samples that belong to patients with and without heart disease. It is available on the UCI Machine Learning Repository and can be downloaded from https://archive.ics .uci.edu/ml/datasets/heart+Disease. The investigators responsible for the collection of the data are the four leading co-authors of the article by Detrano et al. (1989).

xl

Preface

P.6

ACKNOWLEDGMENTS A project of this magnitude is not possible without the support of a web of colleagues and students. I am indebted to all of them for their input at various stages of this project, either through feedback on earlier drafts or through conversations that deepened my understanding of the topics. I am grateful to several of my former and current Ph.D. students and post-doctoral associates in no specific order: Stefan Vlaski, Kun Yuan, Bicheng Ying, Zaid Towfic, Jianshu Chen, Xiaochuan Zhao, Sulaiman Alghunaim, Qiyue Zou, Zhi Quan, Federico Cattivelli, Lucas Cassano, Roula Nassif, Virginia Bordignon, Elsa Risk, Mert Kayaalp, Hawraa Salami, Mirette Sadek, Sylvia Dominguez, Sheng-Yuan Tu, Waleed Younis, Shang-Kee Tee, Chung-Kai Tu, Alireza Tarighat, Nima Khajehnouri, Vitor Nascimento, Ricardo Merched, Cassio Lopes, Nabil Yousef, Ananth Subramanian, Augusto Santos, and Mansour Aldajani. I am also indebted to former internship and visiting undergraduate and MS students Mateja Ilic, Chao Yutong, Yigit Efe Erginbas, Zhuoyoue Wang, and Edward Nguyen for their help with some of the simulations. I also wish to acknowledge several colleagues with whom I have had fruitful interactions over the years on topics of relevance to this text, including coauthoring joint publications, and who have contributed directly or indirectly to my work: Thomas Kailath (Stanford University, USA), Vince Poor (Princeton University, USA), José Moura (Carnegie Mellon University, USA), Mos Kaveh (University of Minnesota, USA), Bernard Widrow (Stanford University, USA), Simon Haykin (McMaster University, Canada), Thomas Cover (Stanford University, USA, in memoriam), Gene Golub (Stanford University, USA, in memoriam), Sergios Theodoridis (University of Athens, Greece), Vincenzo Matta (University of Salerno, Italy), Abdelhak Zoubir (Technical University Darmstadt, Germany), Cedric Richard (Universite Côte d’Azur, France), John Treichler (Raytheon, USA), Tiberiu Constantinescu (University of Texas Dallas, USA, in memoriam), Shiv Chandrasekaran (University of California, Santa Barbara, USA), Ming Gu (University of California, Berkeley, USA), Babak Hassibi (Caltech, USA), Jeff Shamma (University of Illinois, Urbana Champaign, USA), P. P. Vaidyanathan (Caltech, USA), Hanoch Lev-Ari (Northeastern University, USA), Markus Rupp (Tech. Universität Wien, Austria), Alan Laub, Wotao Yin, Lieven Vandenberghe, Mihaela van der Schaar, and Vwani Roychowdhury (University of California, Los Angeles), Vitor Nascimento (University of São Paulo, Brazil), Jeronimo Arena Garcia (Universidad Carlos III, Spain), Tareq Al-Naffouri (King Abdullah University of Science and Technology, Saudi Arabia), Jie Chen (Northwestern Polytechnical University, China), Sergio Barbarossa (Universita di Roma, Italy), Paolo Di Lorenzo (Universita di Roma, Italy), Alle-Jan van der Veen (Delft University, the Netherlands), Paulo Diniz (Federal University of Rio de Janeiro, Brazil), Sulyman Kozat (Bilkent University, Turkey), Mohammed Dahleh (University of California, Santa Barbara,

P.6 Acknowledgments

xli

USA, in memoriam), Alexandre Bertrand (Katholieke Universiteit Leuven, Belgium), Marc Moonen (Katholieke Universiteit Leuven, Belgium), Phillip Regalia (National Science Foundation, USA), Martin Vetterli, Michael Unser, Pascal Frossard, Pierre Vandergheynst, Rudiger Urbanke, Emre Telatar, and Volkan Cevher (EPFL, Switzerland), Helmut Bölcskei (ETHZ, Switzerland), Visa Koivunen (Aalto University, Finland), Isao Yamada (Tokyo Institute of Technology, Japan), Zhi-Quan Luo and Shuguang Cui (Chinese University of Hong Kong, Shenzhen, China), Soummya Kar (Carnegie Mellon University, USA), Waheed Bajwa (Rutgers University, USA), Usman Khan (Tufts University, USA), Michael Rabbat (Facebook, Canada), Petar Djuric (Stony Brook University, USA), Lina Karam (Lebanese American University, Lebanon), Danilo Mandic (Imperial College, United Kingdom), Jonathon Chambers (University of Leicester, United Kingdom), Rabab Ward (University of British Columbia, Canada), and Nikos Sidiropoulos (University of Virginia, USA). I would like to acknowledge the support of my publisher, Elizabeth Horne, at Cambridge University Press during the production phase of this project. I would also like to express my gratitude to the publishers IEEE, Pearson Education, NOW, and Wiley for allowing me to adapt some excerpts and problems from c 2003 my earlier works, namely, Sayed (Fundamentals of Adaptive Filtering, c 2008 Wiley), Sayed (Adaptation, Learning, Wiley), Sayed (Adaptive Filters, c 2014 A. H. Sayed by NOW Publishers), and Optimization over Networks, c Sayed (“Big ideas or big data,” 2018 IEEE), and Kailath, Sayed, and Hassibi c 2000 Prentice Hall). (Linear Estimation, I initiated my work on this project in Westwood, Los Angeles, while working as a faculty member at the University of California, Los Angeles (UCLA), and concluded it in Lausanne, Switzerland, while working at the École Polytechnique Fédérale de Lausanne (EPFL). I am grateful to both institutions for their wonderful and supportive environments. My wife Laila, and daughters Faten and Samiya, have always provided me with their utmost support and encouragement without which I would not have been able to devote my early mornings and good portions of my weekend days to the completion of this text. My beloved parents, now deceased, were overwhelming in their support of my education. For all the sacrifices they have endured during their lifetime, I dedicate this text to their loving memory, knowing very well that this tiny gift will never match their gift. Ali H. Sayed Lausanne, Switzerland March 2021

xlii

References

References Aronszajn, N. (1950), “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404. Bayes, T. and R. Price (1763), “An essay towards solving a problem in the doctrine of chances,” Bayes’s article communicated by R. Price and published posthumously in Philos. Trans. Roy. Soc. Lond., vol. 53, pp. 370–418. Beckley, B. D., P. S. Callahan, D. W. Hancock, G. T. Mitchum, and R. D. Ray (2017), “On the cal-mode correction to TOPEX satellite altimetry and its effect on the global mean sea level time series,” J. Geophy. Res. Oceans, vol. 122, no. 11, pp. 8371–8384. Bedeian, A. G. (2016), “A note on the aphorism ‘there is nothing as practical as a good theory’,” J. Manag. Hist., vol. 22, no. 2, pp. 236–242. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Bellman, R. E. (1957b), “A Markovian decision process,” Indiana Univ. Math. J., vol. 6, no. 4, pp. 679–684. Cauchy, A.-L. (1847), “Methode générale pour la résolution des systems déquations simultanes,” Comptes Rendus Hebd. Séances Acad. Sci., vol. 25, pp. 536–538. Cooley, J. W. and J. W. Tukey (1965), “An algorithm for the machine calculation of complex Fourier series” Math. Comput., vol. 19, no. 90, pp. 297–301. Daintith, J. (2008), editor, Biographical Encyclopedia of Scientists, 3rd ed., CRC Press. de Moivre, A. (1730), Miscellanea Analytica de Seriebus et Quadraturis, J. Tonson and J. Watts, London. Detrano, R., A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, and V. Froelicher (1989), “International application of a new probability algorithm for the diagnosis of coronary artery disease,” Am. J. Cardiol., vol. 64, pp. 304–310. Dua, D. and C. Graff (2019), UCI Machine Learning Repository, available at http://archive.ics.uci.edu/ml, School of Information and Computer Science, University of California, Irvine. Einstein, A. (1916), “Näherungsweise Integration der Feldgleichungen der Gravitation,” Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften Berlin, part 1, pp. 688–696. Fisher, R. A. (1912), “On an absolute criterion for fitting frequency curves,” Messeg. Math., vol. 41, pp. 155–160. Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philos. Trans. Roy. Soc. Lond. Ser. A., vol. 222, pp. 309–368. Fisher, R. A. (1925), “Theory of statistical estimation,” Proc. Cambridge Philos. Soc., vol. 22, pp. 700–725. Fisher, R. A. (1936), “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188. Fourier, J. (1822), Théorie Analytique de la Chaleur, Firmin Didot Père et Fils. English translation by A. Freeman in 1878 reissued as The Analytic Theory of Heat, Dover Publications. Gauss, C. F. (1903), Carl Friedrich Gauss Werke, Akademie der Wissenschaften. GSFC (2017), “Global mean sea level trend from integrated multi-mission ocean altimeters TOPEX/Poseidon, Jason-1, OSTM/Jason-2,” ver. 4.2 PO.DAAC, CA, USA. Dataset accessed 2019-03-18 at http://dx.doi.org/10.5067/GMSLM-TJ42. Hall, T. (1970), Carl Friedrich Gauss: A Biography, MIT Press. Heath, J. L. (1912), The Works of Archimedes, Dover Publications. Kailath, T. (1968), “An innovations approach to least-squares estimation, part I: Linear filtering in additive white noise,” IEEE Trans. Aut. Control, vol. 13, pp. 646–655. Kailath, T. (1971), “RKHS approach to detection and estimation problems I: Deterministic signals in Gaussian noise,” IEEE Trans. Inf. Theory, vol. 17, no. 5, pp. 530–549. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall.

References

xliii

Kalman, R. E. (1960), “A new approach to linear filtering and prediction problems,” Trans. ASME J. Basic Eng., vol. 82, pp. 34–45. Kolmogorov, A. N. (1939), “Sur l’interpolation et extrapolation des suites stationnaires,” C. R. Acad. Sci., vol. 208, p. 2043. Krizhevsky, A. (2009), Learning Multiple Layers of Features from Tiny Images, MS dissertation, Computer Science Department, University of Toronto, Canada. Laplace, P. S. (1774), “Mémoire sur la probabilité des causes par les événements,” Mém. Acad. R. Sci. de MI (Savants étrangers), vol. 4, pp. 621–656. See also Oeuvres Complètes de Laplace, vol. 8, pp. 27–65 published by the L’Académie des Sciences, Paris, during the period 1878–1912. Translated by S. M. Sitgler, Statistical Science, vol. 1, no. 3, pp. 366–367. Laplace, P. S. (1812), Théorie Analytique des Probabilités, Paris. LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998), “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324. Levinson, N. (1947), “The Wiener RMS error criterion in filter design and prediction,” J. Math. Phys., vol. 25, pp. 261–278. Lewin, K. (1945), “The research center for group dynamics at MIT,” Sociometry, vol. 8, pp. 126–135. See also page 169 of Lewin, K. (1952), Field Theory in Social Science: Selected Theoretical Papers by Kurt Lewin, Tavistock. Lyapunov, A. M. (1901), “Nouvelle forme du théoreme sur la limite de probabilité,” Mémoires de l’Académie de Saint-Petersbourg, vol. 12, no. 8, pp. 1–24. Mangasarian, O. L., W. N. Street, and W. H. Wolberg (1995), “Breast cancer diagnosis and prognosis via linear programming,” Op. Res., vol. 43, no. 4, pp. 570–577. Markov, A. A. (1913), “An example of statistical investigation in the text of Eugene Onyegin illustrating coupling of texts in chains,” Proc. Acad. Sci. St. Petersburg, vol. 7, no. 3, p. 153–162. English translation in Science in Context, vol. 19, no. 4, pp. 591–600, 2006. Mercer, J. (1909), “Functions of positive and negative type and their connection with the theory of integral equations,” Philos. Trans. Roy. Soc. Lond. Ser. A, vol. 209, pp. 415–446. Nyquist, H. (1928), “Certain topics in telegraph transmission theory,” Trans. AIEE, vol. 47, pp. 617–644. Reprinted as classic paper in Proc. IEEE, vol. 90, no. 2, pp. 280–305, 2002. Parzen, E. (1962), “Extraction and detection problems and reproducing kernel Hilbert spaces,” J. Soc. Indus. Appl. Math. Ser. A: Control, vol. 1, no. 1, pp. 35–62. Pearson, K. (1894), “Contributions to the mathematical theory of evolution,” Philos. Trans. Roy. Soc. Lond., vol. 185, pp. 71–110. Pearson, K. (1896), “Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia,” Philos. Trans. Roy. Soc. Lond., vol. 187, pp. 253–318. Pearson, K. (1901), “On lines and planes of closest fit to systems of points in space,” Philos. Mag., vol. 2, no. 11, pp. 559–572. Porter, R. and M. Ogilvie (2000), editors, The Biographical Dictionary of Scientists, 3rd ed., Oxford University Press. Robbins, H. and S. Monro (1951), “A stochastic approximation method,” Ann. Math. Stat., vol. 22, pp. 400–407. Rosenblatt, F. (1957), The Perceptron: A Perceiving and Recognizing Automaton, Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Sayed, A. H. (2014a), Adaptation, Learning, and Optimization over Networks, Foundations and Trends in Machine Learning, NOW Publishers, vol. 7, no. 4–5, pp. 311–801. Sayed, A. H. (2018), “Big ideas or big data?” IEEE Sign. Process. Mag., vol. 35, no. 2, pp. 5–6. Shannon, C. E. (1948a), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423. Shannon, C. E. (1948b), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 623–656.

xliv

References

Viterbi, A. J. (1967), “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Trans. Inf. Theory, vol. 13, pp. 260–269. Werbos, P. J. (1974), Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. dissertation, Harvard University. Widrow, B. and M. E. Hoff (1960), “Adaptive switching circuits,” IRE WESCON Conv. Rec., Institute of Radio Engineers, pt. 4, pp. 96–104. Wiener, N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time Series, Technology Press and Wiley. Originally published in 1942 as a classified Nat. Defense Res. Council Report. Also published under the title Time Series Analysis by MIT Press. Zadeh, L. A. (1954), “System theory,” Columbia Engr. Quart., vol. 8, pp. 16–19.

Notation

The following is a list of notational conventions used in the text: (a) We use boldface letters to denote random variables and normal font letters to refer to their realizations (i.e., deterministic values), like x and x, respectively. In other words, we reserve the boldface notation for random quantities. (b) We use CAPITAL LETTERS for matrices and small letters for both vectors and scalars, for example, X and x, respectively. In view of the first convention, X would denote a matrix with random entries, while X would denote a matrix realization (i.e., a matrix with deterministic entries). Likewise, x would denote a vector with random entries, while x would denote a vector realization (or a vector with deterministic entries). One notable exception to the capital letter notation is the use of such letters to refer to matrix dimensions or the number of data points. For example, we usually write M to denote the size of a feature vector and N to denote the number of data samples. These exceptions will be obvious from the context. (c) Small Greek letters generally refer to scalar quantities, such as α and β, while CAPITAL Greek letters generally refer to matrices such as Σ and z. (d) All vectors in our presentation are column vectors unless mentioned otherwise. Thus, if h ∈ IRM refers to a feature vector and w ∈ IRM refers to a classifier, then their inner product is hT w where ·T denotes the transposition symbol. (e) If P (w) : IRM → IR is some objective function, then its gradient relative to wT is denoted by either ∇wT P (w) or ∂P (w)/∂w and this notation refers to the column vector consisting of the partial derivatives of P (w) relative to the individual entries of w: 

  ∂P (w)/∂w = ∇wT P (w) =  

∂P (w)/∂w1 ∂P (w)/∂w2 .. . ∂P (w)/∂wM

    

(M × 1)

Notation

Symbols We collect here, for ease of reference, a list of the main symbols used throughout the text. IR C ZZ IRM ·T ·∗ x X x X Ex E x g(x) x⊥y x⊥y x y|z kxk2 kxk2W kxk or kxk2 kxk1 kxk∞ kxk? kAk or kAk2 kAkF kAk1 kAk∞ kAk? col{a, b} diag{a, b} diag{A} diag{a} a⊕b a = vec{A} A = vec−1 {a} blkcol{a, b} blkdiag{A, B} a b a b A⊗B A ⊗b B |=

xlvi

set of real numbers set of complex numbers set of integer numbers set of M × 1 real vectors matrix transposition complex conjugation (transposition) for scalars (matrices) boldface letter denotes a random scalar or vector boldface capital letter denotes a random matrix letter in normal font denotes a scalar or vector capital letter in normal font denotes a matrix expected value of the random variable x expected value of g(x) relative to pdf of x orthogonal random variables x and y, i.e., E xy T = 0 orthogonal vectors x and y, i.e., xT y = 0 x and y are conditionally independent given z xT x for a real x; squared Euclidean norm of x xT W x for a real x and positive-definite matrix W √ xT x for a real column vector x; Euclidean norm of x `1 -norm of vector x; sum of its absolute entries `∞ -norm of vector x; maximum absolute entry dual norm of vector x maximum singular value of A; also the spectral norm of A Frobenius norm of A `1 -norm of matrix A or maximum absolute column sum `∞ -norm of matrix A or maximum absolute row sum dual norm of matrix A column vector with a and b stacked on top of each other diagonal matrix with diagonal entries a and b column vector formed from the diagonal entries of A diagonal matrix with entries read from column a same as diag{a, b} column vector a formed by stacking the columns of A square matrix A recovered by unstacking its columns from a columns a and b stacked on top of each other block diagonal matrix with blocks A and B on diagonal Hadamard elementwise product of vectors a and b Hadamard elementwise division of vectors a and b Kronecker product of matrices A and B block Kronecker product of block matrices A and B

Notation

A† ∆

a = b 0 IM P >0 P ≥0 P 1/2 A>B A≥B det A Tr(A) A = QR A = U ΣV T ρ(A) λ(A) σ(A) N(A) R(A) rank(A) In(A) b x e x x ¯ or E x σx2 Rx P(A) P(A|B) fx (x) fx (x; θ) fx|y (x|y) S(θ) F (θ) Nx (¯ x, Rx ) x ∼ fx (x) Gg (m(x), K(x, x0 )) K(x, x0 ) H(x) H(x|y) I(x; y) DKL (pkq) Dφ (a, b)

xlvii

pseudo-inverse of A quantity a is defined as b zero scalar, vector, or matrix identity matrix of size M × M positive-definite matrix P positive-semidefinite matrix P square-root factor of P ≥ 0, usually lower triangular means that A N B is positive-definite means that A N B is positive-semidefinite determinant of matrix A trace of matrix A QR factorization of matrix A SVD factorization of matrix A spectral radius of matrix A refers to a generic eigenvalue of A refers to a generic singular value of A nullspace of matrix A range space or column span of matrix A rank of matrix A inertia of matrix A estimator for x error in estimating x mean of random variable x variance of a scalar random variable x covariance matrix of a vector random variable x probability of event A probability of event A conditioned on knowledge of B pdf of random variable x pdf of x parameterized by θ conditional pdf of random variable x given y score function, equal to ∇θT fx (x; θ) Fisher information matrix: covariance matrix of S(θ) Gaussian distribution over x with mean x ¯, covariance Rx random variable x distributed according to pdf fx (x) Gaussian process g; mean m(x) and covariance K(x, x0 ) kernel function with arguments x, x0 entropy of random variable x conditional entropy of random variable x given y mutual information of random variables x and y KL divergence of pdf distributions px (x) and qx (x) Bregman divergence of points a, b relative to mirror φ(w)

xlviii

Notation

Sx (z) Sx (ejω ) b MSE x b MAP x b MVUE x x bML `(θ) PC (z) dist(x, C) proxµh (z) Mµh (z) h? (x) Tα (x) {x}+ P (w) or P(W ) Q(w, ·) or Q(W, ·) ∇w P (w) ∇wT P (w) ∂P (w)/∂w ∇2w P (w) ∂w P (w) hn γ(n) γn c(hn ) wn w? wo w en Remp (c) R(c) q ? (z) I[a] IC,∞ [x] log a ln a O(µ) o(µ) O(1/n) o(1/n) mb(x) pa(x) φC (·)

z-spectrum of stationary random process, x(n) power spectral density function mean-square-error estimator of x maximum a-posteriori estimator of x minimum-variance unbiased estimator of x maximum-likelihood estimate of x log-likelihood function; parameterized by θ projection of point z onto convex set C distance from point x to convex set C proximal projection of z relative to h(w) Moreau envelope of proxµh (z) conjugate function of h(w) soft-thresholding applied to x with threshold α max{0, x} risk functions loss functions row gradient vector of P (w) relative to w column gradient vector of P (w) relative to wT same as the column gradient ∇wT P (w) Hessian matrix of P (w) relative to w subdifferential set of P (·) at location w nth feature vector nth target or label signal, when scalar nth target or label signal, when vector classifier applied to hn weight iterate at nth iteration of an algorithm minimizer of an empirical risk, P (w) minimizer of a stochastic risk, P (w) weight error at iteration n empirical risk of classifier c(h) actual risk of classifier c(h) variational factor for estimating the posterior fz|y (z|y) indicator function: 1 when a is true; otherwise 0 indicator function: 0 when x ∈ C; otherwise ∞ logarithm of a relative to base 10 natural logarithm of a asymptotically bounded by a constant multiple of µ asymptotically bounded by a higher power of µ decays asymptotically at rate comparable to 1/n decays asymptotically at rate faster than 1/n Markov blanket of node x in a graph parents of node x in a graph potential function associated with clique C

Notation

Nk M = (S, A, P, r) S A P r π(a|s) π ? (a|s) v π (s) v ? (s) π q (s, a) q ? (s, a) softmax(z) z` y` W` θ` (`) wij θ` (j) δ`

xlix

neighborhood of node k in a graph Markov decision process set of states for an MDP set of actions for an MDP transition probabilities for an MDP reward function for an MDP policy for selecting action a conditioned on state s optimal policy for an MDP state value function at state s optimal state value function at state s state–action value function at state s and action a optimal state–action value function at (s, a) softmax operation applied to entries of vector z pre-activation vector at layer ` of a neural network post-activation vector at layer ` of a neural network weighting matrix between layers ` and ` + 1 bias vector feeding into layer ` + 1 weight from node i in layer ` to node j in layer ` + 1 weight from bias source in layer ` to node j in layer ` + 1 sensitivity vector for layer `

Abbreviations ADF assumed density filtering ae convergence almost everywhere AIC Akaike information criterion BIBO bounded-input bounded-output BIC Bayesian information criterion BPTT backpropagation through time cdf cumulative distribution function CNN convolutional neural network CPT conditional probability table DAG directed acyclic graph EKF extended Kalman filter ELBO evidence lower bound ELU exponential linear unit EM expectation maximization EP expectation propagation ERC exact recovery condition ERM empirical risk minimization FDA Fisher discriminant analysis FGSM fast gradient sign method GAN generative adversarial network GLM generalized linear model GMM Gaussian mixture model GRU gated recurrent unit

l

Notation

HALS HMM ICA iid IRLS ISTA JSMA KKT KL LASSO LDA LDA LLMSE LMSE LOESS LOWESS LSTM LWRP MAB MAP MCMC MDL MDP ML MMSE MP MRF MSD MSE MVUE NMF NN OCR OMP PCA pdf PGM pmf POMDP PPO RBF RBM ReLu RIP

hierarchical alternating least-squares hidden Markov model independent component analysis independent and identically distributed iterative reweighted least-squares iterated soft-thresholding algorithm Jacobian saliency map approach Karush–Kuhn–Tucker Kullback–Leibler divergence least absolute shrinkage and selection operator latent Dirichlet allocation linear discriminant analysis linear least-mean-squares error least-mean-squares error locally estimated scatter-plot smoothing locally weighted scatter-plot smoothing long short-term memory network layer-wise relevance propagation multi-armed bandit maximum a-posteriori Markov chain Monte Carlo minimum description length Markov decision process maximum likelihood minimum mean-square error matching pursuit Markov random field mean-square deviation mean-square error minimum variance unbiased estimator nonnegative matrix factorization nearest-neighbor rule optical character recognition orthogonal matching pursuit principal component analysis probability density function probabilistic graphical model probability mass function partially observable MDP proximal policy optimization radial basis function restricted Boltzmann machine rectified linear unit restricted isometry property

Notation

RKHS RLS RNN RTRL SARSA SNR SOM SVD SVM TD TRPO UCB VAE VC VI 

reproducing kernel Hilbert space recursive least-squares recurrent neural network real time recurrent learning sequence of state, action, reward, state, action signal-to-noise ratio self-organizing map singular value decomposition support vector machine temporal difference trust region policy optimization upper confidence bound variational autoencoder Vapnik–Chervonenkis dimension variational inference end of theorem/lemma/proof/remark

li

50 Least-Squares Problems

We studied in Chapters 29 and 30 the mean-square error (MSE) criterion in some detail, and applied it to the problem of inferring an unknown (or hidden) variable x from the observation of another variable y when {x, y} are related by means of a linear regression model or a state-space model. In the latter case, we derived several algorithms for the solution of the inference problem, such as the Kalman filter, its measurement and time-update forms, and its approximate nonlinear forms. We revisit the linear least-mean-square error (LLMSE) formulation in this chapter and use it to motivate an alternative least-squares method that is purely data-driven. This second method will not require knowledge of statistical moments of the variables involved because it will operate directly on data measurements to learn the hidden variable. This data-driven approach to inference will be prevalent in all chapters in this volume, where we describe many other learning algorithms for the solution of general inference problems that rely on other choices for the loss function, other than the quadratic loss. We start our analysis of data-driven methods by focusing on the least-squares problem because it is mathematically tractable and sheds useful insights on many challenges that will hold more generally. We will explain how some of these challenges are addressed in least-squares formulations (e.g., by using regularization) and subsequently apply similar ideas to other inference problems, especially in the classification context when x assumes discrete values.

50.1

MOTIVATION The MSE problem of estimating a scalar random variable x ∈ IR from observations of a vector random variable y ∈ IRM seeks a mapping c(y) that solves b = argmin E (x N c(y))2 x

(50.1)

c(y)

We showed in (27.18) that the optimal estimate is given by the conditional mean x b = E (x|y = y). For example, for continuous random variables, the MSE estimate involves an integral computation of the form:

2166

Least-Squares Problems

ˆ x b =

xfx|y (x|y)dx

(50.2)

x∈X

over the domain of the realizations, x ∈ X. Evaluation of this solution requires knowledge of the conditional distribution, fx|y (x|y). Even if fx|y (x|y) were available, computation of the integral expression is generally not possible in closed form. In Chapter 29, we limited c(y) to the class of affine functions of y and considered instead the problem: b )2 (wo , θo ) = argmin E (x N x w,θ

b = yT w N θ subject to x

(50.3)

for some vector parameter w ∈ IRM and offset θ ∈ IR. The minus sign in front of θ is for convenience. Let {¯ x, y¯} denote the first-order moments of the random variables x and y, i.e., their means: x ¯ = E x,

y¯ = E y

(50.4a)

and let {σx2 , Ry , rxy } denote their second-order moments, i.e., their (co)-variances and cross-covariance: σx2 = E (x N x ¯)2

(50.4b) T

Ry = E (y N y¯)(y N y¯)

T

rxy = E (x N x ¯)(y N y¯) =

(50.4c) T ryx

(50.4d)

Theorem 29.1 showed that the LLMSE estimator and the resulting minimum mean-square error (MMSE) are given by b LLMSE N x x ¯ = rxy Ry−1 (y N y¯) MMSE =

σx2

N

rxy Ry−1 ryx

(50.5a) (50.5b)

In other words, the optimal parameters are given by wo = Ry−1 ryx ,

θo = y¯T wo N x ¯

(50.6)

Note in particular that the offset parameter is unnecessary if the variables have zero mean since in that case θo = 0. More importantly, observe that the estimator b LLMSE requires knowledge of the first- and second-order moments of the random x variables {x, y}. When this information is not available, we need to follow a different route to solve the inference problem. To do so, we will replace the stochastic risk that appears in (50.3) by an empirical risk as follows: ( ) N −1 2 X ∆ 1 ? ? T x(n) N (yn w N θ) (50.7) (w , θ ) = argmin P (w, θ) = N n=0 w,θ which is written in terms of a collection of N independent realizations {x(n), yn }; these measurements are assumed to arise from the underlying joint distribution

50.1 Motivation

2167

for the variables {x, y} and they are referred to as the training data because they will be used to determine the solution (w? , θ? ). Once (w? , θ? ) are learned, they can then be used to predict the x-value corresponding to some future observation y by using b = y T w? N θ? x

(50.8)

Obviously, under ergodicity, the empirical risk in (50.7) converges to the stochastic risk in (50.3) as N → ∞. However, even if ergodicity does not hold, we can still pose the empirical risk minimization problem (50.7) independently and seek its solution. Note that we are denoting the empirical risk by the letter P (·); in this case, it depends on two parameters: w and θ. We are also denoting the optimal parameter values by (w? , θ? ) to distinguish them from (wo , θo ). As explained earlier in the text, we use the ? superscript to refer to minimizers of empirical risks, and the o superscript to refer to minimizers of stochastic risks.

50.1.1

Stochastic Optimization At this stage, one could consider learning the (w? , θ? ) by applying any of the stochastic optimization algorithms studied in earlier chapters, such as applying a stochastic gradient algorithm or a mini-batch version of it, say,  select a sample {x(n), y n } at random at iteration n    b (n) = y T let x n w n−1 N θ(n N 1) (50.9) b (n))  update w = wn−1 + 2µy n (x(n) N x n   b (n)) update θ(n) = θ(n N 1) N 2µ(x(n) N x This construction is based on using an instantaneous gradient approximation at iteration n. The recursions can be grouped together as follows:     Nθ(n N 1) T b (50.10a) x(n) = 1 y n wn−1         Nθ(n N 1) 1 Nθ(n) b (n) = + 2µ x(n) N x (50.10b) wn−1 yn wn which are expressed in terms of the extended variables of dimension M + 1 each:     1 Nθ ∆ y0 = , w0 = (50.11) y w Using the extended notation we can write down the equivalent representation: b (n) = (y 0n )T w0n x

w0n

=

w0n−1

+

2µy 0n (x(n)

(50.12a) b (n)) Nx

(50.12b)

After sufficient iterations, the estimators (wn , θ(n)) approach (w? , θ? ). These values can then be used to predict the hidden variable x(t) for any new observation yt as follows: x b(t) = ytT w? N θ?

(50.13)

2168

Least-Squares Problems

It turns out, however, that problem (50.7) has a special structure that can be exploited to motivate a second exact (rather than approximate) recursive solution, for updating wn−1 to wn , known as the recursive least-squares (RLS) algorithm.

50.1.2

Least-Squares Risk Using the extended notation, we rewrite the empirical risk problem (50.7) in the form ( ) N −1 2 X ∆ 1 0 ? 0 T 0 0 (w ) = argmin P (w ) = x(n) N (yn ) w (50.14) N n=0 w0 ∈IRM +1 without an offset parameter. For simplicity of notation, we will assume henceforth that the vectors (w, yn ) have been extended according to (50.11) and will continue to use the same notation (w, yn ), without the prime subscript, for the extended quantities:     1 Nθ y← , w← (50.15) y w We will also continue to denote their dimension generically by M (rather than M + 1). Thus, our problem becomes one of solving ) ( N −1 2 X ∆ 1 T ? (50.16) x(n) N yn w w = argmin P (w) = N n=0 w∈IRM from knowledge of N data pairs {x(n), yn }. We can rewrite this problem in a more familiar least-squares form by collecting the data into convenient vector and matrix quantities. For this purpose, we introduce the N × M and N × 1 variables     y0T x(0) T   y1   x(1)     T ∆  ∆    x(2) H =  y2  , d =  (50.17)   .    . . .  .    . T yN −1

x(N N 1)

The matrix H contains all observation vectors {yn } transposed as rows, while the vector d contains all target signals {x(n)}. Then, the risk function takes the form 1 (50.18) P (w) = kd N Hwk2 N in terms of the squared Euclidean norm of the error vector d N Hw. The scaling by 1/N does not affect the location of the minimizer w? and, therefore, it can be ignored. In this way, formulation (50.16) becomes the standard least-squares problem:

50.1 Motivation



w? = argmin kd N Hwk2

(standard least-squares)

2169

(50.19)

w∈IRM

We motivated (50.19) by linking it to the MSE formulation (50.3) and replacing the stochastic risk by an empirical risk. Of course, the least-squares problem is of independent interest in its own right. Given a collection of data points {x(n), yn }, with scalars x(n) and column vectors yn , we can formulate problem (50.19) directly in terms of these quantities and seek the vector w that matches Hw to d in the least-squares sense. Example 50.1 (Maximum-likelihood interpretation) There is another way to motivate the least-squares problem as the solution to a maximum-likelihood (ML) estimation problem in the presence of Gaussian noise. Assume we collect N iid observations {x(n), y n }, for 0 ≤ n ≤ N − 1. Assume further that these observations happen to satisfy a linear regression model of the form: x(n) = y Tn w + v(n)

(50.20)

for some unknown vector w ∈ IRM , and where v(n) is white Gaussian noise with zero mean and variance σv2 , i.e., v ∼ Nv (0, σv2 ). It is straightforward to conclude that the likelihood function of the joint observations {x(n), y n } given the model w, is fx,y (y0 , . . . , yN −1 , x(0), . . . , x(N − 1); w) = fv (v(0), . . . , v(N − 1); w) ( 2 ) N −1 Y x(n) − ynT w 1 √ exp − = 2σv2 2πσv2 n=0 ( ) N −1 2 1 X 1 T exp − 2 = x(n) − yn w 2σv n=0 (2πσv2 )N/2

(50.21)

so that the log-likelihood function is given by

` ({x(n), yn }; w) = −

N −1 2 1 X N ln(2πσv2 ) − x(n) − ynT w 2 2 2σv n=0

(50.22)

The maximization of the log-likelihood function over w leads to the equivalent problem ( N −1 ) 2 X T w = argmin x(n) − yn w ?

w∈IRM

(50.23)

n=0

which is the same least-squares problem (50.16). In Prob. 50.6 we consider a variation of this argument in which the noise process v(n) is not white, which will then lead to the solution of a weighted least-squares problem.

2170

Least-Squares Problems

50.2

NORMAL EQUATIONS Problem (50.19) can be solved in closed form using either algebraic or geometric arguments. We expand the least-squares risk: kd N Hwk2 = kdk2 N 2dT Hw + wT H T Hw

(50.24)

and differentiate with respect to w to find that the minimizer w? should satisfy the normal equations: H T Hw? = H T d

(normal equations)

(50.25)

Alternatively, we can pursue a geometric argument to arrive at this same conclusion. Note that, for any w, the vector Hw lies in the column span (or range space) of H, written as Hw ∈ R(H). Therefore, the least-squares criterion (50.19) is in effect seeking a column vector in the range space of H that is closest to d in the Euclidean norm sense. We know from Euclidean geometry that the closest vector to d within R(H) can be obtained by projecting d onto R(H), as illustrated in Fig. 50.1. This means that the residual vector, d N Hw? , should be orthogonal to all vectors in R(H): d N Hw? ⊥ Hp, for any p

(50.26)

pT H T (d N Hw? ) = 0, for any p

(50.27)

which is equivalent to

Clearly, the only vector that is orthogonal to any p is the zero vector, so that H T (d N Hw? ) = 0

(50.28)

and we arrive again at the normal equations (50.25).

) AAAB8HicbVBNSwMxEJ2tX7V+VT16CbaCp7JbRD0WvXisYD+ku5Rsmt2GJtklyQql9Fd48aCIV3+ON/+NabsHbX0w8Hhvhpl5YcqZNq777RTW1jc2t4rbpZ3dvf2D8uFRWyeZIrRFEp6obog15UzSlmGG026qKBYhp51wdDvzO09UaZbIBzNOaSBwLFnECDZWeqz6NyyO/Wm1X664NXcOtEq8nFQgR7Nf/vIHCckElYZwrHXPc1MTTLAyjHA6LfmZpikmIxzTnqUSC6qDyfzgKTqzygBFibIlDZqrvycmWGg9FqHtFNgM9bI3E//zepmJroMJk2lmqCSLRVHGkUnQ7Hs0YIoSw8eWYKKYvRWRIVaYGJtRyYbgLb+8Str1mndZu7ivVxpuHkcRTuAUzsGDK2jAHTShBQQEPMMrvDnKeXHenY9Fa8HJZ47hD5zPH9z7j78=

AAAB9XicbVBNT8JAEN3iF+IX6tHLRjDxImmJUY8kXjhiIh8JFLLdbmHDdtvsTiWk4X948aAxXv0v3vw3LtCDgi+Z5OW9mczM82LBNdj2t5Xb2Nza3snvFvb2Dw6PiscnLR0lirImjUSkOh7RTHDJmsBBsE6sGAk9wdre+H7ut5+Y0jySjzCNmRuSoeQBpwSM1C/7V/VJP+1pIGpWHhRLdsVeAK8TJyMllKExKH71/IgmIZNABdG669gxuClRwKlgs0Iv0SwmdEyGrGuoJCHTbrq4eoYvjOLjIFKmJOCF+nsiJaHW09AznSGBkV715uJ/XjeB4M5NuYwTYJIuFwWJwBDheQTY54pREFNDCFXc3IrpiChCwQRVMCE4qy+vk1a14txUrh+qpZqdxZFHZ+gcXSIH3aIaqqMGaiKKFHpGr+jNmlgv1rv1sWzNWdnMKfoD6/MHxn6R/w==

d

Hw?

AAAB+HicbVDLSsNAFJ34rPXRqEs3g0VwVZIi6rLgxmUF+4A2lMnkph06mYR5FGrol7hxoYhbP8Wdf+O0zUJbD1w4nHMv994TZpwp7Xnfzsbm1vbObmmvvH9weFRxj0/aKjWSQoumPJXdkCjgTEBLM82hm0kgScihE47v5n5nAlKxVDzqaQZBQoaCxYwSbaWBW5GgWGQIxxOgOpUDt+rVvAXwOvELUkUFmgP3qx+l1CQgNOVEqZ7vZTrIidSMcpiV+0ZBRuiYDKFnqSAJqCBfHD7DF1aJcJxKW0Ljhfp7IieJUtMktJ0J0SO16s3F/7ye0fFtkDORGQ2CLhfFhmOd4nkKOGLSvsunlhAqmb0V0xGRhGqbVdmG4K++vE7a9Zp/Xbt6qFcbXhFHCZ2hc3SJfHSDGugeNVELUWTQM3pFb86T8+K8Ox/L1g2nmDlFf+B8/gD/XJNC

residual vector

Figure 50.1 A least-squares solution is obtained when d − Hw? is orthogonal to R(H).

50.2 Normal Equations

50.2.1

2171

Consistent Equations We explained earlier in Section 1.51 that equations of the form (50.25) are always consistent (i.e., they always have a solution). This is because the matrices H T and H T H have the same range spaces so that, for any d and H: H T d ∈ R(H T H)

(50.29)

Moreover, the normal equations will either have a unique solution or infinitely many solutions. The solution will be unique when H T H is invertible, which happens when H has full column rank. This condition requires N ≥ M , which means that there should be at least as many observations as the number of unknowns in w. The full-rank condition implies that the columns of H are not redundant. In this case, we obtain w? = (H T H)−1 H T d

(50.30)

In all other cases, the matrix product H T H will be rank-deficient. For instance, this situation arises when N < M , which corresponds to the case in which we have insufficient data (fewer measurements than the number of unknowns). This situation is not that uncommon in practice. For example, it arises in streaming data implementations when we have not collected enough data to surpass M . When H T H is singular, the normal equations (50.25) will have infinitely many solutions, all of them differing from each other by vectors in the nullspace of H – recall (1.56). That is, for any two solutions {w1? , w2? } to (50.25), it will hold that w2? = w1? + p, for some p ∈ N(H)

(50.31)

Although unnecessary for the remainder of the discussions in this chapter, we explain in Appendix 50.A that when infinitely many solutions w? exist to the least-squares problem (50.19), we can determine the solution with the smallest Euclidean norm among these by employing the pseudo-inverse of H – see expression (50.179). Specifically, the solution to the following problem min kwk2 ,

w∈IRM

subject to H T Hw = H T d

(50.32)

w? = H † d

(50.33)

is given by

where H † denotes the pseudo-inverse matrix.

50.2.2

Minimum Risk For any solution w? of (50.25), we denote the resulting closest vector to d by db = Hw? and refer to it as the projection of d onto R(H): ∆ db = Hw? = projection of d onto R(H)

(50.34)

2172

Least-Squares Problems

It is straightforward to verify that even when the normal equations have a mulb This obsertitude of solutions, w? , all of them will lead to the same value for d. vation can be justified both algebraically and geometrically. From a geometric b From point of view, projecting d onto R(H) results in a unique projection d. an algebraic point of view, if w1? and w2? are two arbitrary solutions, then from (50.31) we find that ∆ db2 = Hw2? = H(w1? + p) = Hw1? = db1

(50.35)

What the different solutions w? amount to, when they exist, are equivalent representations for the unique db in terms of the columns of H. We denote the residual vector resulting from the projection by ∆ de = d N Hw?

(50.36)

so that the orthogonality condition (50.28) can be rewritten as H T de = 0

(orthogonality condition)

(50.37)

We express this orthogonality condition more succinctly by writing de ⊥ R(H), where the ⊥ notation is used to mean that de is orthogonal to any vector in the range space (column span) of H. In particular, since, by construction, db ∈ R(H), it also holds that b T de = 0 de ⊥ db or (d) (50.38) Let ξ denote the minimum risk value, i.e., the minimum value of (50.19). This is sometimes referred to as the training error because it is the minimum value evaluated on the training data {x(n), yn }. It can be evaluated as follows: ξ = kd N Hw? k2

= (d N Hw? )T (d N Hw? ) b = (d N Hw? )T (d N d)

= dT (d N Hw? ), since (d N Hw? ) ⊥ db by (50.38)

= dT d N dT Hw?

= dT d N (w? )T H T Hw? , since dT H = (w? )T H T H by (50.25) b T db = dT d N (d) (50.39)

That is, we obtain the following two equivalent representations for the minimum risk:

50.2.3

Projections

b 2 = dT de ξ = kdk2 N kdk

(minimum risk)

(50.40)

When H has full column rank (and, hence, N ≥ M ), the coefficient matrix H T H becomes invertible and the least-squares problem (50.19) will have a unique solution given by

50.2 Normal Equations

w? = (H T H)−1 H T d

2173

(50.41)

with the corresponding projection vector db = Hw? = H(H T H)−1 H T d

(50.42)

The matrix multiplying d in the above expression is called the projection matrix onto R(H) and we denote it by ∆

PH = H(H T H)−1 H T ,

when H has full column rank

(50.43)

The designation projection matrix stems from the fact that multiplying d by b Such projection PH projects it onto the column span of H and results in d. matrices play a prominent role in least-squares theory and they have many useful properties. For example, projection matrices are symmetric and also idempotent, i.e., they satisfy PT H = PH ,

P2H = PH

Note further that the residual vector, de = d N Hw? , is given by so that the matrix

de = d N PH d = (I N PH )d = P⊥ Hd ∆

P⊥ H = I N PH

(50.44)

(50.45)

(50.46)

is called the projection matrix onto the orthogonal complement space of H. It is easy to see that the minimum risk value can be expressed in terms of P⊥ H as follows: b T db ξ = dT d N (d)

= dT d N dT PT H PH d

= dT d N dT PH d,

2 since PT H PH = PH = PH

(50.47)

That is, ξ = dT P⊥ Hd

(50.48)

In summary, we arrive at the following statement for the solution of the standard least-squares problem.

2174

Least-Squares Problems

Theorem 50.1. (Solution of least-squares problem) Consider the standard least-squares problem (50.19) where H ∈ IRN ×M : (a) When H has full column rank, which necessitates N ≥ M , the least-squares problem will have a unique solution given by w? = (H T H)−1 H T d. (b) Otherwise, the least-squares problem will have infinitely many solutions w? satisfying H T Hw? = H T d. Moreover, any two solutions will differ by vectors in N(H) and the solution with the smallest Euclidean norm is given by w? = H † d.

50.2.4

In either case, the projection of d onto R(H) is unique and given by db = Hw? . e where de = d N d. b Moreover, the minimum risk value is ξ = dT d,

Weighted and Regularized Variations

There are several extensions and variations of the least-squares formulation, which we will encounter at different locations in our treatment. For example, one may consider a weighted least-squares problem of the form n o ∆ w? = argmin (d N Hw)T R(d N Hw) (weighted least-squares) w∈IRM

(50.49)

where R ∈ IRN ×N is a symmetric positive-definite weighting matrix. Assume, for illustration purposes, that R is diagonal with entries {r(n)}. Then, the above problem reduces to (we prefer to restore the 1/N factor when using the original data): ( N −1 )  2 1 X T ? ∆ r(n) x(n) N yn w w = argmin (50.50) N n=0 w∈IRM where the individual squared errors appear scaled by r(n). In this way, errors originating from some measurements will be scaled more or less heavily than errors originating from other measurements. In other words, incorporating a weighting matrix R into the least-squares formulation allows the designer to control the relative importance of the errors contributing to the risk value. One can also consider penalizing the size of the parameter w by modifying the weighted risk function in the following manner: (`2 -regularized weighted least-squares) n o ∆ w? = argmin ρkwk2 + (d N Hw)T R(d N Hw)

(50.51)

w∈IRM

where ρ > 0 is called an `2 -regularization parameter (since it penalizes the `2 norm of w). We will discuss regularization in greater detail in the next chapter. Here, we comment briefly on its role. Observe, for instance, that if ρ is large, then the term ρkwk2 will have a nontrivial effect on the value of the risk function. As such, when ρ is large, the solution w? should have smaller Euclidean norm

50.2 Normal Equations

2175

since the objective is to minimize the overall risk. In this way, the parameter ρ provides the designer with the flexibility to limit the norm of w to small values. Additionally, it is straightforward to verify by differentiating the above risk function that the solution w? satisfies the equations: (ρIM + H T RH)w? = H T Rd

(50.52)

Observe, in particular, that even when the product H T RH happens to be singular, the coefficient matrix ρIM + H T RH will be positive-definite and, hence, invertible, due to the addition of the positive term ρIM . This ensures that the solution will always be unique and given by w? = (ρIM + H T RH)−1 H T Rd

(50.53)

Example 50.2 (Sea-level change) We apply the least-squares formalism to the problem of fitting a regression line through measurements related to the change in sea level (measured in millimeters) relative to the start of year 1993. There are N = 952 data points consisting of fractional year values and the corresponding sea-level change. We denote the fractional year value by y(n) and the sea-level change by x(n) for every entry n = 1, 2, . . . , 952. For example, the second entry (n = 2) in the data corresponds to year 1993.0386920, which represents a measurement performed about 14 days into year 1993. Using the least-squares formalism, we already know how to fit a regression line through these data points by solving a problem of the form: ( N −1 ) 2 1 X ∆ ? ? (α , θ ) = argmin x(n) − (αy(n) − θ) (50.54) N n=0 α,θ where (α, θ) are scalar parameters in this case. For convenience, we employ the vector notation as follows. We collect the measurements {x(n), y(n)} into the N × 1 vector and N × 2 matrix quantities:     x(0) 1 y(0)    1  y(1) x(1)     N ×2 N (50.55) d=  ∈ IR  ∈ IR , H =  .. ..     . . 1 y(N − 1) x(N − 1) and introduce the parameter vector: 



w =

−θ α



∈ IR2

(50.56)

Then, problem (50.54) is equivalent to ∆

w? = argmin kd − Hwk2

(50.57)

w∈IR2

whose solution is given by ∆

w? = (H T H)−1 H T d =



−θ? α?



−θ? α?



(50.58)

We find that ?

w ≈



−5961.9 2.9911



 =

(50.59)

Least-Squares Problems

change in sea level relative to the start of year 1993

change in sea level (mm)

100

fitted regression line

80 60 40

measurements 20 0 1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

2015

2017

2019

year LOWESS smoothing with 10% smoothing factor change in sea level (mm)

2176

80

fitted smooth curve

60

40

measurements

20

0 1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

2015

2017

2019

year Figure 50.2 (Top) Result of fitting a linear regression line onto measurements showing

the change in sea level (mm) relative to the start of year 1993. (Bottom) Result of fitting a smoother curve to the same data by using the LOWESS procedure described in Example 50.3. The source of the data used in this simulation is the NASA Goddard Space Flight Center https://climate.nasa.gov/vital-signs/sea-level/.

This construction fits an affine relation (or a line) to the data and allows us to estimate x(n) from an observation y(n) by using (50.60): x b(n) = α? y(n) − θ?

(50.60)

The top plot in Fig. 50.2 shows the resulting regression line x b(n) along with the measurements x(n) (vertical axis) as a function of the year stamp y(n) (horizontal axis). The bottom plot shows a smoother fitted curve using the LOWESS procedure, which is described next. Example 50.3 (LOWESS and LOESS smoothing) Consider N scalar data pairs denoted by {x(n), y(n)}, where n = 0, 1, . . . , N − 1. In many cases of interest, a regression line is not the most appropriate curve to fit onto the data. We now describe two other popular (but similar) schemes that can be used to fit smoother curves. These schemes are known by the acronyms LOWESS, which stands for “locally weighted scatter-plot smoothing” and LOESS, which stands for “locally estimated scatter-plot smoothing.” Both schemes rely on the use of localized least-squares problems. We describe the LOWESS procedure first.

50.2 Normal Equations

2177

LOWESS slides a window of width L over the N data points, say, one position at a time. Typical values are L = N/20, L = N/10, or L = N/4, but other values are possible leading to less (smaller L) or more (larger L) smoothing in the fitted curve. The fraction of samples used within the window is called the smoothing factor, q. Thus, the choice L = N/10 corresponds to using q = 10%, while the choice L = N/20 corresponds to using q = 5%. The data in each window are used to estimate one particular point in the window, which is normally (but not always) the middle point. For example, assume we wish to estimate the sample x(10) corresponding to n = 10, and assume that the window size is L = 5. In this case, the data samples that belong to the window will be (

) (x(8), y(8)), (x(9), y(9)), (x(10), y(10)) , (x(11), y(11)), (x(12), y(12))

(50.61)

with the desired sample (x(10), y(10)) appearing at the center of the interval. Clearly, it is not always possible to have the desired sample appear in the middle of the interval. This happens, for example, for the first data point (x(1), y(1)). In this case, the other points in the window will lie to its right: (

) (x(1), y(1)) , (x(2), y(2)), (x(3), y(3)), (x(4), y(4)), (x(5), y(5))

(50.62)

The same situation happens for the last data point (x(N − 1), y(N − 1)). In this case, the four points in the corresponding window will lie to its left. Regardless, for the data pair (x(no ), y(no )) of interest, where we are denoting the index of interest by no , we construct a window with L data samples around this point to estimate its x-component. For convenience of notation, we collect the indices of the samples within the window into a set Ino . For example, for the cases represented in (50.61)–(50.62), we have no = 10, no = 1,

I10 = {8, 9, 10, 11, 12} I1 = {1, 2, 3, 4, 5}

(50.63) (50.64)

Let ∆no denote the width of the window defined as follows for the above two cases: ∆10 = |y(12) − y(8)|,

∆1 = |y(5) − y(1)|

(50.65)

Next, using the data in each window Ino , we fit a regression line by solving a weighted least-squares problem of the following form:    X  2  ∆ ? (αn , θn? o ) = argmin D(n) x(n) − (αno y(n) − θno ) (50.66) o  αno ,θno  n∈Ino

where D(n) is a nonnegative scalar weight constructed as follows:

D(n) =

!3 y(n) − y(no ) 3 , 1− ∆no

n ∈ I no

(50.67)

Other choices for D(n) are possible, but they need to satisfy certain desirable properties. Observe, for example, that the above choice for the weights varies between 0 and 1, with the weight being equal to 1 at n = no . Moreover, data samples that are farther away from y(no ) receive smaller weighting than samples that are closer to it. To solve (50.66), we can again employ the vector notation as follows. We first collect the data from within the window, namely, {x(n), y(n)}n∈Ino , into the vector and matrix quantities:

2178

Least-Squares Problems

n o dno = col x(n) n∈Ino n o Hno = blkcol [1 y(n)] n∈Ino n o Dno = diag D(n)

(50.68a) (50.68b) (50.68c)

n∈Ino

where Dno is a diagonal matrix. For have  x(8)  x(9)  d10 =  x(10)  x(11) x(12)

example, for the case represented by (50.61) we    , 

   H10 =  

1 1 1 1 1

y(8) y(9) y(10) y(11) y(12)

    

(50.69)

and 



D(8) D(9)

  D10 =  

   

D(10) D(11)

(50.70)

D(12) where, for instance, D(11) =

!3 y(11) − y(10) 3 1 − y(12) − y(8)

(50.71)

We also introduce the parameter vector  wno =

−θno αno

 (50.72)

Then, problem (50.66) is equivalent to ∆

wn? o = argmin (dno − Hno wno )T Dno (dno − Hno wno )

(50.73)

w∈IR2

whose solution is given by ∆

wn? o = (HnTo Dno Hno )−1 HnTo Dno dno =



−θn? o ? αn o

 (50.74)

This construction now allows us to estimate the sample x(no ) by using ? x b(no ) = αn y(no ) − θn? o o

(50.75)

Next, we slide the window by one position to the right, collect L data points around (x(no + 1), y(no + 1)), and use them to estimate x(no + 1) in a similar fashion: ? x b(no + 1) = αn y(no + 1) − θn? o +1 o +1

and continue in this fashion.

(50.76)

50.2 Normal Equations

2179

The difference between the LOWESS and LOESS procedures is that the latter fits a second-order curve to the data within each interval Ino . That is, LOESS replaces (50.66) by    X 2   ? (αn , βn? o , θn? o ) = argmin D(n) x(n) − (αno y(n) + βno y 2 (n) − θno ) o  αno ,βno ,θno  n∈Ino

(50.77) ? and uses the resulting coefficients (αn , βn? o , θn? o ) to estimate x(no ) by using o ? y(no ) + βn? o y 2 (no ) − θn? o x b(no ) = αn o

(50.78)

We continue to slide the L-long window over the data to estimate the subsequent samples y(n). There is one final step that is normally employed to reduce the effect of outliers that may exist in the data. This step redefines the weights D(n) and repeats the calculation of the first- or second-order local curves. Specifically, the following procedure is carried out. Given the target signals {x(n)} and the corresponding estimates {b x(n)} that resulted from the above LOWESS or LOESS construction, we introduce the error sequence ∆

e(n) = x(n) − x b(n),

n = 0, 1, 2, . . . , N − 1

(50.79)

and list the {|e(n)|} in increasing order. We then let δ denote the median of this sequence (i.e., the value with as many samples below and above it): ∆

δ = median{|e(n)|}

(50.80)

Using these error quantities, the LOWESS and LOESS implementations introduce the following weighting scalars for n = 0, 1, . . . , N − 1:  !2  e(n) 2  1− , if |e(n)| < 6δ A(n) = (50.81) 6δ   0, otherwise and use them to update D(n) by D(n) ← D(n)A(n),

n ∈ Ino

(50.82)

We then repeat the design of the local least-squares estimators using these new weights. The construction leads to new estimates {b x(n)}. We can repeat this construction a few times before the process is terminated, leading to the smoothed curve {b x(n)}. Figure 50.3 shows the LOWESS and LOESS smoothing curves that result from applying the above construction to data measurements representing the change in the global surface temperature (measured in ◦ C) relative to the average over the period 1951– 1980. The data consists of N = 139 temperature measurements between the years 1880 and 2018. The top figure shows the curve that results from LOWESS smoothing with a smoothing factor of q = 5% (corresponding to windows with L = 6 samples), while the bottom figure shows the curve that results from LOESS smoothing with a smoothing factor of q = 10% (corresponding to windows with L = 13 samples). Three repeated runs of the form (50.82) are applied. Example 50.4 (Confidence levels and interpretability) One useful feature of leastsquares solutions is that, under reasonable conditions, we can interpret the results and comment on their confidence level. Consider again the standard least-squares problem (50.19) where we denote the entries of d by {x(n)} and the rows of H by {hTn }, e.g.,

Least-Squares Problems

LOWESS smoothing with smoothing factor q = 5%

temperature change (°C)

1

measured data

0.5

0

-0.5 1880

LOWESS smoothing curve

1900

1920

1940

year

1960

1980

2000

2020

2000

2020

LOESS smoothing with smoothing factor q = 10%

1

temperature change (°C)

2180

measured data

0.5

LOESS smoothing curve 0

-0.5 1880

1900

1920

1940

year

1960

1980

Figure 50.3 LOWESS (top) and LOESS (bottom) smoothing curves that result from

applying the smoothing construction of this example to data measurements representing the change in the global surface temperature (measured in ◦ C) relative to the average over the period 1951–1980. Three repeated runs of the form (50.82) are applied. The source of the data is the NASA Goddard Institute for Space Studies (GISS) at https://climate.nasa.gov/vital-signs/global-temperature/.

hn = col{1, yn } when augmentation is used. When H is full rank, we know that the least-squares solution is given by

w? = (H T H)−1 H T d

(50.83)

This vector allows us to predict measurements x(n) using the linear regression model

x b(n) = hTn w?

(50.84)

There are many ways to assess the quality of the solution in the statistical sciences. We summarize some of the main measures. Using the data {x(n)} we define the sample mean and variances:

50.2 Normal Equations



x ¯ = ∆

σx2 = ∆

σx2b = ∆

σx2e =

2181

N −1 1 X x(n) N n=0

(50.85a)

N −1 1 X (x(n) − x ¯ )2 N n=0

(50.85b)

N −1 1 X (b x(n) − x ¯ )2 N n=0

(50.85c)

N −1 1 X (x(n) − x b(n))2 N n=0

(50.85d)

The variance σx2 measures the squared variation of the samples x(n) around their mean, while the variance σx2b measures the squared variation of the predictions around the same mean. The variance σx2e measures the squared error between the x(n) and their predictions. It is straightforward to verify that the variance of the target signal decouples into the sum (this is related to the earlier expression (50.40)): σx2 = σx2b + σx2e

(50.86)

The so-called coefficient of determination is defined as the ratio: ∆

r2 =

σx2b σx2e = 1 − ∈ [0, 1] σx2 σx2

(50.87)

This scalar measures the proportion of the variations in {x(n)} that is predictable from (or explained by) the observations {hn }. For example, if r = 0.5, then this means that 25% of the variations in {x(n)} can be explained by the variations in {hn }. This also means that variations around the regression hyperplane account for 75% of the total variations in the {x(n)}. We can assess the quality of the estimated least-squares model w? as follows. Assume that the data {d, H} satisfy a linear model of the form d = Hwo + v

(50.88)

for some unknown wo ∈ IRM . The least-squares solution w? given by (50.83) is estimating this model. Assume further that v is Gaussian-distributed with v ∼ Nv (0, σv2 IN ). Then, it is easily seen that w? is an unbiased estimator since w? = (H T H)−1 H T d = (H T H)−1 H T (Hwo + v) = wo + (H T H)−1 H T v

(50.89)

and, consequently, E w? = wo . Using the fact that v is Gaussian, we conclude that w? is Gaussian-distributed. Its covariance matrix is given by E (w? − wo )(w? − wo )T = (H T H)−1 H T (E vv T ) H(H T H)−1 = σv2 (H T H)−1

(50.90)

In summary, we find that   w? ∼ Nw? wo , σv2 (H T H)−1

(50.91)

Least-Squares Problems

which means that the individual entries of w? are Gaussian-distributed with variances given by scaled multiples of the diagonal entries of (H T H)−1 . That is, for the jth entry:  h i  w? (j) ∼ Nw? (j) wo (j), σv2 (H T H)−1 (50.92) jj

in terms of the jth diagonal entry of (H T H)−1 . Using this information, we can now determine a 95% confidence interval for each entry wo (j) as follows. First, we need to introduce the t-distribution, also called the Student t-distribution. It is symmetric with a similar shape to the Gaussian distribution, but has heavier tails. This means that a generic random variable x that is t-distributed will have a higher likelihood of assuming extreme values than under a Gaussian distribution. Figure 50.4 compares two Gaussian and t-distributions with zero mean and unit variance.

comparing Gaussian and t -distributions

0.4

0.3

pdf

2182

0.2

0.1

0 -4

-3

-2

-1

0

1

2

3

4

Figure 50.4 Comparing Gaussian and t-distributions with zero mean and unit

variance. Observe how the t-distribution has higher tails. The t-distribution can be motivated as follows. Consider a collection of N scalar iid realizations arising from a Gaussian distribution with true mean µ and variance σ 2 , i.e., x(n) ∼ Nx (µ, σ 2 ). Introduce the sample mean and (unbiased) variance quantities ∆

x ¯ =

N 1 X x(n), N n=1

s2x =

N 1 X (x(n) − x ¯ )2 N − 1 n=1

(50.93)

The quantities {¯ x, s2x } should be viewed as random variables, written in boldface no2 tation {¯ x, sx }, because their values vary with the randomness in selecting the {x(n)}. Next, we define the t-score variable, which measures how far √ the sample mean is from the true mean (scaled by the sample standard deviation and N ): ∆

t =

¯ −µ x √ sx / N

(50.94)

The pdf of the t variable is called the t-distribution with d = N − 1 degrees of freedom. It has zero mean and unit variance and is formally defined by the expression: ft (t; d) =

Γ((d + 1)/2) 1 1 √ Γ(d/2) dπ (1 + t2 /d)(d+1)/2

(t-distribution)

(50.95)

50.2 Normal Equations

2183

where Γ(x) refers to the gamma function encountered earlier in Prob. 4.3. The definition (50.94) explains why the t-distribution is useful in constructing confidence intervals. That is because it assesses how the sample mean is distributed around the true mean. Due to its relevance, the t-distribution appears tabulated in many texts on statistics, and these tables are used in the following manner. Let α = 5% (this value is known as the desired significance level in statistics). We use −M a table of t-distributions to determine the critical value denoted by tN α/2 ; this is the value in a t-distribution with N − M degrees of freedom beyond which the area under the pdf curve will be 2.5% (this calculation amounts to performing what is known as a one-tailed test) – see Fig. 50.5. An example of this tabular form is shown in Table 50.1. One enters the value of α/2 along the vertical direction and the degree N − M along −M the horizontal direction and reads out the entry corresponding to tN α/2 . For example, using N − M = 15 degrees of freedom and α/2 = 2.5%, one reads the value marked in bold face t15 2.5% = 2.131. Table 50.1 Critical values of tdα/2 in one-tailed t-tests with d degrees of freedom. The values in the last row can be used for large degrees of freedom. degree d 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ∞

5%

2.5%

1%

0.5%

0.1%

6.314 2.920 2.353 2.132 2.015 1.943 1.894 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.645

12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 1.960

31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.326

63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.576

318.309 22.327 10.215 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 3.646 3.610 3.579 3.552 3.090

−M Once tN is determined, the confidence interval for each entry of wo would be given by α/2 −M w? (j) ± tN σv α/2

p

[(H T H)−1 ]jj

(50.96)

This means that there is a 95% chance that the true value wo (j) lies within the interval. Likewise, given an observation hn , we can derive a confidence interval for the unperturbed component hTn wo , which happens to be the mean of x(n) in model (50.88). That

2184

Least-Squares Problems

t-distribution with degree of freedom N – M

−M Figure 50.5 The critical value tN is the point to the right of which the area under α/2

a t-distribution with degree N − M is equal to α/2.

is, we can derive a confidence interval for the expected value of the target signal x(n) b (n) = hTn w? . This that would result from hn . To see this, consider the prediction x ? prediction is again Gaussian-distributed since w is Gaussian. Its mean and variance are found as follows. First note that b (n) = hTn w? x n o = hTn wo + (H T H)−1 H T v = hTn wo + hTn (H T H)−1 H T v

(50.97)

b (n) = hTn wo , so that the mean of the prediction agrees with the We conclude that E x actual mean, E x(n) = hTn wo . Moreover, the prediction variance is given by E (b x(n) − hTn wo )2 = hTn (H T H)−1 H T (E vv T ) H(H T H)−1 hn = σv2 hTn (H T H)−1 hn

(50.98)

so that   b (n) ∼ Nxb (n) hTn wo , σv2 hTn (H T H)−1 hn x

(50.99)

which shows that the predictions will be Gaussian-distributed around the actual mean, hTn wo . We can then determine a 95% confidence interval for the mean value hTn wo by using p −M x b(n) ± tN σv hTn (H T H)−1 hn (50.100) α/2 Given an observation hn , this means that there is a 95% chance that the mean value hTn wo will lie within the above interval around x b(n). In a similar vein, given a feature hn , we can derive a confidence interval for the target x(n) itself (rather than its mean, as was done above). To see this, we note that the b (n) − x(n) is again Gaussian-distributed, albeit with mean zero since difference x

50.2 Normal Equations

2185

  b (n) − x(n) = hTn wo + hTn (H T H)−1 H T v − (hTn wo + v(n)) x = hTn (H T H)−1 H T v − v(n)

(50.101)

Moreover, the variance is given by b (n) − x(n))2 = σv2 (1 − hTn (H T H)−1 hn ) E(x

(50.102)

  b (n) ∼ Nxb (n) x(n), σv2 (1 − hTn (H T H)−1 hn ) x

(50.103)

so that

This result shows that the predictions will be Gaussian-distributed around the actual value x(n). We can then determine a 95% confidence interval for x(n) by using p −M (50.104) x b(n) ± tN σv 1 − hTn (H T H)−1 hn α/2 The expressions so far assume knowledge of σv2 . If this information is not available, it can be estimated by noting that v(n) = x(n) − hTn wo and using the sample approximation: σ bv2 ≈

N −1 2 X 1 x(n) − hTn w? N − 1 n=0

(50.105)

The analysis in this example is meant to illustrate that, for least-squares problems and under some reasonable conditions, we are able to assess the confidence levels we have in the results. This is a useful property for learning algorithms to have so that their results become amenable to a more judicious interpretation. It also enables the algorithms to detect outliers and malicious data. For example, if some data pair (x(m), hm ) is received, one may compute x b(m) = hTm w? and verify whether x(m) lies within the corresponding confidence interval (constructed according to (50.104) with n replaced by m). If not, then one can flag this data point as being an outlier.

change in sea level (mm)

We apply construction (50.104)–(50.105) to Example 50.2, which involved fitting a regression line to sea levels over multiple years. We use N = 952 and M = 2 (due to the augmentation of the feature data by the unit entry) so that the number of degrees of freedom is 950. Using the data from the last row of Table 50.1 we have t950 2.5% ≈ 1.960. The regression lines that result from using the lower and upper limits in (50.104) appear in dotted format in Fig. 50.6.

100

change in sea level relative to the start of year 1993

90 80 70 60 50

upper limit on confidence interval

fitted regression line

potential outlier

40

lower limit on confidence interval

30 20

measurements

10 0 -10 1993

1998

2003

2008

2013

2018

year Figure 50.6 The fitted regression line is shown in solid red color, while the lines that

correspond to the upper and lower limits of the confidence interval (50.104) appear in dotted format.

2186

Least-Squares Problems

Example 50.5 (Sketching) In big data applications, the amount of available data can be massive, giving rise to situations where N  M , i.e., the number of observations far exceeds the number of unknowns in the least-squares problem (50.19). In these cases, the solution of the normal equations (50.25) becomes prohibitively expensive since computing the products H T H and H T d requires O(N M 2 ) and O(N M ) additions and multiplications, respectively. One technique to reduce the computational complexity is to employ randomized algorithms that rely on the concept of sketching. The purpose of these algorithms is to seek approximate solutions, denoted by ws , with the following useful property: With high probability 1 − δ, the solution ws should lead to a risk value that is -close to the optimal risk value, namely, it should hold that:   P kd − Hws k2 ≤ (1 + )kd − Hw? k2 = 1 − δ

(50.106)

where δ > 0 is a small positive number. Sketching procedures operate as follows. They first select some random matrix S of size R × N , with R  N . Subsequently, they compute the products Sd and SH and determine ws by solving the altered least-squares problem: ∆

ws = argmin kSd − SHwk2

(50.107)

w∈IRM

Observe that this is a smaller-size problem because SH is now R × M . Since there is some nonzero probability of failure in (50.106), it is customary to repeat the sketching construction several times (by choosing different sketching matrices S each time), and then keep the best solution ws from among the repeated experiments (i.e., the one with the smallest risk value). The three main challenges that arise in sketching solutions relate to: (a) selecting sketching matrices S that guarantee (50.106); (b) selecting suitable values for the dimension R; and, more importantly, (c) choosing sketching matrices S for which the products Sd and SH can be computed efficiently. For this last condition, it is desirable to seek sparse choices for S. One option is to employ Gaussian sketching. We select a dimension R = O((M log M )/) and let the entries of S be iid Gaussian random variables with zero mean and variance equal to 1/R. This construction can be shown to answer points (a) and (b) above, but is costly to implement since it generally leads to dense matrices S for which point (c) is expensive. Computing SH in this case requires O(N M 2 log M ) computations. A second option that also answers points (a) and (b) above is to employ a random subsampling strategy as follows. We introduce the singular value decomposition (SVD) of H (this is of course a costly step and that is the reason why this option will not be viable in general): H = UH ΣH VHT

(50.108)

where UH is N × N orthonormal; its rows have N entries each. We let uTn denote the restriction of the nth row to its M leading entries. That is, each uTn consists of the first M entries in the nth row of UH . The so-called leverage scores of H are defined as the squared norms of these restricted vectors: `n = kun k2 ,

n = 1, 2, . . . , N

(50.109)

It is straightforward to verify that the leverage scores correspond to the diagonal entries of the projection matrix onto R(H), namely,

50.3 Recursive Least-Squares

  `n = PH nn ,

n = 1, 2, . . . , N

2187

(50.110)

We normalize the leverage scores by dividing by their sum to define a probability distribution over the integer indices 1 ≤ r ≤ N : ∆

pn = P(r = n) = `n

N .X

`m

(50.111)

m=1

The scalar pn defines the probability of selecting at random the index value n. Next, for each row r = 1, 2, . . . , R of the sketching matrix S: (a) We select an index n at random from the interval {1, 2, . . . , N } with probability equal to pn . √ (b) We set the rth row of S to the basis vector eTn scaled by 1/ Rpn , where en ∈ IRN has a unit entry at the nth location and zeros elsewhere. Observe that, under this construction, each row of S will contain a single unit entry. In this way, the multiplication of this row by H ends up selecting a row from H. For this reason, we refer to S as performing random subsampling. The main inconvenience of this construction is that it requires computation of the leverage scores, which in turn require knowledge of the SVD factor UH . It would be useful to seek sketching matrices that are data-independent. The third construction achieves this goal and is based on selecting a random subsampling Hadamard matrix. Assume N = 2n (i.e., N is a power of 2) and select R = O((M log3 N )/). Introduce the N × N orthonormal Hadamard matrix computed as the Kronecker product of 2 × 2 orthonormal Hadamard matrices:          1 1 1 1 1 1 1 1 1 (50.112) H= √ ⊗ √ ⊗ ... ⊗ √ 2 1 −1 2 1 −1 2 1 −1 {z } | n times √ √ n Apart from scaling by 1/( 2) = 1/ N , the entries of H will be ±1. Next, we (a) Select uniformly at random R rows from H, and denote the resulting R×N matrix by HR . (b) Construct an N × N random sign matrix in the form of a diagonal matrix D with random ±1 entries on its diagonal, with each entry selected with probability 1/2. p (c) Set S = N/R HR D. It can be verified that under this third construction, the complexity of determining ws is O(N M log(N/) + (M 3 log3 N )/). The purpose of this example is to introduce the reader to the concept of sketching in the context of least-squares problems. Additional comments are provided at the end of the chapter.

50.3

RECURSIVE LEAST-SQUARES One key advantage of the least-squares empirical risk (50.16) is that it enables an exact recursive computation of the minimizer. The recursive solution is particularly useful for situations involving streaming data arriving successively over time. In this section we derive the RLS algorithm but first introduce two modifications into the empirical risk function for two main reasons: (a) to enable an

2188

Least-Squares Problems

exact derivation of the recursive solution; and (b) to incorporate a useful tracking mechanism into the algorithm.

50.3.1

Exponential Weighting We modify the least-squares empirical risk (50.16) to the following exponentially weighted form with `2 -regularization: ) ( N −1 2 1 X N −1−n 1 0 N 2 T ρ λ kwk + λ x(n) N yn w (50.113) min N N n=0 w∈IRM There are three modifications in this formulation, which we motivate as follows: (a) (Exponential weighting). The scalar 0  λ < 1 is called the forgetting factor and is a number close to 1. Its purpose is to scale down data from the past more heavily than recent data. For example, in the above risk, data from time n = 0 is scaled by λN −1 while data at n = N N 1 is scaled by 1. In this way, the algorithm is endowed with a memory mechanism that “forgets” older data and emphasizes recent data. This is a useful property to enable the algorithm to track drifts in the statistical properties of the data, especially when the subscript n has a time connotation and is used to index streaming data. The special case λ = 1 is known as growing memory. Exponential weighting is one form of data windowing where the effective length of the window is approximately 1/(1 N λ) samples.

(b) (Decaying `2 -regularization). The scalar ρ0 > 0 is an `2 -regularization parameter. Observe though that the penalty term ρ0 kwk2 in (50.113) is scaled by λN as well; this factor dies out with time at an exponential rate and helps eliminate regularization after sufficient data has been processed. In other words, regularization will be more pronounced during the initial stages of the recursive algorithm and less pronounced later. One advantage of the regularization factor is that it helps ensure that the coefficient matrix that is inverted in future expression (50.121b) is nonsingular. (c) (Sample averaging). In addition, both terms in (50.113) are scaled by 1/N , which is independent of w. For this reason, we can ignore the 1/N factor and solve instead: ( ) N −1 X 2 ∆ 0 N 2 N −1−n T wN −1 = argmin ρ λ kwk + λ x(n) N yn w (50.114) w∈IRM

n=0

where we are now denoting the unique solution by wN −1 rather than w? . The subscript N N 1 is meant to indicate that the solution wN −1 is based on data up to time N N 1. We attach the time subscript to the solution because we will be deriving a recursive construction that allows us to compute wN from wN −1 where wN minimizes the enlarged risk:



wN = argmin w∈IRM

(

50.3 Recursive Least-Squares

2189

)

(50.115)

ρ0 λN +1 kwk2 +

N X

n=0

λN −n x(n) N ynT w

2

where a new pair of data, {x(N ), yN }, has been added to the risk. The adjustments introduced through steps (b) and (c) enable the derivation of an exact recursive algorithm, as the argument will show. In a manner similar to (50.17), we introduce the data quantities:  T    y0 x(0)  y1T   x(1)      ∆  yT  ∆  x(2)  HN =  2  , dN =    .   .   ..   ..  T yN x(N )

(50.116)

where we are now attaching a time subscript to {HN , dN } to indicate that they involve data up to time N . Thus, note that we can partition them in the form: # " # " dN −1 HN −1 , dN = (50.117) HN = T yN x(N ) so that {HN −1 , HN } differ by one row and {dN −1 , dN } differ by one entry. We also introduce the diagonal weighting matrix: n o ∆ ΛN = diag λN , λN −1 , . . . , 1 (50.118)

and note that

ΛN =



λΛN −1

1



(50.119)

Using {HN , dN , ΛN }, problems (50.114) and (50.115) can be rewritten in matrix form as follows: ( ) ∆

ρ0 λN kwk2 + (dN −1 N HN −1 w)T ΛN −1 (dN −1 N HN −1 w)

wN −1 = argmin w∈IRM



wN = argmin w∈IRM

(

)

(50.120a)

ρ0 λN +1 kwk2 + (dN N HN w)T ΛN (dN N HN w)

(50.120b) Differentiating the above risks relative to w, we find that the unique solutions wN and wN −1 are given by the expressions: −1 T T wN −1 = ρ0 λN IM + HN HN −1 ΛN −1 dN −1 (50.121a) −1 ΛN −1 HN −1  −1 T T wN = ρ0 λN +1 IM + HN ΛN HN HN ΛN dN (50.121b)

2190

Least-Squares Problems

These equations allow us to evaluate the solutions {wN −1 , wN } directly from the data matrices. However, a more efficient construction is possible by going from wN −1 to wN more directly, as we explain next. This step will be referred to as the time-update step.

50.3.2

Exponentially Weighted RLS To derive the recursive algorithm, we introduce the following three quantities: ∆

PN = ∆

T ρ0 λN +1 IM + HN Λ N HN

t(N ) = 1/(1 + ∆

T λ−1 yN PN −1 yN )

−1

gN = λ−1 PN −1 yN t(N )

(50.122a) (50.122b) (50.122c)

where PN is M × M , gN is an M × 1 gain vector, and t(N ) is a scalar factor. The derivation below establishes the following result. Given ρ0 > 0 and a forgetting factor 0  λ ≤ 1, the solution wN of the exponentially weighted regularized least-squares problem (50.115), and the corresponding minimum risk denoted by ξ(N ), can be computed recursively, as shown in listing (50.123) – see Prob. 50.7 for a derivation of the recursion for the minimum cost.

Recursive least-squares for solving (50.115). given N data pairs {x(n) ∈ IR, yn ∈ IRM }, n = 0, 1, . . . , N N 1; start with P−1 = ρ10 IM , ξ(N1) = 0, w−1 = 0M ; repeat for n = 0, 1, 2, . . . , N N 1: t(n) = 1/(1 + λ−1 ynT Pn−1 yn ) gn = λ−1 Pn−1 yn t(n) x b(n) = ynT wn−1 e(n) = x(n) N x b(n) wn = wn−1 + gn e(n) Pn = λ−1 Pn−1 N gn gnT /t(n) ξ(n) = λξ(n N 1) + t(n)e2 (n) end

(50.123)

Derivation of (50.123). We first rewrite (50.121a)–(50.121b) more compactly using the matrices {PN −1 , PN } as: T wN −1 = PN −1 HN −1 ΛN −1 dN −1 T wN = PN HN ΛN dN

(50.124a) (50.124b)

Next, we exploit the relations between {HN , dN , ΛN } and {HN −1 , dN −1 , ΛN −1 } from (50.117) and (50.119) in order to relate wN −1 to wN directly. To begin with, note that

50.3 Recursive Least-Squares

−1 PN

2191

T ρ0 λN +1 IM + HN ΛN HN

= (50.119)

=

T T ρ0 λλN IM + λHN −1 ΛN −1 HN −1 + yN yN

=

−1 T λPN −1 + yN yN

(50.125)

Then, by using the matrix inversion identity (29.89) with the identifications −1 A ← λPN −1 ,

B ← yN ,

C ← 1,

T D ← yN

(50.126)

we obtain a recursive formula for updating PN directly rather than its inverse, PN = λ−1 PN −1 −

T PN −1 λ−1 λ−1 PN −1 yN yN , T 1 + λ−1 yN PN −1 yN

P−1 =

1 IM ρ0

(50.127)

This recursion for PN also gives one for updating the regularized solution wN itself. Using expression (50.124b) for wN , and substituting the above recursion for PN , we find   T wN = PN λHN −1 ΛN −1 dN −1 + yN x(N ) =

   T PN −1 λ−1  λ−1 PN −1 yN yN T λ−1 PN −1 − λH Λ d + y x(N ) N −1 N −1 N N −1 T 1 + λ−1 yN PN −1 yN

=

λ−1 PN −1 yN T T T PN −1 HN yN PN −1 HN −1 ΛN −1 dN −1 − −1 ΛN −1 dN −1 T | {z } 1 + λ−1 yN | {z } PN −1 yN

(50.127)

=wN −1

+ λ−1 PN −1 yN

=wN −1

 1−

1

T PN −1 yN λ−1 yN T −1 PN −1 yN + λ yN

 x(N )

(50.128)

That is, wN = wN −1 +

λ−1 PN −1 yN T (x(N ) − yN wN −1 ), T PN −1 yN 1 + λ−1 yN

w−1 = 0

(50.129) 

The RLS implementation (50.123) updates the weight iterate from wn−1 to wn for each data pair {x(n), yn }. Such implementations are useful for situations involving streaming data where one data pair arrives at each time instant n and the algorithm responds to it by updating wn−1 to wn in real time. If desired, we can extend the algorithm to deal with blocks of data, as explained in Prob. 50.30.

50.3.3

Useful Relations The scalar t(n) in the RLS algorithm is called the conversion factor. This is because it transforms a-priori errors into a-posteriori errors, as established in Prob. 50.17. Some straightforward algebra, using recursion (50.127) for Pn , shows that {gn , t(n)} can also be expressed in terms of Pn , namely, gn = Pn yn t(n) = 1 N

ynT gn

(50.130a) (50.130b)

2192

Least-Squares Problems

To justify (50.130a)–(50.130b), we simply note the following. Multiplying recursion (50.127) for Pn by yn from the right, we get Pn yn = λ−1 Pn−1 yn N

λ−1 Pn−1 yn ynT Pn−1 yn λ−1 1 + λ−1 ynT Pn−1 yn

λ−1 Pn−1 yn 1 + λ−1 ynT Pn−1 yn = gn

=

(50.131)

By further multiplying the above identity by ynT from the left, we get ynT Pn yn =

λ−1 ynT Pn−1 yn 1 + λ−1 ynT Pn−1 yn

(50.132)

so that, by subtracting 1 from both sides, we obtain (50.130b). Furthermore, we note that at each iteration n, the variable Pn in the algorithm is equal to the following quantity: −1 Pn = ρ0 λn+1 IM + HnT Λn Hn (50.133) and the iterate wn is the solution to the regularized least-squares problem that uses only the data up to time n: ) ( n  2 X ∆ n−m 0 n+1 2 T wn = argmin λ (50.134) ρλ kwk + x(m) N ym w w∈IRM

m=0

The minimum cost for this problem, with w replaced by wn , is equal to ξ(n). Example 50.6 (Recommender systems) We revisit the recommender system studied earlier in Example 16.7. There we introduced a collaborative filtering approach based on matrix factorization to predict ratings by users. We denoted the weight vector by user u by wu ∈ IRM and the latent feature vector for item i by hi ∈ IRM . Subsequently, we formulated the regularized least-squares optimization problem: ( U I n o X X b b w bu , hi , θu , α bi = argmin ρkwu k2 + ρkhi k2 + (50.135) {wu ,hi ,θu ,αi }

u=1

X 

i=1

rui −

hTi wu

+ θu + αi

2

)

(u,i)∈R

where the last sum is over the valid indices (i, u) ∈ R, i.e., over the indices for which valid ratings exist. All entries with missing ratings are therefore excluded. We approximated the minimizer of the above (nonconvex) problem by applying the stochastic gradient solution (16.58). In this example, we pursue instead an alternating least-squares solution. Note that if we fix any three of the parameters, then the risk function is quadratic over the remaining parameter. For example, if we fix (hi , θu , αi ), then the risk is quadratic over wu . For any index u, let the notation Ru represent the set of valid indices i for which (u, i) has a rating. Note that u is fixed within Ru . Likewise, for any index i, let the notation Ri represent the set of valid indices u for which (u, i) has a rating. Note that i is fixed within Ri .

50.3 Recursive Least-Squares

2193

For any specific u, setting the gradient relative to wu to zero leads to the expression: ! !−1 X X T hi (rui + θu + αi ) (50.136) (ρIM + hi hi ) w bu = i∈Ru

i∈Ru

We can obtain similar expressions for b hi , θbu , and α bi , leading to listing (50.137). In the listing, the term wu,m represents the estimate for wu at iteration m; likewise for hi,m , θu (m), and αi (m).

Alternating least-squares algorithm for solving (50.135). given ratings ru,i for (u, i) ∈ R; start with arbitrary {wu,−1 , hi,−1 , θ u (−1), αi (−1)}; repeat until convergence over m = 0, 1, . . .: repeat forX u = 1, . . . , U : Au = (ρIM + hi,m−1 hTi,m−1 ) i∈Ru

wu,m =

!

A−1 u

X

hi,m−1 (rui + θu (m − 1) + αi (m − 1))

i∈Ru

θu (m) = −

 1 X rui − hTi,m−1 wu,m−1 + αi (m − 1) |Ru | i∈Ru

(50.137)

end repeat for Xi = 1, . . . , I: T Bi = (ρIM + wu,m wu,m ) u∈Ri   X −1  hi,m = Bi wu,m (rui + θu (m) + αi (m − 1)) u∈R

i  1 X rui − hTi,m−1 wu,m + θu (m) αi (m) = − |Ri |

u∈Ri

end end return {wu? , h?i , θu? , αi? }. We simulate recursions (50.137) for the same situation discussed earlier in Example 16.7. We consider the same ranking matrix for U = 10 users and I = 10 items with integer scores in the range 1 ≤ r ≤ 5; unavailable scores are marked by the symbol ?:   5 3 2 2 ? 3 4 ? 3 3 5 4 1 3 1 4 4 ? 3 ?    3 5 ? 2 1 5 4 1 4 1     ? 2 3 4 4 5 2 5 1 1     2 1 2 2 1 5 1 4 1 ?   (50.138) R=  ? 2 1 3 ? ? 5 3 3 5     3 4 ? 2 5 5 3 2 ? 4     4 5 3 4 2 2 1 ? 5 5   2 4 2 5 ? 1 1 3 1 4  ?

1

4

4

3

?

5

2

4

3

We set M = 5 (feature vectors hi of size 5) and generate uniform random initial conditions for the variables {wu,−1 , hi,−1 , θ u (−1), αi (−1)} in the open interval (0, 1).

Least-Squares Problems

We set ρ = 0.001. We normalize the entries of R to lie in the range [0, 1] by replacing each numerical entry r by the value r ← (r − 1)/4

(50.139)

where the denominator is the score range (highest value minus smallest value) and the numerator is subtracted from the smallest rating value (which is 1). We repeat recursions (50.137) for 500 runs. At the end of the simulation, we use the parameters {wu? , h?i , θu? , αi? } to estimate each entry of R using rbui = (h?i )T wu? − θu? − αi?

(50.140)

We undo the normalization by replacing each of these predicted values by rbui ← 4 rbui + 1

(50.141)

and rounding rbui to the closest integer; scores above 5 are saturated at 5 and scores b shown below, where we indicate the below 1 are fixed at 1. The result is the matrix R scores predicted for the unknown entries in red:        b R=      

5 5 3 4 2 5 3 4 2 5

3 4 5 2 1 2 4 5 4 1

2 1 1 3 2 1 1 3 2 4

2 3 2 4 2 3 2 4 5 4

1 1 1 4 1 5 5 2 4 3

3 4 5 5 5 5 5 2 1 5

4 4 4 2 1 5 3 1 1 5

1 1 1 5 4 3 2 1 3 2

3 3 4 1 1 3 3 5 1 4

3 5 1 1 1 5 4 5 4 3

             

(50.142)

Compared with the earlier result (16.61) obtained by applying a stochastic gradient procedure, we observe that the current simulation based on the alternating least-squares implementation leads to an estimate of the ratings matrix of similar quality. It is useful to recall that the risk function in (50.135) is not convex over the parameters and local minima are therefore possible. Figure 50.7 provides a color-coded representation of the entries of the original matrix R with the locations of the missing entries highlighted in b on the right. red, and the recovered matrix R original matrix ratings with missing entries

predicted matrix ratings, 5

2

5

2

4 3

6

2

8

1

10

0

2

4 6 columns

8

10

4

4

rows

4

rows

2194

3

6

2

8

1

10

0

2

4 6 columns

8

10

Figure 50.7 Color-coded representation of the entries of the original matrix R with

b (right). missing entries (left) and the recovered matrix R

50.4 Implicit Bias

2195

We further denote the risk value at the start of each epoch of index k by ∆

P (k) =

U X

ρkwu k2 +

u=1

I X

X 

ρkhi k2 +

i=1

rui − hTi wu + θu + αi

2

(50.143)

(u,i)∈R

where the parameters on the right-hand side are set to the values at the start of epoch k. Figure 50.8 plots the evolution of the risk curve (normalized by its maximum value so that its peak value is set to 1). normalized risk values at start of epochs

10 0

10 -1

10 -2 10

20

30

40

50

60

70

80

90

100

Figure 50.8 Evolution of the risk curve (50.143) with its peak value normalized to 1.

50.4

IMPLICIT BIAS We return to the standard least-squares problem (50.19), repeated here for ease of reference: ∆

w? = argmin kd N Hwk2

(50.144)

w∈IRM

and examine the case in which there are infinitely many solutions. In particular, we will assume N < M so that H is a “fat” matrix with more columns than rows. This also means that there are fewer measurements than the size of w. We refer to this situation as the under-determined or over-parameterized least-squares problem. It turns out that if we apply the traditional gradient-descent recursion to the solution of (50.144), namely, wn = wn−1 N µ∇wT kd N Hwk2 w=wn−1

T

= wn−1 + 2µH (d N Hwn−1 ), n ≥ 0

(50.145)

where µ is a small step-size parameter, then the iterate wn will converge to the minimum-norm solution, w? = H † d:

2196

Least-Squares Problems

lim wn = H † d

(50.146)

n→∞

Proof of (50.146): Assume H has full row rank and introduce its SVD:   H = U Σ 0 V T , U U T = IN , V V T = IM

(50.147)

where Σ is N × N diagonal with positive singular values {σ`2 > 0} for ` = 1, 2, . . . , N . We partition V into   V = V1 V2 , V1 ∈ IRM ×N (50.148) and note from the orthogonality of the M × M matrix V that    T    IN 0 V1 V1 V2 = V T V = IM ⇐⇒ T 0 IM −N V2

(50.149)

Now, we know from result (50.179) that the minimum norm solution of the least-squares problem for the case under study is given by (recall (1.114)):  −1  Σ w? = H † d = V U Td (50.150) 0 We select the initial condition for the gradient-descent recursion (50.145) to lie in the range space of H T , i.e., w−1 ∈ R(H T ) ⇐⇒ w−1 = H T c, for some c ∈ IRN

(50.151)

In this case, it is easy to see by iterating (50.145) that the successive wn will remain in the range space of H T : wn ∈ R(H T ),

n≥0

(50.152)

Moreover, we can characterize the limit point of this sequence. For this purpose, we introduce a convenient change of variables in the form of the M × 1 vector:  T  V1 wn ∆ zn = V T w n = (50.153) T V2 wn Multiplying recursion (50.145) by V T from both sides leads to      Σ zn = zn−1 + 2µ U T d − Σ 0 zn−1 0

(50.154)

We partition zn into zn = col{an , bn } where the leading component an is N × N . Then, the above relation gives:       an an−1 Σ = + 2µ (U T d − Σan−1 ) (50.155) bn bn−1 0 from which we conclude that an = (IN − 2µΣ2 )an−1 + 2µΣU T d bn = bn−1

(50.156a) (50.156b)

Observe that component bn does not evolve with time and stays fixed at its initial value, denoted by ∆

bn = b?2 = V2T w−1 = V2T H T c

(50.149)

=

0

(50.157)

50.5 Commentaries and Discussion

2197

On the other hand, the recursion for an has a diagonal coefficient matrix, IN − 2µΣ2 . We can select µ to ensure this matrix is stable, namely, to guarantee 2 |1 − 2µσ`2 | < 1, ∀ ` ⇐⇒ µ < 1/σmax

(50.158)

in terms of the largest singular value of H. Under this condition, the recursion for an converges to the steady-state value ∆

lim an = a? = Σ−1 U T d

(50.159)

n→∞

We therefore conclude that  lim zn =

n→∞

Σ−1 U T d 0

 (50.160)

and, hence,  lim wn = V

n→∞

Σ−1 U T d 0



 = V

Σ−1 0



U Td = H †d

(50.150)

=

w?

(50.161)

as claimed. 

We therefore find that, in the under-determined case, when the amount of data available is smaller than the size of the parameter vector, the gradient-descent algorithm shows an implicit bias toward the minimum-norm solution. In other words, among all possible minimizers (and there are infinitely many in this case), the gradient-descent iteration converges to the minimum-norm solution. Other algorithms need not behave in the same manner and, therefore, the choice of the algorithm influences which parameter vector is ultimately learned.

50.5

COMMENTARIES AND DISCUSSION Least-squares, Gauss, and RLS. The standard least-squares problem (50.19) has had an interesting and controversial history since its inception in the late 1700s, as already indicated in the texts by Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008). The criterion was formulated by the German mathematician Carl Friedrich Gauss (1777–1855) in 1795 at the age of 18 – see Gauss (1809). At that time, there was interest in a claim by the German philosopher Georg Hegel (1770–1831), who claimed that he has concluded using deductive logic that only seven planets existed. Then, on January 1, 1801, an astronomer noticed a moving object in the constellation of Aries, and the location of this celestial body was observed for 41 days before suddenly dropping out of sight. Gauss’ contemporaries sought his help in predicting the future location of the heavenly body so that they could ascertain whether it was a planet or a comet (see Hall (1970), Plackett (1972), and Stigler (1981) for accounts of this story). With measurements available from the earlier sightings, Gauss formulated and solved a least-squares problem that could predict the location of the body (which turned out to be the planetoid Ceres). For some reason, Gauss did not bother to publish his least-squares solution, and controversy erupted in 1805 when the French mathematician Adrien Legendre (1752–1833) published a book where he independently invented the least-squares method – see Legendre (1805, 1810). Since then, the controversy has been settled and credit is nowadays given to Gauss as the inventor of the method of least-squares. Interestingly, the method was also published around the same time by the Irish-American mathematician Robert Adrain (1775–1843)

2198

Least-Squares Problems

in the work by Adrain (1808). Here is how Gauss himself motivated the least-squares problem: if several quantities depending on the same unknown have been determined by inexact observations, we can recover the unknown either from one of the observations or from any of an infinite number of combinations of the observations. Although the value of an unknown determined in this way is always subject to error, there will be less error in some combinations than in others . . . One of the most important problems in the application of mathematics to the natural sciences is to choose the best of these many combinations, i.e., the combination that yields values of the unknowns that are least subject to the errors. Extracted from Stewart (1995, pp. 31, 33) Gauss’ choice of the “best” combination was the one that minimizes the least-squares criterion. Actually, Gauss went further and formulated in his work on celestial bodies (ca. 1795) the unweighted (λ = 1) RLS solution, which we described in modern notation in (50.123). This step helped him save the trouble of having to solve a least-squares problem afresh every time a new measurement became available. Of course, Gauss’ notation and derivation were reminiscent of late eighteenth-century mathematics and, therefore, they do not bear much resemblance to the linear algebraic and matrix arguments used in our derivation – see, e.g., the useful translation of Gauss’ original work that appears in Stewart (1995). In modern times, the RLS algorithm is credited to Plackett (1950, 1972). There is also an insightful and strong connection between RLS and Kalman filtering techniques, as detailed in Sayed and Kailath (1994) and in the textbooks by Sayed (2003, 2008) – see Appendix 50.C. Reliable numerical methods. There is a huge literature on least-squares problems and on reliable numerical methods for their solution – see, e.g., Higham (1996), Lawson and Hanson (1995), and Bjorck (1996). Among the most reliable methods for solving least-squares problems is the QR method, which is described in Prob. 50.5. The origin of the QR method goes back to Householder (1953, pp. 72–73), followed by Golub (1965), and Businger and Golub (1965). Since then, there has been an explosion of interest in solution methods for least-squares and recursive least-squares problems – see, for example, the treatment on array methods in Sayed (2003, 2008). LOWESS and LOESS smoothing. We described in Example 50.3 how localized leastsquares formulations can be used to fit smooth curves onto data samples by means of the LOWESS and LOESS procedures, which were originally developed by Cleveland (1979) and Cleveland and Devlin (1988). These are simple but effective nonparametric techniques that slide a window over the data and fit locally either a regression line (LOWESS) or a quadratic curve (LOESS). The methods employ weighting to give more weight to data closer to the point that is being estimated and less weight to points that are farther away. We exhibited one choice for the weighting factor in (50.67) but other choices are possible, as explained in Cleveland (1979), where certain desirable properties on the weight factor are listed. These methods control the effect of outliers by rescaling the weights and repeating the construction a few times. Confidence intervals. We examined confidence intervals for least-squares problems in Example 50.4. In the derivation, we used (50.91) to conclude that the individual entries of the estimator w? are Gaussian and derived confidence intervals for them. If desired, we may alternatively work with the entire estimator w? (rather than its individual entries) and use expression (50.91) to describe an ellipsoidal region around w? where the true model is likely to lie with high confidence. It can be shown that for a significance level α (say, α = 5%), the true model wo lies with (1−α)% probability within the region

50.5 Commentaries and Discussion

( ∆

ellipsoid =

w (w − w? )T H T H(w − w? ) ≤ M σv2 Fα(M,N −M )

2199

) (50.162)

(a,b)

where the notation Fα refers to the point to the right of which the area under an F -distribution with parameters (a, b) is equal to α. This area is also called the critical value at which the significance level α is attained. For more discussion on confidence intervals and basic statistical concepts, the reader may refer to Draper and Smith (1998), Mendenhall, Beaver, and Beaver (2012), Witte and Witte (2013), and McClave and Sincich (2016). Iterative reweighted least-squares. It is explained in Sayed (2003, 2008) that the leastsquares solution can also be useful in solving nonquadratic optimization problems of the form: ( N −1 ) 1 X T p min |x(n) − yn w| (50.163) N n=0 w∈IRM for some positive exponent p (usually 1 ≤ p ≤ 2). This can be seen by reformulating the above criterion as a weighted least-squares problem in the following manner. Introduce the scalars (assumed nonzero): ∆

r(n) = |x(n) − ynT w|p−2 ,

n = 0, 1, . . . , N − 1

(50.164a)

and the diagonal weighting matrix n o R = diag r(0), r(1), . . . , r(N − 1)

(50.164b)

Then, the above optimization problem can be rewritten in the form min (d − Hw)T R(d − Hw)

(50.165)

w∈IRM

where the vector d and the matrix H are defined as in (50.17). Of course, this reformulation is not truly a weighted least-squares problem because R is dependent on the unknown vector, w. Still, this rewriting of the risk function suggests the following iterative technique for seeking its minimizer. Given an estimate wk−1 at iteration k − 1 we do the following: compute rk (n) = |x(n) − ynT wk−1 |p−2 , n = 0, 1, . . . , N − 1 n o set Rk = diag rk (0), rk (1), . . . , rk (N − 1) T

−1

(50.166)

T

update the estimate to wk = (H Rk H) H Rk d and repeat until convergence This implementation assumes that the successive Rk are invertible. The algorithm is known as iterative reweighted least-squares (IRLS). It has several variations with improved stability and convergence properties (see, e.g., Osborne (1985) and Bjorck (1996); see also Fletcher, Grant, and Hebden (1971) and Kahng (1972)). One such variation is to evaluate wk not directly as above but as a convex combination using the prior iterate wk−1 for some 0 < β ≤ 1 as follows:

2200

Least-Squares Problems

compute rk (n) = |x(n) − ynT wk−1 |p−2 , n = 0, 1, . . . , N − 1 n o set Rk = diag rk (0), rk (1), . . . , rk (N − 1) set wk = (H T Rk H)−1 H T Rk d set wk = βwk + (1 − β)wk−1 and repeat until convergence

(50.167)

Matrix factorization. We described an alternating least-squares algorithm for the solution of the matrix factorization (or collaborative filtering) problem (50.135) in Example 50.6. We explained in the commentaries at the end of Chapter 16 that matrix factorization problems of this type arise in the design of recommender systems and were largely driven by the Netflix prize challenge, which ran during the period 2006–2009. Solution (50.137) is motivated by the works of Bell and Koren (2007a), Hu, Koren, and Volinsky (2008), Zhou et al. (2008), and Pilaszy, Zibriczky, and Tikk (2010). For more details on alternating methods, see also the treatment by Udell et al. (2016). We recall that we encountered another instance of matrix factorization problems in the concluding remarks of Chapter 1, when we discussed the Eckart–Young theorem right after (1.222). The theorem dealt with the following scenario. Consider a U × I matrix R and assume we wish to determine a low-rank approximation for it in the form of the product R ≈ W H, where W is U × M , H is M × I, and M is the desired rank approximation. The Eckart–Young theorem determines a collection of M column vectors {xm , ym }, where each xm is U × 1 and each ym is I × 1, in order to solve: M

2

X ∆

T b = R argmin R − xm ym

{xm ,ym }

m=1

(50.168)

F

Once the {xm , ym } are determined, they are used to construct  T y1  y2T    W = x1 x2 . . . xM , H =  .  .. T yM

W and H as follows:     

(50.169)

The solution of (50.168) requires all entries of R to be known (which obviously cannot be applied in the context of recommender systems where many entries are normally b is found as follows. We first introduce the SVD of R, missing). The approximation R say, r X R= σn un vnT (50.170) n=1

where r > M denotes the rank of R and the singular values {σn } are ordered in decreasing order, i.e., σ1 ≥ σ2 ≥ . . . ≥ σr > 0. Then, the solution to (50.168) is given by – recall Prob. 1.56: b= R

M X

T σm um vm

(50.171)

m=1

in terms of the singular vectors {um , vm } associated with the M largest singular values. Sketching and randomized algorithms. We described in Example 50.5 some useful results on randomized algorithms and sketching applied to least-squares problems. These methods help to deal with situations involving massive amounts of data, while delivering some important performance guarantees. The basic idea, which involves projecting the

50.5 Commentaries and Discussion

2201

data onto lower-dimensional spaces, is motivated by an important result from Johnson and Lindenstrauss (1984). In one of its simpler forms for Euclidean spaces, the result can be stated as follows.

Johnson–Lindenstrauss lemma (Johnson and Lindenstrauss (1984)).Consider a collection of M column vectors {xm } of dimension N × 1 each. For any 0 <  < 1/2, select a dimension R = O((log M )/2 ). Then, there exists a matrix S ∈ IRR×N such that for all m 6= m0 : kSxm − Sxm0 k 1−≤ ≤1+ (50.172) kxm − xm0 k

In the context of the least-squares problem studied in Example 50.5, the vectors xm correspond to the columns of H or d. The above lemma essentially states that one can map a collection of vectors {xn } from an Euclidean space of high dimension N to another collection of vectors {Sxm } of much smaller dimension R such that the relative distance between any two points changes only by 1 ± . This result has motivated a flurry of investigations on sketching methods. One notable advance was given by Sarlós (2006), who showed how to construct a sketching matrix S using fast Johnson–Lindenstrauss transforms leading to an ultimate complexity of O(N M log M ) for the solution of leastsquares problems. The Gaussian construction for a sketching matrix given in Example 50.5 is from Indyk and Motwani (1998), while the leverage-scores-based construction is from Drineas, Mahoney, and Muthukrishnan (2006b), and the Hadamard construction is from Ailon and Liberty (2013). Extensions to other convex problems appear in Pilanci and Wainwright (2015). Excellent surveys on randomized algorithms and sketching are given by Mahoney (2011) and Woodruff (2014), with derivations and justifications for several of the results and properties mentioned in the body of the chapter. Implicit bias or regularization. We illustrated in Section 50.4 one instance of implicit bias (also called implicit regularization). We considered an over-parameterized leastsquares problem where there are fewer data points than the size of the parameter vector, w ∈ IRM . The analysis showed that the gradient-descent solution has an implicit bias toward the minimum-norm solution of the least-squares problem. Similar behavior occurs for other risk functions and is not limited to the least-squares case – see, e.g., Prob. 50.31 dealing with matrix factorization, the earlier Prob. 16.8 dealing with the Kaczmarz method, and Prob. 61.7 dealing with logistic regression and support vector machines. Other algorithms need not behave in the same manner and may converge to other minimizers. Therefore, the choice of which algorithm to use has an influence on which model is learned in cases when a multiplicity of solutions exist. And some models are “better” than others because they may generalize better in the following sense. Once a solution w? is found, the intent is to use it to predict target values x for future observations y that were not part of the original training data {d, H} by using, for example, x b(t) = ytT w? . The concept of “generalization” relates to how well a learned model w? performs on new observations, i.e., how well it predicts. We will discuss generalization in the context of classification problems in greater detail in Chapter 64. For more discussion on the topic of implicit bias in the machine learning literature, the reader may refer to Gower and Richtárik (2015), Neyshabur, Tomioka, and Srebro (2015), Gunasekar et al. (2017, 2018), Soudry et al. (2018), Jin and Montúfar (2020), and the references therein. Recursive least-squares and Kalman filtering. Following Sayed and Kailath (1994) and Sayed (2003, 2008), Appendix 50.B describes a useful equivalence result between stochastic and deterministic estimation problems with quadratic risks. The equivalence

2202

Least-Squares Problems

is then used in Appendix 50.C, based on arguments from Kailath, Sayed, and Hassibi (2000) and Sayed (2003, 2008), to clarify the fundamental connection that exists between recursive least-squares and Kalman filtering, so much so that solving a problem in one domain is equivalent to solving a problem in the other domain. One of the earliest mentions of a relation between least-squares and Kalman filtering appears to be Ho (1963); however, this reference considers only a special estimation problem where the successive observation vectors are identical. Later references are Sorenson (1966) and Aström and Wittenmark (1971); these works focus only on the standard (i.e., unregularized) least-squares problem, in which case an exact relationship between least-squares and Kalman filtering does not actually exist, especially during the initial stages of adaptation when the least-squares problem is under-determined. Soon afterwards, in work on channel equalization, Godard (1974) rephrased the growing-memory (i.e., λ = 1) RLS problem in a stochastic state-space framework, with the unknown state corresponding to the unknown weight vector in a manner similar to what we encountered in Example 30.4. Similar constructions also appeared in Willsky (1979), Anderson and Moore (1979), Ljung (1987), Strobach (1990), and Söderström (1994). In the works by Anderson and Moore (1979), Ljung (1987), and Söderström (1994), the underlying models went a step further and incorporated the case of exponentially decaying memory (i.e., λ < 1) by formulating state-space models with a time-variant noise variance. Nevertheless, annoying discrepancies persisted that precluded a direct correspondence between the exponentially weighted RLS (λ < 1) and the Kalman variables. Some of these discrepancies were overcome essentially by fiat (see, e.g., the treatment by Haykin (1991)). This lack of a direct correspondence may have inhibited application of the extensive body of Kalman filter results to the adaptive least-squares problem until a resolution was given in the work by Sayed and Kailath (1994). In retrospect, by a simple device, the latter reference was able to obtain a perfectly matched state-space model for the case of exponentially decaying memory (λ < 1), with a direct correspondence between the variables in the exponentially weighted RLS problem and the variables in the state-space estimation problem. Sea-level and global temperature changes. Figure 50.2 illustrates the result of fitting a linear regression model onto measurements of sea-level changes. The source of the data is the NASA Goddard Space Flight Center at https://climate.nasa.gov/ vital-signs/sea-level/. For more information on how the data was generated, the reader may consult Beckley et al. (2017) and the report GSFC (2017). Similarly, Fig. 50.3 illustrates the fitting of LOWESS and LOESS smoothing curves onto measurements of changes in the global surface temperature. The source of the data is the NASA Goddard Institute for Space Studies (GISS) at https://climate.nasa.gov/ vital-signs/global-temperature/.

PROBLEMS1

50.1 Consider an N × M full-rank matrix H with N ≥ M , and two column vectors d e and z of dimensions N × 1 each. Let de = P⊥ e = P⊥ H d and z H z. Are the residual vectors d and ze collinear in general? If your answer is positive, justify it. If the answer is negative, can you give conditions on N and M under which de and ze will be collinear? 50.2 Let H be N × M with full-column rank. Show that any vector in the column span of P⊥ H is orthogonal to any vector in the column span of H. That is, show that H T P⊥ H = 0. 50.3 Consider the standard least-squares problem (50.19). Comment on the solution w? in the following three cases: (a) d ∈ N(H), (b) d ∈ R(H), and (c) d ∈ N(H T ). 1

Several problems in this section are adapted exercises from Sayed (2003, 2008).

Problems

2203

50.4 Solving the normal equations H T Hw? = H T d by forming the matrix H T H (i.e., by squaring the data) is a bad idea in general. Consider the full-rank matrix   1 1 H= 0   1 1 where  is a very small positive number that is of the same order of magnitude as machine precision. Assuming 2 + 2 = 2 in finite precision, what is the rank of H T H? 50.5 A numerically reliable method for solving the normal equations H T Hw? = H T d is the QR method. It avoids forming the product H T H, which is problematic for illconditioned matrices. The QR method works directly with H and uses its QR decomposition – defined earlier in Section 1.6:   R H=Q 0 where Q is N × N orthogonal and R is M × M upper-triangular with positive diagonal entries. Let col{z1 , z2 } = QT d, where z1 is M × 1. Verify that kd − Hwk2 = kz1 − Rwk2 + kz2 k2 Refer to the standard least-squares problem (50.19) and verify that the least-squares solution w? can be obtained by solving the triangular linear system of equations Rw? = z1 . Conclude that the minimum risk is kz2 k2 . 50.6 Refer to Example 50.1 but assume now that the zero-mean Gaussian noise process is colored. Collect the noise terms into the column vector n o ∆ v = col v(0), v(1), . . . , v(N − 1) and denote its covariance matrix by Rv = E vv T > 0. Use the data and vector notation (50.17) to verify that the ML estimate for w is the solution to the weighted least-squares problem: min

w∈IRM

kd − Hwk2R−1 =⇒ w? = (H T Rv−1 H)−1 H T Rv−1 d v

where the notation kak2R stands for aT Ra. 50.7 Let ξ(n) denote the minimum risk value of (50.134) with w replaced by wn . (a) Show that ξ(n) = dTn Λn (dn − Hn wn ). (b) Derive the time-update relation ξ(n) = λξ(n − 1) + t(n)e2 (n), ξ(−1) = 0. 50.8 Consider an `2 -regularized least-squares problem of the form: ( ) N −1 2 1 X  2 T argmin ρkwk + x(n) − yn w + θ N n=0 w∈IRM ,θ∈IR Observe that regularization is applied to w only and not to θ. Introduce the sample averages: x ¯= (a) (b)

N −1 1 X x(n), N n=0

y¯ =

N −1 1 X yn N n=0

Fix w and show that optimizing over θ leads to the expression θ = y¯T w − x ¯. Center the data and define x0 (n) = x(n) − x ¯ and yn0 = yn − y¯. Conclude that the above least-squares problem is equivalent to solving a traditional regularized problem without offset, namely, ( ) N −1 2 1 X  0 2 0 T argmin ρkwk + x (n) − (yn ) w N n=0 w∈IRM

2204

Least-Squares Problems

50.9

? Let w? and wreg denote the solutions to the following problems: ∆

w? = argmin kd − Hwk2 w∈IRM ∆

? wreg

= argmin

n

o ρkwk2 + kd − Hwk2 , ρ > 0

w∈IRM ? Let Q = H T H, assumed invertible. Show that wreg = (IM + ρQ)−1 w? . 50.10 Consider the weighted least-squares problem (50.49). Verify that the orthogonality condition in this case is given by

H T R(d − Hw? ) = 0 ⇐⇒ H T Rde = 0 where de = d − db and db = Hw? . Show further that the minimum risk is given by e ξ = dT Rd. 50.11 Refer to the stochastic model (50.88) where v has covariance matrix σv2 IN but is not necessarily Gaussian. Relation (50.89) will continue to hold, linking the true model wo to the least-squares model w? . Introduce the MSE risk, P (w) = E kd−Hwk2 , where the expectation is over the source of randomness in d. (a) Let w e = wo − w. Verify that P (w) = w eT H T H w e + N σv2 . Conclude that the minimum value is attained at w = wo and is equal to P (wo ) = N σv2 . (b) Using expression (50.89), verify that the least-squares solution w? , which is now random since it depends on d, leads to an average excess risk value of E P (w? ) − P (wo ) = σv2 M , which is dependent on the problem dimension, M . 50.12 We continue with the stochastic model (50.88), but assume now that the rows of H are Gaussian distributed with zero mean and unit covariance matrix, i.e., each y n ∼ Nyn (0, IM ). We continue to assume that v has zero mean and covariance matrix σv2 IN and is independent of H. In this problem we consider both situations in which N ≥ M (over-determined least-squares, with more data than unknowns) and N < M (under-determined or over-parameterized least-squares). Introduce the weight-error e = wo − w? , where w? is a least-squares solution. vector w (a) Assume first that N ≥ M and show that n o e 2 = σv2 E Tr(H T H) = E kwk (b)

σv2 M , N −M −1

for N ≥ M + 2

where we are denoting H in boldface since its entries are now random. Assume next that N < M and let w? refer to the minimum-norm least-squares solution. Show that n o e 2 = E k(IM − H T (HH T )−1 H)wo k2 + σv2 E Tr(HH T ) E kwk =

M −N σv2 N kwo k2 + , M M −N −1

for M ≥ N + 2

(c) Compare both situations as M varies. Remark. The result of this problem, and especially the result in part (c) showing how the MSE behaves as a function of increasing complexity M , is related to the phenomena of double descent and bias–variance trade-off in learning – see, e.g., Belkin, Ma, and Mandal (2018), Belkin, Rakhlin, and Tsybakov (2019), Hastie et al. (2019), and Mei and Montanari (2020). To solve the problem, the reader needs to rely on some properties of the Wishart distribution. Consider a collection of M -dimensional vectors {an }, each arising from a zero-mean PGaussianT distribution with covariance matrix Σ > 0, i.e., a ∼ Na (0, Σ). Let X = N n=1 an an , which is M × M . Then, for N ≥ M , it is known that X is invertible almost surely and it follows a so-called Wishart distribution with mean zero, N degrees of freedom, and scale parameter Σ, written as X ∼ W(N, Σ). Its mean is E X = N Σ. The inverse matrix X −1 follows an inverse Wishart distribution

Problems

2205

with mean zero, N degrees of freedom, and scalar parameter Σ−1 , written as X −1 ∼ W−1 (N, Σ−1 ). The respective pdfs are proportional to (N −M −1)/2

n 1 o × exp − Tr(Σ−1 X) , N ≥ M 2 n 1  −(N +M +1)/2 o −1 −1 × exp − Tr(Σ−1 X −1 ) , N ≥ M fX −1 (X ) ∝ det X 2 fX (X) ∝



det X

For more information on the Wishart distribution, the reader may refer to Eaton (1983), Gupta and Nagar (2000), and Anderson (2003). 50.13 Refer to the stochastic model (50.88) where v has covariance matrix σv2 IN . Show that E kH(w? − wo )k2 ≤ 4σv2 rank(H) Remark. See Rigollet and Huetter (2017) for a related discussion. 50.14 Consider a symmetric positive-definite weighting matrix, R, and a symmetric positive-definite regularization matrix, Π. Verify that the “normal equations” that describe all solutions to the regularized and weighted least-squares problem: n o min wT Πw + (d − Hw)T R(d − Hw) w∈IRM

are given by (Π + H T RH)w? = H T Rd. Verify that the “orthogonality condition” in this case amounts to requiring: H T R(d − Hw? ) = Πw? ⇐⇒ H T Rde = Πw? where de = d − db and db = Hw? . Show further that the minimum cost is given by either expression: ξ = dT Rde = dT (R−1 + HΠ−1 H T )−1 d 50.15 In constrained least-squares problems we seek to minimize kd − Hwk2 over w ∈ IRM subject to the linear constraint Aw = b, where the data matrices H and A have dimensions N × M (N ≥ M ) and P × M (P ≤ M ), respectively. Both matrices {H, A} are assumed to have full rank. Note that H is “tall” while A is “fat.” Show that the solution is given by  −1 wc? = w? − (H T H)−1 AT A(H T H)−1 AT (Aw? − b) where w? is the standard least-squares solution, w? = (H T H)−1 H T d. ¯ z], with d and z de50.16 Consider a data matrix H and partition it as H = [d H b noting its leading and trailing columns, respectively. Let d and zb denote the regularized b ¯ namely, db = Hw ¯ y? , zb = Hw ¯ z? , de = d − d, least-squares estimates of d and z given H, and ze = z − zb, where wy? and wz? are the solutions of n o n o ¯ y k2 ¯ z k2 and min wzT Πwz + kz − Hw min wyT Πwy + kd − Hw wy

wz

∆ e T z = dT ze. Define κ = e T ze/(kdk e ke for some positive-definite matrix Π. Show that (d) (d) z k). Show that |κ| ≤ 1. 50.17 Refer to the RLS algorithm in Section 50.3. Introduce the a-priori and aposteriori errors e(n) = x(n) − ynT wn−1 and r(n) = x(n) − ynT wn . Observe that one error depends on wn−1 while the other error depends on the updated iterate, wn . The conversion factor allows us to transform e(n) into r(n) without the need to update wn−1 to wn . Show that r(n) = t(n)e(n). Conclude that |r(n)| ≤ |e(n)|.

2206

Least-Squares Problems

50.18 Refer to the derivation of the exponentially weighted RLS in Section 50.3, but assume now that dN evolves in time in the following manner:   adN −1 dN = x(N ) for some scalar a. The choice a = 1 reduces to the situation studied in the body of the chapter. Show that the solution wN , and the corresponding minimum cost, ξ(N ), can be computed recursively as follows. Start with w−1 = 0, P−1 = (1/ρ0 )I, and ξ(−1) = 0, and iterate for n ≥ 0: t(n) = 1/(1 + λ−1 ynT Pn−1 yn ) gn = λ−1 Pn−1 yn t(n) e(n) = x(n) − aynT wn−1 wn = awn−1 + gn e(n) Pn = λ−1 Pn−1 − gn gnT /t(n) ξ(n) = λa2 ξ(n − 1) + t(n)e2 (n) In particular, observe that the scalar a appears in the expressions for {wn , e(n), ξ(n)}. Show further that r(n) = t(n)e(n) where r(n) = x(n) − ynT wn . 50.19 All variables are scalars. Consider N noisy measurements of an unknown x, say, d(n) = x + v(n), and formulate the following two optimization problems: ∆

x bmean = argmin x

N 1 X (d(n) − x)2 , N n=1



x bmedian = argmin x

N 1 X |d(n) − x| N n=1

PN Show that x bmean is the sample mean, i.e., x bmean = N1 n=1 d(n). Show that x bmedian is the median of the observations, where the median is such that an equal number of observations exists to its left and to its right. 50.20 At each time n ≥ 0, M noisy measurements of a scalar unknown variable x are collected from M spatially distributed sensors, say, dm (n) = x + vm (n), m = 0, 1, . . . , M − 1. The unknown x is estimated by solving a least-squares problem of the form: ( N !) −1 X N −n M X ∆ 2 x bN = argmin λ αm (n) |dm (n) − x|

(a) (b)

x

n=0

m=0

where 0  λ ≤ 1 is an exponential forgetting factor and the {αk (n)} are some nonnegative weighting coefficients. Show that x bN can be computed recursively as follows: φ(n) = λφ(n − 1) +

M −1 X

αm (n),

φ(−1) = 0

m=0

s(n) = λs(n − 1) +

M −1 X

αm (n)dm (n),

s(−1) = 0

m=0

x bn = s(n)/φ(n) 50.21 Two least-squares estimators are out of sync. At any time N , estimator #1 computes the estimate w1,0:N −1 that corresponds to the solution of ( ) N −1 X ∆ 0 N 2 N −1−n T 2 w1,0:N −1 = argmin ρ λ kwk + λ (x(n) − yn w) w∈IRM

n=0

Problems

2207

where ρ0 > 0 and λ is the forgetting factor. Note that w1,0:N −1 is an estimate that is based on measurements between times n = 0 and n = N − 1. On the other hand, estimator #2 computes the estimate w2,1:N that corresponds to the solution of ( ) N X ∆ 0 N 2 N −n T 2 ρ λ kwk + w2,1:N = argmin λ (x(n) − yn w) w∈IRM

n=1

Here, w2,1:N is an estimate that is based on measurements between times n = 1 and n = N . Can you use the available estimates {w1,0:N −1 , w2,1:N , N ≥ 0} to construct the recursive solution of ( ) N X ∆ 0 N +1 2 N −n T 2 ρλ kwk + wN = argmin λ (x(n) − yn w) w∈IRM

n=0

where wN is an estimate that is based on all data up to time N ? If so, explain the construction. If not, explain why not. 50.22 Node #1 observes even-indexed data {x(2n), y2n } for n ≥ 0 and computes the recursive least-squares solution of ) ( n X ∆ 2n−2j T 2 0 2n+1 2 λ (x(2j) − y2j w) w2n = argmin ρ λ kwk + w∈IRM

j=0

0

where ρ > 0 is a regularization factor and λ is the forgetting factor. Note that w2n is an estimate that is based solely on the even-indexed data. Likewise, node #2 observes odd-indexed data {x(2n+1), y2n+1 } for n ≥ 0 and computes the recursive least-squares solution of ( ) n X ∆ 0 2n+2 2 2n−2j T 2 w2n+1 = argmin ρ λ kwk + λ (x(2j + 1) − y2j+1 w) w∈IRM

j=0

Here, w2n+1 is an estimate that is based solely on the odd-indexed data. Can you use the available estimates {w2n , w2n+1 , n ≥ 0} to construct the recursive solution of ( ) N X ∆ 0 N +1 2 N −j T 2 wN = argmin ρ λ kwk + λ (x(j) − yj w) w∈IRM

j=0

where wN is an estimate that is based on all data (both even- and odd-indexed) up to time N ? If so, explain the construction. If not, explain why not. 50.23 Consider the optimization problem ( !) N X ∆ 0 N +1 2 N −n T 2 wN = argmin ρλ kwk + E λ (x(n) − α yn w) w∈IRM

n=0

where the data {x(n), yn } are deterministic measurements with x(n) a scalar and yn a column vector of size M × 1. The random variable α is Bernoulli and assumes the value α = 1 with probability p and the value α = 0 with probability 1 − p; it is used to model a faulty sensor – when the sensor fails, no data is measured. Let wN denote the solution. Can you determine a recursion to go from wN −1 to wN ? 50.24 Consider an unknown M × 1 vector w = col{w1 , w2 }, where w1 is L × 1. Introduce the least-squares problem: ( ) min

w∈IRM

w1T Πw1 + kzN − HN wk2 + kdN − GN w1 k2

2208

Least-Squares Problems

where Π > 0,    zN =  

z(0) z(1) .. . z(N )

   , 

   dN =  

x(0) x(1) .. . x(N )





  , 

  HN =  

y0T y1T .. . T yN

   , 

   GN =  

sT0 sT1 .. . T sN

    

Let wN denote the solution and let ξ(N ) be the resulting minimum cost. (a) Relate wN to wN −1 . (b) Relate ξ(N ) to ξ(N − 1). 50.25 Let w? denote the solution to the following regularized least-squares problem: n o min wT Πw + (d − Hw)T R(d − Hw) w∈IRM

where R > 0 and Π > 0. Let db = Hw? denote the resulting estimate of d and let ξ denote the corresponding minimum cost. Now consider the extended problem (

    2 )

d ha H hb T

min wz Πz wz + − wz

γ αa hT αb wz ∈IRM +1 R z

where {h, ha , hb } are column vectors, {γ, αa , αb , a, b} are scalars, and     a R , Π Πz =  Rz = 1 b Let  dbz =

ha αa

H hT

hb αb



wz?

and let ξz denote the corresponding minimum risk of the extended problem. Relate b ξ}. {wz? , dbz , ξz } to {w? , d, 50.26 Consider an M × m full-rank matrix A (M > m) and let w be any vector in its range space, i.e., w ∈ R(A). Let wN denote the solution to the following regularized least-squares problem: ( ) N X N +1 T N −n T 2 min λ w Πw + λ (x(n) − yn w) w∈R(A)

n=0

where Π > 0 and yn is M × 1. Find a recursion relating wN to wN −1 . 50.27 Consider a least-squares problem of the form ( ) N X 2 N −n T 2 min ρkwk + λ |x(n) − yn w| w∈IRM

n=0

where ρ > 0 is a regularization parameter, yn is an M × 1 regression vector, and 0  λ ≤ 1 is a forgetting factor defined as follows:  λe , for n even λ= λo , for n odd Let wN denote the solution to the above least-squares problem. Derive a recursive solution that updates wN to wN +1 .

Problems

50.28

2209

Consider a regularized least-squares problem of the form n o min (w − w) ¯ T Π(w − w) ¯ + (zB−1 − HB−1 w)T RB−1 (zB−1 − HB−1 w)

w∈IRM

where Π > 0, RB−1 > 0 is a weighting matrix, and w ¯ is some known initial condition. We partition the entries of {zB−1 , HB−1 } into block vectors and block matrices:     U0 d0  U1   d1      HB−1 =  zB−1 =   , .. ..     . . UB−1 dB−1 where each db has dimensions p × 1 and each Ub has dimensions p × M . We further assume that the positive-definite weighting matrix RB−1 has a block diagonal structure, −1 }. with p × p positive-definite diagonal blocks, say RB−1 = blkdiag{R0−1 , R1−1 , . . . , RB−1 Let wB−1 denote the solution of the above least-squares problem and let PB−1 = T RB−1 HB−1 )−1 . (Π + HB−1 T TB UB PB−1 , with initial condition P−1 = Π−1 (a) Show that PB = PB−1 − PB−1 UB T −1 ) . and where TB = (RB + UB PB−1 UB T (b) Show that wB = wB−1 + PB−1 UB TB (dB − UB wB−1 ). (c) Conclude that wB can be computed recursively by means of the following block RLS algorithm. Start with w−1 = w ¯ and P−1 = Π−1 and repeat for b ≥ 0:   T = (Rb + Ub Pb−1 UbT )−1   b Gb = Pb−1 UbT Tb wb = wb−1 + Gb (db − Ub wb−1 )    Pb = Pb−1 − Gb Tb−1 GTb (d) (e) (f)

−1 −1 T −1 T −1 RB . UB PB UB − RB RB and TB = RB Establish the equalities GB = PB UB Let {rB , eB } denote the a-posteriori and a-priori error vectors, rB = dB − UB wB −1 and eB = dB − UB wB−1 . Show that RB rB = TB eB . Let ξ(B − 1) denote the minimum cost associated with the solution wB−1 . Show that it satisfies the time-update relations:

T −1 ξ(B) = ξ(B − 1) + rB RB eB = ξ(B − 1) + eTB TB eB , ξ(−1) = 0 P T Conclude that ξ(B) = B b=0 eb Tb eb . 50.29 Consider the same formulation of Prob. 50.28 but assume the weighting matrix RB is related to RB−1 as follows:   DB−1 RB−1 RB = −1 RB

where DB−1 = diag{Ip , . . . , Ip , βIp , Ip , . . . , Ip }, and β > 1 is a positive scalar. The scalar β appears at the location corresponding to the kth block Rk−1 . Find a recursion relating wB to wB−1 . 50.30 Consider a regularized block least-squares problem of the form ( ) B X B+1 T B−b T −1 min λ (w − w) ¯ Π(w − w) ¯ + λ (db − Ub w) Rb (yb − Ub w) w∈IRM

b=0

where each db has size p × 1, each Ub has size p × M , and each Rb is p × p and positivedefinite. Moreover, 0  λ ≤ 1 is an exponential forgetting factor and Π > 0. Let ξ(B) denote the value of the minimum risk associated with the optimal solution wB . Repeat

2210

Least-Squares Problems

the arguments of Prob. 50.28 to show that the solution wB can be time-updated by the following block RLS algorithm:  T = (Rb + λ−1 Ub Pb−1 UbT )−1   b   Gb = λ−1 Pb−1 UbT Tb     e = db − Ub wb−1   b wb = wb−1 + Gb (db − Ub wb−1 ), w−1 = w ¯ −1 T −1 −1 P = λ P − G T G , P = Π  −1 b b−1 b b b    rb = db − Ub wb     ξ(b) = λξ(b − 1) + eTb Tb eb , ξ(−1) = 0   = λξ(b − 1) + rbT Rb−1 eb Verify also that the quantities {Gb , Tb } admit the alternative expressions Gb = Pb UbT Rb−1 and Tb = Rb−1 − Rb−1 Ub Pb UbT Rb−1 . 50.31 Consider a collection of N × N symmetric matrices {Am } for m = 1, 2, . . . , M , an N × N full-rank matrix U , and an M × 1 vector b. It is assumed that M  N 2 so that the amount of data represented by the size of b is significantly smaller than the number of entries in U . Define the M × 1 vector A(U ) = col{Tr(U T Am U )} and consider the optimization problem: min

U ∈IRN ×N

kA(U ) − bk2

Under M  N 2 , there are many solutions U that satisfy A(U ) = b. (a) Write down the gradient-descent recursion for seeking a minimizer for the above problem. (b) Assume the matrices {Am } commute so that Am An = An Am for any n and m. Argue that for a sufficiently small step size, and for an initial condition close to zero, the gradient-descent algorithm converges toward the solution with the smallest nuclear norm, i.e., toward the solution U that solves min kU U T k? , U

subject to A(U ) = b

Remark. The result of this problem provides another manifestation of the implicit bias/regularization problem discussed in the comments at the end of the chapter. There are many solutions U for the over-parameterized problem; yet gradient descent converges to the solution with the smallest nuclear norm. See Gunasekar et al. (2017) for more discussion.

50.A

MINIMUM-NORM SOLUTION Let W = {w such that kd − Hwk2 is minimum}

(50.173)

denote the set of all solutions to the standard least-squares problem (50.19). We argue below, motivated by the presentation from Sayed (2003, 2008), that the solution to min kwk

(50.174)

w? = H † d

(50.175)

w∈W

is given by

in terms of the pseudo-inverse of H.

50.B Equivalence in Linear Estimation

2211

Proof: We establish (50.175) for the over-determined case (i.e., when N ≥ M ) by introducing the SVD of H from Section 1.7. A similar argument applies to the underdetermined case (when N < M ). Thus, let r ≤ M denote the rank of H and introduce its SVD:   Σ H=U VT (50.176) 0 n o where Σ = diag σ1 , . . . , σr , 0, . . . , 0 . Then, it holds that

  2

Σ kd − Hwk2 = kU T d − U T HV V T wk2 = f − z

0

(50.177)

where we introduced the vectors z = V T w and f = U T d. Note that z and w have the same Euclidean norm. Therefore, the problem of minimizing kd − Hwk2 over w is equivalent to the problem of minimizing the rightmost term in (50.177) over z. Let {z(i), f (i)} denote the individual entries of {z, f }. Then

  2 X r N X

2

f − Σ z = (f (i) − σ z(i)) + f 2 (i) i

0 i=1

(50.178)

i=r+1

The second term is independent of z. Hence, any solution z has to satisfy z(i) = f (i)/σi for i = 1 to r and z(i) arbitrary for i = r + 1 to i = M . The solution z with the smallest Euclidean norm requires that these latter values be set to zero. In this case, the solution becomes n o w? = V col f (1)/σ1 , . . . , f (r)/σr , 0, . . . , 0 =V



Σ†

0



U Td

(1.115)

=

H †d

(50.179)

as claimed. 

50.B

EQUIVALENCE IN LINEAR ESTIMATION There is a close relation between regularized least-squares problems and linear leastmean-squares estimation problems. Although the former class of problems deals with deterministic variables and the latter deals with random variables, both classes turn out to be equivalent in the sense that solving a problem from one class also solves a problem from the other class and vice-versa. We follow the presentation from Sayed and Kailath (1994), Kailath, Sayed, and Hassibi (2000), and Sayed (2003, 2008).

Stochastic problem Let x and y be two zero-mean vector random variables that are related via a linear model of the form: y = Hx + v

(50.180a)

for some known matrix H and where v denotes a zero-mean random noise vector with known covariance matrix, Rv = E vv T . The covariance matrix of x is also known and denoted by E xxT = Rx . Both {x, v} are uncorrelated, i.e., E xv T = 0, and we further

2212

Least-Squares Problems

assume that Rx > 0 and Rv > 0. We established in (29.95b) that the linear least-meansquares estimator of x given y is  −1 b = Rx−1 + H T Rv−1 H x H T Rv−1 y (50.180b) and that the resulting MMSE matrix is  −1 MMSE = Rx−1 + H T Rv−1 H

(50.180c)

Deterministic problem Now consider instead deterministic vector variables {x, y} and a data matrix H relating them via y = Hx + v (50.181a) where v denotes measurement noise. Assume further that we pose the problem of estimating x by solving the weighted regularized least-squares problem: n o min xT Πx + (y − Hx)T W (y − Hx) (50.181b) x

where Π > 0 is a regularization matrix and W > 0 is a weighting matrix. It is straightforward to verify by differentiation that the solution x b is given by  −1 x b = Π + H TW H H TW y (50.181c) and that the resulting minimum cost is  −1 ξ = y T W −1 + HΠ−1 H T y

(50.181d)

Equivalence Expression (50.180b) provides the linear least-mean-squares estimator of x in a stochastic framework, while expression (50.181c) provides the least-squares estimate of x in a deterministic setting. It is clear that if we replace the quantities in (50.180b) by Rx ←− Π−1 and Rv ←− W −1 , then the stochastic solution (50.180b) would coincide with the deterministic solution (50.181c). We therefore say that both problems are equivalent. Such equivalences play an important role in estimation and inference theories since they allow us to move back and forth between deterministic and stochastic formulations, and to determine the solution for one context from the solution to the other. Table 50.2 summarizes the relations between the variables in both domains. We consider one application of these equivalence results in the next appendix in the context of Kalman and smoothing filters.

50.C

EXTENDED LEAST-SQUARES If we refer to the derivation in Example 30.4 and examine the Kalman recursions in that context, we will find that they agree with the recursive least-squares recursions. In other words, the example shows that the growing memory (λ = 1) RLS algorithm is equivalent to a Kalman filter implementation for estimating an unknown model x0 = w from the observations. Now model (30.102) is special and, therefore, the RLS filter is equivalent not to a full-blown Kalman filter but only to a special case of it – see Haykin et al. (1997) for another special case. In this appendix, following the equivalence approach of Sayed

50.C Extended Least-Squares

2213

Table 50.2 Equivalence of the stochastic and deterministic frameworks. Stochastic setting Deterministic setting random variables {x, y}

deterministic variables {x, y}

model y = Hx + v

model y = Hx + v

covariance matrix, Rx

inverse regularization matrix Π−1

noise covariance, Rv

inverse weighting matrix W −1

b x

x b

min E (x − Ky)(x − Ky)T K



b = Rx−1 + H T Rv−1 H x

−1

H T Rv−1 y

 −1 MMSE = Rx−1 + H T Rv−1 H

n

min xT Πx + ky − Hxk2W

o

x

 −1 x b = Π + H TW H H TW y  −1 min. cost = y T W −1 + HΠ−1 H T y

and Kailath (1994) from the previous appendix, and adapting the presentation from Kailath, Sayed, and Hassibi (2000), we describe the general deterministic least-squares formulation that is equivalent to a full-blown Kalman filter. In so doing, we will arrive at the extended RLS algorithm (50.211), which is better suited for tracking the state of linear state-space models, as opposed to tracking the state of the special model (30.102), as is further illustrated in Sayed (2003, 2008) by means of several special cases.

Deterministic estimation Consider a collection of (N + 1) measurements {yn }, possibly column vectors, that satisfy yn = Hn xn + vn (50.182) where the {xn ∈ IRM } evolve in time according to the state recursion xn+1 = Fn xn + Gn un ,

n≥0

(50.183)

Here, the {Fn , Gn , Hn } are known matrices and the {un , vn } denote disturbances or noises. Let further Π0 be a positive-definite regularization matrix, and let {Qn , Rn } be positive-definite weighting matrices. Given the {yn }, we pose the problem of estimating the initial state vector x0 and the signals {u0 , u1 , . . . , uN } in a regularized least-squares manner by solving ( ) N N X X T −1 T −1 T −1 min x0 Π0 x0 + (yn − Hn xn ) Rn (yn − Hn xn ) + un Qn un {x0 ,u0 ,...,uN }

n=0

n=0

(50.184) subject to the constraint (50.183). We denote the solution by {b x0|N , u bn|N , 0 ≤ n ≤ N }, and we refer to them as smoothed estimates since they are based on observations beyond the times of occurrence of the respective variables {x0 , un }. In principle, we could solve (50.184) by using optimization arguments, e.g., based on the use of Lagrange multipliers. Instead, we will solve it by appealing to the equivalence result of Table 50.2. In other words, we will first determine the equivalent stochastic problem and then solve this latter problem to arrive at the solution of (50.184). This method of solving (50.184) not only serves as an illustration of the convenience of equivalence results in estimation theory, but it also shows that sometimes it is easier

2214

Least-Squares Problems

to solve a deterministic problem in the stochastic domain (or vice-versa). In our case, the problem at hand is more conveniently solved in the stochastic domain. Introduce the column vectors:  x    0 y0 u  0   y1  ∆  u1  ∆   , (50.185) z =  d =  ..   .   .   .  . yN uN as well as the block-diagonal matrices: n o ∆ W−1 = blkdiag R0 , R1 , . . . , RN ,

n o ∆ Π−1 = blkdiag Π0 , Q0 , . . . , QN (50.186)

Then, it holds that xT0 Π−1 0 x0 +

N X

T uTn Q−1 n un = z Πz

(50.187)

n=0

Moreover, by using the state equation (50.183) to express each term Hn xn in terms of combinations of the entries of z, we can verify that N X

−1 (yn − Hn xn )T Rn (yn − Hn xn ) = (d − Hz)T W(d − Hz) = kd − Hzk2W

n=0

(50.188) where the matrix H is block lower-triangular and given by  H0 H1 G0  H1 Φ(1, 0)  H Φ(2, 0) H2 Φ(2, 1)G0 H2 G1 2 ∆  H =  .. .. .. ..  .  . . .  HN Φ(N, 0)

HN Φ(N, 1)G0

HN Φ(N, 2)G1

...

        HN GN −1

0 (50.189)

and the matrices Φ(n, m) are defined by  Fn−1 Fn−2 . . . Fm , ∆ Φ(n, m) = IM ,

n>m n=m

(50.190)

In other words, we find that we can rewrite the original cost function (50.184) as the regularized least-squares problem: n o min z T Πz + (d − Hz)T W(d − Hz) (50.191) z

Let zbN denote the solution to (50.191), i.e., zbN is a column vector that contains the desired solutions: n o zbN = col x b0|N , u b0|N , u b1|N , . . . , u bN |N (50.192) Now, in view of the equivalence result from Table 50.2, we know that zbN can be obtained by solving an equivalent stochastic estimation problem that is determined as follows.

50.C Extended Least-Squares

2215

Stochastic estimation We introduce zero-mean random vectors {z, d}, with the same dimensions and partitioning as the above {z, d}, and assume that they are related via a linear model of the form: d = Hz + v

(50.193)

where H is the same matrix as in (50.189), and where v denotes a zero-mean additive noise vector, uncorrelated with z, and partitioned as v = col{v 0 , v 1 , . . . , v N }. The dimensions of the {v n } are compatible with those of {y n }. We denote the covariance matrices of {z, v} by Rz = E zz T ,

Rv = E vv T

(50.194)

and we choose them as Rz = Π and Rv = W , where {Π, W} are given by (50.186). bN denote the LLMSE estimator of z given {y 0 , y 1 , . . . , y N } in d. We partition Let z z as −1

−1

z = col{x0 , u0 , u1 , . . . , uN }

(50.195)

b|N in terms Then the equivalence result of Table 50.2 states that the expression for z of d in the stochastic setting (50.193) is identical to the expression for zb|N in terms of d in the deterministic problem (50.191). bN or, equivalently, {b b n|N }, we start by noting that the In order to determine z x0|N , u linear model (50.193), coupled with the definitions of {Rz , Rv , H} in (50.186), (50.189), and (50.194), show that the stochastic variables {y n , v n , x0 , un } so defined satisfy the following state-space model: xn+1 yn

= =

Fn xn + Gn un H n xn + v n

(50.196)

with  T  Qn δnm un  um 0  v   E  n   vm  =  x0 0 x0 1 0 

0 Rn δnm 0 0

 0 0  Π0  0

(50.197)

We now use this model to derive recursions for estimating z (i.e., for estimating the variables {x0 , u1 , . . . , uN }).

Solving the stochastic problem bn denote the LLMSE estimator of z given the top entries {y 0 , . . . , y n } in d. To Let z bn , and ultimately z bN , we proceed recursively by employing the innovations determine z {en } of the observations {y n }. Using the basic recursive estimation formula (30.23) we have −1 bn = z bn−1 + (E zeTn ) Re,n z en   T −1 bn−1 + E ze =z xn|n−1 HnT Re,n en ,

b−1 = 0 z

(50.198)

where we used in the second equality the innovations equation (cf. (30.51)): b n|n−1 = Hn x e n|n−1 + v n en = y n − Hn x

(50.199)

bn have and the fact that E x0 v Tm = 0 and E un v Tm = 0 for all m. Clearly, the entries of z the interpretation n o bn = col x b 0|n , u b 0|n , u b 1|n , . . . , u ˆ n−1|n , 0, 0, . . . , 0 z (50.200) bn are zero since u b m|n = 0 for m ≥ n. where the trailing entries of z

2216

Least-Squares Problems

Let Kz,n = E ze xTn|n−1 . The above recursive construction would be complete, and bN , once we show how to evaluate the gain matrix hence provide the desired quantity z Kz,n . For this purpose, we first subtract the equations (from the Kalman filter (30.69)): xn+1 = Fn xn + Gn un b n+1|n = Fn x b n|n−1 + Kp,n (Hn x e n|n−1 + v n ) x

(50.201) (50.202)

e n+1|n = Fp,n x e n|n−1 + Gn un − Kp,n v n x

(50.203)

to obtain

where Fp,n = Fn − Kp,n Hn . Using this recursion, it is easy to verify that Kz,n satisfies the recursion:   0   Π0 ∆  0  T Kz,n+1 = E ze xTn+1|n = Kz,n Fp,n +  (50.204) Qn GTn , Kz,0 =  0 I 0 The identity matrix that appears in the second term of the recursion for Kz,n+1 occurs at the position that corresponds to the entry un in the vector z, e.g.,   T T Π0 Fp,0 Fp,1   T Π0 Fp,0   T   Q0 GT0 Fp,1   , ... Kz,1 =  Q0 GT0  , (50.205) Kz,2 =    T Q1 G1   0 0 Substituting (50.204) into (50.198) we find that the following recursions hold:  −1 b 0|n = x b 0|n−1 + Π0 ΦTp (n, 0)HnT Re,n b 0|−1 = 0  x en , x −1 (50.206) b m|n = u b m|n−1 + Qm GTm ΦTp (n, m + 1)HnT Re,n u en , m < n  b m|n = 0, m ≥ n u where the matrix Φp (n, m) is defined by  Fp,n−1 Fp,n−2 . . . Fp,m , ∆ Φp (n, m) = I,

n>m m=n

(50.207)

If we introduce the auxiliary variable ∆

λn|N =

N X

T −1 ΦTp (m, n)Hm Re,m em

(50.208)

m=n

then it is easy to verify that recursions (50.206) lead to  b 0|N x    b m+1|m  x em   b m|N   u λm|N

= = = = =

Π0 λ0|N b m|m−1 + Kp,m y m , x b 0|−1 = 0 Fp,m x b m|m−1 y m − Hm x Qm GTm λm+1|N T T −1 Fp,m λm+1|N + Hm Re,m em , λN +1|N = 0

(50.209)

These equations are the Bryson–Frazier smoothing recursions (30.194) – refer also to b m|n } for successive Prob. 30.13; the recursions (30.194) evaluate the estimators {b x0|n , u b n|N }, the estimators values of n, and not only for n = N as in (50.209). Just like {b x0|N , u b m|n } can also be related to the solution of a least-squares problem. Indeed, {b x0|n , u b m|n } in (50.206) by equivalence, the expressions that provide the solutions {b x0|n , u

References

2217

should coincide with those that provide the solutions {b x0|n , u bm|n } for the following deterministic problem, with data up to time n (rather than N as in (50.184)): ( ) n n X X T −1 T −1 T −1 min x0 Π0 x0 + (ym − Hm xm ) Rm (ym − Hm xm ) + um Qm um x0 ,u0 ,...,un

m=0

m=0

(50.210) b m|N } in the stochasWe know by equivalence that the mapping from {y m } to {b x0|N , u tic problem (50.193) coincides with the mapping from {ym } to {b x0|N , u bm|N } in the deterministic problem (50.191). We are therefore led to listing (50.211). Extended recursive least-squares algorithm to solve (50.184). given observations {yn } that satisfy xn+1 = Fn xn + Gn un and yn = Hn xn + vn ; objective: estimate {x0 , u0 , u1 , . . . , un } by solving (50.184). start from x b0|−1 = 0, P0|−1 = Π0 , λN +1|N = 0. (forward pass) repeat for n = 0, 1, 2, . . .: en = yn − Hn x bn|n−1 Re,n = Rn + Hn Pn|n−1 HnT −1 Kp,n = Fn Pn|n−1 HnT Re,n x bn+1|n = Fn x bn|n−1 + Kp,n en T Pn+1|n = Fn Pn|n−1 FnT + Gn Qn GTn − Kp,n Re,n Kp,n end

(50.211)

(backward pass) repeat for n = N, N − 1, . . . , 1, 0: Fp,n = Fn − Kp,n Hn T −1 λn|N = Fp,n λn+1|N + HnT Re,n en end (output) set x b0|N = Π0 λ0|N set u bn|N = Qn GTn λn+1|N , 0 ≤ n ≤ N .

REFERENCES Adrain, R. (1808), “Research concerning the probabilities of the errors which happen in making observations,” The Analyst, vol. 1, no. 4, pp. 93–109. Ailon, N. and E. Liberty (2013), “Almost optimal unrestricted fast Johnson– Lindenstrauss transform,” ACM Trans. Algorithms, vol. 9, no. 3, art. 21. Anderson, T. W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd ed., Wiley. Anderson, B. D. O. and J. B. Moore (1979), Optimal Filtering, Prentice Hall. Aström, K. J. and B. Wittenmark (1971), “Problems of identification and control,” J. Math. Anal. App., vol. 34, pp. 90–113. Beckley, B. D., P. S. Callahan, D. W. Hancock, G. T. Mitchum, and R. D. Ray (2017), “On the cal-mode correction to TOPEX satellite altimetry and its effect on the global mean sea level time series,” J. Geophys. Res. Oceans, vol. 122, no. 11, pp. 8371–8384. Belkin, M., S. Ma, and S. Mandal (2018), “To understand deep learning we need to understand kernel learning,” available at arXiv:1802.01396.

2218

Least-Squares Problems

Belkin, M., A. Rakhlin, and A. B. Tsybakov (2019), “Does data interpolation contradict statistical optimality?” Proc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), pp. 1611–1619, Naha. Bell, R. and Y. Koren (2007a), “Scalable collaborative filtering with jointly derived neighborhood interpolation weights,” Proc. IEEE Int. Conf. Data Mining (ICDM), pp. 43–52, Omaha, NE. Bjorck, A. (1996), Numerical Methods for Least Squares Problems, SIAM. Businger, P. and G. H. Golub (1965), “Linear least-squares solution by Householder transformations,” Numer. Math., vol. 7, pp. 269–276. Cleveland, W. S. (1979), “Robust locally weighted regression and smoothing scatterplots,” J. Amer. Statist. Assoc., vol. 74, pp. 829–836. Cleveland, W. S. and Devlin, S. J. (1988), “Locally weighted regression: An approach to regression analysis by local fitting,” J. Amer. Statist. Assoc., vol. 83, pp. 596–610. Draper, N. R. and H. Smith (1998), Applied Regression Analysis, 3rd ed., Wiley. Drineas, P., M. W. Mahoney, and S. Muthukrishnan (2006b), “Subspace sampling and relative-error matrix approximation: Column–row-based methods,” Proc. Algorithms: Annual European Symp. (ESA), pp. 304–314, Zurich. Eaton, M. L. (1983), Multivariate Statistics: A Vector Space Approach, Wiley. Fletcher, R., J. A. Grant, and M. D. Hebden (1971), “The calculation of linear best Lp approximations,” Comput. J., vol. 14, pp. 276–279. Gauss, C. F. (1809), Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections, English translation by C. H. Davis, 1857, Little, Brown, and Company. Godard, D. N. (1974), “Channel equalization using a Kalman filter for fast data transmission,” IBM J. Res. Develop., vol. 18, pp. 267–273. Golub, G. H. (1965), “Numerical methods for solving linear least-squares problems,” Numer. Math., vol. 7, pp. 206–216. Gower, R. M. and P. Richtárik (2015), “Randomized iterative methods for linear systems,” SIAM J. Matrix Anal. Appl., vol. 36, no. 4, pp. 1660–1690. GSFC (2017), “Global mean sea level trend from integrated multi-mission ocean altimeters TOPEX/Poseidon, Jason-1, OSTM/Jason-2,” ver. 4.2 PO.DAAC, CA, USA. Dataset accessed March 18, 2019 at http://dx.doi.org/10.5067/GMSLM-TJ42. Gunasekar, S., J. Lee, D. Soudry, and N. Srebro (2018), “Characterizing implicit bias in terms of optimization geometry,” Proc. Int. Conf. Machine Learning (ICML), pp. 1832–1841, Stockholm. Gunasekar, S., B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro (2017), “Implicit regularization in matrix factorization,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 6151–6159, Long Beach, CA. Gupta, A. K. and D. K. Nagar (2000), Matrix Variate Distributions, Chapman & Hall. Hall, T. (1970), Carl Friedrich Gauss: A Biography, MIT Press. Hastie, T., A. Montanari, S. Rosset, and R. Tibshirani (2019), “Surprises in highdimensional ridgeless least squares interpolation,” available at arXiv:1903.08560. Haykin, S. (1991), Adaptive Filter Theory, 2nd ed., Prentice Hall. Haykin, S., A. H. Sayed, J. Zeidler, P. Wei, and P. Yee (1997), “Adaptive tracking of linear time-variant systems by extended RLS algorithms,” IEEE Trans. Signal Process., vol. 45, no. 5, pp. 1118–1128. Higham, N. J. (1996), Accuracy and Stability of Numerical Algorithms, SIAM. Ho, Y. C. (1963), “On the stochastic approximation method and optimal filter theory,” J. Math. Anal. Appl., vol. 6, pp. 152–154. Householder, A. S. (1953), Principles of Numerical Analysis, McGraw-Hill. Hu, Y. F., Y. Koren, and C. Volinsky (2008), “Collaborative filtering for implicit feedback datasets,” Proc. IEEE Int. Conf. Data Mining (ICDM), pp. 263–272, Pisa. Indyk, P. and R. Motwani (1998), “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604–613, Dallas, TX.

References

2219

Jin, H. and G. Montúfar (2020), “Implicit bias of gradient descent for mean squared error regression with wide neural networks,” available at arXiv:2006.07356. Johnson, W. and J. Lindenstrauss (1984), “Extensions of Lipschitz maps into a Hilbert space,” Contemp. Math., vol. 26, pp. 189–206. Kahng, S. W. (1972), “Best Lp approximation,” Math. Comput., vol. 26, pp. 505–508. Kailath, T., A. H. Sayed, and B. Hassibi (2000), Linear Estimation, Prentice Hall. Lawson, C. L. and R. J. Hanson (1995), Solving Least-Squares Problems, SIAM. Legendre, A. M. (1805), Nouvelles Méthodes pour la Détermination des Orbites de Comètes, Courcier. Legendre, A. M. (1810), “Méthode de moindres quarres, pour trouver le milieu de plus probable entre les résultats des différentes observations,” Mem. Inst. France, pp. 149–154. Ljung, L. (1987), System Identification: Theory for the User, Prentice Hall. Mahoney, M. W. (2011), Randomized Algorithms for Matrices and Data, Foundations and Trends in Machine Learning, NOW Publishers, vol. 3, no. 2, pp. 123–224. McClave, J. T. and T. T. Sincich (2016), Statistics, 13th ed., Pearson. Mei, S. and A. Montanari (2020), “The generalization error of random features regression: Precise asymptotics and double descent curve,” available at arXiv:1908.05355. Mendenhall, W., R. J. Beaver, and B. M. Beaver (2012), Introduction to Probability and Statistics, 14th ed., Cenage Learning. Neyshabur, B., R. Tomioka, and N. Srebro (2015), “In search of the real inductive bias: On the role of implicit regularization in deep learning,” available at arxiv:1412.6614. Osborne, M. R. (1985), Finite Algorithms in Optimization and Data Analysis, Wiley. Pilanci, M. and M. J. Wainwright (2015), “Randomized sketches of convex programs with sharp guarantees,” IEEE Trans. Inf. Theory, vol. 61, no. 9, pp. 5096–5115. Pilaszy, I., D. Zibriczky, and D. Tikk (2010), “Fast ALS-based matrix factorization for explicit and implicit feedback datasets,” Proc. ACM Conf. Recommender Systems, pp. 71–78, Barcelona. Plackett, R. L. (1950), “Some theorems in least-squares,” Biometrika, vol. 37, no. 1–2, pp. 149–157. Plackett, R. L. (1972), “The discovery of the method of least-squares,” Biometrika, vol. 59, pp. 239–251. Rigollet, P. and J.-C. Huetter (2017), High Dimensional Statistics, MIT lecture notes, available at www-math.mit.edu/∼rigollet/PDFs/RigNotes17.pdf. Sarlós, T. (2006), “Improved approximation algorithms for large matrices via random projections,” Proc. IEEE Symp. Foundations of Computer Science (FOCS), pp. 143– 152, Berkeley, CA. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Sayed, A. H. and T. Kailath (1994), “A state-space approach to adaptive RLS filtering,” IEEE Signal Process. Mag., vol. 11, no. 3, pp. 18–60. Söderström, T. (1994), Discrete-Time Stochastic Systems: Estimation and Control, Prentice Hall. Sorenson, H. W. (1966), “Kalman filtering techniques,” in Advances in Control Systems Theory and Applications, C. T. Leondes, editor, vol. 3, pp. 219–292, Academic Press. Soudry, D., E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018), “The implicit bias of gradient descent on separable data,” J. Mach. Learn. Res., vol. 19, pp. 1–57. Stewart, G. W. (1995), Theory of the Combination of Observations Least Subject to Errors, SIAM. Translation of original works by C. F. Gauss under the title Theoria Combinationis Observationum Erroribus Minimis Obnoxiae. Stigler, S. M. (1981), “Gauss and the invention of least-squares,” Ann. Statist., vol. 9, no. 3, pp. 465–474. Strobach, P. (1990), Linear Prediction Theory, Springer. Udell, M., C. Horn, R. Zadeh, and S. Boyd (2016), “Generalized low rank models,” Found. Trends Mach. Learn., vol. 9, no. 1, pp. 1–118.

2220

Least-Squares Problems

Willsky, A. S. (1979), Digital Signal Processing and Control and Estimation Theory, MIT Press. Witte, R. S. and J. S. Witte (2013), Statistics, 10th ed., Wiley. Woodruff, D. P. (2014), Sketching as a Tool for Numerical Linear Algebra, Foundations and Trends in Theoretical Computer Science, NOW Publishers, vol. 10, no. 1–2, pp. 1–157. Zhou, Y., D. Wilkinson, R. Schreiber, and R. Pan (2008), “Large-scale parallel collaborative filtering for the Netflix prize,” in Algorithmic Aspects in Information and Management, R. Fleischer and J. Xu, editors, pp. 337–348, Springer.

51 Regularization

We discussed the least-squares problem in the previous chapter, which uses

a collection of data points {x(n), yn } to determine an optimal parameter w? by minimizing an empirical quadratic risk of the form: ( ) N −1 2 X ∆ 1 ? T w = argmin P (w) = x(n) N yn w (51.1a) N n=0 w∈IRM where each yn is M -dimensional and each x(n) is a scalar. The solution is determined by solving the normal equations: H T Hw? = H T d

(normal equations)

(51.1b)

where the quantities d ∈ IRN ×1 and H ∈ IRN ×M collect the data: 

   H =    ∆

y0T y1T y2T .. . T yN −1



   ,  



   d =    ∆

x(0) x(1) x(2) .. . x(N N 1)

      

(51.1c)

The normal equations (51.1b) may have a unique solution or infinitely many solutions. They may also be ill-conditioned, meaning that slight perturbations to the data {d, H} can lead to large changes in the solution w? ; this usually occurs when the matrix H is ill-conditioned. In this chapter, we will use the least-squares formulation as a guiding example to illustrate three types of challenges that arise in data-driven learning methods pertaining to (a) nonuniqueness of solutions, (b) ill-conditioning, and (c) the undesirable possibility of over-fitting. We will then explain that regularization is a useful tool to alleviate these challenges. We will also explain how regularization enables the designer to promote preference for certain solutions such as favoring solutions with small norms or sparse structure. We will motivate the main ideas by using the least-squares formulation as a guide due to its mathematical tractability. Subsequently, we will extend the discussion to more general empirical risks, other than least-squares, which will arise in later chapters when we deal with logistic regression, support vector machines, kernel machines, neural networks, and other learning methods.

2222

Regularization

51.1

THREE CHALLENGES In learning problems, we make a distinction between training data and test data. The data {x(n), yn } used to solve the least-squares problem (51.1a) is referred to as training data. Once a solution w? is determined, the value of the risk function at the solution is called the training error: ∆

training error =

N −1 N −1 1 X 1 X (x(n) N ynT w? )2 = (x(n) N x b(n))2 N n=0 N n=0

(51.2)

where x b(n) = ynT w? denotes the prediction for x(n). In this way, the training error is measuring how well the least-squares solution performs on the training data. In general, the training error will be small because the solution w? is purposefully determined to minimize it. In most learning applications, however, the main purpose for learning w? is to employ it to perform prediction on future data that were not part of the training phase. For this reason, it is customary to assess performance on a separate collection of T test data points denoted by {x(t), yt }, and which are assumed to arise from the same underlying distribution fx,y (x, y) as the training data. The corresponding testing error is defined by ∆

testing error =

T −1 T −1 1 X 1 X (x(t) N ytT w? )2 = (x(t) N x b(t))2 T t=0 T t=0

(51.3)

where x b(t) = ytT w? denotes the prediction for x(t). In general, the testing error will be larger than the training error but we desire the gap between them to be small. Learning algorithms that lead to small error gaps are said to generalize well, namely, they are able to extend their good performance on training data to the test data as well. We will discuss generalization and training and testing errors in greater detail in future chapters, especially in the context of classification problems. Here, we are using the least-squares problem to motivate the concepts.

Difficulties We already know that the normal equations (51.1b) are consistent, meaning that a solution w? always exists. The solution is either unique when H has full column rank, in which case it is given by w? = (H T H)−1 H T d

(H has full column rank)

(51.4)

or there are infinitely many solutions differing by vectors in N(H). Some challenges arise in both scenarios, which lead to complications when solving inference problems: (a) (Non-uniqueness). When infinitely many solutions exist, the training error will not change regardless of which solution we pick. This is because all valid

51.1 Three Challenges

2223

solutions w? differ by vectors in the null space of H and, therefore, if w1? and w2? are two valid solutions then w2? = w1? + p, for some p ∈ N(H)

(51.5)

In this case, the predictions x b(n) for the training signals will remain unchanged under w1? or w2? since Hp = 0 and, hence, ynT p = 0 for any of the observation vectors in the training set so that x b(n) = ynT w2? = ynT w1?

(51.6)

It follows that the training error remains invariant. However, the testing error will be sensitive to which solution we select because the test observations {yt } need not be orthogonal anymore to the nullspace of H. We explain in the sequel that `2 -regularization forces a unique solution w? and removes this ambiguity. (b) (Overfitting) Infinitely many solutions w? can exist even when N ≥ M , i.e., even when we have more observations than unknown entries. This occurs when the columns of H are linearly dependent and gives rise to a second challenge. Recall that the least-squares problem is approximating d by db = Hw? . When the columns of H are linearly dependent, some of its columns 0 can be removed to obtain a full-rank lower-dimensional matrix, H 0 ∈ IRN ×M with M 0 < M . This new matrix spans the same column space as H: R(H 0 ) = R(H)

(51.7)

We can then solve an equivalent least-squares problem involving {d, H 0 } instead of {d, H} to obtain the same projection db by using a smaller-size solution (w0 )? of dimension M 0 . We thus see that the rank-deficiency of H amounts to using a more complex model w (i.e., of higher dimensions) than is necessary to approximate d. This issue is a manifestation of the problem of overfitting, which we will discuss in greater detail in later chapters. Overfitting amounts to using more complex models than necessary and it also degrades performance on test data. Rank-deficiency of H also arises when N < M (i.e., when H has more columns than rows). One way to deal with this problem is to collect more data (i.e., to use a larger N ). A second way is to perform dimensionality reduction and reduce the size of the observation vectors. We will discuss techniques for dimensionality reduction in later chapters, including the principal component analysis (PCA) method and the Fisher discriminant analysis (FDA) method. A third way is to employ regularization. For example, we will explain further ahead that `1 -regularization automatically selects a subset of the columns of H to compute w? . (c) (Ill-conditioning) Difficulties can arise even when the normal equations have a unique solution w? but the data matrix H is ill-conditioned (i.e., has a large

2224

Regularization

condition number). In this case, small changes in the data {d, H} can lead to large changes in the solution w? and affect the inference conclusion and testing error – see Prob. 51.2 for a numerical example. One leading cause for ill-conditioning is when the entries within the observation vectors are not normalized properly so that some entries are disproportionately larger by some orders of magnitude than other entries. Such large discrepancies can distort the operation of a learning algorithm, including the least-squares solution, by giving more relevance or attention to larger entries in the observation vector over other entries. One way to deal with ill-conditioning is therefore to scale the observation vectors so that their entries assume values within some uniform range. The next example explains how scaling can be performed. A second way is to employ regularization. In particular, we will see that `2 -regularization reduces the effect of ill-conditioning.

Example 51.1 (Normalization of observation vectors) It is common practice to center the training data around their sample means, as was already suggested by the discussion in Section 29.2. We can take this step further and normalize the entries of the observation vectors to have unit-variance as well. Specifically, the first step is to compute the sample mean vector: ∆

y¯ =

N −1 1 X yn N n=0

(51.8a)

and to use it to center all observation vectors by replacing them by ∆

yn,c = yn − y¯

(51.8b)

where, for clarity, we are adding the subscript “c” to refer to centered variables. If we denote the individual entries of {¯ y , yn } by {¯ y (m), yn (m), m = 1, 2, . . . , M }, then centering amounts to replacing the individual entries by ∆

yn,c (m) = yn (m) − y¯(m)

(51.8c)

The second step in the normalization process is to evaluate the (unbiased) sample variance for each of these centered entries, namely, ∆

2 σ bm =

N −1 X 1 2 yn,c (m), m = 1, 2, . . . , M N − 1 n=0

(51.9a)

and to scale yn,c (m) by the corresponding standard deviation to get ∆

yn,p (m) = yn,c (m)/b σm , m = 1, 2, . . . , M

(51.9b)

where we are now using the subscript “p.” In this way, we start from an observation vector yn and replace it by the normalized vector yn,p , where all entries of yn,p are centered with zero mean and unit variance: remove sample mean

normalize variance

{yn } −−−−−−−−−−−→ {yn,c } −−−−−−−−−−−→ {yn,p }

(51.10)

51.2 `2 -Regularization

2225

A second method to normalize the observation vectors {yn } is as follows. We first identify the smallest and largest entry values within the given dataset: ∆

ymin = min yn (m)

(51.11a)

n,m



ymax = max yn (m)

(51.11b)

∆ = ymax − ymin

(51.11c)

n,m

and then scale all entries in the following manner, for each n and m: ∆

yn,s (m) =

yn (m) − ymin ∆

(51.12)

In this way, each scaled entry yn,s (m) will assume values within the range [0, 1]. We can subsequently center the means of these entries at zero by computing ∆

yn,p = yn,s − y¯n,s ,

where y¯n,s =

N −1 1 X yn,s N n=0

(51.13)

Here again, we start from a given observation vector yn and replace it by yn,p , where all entries lie within the range [−1, 1]: normalize range

remove sample mean

{yn } −−−−−−−−−−−→ {yn,s } −−−−−−−−−−−→ {yn,p }

(51.14)

Regardless of which normalization procedure is used, we will assume that the given observation vectors {yn } have already gone through this process and will continue to use the notation yn rather than switch to yn,p for simplicity.

51.2

`2 -REGULARIZATION One useful technique to avoid the challenges of nonuniqueness of solutions, overfitting, and ill-conditioning is to employ regularization (also called shrinkage in the statistics literature). The technique penalizes some norm of the parameter w in order to favor solutions with desirable properties based on some prior knowledge (such as sparse solutions or solutions with small Euclidean norm). We say that regularization incorporates a form of inductive bias in that it biases the solution away from the unregularized case by incorporating some prior information. This is attained by adding an explicit convex penalty term to the original risk function such as  ρkwk2 (`2 -regularization)    αkwk1 (`1 -regularization) q(w) = (51.15) 2  αkwk1 + ρkwk (elastic-net regularization)   βkwk0 (`0 -regularization)

where (α, β, ρ) are nonnegative parameters, and where kwk0 is a pseudo-norm that counts the number of nonzero elements in w. We will focus on the first

2226

Regularization

three choices due to their mathematical tractability. One can also consider other vector norms, such as the pth norm, kwkp for p < 1 or p = ∞. Regularization will generally have a limited effect on the training error of an algorithm, but will improve the generalization ability of the algorithm by improving its performance on test data for the reasons explained in the sequel. We consider first the case of `2 -regularization, also called ridge regression, where the penalty term is quadratic in w.

51.2.1

Ridge Regression In ridge regression, we replace the empirical risk (51.1a) by the regularized version: ) ( N −1 X  1 2 ∆ ∆ ? x(n) N ynT w (51.16) wreg = argmin Preg (w) = ρkwk2 + N n=0 w∈IRM where ρ > 0 is the regularization factor; its value may or may not depend on N . In general, the value of ρ is independent of N . Observe that, for the purposes of this chapter, we are adding a subscript ? “reg” to (wreg , Preg (w)) to distinguish them from the unregularized versions ? (w , P (w)). This is because we will be comparing both risks and their minimizers throughout this chapter. In future chapters, however, where we will be working almost exclusively with regularized risks, we will revert to the traditional notation (w? , P (w)) without the “reg” subscript for simplicity. Before explaining how ridge regression addresses the aforementioned challenges, we revisit Example 50.1 and show how the regularized empirical risk (51.16) can be motivated as the solution to a maximum a-posteriori (MAP) inference problem. Example 51.2 (Interpretation in terms of a Gaussian prior on the model) Assume we collect N iid observations {x(n), yn }, for 0 ≤ n ≤ N − 1. Assume also that these observations satisfy the same linear model (50.20), namely, x(n) = ynT w + v(n)

(51.17)

for some unknown w ∈ IRM , and where v(n) is a white Gaussian noise process with zero mean and variance σv2 . In the earlier Example 50.1, the model w was treated as an unknown constant and a maximum-likelihood formulation was used to estimate it, thus leading to the standard least-squares problem. Here, we will instead model w as a realization for some random variable w that is Gaussian-distributed with zero mean 2 and covariance matrix Rw = σw IM , i.e., o n 1 1 fw (w) = p exp − 2 kwk2 2 )M 2σw (2πσw

(51.18)

Once w is selected from this distribution, then all observations {x(n)} are generated by this same w from knowledge of {yn }. We are again interested in estimating w. Using

51.2 `2 -Regularization

2227

the Bayes rule (3.39), we assess the conditional probability distribution of the model given the observations as follows:   fw|x,y w|{x(n), yn }   ∝ fx,y|w {x(n), yn } | w fw (w) ( N −1 )  Y  T = fv x(n) − yn w fw (w) n=0

( N −1 Y

) ( ) n 1  2 o 1 T 2 ∝ exp − 2 x(n) − yn w × exp − 2 kwk 2σv 2σw n=0 ( ) N −1 2 1 1 X = exp − 2 kwk2 − 2 x(n) − ynT w 2σw 2σv n=0

(51.19)

where the first and third lines replace the equality sign by proportionality constants. Consequently, we can now formulate a MAP estimation problem to recover w, which amounts to seeking the value of w that maximizes the above conditional density function:   ∆ ? = argmax fw|x,y w|{x(n), yn } wreg w∈IRM

) N −1 2 1 X 1 2 T kwk + x(n) − y w n 2 2σw 2σv2 n=0 w∈IRM ( ) N −1 2 N 2σv2 1 X 2 T = argmin kwk + x(n) − y w n 2 2 2N σw N n=0 w∈IRM 2σv ( ) N −1 2 1 X = argmin ρkwk2 + x(n) − ynT w N n=0 w∈IRM (

= argmin

(51.20)

2 where we introduced ρ = σv2 /N σw . We therefore recover the regularized empirical risk ? (51.16). This argument shows that `2 -regularization helps ensure that the solution wreg is consistent with a prior Gaussian model on the distribution of w.

We now explain how ridge regression promotes solutions with smaller Euclidean norm and alleviates the challenges of ill-conditioning, overfitting, and nonuniqueness of solutions.

Resolving nonuniqueness Differentiating Preg (w) in (51.16) with respect to w, we find that the solution is unique and given by ? wreg = (ρN IM + H T H)−1 H T d

(51.21)

where the matrix ρN IM + H T H is always invertible due to the positive term, ρN IM > 0 and independent of whether H is rank-deficient or not.

Promoting smaller solutions It is seen from the regularized risk (51.16) that larger values for ρ favor solutions ? wreg with smaller Euclidean norm than would result when ρ = 0. This is because

2228

Regularization

the objective is to minimize the aggregate risk, and the first term is influenced by ρkwk2 . This property can be established more formally as follows (see Prob. 51.3 for an alternative argument). Using the unregularized risk P (w), and since ? wreg minimizes the regularized risk, we have ? ? ρkwreg k2 + P (wreg ) ≤ ρkw? k2 + P (w? )

? ? =⇒ ρkwreg k2 N ρkw? k2 ≤ P (w? ) N P (wreg ) (a)

? =⇒ ρkwreg k2 N ρkw? k2 ≤ 0

(51.22)

where step (a) is because w? minimizes the unregularized risk, P (w). It follows ? that kwreg k2 ≤ kw? k2 . Actually, strict inequality holds because P (w? ) is strictly ? ). Since, otherwise, for P (w) to assume the same value at smaller than P (wreg ? ? ? both (w , wreg ), it would mean that wreg must be a minimizer for P (w) as well. ? ? In that case, both (w , wreg ) must satisfy the same normal equations, namely, H T Hw? = H T d,

? H T Hwreg = H Td

(51.23)

? ? = H T d, we conclude satisfies (51.21), i.e., (ρN IM + H T H)wreg But since wreg T ? that wreg = 0. But this is not possible unless H d = 0. Absent this condition, we conclude that ? kwreg k2 < kw? k2

(51.24)

? , shrinks in comparison This proves that the norm of the regularized solution, wreg ? to the norm of the original solution, w . This property is referred to as shrinkage. We will encounter it in other regularization formulations as well.

Countering ill-conditioning Regularization also counters the effect of ill-conditioning, i.e., the sensitivity of the solution w? to small variations in the data {x(n), yn }. Note that the condition number of the new coefficient matrix is given by ∆

κ(ρN IM + H T H) =

2 ρN + λmax (H T H) ρN + σmax (H) = 2 (H) ρN + λmin (H T H) ρN + σmin

(51.25)

in terms of the largest and smallest singular values of H. If the value of ρN is large enough in comparison to the singular-value spread of H, then the ratio on the right-hand side approaches 1 and the matrix ρN IM + H T H becomes well conditioned.

Countering overfitting ? By promoting solutions wreg with smaller Euclidean norm, regularization helps alleviate the danger of overfitting because it searches for the solution over a reduced region in space. This can be shown more formally by verifying that minimizing a regularized least-squares problem of the form (51.16) is equivalent to solving a constrained optimization problem of the following form: ( N −1 ) X 1 ∆ ? wreg = argmin (x(n) N ynT w)2 , subject to ρkwk2 ≤ τ (51.26) N n=0 w∈IRM

51.2 `2 -Regularization

2229

for some τ > 0. The equivalence between problems (51.16) and (51.26) is established algebraically in Appendix 51.A by using the Lagrange and KKT multiplier arguments from Section 9.1. This equivalent characterization shows that regularization reduces the search space for w to the spherical region kwk2 ≤ τ /ρ instead of searching over the entire space w ∈ IRM . Some care is needed in selecting ρ (or τ ): large values for ρ (or small τ ) can have the opposite effect and constrain the search region excessively, thus leading to the possibility of underfitting (i.e., to the use of simpler models than is actually necessary to fit the data well). These remarks show that there is a compromise in setting the value of ρ: Small ρ does not perform effective regularization and large ρ can cause underfitting.

Biased risk values Although regularization is effective in countering ill-conditioning and overfitting, there is a price to pay. This is because regularization biases the least attainable risk (i.e., the training error), which becomes larger than in the unregularized ? to the standard and case. To see this, consider again the solutions w? and wreg regularized least-squares problems. Evaluating the risk functions at the respective minimizers and subtracting them we get, after some algebra – see Prob. 51.4: ? ? Preg (wreg ) N P (w? ) = ρ (w? )T wreg >0

(51.27)

? ) > P (w? ), and that the bias increases from which we conclude that Preg (wreg with ρ.

Example 51.3 (QR solution method) Determination of the `2 -regularized solution (51.21) requires that we compute the matrix product H T H and invert the matrix ρN IM + H T H. We explained earlier in Prob. 50.5 that squaring matrix entries through the product H T H can lead to a loss in numerical precision for small entries; it can also lead to overflow for large entries. A more stable numerical procedure for determining ? wreg can be motivated by using the QR decomposition. We construct the extended quantities: " # H e ∆ H = , of size (N + M ) × M (51.28) √ ρN IM   d ∆ de = , of size (N + M ) × 1 (51.29) 0M ×1 and introduce the QR decomposition: He = Q



R 0

 (51.30)

where R : (M × M ),

Q : (N + M ) × (N + M ),

QQT = QT Q = I T

(51.31) e

and R is upper-triangular. We apply the orthogonal transformation Q to d and denote the resulting entries by   d¯ T e Q d = , d¯ : (N × 1) (51.32) ×

2230

Regularization

where × refers to irrelevant entries. Then, note from (51.28) that (H e )T H e = ρN IM + H T H

(51.33)

while from (51.30) (H e )T H e =



RT

0



QT Q



R 0



= RT R

(51.34)

It then follows that ? wreg

= =

(ρN IM + H T H)−1 H T d    −1 d (H e )T H e (H e )T 0

R−1 (RT )−1 (H e )T de   (51.30) = R−1 (RT )−1 RT 0 QT de     d¯ (51.32) = R−1 IM 0 × −1 ¯ = R d

(51.35)

=

(51.36)

We therefore arrive at the QR procedure listed in (51.37) for determining the `2 regularized solution, which involves solving a triangular system of equations. QR method for minimizing `2 -regularized least-squares risk (51.16). given ρ > 0 and "data d = col{x(n)}, H = blkrow{ynT }; #   H d construct H e = √ and de = ; 0M ρN IM   R perform the QR decomposition H e = Q ; 0   d¯ apply QT to de and find QT d = ; × ? ¯ solve the triangular system of equations Rwreg = d.

51.3

(51.37)

`1 -REGULARIZATION In `1 -regularization, we replace the empirical risk (51.1a) by the regularized version: ( ) N −1 2 1 X ∆ ∆ ? T wreg = argmin Preg (w) = αkwk1 + x(n) N yn w (51.38) N n=0 w∈IRM in terms of the `1 -norm of w (i.e., the sum of its absolute entries), and where α > 0 is the regularization factor; its value may or may not depend on N .

51.3 `1 -Regularization

2231

In general, the value of α is independent of N . The variant with elastic-net regularization solves instead ( ) N −1 X  1 2 ∆ ? wreg = argmin Preg (w) = αkwk1 + ρkwk2 + x(n) N ynT w N n=0 w∈IRM (51.39) with both α > 0 and ρ > 0. We will discover in this section that `1 -regularization ? leads to a sparse solution wreg , i.e., to a solution with a few nonzero entries. In ? this way, for any observation vector y, the inner product calculation x b = y T wreg ? ends up using only a few select entries from y due to the sparsity of wreg . This means that `1 -regularization performs a form of “dimensionality reduction.” In particular, when some entries in y are correlated or redundant, the `1 -solution will rely on one of them and ignore the others. Elastic-net regularization, on the other hand, inherits useful features from both `2 - and `1 -regularization. For example, it can handle situations involving more unknowns than measurements (M > N ), and it also performs entry selection, albeit in a less dramatic fashion than `1 -regularization. The following derivation extends Example 51.2 and provides a similar MAP interpretation for the `1 -regularized empirical risk function (51.38). Example 51.4 (Interpretation in terms of a Laplacian prior on the model) We collect N iid observations {x(n), yn }, for 0 ≤ n ≤ N − 1, and assume that they satisfy the same linear model (51.17). The main difference is that we now assume that w is a realization of a random vector w whose entries {wm } are independent of each other 2 and arise from a Laplace distribution with zero mean and variance σw : fwm (wm ) = √

n √ o 1 exp − 2|wm |/σw 2 σw

(51.40)

We also assume that all observations {x(n)} are generated by the same realization w. We are again interested in estimating w. Using the Bayes rule (3.39), we assess the conditional probability distribution of the model given the observations as follows:   fw|x,y w|{x(n), yn }   ∝ fx,y|w {x(n), yn } | w fw (w) ( N −1 )  Y  T = fv x(n) − yn w fw (w) n=0 N −1 Y

) M o 2 n √ Y 1  T ∝ exp − 2 x(n) − yn w × exp − 2|wm |/σw 2σv n=0 m=1 ( √ ) N −1 2 X 2 1 T = exp − kwk1 − 2 x(n) − yn w σw 2σv n=0 (

(51.41)

where the first and third lines replace the equality sign by proportionality constants. Consequently, we can now formulate a MAP estimation problem to recover w by maximizing the above conditional density function as follows:

2232

Regularization

  ∆ ? wreg = argmax fw|x,h w|{x(n), yn } w∈IRM

(√ = argmin

2

kwk1 +

N −1 2 1 X x(n) − ynT w 2 2σv n=0

)

σw ) ( √ N −1 2 N 2 2σv2 1 X T = argmin kwk1 + x(n) − yn w 2 N σw N n=0 w∈IRM 2σv ( ) N −1 2 1 X T = argmin αkwk1 + x(n) − yn w N n=0 w∈IRM w∈IRM

(51.42)

√ where we introduced α = 2 2σv2 /N σw . We therefore recover the regularized empirical risk (51.38) with q(w) = αkwk1 . This argument shows that `1 -regularization helps en? sure that the solution wreg is consistent with a prior Laplacian model on the distribution of w.

We now explain how `1 -regularization (or its extension in terms of elastic-net regularization) promotes solutions with smaller norm and alleviates the challenges of ill-conditioning, overfitting, and nonuniqueness of solutions.

Resolving nonuniqueness The penalty term αkwk1 is only convex. The regularized risk function will have ? a unique minimizer wreg if the unregularized risk P (w) happens to be strictly or strongly convex. For the least-squares case, the unregularized risk is strongly convex when H T H > 0. More generally, if this condition does not hold, then ? elastic-net regularization can be used and it will ensure a unique minimizer wreg because the resulting regularized risk in that case will become strongly convex regardless of whether H is rank-deficient or not.

Promoting smaller solutions It is seen from the regularized risk in (51.38)–(51.39) that larger values for α or ρ ? favor solutions wreg with smaller norms than would result when α = ρ = 0. This is because the objective is to minimize the aggregate risk, and the regularization factors are influenced by αkwk1 and ρkwk2 . This conclusion can be established more formally. If we set q(w) = αkwk1 for `1 -regularization or q(w) = αkwk1 + ρkwk2 for elastic-net regularization, then it follows from the general result in Appendix 51.A that the following shrinkage property holds: ? q(wreg ) ≤ q(w? )

(51.43)

The result in the appendix holds for more general convex risks, P (w), and is not limited to least-squares risks. It also holds for general convex regularization factors, q(w), and not just `1 - or elastic-net regularization. In other words, result (51.43) extends (51.24) to general convex risks and penalty terms.

51.3 `1 -Regularization

2233

Countering overfitting Both `1 - and elastic-net regularization help alleviate the danger of overfitting because they can also be shown to search for their solutions over reduced regions in space. This can be established more formally by verifying that minimizing a regularized least-squares problem of either forms (51.38)–(51.39) is equivalent to solving a constrained optimization problem of the following form: ) ( N −1 1 X ∆ T 2 ? (x(n) N yn w) , subject to q(w) ≤ τ (51.44) wreg = argmin N n=0 w∈IRM for some τ > 0 and using the appropriate regularization factor: q(w) = αkwk1 for `1 -regularization and q(w) = αkwk1 + ρkwk2 for elastic-net regularization. The equivalence between problems (51.38)–(51.39) and (51.44) is again established algebraically in Appendix 51.A by using the KKT multiplier arguments from Section 9.1. Property (51.44) provides some intuition on how the choice of the penalty factor q(w) defines the solution space. Figure 51.1 plots three contour curves in two-dimensional space corresponding to the level sets: kwk2 = 1,

kwk1 = 1,

kwk1 + kwk2 = 1

(51.45)

It is seen from the figure that for `2 -regularization, the search space for w is limited to a region delineated by a circular boundary. In comparison, the search space for `1 -regularization is delineated by a rotated square boundary with sharp edges, while the search space for elastic-net regularization is midway between these two options. All three regions are obviously convex.

1

1

1

0.5

0.5

0.5

0

0

0

-0.5

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

-0.5

0

0.5

1

-1 -1

-0.5

0

0.5

1

Figure 51.1 The figure illustrates the boundary curves corresponding to conditions

(51.45) in two-dimensional space. The search space for the parameter, w, is limited to the inside of the regions delineated by these curves. Observe that in all three cases, the search domain is convex.

The particular shape for the boundary of the `1 -region helps promote sparsity, ? i.e., it helps lead to solutions wreg with many zero entries. This is illustrated schematically in Fig. 51.2, which shows boundary curves corresponding to the regions kwk1 ≤ τ and kwk2 ≤ τ , along with contour curves for the unregularized ? risk function, P (w). The solution wreg occurs at the location where the contour

2234

Regularization

curves meet the boundary regions. It is seen, due to the corners that are present in the region kwk1 ≤ τ , that the contour curves are more likely to touch this region at a corner point where some of the coordinates are zero. We will establish this conclusion more formally in the next section.

contour curves for unregularized risk, P (w)

w2

w2

kwk1  ⌧

kwk2  ⌧ w1

w1

Figure 51.2 Boundary curves corresponding to the regions kwk1 ≤ τ and kwk2 ≤ τ ,

along with contour curves for the unregularized risk function, P (w).

51.4

SOFT THRESHOLDING We are ready to examine the ability of `1 -regularization to find sparse solution ? vectors, wreg . A sparse solution helps avoid overfitting especially for large dimensional data (i.e., when M is large). This is because, when each observation vector ? yt has many entries, a sparse wreg assigns zero weights to those entries of yt that are deemed “irrelevant.” For this reason, we say that `1 -regularization embodies an automatic selection capability into the solution by picking only entries from yt that are most significant to the task of inferring x(t). For the benefit of the reader, we first review a useful result established earlier in Section 11.1.2 and which relies on the soft-thresholding function w b = T β (z). 2 This function operates on the individual entries of its vector argument z to generate the corresponding entries of w. b For each scalar z, the transformation T β (z), with parameter β ≥ 0, is defined as follows: 2  β β  z N 2 , if z ≥ 2 ∆ β T β (z) = (51.46) 0, if N 2 < z < β2 2  β β z + 2 , if z ≤ N 2

51.4 Soft-Thresholding

2235

Lemma 51.1. (Soft-thresholding operation) Given z ∈ IRM , a constant β ≥ 0, and a scalar φ, the solution to the optimization problem: n o ∆ w b = argmin βkwk1 + kw N zk2 + φ (51.47) w∈IRM

is unique and given by

w b = T β (z)

(51.48)

2

b The soft-thresholding transformation T β (z) helps promote sparse solutions w 2 (i.e., solutions with a few nonzero entries). This property is achieved in a measured manner since soft-thresholding sets to zero all entries of z whose magnitude is below the threshold value β/2, and reduces the size of the larger values by β/2. Figure 51.3 plots the function T β (z) defined by (51.46). In summary, using the 2 `1 -penalty term in (51.47) results in a sparse solution w b that is “close” to the vector z.

y=z AAAB7HicbVBNS8NAEJ3Ur1q/qh69LLaCBylJKepFKHjxWMG0hTaUzXbTLt1swu5GiKG/wYsHRbz6g7z5b9y2OWjrg4HHezPMzPNjzpS27W+rsLa+sblV3C7t7O7tH5QPj9oqSiShLol4JLs+VpQzQV3NNKfdWFIc+px2/MntzO88UqlYJB50GlMvxCPBAkawNpJbTW+eqoNyxa7Zc6BV4uSkAjlag/JXfxiRJKRCE46V6jl2rL0MS80Ip9NSP1E0xmSCR7RnqMAhVV42P3aKzowyREEkTQmN5urviQyHSqWhbzpDrMdq2ZuJ/3m9RAfXXsZEnGgqyGJRkHCkIzT7HA2ZpETz1BBMJDO3IjLGEhNt8imZEJzll1dJu15zLmuN+3qleZHHUYQTOIVzcOAKmnAHLXCBAINneIU3S1gv1rv1sWgtWPnMMfyB9fkD/JKOFQ==

T (z) AAACCnicbVDLSsNAFJ3UV62vqEs3o61QQUpSRF0W3Lis0Bc0oUymk3boZBJmJkINWbvxV9y4UMStX+DOv3HSZqGtBy4czrmXe+/xIkalsqxvo7Cyura+UdwsbW3v7O6Z+wcdGcYCkzYOWSh6HpKEUU7aiipGepEgKPAY6XqTm8zv3hMhachbahoRN0AjTn2KkdLSwDyuOAFSY89LWukgcXyBcOJ4RKE0qadp9eGsMjDLVs2aAS4TOydlkKM5ML+cYYjjgHCFGZKyb1uRchMkFMWMpCUnliRCeIJGpK8pRwGRbjJ7JYWnWhlCPxS6uIIz9fdEggIpp4GnO7O75aKXif95/Vj5125CeRQrwvF8kR8zqEKY5QKHVBCs2FQThAXVt0I8RjoOpdMr6RDsxZeXSadesy9rF3f1cuM8j6MIjsAJqAIbXIEGuAVN0AYYPIJn8ArejCfjxXg3PuatBSOfOQR/YHz+ACL5mng=

2

z AAAB6nicbVDLSgNBEOz1GeMr6tHLYCJ4kLAbRD0GvHiMaB6QLGF20psMmZ1dZmaFGPIJXjwo4tUv8ubfOEn2oIkFDUVVN91dQSK4Nq777aysrq1vbOa28ts7u3v7hYPDho5TxbDOYhGrVkA1Ci6xbrgR2EoU0igQ2AyGN1O/+YhK81g+mFGCfkT7koecUWOl+9JTqVsoumV3BrJMvIwUIUOtW/jq9GKWRigNE1Trtucmxh9TZTgTOMl3Uo0JZUPax7alkkao/fHs1Ak5tUqPhLGyJQ2Zqb8nxjTSehQFtjOiZqAXvan4n9dOTXjtj7lMUoOSzReFqSAmJtO/SY8rZEaMLKFMcXsrYQOqKDM2nbwNwVt8eZk0KmXvsnxxVylWz7M4cnAMJ3AGHlxBFW6hBnVg0IdneIU3RzgvzrvzMW9dcbKZI/gD5/MHnKyNSw==

Figure 51.3 The soft-thresholding function, T β (z), reduces the value of x gradually. 2

Small values of z within the interval [− β2 , β2 ] are set to zero, while values of z outside this interval have their size reduced by an amount equal to β/2. The dotted segment represents the line y = z.

2236

Regularization

51.4.1

Orthogonal Data Before studying the general case of arbitrary data matrices H, we consider first the special case when the “squared matrix” H T H happens to be “orthogonal,” namely, when H satisfies H T H = κ2 IM , for some κ2 > 0

(51.49)

Using this condition, and the vector and matrix notation {d, H} defined in (51.1c), we rewrite the unregularized and regularized risks in the form ∆

P (w) =

N −1 2 1 X x(n) N ynT w N n=0

1 kd N Hwk2 N( ) 1 2 T 2 2 = kdk N 2d Hw + κ kwk N =

(51.50)

and 1 Preg (w) = αkwk1 + N

(

kdk2 N 2dT Hw + κ2 kwk2

)

(51.51)

Note that both risks are strongly convex since κ2 > 0. Therefore, they each have ? . a unique global minimizer, denoted by w? and wreg Lemma 51.2. (`1 -regularized solution for orthogonal data) Consider the `1 -regularized problem (51.51) under the orthogonality condition (51.49). The solution is unique and given by ? wreg = T αN2 (w? )

(51.52)



where w? =

1 T κ2 H d

is the minimizer for the unregularized risk (51.50).

Proof: We employ a completion-of-squares argument to write (51.51) as   κ2 2 1 Preg (w) = αkwk1 + kwk2 − 2 dT Hw + 2 kdk2 N κ κ  

1 κ2 1 T 2 1 2 T 2 = αkwk1 +

w − 2 H d + 2 kdk − 4 kH dk N κ κ κ

(51.53)

That is, Preg (w) ∝ βkwk1 + kw − zk2 + φ

(51.54)

51.4 Soft-Thresholding

2237

where ∝ is the proportionality symbol, while the scalars {β, φ} and the column vector z ∈ IRM are defined by αN >0 κ2 1 ∆ z = 2 H T d = w? κ 1 1 ∆ φ = 2 kdk2 − 4 kH T dk2 κ κ ∆

β =

(51.55a) (51.55b) (51.55c)

Observe that z agrees with the minimizer, w? , for the unregularized problem under condition H T H = κ2 I. Minimization of the empirical risk (51.54) is now of the same form as problem (51.47). Therefore, we deduce that the minimizer to (51.49) under the orthogonality condition (51.49) is given by (51.52). 

51.4.2

LASSO or Basis Pursuit Denoising Observe how construction (51.52) applies soft-thresholding to w? with the threshold defined by αN/2κ2 ; this value (and, hence, sparsity) increases with α. In Prob. 51.10 we relax condition (51.49) and replace it by H T H = D2 , where D is diagonal with positive entries. More generally, for arbitrary data matrices H ? and under elastic-net regularization, we can derive a similar expression for wreg ? involving a soft-thresholding operation, albeit one where w is replaced by a dual variable – see expression (51.61). Thus, consider the regularized problem: ? wreg

= argmin w∈IRM

(

N −1 2 1 X x(n) N ynT w Preg (w) = q(w) + N n=0 ∆

)

(51.56a)

where the regularization factor has the form q(w) = αkwk1 + ρkwk2 , α > 0, ρ ≥ 0

(51.56b)

When ρ = 0 we have pure `1 -regularization. Problem (51.56a)–(51.56b) is known as LASSO, where the acronym stands for “least absolute shrinkage and selection ? operator.” The shrinkage feature is because the solution wreg will satisfy property ? (51.43). The selection feature is because wreg will tend to be sparse. Problem (51.56a) is also known as the basis pursuit denoising problem; this is because it seeks a sparse representation for the vector d in terms of the columns of H. For arbitrary H, the LASSO problem (51.56a) is usually solved iteratively by means of subgradient or proximal gradient iterations, with or without stochastic sampling of data, as was already shown earlier in Examples 14.1, 15.3, and 16.12; the latter example describes a stochastic proximal gradient implementation, reproduced here for illustration purposes.

2238

Regularization

Stochastic proximal gradient algorithm for LASSO problem (51.56a). −1 given dataset {x(m), ym }N m=0 ; start from an arbitrary initial condition, w−1 . repeat until convergence over n ≥ 0: select at random a sample (x(n), y n ) at iteration n; z n = (1 N 2µρ)wn−1 + 2µy n (x(n) N y T n w n−1 ) wn = Tµα (z n ) end return w? ← wn .

(51.57)

Other implementations are of course possible. For instance, Example 15.3 describes a full-batch implementation leading to the iterated soft-thresholding algorithm (ISTA):  N −1  2µ X  T (x(m) N ym wn−1 ) zn = (1 N 2µρ)wn−1 + (51.58) N m=0   wn = Tµα (zn )

Numerical solutions of the LASSO optimization problem based on the use of convex optimization packages are also possible. The derivation in this section is meant to highlight some properties of the exact solution, such as showing that it continues to have a soft-thresholding form. To do so, we will follow a duality argument.

Expression for LASSO solution Using the vector notation {d, H}, problem (51.56a) can be recast as   1 ? 2 wreg = argmin q(w) + kd N Hwk N w∈IRM

(51.59)

or, equivalently, in terms of an auxiliary variable transformation that introduces a constraint:   1 2 ? ? (wreg , z ) = argmin q(w) + kd N zk N (51.60) w,z subject to z = Hw where we introduced z ∈ IRN ; it depends linearly on w. The risk function in statement (51.60) is convex over w and z. We therefore have a convex optimization problem with a linear equality constraint. This type of formulation is a special case of problem (9.1), involving convex costs subject to convex inequality and equality constraints, and which we studied in Section 9.1. The results from that section show that strong duality holds for problem (51.60). This means that we

51.4 Soft-Thresholding

2239

? can learn about the solution wreg by using duality arguments to establish the next theorem for both cases of ρ 6= 0 and ρ = 0; the proof appears in Appendix 51.B.

Theorem 51.1. (Expression for LASSO solution) Consider the regularized problem (51.56a)–(51.56b). The solution satisfies the following constructions: (a) (elastic-net regularization, ρ 6= 0). The solution is unique and given by ? wreg =

 1 Tα H T λo 2ρ

(51.61)

where λo is the unique maximum of the strongly concave function:    N 1

Tα H T λ 2 λo = argmax λT d N kλk2 N (51.62) 4 4ρ λ∈IRN

(b) (`1 -regularization, ρ = 0). The solution satisfies either of the relations: ? Hwreg =dN

N o λ 2 n

? wreg = argmin αkwk1 N (λo )T Hw w∈IRM

(51.63a) o

(51.63b)

where λo is the unique projection of the vector N2 d onto the set of vectors λ satisfying kH T λk∞ ≤ α, namely, it solves problem (51.112). Comparing expression (51.61) with (51.52) for “orthogonal” data matrices, we note that the soft-thresholding function is now applied to a dual variable λo and not to the unregularized solution, w? . Moreover, the threshold in Tα (·) increases with α so that more sparse models are expected for larger α. Clearly, solving the LASSO problem via (51.61) or (51.63a)–(51.63b) is not simpler than solving the original optimization problem (51.56a) because we still need to determine λo in (51.62) or (51.112). The usefulness of (51.61) is that it provides a representation for the solution in a manner similar to (51.52) and helps illustrate ? the sparsity property of the resulting wreg . The parameters α and ρ define the degree of regularization: larger values tend to promote smaller (in norm) and more sparse solutions. One useful way to select these parameters is the cross validation technique described later in Section 61.3. Example 51.5 (Comparing different regularized solutions) In this example we compare numerically the behavior of `2 -, `1 -, and elastic-net regularization solutions. First, however, we need to show how to approximate the regularized solution to (51.56a)– (51.56b). We already know that we can employ a stochastic subgradient algorithm for ? this purpose to arrive at good approximations for wreg . Under elastic-net regularization, the recursion would start from some random initial guess, denoted by w−1 , and then iterate as follows: wn = (1 − 2µρ)wn−1 − µ α sign(wn−1 ) + 2µyn (x(n) − ynT wn−1 ), n ≥ 0

(51.64)

where µ is a small step-size parameter and the notation wn denotes the approximation for the regularized solution at iteration n. The sign function, when applied to a vector argument, returns a vector with entries equal to ±1 depending on the signs of the

2240

Regularization

individual entries of wn−1 : +1 for nonnegative entries and −1 for negative entries. The algorithm is run multiple times over the training data {x(n), yn }, with the data being randomly reshuffled at the beginning of each epoch, namely: (a) At the start of each epoch, the data {x(n), yn } is randomly reshuffled so that each epoch runs over the same dataset, albeit in a different random order. (b) The initial condition for the epoch of index k is the iterate value that was obtained at the end of the previous epoch. The iterate that is obtained at the end of the last epoch is the one that is taken to be ? the approximation for wreg . Iteration (51.64) applies to both cases of `1 -regularization (by setting ρ = 0) and elastic-net regularization when both α and ρ are positive. Although we already have a closed-form solution for the `2 -regularized solution via expression (51.21), or can even arrive at it by means of the recursive least-squares (RLS) algorithm (50.123), the same stochastic recursion (51.64) can be used to approximate the `2 -regularized solution as well by setting α = 0; the recursion leads to a computationally simpler algorithm than RLS, albeit at a slower convergence rate. entries of true model

and its

-regularized solution

1 0.8

-

0.6 0.4 0.2 0 -0.2

0

2

4

6

8

10

12

14

16

18

20

6

8

10

12

14

16

18

20

1

0.8 0.6 0.4 0.2 0 -0.2

0

2

4

Figure 51.4 The top plot shows the true model w o with three nonzero entries at value

1 while all other entries are at 0. The top plot also shows the `2 -regularized solution, ? wreg , that is obtained by using the least-squares expression (51.21). The bottom plot compares the solutions that are obtained from the least-squares expression (51.21) and from the stochastic recursion (51.64) using 20 runs over the data, α = 0, and µ = 0.0001. It is seen that recursion (51.64) is able to learn the `2 -regularized solution well.

51.4 Soft-Thresholding

2241

We use the stochastic construction (51.64) to illustrate the behavior of the different regularization modes by considering the following numerical example. We generate N = 4000 random data points {x(n), y n } related through the linear model: x(n) = y Tn wo + v(n)

(51.65)

where v(n) is white Gaussian noise with variance σv2 = 0.01, and each observation vector has dimension M = 20. We generate a sparse true model wo consisting of three randomly chosen entries set to 1, while all other entries of wo are set to 0. Figures 51.4 and 51.5 illustrate the results that follow from using 20 runs over the data with µ = 0.0001, α = 5, and ρ = 2. It is seen in the lower plot from Fig. 51.4 that the stochastic recursion (51.64) converges to a good approximation for the actual leastsquares solution from (51.21). The middle plot of Fig. 51.5 illustrates the sparsity property of the `1 -regularized solution. 1.2

true model

1

-

0.8 0.6 0.4 0.2 0 -0.2

0

2

4

6

1.2

8

10

12

14

true model

1

16

18

20

16

18

20

-

0.8 0.6 0.4 0.2 0 -0.2

0

2

4

6

8

10

12

14

1.2

true model

1

elastic-net regularization

0.8 0.6 0.4 0.2 0 -0.2

0

2

4

6

8

10

12

14

16

18

20

Figure 51.5 All three plots show the true model w o with three nonzero entries at value

1 while all other entries are at 0. In each case, the true model is compared against the `2 -regularized solution (top plot), the `1 -regularized solution using α = 5 (middle plot), and the elastic-net regularized solution using α = 5 and ρ = 2 (bottom plot). All these regularized solutions are obtained by using the stochastic (sub)gradient recursion (51.64) using 20 runs over the data and µ = 0.0001. The middle plot illustrates how `1 -regularization leads to a sparse solution, while the elastic-net regularized solution has slightly more nonzero entries.

2242

Regularization

51.5

COMMENTARIES AND DISCUSSION Tikhonov regularization. The regularized least-squares problem (51.16) and its solution (51.21) were proposed by the Russian mathematician Andrey Tikhonov (1906– 1993) in the publication by Tikhonov (1963) on ill-posed problems – see also the text by Tikhonov and Arsenin (1977). This form of regularization is nowadays very popular and is known as Tikhonov regularization. Tikhonov’s formulation was general and applicable to infinite-dimensional operators and not only to finite-dimensional leastsquares problems. His work was aimed at solving integral equations of the first kind, also known as Fredholm integral equations, which deal with the problem of determining a function solution x(t) to an integral equation of the following form: ˆ b A(s, t)x(t)dt = b(s) (51.66) a

for a given kernel function, A(s, t), and another function b(s). These integral equations can be ill-conditioned and can admit multiple solutions. The analogy with linear systems of equations of the form Ax = b becomes apparent if we employ the operator notation to rewrite the integral equation in the form Ax = b, in terms of some infinite-dimensional operator A. It turns out that both Phillips (1962) and Tikhonov (1963) proposed using `2 -regularization to counter ill-conditioning for Fredholm integral equations, which is why this type of regularization is also referred to as the Phillips–Tikhonov or Tikhonov–Phillips regularization. The same technique also appeared in Hoerl (1962), albeit for finite-dimensional operators (i.e., for matrices) in the context of least-squares problems. This latter work was motivated by the earlier contribution on ridge analysis from Hoerl (1959) – see also Hoerl and Kennard (1970) and the review by Hoerl (1985). It is for this reason that `2 -regularization is also known as ridge regression in the statistics literature. Useful overviews on the role of Tikhonov regularization in the solution of linear systems of equations and least-squares problems appear in the survey article by Neumaier (1998) and in the texts by Golub and Van Loan (1996), Bjorck (1996), and Hansen (1997). More information on regularization in general can be found in the texts by Wahba (1990) and Engl, Hanke, and Neubauer (1996). LASSO and basis pursuit denoising. In Examples 51.2 and 51.4 we showed that regularization in the least-squares case corresponds to associating a prior distribution with the sought-after parameter, w (now treated as a random quantity). A Gaussian prior leads to `2 -regularization, while a Laplacian prior leads to `1 -regularization as noted by Tibshirani (1996b). We showed in the body of the chapter that `1 -regularization leads to sparse solutions. However, it has been observed in practice that it tends to retain more nonzero entries than necessary in the solution vector and, moreover, if several entries in the observation space are strongly correlated, the solution vector will tend to keep one of them and discard the others – see Zou and Hastie (2005). Elastic-net regularization, on the other hand, combines `1 - and `2 -penalty terms and inherits some of their advantages: It promotes sparsity without totally discarding highly correlated observations. This form of regularization was proposed by Zou and Hastie (2005); examination of some of its properties appears in this reference as well as in the text by Hastie, Tibshirani, and Friedman (2009) and in De Mol, De Vito, and Rosasco (2009). Given data {x(n), yn ∈ IRM }, the pure `1 -regularization formulation solves ( ) N −1 2 1 X ? T wreg = argmin αkwk1 + x(n) − yn w (51.67) N n=0 w∈IRM where α > 0 is the regularization parameter. We explained in the chapter that this problem is equivalent to solving

51.5 Commentaries and Discussion

? wreg = argmin kd − Hwk2 ,

subject to αkwk1 ≤ τ

2243

(51.68)

w∈IRM

for some τ > 0. Problems of this type were first proposed by Santosa and Symes (1986) and later by Tibshirani (1996b); the latter reference uses the acronym LASSO for such problems. A similar problem was studied by Chen, Donoho, and Saunders (1998, 2001) under the name basis pursuit denoising. They examined instead the reverse formulation: ? wreg = argmin kwk1 ,

subject to kd − Hwk2 ≤ 

(51.69)

w∈IRM

for some small  > 0. This formulation was motivated by the earlier work in Chen and Donoho (1994) on standard basis pursuit. In this latter problem, the objective is to seek a sparse representation for a signal vector d from an overcomplete basis H, namely, to solve (see Prob. 51.7): min kwk1 ,

w∈IRM

subject to d = Hw

(51.70)

All three formulations (51.67), (51.68), and (51.69) are equivalent to each other for suitable choices of the parameters {α, τ, } – see Prob. 51.7. The contributions of Tibshirani (1996b) and Chen, Donoho, and Saunders (1998, 2001) generated renewed interest in `1 -regularized problems in the statistics, machine learning, and signal processing literature. These types of problems have an older history, especially in the field of geophysics. For example, a problem of the same form as (51.67) was used in the deconvolution of seismic signals by Santosa and Symes (1986). Their work was motivated by the earlier contributions of Claerbout and Muir (1973) and Taylor, Banks, and McCoy (1979). Using our notation, these last two references consider optimization problems of the following form (compare with (51.67)): ) ( N −1 1 X ? wreg = argmin αkwk1 + |x(n) − ynT w| (51.71) N n=0 w∈IRM where the sum of the absolute residuals (rightmost term) is used in place of the sum of their squared values, as is the case in (51.67). Both formulations employ an `1 -penalty term. One of the earliest recognitions that `1 -regularization promotes sparsity appears in the article by Santosa and Symes (1986, p. 1308), where it is stated that the use of the `1 -penalty term “has the effect of constructing a solution which has the least number of nonzero components.” Arguments and derivations in support of the sparsity-promoting property of the `1 -penalty appear in Levy and Fullagar (1981), Oldenburg, Scheuer, and Levy (1983), and also in Santosa and Symes (1986, sec. 2). In their formulation of the deconvolution problem, Santosa and Symes (1986) proposed replacing (51.71) by the same problem (51.67) using the sum of squared residuals – see their expressions (1.16) and (5.1). It is useful to note that design problems involving `1 -measures of performance have also been pursued in the control field, starting from the mid-1980s. The primary motivation there for the use of the `1 -norm has been to design control laws that minimize the effect of persistent bounded disturbances on the output of the system. Among the earliest references that promoted this approach are the works by Vidyasagar (1986) and Dahleh and Pearson (1986, 1987). A thorough treatment of the subject matter, along with an extensive bibliography, appears in the text by Dahleh and Diaz-Bobillo (1995). Robust least-squares designs. Given an N × M data matrix H, an N × 1 target vector d, an N × N positive-definite weighting matrix R, and an M × M positive-definite regularization matrix Π, the solution to the following regularized weighted least-squares problem:

2244

Regularization



w? = argmin

n

o wT Πw + (d − Hw)T R(d − Hw)

(51.72)

w∈IRM

is unique and given by w? = (Π + H T RH)−1 H T R d

(51.73)

When the data {d, H} are subject to uncertainties, the performance of this solution can deteriorate appreciably. Assume that the actual data matrix that generated the target signal d is H + δH and not H, for some small perturbation δH. Then, the above solution w? , which is designed based on knowledge of the nominal value H, does not take into account the presence of the perturbations in the data. One way to address this problem is to formulate a robust version of the least-squares problem as follows: ∆

wrob =

(51.74a) ) T   wT Πw + (d + δd) − (H + δH)w R (d + δd) − (H + δH)w

( argmin max w∈IRM

{δH,δd}



where {δd, δH} represent the unknown perturbations that are assumed to be modeled as follows:     δd δH = P ∆ ed EH (51.74b) where ∆ is an arbitrary contraction matrix satisfying k∆k ≤ 1 and {P, ed , EH } are known quantities of appropriate dimensions, e.g., ed is a column vector. The matrix P is meant to constrain the perturbations to its range space. Problem (51.74a) can be interpreted as a constrained two-game problem, with the designer trying to select an estimate wrob that minimizes the cost while the opponent {δd, δH} tries to maximize the same cost. It turns out that the solution to (51.74a) has the form of a regularized least-squares solution, albeit one with modified {Π, R} matrices, as indicated by the following result.

Robust regularized least-squares (Sayed, Nascimento, and Cipparrone (2002)). Problem (51.74a)–(51.74b) has a unique solution given by   −1  T b H b + H T RH b b d + βE (51.75a) wrob = Π H TR ed b R} b are obtained from {Π, R} as follows: where {Π, T b H b = Π + βE Π EH b M − P T RP )† P T R b = R + RP (βI R

(51.75b) (51.75c)

where the notation † refers to the pseudo-inverse of its matrix argument, and the scalar βb is determined by solving βb = argmin G(β)

(51.75d)

β≥kP T RP k

where the function G(β) is defined as follows: ∆

G(β) = kw(β)k2Π(β) + kd − Hw(β)k2R(β) + βked − EH w(β)k2

(51.76)

Problems

2245

where the notation kak2X stands for aT Xa and R(β) = R + RP (βI − P T RP )† P T R T βEH EH

Π(β) = Π +  −1   T b w(β) = Π(β) + H T R(β)H H T R(β) d + βEH ed

(51.77a) (51.77b) (51.77c)

We denote the lower bound on β by β` = kP T RP k. Compared with the solution (51.73) to the original regularized least-squares problem, we observe that the expression for wrob is distinct in some important ways: b R}. b (a) First, the weighting matrices {Π, R} are replaced by corrected versions {Π, b which is obtained by minimizing These corrections are defined in terms of a scalar β, G(β) over the semi-open interval [β` , ∞). (b) It was shown by Sayed and Chen (2002) and Sayed, Nascimento, and Cipparrone (2002) that the function G(β) has a unique global minimum (and no local minima) over the interval [β` , ∞). This means that the determination of βb can be pursued by standard search procedures without worrying about convergence to undesired local minima. Extensive experiments suggest that setting βb = λβ` (a scaled multiple of the lower bound for some positive λ chosen by the designer) is generally sufficient. T b H ed . The expression (c) The right-hand side of (51.75a) contains an additional term βE for wrob can be viewed as the solution to the following extended problem: # ) ( "  b d k2 b Td EH  βke −βe   1 2 rob T + kd − HwkRb w = argmin 1 w w T b H b ed Π −βE w∈IRM (51.78) b (d) For values β > β` , the pseudo-inverse operation can be replaced by standard matrix inversion and it holds that b−1 = R−1 − βb−1 P P T R

(51.79)

Other robust variations of least-squares are possible. For example, model (51.74b) for the perturbations can be replaced by one of the form kδHk ≤ η,

kδdk ≤ ηd

(51.80)

where the uncertainties are instead assumed to lie within bounded regions determined by the positive scalars {η, ηd }. The solution has a similar structure and is described in Chandrasekaran et al. (1997, 1998) and Sayed, Nascimento, and Cipparrone (2002). A convex optimization approach is described in El Ghaoui and Lebret (1997). Other variations and geometric arguments are described in Sayed, Nascimento, and Chandrasekaran (1998) – see also Probs. 51.19–51.21.

PROBLEMS

51.1

Consider the least-squares problem (51.1a) with a rank-deficient H:     1 2 +1 H =  1 2  , d =  +1  0 0 −1

2246

Regularization

(a) (b) (c)

Verify that all solutions to the normal equations take the form w? = col{1−2b, b} for any b ∈ IR. Verify that all vectors in the nullspace of H T H take the form p = col{−2b, b}. Verify that the following are two valid solutions:     1 −3 w1? = , w2? = 0 2

Consider the test vector yt = col{2, 2}. Compute the x b(t) that result from both solutions. Remark. Observe that the predictions have opposite signs, which is undesirable in applications where the sign of x b(t) is used to perform classification. 51.2 Consider the least-squares problem (51.1a) with an ill-conditioned matrix H:     1 +1 −1 √  , d =  −1  (51.81) H= +1 

(d)

where  > 0 is a small number, and the entries of d are binary variables of the type x(n) = ±1. (a) What is the condition number of H T H? (b) Determine the solution w? to the normal equations. (c) Consider the two observation vectors   y1 = col 10, 10, 10−6 , y2 = col 10, 10, −10−6 where their trailing entries have small size and differ in sign. Predict their target signals x b(1) and x b(2). Remark. Observe how x b2 can become negative for small enough  while x b1 is always positive. If the sign of x b is used to classify the observation vector y, then the vectors {y1 , y2 }, despite being very close to each other in Euclidean space, will end up being assigned to different classes. ? 51.3 Let w? and wreg denote solutions to the unregularized and regularized leastsquares risks (51.1a) and (51.16), respectively. ? (a) Show that wreg = (ρN I + H T H)−1 H T Hw? . (b) Introduce the eigen-decomposition H T H = U ΛU T , where U is M ×M orthogonal ? and Λ is diagonal with nonnegative entries {λ(m)}. Let w ¯reg = U T wreg and T ? w ¯ = U w and denote their individual entries by {w ¯reg (m), w(m)}. ¯ Verify that   λ(m) w(m), ¯ m = 1, 2, . . . , M w ¯reg (m) = ρN + λ(m) ? Conclude that kwreg k2 < kw? k2 . ? 51.4 Refer to the minimizers {w? , wreg } for the unregularized and regularized leastsquares problems. ? ? (a) Show that Preg (wreg ) − P (w? ) = ρ (w? )T wreg . (b) Introduce the same transformations {w, ¯ w ¯reg } from Prob. 51.3 and conclude that

? Preg (wreg ) − P (w? ) =

M  X m=1

ρλ(m) ρN + λ(m)



2 |w(m)| ¯

? Since generally at least one λ(m) 6= 0 and w? 6= 0, conclude that Preg (wreg )> ? P (w ). (d) Verify that the function f (ρ) = ρλ/(ρN + λ) is nondecreasing in ρ. Conclude that the bias increases with ρ. 51.5 We re-examine the result of Prob. 50.11 for the case of `2 -regularized leastsquares (or ridge regression). Thus, refer again to the stochastic model (50.88) where v has covariance matrix σv2 IN but is not necessarily Gaussian. Introduce the mean-square error (MSE) risk, P (w) = E kd − Hwk2 , where the expectation is over the source of

(c)

Problems

2247

randomness in d. Verify that the `2 -regularized least-squares solution w?reg given by (51.21) leads to the following average excess risk expression: E P (w?reg ) − P (wo ) =  −2 h 2 i 1 (wo )T H T I + HH T Hwo + σv2 Tr H(ρN I + H T H)−1 H T ρN Verify that the expression reduces to the result of Prob. 50.11 as ρ → 0. 51.6 The expression in Prob. 51.5 consists of two terms: The first one depends on 1/ρ while the second one varies with ρ. Show that the average excess risk is bounded by E P (w?reg ) − P (wo ) ≤

ρN σ2 kwo k2 + v Tr(H T H) 2 2ρN

Minimize the bound over ρ and conclude that E P (w?reg )−P (wo ) ≤ σv kwo k For which value of ρ is this bound attained? 51.7 Consider the `1 -regularized problem with α > 0: ( ) N −1 2 1 X T argmin αkwk1 + x(n) − yn w N n=0 w∈IRM

p

Tr(H T H).

Using the vector notation (51.1c), show that the problem is equivalent to solving: argmin kwk1 ,

subject to kd − Hwk2 ≤ 

w∈IRM

for some  ≥ 0. Show that as α → 0, the formulation reduces to the so-called basis pursuit problem (which involves an equality constraint): argmin kwk1 ,

subject to Hw = d

w∈IRM

51.8 Establish the validity of expression (51.102) for Sα (x). 51.9 In this problem, we follow the approach described in the earlier Example 14.10 to express the `1 -regularized least-squares (LASSO) solution in an alternative form. Consider the regularized problem: ) ( N −1 2 1 X ? x(n) − ynT w wreg = argmin Preg (w) = αkwk1 + N n=0 w∈IRM We denote the individual entries of w and yn by w = col{wm } and yn = col{yn,m }, respectively, for m = 1, 2, . . . , M . We also use the notation w−m and yn,−m to refer to the vectors w and yn with their mth entries excluded. (a) Verify that, as a function of wm , the regularized risk can be written as: ( )  α cm 2 |wm | + wm − + terms indep. of wm Preg (w) = am am am where ∆

am = (b)

N −1 1 X 2 yn,m , N n=0



cm =

N −1   1 X T yn,m x(n) − yn,−m w−m N n=0

Conclude that the minimizer over wm is given by w bm = Tα/2am (b cm /am ), for m = 1, 2, . . . , M, and where b cm is given by the same expression as cm with w−m replaced by w b−m .

2248

Regularization

51.10 Replace condition (51.49) by H T H = D2 > 0, where D is diagonal with positive entries. Use result (11.35) to show that expression (51.52) is replaced by  αN −1  ? wreg = sign(w? ) D|w? | − D 1 2 + where the operations sign(x), |x|, and (a)+ are applied elementwise. 51.11 Consider the `2 -regularized risk function Preg (w) = ρkwk2 +P (w), where ρ > 0 and P (w) is some convex risk in w. Show that Preg (w) is strongly convex and, therefore, has a unique global minimum. 51.12 Refer to the equivalent problems (51.94). (a) Assume q(w) = ρkwk2 . Show that τ decreases as ρ increases. (b) Assume q(w) = αkwk1 + ρkwk2 , where α > 0 and ρ > 0. Show that τ decreases as either α or ρ increases. 51.13 Consider the following `2 -regularized stochastic risk: n o o wreg = argmin ρkwk2 + E (x − y T w)2 w∈IRM

o Show that wreg = Ry (ρIM + Ry )−1 wo , where wo is the minimizer of the unregularized component, E (x − y T w)2 . 51.14 Consider the following `1 -regularized stochastic risk: n o o wreg = argmin αkwk1 + E (x − y T w)2 w∈IRM

o Assume Ry = σy2 IM . Show that wreg = Tα/2σy2 (wo ), where wo is the minimizer of the

unregularized component, E (x − y T w)2 . 51.15 Refer to the `1 -regularized problem (51.38). Verify first that for any scalar x ∈ IR, it holds ) ( 1 x2 +z |x| = min z>0 2 z Let wm denote the individual entries of w ∈ IRM . Conclude that (51.38) can be transformed into ( ! ) N −1 M 2 2 X 1 wm 1 X ∆ T min Preg (w) = α + zm + x(n) − yn w 2 zm N n=0 w∈IRM ,{zm >0} m=1 Remark. The idea of replacing the regularization factor by smoother forms has been exploited in several works, especially in the context of optimization and image processing – see, for example, Geman and Yang (1995), Bach et al. (2012), Chan and Liang (2014), and Lanza et al. (2015). 51.16 Consider a vector w ∈ IRM with individual entries {wm }. For any p ≥ 1 and δ ≥ 0, the bridge regression problem in statistics refers to min

w∈IRM

N −1 2 1 X x(n) − ynT w , N n=0

subject to

M X

|wm |p ≤ δ

m=1

Show that this problem is equivalent to solving ( M ) N −1 2 X 1 X p T min x(n) − yn w ρ |wm | + N n=0 w∈IRM m=1 for some ρ ≥ 0. That is, show that for any δ ≥ 0 there exists a ρ ≥ 0 that makes both problems equivalent to each other (i.e., have the same solution). Remark. See the works by Frank and Friedman (1993) and Fu (1998) for a related discussion.

Problems

2249

51.17 Refer to the `1 -regularized problem (51.38) and define the quantities {d, H} shown in (51.1c), where H ∈ IRN ×M . Let {um } denote the individual columns of H for m = 1, 2, . . . , M , where each um has size N × 1. Let wm denote the individual entries ? of w ∈ IRM . Show that w? is a solution of (51.38) if, and only if, for every entry wm it holds that  T ? ? when wm =0   |um (d − Hw )| ≤ N α/2,  ?  uTm (d − Hw? ) = N α sign(wm ), 2

? when wm 6= 0

Remark. See Bach et al. (2012) for a related discussion. 51.18 Derive expressions (51.117a)–(51.117b) for the conjugate functions. 51.19 Assume H is full rank and has dimensions N × M with N > M . Consider the regularized least-squares problem: n o ∆ w? = argmin ρkwk2 + kd − Hwk2 , ρ > 0 w∈IRM

and assume d ∈ / R(H) and H T d 6= 0. Let de = d − Hw? and introduce the scalar ? e Verify that η < kH T dk/kdk. Remark. See Sayed, Nascimento, and η = ρkw k/kdk. Chandrasekaran (1998) for a related discussion. 51.20 The next two problems are extracted from Sayed (2003, 2008). Consider an N × M full rank matrix H with N ≥ M , and an N × 1 vector d that does not belong to the column span of H. Let η be a positive real number and consider the set of all matrices δH whose 2-induced norms do not exceed η, kδHk ≤ η. Now consider the following optimization problem whose solution we denote by w? : ( ) ∆

w? = argmin w∈IRM

max

kδHk≤η

kd − (H + δH)wk

That is, we seek to minimize the maximum residual over the set {kδHk ≤ η}. (a) Argue from the conditions of the problem that we must have N > M . (b) Show that the uncertainty set {kδHk ≤ η} contains a perturbation δH o such that d is orthogonal to (H + δH o ) if, and only if, η ≥ kH T dk/kdk. (c) Show that the above optimization problem has a unique solution at w? = 0 if, and only if, the condition on η in part (b) holds. Remark. For more details on such robust formulations, see Chandrasekaran et al. (1997, 1998), Sayed, Nascimento, and Chandrasekaran (1998), and Sayed, Nascimento, and Cipparrone (2002). 51.21 Consider an N × M full rank matrix H with N ≥ M , and an N × 1 vector d that does not belong to the column span of H. (a) For any nonzero M × 1 column vector w, show that the following rank-one modification of H continues to have full rank for any positive real number η: ∆

H(w) = H − η (b)

(c) (d)

d − Hw wT kd − Hwk kwk

Verify that kd − H(w)wk = ky − Hwk + ηkwk, and that the vectors d − H(w)w and d−Hw are collinear and point in the same direction (that is, one is a positive multiple of the other). Show that kd − H(w)wk = maxkδHk≤η kd − (H + δH)wk. Show that the optimization problem min

max

w∈IRM kδHk≤η

kd − (H + δH)wk

has a nonzero solution w? if, and only if, η < kH T dk/kdk.

2250

Regularization

Show that w? is a nonzero solution of the optimization problem in part (d) if, and only, if H T (w? )(d − Hw? ) = 0. That is, the residual vector d − Hw? should be orthogonal to the perturbed matrix H(w? ). Show further that this condition is equivalent to H T (w? )(d − H(w? )w? ) = 0. (f) Assume two nonzero solutions w1? and w2? exist that satisfy the orthogonality condition of part (e). Argue that H T (w2? )(d − H(w2? )w1? ) = 0, and conclude that w1? = w2? so that the solution is unique. Remark. For further details, see Sayed, Nascimento, and Chandrasekaran (1998). (e)

51.A

CONSTRAINED FORMULATIONS FOR REGULARIZATION In this appendix we first establish the equivalence between (51.16) and (51.26) for `2 regularized least-squares, and between (51.38)–(51.39) and (51.44) for `1 - and elasticnet regularized least-squares. Then we extend the conclusion to other regularized convex risks, besides least-squares. Although it would have been sufficient to treat the general case right away, we prefer to explain the equivalence in a gradual manner for the benefit of the reader, starting with quadratic risks. To establish the equivalence, we will appeal to the Lagrange and KKT multiplier arguments from Section 9.1.

51.A.1

Quadratic Risks We start with the `2 -regularized least-squares risk.

`2 -regularization To begin with, we identify the smallest value for τ . We already know that the solution to the `2 -regularized problem (51.16) is given by ? wreg = (ρN IM + H T H)−1 H T d

(51.82)

Now, consider the constrained problem (51.26) for some τ > 0. The unregularized risk P (w) is quadratic in w and is therefore convex and continuously differentiable. The constraint ρkwk2 ≤ τ defines a convex set in IRM . We are therefore faced with the problem of minimizing a convex function over a convex domain. It is straightforward to verify that problems of this type can only have global minima – see the argument after ? (9.10). For the solution wreg defined by (51.82) to be included in the search domain 2 ? ρkwk ≤ τ , it is necessary for the value of τ to satisfy τ ≥ ρkwreg k2 . This argument shows that the smallest value for τ is ? τ = ρkwreg k2 = ρk(ρN IM + H T H)−1 H T dk2

(51.83)

? in which case the regularized solution, wreg , will lie on the boundary of the set kwk2 ≤ τ /ρ. Moreover, the constraint set will exclude any of the solutions, w? , to the original ? unregularized solution from (51.1b). This is because kw? k > kwreg k, as already revealed by (51.24). ? Let wcons denote a solution to the constrained problem (51.26) for the above value ? of τ . We want to verify that this solution agrees with wreg . We appeal to the KKT conditions from Section 9.1. Note first that problem (51.26) does not involve any equality constraints and has only one inequality constraint of the form ∆

g(w) = ρkwk2 − τ ≤ 0

(51.84)

51.A Constrained Formulations for Regularization

2251

We introduce the Lagrangian function L(w, λ) = P (w) + λ(ρkwk2 − τ ), λ ≥ 0

(51.85)

? wreg (λ)

and let denote a minimizer for it. Strong duality holds because the Slater condition (9.58a) is satisfied, i.e., there exists a w ¯ such that g(w) ¯ < 0 (e.g., w ¯ = 0). ? ? The KKT conditions (9.28a)–(9.28e) then state that wreg (λ) agrees with wcons if, and only if, the following conditions hold for some scalar λ: ? ρkwreg (λ)k2 − τ ≤ 0 (feasibility of primal problem) λ≥0 (feasibility of dual problem)  ? λ ρkwreg (λ)k2 − τ = 0 (complementary condition) n o ∇w λ(ρkwk2 − τ ) + P (w) =0 ? (λ) w=wreg

(51.86a) (51.86b) (51.86c) (51.86d)

? ? If we select λ = 1, then wreg (λ) = wreg and the KKT conditions are satisfied at these ? ? ? values for τ = ρkwreg k2 . It follows that wcons = wreg .

`1 - and elastic-net regularization We can extend the argument for other regularization factors, such as q(w) = αkwk1 ? or q(w) = αkwk1 + ρkwk2 . Let wreg denote the minimizer for either regularized risk (51.38) or (51.39); the argument applies to both cases. It follows that the smallest value for τ should be: ? τ = q(wreg ) (51.87) ? Let wcons denote a solution to the constrained problem (51.44) for the above value of ? τ . We want to verify that this solution agrees with wreg . We again appeal to the KKT conditions from Section 9.1. Note first that either problem (51.38) or (51.39) does not involve any equality constraints and has only one inequality constraint of the form ∆

g(w) = q(w) − τ ≤ 0

(51.88)

We introduce the Lagrangian function L(w, λ) = P (w) + λ(q(w) − τ ), λ ≥ 0

(51.89)

? wreg (λ)

and let denote a minimizer for it. Strong duality holds because the Slater condition (9.58a) is satisfied, i.e., there exists a w ¯ such that g(w) ¯ < 0 (e.g., w ¯ = 0). ? ? The KKT conditions (9.28a)–(9.28e) then state that wreg (λ) agrees with wcons if, and only if, the following conditions hold for some scalar λ: ? q(wreg (λ)) − τ ≤ 0 λ≥0  ? λ q(wreg (λ)) − τ = 0 n o 0 ∈ ∂ q(w) + P (w)

(feasibility of primal problem) (feasibility of dual problem) (complementary condition)

? (λ) w=wreg

(51.90a) (51.90b) (51.90c) (51.90d)

? ? If we select λ = 1, then wreg (λ) = wreg and the KKT conditions are satisfied at these ? ? ? values for τ = q(wreg ). It follows that wcons = wreg .

51.A.2

Other Convex Risks The discussion in the body of the chapter reveals that regularization has several benefits: It resolves ambiguities by ensuring unique solutions and counters ill-conditioning and overfitting. Naturally, these favorable conditions come at the expense of introducing bias: The achievable minimum risk (or training error) is higher under regularization

2252

Regularization

than it would be in the absence of regularization. These various properties have been established so far for the case of least-squares risks. We argue now that regularization ensures similar properties for other convex risk functions besides quadratic risks. Thus, more generally, we let P (w) denote any convex risk function, differentiable or not, and introduce its regularized version: ∆

Preg (w) = q(w) + P (w)

(51.91)

where the penalty term, q(w), is also assumed to be convex in w such as the choices introduced earlier in (51.15). In this chapter we considered one choice for P (w), namely, the quadratic risk (51.92a). Later, when we study learning algorithms, other convex empirical risks will arise (such as logistic risks, exponential risks, hinge risks, and others), in which case the results of the current appendix will be applicable; some of the risks will also be nondifferentiable. Examples include empirical risks of the form: P (w) =

N −1 2 1 X x(n) − ynT w N n=0

(quadratic risk)

(51.92a)

P (w) =

N −1  T 1 X  ln 1 + e−x(n)yn w N n=0

(logistic risk)

(51.92b)

P (w) =

N −1 n o 1 X max 0, −x(n)ynT w N n=0

(perceptron risk)

(51.92c)

P (w) =

N −1 n o 1 X max 0, 1 − x(n)ynT w N n=0

(hinge risk)

(51.92d)

Uniqueness of solution. We focus on general optimization problems of the form (51.91), where P (w) is convex in w (but need not be differentiable) and q(w) is one of the convex penalty terms considered before in (51.15). The first property to note is that whenever P (w) is convex in w, the `2 -regularized version (i.e., when q(w) = ρkwk2 ), will be strongly convex for any ρ > 0 and, therefore, ? Preg (w) will have a unique global minimum, wreg . The strong convexity of Preg (w) in 2 this case follows from the fact that ρkwk is itself strongly convex – see Prob. 51.11. Therefore, ridge regression ensures a unique global minimizer; a similar conclusion can be established when elastic-net regularization is applied for any convex empirical risk P (w). For `1 -regularization, a unique global minimizer will be guaranteed when P (w) happens to be strictly or strongly convex; in this case, convexity of P (w) alone is not sufficient because the penalty αkwk1 is convex but not strictly convex. In the following, we assume that the regularized risk Preg (w) has a unique global minimizer. Promoting smaller solutions. Let w? denote a global minimizer for the unregularized convex risk, P (w). This minimizer need not be unique since P (w) is only assumed to ? be convex but not necessarily strongly convex. Let wreg denote the global minimizer for the regularized risk, Preg (w). This minimizer is assumed to be unique. Now, since ? wreg minimizes Preg (w), we have ? ? q(wreg ) + P (wreg ) ≤ q(w? ) + P (w? ) ? ? =⇒ q(wreg ) − q(w? ) ≤ P (w? ) − P (wreg ) (a)

? =⇒ q(wreg ) − q(w? ) ≤ 0 ? ⇐⇒ q(wreg ) ≤ q(w? )

(51.93)

?

where step (a) is because w minimizes the unregularized risk, P (w). Constrained formulation. We assume the regularized risk has a unique global minimizer. We also assume that the Slater condition (9.58a) holds, i.e., there exists a w ¯ such that

51.B Expression for LASSO Solution

2253

? g(w) ¯ < 0, which is equivalent to q(w) ¯ < τ where τ = q(wreg ). This can be satisfied, for example, at w ¯ = 0 for the penalty terms considered before in (51.15). Then, the same KKT argument used earlier in this appendix under `1 - and elastic-net regularization shows that the following two problems are equivalent (meaning they have the same solution vectors):

 n o ?  w = argmin q(w) + P (w) ⇐⇒   reg w∈IRM

? ?   P (w) subject to q(w) ≤ q(wreg )  wcons = argmin M

(51.94)

w∈IR

51.B

EXPRESSION FOR LASSO SOLUTION In this appendix we establish Theorem 51.1 for the solution of the LASSO problem under `1 - and elastic-net regularization using a duality argument patterned after the derivation in Chen, Towfic, and Sayed (2015); other related arguments appear in Mota et al. (2012, 2013). We assume first that ρ 6= 0. We start by introducing the Lagrangian function: ∆

L(w, z, λ) =

1 kd − zk2 + q(w) + λT (z − Hw) N

(51.95)

where λ ∈ IRN is the dual variable (or Lagrange multiplier). The dual function is defined by minimizing L(w, z, λ) over {w, z}: ∆

D(λ) = min L(w, z, λ) (dual function) w,z ( ( ) ) 1 T 2 T = min kd − zk + λ z + min q(w) − λ Hw z w N

(51.96)

where we are grouping separately the terms that depend on z and w. Once this dual function is determined, as shown by expression (51.107), maximizing it leads to the optimal value for λ – see (51.62): λo = argmax D(λ)

(51.97)

λ∈IRN

Strong duality will then imply that we can determine the optimal solutions for {w, z} for formulation (51.60) by using this value for λo , namely, by solving:



? wreg



1 kd − zk2 + (λo )T z N z∈IRN n o ∆ = argmin q(w) − (λo )T Hw

z o = argmin



=⇒ z o = d −

N o λ 2

(51.98a) (51.98b)

w∈IRM

Expression (51.98a) shows how z o is determined from λo . We still need to show how to solve (51.98b) and determine the regularized solution in terms of λo . We can pursue this task by appealing to result (51.47). Indeed, note that

2254

Regularization

? wreg = argmin

n o q(w) − (λo )T Hw

w∈IRM

= argmin

n o αkwk1 + ρkwk2 − (λo )T Hw

w∈IRM

 α 1 kwk1 + kwk2 − (λo )T Hw ρ ρ w∈IRM ( )

2

2

α 1 1

T o T o

= argmin kwk1 +

w − 2ρ H λ − (2ρ)2 H λ ρ w∈IRM 

= argmin

(51.99)

Using result (51.47) we conclude that (51.61) holds. Determining the dual variable λo . To complete the argument, we still need to determine λo , which is the maximizer for the dual function D(λ) defined by (51.96). We first determine D(λ). From (51.96) we observe that we need to minimize two separate terms, one over z and one over w. We already know from the above argument that for any λ: n o 1 argmin q(w) − λT Hw =⇒ wλ? = Tα (H T λ) (51.100) 2ρ w∈IRM where we are denoting the minimizer for a generic λ by the notation wλ? . Consequently, the minimum value for this first minimization is given by q(wλ? ) − λT Hwλ?

αkwλ? k1 + ρkwλ? k2 − λT Hwλ? (51.101)   (51.100) 1 1 = αkTα (H T λ)k1 + kTα (H T λ)k2 − λT HTα (H T λ) 2ρ 2 =

To simplify the notation, we let for any vector x: ∆

Sα (x) = −αkTα (x)k1 −

1 kTα (x)k2 + xT Tα (x) 2

(51.102)

Then, it is verified in Prob. 51.8 that Sα (x) =

1 kTα (x)k2 2

(51.103)

In this way, we can rewrite the minimum value (51.101) more compactly as q(wλ? ) − λT Hwλ? = −

1 1 Sα (H T λ) = − kTα (H T λ)k2 2ρ 4ρ

(51.104)

For illustration purposes, Fig. 51.6 plots the soft-thresholding functions Tα (x) and Sα (x) for α = 1 and a scalar argument, x. Let us now consider the first minimization in (51.96) for any λ: ( ) 1 N min kd − zk2 + λT z =⇒ zbλ = d − λ (51.105) N 2 z∈IRN where we are denoting the minimizer for a generic λ by the notation zbλ . Consequently, the minimum value for this minimization is given by 1 N kd − zbλ k2 + λT zbλ = λT d − kλk2 N 4

(51.106)

51.B Expression for LASSO Solution

2255

1 1 ,,-...,

� ..__,, � �

,,-...,

� ..__,, �

0

0

-1 -1 �-�-�--�-�--�-� 1 2 -3 -2 -1 0 3

-2 �-�-��-�-�--�-� -1 1 2 -3 -2 0 3

X

X

Figure 51.6 Plots of the soft-thresholding functions Tα (x) and Sα (x) for α = 1.

Adding this result to (51.104) we find that the dual function is given by

N 1  T 

2 D(λ) = λT d − kλk2 −

Tα H λ 4 4ρ

(51.107)

It can be verified that this function is strongly concave and, therefore, has a unique maximum (see next example). The desired dual variable, λo , is therefore given by (51.62). The proof technique used so far requires ρ > 0. This condition was used to complete the squares in step (51.99). We now explain how to handle the situation ρ = 0, which corresponds to pure `1 -regularization. The dual variable λo will now be found by solving the projection problem (51.112). The details are as follows. We revisit step (51.99) when ρ = 0 and note that it reduces to solving a problem of the form: n o min αkwk1 − λT Hw (51.108) w∈IRM

Let C denote the convex set of vectors satisfying kxk∞ ≤ 1. We established earlier in Table 8.4 and Prob. 8.55 the following conjugate pair: r(w) = kwk1 =⇒ r? (x) = IC,∞ [x]

(51.109)

where the notation IC,∞ [x] represents the indicator function relative to set C: It assumes the value zero if x ∈ C and +∞ otherwise. In light of definition (51.114) for the conjugate function, we find that the minimum value of problem (51.108) is given by n o min αkwk1 − λT Hw = −IC,∞ [H T λ/α] (51.110) w∈IRM

Adding this value to (51.106) we find that the dual function is now given by D(λ) = λT d −

N kλk2 − IC,∞ [H T λ/α] 4

(51.111)

Maximizing D(λ) over λ results in λo . To do so, we complete the squares over λ to find that the maximization of D(λ) is equivalent to solving:

2 2

λo = argmin λ − d , subject to kH T λk∞ ≤ α (51.112) N λ∈IRN The minimizer λo is obtained by projecting

2 N

d onto the set of all vectors λ satisfying

2256

Regularization

kH T λk∞ ≤ α. Using z = Hw and z o = d − N2 λo , we conclude that the optimal solution ? wreg also satisfies the equation ? Hwreg =d−

 N o 2 2 λ = d − λo 2 N | N {z }

(51.113)

residual

in terms of the residual resulting from projecting

2 N

d onto the set kH T λk∞ ≤ α.

Example 51.6 (Duality and conjugate functions) There is an alternative way to arrive at the same expression (51.62) under elastic-net regularization by calling upon the concept of conjugate functions, also called Fenchel conjugate functions. This alternative argument is useful for situations (other than least-squares) when explicit expressions for the individual minimum values (51.104) and (51.106) may not be directly available but can be expressed in terms of conjugate functions. We first recall the definition of conjugate functions from (8.83). Consider an arbitrary function r(w) : IRM → IR with domain dom(r); the function r(w) need not be convex. Its conjugate function is denoted by r? (λ) : IRM → IR and is defined as: n o ∆ r? (λ) = sup λT w − r(w) , λ ∈ Y (51.114) w∈IRM

where Y denotes the set of all λ where the supremum operation is finite. It can be verified that r? (λ) is convex regardless of whether r(w) is convex or not. Likewise, the set Y is a convex set – recall Prob. 8.47 and Table 8.4. If r(w) happens to be strongly convex, then Y = IRM (i.e., the sup is finite for all λ). Now, consider the quadratic function f (w) = kwk2 and observe that the dual function D(λ) in (51.96) can be written as:     1 D(λ) = − sup −λT z − kd − zk2 − sup λT Hw − q(w) N w∈IRM z∈IRN   1 = − sup λT (d − z) − kd − zk2 − λT d − q ? (H T λ) N z∈IRN   1 ∆ = − sup λT s − ksk2 − λT d − q ? (H T λ), s = d − z N N s∈IR   1 T 2 = − sup λ s − ksk + λT d − q ? (H T λ) N N s∈IR   1 =− sup N λT s − ksk2 + λT d − q ? (H T λ) N s∈IRN =−

1 ? f (N λ) + λT d − q ? (H T λ) N

(51.115)

where f ? (λ) and q ? (λ) denote the conjugate functions of f (w) = kwk2 ,

q(w) = αkwk1 + ρkwk2

(51.116)

Both functions, f (w) and q(w), are strongly convex and, therefore, the domains of their conjugate functions are the entire space, IRM . Moreover, since f (w) and q(w) are strongly convex and differentiable, it follows from the properties of conjugate functions that f ? (λ) and q ? (λ) are themselves strongly convex and differentiable — recall Table

References

2257

8.4. This implies that D(λ) is strongly concave (i.e., its negative is strongly convex), differentiable, and has a unique maximizer, λo . It can be verified that the conjugate functions for f (w) and q(w) are given by (see Prob. 51.18): 1 kλk2 4 1 kTα (λ)k2 q(w) = αkwk1 + ρkwk2 =⇒ q ? (λ) = 4ρ f (w) = kwk2 =⇒ f ? (λ) =

(51.117a) (51.117b)

Substituting into (51.115), we find that the dual function is given by (51.107).

REFERENCES Bach, F., R. Jenatton, J. Mairal, and G. Obozinski (2012), “Optimization with sparsityinducing penalties,” Found. Trends Mach. Learn., vol. 4, no. 1, pp. 1–106. Bjorck, A. (1996), Numerical Methods for Least Squares Problems, SIAM. Chan, R. H. and H. X. Liang (2014), “Half-quadratic algorithm for `p − `q problems with applications to TV-1 image restoration and compressive sensing,” in Efficient Algorithms for Global Optimization Methods in Computer Vision, A. Bruhn, T. Pock, and X.-C. Tai, editors, pp. 78–103, Springer. Chandrasekaran, S., G. Golub, M. Gu, and A. H. Sayed (1997), “Parameter estimation in the presence of bounded modeling errors,” IEEE Signal Process. Lett., vol. 4, no. 7, pp. 195–197. Chandrasekaran, S., G. Golub, M. Gu, and A. H. Sayed (1998), “Parameter estimation in the presence of bounded data uncertainties,” SIAM. J. Matrix Anal. Appl., vol. 19, no. 1, pp. 235–252. Chen, S. and D. Donoho (1994), “Basis pursuit,” Proc. Asilomar Conf. Signals, Systems and Computers, pp. 41–44, Pacific Grove, CA. Chen, S. S., D. L. Donoho, and M. A. Saunders (1998), “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1 pp. 33–61. Republished in SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001. An earlier draft has been available since 1995 as a technical report, Department of Statistics, Stanford University. Chen, S., D. Donoho, and M. Saunders (2001), “Atomic decomposition by basis pursuit,” SIAM Rev., vol. 43, no. 1, pp. 129–159. Chen, J., Z. J. Towfic, and A. H. Sayed (2015), “Dictionary learning over distributed models,” IEEE Trans. Signal Process., vol. 63, no. 4, pp. 1001–1016. Claerbout, J. F. and F. Muir (1973), “Robust modeling with erratic data,’ Geophysics, vol. 38, no. 5, pp. 826–844. Dahleh, M. A. and I. Diaz-Bobillo (1995), Control of Uncertain Systems: A Linear Programming Approach, Prentice Hall. Dahleh, M. A. and J. B. Pearson (1986), “`1 -optimal feedback controllers for discretetime systems,” Proc. American Control Conf. (ACC), pp. 1964–1968, Seattle, WA. Dahleh, M. A. and J. B. Pearson (1987), “`1 -optimal feedback controllers for MIMO discrete-time systems,” IEEE Trans. Aut. Control, vol. 32, pp. 314–322. De Mol, C., E. De Vito, and L. Rosasco (2009), “Elastic-net regularization in learning theory,” J. Complexity, vol. 25, no. 2, pp. 201–230. El Ghaoui, L. and H. Lebret (1997), “Robust solutions to least-squares problems with uncertain data,” SIAM. J. Matrix Anal. Appl., vol. 18, no. 4, pp. 1035–1064. Engl, H. W., M. Hanke, and A. Neubauer (1996), Regularization of Inverse Problems, Kluwer.

2258

Regularization

Frank, I. E. and J. H. Friedman (1993), “A statistical view of some chemometrics regression tools,” Technometrics, vol. 35, pp. 109–148. Fu, W. J. (1998), “Penalized regressions: The bridge versus the Lasso,” J. Comput. Graphical Statist., vol. 7, no. 3, pp. 397–416. Geman, D. and C. Yang (1995), “Nonlinear image recovery with half-quadratic regularization,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 932–946. Golub, G. H. and C. F. Van Loan (1996), Matrix Computations, 3rd ed., John Hopkins University Press. Hansen, P. C. (1997), Rank-Deficient and Discrete Ill-Posed Problems, SIAM. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Hoerl, A. E. (1959), “Optimum solution of many variables equations,” Chem. Eng. Prog., vol. 55, pp. 69–78. Hoerl, A. E. (1962), “Application of ridge analysis to regression problems,” Chem. Eng. Prog., vol. 58, pp. 54–59. Hoerl, A. E. and R. W. Kennard (1970), “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67. Hoerl, R. W. (1985), “Ridge analysis 25 years later,” Amer. Statist., vol. 39, pp. 186–192. Lanza, A., S. Morigi, L. Reichel, and F. Sgallari (2015), “A generalized Krylov subspace method for `p − `q minimization,” SIAM J. Sci. Comput., vol. 37, no. 5, pp. 30–50. Levy, S. and P. K. Fullagar (1981), “Reconstruction of a sparse spike train from a portion of its spectrum and application to high-resolution deconvolution,” Geophysics, vol. 46, no. 9, pp. 1235–1243. Mota, J., J. Xavier, P. Aguiar, and M. Puschel (2012), “Distributed basis pursuit,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 1942–1956. Mota, J., J. Xavier, P. Aguiar, and M. Puschel (2013), “D-ADMM: A communicationefficient distributed algorithm for separable optimization,” IEEE Trans. Signal Process., vol. 61, no. 10, pp. 2718–2723. Neumaier, A. (1998), “Solving ill-conditioned and singular linear systems: A tutorial on regularization,” SIAM Rev., vol. 40, no. 3, pp. 636–666. Oldenburg, D., W. T. Scheuer, and S. Levy (1983), “Recovery of the acoustic impedance from reflection seismograms”, Geophysics, vol. 48, no. 10, pp. 1318–1337. Phillips, D. L. (1962), “A technique for the numerical solution of certain integral equations of the first kind,” J. ACM, vol. 9, no. 1, pp. 84–97. Santosa, F. and W. W. Symes (1986), “Linear inversion of band-limited reflection seismograms,” SIAM J. Sci. Statist. Comput., vol. 7, no. 4, pp. 1307–1330. Sayed, A. H. (2003), Fundamentals of Adaptive Filtering, Wiley. Sayed, A. H. (2008), Adaptive Filters, Wiley. Sayed, A. H. and H. Chen (2002), “A uniqueness result concerning a robust regularized least-squares solution,” Syst. Control Lett., vol. 46, pp. 361–369. Sayed, A. H., V. H. Nascimento, and S. Chandrasekaran (1998), “Estimation and control with bounded data uncertainties,” Linear Algebra Appl., vol. 284, pp. 259–306. Sayed, A. H., V. Nascimento, and F. A. M. Cipparrone (2002), “A regularized robust design criterion for uncertain data,” SIAM J. Matrix Anal. Appl., vol. 23, no. 4, pp. 1120–1142. Taylor, H. L., S. C. Banks, and J. F. McCoy (1979), “Deconvolution with the `1 norm,” Geophysics, vol. 44, no. 1, pp. 39–52. Tibshirani, R. (1996b), “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser. B, vol. 58, no. 1, pp. 267–288. Tikhonov, A. N. (1963), “Solution of incorrectly formulated problems and the regularization method,” Soviet Math. Dokl., vol. 4, pp. 1035–1038. Tikhonov, A. N. and V. Y. Arsenin (1977), Solutions of Ill-Posed Problems, Winston. Vidyasagar, M. (1986), “Optimal rejection of persistent bounded disturbances,” IEEE Trans. Aut. Control, vol. 31, no. 6, pp. 527–534. Wahba, G. (1990), Spline Models for Observational Data, SIAM.

References

2259

Zou, H. and T. Hastie (2005), “Regularization and variable selection via the elastic net,” J. Roy. Statist. Soc. Ser. B, vol. 67, no. 2, pp. 301–320.

52 Nearest-Neighbor Rule

We encountered one instance of Bayesian inference in Chapter 50, based on the quadratic loss in the context of mean-square-error (MSE) estimation. We explained there that the optimal solution for inferring a hidden zero-mean random variable x from observations of another zero-mean random variable y is given by the conditional estimator, E (x|y), whose computation requires knowledge of b = c(y) is the conditional distribution, fx|y (x|y). Even when the estimator x restricted to affine functions of y, the solution continues to require knowledge of some statistical moments of {x, y} in the form of their variances or covariances, {σx2 , rxy , Ry }. We addressed this challenge in the previous two chapters by using a collection of training data measurements {x(n), yn } arising from the joint distribution fx,y (x, y) to replace the stochastic risk, E (x N y T w)2 , by an empirical least-squares risk, with and without regularization such as: ) ( N −1 X  1 2 ∆ x(n) N ynT w (52.1) w? = argmin αkwk1 + ρkwk2 + N n=0 w∈IRM where α and ρ are nonnegative regularization factors. Moving forward, we will consider more general Bayesian inference problems b ), besides the quadratic loss: involving other types of loss functions Q(x, x n o ∆ b Q = argmin E Q(x, x b) x (52.2) b =c(y) x

b Q reWe already know from result (28.5) that here too the optimal solution x quires knowledge of the conditional probability density function (pdf), fx|y (x|y), since ( )   ∆ b |y = y) x bQ = argmin E x|y Q(x, x (52.3) x b=c(y)

where the expectation of the loss function is evaluated relative to fx|y (x|y). We will follow two paths. One path is similar to what we did in the previous two chapters for MSE estimation. We will replace the stochastic risk in (52.2) by an empirical risk, add regularization, use an affine model for c(y), and then apply some stochastic approximation algorithm to learn the solution. This construction would amount to solving problems of the form: ( ) N −1 1 X ? ∆ Q(x(n), x b(n)) , x b(n) = ynT w (52.4) w = argmin q(w) + N M w∈IR n=0

Nearest-Neighbor Rule

2261

where q(w) denotes the regularization factor. This first approach will be studied at great length in later chapters in the context of the perceptron algorithm, support vector machines, kernel methods, neural networks, and other related methods. The main difference between these methods will be the choice of the b is constructed from y. loss function Q(·, ·) and the way by which the predictor x While most methods will employ affine constructions, kernel methods and the b. neural network structure will allow for some nonlinear mappings from y to x In the current and next few chapters, however, we will follow a second, more direct path to solving (52.2). We will introduce data-based methods that infer either the conditional pdf fx|y (x|y) or the joint pdf fx,y (x, y) directly from the data, rather than minimize an empirical risk. In these investigations, we will focus on the important case of predicting the label of a random variable x from observations y. Specifically, we will focus on the classification problem where x is discrete and assumes one of two binary values, +1 or N1. We will also consider multiclass problems where x can assume one of a multitude of discrete levels.

NOTATION: Regression vs. Classification Before proceeding, we motivate a change in notation. From this point onward in our presentation, we will be dealing mainly with classification problems where the unknown x assumes discrete values. The variable x can be either binary-valued such as x ∈ {N1, +1} or x ∈ {0, 1}, or multivalued such as assuming integer values x ∈ {1, 2, . . . , R}. In order to emphasize the fact that the hidden variable is discrete, we will henceforth use the Greek symbol γ to refer to a binary discrete variable and the normal symbol r to refer to a multilevel discrete variable: (notation for discrete hidden variables) γ ∈ {N1, 1} or γ ∈ {0, 1} r ∈ {1, 2, 3, . . . , R}

(binary values)

(52.5a)

(integer values)

(52.5b)

Both γ and r are random variables, just like the notation x. By introducing these symbols, it becomes easier for the reader to recognize whether a statement is dealing with discrete or continuous variables, a classification or regression problem, and whether the discrete variable itself is binary or multilevel. We will refer to {γ, r} as the class or label variable. For similar reasons, we will replace the observation variable y by the letter h and refer to it as the feature vector. In this way, regression problems deal with variables (x, y) while classification problems deal with variables (γ, h) or (r, h):  notation (x, y) reserved for regression/estimation problems (52.6) notation (γ, h) or (r, h) reserved for classification problems In the context of classification, each entry of h is called an attribute. These entries will generally assume numerical values, but they can also be categorical, such as when an attribute refers to the color of an object (say, red, blue, or yellow) or its size (say, small, medium, or large). It is customary to transform categorical

2262

Nearest-Neighbor Rule

entries into numerical values, as explained in a later chapter when we discuss decision trees, so that it is sufficient for our purposes to treat h as a vector with numerical entries.

52.1

BAYES CLASSIFIER We review briefly the Bayes classifier solution from Section 28.3 in view of the new notation for classification problems. Given a feature vector h ∈ IRM , we are interested in deducing its label r ∈ {1, 2, . . . , R} by seeking a mapping c(h) : IRM → {1, 2, . . . , R} that minimizes the probability of error, namely, bbayes = argmin P (c(h) 6= r) r

(52.7)

c(h)

We know from (28.67) that the optimal solution is given by the maximum a-posteriori (MAP) estimator: bbayes = r

argmax r∈{1,2,...,R}

P(r = r|h = h)

(52.8)

We denote the optimal mapping that corresponds to this construction by c• (h) using the bullet superscript: bbayes = c• (h) r

(Bayes classifier)

(52.9)

In our notation, the • superscript will refer to the ideal solution that we are aiming to achieve. As seen from (52.8), this solution requires knowledge of the conditional probability distribution P(r = r|h = h), which is generally unavailable. The ? superscript, such as writing c? (h), will refer to approximations obtained by solving more tractable formulations: rb = c? (h)

(approximate classifier)

(52.10)

In this and the next few chapters, we will describe data-based methods that infer either the conditional pdf P(r = r|h = h) or the joint probability distribution fr,h (r, h) from the data and lead to approximate classifiers c? (h). Among these methods we list the nearest-neighbor (NN) rule of this chapter, the naïve Bayes classifier, the linear and Fisher discriminant analysis methods (LDA, FDA), and the logistic regression method. Methods that approximate the conditional probabilities P(r = r|h = h) are referred to as discriminative, whereas methods that approximate the joint pdf fr,h (r, h), or its components P(r = r) and the reverse conditional fh|r (h|r), are referred to as generative. This is because discriminative techniques allow us to discriminate between the classes, while generative techniques allow us to generate additional data {r, h} that mimic the distribution of the training data. Before explaining the steps involved in the NN construction, it is useful to comment on how performance is assessed in general for classification algorithms that rely on training data. These comments are valid for all learning methods described here and in future chapters.

52.1 Bayes Classifier

2263

Classification errors In classification problems, we make a distinction between training data and test data. The data {r(n), hn } used to train the classifier are referred to as training data. Once a classifier rb = c? (h) is learned, we can evaluate its performance on the training data by counting the number of erroneous decisions that would result if the classifier were applied to that data. This measure results in the training error, also called the empirical error rate, and is evaluated as follows: N −1 1 X Remp (c ) = I [c? (hn ) 6= r(n)] N n=0 ?



(empirical error on training data)

where the notation I[x] denotes the indicator function defined by:  1, when argument x is true ∆ I[x] = 0, otherwise

(52.11)

(52.12)

The argument of the indicator function in (52.11) is comparing the predicted label rb(n) = c? (hn ) to the true label r(n) for the sample of index n. Note that Remp (c? ) is a number in the range [0, 1] and it measures the empirical probability of error on the training data for the classifier c? (h). In general, the empirical error will be small because c? (h) will be determined with the aim of minimizing it. In most classification applications, however, the main purpose for learning ? c (h) is to employ it to perform inference on future data that were not part of the training phase. For this reason, it is customary to assess performance on a separate collection of T test data points denoted by {r(t), ht }, and which were not part of the training phase but are assumed to arise from the same underlying distribution fr,h (r, h) as the training data. The empirical error rate on the test data is given by ∆

Remp (c? ) =

T −1 1 X I [c? (ht ) 6= r(t)] T t=0

(empirical error on test data)

(52.13)

where rb(t) = c? (ht ) denotes the prediction for each test label r(t). We use the same symbol Remp (·) to refer to empirical errors (whether measured on training or test data); it will be clear from the context whether we are referring to one case or the other. In general, the empirical error on test data will be larger than the empirical error on training data, but we desire the gap to be small. Learning algorithms that lead to small error gaps are said to generalize well, namely, they are able to extend their good performance on training data to arbitrary test data as well. It is important to emphasize that the ultimate objective of a learning algorithm is not to attain a small empirical error on particular test data. More critically, classifiers should be able to generalize, i.e., they should be able to classify well arbitrary future feature vectors from the same data distribution that were not part of the original training and test datasets. This is the crux of the learning problem. In Chapter 64 we will develop conditions under which

2264

Nearest-Neighbor Rule

classifiers that perform well on a sufficient amount of test data can be expected to generalize well for other data. We formally measure the generalization ability of a classifier c? (h) by defining its generalization error as the following expected value (which we also denote by Pe since, as explained by (52.15), it amounts to the probability of error by the classifier): ∆

R(c? ) = E I [c? (h) 6= r]

(generalization error)

(52.14)

where the expectation is over the joint probability distribution of the random data {r, h}. This risk can be expressed in an equivalent form involving the probability of erroneous decisions since R(c? ) = 1 × P (c? (h) 6= r) + 0 × P (c? (h) = r) = P(c? (h) 6= r) ∆

= Pe

(52.15)

That is, probability of error = generalization error

(52.16)

We will establish in Chapter 64 that, under some reasonable conditions on the structure of a classifier (namely, not too simple and not too complex) and on the amount of training data available (which needs to be large enough), the empirical error on a test dataset provides a good approximation for the generalization error, i.e., Remp (c? ) ≈ R(c? ). This means that classifiers that perform well on test data will be expected to perform well more broadly on the entire population. Example 52.1 (The need to generalize) Consider a collection of N feature vectors {hn ∈ IRM } and the corresponding labels r(n) ∈ {1, 2, . . . , R}, for n = 0, 1, . . . , N − 1. We can construct a classifier that memorizes the behavior of the training samples perfectly as follows:  c(h) =

r(h), r,

if h ∈ {h0 , h1 , . . . , hN −1 } selected randomly from {1, 2, . . . , R} for any other h

(52.17)

That is, the classifier assigns the label r(n) to each vector h coinciding with one hn from the training set, and assigns a random label r to any other feature vector. Then, the empirical error on the training data for this classifier will be zero (i.e., the smallest it can be), while its empirical error on arbitrary test data can be unacceptably large. We therefore have an example of a classifier that performs exceedingly well on the training data but delivers poor performance on test data. This is an example of overfitting.

52.2 k-NN Classifier

52.2

2265

k-NN CLASSIFIER We consider first binary classification problems with label γ ∈ {±1}. Assume we have access to N pairs of data points {γ(n), hn } where hn ∈ IRM is the nth feature vector and γ(n) the corresponding label. This collection plays the role of the training dataset. Now, given a new feature h, the objective is to determine its most likely label. The NN rule predicts γ as follows.

Neighborhoods The vectors {h, hn } are points in M -dimensional space. We define a neighborhood around h consisting of the k nearest feature vectors from the training set to h, where closeness is measured in terms of the Euclidean distance (or some other distance metric, if desired). Let the notation Nk (h) refer to the indices of the k closest neighbors to h from within the training set {hn }: n o ∆ Nk (h) = index set of k closest neighbors to h from training set {hn } (52.18) If we envision a hypersphere centered at location h and engulfing the neighborhood of h, then the radius of the sphere should be large enough to include only k neighboring points within Nk (h). Figure 52.1 illustrates this construction in the plane for M = 2 and k = 5. In the figure, training features from class γ = +1 are represented by circles, while training features from class γ = N1 are represented by squares. The location of h is represented by a triangle. The figure draws a circle around h that encompasses its five closest neighbors from the training set. The class of h is then declared to be the one corresponding to the majority class among its neighbors. In this case, the feature h is declared to belong to class +1 since four of its neighbors belong to this class. In the case of a tie, one can select the class randomly between +1 and N1 or set it, by convention, to +1. The k-NN decision rule can be expressed analytically as follows. We first use the neighborhood around h to count the number of neighbors of h that belong to class +1: X ∆ 1 I[γ(n) = +1] (52.19) p(h) = k n∈Nk (h)

where I[x] is the indicator function: Its value is 1 when its argument is true and 0 otherwise. The division by k transforms p(h) into a measure of the fraction of +1 neighbors that exist within Nk (h). The majority vote then translates into applying the following rule to determine the label of h:  +1, if p(h) ≥ 1/2 ? γ (h) = (52.20) N1, otherwise We refer to this mapping from h to its predicted label γ ? (h) by writing c? (h). Note that p(h) is approximating the conditional probability P(γ = +1|h = h) that is needed in the implementation of the Bayes classifier:

2266

Nearest-Neighbor Rule

Figure 52.1 The set of five nearest neighbors around the feature vector h (represented

by the triangle) consists of four circular features (belonging to class +1) and one square feature (belonging to class −1). Accordingly, based on a majority vote, the feature h is declared to belong to class +1.

b = +1|h = h) p(h) = P(γ

(52.21)

In other words, the k-NN rule uses the training data and the k-size neighborhoods to estimate the conditional probabilities P(γ = +1|h = h) locally in order to carry out the classification task. Example 52.2 (Weighted k-NN) The traditional k-NN rule assigns equal weights to all neighbors of a feature vector h before deciding on its label. It is sometimes natural to expect that neighbors that are closer to h are more likely to belong to the same class as h than are neighbors that are farther away from it. In weighted k-NN, each neighbor ` ∈ Nh is assigned a nonnegative weight w` ; for convenience, we normalize the weights to add up to 1. For example, one (but not the only) way to compute the weights is to determine the distances between h and each of its neighbors in Nh and to normalize the distances by their sum:



d` = kh − h` k, h` ∈ Nh d` w` = P 0 `0 ∈Nh d`

(52.22a) (52.22b)

52.2 k-NN Classifier

2267

The resulting decision rule is expressed analytically as follows. We count the weighted number of neighbors of h that belong to class +1: X ∆ p(h) = w` I[γ(`) = +1] (52.23) `∈Nk (h)

and use a majority vote to decide on the label for h:  +1, if p(h) ≥ 1/2 ? γ (h) = −1, otherwise

(52.24)

Multiclass classification The NN rule can be extended to multiclass classification problems with R classes. In this case, we declare h to belong to the class r that receives the majority of votes within its neighborhood.

0

0.9

0



0

x-axis Figure 52.2 The separation regions generated by applying a 5-NN rule over 150

randomly generated feature vectors hn ∈ IR2 arising from three classes: green (class r = 1), red (class r = 2), and yellow (class r = 3).

Figure 52.2 illustrates the separating regions in the plane that would result for k = 5 neighbors and R = 3 classes. The training data is represented by the colored circles (green for r = 1, red for r = 2, and yellow for r = 3). The colored

2268

Nearest-Neighbor Rule

regions represent the class that would be assigned to any feature vector falling into the region. For example, if a location in the plane is colored red, the color indicates that the majority of the five neighbors to this location will belong to class r = 2. Therefore, any feature h falling into the red region will be assigned to class r = 2, and similarly for the two other colored regions. This figure was generated using a total of N = 150 random training points within the region [0, 1] × [0, 1]. The NN rule is a discriminative method that approximates P(r = r|h = h) directly from the training data, and does not make any assumption about the form of these probabilities. It is an example of a nonparametric learning method, which does not involve learning the parameters of separating surfaces, as will happen with other learning methods discussed in future chapters, such as the perceptron, support vector machines, and neural networks. The NN rule operates directly on the available data and does not even involve a training phase. While the NN construction is straightforward, we will find that it suffers from several important challenges.

Voronoi diagrams When k = 1 (i.e., when classification is decided by considering only the label of the closest neighbor), we can partition the feature space into a Voronoi diagram consisting of cells. Each cell n is characterized by a seed point hn , which is one of the points from the training set. The boundaries of each cell n define a region in space consisting of all M -dimensional vectors, h, that are closer to hn than to any other seed. These boundaries can be determined as follows. If we draw line segments connecting any particular seed point, hn , to its neighboring seeds, then the boundaries of cell n would be determined from the bisecting lines that cut these segments in half. Figure 52.3 illustrates this construction for M = 2. A total of N = 100 random feature vectors are generated in the region [0, 1] × [0, 1] and the resulting Voronoi diagram is shown. The lines specify the equidistant boundaries between adjacent feature points; points from class +1 are denoted in green while points from class N1 are denoted in red. Once the Voronoi diagram is generated, classification by means of the 1-NN rule is achieved automatically as follows. Given a new feature vector, h, we determine the cell that it falls into. Then, the class of h is selected to match the class of the seed for that cell.

52.3

PERFORMANCE GUARANTEE There is a fundamental and reassuring result on the performance of the NN classifier. It is sufficient to describe the result for the 1-NN rule; a similar conclusion applies to k-NN and is described in the comments at the end of the chapter. Let c• (h) denote the optimal Bayes classifier (52.8). This classifier minimizes the probability of misclassification and delivers the label r • (h). We denote the smallest probability of error that is attained by the classifier by Pebayes .

52.3 Performance Guarantee

2269

1 0.9 0.8

y - axis

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

x -axis

0.6

0.7

0.8

0.9

1

Figure 52.3 Voronoi diagram for 100 randomly generated feature vectors hn ∈ IR2 .

Points in green belong to class +1 while points in red belong to class −1.

Let c? (h) denote the 1-NN classifier that results from the training data {r(n), hn } of size N . This is the classifier that is defined by the Voronoi diagram corresponding to this data. The generalization error (or probability of misclassification) for this classifier over the entire distribution of the data is given by expression (52.15), which we denote by: ∆

Pe = P(c? (h) 6= r) = R(c? )

(52.25)

The following classical result now holds; its proof appears in Appendix 52.A – see Prob. 52.1 for an alternative argument. Theorem 52.1. (Generalization error of 1-NN classifiers) Consider a multiclass classification problem with R labels, r ∈ {1, 2, . . . , R}. Let Pebayes denote the smallest probability of error attained by the Bayes classifier (52.8). Let Pe denote the probability of error attained by the 1-NN classifier (i.e., its generalization error). Then, for independent realizations {r(n), hn } and for large sample sizes N → ∞, it holds that   R Pebayes ≤ Pe ≤ Pebayes 2 N Pebayes ≤ 2Pebayes (52.26) RN1

2270

Nearest-Neighbor Rule

Result (52.26) means that the probability of error of the 1-NN classifier is at most twice as bad as the best possible performance given by the optimal Bayes classifier. The result also means that any other classifier structure can at most reduce the probability of error of 1-NN by one-half.

Challenges While the k-NN rule is appealing, it faces some important challenges that limit its performance. The classifier is sensitive to noise and outliers, and requires that the training data be stored and processed continuously. Moreover, the following properties are evident: (C1) The classifier treats equally all entries (i.e., attributes) of the feature vector. If, for example, some attributes are more relevant to the classification task than the remaining attributes, this aspect is ignored by the k-NN implementation because all entries in the feature vector contribute similarly to the calculation of Euclidean distances and the determination of neighborhoods. (C2) The k-NN classifier does not perform well in high-dimensional feature spaces when M is large for at least two reasons: (a) First, for each new feature h, the classifier needs to perform a search over the entire training set to determine the neighborhood of h. This step is demanding for large M and N . (b) Second, and more importantly, in high-dimensional spaces, the training samples {hn } only provide a sparse representation for the behavior of the data distribution fr,h, (r, h). The available training examples need not be enough for effective learning. We comment on these issues in the following, and in the next chapter, and explain how clustering helps ameliorate some of these difficulties.

52.4

k-MEANS ALGORITHM One way to address challenge (C2a) and reduce the complexity of the search step is to cluster the training data into a small number of clusters, and to base the classification decision on comparisons against the clusters rather than against the entirety of the training dataset. Clustering is a procedure that partitions the N feature vectors {hn ∈ IRM } into a small collection of K groups (called clusters), with the expectation that vectors within the same group share similar properties. One popular method to perform clustering is the Lloyd algorithm, also known as the k-means algorithm, which operates as follows.

Algorithm We select the desired number of clusters, K, and assign to each cluster k an initial mean vector denoted by µk ∈ IRM . There are several ways by which these

52.4 k-Means Algorithm

2271

initial vectors can be chosen (and their choice influences the performance of the clustering algorithm) – we describe three methods further ahead. Once the initial vectors have been chosen, then the k-means algorithm applies repeatedly the operations shown in listing (52.29) and continually updates the mean vectors {µk } as follows: (1) Each feature vector hn in the training set is assigned to the cluster whose mean µk is the closest to hn (if there are multiple possible clusters, we select one of them at random): ∆

cluster for hn = argmin khn N µk k

(52.27)

1≤k≤K

Let the notation Ck represent a generic cluster k (i.e., the collection of the indices of all feature vectors in it). (2) Following the assignments of the {hn } to clusters, the mean vector µk for each cluster is updated by averaging the feature vectors that ended up within that cluster: 1 X h(n) (52.28) µk = |Ck | n∈Ck

where |Ck | denotes the cardinality of the set (the number of its elements). Observe that this algorithm performs clustering in an unsupervised manner; it acts directly on the feature vectors and does not require any class information. The performance of the algorithm is, however, sensitive to the choice of K, the presence of outliers, and the selection of the initial mean vectors.

k-means algorithm (also known as the Lloyd algorithm). given N feature vectors {hn }, of size M × 1 each; given the number of clusters, K; select K initial mean vectors {µk }, one for each cluster. repeat until convergence: assign each hn to the cluster with the closest mean µk ; for each cluster k, replace µk by the average of all vectors in it; end return clusters {Ck } and their means {µk }.

(52.29)

Interpretation and derivation We explain in the comments at the end of the chapter that the k-means clustering algorithm is related to the expectation-maximization (EM) algorithm described earlier in Chapter 32 for Gaussian mixture models. Both algorithms perform clustering; the main difference is that the k-means method performs hard assignments of samples hn by assigning them to the cluster of the closest mean vector, whereas the EM method performs soft assignments by computing likelihood

2272

Nearest-Neighbor Rule

values and assigning a feature hn to the most likely cluster – see listings (52.37) and (52.38). One way to motivate the k-means algorithm is pursued in Prob. 52.7 and is based on the following argument. Consider a collection of N feature vectors {hn } that we wish to distribute among K nonoverlapping clusters denoted by {C1 , C2 , . . . , CK }. Each cluster Ck is characterized by a mean vector µk corresponding to the average of all features within it, as shown by (52.28). We can seek an optimal assignment of feature vectors by formulating the following optimization problem: ( K ) X X 2 min khn N µk (Ck )k (52.30) C1 ,C2 ,...,CK

k=1 n∈Ck

where the means {µk } are dependent on {Ck }. The unknowns are the clusters {Ck }, i.e., their constituent feature vectors. This is generally a hard nonconvex problem to solve. An approximate solution can be pursued by employing an alternating minimization approach. For each cluster k and feature hn , we introduce the scalar  1, if hn ∈ Ck ank = (52.31) 0, otherwise

which reveals whether hn lies in Ck . There are N K such scalars since the integer subscripts n and k run over 0 ≤ n ≤ N N 1 and 1 ≤ k ≤ K. Using the binaryvalued scalars {ank }, we can rewrite the optimization problem (52.30) in the equivalent form: ( K N −1 ) XX 2 min ank khn N µk (Ck )k (52.32) {ank }

k=1 n=0

If we now alternate between minimizing over the {ank } for a fixed set of means {µk }, and minimizing over the {µk } for a fixed set of assignments {ank }, we arrive at the k-means algorithm: K N −1

2 o n X X

(`) (`−1) ank = argmin ank hn N µk

{ank }

K N −1 n o X X (`) (`) 2 µk = argmin ank khn N µk k {µk }

(52.33a)

k=1 n=0

(52.33b)

k=1 n=0

where ` is an iteration index. The reader is asked to carry out this derivation in Prob. 52.7.

Selection of initial means Three popular methods for selecting the initial mean vectors {µk } are the following: (1) (Forgy initialization). This selects the K mean vectors by sampling randomly without replacement from the N training vectors {hn }.

52.4 k-Means Algorithm

2273

(2) (Random partitioning). This assigns the N feature vectors {hn } at random to K clusters and then computes the means of these clusters and uses them as the initial mean vectors. (3) (k-means++ initialization). This spreads out the selection of the mean vectors as follows. It starts by selecting one mean vector uniformly at random from the N training feature vectors {hn }. We denote this first selection by µ1 . Subsequently, the squared distances from µ1 to all feature vectors are computed and denoted by ∆

d(n) = kµ1 N hn k2 ,

n = 0, 1, . . . , N N 1

(52.34)

These squared distances are normalized to add up to 1 and used to define a probability measure: d(n) ∆ , n = 0, 1, . . . , N N 1 p(n) = PN −1 n=0 d(n)

(52.35)

In this way, feature vectors that are farthest away from µ1 receive higher probability values. The method subsequently selects a second mean vector, µ2 , randomly from the data according to this probability distribution. By construction, feature vectors {hn } that are father away from µ1 will have a higher likelihood of being selected. We end up with two mean vectors {µ1 , µ2 }. The process continues as follows:

(a) For each feature vector, hn , compute the squared distance, d(n), from hn to the closest mean vector. (b) Normalize the distances, d(n), according to (52.35) and use the normalized values as probability measures. (c) Select randomly a new mean vector from the {hn } according to this probability distribution and add it to the collection of previously selected means. Repeat steps (a)–(c) until all K means have been selected.

Use for clustering The two plots in the first row of Fig. 52.4 show N = 250 random feature vectors hn ∈ IR2 belonging to five different classes; the classes are colored in the plot on the left in order to identify them to the reader. The k-means algorithm is blind to the class information and operates on the unlabeled data in the plot on the right. The three plots in the bottom row show the result of applying the k-means algorithm for each of the three initialization procedures (Forgy, random, and k-means++). The location of the mean vector for each cluster is marked by a large × symbol. It is seen in this simulation that the location of the mean vectors is largely unaffected by the type of initialization. The plots in the bottom also show the Voronoi diagrams (separation regions) that result from using the mean vectors. Figure 52.5 illustrates a second situation where the Voronoi regions are sensitive to the initialization procedure. The figure shows the result of applying the same k-means clustering algorithm to a second collection of N = 250 randomly generated feature vectors in the square region [0, 1] × [0, 1].

2274

Nearest-Neighbor Rule

labeled feature data

1

unlabeled feature data

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

Forgy initialization

1

0.8

0

0.2

random partitioning

1

0.5

1

0

0.5

1

0.6

0.8

1

k-means++

1

0.5

0

0.4

0.5

0

0 0

0.5

1

0

0.5

1

Figure 52.4 The plots in the top row show N = 250 feature vectors hn ∈ IR2 belonging to five different classes; in the plot on the left, the classes are colored. The plots in the bottom row show the result of applying k-means clustering to the data using the three initialization methods, Forgy, random, and k-means++. In this case, all methods perform similarly. The × marks show the location of the mean vectors for the clusters.

Use for classification We can exploit the result of the clustering operation to perform classification. We first associate a class r(k) with each cluster k. The class value is determined by considering a majority vote among the members of the cluster. For example, if the majority of the members in the cluster belong to class r = 1, then the cluster will be assigned this label. In this way, we end up associating a label r(k) with each cluster mean µk . During classification, when a new feature vector h arrives, we determine the closest mean vector to it and declare the class of h to be that of this mean vector. In other words, we carry out a 1-NN classification scheme by relying solely on the K cluster means. Since K  N , we end up with a computationally more efficient implementation than the traditional 1-NN solution that relies on comparing against the entire dataset. Another useful feature of the k-means algorithm is that it can also be used to perform classification in a semi-supervised setting when we have available labels for only a subset of the feature vectors {hn } but not for all of them. In this case, we start by clustering the N feature vectors {hn } using the k-means construction; this amounts to an unsupervised step since labels are not necessary to carry out this step. We then label the clusters by using the limited labels that are available. We only consider those feature vectors within each cluster for which the label information is known, and associate the majority label to

52.4 k-Means Algorithm

unlabeled feature data

1

k-means++

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

2275

0 0

0.2

0.4

0.6

0.8

1

Forgy initialization

1

0

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0.4

0.6

0.8

1

random partitioning

1

0.8

0.2

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Figure 52.5 The leftmost top plot shows N = 250 feature vectors hn ∈ IR2 randomly

generated in the square region [0, 1] × [0, 1]. The other three plots show the result of applying k-means clustering on this data using the three initialization methods, Forgy, random, and k-means++. In this case, the clustering results differ. The × symbol marks show the location of the mean vectors for the clusters.

the cluster; this amounts to a semi-supervised step. Subsequently, when a new feature vector h arrives, we determine the closest cluster mean to it and declare the class of h to be of that cluster. Example 52.3 (Clustering MNIST dataset) We apply the k-means clustering algorithm to the MNIST dataset. The classification results obtained here will not be as reliable as the ones we will obtain by using other more elaborate classification schemes in future chapters. The example here is only meant to illustrate the operation of the clustering algorithm. The MNIST dataset is useful for classifying handwritten digits. It contains 60,000 labeled training examples and 10,000 labeled test examples. Each entry in the dataset is a 28 × 28 grayscale image, which we transform into an M = 784-long feature vector, hn . Each pixel in the image and, therefore, each entry in hn , assumes integer values in the range [0, 255]. Every feature vector (or image) is assigned an integer label in the range 0 to 9 depending on which digit the image corresponds to. Figure 52.6 shows randomly selected images from the training dataset rendered using a color-themed colormap.

2276

Nearest-Neighbor Rule

3 I 9

q 3

s

3

Figure 52.6 Randomly selected images from the MNIST dataset for handwritten digits. Each image is 28×28 grayscale with pixels assuming integer values in the range [0, 255]. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ and https://github.com/daniel-e/mnist_octave/blob/master/mnist.mat.

We preprocess the images {hn } by scaling their entries by 255 (so that they assume values in the range [0, 1]). We subsequently compute the mean feature vectors for the training and test sets. We center the scaled feature vectors around their respective means in both sets. Figure 52.7 shows randomly selected images for the digits 0 and 1 before and after processing using the same colormap as before.

original images

processed images

Figure 52.7 Randomly selected images for the digits 0 and 1 from the MNIST dataset for handwritten digits. The top row shows original images and the bottom row shows the processed images, whose pixels are scaled down to the interval [0, 1] and centered around the mean feature vectors for training and testing.

52.4 k-Means Algorithm

cluster 5

label 0

label 8

cluster 10

label 3

cluster 9

label 6

cluster 8

label 6

cluster 7

cluster 6

label 7

label 3

cluster 4

label 2

cluster 3

label 4

cluster 2

cluster 1

label 1

2277

Figure 52.8 The mean image for each cluster, obtained by averaging the images

assigned to the cluster. The images are rendered using a blue color scale against a dark background for emphasis. On top of each image, we assign a class label to the cluster. This label is obtained by a majority vote, namely, by determining the digit that is most repeated within the images in the cluster.

We apply the k-means++ algorithm to identify K = 10 clusters in the normalized training samples. We run the algorithm for 1000 iterations. At the end of these iterations, we obtain the mean vectors (centroids) for each of the clusters and plot them in Fig. 52.8. The figure shows K = 10 clusters labeled k = 1 through k = 10; note that the number we are assigning to refer to each cluster is different from the actual digit numbering from 0 to 9. We further assign a class label to each cluster using a majority vote. The digit that is most repeated within a cluster determines its label. Table 52.1 lists some statistics about the clusters: It shows the number of images that end up in each cluster, and the number of times that the most frequent digit appeared within the cluster. For example, a total of 9420 training images are assigned to cluster 1 and 6593 of these images happen to correspond to digit 1. This class label is assigned to the first cluster and it is written on top of the mean image corresponding to the cluster. Likewise, among the 8891 images in cluster 2, the most represented digit is 4 and it occurs 3180 times. We therefore assign the label 4 to cluster 2, and so forth. In the table, the first column lists the cluster number and the second column lists the class label that is assigned to the cluster. The last column shows the relative frequency of the most represented digit within each cluster.

Table 52.1 The table lists the clusters, their assigned labels, the total number of images in each cluster, the number of occurrences of the most frequent digit in the cluster, and its relative frequency within that cluster. Cluster number 1 2 3 4 5 6 7 8 9 10

Cluster label

Total images

1 4 2 3 0 7 6 6 3 8

9420 8891 4455 5076 4540 8488 5329 4291 4832 4678

Occurrences of most frequent digit 6593 3180 4105 2117 4289 3840 1737 3766 2833 3373

Percentage 70.0% 35.8% 92.1% 41.7% 94.5% 45.2% 32.6% 87.8% 58.6% 72.1%

2278

Nearest-Neighbor Rule

Observe from Fig. 52.8 and also from the data in Table 52.1 that clusters 7 and 8 are labeled as corresponding to the same digit, 6. There is no label corresponding to digit 5 in the figure and table. We can examine more closely the frequency of digit occurrences within each cluster, as shown in the following listing:                   

1 2 3 4 5 6 7 8 9 10

0 5 31 6 233 4289 7 888 147 295 22

1 6593 16 30 4 0 10 10 7 14 58

2 676 167 4105 243 44 65 253 116 117 172

3 301 171 125 2117 22 38 154 27 2833 343

4 220 3180 33 6 8 1776 455 136 1 27

5 282 356 4 1314 44 145 1479 53 1178 566

6 173 53 56 29 55 3 1737 3766 37 9

7 442 1831 45 8 19 3840 22 4 1 53

8 508 149 33 1058 20 126 260 26 298 3373

9 220 2937 18 64 39 2478 71 9 58 55

                  

The top row contains the digits 0 through 9. The first column contains the cluster numbers 1 through 10. Each row in the listing relates to one cluster. The numbers in the row show how many images corresponding to each digit appear within the cluster. We place a box around the most repeated digit. For example, for cluster 7, the most repeated digit is 6, with 1737 images; the second most repeated digit is 5, with 1479 images. Compare these frequencies with the occurrences of digits 5 and 6 within cluster 8: There are 3766 images for digit 6 and only 53 images for digit 5. These results suggest that, if desired, we may label cluster 7 as corresponding to digit 5. Actually, during the testing/classification phase discussed next, we will find out that the algorithm will end up assigning images for digit 5 to cluster 7.

Table 52.2 Number of occurrences for each digit in the test data, along with the cluster it is assigned to and the number of images for that digit that were assigned to this cluster. Occurrences Assignments to Assigned to Digit in test data same cluster Percentage cluster 0 1 2 3 4 5 6 7 8 9 TOTAL

980 1135 1032 1010 982 892 958 1028 974 1009 10,000

718 1105 700 523 556 233 656 629 555 538 6213

73.3% 97.4% 67.8% 51.8% 56.6% 26.2% 68.5% 61.2% 60.0% 53.3% 62.1%

5 1 3 9 2 7 8 6 10 2

Once clustering is completed, and a label is assigned to each cluster, we can use the cluster structure to perform classification. For this purpose, we assign each of the 10,000 testing samples to the cluster with the closest centroid to it, and set the label for this test sample to that of its closest cluster. We assess performance as follows. Table 52.2 lists the number of occurrences of each digit in the test data. For example, there are

52.5 Commentaries and Discussion

2279

980 images corresponding to digit 0, 1135 images corresponding to digit 1, and so forth. During testing, we find that 718 of the images corresponding to digit 0 are found to be closest to the centroid of cluster 5, whose label is “digit 0.” We therefore say that 718 test images corresponding to digit 0 are correctly classified, which amounts to a 73.3% success rate for digit 0. These numbers are listed in the columns of Table 52.2. We also place on top of cluster 1 in Fig. 52.9 the label “digit 0” to indicate that, during testing, this cluster accounts for the largest proportion of classifications in favor of “digit 0.” Consider next digit 5. There are 892 occurrences of test images corresponding to digit 5 in the test data. Of these, 233 are assigned to cluster 7; this is the highest number of images for digit 5 that are assigned to a single cluster (the numbers in the third column of the table show the largest number of same-cluster assignments for each digit). We therefore find that the success rate for digit 5 is 26.2% under this construction. We place on top of cluster 7 in Fig. 52.9 the label “digit 5” to indicate that, during testing, this cluster accounts for the largest proportion of classifications in favor of “digit 5.” It follows from the numbers in the table that the misclassification rate over the MNIST test data is close to 38%. We will be able to attain significantly better performance in later chapters by using other classification methods. digit 4

cluster 2

cluster 9

digit 3

digit 9

cluster 2

digit 8

cluster 10

digit 7

cluster 6

digit 6

cluster 8

cluster 7

digit 5

digit 2

cluster 3

digit 1

cluster 1

cluster 5

digit 0

Figure 52.9 The label on top of each cluster shows the digit label from the testing set

that is most often assigned to that cluster. The images are rendered using the same blue color scale as before for emphasis.

52.5

COMMENTARIES AND DISCUSSION Nearest-neighbor rule. The earliest formulation of the NN rule appears to be the work by Fix and Hodges (1951) in an unpublished report from the 1951 USAF School of Aviation Medicine. Some of the earliest applications in the context of pattern classification appear in the publications by Johns (1961), Sebestyen (1962), Kanal (1962), Kanal et al. (1962), Harley et al. (1963), and Nilsson (1965). One fundamental and surprising result on the performance of the 1-NN classifier is expression (52.26), due to Cover and Hart (1967). The result states that for large sample sizes, the probability of error of the classifier is at most twice as bad as the best possible performance by the optimal Bayes classifier. The result also means, as stated in the aforementioned reference, that “any other decision rule based on the infinite data set can cut the probability of error

2280

Nearest-Neighbor Rule

by at most one half.” An extension to the k-NN rule was given by Devroye (1981) in the following form for binary classifiers (R = 2): Pe ≤ (1 + a)Pebayes (52.36a) √   α k β ∆ a = 1+ √ , k odd, k ≥ 5 (52.36b) k − 3.25 k−3 α ≈ 0.3340, β ≈ 0.9750 (52.36c) √ Note that the factor a converges to zero at the rate O(1/ k). While these statements are reassuring, unfortunately, the conclusion only holds in the limit of large data sizes with N → ∞. Since the seminal result by Cover and Hart (1967), there have been many other studies on NN rules and variations. Representative examples of these efforts include the works by Cover (1968), Peterson (1970), Hellman (1970), Wilson (1972), Fukunaga and Hostetler (1975), Dudani (1976), and Altman (1992) – see also the texts by Tukey (1977), Devroye, Gyorfi, and Lugosi (1996), Duda, Hart, and Stork (2000), Chávez et al. (2001), Shakhnarovich, Darrell, and Indyk (2006), Chaudhuri and Dasgupta (2014), Biau and Devroye (2015), and Chen and Shah (2018). Voronoi diagrams. We illustrated in Fig. 52.3 the use of Voronoi diagrams in the context of NN rules. These diagrams divide the plane into a collection of convex regions consisting of one seed point each, along with all points that are closest to the seed. The diagrams are also referred to as tessellations since they tessellate the space and divide it into polygons without gaps. Such diagrams have found applications in many other areas, including in the arts, geometry, geography, sciences, and engineering. One early notable application of Voronoi diagrams was by the English physician John Snow (1813–1858), who used them to locate the source of the 1854 cholera outbreak in the Soho area in central London. He concluded that most of the individuals infected by the disease lived closer to the Broad Street public water pump than any other water pump in the area. His investigation was reported in the publication by Snow (1854); today, he is considered the father of modern epidemiology. Although the designation “Voronoi diagram” is after the Russian mathematician Georgy Voronoi (1868–1908), who formally defined the concept in Voronoi (1908), there have been informal instances of such diagrams as far back as three centuries earlier by the German astronomer Johannes Kepler (1571–1630) and the French mathematician René Descartes (1596–1650); Kepler used tessellations in his studies of snowflakes and the sphere-packing problem in Kepler (1611), while Descartes used them to identify clusters of stars in Descartes (1644) – see the accounts by Aurenhammer and Klein (2000), Okabe, Boots, and Sugihara (2000), and Liebling and Pournin (2012). Prior to Voronoi (1908), the diagrams were also used by Snow in 1854 and more formally by the German mathematician Gustav Dirichlet (1805–1859) in the work by Dirichlet (1850) on quadratic forms. Useful overviews on Voronoi diagrams appear in the article by Aurenhammer and Klein (2000) and the text by Okabe, Boots, and Sugihara (2000). k-means clustering. There are several variations of the clustering problem in statistical analysis, i.e., the problem of partitioning data into clusters. Some of the earliest formulations appear in the works by Dalenius (1950), Dalenius and Gurney (1951), Marschak (1954), Cox (1957), Fisher (1958), and Ward (1963). For example, Fisher (1958) motivates the article by posing the following question in the abstract: “Given a set of arbitrary numbers, what is a practical procedure for grouping them so that the variance within groups is minimized?” Fisher focused on the one-dimensional case, M = 1. Since solving the clustering formulation (52.30) in its generality is an NP-hard problem, it is necessary to resort to approximate solutions. One of the most popular algorithms is the k-means procedure described in the body of the chapter. The original idea for the k-means algorithm appears to be the works by Steinhaus (1957), Lloyd (1957), and Sebestyen (1962), although Lloyd published his work only 25 years later in 1982. The designation “k-means” was proposed by MacQueen (1965, 1967); for ex-

52.5 Commentaries and Discussion

2281

ample, the author states in the abstract of MacQueen (1967) that the objective is “to describe a process for partitioning an N -dimensional population into k sets on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably efficient in the sense of within–class variance.” The same algorithm was independently developed by Forgy (1965). The k-means++ variant for selecting the initial mean (seed) vectors is more recent and was proposed independently by Ostrovsky et al. (2006) and Arthur and Vassilvitskii (2007); the latter reference contains several results on the behavior of the k-means++ procedure. Useful studies on the convergence properties of the k-means algorithm (also called the Lloyd algorithm) appear in Abaya and Wise (1984), Sabin and Gray (1986), Har-Peled and Sadri (2005), Arthur and Vassilvitskii (2006, 2007), and Du, Emelianenko, and Ju (2006). Accessible overviews on clustering algorithms in classification and data quantization/compression are given by Hartigan (1975), Gray and Neuhoff (1998), Du, Faber, and Gunzburger (1999), MacKay (2003), Tan, Steinbach, and Kumar (2005), and Witten, Frank, and Hall (2011). To facilitate comparison with the EM algorithm described next, we list the k-means clustering method in the form shown in (52.37).

k-means clustering algorithm. given feature vectors {hn ∈ IRM }, for n = 0, 1, . . . , N − 1; given number of clusters, K; (0) given initial mean vectors conditions : πk , k = 1, 2, . . . , K; repeat until convergence over m ≥ 1: (determine clusters): for each n = 0, 1, . . . , N − 1 and k = 1, . . . , K:  (m−1) 1, if hn is closest to µk (m) r (k, hn ) = 0, otherwise N −1 X (m) (m) Nk = r (k, hn )

(52.37)

n=0

(update means): for each k = 1, . . . , K: N −1 1 X (m) (m) r (k, hn )hn µk = (m) Nk n=0 end (m) return {b µk } ← {µk }.

Connection to the EM algorithm. There is a useful connection between the k-means clustering algorithm (52.29) and the EM algorithm (32.67) for Gaussian mixture models studied in an earlier chapter. If we assume the covariance matrices of the Gaussian components are preset to the identity matrix (i.e., if we assume spherical clusters), and focus exclusively on estimating the mean vectors, then the EM algorithm (32.67) reduces to listing (52.38), where m denotes the iteration index and hn denotes the nth feature. For comparison purposes, we have rewritten the k-means clustering algorithm in the form shown in (52.37). Observe that there is a hard assignment of the sample hn to one of the clusters (the one determined by the closest mean vector to hn ). In contrast, the EM implementation performs a soft assignment of hn based on the responsibility factor r(m) (k, hn ): It measures the likelihood that sample hn belongs to cluster k. The k-means algorithm sets these factors to 1 or 0, depending on whether hn is closest to µk or not.

2282

Nearest-Neighbor Rule

Special case of the EM algorithm (32.67) for K clusters. given feature vectors {hn ∈ IRM }, for n = 0, 1, . . . , N − 1; assumed K Gaussian mixture components; (0) (0) given initial conditions:πk , µk , k = 1, 2, . . . , K; repeat until convergence over m ≥ 1: (E-step): for each n = 0, 1, . . . , N − 1 and k = 1, . . . , K: 

2  1 (m−1) (m−1) πk exp − hn − µk

2  r(m) (k, hn ) =

2 

PK 1 (m−1) (m−1) exp − hn − µj

j=1 πj 2 N −1 X (m) (m) Nk = r (k, hn )

(52.38)

n=0

(M-step): for each k = 1, . . . , K: N −1 1 X (m) (m) µk = (m) r (k, hn )hn Nk n=0 (m) (m) πk = Nk /N end (m) (m) return {b πk , µ bk } ← {πk , µk }. MNIST dataset. Example 52.3 applies the k−means clustering algorithm to the MNIST dataset. It contains 60,000 labeled training examples and 10,000 labeled test examples. This popular dataset was used by LeCun et al. (1998) to perform classification of handwritten digits. It can be downloaded from http://yann.lecun.com/exdb/mnist/ and also https://github.com/daniel-e/mnist_octave.

PROBLEMS

52.1 Consider the 1-NN decision rule applied to a binary classification problem and introduce the random variable t(h) = P(γ = +1|h). Assume N → ∞, where N denotes the sample size. Let Pe∞ denote the asymptotic misclassification error as N → ∞ for the 1-NN classifier. n o (a) Show that Pe∞ = E 2t(h)(1 − t(h) . Conclude the validity of property (52.26) for the 1-NN classification rule, namely, that the asymptotic probability of error is bounded by twice the probability of error by the Bayes classifier regardless of the underlying distribution. 52.2 Refer to the bias–variance relation of Prob. 27.16. We use the result here to examine the bias–variance trade-off for the k-NN strategy. Consider scalar and realvalued variables {γ, h, v} satisfying a model of the form γ = f (h) + v, for some known function f (·). The variable v is zero-mean noise with variance σv2 and is independent of h. Consider a collection of independent data realizations {γ(n), hn }. Upon the arrival of a new feature h, we estimate the corresponding γ as follows: 1 X γ b = γ(`) k

(b)

`∈Nh

where the average is computed over the k-nearest neighbors to h, denoted by the set Nh . Show that, conditioned on the feature data {hn }:

Problems

2283

2    σ2 1 X b )2 |h = h = σv2 + v + f (h) − f (h` ) E (γ − γ k k `∈Nh

where the second term on the right-hand side denotes the variance factor (it decays with k), and the last term denotes the squared bias factor. 52.3 We continue with Prob. 52.2. Let γ • denote the optimal MSE estimator for γ given h. Show that γ • = f (h) with estimation error variance equal to E (e γ • )2 = σv2 . e = γ −γ b for the k-NN estimator from Prob. 52.2. Use the result of that problem Let γ to conclude that  2 σ2 1 X e 2 − E (e Eγ γ • )2 = v + E f (h) − γ(`) k k `∈Nh

52.4 Consider two distinct points a, b ∈ IRM . Show that the bisector of the segment joining them is a hyperplane in IRM . 52.5 Consider the collection of m-dimensional points F = {h1 , h2 , . . . , hN }. For any ha from this set, we define its Voronoi cell as the set of all points h that satisfy n o Voronoi(ha ) = h ∈ IRM kh − ha k ≤ kh − hn k2 , ∀ hn ∈ F Show that the Voronoi cell is a convex set. 52.6 Consider a Voronoi diagram similar to the one shown in Fig. 52.3. Let N be the number of seed points {hn }. Let Ne denote the total number of edges in the diagram, and let Nv denote the total number of vertices. Verify that Nv − Ne + N = 1. 52.7 Explain that the k-means algorithm solves problem (52.32) by alternating between minimizing over the {ank } for a fixed set of means {µk }, and minimizing over the {µk } for a fixed set of assignments {ank }. 52.8 Consider a cluster C consisting of a collection of M -dimensional feature vectors, denoted generically by h ∈ C. Let µ denote the mean of the cluster, i.e., the mean of the vectors in C. For any vector x ∈ IRM , show that X X kh − xk2 = kh − µk2 + |C| kµ − xk2 h∈C

h∈C

where |C| denotes the cardinality of C. 52.9 Argue that problem (52.30) is equivalent to solving: ) ( K X X 1 2 khn − hm k min C1 ,C2 ,...,CK |Ck | k=1

n,m∈Ck

52.10 We assumed in (52.42) that P > 0 for any . That is, we assumed that all features h are well behaved in the sense that if we encircle each one of them by a small sphere of radius , then there is a positive probability that other feature vectors will be present inside the sphere. Let us assume, to the contrary, that there exist some subset ¯ that is not well behaved, meaning for any h ¯ ∈ H, ¯ in of feature vectors, denoted by h this set there will exist some ¯ > 0 such that P = 0 for any  < ¯. In other words, no ¯ of radius smaller than ¯. Prove that feature vectors will exist in spheres surrounding h ¯ is a set of probability zero. this is an impossibility, i.e., that H 52.11 Let {h1 , . . . , hN } denote iid random variables selected according to a distribution h ∼ fh (h) with compact support H in IRM . For each hn , let h0n denote its nearest neighbor from among the remaining vectors and define the expected squared `∞ -distance: ∆

d2 =

N 1 X khn − h0n k2∞ N n=1

2284

Nearest-Neighbor Rule

Let D denote the diameter of the set H, meaning that the `∞ -distance between any two points in H cannot exceed D. Show that  16D2 /N 2/M , M ≥ 2 d2 ≤ 4D2 /N, M =1 Remark. See the book by Biau and Devroye (2015, ch. 2) for a related discussion.

52.A

PERFORMANCE OF THE NN CLASSIFIER In this appendix we establish Theorem 52.1 on the generalization error of the 1-NN classifier. The proof follows arguments similar to Cover and Hart (1967). The lower bound in (52.26) is obvious since the Bayes classifier minimizes the probability of error by construction. Let us focus on the upper bound. Let h denote an arbitrary feature vector arising from the probability distribution fh (h). We denote its actual class by r(h). Let xh,N denote the NN to h from among the N given feature vectors {hn }: xh,N =

argmin

kh − xk2

(52.39)

−1 x∈{hn }N n=0

We denote the class of xh,N by r(x). Note that the location of xh,N depends on both h and the data size N . For simplicity, we will drop the subscripts h and N from xh,N and refer to the variable by x. Let fx|h (x|h) denote the conditional pdf of the closest neighbor variable x given h = h. This pdf is also dependent on N since x is dependent on N . It is reasonable to assume that, as the sample size increases to N → ∞, the pdf fx|h (x|h) tends to a Dirac impulse function concentrated at h, i.e., lim fx|h (x|h) = δ(x − h)

N →∞

(52.40)

which means that the pdf becomes concentrated at location h; recall that such impulse functions satisfy the sifting property ˆ g(x)δ(x − h)dx = g(h) (52.41) x∈X

for any function g(x) defined at location h, and where the integration is over the domain of x. Assumption (52.40) can be motivated as follows. Choose an arbitrary  > 0 and let S() denote a sphere of radius  > 0 centered at h. The probability that some feature vector h0 falls within the sphere is given by (see Prob. 52.10): ˆ P = fh (h)dh > 0 (52.42) h∈S()

The probability that the N feature vectors, which are assumed to be chosen independently of each other, fall outside the sphere is given by   N →∞ P N features outside S() = (1 − P )N −→ 0 (52.43) This result holds regardless of the radius of the sphere. Therefore, by shrinking the size of the sphere around h, and as the sample size N tends to infinity, we find that the nearest neighbor to h converges to h with probability 1, and assumption (52.40) is justified.

52.A Performance of the NN Classifier

2285

Now given a feature vector h, whose closest neighbor is x, the 1-NN classifier assigns to h the same label as x. Therefore, the probability of error by this classifier is given by   P(error|h, x) = P r(x) 6= r(h)|h, x   = 1 − P r(x) = r(h)|h, x (a)

= 1−

R     X P r = r(h)|h P r = r(x)|x

(52.44)

r=1

where the rightmost term in (a) is a sum over the probabilities of the classes for h and x being the same; this is because there are R possibilities for r(x) given by r ∈ {1, 2, . . . , R}. If we integrate the above error over the conditional pdf of x given h, and let N → ∞, we obtain the average probability of error for a given h: ˆ P(error|h)

P(error|h, x) fx|h (x|h)dx

= (52.40)

ˆ

x∈H

(

=

R     X 1− P r = r(h)|h P r = r(x)|x

x∈H

=

1−

R X

) δ(x − h)dx, N → ∞

r=1

  P2 r = r(h)|h = h

(52.45)

r=1

If we further integrate over the pdf of h, we obtain the probability of error for the 1-NN classifier: ) ( ˆ R   X 2 P r = r(h)|h = h fh (h)dh (52.46) Pe = 1− h∈H

r=1

We want to compare this expression to Pebayes , which we know from (28.69) is given by ˆ Pebayes =

n  o 1 − P r • (h) = r(h)|h = h fh (h)dh

(52.47)

h∈H

Let us examine the sum that appears inside (52.46). We split it into two terms: R X

  P2 r = r(h)|h = h

r=1 R X

  = P2 r • (h) = r(h)|h = h +

  P2 r = r(h)|h = h

r6=r • (h)

|

{z



}

=A (28.68)

=

 2 1 − P bayes (error|h) + A

(52.48)

where the first term depends on the probability of error of the Bayes classifier at h, and the second term is a sum we are denoting by the letter A. If we minimize A over its terms we can determine a lower bound for the sum of squared probabilities on the left. Hence, we formulate the optimization problem:

2286

Nearest-Neighbor Rule

R X

min

  P2 r = r(h)|h

r6=r • (h)

subject to and

  P r = r(h)|h ≥ 0 R   X P r = r(h)|h = P bayes (error|h)

(52.49)

r6=r • (h)

where the minimization is over the individual terms P2 (r = r(h)|h). We are therefore minimizing a sum of nonnegative terms subject to a constraint on what their sum should be. A straightforward Lagrange multiplier argument will show that the solution is obtained when all probabilities are equal to each other, i.e., when   P bayes (error|h) , for any r 6= r • P r = r(h)|h = h = R−1

(52.50)

Substituting into (52.48) we determine a lower bound as follows: R X

  P2 r = r(h)|h

r=1

 2 = 1 − P bayes (error|h) +

R X

  P2 r = r(h)|h = h

r6=r • (h)

2  2 R − 1  bayes P (error|h) ≥ 1 − 2P bayes (error|h) + P bayes (error|h) + (R − 1)2  2 2 1  bayes ≥ 1 − 2P bayes (error|h) + P bayes (error|h) + P (error|h) R−1 2 R  bayes bayes P (error|h) (52.51) ≥ 1 − 2P (error|h) + R−1 which implies that 1−

R X

  P2 r = r(h))|h = h ≤ 2P bayes (error|h) −

r=1

2 R  bayes P (error|h) (52.52) R−1

Substituting this bound into (52.46) and integrating over the distribution of h we obtain ( ) ˆ 2 R  bayes bayes 2P (error|h) − Pe ≤ P (error|h) fh (h)dh R−1 h∈H ˆ   2 R = 2Pebayes − P (error|h) fh (h)dh R−1 h∈H R  bayes 2 bayes Pe (52.53) ≤ 2Pe − R−1 where in the last step we used the fact that for any scalar random variable x, it holds that (E x)2 ≤ E x2 and, hence,  2  2 ∆ Pebayes = E P bayes (error|h)  2 ≤ E P bayes (error|h) ˆ  2 = P bayes (error|h) fh (h)dh (52.54) h∈H

References

2287

REFERENCES Abaya, E. and F. Wise (1984), “Convergence of vector quantizers with applications to optimal quantization,” SIAM J. Appl. Math., vol. 44, pp. 183–189. Altman, N. S. (1992), “An introduction to kernel and nearest-neighbor nonparametric regression,” Amer. Statist., vol. 46, no. 3, pp. 175–185. Arthur, D. and S. Vassilvitskii (2006), “How slow is the k-means method?” Proc. Ann. Symp. Computational Geometry (SCG), pp. 144–153, Sedona, AZ. Arthur, D. and S. Vassilvitskii (2007), “k-means++: The advantages of careful seeding,” Proc. Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 1027–1035, New Orleans, LA. Aurenhammer, F. and R. Klein (2000) “Voronoi diagrams,” in Handbook of Computational Geometry, J.-R. Sack and J. Urrutia, editors, pp. 201–290, North-Holland. Biau, G. and L. Devroye (2015), Lectures on the Nearest Neighbor Method, Springer. Chaudhuri, K. and S. Dasgupta (2014), “Rates of convergence for nearest neighbor classification,” Proc. Advances Neural Information Process. Systems (NIPS), pp. 1– 9, Montreal. Chávez, E., G. Navarro, R. Baeza-Yates, and J. L. Marroquin (2001), “Searching in metric spaces,” ACM Comput. Surv., vol. 33, no. 3, pp. 273–321. Chen, G. H. and D. Shah (2018), “Explaining the success of nearest neighbor methods in prediction,” Found. Trends Mach. Learn., vol. 10, nos. 5–6, pp. 337–588. Cover, T. M. (1968), “Estimation by the nearest neighbor rule,” IEEE Trans. Inf. Theory, vol. 14, no. 1, pp. 21–27. Cover, T. M. and P. E. Hart (1967), “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27. Cox, D. R. (1957), “Note on grouping,” J. Amer. Statist. Assoc., vol. 52, pp. 543–547. Dalenius, T. (1950), “The problem of optimum stratification,” Skandinavisk Aktuarietidskrift, vol. 34, pp. 203–213. Dalenius, T. and M. Gurney (1951), “The problem of optimum stratification II,” Skandinavisk Aktuarietidskrift, vol. 34, pp. 133–148. Descartes, R. (1644), Principia Philosophiae, Amstelodami, apud Ludovicum Elzevirium. Devroye, L. (1981), “On the asymptotic probability of error in nonparametric discrimination,” Ann. Statist., vol. 9, no. 6, pp. 1320–1327. Devroye, L., L. Gyorfi, and G. Lugosi (1996), A Probabilistic Theory of Pattern Recognition, Springer. Dirichlet, G. L. (1850), “Über die Reduktion der positiven quadratischen Formen mit drei unbestimmten ganzen Zahlen,” Journal für die reine und angewandte Mathematik, vol. 40, pp. 209–227. Du, Q., M. Emelianenko, and L. Ju (2006), “Convergence of the Lloyd algorithm for computing centroidal Voronoi tessellations,” SIAM J. Numer. Anal., vol. 44, pp. 102– 119. Du, Q., V. Faber, and M. Gunzburger (1999), “Centroidal Voronoi tessellations: Applications and algorithms,” SIAM Rev., vol. 41, no. 4, pp. 637–676. Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Dudani, S. A. (1976), “The distance-weighted k-nearest-neighbor rule,” IEEE Trans. Syst. Man Cybern., vol. 6, pp. 325–327. Fisher, W. D. (1958), “On grouping for maximum homogeneity,” J. Amer. Statist. Assoc., vol. 53, pp. 789–798. Fix, E. and J. L. Hodges, Jr. (1951), “Discriminatory analysis, nonparametric discrimination,” Project 21-49-004, Report 4, Contract AF41(128)-31, USAF School of Aviation Medicine, Randolph Field, TX. Forgy, E. W. (1965), “Cluster analysis of multivariate data: Efficiency versus interpretability of classifications,” Biometrics, vol. 21, pp. 768–769.

2288

Nearest-Neighbor Rule

Fukunaga, K. and L. Hostetler (1975), “k-nearest-neighbor Bayes-risk estimation,” IEEE Trans. Inf. Theory, vol. 21, no. 3, pp. 285–293. Gray, R. M. and D. L. Neuhoff (1998), “Quantization,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2325–2373. Harley, T., J. Bryan, L. Kanal, D. Taylor, and J. Grayum (1963), “Semi-automatic imagery screening research study and experimental investigation,” Philco Rep., vol. I, report nos. 2–3. Har-Peled, S. and B. Sadri (2005), “How fast is the k-means method?” Algorithmica, vol. 41, pp. 185–202. Hartigan, J. A. (1975), Clustering Algorithms, Wiley. Hellman, M. E. (1970), “The nearest neighbor classification rule with a reject option,” IEEE Trans. Syst. Man Cybern., vol. 6, no. 3, pp. 179–185. Johns, M. V. (1961), “An empirical Bayes approach to non-parametric two-way classification,” in Studies in Item Analysis and Prediction, H. Solomon, editor, Stanford University Press. Kanal, L. (1962), “Evaluation of a class of pattern recognition networks,” in Biological Prototypes and Synthetic Systems, vol. 1, pp. 261–269, Plenum Press. Kanal, L., F. Slymaker, D. Smith, and W. Walker (1962), “Basic principles of some pattern recognition systems,” Proc. National Electronics Conference, vol. 18, pp. 279–295, Chicago, IL. Kepler, J. (1611), The Six-Sided Snowflake, Oxford University Press, 2014 edition. LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998), “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324. Liebling, T. M. and L. Pournin (2012), “Voronoi diagrams and Delaunay triangulations: Ubiquitous Siamese twins,” Documenta Mathematica, extra volume ISMP, pp. 419– 431. Lloyd, S. P. (1957), “Least square quantization in PCM,” internal Bell Telephone Labs. report. The material was presented at the Institute of Mathematical Statistics Meeting, Atlantic City, NJ, Sep. 10–13, 1957. Published 25 years later as Lloyd, S. P. (1982), “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137. MacKay, D. J. C. (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press. MacQueen, J. B. (1965), “On convergence of k-means and partitions with minimum average variance,” abstract, Ann. Math. Statist., vol. 36, p. 1084. MacQueen, J. B. (1967), “Some methods for classification and analysis of multivariate observations,” Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, vol. 1, pp. 281–297, Berkeley, CA. Marschak, J. (1954), “Towards an economic theory of organization and information,” in Decision Processes, R. M. Thrall, C. H. Coombs, and R. C. Davis, editors, Wiley. Nilsson, N. (1965), Learning Machines, McGraw-Hill. Okabe, A., B. Boots, and K. Sugihara (2000), Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, 2nd ed., Wiley. Ostrovsky, R., Y. Rabani, L. J. Schulman, and C. Swamy (2006), “The effectiveness of Lloyd-type methods for the k-means problem,” Proc. 47th Ann. IEEE Symp. Foundations of Computer Science (FOCS), pp. 165–174, Berkeley, CA. Peterson, D. W. (1970), “Some convergence properties of a nearest neighbor decision rule,” IEEE Trans. Inf. Theory, vol. 16, pp. 26–31. Sabin, M. J. and R. M. Gray (1986), “Global convergence and empirical consistency of the generalized Lloyd algorithm,” IEEE Trans. Inf. Theory, vol. 32, no. 2, pp. 148–155. Sebestyen, G. (1962), Decision Making Processes in Pattern Recognition, MacMillan. Shakhnarovich, G., T. Darrell, and P. Indyk (2006), Nearest-Neighbor Methods in Learning and Vision, MIT Press. Snow, J. (1854), On the Mode of Communication of Cholera, 2nd ed., John Churchill.

References

2289

Steinhaus, H. (1957), “Sur la division des corps matériels en parties,” Bull. Acad. Polon. Sci., vol. 4, no. 12, pp. 801–804. Tan, P.-N., M. Steinbach, and V. Kumar (2005), An Introduction to Data Mining, Addison-Wesley. Tukey, J. (1977), Exploratory Data Analysis, Addison-Wesley. Voronoi, G. F. (1908), “Nouvelles applications des paramètres continus à la théorie de formes quadratiques,” Journal für die reine und angewandte Mathematik, vol. 134, pp. 198–287. Ward, J. (1963), “Hierarchical grouping to optimize an objective function,” J. Amer. Statist. Assoc., vol. 58, pp. 236–244. Wilson, D. L. (1972), “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man, Cybern., vol. 2, no. 3, pp. 408–421. Witten, I. H., E. Frank, and M. A. Hall (2011), Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed., Morgan Kaufmann.

53 Self-Organizing Maps

The k-nearest neighbor (k-NN) rule is appealing. However, each new feature

h ∈ IRM requires searching over the entire training set of size N to determine the neighborhood around h. This step is computationally demanding, especially for large M and N . We described the k-means algorithm in the previous chapter as one useful clustering method to ameliorate this challenge. In this chapter, we describe a second method to assist with the same challenge based on the use of self-organizing maps (SOMs). One main difference is that SOMs will cluster the data into regions in a lower-dimensional space than M . This property is particularly useful in visualizing high-dimensional data. SOMs, also known as Kohonen maps, are unsupervised clustering procedures; they are blind to the class label and operate directly on the feature data to perform clustering, as was the case with the k-means algorithm. SOMs can also be used for classification purposes, as we explain in this chapter.

53.1

GRID ARRANGEMENTS To describe SOMs, we will be using two subscripts, n and k: The subscript n will index the feature samples, {hn }, while the subscript k will index certain locations in space, denoted by {wk }.

Data preprocessing Consider a collection of N features vectors, {hn ∈ IRM }, that arise from some unknown underlying probability distribution, fh (h). SOMs will allow us to estimate this distribution from the data realizations – see later in Fig. 53.8. It is common practice in learning solutions to avoid situations where the entries within each feature vector are not normalized properly and some entries are disproportionately larger than other entries. Such discrepancies distort the operation of learning algorithms, as was already advanced in Example 51.1 when we discussed ill-conditioning in the context of least-squares solutions. For this reason, it is customary to normalize the feature data prior to the launch of a learning procedure. Let h denote a generic feature vector with entries {h(m)}. One way to achieve the desired normalization is to scale the “variances” of the individual entries of

53.1 Grid Arrangements

2291

h to the value 1. For this purpose, we first evaluate the sample mean of the N feature vectors: N −1 X ∆ 1 ¯ = h hn (53.1) N n=0 and use the result to compute the sample variance of each entry of h by using, for the mth entry: ∆

σ 2 (m) =

N −1 2 1 X ¯ hn (m) N h(m) N N 1 n=0

(53.2)

Subsequently, we normalize the feature vectors {hn } and replace their individual entries by ¯ hn (m) ← (hn (m) N h(m))/σ(m), m = 1, 2, . . . , M

(53.3)

This transformation can be expressed in vector form. We introduce the following diagonal matrix containing the positive square-roots of the {σ 2 (m)}: n o ∆ S = diag σ(1), σ(2), . . . , σ(M ) (53.4) Then, the normalization amounts to (see Prob. 53.1):

¯ n = 0, 1, . . . , N N 1 hn ← S −1 (hn N h),

(53.5)

We will assume henceforth that this preprocessing step has already been applied to the feature data and continue to use the notation {hn } to refer to the normalized features.

Rectangular and hexagonal grids Now, given the N feature vectors {hn }, it is generally difficult to recognize visually the existence of patterns or clusters in the data for dimensions M ≥ 4, and yet clusters may exist in the data. Self-organizing maps provide one useful way to discover these clusters and represent them in a visually recognizable manner. SOMs achieve this objective by constructing topology-preserving mappings. These are mappings from the M -dimensional feature space to a lowerdimensional space (usually, two- or three-dimensional) where feature vectors {ha , hb } that are close to each other in the original space will be mapped to vectors {wa , wb } that will continue to be close to each other in the lower-dimensional space. To construct these mappings, a SOM consists of a collection of M input nodes, one for each entry h(m) of h, and a total of K units, called neurons. The neurons are arranged in a particular geometric pattern, such as lying on a line, or distributed over a rectangular or hexagonal grid, or even organized in a threedimensional grid formation. Two-dimensional grids are common and we focus on them. Figure 53.1 illustrates two planar arrangements, one showing neurons

2292

Self-Organizing Maps

arranged in rectangular form and another showing them arranged in hexagonal form. rectangular grid

hexagonal grid

neurons

input nodes

input nodes

Figure 53.1 Planar arrangements of neurons in rectangular (left) or hexagonal (right)

grids. The highlighted areas around the central neuron are meant to illustrate neighborhoods of different sizes around this neuron. Any two neurons connected by an edge are referred to as direct neighbors.

The highlighted rectangular and hexagonal areas around the central neuron illustrate neighborhoods of different sizes around that neuron, and neurons that are connected by edges are direct neighbors to each other. We index the neurons by the integers k = 1, 2, . . . , K. The value of K may be smaller, larger, or equal to N . For example, in a 10 × 10 rectangular grid, there are a total of K = 100 neurons. We number the neurons in the first column 1 through 10 from bottom to top, then 11 through 20 for the second column also from bottom to top, and so forth. We associate with each kth neuron an M -dimensional vector, wk . The entries of the feature vector hn play the role of the input nodes to the grid. It is assumed that each of these input nodes is connected to every neuron in the grid; the edges for these connections are not shown to avoid cluttering the figure. In other words, each neuron in the grid has access to the feature vector hn at every iteration n. We will be using two indices for the weight wk , and write wk,n to indicate the value of the neural location wk at time n. This is because the neural weights {wk } at all grid locations will be adjusted continually as part of the learning process. We initialize these weights to small random values at n = N1: entries of each wk,−1 are set to small random values

(53.6)

53.2 Training Algorithm

2293

Distance measures We assume the Euclidean distances between linked neurons are normalized to a unit value, so that two neighboring neurons are at a nominal distance of one unit length from each other. More generally, if we denote by (xk , yk ) the coordinates of unit k in a 2D planar arrangement, then the Euclidean distance between two arbitrary units of indices k and ` is given by ∆ p (53.7) d(k, `) = (xk N x` )2 + (yk N y` )2 Although the Euclidean distance is commonly used, there are other choices as well, including the `∞ -norm defined by n o ∆ d∞ (k, `) = max |xk N x` |, |yk N y` | (53.8) as well as the following hexagonal distance for hexagonal grids, which is equal to the smallest number of edges linking neurons k and `: ∆

dH (k, `) = smallest number of edges linking k to `

53.2

(53.9)

TRAINING ALGORITHM Once a geometric grid structure is chosen for the SOM, we use the N available feature vectors {hn } to train it. Training amounts to adjusting the neuron weights {wk } continually until sufficient convergence is attained according to the construction below.

Winning unit and neighborhoods Let {wk,n−1 } denote the weight vectors for the various neurons that are available at iteration n N 1. When a feature vector hn is presented to the SOM at iteration n, the first step is to search over all K neurons to determine the location of the neuron whose weight vector is closest to hn , i.e., we search for ∆

k o = argmin khn N wk,n−1 k2

(closest distance criterion)

(53.10)

1≤k≤K

in terms of the smallest Euclidean distance between the vectors {hn , wk,n−1 }. The resulting index identifies the “winning neuron” or the “best matching unit.” In the case of ties, an index k o can be selected at random from among the winning neurons. A second option is to search for the neuron that exhibits the largest correlation with hn , i.e., ∆

k o = argmax hT n wk,n−1

(largest correlation criterion)

(53.11)

1≤k≤K

We view the process of selecting k o as a form of soft competitive learning, where neurons “compete” with each other for the privilege of being selected. The competition is “soft” because, as the description in the following explains, besides the

2294

Self-Organizing Maps

weight at k o , several other neurons around k o will have their weights updated in response to the excitation by hn . Once k o is chosen, the next step is to select a neighborhood around it. The set of neurons belonging to this neighborhood is denoted by Nko . There are many ways by which the neighborhood can be defined, some of which are illustrated in Fig. 53.2: (a) (Circular neighborhood) In this case, the neighborhood around k o consists of all neurons that lie within a circular region of some radius R and centered at k o : n o ∆ Nko = k | d(k o , k) ≤ R (53.12) using the Euclidean distance. It is customary to perform multiple passes over the feature data during the training of the SOM. We index these passes by the letter p = 1, 2, . . . , P . According to this notation, the variable R(p) designates the value of the radius used during the pth pass. The value√of R starts from some large initial value, R(1), typically on the order of K/ 2 or larger in order to approximate the distance from the center of the grid to one of its extreme endpoints. The value of R then decreases exponentially for each pass over the data according to the recursion: R(p + 1) = α R(p), p ≥ 1

(53.13)

where α ∈ (0, 1) defines the rate; its value is typically close to 1, such as α = 0.995. It follows that R(p) = R(1)αp−1 ,

p≥1

(53.14)

In this way, the radius of the circular neighborhood shrinks with time. Another equivalent way to describe the evolution of the radius parameter R(p) is to express it in terms of a time constant, τ > 0, as follows: R(p) = R(1)e−(p−1)/τ ,

p≥1

(53.15)

Comparing with (53.14), we see that (α, τ ) are related as follows: α = e−1/τ

(53.16)

(b) (Rectangular neighborhood) In this case, a rectangular region of width R(p) in the pth pass is used to define the neighborhood of interest. All units that lie within this region are chosen as members of Nko , i.e., n o ∆ (53.17) Nko = k | d∞ (k o , k) ≤ R(p) using the `∞ -distance measure. The radius R(p) starts again from some large initial value and subsequently decreases exponentially for each pass over the data, according to (53.13) or (53.15).

53.2 Training Algorithm

2295

(c) (Hexagonal neighborhood) Similarly, a hexagonal region of width R(p) can be used to identify the neighborhood of interest. All units that lie within the hexagonal region are chosen as members of Nko , i.e., n o ∆ Nko = k | dH (k o , k) ≤ R(p) (53.18)

using the hexagonal distance measure. Neighborhoods of this type are suitable for hexagonal grids, just like rectangular regions are suitable for rectangular grids.

Figure 53.2 Once the winning neuron is identified (marked in blue in the figures), the

neighboring neurons lying within a rectangular region (left) or a circular region of radius R (right) are determined and their weights updated to move them closer to hn .

Neighborhood function Once a winning neuron k o is identified, along with a neighborhood Nko around it, we associate a neighborhood function with the units within this set. The purpose of this function is to assign a nonnegative scaling weight, denoted by sk , to each of the units. There are many ways by which these scaling weights can be assigned, such as using uniform weights during the pth pass over the data:  1, k ∈ Nko sk (p) = (uniform weighting) (53.19) 0, otherwise This uniform choice is sometimes referred to as hard weighting. Other choices are possible. For example, again during the pth pass, Gaussian weights can be constructed as follows:  d2 (ko ,k) −  p 1 e 2R2 (p) , k ∈ Nko sk (p) = (Gaussian kernel) (53.20) 2πR2 (p)  0, otherwise

2296

Self-Organizing Maps

where d(k o , k) refers to the distance between neurons k and k o , on the grid measured in terms of the Euclidean distance, the `∞ -distance, or the hexagonal distance, depending on the type of grid or neighborhood arrangements that are being used. In this case, the neighborhood function performs soft weighting. The Gaussian kernel is centered at the winning neuron k o , with variance R2 (p). In this way, the scaling weight that is assigned to neighbor k will decay exponentially according to the square of the distance from k o , so that neighbors that are closer to the winning unit will receive larger weights than neighbors that are farther away. It is common to simplify (53.20) and ignore the scaling factor, in which case ) ( 1 d2 (k o , k) , k ∈ Nko (53.21) sk (p) = exp N 2R2 (p) It is also common to use the same Gaussian kernel to scale all nodes in the grid (and not only the neighboring nodes to k o , i.e., the condition “k ∈ Nko ” in (53.21) is replaced by “∀ k”). This situation corresponds to defining the neighborhood of any winning unit k o as consisting of all neurons in the grid, i.e., Nko = {1, 2, . . . , K}.

Training using distance criterion We are now ready to describe the training of a SOM using the N given feature vectors {hn }. Training will involve multiple passes over the data. We index the passes by the letter p = 1, 2, . . . , P . During each pass, the radius parameter, R, and the step-size parameter, µ, introduced below remain fixed. Their values are updated as we move from one pass to another. Assume we are at the pth pass, with radius R(p) and step size µ(p); usually, µ(p) decays with the pass index and is chosen to be of the form µ(p) = β/p for some scaling β > 0. It is also common to employ an exponentially decaying step size during the initial Po passes and then switch to a fixed step size µ ¯ to enable continuous learning, i.e., to start from some initial value µ(1) and to update µ(p) as follows:  λµ(p), 1 ≤ p ≤ Po µ(p + 1) = (53.22) µ ¯, p > Po where λ ∈ (0, 1) defines the rate of decay; its value is close to 1, such as λ = 0.995. The size of Po can be selected close to 10% of the total number of passes P . It follows that µ(p) = µ(1)λp−1 , 1 ≤ p ≤ Po

(53.23)

Another equivalent way of describing the evolution of µ(p) is to express it in terms of a time constant, τ 0 > 0, as follows:  0 µ(1)e−(p−1)/τ , 1 ≤ p ≤ Po µ(p) = (53.24) µ ¯, p > Po

53.2 Training Algorithm

2297

Comparing the first line with (53.23) we see that we can make the identification λ = e−1/τ

0

(53.25)

Let us now consider the case in which, at every iteration n, the winning neuron k o is selected based on the closest distance criterion (53.10). Then, the weights in the neighborhood of k o are adjusted as follows: wk,n = wk,n−1 + µ(p)sk (p) (hn N wk,n−1 )

(∀ k ∈ Nko )

(53.26)

For small enough step sizes, the iterate wk,n is seen to have the form of a convex combination of the vectors {hn , wk,n−1 } since   wk,n = 1 N µ(p)sk (p) wk,n−1 + µ(p)sk (p)hn , ∀ k ∈ Nko (53.27)

while all other units in the grid remain with their weight vectors unchanged. By doing so, each updated weight wk,n is brought closer to hn than wk,n−1 , as shown in the plot on the left of Fig. 53.3.

convex combination AAACAXicbVBNSwMxEM3Wr1q/Vr0IXoJF8CBltwf1WPDisYJthbaUbDrbhibZJckWy1Iv/hUvHhTx6r/w5r8x3e5BWx8M83hvhmReEHOmjed9O4WV1bX1jeJmaWt7Z3fP3T9o6ihRFBo04pG6D4gGziQ0DDMc7mMFRAQcWsHoeua3xqA0i+SdmcTQFWQgWcgoMVbquUdpJwgxjeQYHmwTAZOZM+25Za/iZcDLxM9JGeWo99yvTj+iiQBpKCdat30vNt2UKMMoh2mpk2iICR2RAbQtlUSA7qbZBVN8apU+DiNlSxqcqb83UiK0nojATgpihnrRm4n/ee3EhFfdlMk4MSDp/KEw4dhEeBYH7jMF1PCJJYQqZv+K6ZAoQo0NrWRD8BdPXibNasW/qHi31XLtPI+jiI7RCTpDPrpENXSD6qiBKHpEz+gVvTlPzovz7nzMRwtOvnOI/sD5/AHJ2JcK

unit radius AAAB+nicbVDLSsNAFL3xWesr1aWbwSK4kJJ0oS4LblxWsA9oQ5lMJu3QySTMQymxn+LGhSJu/RJ3/o3TNgttPXDhcM693HtPmHGmtOd9O2vrG5tb26Wd8u7e/sGhWzlqq9RIQlsk5anshlhRzgRtaaY57WaS4iTktBOOb2Z+54FKxVJxrycZDRI8FCxmBGsrDdxK3g9jZATTSOKIGTUduFWv5s2BVolfkCoUaA7cr36UEpNQoQnHSvV8L9NBjqVmhNNpuW8UzTAZ4yHtWSpwQlWQz0+fojOrRChOpS2h0Vz9PZHjRKlJEtrOBOuRWvZm4n9ez+j4OsiZyIymgiwWxYYjnaJZDihikhLNJ5ZgIpm9FZERlphom1bZhuAvv7xK2vWaf1nz7urVxkURRwlO4BTOwYcraMAtNKEFBB7hGV7hzXlyXpx352PRuuYUM8fwB87nDzaSk+Y=

Figure 53.3 (Left) The iterate wk,n that results from (53.27) is a convex combination

of {wk,n−1 , hn } and, therefore, it is closer to hn than wk,n−1 . (Right) The iterate wk,n that results from (53.29a) has a larger correlation with hn than wk,n−1 .

The net effect of the update procedure is that neurons that are physically close to each other in the grid arrangement will tend toward similar values. By the same token, feature vectors that are close to each other, and are likely to belong to the same class, will be mapped into the same neighborhood in the grid space. We list the resulting algorithm in (53.28), assuming the Gaussian scaling described by (53.21) is applied to all neurons and similarly for the weight updates (if desired, the scaling and updates can be limited to the neighborhood

2298

Self-Organizing Maps

Nko ). The sampling of the feature vector for each iteration can be done with or without replacement.

Training SOMs using distance criterion and Gaussian smoothing. given N feature vectors {hn ∈ IRM }; start with small initial conditions wk,−1 , k = 1, . . . , K; set initial values for µ(1) and R(1) = O(K); select parameters α and λ close to 1; given number of passes P and a value Po < P . repeat for p = 1, 2, . . . , P : repeat for n = 0, 1, 2, . . . , N N 1: sample a feature vector hn from the training set find winning node k o for hn using (53.10) n o 1 set sk (p) = exp N 2 d2 (k o , k) , ∀ k 2R (p) wk,n = wk,n−1 + µ(p)sk (p) (hn N wk,n−1 ) , ∀k end update µ(p) to µ(p + 1) using (53.22) update R(p + 1) = α R(p) end return {wk? } ← {wk,n }.

(53.28)

Training using correlation criterion Implementation (53.28) assumes the winning neuron is selected according to the closest distance criterion (53.10). Let us consider now the second situation in which the winning neuron is chosen according to the correlation criterion (53.11). In this case, the weights in the neighborhood of k o are adjusted as follows: 0 wk,n = wk,n−1 + µ(p)sk (p)hn , ∀ k ∈ Nko

(53.29a)

wk,n =

(53.29b)

0 wk,n 0 k kwk,n

In this construction, which is illustrated in the plot on the right of Fig. 53.3, the iterate wk,n−1 is first updated along a direction that is parallel to hn to 0 point wk,n , which is more correlated with hn than the starting point wk,n−1 . 0 The norm of wk,n is subsequently adjusted to 1. We list the resulting algorithm in (53.30), assuming the Gaussian scaling described by (53.21) is applied to all neurons and similarly for the weight updates (if desired, the scaling and updates can be limited to the neighborhood Nko ). The sampling of the feature vector for each iteration can be done with or without replacement. Observe that algorithm (53.28) updates the weights in the neighborhood of k o to be closer to hn , while algorithm (53.30) updates the weights to be more correlated with hn .

53.2 Training Algorithm

2299

Training SOMs using correlation and Gaussian smoothing. given N feature vectors {hn ∈ IRM }; start with small initial conditions wk,−1 , k = 1, . . . , K; set initial values for µ(1) and R(1) = O(K); select parameters α and λ close to 1; given number of passes P and a value Po < P . repeat for p = 1, 2, . . . , P : repeat for n = 0, 1, 2, . . . , N N 1: sample a feature vector hn from the training set find winning node k o for hn using (53.11) o n 1 d2 (k o , k) , ∀ k set sk (p) = exp N 2 2R (p) 0 wk,n = wk,n−1 + µ(p)sk (p)hn , ∀ k wk,n =

(53.30)

0 wk,n 0 k kwk,n

end update µ(p) to µ(p + 1) using (53.22) update R(p + 1) = α R(p) end return {wk? } ← {wk,n }.

Example 53.1 (Color matching) We illustrate the operation of a SOM by considering a classical example that involves mapping three-dimensional RGB color coordinates (corresponding to the colors red, green, and blue) into locations in the plane. We consider a square grid consisting of K1 = 20 neurons along each direction (vertical and horizontal) for a total of K = 400 neurons. The training data consists of only N = 8 feature vectors, each of size 3 × 1 (i.e., M = 3). Figure 53.4 lists the feature vectors, whose entries represent RGB coordinates in the range 0 to 255. Under each column, we place a colored circle to illustrate the color that the RGB entries in that column represent. For example, the first column has coordinates (255, 0, 0) and represents the color red. The SOM iterates over these feature vectors repeatedly and in a randomized manner. The feature vectors are not normalized in this example in order to retain the color mapping. A total of P = 1000 passes are performed over the training data. The entries of the 3 × 1 weight vectors for each neuron in the grid are initialized to random integer values in the interval [0, 255]. The parameters used to run recursion (53.28) are set to µ = 1, µ ¯ = 0.1, Po = 100, λ = 0.99,

α = 0.995

(53.31)

with the initial value for the radius chosen as R(1) = K = 400. We plot in Fig. 53.5 two representations for the state of the SOM after 1000 passes. We focus on the map shown on the left side. This map transforms the weights of the neurons into colors. We also place on top of the map the initial feature vectors represented by the colored circles. Observe how the neuron locations around each feature vector have a similar color to the input vector. Observe also how the SOM generates clusters of color in the plane: It identifies regions where features are more red-like, and regions where the features are

2300

Self-Organizing Maps

Figure 53.4 A collection of eight feature vectors of size 3 × 1, each corresponding to the RGB representation of eight input colors.

more green-like, and so forth. The other map on the right in the figure will be discussed in the next section. It is generated by constructing an enlarged unified distance matrix, U , which is a matrix containing information about the distances between the weights of the neurons, i.e., quantities of the form kwk − w` k. The lighter color represents regions where neurons are close to each other in w-space, while the darker color represents transitions or barriers between these regions. Figure 53.6 shows a third representation, known as a terrain plot, which essentially plots the distance values in the U -matrix. The bottom mesh plot in the figure illustrates how similar neurons are pulled together toward close weight values (i.e., small distances between them). circles represent training samples 20

40

distance map (U-matrix)

1 0.9

35

0.8

30

15

0.7

25 10

5

0.6

20

training sample

test sample

x

0.5 0.4

15 x

0.2

5 5

10

15

0.3

10

20

0.1

10

20

30

Figure 53.5 (Left) Color map representation of the weight vectors across the rectangular arrangement for the SOM after P = 1000 passes over the data. (Right) Distance map (U -matrix) generated according to the description in the next section.

Once the SOM is trained, we can employ it for classification as follows. Assume a feature vector h is received and we desire to identify what color it corresponds to. In the simulation we used h = col{164, 168, 250}. This test vector is mapped to the location of the neuron whose weight vector is closest to h. This is indicated by the square marker in the map on the left of Fig. 53.5. The SOM ends up “recognizing” the color of h, or at least the cluster that is most representative of its color. Observe how h is mapped to a location in the grid of similar color. Example 53.2 (Batch SOM) Algorithms (53.28) and (53.30) are sequential in nature, with one feature vector hn presented to the SOM at every iteration. Only the weights

53.2 Training Algorithm

2301

Figure 53.6 Terrain maps generated according to the description in the next section.

The bottom mesh illustrates how similar neurons are pulled together towards close weight values (i.e., small distances between them).

of the winning unit and its neighbors are updated at that iteration, and the process repeats. Besides the large computational cost involved in searching for the best matching neuron at every iteration, the performance of the algorithms is affected by the order in which the data is presented to the SOM. We now describe an alternative training algorithm, known as batch SOM, where, at every iteration, the entire training set {hn } is presented to the SOM, after which the weights of all neurons are updated. We index the batch iterations by the letter b. Let {wk,b−1 } denote the weight vectors of the SOM grid at iteration b − 1. For the next batch, we update the {wk,b−1 } to {wk,b } for all neurons as follows. First, we determine the winning unit for every feature vector hn (i.e., we determine the unit whose weight is closest to hn ). We denote the index of this winning unit by kn , with a subscript n: ∆

kn = argmin khn − wk,b−1 k2

(53.32)

1≤k≤K

Second, we use the Gaussian kernel to associate a smoothing factor between the winning unit kn and any other unit in the grid of index k. We use the following notation for this factor:

2302

Self-Organizing Maps

n s(kn , k) = exp −

o 1 2 d (k , k) n 2R2 (b)

(53.33)

where R(b) is a radius parameter that decreases with the batch index b, d(kn , k) refers generically to the distance between neurons k and kn on the grid, measured in terms of their Euclidean, `∞ , or hexagonal distances depending on the geometry of the grid. Finally, every weight vector in the grid is updated by means of the following weighted combination: PN −1 n=0 s(kn , k)hn wk,b = P , ∀k (53.34) N −1 n=0 s(kn , k) There are variations of this construction. For example, assume we count for every neuron k the number of times it is selected as a matching unit: ∆

nk = number of times neuron k is selected as winning unit

(53.35)

We average all nk feature vectors {hn } that ended up having k as the matching unit and ¯ k . An alternative to (53.34) is to use the following expression denote the average by h over the neighborhood of neuron k: P ¯ `∈N n` s(`, k) h` , ∀k (53.36) wk,b = P k `∈Nk n` s(`, k) The resulting algorithm is listed in (53.37) using (53.34). Batch SOM using distance criterion and Gaussian smoothing. given N feature vectors {hn ∈ IRM }; start with small initial conditions wk,0 , k = 1, . . . , K; set initial value R(1) = O(K); select parameter α close to 1; given number of passes B. repeat for b = 1, 2, . . . , B: determine winning unit kn for every hn using the {wk,b−1 } n o 1 set s(kn , k) = exp − 2 d2 (kn , k) , ∀ k PN −12R (b) n=0 s(kn , k)hn update wk,b = P , ∀ k; N −1 n=0 s(kn , k) update R(b + 1) = α R(b); end return {wk? } ← {wk,B }.

53.3

(53.37)

VISUALIZATION Once a SOM is trained, we end up with a collection of K weight vectors, {wk }, one for each neuron. These vectors play a role similar to the mean vectors {µk } in the k-means clustering procedure from Section 52.4. Since the grid arrangement lives in a lower-dimensional space (e.g., the plane or 3D), as opposed to the feature vectors {hn } that are M -dimensional, we will explain next how to use

53.3 Visualization

2303

the weights {wk } to generate a colored map, or an elevated terrain map, for the grid in order to facilitate visualization of clusters that are present in the data. For example, in the color matching example just discussed, the visualization of the final result was facilitated by the fact that the entries of the weight vectors already correspond to RGB coordinates. Therefore, in that case, it was only a matter of transforming the weights to colors at the neural locations to arrive at the colored map. More generally, when this interpretation is not applicable (e.g., when the weights {wk } are M -dimensional), we need to explain how to generate a visual map for the grid. One way is to construct a “unified distance matrix” as follows.

Unified distance matrix For each neuron k, we compute the average (or median or minimum) distance of its weight vector wk to the weight vectors {w` } of the neurons that are directly connected to it via edges. We denote the set of direct neighbors to neuron k by Dk (this set excludes neuron k itself), so that this average distance is given by ∆

dw (k) =

1 X kwk N w` k |Dk |

(53.38)

`∈Dk

where the notation |Dk | denotes the cardinality of Dk . We also compute the individual distances from every neuron k to its direct neighbors, denoted by  kwk N w` k, ` ∈ Dk dw (k, `) = (53.39) 0, otherwise If neurons k and ` are not connected by an edge, we set dw (k, `) = 0. Using these distances, we can generate a visual representation for the SOM in at least two ways. One way is to use the distances to generate altitude levels for a terrain plot, as was shown in Fig. 53.6. A second way is to use the distances to generate a “color” map. Specifically, we normalize all distances by dividing them by their maximum value so that their values lie within the interval [0, 1]. Neurons whose weights are close to each other will have normalized distances close to 0, while neurons whose weights are farther apart from each other will have distances close to 1. We will be representing these distances in color by mapping small values to light colors and larger values to dark colors. In this way, nodes that are close to each other will be in a region of light colors, while boundaries between clusters will appear in darker color. We explain next how the distance measures are transformed into a colored visualization for the SOM. Assume, for illustration purposes, that we are dealing with a rectangular grid that has K1 × K1 neurons, so that the total number of neurons is K = K12 . In Fig. 53.7 we use K1 = 3 neurons per dimension, represented by the circles. There are at least two ways to represent the distances {dw (k), dw (k, `)} in matrix form. In the first simpler option, we construct a matrix U of the same size K1 × K1 as

2304

Self-Organizing Maps

3 x 3 rectangular grid 3

6

9

2

5

8

1

4

7

3 x 3 unified distance matrix

neuron (option 1)

2

dw (3) 6 6 U =6 6 dw (2) 4 dw (1)

6

2

9

(option 2) 2

1

5

4

dw (5) dw (4)

dw (9)

3

7 7 dw (8) 7 7 5 dw (7)

5 x 5 unified distance matrix

3 x 3 rectangular grid 3

dw (6)

8

7

dw (3)

6 6 6 6 dw (3, 2) 6 6 6 U =6 6 dw (2) 6 6 6 d (2, 1) 6 w 6 4 dw (1)

dw (3, 6)

dw (6)

dw (6, 9)



dw (6, 5)



dw (2, 5)

dw (5)

dw (5, 8)



dw (5, 4)



dw (1, 4)

dw (4)

dw (4, 7)

dw (9)

3

7 7 7 dw (9, 8) 7 7 7 7 dw (8) 7 7 7 7 dw (8, 7) 7 7 7 5 dw (7)

Figure 53.7 Construction of the unified distance matrix U in one of two ways for an

example involving a 3 × 3 rectangular grid. In the first option, the U matrix has the same K1 × K1 dimensions as the grid, while in the second option, empty rows and columns are inserted between the neurons and U has size (2K1 − 1) × (2K1 − 1).

the grid. Each entry of this matrix is set to the average distance dw (k) for the neuron that corresponds to that entry, as shown in the top row of Fig. 53.7. grayscale map for rectangular grids grayscale map for hexagonal grids

neuron neuron

Figure 53.8 Grayscale color representation of the SOM in terms of the distance

measures (53.38)–(53.39). The hexagons are colored in accordance with the distance values, dw (k, `). Dark color between neurons indicates large distances between them, while lighter color indicates smaller distances. The darker regions correspond to boundaries between clusters.

In the second option we construct a larger matrix U of size (2K1 N1)×(2K1 N1). We do so by separating the neurons by additional rows and columns, as illustrated in the bottom part of the same figure. Each neuron (represented by a circle) is surrounded by a small square. We insert virtual squares between the neurons;

53.3 Visualization

2305

these are the squares without circles in the figure. We assign a numerical value to each location in this enlarged grid. Locations with neurons will be assigned the average distance values dw (k) for the respective neurons, while locations between linked neurons will be assigned the distances between them, namely, dw (k, `). One challenge arises for the case of rectangular grids that is not present for enlarged hexagonal grids (shown in the right plot of Fig. 53.7). The challenge is in relation to the entries marked by an × in the U -matrix. For example, neither neurons (3, 5) nor neurons (2, 6) are linked to each other in the SOM grid. One convention is to replace × by the average of the squares surrounding it, which are highlighted in the 5 × 5 matrix in Fig. 53.7, e.g.,  1 dw (3, 6) + dw (3, 2) + dw (6, 5) + dw (2, 5) (53.40) ×← 4 A similar assignment is performed for the other × entries. We refer to U as the U -matrix, with the letter “U ” standing for the “unified distance matrix.” Once the U -matrix is constructed, we generate a grayscale representation for it where smaller entries appear in light color and larger entries appear in dark color – see Fig. 53.8. The result is a color map that helps visualize clusters in the data. This construction was illustrated in the plot on the right in Fig. 53.5 for the earlier example using a different color scheme for illustration purposes. A comparison between both constructions for the U -matrix for the same example is shown in Fig. 53.9, again using a different color scheme than grayscale. U-matrix (option 2)

U-matrix (option 1) 20

35 30

15

25 20

10

15 10

5

5 5

10

15

20

25

30

35

5

10

15

20

Figure 53.9 Comparison between the two constructions for the U -matrix for the simulation under Example 53.1.

Clustering The resulting grayscale representation of the U -matrix provides a clustering representation for the original feature vectors and helps reveal topological characteristics in the feature space. This is because, by constructing a topology-preserving map, we are able to identify how close feature vectors are to each other in the

2306

Self-Organizing Maps

original space by checking how close their winning neurons are to each other in the grid space. We can also generate a clustering map by applying the K-means algorithm (52.29) to the weight vectors {wk }.

Probability distribution of feature data The SOM map also helps reveal the underlying probability distribution for the data, namely, fh (h). This is illustrated in Fig. 53.10. We insert at each neuron location an integer value that indicates the number of times the neuron has been selected as the winning unit during training. The result is a hit histogram, which approximates the desired pdf.

hit histogram

Figure 53.10 Histogram representation where an integer is placed at each neuron

location to indicate the number of times the neuron has been selected as a matching unit during training.

Classification The SOM map can be used for classification purposes as follows. The SOM is first trained on labeled feature vectors {hn } with known classes {r(n)}. The neurons are subsequently labeled according to these classes; for example, neuron wk is assigned the label of the majority of the training features that are closest to it. This step leads to a class map, where different regions of the SOM grid are assigned different labels; the construction is illustrated in the next example – see the plot on the left in the bottom row of Fig. 53.12, where the SOM grid is divided into two regions corresponding to classes γ ∈ {±1} in that example. Now, given a new feature vector h, one that the SOM has not been exposed to during training, we determine the neuron that is closest to it and assign the corresponding label to h. This construction amounts to a form of vector quantization, where the M -dimensional feature space is represented by the K weight vectors in the grid; any feature vector from the original space is represented by one of the weight vectors in the grid space.

53.3 Visualization

2307

Challenges SOMs are expensive to construct, requiring extensive distance comparisons with all neurons in the grid at each iteration. This cost can be prohibitive for large feature spaces. SOMs also require the training data {hn } to be representative enough of the feature space to attain meaningful clustering. This is a challenging assumption, especially for large dimensional spaces due to the curse of dimensionality, as we are going to explain later in a broader context in Section 64.1. SOMs further assume that all feature entries are present (i.e., no entries are missing), otherwise searching for the winning neuron is not possible. Another inconvenience is that SOM results are not generally consistent: (a) different SOMs on the same data can lead to different clustering results; (b) similar clusters may appear in different regions of the SOM; and (c) clusters may appear divided into smaller groups rather than blended together into a larger region. Even more importantly, the self-organizing map is built on the premise that feature vectors that are close to each other should belong to the same or similar clusters; this property is not always true. Example 53.3 (Application to breast cancer dataset) We apply the SOM formulation to a breast cancer dataset. The data consists of N = 569 samples {hn }, with each sample corresponding to a benign or malignant cancer classification. We use γ(n) = −1 for benign samples and γ(n) = +1 for malignant samples. Each feature vector in the data contains M = 30 attributes corresponding to measurements extracted from a digitized image of a fine needle aspirate (FNA) of a breast mass. The attributes describe characteristics of the cell nuclei present in the image, such as those listed in Table 53.1. The feature vectors are normalized by following construction (53.3).

Attribute 1 2 3 4 5 6 7 8 9 10

Table 53.1 Attributes for the breast cancer dataset. Explanation Radius of the cell, measured in terms of the mean of the distances from the center of the cell to points on the perimeter. Texture of the cell, measured in terms of the standard deviation of the grayscale values. Perimeter of the cell. Area of the cell. Smoothness of the cell, measured in terms of local variation in radius lengths. Compactness of the cell, measured in terms of perimeter2 /area − 1.0. Concavity, measured in terms of the severity of the concave portions of the cell contour. Number of concave portions of the cell contour. Cell symmetry. Fractal dimension, measured in terms of a “coastline approximation” minus one.

We select 456 samples (80%) for training the SOM and retain the other 113 samples (20%) for testing the classification performance. A total of P = 1000 passes are performed over the training data. The entries of the weight vectors for each neuron in the

2308

Self-Organizing Maps

SOM mapping of training samples 50

45

40

35

malignant 30

25

20

benign

15

10

5

5

10

15

20

25

30

35

40

45

50

Figure 53.11 Each feature vector hn from the training set is represented by a filled red

or green circle (depending on its label: green for γ(n) = −1 and red for γ = +1) at the location of the neuron with the closest weight vector. Observe how feature vectors corresponding to benign samples appear on one side of the plot, while all other feature vectors corresponding to the malignant samples appear on the other side of the plot. The original breast cancer dataset is available from https: //archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) and https://github.com/kostasdiamantaras/Machine-Learning-Example-MATLAB.

grid are initialized to random Gaussian values with zero mean and unit variance. The parameters used to run recursion (53.28) over a 50 × 50 square grid are set to µ = 1, µ ¯ = 0.1, Po = 100, λ = 0.99,

α = 0.99

(53.41) 2

with the initial value for the radius variable chosen as R(1) = K = 2500 = (50) . Once the SOM is trained, we end up with weight vectors {wk } for each neuron in the grid. For each training feature vector hn , we find the closest neuron and place either a green or red circle at its location, depending on whether hn belongs to class γ(n) = −1 (benign sample) or γ(n) = +1 (malignant sample); if multiple training feature vectors {hn } end up mapping to the same location in the grid, we use a majority vote to decide on the label for that location. The resulting plot is shown in Fig. 53.11, which illustrates the clustering ability of the SOM.

53.3 Visualization

2309

We use the SOM for classification as follows. Figure 53.12 shows four plots. The top leftmost plot repeats the mapping of the training feature vectors from Fig. 53.11 for comparison with the other three plots. In this plot, the colored symbols are only placed at the locations corresponding to the features {hn }. In the plot below it, we extend the label assignments to all neural locations in the grid to generate a class map as follows. For each neuron, we determine the closest five training feature vectors and assign their majority label to the neuron. By doing so, we obtain two colored regions. During testing, feature vectors h falling into one region or the other will be classified accordingly as belonging to class γ = +1 or γ = −1. The rightmost plot in the bottom row displays the locations of the test samples on the grid (they are placed at the locations of the closest neurons). The rightmost plot in the first row shows the result of classifying these test samples by using the SOM, where each test sample is assigned the label of the closest neuron. This construction leads to 5 errors out of 113 test samples, which corresponds to an empirical error rate on the order of 4.42%.

50

SOM mapping of training samples

50

45

45

40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5 10

50

20

30

40

50

classification labels

20

30

40

50

original test samples

45

40

40

malignant

35

35

30

30

25

25

20

errors

10

50

45

SOM classification of test samples

20

benign

15

15

10

10

5

5 10

20

30

40

50

10

20

30

40

50

Figure 53.12 The top leftmost plot maps the training samples to the closest neural

locations in the SOM grid. The plot on the left in the bottom row assigns labels to each neural location by using the majority label of the five closest feature vectors in the training set. In the rightmost plot in the bottom row, each test sample is mapped to the closest neuron. The rightmost plot in the first row shows the result of classifying these test samples by using the SOM grid.

2310

Self-Organizing Maps

53.4

COMMENTARIES AND DISCUSSION Self-organizing maps. These maps have found applications in a wide range of areas, including in process control, robotics, material science and chemistry, statistical analysis, and pattern recognition – see, e.g., Kohonen (1996, 2001), Oja and Kaski (1999), Brereton (2012), and Qian et al. (2019). SOMs were introduced by Kohonen (1982, 1984), whose work was motivated by the earlier studies by von der Malsburg and Willshaw (1973) and Willshaw and von der Malsburg (1976). These earlier works developed a self-organizing topologically preserving model for the interaction between the retina receptive fields and the cerebral cortex. Kohonen (1982) generated a simpler model that was capable of exhibiting self-organized behavior in more general settings. The resulting SOM algorithm was an outgrowth of earlier investigations on associate memories by the same author in Kohonen (1972, 1973, 1974). The sequential and batch versions of these algorithms for training SOMs are described in Kohonen (2001). The U -matrix representation was introduced by Ultsch and Siemon (1990). Despite its simplicity, the motivation for the various steps involved in the operation of SOMs, while based on ingenious insights, remains largely ad-hoc. A satisfying analysis for the convergence and performance properties of SOMs continues to be an open question. In particular, there are no well-defined optimality criteria that motivate the derivation of the training algorithms (53.26) or (53.29a)–(53.29b) – see, though, Probs. 53.6–53.8. Some analyses exist in the one-dimensional case, e.g., in Fort (2006), but more is needed to understand the behavior of SOMs more clearly in higher dimensions. For more discussion on these and related issues, the reader may consult the works by Ritter and Schulten (1988), Erwin, Obermayer, and Schulten (1992a,b), Bishop, Svensén, and Williams (1998), Lau, Yin, and Hubbard (2006), and Yin (2008), as well as the texts by Ritter, Martinetz, and Schulten (1992), Fausett (1994), Mehotra, Mohan, and Ranka (1997), Haykin (1999, 2009), Oja and Kaski (1999), Kohonen (1996, 2001), Van Hulle (2000), Hammer et al. (2004), Principe and Miikkulainen (2009), Brereton (2012), and Qian et al. (2019). Breast cancer Wisconsin dataset. Example 53.3 uses the breast cancer Wisconsin dataset, which can be downloaded from the UCI Machine Learning Repository at https: //archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29, and from https://github.com/kostasdiamantaras/Machine-Learning-ExampleMATLAB. For information on how the data was generated, the reader may consult the work by Mangasarian, Street, and Wolberg (1995).

PROBLEMS

¯ and positive-definite covari53.1 Let h ∈ IRM denote a random vector with mean h 1/2 ance matrix Rh . Let Rh denote a square-root factor for Rh and define the normalized −1/2 ¯ Verify that the mean of h0 is zero and its covariance is the vector h0 = Rh (h − h). identity matrix. 53.2 Refer to Fig. 53.1. How many connections exist in a square rectangular grid with K neurons? What about a hexagonal grid? 53.3 Verify whether definition (53.9) for the “hexagonal distance” is a valid distance measure. 53.4 Refer to the batch SOM algorithm (53.37). Assume the neighborhood function is replaced by uniform weighting over the neighborhood of k, i.e., s(kn , k) = 1 if kn ∈ Nk and zero otherwise. What does the expression for wk,b reduce to? What about expression (53.36)?

References

2311

53.5 Refer to the batch SOM expression (53.34). Assume we limit the {hn } used in this expression only to those whose winning units kn belong to the neighborhood of k, i.e., kn ∈ Nk . How would this expression relate to (53.36)? 53.6 Consider N training feature vectors {hn ∈ IRM }, and a grid with K neurons whose weights are denoted by {wk }. For each hn , let kn denote the best matching unit to hn based on the distance criterion and define the scalars  s(kn , k), if k ∈ Nkn ank = 0, otherwise That is, ank is zero for all neurons k that are not in the neighborhood of the winning unit. Moreover, the term s(kn , k) denotes a smoothing factor and it can be defined, for example, in terms of the Gaussian kernel: n o 1 s(kn , k) = exp − 2 d2 (kn , k) 2R in terms of a distance measure between the coordinates of kn and k on the grid, and where R > 0 is some variance parameter. Consider the optimization problem with a weighted quadratic loss: ( N −1 ) 1 X ? 2 {wk } = argmin ank (hn − wk ) N n=0 {wk ∈IRM } Write down a stochastic gradient recursion for determining the {wk? }. How does the recursion compare with (53.26)? 53.7 Continuing with Prob. 53.6, fix ank = s(kn , k) for all k. What is the least-squares solution of the following problem? ) ( N −1 1 X s(kn , k)(hn − wk )2 {wk? } = argmin N n=0 {wk ∈IRM } Compare with the batch SOM expression (53.34). 53.8 Consider a similar setting to Prob. 53.6, except that now kn denotes the best matching unit to hn based on the correlation criterion. Define the scalars {ank , s(kn , k)} similarly and consider the optimization problem: ( N −1 ) 1 X ? T {wk } = argmax ank hn wk , subject to kwk k = 1 N n=0 {wk ∈IRM } Write down a stochastic projection gradient recursion for determining the {wk? }. How does the recursion compare with (53.29a)–(53.29b)?

REFERENCES Bishop, C., M. Svensén, and C. Williams (1998), “GTM: The generative topographic mapping,” Neural Comput., vol. 10, no. 1, pp. 215–234. Brereton, R. G. (2012), “Self organising maps for visualising and modelling,” Chem. Cent. J., vol. 6 (suppl. 2): S1. Erwin, E., K. Obermayer, and K. Schulten (1992a), “Self-organising maps: Stationary states, metastability and convergence rate,” Biologic. Cybern.,vol. 67, pp. 35–45. Erwin, E., K. Obermayer, and K. Schulten (1992b), “Self-organising maps: Ordering, convergence properties and energy functions,” Biologic. Cybern., vol. 67, pp. 47–55. Fausett, L. (1994), Fundamentals of Neural Networks, Prentice Hall.

2312

Self-Organizing Maps

Fort, J. C. (2006), “SOM’s mathematics,” Neural Netw., vol. 19, pp. 6–7, pp. 812–816. Hammer, B., A. Micheli, A. Sperduti, and M. Strickert (2004), “Recursive selforganizing network models,” Neural Netw., vol. 17, nos. 8–9, pp. 1061–1086. Haykin, S. (1999), Neural Networks: A Comprehensive Foundation, Prentice Hall. Haykin, S. (2009), Neural Networks and Learning Machines, 3rd ed., Pearson. Kohonen, T. (1972), “Correlation matrix memory,” IEEE Trans. Comput., vol. 21, pp. 353–359. Kohonen, T. (1973), “A new model for randomly organised associative memory,” Intl. J. Neurosci., vol. 5, pp. 27–29. Kohonen, T. (1974), “An adaptive associative memory principle,” IEEE Trans. Comput., vol. 23, pp. 444–445. Kohonen, T. (1982), “Self-organised formation of topologically-correct feature map,” Biologic. Cybern., vol. 43, pp. 56–69. Kohonen, T. (1984), Self-Organization and Associative Memory, Springer. Kohonen, T. (1996), “Engineering applications of the self-organizing map,” Proc. IEEE, vol. 84, no. 10, pp. 1358–1384. Kohonen, T. (2001), Self-Organising Maps, 3rd ed., Springer. Lau, K. W., H. Yin, and S. Hubbard (2006), “Kernel self-organizing maps for classification,” Neurocomput., vol. 69, pp. 2033–2040. Mangasarian, O. L., W. N. Street, and W. H. Wolberg (1995), “Breast cancer diagnosis and prognosis via linear programming,” Operations Research, vol. 43, no. 4, pp. 570– 577. Mehotra, K., C. K. Mohan, and S. Ranka (1997), Elements of Artificial Neural Networks, MIT Press. Oja, E. and S. Kaski, editors (1999), Kohonen Maps, Elsevier. Principe, J. C. and R. Miikkulainen, editors (2009), Advances in Self-Organizing Maps, Springer. Qian, J., N. P. Nguyen, Y. Oya, G. Kikugawa, T. Okabe, Y. Huang, and F. S. Ohuchi (2019), “Introducing self-organized maps (SOM) as a visualization tool for materials research and education,” Results Mater., vol. 4, art. 100020. Ritter, H., T. Martinetz, and K. Schulten (1992), Neural Computation and SelfOrganising Maps: An Introduction, Addison-Wesley. Ritter, H. and K. Schulten (1988), “Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability, and dimension selection,” Biologic. Cybern., vol. 60, pp. 59–71. Ultsch, A. and H. P. Siemon (1990), “Kohonen’s self-organizing feature maps for exploratory data analysis,” Proc. Int. Neural Network Conf., pp. 305–308, Paris. Van Hulle, M. M. (2000), Faithful Representations and Topographic Maps: From Distortion to Information-Based Self-Organization, Wiley. von der Malsburg, C. and D. J. Willshaw (1973), “Self-organization of orientation sensitive cells in the striate cortex,” Kybernetik, vol. 4, pp. 85–100. Willshaw, D. J. and C. von der Malsburg (1976), “How patterned neural connections can be setup by self-organization,” Proc. Roy. Soc. Lond. Ser. B, vol. 194, pp. 431–445. Yin, H. (2008), “The self-organizing maps: Background, theories, extensions and applications,” Studies Comput. Intell., vol. 115, pp. 715–762.

54 Decision Trees

We mentioned earlier in Section 52.3 that the nearest-neighbor (NN) rule for classification and clustering treats equally all attributes within each feature vector, hn ∈ IRM . If, for example, some attributes are more relevant to the classification task than other attributes, then this aspect is ignored by the NN classifier because all entries of the feature vector will contribute similarly to the calculation of Euclidean distances and the determination of neighborhoods. This property is generally undesirable. In this chapter, we discuss decision tree classifiers, which are able to discriminate among the attributes and decide on their relative importance to the task at hand. Decision trees, as the name suggests, have a tree structure with branches and leaves. Each tree will consist of a root node, internal nodes, and end or terminal nodes (also called leaves). The nodes will be selected according to metrics that measure how informed an attribute is in relation to other attributes. Once a tree is constructed, and when a feature vector is presented to it for classification purposes, the attributes will be subjected to various queries at the nodes. Depending on the answer to each query, the feature vector will propagate through the branches until it reaches an end node, which will then determine the class label for the feature vector. A decision tree effectively divides the feature space into subregions, one for each leaf, and attaches class labels to these regions. Each subregion is defined by a sequence of responses to query questions leading to its leaf. In this way, decision trees lead to discriminative classification structures that end up approximating the conditional probability distribution, P(r = r|h = h).

54.1

TREES AND ATTRIBUTES The process of classifying data by means of decision trees is illustrated schematically in Fig. 54.1. The plot on the left shows two-dimensional feature data, h ∈ IR2 , whose individual entries are denoted by x1 and x2 for ease of reference:   x1 h= (54.1) x2 The points represented by circles belong to class +1, while the points represented by squares belong to class N1. The plot on the right shows a decision tree that

2314

Decision Trees

is motivated by this data; we will explain in the following how to construct such trees in a formal manner. Here, we simply construct the tree by inspecting the distribution of the training data in the plane, as shown in the plot on the left. If we examine the proposed tree, we find that it has one root node performing the query “is x1 ≥ 1?” and two internal nodes performing the queries “is x2 ≥ 2?” and “is x2 ≥ 1?” Observe that the tests at the nodes are in the form of Boolean queries with yes/no answers. In general, the order by which the attributes are queried matters. For instance, in this example, the root node could have employed a different test by using the second entry, x2 , instead of x1 . The details on how to select which attribute to test first and in what order are spelled out in future sections. One key feature that will emerge from the presentation in this chapter is that the construction of decision trees will not treat all attributes equally but will be able to identify and exploit the most informative attributes in a guided manner. Although decision trees can be applied to multiclass classification as well, as will become evident from the discussion, we will simplify the presentation by focusing on binary classification problems. We will assume henceforth that we have N training points {γ(n), hn }, where γ(n) ∈ {+1, N1} denotes the class variable and hn is the nth feature vector.

= +1

=

1

Figure 54.1 One example of a decision tree motivated by the training data {γ(n), hn }

shown on the left, where γ(n) ∈ {+1, −1} and hn ∈ IR2 . The tree consists of one root node, two internal nodes, and four leaves. One class label is associated with each leaf.

Types of attributes One strength of decision tree classifiers is that they can handle different types of attributes. Let us denote the individual entries of hn by the generic letter x: n o hn = col x1 , x2 , x3 , . . . , xM ∈ IRM (54.2)

54.1 Trees and Attributes

2315

Each xm can be binary-valued, i.e., a Boolean variable assuming values such as {0, 1}, {true, false}, {yes, no}, {small, big}. It can also be discrete, assuming a multitude of levels such as {red, blue, black, white}, {large, medium, small}, {0, 1, 2, 3, 4}. The entry xm can also be real-valued and correspond to measurements of some continuous variable such as temperature, humidity, weight, length, and so forth. Some of these examples and possibilities are listed in Table 54.1. Table 54.1 Examples of Boolean, discrete, and real-valued attributes. Example

Type

xm ∈ {(1, 0), (true, false), (yes, no)} xm ∈ {(red, blue, white), (large, medium, small)} xm ∈ {temperature, humidity, blood pressure}

Boolean discrete real-valued

We will motivate decision trees by assuming that all attributes are Boolean. In general, even if hn contains attributes that are not Boolean, it is possible to transform them to be of this type. For example, assume an attribute xm happens to be real-valued. We can transform it into a Boolean attribute by comparing it against some threshold, tm . If we use the letter x0m to refer to an attribute xm after transformation, then one way to map xm into the Boolean variable x0m is to use: if xm ≥ tm =⇒ set x0m = 1 if xm < tm =⇒ set

x0m

(54.3a)

=0

(54.3b)

The direction of the inequalities can be reversed if desired. We can also replace xm by multiple Boolean entries. For instance, we can transform xm by using finer intervals to define the values for four Boolean variables as follows: check whether xm ∈ [a1 , b1 ] or not =⇒ set x0m,1 to 1 or 0 check whether xm ∈ [a2 , b2 ] or not =⇒ set

check whether xm ∈ [a3 , b3 ] or not =⇒ set

check whether xm ∈ [a4 , b4 ] or not =⇒ set

x0m,2 x0m,3 x0m,4

(54.4a)

to 1 or 0

(54.4b)

to 1 or 0

(54.4c)

to 1 or 0

(54.4d)

In this case, the entry xm is replaced by four entries {x0m,1 , x0m,2 , x0m,3 , x0m,4 } and the original feature vector is enlarged. Each added x0m,j will indicate whether the original xm belongs to some particular interval. Note that in the process of transforming all attributes into Boolean variables, the size of the original feature vector is generally enlarged. The choice of suitable threshold values tm , or interval ranges [am , bm ], is problem-dependent. It is also one of the most challenging aspects in designing decision trees; their performance will be dependent on these choices. For the encoding represented by (54.4a)–(54.4d), if the intervals are disjoint and chosen such that only one of the conditions is met, while the other conditions are violated, then only one transformed value x0m,j will be 1. This type of encoding is known as one-hot encoding. For the case described by (54.4a)–(54.4d), four one-hot encoding choices are possible:

2316

Decision Trees

(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)

(54.5)

We can also transform discrete attributes to Boolean type through similar queries. Assume, for example, that an attribute xm is a color variable. Then, we can transform xm into a Boolean attribute x0m depending on whether xm corresponds to a light color or a dark color, or whether it corresponds to the blue color versus non-blue colors. These situations are illustrated by the following examples, where we also show in the last case how xm can be transformed into multiple Boolean variables if finer tuning is desirable:  0 xm = 1, if x ∈ {light colors} (54.6a) xm ∈ colors =⇒ x0m = 0, if x ∈ {dark colors}  0 xm = 1, if x ∈ {blue} (54.6b) xm ∈ colors =⇒ x0m = 0, if x ∈ {non-blue}  0 x0m,1 = 0, otherwise  xm,1 = 1, if x ∈ {blue, green}, 0 xm ∈ colors =⇒ x = 1, if x ∈ {yellow, orange}, x0m,2 = 0, otherwise  0m,2 xm,3 = 1, if x ∈ {red, pink}, x0m,3 = 0, otherwise (54.6c)

Another possibility is to add one new Boolean attribute for every possible state for xm such as (blue, non-blue), (red, non-red), (yellow, non-yellow), etc. We will assume in the following that all Boolean transformations have already been performed and that the dimension M already refers to the size of the enlarged feature vector, which we will continue to denote by h ∈ IRM to avoid an explosion of notation. We will use the data from Table 54.2 as a driving example for our discussions, where an entry “No” is treated as corresponding to xm = 0 and an entry “Yes” is treated as corresponding to xm = 1. The table lists the symptoms for N = 10 patients and indicates whether they had the flu or not. This contrived example is meant for illustration purposes only and is not based on real patient or medical data. The number of classes in this example is R = 2 with: γ = +1 : patient has the flu

(54.7a)

γ = N1 : patient does not have the flu

(54.7b)

The last column in the table indicates the class that each patient belongs to. Excluding this last column, each row in the table corresponds to a feature vector with M = 6 attributes corresponding to information regarding: n o h = col headache, fever, sore throat, vomiting, chills, runny nose (54.8)

Each entry of h assumes a binary value (Yes/No); i.e., it is Bernoulli-distributed. For example, the first entry of h indicates whether the patient had a headache or not.

54.2 Selecting Attributes

2317

Table 54.2 Symptoms felt by 10 patients and whether they had the flu or not.

54.2

Patient

Headache

Fever

Sore throat

Vomiting

Chills

Runny nose

Flu

0 1 2 3 4 5 6 7 8 9

Yes Yes No No No Yes Yes No Yes Yes

No Yes Yes No Yes No No Yes Yes No

No No Yes No No Yes No No No No

Yes No No Yes Yes No No Yes No No

No Yes Yes No Yes Yes No No No Yes

No Yes Yes No No Yes No No Yes Yes

NO YES YES NO NO YES NO NO YES NO

SELECTING ATTRIBUTES We explain next how to construct a decision tree by growing it one stump at a time. The first step in this process is to decide which attribute to choose as the root of the tree. This is achieved by examining each attribute individually. Assume we pick generically the mth attribute, xm , and want to decide on whether it should be set as the root of the tree. This attribute appears boxed in the following representation: o n h = col ×, ×, xm , ×, ×, ×

(54.9)

All other attributes are denoted by ×. Since the attributes are assumed Boolean, we partition the N training data points {γ(n), hn } into two sets: One set has xm = 0 and the second set has xm = 1. We denote these sets by the following notation: n o S0 (m) = (γ(n), hn ) | xm = 0 (54.10a) n o S1 (m) = (γ(n), hn ) | xm = 1 (54.10b)

The set S0 (m) contains all data points {γ(n), hn } for which the mth attribute in each hn is 0. Likewise, S1 (m) contains all data points {γ(n), hn } for which the mth attribute in each hn is 1. Obviously, the partitioning depends on which attribute xm is being examined. Different attributes will lead to different sets, S0 (m) and S1 (m). That is why we are using m as an argument for these sets to emphasize that they are dependent on m. The cardinality of the two sets can be used to estimate the probability of having the attribute under consideration, now treated as a random variable and denoted in boldface notation by xm ; assume the values xm = 0 or xm = 1. If we let ∆

N0 (m) = |S0 (m)| ,



N1 (m) = |S1 (m)|

(54.11)

2318

Decision Trees

denote these cardinalities, then the desired probability estimates are given by b m = 0) = No (m) , P(x N

b m = 1) = N1 (m) P(x N

(54.12)

where N is the total number of training samples. These probabilities inform us how frequent the events xm = 0 or xm = 1 are in the given training data. If we refer to the data in Table 54.2 and select xm = “sore throat,” we find that S0 (m) = patients {0, 1, 3, 4, 6, 7, 8, 9}

S1 (m) = patients {2, 5}

N0 (m) = 8

(54.13a) (54.13b) (54.13c)

N1 (m) = 2 b P(xm = 0) = 0.8 (without sore throat) b m = 1) = 0.2 (with sore throat) P(x

(54.13d)

γ0 (m) = most common class variable in the set S0 (m)

(54.14a)

γ1 (m) = most common class variable in the set S1 (m)

(54.14b)

(54.13e) (54.13f)

Next we determine the most common class label in each of {S0 (m), S1 (m)} and count the number of samples in the minority class, i.e., we let

n0 (m) = number of samples in S0 (m) that are not in class γ0 (m) (54.14c) n1 (m) = number of samples in S1 (m) that are not in class γ1 (m) (54.14d) Observe that n0 (m) counts the number of samples in the training subset S0 (m) that belong to its minority class. Likewise, n1 (m) counts the number of samples in the training subset S1 (m) that belong to its minority class. Returning to the example from Table 54.2 with xm = “sore throat,” if we examine the set S0 (m), we find that two of these patients had the flu while six patients did not have the flu. Therefore, for this set we have γ0 (m) = N1,

no (m) = 2

(54.15a)

Likewise, if we examine the set S1 (m), we find that both patients had the flu. It follows that γ1 (m) = +1,

n1 (m) = 0

(54.15b)

Using these numbers, we estimate probabilities of conditional events of the form P(γ = γ|xm = xm ). Referring to (54.15a)–(54.15b) we find b = +1|xm = 0) = 2/8 = 0.25 P(γ b = N1|xm = 0) = 6/8 = 0.75 P(γ b = +1|xm = 1) = 1 P(γ b = N1|xm = 1) = 0 P(γ

(54.16a) (54.16b) (54.16c) (54.16d)

54.2 Selecting Attributes

2319

Figure 54.2 aggregates the information we extracted so far for the attribute xm = “sore throat” and presents it in the form of a decision stump with one node and two branches. The numbers {0.25, 0.75, 1, 0} that appear horizontally in the lower part of the figure refer to the probabilities computed in (54.16a)– (54.16d).

Figure 54.2 A tree stump based on the attribute “sore throat.”

54.2.1

Counting Errors Assume we stop here and rely solely on this stump to classify future feature vectors, h. Note that all training examples leading to the branch on the right had the flu (100%), while the majority of the training examples leading to the branch on the left did not have the flu (75%). For this reason, by following the branch on the right, we will be led to declare new patients with “sore throats” as having the flu (i.e., to assign them the label +1), while the branch on the left will lead us to declare new patients without “sore throats” as not having the flu (i.e., to assign them the label N1). In particular, if we were to run all the training data from Table 54.2 over this stump, we will find that the decision will be erroneous in two instances; these are the two cases that appear on the leftmost branch with flu symptoms even though the patients do not have a “sore throat.” Accordingly, the total number of errors on the training data for this single-stump decision-tree will be two, written as ne (m) = 2

(54.17)

The natural question at this stage is whether we could have chosen a different attribute (other than xm = “sore throat”) as the root for the tree and obtain a smaller error count on the training data. This motivates us to consider the following optimal selection for the root node:

2320

Decision Trees



m? = argmin ne (m)

=⇒ (select m? as root node)

(54.18)

1≤m≤M

That is, we select the attribute xm? that leads to the smallest error count on the training data. In the case of ties, we can break the tie by selecting m randomly from among the tied attributes. While this approach is meaningful and performs well in general, it is often outperformed by other measures of information, as we proceed to explain – a numerical example is provided later in Table 54.4 to illustrate this point.

54.2.2

Mutual Information A more informed method to select the root node is to rely on the mutual information measure. To explain this construction, we review briefly the concept of entropy of a random variable. Recall from the presentation in Chapter 6 that for any discrete random variable, x, its entropy is denoted by H(x) and is defined as the nonnegative quantity: ∆

H(x) = NE log2 P(x = x)

(54.19)

where the expectation is over the distribution of x, and where the logarithm is relative to base 2. Although the choice of the logarithm base is generally irrelevant, when the base is chosen as 2, the unit of measure for the entropy is bits. We can rewrite the entropy of x more explicitly as H(x) = N

K X

P(x = xk ) log2 P(x = xk )

(54.20)

k=1

where we are assuming that the domain of x involves K discrete states, denoted by {xk }. For simplicity, we will write the above expression in the form H(x) = N

X

P(x = x) log2 P(x = x)

(54.21)

x

where the sum over k = 1, . . . , K is replaced by a simpler notation involving a sum over the possible discrete realizations for x. For Boolean random variables, when x is either x = 0 or x = 1, expression (54.21) gives: H(x) = NP(x = 0) log2 P(x = 0) N P(x = 1) log2 P(x = 1)

(54.22)

or, more compactly, if we let p = P(x = 1): H(x) = Np log2 p N (1 N p) log2 (1 N p),

p ∈ [0, 1]

(54.23)

In this case, the entropy measure is a function of p alone and it is customary to denote it by H(p).

54.2 Selecting Attributes

2321

We explained in Example 6.1 that the entropy of a random variable reveals the amount of uncertainty we have about it. For example, for a Boolean variable x, if it happens that p = 1, then we would expect to observe the event x = 1 each time an experiment is performed on x. The entropy in this case is H(x) = 0. A similar situation occurs when p = 0, for which we would expect to observe the event x = 0 each time an experiment is performed on x. The entropy again evaluates to H(x) = 0. The case of most uncertainty about x arises when p = 1/2. In this situation, the events x = 1 and x = 0 are equally likely and the entropy evaluates to H(x) = 1. Figure 54.3 plots H(p) versus p; it is seen that the function is concave, attains the value zero at locations p = 0, 1, and attains the maximum value of 1 at p = 1/2. 1

function

0.8

0.6

0.4

0.2

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 54.3 Plot of the entropy function (54.23) for a Boolean random variable as a

function of p ∈ [0, 1], along with a plot of the Gini impurity defined later in (54.43) for comparison purposes.

In a similar vein, we define the conditional entropy of x given an observation of another random variable, γ = γ, also assumed discrete in this exposition, as follows: ∆

H(x | γ = γ) = NE log2 P(x = x | γ = γ) X = N P(x = x | γ = γ) log2 P(x = x | γ = γ)

(54.24)

x

This conditional entropy reflects the amount of uncertainty that remains in x after observing γ = γ. If we average this quantity over all possible realizations for γ, we determine the average amount of uncertainty that remains in x if γ is observable. We denote the resulting conditional entropy measure by: ∆

H(x | γ) =

X

γ∈{+1,−1}

P(γ = γ)H(x | γ = γ)

(54.25)

where the sum is over the possible realizations for γ. If we subtract H(x | γ) from H(x), we then find by how much the initial uncertainty in x is reduced

2322

Decision Trees

given observations of γ. This nonnegative quantity is called mutual information and is symmetric (recall Prob. 6.4): ∆

I(x; γ) = H(x) N H(x|γ) = H(γ) N H(γ|x) = I(γ; x)

(54.26a) (54.26b) (54.26c)

where, by definition, H(γ) = NP(γ = +1) log2 P(γ = +1) N P(γ = N1) log2 P(γ = N1) H(γ | x = x) = N

X

(54.27a)

P(γ = γ | x = x) log2 P(γ = γ | x = x)

(54.27b)

γ∈{+1,−1}

H(γ|x) =

X

P(x = x)H(γ|x = x)

(54.27c)

x

The reverse equality (54.26b) indicates how much the entropy of (or our uncertainty about) γ is reduced if x is observable. This interpretation is the basis for the second method for selecting the most informative attribute xm from a feature vector: We pick the attribute that reduces the uncertainty about γ by the largest amount, i.e., m? = argmax I(γ; xm )

=⇒ (select m? as root node)

(54.28)

1≤m≤M

The solution m? that results from this optimization problem need not agree with the solution that results from the earlier method (54.18), and which is based on counting the number of erroneous decisions. We will illustrate this difference in a numerical example in the following.

54.2.3

Multiclass Classification Although the description has focused so far on binary classification problems, with label γ ∈ {±1}, the expressions for the mutual and conditional entropy measures are applicable more generally to multiclass classification problems. If we denote the class variable by r and let it assume one of R possible values, r ∈ {1, 2, . . . , R}, then the entropy measures needed to evaluate (54.26a) or (54.26b) become:

54.2 Selecting Attributes

H(r) = N H(x) = N H(x|r = r) =

X

2323

P(r = r) log2 P(r = r)

(54.29a)

P(x = x) log2 P(x = x)

(54.29b)

r

X x

X

P(x = x|r = r) log2 P(x = x|r = r)

(54.29c)

P(r = r|x = x) log2 P(r = r|x = x)

(54.29d)

P(r = r)H(x | r = r)

(54.29e)

P(x = x)H(r|x = x)

(54.29f)

x

H(r|x = x) =

X r

H(x|r) =

X r

H(r|x) =

X x

so that (54.26a)–(54.26c) are replaced by: ∆

I(x; r) = H(x) N H(x|r) = H(r) N H(r|x)

= I(r; x)

(54.30a) (54.30b) (54.30c)

These expressions are mathematically equivalent. However, expression (54.30b) can be more efficient when x ∈ {0, 1} is Boolean. In this case, expressions (54.30a) and (54.30b) reduce to: I(r; x) X X =N P(x = x) log2 P(x = x) N P(r = r)H(x | r = r) x∈{0,1}

=N

X r

P(r = r) log2 P(r = r) N

(54.31a)

r

X

P(x = x)H(r|x = x)

(54.31b)

x∈{0,1}

Observe from the rightmost term in the second expression that the term with the conditional entropy involves a sum of two terms only, over x ∈ {0, 1}. In contrast, in the rightmost term in the first expression, the term with the conditional entropy involves a sum of R terms, over r ∈ {1, 2, . . . , R}.

54.2.4

Normalized Mutual Information We continue our exposition by treating the binary case, with γ ∈ {±1}, without much loss in generality. Note that the choice (54.28) selects the attribute that results in the largest absolute reduction in the uncertainty about the class label, γ. There is an alternative way to select m? by measuring the size of the reduction relative to the original uncertainty or entropy of x. Specifically, the mutual information I(γ; x) is first normalized by H(x), leading to the “information gain ratio”:

2324

Decision Trees



Inorm (γ; x) =

I(γ; x) H(x)

(54.32)

which is bounded by 1. The optimal attribute is then selected by solving m? = argmax Inorm (γ; xm )

=⇒ (select m? as root node)

(54.33)

1≤m≤M

One advantage of this normalization procedure is that it avoids biasing the selection of m? toward attributes that have many possible state values. Example 54.1 (Mutual information calculation) We illustrate the computation of the mutual information measure between the attribute xm = “sore throat” and the class label, γ, by reconsidering the example from Table 54.2. Using the estimated probabilities (54.13e)–(54.13f), we approximate the entropy of this particular attribute as follows: b m ) = −0.8 log (0.8) − 0.2 log (0.2) ≈ 0.7219 bits H(x 2 2

(54.34)

Moreover, from the last column in Table 54.2 we estimate the probability distribution for the class variable as follows: b = +1) = 4/10 = 0.4, P(γ

b = −1) = 6/10 = 0.6 P(γ

(54.35)

and from the “sore throat” column in the same table we estimate the conditional probabilities: b m = 1|γ = +1) = 2/4 = 0.5 P(x b m = 0|γ = +1) = 2/4 = 0.5 P(x b m = 1|γ = −1) = 0/6 = 0.0 P(x b m = 0|γ = −1) = 6/6 = 1.0 P(x

(54.36a) (54.36b) (54.36c) (54.36d)

It follows that the conditional entropies of xm given the label variable are given by b m |γ = +1) (54.24) H(x = −0.5 log2 (0.5) − 0.5 log2 (0.5) = 1 b m |γ = −1) H(x

(54.24)

=

−0 × log2 (0) − 1 × log2 (1) = 0

(54.37) (54.38)

so that b m |γ) H(x

(54.25)

=

(0.4 × 1) + (0.6 × 0) = 0.4

(54.39)

We conclude from this result and (54.34) that b xm ) (54.26a) I(γ; = 0.7219 − 0.4 = 0.3219 Ibnorm (γ; xm )

(54.32)

=

0.3219/0.7219 ≈ 0.4459

(54.40) (54.41)

54.2 Selecting Attributes

54.2.5

2325

Gini Impurity There is an alternative to the mutual information measure that relies on the concept of “impurity.” For any discrete random variable x with realizations x, its Gini impurity (also called Gini index) is defined as: ∆

G(x) = 1 N

X

P2 (x = x)

(54.42)

x

In comparison with the entropy expression (54.21), the Gini index does not require the computation of logarithms. We explain in Prob. 54.11 that G(x) is equal to the likelihood of misclassifying x. Specifically, it is equal to the probability of selecting a realization for x at random and then assigning it erroneously to another realization (we encountered this quantity earlier in expression (52.45) while examining the performance of the NN classifier). The reason why G(x) is referred to as a measure of impurity is the following. If it happens that P(x = x) = 1, for some x, while all other states x 6= x have zero probability, then G(x) = 0. In this case, we say that x has a pure distribution; there is an implicit analogy between a pure distribution and zero entropy (where we know that the uncertainty about x is zero). On the other hand, when x has K equally probable levels, then we say that the level of impurity in x is the largest it can be (this is also the most uncertain situation with the largest entropy value). These facts can be illustrated by considering Boolean variables again. Using p = P(x = 1), the expression for Gini impurity simplifies to (compare with (54.23)): G(x) = 1 N p2 N (1 N p)2 = 2p(1 N p)

(54.43)

In this case, the Gini measure is a function of p alone and it is customary to denote it by G(p). It was shown in Fig. 54.3 that this function behaves similarly to the entropy H(p), which explains why it can be used in place of the entropy measure for selecting informative nodes. Note that G(p) = 0 when p = 0 or p = 1 and G(p) is at its maximum value of 1/2 when p = 1/2. Similarly, we define the conditional Gini impurity as X ∆ G(x | γ = γ) = 1 N P2 (x = x | γ = γ) (54.44a) ∆

G(x | γ) =

X γ

x

P(γ = γ)G(x | γ = γ)

(54.44b)

as well as the reduction in Gini impurity as ∆

∆G(γ; x) = G(x) N G(x|γ)

(54.45)

and its normalized version, also called the “Gini gain ratio”: ∆

∆Gnorm (γ; x) =

∆G(γ; x) G(x)

(54.46)

2326

Decision Trees

Subsequently, the most informative attribute is selected by solving m? = argmax ∆Gnorm (γ; xm )

=⇒ (select m? as root node)

(54.47)

1≤m≤M

Example 54.2 (Gini impurity calculation) We illustrate the computation of Gini impurity for the attribute xm = “sore throat” and the class label, γ, by reconsidering the example from Table 54.2. Using the estimated probabilities (54.13e)–(54.13f), we approximate the Gini impurity for this particular attribute as follows: b m ) = 1 − (0.8)2 − (0.2)2 = 0.32 G(x

(54.48)

Moreover, using the estimated conditional probabilities (54.36a)–(54.36d), we get b m |γ = +1) (54.44a) G(x = 1 − (0.5)2 − (0.5)2 = 0.5 b m |γ = −1) G(x

(54.44a)

2

2

1−0 −1 =0

=

(54.49) (54.50)

so that b m |γ) G(x

(54.44b)

=

(0.4 × 0.5) + (0.6 × 0) = 0.2

(54.51)

It follows from this result and (54.48) that d m |γ; x) (54.45) ∆G(x = 0.32 − 0.2 = 0.12 d norm (xm |γ; x) ∆G

54.2.6

(54.46)

=

0.12/0.32 = 0.375

(54.52) (54.53)

Informative Attributes We collect in Table 54.3 the five measures we have described so far for selecting the most informative attribute: Each row in the table corresponds to one measure. We already illustrated in the last two examples how to compute these measures for one particular attribute. We can repeat these calculations for each of the attributes that appear in the feature vector h given by (54.8). The resulting values are listed in Table 54.4. Table 54.3 Information measures for selecting the most informative attribute. Measure

Expression

Most informative attribute

number of errors

ne (m)

m? = argmin ne (m)

mutual information

I(γ; xm )

m? = argmax I(γ; xm )

norm. mutual information

I(γ; xm )/H(xm )

m? = argmax Inorm (γ; xm )

Gini impurity

∆G(γ; xm )

m? = argmax ∆G(γ; xm )

normalized Gini impurity

∆G(γ; xm )/G(xm )

m? = argmax ∆Gnorm (γ; xm )

m m m m m

Two observations follow from the numerical data in Table 54.4. First, if we compare the rows corresponding to the attributes x = “sore throat” and x =

54.3 Constructing a Tree

2327

“vomiting,” we find that both of them result in two total errors under the metric ne (m). However, the attribute x = “vomiting” is superior under most of the other metrics. This example reveals one weakness of the error-counting criterion: It does not have sufficient discrimination power to recognize that some attributes may be better than others even when their error counts match. Second, we recognize from the highlighted values in the last row that, in this example, regardless of the criterion we select, the attribute x = “runny nose” appears to be the most informative. Therefore, we select this attribute to be the root of the decision tree, as shown in Fig. 54.4. Table 54.4 Information measures for various attributes.

54.3

Attribute, m

ne (m)

I(γ; xm )

I(γ ;xm ) H(xm )

∆G(γ; xm )

∆G(γ ;xm ) G(xm )

headache fever sore throat vomiting chills runny nose

4 3 2 2 3 1

0.0464 0.1245 0.3219 0.4200 0.1245 0.6100

0.0478 0.1245 0.4459 0.4325 0.1245 0.6100

0.0300 0.0834 0.1200 0.2134 0.0834 0.3333

0.0625 0.1668 0.3750 0.4446 0.1668 0.6666

CONSTRUCTING A TREE If we are satisfied with this single stump and decide to stop the construction of the tree at this stage, then the end nodes on the left and right branches become leaf nodes. We associate the class label γ = +1 with the leaf on the right because the majority of the training points at that location have the flu (4 versus 1). Similarly, we associate the class label γ = N1 with the leaf on the left because the majority of the training points at that location do not have the flu (5 versus 0). We may also decide to continue growing the tree by adding branches to the leaf on the right in order to improve the discrimination power of that path since we still have one training point misclassified.

54.3.1

Adding Stumps To add a subtree to the right branch, we simply repeat the procedure for determining a “root” by restricting ourselves to the subset of the training data that has the attribute “runny nose” set to “yes.” This subset of samples becomes the new training set for determining the new root for the subtree we are seeking to add at that location. Table 54.5 extracts the relevant data from the earlier Table 54.2; it only retains the symptoms by those patients who suffered from a “runny nose.” We now have a total of N = 5 training data points, and we are faced with the task of

2328

Decision Trees

Figure 54.4 The root of the tree is selected to be the attribute “runny nose.”

selecting the most informative attribute from among the first five attributes (i.e., excluding “runny nose”). In the last row of the table we show the values of the mutual information measure for these attributes. The results indicate that the most informative attribute at this stage, according to this metric, is x = “fever.” Selecting this attribute as the new root, we are led to the tree structure shown in Fig. 54.5, where a new subtree has been added with a root at “fever.” Observe that on the left branch for this subtree there is an equal number of training points in each class: One point with the flu and one point without the flu. If we were to stop growing the tree at this stage, then we would select randomly one of the classes and assign it to that leaf node. Alternatively, we can continue to add a subtree under that node. Table 54.5 Data extracted from Table 54.2. Only symptoms for patients that have a “runny nose” are maintained. Patient

Headache

Fever

Sore throat

Vomiting

Chills

Runny nose

Flu

1 2 5 8 9 I(γ; xm )

Yes No Yes Yes Yes 0.0729

Yes Yes No Yes No 0.3220

No Yes Yes No No 0.1710

No No No No No 0.0000

Yes Yes Yes No Yes 0.0729

Yes Yes Yes Yes Yes —

YES YES YES YES NO —

To do so, we now focus only on the patients that have the attributes “runny nose = yes” and “fever = no.” This data is listed in Table 54.6. We now have a total of N = 2 training points. We focus on selecting a new root from among the four remaining attributes (excluding “runny nose” and “fever”). The last row in the table shows the values for the mutual information metric. The results indicate that the most informative attribute at this stage is x = “sore throat.”

54.3 Constructing a Tree

2329

Figure 54.5 A decision tree consisting of two stumps determined by the attributes

“runny nose” and “fever.”

Selecting this attribute as the new root, we are led to the tree structure shown in Fig. 54.6, where a new subtree has been added with a root at “sore throat.” Table 54.6 Data extracted from Table 54.2. Only symptoms by patients that have a “runny nose” and no “fever” are maintained. Patient

Headache

Fever

Sore throat

Vomiting

Chills

Runny nose

Flu

5 9 I(γ; xm )

Yes Yes 0

No No —

Yes No 1

No No 0

Yes Yes 0

Yes Yes —

YES NO —

Example 54.3 (Constructing a decision tree for a heart disease dataset) We apply the procedure for constructing a decision tree to a heart disease dataset. It consists of 297 samples with feature vectors that contain 13 attributes each. These attributes are described in Table 54.7. There are four classes: class 0 (patient has no heart disease) and classes 1, 2, 3 (patient has heart disease). We group the last three classes into a single class and relabel the feature data into two classes only γ = +1 (heart disease is present) and γ = −1 (heart disease is absent). The first step in processing the data is to redefine the attributes and replace them by binary-valued variables. Some of the attributes are already binary in nature, such as attribute 6 (blood sugar level) and attribute 9 (exercise-induced angina). Other attributes assume real values. There are many ways by which they can be transformed into binary variables. The following procedure is one possibility and is only meant for illustration purposes. Consider, for example, attribute 1 (patient’s age). We compute the average age of all patients in the given dataset. Then, for each patient we set their age variable to 1 if the patient’s age is above the average and to 0 otherwise:

2330

Decision Trees

Figure 54.6 A decision tree consisting of three stumps determined by the attributes

“runny nose,” “fever,” and “sore throat.”

 x=

1, 0,

if patient’s age is above average otherwise

(54.54)

We perform the same transformation for attribute 4 (resting blood pressure), attribute 5 (serum cholesterol level), attribute 8 (heart rate), and attribute 10 (size of ST depression). The remaining attributes are discrete in nature and we can transform them into binary variables as follows. Consider attribute 3 (chest pain type). There are four types. We therefore introduce four binary variables:

 x1 =  x2 =  x3 =  x4 =

1, 0,

if chest pain is typical angina otherwise

(54.55a)

1, 0,

if chest pain is atypical angina otherwise

(54.55b)

1, 0,

if chest pain is nonanginal pain otherwise

(54.55c)

1, 0,

if chest pain is asymptomatic otherwise

(54.55d)

Likewise, for attribute 7 we have three levels and introduce three binary variables as follows:

54.3 Constructing a Tree

2331

Table 54.7 Original attributes for the heart disease dataset. This dataset is derived from the site https://archive.ics.uci.edu/ml/datasets/heart+Disease. Attribute 1 2 3 4 5 6 7

8 9 10 11 12 13

Explanation Patient’s age measured in years. Patient’s sex (male or female). Chest pain type: typical angina (value 1), atypical angina (2), nonanginal pain (3), and asymptomatic (4). Resting blood pressure measured in mm Hg. Serum cholesterol level measured in mg/dl. Fasting blood sugar level above 120 mg/dl) (1 = true; 0 = false). Resting electrocardiographic result: normal (value 0), having ST–T wave abnormality, such as T wave inversions and/or ST elevation or depression larger than 0.05 mV (value 1), or showing probable or definite left ventricular hypertrophy by Estes’ criteria (value 2). Maximum heart rate measured in beats per minute (bpm). Exercise-induced angina (1 = yes; 0 = no). Size of ST depression induced by exercise relative to rest. Slope of the peak exercise ST segment: upsloping (value 1), flat (value 2), or downsloping (value 3). Number of major vessels colored by fluoroscopy (0, 1, 2, 3). Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect.

 x5 =  x6 =  x7 =

1, 0,

if electrocardiographic result is normal otherwise

(54.56a)

1, 0,

if electrocardiographic result is abnormal otherwise

(54.56b)

1, 0,

if electrocardiographic result shows hypertrophy otherwise

(54.56c)

Similarly, for attributes 11 (slope of ST segment), 12 (number of colored vessels), and 13 (thal condition). In this way, we end up with an expanded feature vector h0 with M 0 = 25 binary attributes. We will use these expanded feature vectors to construct the decision tree. Table 54.8 lists the binary attributes. We select 238 samples (80%) for training and use the remaining 59 samples (20%) for testing. In the simulation, we employ the mutual information measure to identify the root nodes at the various stages of the tree construction. For example, according to this criterion, the most informative attribute is found to be attribute 6: is the chest pain asymptomatic?

(54.57)

If the answer is in the affirmative, and after removing this attribute, the next most informative attribute is found to be attribute 19: are zero vessels colored by fluoroscopy?

(54.58)

On the other hand, if the chest pain is not asymptomatic, the next most informative attribute is found to be attribute 23: is the thal condition normal?

(54.59)

2332

Decision Trees

chest pain asymptomatic? no

yes

thal condition normal?

no

zero colored vessels? no

yes

yes zero colored vessels? no

AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4sSS9qAeh6MVjBWMLaSib7aZduh9hdyOU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5ccqoNp737aysrq1vbJa2yts7u3v7lYPDRy0zhUmAJZOqEyNNGBUkMNQw0kkVQTxmpB2Pbqd++4koTaV4MOOURBwNBE0oRsZKYa07QJyj63O/1qtUvbo3g7tM/IJUoUCrV/nq9iXOOBEGM6R16HupiXKkDMWMTMrdTJMU4REakNBSgTjRUT47eeKeWqXvJlLZEsadqb8ncsS1HvPYdnJkhnrRm4r/eWFmkssopyLNDBF4vijJmGukO/3f7VNFsGFjSxBW1N7q4iFSCBubUtmG4C++vEyCRv2q7t83qs2bIo0SHMMJnIEPF9CEO2hBABgkPMMrvDnGeXHenY9564pTzBzBHzifPw7nj/w=

=

1

AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

thal condition reversible?

= +1

yes

no

= +1 angina exercise induced? no

AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4sSS9qAeh6MVjBWMLaSib7aZduh9hdyOU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5ccqoNp737aysrq1vbJa2yts7u3v7lYPDRy0zhUmAJZOqEyNNGBUkMNQw0kkVQTxmpB2Pbqd++4koTaV4MOOURBwNBE0oRsZKYa07QJyj63O/1qtUvbo3g7tM/IJUoUCrV/nq9iXOOBEGM6R16HupiXKkDMWMTMrdTJMU4REakNBSgTjRUT47eeKeWqXvJlLZEsadqb8ncsS1HvPYdnJkhnrRm4r/eWFmkssopyLNDBF4vijJmGukO/3f7VNFsGFjSxBW1N7q4iFSCBubUtmG4C++vEyCRv2q7t83qs2bIo0SHMMJnIEPF9CEO2hBABgkPMMrvDnGeXHenY9564pTzBzBHzifPw7nj/w=

=

yes

age above average

below average yes

1 AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

ST depression

below average

chest pain typical? no

= +1

AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4sSS9qAeh6MVjBWMLaSib7aZduh9hdyOU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5ccqoNp737aysrq1vbJa2yts7u3v7lYPDRy0zhUmAJZOqEyNNGBUkMNQw0kkVQTxmpB2Pbqd++4koTaV4MOOURBwNBE0oRsZKYa07QJyj63O/1qtUvbo3g7tM/IJUoUCrV/nq9iXOOBEGM6R16HupiXKkDMWMTMrdTJMU4REakNBSgTjRUT47eeKeWqXvJlLZEsadqb8ncsS1HvPYdnJkhnrRm4r/eWFmkssopyLNDBF4vijJmGukO/3f7VNFsGFjSxBW1N7q4iFSCBubUtmG4C++vEyCRv2q7t83qs2bIo0SHMMJnIEPF9CEO2hBABgkPMMrvDnGeXHenY9564pTzBzBHzifPw7nj/w=

=

electrocardiographic abnormal?

yes

= +1 AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4sSS9qAeh6MVjBWMLaSib7aZduh9hdyOU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5ccqoNp737aysrq1vbJa2yts7u3v7lYPDRy0zhUmAJZOqEyNNGBUkMNQw0kkVQTxmpB2Pbqd++4koTaV4MOOURBwNBE0oRsZKYa07QJyj63O/1qtUvbo3g7tM/IJUoUCrV/nq9iXOOBEGM6R16HupiXKkDMWMTMrdTJMU4REakNBSgTjRUT47eeKeWqXvJlLZEsadqb8ncsS1HvPYdnJkhnrRm4r/eWFmkssopyLNDBF4vijJmGukO/3f7VNFsGFjSxBW1N7q4iFSCBubUtmG4C++vEyCRv2q7t83qs2bIo0SHMMJnIEPF9CEO2hBABgkPMMrvDnGeXHenY9564pTzBzBHzifPw7nj/w=

=

above average

1

no

1 AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

AAAB8XicbVBNS8NAEJ34WetX1aOXYCsIQkl6UQ9C0YvHCsYW0lA22027dD/C7kYooT/DiwcVr/4bb/4bt20O2vpg4PHeDDPz4pRRbTzv21lZXVvf2Cxtlbd3dvf2KweHj1pmCpMASyZVJ0aaMCpIYKhhpJMqgnjMSDse3U799hNRmkrxYMYpiTgaCJpQjIyVwlp3gDhH1+d+rVepenVvBneZ+AWpQoFWr/LV7UuccSIMZkjr0PdSE+VIGYoZmZS7mSYpwiM0IKGlAnGio3x28sQ9tUrfTaSyJYw7U39P5IhrPeax7eTIDPWiNxX/88LMJJdRTkWaGSLwfFGSMddId/q/26eKYMPGliCsqL3VxUOkEDY2pbINwV98eZkEjfpV3b9vVJs3RRolOIYTOAMfLqAJd9CCADBIeIZXeHOM8+K8Ox/z1hWnmDmCP3A+fwAL3Y/6

= +1

yes

= +1 AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4sSS9qAeh6MVjBWMLaSib7aZduh9hdyOU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5ccqoNp737aysrq1vbJa2yts7u3v7lYPDRy0zhUmAJZOqEyNNGBUkMNQw0kkVQTxmpB2Pbqd++4koTaV4MOOURBwNBE0oRsZKYa07QJyj63O/1qtUvbo3g7tM/IJUoUCrV/nq9iXOOBEGM6R16HupiXKkDMWMTMrdTJMU4REakNBSgTjRUT47eeKeWqXvJlLZEsadqb8ncsS1HvPYdnJkhnrRm4r/eWFmkssopyLNDBF4vijJmGukO/3f7VNFsGFjSxBW1N7q4iFSCBubUtmG4C++vEyCRv2q7t83qs2bIo0SHMMJnIEPF9CEO2hBABgkPMMrvDnGeXHenY9564pTzBzBHzifPw7nj/w=

=

1

Figure 54.7 A couple of layers of a decision tree generated from the binary-valued

attributes shown in Table 54.8 for the heart disease dataset using the mutual information measure.

Figure 54.7 shows several layers of the resulting decision tree, along with the labels at the leaves. The tree is not intended to be a reliable predictor of the presence or absence of heart disease; in this example, the amount of available data is small to enable an accurate classifier. This particular tree results in 16 errors over the 59 test samples, which corresponds to an empirical error rate of 27.12% over the test data. It also results in 32 errors over the entire 238 training samples, which corresponds to an empirical error rate of 13.45% over the training data.

54.3.2

Selection of Leaves and Nodes In the derivation that led to the decision tree of Fig. 54.6 we were guided in our selection of branches by moving in the direction where it was clear that improvements in performance would be expected (at least over the training data). For example, in the two-stump situation of Fig. 54.5 we only had the choice of moving either to the left or to the right at the “fever” node. The decision to move to the left branch, rather than to the right, was motivated by the fact that all decisions were correct in the box at the end of the “yes” branch in Fig. 54.5. More generally, however, in more elaborate scenarios, with multiple leaves and branches already in existence, we may have a multitude of options to branch from. In these situations, the selection of the next “optimal root node” can be carried out as follows (we use the mutual information measure for illustration purposes but other measures can also be used). Suppose we have already constructed a tree with L intermediate leaves – see Fig. 54.8. The figure shows a tree with nodes labeled {A, B, C, D, E, F, G}, two terminal leaves colored in green with classes already assigned to them, and several intermediate leaves indexed by ` = 1, 2, . . . , L. We would like to select one of these

54.3 Constructing a Tree

2333

Table 54.8 Binary attributes for the heart disease dataset. Attribute 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Explanation 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

(patient’s age above average), 0 (otherwise). (male), 0 (female). (chest pain is typical angina), 0 (otherwise) (chest pain is atypical angina), 0 (otherwise) (chest pain is nonanginal), 0 (otherwise) (chest pain is asymptomatic), 0 (otherwise) (blood pressure above average), 0 (otherwise) (cholesterol level above average), 0 (otherwise). (blood sugar level above average), 0 (otherwise). (electrocardiographic result is normal), 0 (otherwise). (electrocardiographic result is abnormal), 0 (otherwise). (electrocardiographic result shows hypertrophy), 0 (otherwise). (heart rate above average), 0 (otherwise). (if angina is exercise-induced), 0 (otherwise). (if size of ST depression is above average), 0 (otherwise). (if ST segment is upsloping), 0 (otherwise). (if ST segment is flat), 0 (otherwise). (if ST segment is downsloping), 0 (otherwise). (if no vessels are colored by fluoroscopy), 0 (otherwise). (if one vessel is colored by fluoroscopy), 0 (otherwise). (if two vessels are colored by fluoroscopy), 0 (otherwise). (if three vessels are colored by fluoroscopy), 0 (otherwise). (if thal condition is normal), 0 (otherwise). (if thal has fixed defect), 0 (otherwise). (if thal has reversible defect), 0 (otherwise).

L leaves as a root node for a new stump in order to grow the tree further. As (`) was done for the derivation that led to Fig. 54.6, for each leaf `, we let {hn } denote the set of all training vectors that lead to that leaf location. We only retain in these feature vectors the attributes that have not been fixed by moving along the path leading to that leaf. We denote the individual attributes of these (`) feature vectors by {xm }. Then, the problem of selecting a new node to grow the tree corresponds to selecting both the leaf and the root node at that leaf: m? = argmax `

54.3.3



  argmax I γ; x(`) m

(54.60)

m

Pruning In summary, the algorithm for constructing a decision tree takes the following general form; in the statement of the algorithm we are using the mutual information metric to measure relevance but any of the other metrics we described before can be used (e.g., error count, normalized mutual information, Gini impurity, or normalized Gini impurity). Let xm denote an arbitrary attribute (i.e., an entry from the feature vector).

2334

Decision Trees

Figure 54.8 A tree with multiple nodes labeled {A, B, C, D, E, F, G}, two terminal leaves colored in green with classes already assigned to them, and multiple intermediate leaves indexed by ` ∈ {1, 2, . . . , L}. The objective is to decide which intermediate leaf to select to grow the tree further by adding a stump at that location.

Construction of decision trees. given N training data samples {γ(n), hn }. repeat until the desired tree depth is attained: for every node, identify the data available for training at that location; identify the set of attributes to be tested for that location; for every attribute xm , compute I(γ; xm ) using the training data; pick the attribute with largest I(γ; xm ) as root for this location; end for every leaf node: count how many training points belong to one class or another; assign the majority class to the leaf node; end (54.61) Decision trees are generally trained until they classify all training data correctly. However, insisting on classifying all training data accurately can cause overfitting; this term refers to the problem that arises when the classifier performs perfectly well on the training data but can perform poorly on new test data. One way to counter this effect is to stop growing the tree when, for exam-

54.4 Commentaries and Discussion

2335

ple, it is observed that the mutual information metric is not significant anymore. A second way is to train a collection of trees randomly and to use a majority vote to arrive at the final classification; this approach is at the core of the random forest algorithm, which combines elements of bagging and decision trees and will be described later in Section 62.1. A third way to counter overfitting is to perform pruning. The following method is known as reduced-error pruning and involves splitting the training data into two groups: One group has, for example, 70% of the training data and the other group has the remaining 30%. The data in both groups are exclusive of each other (i.e., disjoint). We employ the 70% training group to construct an initial decision tree, which we denote generically by the capital letter T . The objective is to prune the tree by removing some of its subtrees. As summarized in (54.62), this can be performed as follows. We start by testing the tree on the 30% test data. We let Remp (T ) denote the empirical error rate that results for tree T from this test. Subsequently, we examine all internal nodes of the tree to decide which subtree(s) can be removed. For each node, we remove the subtree that lies under it and compute the empirical error rate for the trimmed tree over the test data again. We then select from among all trimmed trees the one that results in the smallest empirical error. This process can be repeated a few times to attain the desired pruning level.

Pruning method for decision trees. repeat for every internal node j in tree T : remove the entire subtree that resides under it; transform this node into a leaf node; call this new tree Tj ; run the 70% training data through Tj ; assign to the leaf node the majority class at this location; compute Remp (Tj ) using the 30% test data; end select the pruned Tj with the lowest empirical risk, Remp (Tj ); let T ← Tj and repeat the procedure.

54.4

(54.62)

COMMENTARIES AND DISCUSSION Decision trees. The use of decision trees in decision-making appeared in the 1950s in works on optical character recognition (OCR) and in the 1960s in work by Hunt (1962) and Hunt, Marin, and Stone (1966) on modeling aspects of human learning and intelligence. The manuscript by (Stevens, 1961, sec. 6.2) provides a useful overview of systems and references from the 1950s and 1960s employing decision trees in OCR systems. Modern interest in decision trees for classification and regression purposes was driven by the contributions from Breiman et al. (1984) and Quinlan (1987), who

2336

Decision Trees

developed powerful techniques for training trees. The work by Breiman et al. (1984) focuses on the CART (classification and regression trees) algorithm for constructing trees, which relies on the Gini impurity measure. The algorithm we described in the chapter is among the most widely used in practice and is due to Quinlan (1986), with variations appearing in Quinlan (1983, 1987, 1993); these algorithms are an outgrowth of the basic algorithm by Hunt (1962). Quinlan’s algorithms are known by the acronym ID3 (iterative dichotomizer) and its successor C4.5 (for classification). These procedures identify the most informative attributes in a recursive manner by using the normalized mutual information measure, and grow the tree until sufficient depth is attained. The use of information-theoretic measures in designing decision trees has also been proposed in other works, including by Sethi and Sarvarayndu (1981) and Casey and Nagy (1984). The reduced-error pruning procedure (54.62) is due to Quinlan (1987). It is clear from the discussion in the chapter that a decision tree is capable of learning any Boolean function of the attributes (though not necessarily in an efficient manner). This is because for each attribute we can branch into its two Boolean states and then follow from there with branches for another attribute and so forth. However, decision trees are not able to learn separating lines or separating hyperplanes when the attributes are real-valued. This is because decision trees transform the real attributes into Boolean counterparts by resorting to thresholding. When this is done, the feature space is sliced by cuts parallel to the axes and some information is lost. For these and other reasons, decision trees are sensitive to changes in the input data; small variations can lead to different splits and branches and very different trees. Decision trees also tend to exhibit poorer performance than other classification structures – see, e.g., Hastie, Tibshirani, and Friedman (2009). The use of bagging, boosting, and random forests helps improve their prediction accuracy, as will be discussed in Chapter 62. For further information on decision trees, the reader may consult Utgoff (1989), Safavian and Landgrebe (1991), Mitchell (1997), Rokach and Maimon (2010), and ShalevShwartz and Ben-David (2014). Some results on performance guarantees for decision trees appear in Kearns, Li, and Valiant (1994), Kearns and Mansour (1999), Fiat and Pechyony (2004), Kalai et al. (2008), Lee (2009), Brutzkus, Daniely, and Malach (2019), and Blanc, Lange, and Tan (2020). Entropy and impurity. The concepts of entropy and mutual information, which are used to quantify the amount of unpredictability (or uncertainty) in random realizations, are due to the American engineer Claude Shannon (1916–2001). He is regarded as the founder of the field of information theory, which was launched with the publication of the seminal papers by Shannon (1948a,b). The use of the Gini impurity measure was introduced by Breiman et al. (1984) in their description of the CART algorithm. The entropy and Gini measures are both examples of impurity functions, which are defined more generally as follows. Consider a collection of K probability values satisfying K X

pk = 1,

pk ≥ 0

(54.63)

k=1

An impurity function defined on the tuple (p1 , p2 , . . . , pK ) is a function, denoted by i : [0 1]K → IR+ , that satisfies the following properties: (a) i(p1 , . . . , pK ) is symmetric with regard to its arguments; (b) i(p1 , . . . , pK ) is maximized when p1 = p2 = . . . = pK = 1/K; (c) i(p1 , . . . , pK ) is minimum at the vertices (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1). It can be verified that the entropy and Gini measures defined by (54.20) and (54.42), respectively, satisfy these conditions – see Prob. 54.13. Gini coefficient. The Gini impurity metric has similarities with, but is not identical to, what is known as the Gini coefficient in the economic and social sciences. The coefficient is used as a measure of statistical dispersion in the study of wealth distribution and

54.4 Commentaries and Discussion

2337

income inequality in societies. It was introduced by the Italian sociologist Corrado Gini (1884–1965) in the works by Gini (1909, 1912, 1914) – see, e.g., the treatment by Kleiber and Kotz (2002). The coefficient is defined for both continuous and discrete random distributions. We state its definition only for the case of a binary random variable to facilitate comparison with the Gini impurity measure (54.43). Thus, assume that x is a random variable with discrete levels x1 < x2 and probabilities: p = P(x = x1 ),

1 − p = P(x = x2 )

(54.64)

Let x ¯ denote the mean value of x: x ¯ = px1 + (1 − p)x2 The Gini coefficient is defined as  1 ∆ Gc(x) = 1 − x1 p2 + x2 (1 − p)2 + 2x1 p(1 − p) x ¯

(54.65)

(54.66)

while the Gini impurity measure (54.43) is defined by G(x) = 1 − p2 − (1 − p)2

(54.67)

Some comparisons are apparent: (a) The Gini coefficient depends on the values of the discrete levels, {x1 , x2 }. (b) The Gini impurity and the Gini coefficient are zero when p = 0, 1. In the context of wealth distribution, the values p = 0 or p = 1 indicate that all individuals in the society have the same wealth, x1 or x2 . (c) When p = 1/2, we get G(x) = 12 while Gc(x) = 12 (1 − xx¯1 ). The smaller the value of x1 is, the closer the Gini coefficient is to 21 . In particular, when x1 = 0 (which corresponds to a state of zero wealth), then Gc(x) = 21 . In other words, just like Gini impurity is a measure of how pure the distribution of x is (with its pdf concentrated at one level value leading to zero impunity), the Gini coefficient is a measure of how well distributed wealth is (with a zero value corresponding to all individuals in the society having the same wealth). Heart disease Cleveland dataset. Example 54.3 illustrates the construction of a decision tree by relying on the heart-disease Cleveland dataset. This dataset consists of 297 samples that belong to patients with and without heart disease. It is available on the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/ heart+Disease. The investigators responsible for the collection of the data are the leading four co-authors of the article by Detrano et al. (1989).

PROBLEMS

54.1 Carry out the calculations that lead to the last row of Table 54.4. 54.2 Run the pruning method (54.62) on the decision tree shown in Fig. 54.6. 54.3 Refer to the data in Table 54.2. Ignore the column corresponding to “runny nose” and keep all other columns unchanged. Construct a decision tree using the remaining data. 54.4 Refer to the data in Table 54.2. Ignore the rows corresponding to patients 8 and 9. Construct a decision tree using the remaining data. 54.5 Refer to the data in Table 54.2 and to the feature attributes defined by (54.8). We modify these attributes by removing “fever” and replacing it by “flu.” That is, we now treat the last column of the table as an attribute. The objective is to classify whether the patient with the attributes in h has a “fever” or not. Construct a decision tree for this purpose.

2338

Decision Trees

54.6 Refer to the data in Table 54.2. We split the training data into two groups: patients {0, 1, 2, 3, 4} and patients {5, 6, 7, 8, 9}. We use the first group to construct one decision tree, and we use the second group to construct a separate decision tree. Consider the feature vector defined by (54.8). We feed the test vector h = col{YES, NO, YES, NO, YES, NO} into each of the trees. What is the decision by the first tree? What is the decision by the second tree? How do these decisions compare to the classification from the tree in Fig. 54.6, which was constructed from the entire set of data? 54.7 Refer to the data in Table 54.2 and the corresponding decision tree from Fig. 54.6. Consider the feature vector defined by (54.8). We feed the test vector h = col{YES, ?, YES, NO, YES, NO} into the tree, where ? denotes a missing attribute. One way to handle a missing attribute is to replace it by the most likely value from the training data. Doing so, what would be the classification result? 54.8 Continuing with Prob. 54.7, another method to handle missing data is to ignore that attribute altogether in the training data. Assume we do so and construct the decision tree accordingly. What would be the classification result for the test vector from Prob. 54.7? 54.9 Explain that expression (54.42) for the Gini impurity can be rewritten in the equivalent form X G(x) = P(x = x) × (1 − P(x = x)) x

54.10 Refer to expressions (54.21) and (54.42) for the entropy and Gini impurity of a discrete random variable, x, with K discrete levels, say, x ∈ {a1 , a2 , . . . , aK }. Show formally that these measures are maximized when x is uniformly distributed, i.e., when P(x = ak ) = 1/K. Determine also the respective maximum values. 54.11 Consider a collection of R classes, r ∈ {1, 2, . . . , R}. Let πr denote the probability of each class occurring, πr = P(r = r). Show that the probability of selecting a feature vector at random from one class and assigning it erroneously to another class is equal to the Gini impurity measure G(r). 54.12 Continuing with Prob. 54.11, assume we select two feature vectors independently of each other. (a) What is the likelihood that both feature vectors belong to the same specific class, ro ? (b) What is the likelihood that both feature vectors belong to the same class? (c) What is the likelihood that both feature vectors belong to two different classes? How does this result relate to the Gini impurity measure, G(r)? 54.13 Show that the entropy and Gini measures defined by (54.20) and (54.42), respectively, satisfy the conditions of an impurity function as stated after (54.63). 54.14 Show that the following function: ∆

i(p1 , . . . , pK ) = 1 − max pk 1≤k≤K

satisfies the conditions of an impurity function as stated after (54.63).

REFERENCES Blanc, G., J. Lange, and L.-Y. Tan (2020), “Provable guarantees for decision tree induction: The agnostic setting,” Proc. Int. Conf. Machine Learning (ICML), pp. 1–20. Also available at arXiv:2006.00743.

References

2339

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984), Classification and Regression Trees, Wadsworth International Group. Brutzkus, A., A. Daniely, and E. Malach (2019), “On the optimality of trees generated by ID3,” available at arXiv:1907.05444. Casey, R. G. and G. Nagy (1984), “Decision tree design using a probabilistic model,” IEEE Trans. Inf. Theory, vol. 30, pp. 93–99. Detrano, R., A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, and V. Froelicher (1989), “International application of a new probability algorithm for the diagnosis of coronary artery disease,” Amer. J. Cardiol., vol. 64, pp. 304–310. Fiat, A. and D. Pechyony (2004), “Decision trees: More theoretical justification for practical algorithms,” Proc. Int. Conf. Algorithmic Learning Theory (ALT), pp. 156– 170, Padova. Gini, C. (1909), “Concentration and dependency ratios,” English translation in Rivista di Politica Economica, vol. 87, pp. 769–789, 1997. Gini, C. (1912), “Variabilita e mutabilita: contributo allo studio delle distribuzioni e delle relazioni statistiche,” Universita de Cagliari, part 2, Cuppini. Reprinted in Memorie di Metodologica Statistica, E. Pizetti and T. Salvemini, editors, Libreria Eredi Virgilio Veschi, 1955. Gini, C. (1914), “Sulla misura della concentrazione e della variabilita dei caratteri,” Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, vol. 73, pp. 1203–1248. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Hunt, E. B. (1962), Concept Learning: An Information Processing Problem, Wiley. Hunt, E. B., J. Marin, and P. J. Stone (1966), Experiments in Induction, Academic Press. Kalai, A., A. Klivans, Y. Mansour, and R. A. Servedio (2008), “Agnostically learning halfspaces,” SIAM J. Comput., vol. 37, vol. 6, pp. 1777–1805. Kearns, M., M. Li, and L. Valiant (1994), “Learning Boolean formulas,” J. ACM, vol. 41, no. 6, pp. 1298–1328. Kearns, M. and Y. Mansour (1999), “On the boosting ability of top-down decision tree learning algorithms,” J. Comput. Syst. Sci., vol. 58, no. 1, pp. 109–128. Kleiber, C. and S. Kotz (2002), Statistical Size Distribution in Economics and Actuarial Sciences, Wiley. Lee, H. (2009), On the Learnability of Monotone Functions, Ph.D. dissertation, Columbia University, USA. Mitchell, T. (1997), Machine Learning, McGraw Hill. Quinlan, J. R. (1983), “Learning efficient classification procedures and their application to chess endgames,” in Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors, Tioga Publishing Company, CA. Quinlan, J. R. (1986), “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106. Quinlan, J. R. (1987), “Rule induction with statistical data: A comparison with multiple regression,” J. Oper. Res. Soc., vol. 38, pp. 347–352. Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann. Rokach, L. and O. Maimon (2010), “Decision trees,” in Data Mining and Knowledge Discovery Handbook, Springer. Safavian, S. R. and D. Landgrebe (1991), “A survey of decision tree classifier methodology,” IEEE Trans. Syst., Man Cybern., vol. 21, no. 3, pp. 660–674. Sethi, I. K. and G. P. R. Sarvarayndu (1981), “Hierarchical classifier design using mutual information,” IEEE Trans. Patt. Ann. Mach. Intell., vol. 4. no. 4, pp. 441–445. Shalev-Shwartz, S. and S. Ben-David (2014), Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press. Shannon, C. E. (1948a), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423.

2340

Decision Trees

Shannon, C. E. (1948b), “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 623–656. Stevens, M. A. (1961), “Automatic character recognition: A state-of-the-art report,” Tech. Note 112, PB 161613, National Bureau of Standards, U.S. Department of Commerce. Utgoff, P. E. (1989), “Incremental induction of decision trees,” Mach. Learn., vol. 4, no. 2, pp. 161–186.

55 Naïve Bayes Classifier

The optimal Bayes classifier (52.8) requires knowledge of the conditional prob-

ability distribution P(r = r|h = h), which is generally unavailable. In this and the next few chapters, we describe data-based generative methods that approximate the joint probability distribution fr,h (r, h), or its components P(r = r) and fh|r (h|r), directly from the data. Once these components are estimated, they can then be used to learn the desired probabilities P(r = r|h = h) by means of the Bayes rule and to perform classification. Among these methods we list the naïve Bayes classifier of this chapter, the linear and Fisher discriminant analysis (LDA, FDA) methods of the next chapter, and the logistic regression method of Chapter 59. The naïve classifier is a suboptimal construction that relies on a certain independence assumption. Although the assumption rarely holds in practice, the resulting classifier has become popular and leads to competitive performance in many applications involving text segmentation, document classification, spam filtering, or medical diagnosis. The naïve Bayes classifier is an example of a supervised learning procedure because its training requires access to a collection of feature vectors and their respective labels. The training data is used to estimate the priors P(r = r) and to fit Bernoulli or multinomial distributions to model the conditional fh|r (h|r).

55.1

INDEPENDENCE CONDITION We start by describing the independence assumption that will facilitate the evaluation of the Bayes classifier and lead to its naïve implementation. Specifically, we will assume the following: (a) (Discrete attributes) The individual entries (or attributes) of the feature vector h ∈ IRM , denoted by {h(1), h(2), . . . , h(M )}, assume discrete values (i.e., they are not continuous random variables). Later, in Sections 55.4 and 56.2, we will consider the situation in which the entries of the feature vector are continuously distributed. (b) (Conditionally independent attributes) The individual entries {h(m)} are conditionally independent of each other given the class variable r, so that

2342

Naïve Bayes Classifier

the joint probability of any two entries decouples into the product of the individual probabilities:   P h(k) = a, h(`) = b|r = r     = P h(k) = a|r = r × P h(`) = b|r = r (55.1) for any k 6= `.

Let πr represent the prior probability for each class r = r, namely, ∆

πr = P(r = r), r = 1, 2, . . . , R

(55.2)

Now, given a feature vector h, we would like to determine its most likely label according to the Bayes classifier construction. Using the Bayes rule (3.42c) for discrete random variables, we can express the desired conditional probability in the form: P(r = r) P(h = h|r = r) (55.3) P(r = r|h = h) = P(h = h) Since the quantity in the denominator, P(h = h), is independent of r, we can ignore its presence and note that in order to maximize P(r = r|h = h) over r it is sufficient to maximize the numerator so that the label for h can be found by solving (where we are using the bullet superscript notation to refer to this optimal construction): n o ∆ r• (h) = argmax πr P(h = h|r = r) (55.4) 1≤r≤R

We therefore transformed the problem of determining the label for h into one that requires evaluation of the reverse conditional probability P(h = h|r = r). It is at this stage that the independence assumption becomes useful. This is because it allows us to write the factorization: M   Y P(h = h|r = r) = P h(m) = h(m)|r = r (55.5) m=1

Substituting into (55.4) and transforming the right-hand side into the logarithmic scale to avoid working with small numbers, we arrive at: (Bayes classifier under independence assumption) (55.6) ( ) M   X ∆ r• (h) = argmax log(πr ) + log P(h(m) = h(m)|r = r) 1≤r≤R

m=1

Example 55.1 (Document classification) Consider an application in which we are interested in classifying a newspaper document into one of three classes defined as follows:   r = 1 −→ article discusses sports r = 2 −→ article discusses politics (55.7)  r = 3 −→ article discusses movies

55.2 Modeling the Conditional Distribution

2343

Assume further, for this contrived example, that we extract four attributes from each document and collect them into a four-dimensional feature vector, h ∈ IR4 , where each entry of h counts the total number of times that the words below appear in the article: h(1) : h(2) : h(3) : h(4) :

{football, basketball, baseball} {President, Congress, election} {actor, actress, theater} {inflation, market, consumer}

(55.8a) (55.8b) (55.8c) (55.8d)

In this case, we have R = 3 classes and M = 4 attributes. Obviously, in actual text classification systems the construction of the feature space is more comprehensive than shown here and will take into account several other aspects of the document. Given r = 2 (the document discusses politics), the independence assumption amounts to saying that the number of times that the words {President, Congress, election} appear in the document is, for example, conditionally independent of the number of times that the words {football, basketball, baseball} appear in the same document.

55.2

MODELING THE CONDITIONAL DISTRIBUTION Determination of the Bayes classifier by means of (55.6) still requires knowledge of the reverse conditional probability P(h = h|r = r). Since we are assuming h to be discrete, we can consider two distributions that are particularly useful to model such probabilities.

Bernoulli distribution In one model, we assume each attribute h(m) ∈ {0, 1} follows a Bernoulli distribution and is binary-valued. Situations like this arise, for example, when h(m) is declaring the presence of a certain attribute or not (such as whether an object is hot or cold, blue or yellow, and so forth). Let prm denote the success probability, i.e., the likelihood that h(m) assumes the value 1 under class r = r:   ∆ prm = P h(m) = 1|r = r (55.9)

Note that we are attaching two subscripts to prm : the subscript r indicates that the value of prm depends on the class variable, and the subscript m is the index of the attribute. Thus, the value of prm is referring to the likelihood that the mth attribute is active given that the feature vector belongs to class r. For this same attribute, but under another class r0 , the value pr0 m can be different. Using (55.9), we can write   h(m) P h(m) = h(m)|r = r = prm (1 N prm )1−h(m) , h(m) ∈ {0, 1} (55.10) In this way, we can determine the probabilities P(h = h|r = r) from knowledge of the {prm }.

2344

Naïve Bayes Classifier

Multinomial distribution In a second model, we assume there are M separate events (such as the occurrence of colors red, blue, green, and so forth) and each attribute h(m) counts how many times event m has occurred (e.g., how many times the color red has occurred, the color blue, and the color green). In this case, the variables {h(1), h(2), . . . , h(M )} follow a multinomial distribution. Let prm denote the likelihood of observing attribute m under class r. These probabilities satisfy M X

m=1

prm = 1, ∀ r ∈ {1, 2, . . . , R}

and, using expression (5.34), we have P  M h(m) ! m=1 h(1) h(2) h(M ) p p . . . prM P(h = h|r = r) = h(1)! h(2)! . . . h(M )! r1 r2

(55.11)

(55.12)

Observe that this expression provides the conditional probability of h directly rather than of its individual entries, as was the case with (55.10). This is of course sufficient for use in (55.4).

55.3

ESTIMATING THE PRIORS We are now ready to derive the naïve Bayes classifier. One of the main difficulties in implementing the optimal Bayes solution (55.6) is that it requires knowledge of the probabilities πr and P(h = h|r = r). The latter probabilities are determined once we know the parameters {prm } under either the Bernoulli or multinomial model. The parameters {πr , prm } are rarely known beforehand and need to be estimated. We now assume that we have access to a collection of N training data points, {r(n), hn , n = 0, 1, . . . , N N 1}. In this notation, r(n) is the class for feature hn .

Estimating the class priors Assume that within the N data samples there are Nr examples that belong to class r. Then, the derivation in Prob. 55.3 shows that πr can, in principle, be estimated as follows: Nr π br = (55.13) N which is the fraction of data points that belong to class r within the training set. However, an adjustment is needed to avoid situations where a particular class may not be represented in the training data, in which case we will end up with π br = 0 for that r. To avoid this situation, it is customary to modify the above expression for estimating πr by incorporating a form of smoothing known as Laplace smoothing – see Probs. 55.4 and 55.5. We extend N to N + sR, where we assume the presence of sR additional fictitious training samples. Here,

55.3 Estimating the Priors

2345

the parameter s is positive and controls the amount of smoothing. The choice s = 1 is common and referred to as Laplace smoothing. Choices of s < 1 are referred to as Lidstone smoothing. Now, assuming the labels r ∈ {1, 2, . . . , R} are uniformly distributed within the sR virtual samples, then s of these samples will be expected to belong to each class. We then replace expression (55.13) for π br by π br =

Nr + s , N + sR

r = 1, 2, . . . , R

(Laplace smoothing)

(55.14)

Observe that when s = 0 we get π br = Nr /N , and when s → ∞ we get π br → 1/R. Therefore, the smoothing operation ensures that the estimate for πr lies between the sample average (Nr /N ) and the uniform probability (1/R).

Estimating the reverse conditional probabilities Similarly, we can use the training data to estimate the parameters {prm }. Consider first the multinomial case, where prm denotes the likelihood that attribute m occurs under class r. Given the N training feature vectors {hn }, we isolate the vectors that belong to class r and count how many times attribute m occurs in them: X ∆ Nrm = hn (m) (55.15a) hn ∈ class r Note that m is fixed in this sum and we are adding over all feature vectors from class r in the training set. If we add the {Nrm } over m, we arrive at the total number of all attributes observed in the training set under class r: ∆

NrT =

M X

Nrm

(55.15b)

m=1

Then, prm is estimated by using the smoothed formula: (multinomial parameters)  Nrm + s m = 1, . . . , M pbrm = , r = 1, . . . , R NrT + sM

(55.15c)

for some s > 0 since there are M possible attributes. This calculation assumes that the training data is dense enough so that all classes are observed. For the Bernoulli model, we again isolate the vectors that belong to class r and count how many times attribute m is active at the value 1 within these vectors: X ∆ Nrm = hn (m) (55.16a) hn ∈ class r We also let Nr denote the total number of feature vectors in class r: Nr = number of features hn in class r

(55.16b)

2346

Naïve Bayes Classifier

Then, prm is estimated by using the smoothed formula (Bernoulli parameters)  Nrm + s m = 1, . . . , M pbrm = , r = 1, . . . , R Nr + 2s

(55.16c)

The following listing summarizes the steps involved in the training and classification phases of the naïve Bayes classifier for multinomial-distributed feature data using (55.15c); for Bernoulli-distributed attributes we use (55.16c) instead. The construction is relatively simple to train. Note that we are denoting the resulting classifier in the last line of the algorithm by the notation r? (h) (as opposed to r• (h)) because it is learned directly from the training data.

Na¨ıve Bayes classifier for discrete multinomial feature data. given N training data points {r(n), hn }, n = 0, 1, 2, . . . , N N 1; given R classes, r(n) ∈ {1, 2, . . . , R}; each feature vector, hn , is M -dimensional with entries {hn (m)}; hn (m) counts how many times attribute m occurs in nth sample; select a Laplace smoothing factor s > 0, e.g., s = 1. (training) repeat for r = 1, 2, . . . , R: Nr = number of training samples in class r; r +s π br = NN+sR repeat for m = 1, 2, . . . , M : Nrm NrT pbrm

(55.15a)

=

(55.17)

number of times attribute m occurs in class r;

(55.15b)

= total number of attributes observed in class r; Nrm + s = NrT + sM

end end (classification) given a new feature vector, h, with entries {h(m)}: b = h|r = r) using (55.12), for r = 1, 2, . . . , R; compute P(h n o b = h|r = r) determine r? (h) = argmax π br P(h 1≤r≤R

end

Example 55.2 (Application to medical diagnosis) We reconsider the earlier Table 54.2, repeated here as Table 55.1, which lists the symptoms for N = 10 patients and whether they had the flu or not. The number of classes in this example is R = 2 with: γ = +1 : patient has the flu γ = −1 : patient does not have the flu

(55.18a) (55.18b)

55.3 Estimating the Priors

2347

The last column in the table indicates the class that each patient belongs to. Excluding this last column, each row in the table corresponds to a feature vector with M = 6 attributes. Each entry of h assumes a binary value (Yes/No); i.e., it is Bernoullidistributed. For example, the first entry of h indicates whether the patient had a headache or not. Figure 55.1 provides a graphical illustration of the data from Table 55.1, where the blue color indicates the presence of the relevant symptom. The top row in the figure lists patients without the flu, while the bottom row lists patients with the flu.

Figure 55.1 Graphical illustration of the data from Table 55.1, where the blue color

indicates the presence of the relevant symptom. The top row lists patients without the flu, while the bottom row lists patients with the flu.

Table 55.1 Symptoms felt by 10 patients and whether they had the flu or not. Patient

Headache

Fever

Sore throat

Vomiting

Chills

Runny nose

Flu

0 1 2 3 4 5 6 7 8 9

Yes Yes No No No Yes Yes No Yes Yes

No Yes Yes No Yes No No Yes Yes No

No No Yes No No Yes No No No No

Yes No No Yes Yes No No Yes No No

No Yes Yes No Yes Yes No No No Yes

No Yes Yes No No Yes No No Yes Yes

NO YES YES NO NO YES NO NO YES NO

We set the Laplace smoothing factor to s = 1 and use the data in the table to estimate the prior probabilities as follows:

2348

Naïve Bayes Classifier

N+1 = 4 N−1 = 6 4+1 ≈ 0.4167 π b+1 = 10 + 2 6+1 π b−1 = ≈ 0.5833 10 + 2

(55.19a) (55.19b) (55.19c) (55.19d)

where N+1 denotes the number of samples in the training set that belong to class γ = +1 (has the flu). Similarly for N−1 . We also use the data from the table to estimate the conditional probabilities, first for the patients that had the flu: b P(headache=yes|patient has flu) = (3 + 1)/(4 + 2) = 2/3 b P(headache=no|patient has flu) = (1 + 1)/(4 + 2) = 1/3 b P(fever=yes|patient has flu) = (3 + 1)/(4 + 2) = 2/3

(55.20b)

b P(fever=no|patient has flu) = (1 + 1)/(4 + 2) = 1/3

(55.20d)

b P(sore throat=yes|patient has flu) = (2 + 1)/(4 + 2) = 1/2 b P(sore throat=no|patient has flu) = (2 + 1)/(4 + 2) = 1/2 b P(vomiting=yes|patient has flu) = (0 + 1)/(4 + 2) = 1/6

(55.20e)

(55.20a) (55.20c)

(55.20f) (55.20g)

b P(vomiting=no|patient has flu) = (4 + 1)/(4 + 2) = 5/6 b P(chills=yes|patient has flu) = (3 + 1)/(4 + 2) = 2/3 b P(chills=no|patient has flu) = (1 + 1)/(4 + 2) = 1/3

(55.20h)

b P(runny nose=yes|patient has flu) = (4 + 1)/(4 + 2) = 5/6 b P(runny nose=no|patient has flu) = (0 + 1)/(4 + 2) = 1/6

(55.20k)

(55.20i) (55.20j) (55.20l)

and similarly for the patients that did not have the flu: b P(headache=yes|patient does not have flu) = (3 + 1)/(6 + 2) = 1/2 b P(headache=no|patient does not have flu) = (3 + 1)/(6 + 2) = 1/2 b P(fever=yes|patient does not have flu) = (2 + 1)/(6 + 2) = 3/8 b P(fever=no|patient does not have flu) = (4 + 1)/(6 + 2) = 5/8

(55.21a) (55.21b) (55.21c) (55.21d)

b P(sore throat=yes|patient does not have flu) = (0 + 1)/(6 + 2) = 1/8 b P(sore throat=no|patient does not have flu) = (6 + 1)/(6 + 2) = 7/8 b P(vomiting=yes|patient does not have flu) = (4 + 1)/(6 + 2) = 6/8 b P(vomiting=no|patient does not have flu) = (2 + 1)/(6 + 2) = 3/8

(55.21h)

b P(chills=yes|patient does not have flu) = (2 + 1)/(6 + 2) = 3/8 b P(chills=no|patient does not have flu) = (4 + 1)/(6 + 2) = 5/8

(55.21j)

b P(runny nose=yes|patient does not have flu) = (1 + 1)/(6 + 2) = 1/4 b P(runny nose=no|patient does not have flu) = (5 + 1)/(6 + 2) = 3/4

(55.21e) (55.21f) (55.21g) (55.21i) (55.21k) (55.21l)

In this example, we would like to employ the naïve Bayes classifier to decide whether a new patient with the following symptoms has the flu or not: h = {headache=NO, fever=NO, sore throat=YES, vomiting=NO, chills=NO, runny nose=YES}

(55.22)

55.3 Estimating the Priors

2349

Figure 55.2 Graphical illustration of the symptoms for the new patient. Does the

patient have the flu? The symptoms for the new patient are represented graphically in Fig. 55.2. To begin with, using (55.5), we evaluate the following conditional probabilities: b = h|patient has flu) P(h =

6 Y

b P(h(m) = h(m)|patient has flu)

m=1

b = P(headache=no|patient has flu) × b P(fever=no|patient has flu) × b P(sore throat=yes|patient has flu) × b P(vomiting=no|patient has flu) × b P(chills=no|patient has flu) × b P(runny nose=yes|patient has flu) = 1/3 × 1/3 × 1/2 × 5/6 × 1/3 × 5/6 ≈ 0.01286

(55.23)

and b = h|patient does not have flu) P(h =

6 Y

b P(h(m) = h(m)|patient does not have flu)

m=1

b = P(headache=no|patient does not have flu) × b P(fever=no|patient does not have flu) × b P(sore throat=yes|patient does not have flu) × b P(vomiting=no|patient does not have flu) × b P(chills=no|patient does not have flu) × b P(runny nose=yes|patient does not have flu) = 1/2 × 5/8 × 1/8 × 3/8 × 5/8 × 1/4 ≈ 0.002289

(55.24)

2350

Naïve Bayes Classifier

Consequently, b = h|patient has flu) ≈ 0.4167 × 0.01286 ≈ 0.005359 π b+1 P(h

(55.25)

b = h|patient does not have flu) ≈ 0.5833 × 0.002289 ≈ 0.00013352 π b−1 P(h

(55.26)

and

Since 0.005359 > 0.00013352, we conclude that the patient is likely to have the flu. This example helps illustrate one main limitation of naïve Bayes classifiers, namely, the assumption that the entries of the feature vector (i.e., the attributes) are conditionally independent of each other. For example, given that a patient has the flu, it is likely that having a fever and feeling a chill are dependent (rather than independent) events. Still, naïve Bayes classification is a popular learning scheme due to its computational simplicity and the fact that it performs surprisingly well (although it can be outperformed by other more elaborate learning methods). Example 55.3 (Application to spam filtering) We apply the naïve Bayes classifier to another situation with R = 2 classes γ ∈ {±1} (such as checking whether an email message is spam or not), with γ = +1 corresponding to spam messages. In this example, each entry of the feature vector, h ∈ IRM , is binary-valued and its value is either 1 or 0 depending on whether a particular word is present in the message or not. Using Laplace smoothing, with s = 1, we first estimate the probabilities for the two classes: π b+1 =

N+1 + 1 , N +2

π b−1 =

N−1 + 1 N +2

(55.27)

where N+1 (similarly, N−1 ) denotes the number of samples in the training set that belong to class γ = +1 (similarly, γ = −1). Likewise, for each m = 1, 2, . . . , M , we used Laplace smoothing again to estimate the Bernoulli parameters: ∆ b pb+1,m = P(h(m) = 1|γ = +1) = (N+1,m + 1)/(N+1 + 2) ∆

b pb−1,m = P(h(m) = 1|γ = −1) = (N−1,m + 1)/(N−1 + 2)

(55.28a) (55.28b)

where ∆

N+1,m = number of observations in class +1 having h(m) = 1 ∆

N−1,m = number of observations in class −1 having h(m) = 1 Using the above parameters we can write, for any γ ∈ {±1}:   h(m) b h(m) = h(m)|γ = γ = pbγ,m P (1 − pbγ,m )1−h(m)

(55.29a) (55.29b)

(55.30a)

Accordingly, given a new message with feature vector h, we can decide its class (whether spam or not) by seeking the value of γ ∈ {±1} that maximizes: n o b = h|γ = γ) γ ? (h) = argmax π bγ P(h (55.31) γ∈{±1}

where b = h|γ = γ) = P(h

M Y m=1

  b h(m) = h(m)|γ = γ P

(55.32)

55.4 Gaussian Naïve Classifier

55.4

2351

GAUSSIAN NAÏVE CLASSIFIER We have restricted so far the entries of the feature vector h to discrete values. The naïve Bayes construction can be extended to the case in which the entries of h are continuous in IR. In this section, we describe the situation in which these entries continue to be conditionally independent of each other. Later, in Section 56.2, we will consider the more general scenario where the entries of h can be correlated and derive linear discriminant methods for approximating the Bayes classifier. Let {h(m)} denote the individual entries of h ∈ IRM . Assume that, conditioned on the class variable r = r, each h(m) is Gaussian-distributed with mean 2 µrm and variance σrm , written as 2 fh(m)|r (h(m)|r) ∼ Nm (µrm , σrm ) ( ) 2 1 1  =p exp N 2 h(m) N µrm 2 2σrm 2πσrm

(55.33)

Note that we are using two subscripts to characterize the mean and variance parameters of the Gaussian distribution: The subscript r indicates that these parameters depend on the class label, and the subscript m refers to the mth attribute. We are also denoting the Gaussian distribution for h(m) by the compact notation Nm , with a subscript m. The independence assumption on the entries of h implies that

fh|r (h|r) =

M Y

2 Nm (µrm , σrm )

m=1 M Y

1 p exp = 2 2πσ rm m=1

(

1 N 2 (h(m) N µrm )2 2σrm

)

(55.34)

Repeating the argument that led to (55.4) using the Bayes rule, we find that, given a feature vector h, the class selection r• (h) can be determined by solving: r• (h) = argmax πr fh|r (h|r) 1≤r≤R

= argmax 1≤r≤R

(

(55.35)

M  1 X 1 2 ln(πr ) N ln(2πσrm ) + 2 (h(m) N µrm )2 2 m=1 σrm

)

2 The mean and variance parameters {µrm , σrm } can be estimated from the training data {r(n), hn }. If we let Nr denote the number of feature vectors that belong to class r, then we set

2352

Naïve Bayes Classifier

1 X hn (m) Nr r(n)=r 2 X  1 = hn (m) N µ brm Nr N 1

µ brm = 2 σ brm

(55.36a) (55.36b)

r(n)=r

The resulting algorithm is listed in (55.37).

Na¨ıve Bayes classifier for Gaussian feature data. given N training data points {r(n), hn }, n = 0, 1, 2, . . . , N N 1; given R classes, r(n) ∈ {1, 2, . . . , R}; each feature vector, hn , is M -dimensional with entries {hn (m)}; hn (m) is Gaussian-distributed; select a Laplace smoothing factor s > 0, e.g., s = 1. (training) repeat for r = 1, 2, . . . , R: Nr = number of training samples in class r; r +s π br = NN+sR repeat for m = 1, 2, . . . , M : 1 X hn (m) µ brm = Nr r(n)=r 2 X  1 2 σ brm = hn (m) N µ brm Nr N 1 r(n)=r

end end (classification) given a new feature vector, h, with entries {h(m)}: ( ) M   X 1 1 2 r? (h) = argmax ln(b πr ) N ln(2πb σrm ) + 2 (h(m) N µ brm )2 2 m=1 σ brm 1≤r≤R end (55.37)

55.5

COMMENTARIES AND DISCUSSION Laplace smoothing. The Laplace smoothing formula (55.14) is attributed to the French mathematician Pierre-Simon Laplace (1749–1827). He derived it in the work by Laplace (1814) in his study of the rule of succession, which deals with the following question. Assume an experiment with only two possible outcomes (success or failure) is repeated a total of N independent times, and that Ns successes have been observed

55.5 Commentaries and Discussion

2353

during these trials. Assume we only know that the experiment has two possible outcomes but have no information about the likelihood of each outcome. Consider now the question of determining the probability that the outcome will be a success in the (N + 1)th trial. This probability is given by – see, e.g., the textbooks by Doob (1953) and Jaynes (2003) and Probs. 55.4 and 55.5: P(success in trial N + 1 | given Ns successes so far) =

Ns + 1 N +2

(55.38)

This result can be extended to the case in which each trial has a total of R possible outcomes, say, r ∈ {1, 2, . . . , R}. In this case, the probability that the outcome is in class r in the (N + 1)-th trial will be given by:

P(outcome is class r in trial N + 1 | given Nr observations of r so far) Nr + 1 (55.39) = N +R More generally, we can resort to expression (55.14) where s > 0. The choice s = 1 leads to Laplace smoothing, while choices s < 1 lead to Lidstone smoothing. Some of the earlier references on smoothing techniques include the works by Lidstone (1920), Johnson (1932), and Jeffreys (1948). Na¨ıve Bayes classifier. According to Duda, Hart, and Stork (2000) and Russell and Norvig (2009), some of the earliest applications of the algorithm were in the context of pattern recognition, text classification, and medical diagnosis in the late 1950s and early 1960s. For example, the early work by Maron (1961) examines the task of automatically classifying documents into various categories; the author motivates the work in the abstract of the article by writing that “the task, in essence, is to have a computing machine read a document and on the basis of the occurrence of selected clue words decide to which of many subject categories the document in question belongs.” The author motivates the naïve Bayes construction by using the Shannon entropy measure to quantify the uncertainty about which category a document belongs to. We indicated in the body of the chapter that although the naïve Bayes classifier assumes the entries of the feature vector to be conditionally independent of each other, the classifier still performs competitively in practice even when the independence condition is violated. There have been several studies in the literature to illustrate and explain this behavior, most notably by Clark and Niblett (1989), Langley, Iba, and Thompson (1992), Kononenko (1993), Pazzani (1996), Domingos and Pazzani (1996, 1997), Frank et al. (2000), Garg and Roth (2001), Hand and Yu (2001), and Zhang (2004). The main conclusion from these works is that while the estimates of the conb = r|h = h), can generally be poor (i.e., not close enough to ditional probabilities, P(r their true values), the naïve classifier is still able to deliver performance because the predicted class, r? , is decided not based on the estimated values of the probabilities but rather on comparing these values against each other, i.e., on selecting the class r? b = r|h = h). that leads to the largest value for P(r Naïve Bayes classifiers can be outperformed by other learners as shown, for example, in the works by Ng and Jordan (2001) and Caruana and Niculescu-Mizil (2006). The first work compared logistic regression and naïve Bayes, while the second work compared several learning algorithms against each other, including logistic regression, support vector machines, and naïve Bayes. Nevertheless, motivated by the extensive empirical and analytical evidence in support of the good performance of naïve Bayes classifiers in many situations of interest, these classifiers continue to serve as good starting points for the design of more elaborate learning machines.

2354

Naïve Bayes Classifier

PROBLEMS

55.1 Repeat the derivation of Example 55.2 to verify whether a patient with the following symptoms has the flu: h = {headache=YES, fever=YES, sore throat=NO, vomiting=NO, chills=NO, runny nose=NO} 55.2 Continuing with the same patient from the previous problem, assume the feature vector is missing information about whether the patient has a sore throat or not (marked by the question mark below): h = {headache=YES, fever=YES, sore throat=?, vomiting=NO, chills=NO, runny nose=NO} How would you apply the naïve Bayes classifier to decide on whether the patient has the flu or not? Assuming the patient had a 60% chance of having a sore throat, how likely is it that the decision based on ignoring this information will be different from the decision that takes this additional piece of information into consideration? 55.3 Consider a multiclass classification problem consisting of R classes, say, r ∈ {1, 2, . . . , R}. The prior probability of observing features from class r is denoted by πr . A collection of N independent realizations {r(n), hn } are observed, with r(n) denoting the class variable and hn the corresponding feature vector for the nth sample. It is observed that each class r occurs Nr times in the sample of N data points. (a) Determine the likelihood probability P(N1 , N2 , . . . , NR |π1 , π2 , . . . , πR ), where the {πr } are treated as deterministic parameters. (b) Show that the optimal estimate for πr that is obtained by maximizing the logarithm of the above probability expression is given by π br = Nr /N . 55.4 One way to motivate expression (55.14) for Laplace smoothing is as follows. We continue with the setting of Prob. 55.3 except that we now model the unknown priors {π r } as random variables whose individual pdfs follow a symmetric Dirichlet distribution with parameter s > 0. Since the {π r } should add up to 1, this means that one of the variables is fully determined from knowledge of the remaining R − 1 variables. A joint Dirichlet pdf with positive parameters {s1 , s2 , . . . , sR } has the form: fπ 1 ,π 2 ,...,π R (π1 , π2 , . . . , πR ) ∝

R Y

πrsr −1

r=1

where the symbol ∝ denotes proportionality. It is known P that the mean of each entry π r under this distribution is given by E π r = sr / R r=1 sr . When sr = s, for all r ∈ {1, 2, . . . , R}, the distribution is said to be symmetric. Q Nr (a) Verify that P(N1 , . . . , NR |π = π) ∝ R r=1 πr . (b) Assuming a symmetric distribution, verify that fπ 1 ,π 2 ,...,π R (π1 , π2 , . . . , πR |N1 , . . . , NR ) ∝

R Y

πrNr +s−1

r=1

(c)

Conclude that the optimal mean-square-error estimate for π r given the observations {N1 , N2 , . . . , NR }, which is equal to the expectation of the conditional pdf of part (b), is given by: π br =

Nr + s N + sR

Problems

2355

55.5 Derive the Laplace formula (55.38). Using this formula, what would the probability be of the sun rising tomorrow? Any controversy in the answer? Remark. This sun problem was used by Laplace (1814) to illustrate his calculation. 55.6 Refer to expression (55.12) when the entries of h follow a multinomial distribution. Show that the Bayes classifier (55.6) reduces to the following equivalent problem involving an affine function of the feature data: r• (h) = argmax

n

log(πr ) + hT wr

o

r∈{1,2,...,R}

where wr ∈ IRM collects the log values of the attribute probabilities: ∆

wr =



log(pr1 )

log(pr2 )

...

log(prM )



55.7 Refer again to expression (55.12) when the entries of h follow a multinomial distribution. Assume there are two classes, R = 2, denoted by γ ∈ {±1}. Show that the Bayes classifier (55.6) reduces to checking the sign of an affine function of the feature data as follows: r• (h) = sign(hT w• − θ• ) where the parameters are given by θ• = ln(π−1 /π+ ),

n o w• = col ln(p+1,m /p−1,m ) ∈ IRM

55.8 For the naïve Bayes classifier, how many conditional probabilities of the form (55.15c) need to be estimated from the training data? 55.9 Refer to expression (52.8) for the Bayes classifier. Assume the jth entry of the feature vector h is missing at random, denoted by hj . Let h−j denote the remaining entries of h; it is a vector of size M − 1. Let r• (h−j ) denote the optimal class label based on knowledge of h−j alone. Under the independence assumption, argue that ( •

r (h−j ) = argmax

) πr ×

1≤r≤R

Y

P(hi = hi |r = r)

i6=j

so that classification can proceed by ignoring hj . 55.10 Refer to expressions (55.36a)–(55.36b) for estimating the parameters of a Gaus2 sian naïve classifier. How may parameters {µrm , σrm } need to be estimated in total? 2 55.11 Assume the variances {σrm } in the Gaussian naïve implementation are inde2 pendent of the class label r and can be replaced by the notation {σm }. How would you 2 estimate the {σm }? 55.12 Refer to expression (55.34) for the conditional pdf of h given the class variable in the Gaussian naïve classifier. Assume there are two classes denoted by γ ∈ {±1} 2 with priors {π+1 , π−1 }. Assume further that the variances σrm are independent of r 2 and denote them by σm . Show that the conditional probability P(γ = γ|h = h) can be written in the following sigmoidal form: P(γ = γ|h = h) =

1 1 + e−γ(hT w−θ)

for some parameters (w, θ). Determine expressions for these parameters in terms of 2 {µ+1,m , µ−1,m , σm , π+1 , π−1 }.

2356

Naïve Bayes Classifier

REFERENCES Caruana R. and A. Niculescu-Mizil (2006), “An empirical comparison of supervised learning algorithms,” Proc. Int. Conf. Machine Learning (ICML), pp. 161–168, Pittsburgh, PA. Clark, P. and T. Niblett (1989), “The CN2 induction algorithm,”Mach. Learn., vol. 3, no. 4, pp. 261–283. Domingos, P. and M. Pazzani (1996), “Beyond independence: Conditions for the optimality of the simple Bayesian classifier,” Proc. Int. Conf. Machine Learning (ICML), pp. 1–8, Bari. Domingos, P. and M. Pazzani (1997), “On the optimality of the simple Bayesian classifier under zero-one loss,” Mach. Learn., vol. 29, pp. 103–130. Doob, J. L. (1953), Stochastic Processes, Wiley. Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Frank, E., L. Trigg, G. Holmes, and I. H. Witten (2000), “Naïve Bayes for regression,” Mach. Learn., vol. 41, no. 1, pp. 5–15. Garg, A. and D. Roth (2001), “Understanding probabilistic classifiers,” Proc. European Conf. Machine Learning, pp. 179–191, Freiburg. Hand, D. J. and Y. Yu (2001), “Idiot’s Bayes: Not so stupid after all?” Int. Statist. Rev., vol. 69, pp. 385–389. Jaynes, E. T. (2003), Probability Theory: The Logic of Science, Cambridge University Press. Jeffreys, H. (1948), Theory of Probability, 2nd ed., Clarendon Press. Johnson, W. E. (1932), “Probability: Deductive and inductive problems,” Mind, vol. 41, pp. 421–423. Kononenko, I. (1993), “Inductive and Bayesian learning in medical diagnosis,” App. Artif. Intell., vol. 7, pp. 317–337. Langley, P., W. Iba, and K. Thompson (1992), “An analysis of Bayesian classifiers,” Proc. Nat. Conf. Artificial Intelligence (AAAI), pp. 223–228, San Jose, CA. Laplace, P. S. (1814), Essai Philosophique sur les Probabilités, Paris, published 1840. Lidstone, G. J. (1920), “Note on the general case of the Bayes–Laplace formula for inductive or a posteriori probabilities,” Trans. Faculty Actuaries, vol. 8, pp. 182– 192. Maron, M. E. (1961), “Automatic indexing: An experimental inquiry,” J. ACM, vol. 8, no. 3, pp. 404–417. Ng, A. Y. and M. I. Jordan (2001), “On discriminative vs. generative classifiers: A comparison of logistic regression and naïve Bayes,” Proc. Advances Neural Information Systems (NIPS), pp. 1–8, Vancouver. Pazzani, M. (1996), “Searching for dependencies in Bayesian classifiers,” in Learning from Data, D. Fisher D. and H. J. Lenz, editors, Springer. Russell, S. and P. Norvig (2009), Artificial Intelligence: A Modern Approach, 3rd ed., Prentice Hall. Zhang, R. (2004), “The optimality of naïve Bayes,” Proc. Int. Florida Artificial Intelligence Research Society Conf. (FLAIRS), pp. 562–567, Miami Beach, FL.

56 Linear Discriminant Analysis

In this chapter, we describe three other data-based generative methods that approximate the solution to the optimal Bayes classifier (52.8) in the absence of knowledge of the conditional probabilities P(r = r|h = h). The methods estimate the prior probabilities P(r = r) for the classes and, in some cases, assume a Gaussian form for the reverse conditional distribution, fh|r (h|r). The training data is used to estimate the priors and the first- and second-order moments of fh|r (h|r). Using the Bayes rule the desired conditional probabilities, P(r = r|h = h), are obtained and the results lead to linear discriminant structures illustrated further ahead in Fig. 56.1. Examples of discriminative models of this type include linear discriminative analysis (LDA), Fisher discriminant analysis (FDA), the minimum distance classifier (MDC), and the logistic regression method. We discuss logistic regression in the next chapter. These techniques are examples of supervised learning procedures because they use training data to learn the structure of the classifier.

56.1

DISCRIMINANT FUNCTIONS The first method is LDA, which can be applied directly to multiclass classification problems involving a total of R classes. For this reason, we will consider this setting directly. LDA assumes that the features h are Gaussian-distributed conditioned on the class variable r. That is, the method assumes a Gaussian distribution for reverse conditional, fh|r (h|r).

Estimating the prior Let r denote a discrete class variable that assumes values in the range r ∈ {1, 2, . . . , R}, and let πr represent the prior probability for value r = r, namely, ∆

πr = P(r = r), r = 1, 2, . . . , R

(56.1)

We assume we have a collection of N training data points {r(n), hn }, for n = 0, 1, . . . , N N 1, where hn ∈ IRM denotes the nth feature vector and r(n) denotes its class variable. The prior probabilities {πr } are generally unknown and need to be estimated from the training data. This can be done by appealing to the same Laplace

2358

Linear Discriminant Analysis

smoothing technique (55.14). Assume that within the N data samples there are Nr examples that belong to class r. Then, πr can be estimated as follows: π br =

Nr + s , N + sR

r = 1, 2, . . . , R

(Laplace smoothing)

(56.2)

where s > 0 controls the amount of smoothing. The choice s = 1 is common and referred to as Laplace smoothing. Choices of s < 1 are referred to as Lidstone smoothing.

Fitting the Gaussian models Linear discriminant analysis assumes that, conditioned on the class variable r, the feature data h is Gaussian-distributed with some positive-definite covariance matrix Σ > 0 that is independent of r, namely,   1 1 T −1 exp N (h N m ) Σ (h N m ) (56.3) fh|r (h|r) = r r 2 (2π)M/2 (det(Σ))1/2

Observe that the mean value mr is allowed to depend on r, while the value of Σ is fixed; in Prob. 56.1 we study the case where Σ depends on r. The values of the parameters {mr , Σ} are estimated from the training data {r(n), hn } in order to fit a Gaussian distribution of the form (56.3) by using the sample averages: 1 X m br = hn (56.4a) Nr r(n)=r

br = Σ

b= Σ

X 1 (hn N m b r )(hn N m b r )T Nr N 1

(56.4b)

r(n)=r R

1 X br (Nr N 1)Σ N N R r=1

(56.4c)

where R is the number of classes (R = 2 in the binary case), and r(n) denotes the class variable for the nth feature vector, hn . The sum in (56.4a) is over all feature vectors that belong to class r; their number is Nr . Similarly for the b is called the pooled individual covariances in (56.4b). Expression (56.4c) for Σ covariance matrix, where the individual covariances from each class are weighted by the number of samples in that class. The Gaussian naïve classifier of Section 55.4 corresponds to the situation in which Σ is assumed to be diagonal (but 2 dependent on r, namely, Σ = diag{σrm }) so that the individual entries of h are conditionally independent of each other.

Estimating the conditional probabilities Using an approximation for the Gaussian distribution (56.3), we can now estimate the desired conditional probability P(r = r|h = h). Using the Bayes rule we have b br fbh|r (h|r) b = r|h = h) = fr,h (r, h) = Xπ P(r (56.5) fbh (h) π br0 fb (h|r0 ) h|r

r 0 ∈R

56.1 Discriminant Functions

2359

The denominator is a normalization factor and its value is invariant under r; it will be irrelevant in the calculations. The odds of a feature vector, h, belonging to some class r over another class ` is defined as the ratio: ∆

odds(h; r, `) =

b = r|h = h) P(r b = `|h = h) P(r

(56.6)

For example, in a scenario where the likelihood of class r occurring is 0.2 while the likelihood for class ` is 0.1, then the odds for class r relative to class ` are 2 to 1, while the odds for class ` relative to class r are 1 to 2. If we use (56.5) and compute the natural logarithm of the odds ratio, we end up with the logit function (or the logistic transformation function): logit(h; r, `) ! b = r | h = h) P(r ∆ = ln b = ` | h = h) P(r ! π br fbh|r (h|r) = ln π b` fbh|r (h|`)   1 π br 1 b −1 (h N m b −1 (h N m b r )T Σ b r ) + (h N m b ` )T Σ b `) = ln N (h N m π b` 2 2    1 π br b −1 (m b −1 (m (m br + m b ` )T Σ = hT Σ br N m b ` ) N ln br N m b `) N {z } | 2 π b` | {z } ∆ = wr`



= θr`



= hT wr` N θr`

(56.7)

where we introduced the vector wr` ∈ IRM and the scalar θr` . There are two advantages for using logit representations. First, in the logarithmic scale, the odds for class r over class ` will always have the opposite value (i.e., reverse sign) for the odds of ` over r. In this way, the logit value can be used to decide whether a feature vector h is more likely to belong to one class or the other by examining its sign; a positive logit value corresponds to one class while a negative logit value corresponds to the other class. Moreover, and more importantly, the last line (56.7) shows that the logit is affine over h. This property is a consequence of the assumption of a uniform Σ across all labels. Motivated by these considerations and by expression (56.7), we will associate a linear discriminant function, denoted by dr (h), with each class r as follows: ∆

b −1 m dr (h) = hT Σ br N

1 T b −1 m b Σ m b r + ln π br 2 r

(56.8)

in which case the logit function (56.8) associated with any two classes r and ` reduces to the difference of their discriminant functions: logit(h; r, `) = dr (h) N d` (h)

(56.9)

2360

Linear Discriminant Analysis

The discriminant function dr (h) can now be used to motivate some important classification algorithms, such as LDA and MDC.

56.2

LINEAR DISCRIMINANT ALGORITHM The linear discriminant algorithm uses (56.9) to note that the class that should be assigned to a feature vector h is the one that results in the largest discriminant value, i.e., r? = argmax dr (h)

(56.10)

1≤r≤R

This solution has the same discriminant analysis structure shown earlier in Fig. 28.5, with the discriminant functions now given by (56.8). The diagram is repeated in Fig. 56.1 for ease of reference. The resulting algorithm is summarized in listing (56.11). The algorithm consists of two stages. In the first stage, the parab are learned from the training data, and in the second stage meters {b πr , m b r , Σ} these parameters help define the discriminant functions that drive the classification decisions. The predicted label for feature h is denoted by r? (h) in the last line of the algorithm.

AAAB9HicbVA9TwJBEJ3DL8Qv1NJmIzGhInc0akdiY4mJJyRwkr1lDjbs7V129zCE8D9sLNTY+mPs/DcucIWCL5nk5b2ZzMwLU8G1cd1vp7CxubW9U9wt7e0fHB6Vj08edJIphj5LRKLaIdUouETfcCOwnSqkcSiwFY5u5n5rjErzRN6bSYpBTAeSR5xRY6XHCKnJFJIxMpOoXrni1twFyDrxclKBHM1e+avbT1gWozRMUK07npuaYEqV4UzgrNTNNKaUjegAO5ZKGqMOpourZ+TCKn0SJcqWNGSh/p6Y0ljrSRzazpiaoV715uJ/Xicz0VUw5TLNDEq2XBRlgpiEzCMgfa7su2JiCWWK21sJG1JFmbFBlWwI3urL68Sv165r3l290qjmaRThDM6hCh5cQgNuoQk+MFDwDK/w5jw5L86787FsLTj5zCn8gfP5AxsKkl0=

feature vector

discriminant functions AAACFXicbVDLSsNAFJ34rPEVdekmWIRuLEk36q7gxmUFYwtNKJPJpB06jzAzEUroV7jxV9y4UHEruPNvnLQBtfXAwOGce+/ce+KMEqU978taWV1b39isbdnbO7t7+87B4Z0SuUQ4QIIK2YuhwpRwHGiiKe5lEkMWU9yNx1el373HUhHBb/UkwxGDQ05SgqA20sA5C7kgPMFcuwlRSBJGOOQ6DO0fI805KquVPXDqXtObwV0mfkXqoEJn4HyGiUA5M2MQhUr1fS/TUQGlJojiqR3mCmcQjeEQ9w3lkGEVFbOzpu6pURI3FdI8s8ZM/d1RQKbUhMWmkkE9UoteKf7n9XOdXkQF4VmuMUfzj9Kculq4ZUYmCYmRphNDoInE7OqiEZQQaZNkGYK/ePIyCVrNy6Z/06q3G1UaNXAMTkAD+OActME16IAAIPAAnsALeLUerWfrzXqfl65YVc8R+APr4xs645+F

dR (h)

d2 (h)

d1 (h)

AAAB7nicbVBNT8JAEJ3iF+JX1aOXjWCCF9JyUW8kXjyisUICDdlut7Bhu627WxPS8Ce8eFDj1d/jzX/jAj0o+JJJXt6bycy8IOVMacf5tkpr6xubW+Xtys7u3v6BfXj0oJJMEuqRhCeyG2BFORPU00xz2k0lxXHAaScYX8/8zhOViiXiXk9S6sd4KFjECNZG6tbCwV19dF4b2FWn4cyBVolbkCoUaA/sr36YkCymQhOOleq5Tqr9HEvNCKfTSj9TNMVkjIe0Z6jAMVV+Pr93is6MEqIokaaERnP190SOY6UmcWA6Y6xHatmbif95vUxHl37ORJppKshiUZRxpBM0ex6FTFKi+cQQTCQztyIywhITbSKqmBDc5ZdXiddsXDXc22a1VS/SKMMJnEIdXLiAFtxAGzwgwOEZXuHNerRerHfrY9FasoqZY/gD6/MHyHuOmA==

AAAB7nicbVBNT8JAEJ36ifiFevSyEUzwQlou6o3Ei0dMrJBAQ7bbLWzY3dbdrQlp+BNePKjx6u/x5r9xgR4UfMkkL+/NZGZemHKmjet+O2vrG5tb26Wd8u7e/sFh5ej4QSeZItQnCU9UN8Saciapb5jhtJsqikXIaScc38z8zhNVmiXy3kxSGgg8lCxmBBsrdWvRoFkfXdQGlarbcOdAq8QrSBUKtAeVr36UkExQaQjHWvc8NzVBjpVhhNNpuZ9pmmIyxkPas1RiQXWQz++donOrRChOlC1p0Fz9PZFjofVEhLZTYDPSy95M/M/rZSa+CnIm08xQSRaL4owjk6DZ8yhiihLDJ5Zgopi9FZERVpgYG1HZhuAtv7xK/GbjuuHdNautepFGCU7hDOrgwSW04Bba4AMBDs/wCm/Oo/PivDsfi9Y1p5g5gT9wPn8Al5uOeA==

AAAB7nicbVBNT8JAEJ31E/EL9ehlI5jghbRc1BuJF4+YWCGBhmy3W9iw3dbdrQlp+BNePKjx6u/x5r9xgR4UfMkkL+/NZGZekAqujeN8o7X1jc2t7dJOeXdv/+CwcnT8oJNMUebRRCSqGxDNBJfMM9wI1k0VI3EgWCcY38z8zhNTmify3kxS5sdkKHnEKTFW6tbCgVsfXdQGlarTcObAq8QtSBUKtAeVr36Y0Cxm0lBBtO65Tmr8nCjDqWDTcj/TLCV0TIasZ6kkMdN+Pr93is+tEuIoUbakwXP190ROYq0ncWA7Y2JGetmbif95vcxEV37OZZoZJuliUZQJbBI8ex6HXDFqxMQSQhW3t2I6IopQYyMq2xDc5ZdXiddsXDfcu2a1VS/SKMEpnEEdXLiEFtxCGzygIOAZXuENPaIX9I4+Fq1rqJg5gT9Anz+WFI53

r? = argmax dr (h) 1rR

AAACL3icbVBNaxsxFNQm/UjdLzc59iLiFFwIZjeXppRCoIfkmJa4CViueat9tkUk7UZ6W2LE/qRe+lOaSwJt6bX/orLjQ5t0QGKYmYf0Jq+08pSmV8nK6p279+6vPWg9fPT4ydP2s/WPvqydxL4sdelOcvColcU+KdJ4UjkEk2s8zk/fzf3jz+i8Ku0RzSocGphYNVYSKEqj9v6W+xSEJ3DN2yBqW8QsUsiExjPuFveHJgiTl+dBOMPBTQycN43YFtuNeFOMgmu605dbo3Yn7aUL8NskW5IOW+Jw1P4milLWBi1JDd4PsrSiYQBHSmpsWqL2WIE8hQkOIrVg0A/DYuGGv4hKwceli8cSX6h/TwQw3s9MHpMGaOpvenPxf96gpvHuMChb1YRWXj80rjWnks/b44VyKEnPIgHpVPwrl1NwICn21oolZDdXvk36O73Xvez9Tmevu2xjjT1nm6zLMvaK7bEDdsj6TLIv7IJ9Zz+Sr8ll8jP5dR1dSZYzG+wfJL//AMohqh0=

r? (h) AAAB9HicbVBNT8JAEJ3iF+IX6tFLI5jghbRc1BuJF4+YWDGBQrbLFjZst83uVEMa/ocXD2q8+mO8+W9coAcFXzLJy3szmZkXJIJrdJxvq7C2vrG5Vdwu7ezu7R+UD4/udZwqyjwai1g9BEQzwSXzkKNgD4liJAoEawfj65nffmRK81je4SRhfkSGkoecEjRSr6p6WVcjUdPa6LzaL1ecujOHvUrcnFQgR6tf/uoOYppGTCIVROuO6yToZ0Qhp4JNS91Us4TQMRmyjqGSREz72fzqqX1mlIEdxsqURHuu/p7ISKT1JApMZ0RwpJe9mfif10kxvPQzLpMUmaSLRWEqbIztWQT2gCtGUUwMIVRxc6tNR0QRiiaokgnBXX55lXiN+lXdvW1UmrU8jSKcwCnUwIULaMINtMADCgqe4RXerCfrxXq3PhatBSufOYY/sD5/APZmkZ0=

Figure 56.1 Classifier structure in the form of a collection of discriminant functions,

dr (h); one for each label r. For each feature vector, h, the optimal class is obtained by selecting the label with the largest discrimination value. This value is denoted by r? (h) at the bottom of the figure.

56.2 Linear Discriminant Algorithm

2361

LDA for multiclass classification (under uniform variance, Σ). given N training points {r(n), hn } and R classes; given Laplace smoothing factor s > 0. compute for each class r = 1, 2, . . . , R: Nr + s π br = N + sR 1 X hn m br = Nr r(n)=r X 1 br = Σ (hn N m b r )(hn N m b r )T Nr N 1 r(n)=r

(56.11)

end

R

1 X br (Nr N 1)Σ N N R r=1

b= Σ

classify new features h using: b −1 m dr (h) = hT Σ br N ?

1 T b −1 brΣ m br 2m

r (h) = argmax dr (h) 1≤r≤R

+ ln π br , r = 1, . . . , R

end

Nonuniform covariance matrices When the covariance matrix Σ in the model (56.3) is replaced by Σr and allowed to vary with r, the same argument leading to (56.8) can be repeated to show that the logit function is now given by 1 b r ) + 1 ln det(Σ b `) + logit(h; r, `) = N ln det(Σ 2 2   π br 1 1 b −1 b −1 (h N m ln N (h N m b r )T Σ b r ) + (h N m b ` )T Σ b `) r (h N m ` π b` 2 2

which is quadratic (as opposed to linear) in the feature vector h. It follows that the discriminant function (56.8) should be replaced by (see Prob. 56.1): 1 b r ) N 1 (h N m b −1 dr (h) = N ln det(Σ b r )T Σ b r ) + ln π br r (h N m 2 2

In this case, the LDA algorithm (56.11) would be adjusted to (56.13).

(56.12)

2362

Linear Discriminant Analysis

LDA for multiclass classification (non-uniform variances, Σr ). given N training points {r(n), hn } and R classes; given Laplace smoothing factor s > 0. compute for each class r = 1, 2, . . . , R: Nr + s π br = N + sR 1 X hn m br = Nr r(n)=r X 1 br = Σ (hn N m b r )(hn N m b r )T Nr N 1

(56.13)

r(n)=r

end classify new features h using: br) N dr (h) = N 12 ln det(Σ

1 2 (h

r? (h) = argmax dr (h) 1≤r≤R

b −1 (h N m Nm b r )T Σ b r ) + ln π br r

end

56.3

MINIMUM DISTANCE CLASSIFIER A special case of the linear discriminant solution (56.11) is to assume that all classes are equally probable. In this case, there is no need to estimate the prior probabilities {πr }, and the third equality in expression (56.7) gives 1 1 b −1 (h N m b −1 (h N m b r )T Σ b r ) + (h N m b ` )T Σ b ` ) (56.14) logit(h; r, `) = N (h N m 2 2

This result suggests that we should now associate the following discriminant function with each class r: ∆

b −1 (h N m dr (h) = N(h N m b r )T Σ b r)

(56.15)

b −1 (h N m r? (h) = argmin (h N m b r )T Σ b r)

(56.16)

Note that dr (h) is equal to the (negative) weighted distance from h to the mean m b r . This measure is the (negative of the) squared Mahalanobis distance of h b In this way, the decision rule (56.10) to the Gaussian distribution, Nh|r (m b r , Σ). ends up assigning the feature vector h to the class r? of the closest mean vector: 1≤r≤R

For this reason, the resulting algorithm is known as the minimum distance classifier. The implementation continues to have the same discriminant analysis structure shown in Fig. 56.1, with the discriminant functions now given by (56.15).

56.3 Minimum Distance Classifier

2363

Listing (56.11) is still applicable with dr (h) replaced by (56.15), as shown in (56.17).

Minimum distance classifier for multiclass classification. given N training data points {r(n), hn } and R classes; assume equally probable classes, r ∈ {1, 2, . . . , R}. compute for each class r = 1, 2, . . . , R: 1 X m br = hn Nr r(n)=r X 1 br = Σ (hn N m b r )(hn N m b r )T Nr N 1

(56.17)

r(n)=r

end

R

1 X br (Nr N 1)Σ N N R r=1 classify new features h using: b= Σ

b −1 (h N m r? (h) = argmin (h N m b r )T Σ b r) 1≤r≤R

end

Example 56.1 (Application to the iris dataset) We illustrate the operation of the LDA classifier by applying it to the iris dataset encountered earlier in Examples 27.4 and 32.7. The dataset consists of three types of flowers: setosa, versicolor, and virginica. We denote the three classes by r = 1, 2, 3, respectively. There are 50 measurements for each flower type, and each measurement consists of four attributes: petal length, petal width, sepal length, and sepal width (all measured in centimeters). The flowers are shown in Fig. 56.2.

Versicolor

Setosa

Virginica

sepal

petal Figure 56.2 Illustration of three types of iris flowers: (left) setosa, (middle) versicolor,

and (right) virginica. The figure also indicates a petal and a sepal within the flower in the middle. The source for the individual flower images is Wikimedia commons, where the images are available for use under the Creative Commons Attribution Share-Alike License. The labels and arrows in white have been added by the author.

Linear Discriminant Analysis

We select N = 120 measurements randomly, along with their corresponding labels, for training and keep the remaining T = 30 measurements and their labels for testing. Figure 56.3 displays the scatter diagrams for sepal width × sepal length and petal width × petal length for the training and test data points for the three classes of flowers.

petal dimensions (training data)

sepal dimensions (training data)

2.5

Setosa Versicolor Virginica

4

petal width (cm)

sepal width (cm)

4.5

3.5 3 2.5 2 4

5

6

7

2 1.5 1

Setosa Versicolor Virginica

0.5 0 1

8

2

3

sepal length (cm) sepal dimensions (test data)

2.5

3

2.5

2 4

4.5

5

5.5

6

5

6

7

6.5

petal dimensions (test data) virginica

Setosa Versicolor Virginica

3.5

4

petal length (cm)

petal width (cm)

4

sepal width (cm)

2364

2

versicolor 1.5

setosa 1

Setosa Versicolor Virginica

0.5

7

0 1

2

3

sepal length (cm)

4

5

6

petal length (cm)

Figure 56.3 Scatter diagrams showing sepal width × sepal length and petal width ×

petal length for the training (top row) and test (bottom row) data points. The iris dataset is available from https://archive.ics.uci.edu/ml/datasets/iris.

We apply the LDA procedure (56.11) with R = 3 to the training data and estimate the following prior probabilities:

π b1 = 0.3089,

π b2 = 0.3252,

π b3 = 0.3659

(56.18)

 6.6341  3.0091  m b3 =  5.6159  2.0477

(56.19)

We also determine the mean vectors  5.0216  3.4324  m b1 =  , 1.4946  0.2405 

 5.8718  2.7538  m b2 =  , 4.2359  1.3051 



56.4 Fisher Discriminant Analysis

2365

and covariance matrices 0.1195  0.0940 b Σ1 =  0.0154 0.0124  0.2531  0.0805 b Σ2 =  0.1758 0.0517  0.4153  0.0895 b Σ3 =  0.3080 0.0411  0.2716  0.0879 b Σ= 0.1750 0.0357 

0.0940 0.1278 0.0113 0.0128

0.0154 0.0113 0.0255 0.0061

0.0805 0.0989 0.0928 0.0468

0.1758 0.0928 0.2150 0.0735

0.0895 0.0953 0.0668 0.0393

0.3080 0.0668 0.2967 0.0439

0.0879 0.1064 0.0582 0.0336

0.1750 0.0582 0.1867 0.0419

 0.0124 0.0128  0.0061  0.0130  0.0517 0.0468  0.0735  0.0410  0.0411 0.0393  0.0439  0.0737  0.0357 0.0336  0.0419  0.0444

(56.20a)

(56.20b)

(56.20c)

(56.20d)

Using these quantities, we estimate the class variables for all test data points. The results are shown in Fig. 56.4. We obtained one misclassification over T = 30 test points, resulting in an error rate of 3.33%. The misclassified data point from the versicolor class is indicated by an arrow in the lower plots of the figure; it is classified erroneously as belonging to the virginica class. We further apply the MDC procedure (56.17) to the same iris dataset and estimate the class variables for the same test points. The results are shown in Fig. 56.5. We again obtain the same misclassification resulting in an error rate of 3.33%.

56.4

FISHER DISCRIMINANT ANALYSIS We describe next another popular linear discriminant structure. We motivate it by examining first the simplifications that would occur when we specialize the LDA solution (56.11) to the case of two equally probable classes, i.e., R = 2 with π1 = π2 = 1/2. Since we are now considering a binary classification problem, we revert to our standard notation for the class variable and write γ instead of r. Thus, the two classes are γ = +1 and γ = N1. According to the LDA construction (56.7), given a feature vector h, it will be assigned to class +1 or N1 based on the following discrimination rule:  h ∈ class +1, if hT w? N θ? ≥ 0 (56.21) h ∈ class N1, if hT w? N θ? < 0 where the parameters {w? , θ? } are defined as follows: b −1 (m w? = Σ b +1 N m b −1 ) 1 1 b −1 (m θ? = (m b +1 + m b −1 )T Σ b +1 N m b −1 ) = (m b +1 + m b −1 )T w? 2 2

(56.22a) (56.22b)

Linear Discriminant Analysis

sepal dimensions (test data)

2.5

Setosa Versicolor Virginica

3.5

petal width (cm)

sepal width (cm)

4

3

2.5

2 4

4.5

5

5.5

6

6.5

petal dimensions (test data)

2 1.5 1

Setosa Versicolor Virginica

0.5 0 1

7

2

sepal length (cm) sepal dimensions (using LDA) 2.5

Setosa Versicolor Virginica

3

2.5

error 2 4

4.5

5

5.5

6

4

5

6

6.5

petal dimensions (using LDA) virginica

petal width (cm)

3.5

3

petal length (cm)

4

sepal width (cm)

2366

7

sepal length (cm)

2

versicolor 1.5 1

setosa

error Setosa Versicolor Virginica

0.5

1

2

3

4

5

6

petal length (cm)

Figure 56.4 (Top row) Scatter diagrams showing sepal width × sepal length and petal width × petal length for the test data points for the three classes of flowers. (Bottom row) Classification results for these same test points using the LDA classifier. It is observed that a single point is misclassified.

In these expressions, the notation {m b ±1 } refers to the sample means for classes b are computed as +1 and N1. These means and the pooled covariance matrix Σ follows: m b +1 =

m b −1 = b +1 = Σ

b −1 = Σ

1 N+1 1 N−1 1

X

X

(56.23a)

hn

(56.23b)

γ(n)=−1

N+1 N 1 1

hn

γ(n)=+1

X

γ(n)=+1

X

(hn N m b +1 )(hn N m b +1 )T

(hn N m b −1 )(hn N m b −1 )T N−1 N 1 γ(n)=−1 n o 1 b= b +1 + (N−1 N 1)Σ b −1 Σ (N+1 N 1)Σ N N2

(56.23c) (56.23d) (56.23e)

56.4 Fisher Discriminant Analysis

petal dimensions (test data)

sepal dimensions (test data) 2.5

Setosa Versicolor Virginica

3.5

petal width (cm)

sepal width (cm)

4

3

2.5

2 4

2

1.5

1

Setosa Versicolor Virginica

0.5

4.5

5

5.5

6

6.5

0 1

7

2

sepal length (cm)

4

5

6

petal dimensions (using MDC)

sepal dimensions (using MDC) 2.5

virginica

Setosa Versicolor Virginica

3.5

petal width (cm)

sepal width (cm)

3

petal length (cm)

4

3

2.5

4.5

5

5.5

2

versicolor 1.5

1

setosa

error Setosa Versicolor Virginica

0.5

error 2 4

2367

6

6.5

7

0 1

sepal length (cm)

2

3

4

5

6

petal length (cm)

Figure 56.5 (Top row) Scatter diagrams showing sepal width × sepal length and petal

width × petal length for the test data points for the three classes of flowers. (Bottom row) Classification results for these same test points using the MDC classifier. It is observed that a single point is misclassified.

where {N±1 } denote the number of training samples {hn } in each of the classes b combines the individual estimates {Σ b +1 , Σ b −1 } in proportion {±1}. Note that Σ b as the to the number of samples in their respective classes. We also refer to Σ variance within.

56.4.1

Separating Hyperplanes Observe that, in effect, the LDA construction (56.21) is employing the following classifier: c? (h) = sign(hT w? N θ? )

(56.24)

That is, it determines the label of h by using the sign of the affine function hT w? Nθ? . This classification form will arise frequently in our future treatment of other classification schemes. What will differ among these schemes is the manner by which they compute the parameters (w? , θ? ). For LDA, these parameters are computed using (56.22a)–(56.22b). The FDA method of this section will

2368

Linear Discriminant Analysis

determine them in a different manner by maximizing a certain Fisher ratio. Other classifiers will use other criteria. In the meantime, it is useful to recognize that equations of the form hT w? N θ? = 0

(56.25)

(ha N hb )T w? = 0

(56.26)

describe hyperplanes in M -dimensional space; they consist of all points h ∈ IRM that satisfy the equality. When θ? = 0, the hyperplane passes through the origin since h = 0 will lie on it. We refer to θ? as the offset parameter, while w? represents the normal direction to the hyperplane – see Fig. 56.6. This is because for any two features vectors ha and hb lying in the hyperplane, it holds that

z-axis hyperplane

hT w −

normal direction w

=0

hb

− y-axis

ha

x-axis origin

Figure 56.6 Illustration of the hyperplane hT w ? − θ ? = 0 in IR3 where h and w are

two-dimensional. The vertical axis represents the value hT w? − θ? at any location h. Points h for which hT w? − θ? > 0 are said to lie on one side of the hyperplane, while points h for which hT w? − θ? < 0 are said to lie on the other side of the hyperplane. The vector w? represents the normal direction to the hyperplane.

56.4.2

Fisher Construction The LDA solution (56.21) assumes the conditional pdf fh|γ (h|γ) is Gaussiandistributed according to (56.3) and that the covariance matrix Σ is the same for both classes. The Fisher construction, which we describe next, relaxes both of these conditions: the Gaussian assumption is dropped and the covariance matrix can be different for both classes. Assume that the feature data h ∈ IRM from class γ = +1 arises from some generic (not necessarily Gaussian) distribution with mean m+1 and covariance

56.4 Fisher Discriminant Analysis

2369

matrix Σ+1 > 0, while the feature data from class γ = N1 arises from some other distribution with mean m−1 and covariance matrix Σ−1 > 0. The (pooled) covariance matrix of the feature distribution is then given by Σ = π+1 Σ+1 + π−1 Σ−1

(56.27)

in terms of the prior probabilities for each class: π+1 = P(γ = +1),

π−1 = P(γ = N1)

(56.28)

The FDA solution is based on the following argument. Let h denote an arbitrary feature vector, whose first- and second-order moments are either {m+1 , Σ+1 } or {m−1 , Σ−1 } depending on the class γ for h. For any parameter vector w ∈ IRM , we introduce the scalar random variable ∆

z = hT w

(56.29a)

which results from computing the inner product of h with w. The variable z T either has mean mT +1 w and variance w Σ+1 w (when h belongs to class γ = +1) T T or mean m−1 w and variance w Σ−1 w (when h belongs to class γ = N1): ( T 2 E z = mT +1 w, σz = w Σ+1 w (when h belongs to class +1) (56.29b) T 2 E z = mT −1 w, σz = w Σ−1 w (when h belongs to class N1) The overall variance of z is therefore given by σz2 = π+1 wT Σ+1 w + π−1 wT Σ−1 w

(56.27)

=

wT Σw

(56.30)

The Fisher criterion determines the optimal parameter wo that maximizes the so-called Fisher ratio: ( ) T 2 (mT wT Aw +1 w N m−1 w) o ∆ w = argmax = T (56.31a) wT Σw w Σw w∈IRM where the matrix A has rank-one and is given by ∆

A = (m+1 N m−1 )(m+1 N m−1 )T

(56.31b)

Observe that the numerator wT Aw is the square of the distance between the two possible means for z, while the denominator, wT Σw, is the variance of z. By maximizing the Fisher ratio over w, we are in effect determining a weight vector wo such that the distribution of the transformed variable, z = hT wo , will have the following two useful properties: (a) First, variables z generated by feature vectors from class γ = +1 will cluster o around mT +1 w , while variables z generated by feature vectors from class o γ = N1 will cluster around mT −1 w . T o (b) Second, the two mean values m+1 wo and mT −1 w will generally be well separated from each other with a reduced value for the variance (wo )T Σwo ; this reduced variance helps separate and concentrate the transformed variables z around their respective means.

2370

Linear Discriminant Analysis

In this way, under class γ = +1, the variable z will be well concentrated around o its mean mT +1 w , while under class γ = N1 it will be well concentrated around o the other mean mT −1 w . Moreover, both means will generally (but not always) be well separated from each other.

Figure 56.7 Data corresponding to labels γ = +1 and γ = −1 are projected onto two arbitrary directions, denoted by the vectors wa and wb . Observe how projections onto wa are well separated for both classes; their means are given by mT−1 wa and mT+1 wa and their variance spreads are given by (wa )T Σ−1 wa and (wa )T Σ+1 wa . Observe also how the projections onto wb from both classes get mixed together.

These properties are illustrated in Fig. 56.7 for two generic choices of the parameter w denoted by wa and wb . The figure shows feature vectors from two distributions corresponding to γ ∈ {±1} projected onto the directions of wa and wb along the axes directions – recall that, for any h, the inner product hT w is related to the size of the projection of h onto w, as illustrated in Fig. 56.8. This is because the projection is given by (recall definition (50.34)): ! hT w w T w b h= w = h (56.32) 2 kwk kwk kwk where w/kwk is a unit-length vector along the direction of w. Therefore, the inner product of h with this unit-length vector gives the size of the projection along that direction. Returning to Fig. 56.7, we note that the projections of the feature vectors are well separated in one case but not the other. The Fisher construction described next seeks a good choice for the direction wo as follows.

2371

AAAB73icbVBNSwMxEJ31s9avqkcvwVbwYtktoh4LXjxWsB/QLiWbZtvQbLImWXFZ+ie8eFDEq3/Hm//GtN2Dtj4YeLw3w8y8IOZMG9f9dlZW19Y3Ngtbxe2d3b390sFhS8tEEdokkkvVCbCmnAnaNMxw2okVxVHAaTsY30z99iNVmklxb9KY+hEeChYygo2VOpW0co6fmO6Xym7VnQEtEy8nZcjR6Je+egNJkogKQzjWuuu5sfEzrAwjnE6KvUTTGJMxHtKupQJHVPvZ7N4JOrXKAIVS2RIGzdTfExmOtE6jwHZG2Iz0ojcV//O6iQmv/YyJODFUkPmiMOHISDR9Hg2YosTw1BJMFLO3IjLCChNjIyraELzFl5dJq1b1LqsXd7Vy3c3jKMAxnMAZeHAFdbiFBjSBAIdneIU358F5cd6dj3nripPPHMEfOJ8/NrCPYg==

y-axis

56.4 Fisher Discriminant Analysis

AAACE3icbVC7TsMwFHXKq5RXgZHFokEqDFVSIWCsxMJYJPqQ2qhyXKcxdezIdqiqqP/Awq+wMIAQKwsbf4PbZoCWI1k6OudeXZ/jx4wq7TjfVm5ldW19I79Z2Nre2d0r7h80lUgkJg0smJBtHynCKCcNTTUj7VgSFPmMtPzh9dRvPRCpqOB3ehwTL0IDTgOKkTZSr3hmd0e0T0Kk03Biw3IsxT3BUw+KANqhDQXXAtoj+7RXLDkVZwa4TNyMlECGeq/41e0LnESEa8yQUh3XibWXIqkpZmRS6CaKxAgP0YB0DOUoIspLZ5km8MQofRgIaR7XcKb+3khRpNQ48s1khHSoFr2p+J/XSXRw5aWUx4kmHM8PBQmDJuW0INin0hTAxoYgLKn5K8QhkghrU2PBlOAuRl4mzWrFvaic31ZLNSerIw+OwDEoAxdcghq4AXXQABg8gmfwCt6sJ+vFerc+5qM5K9s5BH9gff4ALzWcYA==

b h (projection of h onto w) AAAB73icbVBNSwMxEJ2tX7V+VT16CbaCF8tuEfVY8OKxgv2AdinZNNuGJtk1yUrL0j/hxYMiXv073vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHTR0litAGiXik2gHWlDNJG4YZTtuxolgEnLaC0e3Mbz1RpVkkH8wkpr7AA8lCRrCxUrs8Ll/gMdO9YsmtuHOgVeJlpAQZ6r3iV7cfkURQaQjHWnc8NzZ+ipVhhNNpoZtoGmMywgPasVRiQbWfzu+dojOr9FEYKVvSoLn6eyLFQuuJCGynwGaol72Z+J/XSUx446dMxomhkiwWhQlHJkKz51GfKUoMn1iCiWL2VkSGWGFibEQFG4K3/PIqaVYr3lXl8r5aqrlZHHk4gVM4Bw+uoQZ3UIcGEODwDK/w5jw6L86787FozTnZzDH8gfP5AzUmj2E=

x-axis

Figure 56.8 The projection of a generic vector h onto a generic direction w is related

to the inner product hT w.

We assume Σ is invertible. Since the matrix A has rank-one, one solution vector that maximizes (56.31a) is given by (see Prob. 56.7): wT Aw =⇒ wo = Σ−1 (m+1 N m−1 ) (56.33a) w wT Σw The average value of the resulting means for z is denoted by ∆ 1 T θo = (m+1 + m−1 ) wo (56.33b) 2 In general, when the means are well separated, this value can be used as a threshold against which to compare hT wo in order to assign h to one class or the other:  h ∈ class +1, if hT wo ≥ θo (56.33c) h ∈ class N1, if hT wo < θo max

In an actual implementation, the quantities {Σ, m+1 , m−1 } that are needed in (56.33a)–(56.33b) are estimated using the same expressions (56.23a)–(56.23e). The resulting parameters for the hyperplane, which are based on these estimates, are denoted by (w? , θ? ) rather than (wo , θo ) in line with our convention to use the star notation for parameters evaluated directly from the training data. The resulting algorithm is listed in (56.34). Observe that the expressions for (w? , θ? ) in the listing below agree with the earlier expressions (56.22a)–(56.22b) for LDA in the case of two equally probable classes. It is perhaps for this reason that FDA

2372

Linear Discriminant Analysis

is sometimes described as belonging to the class of discriminative methods, even though FDA does not actually impose any assumption (such as Gaussianity) on the conditional distribution of the data, fh|r (h|r).

FDA for binary classification. given N training data points {γ(n), hn }; compute for the two classes: X 1 m b +1 = hn N+1 γ(n)=+1 X 1 m b −1 = hn N−1 γ(n)=−1 X 1 b +1 = (hn N m b +1 )(hn N m b +1 )T Σ N+1 N 1 γ(n)=+1 X 1 b Σ−1 = (hn N m b −1 )(hn N m b −1 )T N−1 N 1 γ(n)=−1  1  b b +1 + (N−1 N 1)Σ b −1 Σ= (N+1 N 1)Σ N N2 b −1 (m w? = Σ b +1 N m b −1 ) θ? =

1 2

(56.34)

T

(m b +1 + m b −1 ) w? , or use (56.36) or (56.40) below

end classify new features h using:  h ∈ class +1, if hT w? ≥ θ? h ∈ class N1, if hT w? < θ? end

Example 56.2 (Two other estimates for the offset parameter) It has been observed in practice that the choice θ? =

1 (m b +1 + m b −1 )T w? 2

(56.35)

works well when the distributions of the feature data h under classes γ ∈ {±1} are Gaussian with equal covariance matrices, i.e., Σ+1 = Σ−1 . A second choice for θ? is motivated in Prob. 56.10 and takes the form !T N+1 N−1 ? θ = m b +1 + m b −1 w? (56.36) N N by following a least-squares argument, where the estimated means are weighted based on the number of samples from each class. A third choice for θ? that performs well more generally is determined as follows. We compute the inner products {z(n) = hTn w? } for all n = 0, 1, . . . , N − 1, and order them in increasing order. We denote the resulting sequence by

56.4 Fisher Discriminant Analysis

n o z 0 (0), z 0 (1), z 0 (2), . . . , z 0 (N − 1)

2373

(56.37)

with z 0 (0) denoting the smallest value and z 0 (N − 1) denoting the largest value. For any n, we let θ? (n) denote the average of two successive values z 0 (n) and z 0 (n + 1): ∆

θ? (n) =

 1 0 z (n) + z 0 (n + 1) 2

(56.38)

The number θ? (n) is in the middle of the interval [z 0 (n), z 0 (n + 1)]. We end up with several such values {θ? (n)}, one for each interval. For each θ? (n), we evaluate its empirical error rate on the training data, i.e., we compute: Remp (n) =

N −1   i 1 X h I γ(m) hTm w? − θ? (n) < 0 N m=0

(56.39)

Note that the sum is counting how many times the signs of the label γ(m) and its affine predictor γ b(m) = hTm w? − θ? (n) differ from each other. We subsequently select as θ? that value among all {θ? (n)} that results in the smallest error rate: ∆

θ? = argmin Remp (n)

(56.40)

0≤n≤N −1

This choice for θ? is obviously more demanding to determine than (56.33b), but it generally leads to better results. Example 56.3 (Numerical example in IR2 ) In Fig. 56.9 we show a collection of 120 feature points hn ∈ IR2 whose classes γ(n) ∈ {±1} are known beforehand. We use the data to compute the parameters {w? , θ? } for the separating line hT w? − θ? = 0 using the FDA procedure (56.34). For illustration purposes, in the figure we show the feature vectors hn from classes ±1 projected onto the Fisher direction w? . As explained before, these projections are given by hT w ? b hn = n ? 2 w ? kw k

(56.41)

We also show the separating line (which is normal to the Fisher direction w? ).

56.4.3

Dimensionality Reduction One useful interpretation of the FDA construction is that it reduces the dimension of the feature space from M down to one. Specifically, it replaces the ? M -dimensional feature vectors {hn } by scalars {z(n) = hT n w }, which are subsequently compared against the threshold θ? to arrive at the classification decisions. Dimensionality reduction is an important tool in inference and learning because it helps reduce the complexity of higher-dimensional problems. We will introduce in the next chapter one popular technique for dimensionality reduction, known as principal component analysis (PCA); it is an unsupervised procedure that performs dimensionality reduction without knowledge of the label information {γ(n)}. In comparison, FDA requires knowledge of the {γ(n)} and, for this reason, it is said to be a supervised method. FDA can be extended to reduce the feature dimension from M down to some value M 0 < M , where M 0 is larger than 1. The extension can be motivated

Linear Discriminant Analysis

2

1.5

class +1 projections of features from class +1

1

normal direction 0.5

y-axis

2374

0

separating line class –1

-0.5

projections of features from class –1

-1

-1.5

-2 -4

-3

-2

-1

0

1

2

3

4

x-axis Figure 56.9 The figure shows the projected features onto the Fisher direction w ?

obtained according to the FDA construction (56.34). It is seen that the projected values separate well into two clusters.

by considering multiclass classification problems. Thus, assume that the feature vectors h ∈ IRM can now belong to one of R classes, denoted by r ∈ {1, 2, . . . , R}. The mean and covariance matrices of the unknown distribution for each class r are denoted by mr and Σr , respectively. The probability of each class r occurring is denoted by πr . It follows that the (pooled) mean and covariance matrix of the feature distribution are given by ∆

mW =

R X r=1

πr mr ,



ΣW =

R X

π r Σr

(56.42a)

r=1

The matrix ΣW reflects the amount of variability within classes and is assumed to be invertible. We also measure the variability between classes by introducing ΣB =

R X r=1

πr (mr N mW )(mr N mW )T

(56.42b)

Note that ΣB consists of the sum of terms that measure how far each class mean is from the global mean, which provides a measure of variability between the classes. Although ΣB is the sum of R rank-one terms, it can be verified that

56.4 Fisher Discriminant Analysis

2375

its rank is actually at most R N 1. This is because the variables {mr , mW } are related through (56.42a) – see Prob. 56.8. We would like to project h onto a subspace of dimension M 0 , represented by 0 a matrix W ∈ IRM ×M . Let z = W T h denote the reduction of h to the subspace of dimension M 0 . The random vector z has mean W T mr and covariance matrix W T Σr W when h belongs to class r. The overall mean and covariance matrix of z are then given by E z = W T mW and Rz = W T ΣW W . The Fisher criterion for determining W is modified to maximize the following ratio of determinants: W

o ∆

= argmax W ∈IRM ×M

0

(

det(W T ΣB W ) det(W T ΣW W )

)

(56.43)

The derivation in Prob. 56.11 shows that the columns of W o can be chosen as the eigenvectors of the M × M matrix Σ−1 W ΣB that correspond to its largest R N 1 eigenvalues. We arrive at listing (56.44), where we are again denoting the optimal matrix by W ? to reflect that it is computed from the training data and not from the actual covariance matrices.

FDA for dimensionality reduction. given N training data points {r(n), hn } with R ≥ 2 classes; given a reduced dimension M 0 ≤ R N 1. compute for each class r: Nr = number of training features hn in class r 1 X m br = hn Nr r(n)=r R

1 X m bW = Nr m br N r=1 X 1 br = Σ (hn N m b r )(hn N m b r )T Nr N 1 bW = Σ bB = Σ

r(n)=r R X

1 N NR

r=1 R

br (Nr N 1)Σ

1 X (Nr N 1)(m br N m b W )(m br N m b W )T N N R r=1

W ? = eigenvectors corresponding to largest M 0 b −1 Σ b eigenvalues of the M × M matrix Σ W B end compute the reduced features of size M 0 × 1: zn = (W ? )T hn , n = 0, 1, . . . , N N 1 end

(56.44)

2376

Linear Discriminant Analysis

One of the limitations of this approach is that it generates a matrix W ? with at most R N 1 columns (since the rank of ΣB does not exceed R N 1). This means that we can only reduce the dimension of the feature space down to M 0 = R N 1. For example, if M = 100 (feature space) and R = 2 (binary classification), then the FDA procedure will only allow us to reduce the feature dimension to one (but not to two or three, for instance). Example 56.4 (Application to iris dataset) We apply the FDA procedure (56.44) to the same iris dataset from Examples 56.1. We use the procedure to reduce the dimension of the feature vectors from M = 4 down to M 0 = 2. We obtain the following values for the mean vector parameters:         5.0060 5.9360 6.5880 5.8433 3.4180 2.7700 2.9740 3.0540         m b1 =  , m b2 =  m b3 =  , m bW =  1.4640  4.2600  5.5520  3.7587  0.2440 1.3260 2.0260 1.1987 (56.45) and the covariance matrix parameters:   0.1242 0.1003 0.0161 0.0105 0.1003 0.1452 0.0117 0.0114   b1 =  Σ (56.46a) 0.0161 0.0117 0.0301 0.0057  0.0105 0.0114 0.0057 0.0115   0.2664 0.0852 0.1829 0.0558 0.0852 0.0985 0.0827 0.0412  b2 =  Σ (56.46b)  0.1829 0.0827 0.2208 0.0731  0.0558 0.0412 0.0731 0.0391   0.4043 0.0938 0.3033 0.0491 0.0938 0.1040 0.0714 0.0476  b3 =  Σ (56.46c)  0.3033 0.0714 0.3046 0.0488  0.0491 0.0476 0.0488 0.0754   0.2650 0.0931 0.1674 0.0385 0.0931 0.1159 0.0552 0.0334  bW =  Σ (56.46d)  0.1674 0.0552 0.1852 0.0425  0.0385 0.0334 0.0425 0.0420   0.4214 −0.1302 1.1011 0.4758 −0.1302 0.0732 −0.3737 −0.1499  bB =  (56.46e) Σ  1.1011 −0.3737 2.9110 1.2461  0.4758 −0.1499 1.2461 0.5374 These parameters lead to the 4 × 2 weight matrix   −0.2049 0.0090 0.5890   −0.3871 W? =  0.5465 −0.2543  0.7138 0.7670

(56.47)

We apply this matrix to the four-dimensional features vectors {hn } and transform them into the two-dimensional features {zn }. The scatter diagram for these reduced features is shown in the top plot of Fig. 56.10. In the bottom plot of the same figure we apply the FDA procedure (56.34) to the reduced data to discriminate between classes r = 1 (setosa) and r ∈ {2, 3} (versicolor or virginica). We denote these two classes by a and b. For this classification problem, we determine the following mean vector values from the reduced features {zn }:

56.4 Fisher Discriminant Analysis

dimensionality reduction (using FDA)

3

second coordinate

2377

setosa versicolor virginica

2.5

virginica

2

setosa 1.5

versicolor 1 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

first coordinate classification (using FDA)

second coordinate

3

setosa versicolor and virginica

2

class

versicolor and virginica

2.5

setosa

1.5

separating line

class 1 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

first coordinate Figure 56.10 (Top plot) Scatter diagram for the two-dimensional features {zn }

obtained after applying the FDA reduction procedure (56.44). (Bottom plot) The reduced data {zn } is divided into two classes (setosa vs. versicolor+virginica) and the FDA procedure (56.34) is used to determine the parameters for the separating hyperplane.  m ba =

−1.3748 1.8730



 ,

m bb =

1.4823 1.7859

as well as the following covariance matrix estimates:   0.0442 −0.0367 ba = Σ −0.0367 0.0648   b b = 0.3201 0.1020 Σ 0.1020 0.1071   0.2287 0.0561 b Σ= 0.0561 0.0931

 (56.48)

(56.49a) (56.49b) (56.49c)

Subsequently, we determine the parameters for the separating line and find that   −14.9261 w? = , θ? = −17.3632 (56.50) 9.9294 The lower plot in the figure shows the separating line hT w ? − θ ? = 0

(56.51)

2378

Linear Discriminant Analysis

and illustrates how the FDA construction (56.34) is able to separate the data into the two classes without errors in this case.

56.5

COMMENTARIES AND DISCUSSION Linear discriminant analysis. Comparing expressions (56.23a)–(56.23e) with the top expressions in listing (56.34), it becomes clear that the LDA technique is related to the FDA method. The latter was developed by the English statistician Ronald Fisher (1890–1962) for discriminating between binary classes in the work by Fisher (1936), and it is recognized by many as the first ever algorithm introduced for pattern classification. We explained in Section 56.4 that the main difference between LDA and FDA is that the former assumes Gaussian distributions for the feature data while FDA does not assume any specific distribution and relies on maximizing the Fisher ratio. Given the commonalities that exist in their structures, it is not uncommon in the literature to employ the terms LDA and FDA interchangeably. In our presentation, we have opted to make the distinction explicit for clarity. For further discussion on linear discriminant analysis methods and their application, the reader may refer to Huberty (1975), Fukunaga (1990), Duda, Hart, and Stork (2000), Anderson (2003), McLachlan (2004), Bishop (2007), Hardle and Simar (2007), and Hastie, Tibshirani, and Friedman (1989). Some variations of linear discriminant analysis, including the use of regularization, adjustments for high-dimensional data, and applications in finance and medical diagnostics, are discussed in Altman (1968), Hoerl and Kennard (1970), Friedman (1989), Hastie, Buja, and Tibshirani (1995), Chatterjee and Roychowdhury (1997), Dudoit, Fridlyand, and Speed (2002), Bickel and Levina (2004), Demir and Ozmehmet (2005), Guo, Hastie, and Tibshirani (2007), Witten and Tibshirani (2011), and Clemmensen, Hastie, and Ersboll (2011). In Prob. 63.22 we discuss one implementation of the FDA procedure using kernels based on the approaches proposed independently by Mika et al. (1999a) and Baudat and Anouar (2000); the latter reference uses the terminology of generalized discriminant analysis – see also the work by Park and Park (2005). FDA and dimensionality reduction. The FDA construction is not limited to binary classification problems and can be extended to multiclass problems as well, as was explained in the body of the chapter. In that case, a multicolumn matrix W ? , rather than a single column w? , needs to be determined. This generalization is due to Rao (1948), a Ph.D. student of Fisher (the developer of FDA). This form of FDA is useful for performing dimensionality reduction. It nevertheless requires knowledge of the label information and, as such, is a supervised dimensionality reduction technique. Iris dataset. This dataset was originally used by Fisher (1936) and is available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/ iris — see Dua and Graff (2019). Figure 56.2 displays three types of iris flowers: virginia (photo by Frank Mayfield), setosa (photo by Radomil Binek), and versicolor (photo by Danielle Langlois). The source of the images is Wikimedia commons, where they are available for use under the Creative Commons Attribution Share-Alike License. The relevant links are: (a) https://commons.wikimedia.org/wiki/File:Iris_virginica.jpg (b) https://commons.wikimedia.org/wiki/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg (c) https://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg

Problems

2379

PROBLEMS

56.1 Refer to the discussion on linear discriminant analysis in Section 56.1, where it was assumed that Σ is uniform across all classes. Assume instead that Σ is classdependent and denote it by Σr in (56.3). Repeat the derivation in that section to conclude that the form of the discriminant function that is associated with each class becomes quadratic in h and is given by: 1 1 dr (h) = − ln det(Σr ) − (h − mr )T Σ−1 r (h − mr ) + ln πr 2 2 56.2 True or false: (a) The LDA classifier (56.11) is a Bayes classifier. (b) The minimum distance classifier (56.17) is a Bayes classifier. (c) The minimum distance classifier (56.17) is an affine classifier. (d) The Fisher classifier (56.34) is a Bayes classifier. (e) The Fisher classifier (56.34) has a discriminant function structure. 56.3 Refer to the Gaussian model (56.3) assumed by the LDA construction. Assume the jth entry of the feature vector h is missing at random, denoted by hj . Let h−j denote the remaining entries of h; it is a vector of size M − 1. What is the conditional distribution of h−j given r? Show that the LDA solution will continue to hold by working with the features {hn,−j } for n = 0, 1, . . . , N − 1. 56.4 Consider a Fisher discriminant classifier applied to a binary classification problem. Explain why this analysis fails if the discriminatory information that is present in the training data {γ(n), hn } is reflected in the variance of the data and not in their mean. 56.5 Explain how you would employ the Fisher classifier (56.34) to perform multiclass classification. 56.6 Consider a vector ha ∈ IRM . We wish to determine the projection of ha onto the hyperplane hT w − θ = 0 by solving b ha = argmin kha − hk2 ,

subject to hT w = θ

h∈IRM

Verify that the projection is unique and given by w b ha = ha − (hTa w − θ) kwk2 56.7 Use the result of Prob. 1.14 to establish (56.33a). 56.8 Show that the rank of the matrix ΣB defined by (56.42b) is at most R − 1. 56.9 Set R = 2. Show that the maximization problem (56.43) reduces to (56.31a). 56.10 This problem is motivated by a discussion from Duda and Hart (1973) and Bishop (2007). Its purpose is to formulate a least-squares problem that leads to the same expression for w? in the Fisher classifier listing (56.34). Consider N training data points {γ(n), hn } and construct a sequence b(n) as follows ( N/N+1 , if γ(n) = +1 b(n) = −N/N−1 , if γ(n) = −1 Consider the following least-squares problem that determines an affine model of the form hT w − θ in order to match the sequence {b(n)}, i.e., ( N −1 ) 2 1 X ? ? T (wLS , θLS ) = argmin b(n) − (hn w − θ) N n=0 w∈IRM ,θ

2380

Linear Discriminant Analysis

(a) (b)

N

N

? ? Verify that θLS b +1 + N−1 m b −1 )T wLS = ( N+1 m . b Introduce the rank-one matrix A = (m b +1 − m b −1 )(m b +1 − m b −1 )T . Verify that w? satisfies the normal equations ! N − 2 b N+1 N−1 b ? Σ+ A wLS = (m b +1 − m b −1 ) N N2

? Conclude that wLS and the Fisher classifier w? given in (56.34) are parallel to ? each other, i.e., wLS ∝ w? . 56.11 The purpose of this problem is to derive a multiclass version of the Fisher classifier (56.34) – see, e.g., Johnson and Wichern (1988) and Fukunaga (1990). Thus, assume that each feature vector hn ∈ IRM could belong to one of R classes, denoted by r ∈ {1, 2, . . . , R}. As was the case with the LDA implementation (56.4a)–(56.4c), we use the training data to evaluate the following sample means, sample covariance matrices, and pooled covariance matrix:  1 X  hn br =   m  N r  r(n)=r   X   b 1 Σr = (hn − m b r )(hn − m b r )T N r −1  r(n)=r    R  X  1  Σ br  (Nr − 1)Σ  bW = N − R r=1

(c)

where Nr denotes the number of training features that belong to class r. We are using b W calculated in this manner reflects the amount the subscript W since the matrix Σ of variability within the various classes. Also, the sums in the first two expressions are over all samples for which the class variable r(n) coincides with r. We further introduce the overall mean vector: ∆

m bW =

N −1 R 1 X 1 X Nr m br hn = N n=0 N r=1

and use it to compute the following variance quantity to reflect the amount of variability between classes: bB = Σ

R X 1 (Nr − 1)(m br − m b W )(m br − m b W )T N − R r=1

Let h ∈ IRM denote an arbitrary feature vector, whose first- and second-order moments are denoted by {mr , Σr } depending on the class r. Let z = W T h denote the reduction 0 of h to a subspace of dimension M 0 where W ∈ IRM ×M . The Fisher criterion determine ? an optimal W by maximizing the Fisher ratio: ) ( bBW ) det(W T Σ ? ∆ W = argmax bW W ) 0 det(W T Σ W ∈IRM ×M (a) (b)

(c)

(d)

b B is at most R − 1. Show that the rank of the M × M matrix Σ Explain why the Fisher ratio can be interpreted as the ratio of the within-variance measure to the between-variance measure for the transformed quantities {zn = W T hn }. b W is invertible. Show that the columns of W ? that maximize the Fisher Assume Σ b −1 Σ b B corresponding to the largest ratio can be chosen as the eigenvectors of Σ W R − 1 eigenvalues. How are the values of M 0 and R related to each other?

References

(e)

2381

Explain why the Fisher construction can be viewed as a dimensionality reduction procedure.

REFERENCES Altman, E. I. (1968), “Financial ratios, discriminant analysis and the prediction of corporate bankruptcy,” J. Finance, vol. 23, no. 4, pp. 589–609. Anderson, T. W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd ed., Wiley. Baudat, G. and F. Anouar (2000), “Generalized discriminant analysis using a kernel approach,” Neural Comput., vol. 12, no. 10, pp. 2385–2404. Bickel, P. J. and E. Levina (2004), “Some theory for Fisher’s linear discriminant function, naïve Bayes, and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, pp. 989–1010. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Chatterjee, C. and V. P. Roychowdhury (1997), “On self-organizing algorithms and networks for class-separability features,” IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 663–678. Clemmensen, L., T. Hastie, and B. Ersboll (2011), “Sparse discriminant analysis,” Technometrics, vol. 53, pp. 406–413. Demir, G. K. and K. Ozmehmet (2005), “Online local learning algorithms for linear discriminant analysis,” Pattern Recogn. Lett., vol. 26, no. 4, pp. 421–431. Dua, D. and C. Graff (2019), UCI Machine Learning Repository, available at http: //archive.ics.uci.edu/ml. Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Dudoit, S., J. Fridlyand, and T. P. Speed (2002), “Comparison of discrimination methods for the classification of tumors using gene expression data,” J. Amer. Statist. Assoc., vol. 97, pp. 77–87. Fisher, R. A. (1936), “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188. Friedman, J. H. (1989), “Regularized discriminant analysis,” J. Amer. Statist. Assoc., vol. 84, pp. 165–175. Fukunaga, K. (1990), Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press. Guo, Y., T. Hastie, and R. Tibshirani (2007), “Regularized linear discriminant analysis and its application in microarrays,” Biostat, vol. 8, no. 1, pp. 86–100. Hardle, W. and L. Simar (2007), Applied Multivariate Statistical Analysis, Springer. Hastie, T., A. Buja, and R. Tibshirani (1995), “Penalized discriminant analysis,” Ann. Statist., vol. 23, no. 1, pp. 73–102. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Hoerl, A. E. and R. W. Kennard (1970), “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67. Huberty, C. J. (1975), “Discriminant analysis,” Rev. Edu. Res., vol. 45, no. 4, pp. 543– 598. Johnson, R. A. and D. W. Wichern (1988), Applied Multivariate Statistical Analysis, Prentice Hall. McLachlan, G. J. (2004), Discriminant Analysis and Statistical Pattern Recognition, Wiley. Mika, S., G. Ratsch, J. Weston, B. Scholkopf, and K. R. Muller (1999a), “Fisher discriminant analysis with kernels,” Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 41–48, Madison, WI.

2382

Linear Discriminant Analysis

Park, C. H. and H. Park (2005), “Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition,” SIAM J. Mat. Anal. Appl., vol. 27, no. 1, pp. 87–102. Rao, C. R. (1948), “The utilization of multiple measurements in problems of biological classification,” J. Roy. Statist. Soc. Ser. B, vol. 10, no. 2, pp. 159–203. Witten, D. and R. Tibshirani (2011), “Penalized classification using Fisher’s linear discriminant,” J. Roy. Statist. Soc. Ser. B, vol. 73, pp. 753–772.

57 Principal Component Analysis

Oftentimes, the dimension of the feature space, h

∈ IRM , is prohibitively large either for computational or visualization purposes. In these situations, it becomes necessary to perform an initial dimensionality reduction step where each 0 hn is replaced by a lower-dimensional vector h0n ∈ IRM with M 0  M . We have encountered one specific technique for dimensionality reduction in the previous chapter in the form of the Fisher discriminant analysis (FDA) algorithm. In this chapter, we describe one of the most popular methods for this purpose, known as principal component analysis (PCA), and explain how it differs from FDA. PCA is an unsupervised technique in that it operates directly on feature vectors and does not require information about their labels. It helps reduce the dimension M to a smaller value M 0 by keeping the most significant constituents of the data (also called principal components). In this way, PCA can also be viewed as a technique for feature selection. The unsupervised nature of PCA is in direct contrast with the FDA procedure (56.44), which requires label information. For this reason, PCA can sometimes lead to poorer classification accuracy because the dimensions it retains need not have the most discriminative power. n

57.1

DATA PREPROCESSING We explained earlier in Section 51.1 that it is generally desirable for the entries of a feature vector h to be properly normalized in order to avoid situations where some entries are disproportionately larger than others. Such large discrepancies give rise to ill-conditioned data and can distort the operation of learning algorithms to a great extent. In particular, they can bias the search for the principal components in the data, which correspond to directions in the feature space where the dispersion in data values is most significant. One way to avoid these possibilities is to normalize the feature vectors before applying the PCA procedure. Thus, consider a collection of N feature vectors {hn ∈ IRM }. Our first step involves transforming them into a second collection of vectors {hn,p ∈ IRM }, with an added subscript p, where the individual entries of these transformed vectors

2384

Principal Component Analysis

have “zero mean” and “unit variance.” This is achieved as follows. We compute the sample mean vector N −1 X ∆ 1 ¯ = h hn N n=0

(57.1)

and center all feature vectors by replacing them by ∆ ¯ hn,c = hn N h

(57.2)

where the subscript c refers to centered variables. We denote the entries of each {hn , hn,c } by {hn (m), hn,c (m)} and compute the sample variances: ∆

2 σ bm =

N −1 1 X 2 h (m), N N 1 n=0 n,c

m = 1, 2, . . . , M

(57.3)

The scaling by 1/(N N 1) rather than 1/N is because this calculation employs ¯ and leads to unbiased variance estimates, as was the estimated sample mean h already advanced in Section 31.2 when we discussed sample estimates for means and variances. We next use the standard deviations (i.e., the square-roots of the variances) to scale the centered entries: ∆

hn,p (m) =

1 hn,c (m), σ bm

m = 1, 2, . . . , M

(57.4)

We can express the two-step transformations (57.2) and (57.4) more compactly in vector form as follows: n o ∆ S = diag σ b1 , σ b2 , . . . , σ bM

¯ hn,p = S −1 (hn N h)

(57.5a) (57.5b)

In this way, we end up replacing the original feature vectors {hn } by the normalized vectors {hn,p }. If only centering is desired, then we would replace the {hn } by {hn,c } and apply the PCA procedure to these vectors instead of {hn,p }. For generality, we will describe PCA by using the {hn,p } vectors. The preprocessing steps for PCA are summarized in (57.6).

57.2 Dimensionality Reduction

2385

Preprocessing steps for PCA. given N feature vectors {hn ∈ IRM }. compute: N −1 1 X ¯ h= hn N n=0 ¯ ∀n hn,c = hn N h, N −1 1 X 2 2 h (m), m = 1, 2, . . . , M σ bm = N N 1 n=0 n,c S = diag{ σ b1 , σ b2 , . . . , σ bM } ¯ ∀n hn,p = S −1 (hn N h), end return {hn,p ∈ IRM }.

57.2

(57.6)

DIMENSIONALITY REDUCTION Following the preprocessing stage, we proceed to reduce the dimension of the vectors {hn,p ∈ IRM } and replace them by lower-dimensional vectors {h0n ∈ 0 IRM } by following the steps described in the following. We start by evaluating the M ×M sample covariance matrix for the normalized vectors {hn,p }: ∆

Rp =

N −1 1 X hn,p hT n,p N N 1 n=0

(57.7a)

The matrix Rp is symmetric and nonnegative-definite and, therefore, it admits an eigen-decomposition of the form: Rp = U ΛU T =

M X

λm um uT m

(57.7b)

m=1

where U is orthogonal of size M × M with columns {um }, while Λ is diagonal with nonnegative entries {λm }. The columns of U have unit norms and they correspond to the eigenvectors of Rp , namely, Rp um = λm um ,

m = 1, 2, . . . , M

(57.8)

We assume the eigenvalues are ordered from largest to smallest with λ1 ≥ λ2 ≥ . . . λM ≥ 0

(57.9)

We provide two derivations for PCA: One is based on algebraic arguments and the other on geometric arguments.

2386

Principal Component Analysis

57.2.1

Algebraic Derivation It is generally the case (though not always) that most of the eigenvalues {λm } will have negligible size. PCA exploits this fact to reduce the dimension of the feature space by ignoring the smallest eigenvalues. Assume we keep the largest M 0 eigenvalues from Λ and zero-out the remaining eigenvalues, say,   λ1   ..   .   " #   λM 0 0 Λ1   ∆ Λ≈ (57.10a)  =   0 0 0     ..   . 0 where Λ1 is M 0 × M 0 , and partition the columns of U accordingly:   U = U1 U2

(57.10b)

where U1 is M × M 0 . It then follows that

U ΛU T ≈ U1 Λ1 U1T

(57.11)

This construction retains in U1 the eigenvector directions (or principal components) that correspond to the largest eigenvalues. Approximation (57.11) suggests that we define the reduced feature vectors in the following manner: ∆

h0n = U1T hn,p

(M 0 × 1)

(57.12)

These vectors have one useful property: They are “uncorrelated” with each other in that their sample covariance matrix is diagonal and coincides with Λ1 . This is because the sample covariance, denoted by Rh0 , is given by: ∆

Rh0 =

N −1 1 X 0 0 T h (h ) N N 1 n=0 n n

! N −1 X 1 = U1T hn,p hT n,p U1 N N 1 n=0 | {z }

=

U1T



Rp

U1   T    Λ1 0 U1 T = U1 U1 U2 U1 0 X U2T     Λ1 0 (a)  IM 0 = IM 0 0 0 X 0 = Λ1 ,

U ΛU

T

(M 0 × M 0 )

(57.13)

57.2 Dimensionality Reduction

2387

where in step (a) we used the fact that U T U = IM so that U1T U1 = IM 0 and U1T U2 = 0:  T     U1  IM 0 0 U T U = IM ⇐⇒ = (57.14) U U 1 2 U2T 0 IM −M 0

A second useful way to motivate construction (57.12) is explained next, where it will be seen that the approximation U1 Λ1 U1T for the covariance matrix Rp retains information about the main sources of variability within the data.

57.2.2

Geometric Interpretation Let u ∈ IRM refer to a generic unit-norm column vector (uT u = kuk2 = 1), and denote the inner product between u and any feature vector hn,p by ∆

z(n) = hT n,p u

(57.15)

The projection of hn,p onto the direction specified by u is given by b hn,p = α? u, where the scalar α? is the solution to the least-squares problem: α? = argmin khn,p N αuk2 = α∈IR

uT hn,p = z(n) uT u

(57.16)

That is, the scalar z(n) defines the projection: b hn,p = z(n)u

(57.17)

The result is illustrated in Fig. 57.1 for two-dimensional vectors hn,p . The size of z(n) reflects the amount of variability that hn,p has along the direction u, with some feature vectors having larger-size projections than other vectors. We exploit this preliminary result as follows.

First principal component Assume we formulate the optimization problem: ( ) N −1 1 X 2 ∆ z (n) u1 = argmax N N 1 n=0 kuk=1

(57.18)

which amounts to seeking a direction vector u that maximizes the sum of squares of the projections from all feature vectors onto it. This cost can be interpreted as seeking the direction u along which the sample variance of the projections is the largest possible. Substituting the expression for z(n) into (57.18), the above problem becomes ( ) N −1 X 1 T max uT hn,p hT (57.19) n,p u = max u Rp u N N 1 n=0 kuk=1 kuk=1

The solution to this latter problem follows from the well-known Rayleigh–Ritz characterization of the maximum eigenvalue of a matrix – recall (1.17b). Specifically, the maximum value of (1.17b) is attained when u is chosen as the unit-norm

Principal Component Analysis

y-axis

2388

two feature vectors hn,p hm,p

hn,p

u

u (unit-norm vector) hn,p (projections onto u)

x-axis

hn,p hm,p projections onto u

projection of hn,p onto u ∆

hn,p = (hT n,p u)u = z(n)u

Figure 57.1 The projection of hn,p onto the unit-norm vector u is given by b hn,p = z(n)u.

eigenvector of Rp that is associated with its largest eigenvalue, λ1 = λmax (Rp ). It follows that the optimal u is given by u1 = first column of U

(57.20)

and the resulting maximum value for (57.19) is (u1 )T Rp u1 = λ1

(57.21)

Example 57.1 (An alternative interpretation) Using z(n) = uT hn,p , the projection (57.17) can be rewritten in terms of the direction u as follows: b hn,p = uuT hn,p

(57.22)

where the rank-one matrix uuT transforms hn,p into its projection. Motivated by this observation, we now verify that problem (57.18) for determining the first principal direction u1 can be posed in the following equivalent manner: ( ∆

u1 = argmin kuk=1

N −1 X 1 khn,p − uuT hn,p k2 N − 1 n=0

) (57.23)

In this formulation, we are seeking a rank-one matrix, uuT , with unit-norm u, that minimizes the squared distance between hn,p and its projection onto the direction of u. That is, we are minimizing the sum of squared residual norms. We can verify that the above formulation is equivalent to (57.19) as follows. First, note that

57.2 Dimensionality Reduction

N −1 X

khn,p − uuT hn,p k2 =

n=0

N −1n X n=0

=

N −1n X

2389

khn,p k2 + hTn,p u |{z} uT u uT hn,p − 2hTn,p uuT hn,p

o

=1

khn,p k2 − hTn,p uuT hn,p

o

(57.24)

n=0

Consequently, as claimed, N −1

(

X 1 khn,p − uuT hn,p k2 N − 1 n=0 kuk=1 ( ) N −1 X 1 T T = argmax hn,p uu hn,p N − 1 n=0 kuk=1 ( ) N −1 X 1 T T = argmax u hn,p hn,p u N − 1 n=0 kuk=1 ! ( N −1 X 1 T T = argmax u hn,p hn,p }u N − 1 n=0 kuk=1

)

argmin

= argmax uT Rp u

(57.25)

kuk=1

Other principal components The argument can be continued to determine additional principal directions besides (57.19). For example, we can seek next a unit-norm vector u2 that is orthogonal to u1 and maximizes the same cost: n o ∆ u2 = argmax uT Rp u (57.26) kuk=1, uT u1 =0

The solution is obtained by selecting u2 as the unit-norm eigenvector of Rp that is associated with its second largest eigenvalue, λ2 , i.e., u2 will now be the second column of U and the maximum value of the above cost will be λ2 . One way to justify this conclusion is as follows. Using the eigen-decomposition (57.7b) we have ! M X uT Rp u = uT λm um uT m u m=1

=u

T

|

M X

λm um uT m

m=2 ∆

{z

= Rp,2

so that problem (57.26) is equivalent to

! }

u,

since u ⊥ u1

n o ∆ u2 = argmax uT Rp,2 u kuk=1

(57.27)

(57.28)

2390

Principal Component Analysis

where u1 is eliminated and Rp is replaced by Rp,2 . Note that Rp,2 is obtained from Rp by deflation since ∆

Rp,2 = Rp N λ1 u1 uT 1

(57.29)

It now follows from the Rayleigh–Ritz characterization of the maximum eigenvalue of a matrix that the maximum of (57.28) is λ2 (which is the largest eigenvalue for Rp,2 ) and is attained when u = u2 . We can continue in this manner, by successively deflating Rp and retaining its largest eigenvalues and the corresponding eigenvectors, until M 0 principal directions are determined. In this way, construction (57.12) ends up projecting the given vectors {hn,p } onto the M 0 principal directions along which the variability of the feature data is the most significant.

57.2.3

Encoding and Decoding A third way to interpret construction (57.12) is to note that the vectors {h0n } are solutions to the least-squares problem: n o ∆ h0n = argmax khn,p N U1 xk2 (57.30) x∈IRM

0

That is, h0n is the closest vector to hn,p from R(U1 ). Indeed, using the fact that U1T U1 = IM 0 , the solution h0n is given by h0n = (U1T U1 )−1 U1T hn,p = U1T hn,p

(57.31)

which agrees with (57.12). In this way, expression (57.12) is replacing each hn,p by its basis representation hn0 in the range space of U1 . The projection of hn,p onto R(U1 ), on the other hand, is given by b hn,p = U1 h0n

(57.32)

b ¯ + SU1 h0 hn = h n

(57.33)

Obviously, there is some loss of information in replacing {hn,p } by the reduced vectors {h0n }. For one, the latter vectors lie in a lower M 0 -dimensional space than the {hn,p }. Nevertheless, we can use the projections (57.32) to “reverse” the reduction procedure (57.12) and “recover” or estimate the vectors {hn,p } from knowledge of {h0n }. If we further undo the earlier mean and variance normalizations, and substitute (57.32) into (57.5b), we can “recover” the original feature data {hn } from the reduced vectors {h0n } as follows:

Using these results, we can represent the PCA operation as consisting of two stages: an encoder stage that compresses feature vectors hn ∈ IRM to lower0 dimensional vectors h0n ∈ IRM ×1 , followed by a decoder stage that “recovers” or estimates the original feature hn from h0n . This procedure is illustrated in Fig. 57.2. The reduced representation h0n is also called a latent variable or a factor

57.2 Dimensionality Reduction

2391

Figure 57.2 Illustration of a two-stage procedure for PCA involving compression of 0

hn ∈ IRM down to h0n ∈ IRM by means of an encoder stage, followed by decoding to b hn ∈ IRM .

since it embodies important information about the original data. In summary, we arrive at the following listing for the PCA procedure.

PCA algorithm. given N feature vectors {hn ∈ IRM }; given a desired lower dimension M 0  M . compute: N −1 X ¯= 1 h hn N n=0 ¯ hn,c = hn N h N −1 1 X 2 2 σ bm = h (m), m = 1, 2, . . . , M N N 1 n=0 n,c S = diag{ σ b1 , σ b2 , . . . , σ bM } ¯ hn,p = S −1 (hn N h) N −1

1 X hn,p hT n,p N N 1 n=0 Rp =U ΛU T (eigen-decomposition)  U = U1 X , U1 : M × M 0 Λ = diag{Λ1 , X}, Λ1 : M 0 × M 0 Rp =

h0n = U1T hn,p , ∀ n (encoding) 0 b ¯ hn = h + SU1 hn , ∀ n (decoding) end return {h0n }.

(57.34)

2392

Principal Component Analysis

Matrix representation We collect the transformed feature vectors {hn,p }, their reduced versions {h0n }, and the predictions {b hn,p = U1 h0n } into three matrices of dimensions N × M , 0 N × M , and N × M , respectively:       b hT hT (h00 )T 0,p 0,p   bT   (h01 )T  hT  h1,p  1,p ∆  ∆    0 ∆  b   , H = Hp =   (57.35)   , Hp =  .. . ..  .     . . .   T T b hN −1,p (hN −1,p ) hT N −1,p Using these quantities, we can express the mappings (57.31) and (57.32) from hn,p to h0n and from h0n back to b hn,p in matrix form as follows: H 0 = Hp U1 ,

b p = H 0 U1T H

Likewise, if we collect the predictions {b hn } of b similar matrix H:  b hT 0  bT h  1 ∆  b = H ..   .

b hT N −1

then relation (57.33) implies that

(57.36)

the original feature vectors into a      

¯T + H b =1⊗h bpS H

(57.37)

(57.38)

Remark 57.1. (Using the singular value decomposition) It is common in practice 0 to construct the matrix U1 ∈ IRM ×M directly from the SVD of the N × M matrix Hp . This step is desirable because it avoids the need to “square” the vectors {hn,p } while forming Rp ; recall that the latter consists of a sum of outer products of the form hn,p hTn,p . Let us denote the SVD of Hp by   Σ Hp = V UT (57.39) 0 where Σ contains the singular values in decreasing order, while V is N × N orthogonal, and U is M × M orthogonal. Now, note that HpT Hp =

N −1 X

hn,p hTn,p = (N − 1)Rp

(57.40)

n=0

and from the above SVD we deduce that HpT Hp = U Σ2 U T

(57.41)

Comparing (57.40) and (57.41), we observe that the matrix U from the SVD (57.39) coincides with the orthonormal eigenvectors of Rp , while Σ2 = (N − 1)Λ. Therefore, we can determine U1 by retaining the leading M 0 columns of U obtained from (57.39). 

57.2 Dimensionality Reduction

2393

Remark 57.2. (Detecting outliers) PCA is useful in detecting outliers in the data. Assume a feature vector hn is reduced to h0n , which is then mapped back to b hn . If b hn is determined to be a rather poor estimate for hn , then this fact can be used to flag hn as a potential outlier. 

Limitations The PCA construction can lead to poor results in some cases for two main reasons. First, the procedure can fail to perform sufficient dimensionality reduction because most (or all) eigenvalues of Rp may be significant and reduction is not well justified. Second, PCA can lead to poor results when applied to classification problems because it is an unsupervised technique that operates solely on the feature data and does not require knowledge of their labels. This means that while transforming from {hn } to {h0n }, the PCA construction is not concerned with generating features {h0n } for which the class information (or label) is retained or easily recognized. Contrast this situation with the Fisher construction (56.44), where the class information is used during dimensionality reduction and the procedure aims at preserving the discrimination information that is present in the feature data. This is not the case for PCA. Example 57.2 (Reducing the feature space to two dimensions) We illustrate PCA by generating N = 100 random features hn ∈ IR3 and reducing them to two-dimensional vectors. The top row in Fig. 57.3 shows a three-dimensional scatter diagram of the data (on the left). Although the label information is unnecessary, the plot shows the two classes in color for convenience. The plot on the right shows the recovered feature vectors, b hn ∈ IR3 , using relation (57.33). The bottom row shows the reduced 0 vectors, hn ∈ IR2 , obtained after discarding the smallest eigenvalue, λ3 , and retaining the columns of U corresponding to λ1 and λ2 . For this example, the eigenvalues of Rp were found to be: λ1 = 2.0098,

λ2 = 0.5458,

λ3 = 0.4444

(57.42)

It is clear from these values that we could have ignored the last two eigenvalues and retained only the first column of U corresponding to λ1 ; that is, we could have projected down to a subspace of dimension 1 without considerable loss in information relative to the two-dimensional case. This situation is evident from examining the projected data in the bottom row of Fig. 57.3; observe that the data is well discriminated if we simply consider their x-coordinates. For illustration purposes, Fig. 57.4 shows a three-dimensional scatter diagram of the normalized feature vectors, hn,p ∈ IR3 , on the left. The plot on the right shows the same vectors {hn,p } along with the hyperplane corresponding to the column span of U1 . The vectors {hn,p } are projected onto this range space to generate the reduced vectors {h0n }, already shown in the bottom row of Fig. 57.3. Example 57.3 (Reducing the feature space of the iris data) We illustrate PCA by applying it to the iris dataset, which we encountered earlier in Example 56.1. The dataset consists of three types of flowers: setosa, versicolor, and virginica. There are 50 measurements for each flower type, with a total of N = 150 samples. Each feature vector hn consists of four attributes: petal length, petal width, sepal length, and sepal width (all measured in centimeters). We follow the procedure outlined in Remark 57.1. We center the feature vectors {hn } and scale their variances to transform them into the

2394

Principal Component Analysis

Figure 57.3 The plot in the top row (left) shows a three-dimensional scatter diagram

of the original N = 100 feature vectors. The plot in the same top row (right) shows the recovered feature vectors using approximation (57.33). The bottom row shows the reduced vectors {h0n } in two dimensions, which result from projecting the preprocessed vectors {hn,p } onto R(U1 ). normalized features {hn,p } according to the preprocessing steps outlined in (57.6). We subsequently perform the SVD of the 150 × 4 matrix:   Σ Hp = V UT (57.43) 0 The four nonzero singular values are found to be (the σ-notation here refers to singular values and not to standard deviations): σ1 = 20.8256,

σ2 = 11.7159,

σ3 = 4.6861,

σ4 = 1.7529

(57.44)

We retain the first two dimensions and use the first two columns of U , given by   −0.5224 −0.3723  0.2634 −0.9256  U1 =  (57.45) −0.5813 −0.0211  −0.5656 −0.0654 to compute the two-dimensional reduced feature vectors h0n = U1T hn,p . Figure 57.5 shows scatter diagrams for these reduced vectors. Example 57.4 (Reducing the feature space of the heart disease data) We apply PCA to the heart disease dataset we encountered earlier in Example 54.3. The set consists

57.2 Dimensionality Reduction

2395

Figure 57.4 The plot on the left shows a three-dimensional scatter diagram of the

normalized vectors, {hn,p }. The plot on the right shows the same scatter diagram along with the hyperplane representing the column span of U1 .

of N = 297 samples with M = 13 attributes in each feature vector hn . There are R = 4 classes denoted by r = 0 (no heart disease) and r = 1, 2, 3 (heart disease). The 13 attributes in the feature vectors are listed in Table 54.7. We again follow the procedure outlined in Remark 57.1. We center the feature vectors {hn } and scale their variances to transform them into the normalized features {hn,p } according to the preprocessing steps outlined in (57.6). We subsequently perform the SVD of the 297 × 13 matrix Hp as shown by (57.39). The 13 nonzero singular values are found to be 13 nonzero singular values n √ = 296 × 1.7483, 1.2653, 1.1192, 1.0515, 0.9925, 0.9337, o 0.9198, 0.8873, 0.8247, 0.7506, 0.6769, 0.6624, 0.5946

(57.46)

It is seen in this example that the singular values decay slowly. Nevertheless, we reduce the dimension of the feature vectors down to three by keeping only the three principal components, leading to           U1 =          

−0.2683 −0.1174 −0.2923 −0.1650 −0.0831 −0.0695 −0.1451 0.3927 −0.3400 −0.3997 −0.3554 −0.3050 −0.3461

0.4234 −0.4562 −0.1216 0.3802 0.4444 0.2010 0.2660 0.0463 −0.1829 −0.0566 −0.0660 0.1527 −0.2797

−0.0648 0.4356 −0.4141 0.3602 −0.2022 0.5345 0.1241 0.2308 −0.1978 0.0692 0.0278 0.0899 0.2381

                   

(57.47)

We use these columns to compute the three-dimensional reduced feature vectors h0n = U1T hn,p . Figure 57.6 shows scatter diagrams for these reduced vectors.

2396

Principal Component Analysis

3 2 1 0 -1 -2 -3 -4

-3

-2

-1

0

1

2

3

-3

-2

-1

0

1

2

3

3 2 1 0 -1 -2 -3 -4

Figure 57.5 The top scatter plot shows the resulting two-dimensional feature vectors

{h0n } for the three classes of iris flowers: setosa, virginica, and versicolor. The bottom scatter plot aggregates virginica and versicolor into a single class.

57.3

SUBSPACE INTERPRETATIONS We motivated the PCA construction in the earlier sections by showing that it seeks the principal directions {u1 , u2 , . . . , uM 0 } along which the “variance” of the projections of the feature data is maximized. There are other useful interpretations for PCA as a procedure that seeks low-rank approximations for data and covariance matrices.

Low-rank approximation of data matrix Assume all feature vectors {hn } have been preprocessed by (57.6) and replaced by {hn,p }. We collect the normalized vectors into the same N × M matrix Hp from (57.35) where N ≥ M , and pose the problem of approximating Hp by a matrix of lower rank M 0 < M : ∆ bp = H

2

argmin kHp N XkF , subject to rank(X) = M 0

0 X∈IRN ×M

(57.48)

57.3 Subspace Interpretations

no heart disease heart disease 1 heart disease 2 heart disease 3

dimensionality reduction using PCA 4

third coordinate

2397

2 0 -2 4 2 0 -2 -4

second coordinate

-5

-2

-3

-4

-1

0

1

2

3

4

first coordinate

dimensionality reduction using PCA; two classes no heart disease heart disease

third coordinate

4 2 0 -2 4 2 0 -2

second coordinate

-4

-5

-2

-3

-4

-1

0

1

2

3

4

first coordinate

Figure 57.6 The top scatter plot shows the resulting three-dimensional feature vectors

{h0n } for four classes of heart disease conditions. The bottom scatter aggregates all data points into two classes depending on the presence or absence of a heart disease condition. The heart disease data is derived from the processed Cleveland dataset from the site https://archive.ics.uci.edu/ml/datasets/heart+Disease.

where the notation kAkF denotes the Frobenius norm of its matrix argument. It turns out that the principal component matrix U1 found by PCA helps determine b p as follows: the solution H b p = H 0 U1T H

(57.49)

where H 0 is the same N ×M 0 matrix from (57.35) containing the reduced feature vectors {h0n }. Proof of (57.49): To solve problem (57.48), we introduce the SVD for Hp :  Hp = V

Σ 0



UT

(57.50)

where V is N × N orthogonal, U is M × M orthogonal, and Σ is M × M with ordered singular values appearing on the diagonal entries denoted by σ1 ≥ σ2 ≥ . . . ≥ σM . We let (Σ1 , U1 , V1 ) denote the leading M 0 × M 0 , M × M 0 , and N × M 0 submatrices of (Σ, U, V ):

2398

Principal Component Analysis

 Σ=

Σ1 0

0 ×



, UT =



U1T ×

 , V =



V1

×



(57.51)

Then, we know from the Eckart–Young theorem (1.224) that the best approximation for Hp of rank M 0 is given by b p = V1 Σ1 U1T H

(57.52)

0

We can relate this expression to H as follows. Note that   Σ U T U1 Hp U1 = V 0   T    Σ1 0 U1 V1 × U1 = 0 × ×     Σ1 0 (57.14)  IM 0 V1 × = 0 × 0 =

V1 Σ1

(57.53)

b p , we conclude that Substituting into expression (57.52) for H b p = Hp U1 U1T H | {z }

(57.36)

=

H 0 U1T

(57.54)



= H0

as claimed.



Low-rank approximation of covariance matrix We can also motivate PCA as a method that seeks directly a low-rank approximation for the sample covariance matrix Rp defined by (57.7a), namely, ∆

Rp =

N −1 1 X hn,p hT n,p N N 1 n=0

(57.55)

To see this, we consider the problem of approximating Rp by a matrix of lower rank M 0 < M : ∆ bp = R argmin kRp N Rk2F , subject to rank(R) = M 0

(57.56)

R∈IRM ×M

Again, the principal component matrix U1 found by PCA, along with the diagonal matrix Λ1 , help determine the solution as follows: bp = U1 Λ1 U1T R

(57.57)

This matrix is actually the sample covariance of the projections {b hn,p = U1 h0n } since ! N −1 N −1 1 X b bT 1 X 0 0 T hn,p hn,p = U1 h (h ) U1T N N 1 n=0 N N 1 n=0 n n (57.13)

=

U1 Λ1 U1T

(57.58)

57.4 Sparse PCA

2399

Proof of (57.57): We introduce the eigen-decomposition Rp = U ΛU T , with the eigenvalues ordered as λ1 ≥ λ2 ≥ . . . ≥ λM ≥ 0. We let U1 denote the leading M 0 columns of U and Λ1 the leading M 0 × M 0 submatrix of Λ. Then, we know from the Eckart–Young theorem (1.224) that the best approximation for Rp of rank M 0 is given by (57.57).



57.4

SPARSE PCA The traditional PCA solution discussed in the earlier part of this chapter replaces every preprocessed M -dimensional feature vector hn,p by an M 0 -dimensional vector h0n given by (57.31), namely, h0n = U1T hn,p

(57.59)

In this construction, the entries of h0n depend on all entries of hn,p , which often makes it difficult to interpret h0n in some meaningful way. Sparse PCA is a variation that requires the rows of the transformation matrix U1 to be sparse so that the entries of h0n will depend only on a sparse collection of the entries of hn,p . Sparse PCA can be formulated as follows. We already know under PCA that the first column of U1 is determined by solving (57.19), i.e., n o ∆ u1 = argmax uT Rp u (57.60) kuk=1

where the Euclidean norm of u is constrained to 1. Under sparse PCA, it is desired to replace this problem by some formulation of the form ∆

u1 = argmax kuk=1

n o uT Rp u , subject to u being sparse

(57.61)

where the sparsity of u can be measured, for example, by imposing a bound on the number of its nonzero elements, such as kuk0 ≤ T . Once u1 is determined, we deflate Rp to ∆

T Rp,2 = Rp N λ1 u1 uT 1 , where λ1 = u1 Rp u1

(57.62)

We then determine the next principal vector by solving: n o ∆ u2 = argmax uT Rp,2 u , subject to u being sparse

(57.63)

kuk=1

and so forth. We describe next one explicit problem formulation that approximates this desired behavior and leads to sparse principal vectors.

2400

Principal Component Analysis

Alternative optimization First, we recall from the analysis in Example 57.1 that traditional PCA finds the first principal vector by solving: ( ) N 1 X ∆ T 2 u1 = argmin khn,p N uu hn,p k (57.64) N n=1 kuk=1 We can rewrite this problem in the following equivalent form by introducing the constraint u = z, where both u and z are M -dimensional vectors: ) ( N −1 1 X ∆ T 2 khn,p N uz hn,p k , subject to u = z (u1 , z1 ) = argmin N N 1 n=0 kuk=1,z

(57.65)

It turns out that we can ignore the equality constraint and focus instead on an optimization problem over the two separate variables {u, z}: ) ( N −1 1 X ∆ T 2 khn,p N uz hn,p k (57.66) (ux , zx ) = argmin N N 1 n=0 kuk=1,z The solution to this problem continues to lead to ux = u1 and zx = u1 , where u1 is the first column of U . The proof appears in Appendix 57.B. An extension with `2 -regularization over z appears in Prob. 57.12. One can follow a similar argument to explain that multiple principal components can be determined simultaneously by optimizing a related two-variable problem – see Prob. 57.2. Indeed, the same argument used in Example 57.1 can be repeated to show that the matrix U1 (which consists of the leading M 0 columns of U in the eigen-decomposition Rp = U ΛU T ) is the solution to: ) ( N −1 1 X ∆ T 2 khn,p N AA hn,p k (57.67) U1 = argmin N N 1 n=0 AT A=IM 0 where A is M × M 0 and where we are denoting the optimal A by U1 . We can again rewrite this problem in the following form by introducing a second M ×M 0 variable Z: ( ) N −1 1 X ∆ T 2 (U1 , Z1 ) = argmin khn,p N AZ hn,p k (57.68) N N 1 n=0 AT A=IM 0 ,Z The solution again satisfies U1 = Z1 , where U1 consists of the leading M 0 columns of U . The proof is similar to the one used in Appendix 57.B.

Sparse formulation Now, let {zm } denote the columns of Z. Motivated by the above considerations, one formulation of sparse PCA is the following optimization problem with elasticnet regularization imposed on the columns of Z to induce sparsity:

57.4 Sparse PCA

2401



(U1 , Z1 ) = (57.69) ( N −1 ) 0 0 M M X X X T 2 2 argmin khn,p N AZ hn,p k + α kzm k1 + ρ kzm k AT A=IM 0 ,Z

n=0

m=1

m=1

where A is M × M 0 , α > 0, and ρ ≥ 0. Once the columns {zm } of Z1 are determined, we set U1 = Z1 and construct the reduced features by using h0n = U1T hn,p ,

n = 0, 1, . . . , N N 1

(57.70)

To solve (57.69), we denote the cost function by P (A, Z) =

N −1 X n=0

0

T

2

khn,p N AZ hn,p k + α

M X

m=1

0

kzm k1 + ρ

M X

m=1

kzm k2

M0 M0

2 X X

T = Hp N Hp ZA + α kzm k1 + ρ kzm k2 F

m=1

m=1

(57.71)

where the {hn,p } have been collected into the matrix Hp , as was defined earlier in (57.35). This function is not convex over the combined variables {A, Z}; it is, however, convex over one variable if the other remains fixed. For this reason, we will seek the solution {U1 , Z1 } by alternating between minimizing over Z with A fixed, and minimizing over A with Z fixed: (a) (Minimizing over Z) Assume that at iteration k N 1 the value of A is fixed at Ak−1 . We denote the columns of Ak−1 by {am,k−1 } for m = 1, 2, . . . , M 0 . Let A⊥ denote the M × (M N M 0 ) matrix with orthonormal columns such that  ∆  Θ = Ak−1 A⊥ is M × M orthogonal (57.72) The matrix A⊥ is not needed to find the solution, but is only introduced for the convenience of the argument. Since orthogonal transformations do not affect the Frobenius norm, we have P (Ak−1 , Z) M0 M0

2 X X

T kzm k1 + ρ kzm k2 = Hp Θ N Hp ZAk−1 Θ + α F

= kHp Ak−1 N =

M0 X

m=1

(

Hp Zk2F

+

m=1

kHp A⊥ k2F

(57.73)

m=1

+ α

M0 X

m=1

kzm k1 + ρ

αkzm k1 + ρkzm k2 + kym N Hp zm k2

)

+ cte

0

M X

m=1

kzm k2

2402

Principal Component Analysis



where we defined the N × 1 columns ym = Hp am,k−1 , and where terms independent of {zm } are grouped into the constant factor. We observe that the optimization over {zm } decouples into M 0 separate optimization problems, one for each column: ( ) ∆

zbm = argmin z∈IRM

αkzk1 + ρkzk2 + kym N Hp zk2

m = 1, 2, . . . , M 0

(57.74)

Each one of these problems is a LASSO (least absolute shrinkage and selection operator) problem, which we already know how to solve. For instance, earlier in Examples 14.1 and 15.3 we described algorithms for determining the solution zbm based on subgradient and proximal gradient iterations. In particular, repeating the argument from the latter example, we find that we can learn zbm by applying repeatedly the following construction over some iteration index i until sufficient convergence is attained. Start from r−1 arbitrary, and repeat using a step size µ > 0 the iterated soft-thresholding algorithm (ISTA):  yi = (1 N 2µρ)ri−1 + 2µHpT (ym N Hp ri−1 ), i ≥ 0 (57.75) ri = Tµα (yi ) where Tβ (·) is the soft-thresholding operator defined by (11.18), namely,   x N β, if x ≥ β ∆ Tβ (x) = (57.76) 0, if N β < x < β  x + β, if x ≤ Nβ

When x is vector-valued, the operation Tβ (x) is applied to the individual entries of x and the result is a vector of soft-thresholded values. Once iteration (57.75) approaches convergence, we set zbm ← ri . We repeat the same construction for all columns of Z. The result is the iterate Zk with columns {b zm }. (b) (Minimizing over A) Now assume Z is fixed at Zk and let us explain how to update A from Ak−1 to Ak . Remember that A is constrained to satisfy AT A = IM 0 . The cost is now given by (excluding all terms that are independent of A):

2

P (A, Zk ) = Hp N Hp Zk AT (57.77) F n T  o = Tr Hp N Hp Zk AT H p N H p Z k AT = Tr(HpT Hp ) N 2 Tr(AZkT HpT Hp ) + Tr(AZkT HpT Hp Zk AT )

T = Tr(HpT Hp ) N 2 Tr(AZkT HpT Hp ) + Tr(ZkT HpT Hp Zk A | {zA}) =IM 0

=

Tr(HpT Hp )

N

2 Tr(AZkT HpT Hp )

+

Tr(ZkT HpT Hp Zk )

57.4 Sparse PCA

2403

and we are reduced to solving Ak = argmax Tr(AZkT HpT Hp )

(57.78)

AT A=IM 0

We introduce the SVD: ∆

ZkT HpT Hp = Ua



Σa



0

(M 0 × M )

VaT ,

(57.79)

where Ua is M 0 ×M 0 orthogonal, Va is M ×M orthogonal, and Σa is M 0 ×M 0 diagonal. Then, we have   Ak = argmax Tr AUa Σa

0

AT A=IM 0

  = argmax Tr VaT AUa Σa AT A=IM 0

  = argmax Tr XUa Σa

(a)

0

X=VaT A



VaT

0







(57.80)

where in step (a) we introduced the matrix X = VaT A. Let us ignore for now the requirement that A needs to satisfy the normalization AT A = IM 0 . 0 Let {xT m } denote the rows of the M × M matrix X and let {um } denote 0 0 the columns of the M × M matrix Ua . Let also {σm } denote the diagonal entries of Σa , which are nonnegative. Then, it holds that 0

Ak = argmax {xm }

M X

σm xT m um

(57.81)

m=1

By the Cauchy–Schwarz inequality, each term on the right-hand side is maximized when xm and um are aligned and satisfy xT m = um , in which case UaT = VaT A and, hence, Ak = Va UaT is orthogonal

(57.82)

In summary, we arrive at listing (57.83) for the sparse PCA procedure. In this construction, the “optional” matrix Z is approximated iteratively. At the end of the procedure, we scale its columns to unit norm to ensure proper normalization.

2404

Principal Component Analysis

Sparse PCA for solving (57.69). given N feature vectors {hn ∈ IRM }; transform them to {hn,p ∈ IRM } according to procedure (57.6); collect the {hn,p } into the N × M matrix Hp defined by (57.35); objective: determine an M × M 0 principal component matrix U1 0 to reduce the {hn,p } into {h0n = U1T hn,p ∈ IRM }; start from arbitrary M × M 0 initial conditions A−1 and Z−1 ; denote the columns of {A, Z} by {am , zm }, for m = 1, . . . , M 0 repeat until convergence over k = 0, 1, 2, . . .: update each column zm,k by solving the LASSO problem (57.74), e.g., by running (57.75) for sufficient iterations

(57.83)

set columns of Zk to {zm,k }, for m = 1, 2, . . . , M 0   introduce the SVD: ZkT HpT Hp = Ua Σa 0 VaT set Ak = Va UaT end set U1 ← Zk after convergence; normalize each column of U1 to unit norm; (encoding) h0n = U1T hn,p , ∀ n 0 b ¯ (decoding) hn = h + SU1 hn , ∀ n

57.5

PROBABILISTIC PCA The PCA solution is a deterministic procedure that operates on the feature data 0 {hn ∈ IRM } and transforms them into the reduced vectors {h0n ∈ IRM }, as was described by listing (57.34). If desired, the algorithm allows us to “recover” the original features from the {h0n } by using relation (57.33), namely, b ¯ + SU1 h0 hn = h n

(57.84)

¯ is the ensemble average, S is an M × M diagonal matrix of standard where h deviations and, more importantly, U1 has dimensions M × M 0 .

57.5.1

Low-Rank Approximation We already know from (57.13) that the M 0 × M 0 sample covariance matrix of the reduced features {h0n } is equal to Λ1 . For convenience, we denote the M × M

57.5 Probabilistic PCA

2405

sample covariance matrices of the centered features {hn } and their predictions {b hn } by N −1 1 X ¯ n N h) ¯ T (hn N h)(h N N 1 n=0



Rh

=

bh R

=



(57.84)

=

N −1 1 X b ¯ b ¯ T (hn N h)( hn N h) N N 1 n=0

SU1 Λ1 U1T S T

(57.85a)

(57.85b)

bh (this conclusion is The PCA solution is effectively approximating Rh by R consistent with our earlier explanation based on (57.56) and using Rp ): bh = SU1 Λ1 U1T S T Rh ≈ R

(57.86)

where the product on the right-hand side has rank M 0 , while the rank of Rh is generally M . Another way to express this result is to introduce the tall matrix ∆

1/2

W = SU1 Λ1

(M × M 0 )

(57.87)

of size M × M 0 and rank M 0 . Then, relation (57.86) amounts to a low-rank factorization of the form: Rh ≈ W W T ,

57.5.2

W : M × M 0 , rank(W ) = M 0

(57.88)

Latent Model Motivated by this observation, we now introduce an alternative formulation for dimensionality reduction that relies on learning a stochastic generative model for the feature data in terms of lower-dimensional latent variables. The resulting algorithm is known as probabilistic PCA. We start by assuming that the feature vectors {hn } arise from some unknown generative Gaussian model of the following form with an M 0 × 1 Gaussian latent variable z (see Prob. 57.6): z ∼ Nz (0, IM 0 )

(57.89a)

h = Wz + µ + v

(57.89b)

v ∼ Nv (0,

(57.89c)

σh2 IM )

for some M ×M 0 matrix W , scalar σh2 , and M ×1 vector µ. Under this model, for each realization hn , a latent variable zn is sampled from Nz (0, IM 0 ) and a noise perturbation vn is sampled from Nv (0, σh2 IM ). The sampled values are then used to generate hn according to (57.89b). The latent variable zn in this stochastic model will play a role similar to the lower-dimensional feature vector h0n in the deterministic treatment. Model (57.89a)–(57.89c) is a special case of latent factor models. The main difference is that the covariance matrix of the noise component is set to a multiple

2406

Principal Component Analysis

of the identity matrix in (57.89c), whereas in factor analysis it is set to an arbitrary positive-definite diagonal matrix, say, v ∼ Nv (0, Dv ). The special noise covariance used in (57.89c) allows for a closed-form solution for the probabilistic PCA problem studied in this section, as shown in future statement (57.105). In contrast, under the more general factor analysis model, we would need to resort to iterative solutions such as the expectation-maximization (EM) procedure described further below in Section 57.5.4 (see Prob. 57.9). One important observation that follows directly from the latent model (57.89a)– (57.89c) is that the model implicitly assumes a low-rank representation for the actual covariance matrix of h. Indeed, note that Eh = WEz + µ + Ev = µ

(57.90a)

and ∆

R = E (h N µ)(h N µ)T

= E (W z + v)(W z + v)T

= W W T + σh2 IM

(57.90b)

This last expression shows that R is a low-rank modification of σh2 IM since W W T has rank M 0 ; the result is similar to (57.88) when σh2 → 0. For this reason, we will be able to recover the deterministic PCA solution from probabilistic PCA by setting σh2 → 0. The objective of probabilistic PCA is twofold: to identify the parameters (µ, σh2 , W ) of the latent model that explains the feature data {hn }, and to perform dimensionality reduction by recovering the latent variables {b zn } and replacing each hn by the corresponding zbn . The estimate for each latent variable z will be obtained by solving a least-squares problem of the form: c zk2 zb = argmin kh N µ bNW

(57.91)

cTW c )−1 W c T (h N µ zb = (W b)

(57.92a)

b c zb + µ h=W b

(57.92b)

z∈IRM

0

c }. The solution which is written in terms of estimated model parameters {b µ, W is given by

Conversely, given zb, we can “recover” or estimate the original feature vector by using The last two relations allow us to move back and forth between the latent domain {b zn } and the original domain {b hn }, but only after the model parameters (µ, W ) have been estimated. We explain in the next section how to estimate these parameters.

57.5 Probabilistic PCA

57.5.3

2407

Estimating Model Parameters We show in expression (57.127) in the appendix that under model (57.89a)– (57.89c) the distribution for h is Gaussian, h ∼ Nh (µ, R). Now, assume we are given N independent feature vectors {hn } arising from this distribution. We already know from Prob. 31.4 that we can estimate the mean and covariance matrix of fh (h) by using the sample calculations: N −1 X ∆ 1 ¯ = hn µ b = h N n=0 ∆

Rh =

(57.93)

N −1 1 X (hn N µ b)(hn N µ b) N N 1 n=0

(57.94)

Let {λ1 ≥ λ2 ≥ . . . ≥ λM } denote the ordered eigenvalues for the actual covariance matrix R from (57.90b), factored into its eigen-decomposition: R = U ΛU T

(57.95)

where U is the M × M orthonormal matrix of eigenvectors. Similarly, let {σ1 ≥ σ2 ≥ . . . ≥ σM 0 > 0} denote the ordered nonzero singular values of the M × M 0 sought-after matrix W of rank M 0 . It follows that the eigenvalues of W W T are 2 0 {σ12 ≥ σ22 ≥ . . . ≥ σM 0 , 0, . . . , 0} with M N M trailing zeros. It is also clear from T 2 R = W W + σh IM that the eigenvalues of R and W W T satisfy the relations 2 λ m = σm + σh2 , m = 1, 2, . . . , M 0

λm =

σh2 ,

0

m = M + 1, . . . , M

(57.96) (57.97)

That is, the last M N M 0 eigenvalues of R must coincide with each other and be equal to σh2 . Therefore, one way to estimate σh2 is to average the smallest M NM 0 eigenvalues of the sample covariance matrix Rh : σ bh2 =

1 M N M0

M X

λm (Rh )

(57.98)

m=M 0 +1

where the notation λm (Rh ) refers to the ordered eigenvalues of Rh . The reason for using Rh instead of R is because the latter is not known, while Rh serves as an estimate for it. Furthermore, using R N σh2 IM = W W T and the eigen-decomposition R = U ΛU T we have U (Λ N σh2 IM )U T = W W T

(57.99)

The trailing M N M 0 entries of the diagonal matrix Λ N σh2 IM are zero. We therefore introduce the partitions     Λ1 0 U = U1 U2 , Λ = (57.100) 0 0

2408

Principal Component Analysis

where U1 contains the leading M 0 columns of U and Λ1 is M 0 × M 0 . Then, relation (57.99) is equivalent to U1 (Λ1 N σh2 IM 0 )U1T = W W T

(57.101)

or, upon splitting the diagonal matrix into two square-root factors: U1 (Λ1 N σh2 IM 0 )1/2 (Λ1 N σh2 IM 0 )1/2 U1T = W W T {z } | {z } | ∆

(57.102)



= AT

=A

We conclude from the result of Prob. 1.46 that the matrices A and W are related as follows: W = U1 (Λ1 N σh2 IM 0 )1/2 V

(57.103)

for any M 0 × M 0 orthogonal matrix, V . In practice, we replace σh2 by its approximation σ bh2 and compute (U1 , Λ1 ) from the eigen-decomposition of Rh . Thus, let Rh = U ΛU T (using the same symbols to avoid an explosion of notation). Then, c = U1 (Λ1 N σ W bh2 IM 0 )1/2 V

(57.104)

where we are free to select V , e.g., V = IM 0 . We therefore arrive at solution (57.105), which is referred to as the probabilistic PCA problem.

Probabilistic PCA. assumed latent model (57.89a)–(57.89c); given N feature realizations {hn } arising from this model; objective: estimate {µ, σh2 , W } and latent variables {zn }. compute: N −1 1 X µ b= hn N n=0 N −1 1 X Rh = (hn N µ b)(hn N µ b) N N 1 n=0 Rh = U ΛU T , with ordered eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λM Λ1 = leading M 0 × M 0 submatrix of Λ with largest eigenvalues U1 = leading M 0 columns of U M X 1 2 σ bh = λm M N M0 0 m=M +1

select any orthogonal M 0 × M 0 matrix V , such as V = IM 0

c = U1 (Λ1 N σ W bh2 IM 0 )1/2 V T zbn = V (Λ1 N σ bh2 IM 0 )−1/2 U1T (hn N µ b), ∀ n b c zbn + µ hn = W b, ∀ n end

(encoding) (decoding) (57.105)

57.5 Probabilistic PCA

2409

We explain in Appendix 57.A that procedure (57.105) for estimating (µ, σh2 , W ) corresponds to a maximum-likelihood (ML) solution.

57.5.4

Expectation-Maximization Solution An alternative method to estimate the parameters (µ, σh2 , W ) of the latent model (57.89a)–(57.89c) is to appeal to the EM algorithm (32.38). Although we will develop the procedure by focusing on the latent model (57.89a)–(57.89c), where the noise covariance has the special form σh2 IM , the EM approach can be applied to more general latent factor models where the noise covariance can be set to arbitrary positive-definite diagonal matrices. To apply the EM procedure, we first need to determine the conditional pdf of the latent variable z given the observation h. Using the joint Gaussian pdf shown in future expression (57.126) and the expressions from the first column of Table 4.1, we find that, after some algebra involving the matrix inversion lemma (see Prob. 57.8): fz|h (z|h) = Nz (¯ z , σh2 C −1 ) ∆

z¯ = C ∆

−1

(57.106a)

T

W (h N µ)

(57.106b)

C = σh2 IM 0 + W T W

(57.106c)

Moreover, in view of the Bayes rule, the joint pdf of (z, h) can be expressed as fz,h (z, h) = fh|z (h|z) fz (z)

so that

∝p

(57.107)

o n 1 o n 1 1 2 2 kzk kh N W z N µk exp N × exp N n 2σh2 2 (2πσh2 )M

ln fz,h (z, h) = N

1 1 M ln(2πσh2 ) N 2 kh N µk2 N 2 z T W T W z + 2 2σh 2σh 1 T T 1 (57.108) z W (h N µ) N kzk2 + cte σh2 2

Expectation step Computing the expectation of the terms involving z in ln fz,h (z, h) using the 2 conditional pdf (57.106a) evaluated at the estimated parameters (µm−1 , σh,m−1 , Wm−1 ) available at iteration m N 1 we get 2 T Cm−1 = σh,m−1 IM 0 + Wm−1 Wm−1

(57.109a)



−1 T E z = Cm−1 Wm−1 (h N µm−1 ) = z¯m−1 ∆

−1 2 E kzk2 = σh,m−1 Tr(Cm−1 ) + k¯ zm−1 k2 = a2m−1

(57.109b) (57.109c)

2410

Principal Component Analysis

and o n  −1 2 T E z T W T W z = Tr W T W σh,m−1 Cm−1 + z¯m−1 z¯m−1 | {z } ∆

= Bm−1

= Tr(W Bm−1 W T )

(57.109d)

Observe that the value of z¯m−1 and, consequently, the values of a2m−1 and Bm−1 , depend on h. For this reason, we will write {¯ zn,m−1 , a2n,m−1 , Bn,m−1 } with an added subscript n when h = hn . Substituting into (57.108) gives, for a generic vector h: n o ∆ Q(h) = E z|h ln fz,h (z, h) = N

(57.110)

1 1 M ln(2πσh2 ) N 2 kh N µk2 N 2 Tr(W Bm−1 W T ) + 2 2σh 2σh 1 T 1 z¯m−1 W T (h N µ) N a2m−1 + cte 2 σh 2

Maximization step In the maximization step, we update the parameters by solving

2 (µm , σh,m , Wm )

= argmax 2 ,W ) (µ,σh

= argmax

( N −1 X n=0

2 ,W ) (µ,σh

( N −1 X

)

Q(hn )

n=0

(57.111)

1 1 M ln(2πσh2 ) N 2 khn N µk2 N 2 Tr(W Bn,m−1 W T ) 2 2σh 2σh !) 1 T 1 2 T + 2 z¯n,m−1 W (hn N µ) N an,m−1 σh 2

N

Differentiating (57.111) relative to µ, σh2 , and W and setting the gradients to 2 zero at (µm , σh,m , Wm ) gives the coupled EM equations: µm =

Wm =

N −1  1 X hn N Wm z¯n,m−1 N n=0 N −1 X n=0

2 σh,m =

(hn N

T µm )¯ zn,m−1

!

(57.112a) N −1 X n=0

Bn,m−1

!−1

(57.112b)

N −1  1 X T T T khn N µm k2 N 2¯ zn,m−1 Wm (hn N µm ) + Tr(Wm Bn,m−1 Wm ) N M n=0

(57.112c)

57.6 Commentaries and Discussion

2411

One way to solve these coupled equations approximately is to replace µm by the ¯ for all m. A second way is to replace Wm in the expression for sample mean h µm by Wm−1 . In summary, we arrive at listing (57.113).

EM-based solution for probabilistic PCA. assumed latent model (57.89a)–(57.89c); given N feature realizations {hn } arising from this model; objective: estimate parameters {µ, σh2 , W } and latent variables {zn }; 2 given initial conditions {σh,0 , W0 }. compute: N −1 X ¯= 1 h hn N n=0 ¯ set µ0 = h repeat until convergence for m ≥ 1: (expectation step) 2 T Cm−1 = σh,m−1 IM 0 + Wm−1 Wm−1 −1 T z¯n,m−1 = Cm−1 Wm−1 (hn N µm−1 ), ∀ n −1 2 T Bn,m−1 = σh,m−1 Cm−1 + z¯n,m−1 z¯n,m−1 , ∀n (maximization step) ¯ µm = h ! N −1 X T Wm = (hn N µm )¯ zn,m−1 n=0

2 σh,m =

N −1 X

Bn,m−1

n=0

!−1

N −1 1 X T T khn N µm k2 N 2¯ zn,m−1 Wm (hn N µm ) + N M n=0  T Tr(Wm Bn,m−1 Wm )

end end c } ← {µm , σ 2 , Wm } return {b µ, σ bh2 , W h,m

cTW c )−1 W c T (hn N µ b), ∀ n zbn = (W b c hn = W zbn + µ b, ∀ n

57.6

(encoding) (decoding) (57.113)

COMMENTARIES AND DISCUSSION Principal component analysis. PCA was originally developed by the English statistician Karl Pearson (1857–1936) who, along with Ronald Fisher (1890–1962), are regarded as the founders of modern-day mathematical and applied statistics – see the exposition by Tankard (1984). In the original work by Pearson (1901), he employed a

2412

Principal Component Analysis

different terminology and referred instead to “principal axes of correlation” and “best plane fits.” The notion of principal axes also appears briefly in the text by Galton (1889, fig. 11, p. 101) authored by the English statistician Francis Galton (1822–1911), who is credited with the introduction of the statistical notions of “correlation” and “regression lines.” The actual terminology of “principal components” was introduced later by Hotelling (1933, 1935), who was apparently unaware of the work by Pearson (1901) and devised independently the PCA procedure. Useful overviews of the history of PCA and several subsequent developments are given in the articles by Burt (1949) and Abdi and Williams (2010), and the text by Jolliffe (2002). For further treatment of PCA and several of its aspects, readers may consult, for example, Anderson (1963), Rao (1964), Gabriel (1971), Jackson and Mudholkar (1979), Moore (1981), Oja (1982, 1983, 1992), Diamantaras and Kung (1996), Johnstone (2001), and Yeung and Ruzzo (2001). We explained in the body of the chapter that one important difference between the FDA solution (56.44) and the PCA procedure is that FDA takes the class variables {γ(n)} into account while PCA relies solely on the feature data {hn }. This does not necessarily mean that the performance of FDA will always be superior to PCA – examples and experimental results in the context of image recognition problems appear in the article by Martinez and Kak (2001), where it is shown that PCA tends to outperform FDA when the number of available training samples per class is small. Sparse, probabilistic, and robust PCA. Sparse PCA is introduced in Zou, Hastie, and Tibshirani (2006), and also discussed in Shen and Huang (2008) and in the overview article by Zou and Xue (2018). The probabilistic interpretation of PCA discussed in Section 57.5 is from Tipping and Bishop (1999). This formulation has close connections to factor analysis, which is a popular method in statistics where the unobserved/latent variables are referred to as factors. Factor analysis has encountered wide applications in the social and life sciences – see, e.g., Harman (1976), Gorsuch (1983), Child (2006), and Mulaik (2009). Robust versions of PCA are described in Chandrasekaran et al. (2009) and Candes et al. (2009). In robust PCA, the data matrix Hp in (57.48) is expressed as the sum of two components, Hp = L + S, where L is low rank and S is sparse. One then seeks to learn L and S simultaneously by solving a regularized problem of the form n o ∆ bp = H argmin α kLk? + kSk1 , subject to Hp = L + S (57.114) L,S

in terms of the nuclear norm of L and the entry-wise `1 -norm of S, and where α > 0 is a regularization parameter. If noise is present and Hp = L + S + noise, then one can consider replacing the above problem by (see Prob. 57.11): n o 1 ∆ bp = H argmin α kLk? + λkSk1 + kHp − L − Sk2F 2 L,S

(57.115)

where {α, λ} are regularization parameters. Nonlinear and kernel versions of PCA are proposed in Scholkopf, Smola, and Muller (1998, 1999) and Mika et al. (1999b); a nonlinear version of PCA was motivated briefly in Example 40.7, while a kernel-based PCA solution will be described in Section 63.8. We will also explain in the comments after expression (60.56) that the Oja rule (60.56) for Hebbian learning, proposed by Oja (1982, 1983), converges toward an estimate for the unit-norm eigenvector of the feature covariance matrix corresponding to its largest eigenvalue. In other words, there is a strong connection between the Oja rule and PCA. In effect, the Oja rule can be used to approximate the first column of U in (57.20) by solving a problem similar to (57.19). Karhunen–Lo` eve transform. PCA is a special case of a more general theory known as the Karhunen–Loève theory for the representation of stochastic processes as linear

57.6 Commentaries and Discussion

2413

combinations of orthogonal basis functions. By retaining the most prominent basis functions, one ends up revealing the most expressive components of the underlying random process in a manner similar to the principal components in PCA. The Karhunen–Loève theory was developed independently by Kosambi (1943), Karhunen (1946), and Loève (1946, 1955). The theory became widely known as the Karhunen-Loève theory mainly because the contribution of Kosambi (1943) was discovered only years later – see the account by Narasimha (2011). It is not unusual to find the PCA procedure referred to in some fields as the Karhunen–Loève transform, or as the Hotelling transform due to the work by Hotelling (1933, 1935). We provide, for the benefit of the reader, a brief review of the Karhunen–Loève transform. Consider a zero-mean mean-square-integrable random process x(t) defined over some interval [a, b], namely, (ˆ ) b 2 E x (t)dt < ∞ (57.116) a

Assume the process has zero mean and denote its symmetric and continuous covariance function by: ∆

Rx (t, s) = E x(t)x(s) = Rx (s, t)

(57.117)

We can associate a linear operator with this covariance function, one that maps squareintegrable functions f (t) to square-integrable functions g(t) through the integral transformation: ˆ b ∆ g(t) = Rx (t, s)f (s)ds (57.118) a

Since Rx (t, s) is symmetric, the above operator is self-adjoint. This means the following. Let first ˆ b ˆ b ∆ ∆ Rx (t, s)f (s)ds, h0 (t) = Rx (t, s)h(s)ds (57.119) f 0 (t) = a

a

where, for convenience, we are using the prime notation to denote the result of the linear operator. Then, the property of a self-adjoint operator means that: ˆ b ˆ b f 0 (t)h(t)dt = f (t)h0 (t)dt (57.120) a

a

It follows from the spectral theory of self-adjoint operators that the linear mapping (57.118) has real eigenvalues and a countable set of orthonormal eigenfunctions – see, e.g., Naylor and Sell (1982), Kreyszig (1989), and Bachman and Narici (1998). We denote the eigenvalues and eigenfunctions by the symbols λn and en (t); they are defined by the relations: ˆ b λn en (t) = Rx (t, s)en (s)ds (57.121a) a b

ˆ 1=

e2n (t)dt

(57.121b)

en (t)em (t)dt, n 6= m

(57.121c)

a b

ˆ 0=

a

so that en (t) is mapped into a multiple of itself. The Karhunen–Loève theorem asserts that the stochastic process x(t) can be represented as the following convergent series (see Prob. 57.14): ∞ X x(t) = An en (t) (57.122) n=1

2414

Principal Component Analysis

where the Karhunen–Loève coefficients, denoted by An , are real random variables that satisfy ∆

ˆ

b

x(t)en (t)dt

An =

(57.123a)

a

E An = 0

(57.123b)

E A2n

(57.123c) (57.123d)

= λn E An Am = 0, for n 6= m

By retaining only the most expressive Karhunen–Loève coefficients, say, the ones with the largest variance values, λn , we obtain the Karhunen–Loève approximation for x(t). For example, if we assume the eigenvalues {λn } are ordered from largest to smallest, then retaining the leading L eigenvalues leads to the approximation x(t) ≈

L X

An en (t)

(57.124)

n=1

When specialized to the case of a zero-mean finite-dimensional discrete random variable, xn , with covariance matrix Rx , the above construction reduces to the PCA.

PROBLEMS

b p in (57.36)? 57.1 What is the rank of H 57.2 Consider a collection of N feature vectors {hn,p ∈ IRM }, which are assumed to have already been preprocessed according to (57.6). We wish to replace them by 0 lower-dimensional vectors {h0n ∈ IRM } by seeking an M × M 0 matrix transformation U1 such that h0n = U1T hn,p and, moreover, hn,p can be recovered from h0n by estimating b hn,p = U1 h0n . We require the columns of U1 to be orthonormal, i.e., U1T U1 = IM 0 . We explained in the body of the chapter how to determine U1 . Show that U1 is also the solution to the following optimization problem: ( U1 =

argmin A∈IRM ×M

0

) N −1 X 1 T 2 khn,p − AA hn,p k , N − 1 n=0

subject to AT A = IM 0

57.3 Refer to the derivation of the PCA procedure. Consider M -dimensional feature vectors, hn ∈ IRM , and assume we reduce the feature space to dimension M 0 = M − 1. Accordingly, we partition the orthogonal matrix U into U = [U1 x], where x ∈ IRM denotes the last column of U and is discarded. We explained in (57.32), following a leastsquares argument, that the projection of the feature vector, hn,p ∈ IRM , onto R(U1 ) is given by b hn,p = U1 U1T hn,p . In this problem, we want to reconcile this expression with the result from Prob. 56.6. (a) Determine an equation for the hyperplane that corresponds to the column span of U1 in the form hT w − θ = 0. That is, determine its normal vector, w ∈ IRM , and its threshold θ. (b) Use the result of Prob. 56.6 to write down the projection of any vector hn,p onto the hyperplane described by the parameters {w, θ} from part (a). (c) Argue that the expression in part (b) coincides with b hn,p = U1 U1T hn,p .

Problems

2415

57.4 Let h ∈ IRM represent random feature data with zero-mean and covariance matrix Rh = E hhT . We pose the problem of seeking M 0 < M unit-norm and orthogonal directions {um } for m = 1, 2, . . . , M 0 to solve ) M0

2 X

T E h − (h um )um , subject to kum k2 = 1, uTi uj = 0

( min

{um }

m=1

Verify that the cost function that is being minimized can be rewritten in the form 0

P({um }) = Tr(Rh ) −

M X

uTm Rh um

m=1

Determine expressions for optimal {um } and show that the cost value ends up being P Po = M m=M 0 +1 λm (Rh ) in terms of the sum of the smallest eigenvalues of Rh . Remark. See the text by Diamantaras and Kung (1996) for a related discussion. 57.5 Let h ∈ IRM represent random feature data with zero-mean and covariance matrix Rh = E hhT . We pose the problem of seeking a unit-norm vector u to solve ( min

{um }

)

2

T E h − (h u)u , subject to kuk2 = 1

Apply a stochastic projection gradient algorithm to justify the following online recursion for learning u recursively from streaming vectors {hn } (the successive iterates for u are denoted by un ): z(n) = hTn un−1 un = un−1 + µz(n)(hn − z(n)un−1 ) un ← un /kun k Remark. The second recursion is known as the Oja rule from Oja (1982, 1983); we will comment on it in the concluding remarks of Chapter 60 when we discuss its relation to Hebbian learning – see expression (60.56). 57.6 Refer to the latent model (57.89a)–(57.89c) and assume we replace the model for z by z ∼ Nz (¯ z , Rz ) with some possibly nonzero mean z¯ and positive-definite covariance matrix Rz . Show that the resulting latent model can be reduced to the normalized form 1/2 (57.89a)–(57.89c) by redefining W ← W Rz and µ ← W z¯ + µ. 57.7 Show that the cost function that appears in (57.157) is always nonnegative in view of Jensen inequality. 57.8 Establish the mean and covariance expressions that characterize the conditional pdf (57.106a)–(57.106c) of the latent variable z given the observation h. 57.9 We extend the latent model (57.89a)–(57.89c) by allowing a more general diagonal covariance matrix for the noise component, i.e., z ∼ Nz (0, IM 0 ),

h = W z + µ + v,

v ∼ Nv (0, Dv ), Dv > 0

Extend the derivation of the EM algorithm from Section 57.5.4 to this case.

2416

Principal Component Analysis

57.10 Refer to the coupled EM equations (57.112a)–(57.112c) for probabilistic PCA. Assume the mean parameter µ is known. Show that the EM expressions become: −1 T z¯n,m−1 = Cm−1 Wm−1 (hn − µ) 2 −1 T Bn,m−1 = σh,m−1 Cm−1 + z¯n,m−1 z¯n,m−1 ! !−1 N −1 N −1 X X T Wm = (hn − µ)¯ zn,m−1 Bn,m−1 n=0 2 σh,m

n=0

N −1  1 X T T = khn − µk2 − 2¯ zn,m−1 Wm (hn − µ) + Tr(Wm Bn,m−1 Wm ) N M n=0

Eliminate z¯n,m−1 and Bn,m−1 from the last two expressions and show that they can be rewritten in the form: −1 T Wm = P Wm−1 (σh2 IM 0 + Cm−1 Wm−1 P Wm−1 )−1 1 2 −1 T σh,m = Tr(P − P Wm−1 Cm−1 Wm ) M where P is the sample covariance matrix (57.129). Conclude that the data enters into the PCA procedure through P . Remark. For a related discussion, see Tipping and Bishop (1999). 57.11 Refer to formulation (57.115) for the robust PCA problem. Use the result of Prob. 11.12 to show that one recursive implementation for learning its solution is:

Ln = Sα (Hp − Sn−1 ) Sn = Tλ (Hp − Ln ) where Tβ (x) is the standard soft-thresholding operator (57.76) applied to the individual entries of its matrix argument, while Sβ (X) is the singular value thresholding operator defined earlier in Prob. 11.12, namely, we introduce the SVD X = U ΣV T and replace Σ by Σβ where entries of Σ larger than β are retained while entries smaller than β are set to zero. Then, Sβ (X) = U Σβ V T . 57.12 Consider the following `2 -regularized variation of problem (57.66): ) ( N −1 X 1 ∆ T 2 2 khn,p − uz hn,p k (ux , zx ) = argmin ρkzk + N − 1 n=0 kuk=1,z Follow arguments similar to those used in Appendix 57.B to solve this problem. 57.13 Consider the following `2 -regularized variation of problem (57.68): ) ( N −1 X 1 ∆ T 2 2 (Ux , Zx ) = argmin khn,p − AZ hn,p k + ρkZkF N − 1 n=0 AT A=I 0 ,Z M

Follow arguments similar to those used in Appendix 57.B to solve this problem. 57.14 Assume the validity of the Karhunen–Loève representation (57.122). Derive expressions (57.123a)–(57.123d) for the Karhunen–Loève coefficient, An . Introduce the P partial sum sN (t) = N n=1 An en (t) and show that lim E |x(t) − sN (t)|2 = 0

N →∞

Conclude that the series (57.122) converges in the mean-square-error sense. 57.15 Establish the validity of property (57.120). 57.16 A zero-mean random Wiener process is one with covariance function defined by Rx (t, s) = min(t, s). Refer to the integral equation (57.121a) and assume a = 0 and b = 1. Show that the eigenvalue/eigenfunction pairs in this case are given by   √ 1 = (n − 1/2)2 π 2 , en (t) = 2 sin (n − 1/2)πt λn

57.A Maximum-Likelihood Solution

2417

57.17 A zero-mean random Brownian process is one with covariance function defined by Rx (t, s) = min(t, s) − ts. Refer to the integral equation (57.121a) and assume a = 0 and b = 1. Determine expressions for λn and en (t).

57.A

MAXIMUM-LIKELIHOOD SOLUTION We derived in Section 57.5.3 a procedure for estimating the latent model parameters {µ, σh2 , W }. We verify in this appendix that the procedure coincides with the ML solution for estimating the same parameters following the arguments from Tipping and Bishop (1999). To begin with, the latent model (57.89a)–(57.89c) assumes the Gaussian distributions: z ∼ Nz (0, IM 0 ), h|z ∼ Nh (W z + µ, σh2 IM ) (57.125) where the second distribution refers to the conditional pdf of h given z. Using the earlier result (4.93) we conclude that the joint pdf of (z, h) is Gaussian and given by !    I WT 0 (57.126) , fz,h (z, h) ∼ Nz,h µ W σh2 IM + W W T so that the marginal pdf for h is also Gaussian: ∆

fh (h) = Nh (µ, R), where R = σh2 IM + W W T

(57.127)

Now, given N iid realizations {hn } arising from the assumed latent model, their loglikelihood function is given by (where we are adding the scaling by 1/N for convenience): `(µ, σh2 , W ) ) ( N −1 n 1 o Y 1 1 1 T −1 p √ = exp − (hn − µ) R (hn − µ) ln N 2 (2π)M det R n=0 N −1 1 1 X = − ln | det R| − (hn − µ)T R−1 (hn − µ) + cte 2 2N n=0

1 1 = − ln | det R| − Tr(R−1 P ) + cte 2 2

(57.128)

where we are using the letter P to refer to the sample covariance matrix ∆

P =

N −1 1 X (hn − µ)(hn − µ)T N n=0

(57.129)

Compared with our earlier expression (57.85a) for Rh , we find that the latter employs ¯ and is scaled by 1/(N − 1) while P employs the actual mean µ and the sample mean h is scaled by 1/N . As the derivation will reveal, we will end up using Rh to approximate P since µ is unknown. Returning to (57.128), we formulate the ML optimization problem: n o c ) = argmin ln | det R| + Tr(R−1 P ) (b µ, σ bh2 , W (57.130) 2 ,W µ,σh

where R and P are symmetric matrices.

2418

Principal Component Analysis

Solution Differentiating the cost in (57.130) relative to µ and setting the gradient to zero at µ=µ b gives µ b=

N −1 1 X ¯ hn = h N n=0

(57.131)

That is, the parameter µ is estimated by means of the sample mean of the feature vectors. Therefore, we can compute Rh and use it as an unbiased estimate for P when necessary. We know from (57.127) that R depends on W . Differentiating the cost function relative to W , using the results from parts (c) and (e) of Prob. 2.11 and part (e) of c , gives Prob. 2.12, and setting the gradient to zero at W = W b−1 W c − 2R b−1 P R b−1 W c=0 2R

(57.132)

cW c T , so that the desired W c should satisfy b = σh2 IM + W where R c = PR b−1 W c W

(57.133)

b we are effectively reduced to a nonlinear equation in Substituting the expression for R, c the sought-after variable W : c = P (σh2 + W cW c T )−1 W c W

(57.134)

c = 0 is one trivial solution. However, we are interested in nontrivial Obviously, W c in terms of its solutions. To solve the equation, we represent the M × M 0 matrix W reduced SVD, i.e., we write c = U1 Σ1 V T W (57.135) where U1 : (M × M 0 ),

Σ1 : (M 0 × M 0 ),

V : (M 0 × M 0 )

(57.136)

with U1 having orthonormal columns satisfying U1T U1 = IM 0 T

(57.137)

T

while V is orthogonal satisfying V V = V V = IM 0 , and the diagonal entries of Σ1 are denoted by {σ1 ≥ σ2 ≥ . . . ≥ σM 0 > 0}. We will solve (57.134) by determining the factors (U1 , Σ1 , V ) that satisfy the equation. The SVD representation facilitates finding a solution to (57.134). We start by substituting (57.135) into (57.134) to get U1 Σ1 V T = P (σh2 IM + U1 Σ21 U1T )−1 U1 Σ1 V T

(57.138)

which, after canceling V and Σ1 from both sides leads to P (σh2 IM + U1 Σ21 U1T )−1 U1 = U1

(57.139)

Applying the matrix inversion lemma (1.81) gives ( )  −1 1 1 1 2 2 T −1 −2 T 1 (σh IM + U1 Σ1 U1 ) U1 = IM − 2 U1 Σ1 + 2 IM 0 U1 2 U1 σh2 σh σh σh ( )   −1 1 1 1 1 −2 = U1 IM 0 − 2 Σ1 + 2 IM 0 σh2 σh σh σh2 = U1 (σh2 IM 0 + Σ21 )−1

(57.140)

57.A Maximum-Likelihood Solution

2419

and, consequently, P U1 = U1 (σh2 IM 0 + Σ21 )

(57.141)

This equality allows us to conclude that the columns of U1 are orthonormal eigenvectors 2 for P , with the corresponding eigenvalues given by {σh2 + σm }. This conclusion is not sufficient to identify U1 because we still do not know which eigenvectors (or which eigenvalues) of P should be picked; although the algebraic argument leading to (57.103) already suggests that we should pick the M 0 columns of U1 corresponding to the largest M 0 eigenvalues of P . Ignoring this fact for now, let us examine (57.141) more closely. Let {λm , m = 1, 2, . . . , M } denote the eigenvalues of P without any particular ordering at this stage. We know from (57.141) that a collection of M 0 of these eigenvalues should satisfy 2 λm = σh2 + σm ,

m = 1, . . . , M 0

(57.142) 0

We index these eigenvalues {λm } starting from m = 1 up to m = M . Observe that these eigenvalues are strictly larger than σh2 : m = 1, 2, . . . , M 0

λm > σh2 ,

(57.143)

The argument will show that these should coincide with the largest eigenvalues of P . Thus, if we collect the largest eigenvalues into the M 0 × M 0 matrix Λ1 , n o Λ1 = diag λ1 , λ2 , . . . , λM 0 (57.144) then relation (57.141) implies that Σ1 = (Λ1 − σh2 IM 0 )1/2

(57.145)

Likewise, U1 = leading M 0 columns of U corresponding to Λ1

(57.146)

To establish that the retained eigenvalues of P are the largest ones based on the current argument, we return to the cost function in (57.130) and re-express it in terms cW c T has rank M 0 with eigenvalues of the eigenvalues. Note first that the M ×M matrix W n o 2 cW c T ) ∈ σ12 , σ22 , . . . , σM 0 , 0, . . . , 0 eig(W (57.147) b = σh2 IM +W cW c T , the eigenvalues of R b are given by Therefore, using the representation R b ∈ σh2 + eig(W cW cT ) eig(R) 2 2 2 0 , σh , . . . , σ h } = {σh2 + σ12 , . . . , σh2 + σM

= {λ1 , . . . , λM 0 , σh2 , . . . , σh2 }

(57.148)

and, hence, 0

b= det R

M Y m=1

! λm

M Y

! σh2

(57.149)

m=M 0 +1

It follows that 0

b = ln | det R|

M X m=1

ln(λm ) + (M − M 0 ) ln(σh2 )

(57.150)

2420

Principal Component Analysis

Likewise, we have b−1 P R

=

cW c T )−1 P (σh2 IM + W

(σh2 IM + U1 Σ21 U1T )−1 P ( )  −1 1 1 1 −2 T 1 0 = I − U Σ + I U P M 1 M 1 2 1 σh2 σh2 σh2 σh  −1 1   (57.141) 1 1 1 P − 2 U1 Σ−2 σh2 IM 0 + Σ21 U1T = 1 + 2 IM 0 2 2 σh σh σh σh 1 1 2 T = P − 2 U1 Σ1 U1 σh2 σh  1  cW cT = P −W (57.151) 2 σh =

so that b−1 P ) = Tr(R

o 1 n cW cT) Tr(P ) − Tr( W σh2

(57.152)

and, consequently, 0

b−1 P ) Tr(R

M M 1 X 1 X 2 λm − 2 σm 2 σh m=1 σh m=1

=

0

=

M M 1 X 1 X λm − 2 (λm − σh2 ) 2 σh m=1 σh m=1

=

M0 +

(57.142)

1 σh2

M X

λm

(57.153)

m=M 0 +1

Combining this result with (57.150), we can rewrite the cost function appearing in (57.130) in the form ∆ J(σh2 , {λm }) =

0

0

M +

M X

ln(λm ) + (M − M 0 ) ln(σh2 ) +

m=1

1 σh2

M X

λm

m=M 0 +1

(57.154) Differentiating this cost relative to

σ bh2 =

σh2

and setting the gradient to zero at

1 M − M0

M X

λm

σ bh2

gives

(57.155)

m=M 0 +1

Observe that the expression for σ bh2 involves the discarded eigenvalues of P . This discarded set must include the smallest eigenvalue of P . Otherwise, if the smallest eigenvalue belongs to the retained set {λm , m = 1, . . . , M 0 }, then it will need to be larger than σh2 because all retained eigenvalues are larger than σh2 . In that case, all eigenvalues of P will be larger than σh2 and expression (57.155) would not provide a suitable estimate for σh2 . We substitute (57.155) into (57.154) to eliminate σh2 and rewrite the objective function solely in terms of the eigenvalues: ! M0 M X X 1 0 J({λm }) = M + ln(λm ) + (M − M ) ln λm (57.156) M − M0 0 m=1 m=M +1

57.B Alternative Optimization Problem

2421

Using the fact that the sum of all {ln(λm )} is constant and equal to the trace of ln(P ), the above minimization is equivalent to solving (see Prob. 57.7): ( bm } = argmin {λ {λm }

ln

1 M − M0

M X

! λm



m=M 0 +1

1 M − M0

M X

!) ln(λm )

(57.157)

m=M 0 +1

where only the discarded eigenvalues {λm } for m > M 0 appear. We want to minimize this objective relative to the choice of these eigenvalues. Since we already know that the discarded set must include the smallest eigenvalue of P , and since ln(·) is a monotonically increasing function, we deduce that the discarded eigenvalues should be the smallest M 0 eigenvalues of P .

57.B

ALTERNATIVE OPTIMIZATION PROBLEM In this appendix we establish that the solution to (57.66) is given by ux = u1 and zx = u1 , where u1 is the first column of U . We adjust the argument from Zou, Hastie, and Tibshirani (2006); an extension of the result is studied in Prob. 57.12. Let us ignore the scaling by 1/(N − 1). Using the N × M matrix Hp defined in (57.35), the cost function from (57.66) can be written in matrix form as follows: P (u, z) =

N X

khn,p − uz T hn,p k2 = kHp − Hp zuT k2F

(57.158)

n=1

Introduce the M × (M − 1) matrix U⊥ with orthonormal columns such that Θ = [u U⊥ ] becomes M × M orthogonal. Since orthogonal transformations do not alter the Frobenius norm we have P (u, z) = k(Hp − Hp zuT )Θk2F = kHp [u U⊥ ] − Hp zuT [u U⊥ ]k2F = kHp u − Hp zk2 + kHp U⊥ k2F

(57.159)

Only the first term depends on z. Given u, the minimizer over z is given by (assuming full rank Hp ): zx = u

(57.160)

Substituting into the cost expression gives: P (u, zb) = kHp U⊥ k2F T = Tr(U⊥ HpT Hp U⊥ ) !  T    u T Hp Hp u U⊥ − uT HpT Hp u = Tr T U⊥   = Tr ΘT HpT Hp Θ − uT HpT Hp u

= Tr(HpT Hp ) − uT HpT Hp u

(57.161)

Only the second term depends on the unknown u. We are therefore reduced to solving n o ux = argmax uT HpT Hp u (57.162) kuk=1

2422

Principal Component Analysis

The solution is the unit-norm eigenvector of the matrix HpT Hp that corresponds to its largest eigenvalue, which is the vector u1 since HpT Hp = (N − 1)Rp . We conclude that ux = u1 . Substituting into zx gives zx = u1 as well.

REFERENCES Abdi, H. and L. J. Williams (2010), “Principal component analysis,” Wiley Interdiscip. Rev.: Comput. Statist., vol. 2, no. 4, pp. 433–459. Anderson, T. W. (1963), “Asymptotic theory for principal component analysis,” Ann. Math. Statist., vol. 34, pp. 122–148. Bachman, G. and L. Narici (1998), Functional Analysis, Dover Publications. Burt, C. (1949), “Alternative methods of factor analysis and their relations to Pearson’s method of Principle Axes,” Br. J. Psychol., Statistical Section, vol. 2, pp. 98–121. Candes, E. J., X. Li, Y. Ma, and J. Wright (2009), “Robust principal component analysis,” J. ACM, vol. 58, no. 3, pp. 1–37. Chandrasekaran, V., S. Sanghavi, P. Parrilo, A. Willsky (2009), “Rank-sparsity incoherence for matrix decomposition,” SIAM J. Optim., vol. 21, pp. 572–596. Child, D. (2006), The Essentials of Factor Analysis, 3rd ed., Bloomsbury Academic Press. Diamantaras, K. I. and S. Y. Kung (1996), Principal Component Neural Networks: Theory and Applications, Wiley. Gabriel, K. R. (1971), “The biplot graphic display of matrices with application to principal component analysis,” Biometrika, vol. 58, no. 3, pp. 453–467. Galton, F. (1889), Natural Inheritance, MacMillan and Co. Gorsuch, R. L. (1983), Factor Analysis, 2nd ed., Erlbaum. Harman, H. H. (1976), Modern Factor Analysis, University of Chicago Press. Hotelling, H. (1933), “Analysis of a complex of statistical variables into principal components,” J. Edu. Psychol., vol. 24, pp. 417–441, 498–520. Hotelling, H. (1935), “Simplified calculation of principal components,” Psychometrica, vol. 1, pp. 27–35. Jackson, J. E. and G. S. Mudholkar (1979), “Control procedures for residuals associated with principal component analysis,” Technometrics, vol. 21, no. 3, pp. 341–349. Johnstone, I. M. (2001), “On the distribution of the largest eigenvalue in principal components analysis,” Ann. Statist., vol. 29, no. 2, pp. 295–327. Jolliffe I. T. (2002), Principal Component Analysis, 2nd ed., Springer. Karhunen, K. (1946), “Uber lineare Methoden in der Wahrscheinlichkeitsrechnung,” Ann. Acad. Sci. Fennicae, Series A 1, Math. Phys., vol. 37, pp. 3–79. Kosambi, D. D. (1943), “Statistics in function space,” J. Ind. Math. Soc., vol. 7, pp. 76–88. Kreyszig, E. (1989), Introductory Functional Analysis with Applications, Wiley. Loève, M. (1946), “Fonctions aléatoires de second ordre,” Rev. Sci., vol. 84, no. 4, pp. 195–206. Loève, M. (1955), Probability Theory: Foundations, Random Sequences, Van Nostrand. Martinez, A. M. and A. C. Kak (2001), “PCA versus LDA,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233. Mika, S., B. Scholkopf, A. Smola, K. R. Muller, M. Scholz, and G. Ratsch (1999b), “Kernel PCA and de-noising in feature spaces,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 536–542, Cambridge, MA. Moore, B. C. (1981), “Principal component analysis in linear systems: Controllability, observability, and model reduction,” IEEE Trans. Aut. Control, vol. 26, no. 1, pp. 17–32. Mulaik, S. A. (2009), Foundations of Factor Analysis, 2nd ed., Chapman & Hall. Narasimha, R. (2011), “Kosambi and proper orthogonal decomposition,” Resonance, June, pp. 574–581.

References

2423

Naylor, A. W. and G. Sell (1982), Linear Operator Theory in Engineering and Science, Springer. Oja, E. (1982), “Simplified neuron model as a principal component analyzer,” J. Math. Biol., vol. 15, no. 3, pp. 267–273. Oja, E. (1983), Subspace Methods of Pattern Recognition, Research Studies Press. Oja, E. (1992), “Principal components, minor components, and linear neural networks,” Neural Netw., vol. 5, pp. 927–935. Pearson, K. (1901), “On lines and planes of closest fit to systems of points in space,” Philos. Mag., vol. 2, no. 11, pp. 559–572. Rao, C. R. (1964), “The use and interpretation of principal component analysis in applied research,” Ind. J. Statist., Ser. A, vol. 26, no. 4, pp. 329–358. Scholkopf, B., A. Smola, and K. R. Muller (1998), “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319. Scholkopf, B., A. Smola, and K. R. Muller (1999), “Kernel principal component analysis,” in Advances in Kernel Methods: Support Vector Learning, C. J. C. Burges, B. Schölkopf, and A. J. Smola, editors, pp. 327–352, MIT Press. Shen, H. and J. Z. Huang (2008), “Sparse principal component analysis via regularized low rank matrix approximation,” J. Multivariate Anal., vol. 6, no. 99, pp. 1015–1034. Tankard, J. W. (1984), The Statistical Pioneers, Schenkman Books. Tipping, M. E. and C. M. Bishop (1999), “Probabilistic principal component analysis,” J. Roy. Statist. Soc. Ser. B, vol. 61, no. 3, pp. 611–622. Yeung, K. Y. and W. L. Ruzzo (2001), “Principal component analysis for clustering gene expression data,” Bioinformatics, vol. 17, no. 9, pp. 763–774. Zou, H., T. Hastie, and R. Tibshirani (2006), “Sparse principle component analysis,” J. Comput. Graph. Statist., vol. 15, no. 2, pp. 262–286. Zou, H. and L. Xue (2018), “A selective overview of sparse principal component analysis,” Proc. IEEE, vol. 106, no. 8, pp. 1311–1320.

58 Dictionary Learning

Principal component analysis (PCA) is a formidable tool for dimensionality reduction. Given feature vectors {hn } in M -dimensional space, PCA replaces them by lower-dimensional vectors {h0n } of size M 0  M each. This is achieved by first preprocessing the {hn } and replacing them by centered and normalized features {hn,p }. The sample covariance matrix of these latter vectors is subsequently computed along with its eigen-decomposition: ∆

Rp =

N −1 1 X T hn,p hT n,p = U ΛU N N 1 n=0

(58.1)

The leading M 0 columns of U are retained into the M × M 0 matrix U1 and used to perform the reductions: h0n = U1T hn,p ,

n = 0, 1, . . . , N N 1

(58.2)

By collecting the vectors {hn,p , h0n } into the N ×M and N ×M 0 matrices {Hp , H 0 } defined by (57.35), we explained through (57.36) that the PCA solution essentially amounts to a low-rank factorization for the data matrix Hp in the form: Hp ≈ H 0 U1T ⇐⇒ HpT ≈ U1 (H 0 )T

(58.3)

where the columns of the transposed matrices are given by the feature vectors:   HpT = h0,p h1,p h2,p . . . hN −1,p (58.4a)   0 0 T 0 0 0 (58.4b) (H ) = h0 h1 h2 . . . hN −1

Expression (58.3) shows that each column hn,p is effectively approximated by a linear combination of the M 0 columns of U1 , as was already advanced by (57.32): hn,p ≈ U1 h0n

(PCA: U1 is “tall”)

(58.5)

The matrix U1 serves as a basis for a subspace of dimension M 0 in IRM . It has fewer columns than rows and is therefore a “tall” matrix. This result is illustrated schematically in Fig. 58.1. In this chapter, we examine another important method for data representation, known as dictionary learning. This technique also deals with the representation of data columns, such as hn,p , as linear combinations of some basis vectors (now

58.1 Learning Under Regularization

U1 (M ⇥ M 0 )

HpT (M ⇥ N )



2425

(H 0 )T (M 0 ⇥ N )

columns = reduced vectors {h0n } (play the role of latent variables)

AAACEnicbVBLSwMxGMzWV62vVY9egq1QQcpuBfUiFLx4rGAf0F1KNk3b0DyWJFuoS3+DF/+KFw+KePXkzX9j+jho60BgmPm+JDNRzKg2nvftZFZW19Y3spu5re2d3T13/6CuZaIwqWHJpGpGSBNGBakZahhpxoogHjHSiAY3E78xJEpTKe7NKCYhRz1BuxQjY6W2e2qvSLiA11BIxRGjD6QDhwQbqTQsBGm/nYqzeByMC20375W8KeAy8eckD+aott2voCNxwokwmCGtW74XmzBFylDMyDgXJJrECA9Qj7QsFYgTHabTSGN4YpUO7EpljzBwqv7eSBHXesQjO8mR6etFbyL+57US070KUyrixBCBZw91EwaNhJN+YIcqm56NLEFYUftXiPtIIWxsizlbgr8YeZnUyyX/onR+V85XivM6suAIHIMi8MElqIBbUAU1gMEjeAav4M15cl6cd+djNppx5juH4A+czx/Lzp14

column = normalized vectors {hn,p }

basis matrix

Figure 58.1 PCA construction. Each column hn,p in M -dimensional space is

approximated by a linear combination of the M 0 columns in U1 , with the coefficients of the linear combination given by the entries of h0n .

called atoms). However, two main differences arise in relation to PCA. First, the “tall” basis matrix U1 is replaced by a “fat” matrix having more columns than rows, which we denote by the letter W , of dimensions M × K where K ≥ M . Second, the linear combination vector h0n will be required to be sparse with few nonzero entries: hn,p ≈ W h0n (dictionary learning: W is “fat” and h0n is sparse)

(58.6)

This situation is illustrated in Fig. 58.2 and compared with the PCA construction shown inside the box in the lower left corner. In dictionary learning, given the feature vectors {hn,p }, the objective is to determine a “fat” matrix W (also called the dictionary) and a sparse matrix Hn0 such that HpT ≈ W (H 0 )T

(58.7)

By doing so, we arrive at a procedure that enables us to represent feature vectors as sparse representations of some fundamental atoms stored in the dictionary W . In a later section in this chapter, we will extend the discussion to nonnegative matrix factorizations to deal with applications where the matrices W and H 0 are further required to have nonnegative entries.

58.1

LEARNING UNDER REGULARIZATION We will describe three techniques for dictionary learning. One technique allows online learning with regularization, a second technique allows online learning with constraints, and the third technique is the K-SVD method. We consider the first technique in this section. To accommodate more general situations, we redefine the notation in (58.7). We will denote the given data matrix HpT generically by X, the desired basis by W , and the representation matrix (H 0 )T by Z. Thus, our intention is to start from some given M × N matrix X and to determine a factorization for it in the form:

2426

Dictionary Learning

W (M ⇥ K)

⇡ basis matrix

hn,p

c

zeros U1 (M ⇥ M 0 )

c

⇡ hn,p

h0n AAAB7XicbVA9TwJBEJ3DL8Qv1NJmIxipyB2FWpLYWGIiHwlcyN6yByt7u5fdPRNy4T/YWGiMrf/Hzn/jHlyh4EsmeXlvJjPzgpgzbVz32ylsbG5t7xR3S3v7B4dH5eOTjpaJIrRNJJeqF2BNORO0bZjhtBcriqOA024wvc387hNVmknxYGYx9SM8FixkBBsrdaqTobisDssVt+4ugNaJl5MK5GgNy1+DkSRJRIUhHGvd99zY+ClWhhFO56VBommMyRSPad9SgSOq/XRx7RxdWGWEQqlsCYMW6u+JFEdaz6LAdkbYTPSql4n/ef3EhDd+ykScGCrIclGYcGQkyl5HI6YoMXxmCSaK2VsRmWCFibEBlWwI3urL66TTqHtXdfe+UWnW8jiKcAbnUAMPrqEJd9CCNhB4hGd4hTdHOi/Ou/OxbC04+cwp/IHz+QNjRo5D

basis matrix

(PCA)

h0n AAAB7XicbVA9TwJBEJ3DL8Qv1NJmIxipyB2FWpLYWGIiHwlcyN6yByt7u5fdPRNy4T/YWGiMrf/Hzn/jHlyh4EsmeXlvJjPzgpgzbVz32ylsbG5t7xR3S3v7B4dH5eOTjpaJIrRNJJeqF2BNORO0bZjhtBcriqOA024wvc387hNVmknxYGYx9SM8FixkBBsrdaqTobisDssVt+4ugNaJl5MK5GgNy1+DkSRJRIUhHGvd99zY+ClWhhFO56VBommMyRSPad9SgSOq/XRx7RxdWGWEQqlsCYMW6u+JFEdaz6LAdkbYTPSql4n/ef3EhDd+ykScGCrIclGYcGQkyl5HI6YoMXxmCSaK2VsRmWCFibEBlWwI3urL66TTqHtXdfe+UWnW8jiKcAbnUAMPrqEJd9CCNhB4hGd4hTdHOi/Ou/OxbC04+cwp/IHz+QNjRo5D

Figure 58.2 Dictionary learning. Each column hn,p in M -dimensional space is approximated by a linear combination of the K columns in W , with the coefficients of the linear combination given by the entries of the sparse vector h0n .

X ≈ WZ

(58.8a)

W (M × K), Z (K × N ), Z sparse

(58.8b)

where

The dimensions (M, N, K) and their relative sizes are kept generic. In Fig. 58.2 we illustrate the situation in which W is a “fat” matrix with more columns than rows (i.e., with K > M ). In this case, the dictionary is said to be overcomplete, and this is the standard situation of most interest to us in this chapter; it arises when one seeks sparse representations Z. However, there are applications where one seeks a “tall” W with fewer columns than rows. In this case, the dictionary is said to be undercomplete. This scenario arises, for instance, in PCA constructions where the feature vectors are mapped into a lower-dimensional space. It also arises in gene expression applications where the row dimension M for X corresponds to the number of available samples while the column dimension N becomes the feature size; in this application, there is usually a small number of samples (small M ), while the feature space is high-dimensional (large N ). We

58.1 Learning Under Regularization

2427

observe from these examples that we can relax the interpretation for the quantity X. Depending on the application at hand, it is the rows rather than the columns of X that play the role of “feature” vectors. Also, depending on the size of K, dictionary learning may be performing sparse coding (K > M ) or dimensionality reduction (K < M ). If we denote the columns of X by {xn } and the columns of Z by {zn }, for n = 1, 2, . . . , N , then expression (58.8a) leads to the representation: xn ≈ W zn

(58.9)

We refer to zn as the latent variable corresponding to xn ; it helps identify which atoms (columns) of W should be combined to approximate xn . For later reference, we will denote the columns of W by {wk } for k = 1, 2, . . . , K:  (M × N )  {xn }: columns of X (58.10) {zn }: columns of Z (K × N )  {wk }: columns of W (M × K) Solution methods to attain the factorization (58.8a) are generally iterative in nature and will consist of two steps: an inference step (also called a sparse coding step) that updates Z, and a dictionary learning step that updates W . The iterations continue until sufficient convergence is attained. There are many variations of risk functions that can be chosen to design W and Z; they all involve some form of `1 -regularization on the columns of Z to induce sparsity and some form of `2 -regularization on the columns of W to avoid large values. One common choice is to consider: ( N −1 ) K−1  X X ? ? 2 2 (W , Z ) = argmin kxn N W zn k + αkzn k1 + ρkwk k W,Z

n=0

k=0

(58.11)

where α > 0 and ρ > 0 are regularization parameters. The risk function in the above formulation is nonconvex in both (W, Z). However, if we fix one of the variables, then the optimization problem becomes convex over the other variable. This fact can be exploited to devise a recursive solution that alternates between updating W and Z as follows. Let {Wm−1 , Zm−1 , wk,m−1 , zn,m−1 } denote the estimates for {W, Z, wk , zn } at iteration m N 1: (a) (Solving for Z) We fix W at Wm−1 and rewrite the risk as the following function over zn by ignoring terms that are independent of the {zn }: ( ) N −1 X ∆ 2 P (Z) = kxn N Wm−1 zn k + αkzn k1 (58.12) n=0

Minimization of P (Z) over Z decouples into N separate optimization problems over each of the columns zn : n o ∆ zn,m = argmin kxn N Wm−1 zk2 + αkzk1 , n = 0, 1, . . . , N N 1 (58.13) z∈IRK

2428

Dictionary Learning

Each one of these problems is a LASSO (least absolute shrinkage and selection operator) problem, which we already know how to solve. For instance, earlier in Examples 14.1 and 15.3 we described algorithms for determining the solution zn,m based on subgradient and proximal gradient iterations. In particular, repeating the argument from the latter example, we find that we can learn zn,m (the estimate for zn at iteration m) by applying repeatedly the following construction over some iteration index i until sufficient convergence is attained. Start from r−1 arbitrary, and repeat using a step size µ > 0 the iterated soft-thresholding algorithm (ISTA):  T yi = ri−1 + 2µWm−1 (xn N Wm−1 ri−1 ), i ≥ 0 (58.14) ri = Tµα (yi ) where Tβ (·) is the soft-thresholding operator defined by (57.76). Once iteration (58.14) approaches convergence, we set zn,m = ri . We repeat the same construction for all columns of Z. The result is the iterate Zm with columns {zn,m }. (b) (Solving for W ) We fix Z at Zm . Then the risk function becomes quadratic in the columns of W . To see this, ignore the terms involving kzn,m k1 since they do not depend on W , and rewrite the risk as the following function of W : P (W ) ∆

=

N −1 X n=0

kxn N W zn,m k2 +

K−1 X k=0

ρkwk k2

n o n o = Tr (X N W Zm )T (X N W Zm ) + ρ Tr W T W n o n o (a) T T = Tr X T X + Tr NW Zm X T N XZm W T + W (ρIK + Zm Zm )W T n o n o (b) T = Tr X T X + Tr NW Bm N Bm W T + W Am W T n   T o −1 = Tr W N Bm A−1 + cte (58.15) m Am W N B m Am

where in step (a) we used the trace property Tr(CD) = Tr(DC) for any matrices of compatible dimensions, and in step (b) we introduced the matrices: ∆

T Bm = XZm ,



T Am = ρIK + Zm Zm

(58.16)

The constant term in the last equality (58.15) includes terms that are independent of the unknown W . We conclude from (58.15) that P (W ) is minimized at Wm = Bm A−1 m

(58.17)

In this way, we arrive at the online algorithm for dictionary learning listed in (58.18). The reason for the designation “online” is that the algorithm is able to

58.1 Learning Under Regularization

2429

respond to streaming columns xn and to continually update the dictionary in response to the streaming data.

Online dictionary learning for the regularized problem (58.11). given M × N data matrix X; objective: determine the factorization X ≈ W Z, where W is M × K and Z is K × N and sparse; start from arbitrary initial conditions W−1 and Z−1 ; denote the columns of {X, Z} by {xn , zn } for n = 0, 1, . . . , N N 1. repeat until convergence over m = 0, 1, 2, . . .: (sparse coding) update each column zn,m by solving the LASSO problem (58.13), e.g., by running (58.14) for sufficient iterations

(58.18)

set columns of Zm = {zn,m } (dictionary update) T Bm = XZm T Am = ρIK + Zm Zm −1 Wm = Bm Am end return (W, Z) ← (Wm , Zm ).

Example 58.1 (Sparse coding by duality) We can also solve the sparse coding problem (58.13) (i.e., the problem of determining the columns {zn }) by following the same duality arguments used earlier in Section 51.4.2. We illustrate the procedure by assuming elastic-net regularization is applied to each zn . In this case, we replace (58.13) by ∆

zn = argmin

n o αkzk1 + ρ2 kzk2 + kxn − Wm−1 zk2

(58.19)

z∈IRK

where ρ2 > 0. This formulation has a form similar to (51.59), with the identifications: 1 1 √ d ← xn , √ H ← Wm−1 , w ← z, ρ ← ρ2 , N ← M N N

(58.20)

Therefore, using expression (51.61), the solution is given by the soft-thresholding expression:   1 T zn = Tα Wm−1 λ0 (58.21) 2ρ2 where the vector λ0 is the unique maximum of the following strongly concave function: 

 2  1 1  T

λ0 = argmax λT xn − kλk2 − (58.22)

Tα Wm−1 λ 4 4ρ M 2 λ∈IR

2430

Dictionary Learning

58.2

LEARNING UNDER CONSTRAINTS A second method to determine the factors (W, Z) is to replace the regularized formulation (58.11) by one where the norms of the columns of W are directly bounded, namely, by:

?

?

(W , Z ) = argmin W,Z

N −1 X n=0

(

2

kxn N W zn k + αkzn k1

)

(58.23)

subject to kwk k2 ≤ 1, k = 0, 1, . . . , K N 1 Here, the `2 -norms of the columns of W are constrained to avoid large values. In this case, we continue to construct the columns of Z by solving the same individual LASSO problems, as explained in listing (58.18). However, the dictionary update for W will need to be adjusted because the last equality in (58.15) is not valid anymore (without the presence of ρ, the matrix Am becomes singular). Thus, repeating the argument that led to (58.15), and ignoring all terms that are independent of W , we now have ∆

P (W ) =

N −1 X n=0

kxn N W zn,m k2 + cte

(58.24)

n o = Tr (X N W Zm )T (X N W Zm ) + cte n o n o T T = Tr X T X + Tr NW Zm X T N XZm W T + W Zm Z m W T + cte We leave out the constant term that is independent of W and write: n o T P (W ) = Tr NW Bm N Bm W T + W Am W T

(58.25)

where Bm continues to have the same expression as before but Am is adjusted to: ∆

T Bm = XZm ,



T Am = Zm Zm

(58.26)

We denote the individual columns of {W, Bm , Am } by {wk , bk,m , ak,m } for k = 0, 1, . . . , K N 1. We also denote the individual entries of the K × K symmetric matrix Am by {akk0 ,m } so that P (W ) =

K−1 X k=0

(

N

2bT k,m wk

+

K−1 X k0 =0

akk0 ,m wkT wk0

)

(58.27)

58.2 Learning Under Constraints

2431

where the last term follows from the equalities ( K−1 ) X T T Tr(W Am W ) = Tr W ak,m wk k=0

=

K−1 X

Tr(W ak,m wkT )

k=0

=

K−1 X

wkT W ak,m

k=0

=

K−1 X

wkT

k=0

=

K−1 X K−1 X

K−1 X

akk0 ,m wk0

k0 =0

akk0 ,m wkT wk0

! (58.28)

k=0 k0 =0

Returning to (58.27), we can separate out all terms that depend solely on wk and write X  a`k,m w`T wk + akk,m kwk k2 + cte P (wk ) = N2bT (58.29) k,m wk + 2 `6=k

where we used the fact that Am is symmetric. Let Wm−1 denote the iterate that is available for the dictionary W at iteration m N 1. We exclude column k from (−k) (−k) it and denote the resulting matrix by Wm−1 . Let also ak,m denote the column vector ak,m with its kth entry excluded. Then, the minimum of P (wk ) subject to kwk k2 ≤ 1 can be sought by applying projection gradient steps as follows. First, the unconstrained minimizer of (58.29) occurs at the location (denoted by yk ):  1  (−k) (−k) (58.30) bk,m N Wm−1 ak,m yk = akk,m where akk,m is the kth diagonal element of Am . The vector yk should be subsequently projected onto the set kwk k2 ≤ 1 to get (recall (15.51)):  yk , if kyk k ≤ 1 wk = (58.31) yk /kyk k, if kyk k > 1

Since the kth column of Wm−1 is wk,m−1 , we can rewrite the first step as follows:   1    bk,m N Wm−1 ak,m  yk ← wk,m−1 + a kk,m  (58.32) yk , if kyk k ≤ 1    wk ← yk /kyk k, if kyk k > 1

This computation can be written in matrix form and appears in the listing of the algorithm in (58.33). Compared with the earlier listing (58.18), only diag(Am ) is now being inverted as opposed to Am itself.

2432

Dictionary Learning

Online dictionary learning for the constrained problem (58.23). given M × N data matrix X; objective: determine the factorization X ≈ W Z, where W is M × K and Z is K × N and sparse; start from arbitrary initial conditions W−1 and Z−1 with columns of W−1 having norms bounded by 1; denote the columns of {X, Z} by {xn , zn } for n = 0, 1, . . . , N N 1. repeat until convergence over m = 0, 1, 2, . . .: (sparse coding) update each column zn,m by solving the LASSO problem (58.13), e.g., by running (58.14) for sufficient iterations (58.33) set columns of Zm = {zn.m } (dictionary update) T Am = Z m Z m T Bm = XZm Y = Wm−1 + (Bm N Wm−1 Am )(diag(Am ))−1 for each column yk of Y perform the projection:  yk , if kyk k ≤ 1 yk ← yk /kyk k, if kyk k > 1 set Wm = Y end return (W, Z) ← (Wm , Zm ).

58.3

K-SVD APPROACH A third approach for the solution of the dictionary learning problem is the K-SVD method, which again alternates between updating the dictionary and performing sparse coding. The method, however, considers a variation of formulation (58.11) where, for each column xn , the two design steps are modified as follows: ( N −1 ) X given W , solve Z ? = argmin kzn k0 , subject to X ≈ W Z (58.34a) Z

n=0

?

given Z, solve W = argmin kX N W Zk2F

(58.34b)

W

where kak0 counts the number of nonzero elements in vector a.

Orthogonal matching pursuit The first step (58.34a) is minimizing the number of nonzero elements in the vectors {zn }, thus enforcing the sparse requirement. This step decouples into N independent problems, one for each column zn :

58.3 K-SVD Approach

n o zno = argmin kzn k0 , subject to xn ≈ W zn

2433

(58.35)

zn

This problem has the same form as the sparse signal recovery problem (58.90) we study in Appendix 58.A, with W playing the role of the matrix A, zn playing the role of x, and xn playing the role of b. Specifically, in the appendix we derive the orthogonal matching pursuit (OMP) algorithm, in either form (58.101) or (58.103), for the solution of problems of the following generic form, which matches (58.35): n o xo = argmin kxk0 , subject to b ≈ Ax where A is a “fat” matrix (58.36) x

We refer to the algorithm that maps (W, xn ) to a solution zno by writing compactly zno = OMP(W, xn , T )

(58.37)

where T is some predefined bound on the sparsity level that is desired for zno (i.e., the number of nonzero entries in it). The OMP algorithm takes W and xn as input and generates the sparse vector zno . We therefore know how to solve the sparse coding step (58.34a). The resulting Z is denoted by Z ? .

Rank-one approximations Let us now consider the dictionary update step (58.34b). We denote the individual columns of W by {wk } and the individual rows of Z ? by {rnT } (these are not the sparse columns of Z ? but rather its rows). In the K-SVD solution, we update the columns of W one column at a time in the following manner (in this process, the rows of Z ? are also updated). Consider the kth column, wk . Expressing W Z ? as the sum of K outer products we have K−1

2 X

kX N W Z ? k2F = X N wk0 rkT0 F

k0 =0

K−1

2

  X

= XN wk0 rkT0 Nwk rkT

|

F

k0 6=k

{z



= Ek

2 ∆

= Ek N wk rkT

F

}

(58.38)

where we introduced the matrix Ek ; it is equal to the data matrix X adjusted by the combination of all outer products from {W, Z} that are not being updated at this kth step. Thus, the matrix Ek corresponds to the error in approximating X when the kth atom is removed from W . This matrix is known and the only design variable in the above expression is the kth column wk , which we would like to update to some new value wk0 . Expression (58.38) is suggesting that the outer product wk rkT should serve as a rank-one approximation for Ek . In principle,

2434

Dictionary Learning

we could resort to the Eckart–Young theorem (1.224) to determine a rank-one approximation for Ek from its singular value decomposition (SVD) and then use the resulting outer product approximation to deduce from it what the updated vector wk0 should be. There is one problem with this approach, however. The row vector rkT may have zero locations in it, which would in turn mean that the outer product wk rkT should have some zero columns. Therefore, the only degrees of freedom we have in solving (58.38) are those that correspond to the nonzero columns of wk rkT . This observation suggests that we should reformulate problem (58.38) in the following manner by removing the zero columns from the outer product wk rkT . Let Sk denote the indices of the nonzero entries in the vector rkT (whose size is 1 × N ). For example, if rkT =



0

0

0 ×

0

0

× 0

0

0

0

×



(58.39)

then Sk = {3, 6, 11}, where we are indexing the entries by counting from zero. The set Sk identifies the input sample vectors {xn } that employ the kth atom, wk . We introduce the restriction of the N × N identity matrix to Sk and denote it by ISk ; this matrix has dimensions N × |Sk | and its columns are the basis vectors corresponding to the indices in Sk . That is, ISk =



e3

e6

e11



,

(N × 3)

(58.40)

Motivated by (58.38), we then consider the problem of updating both {wk , rk } to {wk0 , rk0 } by solving

  2

{wk0 , rk0 } = argmin Ek N wk rkT ISk

F

wk ,rk

(58.41)

We know from (1.224) that the solution is obtained by introducing the SVD of the N × |Sk | matrix Ek ISk , namely, ∆

Ek ISk =

|Sk | X

σ` u` v`T

(58.42)

`=1

and to set wk0 and the nonzero entries of rk0 to the following values: wk0 = u1 (rk0 )T ISk

=

σ1 v1T

(first right-singular vector)

(58.43a)

(scaled first left-singular vector)

(58.43b)

We therefore arrive at listing (58.44) for the K-SVD algorithm. The matrices {W, Z} generated at iteration m are denoted by {Wm , Zm } and their columns by {wk,m , zn,m } with a subscript m added. Likewise, the rows of Zm are denoted T by {rk,m }.

58.4 Nonnegative Matrix Factorization

2435

K-SVD algorithm for solving (58.34a)–(58.34b). given M × N data matrix X; given bound T on the sparsity level for the columns of Z; objective: determine the factorization X ≈ W Z, where W is M × K and Z is K × N and sparse; start from arbitrary initial conditions W−1 and Z−1 ; denote columns of {X, Z, W } by {xn , zn , wk }; denote rows of Z by {rkT }. repeat until convergence over m = 0, 1, 2, . . .: (sparse coding) update each column zn,m by orthogonal matching pursuit: zn,m = OMP(Wm−1 , xn , T ) using (58.101) or (58.103) set columns of Zm to {zn,m } T denote rows of Zm by {rk,m } (dictionary update) repeat for each dictionary column k = 0, 1, . . . , K N 1: T Sk = {indices of nonzero entries in rk,m } K X wk0 ,m−1 rkT0 ,m Ek = X N Ek ISk =

k0 6=k |Sk |

X

σ` u` v`T

(58.44)

(SVD)

`=1

set wk,m = u1 update nonzero entries of rk,m to σ1 v1 end set columns of Wm to {wk,m } T set rows of Zm to {rk,m } end return (W, Z) ← (Wm , Zm ).

58.4

NONNEGATIVE MATRIX FACTORIZATION In some important applications, such as face recognition and topic modeling, the entries of X are nonnegative and it is desirable for the entries of the factors W and Z to be nonnegative as well. For example, in a face recognition setting, each image representing a face would correspond to a column in the data matrix X. The columns of W would then provide atoms that represent different constituents of the face, such as nose, eyes, lips, and mouth. The atoms are combined by the elements in Z to reproduce the faces in X (or to generate other

2436

Dictionary Learning

face examples). In this application, each face image is expressed as a nonnegative combination of elementary components saved into the dictionary W . A second similar example arises in topic modeling, where each column in X would convey information about a particular document (such as the number of times certain words appear). The columns of W would represent different topics, and the elements of Z would help identify the topics discussed in each document. In these types of applications, the entries of Z are positive weights that help scale the contribution of the various atoms in reproducing X. One way to formulate the nonnegative matrix factorization (NMF) problem is to replace (58.23) by the following. Given X  0, we wish to determine W and Z by solving ( N −1 ) X ? ? 2 (W , Z ) = argmin kxn N W zn k (58.45a) W,Z

n=0

subject to W  0, Z  0

where the notation A  0 signifies that all entries of A are nonnegative. Moreover, X is M × N , W is M × K, and Z is K × N . The problem is equivalent to n o (W ? , Z ? ) = argmin Tr (X N W Z)T (X N W Z) (58.45b) W,Z

subject to W  0, Z  0

which is further equivalent to (see Prob. 58.10): n o (W ? , Z ? ) = argmin kX N W Zk2F

(58.45c)

W,Z

subject to W  0, Z  0

in terms of the squared Frobenius norm. Observe from this last formulation that the objective function is symmetric over W and Z since the problem of minimizing kX N W Zk2F over W , with Z fixed, is similar to the problem of minimizing kX T N Z T W T k2F over Z T with W T fixed. There are many techniques that can be used to solve the NMF problem. Also, several variations are possible, such as including `1 - and `2 -regularization – see Prob. 58.8.

58.4.1

Alternating Least-Squares Method One solution method is to alternate between finding Z and W . Assume W is fixed at Wm−1 . Then, problem (58.45a) decouples into N independent least-squares problems over the {zn }: n o ∆ (58.46) zn,m = argmin kxn N Wm−1 zk2 , subject to z  0 z∈IRK

One way to determine zn,m is to solve the least-squares problem first and then project the result onto the nonnegative orthant z  0. The least-squares solution

58.4 Nonnegative Matrix Factorization

2437

will depend on whether Wm−1 is a “tall” or “fat” matrix. Hence, we will employ the pseudo-inverse notation:  †  (least-squares solution): let y = Wm−1 xn (58.47) (projection): set negative entries of y to zero  set zn,m = y This process should be repeated for every column zn of Z. We can group these steps into a single matrix statement as follows: (updating Zm with Wm−1 fixed)  †  Y = Wm−1 X set negative entries of Y to zero  set Zm = Y

(58.48)

Next, we fix Z at Zm and solve for ∆

Wm = argmin W

N −1 X n=0

kxn N W zn,m k2 , subject to W  0

(58.49)

or, equivalently, n o ∆ Wm = argmin Tr (X N W Zm )T (X N W Zm ) , subject to W  0 W

(58.50)

Let ∆

T Bm = XZm ,



T Am = Zm Zm

(58.51)

Adjusting the argument that led to (58.15) and ignoring terms that are independent of W , the objective function that we wish to minimize over W can be expressed as n o T P (W ) = Tr NW Bm N Bm W T + W Am W T (58.52)

Differentiating relative to W and using the gradient properties from Table 2.1 we find that P (W ) is minimized at any solution for the normal equations Y Am = Bm . Consequently, after incorporating projection onto the nonnegative orthant, we find that Wm can be found as follows: (updating Wm with Zm fixed)  T T Bm = XZm , Am = Zm Zm    † Y = Bm Am  set negative entries of Y to zero   set Wm = Y

(58.53)

The alternating least-squares solution method consists of repeating steps (58.48) and (58.53). Due to the projections onto the nonnegative orthant, some degradation occurs in attaining the approximation X ≈ W Z. For this reason, it is

2438

Dictionary Learning

customary to scale the resulting product W Z by some positive constant λ? determined by solving ∆

λ? = argmin kX N λW Zk2F

(58.54)

λ≥0

It is straightforward to check that the solution is given by (see Prob. 58.9): λ? =

Tr(X T W Z) Tr(Z T W T W Z)

(58.55)

so that X ≈ λ? W Z. The alternating least-squares solution just described faces convergence difficulties and it is not the method of choice in current practice.

58.4.2

Alternating Coordinate-Descent Method A second, more reliable approach for the solution of the NMF problem (58.45b) is to alternate between two coordinate-descent steps: one for estimating Zm and the other for estimating Wm . Note that the cost function over W and Z is n o P (W, Z) = Tr (X N W Z)T (X N W Z)

(58.56)

As explained before, the structure of this function is symmetric over W and Z; once a method is found for determining W with Z fixed, the same method can be used for determining Z with W fixed, since we can also write n o P (W, Z) = Tr (X T N Z T W T )T (X T N Z T W T )

(58.57)

which has the same form as (58.56) with X replaced by X T , Z by W T , and W by Z T . The variable Z T plays the role of W in (58.56). For this reason, we focus on minimizing (58.56) over W given Z = Zm and later consider minimizing it over Z given W = Wm−1 . We fix Z = Zm . It is straightforward to verify that by repeating the argument that led to (58.32) we can determine Wm as follows:  T Am = Z m Z m    T   Bm = XZm    (0) Wm−1 = Wm−1  Y = Wm−1 + (Bm N Wm−1 Am )(diag(Am ))−1     set negative entries of Y to zero    set Wm = Y

(58.58)

Next, we fix W = Wm−1 and follow a similar argument to determine Zm due to the symmetry in the problem. The resulting procedure listed in (58.59) is known as the hierarchical alternating least-squares (HALS) method.

58.4 Nonnegative Matrix Factorization

2439

HALS algorithm for solving the NMF problem (58.45c). given M × N data matrix X  0; objective: determine the factorization X ≈ W Z, where W  0 is M × K and Z  0 is K × N ; start from arbitrary initial conditions W−1  0 and Z−1  0. repeat until convergence for m = 0, 1, 2, . . .: (sparse coding) T Cm−1 = Wm−1 Wm−1 T Dm−1 = Wm−1 X Y = Zm−1 + (diag(Cm−1 ))−1 (Dm−1 N Cm−1 Zm−1 ) set negative entries of Y to zero set Zm = Y

(58.59)

(dictionary update) T Am = Z m Z m T Bm = XZm Y = Wm−1 + (Bm N Wm−1 Am )(diag(Am ))−1 set negative entries of Y to zero set Wm = Y end return (W, Z) ← (Wm , Zm ).

58.4.3

Multiplicative Update Algorithm One of the most popular methods for solving the NMF problem (58.45c) is the multiplicative update (MU) algorithm, which is characterized by its simplicity. We start from the risk function n o P (W, Z) = Tr (X N W Z)(X N W Z)T n o = Tr NW ZX T N XZ T W T + W ZZ T W T + cte n o = Tr NZX T W N W T XZ T + Z T W T W Z + cte

(58.60)

where the constant term is independent of both W and Z. Referring to the matrix differentiation results from Table 2.1 we have ∇Z T P (W, Z) = N2W T X + 2W T W Z T

∇W T P (W, Z) = N2XZ + 2W ZZ

T

(58.61a) (58.61b)

2440

Dictionary Learning

which motivates us to write down two gradient-descent recursions for estimating {W, Z} using matrix step sizes in the following manner:   T T Zm = Zm−1 + zz,m Wm−1 X N Wm−1 Wm−1 Zm−1 (58.62a)   T T Wm = Wm−1 + zw,m XZm N Wm−1 Zm Zm (58.62b)

The notation refers to the elementwise Hadamard product, and {zz,m , zw,m } are step size matrices with positive entries and whose values change with the iteration index, m. Note that we are in effect assigning separate step-sizes to the individual entries of the gradient matrices. The multiplicative update algorithm follows for specific choices of {zz,m , zw,m } as the following elementwise Hadamard divisions: [zz,m ]ij =

[Zm−1 ]ij ∆ T = Zm−1 Wm−1 Wm−1 Zm−1 T [Wm−1 Wm−1 Zm−1 ]ij

(58.63a)

[zw,m ]ij =

[Wm−1 ]ij ∆ T = Wm−1 Wm−1 Zm Zm T] [Wm−1 Zm Zm ij

(58.63b)

where we are introducing the symbol to refer to elementwise division. It is straightforward to verify that with these choices:   T T T zz,m Wm−1 X N Wm−1 Wm−1 Zm−1 = zz,m Wm−1 X N Zm−1 (58.64a)   T T T zw,m XZm N Wm−1 Zm Zm = zw,m XZm N Wm−1 (58.64b)

Substituting into the update relations (58.62a)–(58.62b), we arrive at listing (58.65).

Multiplicative update algorithm for NMF problem (58.45c). given M × N data matrix X  0; objective: determine the factorization X ≈ W Z, where W  0 is M × K and Z  0 is K × N ; start from arbitrary initial conditions W−1  0 and Z−1  0. repeat until convergence for m = 0, 1, 2, . . .:  

(58.65)

T T Zm = Zm−1 Wm−1 X Wm−1 Wm−1 Zm−1   T T Wm = Wm−1 XZm Wm−1 Zm Zm

end return (W, Z) ← (Wm , Zm ).

One observation regarding the behavior of the algorithm is the following. Consider the update for Wm . Comparing (58.65) with (58.62b), we find that the following relation holds in terms of the gradient matrix of P (W, Z) relative to the matrix W :

58.4 Nonnegative Matrix Factorization

  T T Wm−1 XZm Wm−1 Zm Zm = Wm−1 N zw,m ∇W T P (W, Z)

2441

W =Wm−1

(58.66)

Consider an arbitrary entry of index (i, j) in Wm−1 and the corresponding scaling ratio appearing on the left-hand side: ∆

rij =

[XZm ]ij T T] [Wm−1 Zm Zm ij

(58.67)

There are three possibilities for its value: rij > 1, rij = 1, and rij < 1. In the first case, when rij > 1, the value of [Wm ]ij will become larger than [Wm−1 ]ij . Therefore, the gradient matrix at the (i, j) location must be negative for the expression on the right-hand side to increase [Wm−1 ]ij to [Wm ]ij : i h (58.68) ∇W T P (Wm−1 , Z) < 0 ij

On the other hand, the gradient matrix at the same location will be positive if rij < 1 so that [Wm−1 ]ij is decreased to [Wm ]i,j ; its value will remain unchanged when rij = 0 since the gradient matrix will be zero at that location.

Example 58.2 (Non-increasing risk values) One useful property of the multiplicative update algorithm (58.65) is that the successive iterates {Wm , Zm } lead to nonincreasing risk values (though they need not converge to a stationary point). For any generic {W, Z}, the risk value is given by P (W, Z) = kX − W Zk2F =

N −1 X

kxn − W zn k2

(58.69)

n=0

in terms of the individual columns of X and Z. Let us focus on one of the terms in this sum (the argument will apply similarly to the other terms). Thus, consider, the generic quadratic term P (z) = kx − W zk2

(58.70)

and let zm−1 denote the previous iterate for z. Let us verify that P (zm ) ≤ P (zm−1 ); a similar argument can be repeated for the columns of W . A second-order Taylor series expansion around zm−1 gives P (z) = P (zm−1 ) + ∇z P (zm−1 )(z − zm−1 ) + (z − zm−1 )T W T W (z − zm−1 )

(58.71)

where ∇z P (zm−1 ) = 2W T (W zm−1 − x)

(58.72)

Note that 2W T W is the K × K Hessian matrix of P (z) relative to z. We construct next a majorization for P (z), i.e., a function that bounds it from above. Specifically, we seek a quadratic function similar to the above second-order Taylor series expansion except for the Hessian matrix, namely, a function of the form: Q(z) = P (zm−1 ) + ∇z P (zm−1 )(z − zm−1 ) + (z − zm−1 )T H(z − zm−1 )

(58.73)

2442

Dictionary Learning

with the K × K matrix H chosen to satisfy H ≥ W T W . Once this is done, we end up with Q(z) ≥ P (z) for any z. We choose H as a diagonal matrix with the following entries (assuming all entries of zm−1 are strictly positive): ( ) [W T W zm−1 ]k ∆ H = diag (58.74) [zm−1 ]k where the notation [a]k denotes the kth entry of vector a. Note that the matrix H has nonnegative entries (since W and zm−1 have nonnegative entries). We now verify that H ≥ W T W , as required. Proof that H ≥ W T W : For compactness of notation, let A = W T W and note that A is K × K symmetric with nonnegative entries. We denote the individual entries of A by Aab . Let also ym−1 = W T W zm−1 . We denote the individual entries of zm−1 by za with subscript a; these entries are also nonnegative. Next, we introduce the diagonal matrices ∆



Dz = diag{zm−1 }, Then, H =

Dz−1 Dy .

Dy = diag{ym−1 }

(58.75)

T

Moreover, the product Dz (H − W W )Dz simplifies to

Dz (H − W T W )Dz = Dz (Dz−1 Dy − W T W )Dz = Dy Dz − Dz ADz

(58.76)

For any arbitrary K × 1 vector x with entries {xa } we have   xT Dz (H − W T W )Dz x = xT Dy Dz x − xT Dz ADz x =

K−1 X

Aab za zb x2a −

a,b=0

K−1 X

xa za Aab zb xb

(58.77)

a,b=0

where in the second line we spelled out the products in terms of the individual entries of the matrices and vectors involved. Using the fact that A is symmetric so that Aab = Aba , the first sum can be expressed in the form K−1 X a,b=0

Aab za zb x2a =

K−1 K−1 1 X 1 X Aab za zb x2a + Aab za zb x2b 2 2 a,b=0

(58.78)

a,b=0

and, consequently, K−1    1 X 1 xT Dz (H − W T W )Dz x = Aab za zb x2a + x2b − xa xb 2 2 a,b=0

=

K 1 X Aab za zb (xa − xb )2 2 a,b=1

≥0

(58.79)

It follows that Dz (H − W T W )Dz ≥ 0. By congruence we conclude that H − W T W ≥ 0 so that H ≥ W T W . Therefore, we determined a choice for H that ensures Q(z) ≥ P (z) for all z. Next, using Q(z), we update zm−1 to zm by selecting zm to correspond to the location of the minimizer of Q(z) so that Q(zm ) ≤ Q(zm−1 ); in effect, we are following a majorization– minimization argument. Setting the gradient vector of Q(z) to zero at z = zm , i.e., ∇zT Q(z) = ∇zT P (zm−1 ) + 2H(z − zm−1 ) =0 (58.80) z=zm

58.5 Commentaries and Discussion

2443

leads to zm = zm−1 −

1 −1 H ∇zT P (zm−1 ) 2

(58.81)

This expression agrees with the update for a generic column of Zm from (58.62a), as used by the multiplicative updates algorithm. Now note that P (zm ) ≤ Q(zm ) ≤ Q(zm−1 ) = P (zm−1 )

(58.82)

so that the risk function is nonincreasing over the successive iterates for a generic column z. A similar argument applies for the successive iterates of W .  Example 58.3 (Application to MNIST dataset) We illustrate the operation of the multiplicative update algorithm (58.65) by applying it to the MNIST “handwritten digits” dataset encountered earlier in Example 52.3. The dataset consists of 60,000 labeled training samples. Each entry in the dataset is a 28 × 28 grayscale image, which we transform into an M = 784-long feature vector, hn . Each pixel in the image and, therefore, each entry in hn assumes nonnegative integer values in the range [0, 255]. Every feature vector (or image) is assigned an integer label in the range 0–9 depending on which digit the image corresponds to. The earlier Fig. 52.6 showed randomly selected images from the training dataset. In this example, we do not center or normalize the feature vectors and keep their entries in the range [0, 255]. We employ N = 10,000 samples and construct a data matrix X of size M × N , where M = 784 is the size of each feature vector. In this way, each column of X corresponds to the vectorized image of a handwritten digit. We construct two separate dictionaries: one with K = 9 atoms and another with K = 64 atoms for illustration purposes. The dictionary W has size M × K. This allows us to determine dictionaries with 9 and 64 elementary images that can be composed together to approximate the original images (or to generate new sample images). We run the multiplicative update algorithm (58.65) for 1000 iterations and plot the resulting dictionaries in Fig. 58.3. The figure also shows several original handwritten digits selected at random from the MNIST dataset and their approximations that follow from the dictionary representation X ≈ W Z.

58.5

COMMENTARIES AND DISCUSSION Dictionary learning. Dictionary learning is a powerful tool that leads to a sparse representation of dependencies in the feature data in terms of basis vectors (or atoms) in the form X ≈ W Z, where Z is sparse, W is the dictionary, and X is the data. Two key references that motivated this paradigm are the works by Mallat and Zhang (1993) and Chen, Donoho, and Saunders (1998); the latter work was made available as a technical report in 1995. One of the earliest applications of dictionary learning was in the context of image representation in terms of overcomplete dictionaries by Olshausen and Field (1996, 1997). This work led to a spike of interest in the field and motivated many subsequent works. As explained in the body of the chapter, most solution methods to the dictionary learning problem involve alternating between two steps: updating the dictionary (W ) and updating the sparse coding representation (Z). Several methods have been devised for this purpose, including the K-SVD method of Aharon, Elad, and Bruckstein (2006), the FOCUSS method of Gorodnitsky and Rao (1997) and Kreutz-Delgado et al. (2003), and the online learning method of Mairal et al. (2010); we discussed this latter method in the text leading to listing (58.33): It performs sparse coding by solving a standard LASSO problem and updates the dictionary

2444

Dictionary Learning

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

original original

original

approximate

approximate approximate

approximate

original

original original

original

approximate

approximate approximate

approximate

original

original original

original

approximate

approximate approximate

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

original

approximate

Figure 58.3 The plots on the left show the 9 and 64 atoms in the 784 × K dictionary W obtained by means of the multiplicative update algorithm (58.65) applied to the N = 10,000 samples from the MNIST dataset. The plots on the right show several original handwritten digits from the MNIST dataset and their approximations that follow from the dictionary representation X ≈ W Z.

by means of a gradient-descent iteration followed by projection. We also discussed the K-SVD method in Section 58.3; the reason for the name is due to some similarity with the nearest-neighbor and k-means procedures from Chapter 52. If we refer to (58.34a)– (58.34b), we observe that K-SVD is seeking sparse columns {zn } that essentially solve a problem of the form: (W ? , Z ? ) = argmin kX − W Zk2F ,

subject to kzn k0 ≤ T

(58.83)

W,Z

for some upper bound T on the sparsity of the coding vectors. In this case, each feature vector will be represented by a sparse combination of atoms from the dictionary. The k-means procedure, on the other hand, corresponds to the special problem: (W ? , Z ? ) = argmin kX − W Zk2F ,

subject to zn = ej for some j

(58.84)

W

where the coding vectors will match some basis vectors by choosing the closest atoms from the dictionary. If these atoms happen to be computed as cluster means, then the analogy with k-means becomes evident. Useful overviews of dictionary learning are given by Elad (2010), Rubinstein, Bruckstein, and Elad (2010), Tosic and Frossard (2011), and Dumitrescu and Irofti (2018). Dictionary learning has found applications in many domains, including in image denoising by Elad and Aharon (2006) and Mairal et al. (2010), novel document detection

58.5 Commentaries and Discussion

2445

by Kasiviswanathan et al. (2012), Aiello et al. (2013), and Takahashi, Tomioka, and Yamanishi (2014), feature extraction and classification by Mairal et al. (2008), biclustering by Lee et al. (2010), and decentralized learning by Chen, Towfic, and Sayed (2015). Nonnegative matrix factorization. NMF focuses on the important special case where all entries of {X, W, Z} are required to be nonnegative. In this case, the sparse factorization leads to more interpretable results. For example, in face recognition problems, atoms in W would correspond to different parts of the face, such as the nose, mouth, or ears. Matrices with nonnegative entries, and their factorization into the product of lower rank matrices, has been a subject of intense interest in the mathematical community for many years, especially in the domain of linear algebra – see, e.g., Thomas (1974), Campbell and Poole (1981), Chen (1984), and Berman and Plemmons (1994). Interest in NMF for applications peaked following the work by Paatero and Tapper (1994), which aimed at discovering the chemical components present in a chemical sample. Soon thereafter, a simple and efficient algorithm for the solution of NMF problems, known as the multiplicative update algorithm, was proposed by Lee and Seung (1999, 2001); the derivation in Example 58.2 is based on the arguments used in Lee and Seung (2001). The MU algorithm turned out to be a generalization of an earlier method originally developed by Daube-Witherspoon and Muehllehner (1986) in their work on image reconstruction. The article by Lee and Seung (1999) led to an explosion of interest in the topic, and to the development of many other algorithms for the solution of the NMF problem. In particular, the HALS method described in (58.59), which alternates between two coordinate-descent steps, was developed by Cichocki, Zdunek, and Amari (2007). Today, NMF has found applications in a wide range of fields, including face recognition, topic modeling, computer vision, genetics, bioinformatics, and recommender systems, to name a few. Good overview articles on NMF algorithms and applications are given by Wang and Zhang (2013), Gillis (2014), Zhou et al. (2014), and Fu et al. (2019). Orthogonal matching pursuit. We describe in Appendix 58.A the OMP method for determining sparse solutions to underdetermined linear systems of equations, i.e., for solving problems of the form xo = argmin kxk0 , subject to Ax = b, A ∈ IRM ×N , N ≥ M

(58.85)

x∈IRN

where the notation kxk0 refers to the number of nonzero entries in x. Problems of this type arise frequently in compressive sensing and sparse signal recovery from a limited amount of noisy measurements. Consider a situation where a vector b ∈ IRM is observed under noise, say, b = Ax + v, where v represents small perturbations, and x is a long sparse vector. The objective is to learn/recover a sparse solution xo such that b ≈ Axo . In the appendix, we describe one popular greedy method for solving problem (58.85) from knowledge of (A, b), known as the OMP algorithm. The qualification “greedy” means that the algorithm constructs the solution through a sequence of locally optimal steps in the hope that the overall construction will ultimately approximate well the desired global solution. One of the earliest references on the OMP algorithm (58.101) in signal processing appears to be the work by Pati, Rezaiifar, and Krishnaprasad (1993). Their article motivated OMP by modifying the traditional matching pursuit (MP) algorithm of Mallat and Zhang (1993) and Davis, Mallat, and Zhang (1994). This latter algorithm is listed in (58.86), where it is seen that the main difference in relation to OMP is that OMP incorporates a least-squares projection step so that the best approximation for b at step k is based on all atoms that have already been selected up to that iteration. The matching pursuit (MP) method (58.86) can be viewed as a special case of the projection pursuit algorithm in statistics by Friedman and Stuetzle (1981) and Huber (1985), and which we discussed in Section 40.5. Another earlier reference

2446

Dictionary Learning

related to OMP, albeit in the context of system identification and less directly related to sparse signal recovery, is the work by Chen, Billings, and Luo (1989). MP algorithm for solving (58.85). given A ∈ IRM ×N and b ∈ IRM ; usually N ≥ M ; denote columns of A by {an }, n = 1, 2, . . . , N ; given upper bound T < N on desired number of nonzero entries in xo ; objective: find T -sparse vector xo satisfying Axo = b; set initial residual vector r0 = b; set initial solution vector to x0 = 0. repeat for k = 1, 2, . . . , T : T nk = argmax |rk−1 an |

(58.86)

1≤n≤N T xk = xk−1 + (rk−1 ank )ank T rk = rk−1 − (rk−1 ank )ank end return xo ← xT .

We also provide in the appendix several conditions that guarantee recovery by the OMP method of the true sparse xo that matches Axo to b in the noiseless case. The first condition (58.105) is in terms of the mutual incoherence of A, a concept introduced by Mallat and Zhang (1993) and used by Donoho and Huo (2001) and Tropp (2004). The condition is sufficient and requires the mutual incoherence of A, denoted by µ(A), to be sufficiently small: µ(A)
2k o

(58.91)

Proof: Note that x is the only k-sparse solution if, and only if, there does not exist another k-sparse vector x1 such that Axo = Ax1 or, equivalently, A(xo − x1 ) 6= 0 for any two k-sparse vectors xo and x1 . Let S0 and S1 denote the supports of x0 and x1 , respectively, (i.e., the set of indices where they have nonzero entries). Then, if we restrict the matrix A to these support locations (i.e., consider only these columns) we have AS0 ∪S1 z 6= 0 (58.92)

58.A Orthogonal Matching Pursuit

2449

for any vector z of size |S0 ∪S1 |. This means that all submatrices with at most 2k columns have no nonzero vectors in their nullspace and are therefore full rank. It follows that spark(A) > 2k.  Problems of the type (58.90) arise frequently in compressive sensing and sparse signal recovery from a limited amount of noisy measurements. Consider a situation where a vector b ∈ IRM is observed under noise according to the model: b = Ax + v

(58.93)

where v represents small perturbations, and x is a long sparse vector. The objective is to learn/recover a sparse solution xo such that b ≈ Axo . In this appendix, we describe one popular greedy method for solving problem (58.90) from knowledge of (A, b), known as the OMP algorithm. The qualification “greedy” means that the algorithm constructs the solution through a sequence of locally optimal steps in the hope that the overall construction will ultimately approximate well the desired global solution.

Derivation of algorithm Without loss of generality, we assume that the columns of A are normalized to unit norm. We denote these columns by {a1 , a2 , . . . , aN } so that kan k2 = 1 for all n = 1, 2, . . . , N . If the solution xo happens to be 1-sparse (i.e., k = 1), then b will be proportional to one of the columns of A. This column can be determined by searching for the column an that has the largest absolute correlation with b: ∆

n1 = argmax |bT an |

(58.94)

1≤n≤N

We denote the resulting index by n1 and add this index to the set C1 = {n1 } of columns from A that contribute to b. The value n1 identifies which entry of xo should be nonzero, namely, its n1 th entry. Using C1 , which identifies the column in A of index n1 that contributes the most to b, we approximate the amount of contribution by solving the least-squares problem: ∆

x b1 = argmin kAC1 x − bk2

(58.95)

x∈IR

where the notation AC1 extracts the column of index n1 from A; in this case, it amounts to column an1 . The solution is given by  −1 1 aTn b = aTn1 b (58.96) x b1 = ATC1 AC1 ATC1 b = kan1 k2 1 If, on the other hand, xo happens to be 2-sparse so that k = 2, we can determine the next column of A that contributes the most to b as follows. We first subtract the effect of an1 from b to get a residual vector r1 = b − an1 x b1

(58.97)

We then search for the second column in A (excluding an1 ) that is most correlated with r1 . We denote its index by n2 and add it to C2 = {n1 , n2 }. Using C2 , which identifies the two columns in A of index {n1 , n2 } that contribute the most to b, we approximate the amount of contribution by solving the least-squares problem: ∆

x b2 = argmin kAC2 x − bk2

(58.98)

x∈IR2

where the notation AC2 extracts the columns of indices n1 and n2 from A. The solution is given by  −1 x b2 = ATC2 AC2 ATC2 b (58.99)

2450

Dictionary Learning

and the new residual that results from removing the effect of the new added column from A is b2 (2) r2 = r1 − an2 x

(58.100)

b2 = {b x2 (1), x b2 (2)}. in terms of column an2 and the second entry in the solution vector x The process continues in this manner, leading to listing (58.101). OMP algorithm for solving (58.90). given A ∈ IRM ×N and b ∈ IRM ; usually N ≥ M ; denote columns of A by {an }, n = 1, 2, . . . , N ; given upper bound T < N on desired number of nonzero entries in xo ; objective: find T -sparse vector xo satisfying Axo = b; set initial residual vector r0 = b; set initial index set to C0 = ∅ (empty set). repeat for k = 1, 2, . . . , T : (58.101)

T nk = argmax |rk−1 an | 1≤n≤N

Ck = Ck−1 ∪ {nk } ACk = restriction of A to columns Ck −1  ATCk b x bk = ATCk ACk rk = rk−1 − ank x bk (k) end return CT ; contains indices of nonzero entries in xo return x bT ; contains nonzero entries of xo . In the above listing, we can replace the explicit least-squares solution by writing instead ∆

x bk = argmin kACk x − bk2

(58.102)

x∈IRk

We can also replace the update for the residual vector using the original b instead of rk−1 . In this way, the loop in (58.101) can be rewritten in the following alternative form: repeat for k = 1, 2, . . . , T : T nk = argmax |rk−1 an | 1≤n≤N

Ck = Ck−1 ∪ {nk } ACk = restriction of A to columns Ck x bk = argmin kACk x − bk2

(58.103)

x∈IRk

rk = b − ACk x bk end

Sparse recovery guarantees It turns out that the OMP method can ensure guaranteed recovery of the true sparse xo that matches Axo to b in the noiseless case under certain conditions on (A, xo ). To see this, we continue to assume that the columns of A are normalized to unit norm. We define the mutual incoherence of the N columns of A as the largest absolute correlation among its columns: ∆

µ(A) = argmax 1≤n6=n0 ≤N

|aTn an0 |

(58.104)

58.A Orthogonal Matching Pursuit

2451

This measure reflects how the atoms are “related” to each other or how much they “look alike.” A dictionary A with small µ(A) is said to be incoherent because its columns have minimal correlations among themselves. Since, by the Cauchy–Schwarz inequality, we know that |aT b| ≤ kak kbk for any vectors {a, b}, we conclude that µ(A) ∈ [0, 1]. Although the equation Ax = b has infinitely many solutions, we will establish in the following that any k-sparse solution x can be recovered by OMP if the mutual incoherence of A and the sparsity level of x satisfy the following sufficient condition: µ(A)
hT wb? N θb? This suggests that we can also select the class label by solving instead n o ∆ r? (h) = argmax hT wr? N θr?

(59.27)

(59.28)

1≤r≤R

Example 59.2 (Applying OvA to the iris dataset) We consider the dimensionally reduced iris dataset from the top plot of Fig. 57.5. There are three classes denoted by r ∈ {0, 1, 2} corresponding to the setosa (r = 0), versicolor (r = 1), and virginica (r = 2) flower types. There are also a total of N = 150 samples. The plots in Fig. 59.5 show all data samples, along with the groupings that result from considering samples from one class against the combined samples from the other two classes. The feature vectors are extended according to (59.16). A collection of 120 samples are selected for training while the remaining 30 samples are used for testing. We use the 120 samples to train a logistic classifier using the stochastic gradient recursion (59.15) with µ = 0.01 and ρ = 0.1. Five passes of the algorithm with random reshuffling are applied to the data resulting in (where, for completeness, we are highlighting the offset and weight vector parameters):     ? −0.4813 −θ0,12   =  1.1256  (to separate class 0 from classes (1,2)) (59.29a) ? w0,12 −0.3421 and  

? −θ1,02 ? w1,02





 −0.4134  =  −0.1498  0.5381

(to separate class 1 from classes (0,2))

(59.29b)

59.3 Multiclass Classification

2

all data samples (3 classes)

2

1

1

0

0

-1

-3 -4

2

-2

-1

0

1

2

3

class 1 versus classes (0,2)

-3 -4

2

1

1

class 1

0

classes 1 and 2

-2

class 2 class 0 -3

class 0 versus classes (1,2)

-1

class 1

-2

class 0 -3

-2

-1

0

1

2

3

class 2 versus classes (0,1) class 2

0

-1

-1

-2

-2

classes 0 and 1

classes 0 and 2 -3 -4

2467

-3

-2

-1

0

1

2

3

-3 -4

-3

-2

-1

0

1

2

3

Figure 59.5 The top left plot shows all data samples from the three classes r = 0, 1, 2.

The other plots show the groupings that result from considering samples from one class against the combined samples from the other two classes.

and  

? −θ2,01 ? w2,01

 −0.4931  =  −0.8530  −0.2329 



(to separate class 2 from classes (0,1))

(59.29c)

Figure 59.6 shows the training data and the test data. It also shows the resulting separating lines. It is clear from the plot on the right in the top row of the figure that it is not possible to separate class r = 1 from the combined classes r ∈ {0, 2} by means of a linear classifier. The same is true, albeit to a lesser extent, for separating class r = 2 from the combined classes r ∈ {0, 1}. The empirical error rates obtained over the training data in each of the three cases shown in the figure are 0% for separating r = 0 from r ∈ {1, 2}, 26.67% for separating r = 1 from r ∈ {0, 2}, and 13.33% for separating r = 2 from r ∈ {0, 1}. Next, for each test vector h, we use expression (59.24) to determine the likelihood that it belongs to class r ∈ {0, 1, 2}. The bottom plot in Fig. 59.7 shows the largest likelihood value for each test sample. The top plot on the right shows the predicted labels over the test data. The samples that are misclassified are marked in this plot by red. It is observed that five samples are misclassified, resulting in an empirical error rate of 16.67% over the test data (or 5 errors out of 30 samples).

2468

Logistic Regression

2

class 0 versus classes (1,2)

2

1

1

0

0

-1

-1

-2

2

-3

-2

-1

class 0 0

1

2

3

class 2 versus classes (0,1)

-3 -4

1

0

0

-1

-1

-2

training samples -2

-1

-3

0

1

2

3

-3 -4

-2

-1

0

1

2

3

test samples (30)

class 1 (versicolor) class 0 (setosa)

-2

class 2 -3

training samples

2

1

-3 -4

class 1

-2

training samples -3 -4

class 1 versus classes (0,2)

class 2 (virginica) -3

-2

-1

test samples 0

1

2

3

Figure 59.6 The bottom right plot shows the test data. The other plots show the

logistic regression classifier that is obtained in each case. It is clear that it is possible to classify without errors the training data in the top-left plot related to separating class r = 0 from r ∈ {1, 2}. The same is not true for the other two cases. In particular, it is not possible to separate class r = 1 from the combined classes r ∈ {0, 2} by means of a linear classifier.

59.3.2

OvO Strategy The second technique for multiclass classification is the OvO strategy. Starting with a total of R classes, there are R(RN1)/2 pairwise combinations of individual classes that are possible. For example, if R = 3 so that we have three classes, r ∈ {1, 2, 3}, then we can group the training data according to the following pairs of classes: (1, 2), (1, 3), and (2, 3). In the grouping (1, 2), all data points belonging to classes r = 1 and r = 2 will be used to train a binary classifier to separate between these classes. In the second grouping (1, 3), all data points belonging to classes r = 1 and r = 3 will be used to train a binary classifier to separate between these classes. And likewise for the data corresponding to the grouping (2, 3) – see Fig. 59.8. At the end of this training process, we end up with R(R N 1)/2 classifiers, one for each pairing of classes. During testing, when the classification machine receives a new feature vector h and wishes to classify it, the procedure is as

59.3 Multiclass Classification

test samples (30)

2

2

2469

predicted labels for test samples class 1

class 1 1

1

0

0

errors

-1

-1

class 0

class 0 -2

-3 -4

class 2 -3

-2

setosa versicolor virginica -2

-1

0

1

2

-3 -4

3

class 2 -3

setosa versicolor virginica -2

-1

0

1

2

3

misclassified samples marked in red

1

likelihood values

0.9 0.8 0.7 0.6 0.5 0.4

5

10

15

20

25

30

test samples Figure 59.7 The bottom plot shows the largest likelihood value for each test sample.

The top right plot shows the resulting predicted labels over the test data. It is observed that five test samples are misclassified.

follows. Each classifier generates a classification decision for h (whether it belongs to one of its classes or the other). For example, for the case R = 3 described above, assume that h belongs to class r = 2. Then, classifier (1, 2) will decide that h belongs to class 2, classifier (2, 3) will also decide that it belongs to class 2, while classifier (1, 3) will issue some wrong decision. The final decision is to select the class that received the largest number of votes from the R(RN1)/2 classifiers. One inconvenience of this procedure is that some classes may receive an equal number of votes (which can, for example, be decided by randomly selecting one choice from among the possibilities). Example 59.3 (Applying OvO to the iris dataset) We consider the same setting from Example 59.2 except that we now apply the OvO procedure. There are three classes denoted by r ∈ {0, 1, 2} corresponding to the setosa (r = 0), versicolor (r = 1), and virginica (r = 2) flower types. There are also a total of N = 150 samples: 120 of them are selected for training and the remaining 30 samples are used for testing. The plots in Fig. 59.9 show all training samples, along with the pairings that result from considering samples from one class against the samples from another class.

2470

Logistic Regression

Figure 59.8 In the OvO strategy, a collection of R(R − 1)/2 binary classifiers are designed for a multiclass classification problem involving R cases.

We again extend the feature vectors according to (59.16) and apply five passes of the `2 -regularized logistic regression algorithm (59.15) using µ = 0.01 and ρ = 0.1. Using random reshuffling, the simulation leads to (where we are showing both the offset and the weight parameters for completeness):  

? −θ01 ? w01





 −0.2800  =  0.9675  −0.4310

(to separate class 0 from class 1)

(59.30a)

 −0.1095  =  1.1166  −0.0912

(to separate class 0 from class 2)

(59.30b)

(to separate class 1 from class 2)

(59.30c)

and  

? −θ02 ? w02





and  

? −θ12 ? w12





 0.1993  =  0.5390  0.3971

Figure 59.10 shows the training data and the test data. It also shows the resulting separating lines. It is clear from the plot on the right in the top row that classes r = 1 and r = 2 are not separable by a linear classifier. The middle plots in the figure show the original test data and the predicted labels. It is seen that there are 5 misclassifications (out of 30 test samples), resulting in an empirical error rate of 13.33%.

59.4 Active Learning

2

training samples (3 classes)

2471

classes 0 and 1

2

1

1

0

0

class 1 -1

-1

class 1

class 2 -2

-2

class 0

class 0 -3 -4

-3

-2

-1

0

1

2

3

classes 0 and 2

2

-3 -4

-3

1

0

0

-1

0

1

2

3

2

3

classes 1 and 2

2

1

-2

class 1

-1

-1

class 2

class 2 -2

-3 -4

-2

class 0 -3

-2

-1

0

1

2

3

-3 -4

-3

-2

-1

0

1

Figure 59.9 The top left plot shows all training samples from the three classes

r = 0, 1, 2. The other plots show the pairings that result from considering samples from one class against samples from another class.

59.4

ACTIVE LEARNING In this section and the next we discuss two important problems in learning where logistic regression plays a supporting role. The concepts discussed here can be applied to other supervised classification algorithms as well. We start with the problem of active learning. The main objective of active learning is to endow a learning algorithm with the ability to select which training samples to use and in what order. The expectation is that by doing so, the classifier will be able to deliver improved performance with a smaller number of labeled samples. This is particularly helpful in applications where it is costly to collect labeled data.

59.4.1

Labeled Data Assume we have a collection of N labeled data pairs {γ(n), hn }. For simplicity, we assume two classes, γ(n) ∈ {±1}, although the discussion can be easily extended to multiclass problems – see Example 59.5. Under active learning, the learner

2472

Logistic Regression

classes 0 and 1

2

classes 0 and 2

2

1

1

1

0

0

0

-1

-1

-1

classes 1 and 2

2

class 1

class 1 -2 -3 -4

-2

class 0 -3

-2

-1

0

1

2

-3 -4

3

-3

-2

-1

0

1

2

class 1

-3 -4

3

-3

-2

-1

0

1

2

3

class 1

1

0

0

-1

errors

-1

class 0

class 2

setosa versicolor virginica

test samples (30) -3 -4

class 2

2

2

1

-2

-2

class 0

class 2

-3

-2

-1

0

1

2

3

-2

class 2

class 0 predicted labels

-3 -4

-3

-2

-1

0

1

2

3

Figure 59.10 The top row shows the pairings of classes and the separating lines that

result from logistic regression. It is clear that it is not possible to classify without errors the training data in the top rightmost plot related to separating class r = 1 from r = 2. The middle plots show the original test data and the predicted labels. It is seen that there are 5 misclassifications (out of 30 test samples).

selects initially a random subset of N1 training samples, called the seed. We refer to this set by the letter S, and denote the unused samples by U. The learner uses the samples in S to train an initial classifier and arrive at its parameters (w? , θ? ). This step can be carried out, for example, by using the stochastic gradient logistic regression algorithm (59.15) with or without regularization. To continue, the classifier will now query the other set, U, repeatedly to decide which of its samples to choose in order to continue to update the classifier (w? , θ? ). This procedure is known as pool-based sampling, which is one of the more popular schemes in active learning. The learner follows the following steps to carry out the query process: (a) (Compute confidence levels and uncertainties) For each feature hn ∈ U (i.e., for each sample in the pool of unused samples), the classifier evaluates the confidence level that it would have in assigning it to class γ = +1. We denote this confidence level by 1 (confidence level) (59.31) T ? ? 1 + e−(hn w −θ ) Obviously, the confidence that the classifier has in assigning the same sample to the other class γ = N1 is 1 N p(n). We use the probabilities {p(n)} to assess the level of uncertainty that we will have about the true label for hn . The uncertainty is computed by using the following entropy measure for the nth sample: ∆

p(n) =

59.4 Active Learning

2473

H(n) = Np(n) log2 p(n) N (1 N p(n)) log2 (1 N p(n)) (uncertainty)

(59.32)

If the set U is very large, it may become computationally demanding to evaluate these uncertainties for all feature vectors in it. Alternatively, we can sample a random subset of U and only evaluate the confidence levels and entropy values for the features in this subset. The following are two popular strategies to select the next sample from U (or its subset) for use in training. They are referred to as uncertainty sampling procedures: (a1) (Least confidence) One strategy is to select the sample hno for which the learner is least confident about its class, i.e., the one for which p(n) is closest to 0.5: n o no = argmin |0.5 N p(n)| (59.33) n∈U

Note that we are not selecting the sample with the smallest likelihood value, but rather the sample whose likelihood is closest to 0.5.

(a2) (Most uncertainty) A related strategy is to select the sample hno with the highest entropy value (i.e., the sample for which the algorithm is less certain about its class): no = argmax H(n)

(59.34)

n∈U

Under binary classification, this criterion is similar to (59.33) because the entropy measure attains its maximum when p(n) = 1/2. For both cases of (a1) and (a2), and under mini-batch implementations, the classifier would query U to select B samples at once by choosing the B samples with the smallest confidence or largest entropy values. (b) (Update the classifier) Once a new training sample (γ(no ), hno ) has been selected, the learner updates its (w? , θ? ) and repeats the procedure by seeking a new point from the unused set U, updating the classifier, and so forth. Observe that under active learning, the classifier updates its parameters by repeatedly using data that it is least confident about (i.e., samples that are most challenging to classify correctly under the current parameter values).

59.4.2

Unlabeled Data When all training samples are labeled, active learning helps attain higher accuracy levels with fewer training samples. However, active learning can also be applied to situations where there is a limited amount of labeled data. We denote the smaller set of labeled data by S and use it to train an initial classifier (w? , θ? ) as before. The remaining unlabeled data are collected into the

2474

Logistic Regression

second set U. For each of the feature vectors in U, the active learner again uses expression (59.31) to evaluate how confident it will be about its classification. It then selects the feature vector hno with the lowest confidence level, according to steps (a1) or (a2), and requests that its true label γ(no ) be provided by an oracle. The oracle is usually a human annotator. This situation provides an example of a design system involving a human-in-the-loop. Once (hno , γ(no )) are known, the active learner uses this data point to update its classifier (w? , θ? ), and the process repeats. Observe that in this implementation, the learner is only requesting labels for what it believes are the most informative feature vectors within the unlabeled dataset. By doing so, the learner prioritizes the features and it becomes unnecessary to collect labels for all features in U at once, but only on demand. Example 59.4 (Active learning applied to a logistic regression model) We consider the `2 -regularized logistic regression algorithm (59.15) with the offset parameter set to zero, namely, wn = (1 − 2µρ)wn−1 + µ

γ(n)hn , n≥0 T 1 + eγ (n)hn wn−1

(59.35)

We generate N = 2000 random pairs of data {γ(n), hn } according to a logistic model. First, a random parameter wa ∈ IR10 is selected, and a random collection of feature vectors {hn } are generated with zero-mean unit-variance Gaussian entries. Then, for each hn , the label γ(n) is set to either +1 or −1 according to the following construction: γ(n) = +1 if

1



a −hT nw

1+e



≥ 1/2, otherwise γ(n) = −1

(59.36)

A total of K = 300 epochs are run over the data, with the data randomly reshuffled prior to each run. We determine the value of the risk function P (w) at the beginning k of each epoch, denoted by P (w−1 ). This results in a learning curve showing how the risk value diminishes with the epoch index. We repeat the experiment 10 times and average the learning curves to obtain a smoother curve. The learning curves are plotted in normalized logarithmic scale, namely, ! k P (w−1 ) − P (w? ) ln (59.37) k maxk {P (w−1 ) − P (w? )} where the minimizer w? for P (w) is “obtained” by applying a batch gradient-descent algorithm on the entire set of data points. The simulation uses ρ = 1, µ = 0.0001, and M = 10. We assume we know the labels for only 40 data points, while the labels for the remaining 1960 feature vectors will be requested on demand. We run algorithm (59.35) on the available labeled data points and obtain an initial classifier model, w40 . Subsequently, we follow an active learning approach. We select 20 random samples from among the remaining 1960 samples. For each of the samples in this batch, we compute the confidence level p(n) and retain the sample of least confidence, indexed by no according to (59.33). We request the label for this feature vector and use the data point (γ(no ), hno ) to update w40 to w41 . We repeat the procedure 200 times. Figure 59.11 shows the learning curves for this construction, as well as the resulting weight for the classifier. It is seen that the learner is able to estimate well the entries of the classifier.

59.4 Active Learning

one learning curve (log scale)

10 0

risk values at start of each epoch

10 -2

10

-4

10

-6

10 0

50

100

150

200

250

smoothed learning curve

smoothed over 10 experiments

10 -2

300

10

-4

10

-6

2475

50

100

150

200

250

300

classifier weight vector

0.1

0.05

0

-0.05

original model from active learning -0.1 1

2

3

4

5

6

7

8

9

10

k Figure 59.11 (Top left) A sample learning curve P (w−1 ) relative to the minimum risk

value P (w? ) in normalized logarithmic scale for the stochastic gradient implementation (59.35) under random reshuffling. (Top right) Smoothed learning curve obtained by averaging over 10 experiments. (Bottom) Actual logistic regression model wa and the estimate for it obtained through active learning.

Example 59.5 (Multiclass learning) The description of the active learning procedure focused on the binary case. When there are multiple classes, say r ∈ {1, 2, . . . , R}, we ? ? first apply the OvA procedure to design the binary classifiers {(w1? , θ1? ), . . . , (wR , θR )}. Then, for each feature hn ∈ U, expression (59.24) provides the likelihood that it belongs to class r by classifier wr? : 1



p0r (n) = P(r = r | hn = hn ; wr? , θr? ) =

1+

T ? ? e−(hn wr −θr )

(59.38)

We normalize these likelihoods to add up to 1 and transform them into a probability distribution as follows: p0 (n) pr (n) = PR r 0 (59.39) `=1 p` (n) Using these values, we can assess the uncertainty about the class label for hn by computing its entropy: H(n) = −

R X

pr (n) log2 pr (n)

(59.40)

r=1

Then, we select the sample no as follows. For each sample hn , we first determine which label appears to be the most likely for it, denoted by:

2476

Logistic Regression

rb(n) = argmax pr (n)

(59.41)

1≤r≤R

Let pb(n) denote the corresponding largest likelihood: pb(n) = prb(n) (n)

o

(59.42)

Subsequently, we choose n by selecting the sample with the least confidence: n o no = argmax |0.5 − pb(n)| (59.43) n∈U

Table 59.1 provides an example with four samples from U assuming R = 3 classes. From the entries in the last column we deduce that no = 1 for this example. Table 59.1 Example showing the likelihood values for four samples from U. Sample 1 2 3 4

p1 (n)

p2 (n)

p3 (n)

rb(n)

pb(n)

|0.5 − pb(n)|

0.5 0.1 0.2 0.1

0.3 0.2 0.2 0.8

0.2 0.7 0.6 0.1

1 3 3 2

0.5 0.7 0.6 0.8

0.0 0.2 0.1 0.3

Example 59.6 (Other classifier structures) The description of the active learning procedure prior to the examples relied on the training of a binary logistic regression classifier, which allowed us to assess the confidence levels using expression (59.31). Active learning can be applied to other types of classifiers, which may not have an explicit expression for evaluating the confidence level associated with their decisions. Two examples of alternative policies that can be used to select the “least confident” sample include: (a) Selecting the sample no from the unused set U (or a subset of it) that is closest to the separating hyperplane (w? , θ? ). We explain in future expression (60.10) that the distance of a generic feature vector h to the hyperplane is given by distance from h to hyperplane (w? , θ? ) = |hT w? − θ? |/kw? k (b)

59.5

(59.44)

For each hn ∈ U (or in a subset of U), we predict its label using − θ? ). We subsequently select a neighborhood of the closest K features to h from the seed set S and find out how many of them belong to the same class +1, say, Nh . Then, the ratio Nh /K is an approximation for the confidence level in assigning h to that class, from which we can estimate p(n) (the confidence in assigning h to class +1) and continue from here as before. sign(hTn w?

DOMAIN ADAPTATION We examine next the concept of domain adaptation, where logistic regression again plays a useful supporting role. Domain adaptation deals with the situation where a learning algorithm is trained on data arising from a particular joint distribution (e.g., height and weight of female and male individuals from a certain geographic region, A), and then it is desired to adjust the classifier so that it can operate reliably on data arising from a perturbed distribution (such as

59.5 Domain Adaptation

2477

height and weight measurements for female and male individuals from another geographic region, B) without the need to carry out a new retraining stage. This is particularly useful in situations where collecting new training data may be prohibitive. To formulate the domain adaptation problem, we need to distinguish between two domains: the source domain and the target domain.

59.5.1

Source Domain Let γ and h ∈ IRM refer to labels and feature vectors, and consider a stochastic risk optimization problem of the general form: ∆

wo = argmin E Q(w; γ, h)

(59.45)

w∈IRM

where Q(·; ·) represents some loss function (such as the logistic loss) and the expectation is over the joint distribution of the data. We have used the notation fγ ,h (γ, h) before to refer to this distribution. For reasons that will become clear soon, we will instead use the notation fS (γ, h), with a subscript S, and refer to it as the source distribution. We already know how to apply stochastic optimization procedures to minimize risk functions of the form (59.45). For instance, assume we have streaming data {γ(n), hn } arising from the distribution fS (γ, h). Then, we can iterate, for example, a stochastic gradient recursion to get successive estimates for wo : wn = wn−1 N µ∇wT Q(wn−1 ; γ(n), hn ),

n≥0

(59.46)

The same algorithm is useful for seeking the minimizer of an empirical risk approximation that solves instead: ) ( NS −1 1 X ? ∆ (59.47) Q(w; γ(n), hn ) w = argmin NS n=0 w∈IRM assuming that we have access to a collection of NS data realizations {γ(n), hn }. We can also incorporate regularization into the empirical risk and solve instead ( ) NS −1 1 X ? ∆ 2 Q(w; γ(n), hn ) (59.48) w = argmin ρkwk + NS n=0 w∈IRM in which case we would apply the following stochastic gradient recursion: wn = (1 N 2µρ)wn−1 N µ∇wT Q(wn−1 ; γ(n), hn ),

n≥0

(59.49)

Once trained, the resulting classifier w? is expected to perform well on test data arising from the same source distribution as the training data.

59.5.2

Target Domain There are, however, important situations in practice where the test data arise from a different distribution than the training data. For example, it may be

2478

Logistic Regression

easy to collect feature data from some geographic region A and label the data according to whether an individual has one medical condition or another. And yet we are interested in employing the classifier to discriminate among features collected from another geographic region B where the distribution of the data follows a different pattern, and where it is either difficult or expensive to collect labels for the data. If we simply train the classifier using the available labeled data from region A, and use it to classify the feature vectors from region B without proper adjustment, then the accuracy of the classification task will generally be low. This situation gives rise to a scenario where a classifier is trained by data arising from a source distribution and needs to be tested on data arising from a different target distribution. Domain adaptation provides one solution to this problem; later in future Example 65.11 we will illustrate another solution method based on transfer learning in the context of neural networks. The terms “domain adaptation” and “transfer learning” are often used interchangeably even though the former refers to a narrower situation where the label and feature spaces are the same but only their probability distributions can change. Domain adaptation can be motivated as follows. We denote the joint distribution for the data from the target domain by fT (γ, h), with a subscript T . We use the Bayes rule and factor it as fT (γ, h) = fT (h) fγ |h (γ|h)

(59.50)

where the first component on the right-hand side is the distribution of the feature data under the target distribution, while the second component continues to be the conditional distribution of the label over the data. Observe that we are writing fT (h) instead of fh (h) to emphasize that this data distribution is specific to the target domain. Likewise, we will write fS (h) to refer to the distribution of the feature data in the source domain. One of the common situations for domain adaptation is the case where the feature distribution is different between the source and target domains, i.e., fS (h) 6= fT (h)

(59.51)

The difference between these distributions needs to be small for better results. At the same time, it is assumed that the conditional distributions remain invariant across both domains: fγ |h (γ|h) is invariant in the source and target domains

(59.52)

This case is referred to as domain adaptation under covariate shift. Now, if we had access to labeled training data from the target distribution, we could consider minimizing directly the risk function over the target domain, which would be defined by

59.5 Domain Adaptation

2479



PT (w) = E T Q(w; γ, h) n o = E h E γ |h Q(w; γ, h) ˆ n o = fT (h) E γ |h Q(w; γ, h) dh

(59.53)

h∈H

where in the first line the expectation is over the target distribution fT (γ, h). Unfortunately, we only have access to labeled data from the source domain and not from the target domain. Therefore, we cannot employ an iterative procedure, such as a stochastic gradient algorithm, to minimize PT (w) because the risk values cannot be evaluated; only realizations for h are available from the target distribution but not for their labels γ.

59.5.3

Training under Covariate Shift The main question is whether it is possible to use the available labeled training data from the source domain to seek the minimizer of (59.53). The answer is in the affirmative. To show how this can be done, we first rework the risk expression (59.53) as follows: ˆ n o fS (h) (a) fT (h) E γ |h Q(w; γ, h) dh PT (w) = h∈H fS (h)    ˆ fT (h) = fS (h) E γ |h Q(w; γ, h) dh fS (h) h∈H   fT (h) Q(w; γ, h) (59.54) = ES fS (h) where in step (a) we multiplied and divided by the same quantity fS (h). In the last equality we used the fact that, under covariate shift, the conditional pdf of γ given h is the same for both source and target domains. The last expectation is over the joint distribution of the source domain. This derivation assumed that both the source and target distributions for the feature data share the same range space, written as h ∈ H. The scalar α(h) = fT (h)/fS (h) in (59.54) is referred to as the importance weight. The key observation is that by scaling by α(h) we are able to transform the risk PT (w) from (59.53), which involves averaging over the joint target distribution, fT (γ, h), to an equivalent expression (59.54) that involves averaging over the joint source distribution, fS (γ, h). Motivated by expression (59.54), and using the labeled training data {γ(n), hn )} from the source domain, we introduce a regularized empirical risk of the form (say, an `2 -regularized risk for illustration purposes): NS −1 1 X fT (hn ) Q(w; γ(n), hn ) P (w) = ρkwk + NS n=0 fS (hn ) ∆

2

(59.55)

2480

Logistic Regression

where the ratio fT (hn )/fS (hn ) is evaluated at the source feature vector, hn . We still cannot minimize this empirical risk because the pdfs, fT (h) and fS (h), are not known. All we have is a collection of NS labeled feature vectors {h0 , . . . , hNS−1 } from the source domain and a second collection of NT unlabeled feature vectors {hNS +1 , . . . , hNS +NT −1 } from the target domain. There are solution methods that rely on estimating the pdfs fS (h) and fT (h) from the data, which requires modeling these distributions explicitly, and then applying a stochastic gradient recursion to seek the minimizer of P (w). We describe another solution that relies on the use of a logistic regression classifier. To begin with, let us introduce a second label, denoted by σ, such that σ = +1 if feature h arises from the source domain and σ = N1 if h arises from the target domain. It can be easily verified that (see Prob. 59.16): ∆

α(h) =

P(σ = +1) P(σ = N1|h) fT (h) = fS (h) P(σ = N1) P(σ = +1|h)

(59.56)

where the first ratio, P(σ = +1)/P(σ = N1), is a measure of the relative frequency of samples arising from the source and target domains. These probabilities can be estimated as follows: b = +1) = P(σ

NS , NS + NT

b = N1) = P(σ

NT NS + NT

(59.57)

The rightmost ratio in (59.56) given by P(σ = N1|h)/P(σ = +1|h) can be estimated by introducing a separate classifier to distinguish between features arising from one domain or the other. For example, we can train a logistic regression classifier and determine its parameters (wx , θx ) to separate between classes σ = +1 and σ = N1; it should be noted that these two classes may not generally be linearly separable (in which case we can use other classifier structures, such as kernel-based learning, as discussed in a future chapter). We have available a total of NS +NT feature vectors for this classification task, with NS of them belonging to class σ = +1 and NT of them belonging to class σ = N1. Let {hn , σ(n)} refer to all available feature vectors and their labels (source or target features). Then, we can learn (wx , θx ) by iterating the logistic regression algorithm: x x σ b(n) = hT n wn−1 N θ (n N 1) µσ(n) θx (n) = θx (n N 1) N 1 + eσ(n)bσ(n) µσ(n)hn x wnx = (1 N 2µρ)wn−1 + 1 + eσ(n)bσ(n)

(59.58a) (59.58b) (59.58c)

Once this classifier is trained, we use its limiting parameters (wx , θx ) to estimate the probabilities that are needed in (59.56). Using expression (59.6), we can write 1 1 + e−(hT wx −θx ) 1 P(σ = N1|h) = (h 1 + e T wx −θx ) P(σ = +1|h) =

(59.59) (59.60)

59.5 Domain Adaptation

2481

Substituting into (59.56) we get α(h) =

fT (h) fS (h) T

x

x

NS 1 + e−(h w −θ ) NT 1 + e(hT wx −θx ) NS −(hT wx −θx ) = e NT NS −bσ e = NT ≈

(59.61)

where we are defining the predicted label: ∆

σ b = hT wx N θx

(59.62)

A second method to estimate the ratio P(σ = N1|h)/P(σ = +1|h) is to employ a k-nearest neighbor (k-NN) rule instead of the logistic regression classifier. In this case, a majority vote from the k nearest neighbors to h from among all samples {hn , σ(n)} would determine its predicted label and allow us to estimate P(σ = σ|h) by counting the number of votes for the label σ divided by k. In this case, the importance weight α(h) would be given by α(h) ≈

NS N−1 NT N+1

(59.63)

where N−1 is the number of neighbors from class N1 and N+1 is the number of neighbors from class +1 within the k-size neighborhood around h. Clearly, N−1 + N+1 = k. Once the probabilities needed in (59.56) have been estimated, we can apply a stochastic gradient algorithm to minimize (59.55) using the labeled source data {γ(n), hn } as follows: wn = (1 N 2µρ)wn−1 N µα(n) ∇wT Q(wn−1 ; γ(n), hn )

(59.64)

where α(n) is the scaling factor evaluated at the nth feature vector from the source domain: ∆

α(n) = α(hn ) =

NS −bσ(n) x x e , σ b(n) = hT nw N θ NT

(59.65)

In summary, we arrive at procedure (59.66) for solving the domain adaptation problem under covariate shift. This procedure is based on instance-weighting since each data point (γ(n), hn ) from the source domain is weighted by the scalar α(n).

2482

Logistic Regression

Domain adaptation algorithm for minimizing (59.55) under covariate shift. given: NS labeled data pairs from source domain: {γ(n), hn }, n = 0, 1, . . . , NS N 1. NT unlabeled features from target domain: {hn+NS }, n = 0, 1, . . . , NT N 1. logistic classifier • train a logistic classifier to distinguish between source features (σ = +1) and target features (σ = N1). Let (wx , θx ) denote the resulting classifier. T x x • for each source vector hn , determine α(n) = (NS /NT )e−(hn w −θ ) . stochastic gradient algorithm • use the labeled source data {γ(n), hn } to train: wn = (1 N 2µρ)wn−1 N µα(n) ∇wT Q(wn−1 ; γ(n), hn ) • use the resulting classifier w? to determine the labels for the target domain features {ht }, t = 0, 1, . . . , NT N 1: ? γ(t) = +1, if hT tw ≥0 T ? γ(t) = N1, if ht w < 0 (59.66)

Example 59.7 (Domain adaptation applied to logistic data) We generate NS = 100 random pairs of source data points {γ(n), hn } according to a logistic model. First, a random parameter model wS ∈ IR2 is selected, and a random collection of feature vectors {hn } are generated with zero-mean unit-variance Gaussian entries. Then, for each hn , the label γ(n) is set to either +1 or −1 according to the following construction:

γ(n) = +1 if

1





−hT n wS

1+e

≥ 1/2,

otherwise γ(n) = −1

(59.67)

We generate a second set of NT = 100 random pairs of target data points {γ(n), hn } by using a similar logistic construction albeit with a different parameter model wT ∈ IR2 . A total of K = 100 epochs are run over the data, with the data randomly reshuffled prior to each run. Although the source and target features are not linearly separable in this example, we still train a logistic classifier wx to separate between source and target data, as explained prior to the example. The result is (where, due to the extension, the top entry is the offset θx for the classifier):



−θx wx





 0.0049 =  −0.0089  0.0069

(59.68)

We use the result to determine the scalars {α(n)} for the labeled source samples, and apply the logistic regression recursion to the source data. We again run K = 100 epochs

59.5 Domain Adaptation

labeled source data

3

actual labels for target data

3

2

2

1

1

0

0

-1

-1

2483

+1

+1

-2

-1

-2

-1

-3

-3 -4

3

-2

0

2

4

unlabeled target data

-4

3

2

2

1

1

0

0

-1

-1

-2

-2

-2

0

2

4

result of labeling target data

+1

-3

-1

-3 -4

-2

0

2

4

-4

-2

0

2

4

Figure 59.12 (Top left) Scatter diagram for the labeled source samples. (Top right) Scatter diagram for the labeled target samples. (Bottom left) Unlabeled target samples. (Bottom right) Predicted labels for the target samples.

and perform random reshuffling at the start of each run, leading to the estimate (no offset was used in this case):

w? =



−0.2108 −0.0667

 (59.69)

We use w? to classify the target samples into classes +1 or −1. The simulation uses ρ = 1, µ = 0.0001, and M = 2. Figure 59.12 shows the scatter diagrams for the labeled source and target data in the top row for comparison purposes. In the implementation, we are actually assuming that the labels for the target data are not known, as shown in the bottom row of the figure, and employ the above construction to predict their labels. For this example, the classification error is 19%. It is not difficult to observe from repeating this experiment that the domain adaptation procedure can fail more often than desirable. This is because the source and target samples are not generally linearly separable, which results in poor estimates for {α(n)}.

2484

Logistic Regression

59.6

COMMENTARIES AND DISCUSSION Logistic function, logit, and probit. In general, for any x ∈ IR, the logistic function σ(x) is defined as the transformation ∆

σ(x) =

1 1 + e−x

(logistic function)

(59.70)

We showed a plot of this function in Fig. 59.1, where it is seen that σ(x) satisfies the useful properties: σ(0) = 1/2,

lim σ(x) = 1,

x→+∞

lim σ(x) = 0

x→−∞

(59.71)

In other words, the function σ(x) assumes increasing values in the range (0, 1) as x varies from −∞ to +∞. Accordingly, σ(x) can be interpreted as a cumulative distribution of some underlying probability density function (pdf), which turns out to be the logistic distribution. To see this, let x denote a random variable with mean x ¯ and variance σx2 . 2 2 2 Let a = 3σx /π . Then, we say that x has a logistic distribution when its pdf has the form   1 1 ∆ sech2 (x − x ¯) (logistic pdf) (59.72) fx (x) = 4a 2a where sech(·) refers to the hyperbolic secant function defined by ∆

sech(t) = 2/(et + e−t )

(59.73)

The cumulative distribution, which measures P(x ≤ x), is given by 1



Fx (x) =

1 (x−¯ −a x)

(59.74)

1+e

so that σ(x) corresponds to the cumulative distribution of a logistic pdf with mean zero and variance σx2 = π 2 /3. Furthermore, for a generic y ∈ (0, 1), we define the logit function   y ∆ ∆ g(y) = logit(y) = ln (logit function) (59.75) 1−y This function is closely related to the logistic function (59.70). Indeed, one function is the inverse of the other. That is,   y g(y) = ln ⇐⇒ y = σ(g(y)) (59.76) 1−y This observation provides one useful interpretation for the logistic regression model used in (59.6). Indeed, if we make the identifications: y ← P(γ = γ|h = h; wo ),

x ← γhT wo

(59.77)

then relation (59.6) is simply stating that y = σ(x). Consequently, according to (59.76), it must hold that x = logit(y). The logistic function (59.70) was introduced by the Belgian mathematician Pierre Verhulst (1804–1849) in his studies of models for population growth. Verhulst’s work was motivated by his observation that the rate of population growth should be dependent on the population size, which led him to propose a continuous-time differential equation in the paper by Verhulst (1845) of the following form:   dy(t) y(t) = α y(t) 1 − , y(0) = Po , t > 0 (59.78) dt P

59.6 Commentaries and Discussion

2485

Here, the symbol y(t) denotes the population size at time t, α denotes the growth rate, and P is the maximum population size. The solution that corresponds to the special case Po = 1/2, P = 1, and α = 1 leads to the logistic function: y(t) =

1 1 + e−t

(59.79)

The designation “logit” for the function (59.75) was introduced by Berkson (1944, 1951), whose work was motivated by the earlier introduction of the “probit” function by Gaddum (1933) and Bliss (1934a,b); the term “probit” was used in the references by Bliss (1934a,b). For any y ∈ (0, 1), the probit function is defined as (compare with (59.75)): ∆



g(y) = probit(y) = Φ−1 (y)

(probit function)

(59.80)

where Φ(x) denotes the cumulative function of the standard Gaussian distribution with zero mean and unit variance, i.e., ˆ x 2 1 ∆ Φ(x) = √ e−τ /2 dτ (59.81) 2π −∞ and Φ−1 (y) is the inverse transformation. The probit is closely related to the Q-function for Gaussian distributions since one function is the inverse of the other: g(y) = probit(y) ⇐⇒ y = Φ(g(y))

(59.82)

For further details on the history of the logistic function and the logit and probit functions, the reader may refer to the treatment given by Cramer (2003). Logistic regression. The technique is of broad appeal and has found applications in a wide range of fields, besides machine learning and pattern classification, such as in the life and social sciences. The driving force behind its appeal is that the independent variable γ is not continuous but discrete and can only assume a finite number of possibilities. For example, in the life sciences where logistic regression formulations are popular, it is customary for the variable γ to represent the state of a patient (e.g., whether the patient has a certain condition or not), while h collects measurements of biological variables. Likewise, in the social and political sciences, the variable γ may represent whether an individual leans toward one political affiliation or another based on observations of certain attributes. The concept of logistic regression in statistical analysis was originally introduced by Cox (1958), who focused on the binary case – see also the texts by Cox (1969, 2006). There is a strong connection with the probit regression concept introduced earlier by Bliss (1934a,b). In the logistic formulation, the conditional probability was modeled in (59.6) by the cumulative function of a logistic pdf, written here more compactly in terms of σ(x) = 1/(1 + e−x ) as P(γ = +1 | h = h; wo , θo ) = σ(hT wo − θo )

(59.83)

In comparison, the probit model employs the cumulative distribution of the Gaussian distribution: P(γ = +1 | h = h; ; wo , θo ) = Φ(hT wo − θo ) (59.84) where Φ(x) is defined by (59.81). The logit and probit models for binary classification lead generally to similar results, although the logit formulation is more popular – see, e.g., the discussion in Chapter 33. Multinomial logistic regression. We focused mostly on the case of binary classification problems in the body of the chapter, where γ assumes one of two possible values, γ ∈ {±1}. Variations of logistic regression that handle more than two states are of course possible; these formulations are referred to as multinomial or multiclass logistic

2486

Logistic Regression

regression problems, and also as softmax regression problems – see Prob. 59.14. In this case, it is assumed that there are R classes, labeled r ∈ {1, 2, . . . , R}. Separate parameters (wr , θr ) are associated with each class. The conditional distribution of r given the feature h is modeled as the following softmax function:

P(r = r|h = h) = e

hT wr −θr

R X

hT wr0 −θr0

e

!−1 , 1≤r≤R

(59.85)

r 0 =1

An application of multinomial logistic regression in the context of classification problems appears in Bohning (1992). More detailed treatments on logistic regression, along with examples of applications in several fields, can be found in the texts by Harrell (2001), Cramer (2003), Cox (2006), Hale, Yin, and Zhang (2008), Freedman (2009), Hilbe (2009), Bolstad (2010), Shi et al. (2010), and Hosmer and Lemeshow (2013). Further discussions in the context of machine learning, involving the consideration of regularized versions of logistic regression and other methods of solution, appear in several works, e.g., Figueiredo (2003), Ng (2004), Krishnapuram et al. (2005), Koh, Kim, and Boyd (2007), and in the texts by McCullagh and Nelder (1989), Bishop (2007), and Theodoridis and Koutroumbas (2008). Multiclass classification problems. We have already encountered, and will continue to encounter, in our treatment several classification algorithms that can handle multiclass classification problems, such as naïve Bayes classifiers, k-NN classifiers, self-organizing maps, decision trees, random forests, LDAs, and neural networks. There have also been works in the literature on extending the support vector machine (SVM) formulation of Chapter 61 to multiclass problems – see, e.g., Vapnik (1998) and the articles by Joachims (1998), Platt (1998), Weston and Watkins (1999), Bredensteiner and Bennett (1999), Crammer and Singer (2001), and Lee, Lin, and Wahba (2004). In Sections 59.3.1 and 59.3.2 we discussed a different approach to multiclass classification, in the form of the OvO and OvA techniques, which rely on transforming the problem into a sequence of binary classification problems. While OvO involves training more classifiers than OvA (O(R2 ) vs. O(R)), it nevertheless uses less training data. It has been observed in the literature, based on extensive experimentation, that if the underlying binary classifiers are tuned well, then using the OvO or OvA strategies works rather well, even in comparison to more sophisticated multiclass classification techniques. The work by Rifkin and Klautau (2004) provides arguments in support of this performance, especially for the OvA method. The OvO and OvA strategies are intuitive and simple and perhaps, for this reason, they have been rediscovered multiple times. For further discussion, the reader may refer to the texts by Bishop (2007) and Hastie, Tibshirani, and Friedman (2009), as well as the articles by Hastie and Tibshirani (1998), Allwein, Shapire, and Singer (2000), Hsu and Lin (2002), Aly (2005), Garcia-Pedrajas and Ortiz-Boyer (2006), and Rocha and Goldenstein (2013). A third method for transforming a multiclass classification problem into the solution of a sequence of binary classifiers is the error-correcting output code (ECOC) method, proposed in the works by Sejnowski and Rosenberg (1987) and Dietterich and Bakiri (1995) – see also the treatments in Allwein, Shapire, and Singer (2000), Hsu and Lin (2002), and Furnkranz (2002). The method relies on using ideas from coding theory to select between a collection of binary classifiers as follows. Assume we are faced with a multiclass classification problem involving R classes. Let B denote the number of base classifiers that we are going to use to attain multiclass classification. We introduce a coding matrix, denoted by C, of size R × B and whose entries are ±1. Each row of C is associated with one class, r = 1, 2, . . . , R, and the ±1 entries on the rth row of C constitute a “codeword” that we are using to represent that particular class. For example, assume R = 4 and B = 6. Then, one choice for the coding matrix C could be:

59.6 Commentaries and Discussion

   C= 

r 1 2 3 4

b1 +1 +1 −1 −1

b2 −1 +1 −1 −1

b3 +1 −1 −1 +1

b4 +1 +1 −1 −1

b5 −1 +1 +1 −1

b6 −1 +1 +1 −1

2487

    

(59.86)

We are labeling the columns by {b` } and these columns have a meaningful interpretation. For example, assume we are classifying images into four classes: cars, fruits, flowers, and airplanes. The value of b1 could be indicating whether an image has wheels in it (b1 = +1) or not (b1 = −1). If you examine the values appearing in the b1 column in the above example for C, we find that classes r = 1 and r = 2 have wheels in them while classes r = 3 and r = 4 do not. We say that the matrix C provides R codewords (rows); one for each class r. Moreover, each column of C (i.e., each base classifier) divides the training data into two groupings, regardless of their class r. For example, under column b1 , all training data that belong to classes r = 1 or r = 2 are assigned to class γ = +1, while the remaining training data that belong to classes r = 3 or r = 4 are assigned to class γ = −1. A binary classifier can then be trained on this grouping; this step results in a classifier with parameter vector w1? . We repeat for column b2 . In this case, all training data that belong to class r = 2 are assigned to γ = +1, while the remaining training data that belong to classes r = 1, 3, 4 are assigned to class γ = −1. A binary classifier can then be trained on this grouping; this step results in a classifier with parameter vector w2? . We repeat for columns {b3 , b4 , . . . , b6 }. By the end of this training process, we end up with six trained binary classifiers, {w`? }. Next, during normal operation, when a new feature vector h is received, we employ the classifiers {w`? } to determine a codeword representation for h. For example, assume we find that the codeword corresponding to a particular h is   codeword(h) = +1 −1 −1 +1 −1 −1 (59.87) We then determine the “closest” row in C to this codeword, where closeness can be measured either in terms of the Euclidean norm or in terms of the Hamming distance; the Hamming distance between two vectors is the number of entries at which the vectors differ from each other: n o ∆ rb = argmin Hamming (codeword(h), C(r, :)) (59.88) 1≤r≤R

For the above example, we find that rb = 1 so that feature h is assigned to class r = 1. One of the weaknesses of this approach is that it disregards the correlations that may exist between different classes. Active learning. We described some features of active learning in Section 59.4. Under this approach, the learner seeks to improve its performance by being proactive about which training samples to use. Active sampling can be employed for both cases of labeled and unlabeled data. In the latter case, it reduces the amount of labeling that needs to be provided by prioritizing samples for training. Although we illustrated the operation of active learning under the logistic regression classifier, the same methodology can be applied to other classifiers, including neural network structures, using other variations to identify the “least confident” samples, as illustrated in Example 59.6 – see, e.g., Cortes and Vapnik (1995), Fujii et al. (1998), Tong and Koller (2000), Lindenbaum, Markovitch, and Rusakov (2004), and Settles (2010). The last reference provides a useful survey on active learning. Further useful reading includes the works by MacKay (1992), Cohn, Atlas, and Ladner (1994), Cohn, Ghahramani, and Jordan (1996), Dasgupta (2004), Baram, El-Yaniv, and Luz (2004), Schein and Ungar (2007), Dasgupta and Hsu (2008), and Dasgupta, Hsu, and Monteleoni (2008). Domain adaptation. The weighted solution (59.66) for the domain adaptation problem was proposed by Bickel, Brueckner, and Scheffer (2007); in the same article they propose

2488

Logistic Regression

a second variant that determines {wx , w? } simultaneously by means of a Newton-type algorithm. The conclusion in (59.54) that weighting by the ratio of pdfs, fT (h)/fS (h), helps transform expectation over the target distribution fT (h) to expectation over the source distribution fS (h) is due to Shimodaira (2000). A similar construction arises in the study of off-policy reinforcement learning algorithms – see Section 46.7. There are many other variations of domain adaptation, which differ by the manner by which they estimate the importance weight fT (h)/fS (h). In the chapter we discussed two solutions: one based on training a logistic regressor to discriminate between source and target samples, and the other based on using the k-NN rule. Other approaches that rely on parametric and nonparametric approaches to estimating the distributions {fT (h), fS (h)} or their ratio are also possible. For example, following Shimodaira (2000), one could assume Gaussian forms for these distributions and estimate their sample means and covariances from the data:  NT −1 NS −1  1 X 1 X   µ bT = ht , µ bS = hs    NT t=0 NS s=0    NT −1   X  1  bT =  R (ht − µ bT )(ht − µ bT )T NT − 1 t=0 (59.89)   NS −1   X  1  bS =  R (hs − µ bS )(hs − µ bS )T   N − 1 S   s=0    bT ), fS (h) ∼ Nh (b bS ) fT (h) ∼ Nh (b µT , R µS , R Another approach to domain adaptation is based on the methodology of optimal transport – see, e.g., Courty et al. (2016, 2017) and Redko, Habrard, and Sebban (2017). In this case, the conditional distributions fγ |h (γ|h) are allowed to be different over the source and target domains. We denote them by fT (γ|h) and fS (γ|h). One then seeks a mapping t(h) operating on the source feature vectors such that, after the transformation, the distributions match each other: fS (γ|h) = fT (γ|t(h)), ∀ h ∈ source domain

(59.90)

In this approach, after the mapping t(·) is determined, one applies it to the source data and subsequently trains the classifier directly in the target domain using these transformed vectors. The main intuition is that after the transformation, the source data will behave similarly to the target data. Good surveys on domain adaptation and transfer learning, including discussions on other approaches, are given by Weiss, Khoshgoftaar, and Wang (2016) and Kouw and Loog (2019). Useful performance results are given by Crammer, Kearns, and Wortman (2008), Mansour, Mohri, and Rostamizadeh (2009), Ben-David et al. (2010a,b), Cortes, Mansour, and Mohri (2010), and Germain et al. (2013).

PROBLEMS

59.1 Consider the logistic function σ(x) = 1/(1 + e−x ). Verify that (a) σ(−x) = 1 − σ(x). (b) dσ/dx = σ(x)σ(−x). (c) dσ/dx = σ(x)(1 − σ(x)). 59.2 Consider the `2 -regularized logistic risk: n  o b ∆ b (w) = hT w P (w) = ρkwk2 + E ln 1 + e−γ γ (w) , γ and denote its minimizer by wo . Prove that

Problems

2489

(a) kwo k ≤ E khk/2ρ. (b) kwo k2 ≤ Tr(Rh )/4ρ2 , where h is zero-mean and Rh = E hhT . 59.3 Consider the logistic regression algorithm (59.15) without the offset parameter θ (i.e., set it to zero). Introduce the auxiliary variable d(n) = +1 if γ(n) = +1 and d(n) = 0 if γ(n) = −1. Let further σ(x) = 1/(1+e−x ) denote the logistic function. Show that the logistic regression algorithm can be equivalently re-worked into the following form: e(n) = d(n) − σ(hTn wn−1 ) wn = (1 − 2µρ)wn−1 + µhn e(n) 59.4

Consider the `2 -regularized empirical logistic risk problem: ( ) N −1  1 T −1 1 X  ? ∆ w −γ(n)hT n w = argmin w Rw w + ln 1 + e 2 N n=0 w∈IRM

where Rw > 0. Let σ(z) = 1/(1 + e−z ). Show that w? can be written in the form w? =

N −1 1 X λ(n)γ(n)Rw hn N n=0

where the coefficients {λ(n)} are the derivatives of σ(z) evaluated at z = γ(n)b γ (n), i.e., ∆ dσ(z) , γ b = hT w ? λ = dz z=γbγ 59.5 We continue with Prob. 59.4. Using model w? , show that the conditional probability of the label variable given the feature vector can be written in the form P(γ = γ | h; w? ) =

1 1 + e−γbγ

for the following function of the feature vector h: ∆

γ b(h) =

N −1 1 X λ(n)γ(n)hT Rw hn N n=0

Conclude that the label γ that maximizes P(γ = γ | h; w? ) is the one that matches sign(b γ (h)). Remark. For additional discussion on the material in this problem and the previous one, the reader may refer to the work by Jaakkola and Haussler (1999). 59.6 Consider N iid observations {γ(n), hn }. For each individual feature {γ, h}, the conditional probability of the label given the feature is modeled according to (59.6). Assume zero offsets in this problem. Verify that the log-likelihood function for the observations, denoted by `(w), can be written in the form (compare with (59.11)): (       ) N −1 X 1 + γ(n) 1 − γ(n) 1 1 `(w) = ln + ln T T 2 2 1 + e−hn w 1 + ehn w n=0 Redefine the labels from {−1, +1} to {0, 1} by using the transformation γ ← (1 + γ)/2. Using the new labels, verify that the same log-likelihood function can be written in the following form (which is the negative of the so-called cross-entropy risk function encountered later in Section 65.7 in the context of neural networks): `(w) =

N n o X γ(n) ln P(γ(n) = 1) + (1 − γ(n)) ln P(γ(n) = 0) n=0

where P(γ = 1) = 1/(1 + e−h

T

w

) and P(γ = 0) = 1/(1 + eh

T

w

).

2490

Logistic Regression

59.7 Show that the log-likelihood function `(w) in Prob. 59.6 is concave. In particular, verify that its Hessian matrix relative to w is nonpositive-definite. 59.8 Consider the same setting of Prob. 59.6 with γ(n) ∈ {±1} but assume now 2 that we attach a Gaussian prior to the model w, say, w ∼ Nw (0, σw IM ). Verify that the maximum a-posteriori (MAP) estimator that maximizes the joint pdf of {w, γ(0), . . . , γ(N − 1)} given the feature vectors {hn } leads to the `2 -regularized logistic regression solution. 59.9 Consider the same setting of Prob. 59.6 with γ(n) ∈ {±1} but assume now that we attach a Laplacian prior to the model w. Specifically, the entries {wm } of w ∈ IRM are assumed to be independent of each other and follow a Laplacian distribution with 2 zero mean and variance σw : n √ o 1 fwm (wm ) = √ exp − 2|wm |/σw 2 σw Verify that the MAP estimator that maximizes the joint pdf of {w, γ(1), . . . , γ(N − 1)} given the feature vectors {hn } leads to an `1 -regularized logistic regression solution. 59.10 Let σ(x) = 1/(1 + e−x ) refer to the logistic (or sigmoid) function. Consider the second log-likelihood function defined in Prob. 59.6 for labels {0, 1}, namely, `(w) =

N n X

γ(n) ln σ(hTn w) + (1 − γ(n)) ln(1 − σ(hTn w))

o

n=0

Construct the data matrix H whose rows are {hTn }, the column vector d whose entries are {γ(n)}, and the column vector s(w) whose entries are {σ(hTn w)}, i.e.,       hT0 σ(hT0 w) γ(0)  hT1     σ(hT1 w)  γ(1) ∆  ∆  ∆     H =  , d =   , s(w) =   .. .. ..       . . . T T γ(N − 1) hN −1 σ(hN −1 w) Construct also the N × N diagonal matrix n  o ∆ D(w) = diag σ(hTn w) 1 − σ(hTn w) Verify that (a) ∇wT `(w) = H T (d − s(w)). (b) ∇2w `(w) = −H T D(w)H. 59.11 Continuing with Prob. 59.10, we wish to write down Newton recursion (12.197) for maximizing the log-likelihood function `(w) using a unit-value step size. Verify that the recursion in this case reduces to the following form over m ≥ 0: ∆

Dm−1 = D(wm−1 ) ∆

−1 zm−1 = Hwm−1 + Dm−1 (d − s(wm−1 ))

wm = (H T Dm−1 H)−1 H T Dm−1 zm−1 59.12 Conclude from Prob. 59.11 that the mth iterate is the solution of the weighted least-squares problem ∆

w? = argmin (zm−1 − Hw)T Dm−1 (zm−1 − Hw) w∈IRM

How is this conclusion related to the iterative reweighted least-squares problem (50.167)?

Problems

2491

59.13 Consider a binary classification problem where γ = ±1 and the following risk functions: b )2 P (c) = E (γ − γ b} P (c) = E max{0, 1 − γ γ

(mean-square-error risk) (hinge risk)

b = c(h) denotes the prediction that is generated by the classifier, c(h); we are where γ not limiting the problem statement to linear classifiers. Show that the minima of the b are given by: above risks over γ γ b = 2 P(γ = +1|h = h) − 1 (mean-square-error risk)   γ b = sign P(γ = +1|h = h) − 1/2 (hinge risk) Derive in each case expressions for the confidence level P(γ = +1|h = h). 59.14 Refer to the multiclass logistic regression model (59.85) and assume zero offset parameters. Consider a collection of N independent data realizations {hn , r(n)}, where hn ∈ IRM is a feature vector and r(n) is its class. Let W collect all models {wr } into its columns and introduce the notation !−1 R X T T ∆ eh wr0 , 1≤r≤R σnr = P(r = r|hn ; W ) = eh wr r 0 =1

Introduce further the R-dimensional vectors:    σn1 I[r(n) = 1]  σn2   I[r(n) = 2] ∆    σn =  .  , γn =  ..  ..   . σnR I[r(n) = R]

    

where I[x] is the indicator function assuming the value 1 when the statement x is true and zero otherwise. (a) Argue that the log-likelihood function is given by ! N −1 X N −1 R R n o X X X `(W ) = I[r(n) = r]hTn wr − ln exp hTn wr0 n=0 r=1

(b)

n=0

r 0 =1

For any model wr , show that ∇wrT `(W ) =

N −1 X

 I[r(n) = r] − σnr hn

n=0

(c)

Collect the column gradient vectors from part (b) into a matrix and conclude that ∇W `(W ) =

N −1 X

(γn − σn )T ⊗ hn

n=0

(d) Write down a gradient-ascent iteration for maximizing `(W ). Remark. For a related discussion, the reader can refer to Murphy (2012). 59.15 Refer to the probit regression formulation (59.84) for binary classification. Formulate an ML estimation problem for recovering the weight vector and use the ML formulation to motivate an `2 -regularized stochastic gradient probit solution. 59.16 Establish relation (59.56). Remark. For additional motivation, see Bickel, Brueckner, and Scheffer (2007).

2492

Logistic Regression

59.17 Refer to the Poisson distribution (5.47). Argue that the canonical link function is given by g(µ) = ln(µ). Show that the empirical risk optimization problem (59.118) reduces to the following Poisson regression problem: ( w

?

= argmax w∈IRM

59.A

N −1  1 X γ(n)hTn w − exp{hTn w} N n=0

)

GENERALIZED LINEAR MODELS Linear and logistic regression problems are special cases of the family of generalized linear models (GLMs). We revisit these two cases and introduce the generalization. For more details on such models, the reader may refer to the text by McCullagh and Nelder (1989). GLMs were introduced earlier in the work by Nelder and Wedderburn (1972) as a generalization of various regression models. The main intuition is that linear predictors are used to estimate transformations of the conditional mean.

Linear regression Assume a random variable γ ∈ IR arises from a linear model of the form: γ = hT w + v

(59.91)

where v is a zero-mean Gaussian random variable that is independent of the random variable h ∈ IRM . Then, clearly, given h, the conditional pdf of γ is Gaussian as well: γ|h ∼ Nγ (hT w, σv2 )

(59.92)

We denote the mean of this conditional distribution by µ = E (γ|h). In this case, µ depends linearly on the observation h through the model parameter w ∈ IRM : µ = hT w

(59.93)

In linear regression, we estimate γ from h by using a similar structure for the linear predictor: γ b = hT w

(59.94)

The way we estimate the unknown w is by maximizing the log-likelihood function over a collection of N independent data pairs {γ(n), hn }: w? = argmax ln w∈IRM

( N −1 Y n=0

n 1 o 1 √ exp − 2 (γ(n) − hTn w)2 2 2σv 2πσv

) (59.95)

which in this case reduces to solving the least-squares problem: ( w

?

= argmin w∈IRM

N −1 1 X (γ(n) − hTn w)2 N n=0

) (59.96)

59.A Generalized Linear Models

2493

Logistic regression Consider next a situation where the random variable γ is binary-valued, as in γ ∈ {±1}, where γ assumes its values according to a Bernoulli distribution: γ|h ∼ Bernoulli(p)

(59.97)

The success probability (i.e., the probability of getting γ = +1) is modeled by means of a logistic function: p = P(γ = +1|h = h) =

1 1 + e−hT w

(59.98)

The mean of the conditional distribution, µ = E (γ|h), is now given by µ = 2p − 1

(59.99)

In logistic regression, we estimate γ from h by again using a linear predictor of the form γ b = hT w

(59.100)

In this case, the predictor does not have the same form as µ. However, they can be related to each other. Using (59.98) and the expressions for {µ, γ b}, it is easy to verify that 1 + µ γ b = ln (59.101) 1−µ That is, γ b is obtained by means of some logarithmic transformation applied to the conditional mean. The way we estimate w is by maximizing the log-likelihood function over a collection of N independent data pairs {γ(n), hn }: ) ( N −1 Y 1 ? (59.102) w = argmax ln T 1 + e−γ(n)hn w w∈IRM n=0 which reduces to minimizing the logistic empirical risk: ( N −1 )  1 X  w ? −γ(n)hT n w = argmin ln 1 + e N n=0 w∈IRM

(59.103)

Generalization The previous two examples share some common elements: (a) Each case assumes a particular model for the conditional pdf of the target variable γ given the observation h, namely, for the distribution of γ|h. The nature of the variables can be different. For example, in one case, γ is real-valued but in the other case it is discrete and binary-valued. b = hT w, (b) Each case assumes a linear predictor for the target variable in the form γ for some model parameter w. This linear construction will be common for all GLMs (which explains the qualification “linear” in the name). b is estimating some transformation of the conditional mean µ = (c) The predictor γ E (γ|h). In the linear regression case, the predictor is estimating µ itself, while in the logistic regression case the predictor is estimating ln((1 + µ)/(1 − µ)). We refer to the function that maps the mean to the predictor as the link function and denote it by the notation γ b = g(µ). Thus, we have g(µ) = µ

(for linear regression) 1 + µ g(µ) = ln (for logistic regression) 1−µ

(59.104a) (59.104b)

2494

Logistic Regression

Clearly, under item (a), there are many possible choices for the conditional pdf model of γ|h. We will allow the model to belong to the family of exponential distributions. But first, we introduce canonical exponential distributions, which take the following form for scalar random variables y ∈ IR: ( )  1  fy (y) = exp θy − b(θ) + c(y, φ) (59.105) d(φ) where θ is a scalar parameter and φ is the dispersion parameter. Several of the exponential distributions we considered in Chapter 5 can be written in this alternative form. Three examples are as follows. Gaussian case. For Gaussian random variables, we showed in (5.12) that  fy (y) = exp

 1 2 µ2  1  1  2 − ln(2πσ ) + µy − y σ2 2 2 σ2

(59.106)

We can therefore make the identifications θ=µ

(59.107a)

φ = σ2 d(φ) = φ

(59.107b) (59.107c)

b(θ) = µ2 /2 1  1 c(y, φ) = − ln(2πφ) + y 2 2 φ

(59.107d) (59.107e)

Observe that the mean of the distribution is represented by the parameter θ. Bernoulli case. For Bernoulli random variables assuming values {0, 1}, we showed in (5.19) that     p fy (y) = exp y ln + ln(1 − p) (59.108) 1−p and we can make the identifications p  1−p

(59.109a)

φ=1 d(φ) = 1 b(θ) = − ln(1 − p) c(y, φ) = 0

(59.109b) (59.109c) (59.109d) (59.109e)

θ = ln



Observe that the mean of the distribution is related to the parameter θ since E y = p and, therefore,  µ  θ = ln (59.110) 1−µ Canonical exponential case. For distributions of the general form (59.105), it is customary to select the parameter θ to play the role of the predictor, which ends up defining a canonical choice for the link function. Let us illustrate this construction by reconsidering the logistic regression problem, albeit with the classes set to {0, 1} for illustration purposes. We therefore have that the conditional pdf of γ given h is described by the Bernoulli distribution: γ|h ∼ Bernoulli(p)

(59.111)

59.A Generalized Linear Models

2495

where the success probability is modeled as p = P(γ = +1|h = h) =

1 1 + e−hT w

(59.112)

The mean of the conditional distribution, µ = E (γ|h), is now given by µ=p

(59.113)

and it is easy to verify that γ b = ln



µ  1−µ

(59.114)

which is the same conclusion that would have resulted from setting γ b = θ directly in view of (59.110). Thus, we will assume canonical exponential distributions of the form (59.105) for the conditional distribution γ|h, namely, n 1  o fγ |h (γ|h) ∝ exp θγ − b(θ) d(φ)

(59.115)

and use the linear predictor γ b = hT w to replace θ, i.e., we parameterize θ in the linear T form θ = h w. This step implicitly defines a link function that maps µ = E (γ|h) to γ b. The qualification “generalized” in GLM refers to the use of the more general exponential distribution (59.115) in modeling the conditional pdf γ|h, while the qualification “linear” in GLM refers to the linear model used for θ = hT w. In this way, the conditional pdf takes the form: n 1  o fγ |h (γ|h) ∝ exp γhT w − b(hT w) d(φ)

(59.116)

Once w is estimated, we end up with an approximation for the conditional pdf of γ|h, from which inference about γ can be performed. The way we estimate w is by maximizing the log-likelihood function over a collection of N independent realizations {γ(n), hn }:

w

?

= argmax ln w∈IRM

( N −1 Y

)  1   T T exp γh w − b(h w) d(φ) n=0

(59.117)

which reduces to maximizing the following empirical risk (see Prob. 59.17): ( w

?

= argmax w∈IRM

N −1  1 X γ(n)hTn w − b(hTn w) N n=0

) (59.118)

By doing so, and assuming ergodicity, we are in effect seeking predictions γ b = hT w that solve the Bayesian inference problem wo = argmin w∈IRM

n  o b , E b(b γ) − γγ

b = hT w s.t. γ

(59.119)

2496

Logistic Regression

REFERENCES Allwein, E., R. Shapire, and Y. Singer (2000), “Reducing multiclass to binary: A unifying approach for margin classifiers,” J. Mach. Learn. Res., vol. 1, pp. 113–141. Aly, M. (2005), “Survey of multiclass classification methods,” Neural Netw., pp. 1–9. Baram, Y., R. El-Yaniv, and K. Luz (2004), “Online choice of active learning algorithms,” J. Mach. Learn. Res., vol. 5, pp. 255–291. Ben-David, S., J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010a), “A theory of learning from different domains,” Mach. Learn., vol. 79, nos. 1–2, pp. 151–175. Ben-David, S., T. Luu, T. Lu, and D. Pál (2010b), “Impossibility theorems for domain adaptation,” Proc. Artificial Intelligence and Statistics Conf. (AISTATS), pp. 129– 136, Sardinia. Berkson, J. (1944), “Application of the logistic function to bio-assay,” J. Amer. Statist. Assoc., vol. 39, no. 227, pp. 357–365. Berkson, J. (1951), “Why I prefer logits to probits?” Biometrics, vol. 7, pp. 327–339. Bickel, S., M. Brueckner, and T. Scheffer (2007), “Discriminative learning for differing training and test distributions,” in Proc. Int. Conf. Machine Learning (ICML), pp. 81–88, Corvallis, OR. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Bliss, C. I. (1934a), “The method of probits,” Science, vol. 79, pp. 38–39. Bliss, C. I. (1934b), “The method of probits,” Science, vol. 79, pp. 409–410. Bohning, D. (1992), “Multinomial logistic regression algorithm,” Ann. Inst. Stat. Math., vol. 44, pp. 197–200. Bolstad, W. M. (2010), Understanding Computational Bayesian Statistics, Wiley. Bredensteiner, E. J. and K. P. Bennett (1999), “Multicategory classification by support vector machines,” Comput. Optim. Appl., vol. 12, pp. 53–79. Cohn, D., L. Atlas, and R. Ladner (1994), “Improving generalization with active learning,” Mach. Learn., vol. 15, no., pp. 201–221. Cohn, D., Z. Ghahramani, and M. I. Jordan (1996), “Active learning with statistical models,” J. Artif. Intell. Res., vol. 4, pp. 129–145. Cortes, C., Y. Mansour, and M. Mohri (2010), “Learning bounds for importance weighting,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 442–450, Vancouver. Cortes, C. and V. N. Vapnik (1995), “Support-vector networks,” Mach. Learn., vol. 20, pp. 273–297. Courty, N., R. Flamary, A. Habrard, and A. Rakotomamonjy (2017), “Joint distribution optimal transportation for domain adaptation,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 3730–3739, Long Beach, CA. Courty, N., R. Flamary, D. Tuia, and A. Rakotomamonjy (2016), “Optimal transport for domain adaptation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 39, no. 9, pp. 1853–1865. Cox, D. R. (1958), “The regression analysis of binary sequences (with discussion),” J. Roy. Statist. Soc. B, vol. 20, pp. 215–242. Cox, D. R. (1969), Analysis of Binary Data, Chapman & Hall. Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press. Cramer, J. S. (2003), Logit Models from Economics and Other Fields, Cambridge University Press. Crammer, K., M. Kearns, and J. Wortman (2008), “Learning from multiple sources,” J. Mach. Learn. Res., vol. 9, pp. 1757–1774. Crammer, K. and Y. Singer (2001), “On the algorithmic implementation of multiclass kernel-based vector machines,” J. Mach. Learn. Res., vol. 2, pp. 265–292. Dasgupta, S. (2004), “Analysis of a greedy active learning strategy.” in Proc. Advances Neural Information Processing Systems (NIPS), pp. 337–344, Vancouver.

References

2497

Dasgupta, S., and D. Hsu (2008), “Hierarchical sampling for active learning,” in Proc. Int. Conf. Machine Learning (ICML), pp. 208–215, Helsinki. Dasgupta, S., D. Hsu, and C. Monteleoni (2008), “A general agnostic active learning algorithm,” in Proc. Advances Neural Information Processing Systems (NIPS), pp. 353–360, Vancouver. Dietterich, T. G. and G. Bakiri (1995), “Solving multiclass learning problems via error correcting output codes,” J. Artif. Intell. Res., vol. 2, no. 263–286. Figueiredo, M. (2003), “Adaptive sparseness for supervised learning,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 25, pp. 1150–1159. Freedman, D. A. (2009), Statistical Models: Theory and Practice, Cambridge University Press. Fujii, A. T., Tokunaga, K. Inui, and H. Tanaka (1998), “Selective sampling for example based word sense disambiguation,” Comput. Linguist., vol. 24, no. 4, pp. 573–597. Furnkranz, J. (2002), “Round robin classification,” J. Mach. Learn. Res., vol. 2, pp. 721–747. Gaddum, J. H. (1933), “Reports on biological standard III. Methods of biological assay depending on a quantal response,” Special Report Series of the Medical Research Council, no. 183. Garcia-Pedrajas, N. and D. Ortiz-Boyer (2006), “Improving multi-class pattern recognition by the combination of two strategies,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 28, no. 6, pp. 1001–1006. Germain, P., A. Habrard, F. Laviolette, and E. Morvant (2013), “A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers,” Proc. Int. Conf. Machine Learning (ICML), pp. 738–746, Atlanta, GA. Hale, E. T., M. Yin, and Y. Zhang (2008), “Fixed-point continuation for `1 minimization: Methodology and convergence,” SIAM J. Optim., vol. 19, pp. 1107– 1130. Harrell, F. E. (2001), Regression Modeling Strategies, Springer. Hastie, T. and R. Tibshirani (1998), “Classification by pairwise coupling,” Ann. Statist., vol. 26, no. 2, pp. 451–471. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Hilbe, J. M. (2009), Logistic Regression Models, Chapman & Hall. Hosmer, D. W. and S. Lemeshow (2013), Applied Logistic Regression, 3rd ed., Wiley. Hsu, C.-W. and C.-J. Lin (2002), “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425. Jaakkola, T. and D. Haussler (1999), “Exploiting generative models in discriminative classifiers,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–7, Denver, CO. Joachims, T. (1998), “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods: Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, editors, MIT Press. Koh, K., S. Kim, and S. Boyd (2007), “An interior-point method for large scale `1 regularized logistic regression, J. Mach. Learn. Res., vol. 8, pp. 1519–1555. Kouw, W. M. and M. Loog (2019), “An introduction to domain adaptation and transfer learning,” available at arXiv:1812.11806v2. Krishnapuram, B., L. Carin, M. Figueiredo, and A. Hartemink (2005), “Sparse multinomial logistic regression: Fast algorithms and generalization bounds,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 27, no. 6, pp. 957–968. Lee, Y., Y. Lin, and G. Wahba (2004), “Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data,” J. Amer. Statist. Assoc., vol. 99, no. 465, pp. 67–81. Lindenbaum, M., S. Markovitch, and D. Rusakov (2004), “Selective sampling for nearest neighbor classifiers,” Mach. Learn., vol. 54, no. 2, pp. 125–152. MacKay, D. J. C. (1992), “Information-based objective functions for active data selection,” Neural Comput., vol. 4, no. 4, pp. 590–604.

2498

Logistic Regression

Mansour, Y., M. Mohri, and A. Rostamizadeh (2009), “Domain adaptation: Learning bounds and algorithms,” Proc. Conf. Learning Theory (COLT), pp. 19–30, Montreal. McCullagh, P. and J. A. Nelder (1989), Generalized Linear Models, 2nd ed., Chapman & Hall. Murphy, K. P. (2012), Machine Learning: A Probabilistic Perspective, MIT Press. Nelder, J. and R. Wedderburn (1972), “Generalized linear models,” J. Roy. Statist. Soc. Ser. A, vol. 135, no. 3, pp. 370–384. Ng, A. Y. (2004), “Feature selection, `1 vs. `2 regularization, and rotational invariance,” Proc. Int. Conf. Machine Learning (ICML), pp. 78–86, Banff. Platt, J. C. (1998), “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, editors, MIT Press. Redko, I., A. Habrard, and M. Sebban (2017), “Theoretical analysis of domain adaptation with optimal transport,” Proc. Joint European Conf. on Machine Learning and Knowledge Discovery in Databases, pp. 737–753, Skopje. Rifkin, R. and A. Klautau (2004), “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, pp. 101–141. Rocha, A. and S. K. Goldenstein (2013), “Multiclass from binary: Expanding one-vs-all, one-vs-one and ECOC-based approaches,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 289–302. Schein, A. I. and L. H. Ungar (2007), “Active learning for logistic regression: An evaluation,” Mach. Learn., vol. 68, no. 3, pp. 235–265. Sejnowski, T. J., and C. R. Rosenberg (1987), “Parallel networks that learn to pronounce English text,” J. Complex Syst., vol. 1, no. 1, pp. 145–168. Settles, B. (2010), “Active learning literature survey,” Computer Sciences Technical Report 1648, University of Wisconsin–Madison. Shi, J., W. Yin, S. Osher, and P. Sajda (2010),“A fast hybrid algorithm for large scale `1 -regularized logistic regression,” J. Mach. Learn. Res., vol. 11, pp. 713–741. Shimodaira, H. (2000), “Improving predictive inference under covariate shift by weighting the log-likelihood function,” J. Statist. Plann. Infer., vol. 90, no. 2, pp. 227–244. Theodoridis, S. and K. Koutroumbas (2008), Pattern Recognition, 4th ed., Academic Press. Tong, S. and D. Koller (2000), “Support vector machine active learning with applications to text classification,” Proc. Int. Conf. on Machine Learning (ICML), pp. 999–1006. Vapnik, V. N. (1998), Statistical Learning Theory, Wiley. Verhulst, P. F. (1845), “Recherches mathématiques sur la loi dáccroissement de la population,” Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles, vol. 18, pp. 1–42. Weiss, K., T. M. Khoshgoftaar, and D. Wang (2016), “A survey of transfer learning,” J. Big Data, vol. 3, no. 1. doi:10.1186/s40537-016-0043-6. Weston, J. and C. Watkins (1999), “Support vector machines for multiclass pattern recognition,” Proc. European Symp. Artificial Neural Networks, pp. 219–224, Bruges.

60 Perceptron

In this and the next chapter we discuss two binary classification schemes known as perceptron and support vector machines. In contrast to logistic regression, these methods approximate neither the conditional pdf, fγ |h (γ|h) nor the joint pdf, fγ ,h (γ, h). Instead, both schemes are examples of deterministic methods that operate directly on data realizations {γ(n), hn } and learn from the training data how to discriminate between classes. As the derivations will show, these methods rely on geometric arguments to construct hyperplanes that separate the data into classes. The perceptron algorithm, discussed in this chapter, is one of the earliest iterative solutions devised for binary classification problems. Its development led to a flurry of interest in learning methods, culminating with various techniques for cascading elementary units into the form of neural networks for more sophisticated solutions. We will motivate perceptron from first principles for linearly separable data and comment on its convergence properties and limitations. In comparison to logistic regression, which continually updates its weight iterate wn in response to data, the perceptron algorithm limits its updates only to data points that are misclassified. This results in a simpler implementation, albeit at a cost. For instance, we will find that perceptron is not able to complement its classification decision with a confidence level, as was the case with logistic regression.

60.1

LINEAR SEPARABILITY Assume we are given N realizations {γ(n), hn }, where γ(n) ∈ {±1} is the binary label corresponding to the nth feature vector hn ∈ IRM . The objective is to construct a classifier c(h) : IRM → {±1} that maps feature vectors h into their labels. One popular classification structure is the set of “affine-based” classifiers defined by  ∆ c(h) = sign hT w N θ (60.1) where each classifier is parameterized by a vector w ∈ IRM and an offset parameter θ ∈ IR. The qualification “affine” or, more simply “linear”, refers to the relation hT w N θ that appears inside the sign operation. Any point h that lies on the hyperplane defined by (w, θ) satisfies hT w N θ = 0. On the other hand,

2500

Perceptron

points lying on one side of the hyperplane satisfy hT w N θ > 0, while points lying on the other side satisfy hT w N θ < 0 (see Fig. 60.1). The sign operation therefore allows us to identify where a given h lies in relation to the hyperplane hT w N θ = 0. This class of “linear” classifiers is very useful in practice even in situations when the data cannot be well separated by “linear” structures. This is because they will serve as building blocks for more elaborate classifiers. We will illustrate this situation later in Example 63.1 and also in Section 63.2 when we discuss kernel methods. We will say that a given dataset {γ(n), hn } is linearly separable when linear classifiers of the form (60.1) exist that are able to separate the data into its two classes, with one class lying on one side of the hyperplane and the other class lying on the other side of the hyperplane. This situation is illustrated in Fig. 60.1 for the case M = 2. In two-dimensional spaces, a hyperplane is simply a line (it will be a plane in IR3 when M = 3). The figure shows two situations depending on whether the separating line passes through the origin or not. It is clear from the figure that separating lines are not unique, especially since the slopes of the lines can be altered in many ways and still succeed in separating the data into two classes. Once a separating hyperplane is chosen, with parameters denoted by (w? , θ? ), then the classifier can be used to assign feature vectors h into one class or the other by performing the following check: 

if hT w? < θ? , assign h to class N1 if hT w? > θ? , assign h to class +1

(60.2)

In the following we derive the perceptron algorithm, which will provide one way to determine a separating hyperplane (w? , θ? ) from the training data {γ(n), hn }.

w w

separating line passes through the origin

separating line does not pass through the origin

Figure 60.1 Illustration of linearly separable data in IR2 . The separating line on the

left passes through the origin (i.e., it has a zero offset parameter), while the separating line on the right does not pass through the origin. The vector w represents the normal direction to the line.

60.2 Perceptron Risk

60.2

2501

PERCEPTRON EMPIRICAL RISK Assuming the N data points {γ(n), hn } are linearly separable, our objective is to construct a hyperplane (w? , θ? ) that separates the data into its two classes. We pursue a geometric argument. Let (w, θ) denote the parameters of some generic hyperplane. By definition, all vectors h ∈ IRM that lie on this hyperplane satisfy the equation hT w N θ = 0

(60.3)

Moreover, the vector w is called the normal direction to the hyperplane. This is because if we consider any two vectors (ha , hb ) on the hyperplane, i.e., hT a w N θ = 0,

hT bwNθ = 0

(60.4)

then by subtracting we find that (ha N hb )T w = 0

(60.5)

so that w is orthogonal to the difference of any two vectors lying in the hyperplane – recall Fig. 56.6. Accordingly, with every hyperplane defined by the parameters (w, θ), we associate the unit-norm normal direction: unit-norm normal direction = w/kwk

(60.6)

We wish to determine parameters (w, θ) such that any data pair (γ(n), hn ) in the training set will be correctly classified by this hyperplane, namely, such that ( T hn w N θ > 0, when γ(n) = +1 (60.7) hT n w N θ < 0, when γ(n) = N1 In the first case, hn will lie on one side of the hyperplane, while in the second case it will lie on the other side. We can combine these two conditions into a single relation by writing that the choice for (w, θ) should enforce the following condition for all training data points, n = 0, 1, . . . , N N 1:  γ(n) hT (correct classification) (60.8) nw N θ > 0

Geometric construction Pick an arbitrary vector ha that belongs to the hyperplane (w, θ), i.e., hT aw N θ = 0

(60.9)

The distance from any training feature hn to the hyperplane w can be determined by projecting the vector difference (hn Nha ) onto the unit-norm direction w/kwk and retaining the absolute value of this projection (see Fig. 60.2):

2502

Perceptron

ha AAAB7HicbVA9SwNBEJ2LXzF+RS1tFhMhVbhLoZYBG8sI5gOSI+xt9pIle3vH7pwQjvwGGwtFbP1Bdv4bN8kVmvhg4PHeDDPzgkQKg6777RS2tnd294r7pYPDo+OT8ulZx8SpZrzNYhnrXkANl0LxNgqUvJdoTqNA8m4wvVv43SeujYjVI84S7kd0rEQoGEUrtauTIa0OyxW37i5BNomXkwrkaA3LX4NRzNKIK2SSGtP33AT9jGoUTPJ5aZAanlA2pWPet1TRiBs/Wx47J1dWGZEw1rYUkqX6eyKjkTGzKLCdEcWJWfcW4n9eP8Xw1s+ESlLkiq0WhakkGJPF52QkNGcoZ5ZQpoW9lbAJ1ZShzadkQ/DWX94knUbdu667D41Ks5bHUYQLuIQaeHADTbiHFrSBgYBneIU3RzkvzrvzsWotOPnMOfyB8/kD7e6OBQ==

normal unit vector AAACHXicbVDLSsNAFJ3UV42vqEs3wSJ0VZIi6rLgxmUF+4AmlMlk0g6dR5iZFEroj7jxV9y4UMSFG/FvnLYBbeuBgcM5996590QpJUp73rdV2tjc2t4p79p7+weHR87xSVuJTCLcQoIK2Y2gwpRw3NJEU9xNJYYsorgTjW5nfmeMpSKCP+hJikMGB5wkBEFtpL5zGXBBeIy5drmQDNIgsH+ljBO9JIwx0kL2nYpX8+Zw14lfkAoo0Ow7n0EsUMbMCEShUj3fS3WYQ6kJonhqB5nCKUQjOMA9QzlkWIX5/Lqpe2GU2E2ENM+sMFf/duSQKTVhkalkUA/VqjcT//N6mU5uwpzwNNOYo8VHSUZdLdxZVG5MpDmXTgyBSBKzq4uGUEKkTaC2CcFfPXmdtOs1/6rm3dcrjWoRRxmcgXNQBT64Bg1wB5qgBRB4BM/gFbxZT9aL9W59LEpLVtFzCpZgff0AQN+ijQ==

hyperplane hT w ✓ = 0

Figure 60.2 Distance from hn to the separating hyperplane can be obtained by

computing, for any ha , the inner product of the difference (hn − ha ) and the unit-norm vector, w/kwk.

T w distance from hn to hyperplane = (hn N ha ) kwk T 1 T = (hn w N ha w) kwk 1 (60.9) T = (hn w N θ) kwk 1 (a) T = γ(n)(hn w N θ) kwk

(60.10)

where we added γ(n) in step (a) because |γ(n)| = 1. Weknow from (60.8) that if (γ(n), hn ) is misclassified by (w, θ), then γ(n) hT n w N θ < 0. When this occurs, the distance expression becomes: distance from a misclassified point hn to (w, θ)  1 = Nγ(n) hT nw N θ kwk

(60.11)

If we add the distances of all misclassified points to the hyperplane (their index set is denoted by M) we get: ! X  1 T N γ(n) hn w N θ (60.12) kwk n∈M

The scaling by 1/kwk is irrelevant since we can always re-normalize the separating hyperplane (w, θ) by scaling its w to have unit norm and by scaling θ

60.2 Perceptron Risk

2503

similarly, i.e., we can always replace any (w, θ) in (60.3) by (w/kwk, θ/kwk). As such, we will remove the scaling by 1/kwk from (60.12) and consider instead the sum: X  γ(n) hT (60.13) S=N nw N θ n∈M

In order to reduce classification errors on the N training data points, we would like to keep this sum small. We can rewrite the above expression in an equivalent manner that incorporates all training points as follows: S=

N −1 X n=0

n o max 0, Nγ(n) hT nw N θ

(60.14)

where, by comparing against zero, we are in effect only keeping the contributions arising from the misclassified points (γ(n), hn ). If we scale by 1/N we arrive at the empirical risk function that is associated with the perceptron construction, namely, ( ) N −1 n o X 1 ∆ ∆ (w? , θ? ) = argmin P (w) = max 0, Nγ(n)(hT n w N θ) N n=0 w∈IRM ,θ∈IR (60.15)

Clearly, when the data is linearly separable, a hyperplane (w? , θ? ) exists that separates the data correctly and reduces the sum of misclassified distances in (60.13) to zero. If we invoke ergodicity, we find that P (w) motivates the following stochastic risk function: N −1 n o n o 1 X N →∞ N→ E max 0, Nγ(hT w N θ) max 0, Nγ(n)(hT n w N θ) N n=0 (60.16)

so that the perceptron construction can also be interpreted as solving the following Bayesian inference problem: ( ) n o wo = argmin E max 0, Nγ(hT w N θ) (60.17) w∈IRM ,θ∈IR

where the expectation is over the joint distribution of (γ, h).

Online recursion Problem (60.15) can now be solved by a variety of stochastic optimization methods, already discussed in previous chapters, such as using stochastic subgradient algorithms and variations thereof. It is sufficient to illustrate the construction by considering one solution method. We will therefore focus on stochastic subgradient implementations, with and without regularization, that rely on instantaneous subgradient approximations. The sampling of the data in the stochastic implementation can also be done with or without replacement.

2504

Perceptron

In practice, the optimization problem (60.15) is modified to incorporate regularization for the reasons already explained in Chapter 51, such as reducing ill-conditioning, reducing the possibility of overfitting, and endowing w? with desirable properties such as having a small norm or sparse structure. For illustration, we will consider perceptron risks under `2 -regularization and replace (60.15) by ) ( N −1 n o 1 X ∆ T ? ? 2 max 0, Nγ(n)(hn w N θ) (w , θ ) = argmin ρkwk + N n=0 w∈IRM ,θ∈IR (60.18)

where ρ is a nonnegative scalar. Using the result of Example 16.9, we show in (60.19) a listing for a regularized perceptron algorithm for solving (60.18). The notation I[x] refers to the indicator function that is equal to 1 when condition x is true and 0 otherwise.

Regularized perceptron for minimizing (60.18). −1 given dataset {γ(m), hm }N m=0 or streaming data (γ(n), hn ); start from an arbitrary initial condition, w−1 . repeat until convergence over n ≥ 0: select at random or receive a sample (γ(n), hn ) at iteration n; b (n) = hT γ n w n−1 N θ(n N 1) θ(n) = θ(n N 1) N µγ(n) I [γ(n)b γ (n) ≤ 0] wn = (1 N 2µρ)wn−1 + µγ(n)hn I [γ(n)b γ (n) ≤ 0] end return w? ← wn , θ? ← θ(n); classify a feature h by using the sign of γ b = hT w? N θ? .

(60.19)

We can simplify the notation by extending the feature and weight vectors as follows: " # " # 1 Nθ h← , w← (60.20) h w so that the recursions in (60.19) can be rewritten more compactly in the following manner, where the offset parameter is now implicit:   γ b (n) = hT n w n−1 (60.21)  w = Aw γ (n) ≤ 0] , n ≥ 0 n n−1 + µγ(n)hn I [γ(n)b and the diagonal matrix A depends on the regularization parameter:   1 ∆ A = (1 N 2µρ)IM

(60.22)

60.2 Perceptron Risk

2505

When a mini-batch of size B is used, the perceptron recursion is replaced by  select B data samples {γ(b), hb } at random      γ(b) = hT b w n−1 , b = 0, 1, . . . , B N 1 ! (60.23) B−1   1 X   γ(b)hb I [γ(b)b γ (b) ≤ 0] , n ≥ 0  wn = Awn−1 + µ × B b=0

On the other hand, in the absence of regularization (ρ = 0), we obtain the classical perceptron update: wn = wn−1 + µγ(n)hn I [γ(n)b γ (n) ≤ 0] , n ≥ 0

(60.24)

which implies that wn = wn−1 + µγ(n)hn , if γ(n)b γ (n) ≤ 0

(60.25)

Otherwise, we keep wn = wn−1 . That is, the weight iterate is updated from wn−1 to wn only when the data point (γ(n), hn ) is misclassified, i.e., when the b (n) do not match, in which case the vectors γ(n)hn and wn−1 signs of γ(n) and γ will have a nonpositive inner product. This situation is illustrated in Fig. 60.3. We therefore find that the perceptron iteration (60.25) perturbs wn−1 to wn in order to obtain a vector wn that is more correlated with γ(n)hn .

AAACCHicbVBNSwMxEM3Wr1q/qh49GCyCeCi7RdSj4MVjBWuFupRsOtuGZpOQzApl8ejFv+LFgyJe/Qne/DemtQe/Hgw83ptJZl5ipHAYhh9BaWZ2bn6hvFhZWl5ZXauub1w6nVsOLa6ltlcJcyCFghYKlHBlLLAskdBOhqdjv30D1gmtLnBkIM5YX4lUcIZe6la3LbhcItUpxQFQA/5Zg1YrmpseQ+hWa2E9nID+JdGU1MgUzW71/bqneZ6BQi6Zc50oNBgXzKLgEm4r17kDw/iQ9aHjqWIZuLiYHHJLd73So6m2vhTSifp9omCZc6Ms8Z0Zw4H77Y3F/7xOjulxXAhlcgTFvz5Kc0lR03EqtCcscJQjTxi3wu9K+YBZxtFnV/EhRL9P/ksuG/XosH5w3qid7E/jKJMtskP2SESOyAk5I03SIpzckQfyRJ6D++AxeAlev1pLwXRmk/xA8PYJ1b+Zyg==

result of the perceptron update

the inner product between these two vectors is nonpositive AAACOnicbVBNSwMxFMzWr1q/Vj16CRZBPJTdIupR8OKxglWhW0o2+9qGZpMledtSir/Li7/CmwcvHhTx6g8wrT1Y60BgmDcvyUycSWExCJ69wsLi0vJKcbW0tr6xueVv79xYnRsOda6lNncxsyCFgjoKlHCXGWBpLOE27l2M57d9MFZodY3DDJop6yjRFpyhk1r+VaS0UAkopNgFKpQCQzOjk5wjjQEHAIpGUWnGZoHiQNM+cNTGUmGp0irTVqDoQ8svB5VgAjpPwikpkylqLf8pSjTPU3c5l8zaRhhk2Bwxg4JLuC9FuYWM8R7rQMNRxVKwzdEk+j09cEpC29q44z43UX9vjFhq7TCNnTNl2LV/Z2Pxv1kjx/ZZcyRUliMo/vNQO5cUNR33SBNhXHw5dIRx45JzyrvMMI6u7ZIrIfwbeZ7cVCvhSeX4qlo+P5rWUSR7ZJ8ckpCcknNySWqkTjh5IC/kjbx7j96r9+F9/lgL3nRnl8zA+/oGzHKulw==

Figure 60.3 Iteration (60.25) updates w n−1 to w n in order to obtain a vector w n that

is more correlated with γ(n)hTn .

It is useful to note that the inequality condition γ(n)b γ (n) ≤ 0 in (60.25) cannot be replaced by the strict inequality γ(n)b γ (n) < 0. This is because if we start from the initial condition w−1 = 0, as is typical in many implementations,

2506

Perceptron

b (0) = hT then γ 0 w −1 = 0 and the recursion will never update the weight iterate. Moreover, in most implementations of the perceptron algorithm, the step-size parameter is set to µ = 1, which leads to (see the explanation after (63.31) and also Prob. 60.1): wn = wn−1 + γ(n)hn ,

if γ(n)b γ (n) ≤ 0

(60.26)

Example 60.1 (Binary classification using perceptron) Figure 60.4 shows a collection of 150 feature samples hn ∈ IR2 whose classes ±1 are known beforehand: 120 samples are selected for training and 30 samples are selected for testing. The data arises from the dimensionally reduced iris dataset from Example 57.4; we denoted the two-dimensional reduced feature vectors by the notation h0n in that example. We denote them by hn here. We employ the two classes shown in the bottom plot of Fig. 57.5 and denote them by γ(n) ∈ {±1}. We extend the feature data and weight vector according to (60.20).

2

training samples (120)

2

1

1

0

0

-1

-1

class -1

-2

test samples (30); error = 0%

class -1

class +1

-2

class +1 -3 -4

-3

-2

-1

0

1

2

3

-3 -4

-3

-2

-1

0

1

2

3

Figure 60.4 The plots show 120 data points used for training (left) and 30 data points used for testing (right). The separating line is obtained by running the perceptron algorithm (60.27a)–(60.27b) five times over the training data.

We use 120 samples to train the perceptron classifier by running five passes over the data: γ b(n) = hTn wn−1 wn = wn−1 + γ(n)hn , if γ(n)b γ (n) ≤ 0

(60.27a) (60.27b)

During each pass of the algorithm, the data {γ(n), hn } is randomly reshuffled and the algorithm is rerun starting from the weight iterate obtained at the end of the previous pass. The line in the figure shows the separating curve obtained in this manner with parameters (after undoing the extension (60.20)):   2.3494 ? w = , θ? = 2.0 (60.28) −0.4372 It is seen that the separation curve is able to classify all test vectors and leads to a 0% empirical error rate.

60.3 Termination in Finite Steps

60.3

2507

TERMINATION IN FINITE STEPS One useful property of the perceptron algorithm (60.25) is that it terminates in a finite number of steps for linearly separable data {γ(n), hn }. To see this, recall first that linear separability means that there exists at least one vector w? that is able to separate the data into two classes and satisfy ? ∃ w? such that γ(n)hT n w > 0, for n = 0, 1, . . . , N N 1

(60.29)

where we are assuming that the feature data and the weight vector have been extended according to (60.20). When this happens, there will exist at least one point h in the training data that will be closest to the hyperplane, w? . The distance from this closest point to the hyperplane is called the margin. This situation is illustrated in Fig. 60.5. Using expression (60.10) with w = w? and setting θ = 0 (since hn and w? are assumed to have been extended), we find that the margin can be evaluated by computing: ) ) ( ( ? ? γ(n)hT |hT (60.29) ∆ nw nw | ? (60.30) m(w ) = min = min 0≤n≤N −1 0≤n≤N −1 kw? k kw? k Observe that the margin is dependent on the choice of w? . The next result shows that the performance of the perceptron algorithm is sensitive to the margin. Specifically, the number of misclassifications encountered by the algorithm is inversely proportional to m2 (w? ) so that larger margins are preferable. Although the result guarantees convergence in a finite number of steps, the number of steps required can still be large because the margin can be small. Lemma 60.1. (Finite number of errors) Assume the N -size dataset {γ(m), hm } is linearly separable, i.e., there exists at least one vector w? satisfying (60.29), and denote its margin by m(w? ). Assume further that the feature vectors are bounded, say, khn k ≤ H for all n. The perceptron algorithm (60.25) is applied continuously over the data, including possibly multiple passes, as needed. At any iteration t, the total number of erroneous misclassifications encountered by the algorithm until that point in time is bounded by |Mt | ≤

H2 m2 (w? )

(60.31)

Since the perceptron algorithm updates only when misclassifications occur, it follows from this result that the algorithm will only perform a finite number of updates. Proof: We refer to (60.25) and assume, without loss of generality, that the algorithm starts from the initial condition w−1 = 0; if w−1 is nonzero, then we should incorporate its value into the derivation below. Let Mt denote the collection of all iteration indices for which the algorithm encounters a misclassification until time t (i.e., when it performs updates). Iterating (60.25), we find that at any iteration t:

2508

Perceptron

Figure 60.5 The closest feature vector to the separating hyperplane w ? is highlighted inside a circle with its distance to w? representing the margin, m(w? ).

! X

wt = µ

γ(n)hn

(60.32)

n∈Mt

where the sum is over the set of misclassified data up to time t, i.e., points for which γ(n)hTn wn−1 ≤ 0,

for n ∈ Mt

(60.33)

and (γ(n), hn ) is the sample pair selected at the nth iteration. Computing the inner product of wt with w? we get ! X wtT w? = µ γ(n)hTn w? n∈Mt (60.29)

=

! µ

X

|hTn w? |

n∈Mt (60.30)



µ m(w? ) kw? k |Mt |

(60.34)

in terms of the cardinality of the set Mt . It follows that kwt k2 kw? k2 ≥ |wtT w? |2 ≥ µ2 m2 (w? ) kw? k2 |Mt |2

(60.35)

where we applied the Cauchy–Schwarz inequality for the inner product of two vectors, which states that |aT b| ≤ kak kbk. We then arrive at the lower bound: kwt k2 ≥ µ2 m2 (w? ) |Mt |2

(60.36)

We can similarly derive an upper bound for kwt k2 as follows. We return to the percep-

60.4 Pocket Perceptron

2509

tron recursion (60.25) and note that, for any step t where an update occurs: kwt k2 = kwt−1 + µγ(t)ht k2 = kwt−1 k2 + µ2 γ 2 (t)kht k2 + 2µγ(t)hTt wt−1 ≤ kwt−1 k2 + µ2 γ 2 (t)kht k2 (because of (60.33)) = kwt−1 k2 + µ2 kht k2

(since γ 2 (t) = 1)

(60.37)

Iterating starting from w−1 , we find that ! kwt k

2

≤ µ

2

X

khn k

2

(60.38)

n∈Mt

and we arrive at the upper bound kwt k2 ≤ µ2 H 2 |Mt |

(60.39)

Combining this result with (60.36), we conclude that µ2 m2 (w? ) |Mt |2 ≤ kwt k2 ≤ µ2 H 2 |Mt |

(60.40)

These bounds are valid as long as the lower bound is smaller than the upper bound, i.e., µ2 m2 (w? ) |Mt |2 ≤ µ2 H 2 |Mt | (60.41) which is only satisfied if the number of updates (and, hence, the number of erroneous decisions) is bounded according to (60.31). This result holds irrespective of the value of t and is also independent of the feature dimension, M . The bound in (60.31) confirms termination of the perceptron algorithm after a finite number of iterations for linearly separable data. 

60.4

POCKET PERCEPTRON When the training data {γ(m), hm } is not linearly separable, the perceptron iteration (60.25) will not terminate and the weight vector will continue to update and possibly move from a “good” to a “bad” solution, i.e., from a hyperplane that separates well a large fraction of the data to another hyperplane that performs poorly on the same data. One variation that improves the behavior of the perceptron under these circumstances is to introduce a pocket variable to keep track of the best iterate. At the conclusion of the training phase, the pocket variable provides the desired weight estimate. The pocket perceptron algorithm operates as follows. Let wp ∈ IRM denote the weight iterate that is saved in the “pocket.” We set its value initially to some random vector (e.g., the zero vector) and evaluate its empirical error rate over the N training data points, {γ(m), hm } (i.e., we compute the fraction of incorrect classifications by wp ). We denote this value by Rp

N −1  1 X  = I γ(m)hT m wp ≤ 0 N m=0 ∆

(60.42)

2510

Perceptron

At any subsequent iteration of index n, the perceptron recursion (60.26) updates wn−1 to a new value wn only when wn−1 misclassifies hn . Each time an update occurs, from wn−1 to wn , we compute the empirical error rate of the new iterate over the entire training dataset: ∆

R(wn ) =

N −1  1 X  I γ(m)hT m wn ≤ 0 N m=0

(60.43)

and compare it against Rp in order to decide whether to replace the pocket variable by the new value, wn : if R(wn ) < Rp then wp ← wn and Rp ← R(wn )

(60.44)

At the end of the training phase, the hyperplane that is selected as the final classifier is the one that has been saved in the pocket, i.e., wp . One inconvenience of this implementation is that it assumes, at every iteration, that the algorithm has access to the entire training data to assess the empirical error rates, R(wn ). Pocket perceptron algorithm for binary classification. N −1 given dataset {γ(m), hm }m=0 ; assume vectors are extended according to (60.20); start from initial conditions, w−1 = wp = 0M , Rp = 1. repeat until convergence over n ≥ 0: select at random (γ(n), hn ) at iteration n; b (n) = hT γ n w n−1 if γ(n)b γ (n) ≤ 0: wn = wn−1 + γ(n)hn N −1 i 1 X h R(wn ) = I γ(m)hT m wn ≤ 0 N m=0 if R(wn ) < Rp : wp ← wn , Rp ← R(wn ) end else wn = wn−1 end end return w? ←N wp .

(60.45)

Example 60.2 (Binary classification using the pocket perceptron) Figure 60.6 illustrates the behavior of the pocket algorithm on training samples that are not linearly separable. The data arises from the dimensionally reduced iris dataset from Example 57.4; we denoted the two-dimensional reduced feature vectors by the notation h0n in that example. We denote them by hn here. We consider the situation involving three classes shown in the top plot of Fig. 57.5, and extract the data corresponding to classes r = 1 (versicolor) and r = 2 (virginica) – see, for example, the bottom rightmost plot

60.4 Pocket Perceptron

2511

in Fig. 59.9. We denote these two classes by γ(n) ∈ {±1}. There are a total of 100 data samples; we select 80 samples for training and 20 samples for testing. We use the 80 samples to train the traditional perceptron classifier (60.26) and the pocket perceptron classifier (60.45), both under extensions (60.20). In each case, we run five passes of the algorithm over the training data using random reshuffling. The lines in Fig. 60.6 show the separating curves obtained in this manner with parameters   4.2001 w? = , θ? = −5.0 (traditional perceptron) (60.46) −0.4662   3.9984 , θ? = −5.0 (pocket perceptron) (60.47) w? = −1.6366 The resulting empirical error rates on the test data are 20% for perceptron (4 misclassifications in 20 test samples) and 10% for pocket perceptron (2 misclassifications in 20 test samples). The empirical error rates over the training data are 16.25% and 10%, respectively.

3 2

test samples (perceptron)

training samples (perceptron)

3 2

class -1

1

1

0

0

-1

-1

class +1

-2

-3 -2

-1

0

1

training samples (pocket perceptron) 3 2

-4 -3

-2

-1

0

1

test samples (pocket perceptron)

3 2

class -1

1

1

0

0

-1

class -1

-1

class +1

-2

-2

-3 -4 -3

class +1

-2

-3 -4 -3

class -1

class +1

-3 -2

-1

0

1

-4 -3

-2

-1

0

1

Figure 60.6 The plots show 80 training samples and 20 test samples (top row), and

the resulting separation lines by means of the perceptron classifier (60.26) and the pocket perceptron classifier (60.45).

Example 60.3 (Application to the heart disease data) We reconsider the dimensionally reduced heart disease dataset from Example 57.4. In particular, we consider the data samples shown in the bottom scatter plot of Fig. 57.6, where the feature vectors

2512

Perceptron

have been reduced to dimension 3. We denote these features vectors by the notation {hn } in this example (as opposed to {h0n } used in Example 57.4); we also denote their dimension by M = 3. The data in that figure have been aggregated into two classes: presence of heart disease (which we now assign the label +1) and absence of heart disease (which we now assign the label −1).

Figure 60.7 The plots show 238 training samples and 59 test samples in three-dimensional space (top row ), and the resulting separation curves by means of the perceptron classifier (60.26) and the pocket perceptron classifier (60.45).

There are a total of 297 data samples; we select 238 samples for training and 59 samples for testing (that amounts to 20% of the total number of samples). We use the data to train the traditional perceptron classifier (60.26) and the pocket perceptron classifier (60.45), both under extensions (60.20). In each case, we run 50 passes of the algorithms over the training data using random reshuffling. The results are shown in Fig. 60.7. The hyperplanes in the figure show the separating curves obtained in this manner with parameters   5.2289 ? w =  2.6399  , θ? = −1.0 (traditional perceptron) (60.48) 1.0637 and 

 3.8486 w =  0.1409  , 2.5030 ?

θ? = −1.0

(pocket perceptron)

(60.49)

60.5 Commentaries and Discussion

2513

The resulting empirical error rates on the test data are 33.90% for perceptron (20 misclassifications in 59 test samples) and 22.03% for pocket perceptron (13 misclassifications in 59 test samples). The empirical error rates over the training data are 20.59% and 13.45%, respectively. Table 60.1 Empirical error rates over test and training data for both cases of thirteen- and three-dimensional feature vectors. Training Testing Algorithm M N Ntrain Ntest error error perceptron pocket perceptron perceptron pocket perceptron

13 13 3 3

297 297 297 297

238 238 238 238

59 59 59 59

17.23% 11.34% 20.59% 13.45%

27.12% 17.23% 33.90% 22.03%

We repeat the same procedure and apply the perceptron and pocket perceptron to the heart disease dataset without reducing the dimension of the feature space. Recall that originally each feature consists of M = 13 attributes. We center the feature vectors around their mean and scale their variance to 1, as was described earlier in the preprocessing steps for principal component analysis (PCA) in (57.6). We subsequently apply the perceptron and pocket perceptron to 238 training samples from this set and test the performance on 59 other samples. We also test the performance on the training samples. Table 60.1 summarizes the empirical error rates obtained for both the reduced and full-feature vectors. The symbols Ntrain and Ntest refer to the number of samples used for training and testing.

60.5

COMMENTARIES AND DISCUSSION The perceptron. The word “perceptron” appears to be a shorthand for the combination “perception automaton” and is nowadays commonly used to refer to the perceptron structure (60.25). This algorithm corresponds to a linear classification rule that is guaranteed to converge in a finite number of iterations for linearly separable data, as explained in Section 60.3. The first works establishing bounds similar to (60.31) are by Block (1961, 1962) and Novikoff (1962). More discussion on linear separability is included in Appendices 60.A and 60.B. The perceptron rule was introduced and implemented into a hardware unit in 1957 by the American psychologist Frank Rosenblatt (1928–1971). Rosenblatt (1957, 1958) was interested in pattern classification problems while working at the Cornell Aeronautical Laboratory. He was motivated by the work performed about a decade earlier in 1949 by the Canadian neuroscientist Donald Hebb (1904–1985) on a model for the neural activity in the human brain. For additional information on the perceptron and its history, the reader may refer to Rosenblatt (1962), Minsky and Papert (1969), Duda and Hart (1973), Widrow and Lehr (1990), Peretto (1992), Haykin (1999), Siu, Roychowdhury, and Kailath (1995), and Theodoridis (2015). Hebbian model. In his influential text, Hebb (1949) postulated on how neurons in the brain adjust their connection strength. He argued that when a neural cell A is repeatedly involved in firing another neural cell B, then the strength of the synaptic weight linking A to B should increase so that the role of A in firing B is enhanced. This postulate motivated the following algorithmic construction – see the diagram on the left-hand side of Fig. 60.8.

2514

Perceptron

Figure 60.8 A diagram representation of the Hebbian neural model (left) and

McCulloch–Pitts neural model (right) with binary input signals and a binary output.

Assume there are several neurons connected to B. We assign one scaling weight to the link between B and each of these neurons. This results in a linear combination output at B of the form γ(n) = hTn w. Here, the vector w contains the synaptic weights and the vector hn ∈ IRM contains the incoming signals at each of the feeding neurons into B. The variable γ(n) denotes the output signal by neuron B at instant n. The Hebbian learning rule for adjusting the synaptic weights takes the following form:  b (n) = hTn wn−1 γ (60.50) wn = wn−1 + µb γ (n)hn with a plus sign in the second equation. In this expression, each entry of wn−1 is adjusted in proportion to the corresponding entry in hn . Observe that the Hebbian rule (60.50) is an unsupervised learning rule; it only relies on knowledge of the feature data, {hn }. It is instructive to compare this form with the perceptron update (60.25), namely, wn = wn−1 + µγ(n)hn , if γ(n)b γ (n) ≤ 0 (60.51) The perceptron update is a supervised rule; its structure is similar to the Hebbian rule b (n) is replaced by the true class variable, γ(n), and the update is only except that γ b (n) performed when misclassifications occur. The fact that the Hebbian rule relies on γ makes it an unstable algorithm since its weights grow unbounded. The Hebbian rule was motivated heuristically by Hebb (1949), using intuition from biological data. The history of this rule serves as a good example of how closer bridges between the biological and mathematical sciences can help avoid unreasonable models. The instability pitfall in the Hebbian update can be seen from several perspectives. First, note that we can rewrite the Hebbian rule (60.50) in the equivalent form:   wn = IM + µhn hTn wn−1 (60.52) This is a first-order recursion. Assuming iid feature vectors {hn } and letting Rh = E hn hTn ≥ 0, it follows under expectation that E wn = (IM + µRh )E wn−1

(60.53)

This is an unstable recursion since the spectral radius of IM + µRh is larger than 1. A second way to explain the instability problem is to observe that the Hebbian rule (60.50) can be interpreted as a stochastic-gradient iteration for maximizing (rather than

60.5 Commentaries and Discussion

2515

minimizing) the variance P (w) = E (hT w)2 , which is convex over w. Yet another way to highlight the instability problem is to note that the maximization of P (w) amounts to determining a vector w that solves (assuming h has zero mean): n o ∆ (60.54) wo = argmax wT Rh w w∈IRM

This is an ill-posed problem, since we know from the Rayleigh–Ritz characterization (1.16) for the largest eigenvalue of Rh that wT Rh w ≤ λmax (Rh )kwk2

(60.55)

and that equality is achieved when w is an eigenvector for Rh corresponding to λmax . However, there are infinitely many such eigenvectors since any eigenvector can be scaled up or down and it continues to be an eigenvector. Therefore, without any constraint on the norm of w, the bound on the right-hand side of (60.55) can be made arbitrarily large. Oja rule. It took over three decades until a viable stable variant to the Hebbian rule was proposed by Oja (1982, 1983). The resulting recursion is nowadays known as the Oja rule, and it takes the following form: b (n)wn−1 ) wn = wn−1 + µb γ (n)(hn − γ

(60.56)

For comparison purposes with the Hebbian rule (60.52), we can rewrite (60.56) in the equivalent form: wn = (I + µhn hTn )wn−1 − µ (b γ (n))2 wn−1

(60.57)

which shows that we now have an additional decay term that is proportional to (b γ (n))2 . b , the update (60.56) can be “motivated” as a stochastic-gradient iteration For a fixed γ b wk2 , which is convex over w. This is not how Oja for minimizing P (w) = E kh − γ (1982) motivated the rule. However, by considering this risk function, the rule can be b w to approach h in order to explained as follows. As n → ∞, we expect the product γ minimize the mean-square error: b (n)wn−1 =⇒ wTn−1 hn ≈ γ b (n)kwn−1 k2 =⇒ kwn−1 k2 ≈ 1 hn ≈ γ | {z } b (n) =γ

(60.58)

In this way, the norm of the weight vector will approach the value 1 (and remain bounded). Actually, it can be verified that the Oja rule seeks a solution to (60.54) subject to the constraint kwk2 = 1 – see Prob. 60.10 and also Oja (1992(@). In this way, the Oja rule converges toward an estimate for the unit-norm eigenvector of Rh corresponding to its largest eigenvalue. As a result, there is a strong connection between the Oja rule and the PCA method studied in Chapter 57. In particular, the Oja rule is in effect approximating the first column of U in (57.20) by solving a problem similar to (57.19) – recall Prob. 57.5. McCulloch–Pitts model. Hebb’s (1949) rule was a generalization of an earlier model for neural activity introduced by McCulloch and Pitts (1943) in a famous article. As indicated in the diagram on the right-hand side of Fig. 60.8, they considered a simpler neural model consisting of two neurons feeding into a threshold unit. They limited the input signals to binary values 0 and 1. If the threshold value is set to 1, then the input– output mapping of this neural model emulates the behavior of the OR logical function, as illustrated in Table 60.2. If, on the other hand, the threshold value is set to 2, then the input–output mapping emulates the behavior of the AND logical function. The McCulloch–Pitts model is limited in its capability since it restricts the input signals to binary values and does not associate weights with the links. Hebb’s (1949) model, as well as the subsequent work by Rosenblatt (1957, 1958) on the perceptron, allowed for extensions in both of these domains as well as for the critical insight of adjusting the weights over time.

2516

Perceptron

Table 60.2 Input–output mapping of the McCulloch–Pitts neural model with binary input signals and a threshold value set at 1. Neuron #1

Neuron #2

Output

0 0 1 1

0 1 0 1

0 1 1 1

Pocket perceptron. We indicated in Section 60.3 that the perceptron recursion terminates in a finite number of steps for linearly separable data – see, e.g., Block (1961, 1962) and Novikoff (1962). When the data is not linearly separable, the perceptron recursion will not terminate and the algorithm may move from a good solution to a bad one as it updates. The pocket perceptron algorithm improves performance under these circumstances; it was proposed by Gallant (1986, 1990). Separation theorem. There are important results in geometry due to Minkowski (1911) that ensure the existence of hyperplanes that separate disjoint convex sets in IRM . These results are relevant to the concept of linear separability. Consider first a nonempty convex set C ⊂ IRM and an arbitrary point zo ∈ / C. The so-called supporting hyperplane theorem affirms the existence of a hyperplane passing through zo with the set C belonging to one of its halfspaces, i.e., there exists w ∈ IRM and θ ∈ IR such that n o zoT w − θ = 0 and sup cT w − θ ≤ 0 (60.59) c∈C

This is also equivalent to stating that there exists a vector w such that sup cT w ≤ zoT w

(60.60)

c∈C

In the case when C is closed and zo is a point on its boundary, then the hyperplane would correspond to the tangent at zo – see the illustration in Fig. 60.9. Next consider two disjoint nonempty convex sets X and Y in IRM . Then, the separating hyperplane theorem states that there exists a vector w ∈ IRM and a scalar θ such that: xT w − θ ≤ 0,

∀x ∈ X

(60.61a)

xT w − θ ≥ 0,

∀y ∈ Y

(60.61b)

This is also equivalent to stating that there exists a vector w such that sup xT w ≤ inf y T w x∈X

y∈Y

(60.62)

The inequalities in the above expressions cannot be made strict. Figure 60.10 illustrates a situation where two disjoint convex sets cannot be strictly separated. However, when at least one of the sets happens to be closed and bounded (also called compact), then there exist (w, θ) such that xT w − θ > 0, T

x w − θ < 0,

∀x ∈ X

(60.63a)

∀y ∈ Y

(60.63b)

Problems

2517

separating hyperplane

zoT w − θ = 0

w

normal direction

zo X

convex set

C

Y

convex set

convex set

Figure 60.9 A supporting hyperplane on the left where the convex set appears on one

side of it, and a separating hyperplane on the right where the convex sets are separated by it. Proof of (60.62): Assuming the validity of the supporting hyperplane theorem, we can establish the separating hyperplane theorem as follows. Introduce the convex set D = X − Y where d ∈ D if, and only if, d = x − y for some x ∈ X and y ∈ Y. The origin z = 0 does not belong to D because otherwise it will require x = y and we know that the sets X and Y do not share elements. We conclude from (60.60) that a vector w ∈ IRM should exist such that sup dT w ≤ 0 d∈D

which means that xT w ≤ y T w for any (x, y) ∈ X × Y. It follows that (60.62) holds.  For further discussion and proofs, the reader is referred to Pettis (1956), Luenberger (1969), and Boyd and Vandenberghe (2004).

PROBLEMS

60.1 Refer to the perceptron recursion (60.25). Is the performance of the algorithm affected if we set µ = 1? 60.2 Can the perceptron algorithm learn to implement the AND function? And what about the XOR function? We define these functions over four feature vectors in IR2 as follows:   h = [−1, −1]T ∈ class −1 h = [−1, −1]T ∈ class −1       T h = [−1, +1]T ∈ class +1 h = [−1, +1] ∈ class −1 AND → XOR → T h = [+1, −1] ∈ class −1 h = [+1, −1]T ∈ class +1       T h = [+1, +1] ∈ class +1 h = [+1, +1]T ∈ class −1 60.3 Show that the perceptron algorithm can learn to implement a NAND function. Consider then the other logical operations represented by NOT, AND, OR, NOR, XOR (exclusive OR), and XNOR (exclusive NOR). Show how each of these logical operations can be implemented by using solely NAND gates.

Perceptron

y-axis

2518

convex set Y

curve y = 1/x

x-axis

convex set X

Figure 60.10 An example of two disjoint convex sets that cannot be strictly separated.

60.4 Consider a collection of N linearly separable data pairs {γ(n), hn }. Assume the offset parameter is zero. Show that linear separability is equivalent to the existence of a vector w ∈ IRM that satisfies Hw  1N , where H is the N × M data matrix whose rows are γ(n)hTn and  denotes elementwise comparisons. 60.5 Consider two collections of vectors in IRM denoted by H = {h1 , h2 , . . . , hN } and X = {x1 , x2 , . . . , xL }. We say that these sets are linearly separable if there exist w? ∈ IRM and θ? ∈ IR such that hTn w? > θ? for all vectors in set H and xTn w? < θ? for all vectors in set X. Show more strongly that the two sets H and X are linearly separable if, and only if, there exist z ? ∈ IRM and α? ∈ IR such that hTn z ? − α? ≥ 1 for all vectors in H and xTn z ? − α? ≤ −1 for all vectors in X. 60.6 Continuing with Prob. 60.5, we show that checking linear separability of two sets can be reduced to solving a linear program. Show that two sets H and X are linearly separable if, and only if, the optimal value for the following linear program is zero: 1 T 1 1N a + 1TL b, where z ∈ IRM , α ∈ IR, a ∈ IRN , b ∈ IRL N L   a(n) ≥ −hTn z + α + 1, n = 1, 2, . . . , N   b(`) ≥ xT` z − α + 1, ` = 1, 2, . . . , L subject to   a(n) ≥ 0, n = 1, 2, . . . , N  b(`) ≥ 0, ` = 1, 2, . . . , L min

{z,α,a,b}

Show further that if {z ? , α? , a? , b? } is an optimal solution, then g(f ) = f T z ? − α? is a separating hyperplane for the two sets, where f denotes a generic feature vector. Remark. The reader may refer to Smith (1968) and Bennett and Mangasarian (1992) for a related discussion. 60.7 Consider a collection of N linearly separable data pairs {γ(n), hn } where γ(n) ∈ {−1, +1} denotes the label and hn ∈ IRM is the corresponding feature vector.

Problems

2519

We assume feature vectors have already been extended according to (60.20). We wish to determine a separating hyperplane w such that γ(n)hTn w > 0. We motivated the perceptron recursion in the body of the chapter as one solution method. Here, we motivate a second relaxation method based on using the alternating projection algorithm from Section 12.6. Introduce the N halfspaces Hn = {w | − γ(n)hTn w < 0}, one for each data pair (γ(n), hn ). We are then faced with the problem of solving N linear inequalities and finding a point w? in the intersection of these halfspaces. Use the result of Prob. 9.5 to show that the alternating projecting method motivates the following recursion: wn = wn−1 +

n o γ(n)hn max 0, −γ(n)hTn wn−1 2 khn k

How is this method different from the unregularized perceptron recursion? Remark. The above recursion is known as a relaxation method for solving a set of linear inequalities; it was introduced by Agmon (1954) and Motzkin and Schoenberg (1954) in back-toback papers in the same journal issue – see also Eremin (1965). A footnote on the first page of Agmon (1954) acknowledges that the idea of the algorithm was communicated to the author by the first author of Motzkin and Schoenberg (1954). 60.8 Consider a collection of N data points {γ(n), hn } and refer to the perceptron recursion (60.25). Let w1? denote the separating hyperplane that is obtained by running the recursion over this data. Now assume we replace each γ(n) by −γ(n); that is, we switch the labeling of the classes: class +1 becomes −1 and vice-versa. We run the perceptron again on this modified data and obtain w2? . How are w1? and w2? related to each other? Is the perceptron sensitive to how we label the classes? 60.9 Refer to the perceptron recursion (60.25). Introduce the variable d(n) defined as follows: d(n) = +1 if γ(n) = +1 and d(n) = 0 if γ(n) = −1. Introduce also the hard-threshold function:  1, x ≥ 0 ∆ g(x) = 0, x < 0 Show that recursion (60.25) can be re-worked into the following form: e(n) = d(n) − g(hTn wn−1 ) wn = wn−1 + µhn e(n) 60.10 Refer to the Oja rule (60.56). Explain that this rule is maximizing E (hTn w)2 subject to kwk2 = 1. 60.11 Let γ denote a generic binary random variable that assumes the values ±1, and let h denote an M × 1 random (feature) vector. Consider the following regularized exponential risk function: n o T ∆ P (w) = ρkwk2 + E e−γ h w where ρ > 0 is a regularization parameter. Derive a stochastic-gradient algorithm for the minimization of P (w). How does the algorithm compare to perceptron learning? 60.12 Establish the equality to 2N in (60.77). 60.13 Conclude from (60.65) that when N = 2(M + 1), then the number of linearly separable dichotomies of the feature vectors {hn ∈ IRM } is given by S(M, N ) = 22M +1 . In other words, show that in this case only half of the 2N possible dichotomies are linearly separable. 60.14 Assume the number of feature vectors is fixed at N while their dimension is allowed to increase from m = 1 up to m = M − 1. Conclude from (60.65) that the number of linearly separable dichotomies, S(m, N ), increases monotonically with m from the value S(1, N ) = 2N up to the value S(M − 1, N ) = 2N . 60.15 Refer to result (60.65).

2520

Perceptron

(a)

(b)

Use it to bound the number of linearly separable Boolean functions in M dimensions as follows: ! M X 2M − 1 M S(M, 2 ) ≤ 2 m m=0 Is this bound consistent with the Sauer lemma (64.86)? Use an argument similar to (64.102) to establish that for any M ≥ L: !  L L X Me M < m L n=0

(c) Combine the results of parts (a) and (b) to establish (60.85). 60.16 Refer to the probability expression (60.78), which evaluates the likelihood that a randomly selected dichotomy of N feature vectors in IRM is linearly separable. (a) Plot P(M, N ) against the ratio N/(M + 1). (b) What are the values of P(M, N ) when N = 2(M + 1) and N ≤ M + 1? (c) Establish the limits (60.79)–(60.80). (d) Establish (60.81). 60.17 Consider a unit-edge hypercube in M dimensions, with one vertex lying at the origin. The hypercube has 2M edges. Let the feature vectors {hm } correspond to the locations of these vertices. Show that no 2M vertices in general position exist.

60.A

COUNTING THEOREM The concept of linearly separable data is paramount in the study of binary classification problems. We indicated in (60.2) that a collection of feature vectors {hn ∈ IRM } is linearly separable if at least one hyperplane, w? ∈ IRM , can be determined that separates the data into two classes, with one class lying on one side of the plane and the other class lying on the other side of the plane, namely,  T ? hn w < 0, whenever hn ∈ class −1 (60.64) hTn w? > 0, whenever hn ∈ class +1 In our discussion in this appendix we will assume that the feature data and the weight vector for the separating hyperplane have been extended according to (60.20) so that there is no need to account for the offset parameter separately. Now, given an arbitrary collection of N feature vectors hn ∈ IRM , there are 2N possibilities for assigning each one of them to a class γ(n) ∈ {±1}. Each of these possibilities is referred to as a dichotomy. In general, not all dichotomies will be linearly separable. We refer to Fig. 60.11 to illustrate this concept. The figure shows N = 3 feature vectors in IR2 and all eight possible dichotomies (i.e., label assignments). For example, in the leftmost box in the top row, we show two feature vectors assigned to +1 (represented by the plus sign) and one feature vector assigned to −1 (represented by the minus sign). In this same box, we show a line that can be used to separate both classes. Similarly for the remaining seven boxes in the figure. It is seen in this example that all eight dichotomies are linearly separable. In the rightmost box, we consider another situation involving N = 4 feature vectors in IR2 and show one particular dichotomy with two features assigned to +1 and two other features assigned to −1. In this case, the dichotomy is not linearly separable. We expand on this situation in Fig. 60.12, which lists all 16 possible dichotomies for N = 4 feature vectors. Each circle represents an assignment to class +1 and each square represents an assignment to class −1. The two boxes marked with background color correspond

60.A Counting Theorem

+

-

+

+

-

-

-

+

+

-

-

-

+

-

+

-

+

+

-

-

+

+

-

2521

-

+ +

-

+

Figure 60.11 The eight squares on the left show all possible assignments of the same

three feature vectors in IR2 . In each case, a line exists that separates the classes ±1 from each other. We therefore say that the three feature vectors in this example are separable by linear classifiers. In contrast, the figure on the right shows four feature vectors in the same space IR2 and an assignment of classes that cannot be separated by a linear classifier.

to dichotomies that are not linearly separable. It is seen from the figure that there are 14 out of 16 dichotomies that are linearly separable.

Figure 60.12 Given four (N = 4) feature vectors, there are 24 = 16 possible

dichotomies shown in the figure. Each circle represents assignment to class +1, while each square represents assignments to class −1. The marked boxes with background color correspond to the two dichotomies that are not linearly separable. One useful question in the study of binary classification problems is the following. Given N feature vectors {hn } in M -dimensional space, how many of the 2N dichotomies can be expected to be linearly separable? This is a classical problem in combinatorial

Perceptron

geometry and has been answered elegantly by Cover (1965); a couple of other works from the early 1950s and 1960s with similar conclusions are mentioned in Cover (1965), including an earlier proof technique by Schlafli (1950, pp. 209–212). The counting theorem that we describe below can be viewed as an early precursor to a famous inequality known as the Sauer lemma, which we establish later in Appendix 64.B – see (64.86). In the terminology of that appendix, the number of linearly separable dichotomies is also called the shatter coefficient. We denote this number by the notation S(M, N ), where N is the number of feature vectors and M is the dimension of the feature space (and also the size of the parameter space that defines the classifier). Although we are focusing here on linear classifiers, we hasten to add that the shatter coefficient can be defined for other classes of classifiers as well; for this reason, in Appendix 64.B we will use instead the more general notation S(C, N ) to refer to the shatter coefficient, where the symbol C refers to the class of classifiers under consideration (linear or otherwise). The counting theorem stated further ahead is specific to linear classifiers, in which case it is justifiable to replace C by the dimension M of the parameter space, w. The statement of the counting theorem requires the notion of points in general position. (Definition of points in general position). Consider N column vectors {hn } in M dimensional space, hn ∈ IRM . The N points are said to be in general position if no subset of M + 1 vectors lies in an (M − 1)-dimensional hyperplane. We also say that the points are in general position if every subset of M or fewer vectors is linearly independent. This situation is illustrated in Fig. 60.13. The plot on the left shows N = 5 feature vectors in IR2 (for which M = 2). These vectors are not in general position because three vectors happen to lie on the same line. This example shows that four points in IR3 are in general position if no three of them lie on the same line.

points in general position

y-axis

not in general position

y-axis

2522

x-axis

x-axis

Figure 60.13 The plot on the left shows N = 5 feature vectors in IR2 (for which

M = 2). These vectors are not in general position because three vectors happen to lie on the same line. We are now ready to state the counting theorem and prove it following Cover (1965). Observe that the theorem provides an exact count for the number of linearly separable dichotomies. It is not generally possible to provide such an exact count for other classes

60.A Counting Theorem

2523

of classifiers. For these more general cases, the theorem will be replaced by the Sauer lemma (64.86), which provides an upper bound (rather than an equality) for the number of separable dichotomies – see Appendix 64.C on the Vapnik–Chervonenkis bound. Counting theorem (Cover (1965)). Consider the class of linear classifiers defined by C = {sign(hT w)}, where h ∈ IRM denotes feature vectors and the free parameter w ∈ IRM defines the hyperplane. Feature vectors are assigned to classes +1 or −1 depending on the sign of the inner product hT w. Consider a collection of N feature vectors, {hn }, in general position in IRM . It holds that the number of linearly separable dichotomies, from among the 2N possible dichotomies, is given by  ! M X  N −1   2 , when N > M + 1 m (60.65) S(M, N ) = m=0    N 2 , when N ≤ M + 1

Proof: Starting with the N feature vectors {hn } in IRM , we let S(M, N ) denote the

number of linearly separable dichotomies for this set of generally positioned points. Next, we enlarge the set to N + 1 points by adding a new feature vector, hN +1 , such that the new expanded feature set continues to have general position. We similarly let S(M, N + 1) denote the number of linearly separable dichotomies for this new set. The argument that follows determines a relation between S(M, N ) and S(M, N + 1). Let w be one of the linear classifiers that generates one of the dichotomies for the initial feature set {hn } of size N . Under this classifier, a feature vector hn would be mapped to the label: γ(n) = sign(hTn w)

(60.66)

The value of γ(n) is either +1 or −1. Thus, the hyperplane w generates the following dichotomy for the N feature vectors: [ γ(1), γ(2), . . . , γ(N ) ] ,

γ(n) ∈ {+1, −1}

(60.67)

When this same hyperplane is applied to the additional feature hN +1 , it will generate some label denoted by γ(N + 1) = sign(hTN +1 w)

(60.68)

The value of this label is again either +1 or −1. In this way, the hyperplane w leads to the following dichotomy over the expanded set: h i γ(1), γ(2), . . . , γ(N ), γ(N + 1) (60.69) We therefore find that for every linear dichotomy defined over the original N feature vectors {hn }, we can associate at least one dichotomy over the expanded feature set of size N + 1. The analysis so far shows that S(M, N + 1) is at least as large as S(M, N ): S(M, N + 1) ≥ S(M, N )

(60.70)

Let us verify next that it is actually possible to generate more dichotomies over the N + 1 feature vectors than the S(M, N ) dichotomies generated over the smaller set. The argument depends on whether we can find a separating hyperplane w from the original set that passes through hN +1 or not: (a) Assume first that there exists a hyperplane w from the set that generates the dichotomies for the original N feature vectors with the following property: The hyperplane passes through the added point hN +1 . In this case, we can perturb

Perceptron

w

hN +1 separating hyperplane

y-axis

2524

perturbations

x-axis

Figure 60.14 The plot shows one dichotomy for the N features vectors {hn } with a separating hyperplane w that passes through the new feature hN +1 . By perturbing this hyperplane slightly to one side or the other, the feature hN +1 can end up with label +1 or −1.

this hyperplane by an infinitesimal amount and have hN +1 appear on one side or the other of the plane, with the plane still separating the original N feature vectors – see Fig. 60.14. It follows in this case that, for each separating w for the original feature vectors, we are able to generate two dichotomies for the expanded (N + 1)-long set (and not just one as above), with the label for hN +1 being either +1 or −1: [ γ(1), γ(2), . . . , γ(N ), +1 ] ,

[ γ(1), γ(2), . . . , γ(N ), −1 ]

(60.71)

This argument indicates that S(M, N +1) will be larger than S(M, N ) and we write S(M, N + 1) = S(M, N ) + ∆

(60.72)

for some positive number ∆ to be determined. (b) Assume, on the other hand, that there is no hyperplane from the S(M, N ) dichotomies for the original N feature vectors that passes through hN +1 . Then, in this case, the point hN +1 would always lie on one side of all the hyperplanes for the original dichotomies. As a result, only one dichotomy over the N + 1 features is possible, as explained earlier, and not two dichotomies as in part (a). We therefore need to determine ∆. By definition, its value is equal to the number of dichotomies of the original N feature vectors with the constraint that the separating hyperplanes should pass through hN +1 . By restricting the separating hyperplanes to pass through a particular point, we are in effect reducing the dimension (or degrees of freedom) of the problem from M down to M − 1. Therefore, it holds that ∆ = S(M − 1, N ) and we arrive at the relation S(M, N + 1) = S(M, N ) + S(M − 1, N )

(60.73)

We can now use this relation to establish (60.65) by induction. We assume result (60.65) holds for (M, N ) and establish a similar form for (M, N +1). To begin with, note that the

60.A Counting Theorem

2525

relation holds for N = 1 since it gives S(M, 1) = 2 – see (60.77), and we know that for a single feature vector in M -dimensional space there are only two possible dichotomies. Note also that relation (60.65) holds for M = 1 since it gives S(1, N ) = 2N , and we know that there are 2N dichotomies for N generally positioned points on a line. Next, using (60.73) and the assumed induction form (60.65) we have ! ! M −1 M X X N −1 N −1 S(M, N + 1) = 2 + 2 m m m=0 m=0 ! ! M M X N −1 X N −1 + 2 , m ← m0 − 1 = 2 m m0 − 1 0 m=0 m =1 ! ! M M X X N −1 N −1 = 2 + 2 , m0 ← m m m − 1 m=0 m=1 ! ! M M X X (a) N −1 N −1 = 2 + 2 m m−1 m=0 m=0 ( ! !) M X N −1 N −1 = 2 + m m−1 m=0 ! M X (b) N = 2 (60.74) m m=0 as expected, where step (a) uses the property ! N −1 = 0, when k < 0 k

(60.75)

and step (b) uses the equality N m

! =

N −1 m

! +

! N −1 m−1

(60.76)

Relation (60.65) is valid as long as the value of m within the combinatorial expression does not exceed N −1. This is satisfied whenever M < N −1 or, equivalently, N > M +1. On the other hand, when N ≤ M + 1, we can replace the upper limit M in the summation by N − 1 and write instead ! N −1 X N −1 S(M, N ) = 2 = 2N , when N ≤ M + 1 (60.77) m m=0 This concludes the proof.  Now, given a collection of N feature vectors hn ∈ IRM and assuming each of the 2 possible dichotomies is equally likely to occur, we readily conclude from result (60.65) that the probability that a randomly selected dichotomy is linearly separable is captured by the expression: N

P(M, N ) = S(M, N )/2N

(60.78)

This is a revealing expression and brings forth some useful properties. Problems 60.12– 60.16 explore these properties and are motivated by the exposition and results from Cover (1965). In particular, the following useful conclusions are established in these problems:

2526

Perceptron

(a) When N ≤ M + 1, each one of the 2N possible dichotomies of the feature vectors hn ∈ IRM is linearly separable. (b) When N = 2(M +1), only half of the 2N possible dichotomies of the feature vectors hn ∈ IRM is linearly separable. (c) The value N = 2(M +1) corresponds to a critical turning point for large-dimensional problems. In particular, it holds for any small  > 0 that   lim P M, (1 + )2(M + 1) = 0 (60.79) M →∞   (60.80) lim P M, (1 − )2(M + 1) = 1 M →∞

Observe how at the cut-off point N = 2(M +1) (i.e., for this many feature vectors), the probability of linear separation transitions sharply from 1 down to 0. (d) These limiting results motivate introducing the notion of the capacity of the class of linear classifiers in M -dimensional space. The capacity is defined as the largest number C such that for any N < (1 − )C, a random dichotomy of size N in IRM is linearly separable with probability larger than 1 − δ, for some small δ > 0. It can be shown that, for M large enough, C = 2(M + 1)

(60.81)

That is, the capacity corresponds roughly to two random feature vectors per weight dimension.

60.B

BOOLEAN FUNCTIONS It is useful to comment on how the results from the previous appendix on linear separability relate to the (more complex) problem of counting the number of linearly separable dichotomies generated by Boolean functions. One key difference in relation to what we have discussed so far is that the entries of each hn will now be restricted to assuming only the binary values 0 or 1. In this case, the feature vectors {hn } will generally violate the general position requirement, as illustrated by Prob. 60.17. Consequently, result (60.65) will not be applicable anymore. However, building on arguments from Furedi (1986), the work by Budinich (1991) shows that the probability expression (60.78) will continue to hold for M → ∞. The expression would then provide the probability that a collection of N vertices in a large M -dimensional hypercube are linearly separable, as we proceed to clarify. Let f (a1 , a2 , . . . , aM ) : {0, 1}M → {0, 1} denote a Boolean function defined over M binary arguments denoted by {am }. Each am can assume one of only two possible values, 0 or 1, and the function itself can only assume the values 0 or 1. We can interpret each realization of the M -dimensional vector (a1 , a2 , . . . , aM ) as representing the coordinates of some vertex of a hypercube in M dimensions. This coordinate vector plays the role of a feature vector hn in our previous notation. The class that this feature vector belongs to will be the value of the function f (a1 , . . . , aM ), written more compactly as f (hn ): f (hn ) : {0, 1}M −→ {0, 1}

(60.82)

Note that we are denoting the classes by {0, 1}. Now, an M -dimensional hypercube will have 2M vertices. Each of these vertices can be assigned to class 0 or 1. There are a M total of 22 possible binary assignments for all vertices of the hypercube, i.e., there are M a total of 22 Boolean functions over M arguments. For any particular choice of the Boolean function, we let V0 denote the collection of vertices it assigns to class 0 and V1 the collection of vertices it assigns to class 1. The Boolean function will then be said to

60.B Boolean Functions

2527

be linearly separable if there exists at least one hyperplane in IRM that separates the vertices V0 and V1 from each other: One set of vertices would appear on one side of the hyperplane and the other set would appear on the other side, i.e., if there exists some w? ∈ IRM such that  1, if hTn w? > 0, i.e., hn ∈ V1 f (hn ) = (60.83) 0, if hTn w? < 0, i.e., hn ∈ V0 This situation is illustrated in Fig. 60.15. The figure shows one realization of a Boolean function; vertices marked in blue are assigned the binary value 1 and vertices marked in yellow are assigned the binary value 0. It is seen in this example that the sets V0 and V1 are linearly separable. One example of a Boolean function that is not linearly separable is the XOR function, defined as follows (where M = 2):

f (hn ) = a1 XOR a2

 (a , a ) = (0, 0)   1 2 (a1 , a2 ) = (0, 1) →   (a1 , a2 ) = (1, 0) (a1 , a2 ) = (1, 1)

−→ −→ −→ −→

f (a1 , a2 ) = 0 f (a1 , a2 ) = 1 f (a1 , a2 ) = 1 f (a1 , a2 ) = 0

(60.84)

Figure 60.15 The figure shows one realization of a Boolean function; vertices marked

in blue are assigned the binary value 1 and vertices marked in yellow are assigned the binary value 0. It is seen in this example that the sets V0 and V1 are linearly separable. M

The question we would like to examine is to determine how many of the 22 possible Boolean functions are linearly separable. We already know from the XOR example and from Fig. 60.12 that not all Boolean functions are linearly separable. For instance, consider the situation corresponding to M = 2 (Boolean functions with two arguments). Hypercubes in this space are squares with 22 = 4 vertices. There are a total of 24 = 16 possible assignments for these vertices. We know from the representation in Fig. 60.12 that there are only 14 linearly separable Boolean functions in this case. More generally, there is no closed-form expression for the number of linearly separable Boolean functions for arbitrary values of M ; this is in contrast to result (60.65). In the Boolean context, we have N = 2M (the number of feature vectors is the number

2528

Perceptron

of vertices). Therefore, we will denote the number of linearly separable Boolean functions by S(M, 2M ). Although a closed-form expression for S(M, 2M ) does not exist, the work by Muroga (1971) provides a useful upper bound – see also the text by Peretto (1992), the volume edited by Smolensky, Mozer, and Rumelhart (1996), and the proof in Anthony (2001, pp. 37–38), as well as Prob. 60.15: S(M, 2M ) ≤ 2M

2

(60.85)

Table 60.3 provides some known values for the number of linearly separable Boolean functions up to M = 8 – see Muroga (1971). Table 60.3 Number of linearly separable Boolean functions in M -dimensional space (up to M = 8) and the probability that the vertices of the unit-edge hypercube are linearly separable. # Linearly separable

Probability of

M

# Boolean  Mfunctions  22

Boolean functions

separation

1 2 3 4 5 6 7 8

4 16 256 65,356 4,294,967,296 18,446,744,073,709,551,616 ≈ 3.4028 × 1038 ≈ 1.1579 × 1077

4 14 104 1882 94,572 15,028,134 8,378,070,864 17,561,539,552,946

1 0.875 0.40625 0.02880 ≈ 2.2 × 10−5 ≈ 8.1 × 10−13 ≈ 2.5 × 10−29 ≈ 1.5 × 10−64

REFERENCES Agmon, S. (1954), “The relaxation method for linear inequalities,” Can. J. Math., vol. 6, no. 3, pp. 382–392. Anthony, M. (2001), Discrete Mathematics of Neural Networks, SIAM. Bennett, K. P. and O. L. Mangasarian (1992), “Robust linear programming discrimination of two linearly inseparable sets,” Optim. Meth. Softw., vol. 1, pp. 23–34. Block, H. D. (1961), “Analysis of perceptrons,” Proc. West Joint. Computer Conf., vol. 19, pp. 281–289. Block, H. D. (1962), “The perceptron: A model for brain functioning I,” Rev. Mod. Phys., vol. 34, no. 1, pp. 123–135. Boyd, S. and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press. Budinich, M. (1991), “On linear separability of random subsets of hypercube vertices,” J. Phys. A: Math. Gen., vol. 24, pp. L211–L213. Cover, T. M. (1965), “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Trans. Electron. Comput., vol. 14, pp. 326–334. Duda, R. O. and P. E. Hart (1973), Pattern Classification and Scene Analysis, Wiley. Eremin, I. (1965), “The relaxation method of solving systems of inequalities with convex functions on the left sides,” Soviet Math. Dokl., vol. 6, pp. 219–222. Furedi, Z. (1986), “Random polytopes in the d-dimensional cube,” Discrete Comput. Geomet., vol. 1, pp. 315–319. Gallant, S. I. (1986), “Optimal linear discriminants,” Proc. Int. Conf. Pattern Recognition, pp. 849–852, Paris.

References

2529

Gallant, S. I. (1990), “Perceptron-based learning algorithms,” IEEE Trans. Neural Netw., vol. 1, no. 2, pp. 179–191. Haykin, S. (1999), Neural Networks: A Comprehensive Foundation, Prentice Hall. Hebb, D. O. (1949), The Organization of Behavior, Wiley. Luenberger, D. G. (1969), Optimization by Vector Space Methods, Wiley. McCulloch, W. and W. Pitts (1943), “A logical calculus of ideas immanent in nervous activity,” Bull. Math. Biophys., vol. 5, no. 4, pp. 115–133. Minkowski, H. (1911), Gesammelte Abhandlungen, Leipzig and Berlin. Minsky, M. and S. Papert (1969), Perceptrons, MIT Press. Expanded edition published in 1987. Motzkin, T. and I. J. Schoenberg (1954), “The relaxation method for linear inequalities,” Can. J. Math., vol. 6, no. 3, pp. 393–404. Muroga, S. (1971), Threshold Logic and Its Applications, Wiley. Novikoff, A. (1962), “On convergence proofs on perceptrons,” Proc. Symp. Mathematical Theory Automata, pp. 615–622, Brooklyn, NY. Oja, E. (1982), “Simplified neuron model as a principal component analyzer,” J. Math. Biol., vol. 15, no. 3, pp. 267–273. Oja, E. (1983), Subspace Methods of Pattern Recognition, Research Studies Press. Oja, E. (1992), “Principal components, minor components, and linear neural networks,” Neural Netw., vol. 5, pp. 927–935. Peretto, P. (1992), An Introduction to the Modeling of Neural Networks, Cambridge University Press. Pettis, B. J. (1956), “Separation theorems for convex sets,” Math. Mag., vol. 29, no. 5, pp. 233–247. Rosenblatt, F. (1957), The perceptron: A Perceiving and Recognizing Automaton, Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab. Rosenblatt, F. (1958), “The Perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408. Rosenblatt, F. (1962), Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Spartan Books. Schlafli, L. (1950), Gesammelte Mathematische Abhandlungen I, Springer. Siu, K.-Y., V. P. Roychowdhury, and T. Kailath (1995), Discrete Neural Computation: A Theoretical Foundation, Prentice Hall. Smith, F. W. (1968), “Pattern classifier design by linear programming,” IEEE Trans. Comput., vol. 17, no. 4, pp. 367–372. Smolensky, P., M. C. Mozer, and D. E. Rumelhart, editors (1996), Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Publishers. Theodoridis, S. (2015), Machine Learning: A Bayesian and Optimization Perspective, Academic Press. Widrow, B. and M. A. Lehr (1990), “30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation,” Proc. IEEE, vol. 78, no. 9, pp. 1415–1442.

61 Support Vector Machines

When the training data {γ(n), h } is linearly separable, there will exist many n

separating hyperplanes that can discriminate the data into two classes. Some of the techniques we described in the previous chapters, such as logistic regression and perceptron, are able to find such separating hyperplanes. However, in general, there are many others. For example, if we refer to the earlier Fig. 60.1, we observe that the slopes of the separating lines in the figure can be adjusted, with the lines tilted further in one direction or the other, and we would still obtain correct classification for the same training data. For each valid choice of a separating hyperplane, w? , there will exist some feature vector h in the training set that is closest to the hyperplane. We indicated in the previous chapter that the distance of this closest point to the hyperplane is called the margin and was denoted by m(w? ). We illustrate this situation again in Fig. 61.1. In this chapter, we describe the support vector machine (SVM) technique, whose purpose is to find the hyperplane w? with the largest possible margin, so that the training data will be the farthest away from it compared to other separating hyperplanes. Doing so adds a degree of robustness, as well as a desirable safety margin, to the operation of the classifier. We will consider two formulations of SVM: One is referred to as hard-margin SVM and the other is soft-margin SVM. Both techniques are again examples of deterministic methods, which operate directly on data realizations {γ(n), hn } without assuming explicitly any form for the underlying conditional or joint pdfs of the random variables (γ, h), as was the case, for example, with logistic regression and linear discriminant analysis (LDA).

61.1

SVM EMPIRICAL RISK The SVM formulation can be motivated by following geometric arguments. Let (w? , θ? ) denote the parameters (weight vector and scalar offset) of some generic separating hyperplane for a collection of N linearly separable training points {γ(n), hn }, where γ(n) ∈ {±1} is the label associated with feature vector hn ∈ IRM . Some feature vectors in this set will be closer to the hyperplane (w? , θ? ) than other feature vectors. Let (γ(n? ), hn? ), with index n? , denote one of the data points in the set that is closest to (w? , θ? ). This situation is illustrated in

61.1 SVM Empirical Risk

2531

( (n? ), hn? )

Figure 61.1 The figure shows one separating hyperplane and the two closest points

from the training data to it; the points are highlighted inside a circle. The distance from these points to the hyperplane is called the margin. Other separating hyperplanes will have their own margins.

Fig. 61.1; the figure further illustrates the possibility that there can also exist points in the other class at the same closest distance from the hyperplane. Since all points are correctly classified by (w? , θ? ), then using expression (60.10) we conclude that the margin is given by: ? ? m(w? ) = γ(n? ) hT n? w N θ



1 kw? k

(61.1)

We are free to scale (w? , θ? ) without altering the hyperplane hT w? N θ? = 0. Thus, assume the parameters (w? , θ? ) are scaled by the same value to attain the normalization: ? ? γ(n? )(hT n? w N θ ) = 1

(61.2)

In this case, the margin associated with the scaled (w? , θ? ) becomes m(w? ) =

1 kw? k

(61.3)

which is inversely proportional to kw? k. It follows that maximizing m(w? ) is equivalent to minimizing 21 kw? k2 (the scaling by 1/2 is added for convenience).

2532

Support Vector Machines

Hard-margin version Motivated by these considerations, we formulate the design problem: (w? , θ? ) = argmin w∈IRM ,θ∈IR

1 kwk2 2

(61.4a)

subject to γ(n)(hT n w N θ) ≥ 1, n = 0, 1, . . . , N N 1

(61.4b)

This formulation helps enforce three properties: (a) (Correct classifications) First, it enforces that all training data points are correctly classified by the resulting classifier (w? , θ? ). This is because the ? ? predictor γ b(n) = hT n w N θ and the true label γ(n) will have the same sign by (61.4b).

(b) (Sufficient distance away from hyperplane) Second, all training points will be sufficiently away from the separating hyperplane (w? , θ? ), at a distance that is at least equal to 1/kw? k. This is because, using expression (60.10), the distance from any training feature vector hn to the separating hyperplane will satisfy ? ? distance = γ(n) hT nw N θ



1 kw? k

(61.4b)



1 kw? k

(61.5)

(c) (Margin attained) Third, there should exist an index n? that satisfies (61.4b) with equality. This conclusion can be verified by contradiction. Assume the solution (w? , θ? ) leads to a strict inequality for all training points, namely, ? ? ? γ(n)(hT n w N θ ) > 1 for all 0 ≤ n ≤ N N 1. Let n denote the index with T ? ? the smallest value for the product γ(n)(hn w N θ ), i.e., n o ? ? n? = argmin γ(n)(hT w N θ ) (61.6) n 0≤n≤N −1

and denote the corresponding value by   ∆ ? ? δ ? = γ(n? ) hT w N θ ? n

(61.7)

By assumption, we have δ ? > 1. We scale (w? , θ? ) down by δ ? and replace them by w? ←N w? /δ ? ,

θ? ←N θ? /δ ?

(61.8)

The scaled (w? , θ? ) continues to be a separating hyperplane that satisfies the constraint (61.4b) for all n. However, the scaled w? has a smaller norm than the original w? since δ ? > 1, which contradicts (61.4a). We conclude ? ? that there must exist an index n? that satisfies γ(n? )(hT n? w N θ ) = 1. In view of expression (61.1), the feature vector hn? attains the margin m(w? ) = 1/kw? k.

61.1 SVM Empirical Risk

2533

Once a separating hyperplane (w? , θ? ) is determined by solving problem (61.4a)– (61.4b), we may encounter three situations depending on how a training point (γ(n), hn ) is positioned relative to the hyperplane:   T ? ?  γ(n) hn w N θ  > 1 → point (γ(n), hn ) exceeds the margin ? ? (61.9) γ(n) hT n w N θ  = 1 → point (γ(n), hn ) meets the margin  T ? γ(n) hn w N θ? < 1 → point (γ(n), hn ) violates the margin

In the first case, the distance from hn to the separating hyperplane will be larger than 1/kw? k and, therefore, the point (γ(n), hn ) will be farther away from the separating hyperplane than the margin. In the second case, we say that the training point (γ(n), hn ) meets the margin since the distance from hn to the separating hyperplane (w? , θ? ) will be 1/kw? k, which is the value of the margin. In the third case, the point hn will be closer to the hyperplane than the margin. Obviously, as was just proven under items (a)–(c), when problem (61.4a)–(61.4b) admits a solution (w? , θ? ), then all points {γ(n), hn } will either meet the margin or exceed it and the violation in the third case will not occur; this scenario will only arise when we study the soft-margin SVM further ahead. The solution (w? , θ? ) to (61.4a)–(61.4b) is called the hard-margin SVM solution because we are requiring the training data to be linearly separable and to have a distance of at least 1/kw? k away from the separating hyperplane (i.e., to exceed the margin). We will refer to all points (γ(n), hn ) that meet or violate the margin as support vectors: ? ? (γ(n), hn ) is a support vector ⇐⇒ γ(n)(hT nw N θ ) ≤ 1

(61.10)

The presence of these vectors is the reason for the name “support vector machine.” We will explain in a later section, using duality arguments, that the solution to the SVM problem is exclusively defined by these support vectors – see expression (61.45). In the hard-margin SVM formulation under discussion, support vectors will only consist of points (γ(n), hn ) that meet the margin with equality sign in (61.10). However, as we will see in the sequel, support vectors (γ(n), hn ) will exist under soft-margin SVM for which strict inequality holds in (61.10).

Soft-margin version We formulate next a more relaxed version of problem (61.4a)–(61.4b), leading to soft-margin SVM, in order to accommodate situations where the data points are not perfectly linearly separable or when outliers may be present. Outliers can perturb the choice of the separating hyperplane in a significant manner and push it closer to one class or the other if one insists on a hard-margin design. This scenario is illustrated in Fig. 61.2. An outlier feature vector is highlighted by the surrounding circle; its presence results in a separating hyperplane (the solid line) with a smaller margin compared to the original dashed hyperplane (in dashed line) obtained in the absence of the outlier.

2534

Support Vector Machines

Figure 61.2 An outlier is indicated by the surrounding circle; its presence results in a

separating hyperplane with a smaller margin compared to the original (dashed) hyperplane in the absence of the outlier. Soft-margin SVM reduces the influence of outliers on the selection of the separating hyperplane and leads to solutions that approach the dashed line.

Soft-margin SVM helps reduce the influence of outliers on the selection of the separating hyperplane. It continues to seek a hyperplane with the largest possible margin but will allow a small number of the data points to violate the margin (i.e., to be either closer to the separating hyperplane than the margin or even misclassified altogether). This relaxation is achieved by replacing the original formulation (61.4a)–(61.4b) by the following optimization problem: 

 w? , θ? , {s? (n)} = argmin

w,θ,s(n)

(

1 kwk2 + η 2

!) N −1 1 X s(n) N n=0

subject to γ(n)(hT n w N θ) ≥ 1 N s(n)

(61.11a) (61.11b)

s(n) ≥ 0, n = 0, 1, 2, . . . , N N 1 (61.11c)

where w ∈ IRM , θ ∈ IR, η > 0 is a scaling parameter, and the {s(n) ≥ 0} are newly introduced nonnegative variables, called the slack variables. There is one slack variable for each data point in the training set. From expression (61.11b), we see that each slack variable s(n) introduces some tolerance and allows the quantity γ(n)(hT n wNθ) to be smaller than 1. That is, it allows the point (γ(n), hn ) to violate the margin since 1 N s(n) can be smaller than 1, in which case hn ends up being closer to the hyperplane than desired, or perhaps even on the wrong side of it. Two types of violations are possible:

61.1 SVM Empirical Risk

2535

(a) (Margin violation) Values of s(n) in the range 0 ≤ s(n) ≤ 1 will correspond to points (γ(n), hn ) that fall on the correct side of the separating hyperplane but are closer to the hyperplane than the margin. (b) (Misclassification) Values s(n) > 1 will correspond to points (γ(n), hn ) that fall on the wrong side of the separating hyperplane and are therefore misclassified. Compared with (61.4a), the cost function in (61.11a) incorporates an additional term that penalizes the contribution from the slack variables; the size of this penalty is controlled by the parameter η. By minimizing the augmented cost, we are in effect attempting to reduce the contribution from the slack deviations. Note that large values for η favor solutions (w? , θ? ) with a small slack contribution and, hence, with a smaller number of misclassification errors. In particular, as η → ∞, problem (61.11a)–(61.11c) reduces to the hard-margin SVM formulation (61.4a)–(61.4b) since this situation will force all s(n) → 0. On the other hand, smaller values for η accommodate some violations of the margin, including more misclassifications.

Empirical risk By examining the structure of problem (61.11a)–(61.11c) we can readily deduce the values of the slack variables s(n) for all data points: (a) (Zero slack variables) To begin with, whenever some data point (γ(no ), hno ) satisfies γ(no )(hT no w N θ) ≥ 1, then the corresponding slack variable, s(no ), should be 0. That is, data points than are on the correct side of the hyperplane and are farther away from it than its margin will necessarily have zero slack variables. This is because the objective is to reduce the cost (61.11a) and, therefore, we can set s(no ) to zero to reduce the sum of the slack variables without violating (61.11b)–(61.11c). (b) (Positive slack variables) On the other hand, whenever γ(n1 )(hT n1 w N θ) < 1 for some data point (γ(n1 ), hn1 ), then the smallest value that can be chosen for the corresponding slack variable is s(n1 ) = 1 N γ(n1 )(hT n1 w N θ) > 0

(61.12)

in order to satisfy the nonnegativity constraint (61.11c). We select the smallest value for s(n1 ) because the cost (61.11a) penalizes the sum of the slack variables. Based on these observations, we are motivated to consider the following alternative formulation of the optimization problem (61.11a)–(61.11c): ) ( N −1 n o 1 X ∆ T ? ? 2 max 0, 1 N γ(n) hn w N θ (w , θ ) = argmin P (w) = ρkwk + N n=0 w∈IRM ,θ∈IR (61.13) where ρ = 1/2η. Note that large values for η correspond to small values for ρ.

2536

Support Vector Machines

Accordingly, small ρ will favor solutions with a small number of margin violations or misclassifications (i.e., solutions with mostly small slack variables). This means that small values for ρ are recommended for data that are more or less separable, with ρ → 0 corresponding to the hard-margin solution. On the other hand, larger values for ρ tolerate a higher level of margin violations and/or misclassifications. This case is better suited for data that are more challenging to separate. If we invoke ergodicity on the data {γ(n), hn }, we find that P (w) motivates the following stochastic risk function N −1 n o 1 X max 0, 1 N γ(n)(hT n w N θ) N n=0

N →∞

N→

n o E max 0, 1 N γ(hT w N θ)

(61.14)

so that the soft-margin SVM construction can also be interpreted as solving the following Bayesian inference problem: ( ) n o T o o 2 (w , θ ) = argmin ρkwk + E max 0, 1 N γ(h w N θ) (61.15) w∈IRM ,θ∈IR

where the expectation is over the joint distribution of (γ, h).

Online recursion Problem (61.13) can be solved by a variety of stochastic optimization methods, already discussed in previous chapters, such as using stochastic subgradient algorithms and variations thereof. It is sufficient to illustrate the construction by considering one solution method. We will therefore focus on stochastic subgradient implementations, with or without regularization, that rely on instantaneous subgradient approximations. The sampling of the data in the stochastic implementation can also be done with or without replacement. Using the result of Example 16.8, we list the SVM algorithm for solving (61.13) in (61.22), where the notation I[x] refers to the indicator function that is equal to 1 when condition x is true and 0 otherwise. Comparing (61.22) with the perceptron listing (60.19), we find that the condition I[γ(n)b γ (n) ≤ 0] is now replaced by I[γ(n)b γ (n) ≤ 1]. We can simplify the notation in listing (61.22) by extending the feature and weight vectors as follows: " # " # 1 Nθ h← , w← (61.16) h w so that the recursions can be rewritten more compactly in the following manner where the offset parameter is now implicit:   γ b (n) = hT n w n−1 (61.17)  w = Aw γ (n) ≤ 1] , n ≥ 0 n n−1 + µγ(n)hn I [γ(n)b

61.1 SVM Empirical Risk

and the diagonal matrix A depends on the regularization parameter:   1 ∆ A = (1 N 2µρ)IM

2537

(61.18)

When a mini-batch of size B is used, the SVM recursion is replaced by  select B data samples {γ(b), hb } at random      γ(b) = hT b w n−1 , b = 0, 1, . . . , B N 1 ! (61.19) B−1   1 X   γ(b)hb I [γ(b)b γ (b) ≤ 1] , n ≥ 0  wn = Awn−1 + µ × B b=0

On the other hand, in the absence of regularization (ρ = 0), we obtain: wn = wn−1 + µγ(n)hn I [γ(n)b γ (n) ≤ 1] , n ≥ 0

(61.20)

which implies that wn = wn−1 + µγ(n)hn , if γ(n)b γ (n) ≤ 1

(61.21)

Soft-margin SVM algorithm for minimizing (61.13). −1 given dataset {γ(m), hm }N m=0 or streaming data (γ(n), hn ); start from an arbitrary initial condition, w−1 . repeat until convergence over n ≥ 0: select at random or receive a sample (γ(n), hn ) at iteration n; b (n) = hT γ n w n−1 N θ(n N 1) θ(n) = θ(n N 1) N µγ(n)I [γ(n)b γ (n) ≤ 1] wn = (1 N 2µρ)wn−1 + µγ(n)hn I [γ(n)b γ (n) ≤ 1] end return w? ← wn , θ? ← θ(n); classify a feature h by using the sign of γ b = hT w? N θ? .

(61.22)

Example 61.1 (Binary classification using soft-SVM) We show in Fig. 61.3 a collection of 150 feature samples hn ∈ IR2 whose classes ±1 are known beforehand: 120 samples are selected for training and 30 samples are selected for testing. The data arises from the dimensionally reduced iris dataset from Example 57.4; we denoted the two-dimensional reduced feature vectors by the notation h0n in that example. We denote them here by hn . We employ the two classes shown in the bottom plot of Fig. 57.5 and denote them by γ(n) ∈ {±1}. We will use the data to compare the performance of the perceptron and SVM algorithms. We first use the data to train the perceptron classifier (60.26), under extensions (60.20), by running five passes over the training data: γ b(n) = hTn wn−1 wn = wn−1 + γ(n)hn ,

if γ(n)b γ (n) ≤ 0

(61.23a) (61.23b)

2538

Support Vector Machines

2

1

training samples (perceptron)

2

class +1

class -1

test samples (perceptron)

1

class -1 0

0

-1

-1

-2 -4

2

1

-3

-2

-1

0

1

2

3

4

training samples (soft SVM)

2

class +1

class -1

-2 -4

1

0

0

-1

-1

-2 -4

-2 -4

class +1 -3

-2

-1

0

1

2

3

4

test samples (soft SVM)

class -1

class +1 -3

-2

-1

0

1

2

3

4

-3

-2

-1

0

1

2

3

4

Figure 61.3 The first row shows the training and test data for the perceptron

algorithm without regularization and µ = 1, while the second row shows the same data for the soft-margin SVM algorithm under `2 -regularization with ρ = 0.01 and µ = 0.1. The lines show the resulting classifiers.

During each pass, the data {γ(n), hn } is randomly reshuffled and the algorithm is rerun over the data starting from the weight iterate obtained at the end of the previous pass. The line in the figure shows the separating curve obtained in this manner with parameters (where we now undo the extension (60.20)): w? =



3.4184 −1.5104

 ,

θ? = 1.0

(perceptron)

(61.24)

It is seen that the separation curve is able to classify all test vectors and leads to 0% empirical error rate. We also use the same data to run five passes of the soft-SVM classifier (61.22) by using ρ = 0.01 and µ = 0.1. The data is randomly reshuffled at the start of each pass. The line in the figure shows the separating curve obtained in this manner with parameters ?

w =



1.2253 −0.3855

 ,

θ? = 1.0

(soft SVM)

(61.25)

It is also seen that the separation curve is able to classify all test vectors and leads to 0% empirical error rate.

61.1 SVM Empirical Risk

2539

Example 61.2 (Application to breast cancer dataset) We apply the soft-SVM classifier (61.22) to the breast cancer dataset encountered earlier in Example 53.3. The data consists of N = 569 samples, with each sample corresponding to a benign or malignant cancer classification. We use γ(n) = −1 for benign samples and γ(n) = +1 for malignant samples. Each feature vector in the data contains M = 30 attributes corresponding to measurements extracted from a digitized image of a fine needle aspirate of a breast mass. The attributes describe characteristics of the cell nuclei present in the image; examples of these attributes were listed earlier in Table 53.1. All feature vectors are centered around the sample mean and their variances scaled to 1 according to the preprocessing step described earlier under principal component analysis (PCA) in (57.6). We select 456 samples (80%) randomly from these processed vectors for training and keep the remaining 113 samples (20%) for testing. We use ρ = 0.01 and µ = 0.01. We run the algorithm 20 passes over the training data using random reshuffling. The resulting empirical error rate on the test data is 12.39%, resulting from 14 misclassified samples out of 113 test samples. For comparison purposes, we use the PCA procedure (57.34) to reduce the dimension of the feature space down to M = 2 and run again the same soft-SVM procedure over this reduced data. Figure 61.4 shows the 456 training samples and 113 test samples, along with the resulting classifier whose parameters are determined to be

w? =



−1.1022 0.6507

 ,

θ? = −0.07

(soft SVM)

(61.26)

The resulting empirical error on the test data is found to be 5.31%, which amounts to 6 misclassified decisions out of 113 test samples.

10

training points (soft SVM)

10

5

5

0

0

-5

class -1 (benign)

class +1 (malignant) -10 -20

-15

-10

-5

0

5

10

-5

-10 -20

test points (soft SVM)

class +1 (malignant) -15

-10

class -1 (benign) -5

0

5

10

Figure 61.4 The plots show the training and test samples for two-dimensional reduced

feature vectors from a breast cancer dataset, along with the separating line that arises from training a soft-SVM classifier.

Example 61.3 (Support vectors and misclassification errors) The number of support vectors in an SVM implementation conveys useful information about the learning ability of the SVM solution. Specifically, consider repeated experiments involving training data {γ(n), hn } of size N each. Then, it holds that the average number of support vectors over these experiments provides an indication of the expected empirical error rate over the training data for the SVM classifier, denoted generically by c? , namely,

2540

Support Vector Machines

E Remp (c? ) ≤

1 E [# support vectors] N

(61.27)

where the expectation is over experiments (or over the distribution of the data (γ, h)). The empirical error rate is denoted in boldface because it is treated as a random variable whose value varies from one experiment to another; recall from definition (52.11) that Remp (c? ) counts the fraction of errors over the training data. Observe that the bound on the right-hand side is independent of the dimension M of the feature space h ∈ IRM , which is a useful property of SVM solutions. Observe also that an SVM solution is expected to yield very few support vectors; otherwise, the SVM classifier would not be effective. Proof of (61.27): Note that, for any set of training data of size N , the number of support vectors satisfies: [# support vectors] = [# training data that meet or violate the margin] ≥ [# misclassified training data] (61.28) Therefore, the empirical error rate over the training data in each experiment satisfies: ∆

Remp (c? ) =

1 1 [# misclassified data] ≤ [# support vectors] N N

(61.29)

Taking expectations of both sides, we arrive at (61.27).  Example 61.4 (SVM for regression problems) We refer to the empirical risk (61.13) used by SVM for binary classification, namely, γ b(n) = hTn w − θ

(61.30a) (



(w? , θ? ) =

ρkwk2 +

argmin w∈IRM ,θ∈IR

1 N

N −1 X

) n o max 0, 1 − γ(n)b γ (n)

(61.30b)

n=0

This formulation relies on the nondifferentiable hinge function g(x) = max{0, 1 − x}, which ignores all values x > 1. We can motivate a similar construction for the solution of regression (as opposed to classification) problems, where the purpose is to estimate the target variables γ(n) (rather than their signs). For this purpose, we consider the following regularized formulation: γ b(n) = hTn w − θ

(61.31a) (



(w? , θ? ) =

argmin w∈IRM ,θ∈IR

ρkwk2 +

1 N

N −1 X

n o max 0, |γ(n) − γ b(n)| − 

) (61.31b)

n=0

for some small  > 0. This description continues to rely on a nondifferentiable function, albeit one of the form g(x) = max{0, |x| − } so that only values x ∈ (−, ) are ignored. This is illustrated schematically in the diagram of Fig. 61.5, where the vertical axis is denoted by y. The function has two points of discontinuity at x = ±. The slope of the function is +1 for x > , −1 for x < −, and 0 for x ∈ (−, ). At x =  we select the subgradient as +1 and at x = − as −1. We therefore construct a subgradient for g(x) as follows: sg (x) = I[x ≥ ] − I[x ≤ −]

(61.32)

61.2 Convex Quadratic Program

y = max{0, |x|

y=

✏}

y=x

(x + ✏) ✏

2541





x

Figure 61.5 Plot of the function y = max{0, |x| − }.

Applying this construction to (61.31b) we can write down the following stochastic subgradient implementation:

61.2

b (n) = hTn wn−1 − θ(n − 1) γ h i h i b (n) ≤ γ(n) −  − I γ b (n) ≥ γ(n) +  α(n) = I γ

(61.33b)

θ(n) = θ(n − 1) − µα(n) wn = (1 − 2µρ)wn−1 + µ α(n)hn

(61.33c) (61.33d)

(61.33a)

CONVEX QUADRATIC PROGRAM There are several ways by which the hard- and soft-margin SVM formulations can be solved. In listing (61.22) we pursued an online solution based on a stochastic subgradient implementation, which is one of the simplest and most commonly used methods for solving SVM problems. In this section, we describe another solution method that is based on transforming the SVM problem into a convex quadratic program (i.e., into an optimization problem with a quadratic cost function subject to a convex constraint – see (61.42a)–(61.42b)). Such quadratic programs can be solved efficiently by means of convex optimization packages. The main motivation for the derivation that follows is to highlight the role played by support vectors; the derivation will also be useful later when we develop a kernel-based SVM version for classifying data that are not necessarily linearly separable. The details of the convex program formulation are as follows.

2542

Support Vector Machines

Optimization by duality We focus initially on the hard-margin SVM problem (61.4a)–(61.4b). We call upon the Karush–Kuhn–Tucker (KKT) conditions (9.28a)–(9.28e) to transform the constrained problem into an unconstrained version. Specifically, we start by introducing the Lagrangian function: N −1 X  1 2 L(w, θ, λ(n)) = kwk N λ(n) γ(n)(hT n w N θ) N 1 2 n=0 ∆

(61.34)

where the {λ(n) ≥ 0} denote Lagrange multipliers; they are nonnegative because of the direction of the inequalities in the constraints (61.4b). To determine the solution (w? , θ? ), we need to perform two tasks. First, we minimize L(w, θ, λ(n)) over (w, θ) and determine the minimum value, which we denote by the dual function D(λ(n)); it is a function of the multipliers {λ(n)} alone. Second, we maximize the dual function over the {λ(n)}. From the solutions to these two steps, and in view of the KKT conditions, we will be able to recover the desired (w? , θ? ), as we proceed to explain. Computing the gradients of L(w, θ, λ(n)) relative to w and θ we get ∇wT L(w, θ, λ(n)) = w N

N −1 X

λ(n)γ(n)hn

(61.35a)

n=0

N −1 X ∂L(w, θ, λ(n)) = λ(n)γ(n) ∂θ n=0

(61.35b)

Setting these gradients to zero at (w? , θ? ), we find that the variables {w? , λ(n)} must satisfy: w? =

N −1 X

λ(n)γ(n)hn ,

n=0

N −1 X

λ(n)γ(n) = 0

(61.36)

n=0

Using these conditions, we substitute into the Lagrangian function and determine the dual function as follows: D(λ(n)) = L(w? , θ? , λ(n)) N −1 X  1 ? 2 ? ? kw k N λ(n) γ(n)(hT nw N θ ) N 1 2 n=0 ! N −1 N −1 X 1 ? 2 X T = kw k + λ(n) N λ(n)γ(n)hn w? + 2 n=0 n=0

=

(61.36)

=

(61.36)

=

1 ? 2 kw k + 2 N −1 X

N −1 X n=0

N −1 X n=0

!

λ(n)γ(n) θ?

λ(n) N kw? k2

N −1 N −1 1 X X λ(n) N γ(n)γ(m)λ(n)λ(m)hT n hm 2 n=0 n=0 m=0

(61.37)

61.2 Convex Quadratic Program

2543

The resulting dual function is dependent on the {λ(n)} alone and is given by: ∆

D(λ(n)) =

N −1 X n=0

λ(n) N

N −1 N −1 1 X X γ(n)γ(m)λ(n)λ(m)hT n hm 2 n=0 m=0

(61.38)

which we now need to maximize subject to the constraints λ(n) ≥ 0,

N −1 X

λ(n)γ(n) = 0

(61.39)

n=0

We can express the dual function in vector form by introducing the vector and matrix quantities:     γ(0) λ(0)     γ(1) λ(1) ∆     λ =   , [A]n,m = γ(n)γ(m)hT , γ =  .. .. n hm     . . γ(N N 1) λ(N N 1) (61.40) The vector λ is N × 1 and the matrix A (also called the Gramian matrix) is N × N . Then, we can rewrite (61.38) as: D(λ) = 1T λ N

1 T λ Aλ 2

(61.41)

We wish to maximize D(λ), which can be achieved by minimizing ND(λ). Hence, the problem of determining the {λ(n)} is formulated as follows: ) ( 1 λ? = argmin λT Aλ N 1T λ (61.42a) 2 λ∈IRN subject to λ  0,

λT γ = 0

(61.42b)

where the notation a  b means elementwise comparison. The above problem is a convex quadratic programming problem: It involves a cost (61.42a) that is quadratic in λ, with coefficients { 21 A, N1}. It also involves the linear constraint λT γ = 0 and the condition λ  0. A quadratic program solver can be used to return a vector λ? .

Support vectors The solution λ? will exhibit a useful property, namely, most of its entries will be zero. This is because of the KKT complementary condition (9.28d), which needs to hold. That condition translates into the requirement:   ? ? λ? (n) γ(n)(hT = 0, n = 0, 1, 2, . . . , N N 1 (61.43) nw N θ ) N 1

Now, consider any data point (γ(n), hn ) that exceeds the margin, i.e., for which ? ? γ(n)(hT n w N θ ) > 1; these points are correctly classified. Then, from (61.43), it must hold for these points that λ? (n) = 0. On the other hand, if λ? (n) 6= 0, then ? ? ? it must hold that γ(n)(hT n w N θ ) = 1 so that nonzero values for λ (n) will only

2544

Support Vector Machines

occur for data points that meet the margin. There are generally only a few of these points and they are examples of support vectors. More generally, support vectors were defined in (61.10) as any points (γ(n), hn ) that meet or violate the margin. In the hard-margin SVM formulation under discussion, support vectors will only consist of points (γ(n), hn ) that meet the margin with equality sign in (61.10). Observe further from condition (61.39) that there should exist at least one support vector from each class {±1}. This is because if λ? (n1 ) is some nonzero entry of the vector λ? corresponding to label γ(n1 ), then there should exist another entry of similar value λ(n2 ) in the vector λ? , albeit with label γ(n2 ) = Nγ(n1 ). When this happens, the two terms λ? (n1 )γ(n1 ) and λ? (n2 )γ(n2 ) cancel each other and it becomes possible for the sum in (61.39) to evaluate to zero, as required. Using the solution λ? we can determine w? by using relation (61.36): w? =

N −1 X

λ? (n)γ(n)hn

(61.44)

n=0

But since most of the {λ? (n)} will be zero, this expression actually provides a sparse representation for w? in terms of the support vectors; it shows that w? is a linear combination of the support vectors. We can therefore write w? =

X

λ? (s)γ(s)hs

(61.45)

s∈S

where the sum is limited to the set S of support vectors. We still need to determine θ? . For that purpose, we pick any point (γ(n), hn ) that meets the margin (i.e., any support vector in the hard-margin SVM implementation) and use it to solve for θ? : ? ? ? T ? γ(n)(hT n w N θ ) = 1 =⇒ θ = hn w N

1 γ(n)

(61.46)

We can enhance the accuracy of this construction for θ? by averaging estimates over all support vectors that meet the margin (or several of them), say, as:   1 X T ? 1 ? (61.47) θ = hs w N |S| γ(s) s∈S

where |S| denotes the cardinality of S. Obviously, under hard-margin SVM, all points in S meet the margin. Combining (61.45) and (61.47) we estimate the label of a test vector h by using the following expression (which is written in terms of the support vectors): γ b = hT w? N θ? X 1 X = λ? (s)γ(s)hT hs N |S| s∈S

s∈S

X

s0 ∈S

λ? (s0 )γ(s0 )hT s hs0 N

1 γ(s)

!

(61.48)

61.2 Convex Quadratic Program

and making the classification decision:  if γ b ≥ 0, assign h to class +1 if γ b < 0, assign h to class N1

2545

(61.49)

We summarize the solution method of this section in the following listing.

Convex program solution of hard-margin SVM (61.4a)–(61.4b). (training) compute: given N data points {γ(n), hn }, n = 0, 1, . . . , N N 1; form the vector γ and matrix A defined by (61.40); solve (61.42a)–(61.42b) and determine λ? ; ∆

S = set of support vectors defined by (61.10); these are the points (γ(s), hs ) with λ? (s) 6= 0 X ? w = λ? (s)γ(s)hs s∈S   1 X T ? 1 ? θ = hs w N |S| γ(s)

(61.50)

s∈S

end return (w? , θ? )

(classification) classify feature vector h using (61.49) where γ b = hT w? N θ? .

Soft-margin adjustment

Following similar arguments, we can verify that the soft-margin SVM problem (61.11a)–(61.11c) reduces to a convex quadratic programming problem of the following form, where the main modification is the upper bound on the entries of λ (see Prob. 61.9):   1 T ? T λ = argmin λ Aλ N 1 λ (61.51a) 2 λ∈IRN η subject to 0  λ  1, λT γ = 0 (61.51b) N In this case, it turns out that the solution vector λ? will have nonzero entries at data points that meet the margin and also at data points that violate the margin. As explained earlier in (61.10), these points constitute the support vectors. The same listing (61.50) will continue to hold with one adjustment to the expression for θ? . Let S1 ⊂ S denote the subset of support vectors that meet the margin. Then, we estimate θ? by averaging over these vectors (or a subset of them), say, as:   1 X 1 ? θ? = hT w N (61.52) s |S1 | γ(s) s∈S1

2546

Support Vector Machines

and, therefore, the expression for γ b becomes: γ b = hT w? N θ? X λ? (s)γ(s)hT hs N = s∈S

61.3

1 X |S1 |

s∈S1

X

s0 ∈S

λ? (s0 )γ(s0 )hT s hs0 N

1 γ(s)

!

(61.53)

CROSS VALIDATION The material in this section is not specific to SVMs, but is applicable more broadly. We present it here because at this stage of our development we are in a good position to motivate the useful technique of cross validation for selecting hyperparameters for learning algorithms. We have encountered several such algorithms so far, such as the nearest-neighbor rule, the k-means algorithm, logistic regression, perceptron, SVMs, recursive least-squares, and various other stochastic optimization methods with and without regularization. We will encounter additional algorithms in subsequent chapters, such as AdaBoost, kernel methods, neural networks, and so forth. In most of these implementations, certain parameters, also called hyperparameters, need to be set by the designer, such as regularization parameters, forgetting factors, step sizes, number of clusters, etc. Two important questions arise: (a) How do we pick a good learning algorithm for an application from among multiple possibilities? And how do we set the hyperparameters for the algorithm in a guided manner? (b) How do we assess the performance of the algorithm, such as its empirical error rate, in order to estimate its generalization ability? One useful technique to answer these questions is cross validation. While there are several variations of cross validation, we describe one construction that is common in practice. We denote the learning algorithm under study generically by the notation Ap , where the letter A refers to the algorithm and the letter p refers to some hyperparameter that influences its performance. For example, A could be the logistic regression algorithm and p could be the regularization parameter, ρ. We start with a total of NTOTAL = N + T data points, {γ(n), hn }, where γ(n) is the label corresponding to feature vector hn ∈ IRM . The set is split into two disjoint groups: a training group consisting of N data points and a test group consisting of T data points: {γ(n), hn },

{γ(t), ht },

n = 0, 1, 2, . . . , N N 1 t = 0, 1, 2, . . . , T N 1

N + T = NTOTAL

(training data)

(61.54a)

(test data)

(61.54b) (61.54c)

Usually, the split is about 70–80% of NTOTAL used for training and 20–30% of

61.3 Cross Validation

2547

NTOTAL used for testing. If NTOTAL = 1000, then we will have N = 800 training data points and T = 200 test data points. The test data should be separated completely from the training data and only used for testing purposes later, after the classifier has been trained. Training is performed as follows. We start from the N training data points and split them K-fold, where K is some integer normally between 5 and 10, though the value K = 10 is common. Let us select K = 5 for illustration purposes. Then, the N training points are split into K segments, with N/K data points in each segment. For the example with N = 800 and K = 5, we end up with 5 segments with Ns = 160 samples per segment. We index these segments by s = 1, 2, 3, . . . , K – see Fig. 61.6. During each iteration of the cross validation procedure described below, one of the segments (also called a validation set) is left out and used for cross validation purposes, while the remaining K N 1 segments are used for training. This procedure, known as K-fold cross validation, operates as follows: repeat for s = 1, 2, . . . , K: (1) Exclude the data from the segment indexed by s, and use all data from the remaining K N1 segments to train the learning algorithm. For example, when s = 1 and K = 5, we use the data from segments 2, 3, 4, and 5 for training. In the N = 800 example, this would amount to using a total of 4×160 = 640 data points for training. We can run multiple passes of the algorithm over the training data. Once training is completed, we test the performance of the resulting classifier using the data from the cross validation segment, s, that was left out to measure its empirical error rate. If we let the set Ns denote the indices of the data points within the cross validation segment, then this error is given by Remp (s) =

1 X I [Ap (hn ) 6= γ(n)] Ns

(61.55)

n∈Ns

This calculation counts the average number of erroneous classifications over the cross validation segment. (2) We repeat the construction in step (1) for each of the segments: Use one segment for cross validation and the remaining segments for training. In each run, we compute the resulting empirical error rate. By the time we have scanned over all K segments, we would have available K error values, Remp (s), one for each segment s = 1, 2, . . . , K. We average these values to obtain an estimate for the error rate of the algorithm:

Remp (Ap ) =

K 1 X Remp (s) K s=1

(61.56)

2548

Support Vector Machines

•••

K AAAB6nicbVDLSgNBEOz1GeMr6tHLYCJ4kLAbQT0GvAheIpoHJEuYnfQmQ2Znl5lZIYR8ghcPinj1i7z5N06SPWhiQUNR1U13V5AIro3rfjsrq2vrG5u5rfz2zu7efuHgsKHjVDGss1jEqhVQjYJLrBtuBLYShTQKBDaD4c3Ubz6h0jyWj2aUoB/RvuQhZ9RY6aF0V+oWim7ZnYEsEy8jRchQ6xa+Or2YpRFKwwTVuu25ifHHVBnOBE7ynVRjQtmQ9rFtqaQRan88O3VCTq3SI2GsbElDZurviTGNtB5Fge2MqBnoRW8q/ue1UxNe+2Muk9SgZPNFYSqIicn0b9LjCpkRI0soU9zeStiAKsqMTSdvQ/AWX14mjUrZuyxf3FeK1fMsjhwcwwmcgQdXUIVbqEEdGPThGV7hzRHOi/PufMxbV5xs5gj+wPn8AVTvjRs=

AAACAXicbVA9TwJBEJ3DL8SvUytjsxFMqMgdjdqR2Fhi4gkJXMjeMgcb9j6yu2dCCLHxr9hYqLH1X9j5b1zgCgVfMpmX92ayOy9IBVfacb6twtr6xuZWcbu0s7u3f2AfHt2rJJMMPZaIRLYDqlDwGD3NtcB2KpFGgcBWMLqe+a0HlIon8Z0ep+hHdBDzkDOqjdSzTyrdIBMCNVnqlZ5ddmrOHGSVuDkpQ45mz/7q9hOWRRhrJqhSHddJtT+hUnMmcFrqZgpTykZ0gB1DYxqh8ifzE6bk3Ch9EibSVKzJXP29MaGRUuMoMJMR1UO17M3E/7xOpsNLf8LjNNMYs8VDYSaITsgsD9LnEpkWY0Mok9z8lbAhlZRpk1rJhOAun7xKvHrtqube1suNap5GEU7hDKrgwgU04Aaa4AGDR3iGV3iznqwX6936WIwWrHznGP7A+vwB3z2WkQ==

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

cross validation segment

test data AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBg5SkgnosePFYwX5IG8pms2mX7iZhdyKU0l/hxYMiXv053vw3btsctPXBwOO9GWbmBakUBl332ymsrW9sbhW3Szu7e/sH5cOjlkkyzXiTJTLRnYAaLkXMmyhQ8k6qOVWB5O1gdDvz209cG5HEDzhOua/oIBaRYBSt9IjcIAkp0n654lbdOcgq8XJSgRyNfvmrFyYsUzxGJqkxXc9N0Z9QjYJJPi31MsNTykZ0wLuWxlRx40/mB0/JmVVCEiXaVoxkrv6emFBlzFgFtlNRHJplbyb+53UzjG78iYjTDHnMFouiTBJMyOx7EgrNGcqxJZRpYW8lbEg1ZWgzKtkQvOWXV0mrVvWuqpf3tUr9Io+jCCdwCufgwTXU4Q4a0AQGCp7hFd4c7bw4787HorXg5DPH8AfO5w+mjJA+

AAACBXicbVDLSgMxFM3UV62vUZe6CBbBhdSZCuqy4MZlBfuAdih30sw0NJMZkkyhlG7c+CtuXCji1n9w59+YmXahrQcCJ+fcQ3KPn3CmtON8W4WV1bX1jeJmaWt7Z3fP3j9oqjiVhDZIzGPZ9kFRzgRtaKY5bSeSQuRz2vKHt5nfGlGpWCwe9DihXgShYAEjoI3Us49HwFk/v1xoCUwwEWJFw4gKrXp22ak4OfAyceekjOao9+yvbj8maRYmHJTquE6ivQlIzQin01I3VTQBMoSQdgwVEFHlTfItpvjUKH0cxNIcoXGu/k5MIFJqHPlmMgI9UIteJv7ndVId3HgTJpJUU0FmDwUpxzrGWSW4zyQlmo8NASKZ+SsmA5BAtCmuZEpwF1deJs1qxb2qXN5Xy7XzeR1FdIRO0Bly0TWqoTtURw1E0CN6Rq/ozXqyXqx362M2WrDmmUP0B9bnDyu1mO8=

validation/training segments T = 0.2NTOTAL AAAB/nicbVDLSsNAFJ34rPUVFVduBlvBhYSkgroRKm5ciFZIH9CGMJlO2qEzSZiZCCUU/BU3LhRx63e482+ctllo64ELh3Pu5d57goRRqWz721hYXFpeWS2sFdc3Nre2zZ3dhoxTgUkdxywWrQBJwmhE6ooqRlqJIIgHjDSDwfXYbz4SIWkcuWqYEI+jXkRDipHSkm/ul91L26rAOz/rCA7de/fqdlT2zZJt2RPAeeLkpARy1Hzzq9ONccpJpDBDUrYdO1FehoSimJFRsZNKkiA8QD3S1jRCnEgvm5w/gkda6cIwFroiBSfq74kMcSmHPNCdHKm+nPXG4n9eO1XhhZfRKEkVifB0UZgyqGI4zgJ2qSBYsaEmCAuqb4W4jwTCSidW1CE4sy/Pk0bFcs6s04dKqXqSx1EAB+AQHAMHnIMquAE1UAcYZOAZvII348l4Md6Nj2nrgpHP7IE/MD5/AOwuk3o=

N = 0.8NTOTAL AAAB/nicbVDLSsNAFJ34rPUVFVduBlvBhYSkgnYjVNy4kFqhL2hDmEwn7dCZJMxMhBIK/oobF4q49Tvc+TdO2yy09cCFwzn3cu89fsyoVLb9bSwtr6yurec28ptb2zu75t5+U0aJwKSBIxaJto8kYTQkDUUVI+1YEMR9Rlr+8Gbitx6JkDQK62oUE5ejfkgDipHSkmceFqtXtlWGVS/tCg7r9/Xru3HRMwu2ZU8BF4mTkQLIUPPMr24vwgknocIMSdlx7Fi5KRKKYkbG+W4iSYzwEPVJR9MQcSLddHr+GJ5opQeDSOgKFZyqvydSxKUccV93cqQGct6biP95nUQFZTelYZwoEuLZoiBhUEVwkgXsUUGwYiNNEBZU3wrxAAmElU4sr0Nw5l9eJM2S5VxY5w+lQuUsiyMHjsAxOAUOuAQVcAtqoAEwSMEzeAVvxpPxYrwbH7PWJSObOQB/YHz+AOwWk3o=

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

cross validation segment

training

training

training

training

training

cross validation segment

training

training

training

training

training

cross validation segment

training

training

training

training

training

cross validation segment

training

training

training

training

training

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAACI3icbVDLSgMxFM34rONr1KWbYBFcSJmpoOKq4MZlBfuAzlAymbQNzSRDkimUof/ixl9x40Ipblz4L2baAW3rhZDDOfdw7z1hwqjSrvtlra1vbG5tl3bs3b39g0Pn6LipRCoxaWDBhGyHSBFGOWloqhlpJ5KgOGSkFQ7vc701IlJRwZ/0OCFBjPqc9ihG2lBd587ngvKIcA2xFEr5vv3LjBCj0axxgVakH5u/65TdijsruAq8ApRBUfWuM/UjgdPcixlSquO5iQ4yJDXFjExsP1UkQXiI+qRjIEcxUUE2u3ECzw0TwZ6Q5uXL5uxfR4ZipcZxaDpjpAdqWcvJ/7ROqnu3QUZ5kmrC8XxQL2VQC5gHBiMqCdZsbADCkppdIR4gibA2sdomBG/55FXQrFa868rVY7VcuyziKIFTcAYugAduQA08gDpoAAyewSt4Bx/Wi/VmTa3PeeuaVXhOwEJZ3z9PtaU8

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5IUUY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKXdRUKKFGg3LFrbkLkHXi5aQCOZqD8ld/GLM04gqZpMb0PDdBP6MaBZN8VuqnhieUTeiI9yxVNOLGzxb3zsiFVYYkjLUthWSh/p7IaGTMNApsZ0RxbFa9ufif10sxvPEzoZIUuWLLRWEqCcZk/jwZCs0ZyqkllGlhbyVsTDVlaCMq2RC81ZfXSbte865ql/f1SqOax1GEMziHKnhwDQ24gya0gIGEZ3iFN+fReXHenY9la8HJZ07hD5zPH1DwkBM=

cross validation segment

Figure 61.6 The data is divided into two parts, N and T , with about 80% of the data

points in the first set for training and 20% in the second set for testing. This first set is subsequently divided into K segments of width Ns each. During each iteration of the cross validation procedure, one of the segments is used for cross validation while the remaining segments are used for training.

(3) The important fact to recognize is that Remp (Ap ) estimates the performance of algorithm A for a particular parameter value p. We repeat steps (1)–(2) for different values of p, which would then allow us to arrive at a curve that shows how the error, Remp (Ap ), varies with p and subsequently pick the value of p that leads to the smallest error value. (4) In another scenario, we may be interested in repeating steps (1)–(2) for different algorithms, while keeping the parameters fixed, in order to select the algorithm that results in the smallest value for Remp (Ap ). end

61.3 Cross Validation

2549

Sometimes, the size of N may not be large enough for meaningful training. An alternative implementation is to employ a variation known as leave-one-out cross validation. In this case, we set K = N so that each segment consists of a single data point. During cross validation, training will be performed by using N N 1 points and the empirical error will be evaluated on the single point that is left out. At the end of the cross validation phase, we arrive at an answer to our first question about how to select the “best” algorithm or how to set “hyperparameters” in a guided manner. Once the algorithm and/or its hyperparameter(s) have been selected, we return to the full collection of N training data points, without excluding any segment for cross validation, and retrain the selected algorithm on this entire dataset of N points, i.e., on the 800 points in our example. The resulting classifier is denoted by A? . We still need to answer the second question about how to test the performance of the “optimized” algorithm, A? . To do so, we resort to the testing data (the T points) that we set aside and did not use during the cross validation procedure or training. We measure the empirical error rate on this test data:

Remp (A? ) =

T −1 1 X I [A? (ht ) 6= γ(t)] T t=0

(61.57)

This calculation computes the average number of erroneous classifications over the test data for the learning algorithm and serves as its performance measure. Example 61.5 (Selecting the regularization parameter) We apply the cross validation procedure to the selection of the regularization parameter ρ and the step-size parameter µ in an `2 -regularized logistic regression implementation. We consider the same data from Example 59.2, except that we now examine the problem of separating class r = 1 from class r = 2. There are a total of NTOTAL = 100 samples, with 50 samples from each class. We separate T = 20 samples for testing (that is 20% of the total number of samples) and use the remaining N = 80 samples for training. We extend the feature vectors according to (59.16) and apply 100 passes of the `2 -regularized logistic regression algorithm (59.15). We generate two plots for the empirical error rate of the logistic learner. In one case, we fix the step-size parameter at µ = 0.01 and vary the regularization parameter ρ in steps of one in the range ρ ∈ [0, 20]. In the second case, we fix the regularization parameter at ρ = 5 and vary the step-size µ in steps of 0.005 in the range µ ∈ [0.001, 0.1]. We implement a 10-fold cross validation scheme. That is, we set K = 10 and divide the training data into 10 segments of 8 samples each. We fix ρ at a particular value, and run the logistic regression on 9 segments while keeping the 10th segment for testing; this 10th segment generates an empirical error value. While running the algorithm on the 9 segments we run it multiple times over the data using 100 passes. We repeat this procedure 10 times, using 9 segments for training and 1 segment for testing, and subsequently average the empirical errors to determine the error rate that corresponds to the fixed value of ρ. We repeat the construction for other values of ρ and arrive at the curve shown on the left in Fig. 61.7. From this figure, it is evident that smaller values of ρ are preferred.

Support Vector Machines

training samples from classes 1 and 2

3 2 1 0 -1

class 1

class 2 -2 -3 -4

-3.5

-3

-2.5

-2

-1.5

45 40 35 30 25 20 15

-1

-0.5

0

0.5

1

60

empirical error rate (%)

50

empirical error rate (%)

2550

0

5

10

15

20

55 50 45 40 35 30 25

0

0.02

0.04

0.06

0.08

0.1

Figure 61.7 The plot on the left shows how the empirical error rate for the

`2 -regularized logistic regression algorithm varies with the selection of ρ. A 10-fold cross validation implementation is used to generate this curve. The plot on the right shows the same curve as a function of the step-size parameter.

We repeat the same construction for the step-size parameter. We fix µ at one particular value, and run (59.15) on 9 of the segments while keeping the 10th segment for testing; this 10th segment generates an empirical error value. While running the algorithm on the 9 segments we run it multiple times over the data using 100 passes. We repeat the procedure 10 times, using 9 segments for training and 1 segment for testing, and subsequently average the empirical errors to determine the error rate that corresponds to the fixed value of µ. We repeat the construction for other values of µ and arrive at the curve shown on the right in Fig. 61.7. Example 61.6 (Structural risk minimization) The cross validation approach of this section can be seen as a form of structural risk minimization. Consider, for instance, the `2 -regularized empirical risk formulation: ∆

w? = argmin

n

P (w) = q(w) + Punreg (w)

o

(61.58)

w∈IRM

where we are expressing the risk P (w) as the sum of two components: q(w) denotes the convex regularization factor, and Punreg (w) denotes the remaining unregularized component. For example, for the logistic regression problem

61.4 Commentaries and Discussion

( w

?

= argmin w∈IRM

2551

) N −1  1 X  w −γ(n)hT n ln 1 + e ρkwk + N n=0

(61.59)

N −1  T 1 X  ln 1 + e−γ(n)hn w N n=0

(61.60)

2

we would have q(w) = ρkwk2 ,

Punreg (w) =

Now, we know from the earlier result (51.94) that, under some reasonable technical conditions that are usually satisfied for our problems of interest, solving a regularized problem of the form (61.58) is equivalent to solving ∆

w? = argmin Punreg (w), subject to q(w) ≤ τ

(61.61)

w∈IRM

for some τ ≥ 0 dependent on ρ, written as τ (ρ). In other words, problem (61.61) is effectively searching for the classifier w? within the set: ∆

Wρ = {w ∈ IRM | q(w) ≤ τ (ρ)}

(61.62)

which is parameterized by ρ. By solving (61.58) for different values of ρ, as happens during a cross validation procedure to select an optimal ρ, we are then searching for the solution w? over successive sets {Wρ1 , Wρ2 , . . .} defined by successive values for the hyperparameter ρ. This sequence of nested optimization problems to determine an optimal classifier (i.e., the w? corresponding to the optimal choice of ρ) is an example of “structural risk minimization.”

61.4

COMMENTARIES AND DISCUSSION Support vector machines. It is mentioned in the text by Vapnik (1979), and also in the article by Cortes and Vapnik (1995, p. 275), that the original idea of the hardmargin SVM formulation (61.4a)–(61.4b) was developed by Vapnik and Chervonenkis back in 1965, although the modern form of SVM and its kernel version first appeared in the publication by Boser, Guyon, and Vapnik (1992). The soft-margin formulation (61.11a)–(61.11c) appeared in Cortes and Vapnik (1995). Hard-margin SVM can be viewed as a nonlinear extension of the generalized portrait algorithm introduced by Vapnik and Lerner (1963) and further developed by Vapnik and Chervonenkis (1964). All these algorithms are based on the idea of seeking separating surfaces that maximize the margin from the training data, and have found applications in a range of areas, including bioinformatics, image recognition, face detection, text processing, and others – see the overview by Burges (1998). Mentions of classifier designs that use of large-margin hyperplanes also appear in the works by Cover (1965) and Duda and Hart (1973). For more information on SVM classifiers, their history, properties, and variations, the reader may refer to the texts by Vapnik (1995, 1998), Scholkopf (1997), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola (2001), Herbrich (2002), and Steinwart and Christmann (2008), as well as the articles by Burges (1998), Lin (2002), Lin, Lee, and Wahba (2002), and Smola and Scholkopf (2004). Quadratic program. We explained in Section 61.2 that SVM problems can be recast as convex quadratic programs whose solutions can be pursued by duality arguments. These quadratic programs are extensions of a body of work from the early and mid-1960s by Minnick (1961), Singleton (1962), Charnes (1964), and more broadly by Mangasarian (1965, 1968), who posed the binary classification problem as the solution to linear

2552

Support Vector Machines

(as opposed to quadratic) programming problems. In linear programs, the objective function and the constraint function are all linear (affine) in the unknown w. Slack variables. The soft-margin framework relies on introducing slack variables to enhance the robustness of the SVM solution. The idea of using slack variables is due to Smith (1968), whose work was motivated by the linear programming approach of Mangasarian (1965). The application of slack variables to separating hyperplanes appears in the article by Bennett and Mangasarian (1992). Result (61.27) relating the average number of support vectors to the expected empirical error rate for SVM classifiers appears in Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995). Cross validation. One of the advantages of the cross validation procedure is that, by alternating over training and validation segments, it becomes possible to investigate the generalization performance of a learning algorithm without the need to collect additional training data. The technique performs generally well in practice, although some difficulties may arise. For example, we discussed two versions of cross validation: the leave-one-out model and the K-fold model. In the leave-one-out procedure, one sample is left aside while training is performed on the remaining N − 1 samples. When this is repeated a second time, a second sample is set aside and training is performed on the other N − 1 samples, and so on. Note that the training data used during the successive training steps share N − 2 data points. This means that the models that result from these training steps are highly correlated, which affects the quality of the estimate for the empirical risk in (61.56) since it is obtained by averaging strongly correlated quantities. This is one reason why it is preferred to employ the K-fold construction to reduce correlation between successive runs of the procedure. Nevertheless, the leave-one-out procedure is simpler and computationally less demanding than K-fold implementations. The idea of setting aside some random subset of the data for subsequent testing is widely used in statistical analysis and correlation studies. Some of the earlier works involve, for example, contributions by Larson (1931) and Quenouille (1949, 1957). According to Stone (1974), the method of cross validation in the form described in this chapter appears to have been originally developed by Lachenbruch (1965), who was motivated by the work of Mosteller and Wallace (1963). Useful early accounts, including discussion of K-fold cross validation, appear in Lachenbruch and Mickey (1968), Mosteller and Tukey (1968), and Luntz and Brailovsky (1969). Other earlier works dealing with cross validation techniques and their properties appear in Hills (1966), Cochran (1968), Allen (1974), Stone (1974, 1977, 1978), and Cox (1975). Further treatment on the subject, including more modern accounts and analysis of bias and variance properties, can be found in Devijver and Kittler (1982), Picard and Cook (1984), Breiman et al. (1984), Geisser (1993), Breiman (1996c), Holden (1996), Efron and Tibshirani (1997), Anthony and Holden (1998), Dietterich (1999), Nadeau and Bengio (2003), McLachlan (2004), Bengio and Grandvalet (2005), and Hastie, Tibshirani, and Friedman (2009). The results by Holden (1996) and Anthony and Holden (1998), in particular, provide a useful characterization of the quality of the empirical error rate estimated according to (61.56) in a K-fold implementation. They derived a bound on the probability that this empirical estimate is close enough to the true error rate of the classifier c by showing that, for any 0 < δ < 1, N ≥ K ≥ 3, and N δ 2 > 2K:    2VC 1 )e N (1 + K 2−N δ/2K (61.63) P sup Remp (Ap ) − R(c) > δ ≤ 2K c∈C

VC

where Remp (Ap ) refers to the estimated empirical error rate of the algorithm under consideration, R(c) is the actual probability of error of classifier c, C is the class of classifiers over which the design is performed (such as limiting c to affine classifiers), and VC is a constant that measures the complexity of the set C; for example, it is M − 1 for affine classifiers in IRM . We will define the VC dimension in a future chapter. The bound on the right-hand side depends on δ, the size of the training data, N , the number of segments, K, and the VC dimension. The result is similar in form to the

61.4 Commentaries and Discussion

2553

Vapnik–Chervonenkis bound, which we will derive in expression (64.111).

PROBLEMS

61.1 Consider two feature vectors {ha , hb } where ha belongs to class +1 and hb belongs to class −1. Assume that these two vectors meet the margin in an SVM implementation, that is, they satisfy hTa w? − θ? = +1 and hTb w? − θ? = −1. The parameters (w? , θ? ) describe the separating hyperplane with maximal margin. Project the vector difference ha − hb along the unit-norm normal to the separating hyperplane and determine the size of the margin, m(w? ), associated with w? from this calculation. 61.2 Is the solution to the hard-margin SVM problem (61.4a)–(61.4b) unique? 61.3 Is the solution to the soft-margin SVM problem (61.11a)–(61.11c) unique? 61.4 Justify recursions (61.33a)–(61.33d) for the solution of an `2 -regularized SVM risk for regression purposes. Remark. For more discussion on the use of the -insensitive loss function max{0, |x| − } in (61.31b), the reader may refer to Vapnik (1995, 1998). 61.5 How would recursions (61.33a)–(61.33d) be modified if the empirical risk is `1 -regularized and changed to ) ( N −1 n o 1 X ∆ 2 ? ? max 0, (γ(n) − γ b(n)) −  (w , θ ) = argmin αkwk1 + N n=0 w∈IRM ,θ∈IR where γ b(n) = hTn w − θ? 61.6 Refer to the statement of Prob. 60.7, except that now we wish to determine a separating hyperplane w such that γ(n)hTn w > 1. We motivated the SVM recursion in the body of the chapter as one solution method. Here, we motivate a second relaxation method based on using the alternating projection algorithm from Section 12.6. Introduce the N halfspaces Hn = {w | 1 − γ(n)hTn w < 0}, one for each data pair (γ(n), hn ). We are then faced with the problem of solving N linear inequalities and finding a point w? in the intersection of these halfspaces. Use the result of Prob. 9.5 to show that the alternating projecting method motivates the following recursion: n o γ(n)hn T 0, 1 − γ(n)h w max wn = wn−1 + n−1 n khn k2 How is this method different from the hard-margin SVM recursion? 61.7 Consider a collection of N data points {γ(m), hm } where γ(m) ∈ {±1} and hm ∈ IRM . Assume the data is linearly separable with zero offset, meaning that there exists some vector w such that hTm w > 0 for features in class +1 and hTm w < 0 for features in class −1. We know that such separating hyperplanes are highly nonunique. Consider the logistic regression formulation: ( ) N −1   X ∆ 1 −γ(m)hT w m min P (w) = ln 1 + e N m=0 w∈IRM Assume we apply the gradient-descent recursion repeatedly to minimize P (w), namely, wn = wn−1 − µ∇wT P (wn−1 ), n ≥ 0 Show that, for small µ, the iterate wn converges to a limit satisfying lim wn /kwn k = wsvm /kwsvm k

n→∞

where wsvm is the solution to the hard-margin SVM problem: wsvm = argmin w∈IRM

1 kwk2 , subject to γ(m)hTm w ≥ 1, m = 0, 1, 2, . . . , N − 1 2

2554

Support Vector Machines

Remark. The result of this problem provides another example of the implicit bias problem discussed in the comments of Chapter 29. The data is linearly separable and there exist infinitely many choices for the separating hyperplane. The gradient-descent algorithm chooses one particular solution from among these; namely, the one with the largest margin. See Soudry et al. (2018) for more discussion. 61.8 Verify that the Gramian matrix A defined by (61.40) is nonnegative definite and conclude that the cost function in the minimization problem (61.42a) is convex. 61.9 Repeat the derivation given in Section 61.2 to show that the soft-margin SVM problem (61.11a)–(61.11c) can be rewritten as in (61.51a)–(61.51b). Explain that the solution λ(n) will be nonzero at data points that meet or violate the margin. In particular, verify that:  λ? (n) = 0, when γ(n)(hTn w? − θ? ) > 1  λ? (n) = η/N, when γ(n)(hTn w? − θ? ) < 1  0 ≤ λ? (n) ≤ η/N, when γ(n)(hTn w? − θ? ) = 1

REFERENCES Allen, D. M. (1974), “The relationship between variable selection and data augmentation and a method of prediction,” Technometrics, vol. 16, pp. 125–127. Anthony, M. and S. B. Holden (1998), “Cross-validation for binary classification by realvalued functions: Theoretical analysis,” Proc. Ann. Conf. Computational Learning Theory (COLT), pp. 218–229, Madison, WI. Bengio, Y. and Y. Grandvalet (2005), “Bias in estimating the variance of K-fold crossvalidation,” in Statistical Modeling and Analysis for Complex Data Problems, P. Duchesne and B. Remillard, editors, pp. 75–95, Springer. Bennett, K. P. and O. L. Mangasarian (1992), “Robust linear programming discrimination of two linearly inseparable sets,” Optim. Meth. Softw., vol. 1, pp. 23–34. Boser, B. E., I. Guyon, and V. N. Vapnik (1992), “A training algorithm for optimal margin classifiers,” Proc. Annual Conf. Computational Learning Theory (COLT), pp. 144–152, Pittsburgh, PA. Breiman, L. (1996c), “Bias, variance and arcing classifiers,” Technical Report 460, Statistics Department, University of California at Berkeley. Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984), Classification and Regression Trees, Wadsworth International Group. Burges, C. (1998), “A tutorial on support vector machines for pattern recognition,” Data Mining Knowl. Discov., vol. 2, no. 2, pp. 121–167. Charnes, A. (1964), “Some fundamental theorems of Perceptron theory and their geometry,” in Computer and Information Sciences, J. T. Tou and R. H. Wilcox, editors, Spartan Books. Cochran, W. G. (1968), “Commentary on estimation of error rates in discriminant analysis,” Technometrics, vol. 10, pp. 204–205. Cortes, C. and V. N. Vapnik (1995), “Support-vector networks,” Mach. Learn., vol. 20, pp. 273–297. Cover, T. M. (1965), “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Trans. Electron. Comput., vol. 14, pp. 326–334. Cox, D. R. (1975), “A note on data-splitting for the evaluation of significance levels,” Biometrika, vol. 62, no. 2, pp. 441–445. Cristianini, N. and J. Shawe-Taylor (2000), An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press. Devijver, P. A. and J. Kittler (1982), Pattern Recognition: A Statistical Approach, Prentice Hall.

References

2555

Dietterich, T. G. (1999), “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Comput., vol. 10, pp. 1895–1924. Duda, R. O. and P. E. Hart (1973), Pattern Classification and Scene Analysis, Wiley. Efron, B. and R. Tibshirani (1997), “Improvements on cross-validation: The .632+bootstrap method,” J. Amer. Statist. Assoc., vol. 92, no. 438, pp. 548–560. Geisser, S. (1993), Predictive Inference, Chapman & Hall. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Herbrich, R. (2002), Learning Kernel Classifiers: Theory and Algorithms, MIT Press. Hills, M. (1966), “Allocation rules and their error rates,” J. Roy. Statist. Soc. Ser. B, vol. 28, pp. 1–31. Holden, S. B. (1996), “Cross-validation and the PAC learning model,” Research Note RN/96/64, Department of Computer Science, University College London. Lachenbruch, P. (1965), Estimation of Error Rates in Discriminant Analysis, Ph.D. dissertation, University of California at Los Angeles. Lachenbruch, P. and M. Mickey (1968), “Estimation of error rates in discriminant analysis,” Technometrics, vol. 10, pp. 1–11. Larson, S. (1931), “The shrinkage of the coefficient of multiple correlation,” J. Edu. Psychol., vol. 22, pp. 45–55. Lin, Y. (2002), “Support vector machines and the Bayes rule in classification,” Data Mining Knowl. Discov., vol. 6, pp. 259–275. Lin, Y., Y. Lee, and G. Wahba (2002), “Support vector machines for classification in nonstandard situations,” Mach. Learn., vol. 46, pp. 191–202. Luntz, A. and V. Brailovsky (1969), “On estimation of characters obtained in statistical procedure of recognition” Techicheskaya Kibernetica, vol. 3, pp. 6–12 (in Russian). Mangasarian, O. L. (1965), “Linear and nonlinear separation of patterns by linear programming,” Oper. Res., vol. 13, pp. 444–452. Mangasarian, O. L. (1968), “Multi-surface method of pattern separation,” IEEE Trans. Inf. Theory, vol. 14, no. 6, pp. 801–807. McLachlan, G. J. (2004), Discriminant Analysis and Statistical Pattern Recognition, Wiley. Minnick, R. C. (1961), “Linear-input logic,” IRE Trans. Electron. Comput., vol. 10, pp. 6–16. Mosteller, F. and J. W. Tukey (1968), “Data analysis, including statistics,” in Handbook of Social Psychology, G. Lindzey and E. Aronson, editors, Addison-Wesley. Mosteller, F. and D. L. Wallace (1963), “Inference in an authorship problem,” J. Amer. Statist. Assoc., vol. 58, pp. 275–309. Nadeau, C. and Y. Bengio (2003), “Inference for the generalization error,” Mach. Learn., vol. 52, pp. 239–281. Picard, R. and D. Cook (1984), “Cross-validation of regression models,” J. Amer. Statist. Assoc., vol. 79, no. 387, pp. 575–583. Quenouille, M. (1949), “Approximate tests of correlation in time series,” J. Roy. Statist. Soc. Ser. B, vol. 11, pp. 18–84. Quenouille, M. (1957), The Analysis of Multiple Time Series, Griffin. Scholkopf, B. (1997), Support Vector Learning, Oldenbourg Verlag. Scholkopf, B. and A. J. Smola (2001), Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press. Singleton, R. C. (1962), “A test for linear separability as applied to self-organizing machines,” Proc. Conf. Self-Organizing Systems, M. C. Yovits, G. T. Jacobi, and G. D. Goldstein, editors, pp. 503–524, Spartan Books. Smith, F. W. (1968), “Pattern classifier design by linear programming,” IEEE Trans. Comput., vol. 17, no. 4, pp. 367–372. Smola, A. J. and B. Scholkopf (2004), “A tutorial on support vector regression,” Statist. Comput., vol. 14, no. 3, pp. 199–222. Soudry, D., E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018), “The implicit bias of gradient descent on separable data,” J. Mach. Learn. Res., vol. 19, pp. 1–57.

2556

Support Vector Machines

Steinwart, I. and A. Christmann (2008), Support Vector Machines, Springer. Stone, M. (1974), “Cross-validatory choice and assessment of statistical predictions,” J. Roy. Statist. Soc. Ser. B, vol. 36, pp. 111–147. Stone, M. (1977), “Asymptotics for and against cross-validation,” Biometrika, vol. 64, no. 1, pp. 29–35. Stone, M. (1978), “Cross-validation: A review,” Math. Operationsforsch. Statist., Ser. Statistics, vol. 9, no. 1, pp. 127–139. Vapnik, V. N. (1979), Estimation of Dependences Based on Empirical Data, Nauka (in Russian). English translation published in 1982 by Springer. Reprinted in 2006. Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer. Vapnik, V. N. (1998), Statistical Learning Theory, Wiley. Vapnik, V. N. and A. Y. Chervonenkis (1964), “A note on one class of perceptrons,” Aut. Remote Control, vol. 25, no. 1. Vapnik, V. N. and A. Lerner (1963), “Pattern recognition using generalized portrait method,” Aut. Remote Control, vol. 24, no. 6, pp. 774–780.

62 Bagging and Boosting

In this chapter we describe two ensemble learning techniques, known as bagging and boosting, which aggregate the decisions of a mixture of learners to enable enhanced classification performance. In particular, they help transform a collection of “weak” learners into a more robust learning machine. The two methods differ by the manner in which the individual learners are trained and by how their decisions are fused together. For example, the classifiers operate in parallel under bagging and sequentially under boosting. A majority vote is used to fuse the decisions of bagging classifiers, while an adaptive and weighting procedure is used under boosting. The two techniques also differ in the manner they address overfitting and underfitting. Bagging helps us move from a state of overfitting to better-fitting by smoothing the decisions of various classifiers. In comparison, boosting helps us move in the other direction from a state of underfitting to better-fitting by aggregating the decisions of weak classifiers to yield a better-performing classifier. We describe the bagging procedure first followed by boosting.

62.1

BAGGING CLASSIFIERS Consider a collection of N training data points {γ(n), hn }, where γ(n) denotes the class variable and hn ∈ IRM the corresponding feature vector. Although unnecessary, for simplicity we consider a binary classification problem where γ(n) ∈ {±1}. Bagging is based on the idea of training multiple classifiers, using data from the same training set, and on combining their classification decisions. In order to ensure variability across the classifiers, bootstrap sampling is employed to generate training samples randomly for each classifier; hence the name “bagging,” which is a shorthand reference obtained from the words “bootstrap aggregating.” Specifically, consider a collection of L classifiers, also called base classifiers. The classifiers need not have a homogeneous structure: While all of them can be of the same type (say, logistic classifiers), they can also be different. For each classifier, a training set of size N is generated by sampling from the original training data {γ(n), hn } with replacement. The operation of sampling with replacement is known as bootstrap sampling in the statistics literature. As such, some data

2558

Bagging and Boosting

points may be repeated, even within the training data for the same classifier. All classifiers are subsequently trained with the corresponding training datasets. We denote the trained classifiers by {c?` (h)}, for ` = 1, 2, . . . , L.

62.1.1

Classification and Regression During testing, when a new feature vector h arrives, each of the individual classifiers c?` (h) generates a prediction, denoted by γ b` (h), for the label corresponding ? to h. For example, if c` (h) is an affine classifier with parameters (w`? , θ`? ), then ? ? γ b` (h) = hT ` w` N θ`

(62.1)

c?` (h) = sign(b γ` (h))

(62.2)

and the class variable is set to the sign of the prediction:

These individual decisions can be combined in different ways. For example, a majority vote can be used to arrive at the final label, c? (h): If the majority of the classifiers decides in favor of class +1, then we set c? (h) = +1; otherwise, we set c? (h) = N1: n o γ ? = c? (h) = majority vote c?1 (h), c?2 (h), . . . , c?L (h) (62.3) In other instances, a weighted majority vote can be used, especially in situations when classifiers provide confidence levels for their decisions, as happens with logistic regression solutions. Figure 62.1 illustrates the structure of the bagging classifier. It consists of the bank of classifiers, {c?` (h)}, with each one of them producing a ±1 binary decision. A majority vote is then taken to arrive at the ultimate decision c? (h). It is clear that the same construction can be used for multiclass classification problems where the binary label γ ∈ {±1} is replaced by r ∈ {1, 2, . . . , R}. The listing below describes the bagging procedure for classification problems.

Bagging algorithm. given N training points {γ(n), hn }, n = 0, 1, . . . , N N 1; given L base classifiers c` (h), ` = 1, 2, . . . , L; (training) repeat for each classifier `: choose N training samples at random with replacement; train c` (h) and denote the trained classifier by c?` (h). end (testing) apply the feature vector h to each classifier c?` (h); determine its class γ ? by taking a majority vote according to (62.3).

(62.4)

62.1 Bagging Classifiers

2559

classification

individual labels

base classifiers

Figure 62.1 Structure of the bagging classifier; it consists of a bank of L classifiers

followed by a majority vote decision.

If we are instead interested in solving regression (as opposed to classification) problems where the emphasis is on predicting the target variable γ(h), then the individual predictors can be fused by averaging their values to get L

γ b(h) =

62.1.2

1X γ b` (h) L

(62.5)

`=1

Variance Reduction In general, but not always, bagging has a smoothing effect and helps reduce the variance of the final classifier in comparison to the variances of the individual classifiers due to averaging; this effect is examined in Prob. 62.3. As such, bagging is useful in countering overfitting. This can be illustrated as follows. Each b ` serves as an estimator for the true target variable γ. The squared predictor γ bias and variance of the predictor are denoted by ∆

b ` )2 , b2` = (γ N E γ



b 2` N (E γ b ` )2 σ`2 = E γ

(62.6)

We saw earlier in Table 27.3 in the context of inference problems, and we are going to see in Section 64.5 in the context of classification problems, that there is a

2560

Bagging and Boosting

fundamental bias–variance trade-off where small bias in a predictor is associated with a larger variance and vice-versa. When overfitting occurs, the bias will be small, while the variance will be large. Through bagging, we will generally be able to reduce the size of the error variance. To see this, consider the following simplified example. Let x denote a random variable with mean µ and variance σx2 . Assume a single observation, denoted by x1 , is available for x. Then, we can employ this observation to construct an estimator for the mean as follows: b = x1 µ

(62.7)

This estimator is unbiased and its error variance is equal to σx2 since b = µ, E (b Eµ µ N µ)2 = E (x1 N µ)2 = σx2

(62.8)

If we happen to have L independent observations, denoted by {x` }, then we can average them and employ instead the following estimator for the mean: L

bL = µ

1X x` L

(62.9)

`=1

This estimator continues to be unbiased but its variance is now reduced by a factor L since (see Prob. 62.2): E (b µL N µ)2 = σx2 /L

(62.10)

The bagging construction is based on a similar principle by averaging several estimators {b γ ` (h)} for γ(h) – see, e.g., the fusion expression (62.5) used in the regression context. One would then expect the error variance in estimating γ(h) to improve relative to the error variances of the individual predictors {b γ ` (h)}. This is generally true for regression solutions that rely on bagging. One difficulty, however, is that the estimators {b γ ` (h)} used in bagging are highly correlated (and not independent) since they are constructed from the same underlying training data. As such, situations can arise where the error variance is not necessarily reduced, as indicated in some of the references in the comments at the end of the chapter.

62.1.3

Random Forests The correlation among the trained classifiers {c?` (h)} is a major drawback of the bagging construction. The random forest algorithm, which is a bagging procedure applied to decision trees, addresses this limitation and reduces the correlation by relying on random subsets of the feature data. The algorithm operates as follows. Using bootstrap samples from the available training data, a collection of decision trees are trained separately and the final classification label is decided based again on a majority vote from the individual trees. Now, however, the feature space is also sampled randomly. Specifically, during the construction of each individual tree, and at every node, a random subset of the attributes is selected to

62.2 AdaBoost Classifier

2561

decide on the root at that location in the tree, e.g., as described earlier by algorithm (54.61). Feature randomization helps counter the possibility of overfitting. The following listing summarizes the main steps of the random forest algorithm, starting with a collection of N training data points, {γ(n), hn }. The size of the bootstrap sample, denoted by N 0 ≤ N , can be smaller (but not much smaller) than N , say, N 0 ≈ 32 N . Random forests algorithm. given a collection of N data points {γ(n), hn }; it is desired to build L decision trees to serve as base classifiers. (training) repeat for each tree: choose a bootstrap sample of size N 0 ≤ N from training data; use this data to construct a decision tree, e.g., by using (54.61) except that the construction at each node is based on a random subset of the attributes. end (testing) apply the feature vector h through all trees; select the class label using a majority vote.

62.2

(62.11)

ADABOOST CLASSIFIER We discuss next the boosting technique, which is a more sophisticated procedure than bagging and involves combining a collection of individual classifiers in an adaptive and weighted manner. By doing so, boosting is able to transform a collection of weak learners into a more reliable classification machine. One of the most widely used boosting methods is the AdaBoost algorithm, where “AdaBoost” stands for “adaptive boosting.” The algorithm boosts the performance of the individual classifiers through an iterative procedure where certain combination weights are adjusted from one iteration to another. To motivate the algorithm for binary classification problems, we will first define for the purposes of this section weak learners, c(h), as any classifier whose empirical error rate or misclassification error on the training data (as defined by (52.11)) is slightly better than random guessing, say, Remp (c) = 0.5 N , for small  > 0

=⇒ weak classifier

(62.12)

How do such classifiers arise? Consider, for example, a situation dealing with large-size feature vectors, hn ∈ IRM , where M can run into the hundreds or thousands. Training a classifier on such large-dimensional feature spaces can be

2562

Bagging and Boosting

computationally demanding. One can instead consider training multiple learners on smaller subsets of the feature entries (or even on individual feature entries); these learners will generally exhibit weak performance because the smaller feature subspaces may not have sufficient discrimination power in them. The boosting technique is then used to aggregate the decisions of the weak learners and transform them into a more reliable decision.

62.2.1

Algorithm Statement We start by describing the boosting algorithm and its operation, before formally deriving it. Consider the collection of N training data points, {γ(n), hn }. Initially, all points in this training set are assigned equal weights, denoted by d1 (n) = 1/N,

n = 0, 1, 2, . . . , N N 1.

(62.13)

The subscript “1” stands for iteration 1, and the index n refers to the data index. These weights add up to 1 and they will be updated, from one iteration to another, as the boosting procedure progresses, in order to assign smaller weights to data that is being well classified and larger weights to data that is resisting classification. We assume that we have a collection of L weak classifiers denoted by {c1 (h), c2 (h), . . . , cL (h)}; these classifiers can be of different types, say, perceptron, logistic regression, nearest neighbor, or any other classifier structure. We train each of the L weak classifiers with the N data points {γ(n), hn }. Then, each of them will provide a binary decision, denoted by c` (h) ∈ {±1}. The AdaBoost procedure operates by iterating over the weak classifiers; the iteration variable is denoted by t in the description below. In principle, the variable t can be incremented until it reaches the value L, although the iteration can be stopped before L (especially when L is large): repeat for t = 1, 2, . . . , L (number of weak learners): (1) Associate with each classifier, c` (h) for ` = 1, 2, . . . , L, an empirical error rate that is computed by counting, in a weighted manner, the number of its erroneous decisions on the training data as follows: ∆

E(`) =

N −1 X n=0

dt (n) I [c` (hn ) 6= γ(n)] =

X

dt (n)

(62.14)

n∈M`

where M` denotes the set of indices of misclassified feature vectors by classifier c` (h). Observe that each erroneous decision is weighted by the corresponding scalar dt (n) from iteration t. (2) At least one of the classifiers should have E(`) < 1/2; otherwise, the procedure stops. From among the L classifiers {c` (h)}, select the one that results in the smallest empirical error during this tth iteration (i.e., the one with

62.2 AdaBoost Classifier

2563

the best performance). We denote it by: ∆

c?t (h) = argmin E(`)

(62.15)

1≤`≤L

where the superscript ? is used to designate the selected classifier at this iteration. The corresponding error is denoted by E ? (t), which is smaller than 1/2. We use this error to define the following relevance factor for c?t (h) (this expression is justified further ahead in Section 62.2.2):   1 1 N E ? (t) α? (t) = ln (62.16) 2 E ? (t) Note that the argument of the logarithm is larger than 1 since E ? (t) < 1/2. It follows that α? (t) is positive. Note further that the smaller the value of E ? (t) is, the larger the value of α? (t) will be. Hence, classifiers with better performance will be weighted more heavily in the final construction. (3) Update the nonnegative weighting factors {dt (n)} for the training data for the next iteration t + 1 as follows (this expression is again justified further ahead in Section 62.2.2): n o 1 dt (n)exp Nα? (t)γ(n)c?t (hn ) , n = 0, 1, 2, . . . , N N 1 dt+1 (n) = β(t) (62.17a) The scalar β(t) is used to normalize the sum of the weighting factors {dt+1 (n)} to 1, i.e., ∆

β(t) =

N −1 X n=0

n o dt (n)exp Nα? (t)γ(n)c?t (hn )

(62.17b)

Observe from (62.17a) that training points that are misclassified by c?t (h) will result in γ(n)c?t (hn ) < 0, and the exponential factor will be larger than 1. This means that weights for points that are misclassified by c?t (h) are increased while weights for the remaining points are decreased. This helps alert the classifiers to “difficult” data points for the next iteration. (4) Move to the next iteration t + 1. end At the end of the L iterations, and after we have identified the best classifiers {c?t , α? (t)} over these iterations, we combine them to arrive at the final classifier: ∆

γ b(h) = ∆

L X

α? (t)c?t (h)

(prediction)

(62.18a)

(decision)

(62.18b)

t=1

c? (h) = sign(b γ (h))

2564

Bagging and Boosting

where γ b(h) denotes the prediction for the class variable so that the test feature vector h can be classified as follows: ( h ∈ class +1 if γ b(h) ≥ 0 (62.19) h ∈ class N1 if γ b(h) < 0

Observe from (62.18a)–(62.18b) that the ultimate classifier c? (h) is expressed in the form of (the sign of) an additive model; hence the name “ensemble learning.” Each step of the AdaBoost procedure selects one weak classifier to add to the ensemble. Figure 62.2 illustrates the structure of the AdaBoost classifier. It consists of a bank of weak classifiers, {c?t (h)}, with each one of them resulting in a ±1 binary decision. These binary decisions are combined using the scalars {α? (t)} to arrive at the ultimate decision, c? (h). In summary, we arrive at the following listing for the AdaBoost algorithm.

AdaBoost algorithm. given N training data points {γ(n), hn }; (training) start with L weak classifiers, {c` (h)}; set d1 (n) = 1/N, n = 0, 1, . . . , N N 1. repeat t = 1, 2, . . . , L: M` = setX of indices misclassified by c` (h) E(`) = dt (n), ` = 1, 2, . . . , L n∈M`

c?t = argmin E(`) 1≤c` ≤L  ? (t) ? α (t) = 12 ln 1−E ? E (t) dt+1 (n) = dt (n)e−α N −1 X β(t) = dt+1 (n)

?

(62.20)

(t)γ(n)c? t (hn )

, n = 0, 1, 2, . . . , N N 1

n=0

dt+1 (n) = dt+1 /β(t) end (classification) L X γ b(h) = α? (t)c?t (h) t=1

c? (h) = sign (b γ (h)).

We establish in Prob. 62.8 one useful property for the AdaBoost construction, namely, that its empirical error rate over the training data improves exponentially fast with the number of weak classifiers. If we let the variable ν ? (t) =

1 N E ? (t) > 0 2

(62.21)

62.2 AdaBoost Classifier

prediction

2565

classification

weighted combination

individual labels

classifiers {c?t (h)} selected during training

Figure 62.2 Structure of the AdaBoost classifier; it consists of a bank of L weak classifiers combined through the scalars {α? (t)}.

denote the margin of each individual classifier, c?t (h), from 1/2, then we show in the problem that the empirical error rate for the AdaBoost classifier c? (h) is bounded by: ! L X 2 ? ? Remp (c ) ≤ exp N2 (ν (t)) (62.22) t=1

The bound in this expression decreases exponentially with L. Moreover, weak classifiers with larger ν ? (t) (i.e., smaller errors, E ? (t)) cause the bound to decrease faster than other classifiers. Example 62.1 (Boosting a collection of threshold-based classifiers) We illustrate the AdaBoost construction by considering the example shown in Fig. 62.3. The figure shows N = 100 randomly generated feature vectors hn ∈ IR2 in the region [−0.5, 0.5] × [−0.5, 0.5] belonging to one of two classes, γ ∈ {±1} (the discs belong to class +1). We denote the (x, y)-coordinates of a generic feature vector h by h = [h1 h2 ]. Our objective is to build an aggregate classifier that enhances the performance of a collection of L = 14 weak classifiers. These latter classifiers are chosen to be thresholdbased, i.e., they focus on individual x or y coordinates of the feature vector and decide on whether the sample belongs to one class or another by comparing the coordinate against some threshold values. The thresholds used by the x- and y-domain classifiers

2566

Bagging and Boosting

training samples (N=100)

0.5

class -1 0

class +1

-0.5 -0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

AdaBoost classifier (0% error)

0.562.3 The plot shows = 100 training feature vectors used to construct an Figure

AdaBoost classifier. are given by: 0

θx ∈ {−0.375, −0.25, −0.125, 0, 0.125, 0.25, 0.375} θy ∈ {−0.375, −0.25, −0.125, 0, 0.125, 0.25, 0.375}

(62.23a) (62.23b)

We therefore have seven x-domain classifiers (denoted by {c1 (h), . . . , c7 (h)}) and seven y-domain classifiers (denoted by {c8 (h), . . . , c14 (h)}). The classifiers discriminate the -0.5 -0.5 -0.4to the-0.3 -0.2 rules: -0.1 0 0.1 0.2 0.3 0.4 0.5 data according following  c1 (h) = −sign(h1 + 0.375)       c2 (h) = sign(h1 + 0.25)   c3 (h) = sign(h1 + 0.125) c4 (h) = sign(h1 ) x-axis thresholds −→ (62.24a)   c5 (h) = sign(h1 − 0.125)     c (h) = sign(h1 − 0.25)   6 c7 (h) = sign(h1 − 0.375) and  c8 (h)     c9 (h)     c10 (h) c11 (h) y-axis thresholds −→    c12 (h)    c (h)   13 c14 (h)

= = = = = = =

−sign(h2 + 0.375) −sign(h2 + 0.25) −sign(h2 + 0.125) −sign(h2 ) −sign(h2 − 0.125) sign(h2 − 0.25) sign(h2 − 0.375)

(62.24b)

For example, classifier c1 (h) decides in favor of γ(h) = −1 if h1 ≥ −0.375 and in favor of γ(h) = +1 if h1 < −0.375. Note the minus sign in the definition of c1 (h); the minus signs are included for various classifiers to ensure that they result in weak classifiers for this example – see the second column of Table 62.1. The first step of the boosting implementation involves running these classifiers on the training data. We evaluate both their empirical error rates, Remp (c` ), and their weighted empirical errors, E(`). The results for this first step are shown in Table 62.1. It is seen that classifier c6 (h) results in the smallest weighted error, E(`), and, therefore, the first selection is: c?1 (h) = c6 (h),

α? (1) = 1.099

(62.25)

62.2 AdaBoost Classifier

2567

Table 62.1 Empirical error rates and weighted errors after the first iteration, t = 1. Classifier, c` (h)

Empirical error rate (%), Remp (c)

Weighted error, E(`)

c1 (h) c2 (h) c3 (h) c4 (h) c5 (h) c6 (h) c7 (h) c8 (h) c9 (h) c10 (h) c11 (h) c12 (h) c13 (h) c14 (h)

45% 41% 40% 34% 18% 10% 21% 26% 24% 32% 42% 49% 48% 41%

0.4500 0.4100 0.4000 0.3400 0.1800 0.1000 0.2100 0.2600 0.2400 0.3200 0.4200 0.4900 0.4800 0.4100

The weighting factors, {dn (t)}, are now adjusted according to (62.17a) and the procedure is repeated. During the second step, we evaluate the weighted empirical errors, E(`). The results for this second step are shown in Table 62.2. It is seen that classifier c9 (h) now results in the smallest weighted error, E(`), and, therefore, the second selection is: c?2 (h) = c9 (h), α? (2) = 0.9359 (62.26)

Table 62.2 Weighted empirical errors after the second iteration, t = 2. Classifier, c` (h)

Weighted error, E(`)

c1 (h) c2 (h) c3 (h) c4 (h) c5 (h) c6 (h) c7 (h) c8 (h) c9 (h) c10 (h) c11 (h) c12 (h) c13 (h) c14 (h)

0.6944 0.2278 0.4000 0.5444 0.4556 0.5000 0.5611 0.2333 0.1333 0.1778 0.2333 0.2722 0.7111 0.6722

Continuing in this manner, we arrive at the list of optimal selections {c?t (h)} shown in Table 62.3 along with their weighting factors. The aggregate classifier is subsequently constructed as ! L X ∆ ? ? ? c (h) = sign α (t)ct (h) (62.27) t=1

The table also lists the original weak classifiers that correspond to the various optimal choices, {c?t (h)}. It is seen in this example that the solution (62.27) depends exclusively

2568

Bagging and Boosting

on classifiers {c2 (h), c6 (h), c9 (h)}. If we add the corresponding weighting coefficients, we can simplify expression (62.27) to the following form:   ∆ c? (h) = sign 3.1523 c2 (h) + 4.0976 c6 (h) + 3.9012 c9 (h)

(62.28)

If we now apply this classifier to a fine grid over the region [−0.5, 0.5] × [−0.5, 0.5], we obtain the colored regions shown in Fig. 62.4: the light color corresponds to class γ = +1, and the darker color corresponds to class γ = −1. The resulting empirical error for this classifier on the training data is found to be zero. Figure 62.5 shows the discrimination regions for the three classifiers {c2 (h), c6 (h), c9 (h)} that were selected by the boosting procedure. The blue areas indicate the regions that correspond to class +1 for each classifier; the regions corresponding to class −1 are left uncolored. Table 62.3 Optimal classifier selections at the successive t = 1, 2, . . . , 14 steps with the corresponding weight factors, α? (t). Classifier, c?t (h)

0.5

class 0

-0.5 -0.5

Weighting factor, α? (t)

c?1 (t) c?2 (t) c?3 (t) c?4 (t) c?5 (t) c?6 (t) c?7 (t) c?8 (t) -1c?9 (t) c?10 (t) c?11 (t) c?12 (t) c?13 (t) c?14 (t)

-0.4

-0.3

Original classifier

1.0986 0.9395 0.9443 0.8069 0.7838 training samples (N=100) 0.7543 0.7431 0.7342 0.7296 0.7265 0.7248 0.7236 0.7229 0.7225 -0.2

-0.1

0

0.1

c6 (h) c9 (h) c2 (h) c6 (h) c9 (h) c2 (h) c6 (h) c9 (h) c2 (h) c6 (h) c9 (h) c2 (h) c6 (h) c9 (h) 0.2

class +1

0.3

0.4

0.5

0.3

0.4

0.5

AdaBoost classifier (0% error)

0.5

0

-0.5 -0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

Figure 62.4 The plot shows the resulting discrimination region for the AdaBoost

implementation that is obtained from examining L = 14 weak classifiers chosen to be threshold-based. These classifiers focus on individual coordinates of the feature vectors and decide on whether a feature vector belongs to one class or another by examining whether the x- or y-coordinate exceeds some threshold value or not.

62.2 AdaBoost Classifier

2569

training samples (N=100) 0.5 0.25 0 -0.25 -0.5 -0.5

-0.375

-0.25

-0.125

0

0.125

0.25

0.375

0.5

0.25

0.375

0.5

0.5 0.25 0

+1

-0.25 -0.5 -0.5

-0.375

-0.25

-0.125

0

0.125

0.5 0.25 +1

0 -0.25 -0.5 -0.5

-0.375

-0.25

-0.125

0

-0.375

-0.25

-0.125

0

0.125

0.25

0.375

0.5

0.125

0.25

0.375

0.5

0.5 0.25 0 -0.25 -0.5 -0.5

+1

Figure 62.5 The top plot shows the N = 100 training feature vectors. The bottom three plots show the discrimination regions for the three classifiers {c2 (h), c6 (h), c9 (h)} that were selected by the boosting procedure. The colored areas indicate the regions that correspond to class +1 for each classifier; the regions corresponding to class −1 are left uncolored.

62.2.2

Derivation of AdaBoost The AdaBoost algorithm can be motivated in many ways. We follow one derivation by induction as follows. Assume by the end of iteration t N 1 we have identified the optimal classifiers {c?1 (h), c?2 (h), . . . , c?t−1 (h)} and their scaling weights {α? (1), α? (2), . . . , α? (t N 1)}. If we were to stop at this iteration, then these classifiers would result in the aggregate construction: ∆

γ b(t−1) (h) =

t−1 X

α? (s)c?s (h)

(62.29)

s=1

where we are using the superscript (t N 1) to refer to the aggregate result; the corresponding classifier is denoted by c(t−1) (h) and will be given by   ∆ c(t−1) (h) = sign γ b(t−1) (h) (62.30)

2570

Bagging and Boosting

We associate with c(t−1) (h) the following exponential risk: b (t−1) (h)

P (c(t−1) ) = E e−γ γ ∆

(62.31)

over the joint distribution of the data (γ, h). Under an ergodicity assumption on the data, we can approximate the risk by its empirical value computed from the training data as (we continue to use the same notation for the risk for convenience): P (c(t−1) ) =

N −1 1 X −γ(n)bγ (t−1) (hn ) e N n=0

(62.32)

We continue to iteration t. We would like to select the next classifier, denoted generically by ct (h), and its associated weight α(t). The updated aggregate classifier will become: γ b(t) (h) =

t−1 X

α? (s)c?s (h) + α(t)ct (h)

s=1

=γ b(t−1) (h) + α(t)ct (h)

(62.33)

We formulate a design problem to identify the optimal selections for α(t) and ct (h). For this purpose, we consider N −1 1 X −γ(n)bγ (t) (hn ) P (c ) = e N n=0 (t)

N −1 1 X  −γ(n)bγ (t−1) (hn )  −γ(n)α(t)ct (h) e e N n=0 | {z }

=



= τt (n)

N −1 1 X τt (n)e−γ(n)α(t)ct (hn ) N n=0



=

(62.34)

where we introduced the weighting coefficients: ∆

τt (n) = e−γ(n)bγ

(t−1)

(hn )

,

n = 0, 1, . . . , N N 1

(62.35)

with boundary conditions at t = 1: τ1 (n) = 1, n = 0, 1, . . . , N N 1

(62.36)

To continue, we split the sum in (62.34) into two components: one involving misclassified feature vectors for which γ(n)ct (hn ) = N1 and the other involving correctly classified feature vectors for which γ(n)ct (hn ) = +1. We denote the

62.2 AdaBoost Classifier

index set of misclassified data by Mt . Then, ( ) X X 1 (t) −α(t) α(t) P (c ) = τt (n)e τt (n)e + N n∈Mt n∈M / t ( N −1 )   X 1 X −α(t) α(t) −α(t) = τt (n)e τt (n) e + Ne N n=0 n∈Mt (N −1 !)   X e−α(t) X 2α(t) τt (n) + e N1 τt (n) = N n=0

2571

n∈Mt

(62.37)

It is clear that only the rightmost term depends on the unknown classifier ct (h) through the set Mt . We conclude that, at iteration t, the optimal choice for ct (h) from among the weak classifiers is the one that minimizes the sum: ( ) X ? ct (h) = argmin τt (n) (62.38) 1≤`≤L

n∈M`

It can be verified that this optimization problem is equivalent to criterion (62.15) – see Prob. 62.9. To determine the optimal value for α(t), we differentiate the first line in expression (62.37) with respect to α and set the derivative to zero to find that   X ? ? 1 X Nτt (n)e−α (t) + τt (n)eα (t)  = 0 N n∈M / t

which leads to

(62.39)

n∈Mt

1 α (t) = ln 2 ?

Now let ∆

! P n∈M / t τt (n) P n∈Mt τt (n)

P

t E (t) = Pn∈M N −1

?

n=0

(62.40)

τt (n)

(62.41)

τt (n)

It can again be verified that this expression agrees with definition (62.14) – see Prob. 62.9. Then, we can express α? (t) in terms of E ? (t), as was the case in (62.16). Note finally from expression (62.35) that, once the optimal values α? (t) and ? ct (h) have been determined, we get ∆

τt+1 (n) = e−γ(n)bγ

(t)

(hn ) (62.33)

=

?

τt (n) e−α

(t)γ(n)c? t (h)

(62.42)

Without loss in generality, we can normalize the scaling factors τt (n) in (62.34) to add up to 1. In order to ensure that this normalization property propagates to iteration t + 1 we scale the above expression for τt+1 (n) and use instead τt+1 (n) =

? ? 1 τt (n) e−γ(n)α (t)ct (h) β(t)

(62.43a)

2572

Bagging and Boosting

where ∆

β(t) =

N −1 X

?

τt (n) e−α

(t)γ(n)c? t (h)

(62.43b)

n=0

Comparing (62.43a) with (62.17a), we see that the variable τt (n) agrees with dt (n).

62.3

GRADIENT BOOSTING The AdaBoost construction is a special case of a more general formulation that includes several variants, as we proceed to explain. Consider again a classification context where we are given N data pairs {γ(n), hn }, where hn ∈ IRM are feature vectors and γ(n) are the class variables. We assume there are two classes so that γ(n) ∈ {±1}. A generic classifier, denoted by c(h), is a transformation that maps a feature vector h into a class value: c(h) : IRM → ±1

(62.44)

One empirical way to assess the performance of any classifier is to count how many erroneous decisions it generates on the training data: ∆

Remp (c) =

N −1 1 X I [γ(n) 6= c(hn )] N n=0

(62.45)

We explained earlier that a classifier c(h) is weak if its empirical error rate is only slightly better than random guessing, i.e., Remp (c) = 0.5 N . Gradient boosting will help aggregate the decisions of multiple weak learners and construct from them a more reliable decision structure. Thus, assume we have a collection of L weak classifiers n o C = c1 (h), c2 (h), . . . , cL (h) (62.46)

where the value of L can be large, even larger than the data size, N . We would like to combine these learners into a more powerful classifier with stronger predictive abilities, i.e., we would like to determine combination coefficients {α(`)} to construct a prediction for the class variable γ(h) as follows: ∆

γ b(h) =

L X

α(`)c` (h)

(62.47)

`=1

Note again that we are writing γ(h) or γ b(h), with argument h, to refer to the class variable that corresponds to a generic feature vector h. When h happens to be a feature vector from the training data, say h = hn for some n, we will instead write γ(n) or γ b(n) with argument n. The coefficients {α(`)} in (62.47) will be determined according to some optimality criterion; and many of these

62.3 Gradient Boosting

2573

coefficients can be zero so that some classifiers may end up being excluded from the combination. We collect the coefficients {α(`)} into a column vector: n o ∆ w = col α(1), α(2), . . . , α(L) ∈ IRL (62.48)

One tractable way to assess the performance of construction (62.47) is to use some loss function and add its value over the entire training data. This amounts to using some surrogate risk in place of the 0/1 risk defined by (62.45), say, ∆

P (w) =

N −1 1 X Q(γ(n), γ b(n)) N n=0

(62.49)

where γ b(n) is a function of w, and Q(·, ·) refers to some loss function, assumed convex. Popular choices include:  1 b )2 (quadratic loss)  2 (γ N γ   −γb γ  e (exponential loss)  (62.50) Q(γ, γ b) = max(0, 1 N γb γ ) (hinge loss)    ln(1 + e−γbγ ) (logistic loss)   |γ N γ b| (absolute loss)

as well as the following Huber loss, which has useful robustness properties to outliers in the data:  1 b)2 , when |γ N γ b| ≤ δ 2 (γ N γ (62.51) Q(γ, γ b) = δ (|γ N γ b| N δ/2) , when |γ N γ b| > δ

for some threshold parameter δ > 0.

First attempt based on gradient-descent One way to determine the combination coefficients {α(`)} is to minimize (62.49) over w. This step amounts to solving a parametric optimization problem, one that is parameterized by w. A (batch) gradient-descent implementation would start from an initial value w−1 and move repeatedly along the direction of the negative gradient of P (w). If, for every feature vector hn , we collect the classifier outputs into a column vector, say, n o ∆ cn = col c1 (hn ), c2 (hn ), . . . , cL (hn ) (62.52) then this recursive construction would involve updates of the following form for t ≥ 1 (assuming, for simplicity, differentiable loss functions):  b(n) = cT  n wt−1 , n = 0, 1, 2 . . . , N N 1  γ ! N −1 (62.53) 1 X  ∇w Q(γ(n), γ b(n)) , t ≥ 0  wt = wt−1 N µ N n=0

in terms of the gradient vector of Q(·, ·) with respect to w. Here, µ > 0 is some small step-size parameter; if desired, its value can be optimized via a line search procedure or via cross validation. The gradient-descent solution (62.53) is

2574

Bagging and Boosting

computationally demanding, especially for large L, since it involves continually updating an iterate wt of size L. The solution can also lead to overfitting.

Alternative approach based on a greedy strategy Gradient boosting provides an alternative approach that reduces the computational complexity dramatically. This is achieved by applying a greedy strategy, where the problem of updating the coefficients {α(`)} simultaneously, as happens when we update wt via (62.53), is replaced by updating one α(`) at a time. The idea is to replace the parametric gradient-descent construction (62.53) by a functional gradient-descent implementation. In this alternative viewpoint, we do not treat the L coefficients {α(`)} or their weight vector w as the unknown. Instead, we treat the function γ b(h) as the unknown and assume an ensemble model for it. Recall that γ b(h) is a function of the feature space, h ∈ IRM : γ b(h) : IRM → IR

(62.54)

The functional gradient-descent construction can be motivated as follows. Assume that by the end of iteration t N 1 we have already succeeded in selecting the first t N 1 classifiers, denoted by {c?1 (h), c?2 (h), . . . , c?t−1 (h)}, and their nonnegative scaling weights {α? (1), α? (2), . . . , α? (t N 1)}. If we were to stop at this iteration, then these classifiers would result in the following prediction for γ(h): γ b

(t−1)



(h) =

t−1 X

α? (s)c?s (h)

(62.55)

s=1

where we are using the superscript (tN1) to refer to the aggregate result at stage t N 1. The empirical risk value that is associated with this aggregate prediction, and which is evaluated over the training data {γ(n), hn }, is given by ∆

P (b γ (t−1) (h)) =

N −1  1 X  Q γ(n), γ b(t−1) (n) N n=0

(62.56)

Note that, for emphasis, we are denoting the argument of the empirical risk P (·) by γ b(t−1) (h). We continue to iteration t. Motivated by the desired form (62.47), we would like to select the next classifier, denoted generically by ct (h), and its associated nonnegative weight, α(t), in order to enlarge the aggregate prediction from iteration t N 1 by adding one more term to it as follows: γ b(t) (h) =

t−1 X

α? (s)c?s (h) + α(t)ct (h)

(62.57)

s=1

That is, we would like the update for γ b(t) (h) to be of the form: γ b(t) (h) = γ b(t−1) (h) + α(t)ct (h)

(62.58)

62.3 Gradient Boosting

2575

which involves correcting the previous construction by adding α(t)ct (h). In order to determine the optimal choices {c?t (h), α? (t)}, we examine the form of the empirical risk at iteration t, namely, ∆

P (b γ (t) (h)) =

N −1  1 X  Q γ(n), γ b(t) (n) N n=0

(62.59)

We introduce, for each data point n = 0, 1, . . . , N N 1, the gradients of the loss function relative to its second argument:  ∂Q γ(n), γ b(n) ∆ (a scalar) (62.60) gt (n) = N ∂b γ (n) (t−1) γ b(n)=b γ

(n)

(t−1)

and evaluate them at the prior prediction, γ b (n). Gradient boosting selects ct (h) optimally by setting it to c?t (h) = c`? (h), where the optimal index `? is obtained by solving the least-squares problem: ( N −1 ) 2 1 X ? ` = argmin gt (n) N β c` (hn ) , β ∈ IR (62.61) N n=0 `,β The scalar β is an auxiliary parameter and its optimal value is easily seen to be ! ! N −1 −1 . NX X ? 2 β = gt (n)c` (hn ) c` (hn ) (62.62) n=0

n=0

If desired, the value of β in (62.61) can be set to 1. Once c?t (h) is selected, we then choose α(t) in order to result in the steepest decline in the value of the empirical risk, namely, ) ( N −1  1 X  (t−1) ? ? Q γ(n), γ b (n) + αct (hn ) (62.63) α (t) = argmin N n=0 α≥0

where α is now playing the role of a step-size parameter in the functional gradient-descent domain. The reason for using αc?t (hn ) as the correction term to γ b(t−1) (n) in the above expression, instead of αgt (n), is because construction (62.58) requires the correction to be based on the classifiers. Step (62.61) therefore determines the “closest” approximation for the gt (n) in terms of the c` (hn ). With {c?t (h), α? (t)} so determined, we return to the update (62.58) and write it in terms of these optimal choices: γ b(t) (h) = γ b(t−1) (h) + α? (t)c?t (h)

(62.64)

In summary, we arrive at the listing below for the gradient boosting algorithm. The initial prediction values over the training data, which we denote by γ b(0) (n), can be chosen in several ways, such as selecting some arbitrary small initial weights {α(0) (`)} and then setting γ b

(0)

(n) =

L X `=1

α(0) (`)c` (hn ),

n = 0, 1, . . . , N N 1

(62.65)

2576

Bagging and Boosting

A second way is to fix γ b(0) (n) = γc for all n, i.e., fix the initial prediction values to some constant label for all training data. Then, select the value of this constant by solving ( ) N 1 X γc = argmin Q(γ(n), γ b) (62.66) N n=1 γ b Gradient boosting algorithm for generic loss functions. given N data pairs {γ(n), hn }; given L weak classifiers {c` (h)}; choose initial values γ b(0) (n), n = 0, 1, . . . , N N 1. repeat t = 1, 2, . . . , L:  ∂Q γ(n), γ b(n) gt (n) = N ∂b γ (n) ) ( N −1γb(n)=bγ (t−1) (n) X 1 2 (gt (n) N β c` (hn )) {`? , β ? } = argmin N n=0 `,β c?t (h) = c`? (h) ( N −1 )  X  1 α? (t) = argmin Q γ(n), γ b(t−1) (n) + αc?t (hn ) N n=0 α≥0 γ b(t) (h) = γ b(t−1) (h) + α? (t)c?t (h) end (classification) L X γ b(h) = α? (t)c?t (h)

(62.67)

t=1

c? (h) = sign (b γ (h)) .

We consider next different choices for the loss function Q(γ, γ b) and show how the gradient boosting algorithm reduces to special cases such as AdaBoost (for the exponential loss) or L2Boost (for the square loss).

Exponential loss: AdaBoost algorithm Consider first the exponential loss Q(γ, γ b) = e−γbγ

(62.68)

which is associated with AdaBoost learning. To explain the simplifications that occur in the gradient boosting algorithm (62.67), we start by noting that the negative gradients are given by  ∂Q γ(n), γ b(n) N = γ(n)τt (n) (62.69) ∂b γ (n) (t−1) γ b(n)=b γ

(n)

62.3 Gradient Boosting

2577

where we are introducing the scalar variable ∆

τt (n) = e−γ(n)bγ

(t−1)

(n)

(62.70)

Let M` denote the set of indices of the training data that are misclassified by classifier c` (·), i.e., M` contains all indices n for which γ(n)c` (hn ) = N1,

n ∈ M`

(62.71)

Then, the choice of the optimal classifier at stage t is given by (the result is independent of the value of the scaling variable β so we set β = 1): `? = argmin 1≤`≤L (a)

= argmin 1≤`≤L

(b)

= argmin 1≤`≤L

= argmin 1≤`≤L

= argmin 1≤`≤L

N −1 X n=0

N −1 X n=0

N −1 X n=0

(

(γ(n)τt (n) N c` (hn ))

2

(τt (n) N γ(n)c` (hn ))

2

Nτt (n)γ(n)c` (hn )

X

n∈M`

X

τt (n) N

X

n∈M / `

)

τt (n)

τt (n)

(62.72)

n∈M`

where step (a) is because γ(n) ∈ {±1}, step (b) is because the discarded sum is P independent of ` since c` (hn ) ∈ {±1}, and the last step is because n τt (n) is a constant independent of `. Result (62.72) indicates that the optimal classifier `? is selected as the one that results in the smallest sum of weights τt (n) over the misclassified data, i.e., ( N −1 ) X `? = argmin τt (n) I [c` (hn ) 6= γ(n)] (62.73) 1≤`≤L

n=0

which agrees with (62.38). Likewise, if we differentiate the right-hand side of (62.63) over α and set the gradient to zero we arrive at the same relation (62.39) – see Prob. 62.12.

Quadratic loss: L2Boost algorithm Consider next the quadratic loss Q(γ, γ b) =

1 (γ N γ b)2 2

Then, the negative gradient relative to γ b for each data point is given by gt (n) = γ(n) N γ b(t−1) (n)

(62.74)

(62.75)

2578

Bagging and Boosting

Moreover, the parameter α? (t) in (62.67) is given by ( N −1 )  1 X (t−1) ? ? α (t) = argmin γ(n) N γ b (n) N αct (hn ) N n=0 α ( N −1 )  1 X ? gt (n) N αct (hn ) = argmin N n=0 α

(62.76)

It follows that α? (t) = β ? . We therefore arrive at listing (62.77) for the L2Boost algorithm. L2Boost algorithm for quadratic losses. given N data points {γ(n), hn }; given L weak classifiers {c` (h)}; choose initial values γ b(0) (n), n = 0, 1, . . . , N N 1. repeat t = 1, 2, . . . , L: gt (n) = γ(n) N γ b(t−1) (n) ) ( N −1 X 1 2 (gt (n) N β c` (hn )) {`? , β ? } = argmin N n=0 `,β c?t (h) = c`? (h) α? (t) = β ? γ b(t) (h) = γ b(t−1) (h) + α? (t)c?t (h) end (classification) L X γ b(h) = α? (t)c?t (h)

(62.77)

t=1

c? (h) = sign (b γ (h)) .

Logistic loss: LogitBoost algorithm Consider now the logistic loss function Q(γ, γ b) = ln(1 + e−γbγ )

In this case, the negative gradients are given by  ∂Q γ, γ b γ(n)τt (n) N = ∂b γ 1 + τt (n) (t−1) where again

γ b=b γ



(62.78)

(62.79)

(n)

τt (n) = e−γ(n)bγ

(t−1)

(n)

(62.80)

We know from the theory of logistic regression that the confidence level in the prediction is   1 P γ(n) = γ (t−1) (n) | hn = hn = (62.81) 1 + τt (n)

62.3 Gradient Boosting

2579

Therefore, the negative gradients can be expressed as   gt (n) = γ(n)τt (n) × P γ(n) = γ (t−1) (n) | hn = hn

(62.82)

Comparing with expression (62.69) in the exponential loss case, we find that the logistic loss takes the confidence level into account in the computation of the gradient information. This is one reason why LogitBoost solutions tend to have slightly superior performance than AdaBoost, especially when the data is not well separated. Next, finding the best weak classifier is equivalent to solving (we set β = 1):

?

`

= argmin 1≤`≤L

( N −1 X n=0

) h i τt (n) I c` (hn ) 6= γ(n) 1 + τt (n)

(62.83)

which, in comparison to (62.72), replaces the weight τt by τt /(1 + τt ), while α? (t) is obtained by solving: (

!) N −1  h i X 1 ln 1 + exp Nγ(n) γ b(t−1) (n) + αc?t (hn ) α? (t) = argmin N n=0 α ( N −1 !)   1 X ? = argmin ln 1 + τt (n)exp Nαγ(n)ct (hn ) N n=0 α ( )     X X 1 −α α = argmin ln 1 + τt (n)e + ln 1 + τt (n)e N α n∈Mt

n∈M / t

(62.84)

where Mt contains the indices of the samples that are misclassified by c?t (h). Differentiating relative to α and setting the derivative to zero at α? (t) we find that the latter is the solution to the equation: X

n∈M / t

X τt (n) τt (n) = ? α (t) τt (n) + e τt (n) + e−α? (t)

(62.85)

n∈Mt

We arrive at listing (62.86), where β can be set to 1. We should note that the original LogitBoost algorithm has a different form because it employs both firstand second-order gradient information (similar to a Newton-type setting).

2580

Bagging and Boosting

LogitBoost algorithm for logistic losses. given N data points {γ(n), hn }; given L weak classifiers {c` (h)}; choose initial values γ b(0) (n), n = 0, 1, . . . , N N 1. repeat t = 1, 2, . . . , L: (t−1) (n) τt (n) = e−γ(n)bγ τt (n) gt (n) = γ(n) 1 + t (n) ( τN ) −1 X 1 τt (n) ? ` = argmin I[c` (hn ) 6= γ(n)] N n=0 1 + τt (n) ` c?t (h) = c`? (h) solve (62.85) for α? (t) end (classification) L X γ b(h) = α? (t)c?t (h)

(62.86)

t=1

c? (h) = sign (b γ (h)) .

62.4

COMMENTARIES AND DISCUSSION Bagging and boosting. These techniques are examples of ensemble learning methods where classification decisions by a mixture of learners are fused together to enhance performance. In bagging, the training data is sampled randomly, using bootstrap sampling, and used to train the classifiers. During testing, the classifiers operate in parallel and a majority vote is used to arrive at the final decision. In boosting, the classifiers are weighted and combined with the objective of boosting performance at the misclassified training points. Bagging is characterized by its simplicity, while boosting attempts to combine classifiers in some optimal manner. It cannot be stated categorically that one method is superior to the other, although it has been shown that boosting leads to better classification accuracy in some instances. Accessible surveys on bagging and boosting methods are given in the articles by Dietterich (2000), Schapire (2003), Polikar (2006), Buhlmann and Hothorn (2007), Rokach (2010), and Buhlmann (2012), and in the texts by Hastie, Tibshirani, and Friedman (2009), Schapire and Freund (2012), and Zhou (2012). For overviews on bootstrap methods, the reader may refer to the texts by Efron and Tibshirani (1993) and Zoubir and Iskander (2007). Bagging was first proposed by Breiman (1994, 1996a,1996b). Bagging classifiers are sometimes also referred to as “arcing classifiers,” where the name stands for classifiers with “adaptive reweighting and combining.” It was shown by Buhlmann and Yu (2002) that bagging usually reduces the error variance of the classifier – see also Buhlmann (2012) and Prob. 62.3, which is motivated by the analysis from these articles and shows that bagging performs some sort of data smoothing. Examples to the contrary showing that the error variance in bagging may not be reduced appear in Buja and Stuetzle (2006).

62.4 Commentaries and Discussion

2581

The original works on boosting are the articles by Schapire (1990) and Freund and Schapire (1996, 1997). According to Buhlmann and Yu (2003) and Rokach (2010), there have been some earlier forms of ensemble methods before these contributions, dealing mainly with special instances of combinations of decision sources such as the works by Tukey (1977), Dasarathy and Sheela (1979), and Hansen and Salamon (1990). However, the contributions by Breiman (1994, 1996a,1996b) and Freund and Schapire (1996, 1997) moved ensemble learning to a new level and generated great interest in the topic. The observation that boosting can be viewed as an optimization problem in function space is due to Breiman (1998, 1999); this fact was exploited to great effect in the development of gradient boosting techniques by Mason et al. (1999), Friedman (2001, 2002), and Hastie, Tibshirani, and Friedman (2009). The presentation in Section 62.3 is motivated by the discussion in Ying and Sayed (2017). Random forest algorithm. We described one instance of bagging in the body of the chapter in the form of the random forest algorithm. The idea of using decision forests was first proposed by Ho (1995, 1998) and motivated by an application to handwritten digit recognition. In these works, randomization was achieved through the random selection of feature attributes during training; bagging was not involved. A related algorithm that employs a collection of random trees was independently proposed by Amit and Geman (1997). The random forests algorithm that we described in the chapter was proposed by Breiman (2001), who suggested incorporating bagging in addition to random feature selection. Additional information on random forests can be found in Hastie, Tibshirani, and Friedman (2009) and Criminisi, Shotton, and Konukoglu (2011). AdaBoost. The idea of boosting was introduced by Schapire (1990), who was motivated by the earlier work of Kearns and Valiant (1998, 1989), where the concept of weak learners was introduced; these are learners whose decisions are only slightly better than pure chance or random guessing. One fundamental problem posed in Kearns and Valiant (1998, 1989) is whether the existence of weak learners implies the existence of strong learners. In other words, their work inquired whether it is possible to transform a collection of weak learners into a learning machine with a higher decision accuracy most of the time. Schapire (1990) proposed the use of boosting techniques to answer this question, where a collection of weak learners are combined in a weighted manner to boost performance over misclassified data. Subsequently, several works exploited boosting in designing strong classifiers, including the works by Drucker, Schapire, and Simard (1993), Drucker et al. (1994), Drucker and Cortes (1994), and Drucker (1997). However, the constructions in these works were implicit and it was not until the work by Freund and Schapire (1996, 1997) that an explicit algorithm, known as AdaBoost, was developed to help transform a mixture of weak learners into a strong learning algorithm. Subsequent work by Schapire and Singer (1999) provided a derivation for AdaBoost by relating it to the minimization of the exponential loss function, as was carried out after expression (62.31) in the body of the chapter motivated by the streamlined presentation given by Rojas (2009). For further information on the boosting method and its applications, the reader may refer to the articles by Opitz and Maclin (1999), Friedman (2001), Buhlmann and Yu (2003), Breiman (2004), Zenko (2004), Brown et al. (2005), and Bartlett and Traskin (2007), and to the texts by Hastie, Tibshirani, and Friedman (2009), Schapire and Freund (2012), and Zhou (2012).

PROBLEMS

62.1 Assume a classifier c(h) has an empirical error rate larger than 50% on a training set of size N . Is this a weak classifier? How do you generate a weak classifier using c(h)? 62.2 Assume we have L estimators for a random variable x, denoted by {b x` } for

2582

Bagging and Boosting

e` = x − x b ` have zero mean with uniform variance ` = 1, 2, . . . , L. Assume the errors x e`x e k for all ` 6= k. Consider the ensemble estimator x b = σe2P and correlation r = E x L 1 b x . Show that its error variance is given by ` `=1 L b )2 = E (x − x

1 2 L−1 σe + r L L

62.3 Consider a collection of N iid random variables, {y(n)}, with mean y¯ and variance σy2 . Using bootstrap sampling (i.e., sampling with replacement), a set of N training points is selected randomly from this collection, denoted by {y (`) (n)} and used to compute a prediction variable as follows: ∆

b` = γ

N 1 X (`) y (n) N n=1

Subsequently, a classifier c?` (h) makes the decision c?` = I [b γ ` ≤ a] for some constant a. This process is repeated for L classifiers and a bagging classifier is constructed as c? (h) =

L 1X ? c` (h) L `=1

We assume L is large enough so that the above ensemble average can be replaced by c? = E c?` . (a) Use random variable √ the central limit theorem to conclude that the normalized N (b γ ` − y¯) tends to the Gaussian distribution Nγb` (0, σy2 ) as N → ∞. √ (b) Assume a is selected close enough to y¯, say, a− y¯ = βσy / N , for some constant β. ? Conclude that for N large enough, the classifier c` (h) tends to the hard-decision rule I[z ≤ β], where z ∼ Nz (0, 1). (c) Conclude that c? (h) tends to the soft-decision rule c? (h) = Φ(β), where Φ(x) denotes the cumulative distribution of the standard normal distribution, Nx (0, 1). That is, ˆ x 2 1 ∆ Φ(x) = √ e−τ /2 dτ 2π −∞ 62.4 Refer to expression (62.16) for the combination weights α? (t) in the AdaBoost implementation (62.18b). Show that for two optimal classifiers s and t whose error rates satisfy E ? (s) < E ? (t), the corresponding combination weights will satisfy α? (s) > α? (t). Conclude that the smaller the error of a weak classifier, the more its contribution will be to the final decision by the AdaBoost classifier. 62.5 Refer to expression (62.16) for the trust or confidence factors in the AdaBoost implementation. Show that if at any particular iteration t we encounter E ? (t) = 0.5, then the AdaBoost iteration should stop. 62.6 Refer to expression (62.17a) for the weighting factor in the AdaBoost implementation. (a) Show that ! ! t t X 1 Y 1 ? ? dt+1 (n) = exp − α (s)γ(n) cs (hn ) N s=1 β(s) s=1 (b)

Introduce the optimal aggregate classifier at iteration t as ∆

γ b(t) (h) =

t X s=1

α? (s)c?s (h)

62.4 Commentaries and Discussion

2583

Conclude from this expression and from part (a) that the normalization factors {β(1), . . . , β(t)} must satisfy t Y

β(s) =

s=1

N −1 1 X −γ(n)bγ (t) (hn ) e N n=0

62.7 Refer to the result of part (b) from Prob. 62.6. Let Remp (c? ) denote the empirical error rate for the AdaBoost solution, i.e., the error rate that is associated with the aggregate solution (62.18a)–(62.18b), i.e., Remp (c? ) =

N −1 1 X I [c? (hn ) 6= γ(n)] N n=0

Show that Remp (c? ) ≤

N −1 1 X −γ(n)bγ (n) e N n=0

62.8 Refer to the scaling factor β(t) in (62.17a). Let Mt denote the set of indices of misclassified data by c?t (h). (a) Verify that   ! X X α? (t) −α? (t)  β(t) = e dt (n) + e dt (n) n∈M / t

n∈Mt ?

(b)

Show that the value of α (t) given by (62.16) minimizes the above expression for β(t) over α? (t). Verify that the resulting minimum value for β(t) is given by p βmin (t) = 2 E ? (t)(1 − E ? (t))

(c)

Let ν ? (t) = 1/2 − E ? (t) denote the margin from 1/2. Show that βmin (t) ≤ ? 2 e−2(ν (t)) , and conclude from part (b) in Prob. 62.6 that the empirical error rate for the AdaBoost classifier is bounded by: ! L X 2 ? Remp (c) ≤ exp −2 (ν (t)) t=1

Remark. See also Shalev-Shwartz and Ben-David (2014, ch. 10) for a related discussion. 62.9 Refer to the derivation in Section 62.2 for AdaBoost. (a) Establish the equivalence of the optimization problems (62.15) and (62.38) over misclassified data. (b) Establish the equivalence of expressions (62.14) and (62.41). 62.10 Refer to the AdaBoost classifier (62.18a)–(62.18b). We explained in Section b b is the 62.2 that the classifier minimizes the exponential risk, P (c) = E e−γ γ , where γ predictor generated by the algorithm for the classification label. b is given by (a) Show that the minimum of P (c) over γ   P(γ = +1|h = h) 1 b = ln γ 2 P(γ = −1|h = h) (b)

We explained earlier in (59.5a) that the logistic regression implementation provides a measure of confidence about its classification decision. Conclude from part (a) that, in a similar manner, a confidence level can be associated with the AdaBoost implementation as follows: P(γ = +1|h = h) =

1 1 + e−2bγ (h)

2584

Bagging and Boosting

where γ b(h) is the prediction for the label of feature vector h. 62.11 Consider the exponential loss Q(γ, γ b) = e−γbγ . Differentiate the right-hand side of (62.63) over α and set the gradient to zero. Show that this calculation leads to the same relation (62.39). 62.12 Refer to the gradient boosting algorithm (62.67) and let us specialize it to the case of the absolute error loss, Q(γ, γ b) = |γ − γ b|. Verify that gt (n) = sign(γ(n) − γ b(n)) and N −1 X b(t−1) (n) γ(n) − γ ? α (t) = argmin −α ? c (n) α t n=0 | {z } ∆

= y(n)

Conclude that α? (t) is the median of the processed samples {y(n)}.

REFERENCES Amit, Y. and D. Geman (1997), “Shape quantization and recognition with randomized trees,” Neural Comput., vol. 9, no. 7, pp. 1545–1588. Bartlett, P. L. and M. Traskin (2007), “AdaBoost is consistent,” J. Mach. Learn. Res., vol. 8, pp. 2347–2368. Breiman, L. (1994), “Heuristics of instability in model selection,” Ann. Statist., vol. 24, no. 6, pp. 2350–2383. Breiman, L. (1996a), “Stacked regressions,” Mach. Learn., vol. 24, no. 1, pp. 41—64. Breiman, L. (1996b), “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140. Breiman, L. (1998), “Arcing classifiers,” Ann. Statist., vol. 26, no. 3, pp. 801–824. Breiman, L. (1999), “Prediction games and arcing algorithms,” Neural Comput., vol. 11, pp. 1493–1517. Breiman, L. (2001), “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32. Breiman, L. (2004), “Population theory for boosting ensembles,” Ann. Statist., vol. 32, no. 1, pp. 1–11. Brown, G., J. Wyatt, R. Harris, and X. Yao (2005), “Diversity creation methods: A survey and categorisation,” Inf. Fusion, vol. 6, no. 1, pp. 5–20. Buhlmann, P. (2012), “Bagging, boosting and ensemble methods,” in Handbook of Computational Statistics, J. E. Gentle, W. K. Hardle, and Y. Mori, editors, pp. 985–1022, Springer. Buhlmann, P. and T. Hothorn (2007), “Boosting algorithms: Regularization, prediction and model fitting,” Statist. Sci., vol. 22, pp. 477–505. Buhlmann, P. and B. Yu (2002), “Analyzing bagging,” Ann. Statist., vol. 30, pp. 927– 961. Buhlmann, P. and B. Yu (2003), “Boosting with L2 loss: Regression and classification,” J. Amer. Statist Assoc., vol. 98, pp. 324–338. Buja, A. and W. Stuetzle (2006), “Observations on bagging,” Statistica Sinica, vol. 16, pp. 323–351. Criminisi, A., J. Shotton, and E. Konukoglu (2011), “Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning,” Microsoft Technical Report, MSR-TR-2011-114, Microsoft Research. Dasarathy, B. V. and B. V. Sheela (1979), “Composite classifier system design: Concepts and methodology,” Proc. IEEE, vol. 67, no. 5, pp. 708–713. Dietterich, T. G. (2000), “Ensemble methods in machine learning,” Proc. Int. Workshop Multiple Classifier Systems, pp. 1–15, London. Drucker, H. (1997), “Improving regressors using boosting techniques,” Proc. Int. Conf. Machine Learning (ICML), pp. 107–115, Nashville, TN.

References

2585

Drucker, H. and C. Cortes (1994), “Boosting and other machine learning algorithms,” Neural Comput., vol. 6, no. 6, pp. 1289–1301. Drucker, H., C. Cortes, L. D. Jackel, Y. LeCun, and V. N. Vapnik (1994), “Boosting and other ensemble methods,” Neural Comput., vol. 6, no. 6, pp. 1289–1301. Drucker, H., R. Schapire, and P. Simard (1993), “Improving performance in neural networks using a boosting algorithm,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 42–49, Denver, CO. Efron, B. and R. Tibshirani (1993), An Introduction to the Bootstrap, Chapman & Hall. Freund, Y. and R. E. Schapire (1996), “Experiments with a new boosting algorithm,” Proc. Int. Conf. Machine Learning (ICML), pp. 325–332, Bari. Freund, Y. and R. E. Schapire (1997), “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Sys. Sci., vol. 55, pp. 119–139. A conference version was published earlier in Proc. European Conf. Computational Learning Theory, Barcelona, 1995. Friedman, J. H. (2001), “Greedy function approximation: A gradient boosting machine,” Ann. Statistics, vol. 29, no. 5, pp. 1189–1232. Friedman, J. H. (2002),“Stochastic gradient boosting,” Comput. Statist. Data Anal., vol. 38, no. 4, pp. 367–378. Hansen, L. K. and P. Salamon (1990), “Neural network ensembles,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 12, no. 10, pp. 993–1001. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Ho, T. K. (1995), “Random decision forest,” Proc. Int. Conf. Document Analyis and Recognition, pp. 278–282, Montreal. Ho, T. K. (1998), “The random subspace method for constructing decision forests,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844. Kearns, M. and L. G. Valiant (1988), “Learning Boolean formulae or finite automata is as hard as factoring,” Technical Report TR-14-88, Harvard University. Kearns, M. and L. G. Valiant (1989), “Cryptographic limitations on learning Boolean formulae and finite automata,” Proc. Ann. ACM Symp. Theory in Computing, pp. 433–444, New York. Mason, L., J. Baxter, P. Bartlett, and M. Frean (1999), “Boosting algorithms as gradient descent in function space,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 512–518, Denver, CO. Opitz, D. and R. Maclin (1999), “Popular ensemble methods: An empirical study,” J. Artif. Intell. Res., vol. 11, pp. 169–198. Polikar, R. (2006), “Ensemble based systems in decision making,” IEEE Circ. Syst. Mag., vol. 6, no. 3, pp. 21–45. Rojas, R. (2009), “AdaBoost and the super bowl of classifiers: A tutorial introduction to adaptive boosting,” Technical Report, Freie University. Rokach, L. (2010), “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33, pp. 1–39. Schapire, R. E. (1990), “The strength of weak learnability,” Mach. Learn., vol. 5, no. 2, pp. 197–227. Schapire, R. E. (2003), “The boosting approach to machine learning: An overview,” in Proc. Workshop on Nonlinear Estimation and Classification, pp. 149–171, New York. Schapire, R. E. and Y. Freund (2012), Boosting: Foundations and Algorithms, MIT Press. Schapire, R. E. and Y. Singer (1999), “Improved boosting algorithms using confidencerated predictions,” Mach. Learn., vol. 37, pp. 297–336. Shalev-Shwartz, S. and S. Ben-David (2014), Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press. Tukey, J. (1977), Exploratory Data Analysis, Addison-Wesley. Ying, B. and A. H. Sayed (2017), “Diffusion gradient boosting for networked learning,” Proc. IEEE ICASSP, pp. 2512–2516, New Orleans, LA. Zenko, B. (2004), “Is combining classifiers better than selecting the best one,” Mach. Learn., vol. 54, pp. 255–273.

2586

Bagging and Boosting

Zhou, Z. (2012), Ensemble Methods: Foundations and Algorithms, Chapman & Hall. Zoubir, A. M. and D. R. Iskander (2007), Bootstrap Techniques for Signal Processing, Cambridge University Press.

63 Kernel Methods

In the immediate past chapters we developed several techniques for the design of linear classifiers, such as logistic regression, perceptron, and support vector machines (SVM). These algorithms are suitable for data that are linearly separable; otherwise, their performance degrades significantly. In this chapter we explain how the methods can be adjusted to determine nonlinear separation surfaces. The solution will rely on the use of kernel methods. Kernels are functions that map two vector arguments into the inner product of two transformed versions of these vectors. The main idea is that if the original data {γ(n), hn } is not linearly separable, then we would map each hn ∈ IRM into a higher-dimensional space and replace it by the transformed feature hφn ∈ IRMφ such that the adjusted data {γ(n), hφn } is more likely to be linearly separable in the expanded space. What is particularly interesting about kernel-based methods is that the mapping from hn to hφn will be implicit without the need to actually compute the long vector hφn , and even without the need to know what type of transformation needs to be applied to the original data. These elements are automatically accounted for by the solution method. There are many types of kernel functions that define the nature of the transformation from one space to another. This chapter defines kernels, introduces several of their properties, and explains how the concept applies to the design of learning algorithms.

63.1

MOTIVATION We encountered several binary classification schemes in the previous chapters, which use the training data {γ(n), hn } to determine classifiers c(h) : IRM → {±1} that map feature vectors into labels. Although we have focused primarily on the design of classifiers that lead to separating hyperplanes, many situations of interest in practice require the use of nonlinear separation surfaces. Two instances are illustrated in Fig. 63.1 for data in IR2 . The plot on the left shows data that are separated by an elliptical curve: All training data from class N1 lie inside the curve, while data from class +1 lie outside the curve. In general, separation curves need not be so regular; more complex forms are possible, as shown on the right side of the same figure.

2588

Kernel Methods

Figure 63.1 In the example on the left, an elliptic curve is sufficient to separate the

training data into two classes over IR2 . In the example on the right, a more elaborate separation curve is necessary.

When separation of the training data into its two classes is possible, we say that the data is separable. Once a separation surface is determined and some new feature vector, h, is received, the classifier will assign it to one label γ or another by determining whether it lies on one side of the separation surface or the other. All separation surfaces will be described by equations of the form g(h) = 0, for some function g(·) of the feature space. For example, for affine surfaces we have g(h) = hT w N θ for some parameters (w, θ). Using g(h), classification would then make assignments according to the tests:  assign h to class +1 if g(h) > 0 (63.1) assign h to class N1 if g(h) < 0

Overfitting and underfitting One useful question is whether it is preferable to seek elaborate separation curves that are able to weave through the training data and separate them into their correct classes, as opposed to simpler curves that may misclassify some training samples. The answer is negative! We provide an intuitive explanation here and leave the analytical details to the next chapter. The answer to the question relates to two phenomena that arise frequently in classification problems concerning the notions of overfitting and underfitting. Consider the situation shown in the rightmost plot of Fig. 63.2. The data is almost fully linearly separable, with a few misclassified samples lying on the opposite sides of the separation line. One may wonder whether in situations like this it is preferable to design classifiers that lead to more complex separation surfaces so that fewer points are misclassified. The plot on the left in the same figure shows a separation curve that weaves through the training data points and makes sure that they fall on the right side of the curve. The plot in the middle shows a midterm solution where the curve is smoother and leaves only

63.1 Motivation

2589

Figure 63.2 When a more complex model is used, the separation curve can succeed in

separating better the training data (leftmost plot) but will generally perform poorly on new test data. On the other hand, when a simple model is used (rightmost plot), the separation curve need not fit the training data well but can perform better on test data.

a couple of the points misclassified in comparison with the linear case. We will establish in Chapter 64 that it is not necessarily the case that more complex separation surfaces lead to higher classification accuracy. While these models fit the data better and lead to minimal misclassification errors over the training samples, their performance will nevertheless be generally poor on test samples. This behavior is the result of the phenomenon of overfitting, which occurs when the complex classifier overreaches in trying to match almost perfectly the training feature vectors to their classes. By doing so, the classifier ends up erroneously fitting spurious effects and outlier data. This conclusion is in line with what is generally referred to as the Occam razor principle. The principle categorically favors simpler explanations or hypotheses over more complex ones. On the other hand, when simplistic models are used, the separation curve need not fit the training data well, thus resulting in poor performance over both the training and test data. In this case, we say that underfitting occurs. The middle plot in the figure shows a midterm solution using a separation model of moderate complexity: The model neither overreaches in fitting the data nor oversimplifies in explaining the same data. In the previous chapters, we discussed a variety of techniques to avoid the perils of overfitting (and which allow classifiers to move away from a state of overfitting to a state of better-fitting). Some of these techniques include the use of regularization, the use of dimensionality reduction techniques (such as Fisher discriminant analysis (FDA) or principal component analysis (PCA)), and boosting. In the current chapter, we will introduce an alternative powerful tool that allows us to move in the other direction: from a state of underfitting toward better-fitting. This tool relies on the use of kernels, which will help enlarge the class of classifiers to include nonlinear separation surfaces as well.

2590

Kernel Methods

63.2

NONLINEAR MAPPINGS Before introducing kernels and their properties, we consider two examples to illustrate how nonlinear decision boundaries can be handled by transforming feature vectors, hn , into higher dimensions and by carrying out the learning task in the transformed domain. We will denote such transformations generically by the notation: hn ∈ IRM N→ φ(hn ) ∈ IRMφ

(63.2)

where the dimension of the new enlarged space is denoted by Mφ . The function φ(·) acts on the individual entries of hn to generate a new vector of size Mφ . Sometimes, we will denote the transformed feature vector by the notation hφn , with a superscript φ, i.e., ∆

hφn = φ(hn )

(63.3)

Example 63.1 (Nonlinear separation curve) Consider the situation illustrated in Fig. 63.1, where the feature data are separated by an elliptic curve. One way to transform this problem into a scenario that involves determining a separating hyperplane in higher dimensions (rather than an ellipse in two dimensions) is by extending the feature vectors. We denote the entries of the feature vector h generically by: ∆

h = col{1, x, y}

(63.4)

where the number 1 has been added in view of the usual extension shown in (61.16); moreover, the scalars (x, y) denote the coordinates of the point corresponding to h in IR2 . If we were to seek a separating line with normal vector ∆

w? = col{−θ? , a? , b? }

(63.5) T

?

then all points on this line would satisfy the equation h w = 0 or, equivalently, a? x + b? y = θ?

(63.6)

We already know that a separation line of this type does not exist for the data shown in Fig. 63.1 (left). The data suggests that we should seek an elliptic curve in IR2 . In order to transform the task of determining such a nonlinear curve into an equivalent problem involving the determination of a hyperplane in higher dimensions, we extend the feature vector by adding second-order terms and defining: ∆

hφ = φ(h) = col{1, x, y, xy, x2 , y 2 }

(63.7)

2

This transformation maps the original data (x, y) ∈ IR into the enlarged fivedimensional space (x, y, xy, x2 , y 2 ) ∈ IR5 . If we now solve the binary classification problem in this enlarged domain, and determine a separating hyperplane, wφ,? , say, with parameters: o n ∆ wφ,? = col −θφ,? , aφ,? , bφ,? , cφ,? , dφ,? , eφ,? (63.8) then points lying on this hyperplane would satisfy the equation: aφ,? x + bφ,? y + cφ,? xy + dφ,? x2 + eφ,? y 2 = θφ,?

(63.9)

63.2 Nonlinear Mappings

2591

which is the general equation of an ellipse in IR2 . Classification results can then be obtained by carrying out the comparisons:  if (hφ )T wφ,? < 0, assign h to class −1 (63.10) if (hφ )T wφ,? ≥ 0, assign h to class +1 In this way, by extending the dimension of the feature space, we are able to transform the classification problem into one involving a linear classifier of the form (63.10). One key challenge with this solution method is that, in general, we do not know beforehand the analytical form of the nonlinear curve that separates the training data, which makes it difficult to decide in advance which extension of the form (63.8) should be used. In the current example, we are able to plot the training data and visualize that an elliptic curve is needed. However, such visualizations are not possible in general, and the separation curve may be more irregular. It is therefore useful to “automate” the step of extending the feature vectors so that the learning algorithm is able to select autonomously a suitable extension. This will be possible by relying on the kernel-based methods of this chapter. Example 63.2 (XOR function) A second example to illustrate the power of nonlinear transformations is the following. Assume the feature space is two-dimensional, h ∈ IR2 , and that the entries of each feature vector are ±1 so that there are four possibilities for h. Their labels are defined as follows, which corresponds to the classical XOR digital operation applied to the individual entries of h:  h = col{−1, −1} ∈ class –1   1 h2 = col{−1, +1} ∈ class +1 XOR → (63.11)   h3 = col{+1, −1} ∈ class +1 h4 = col{+1, +1} ∈ class –1 This feature space is not linearly separable, as shown in the left plot of Fig. 63.3, where we are denoting the individual entries of h by h = col{a, b}.

ab AAAB63icbVBNSwMxEJ2tX7V+VT16CbZCT2W3BfVY8OKxgv2AdinZNNuGJtklyQpl6V/w4kERr/4hb/4bs+0etPXBwOO9GWbmBTFn2rjut1PY2t7Z3Svulw4Oj45PyqdnXR0litAOiXik+gHWlDNJO4YZTvuxolgEnPaC2V3m956o0iySj2YeU1/giWQhI9hkUhUH1VG54tbdJdAm8XJSgRztUflrOI5IIqg0hGOtB54bGz/FyjDC6aI0TDSNMZnhCR1YKrGg2k+Xty7QlVXGKIyULWnQUv09kWKh9VwEtlNgM9XrXib+5w0SE976KZNxYqgkq0VhwpGJUPY4GjNFieFzSzBRzN6KyBQrTIyNp2RD8NZf3iTdRt27rjcfGpVWLY+jCBdwCTXw4AZacA9t6ACBKTzDK7w5wnlx3p2PVWvByWfO4Q+czx8vj42Z

Figure 63.3 The plot on the left shows the four possibilities for the XOR function in

IR2 ; these points are not linearly separable. The plot on the right shows the four possibilities for (b, ab); these points are linearly separable. Assume now we map each h to the three-dimensional vector: ∆

φ(h) = col {a, b, ab}

(63.12)

2592

Kernel Methods

Then, the feature vectors in (63.11) will be mapped into:  φ(h1 ) = col{−1, −1, +1} ∈ class   φ(h2 ) = col{−1, +1, −1} ∈ class XOR →   φ(h3 ) = col{+1, −1, −1} ∈ class φ(h4 ) = col{+1, +1, +1} ∈ class

–1 +1 +1 –1

(63.13)

If we examine the transformed feature vectors and focus on their second and third coordinates, namely, {b, ab}, we find that they are now linearly separable, as shown in the right plot of Fig. 63.3.

63.3

POLYNOMIAL AND GAUSSIAN KERNELS There are three challenges with transformations of the type (63.2). First, the designer does not know beforehand which transformation φ(·) to select. Second, most of the online and batch learning algorithms we described earlier (such as stochastic gradient, subgradient, and proximal methods) require the computation of inner products between feature vectors and weight iterates. This computation becomes problematic for higher-dimensional spaces with large Mφ . And third, once the transformation φ(·) is selected, one still needs to form the transformed vectors, φ(hn ). It turns out that there is an elegant way to circumvent these difficulties by resorting to kernel methods. These methods provide an “automated” way to extend the feature space and to carry out the necessary calculations and transformations in an efficient and flexible manner. We start by defining kernels. A kernel is a function that maps two vector arguments into the inner product of similarly transformed versions of these vectors, namely, K(hk , h` ) : IRM × IRM → IR where K(hk , h` ) = (φ(hk ))T φ(h` ) =



hφk

(63.14a) T

hφ`

(63.14b)

for some function φ(·). Note from definition (63.14b) that kernels are symmetric functions since K(hk , h` ) = K(h` , hk )

(63.14c)

We say that kernel functions perform inner product operations in the transformed domain. Obviously, not every function, K(hk , h` ), can be expressed in the inner product form (63.14b) and, therefore, not every function is a kernel. A fundamental theorem in functional analysis, known as the Mercer theorem, clarifies which functions K(hk , h` ) can be expressed in the form (63.14b). For any integer N , we introduce the following N × N Gramian matrix, AN , which is symmetric: ∆

[AN ]k,` = K(hk , h` ), k, ` = 0, 1, 2, . . . , N N 1

(63.15)

63.3 Polynomial and Gaussian Kernels

2593

Mercer theorem: A symmetric and square-integrable function K(hk , h` ) is a kernel if, and only if, the Gramian matrix AN defined by (63.15) is positive semi-definite for any size N and any feature data {hn }. Proof: We motivate the argument briefly as follows; more details are given in the comments at the end of the chapter leading to (63.201). Assume first that K(·, ·) is a kernel. We collect the transformed feature vectors into the matrix  ∆  φ(h0 ) φ(h1 ) . . . φ(hN −1 ) , (Mφ × N ) Hφ = (63.16) Then, it holds that AN = HφT Hφ so that AN ≥ 0. Conversely, assume AN ≥ 0 for any N and {hn }. Then, AN admits an eigen-decomposition of the form AN = VNT VN . Let VN = [v0 v1 v2 . . . vN −1 ] denote the columns of V and set φ(hn ) = vn . Then, the (k, `)th entry of AN is equal to the inner product (φ(hk ))T φ(h` ), which by the definition (63.15) is also equal to K(hk , h` ). Therefore, it holds that K(hk , h` ) = (φ(hk ))T φ(h` )

(63.17)

for some function φ(·). 

We are going to see that kernel-based implementations of learning algorithms require knowledge of the kernel function itself, K(hk , h` ), and not of the specific transformation φ(·) that defines it. Before continuing, we illustrate this fact by means of the following example. Example 63.3

(Second-order polynomial kernel) Consider the following function: ∆

K(hk , h` ) = (1 + hTk h` )2

(63.18)

which maps two feature vectors (hk , h` ) into a nonnegative scalar. We want to verify that the function so defined is a kernel. To do so, and according to definition (63.14b), we need to identify a transformation φ(h) that allows us to express K(hk , h` ) as the inner product (φ(hk ))T φ(h` ). We verify that this is indeed the case by focusing on the case M = 2, for convenience. We denote the individual entries of (hk , h` ) by hk = col{hk,1 , hk,2 } h` = col{h`,1 , h`,2 }

(63.19a) (63.19b)

Then, using (63.18), we have K(hk , h` ) = (1 + hk,1 h`,1 + hk,2 h`,2 )2 = (1 + hk,1 h`,1 )2 + h2k,2 h2`,2 + 2(1 + hk,1 h`,1 )hk,2 h`,2 = 1 + h2k,1 h2`,1 + 2hk,1 h`,1 + h2k,2 h2`,2 + 2hk,2 h`,2 + 2hk,1 h`,1 hk,2 h`,2

(63.20)

which we can express more compactly as follows. We introduce the transformed vectors: n √ o √ √ φ(hk ) = col 1, 2hk,1 , 2hk,2 , 2hk,1 hk,2 , h2k,1 , h2k,2 (63.21) n √ o √ √ φ(h` ) = col 1, 2h`,1 , 2h`,2 , 2h`,1 h`,2 , h2`,1 , h2`,2 (63.22)

2594

Kernel Methods

and then note from (63.20) that K(hk , h` ) = (φ(hk ))T φ(h` )

(63.23)

In other words, the function (63.18) can be expressed as the inner product between the two transformed vectors (φ(hk ), φ(h` )), both of dimension 6 × 1 when M = 2. Observe the important fact that the vectors, hk and h` , have both been transformed in an identical manner.

The above example illustrates why the evaluation of K(hk , h` ) does not require that we form first the transformed vectors, (φ(hk ), φ(h` )). Instead, the kernel function can be evaluated directly in the original feature domain using (63.18). This fact is one of the key properties that makes kernel methods useful. More generally, a pth order polynomial kernel takes the form: ∆

p K(hk , h` ) = (1 + hT k h` ) , p = 1, 2, 3, . . .

(63.24)

for any integer p ≥ 1. In the example we considered p = 2 and M = 2. One can similarly verify that functions of the form (63.24) are kernels for other integer values of p and M . While these verifications can be carried out from first principles, as was done in the example, it is more convenient to rely on certain properties of kernels. Let K(ha , hb ), K1 (ha , hb ), and K2 (ha , hb ) denote any kernel functions. Then, the following properties hold (the proofs of which are left as exercises in Probs. 63.3–63.8): (a) (b) (c) (d) (e) (f) (g)

αK1 (ha , hb ) + βK2 (ha , hb ) is a kernel for any α, β ≥ 0. K1 (ha , hb )K2 (ha , hb ) is a kernel. K(g(ha ), g(hb )) is a kernel for any vector-valued function g(h) : IRM → IRM . f (ha )f (hb ) is a kernel for any scalar-valued function f (h) : IRM → IR. p(K(ha , hb )) is a kernel for any polynomial p(x) with positive coefficients. eK(ha ,hb ) is a kernel. If kernels Kn (ha , hb ) are convergent, then lim Kn (ha , hb ) is a kernel. n→∞

Note that these properties allow us to construct new kernels from other known kernel functions. Another example of a popular kernel is the Gaussian kernel (also called the radial basis function (RBF) or the squared exponential kernel). It is defined as ∆

K(hk , h` ) = exp

(

1 N 2 khk N h` k2 2σ

)

(63.25)

for some parameter σ 2 > 0. This parameter controls the width of the Gaussian pulse.

63.4 Kernel-Based Perceptron

Example 63.4

2595

(Gaussian kernel) We rewrite the Gaussian kernel (63.25) in the form − 12 khk k2 2σ

K(hk , h` ) = e

− 12 kh` k2 2σ

e

1

e σ2

hT k h`

(63.26)

which expresses K(hk , h` ) as the product of three terms. Let ∆

− 12 khk k2 2σ



1 σ2

K1 (hk , h` ) = e K2 (hk , h` ) = e

hT k h`

− 12 kh` k2 2σ

e

(63.27a) (63.27b)

Then, both of these functions are kernels. The function K1 (hk , h` ) is a kernel because it is expressed as the product of two single-argument functions, as required by property (d) listed above. Likewise, K2 (hk , h` ) is a kernel because of property (f). It follows that K(hk , h` ) is a kernel by property (b). An alternative way to establish that Gaussian functions of the form (63.25) are kernel functions is to resort to a second result from functional analysis, known as the Bochner theorem; this line of reasoning is pursued in Probs. 63.12 and 63.13. In Prob. 63.10 we pursue the Gaussian example further and show that the transformed vectors, φ(h), are now infinite-dimensional.

63.4

KERNEL-BASED PERCEPTRON In general, given a kernel function K(hk , h` ), it can be expensive or even impossible to determine the transformation φ(h), for any feature vector h. Fortunately, several learning algorithms can be reworked into a form that does not require knowledge of φ(h) but only of kernel evaluations of the form K(h, h` ). We illustrate this fact in this and the following sections and explain how to embed kernel functions into the operation of learning methods. We start with the perceptron algorithm. In the kernel-based implementation of perceptron, the algorithm operates (implicitly) on the transformed data (γ, hφ ); obviously, the label variable γ is the same for both h and its transformed version hφ . We will discuss two implementations for perceptron: with and without regularization.

63.4.1

Unregularized Kernel Perceptron We are given N data samples {γ(m), hm }, where γ(m) ∈ {±1} is the label of the mth feature vector hm ∈ IRM . We denote the transformed samples by {γ(m), hφm }, where hφm ∈ IRMφ . As the argument will show, we will not need to know the transformed features {hφm }; the final kernel-based implementation will operate directly on the original features {hm }. Now, assuming the transformed data samples {γ(m), hφm } are available, the traditional perceptron recursion would seek a hyperplane with parameters wφ ∈ IRMφ and θφ ∈ IR by iterating the following recursions over n ≥ 0:

2596

Kernel Methods

select a random data pair (γ(n), hφn ) at iteration n

(63.28a)

b (n) = (hφn )T wφn−1 N θ φ (n N 1) γ

(63.28b)

wφn = wφn−1 + µ α(n)γ(n) hφn

(63.28e)

α(n) = I [γ(n)b γ (n) ≤ 0] φ

φ

θ (n) = θ (n N 1) N µ α(n)γ(n)

(63.28c) (63.28d)

where we introduced the shorthand notation α(n) for the indicator function (its value is either 0 or 1). Note that this implementation requires that we evaluate the inner product between the Mφ -dimensional vectors hφn and wφn−1 in (63.28b) b (n). By appealing to a kernel-based construction, we will be able to compute γ to carry out this calculation in a more efficient manner as follows. runningindex indexn:n:0 0 1 1 2 2 3 3 4 4 5 5 6 6 . .... . running sampleindex indexm:m:3 3 2 2 1 1 3 3 6 6 0 0 1 1 . .... . sample

misclassified misclassified

misclassified misclassified

set:MM {3, misclassificationset: misclassification =={3, 6}6} nn a(6)==1 1 errorcounter: counter:a(3) a(3)==2,2,a(6) error

Figure 63.4 The first row shows the running index n for the perceptron implementation (63.28a)–(63.28e). The second row shows the index of the randomly selected samples. The circles highlight the samples where misclassifications occur. The indices of these samples are included in the misclassification set Mn , and the number of times that errors occur at these samples are aggregated into their respective counters.

Recursions (63.28a)–(63.28e) run continually over the data. During each iteration n, the algorithm samples a random data pair from the N -point dataset {γ(m), hφm }. This sample is either classified correctly or misclassified. Moreover, it may have been sampled before. We illustrate the operation of the algorithm in Fig. 63.4 for the first few iterations over n. The top row in the figure shows the running index n assuming the values n = 0, 1, 2, 3, 4, 5, 6. At each iteration, a sample of index m is chosen at random from the dataset {γ(m), hφm }. The selected samples are indicated in the second row. For example, at iteration n = 0, the sample of index m = 3 is selected, while at iteration n = 4, the sample of index m = 6 is selected. The circles refer to samples that are misclassified by the algorithm; the misclassifications occur at instants n = 0, 3, 4. The sample of index m = 3 is misclassified twice, while the sample of index 6 is misclassified once. We collect the indices of misclassified samples into the set Mn = {3, 6}.

63.4 Kernel-Based Perceptron

2597

We also count how many times each of these samples has been misclassified until time n. We do so by associating a counter with each data sample. The counters are entries of an N -dimensional vector a. In Fig. 63.4 we have a(3) = 2, which means that sample m = 3 has been misclassified twice. We also have a(6) = 1, which means that sample m = 6 has been misclassified once. Thus, let Mn collect the indices of misclassified samples up to time n. Let also a(m) count the number of times that the mth data sample (γ(m), hφm ) has been misclassified. Starting from the initial conditions wφ−1 = 0 and θ φ (N1) = 0, and iterating (63.28a)–(63.28e), we get X θ φ (n) = N µ a(m)γ(m) (63.29a) m∈Mn

wφn

=

X

µ a(m)γ(m)hφm

(63.29b)

m∈Mn

It is important to note that each data point (γ(m), hφm ) appears only once in the sums (63.29a)–(63.29b). The value of a(m) can grow with time because, during learning, the same training data point may be presented to the algorithm multiple times during repeated passes over the data. Note further that there are a total of N counters, {a(m), m = 0, 1, . . . , N N 1}, one for each data sample. Using representation (63.29b), we can express the inner product that is needed in (63.28b) in the form: b (n) γ



(hφn )T wφn−1 N θ φ (n N 1) X (63.29b) = µa(m)γ(m)(hφn )T hφm + =

m∈Mn−1

=

X

X

µa(m)γ(m)

m∈Mn−1

  µa(m)γ(m) 1 + K(hn , hm )

(63.30)

m∈Mn−1

where the sum is over the set of misclassified data up to time n N 1, i.e., over Mn−1 , since we are computing the inner product of hφn with wφn−1 . The last b (n) can be computed by relying solely on kernel evaluations; equality shows that γ b (n) is computed, there is no need to know the transformed vectors hφn . Once γ we can subsequently check whether the product γ(n)b γ (n) is positive or not. Obviously, the value of the positive step-size parameter, µ, that appears in (63.30) is unnecessary for this verification, which explains why we can set µ to 1 in perceptron implementations. We will therefore use the following expression: b (n) = γ

N −1 X m=0

  γ(m)a(m) 1 + K(hn , hm )

(63.31)

where the sum is now over the entire dataset (since some of the a(m) will be 0); moreover, the (γ(m), hm ) inside the sum do not need to be in boldface anymore since we are adding over all training data. As such, we arrive at listing (63.32) for the perceptron learning rule in the kernel domain. We start with the training data {γ(m), hm , m = 0, 1, 2, . . . , N N 1}. The algorithm may pass over this data

2598

Kernel Methods

multiple times. In the listing it is assumed that the feature data and the weight vector have been extended according to (61.16) to accommodate the presence of an offset in the separating hyperplane.

Unregularized kernel-based perceptron algorithm. given N data points {γ(m), hm }, m = 0, 1, . . . , N N 1; choose a kernel function, K(h, h0 ); start with integer counters a(m) = 0, m = 0, 1, . . . , N N 1; (training) repeat until convergence over n = 0, 1, 2, . . . : select a random data sample (γ(n), hn ) from within the dataset {γ(m), hm }; denote the selected index by m0 ; N −1   X b (n) = γ a(m)γ(m) 1 + K(hn , hm ) (predict label)

(63.32)

m=0

a(m0 ) ← a(m0 ) + I[γ(n)b γ (n) ≤ 0] (increment counter) end (classification) given a feature vector h, compute γ b(h) using (63.33); classify h using (63.34). Once training is completed, we will have available {γ(m), hm , a(m)} for m = 0, 1, . . . , N N 1. Observe, in particular, that the main purpose of running the perceptron iteration in the kernel domain is to generate the counters {a(m)}. During normal operation, when a new feature vector h arrives, the kernel-based perceptron solution predicts its class by using the following expression where the {a(m)} are used: γ b(h) =

N −1 X m=0

  γ(m)a(m) 1 + K(h, hm )

and assigning h to class ±1 as follows:  if γ b(h) ≥ 0, assign h to class +1 if γ b(h) < 0, assign h to class N1

(63.33)

(63.34)

Note further from expression (63.29b) that the hyperplane parameters are well approximated by (setting µ = 1): wφ,? ≈

N −1 X m=0

γ(m)a(m)hφm ,

N −1 X θφ,? ≈ N γ(m)a(m)

(63.35)

m=0

Substituting these expressions into the equation for the separating surface, which is given by (hφ )T wφ,? N θφ,? = 0, we find that the latter can be rewritten in the kernel domain as follows (where the counters {a(m)} are again prominent):

63.4 Kernel-Based Perceptron

N −1 X m=0

  γ(m)a(m) 1 + K(h, hm ) = 0

(separation surface)

2599

(63.36)

The resulting algorithm structure is shown in Fig. 63.5. One of the inconveniences of this implementation for online learning is the evaluation of the variable γ b(h) in (63.33). This is because the number of feature vectors {hm } that are included in the summation increases as the number of nonzero entries a(m) increases.

⇣ ⌘ a(0) (0) 1 + K(h, h0 ) ⇣ ⌘ a(1) (1) 1 + K(h, h1 ) ⇣ ⌘ a(2) (2) 1 + K(h, h2 )

a(N

1) (N

⇣ 1) 1 + K(h, hN

+

1)



Figure 63.5 Structure of the kernel-based perceptron classifier described by equations

(63.33)–(63.34).

Example 63.5 (Relation to nearest-neighbor rule) Expression (63.36) for the separation surface has a useful interpretation. Assume we are using the Gaussian kernel (63.25) with variance σ 2 . Then, the function 1 + K(h, hm ) describes a raised Gaussian “bump” that is centered at feature vector hm with spread dictated by the parameter σ 2 – see Fig. 63.6. In this way, expression (63.36) amounts to the following construction. We center a Gaussian “bump” at each training feature vector, hm , with some of these “bumps” scaled by +1 and others by −1 depending on the value of γ(m). Subsequently, we combine all bumps with weights {a(m)}. The resulting curve will be the desired separation surface. For any test feature vector h, we determine its class variable by deciding whether it lies on one side of this surface or the other by combining the contributions of the Gaussian bumps according to (63.33) and using (63.34). This construction can be viewed as a kernel-based form of the nearest-neighbor (NN) rule. For instance, if the parameter σ 2 happens to be very small, then the Gaussian bumps will be highly concentrated around their centers and their influence will decay rapidly in space. In this extreme situation, given an arbitrary feature point h ∈ IRM , the value of the sum (63.33) will be mainly dictated by the Gaussian kernel centered

2600

Kernel Methods

Figure 63.6 Gaussian bumps scaled by ±a(m) are centered at the training feature

vectors, {hm }. The classification decision for each test vector h is obtained by combining the contributions of these Gaussian curves according to (63.33).

Figure 63.7 Classification regions obtained by combining Gaussian kernels centered at

all training vectors according to construction (63.36) for two cases: σ 2 = 0.01 (left) and σ 2 = 0.001 (right).

at the vector hm that is closest to h (a type of behavior that approaches that of 1-NN solutions). We illustrate this behavior in Fig. 63.7 for the two choices σ 2 = 0.01 (left) and σ 2 = 0.001 (right). In both cases, a total of N = 100 random points are generated in the region [0, 1] × [0, 1]. In these plots, it is assumed for illustration purposes that a(m) = 1 for all training data so that all training vectors contribute to the classification

63.4 Kernel-Based Perceptron

2601

decision. For any point h in the plane, the estimated class variable is computed via (63.33) under the condition a(m) = 1, and subsequently assigned the red color (for class −1) or the green color (for class +1) according to (63.34). Obviously, in an actual perceptron implementation (63.32), only the misclassified data contributes to (63.33) and its contribution is weighted by the respective scalars {a(m)}. We will explain later in the argument leading to (63.144) how this interpretation is related to the solution of regression problems using Gaussian processes.

63.4.2

Regularized Kernel Perceptron We consider next the `2 -regularized version of perceptron, which replaces (63.28a)– (63.28e) by select a random data pair (γ(n), hφn ) at iteration n (hφn )T wφn−1

(63.37a)

φ

(63.37b)

α(n) = I [γ(n)b γ (n) ≤ 0]

(63.37c)

b (n) = γ φ

N θ (n N 1)

φ

θ (n) = θ (n N 1) N µα(n)γ(n) wφn

= (1 N

2µρ)wφn−1

+ µ α(n)γ(n)

(63.37d) hφn

(63.37e)

where ρ ≥ 0 is a regularization parameter. Due to the presence of the scaling factor (1 N 2µρ), the derivation of the kernel-based version of the algorithm requires some closer examination. To begin with, starting from the initial condition θ φ (N1) = 0 and iterating (63.37d), we again find that X θ φ (n) = N µ a(m)γ(m) (63.38) m∈Mn

in terms of the same counter variable a(m) as before, which counts the number of times that the data point (γ(m), hm ) has been misclassified until time n. Each label γ(m) appears only once in the sum (63.38). Note again that there are a total of N counters {a(m), m = 0, 1, . . . , N N 1}, one for each data point (γ(m), hm ) from the training dataset. Next, starting from the initial condition wφ−1 (N1) = 0 and iterating (63.37e) we get wφn

=

n X

n0 =0

0

(1 N 2µρ)n−n µ γ(n0 )hφn0 I [γ(n0 )b γ (n0 ) ≤ 0]

(63.39)

where the notation (γ(n0 ), hφn0 ) refers to the sample selected at iteration n0 . Some samples may appear repeated in (63.39) and some effort is needed to rewrite (63.39) in a form similar to (63.29b), where each hm from the original dataset {γ(m), hm } appears only once under the sum. We can do so by introducing a new “counter” variable denoted by b(m), one for each data point m = 0, 1, . . . , N N 1, and showing that (63.39) can be rewritten in the form X wφn = µb(m)γ(m)hφm (63.40) m∈Mn

2602

Kernel Methods

where now each sample (γ(m), hφm ) appears only once inside the sum. As the derivation will reveal, the variable b(m) will relate to (but is not exactly equal to) how many times hm has been misclassified up to time n. Specifically, assume the sample of index m0 has been selected at iteration n. Then, b(m0 ) will need to be updated as follows at the nth iteration: b(m0 ) ← (1 N 2µρ)b(m0 ) + I [γ(m0 )b γ (m0 ) ≤ 0]

(63.41)

That is, the value of b(m0 ) is scaled by (1 N 2µρ) before being incremented when misclassifications occur. Note that while a(m) assumes integer values, the variable b(m) assumes nonnegative real values. Proof of (63.41): Assume for the sake of argument, and without loss in generality, that the data point (γ(m0 ), h0m ) encountered the violation γ(m0 )b γ (m0 ) ≤ 0 three times: At some time no < n, at a subsequent time n1 < n, and at time n itself. Then, we know from (63.39) that, at time n, the transformed feature vectors hφn0 will correspond to the sampled features {hφno , hφn1 , hφn } and they will appear multiplied in the sum by the coefficients: ∆





co = (1 − 2µρ)n−no , c1 = (1 − 2µρ)n−n1 , c2 = 1

(63.42)

In other words, for this example, we will have b(m0 ) = 1 + (1 − 2µρ)n−n1 + (1 − 2µρ)n−no

(63.43)

It is clear that this coefficient can be generated by construction (63.41) if we start from a zero boundary condition. The first time hm0 is misclassified at time no , the value of b(m0 ) is updated to 1. The second time h(m0 ) is misclassified at time n1 , the unit value is scaled by (1 − 2µρ)n1 −no and the result incremented by 1, i.e., it becomes 1 + (1 − 2µρ)n1 −no . The third time h(m0 ) is misclassified at time n, this new value for b(m0 ) would be scaled by (1 − 2µρ)n−n1 and again incremented by 1. The result is (63.43).

 0

Motivated by these considerations, we find that the coefficient b(m ) that corresponds to feature vector hm0 should be updated in two steps:  b(m0 ) ← (1 N 2µρ)b(m0 ) (always and followed by) b(m0 ) ← b(m0 ) + 1 (whenever a violation occurs at (γ(m), hm )) (63.44) Using (63.38) and (63.40), we can now write ∆

b (n) = (hφn )T wφn−1 N θ φ (n N 1) γ N −1   X = µγ(m) a(m) + b(m)K(hn , hm )

(63.45)

m=0

where we can remove the scaling by µ > 0 (since we are only interested in the sign b (n)). In summary, we arrive at implementation (63.46) for the `2 -regularized of γ kernel-based perceptron rule. We start with the training data {(γ(m), hm ), m = 0, 1, 2, . . . , N N 1}, with the understanding that the algorithm may pass over the data multiple times. Note that this implementation reduces to (63.32) when ρ = 0, in which case the variables a(m) and b(m) coincide.

63.4 Kernel-Based Perceptron

2603

`2 -regularized kernel-based perceptron algorithm. given N data points {γ(m), hm }, m = 0, 1, . . . , N N 1; choose a kernel function, K(h, h0 ); start with counters a(m) = 0 = b(m), m = 0, 1, . . . , N N 1. repeat until convergence over n = 0, 1, 2, . . . : (training) select a random data sample (γ(n), hn ) from within the dataset {γ(m), hm }; denote the selected index by m0 ; N −1   X b (n) = γ(m) a(m) + b(m)K(hn , hm ) γ

(63.46)

m=0

b(m0 ) ← (1 N 2µρ)b(m0 ) if γ(n)b γ (n) ≤ 0, update: a(m0 ) ← a(m0 ) + 1 b(m0 ) ← b(m0 ) + 1 end end (classification) given a feature vector h, compute γ b(h) using (63.47); classify h using (63.48).

At the end of the training phase, we will have available {γ(m), hm , a(m), b(m)} for m = 0, 1, . . . , N N 1. During normal operation, when a new test vector, h, arrives, the kernel-based perceptron solution predicts its class by computing: γ b(h) =

N −1 X m=0

  γ(m) a(m) + b(m)K(h, hm )

and assigning h to class ±1 as follows:  if γ b(h) ≥ 0, assign h to class +1 if γ b(h) < 0, assign h to class N1

(63.47)

(63.48)

Moreover, from expressions (63.38)–(63.39), the parameters for the resulting separating surface are approximated by (using µ = 1): wφ,? ≈

N −1 X m=0

b(m)γ(m)hφm ,

θφ,? ≈ N

N −1 X

a(m)γ(m)

(63.49)

m=0

Substituting these expressions into the equation for the separating surface, which is given by (hφ )T wφ,? N θφ,? = 0, we find that it can be described in the kernel domain as follows: N −1 X m=0

  γ(m) a(m) + b(m)K(h, hm ) = 0

(separation surface)

(63.50)

2604

Kernel Methods

63.5

KERNEL-BASED SVM We consider next the soft-SVM formulation and derive two kernel-based implementations for it: One is recursive in nature and the other is in terms of a convex quadratic program. We discuss the recursive solution in this section and leave the quadratic program to Example 63.9. Since the arguments are similar to the ones used for the regularized perceptron algorithm, we will be brief. We again start from a collection of N data points {γ(m), hm }. To begin with, and according to listing (61.22), the soft-SVM recursion applied to the transformed data (γ(n), hφn ) is given by select a random data pair (γ(n), hφn ) at iteration n (hφn )T wφn−1

(63.51a)

φ

(63.51b)

α(n) = I [γ(n)b γ (n) ≤ 1]

(63.51c)

b (n) = γ

N θ (n N 1)

θ(n) = θ(n N 1) N µα(n)γ(n) wφn

= (1 N

2µρ)wφn−1

(63.51d)

+ µ α(n)γ(n)

hφn

(63.51e)

where ρ ≥ 0 is a regularization parameter. Comparing these recursions with the perceptron iterations (63.37a)–(63.37e), we observe that the only difference is in the argument of the indicator function. The comparison is now against the value 1 rather than 0. Therefore, the same derivation that we had for the `2 -regularized kernel-based perceptron recursion will lead to listing (63.55). Observe that the b (n) since we are comparing scaling by µ is now maintained in the expression for γ the product γ(n)b γ (n) against the value 1 (and not against 0 anymore). Again, note that a(m) = b(m) when ρ = 0 so that only counter a(m) would be needed in the hard-SVM case. At the end of the training phase, we will have available {γ(m), hm , a(m), b(m)} for m = 0, 1, . . . , N N 1. During normal operation, when a test vector h arrives, the kernel-based SVM solution predicts its class by computing γ b(h) =

N −1 X m=0

  µγ(m) a(m) + b(m)K(h, hm )

and assigning h to class ±1 as follows:  if γ b(h) ≥ 0, assign h to class +1 if γ b(h) < 0, assign h to class N1

(63.52)

(63.53)

The resulting structure is shown in Fig. 63.8, with the separation surface now given by N −1 X m=0

  γ(m) a(m) + b(m)K(h, hm ) = 0

(separation surface)

(63.54)

63.5 Kernel-Based SVM

2605

`2 -regularized kernel-based SVM algorithm. given N data points {γ(m), hm }, m = 0, 1, . . . , N N 1; choose a kernel function, K(h, h0 ); start with counters a(m) = 0 = b(m), m = 0, 1, . . . , N N 1. repeat until convergence over n = 0, 1, 2, . . . : (training) select a random data sample (γ(n), hn ) from within the dataset {γ(m), hm }; denote the selected index by m0 ; N −1   X b (n) = µγ(m) a(m) + b(m)K(hn , hm ) γ

(63.55)

m=0

b(m0 ) ← (1 N 2µρ)b(m0 ) if γ(n)b γ (n) ≤ 1, update: a(m0 ) ← a(m0 ) + 1 b(m0 ) ← b(m0 ) + 1 end end (classification) given a feature vector h, compute γ b(h) using (63.52) classify h using (63.53).

+

Figure 63.8 Structure of the `2 -regularized kernel-based SVM classifier described by

equations (63.52)–(63.53).

2606

Kernel Methods

Example 63.6 (Binary classification using kernel-based SVM) We show in the two rows of Fig. 63.9 a collection of 400 feature points hn ∈ IR2 whose classes ±1 are known beforehand. We divide this data into a training set (with 320 points) and a test set (with 80 points). In the top row, the classes are separated by a circular curve, while in the bottom row the classes are separated by another nonlinear curve. We use the training data to construct two SVM classifiers and to assess their empirical error rates on the test data. We use a Gaussian kernel with σ 2 = 1 and train the kernel-based SVM classifier (63.55) using ρ = 5, µ = 0.1, and 40 passes over the data. The curves in the figure show the resulting separating curves from (63.54). The resulting empirical error rates are 5% in the top row and 10% in the bottom row. training samples

test samples; error = 5% 2

2

1

1

0

0

-1

-1

-2

-2 -2

-1

0

1

2

-2

training samples 2

1

1

0

0

-1

-1

-2

-2 -0.5

0

0.5

1

0

1

2

test samples; error = 10%

2

-1

-1

1.5

2

-1

-0.5

0

0.5

1

1.5

2

Figure 63.9 The first row shows the training and test data for a kernel-based SVM

classifier used to determine a circular separation curve. The second row shows training and test data for a second kernel-based SVM classifier used to determine a nonlinear separation curve. Both classifiers employ a Gaussian kernel with σ 2 = 1.

Example 63.7 (Selecting the regularization parameter via cross validation) We illustrate how to apply the cross validation procedure from Section 61.3 to the selection of the regularization parameter ρ in the kernel-based implementation of the SVM learner from the previous example. In the lower plot of Fig. 63.9 we employed SVM to determine a nonlinear separation curve using a Gaussian kernel and ρ = 5. We reconsider this case and generate a plot of the empirical error rate for the SVM learner against the regularization parameter, ρ.

63.5 Kernel-Based SVM

2607

We start with N = 320 training data points and implement a five-fold cross validation scheme. That is, we set K = 5 and divide this training data into 5 segments of 64 samples each. We fix ρ at one particular value and run SVM on four of the segments while keeping the fifth segment for testing; this fifth segment generates an empirical error value. While running the algorithm on the four segments, we run it multiple times over the data (using 40 passes). We repeat this procedure five times, using four segments for training and one segment for testing, and subsequently average the empirical errors to determine the error rate that corresponds to the fixed value of ρ. We repeat the construction for other values of ρ and arrive at the curve shown in Fig. 63.10. From this figure, it is seen that for the values of ρ tested, the performance of the kernel-based SVM learner starts to degrade appreciably for ρ > 17.

empirical error rate (%)

40 35 30 25 20 15 10 5 2

4

6

8

10

12

14

16

18

20

(regularization parameter) Figure 63.10 The plot shows how the empirical error rate for the

Gaussian-kernel-based SVM solution (63.55) varies with the selection of the regularization parameter, ρ. A five-fold cross validation implementation is used to generate this curve.

Example 63.8 (Application to breast cancer dataset) We reconsider the breast cancer dataset from Example 61.2. The data consists of N = 569 samples, with each sample corresponding to a benign or malignant cancer classification. We use γ(n) = −1 for benign samples and γ(n) = +1 for malignant samples. Each feature vector in the data contains M = 30 attributes corresponding to measurements extracted from a digitized image of a fine needle aspirate of a breast mass. The attributes describe characteristics of the cell nuclei present in the image; examples of these attributes were listed earlier in Table 53.1. All feature vectors are centered around the sample mean and their variances scaled to unity according to the preprocessing step described earlier under PCA in (57.6). We apply the PCA procedure (57.34) to reduce the dimension of the feature space down to M = 2. The resulting data samples are shown in Fig. 63.11. We select 456 samples for training and use the remaining 113 samples for testing. We run a kernel-based SVM implementation with ρ = 1 and µ = 0.05 using 100 runs over the data. Figure 63.12 shows the separation curves that result from using three different values for the variance of the Gaussian kernel, σ 2 ∈ {0.1, 1, 25}. Each row in the figure shows training samples on the left and test samples on the right, along with the separation curve. Observe how smaller values for σ 2 lead to overfitting. For example, the choice σ 2 = 0.1 leads to a separation curve that tightly encircles class γ = −1 in the training data. Table 63.1 lists the resulting empirical errors on the test data for the three kernel variances. As expected, overfitting degrades the empirical error rate on test data.

2608

Kernel Methods

breast cancer data with reduced two-dimensional features 5

0

-5

-10 -16

-14

-12

-10

-8

-6

-4

-2

0

2

4

6

Figure 63.11 The plot shows N = 569 data samples for two-dimensional reduced feature vectors from a breast cancer dataset.

Table 63.1 Empirical error rates over test data for three different values of the variance parameter σ 2 for the Gaussian kernel. σ2

M

N

Ntrain

Ntest

0.1 1 25

2 2 2

569 569 569

456 456 456

113 113 113

Testing error 7.96% 5.31% 4.42%

Example 63.9 (Quadratic program implementation) We present in this example an alternative kernel-based implementation for the soft-SVM problem in the form of a convex quadratic program using arguments similar to those from Section 61.2. Motivated by the soft-margin SVM formulation (61.11a)–(61.11c) in the original domain {γ(m), hm }, we formulate the following similar problem in the transformed domain {γ(m), hφm }:   wφ,? , θφ,? , {s? (m)} ) ( −1 1 N  X 1 φ 2 kw k + η s(m) = argmin 2 N m=0 wφ ,θ φ ,s(m)   subject to γ(m) (hφm )T wφ − θφ ≥ 1 − s(m) s(m) ≥ 0, m = 0, 1, . . . , N − 1

(63.56a) (63.56b) (63.56c)

where wφ ∈ IRMφ and θ ∈ IR. Repeating the duality arguments from Section 61.2, we can similarly verify that problem (63.56a)–(63.56c) can be solved by considering the dual optimization problem:  1 T T λ = argmin λ Aλ − 1 λ 2 λ∈IRN η subject to 0  λ  1, λT γ = 0 N ?



(63.57a) (63.57b)

63.5 Kernel-Based SVM

training samples

test samples

5

5

0

0

-5

-5

-10

-10 -15

-10

-5

0

5

5

5

0

0

-5

-5

-10

-10 -15

-10

-5

0

5

5

5

0

0

-5

-5

-10

-10 -15

-10

-5

0

2609

5

-15

-10

-5

0

5

-15

-10

-5

0

5

-15

-10

-5

0

5

Figure 63.12 Each row in the figure shows training samples on the left and test

samples on the right, along with the separation curves that result from applying a soft-SVM kernel implementation for different values of the variance parameter for the Gaussian kernel, namely, σ 2 ∈ {0.1, 1, 25}.

where the notation a  b stands elementwise comparison, and the N × 1 vector γ is defined by   γ(0)   γ(1)   γ= (63.58)  ..   . γ(N − 1) The entries of λ are denoted by {λ(m)} and they play the role of dual variables. Moreover, the matrix A is N ×N with entries now defined in terms of kernel evaluations: [A]n,m = γ(n)γ(m)(hφn )T hφm = γ(n)γ(m)K(hn , hm )

(63.59)

Problem (63.57a)–(63.57b) is a convex quadratic problem: It involves a cost in (63.57a) that is quadratic in λ, with coefficients { 21 A, −1}. It also involves the linear constraint η λT γ = 0 and the condition 0  λ  N 1. A quadratic program solver can be used to determine a solution λ? . This solution will exhibit a useful property, namely, that most of the entries of λ? will be zero, while the nonzero values λ? (m) 6= 0 will only occur for data points (γ(m), hφm ) that correspond to support vectors. These are points for which   (γ(m), hφm ) is a support vector ⇐⇒ γ(m) (hφm )T wφ,? − θφ,? ≤ 1 (63.60)

2610

Kernel Methods

Furthermore, in a manner similar to (61.53), the prediction for the class variable is given by ! X ? 1 1 X X ? 0 0 0 λ (s )γ(s )K(hs , hs ) − γ b(h) = λ (s)γ(s)K(h, hs ) − |S1 | γ(s) 0 s∈S

s∈S1

s ∈S

(63.61) where S is the set of support vectors and S1 ⊂ S is meet the margin. Subsequently, the test vector h is  if γ b(h) ≥ 0, assign h to if γ b(h) < 0, assign h to

63.6

the subset of support vectors that assigned to class ±1 by using: class +1 class −1

(63.62)

KERNEL-BASED RIDGE REGRESSION The kernel-based formulation can be extended to other inference problems. In the last two sections, we devised kernel-based constructions for the perceptron and SVM classifiers and used them to solve classification problems with nonlinear separation surfaces. In this and the following sections, we apply kernel-based arguments to other problems, including ridge regression, PCA, more general loss functions in learning, and inference with Gaussian processes. We start with ridge regression. Consider again N data points {γ(m), hm } where γ(m) is now real-valued and not necessarily limited to the binary choices ±1. The feature vectors hm continue to be M -dimensional. We pose the following `2 -regularized least-squares problem (already studied earlier in (50.114)): ( ) N −1 X 2 ∆ 0 N 2 N −1−m T wN −1 = argmin ρ λ kwk + λ γ(m) N hm w (63.63) w∈IRM

m=0

where ρ0 > 0 is a regularization parameter and 0  λ ≤ 1 denotes a scalar forgetting factor. The notation wN −1 , with a subscript N N 1, indicates that the solution wN −1 is based on the data {γ(m), hm } up to time N N 1. Once computed, the solution wN −1 allows us to determine predictions for the target variable γ by means of the linear regression model: γ b = hT wN −1

(63.64)

There is an implicit assumption in this construction that a linear mapping from the feature space to the target space is justified. This need not be the case in many situations. We will explain that a kernel-based formulation allows us to employ more general nonlinear models to map the features h to their predictions γ b. This will be achieved by transforming the {hm } into higher-dimensional vectors {hφm } and solving a ridge regression problem in the transformed domain. Once

63.6 Kernel-Based Ridge Regression

2611

φ the solution wN −1 is computed in that domain, the target variables can then be predicted by using φ γ b = (hφ )T wN −1

(63.65)

This solution would be effective if it can be carried out without the need to explicitly know or form the transformed vectors {hφm }. We will explain that this is indeed possible. But first, we examine more closely the solution to problem (63.63) in the original domain in order to highlight some properties that will enable the kernel-based solution.

63.6.1

Using the Gramian Matrix We already know from (50.121b) that the solution wN −1 ∈ IRM satisfies the linear equations:  T T ρ0 λN IM + HN (63.66) −1 ΛN −1 HN −1 wN −1 = HN −1 ΛN −1 dN −1

where HN −1 ∈ IRN ×M is the data matrix and dN ∈ IRN ×1 is the target vector with entries:     hT γ(0) 0  hT    γ(1) 1      hT    γ(2) 2 HN −1 =  (63.67)  , dN −1 =       . . .. ..     hT γ(N N 1) N −1 Moreover, the weighting matrix ΛN −1 is diagonal and given by n o ∆ ΛN −1 = diag λN −1 , λN −2 , . . . , λ, 1

(63.68)

The resulting prediction for the target vector is denoted by dbN −1 = HN −1 wN −1

(63.69)

T ρ0 λN wN −1 = HN −1 ΛN −1 (dN −1 N HN −1 wN −1 )

(63.70)

We can rewrite this expression in an alternative form that will facilitate the solution of the regularized least-squares problem in the transformed feature space. Note first from (63.66) that

If we introduce the N × 1 column vector: ∆

αN −1 =

1

ρ0 λ N

(dN −1 N HN −1 wN −1 )

(63.71)

then we can deduce from (63.70) that T wN −1 = HN −1 ΛN −1 αN −1

(63.72)

2612

Kernel Methods

Substituting this relation into (63.69) and (63.71) leads to the following useful expressions for dbN −1 and αN −1 , which involve the N × N Gramian matrix T BN −1 = HN −1 HN −1 : dbN −1 = BN −1 ΛN −1 αN −1 0 N

αN −1 = ρ λ IN + BN −1 ΛN −1

−1

(63.73a) dN −1

(63.73b)

The entries of BN −1 consist of inner products between the individual feature vectors, i.e., [BN −1 ]m,m0 = hT m hm0

(63.74)

The key point to note is that the above expression for dbN −1 allows us to compute the predictions {b γ (m)} by relying solely on knowledge of the Gramian matrix BN −1 ; there is no need to use the actual feature vectors, as is the case with expression (63.69). This observation suggests that it should be possible to devise a kernel-based implementation for ridge regression problems in the transformed space. This generalization is useful because while problem (63.63) assumes that a linear regression model can fit the data {γ(m), hm }, a regularized problem in the kernel space would allow us to handle situations where nonlinear models are more suitable. The idea is to transform the feature vectors {hm } into longer transformed vectors {hφm } and to fit a linear regression model to the transformed data {γ(m), hφm }.

63.6.2

Kernel-Based Least-Squares Thus, consider now a regularized least-squares problem in the transformed domain, {hφn }, namely, ( ) N −1  2 X ∆ φ 0 N φ 2 N −1−m φ T φ wN −1 = argmin ρ λ kw k + λ γ(m) N (hm ) w wφ ∈IRMφ

m=0

(63.75)

Then, repeating the above derivation will lead to the same expressions (63.73a)– (63.73b) for {dbN −1 , αN −1 }, with the only difference being that the matrix BN −1 is now replaced by the kernel-based Gramian matrix AN −1 as follows: ∆

[AN −1 ]m,m0 = (hφm )T hφm0 = K(hm , hm0 ) dbN −1 = AN −1 ΛN −1 αN −1 −1 αN −1 = ρ0 λN IN + AN −1 ΛN −1 dN −1

(63.76a) (63.76b) (63.76c)

Observe that the entries of AN −1 are fully determined by kernel computations involving the original vectors and not their transformed versions. Expressions (63.76a)–(63.76c) show how to solve ridge regression problems in kernel space.

63.7 Kernel-Based Learning

2613

Now, given a test vector h, with transformed version hφ , we can estimate its target variable as follows: ∆

φ γ b(h) = (hφ )T wN −1 (a)

φ T = (hφ )T (HN −1 ) ΛN −1 αN −1  = K(h, h0 ) K(h, h1 ) . . .

(b)

=

N −1 X

λN −1−m α(m)K(h, hm )

K(h, hN −1 )



ΛN −1 αN −1 (63.77)

m=0

φ where in step (a) we used an expression analogous to (63.72) for wN −1 in terms φ φ T of the data matrix HN whose rows are the transposed vectors {(h m ) }. In −1 step (b) we are denoting the individual entries of αN −1 by {α(m)}. In summary, we arrive at listing (63.78).

Kernel-based ridge regression. given N data points {γ(m), hm }; given forgetting factor 0  λ ≤ 1; given regularization parameter ρ0 ≥ 0; choose a kernel function, K(h, h0 ). (training) compute: dN −1 = col{γ(0), γ(1), . . . , γ(N N 1)} ΛN −1 = diag{λN −1 , . . . , λ, 1} [AN −1 ]m,m0 = K(hm , hm0 ), m, m0 = 0, 1, . . . , N N 1 −1 αN −1 = ρ0 λN IN + AN −1 ΛN −1 dN −1 denote the entries of αN −1 by {α(m)}, m = 0, 1, . . . , N N 1 end (prediction) for any test vector h, predict its target using: N −1 X γ b(h) = λN −1−m α(m)K(h, hm )

(63.78)

m=0

63.7

KERNEL-BASED LEARNING The main insight in the previous sections has been to lift the original feature vectors {hn } into a higher-dimensional space and replace them by {hφn }. Subsequently, classification or regression problems are solved in the enlarged space. Through lifting, we transformed problems that require nonlinear classifier or inference mappings in the lower-dimensional space to problems that rely on linear mappings in the higher-dimensional space. One may wonder about the reason

2614

Kernel Methods

that enables us to “linearize” inference problems by lifting the data to higherdimensional spaces. We clarify this question in this section. The kernel-based algorithms derived in the earlier sections are special cases of a more general formalism. In particular, observe from expressions (63.33), (63.52), and (63.77) for perceptron, SVM, and ridge regression that the estimated target variable, γ b(h), has the form of a linear combination of kernel function evaluations. This form turns out to be significant to our question – see statement (63.82) further ahead. In the comments at the end of the chapter we explain how, given a kernel function K(h, h0 ), we can always associate a so-called reproducing kernel Hilbert space (RKHS) with it. We will denote this vector space by the letter H. This is a space of functions of the feature variable, h. For example, the prediction γ b(h), which is a function of h, lives in this space. It is not necessary for the purposes of this section to define what an RKHS space is or how it is constructed; its existence is a consequence of a key result in functional analysis known as the Moore–Aronszajn theorem. We defer these discussions and the statement of the theorem to the comments. What is sufficient for our presentation here is a fundamental property for these spaces captured by the Representer theorem.

63.7.1

Representer Theorem Assume we are given a collection of N training points {γ(m), hm }, where γ(m) denotes the target variable for hm ∈ IRM . For classification problems, the variable γ(m) is binary-valued, such as γ(m) ∈ {±1}, whereas for regression problems it is real-valued. We let γ b(h) : IRM → IR denote an arbitrary element in the RKHS space H; this is simply some function that maps feature vectors h into predictions for their targets, γ(h). With each training point (γ(m), hm ), we associate a loss value denoted by Q(γ(m), γ b(hm )), such as the ones studied before involving the hinge loss, the perceptron loss, the logistic loss, the quadratic loss: Q(γ, γ b) = (γ N γ b)2   Q(γ, γ b) = ln 1 + e−γbγ Q(γ, γ b) = max{0, Nγb γ}

(quadratic)

(63.79a)

(logistic)

(63.79b)

(perceptron)

(63.79c)

Q(γ, γ b) = max{0, 1 N γb γ } (hinge)

(63.79d)

We then consider the problem of seeking the function γ b(h) ∈ H that minimizes the following `2 -regularized empirical risk: ) ( N −1  1 X  ∆ ? 2 Q γ(m), γ b(hm ) (63.80) γ b (h) = argmin ρkb γ (h)kH + N m=0 γ b(h)∈H

where ρ > 0 is a regularization parameter. This problem involves minimizing over a function, denoted by γ b(h), and this function can depend in some nonlinear manner on h. Also, the regularization on γ b(h) is defined in terms of its squared-

63.7 Kernel-Based Learning

2615

norm in the Hilbert space H, i.e., in terms of some inner product operation denoted generically by the notation h·, ·i: kb γ (h)k2H = hb γ (h), γ b(h)iH

(63.81)

The exact form of the inner product is not necessary at this point; we will provide an expression for it soon – see expression (63.90) and also the explanation at the end of the chapter. The following result asserts that the optimal solution γ b? (h) is a combination of a finite number of kernel evaluations centered at the given feature vectors – the proof is given in the comments section after (63.213). Representer theorem: For any convex loss function Q(γ, γ b) : IR2 → IR, it holds that the optimal solution of (63.80) has the form γ b? (h) =

N −1 X

α? (m)K(h, hm )

(63.82)

m=0

for some real coefficients {α? (m)}.

It follows from this result and from the properties of the RKHS that (see the explanation in the comments at the end of the chapter): kb γ

?

(h)k2H

=

N −1 N −1 X X

α? (m)α? (m0 )K(hm , hm0 )

(63.83)

m=0 m0 =0

We can use this expression, and the result of the Representer theorem, to transform the optimization problem (63.80) over the infinite-dimensional space H into a more tractable optimization problem over IRMφ .

63.7.2

Optimization Problem To do so, we let α ∈ IRN ×1 denote a column vector of size N . For any feature vector h, we introduce the following column vector involving kernel evaluations of h with the training data: n o ∆ uh = col K(h, h0 ), K(h, h1 ), . . . , K(h, hN −1 ) ∈ IRN ×1 (63.84) We also introduce the N × N Gramian matrix ∆

[A]m,m0 = K(hm , hm0 ), m, m0 = 0, 1, . . . , N N 1 and the matrix of transformed feature vectors h i ∆ Φ = hφ0 hφ1 . . . hφN −1 ∈ IRMφ ×N

(63.85)

(63.86)

so that

A = ΦT Φ

(63.87)

2616

Kernel Methods

The nth column of A corresponds to the kernel evaluations of hn with all other training vectors. We denote this column by un so that ∆

un = uhn = nth column of A    (hφ0 )T hφn K(hn , h0 )  φ  K(hn , h1 )   (h1 )T hφn   =  =  .. ..    . .  K(hn , hN −1 ) (hφN −1 )T hφn

(63.88)



   = ΦT hφn  

Result (63.82) shows that we can parameterize the sought-after function γ b(h) in the following linear form in the expanded domain: φ T γ b(h) = uT h α = (h ) Φα

(63.89)

kb γ (h)k2H = αT Aα = αT ΦT Φα

(63.90)

for some vector α ∈ IRN . Form (63.83) also suggests that we can replace the regularization factor kb γ (h)k2H in (63.80) by the quadratic form

Let further ∆

wφ = Φα ∈ IRMφ

(63.91)

in which case the predictor (63.89) is given by γ b(h) = (hφ )T wφ

(63.92)

Observe that this representation does not include an offset parameter; it is therefore implicitly assumed that the feature data has been centered around its sample mean. Substituting (63.90)–(63.91) into (63.80), we conclude that we can replace the original optimization problem over the nonlinear functional γ b(h) by an alternative optimization problem over wφ ∈ IRMφ : w

φ,? ∆

= argmin wφ ∈IRMφ

(

N −1  1 X  ρkw k + Q γ(m), (φ(hm ))T wφ N m=0 φ 2

)

(63.93)

It turns out that solutions wφ,? to (63.93) will belong to the column span of Φ, as required by (63.91): wφ,? ∈ R(Φ)

(63.94)

For this reason, constraint (63.91) is removed from (63.93) and the minimization is over all wφ ∈ IRMφ . The validity of this statement is established next. Proof of (63.94): Any generic vector wφ ∈ IRMφ can be decomposed orthogonally as wφ = aφ + bφ

(63.95)

63.7 Kernel-Based Learning

2617

for some column vectors (aφ , bφ ), where aφ ∈ R(Φ) and bφ ∈ N(ΦT ), i.e., ΦT bφ = 0 and (aφ )T bφ = 0. Substituting this decomposition for wφ into the cost in (63.93), we find that N −1  1 X  Q γ(m), (φ(hm ))T wφ N m=0

ρkwφ k2 +

= ρkaφ k2 + ρkbφ k2 +

N −1  1 X  Q γ(m), (φ(hm ))T aφ N m=0

(63.96)

The minimum over bφ is attained at bφ = 0, so that problem (63.93) is equivalent to N −1  1 X  ρka k + Q γ(m), (φ(hm ))T aφ N m=0

( w

φ,?

φ,?

=a



= argmin aφ ∈R(Φ)

)

φ 2

(63.97)

In other words, any solution to problem (63.93) will be of the form (63.91).



63.7.3

Stochastic Learning We therefore arrived at the `2 -regularized empirical minimization problem (63.93), which is similar in form to the many problems we examined in earlier chapters while discussing stochastic gradient, subgradient, and proximal algorithms. Consider, for example, the case of the quadratic loss (63.79a). In this case, the stochastic gradient algorithm for solving (63.93) will take the form: select a data pair (γ(n), hφn ) at random at iteration n;

(63.98a)

b (n) = (hφn )T wφn−1 γ

(63.98b)

wφn

= (1 N

2µρ)wφn−1

+

2µhφn



b (n) γ(n) N γ



(63.98c)

If we consider instead the perceptron loss (63.79c), then the stochastic subgradient algorithm for solving (63.93) will be of the form select a data pair (γ(n), hφn ) at random at iteration n;

(63.99a)

b (n) = (hφn )T wφn−1 γ wφn = (1 N 2µρ)wφn−1

(63.99b) +

µγ(n)hφn

I[γ(n)b γ (n) ≤ 0]

(63.99c)

From this point, we can repeat the argument that led to (63.46) to arrive at the same kernel-based perceptron algorithm (albeit with the variable a(m) set to 0 since the derivation in this section ignores the scalar offset parameter, θφ ). In order to account for offsets, we can modify (63.93) as follows (wφ,? , θφ,? ) ∆

=

argmin wφ ∈IRM ,θ φ ∈IR

(

ρkwφ k2 +

1 N

N −1 X n=0

 Q γ(n), (φ(hn ))T wφ N θφ

(63.100) ) 

2618

Kernel Methods

where regularization is applied to wφ only. This formulation effectively amounts to modifying the original optimization problem (63.80) by searching over predictors with offset of the form γ bθ (h) = γ b(h) N θφ , where θφ ∈ IR, γ b(h) ∈ H, and the minimization is carried over both {b γ (h), θφ }. Starting from (63.100), we can then follow arguments similar to what we did before for the kernel-based perceptron and SVM algorithms in (63.46) and (63.55). Indeed, for arbitrary convex losses Q(γ(n), γ b(n)), a stochastic gradient (or subgradient) implementation would take the form: select a data pair (γ(n), hφn ) at random at iteration n;

b (n) = γ φ

(hφn )T wφn−1 φ

φ

N θ (n N 1)

(63.101b)

b (n)) θ (n) = θ (n N 1) N µ ∂θφ Q (γ(n), γ

wφn

= (1 N

2µρ)wφn−1

(63.101a)

b (n)) N µ ∂(wφ )T Q (γ(n), γ

(63.101c) (63.101d)

in terms of subgradient constructions for the loss function relative to θφ and wφ b (n)). We can develop these recursions further as follows. evaluated at (γ(n), γ Let ∆

π (γ, γ b) = ∂γb Q (γ, γ b)

(63.102)

denote a scalar-valued subgradient construction for Q(γ, γ b) relative to γ b. Noting φ T φ φ that, by construction, γ b = (h ) w N θ , and using the chain rule (see Probs. 8.29 and 8.30), we can determine subgradients for Q(·, ·) as follows: b) = Nπ (γ, γ b) ∂θφ Q (γ, γ

∂(wφ )T Q (γ, γ b) = h π (γ, γ b)

(63.103b)

select a data pair (γ(n), hφn ) at random at iteration n;

(63.104a)

φ

(63.103a)

Substituting into (63.101b)–(63.101d) we get

b (n) = γ φ

(hφn )T wφn−1 φ

φ

N θ (n N 1)

b (n)) θ (n) = θ (n N 1) + µ π(γ(n), γ

wφn

= (1 N

2µρ)wφn−1

N

µ hφn

b (n)) π(γ(n), γ

(63.104b) (63.104c) (63.104d)

It is clear that updates will occur whenever

b (n)) 6= 0 π(γ(n), γ

(63.105)

This condition plays the same role as updating the perceptron or SVM recursions whenever I[γ(n)b γ (n) ≤ 0] or I[γ(n)b γ (n) ≤ 1].

63.8

KERNEL PCA We devise next a kernel implementation for the PCA procedure from Chapter 57. The objective is to perform dimensionality reduction in the higher-dimensional space by working directly with kernel evaluations and without the need for the

63.8 Kernel PCA

2619

extended features {hφn }. Thus, consider feature vectors {hn ∈ IRM } and a kernel function: K(hn , hm ) = (hφn )T hφm ,



hφn = φ(hn )

(63.106)

for some mapping φ(·) and where each hφn ∈ IRMφ . The corresponding N × N Gramian matrix is denoted by ∆

[A]n,m = K(hn , hm )

(63.107)

whose entries are the pairwise kernel values of all training feature vectors. The matrix A is symmetric and nonnegative definite. We desire to replace each ex0 0 tended vector hφn ∈ IRMφ by a reduced vector hφn ∈ IRMφ where Mφ0  Mφ . We start by computing the sample mean N −1 X φ ∆ 1 ¯ hφ h = N n=0 n

(63.108)

and centering the transformed vectors around it: ∆

¯ φ, hφn,c = hφn N h

n = 0, 1, 2, . . . , N N 1

(63.109)

with a subscript c. In this section, for simplicity, we only perform mean centering and do not perform variance normalization, as was the case with step (57.4) in our description of the traditional PCA procedure. The Gramian matrix of the centered variables (63.109), which we denote by Ac , can be related to the Gramian matrix A of the original feature vectors. To see this, note that the (n, m)th entry of Ac is given by [Ac ]n,m = (hφn,c )T hφm,c   ¯ φ T hφ N h ¯φ = hφn N h m T T φ ¯ N (h ¯ φ )T hφ + (h ¯ φ )T h ¯φ = hφn hφm N hφn h m  T φ ¯ N (h ¯ φ )T hφ + (h ¯ φ )T h ¯φ = [A]n,m N hφn h m

(63.110)

Each of the last three terms in the above expression can be evaluated as follows: ! N −1   1 X φ φ T ¯φ φ T hk hn h = hn N k=0

=

=

1 N 1 N

N −1 X k=0

N −1 X

hφn

T

hφk

K(hn , hk )

k=0

1 [A]n,: 1N (63.111) N in terms of the inner product between the nth row of A and the vector of all unit entries. Likewise, =

2620

Kernel Methods

¯ φ )T hφ = (h m

1 T 1 [A]:,m N N

(63.112)

in terms of the inner product between the mth column of A and the vector of all unit entries. Moreover, !T ! N −1 N −1 1 X φ 1 X φ φ T¯φ ¯ (h ) h = h hk N n=0 n N k=0

=

1 N2

−1 N −1 N X X

(hφn )T hφk

n=0 k=0

1 = 2 1T A1N N N

(63.113)

in terms of the sum of all entries of A. Collecting these expressions, it follows that the Gramian matrices {Ac , A} are related via: Ac = A N

1 1 1 T A1N 1T 1N 1T 1N (1T N N NA + N A1N )1N N N N2

(63.114)

so that knowledge of A is sufficient to determine Ac , and both of these matrices involve only kernel-type calculations. We can now proceed with dimensionality reduction in the kernel domain. To begin with, we introduce the Mφ × Mφ sample covariance matrix of the centered and transformed feature vectors: ∆ bφ = R

N −1 T 1 X φ hn,c hφn,c N N 1 n=0

(63.115)

and consider its eigen-decomposition:

bφ = Uφ Λφ U T R φ

(63.116)

bφ u = λu R

(63.117)

where Uφ is Mφ × Mφ orthogonal and Λφ is diagonal with nonnegative entries. bφ that correspond to its largest eigenvalues We retain the Mφ0 eigenvectors of R and discard the rest. The challenge we face here is that we cannot carry out bφ is not available. This is the above eigen-decomposition because the matrix R because its calculation relies on knowing the variables {hφn,c }, which in turn requires knowledge of the transformation φ(·); only the kernel function K(·, ·) is known. We follow an alternate route to determine the dominant eigenvectors bφ . of R Let u denote a generic column of Uφ . Then, this eigenvector satisfies where λ is the corresponding eigenvalue. The vector u has unit-norm and is orthogonal to all other eigenvectors in Uφ . Using (63.115), we have ! N −1  1 X φ T h hφ u = λu (63.118) N N 1 n=0 n,c n,c

63.8 Kernel PCA

2621

from which we conclude that u = ∆

=

N −1 h X T i 1 hφn,c u hφn,c (N N 1)λ n=0

N −1 X

a(n)hφn,c

(63.119)

n=0

for some coefficients {a(n)} to be determined. This relation expresses u as a combination of the centered and transformed feature vectors. Multiplying (63.118) from the left by (hφk,c )T , for any k = 0, 1, . . . , N N 1, and using the above representation for u we obtain: ! N −1 ! N −1 X T 1 X φ φ T φ φ h h a(m)hm,c = (hk,c ) N N 1 n=0 n,c n,c m=0 λ

N −1 X

a(m)(hφk,c )T hφm,c

(63.120)

m=0

or, equivalently, in terms of the entries of the Gramian matrix Ac : N −1 N −1 N −1 X 1 X X a(m)[Ac ]k,n [Ac ]n,m = λ a(m) [Ac ]k,m N N 1 n=0 m=0 m=0

(63.121)

We collect the coefficients {a(m)} into a column vector, a. Writing relation (63.121) for all k = 0, 1, . . . , N N 1 in matrix form we obtain 1 A2 a = λAc a N N1 c

(63.122)

Ac a = νa

(63.123)

Assuming Ac is invertible, we find that the problem of determining the vector a reduces to solving the eigenvalue problem:

Since Ac is symmetric (and also nonnegative), there will be Mφ orthonormal eigenvectors {a` } with nonnegative eigenvalues {ν` } – clearly, ν` = (N N 1)λ` for ` = 1, 2, . . . , Mφ . Now, for illustration purposes, assume we follow this procedure and determine two eigenvectors for Ac , say, a1 and a2 , with eigenvalues ν1 and ν2 . The eigenvectors {a1 , a2 } will both have unit norm and be orthogonal to each other. From (63.119), these values for {a1 , a2 } will lead to the following eigenvectors {u1 , u2 } bφ : for R u1 =

N −1 X

a1 (n)hφn,c

(63.124a)

a2 (n)hφn,c

(63.124b)

n=0

u2 =

N −1 X n=0

2622

Kernel Methods

so that 2 ku1 k2 = aT 1 Ac a1 = ν1 ka1 k = ν1 2

ku2 k = uT 1 u2

=

aT 2 Ac a2 aT 1 Ac a2

(63.125a)

2

= ν2 ka2 k = ν2

(63.125b)

=0

(63.125c)

Therefore, the u-eigenvectors that result from this construction will be orthogonal to each other, as desired. However, their norms will not be unity. For this reason, once an eigenvector a is determined with eigenvalue ν, we need to scale √ √ it by ν. Using the scaled vector a/ ν, the projection of any new feature vector h along the direction of the eigenvector u determined from a is given by ! N −1 N −1 X 1 1 X ∆ φ T φ T φ √ a(n)hn,c = √ (hc ) u = (hc ) a(n)Kc (h, hn ) (63.126) ν ν n=0 n=0 where, in view of the calculation (63.110) or (63.114), the “centered” kernel value that appears in the above expression is given by: Kc (h, hn ) ∆

=

(63.127)

(hφc )T (hφn,c )

= K(h, hn ) N

N −1 N −1 N −1 N −1 1 X 1 X X 1 X K(h, hk ) N K(hk , hn ) + 2 K(hk , hm ) N N N m=0 k=0

k=0

k=0

In summary, dimensionality reduction in the kernel domain, from dimension 0 0 hφ ∈ IRMφ to dimension hφ ∈ IRMφ , can be accomplished by listing (63.128). Kernel-based PCA algorithm. given N feature vectors {hn ∈ IRM }, n = 0, 1, . . . , N N 1; given a kernel function K(h, h0 ); 0 0 objective: obtain reduced features {hφn ∈ IRMφ }; compute: let [A]n,m = K(hn , hm ), which has size N × N ; let Ac = A N

1 1 1 T A1N 1T 1N 1T 1N (1T N N NA + N A1N )1N N N N2

determine Mφ0 largest eigenvalues and corresponding eigenvectors of Ac , denoted by {νk , ak }, k = 1, 2, . . . , Mφ0 ; 0

0

for any hn ∈ IRM , compute the entries of hφn ∈ IRMφ using: ! N −1 X 1 φ0 kth entry of hn = √ ak (m)Kc (hn , hm ) νk m=0 using expression (63.127) for Kc (hn , hm ) end

(63.128)

63.9 Inference under Gaussian Processes

63.9

2623

INFERENCE UNDER GAUSSIAN PROCESSES We end the chapter by explaining how the formalism of Gaussian processes is intrinsically related to kernel computations. Recall that we defined Gaussian processes earlier in Section 4.5. Given features h ∈ IRM , a Gaussian process is a transformation g(h) that satisfies some useful properties. First, we associate with every such process a mean function and a covariance function defined as follows: ∆

m(h) = E g(h)    ∆ K(h, h0 ) = E g(h) N m(h) g(h0 ) N m(h0 )

(63.129a) (63.129b)

Using these functions, we can evaluate the mean of g(h) for any h, and the cross-covariance between g(h) and g(h0 ) for any h, h0 . Moreover, if we consider any sub-collection of feature vectors, say, four of them, then the corresponding transformations {g(ha ), g(hb ), g(hc ), g(hd )} will be jointly Gaussian-distributed with moments constructed from the mean and covariance functions:     m(ha ) K(ha , ha ) K(ha , hb ) K(ha , hc ) K(ha , hd )  ∆  m(hb )  ∆    K(hb , ha ) K(hb , hb ) K(hb , hc ) K(hb , hd )  m ¯ =   m(hc )  , Rg =  K(hc , ha ) K(hc , hb ) K(hc , hc ) K(hc , hd )  m(hd ) K(hd , ha ) K(hd , hb ) K(hd , hc ) K(hd , hd ) (63.130) We denote a Gaussian process by the notation   g ∼ GPg 0, K(h, h0 ) (63.131) The mapping g(h) plays the role of the nonlinear transformation φ(h) from the earlier sections in this chapter. We will continue with the notation g here instead of φ to remain consistent with our treatment of Gaussian processes in earlier chapters. We now consider two inference problems related to regression and classification, and show how the machinery of Gaussian processes can be used to solve them by exploiting kernel calculations.

63.9.1

Bayesian Regression Consider first a collection of N independent data points {γ(n), hn }, where each γ(n) is a realization that arises from a noisy perturbation of some Gaussian process, namely,   γ(n) = g(hn ) + v(n), g ∼ GPg 0, K(h, h0 ) (63.132) The noise v(n) ∼ Nv (0, σv2 ) is assumed to be a white-noise Gaussian random process with variance σv2 and independent of g(h). The above model allows for nonlinear mappings from the feature space, h, to the target space, γ. Different choices for the kernel K(h, h0 ) would correspond to different assumptions on the

2624

Kernel Methods

nature of the nonlinear mapping from the features {hn } to the target signals {γ(n)}. We do not observe or know the function g(·), but only have access to the data {γ(n), hn }. We wish to use this data to design a predictor function, γ b(h), for the target variable for any new feature vector h. We collect the measurements {γ(n)}, the Gaussian process values of g(·), and the perturbations {v(n)} into vector quantities:      so that

|





γ(N N 1) {z } γ N −1

|

γ(0) γ(1) .. .

    =  





g(hN −1 ) {z }

|

g(h0 ) g(h1 ) .. .

     +   

g N −1

v(0) v(1) .. .

    

(63.133)

v(N N 1) {z } v N −1

γ N −1 = g N −1 + v N −1

(63.134)

Let RN −1 denote the covariance matrix of the vector g N −1 evaluated at the given feature data: h iN −1 RN −1 = K(hn , hm ) (63.135) n,m=0

Then, γ N −1 is Gaussian-distributed with mean and covariance matrix given by   γ N −1 ∼ Nγ N −1 0, σv2 IN + RN −1

(63.136)

Now assume we receive a new feature vector h and wish to predict its target signal γ. If we incorporate this data into the model (63.133) we have "

γ N −1 γ

#

=

"

g N −1 g(h)

#

+

"

v N −1 v

#

The extended vector on the left-hand side continues to be Gaussian: " #!   σv2 IN + RN −1 K(HN −1 , h) γ N −1 ∼ Nγ N −1 ,γ 0, γ K(h, HN −1 ) σv2 + K(h, h)

(63.137)

(63.138)

where HN −1 collects the feature vectors 

 ∆  HN −1 =  

hT 0 hT 1 .. . hT N −1

    

(63.139)

63.9 Inference under Gaussian Processes

2625

and the notation K(HN −1 , h) refers to a column vector containing the kernel evaluations of all feature vectors with h: 

  K(HN −1 , h) =  



K(h0 , h) K(h1 , h) .. . K(hN −1 , h)

  T   = K(h, HN −1 ) 

(63.140)

We therefore find that {γ N −1 , γ} are jointly Gaussian-distributed. We can then rely on the result of Lemma 4.3 to conclude that the conditional probability density function (pdf) of γ given γ N −1 is also Gaussian: fγ |γ N −1 (γ|γN −1 ) ∼ Nγ (b γ, σ bγ2 )

(63.141a)

γ b = K(h, HN −1 )(σv2 IN + RN −1 )−1 γN −1 σ bγ2 = σv2 + K(h, h) N K(h, HN −1 )(σv2 IN +

(63.141b) −1

RN −1 )

K(HN −1 , h) (63.141c)

The mean γ b can be used as a maximum a-posteriori (MAP) predictor for the target variable γ associated with feature h. In particular, observe that γ b can be written in the form γ b(h) =

N −1 X

a(n)K(h, hn )

(63.142)

n=0

where the combination weights a(n) are the entries of the vector



a = (σv2 IN + RN −1 )−1 γN −1

(63.143)

We observe from (63.142) that γ b is obtained by placing kernel “bumps” at the given feature locations {hn } and combining them; this is similar to the construction described in Example 63.5 where we commented on the relation to the NN rule. Note that the solution (63.142) does not require knowledge of the underlying nonlinear transformation g(·); it only depends on the kernel function K(h, h0 ) and the training data {γ(n), hn }. We list the resulting algorithm in (63.144). Observe that, in a manner similar to the NN solution, the algorithm operates on all training data {γ(n), hn } each time a new test sample h arrives.

2626

Kernel Methods

Bayesian regression using Gaussian processes. given N data samples {γ(n), hn }, n = 0, 1, . . . , N N 1; given a kernel function K(h, h0 ) and noise variance σv2 ; objective: predict target γ for a new feature vector h ∈ IRM . compute: N −1 construct covariance matrix RN −1 = [K(hn , hm n o)]n,m=0 construct γN −1 = col γ(0), γ(1), . . . , γ(N N 1)

(63.144)

end

n o given h, construct K(h, HN −1 ) = row K(h, h0 ), . . . , K(h, hN −1 )

set γ b(h) = K(h, HN −1 )(σv2 IN + RN −1 )−1 γN −1 .

Example 63.10 (Optimizing the parameters: I) The performance of algorithm (63.144) depends on two factors: the noise parameter, σv2 , and the kernel function. Moreover, the kernel function usually depends on some parameters of its own (such as the variance parameter for the Gaussian kernel). These parameters may be chosen by means of a cross validation step or by solving an optimization problem that maximizes a likelihood function over them, as illustrated next. Let θ denote the parameters of the kernel function. For example, if we consider the Gaussian kernel: o n 1 (63.145) K(hn , hm ) = exp − 2 khn − hm k2 2σh then its parameter is θ = σh2 . If, on the other hand, we consider a weighted kernel function of the form n 1 o K(hn , hm ) = exp − (hn − hm )T D(hn − hm ) 2

(63.146)

with a diagonal matrix D with positive entries, then the entries of D would constitute the parameters in θ. We can also include the noise variance σv2 into θ. We can learn θ by maximizing the log-likelihood function. To begin with, we know from (63.136) that the vector γ N −1 is Gaussian distributed. Let `(θ) = ln fγ N −1 (γN −1 ; θ) denote the log-likelihood function, which is given by `(θ) = −

N 1 1 T −1 ln(2π) − ln det SN −1 − γN SN −1 γN 2 2 2 (63.147)

where we introduced ∆

SN −1 = σv2 IN + RN −1

(63.148)

The expression for `(θ) can then be maximized by using, for example, a gradient ascent recursion. For that purpose, we would need to evaluate the gradient of `(θ) relative to each entry of θ. Using the results of parts (a) and (b) from Prob. 2.10:

63.9 Inference under Gaussian Processes

( ) ∂`(θ) 1 1 T −1 ∂SN −1 −1 ∂SN −1 −1 = − Tr SN −1 + γN S γN −1 −1 SN −1 ∂θm 2 ∂θm 2 ∂θm N −1 ( )   ∂S 1 N −1 T −1 = Tr xN −1 xN −1 − SN −1 2 ∂θm

2627

(63.149)

where we introduced ∆

−1 xN −1 = SN −1 γN −1

(63.150)

The parameter vector θ can be estimated by using a recursion of the following form, with a small step size µ > 0: θnew ← θold + µ ∇θT `(θ) (63.151) θ=θold

where the column gradient vector is formed from the entries (63.149): [∇θT `(θ)]m = ∂`(θ)/∂θm

63.9.2

(63.152)

Bayesian Classification We can similarly apply the machinery of Gaussian processes to generalize the logit and probit models from Chapter 33 to the kernel domain and allow for nonlinear transformations of the feature space. The arguments in this case are more challenging because certain integration steps in the derivation cannot be carried out in closed form and will need to be approximated. Thus, consider N independent data samples {γ(n), hn } where γ(n) ∈ {±1} is the label corresponding to the nth feature vector hn ∈ IRM . Each γ(n) is now modeled as a realization arising from the following logistic model: P(γ = γ|h) =

1 , 1 + e−γ g(h)

γ ∈ {±1}

defined in terms of a Gaussian process:   g ∼ GPg 0, K(h, h0 )

(63.153)

(63.154)

Expression (63.153) generalizes the traditional logit model (33.18) and replaces the inner product hT w by the Gaussian process g(h); this modification allows for nonlinear mappings of the feature space. Again, we do not observe g(h) itself but only the feature vectors {hn }. For ease of reference, we collect the data {γ(n), hn } into the same vector and matrix quantities as before: n o n o T T HN −1 = col hT 0 , h1 , . . . , hN −1 , γN −1 = col γ(0), γ(1), . . . , γ(N N 1) (63.155) as well as their corresponding Gaussian (latent) factors: n o g N −1 = col g(h0 ), g(h1 ), . . . , g(hN −1 ) (63.156)

2628

Kernel Methods

We denote the N × N covariance matrix of g N −1 by RN −1 with entries: [RN −1 ]m,n = K(hm , hn )

(63.157)

We are interested in devising a predictor γ b(h) for the label variable. To do so, we will evaluate the prediction distribution for γ given h and the training data, namely, ∆

prediction distribution = P(γ = γ | γN −1 ; h, HN −1 )

(63.158)

Once this quantity is evaluated, we can then compare it against 1/2 to decide whether we set γ(h) to +1 or N1. Let g(h) be the Gaussian term associated with h; we will write g without the argument h for compactness. We know from the Bayes rule that evaluation of the prediction probability requires knowledge of the posterior fg|γ N −1 (g|γN −1 ) since: (a)

P(γ = γ | γN −1 ; h, HN −1 ) =

ˆ

g

(63.159)

fγ ,g|γ N −1 (γ, g|γN −1 ; h, HN −1 )dg

ˆ = g

fg|γ N −1 (g|γN −1 ; h, HN −1 )

|

{z

}

posterior

fγ |g (γ|g) | {z }

dg

model (63.153)

where in step (a) we marginalized over g, and in the last step we used the fact that the assumed model for generating γ from g does not depend on γ N −1 .

Approximating the posterior Unfortunately, evaluation of the posterior distribution in this case is difficult and does not admit a closed-form expression. This is because computation of the posterior involves marginalizing over all prior Gaussian factors in g N −1 since posterior distribution

(63.160)

= fg|γ N −1 (g|γN −1 ; h, HN −1 ) ˆ = fg,gN −1 |γ N −1 (g, gN −1 |γN −1 ; h, HN −1 )dgN −1 g

ˆ N −1

= gN −1

fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 ) fg|gN −1 (g|gN −1 ; h, HN −1 ) dgN −1

|

{z

posterior

}|

{z

Gaussian

}

where in the last line we used the fact that the distribution of g N −1 does not depend on h and the distribution of g depends only on the feature data. We will now explain how to approximate the above integral expression. We will do so in two steps: First, we explain that the rightmost conditional inside the integral is Gaussian and, second, we approximate the leftmost conditional by another Gaussian. Then, we will be faced with integrating the product of two Gaussian distributions, which we already know how to do. The details are as follows.

63.9 Inference under Gaussian Processes

2629

The rightmost conditional of g given g N −1 inside the integral is Gaussian since, by assumption, these variables arise from a Gaussian process. Note in particular that " " # #! RN −1 K(HN −1 , h) g N −1 ∼ NgN −1 ,g 0, (63.161) g(h) K(h, HN −1 ) K(h, h) It follows from the result of Lemma 4.3 that the conditional pdf of g given g N −1 is Gaussian and is given by: b b 2 (g|g ; h, H ) ∼ N ( f (63.162a) N −1 N −1 g g, σ g ) bg|gN −1 −1 g = K(h, HN −1 )RN −1 gN −1 b 2 −1 σ g = K(h, h) N K(h, HN −1 )RN −1 K(HN −1 , h)

(63.162b) (63.162c)

Let us focus next on the posterior term fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 ) that appears inside the integral (63.160). This term cannot be evaluated in closed form and we resort to the Laplace method from Section 33.2 to carry out an approximation. First, we know from the Bayes rule that: fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 )

∝ fgN −1 ,γ N −1 (gN −1 , γN −1 ; HN −1 )

= fgN −1 (gN −1 ; HN −1 ) fγ N −1 |gN −1 (γN −1 |gN −1 ; HN −1 ) {z } | {z }| prior

and, hence,

(63.163)

likelihood

fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 )

−1 n 1 o NY 1 −1 T ∝ exp N gN −1 RN −1 gN −1 × −γ(n)g(h n) 2 1+e n=0

(63.164)

so that, apart from a constant factor,

ln fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 )

N −1   X 1 T −1 ln 1 + e−γ(n)g(hn ) = N gN −1 RN −1 gN −1 N 2 n=0

(63.165)

We conclude that the gradient is given by   −1 −1 ∇g N ln f (g |γ ; H ) = NRN T N −1 g N −1 |γ N −1 N −1 N −1 −1 gN −1 + DN −1 γN −1 −1 (63.166) where DN −1 is a diagonal matrix with entries n oN −1 ∆ DN −1 = diag 1 + eγ(n)g(hn ) (63.167) n=0

Note that DN −1 is a function of the entries of gN −1 ; we denote this dependency explicitly by writing DN −1 (gN −1 ). The Hessian matrix is given by   ∆ −1 −1 b−1 ∇2gN −1 ln fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 ) = NRN −1 N BN −1 = NRg (63.168)

2630

Kernel Methods

where the entries of the diagonal matrix BN −1 are given by oN −1 n ∆ 2 −γ(n)g(hn ) BN −1 = DN −1 × col e n=0

(63.169)

Note that BN −1 depends on gN −1 and we also write BN −1 (gN −1 ) when we need to be explicit about this relation. We can now write down a gradient ascent recursion for estimating gN −1 (one could also write a Newton-type recursion if desired): h   i (m) (m−1) (m−1) (m−1) −1 −1 gN −1 = gN −1 + µ(m) RN g N D g γ , m≥0 N −1 −1 N −1 N −1 N −1 (63.170)

where µ(m) is a decaying step-size sequence, such as µ(m) = τ /(m + 1). The recursion will approach the MAP estimate for gN −1 , which we denote by gbN −1 and use to approximate the conditional pdf by bg ) fgN −1 |γ N −1 (gN −1 |γN −1 ; HN −1 ) ≈ NgN −1 (b gN −1 , R

(63.171)

Returning to (63.160) we find that the integral involves the product of two Gaussian distributions. We can then refer to result (27.63) for such integrals and make the identifications: b 2 −1 bg W ← K(h, HN −1 )RN −1 , θ ← 0, z = σ g , x ¯ = gbN −1 , Rx ← R (63.172) to conclude that

where

and

fg|γN −1 (g|γN −1 ; h, HN −1 ) ∼ Ng (b g, σ bg2 )

(63.173)

−1 gb = K(h, HN −1 )RN bN −1 −1 g

(63.174a)

= K(h, h) N K(h, HN −1 )(RN −1 + BN −1 )−1 K(HN −1 , h)

(63.174b)

b 2 −1 b −1 σ bg2 = σ g + K(h, HN −1 )RN −1 Rg RN −1 K(HN −1 , h)

The second equality is established in Prob. 63.18. The parameters (b g, σ bg2 ) are used to determine the predictive distribution below in (63.180). Observe how gb can be written in the form: gb =

N −1 X

a(n)K(h, hn )

(63.175)

n=0

where the combination weights a(n) are the entries of the vector ∆

−1 a = RN bN −1 −1 g

(63.176)

We observe that gb is obtained by placing kernel “bumps” at the given feature locations {hn } and combining them. Note again that expression (63.175) does not require knowledge of the nonlinear transformation g(·), but only of the kernel function K(h, h0 ) and the training data {γ(n), hn }.

63.9 Inference under Gaussian Processes

2631

Determining the predictor We now have sufficient information to approximate the predictive distribution (63.159) for γ given a new feature h and the training data. For this purpose, we introduce the auxiliary scalar variable:



x = γg(h)

(63.177)

Conditioned on γ, and in view of the Gaussian distribution (63.173) given all data, the variable x is Gaussian-distributed with mean and variance given by:

fx|γ ,γ N −1 (x|γ, γN −1 ; h, HN −1 ) ≈ Nx (γ gb, σ bg2 )

(63.178)

Using this observation, we can evaluate the predictive distribution as follows:

P(γ = γ|γN −1 ; h, HN −1 ) ˆ



= −∞ ˆ ∞

= −∞

ˆ ≈ =

fγ ,x|γ N −1 (γ, x|γN −1 ; h, HN −1 )dx P(γ = γ|h) fx|γ ,γ N −1 (x|γ, γN −1 ; h, HN −1 )dx



−∞ ∞

ˆ

−∞

1 Nx (γ gb, σ bg2 )dx 1 + e−x   1 1 1 2 exp N 2 (x N γ gb) dx 1 + e−x (2πb 2b σg σg2 )1/2

(63.179)

The last integral is difficult to evaluate in closed form. However, it can be approximated by using the same results (33.49)–(33.48) used before for the logit model to get:

P(γ = γ|γN −1 ; h, HN −1 ) ≈

n .q o 1 + exp Nγb g 1 + πb σg2 /8

!−1

(63.180)

We then set γ = +1 if this probability value is larger than 1/2; otherwise, we set γ = N1. We list the resulting algorithm in (63.181).

2632

Kernel Methods

Binary classification using Gaussian processes. given N data samples {γ(n), hn }, n = 0, 1, . . . , N N 1; given a kernel function K(h, h0 ) and noise variance σv2 ; given a small step-size parameter, µ > 0; objective: predict label γ for a new feature vector h ∈ IRM . (−1)

start with gN −1 = 0N ×1 . compute: N −1 construct covariance matrix RN −1 = [K(hn , hm n o)]n,m=0 construct γN −1 = col γ(0), γ(1), . . . , γ(N N 1)

repeat until convergence over m = 0, 1, 2, . . .: (m−1)

{g (m−1) (n)} = entries of vector gN −1 n oN −1 (m−1) (n) DN −1 = diag 1 + eγ(n)g n=0 h i (m) (m−1) (m−1) −1 −1 gN −1 = gN −1 + µ RN N DN −1 gN −1 γN −1

(63.181)

end (m) set gbN −1 = gN −1 {b gN −1 (n)} = entries of vector gbN −1 n  2 oN −1 BN −1 = diag e−γ(n)bgN −1 (n) 1 + eγ(n)bgN −1 (n) n=0

end

n o given h, construct K(h, HN ) = row K(h, h0 ), . . . , K(h, hN −1 )

−1 gb = K(h, HN −1 )RN bN −1 −1 g 2 σ bg = K(h, h) N K(h, HN −1 )(RN −1 + BN −1 )−1 K(HN −1 , h) o−1  n q P(γ = +1 | data) ≈ 1 + exp Nb g / 1 + πb σg2 /8

set γ = +1 if probability larger than 1/2; otherwise γ = N1.

Example 63.11 (Optimizing the parameters: II) As was the case with Bayesian regression, we can optimize the selection of the parameters that influence the operation of algorithm (63.181), such as σv2 and any parameters that define the kernel function. Let θ denote the aggregate parameters. We know from the derivation of the Laplace method in an earlier chapter, and in particular from expression (33.33), that the joint distribution of γ N −1 and g N −1 can be approximated by: ln fγ N −1 ,gN −1 (γN −1 , gN −1 )

(63.182)

1 bg−1 (gN −1 − gbN −1 ) ≈ ln fγ N −1 ,gN −1 (γN −1 , gbN −1 ) − (gN −1 − gbN −1 )T R 2

bN −1 found in (63.171). in terms of the MAP estimate gbN −1 and the covariance matrix R It follows that the joint distribution can be approximated by:

63.9 Inference under Gaussian Processes

2633

fγ N −1 ,gN −1 (γN −1 , gN −1 )

(63.183) o n 1 T b −1 ≈ fγ N −1 ,gN −1 (γN −1 , gbN −1 ) exp − (gN −1 − gbN −1 ) Rg (gN −1 − gbN −1 ) 2 q p bg ) b g × Ng (b gN −1 , R = fγ N −1 ,gN −1 (γN −1 , gbN −1 ) × (2π)N × det R N −1 We can recover the distribution for γ N −1 by marginalizing over g N −1 : fγ N −1 (γN −1 ; θ) ˆ = fγ N −1 ,gN −1 (γ N −1 , gN −1 ; θ) dgN −1 gN −1

= fγ N −1 ,gN −1 (γN −1 , gbN −1 ) ×

(63.184)

ˆ q bg × (2π)N det R |

=

p

(2π)N ×

q

gN −1

bg ) dgN −1 NgN −1 (b gN −1 , R {z

}

=1

b g × fg det R (b gN −1 ) × fγ N −1 |gN −1 (γN −1 |b gN −1 ) N −1

We know that fgN −1 (gN −1 ) = NgN −1 (0, RN −1 ) fγ N −1 |gN −1 (γN −1 |gN −1 ) =

N −1 Y n=0

1 1 + e−γ(n)g(hn )

(63.185) (63.186)

and, hence, o n 1 1 1 T −1 √ fgN −1 (b gN −1 ) = p bN −1 exp − gbN −1 RN −1 g 2 (2π)N det RN −1

(63.187)

and fγ N −1 |gN −1 (γN −1 |b gN −1 ) =

N −1 Y n=0

1 1 + e−γ(n)bgN −1 (n)

(63.188)

where the individual entries of gbN −1 are denoted by {b gN −1 (n)}. If we now introduce the log-likelihood function `(θ) = ln fγ N −1 (γN −1 ; θ) and substitute the above expressions into (63.183) we find N −1     1 X 1 T −1 −1 b `(θ) = − gbN R g b − ln 1 + e−γ(n)bgN −1 (n) + ln det RN −1 N −1 N −1 −1 Rg 2 2 n=0

(63.189) bg we have Using expression (63.168) for R −1 −1 −1 −1 −1 −1 RN = (IN + BN −1 (RN −1 + BN −1 ) −1 RN −1 )

(63.190)

so that the log-likelihood is given by 1 T −1 `(θ) = − gbN bN −1 −1 RN −1 g 2 N −1   1 X −1 − ln 1 + e−γ(n)bgN −1 (n) − ln det(IN + BN −1 RN −1 ) 2 n=0

(63.191)

We can maximize this expression by means of a gradient-ascent recursion, which would require that we evaluate the partial derivatives relative to the individual entries of

2634

Kernel Methods

θ, denoted by {θm }. In the above expression, the quantities RN −1 , BN −1 , and gbN −1 depend on θ; the dependency of RN −1 is explicit while BN −1 and gN −1 change with the updates for θ. The computation of the required partial derivatives is tedious and is left as a guided exercise – see Prob. 63.24.

63.10

COMMENTARIES AND DISCUSSION Kernel-based perceptron, SVM, and PCA. Kernel-based methods in learning date back to the mid-1960s with the introduction of the kernel perceptron algorithm for nonlinear classification in the work by Aizerman, Braverman, and Rozoner (1964). We indicated following (63.34) that one main difficulty encountered by the kernel-based perceptron algorithm is that the size of the dataset used in the implementation grows with time, which results in an increasing demand for memory storage. Variations to ameliorate this difficulty appear in the works by Crammer, Kandola, and Singer (2004), Weston, Bordes, and Bottou (2005), Cesa-Bianchi and Gentile (2007), and Dekel, Shalev-Shwartz, and Singer (2008). These works propose algorithms that operate on a budget by limiting the amount of data used to arrive at the classification decision. For further discussion on kernel-based methods in learning, the reader may refer to Cristianini and Shawe-Taylor (2000), Scholkopf and Smola (2001), Herbrich (2002), Kivinen, Smola, and Williamson (2004), and Slavakis, Bouboulis, and Theodoridis (2014). Nonlinear and kernel versions of PCA and ridge regression appear in Scholkopf, Smola, and Muller (1998, 1999), Mika et al. (1999b), Cristianini and Shawe-Taylor (2000), Hastie, Tibshirani, and Friedman (2009), Scholkopf, Luo, and Vovk (2013), and Vovk (2013). Gaussian processes. We illustrated in Section 63.9 how the formalism of kernel methods can be combined with Gaussian processes for the solution of Bayesian regression and classification problems. One notable difference between the Gaussian process approach and earlier Bayesian methods from Chapter 33 is that the Gaussian assumption is now imposed on the (implicit) function g(hn ) rather than on any parameter model. One does not observe the function g(·) itself but rather the feature vectors {hn }; the Gaussian process is used as a convenient modeling tool to account for the (possibly nonlinear) dynamics that maps the feature vectors hn to the labels or targets γ(n). The material in Section 63.9 is motivated by the discussion in Williams and Barber (1998), Rasmussen and Williams (2006), and Bishop (2007). The results in the section show that the Gaussian process approach arrives at inference decisions by relying on the behavior of the closest points in the training data to the new datum, which generalizes the NN construction. We have given several references on the use of Gaussian processes in inference and learning in the concluding remarks of Chapter 3, including the works by O’Hagan (1978), Poggio and Girosi (1990), Neal (1995), and Williams and Rasmussen (1995). Similar techniques have also been used in geostatistics under the name of “krigging” – see, e.g., Journel and Huijbregts (1978), Ripley (1981, 1996), and Fedorov (1987). Mercer theorem. Kernel-based learning exploits powerful properties of kernel functions and reproducing kernel Hilbert spaces, which we explain in greater detail here. Thus, consider a continuous function that maps two vector arguments into the set of real numbers: K(hk , h` ) : IRM × IRM → IR (63.192) We assume the function is symmetric so that K(hk , h` ) = K(h` , hk ). We say that K(·, ·) is positive semi-definite if it satisfies the property:

63.10 Commentaries and Discussion

N −1 N −1 X X

K(hk , h` )α(k)α(`) ≥ 0

2635

(63.193)

k=0 `=0

for any finite number N of vectors {hn } and real scalars {α(n)}. We say that it is positive-definite if, assuming the feature vectors are mutually distinct, equality holds in (63.193) only when all the {α(k)} are zero. We can restate definition (63.193) in the following matrix form. If we introduce the N × N Gramian matrix ∆

k, ` = 0, 1, . . . , N − 1

[AN ]k,` = K(hk , h` ),

(63.194)

then definition (63.193) is equivalent to saying that AN is symmetric and positive semi-definite for any choice of the feature vectors {hn } and finite N . Now, since AN is symmetric, it has a full set of orthonormal eigenvectors {un } satisfying the conditions: Aun = λn un , kun k2 = 1,

uTn um = 0 for n 6= m

(63.195)

where the N eigenvalues {λn } are nonnegative. We can rewrite this decomposition in the form N −1 X AN = λn un uTn (63.196) n=0

This fundamental decomposition can be stated in the function domain as well if we assume that K(·, ·) is square-integrable, i.e., ˆ (K(h1 , h2 ))2 dh1 dh2 < ∞ (63.197) h1 ,h2 ∈Dh

where Dh ⊂ IRM is the domain of the feature space, assumed compact. A key result in functional analysis, known as the Mercer theorem, guarantees the existence of a countable sequence of orthonormal functions, denoted by un (h) : IRM → IR, and a sequence of nonnegative real numbers, denoted by λn ≥ 0, such that we can write in a manner similar to (63.196): K(hk , h` ) =

∞ X

λn un (hk )un (h` )

(63.198)

n=0

where the orthonormality of the un (·) means that  ˆ 1, un (h)um (h)dh = 0, h∈D h

n=m otherwise

(63.199)

The series in (63.198) is guaranteed to converge absolutely and uniformly under (63.197). One important consequence of the Mercer theorem is that there exists a mapping φ(h) : IRM → IRMφ that allows us to express K(hk , h` ) in an inner product form, namely, as K(hk , h` ) = (φ(hk ))T φ(h` ) = (hφk )T hφ`

(63.200)

where we are using the shorthand notation hφ to refer to the transformed vector φ(h). The result can be deduced from (63.198) if we define n√ o √ √ ∆ φ(h) = col λ0 u0 (h), λ1 u1 (h), λ2 u2 (h), . . . (63.201) Therefore, the Mercer theorem ensures that symmetric positive semi-definite functions, K(·, ·), correspond to kernels. In summary, we have the following statement – see also the texts by Dunford and Schwartz (1963), Werner (1995), Berlinet and Thomas-Agnan (2004).

2636

Kernel Methods

Mercer theorem (Mercer (1909)): Every square-integrable symmetric and positive semidefinite function K(hk , h` ) : IRM × IRM → IR admits a representation of the form (63.198) for orthonormal functions un (h) : IRM → IR and nonnegative scalars λn . An equivalent statement is to affirm that K(hk , h` ) is a kernel if, and only if, the Gramian matrix AN defined by (63.194) is positive semi-definite for any size N and feature data {hn }. We listed right after (63.24) several properties of kernel functions. We call upon these properties in the problems at the end of the chapter to check whether certain functions are valid kernels or not. Another useful result for checking whether a function is a valid kernel is Bochner theorem due to Bochner (1932) – see Prob. 63.12 and Loomis (1953), Riesz and Nagy (1965), Reed and Simon (1975), Rudin (1990), Gunning (1992), and Katznelson (2004). Kernel trick. The kernel-based framework allows us to transform a classification problem from an M -dimensional space where the feature vectors {hn } may not be linearly separable to a higher Mφ -dimensional space where the transformed feature vectors {hφn = φ(hn )} are more likely to be linearly separable. The value of Mφ is generally much larger than M and can even be unbounded, as happens in the case of Gaussian kernels. What makes kernel-based implementations attractive is the fact that we do not need to know the transformation φ(·) that maps h ∈ IRM to hφ = φ(h) ∈ IRMφ . We can rely solely on evaluating kernel values. This is because of the following key observation, already explored in the body of the chapter and which we would like to generalize here. In the Mφ -dimensional space, we can seek an estimate for the label γ by means of a linear regression (or inner product) model of the form γ b(h) = (hφ )T wφ,? − θφ,?

(63.202)

for some parameters wφ,? ∈ IRMφ and θφ,? ∈ IR. These parameters define some “optimal” separating hyperplane in IRMφ (optimal in the sense of minimizing some empirical risk function). For example, in the body of the chapter we estimated this hyperplane recursively by means of kernel-based implementations of the perceptron and SVM classifiers. The successive iterates (wnφ , θφ (n)) that are computed by these algorithms provide estimates that approach (wφ,? , θφ,? ). The main “trick,” which is often referred to in the literature as the “kernel trick,” is that the vector wφ,? , or its approximations, can be represented as a linear combination of the transformed vectors, {hφn }. We encountered this property while deriving the kernel-based perceptron, SVM, and ridge regression implementations in Sections 63.4, 63.5, and 63.6. For example, we observe from expressions (63.35) and (63.49) that, after convergence, the resulting hyperplane, denoted by (wφ,? , θφ,? ), in the kernel domain can be expressed in the general form: wφ,? ≈

N −1 X

β(m)hφm ,

θφ,? ≈ −

m=0

N −1 X

α(m)γ(m)

(63.203)

m=0

for some coefficients {α(m), β(m)}, and where N is the number of training data points. Once this fact is noted, it then becomes clear that the estimate for the class variable, γ b(h), can be evaluated by using the kernel function directly since now: γ b(h)

= (63.203)

=

(hφ )T wφ,θ − θφ,? N −1 X

 α(m) + β(m)K(h, hm )

(63.204)

m=0

This calculation involves only the evaluation of kernel values between h and the N given feature vectors, {hm }. As such, knowledge of the mapping φ(·) is unnecessary

63.10 Commentaries and Discussion

2637

and the kernel function, K(·, ·) is sufficient for these computations. This “kernel trick” is what enables the development of kernel-based implementations. The justification we are giving here for this important conclusion is based on assuming the validity of representation (63.203) for wφ,? , which we know holds for perceptron and SVM updates. But how general is this conclusion and does a representation of the form (63.203) hold more generally? The answer is in the affirmative, as already explained in Section 63.7. We provide further motivation for the conclusion here by reviewing two fundamental results in functional analysis known as the Moore–Aronszajn theorem and the Representer theorem. We start with the first theorem. We will not delve into technical details but will rather highlight the main points in a guided manner. Reproducing kernel Hilbert spaces. Recall from the discussion so far that we moved from an M -dimensional space of vectors, {hn }, to an Mφ -dimensional space of transformed vectors, {hφn }. We did so by introducing a kernel function, K(hk , h` ), and explaining that this function implicitly induces a transformation hφ = φ(h). It turns out that the kernel function induces several other deeper properties. In particular, it helps define a useful RKHS as follows. Given a kernel function K(·, ·), and any collection of N training feature vectors {hn ∈ IRM }, we first construct the space that consists of all functions f (h) : IRM → IR defined as follows: N −1 X ∆ f (h) = α(n)K(h, hn ) (63.205) n=0

for any α(n) ∈ IR. This is a vector space over the field of real numbers because it satisfies the properties listed in Prob. 63.16 – see also the texts by Luenberger (1969), Halmos (1974), Bachman and Narici (1998), and Treves (2006) for broader treatments of vector spaces. For simplicity, we will denote the space by the letter H and write f ∈ H. Next, we associate an inner product with this vector space. To do so, we consider any other element from H, say, N −1 X



g(h) =

β(m)K(h, hm )

(63.206)

m=0

for some scalars β(m) ∈ IR. The inner product of f, g ∈ H is denoted by hf (h), g(h)iH and defined as N −1 N −1 X X ∆ hf (h), g(h)iH = α(n)β(m)K(hn , hm ) (63.207) n=0 m=0

It can be verified that this definition satisfies the properties listed in Prob. 63.17 for inner products. It follows that H is an inner product space (i.e., a vector space with an inner product). Therefore, we can assign norms to its elements defined by ∆

kf k2H = hf, f iH =

N −1 N −1 X X

α(n)α(m)K(hn , hm )

(63.208)

n=0 m=0

The space H can be transformed into a Hilbert space as follows. Recall first that a sequence of functions, fp (h) ∈ H, indexed by p, is said to be Cauchy if for every  > 0, there exists some P large enough such that for all p, q > P , it holds that kfp (h) − fq (h)kH < 

(63.209)

That is, the functions become closer to each other for larger values of p and q. An inner product space is said to be complete if every Cauchy sequence in it converges pointwise

2638

Kernel Methods

to an element that lies in the space. Now, consider every possible Cauchy sequence in H. Its limit point may or may not belong to H (i.e., they may not be of the form (63.205)). However, if we enlarge H to include all these limit points, then we end up with a Hilbert space (which is defined as an inner product space that is complete). We will continue to denote the enlarged space by H. The functions f (h) defined by (63.205) lie within this Hilbert space. For more details on Hilbert spaces and their applications, see the texts by Young (1988), Kreyszig (1989), Debnath and Mikusinski (2005), Small and McLeish (2011), Halmos (2013), and Kennedy and Sadeghi (2013). To proceed, observe the following important property that follows from (63.207). Assume we select g(h) = K(h, hk ), for some index k, which corresponds to the choice β(k) = 1 and β(m) = 0 for m 6= k. In that case, expression (63.207) gives hf (h), K(h, hk )iH =

N −1 X n=0

α(n)K(hn , hk ) =

N −1 X

α(n)K(hk , hn )

(63.205)

=

f (hk )

n=0

(63.210) This result shows that computing the inner product of any function, f (h), with a kernel centered at hk , reproduces the value of f (·) at h = hk : f (hk ) = hf (h), K(h, hk )iH

(63.211)

We then say that the space H has a reproducing property (not every Hilbert space has this property). For this reason, the space H is called a reproducing kernel Hilbert space. More formally, an RKHS is a Hilbert space with a kernel function that satisfies the following two properties: (a) For any h and fixed hk , it holds that K(h, hk ) ∈ H. (b) The reproducing property (63.211) holds. Both of these conditions are satisfied by the space H that we have constructed. More details on RKHS and their properties can be found, for example, in the texts by Saitoh (1988) and Berlinet and Thomas-Agnan (2004) and the articles by Aronszajn (1950) and Hille (1972). In summary, starting from a kernel function, K(·, ·), we showed how to construct an RKHS space consisting of functions of the form (63.205) for any N feature vectors {hn }. Can there be other RKHS spaces that are associated with the same kernel K(·, ·)? The Moore–Aronszajn theorem guarantees that this is not the case: The kernel and the RKHS we constructed define each other uniquely. The theorem appears in Aronszajn (1950), who attributes it to Moore (1935, 1939) and, hence, the name. Moore–Aronszajn theorem (Moore (1935, 1939) and Aronszajn (1950)): Every RKHS has a unique kernel and vice-versa. An equivalent statement is to consider a kernel function K(·, ·), to construct the RKHS consisting of all functions of the form (63.205) for any N feature vectors {h` }, and to endow the space with the inner product operation (63.207). Then, the kernel function and this RKHS define each other uniquely. Representer theorem. So far we have only shown that an RKHS H can be associated with a kernel function. Let us now show how the RKHS is relevant in the context of inference problems and, in particular, how it can be used to construct optimal classifiers in the higher-dimensional feature space, hφ ∈ IRMφ . The answer to this part is given by the Representer theorem, which we motivate next. Assume that we are given a collection of N training points {γ(n), hn }, where each γ(n) ∈ {±1} denotes the label of the nth feature vector hn ∈ IRM . We let γ b(h) denote an arbitrary element in the Hilbert space H; this function will map feature vectors h

63.10 Commentaries and Discussion

2639

into predictors for the true label γ(h). With each measurement (γ(n), hn ), we associate a loss value denoted by Q(γ(n), γ b(hn )), such as the ones studied in the body of the chapter involving the hinge loss, the perceptron loss, the logistic loss, or the quadratic loss – see expressions (63.79a)–(63.79d). We then consider the problem of seeking the function γ b(h) ∈ H that minimizes the following `2 -regularized empirical risk: ( ) N −1  1 X  ∆ γ b? (h) = argmin ρkb γ k2H + Q γ(n), γ b(hn ) (63.212) N n=0 γ b (h)∈H

where ρ > 0 is a regularization parameter. Observe that, for generality, we are allowing γ b(h) to refer to an arbitrary element in the Hilbert space H. Also, the regularization on γ b(h) is defined in terms of its squared-norm in this space. The following important result asserts that the optimal solution γ b? (h) is a combination of a finite number of kernel evaluations centered at the given feature vectors, i.e., it has the form suggested by construction (63.205). The result below was derived by Kimeldorf and Wahba (1970) for the special case of the quadratic loss (63.79a) – see also the text by Wahba (1990). A similar result was derived independently by Larkin (1970), who stated that the “importance of Hilbert spaces possessing reproducing kernel functions stems from the desirability of estimating a function (and hence, functionals) from values of its ordinates at given abscissae.” The result was subsequently generalized by Poggio and Girosi (1990) and Cox and O’Sullivan (1990) to allow for arbitrary loss functions, and by Scholkopf, Herbrich, and Smola (2001) to allow for more general regularization terms where the γ kH k), for any strictly factor ρkb γ k2H in (63.212) is replaced by a factor of the form g(kb monotonically increasing function g(·) : [0, ∞) → IR – see Prob. 63.20. Representer theorem: For any convex loss function Q(·, ·) : IR2 → IR, it holds that the optimal solution of (63.212) can be expressed in the form γ b? (h) =

N −1 X

α? (n)K(h, hn )

(63.213)

n=0

for some real coefficients {α? (n)}.

Proof: The argument is motivated by the derivation in Scholkopf, Herbrich, and Smola (2001). We split H into two complementary subspaces denoted by S and S⊥ . The subspace S consists of all functions, fs (h), of the form (cf. (63.205)): ( ) N −1 X S = fs (h) = α(n)K(h, hn ) (63.214) n=0

That is, the functions in this space are combinations of kernels centered at the N given feature vectors. The subspace S⊥ consists of all functions, f⊥ (h), that are orthogonal to the elements of S: n o S⊥ = f⊥ (h) | hfs (h), f⊥ (h)iH = 0 for any fs (h) ∈ S (63.215) Since H is a Hilbert space, any arbitrary element γ b(h) ∈ H can be decomposed as the sum of two components: one component γ bs (h) ∈ S and the other component γ b⊥ (h) ∈ S⊥ , i.e., γ b(h) = γ bs (h) + γ b⊥ (h)

(63.216)

kb γ (h)k2H = kb γs (h)k2H + kb γ⊥ (h)k2H ≥ kb γs (h)k2H

(63.217)

It follows that

2640

Kernel Methods

and equality is attained whenever γ b(h) ∈ S. Moreover, in view of the reproducing property (63.211) and for any given feature vector hn : γ b(hn )

=

hb γ (h), K(h, hn )iH

= =

 hn )iH hb γs (h), K(h, hn )iH + hb γ⊥ (h), K(h,  hb γs (h), K(h, hn )iH

(63.211)

=

:0  

γ bs (hn )

(63.218)

where the third equality is because K(h, hn ) ∈ S so that hb γ⊥ (h), K(h, hn )iH = 0. Consequently, in view of (63.217) and (63.218), the empirical risk (63.212) that we are attempting to minimize over γ b(h) satisfies: ρkb γ (h)k2H +

N −1 N −1 1 X 1 X Q(γ(n), γ b(hn )) ≥ ρkb γs (h)k2H + Q(γ(n), γ bs (hn )) (63.219) N n=0 N n=0

The lower bound, which is attained whenever γ b(h) ∈ S, is minimized by solving ( ) N −1  1 X  2 Q γ(n), γ bs (hn ) (63.220) argmin ρkb γs (h)kH + N n=0 γ bs (h)∈S

We therefore transformed the original problem (63.212), where minimization is over γ b(h) ∈ H, into the equivalent problem (63.220) where minimization is over γ bs (h) ∈ S. Given the definition of the subspace S in (63.214), we conclude that the solution γ b(h) has the form (63.213).  For further discussion on kernel methods in the context of classification, estimation, and learning, readers may consult the texts by Wahba (1990), Scholkopf (1997), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola (2001), Herbrich (2002), and ShaweTaylor and Cristianini (2004), and the articles by Parzen (1962, 1963), Larkin (1970), Kailath (1971), Kailath and Duttweiler (1972), Duttweiler and Kailath (1973a,b), Kailath and Weinert (1975), Freund and Schapire (1999), Muller et al. (2001), Kivinen, Smola, and Williamson (2004), Bordes et al. (2005), and Slavakis, Bouboulis, and Theodoridis (2014).

PROBLEMS

63.1 Refer to the polynomial kernel (63.24). Assume hk , h` ∈ IR2 . What would the dimension Mφ be when p = 3? 63.2 Show that K(ha , hb ) = hTa Bhb is a kernel function for any symmetric B > 0. 63.3 Let K1 (ha , hb ) and K2 (ha , hb ) be two kernel functions defined over the same feature space. Show that the following constructions are also valid kernels: (a) αK1 (ha , hb ), for any α > 0 (scaling property). (b) K1 (ha , hb ) + K2 (ha , hb ) (sum of kernels property). (c) K1 (ha , hb )K2 (ha , hb ) (product of kernels property). 63.4 Assume Kn (ha , hb ) is a convergent sequence of kernels. Show that the limit, as n → ∞, is also a kernel. 63.5 Let K(ha , hb ) be a kernel function. Use the results of Probs. 63.3 and 63.4 to show that eK(ha ,hb ) is a kernel. 63.6 Let K(ha , hb ) be a kernel function.p Establish the following generalization of the Cauchy–Schwarz inequality: K(ha , hb ) ≤ K(ha , ha ) K(hb , hb ).

Problems

2641

63.7 Let f (h) : IRM → IR denote a scalar-valued function. Show that K(ha , hb ) = f (ha )f (hb ) is a kernel function. 63.8 Let K(ha , hb ) be a kernel function and let f (h) : IRM → IRM . Show that K(f (ha ), f (hb )) is a kernel function. 63.9 Consider the function  a, if khk − h` k ≤ b ∆ K(hk , h` ) = 0, otherwise where a, b > 0. Is K(hk , h` ) a kernel function? 63.10 Refer to the Gaussian kernel (63.25) and assume M = 2. Assume also that the feature vectors have the form h = {1, x} with a leading unit entry and x ∈ IR. Recall further the Taylor series expansion of the exponential function around x = 0: ex = 1 + x +

∞ X x2 x3 xm + + ... = 2! 3! m! m=0

and the binomial formula: (a + b)

m

! m X m q m−q = a b , q q=0

where

! m ∆ m! = q q!(m − q)!

where the notation q! denotes the factorial of the integer q. Show that the Gaussian kernel in this case can be written as K(hk , h` ) = (φ(hk ))T φ(h` ) in terms of the infinitedimensional vector transformation: s s ( ) r 1 1 1 − 12 (1+x2 ) φ(h) = e 2σ col xk,0 , xk,1 , xk,2 , xk,3 , . . . 1! σ 2 2!(σ 2 )2 3!(σ 2 )3 where each xk,m denotes the following vector xk,m =

h q

m 0



q

m 1



x

...

q

m m



xm

iT

63.11 Let fh (h; θ) denote the marginal pdf of the feature space parameterized by θ. Introduce the score function S(θ, h) = ∇θT ln fh (h; θ), as well as the Fisher information matrix F (θ) = E h S(θ, h)S(θ, h)T where the expectation is relative to the distribution of h. The Fisher kernel was defined by Jaakkola and Haussler (1999) as K(ha , hb ) = S(θ, ha )F −1 (θ)S(θ, hb ). Verify that this is a valid kernel function. Remark. The definition is motivated by an observation similar to (6.130) that the quantity F −1 (θ)S(θ; h) points in the direction of gradient ascent on the manifold of pdfs parameterized by θ, and by result (6.100) showing how the Fisher information matrix helps measure “distances” between pdfs. 63.12 Consider a square-integrable function f (h) : IRM → IR and define its Fourier transform as follows ˆ T ∆ F (jΩ) = f (h)e−jΩ h dh h M

where Ω ∈ IR . Observe that both f (h) and F (jΩ) are functions of M -dimensional arguments. When M = 1, the above expression reduces to the traditional definition of the Fourier transform. (a) Establish the Parseval relation for this domain, namely, ˆ ˆ 1 |f (h)|2 dh = |F (jΩ)|2 dΩ M (2π) h Ω

2642

Kernel Methods

(b)

Establish further that ˆ f (h)g(h)dh = h

(c)

1 (2π)M

where ∗ denotes complex conjugation. Define the convolution operation ∆



ˆ F (jΩ)G∗ (jΩ)dΩ Ω

ˆ f (h0 )g(h − h0 )dh0

k(h) = f (h) ? g(h) = h0

(d)

Show that K(jΩ) = F (jΩ)G(jΩ). Now refer to condition (63.193) for positive semi-definite kernels. Show that it can also be stated equivalently by requiring ˆ K(h, h0 )f (h)f (h0 )dhdh0 ≥ 0 h,h0

(e)

for any square-integrable function f (·). Assume K(h, h0 ) is of the form K(h, h0 ) = s(kh − h0 k) for some function s(·); i.e., the kernel is a function of the distance between the feature arguments. Using the results of parts (a)–(d) show that, in this case, ˆ ˆ 1 S(jΩ)|F (jΩ)|2 dΩ K(h, h0 )f (h)f (h0 )dhdh0 = (2π)M Ω h,h0

Conclude that an equivalent condition for positive semi-definiteness is to require S(jΩ) ≥ 0. In other words, this problem establishes that the function K(h, h0 ) = s(kh − h0 k) is a kernel if, and only if, the function s(h) has a nonnegative Fourier transform. This result is known as the Bochner theorem. 63.13 Use the Bochner theorem from Prob. 63.12 to establish that the Gaussian choice (63.25) is a kernel. 63.14 Consider the Laplacian function K(ha , hb ) = exp(−kh − h0 k). Is it a kernel? 63.15 Refer to the discussion in Section 63.9.2 on Bayesian classification. Given knowledge of the feature vectors {hn }, determine an approximate expression for the joint probability   P γ(0) = γ(0), γ(1) = γ(1), . . . , γ(N − 1) = γ(N − 1) | HN −1 63.16 Refer to the space denoted by H and consisting of all functions defined by (63.205). Show that this space is a vector space over the field of real numbers. Recall that a vector space is one that satisfies the following 10 properties for any f, f1 , f2 , f3 ∈ H and β, β1 , β2 ∈ IR: (a) f1 + f2 ∈ H. (b) βf ∈ H. (c) f1 + f2 = f2 + f1 . (d) f1 + (f2 + f3 ) = (f1 + f2 ) + f3 . (e) β(f1 + f2 ) = βf1 + βf2 . (f) (β1 + β2 )f = β1 f + β2 f . (g) β1 (β2 f ) = (β1 β2 )f . (h) 1f = f . (i) There exists an element 0 ∈ H such that 0 + f = f . (j) There exists an element −f ∈ H such that −f + f = 0. 63.17 Refer to the operation defined by (63.207) over the vector space H. Show that it satisfies the properties required of a valid inner product operation, namely, for any f, g, g1 , g2 ∈ H, λ ∈ IR: (a) hf, giH = hg, f iH . (b) hf, g1 + g2 >H = hf, g1 iH + hf, g2 iH . (c) hλf, giH = λhf, giH .

Problems

2643

(d) hf, f iH ≥ 0. (e) hf, f iH = 0 ⇐⇒ f = 0. 63.18 Use expression (63.162c) and the matrix inversion formula to establish the validity of the variance expression (63.174b). 63.19 Expression (63.174a) estimates the latent variable g(h) by using the approximate MAP estimate gbN for g N . Show without approximations that the optimal meansquare-error (MSE) estimator for g given the data can be expressed in terms of the optimal MSE estimator for g N as follows:     −1 E g|γ N −1 ; h, HN −1 = K(h, HN −1 )RN −1 E g N −1 |γ N −1 ; HN −1 63.20 Refer to the statement of the Representer theorem. Show that the same conγ kH ), for clusion holds if the regularization term ρkb γ k2H in (63.212) is replaced by g(kb any strictly monotonically increasing function g(·). 63.21 Motivated by the results of Probs. 59.4 and 59.5, we attempt here a kernelbased implementation for logistic regression. Consider the `2 -regularized empirical logistic risk function: ( ) N −1  1 X  ? ∆ 2 −γ(n)(φ(hn ))T w w = argmin ρkwk + ln 1 + e N n=0 M w∈IR φ where ρ > 0. Let σ(z) = 1/(1 + e−z ). (a) Show that w? can be written in the form: w? =

N −1 1 X λ(n)γ(n)φ(hn ) 2ρN n=0

where the coefficients λ(n) are the derivatives of ln σ(z) at z = γ(n)b γ (n): ∆

λ = (b)

d ln σ(z) , dz z=γb γ

γ b = (φ(h))T w?

Using model w? , show that the conditional probability of the label variable given the feature vector can be written in the form P(γ | h; w? ) = 1/(1 + e−γbγ ), for the following function of h: ∆

γ b(h) =

N −1 1 X λ(m)γ(m)K(h, hm ), 2ρN m=0

K(h, hm ) = (φ(h))T φ(hm )

Conclude that the label γ that maximizes P(γ | h; w? ) is the one that matches sign(b γ (h)). (c) Is this a valid kernel-based implementation? 63.22 The purpose of this problem is to derive a kernel-based implementation for the Fisher solution from Section 56.4. For a related discussion, see Mika et al. (1999a). Refer to the presentation in that section and let φ(h) : IRM → IRMφ denote the transformation that maps feature vectors, h, into their transformed values hφ . We express the hyperplane wφ using a combination of the transformed feature vectors as follows: wφ =

N −1 X

α(n)hφn

n=0

for some coefficients {α(n)} to be determined. Consider a generic kernel function of the form K(hk , h` ) = (hφk )T h` .

2644

Kernel Methods

(a)

Starting from the Fisher value defined by (56.31a), motivate the following form in the transformed domain in terms of kernel calculations: ∆

f (x) =

xT Aφ x xT Bφ x

where x = col{α(0), α(1), . . . , α(N − 1)} and the entries of the matrix quantities {Aφ , Bφ } are constructed in the following manner. We first construct two transformed mean vectors of size N × 1 each, denoted by {m b φ,+1 , m b φ,−1 }, and whose pth entries are given by:   1  X [m b φ,+1 ]p = K(hp , h` ) N+1 `:γ(`)=+1   1  X [m b φ,−1 ]p = K(hp , h` ) N−1 `:γ(`)=−1

Aφ = (m b φ,+1 − m b φ,−1 )(m b φ,+1 − m b φ,−1 )T [Cφ,+1 ]n,m = K(hn , hm ), 0 ≤ n ≤ N − 1, hm ∈ class +1 [Cφ,−1 ]n,m = K(hn , hm ), 0 ≤ n ≤ N − 1, hm ∈ class −1     1 1 T b φ,+1 = Cφ,+1 I − 1N+1 1TN+1 Cφ,+1 Σ N+1 − 1 N+1     1 1 T b φ,−1 = Cφ,−1 I − 1N−1 1TN−1 Cφ,−1 Σ N−1 − 1 N−1  1  b φ,+1 + (N−1 − 1)Σ b φ,−1 Bφ = (N+1 − 1)Σ N −2 (b)

(c)

To avoid ill-conditioning of Bφ , we may consider solving the following problem:   xT Aφ x max x xT (ρI + Bφ )x for small ρ > 0. Show that an optimal x? can be chosen to be parallel to the eigenvector corresponding to the largest eigenvalue of (ρI+Bφ )−1 (m b φ,+1 −m b φ,−1 ). PN −1 ? Define γ b(h) = α (n)K(h, h ). Conclude that one way to assign a given n n=0 feature vector h to classes ±1 is as follows:  h ∈ class +1, if γ b(h) ≥ θ? h ∈ class −1, if γ b(h) < θ? where ∆

θ? =

1 (m b φ,+1 + m b φ,−1 )T x? 2

63.23 Consider a collection of N training points {γ(n), hn }, where γ(n) ∈ {±1} and hn ∈ IRM . Many classification problems involve minimizing a regularized empirical risk of the following form: n  o min ρkwk2 + P hT0 w, hT1 w, . . . , hTN −1 w w∈IRM

where ρ > 0, and P (·) is a function of the inner products {hTn w}. The logistic regression formulation provides one example where

Problems

2645

N −1    T 1 X  ln 1 + e−γ(m)hm w P hT0 w, hT1 w, . . . , hTN −1 w = N m=0

Similarly, the least-squares criterion (63.63) is another example where N −1   X P hT0 w, hT1 w, . . . , hTN −1 w = λN −1−m (γ(m) − hTm w)2 m=0

(a)

Introduce the data matrix H = col{hT0 , hT1 , . . . , hTN −1 } ∈ IRN ×M ; its rows consist of the transposed feature vectors. Rewrite the optimization problem in the compact form: n o min ρkwk2 + P (Hw) w∈IRM

(b)

Show that any solution w to this problem can be expressed in the form w = H T α, for some α ∈ IRN . That is, show that any solution w needs to be a linear combination of the feature vectors. Conclude that the optimization problem of part (a) is equivalent to solving min

n

α∈IRN

(c)

for some matrix A , col{aT0 , aT1 , . . . , aTN −1 }. Identify A. Conclude that, in kernel space, the solution of ( )   φ 2 φ T φ φ T φ min ρkw k + P (h0 ) w , . . . , (hN −1 ) w wφ ∈IR

(d)

 o ρkαk2A + P aT0 α, aT1 α, . . . , aTN −1 α



can be reduced to the same equivalent problem as in part (b) where now [A]m,n = K(hm , hn ). In other words, conclude that knowledge of the kernel function, K(·, ·), is sufficient to determine α? ; there is no need for the transformation function, φ(·). Let α? (n) denote the individual entries of α? . Show that the estimated class variable for a new feature vector h can be determined from the expression: γ b(h) =

N −1 X

α? (n)K(h, hn )

n=0

63.24

Refer to expression (63.191) for the log-likelihood function, written as 1 T −1 `(θ) = − gbN bN −1 −1 RN −1 g 2 | {z } =X

N −1 X

  1 −1 − ln 1 + e−γ(n)bgN −1 (n) − ln det(IN + BN −1 RN −1 ) |2 {z } n=0 | {z } =Z =Y

= −X − Y − Z The individual entries of gbN −1 are denoted by gbN −1 (n). We want to evaluate ∂`(θ)/∂θm , for any of the entries of θ. For this purpose, we will rely on the results of Prob. 2.10 for differentiating matrix functions relative to scalar parameters.

2646

Kernel Methods

(a)

Verify that !  ∂b ∂X gN −1 T −1 1 T −1 ∂RN −1 −1 = RN −1 gbN −1 − gbN −1 RN −1 RN −1 gbN −1 ∂θm ∂θm 2 ∂θm

(b)

Verify that Z = − 12 ln det BN −1 +

1 2

ln det(BN −1 + RN −1 ) and conclude that ) (  ∂Z 1  −1 ∂BN −1  1 ∂RN −1  −1 ∂BN −1 = − Tr BN −1 + Tr (BN −1 + RN −1 ) + ∂θm 2 ∂θm 2 ∂θm ∂θm

and ∂BN −1 = diag ∂θm (c)

∂ ∂b gN −1 (n)

 2 e−γ(n)bgN −1 (n) 1 + eγ(n)bgN −1 (n)

!

∂b gN −1 (n) ∂θm

)N −1 n=0

Verify that ∂Y = diag ∂θm

(d)

(

(

∂ ∂b gN −1 (n)

  ln 1 + e−γ(n)bgN −1 (n)

!

∂b gN −1 (n) ∂θm

)N −1 (diagonal) n=0

The expressions in parts (a)–(c) depend on finding the entries of ∂b gN −1 /∂θm . Recall that gbN −1 is the approximation to the MAP estimator that follows from maximizing (63.165). Therefore, by definition, the gradient vector in (63.166) at −1 −1 gbN −1 is expected to evaluate to zero so that RN bN −1 = DN −1 g −1 γN −1 . Deduce from this relation that ) ( −1 ∂DN ∂RN −1 ∂b gN −1 −1 −1 = RN −1 γN −1 − gbN −1 ∂θm ∂θm ∂θm ( ! )N −1 −1 ∂DN ∂b gN −1 (n) ∂ 1 −1 = diag (diagonal) ∂θm ∂b gN −1 (n) 1 + eγ(n)bgN −1 (n) ∂θm n=0

Remark. The reader may refer to Williams and Barber (1998) and Rasmussen and Williams (2006, ch. 5) for a related discussion.

REFERENCES Aizerman, M. A., E. M. Braverman, and L. I. Rozoner (1964), “Theoretical foundations of the potential function method in pattern recognition learning,” Aut. Remote Control, vol. 25, pp. 821–837. Aronszajn, N. (1950), “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404. Bachman, G. and L. Narici (1998), Functional Analysis, Dover Publications. Berlinet, A. and C. Thomas-Agnan (2004), Reproducing Kernel Hilbert Spaces in Probability and Statistics, Kluwer Academic Publishers. Bishop, C. (2007), Pattern Recognition and Machine Learning, Springer. Bochner, S. (1932), Vorlesungen uber Fouriersche Integrale, Akademische Verlagsgesellschaft. Translated by M. Tenenbaum and H. Pollard as Lectures on Fourier Integrals, Princeton University Press, 1959. Bordes, A., S. Ertekin, J. Weston, and L. Bottou (2005), “Fast kernel classifiers with online and active learning,” J. Mach. Learn. Res., vol. 6, pp. 1579–1619.

References

2647

Cesa-Bianchi, N. and C. Gentile (2007), “Tracking the best hyperplane with a simple budget perceptron,” Mach. Learn., vol. 69, pp. 143–167. Cox, D. D. and O’Sullivan, F. (1990), “Asymptotic analysis of penalized likelihood and related estimators,” Ann. Statist., vol. 18, pp. 1676–1695. Crammer, K., J. Kandola, and Y. Singer (2004), “Online classification on a budget,” Proc. Advances Neural Information Processing Systems (NIPS), vol. 16, pp. 225–232, Vancouver. Cristianini, N. and J. Shawe-Taylor (2000), An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press. Debnath, L. and P. Mikusinski (2005), Introduction to Hilbert Spaces with Applications, 3rd ed., Academic Press. Dekel, O., S. Shalev-Shwartz, and Y. Singer (2008), “The forgetron: A kernel-based perceptron on a budget,” SIAM J. Comput., vol. 37, no. 5, pp. 1342–1372. Dunford, N. and L. Schwartz (1963), Linear Operators, Part II, Spectral Theory: Self Adjoint Operators in Hilbert Space, Wiley. Duttweiler, D. L. and T. Kailath (1973a), “RKHS approach to detection and estimation problems IV: Non-Gaussian detection,” IEEE Trans. Inf. Theory, vol. 19, no. 1, pp. 19–28. Duttweiler, D. L. and T. Kailath (1973b), “RKHS approach to detection and estimation problems V: Parameter estimation,” IEEE Trans. Inf. Theory, vol. 19, no. 1, pp. 29– 36. Fedorov, V. (1987), “Kriging and other estimators of spatial field characteristics,” working paper WP-87-99, International Institute for Applied Systems Analysis (IIASA), Austria. Freund, Y. and R. E. Schapire (1999), “Large margin classification using the perceptron algorithm,” Mach. Learn., vol. 37, no. 3, pp. 277–296. Gunning, R. C., editor (1992), Collected Papers of Salomon Bochner, Parts 1–4, American Mathematical Society. Halmos, P. R. (1974), Finite-Dimensional Vector Spaces, Springer. Halmos, P. R. (2013), Introduction to Hilbert Space and the Theory of Spectral Multiplicity, Martino Fine Books. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Herbrich, R. (2002), Learning Kernel Classifiers: Theory and Algorithms, MIT Press. Hille, E. (1972), “Introduction to general theory of reproducing kernels,” Rocky Mountain J. Math., vol. 2, no. 3, pp. 321–368. Jaakkola, T. and D. Haussler (1999), “Exploiting generative models in discriminative classifiers,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–7, Denver, CO. Journel, A. G. and C. J. Huijbregts (1978), Mining Geostatistics, Academic Press. Kailath, T. (1971), “RKHS approach to detection and estimation problems I: Deterministic signals in Gaussian noise,” IEEE Trans. Inf. Theory, vol. 17, no. 5, pp. 530–549. Kailath, T. and D. Duttweiler (1972), “An RKHS approach to detection and estimation problems III: Generalized innovations representations and a likelihood-ratio formula,” IEEE Trans. Inf. Theory, vol. 18, no. 6, pp. 730–745. Kailath, T. and H. L. Weinert (1975), “An RKHS approach to detection and estimation problems II: Gaussian signal detection,” IEEE Trans. Inf. Theory, vol. 21, no. 1, pp. 15–23. Katznelson, Y. (2004), An Introduction to Harmonic Analysis, 3rd ed., Cambridge University Press. Kennedy, R. A. and P. Sadeghi (2013), Hilbert Space Methods in Signal Processing, Cambridge University Press. Kimeldorf, G. S. and G. Wahba (1970), “A correspondence between Bayesian estimation on stochastic processes and smoothing by splines,” Ann. Math. Statist., vol. 41, no. 2, pp. 495–502.

2648

Kernel Methods

Kivinen, J., A. J. Smola, and R. C. Williamson (2004), “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176. Kreyszig, E. (1989), Introductory Functional Analysis with Applications, Wiley. Larkin, F. M. (1970), “Optimal approximation in Hilbert spaces with reproducing kernel functions,” Math. Comput., vol. 24, no. 112, pp. 911–921. Loomis, L. H. (1953), An Introduction to Abstract Harmonic Analysis, Van Nostrand. Luenberger, D. G. (1969), Optimization by Vector Space Methods, Wiley. Mercer, J. (1909), “Functions of positive and negative type and their connection with the theory of integral equations,” Philos. Trans. Royal Soc. A, vol. 209 pp. 415–446. Mika, S., G. Ratsch, J. Weston, B. Scholkopf, and K. R. Muller (1999a), “Fisher discriminant analysis with kernels,” Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 41–48, Madison, WI. Mika, S., B. Scholkopf, A. J. Smola, K. R. Muller, M. Scholz, and G. Ratsch (1999b), “Kernel PCA and de-noising in feature spaces,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 536–542, Denver, CO. Moore, E. H. (1935), General Analysis, Part I, Memoirs of the American Philosophical Society. Moore, E. H. (1939), General Analysis, Part II, Memoirs of the American Philosophical Society. Muller, K. R., S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf (2001), “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201. Neal, R. (1995), Bayesian Learning for Neural Networks, Ph.D. dissertation, Department of Computer Science, University of Toronto, Canada. O’Hagan, A. (1978), “Curve fitting and optimal design for prediction,” J. Roy. Statist. Soc. Ser. B, vol. 40, no. 1, pp. 1–42. Parzen, E. (1962), “Extraction and detection problems and reproducing kernel Hilbert spaces,” J. Soc. Indus. Appl. Math. Ser. A: Control, vol. 1, no. 1, pp. 35–62. Parzen, E. (1963), “Probability density functionals and reproducing kernel Hilbert spaces,” Proc. Symp. Time Series Analysis, pp. 155–169, New York. Poggio, T. and F. Girosi (1990), “Networks for approximation and learning,” Proc. IEEE, vol. 78, no. 9, pp. 1481–1497. Rasmussen, C. E. and C. K. I. Williams (2006), Gaussian Processes for Machine Learning, MIT Press. Reed, M. and B. Simon (1975), Methods of Modern Mathematical Physics, vol. II, Academic Press. Riesz, F. and B. S. Nagy (1965), Functional Analysis, 2nd ed., Frederick Ungar Publishing. Reprinted by Dover Publications, 1990. Ripley, B. D. (1981), Spatial Statistics, Wiley. Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press. Rudin, W. (1990), Fourier Analysis on Groups, Wiley. Saitoh, S. (1988), Theory of Reproducing Kernels and its Applications, Longman Scientific & Technical. Scholkopf, B. (1997), Support Vector Learning, Oldenbourg Verlag. Scholkopf, B., R. Herbrich, and A. J. Smola (2001), “A generalized representer theorem,” Proc. Ann. Conf. Computational Learning Theory (COLT/EuroCOLT), pp. 416–426, Berlin. Scholkopf, B. and A. J. Smola (2001), Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press. Scholkopf, B., A. Smola, and K. R. Muller (1998), “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319. Scholkopf, B., A. Smola, and K. R. Muller (1999), “Kernel principal component analysis,” in Advances in Kernel Methods: Support Vector Learning, C. J. C. Burges, B. Schölkopf, and A. J. Smola, editors, pp. 327–352, MIT Press. Scholkopf, B., Z. Luo, and V. Vovk, editors, (2013), Empirical Inference, Springer.

References

2649

Shawe-Taylor, J. and N. Cristianini (2004), Kernel Methods for Pattern Analysis, Cambridge University Press. Slavakis, K., P. Bouboulis, and S. Theodoridis (2014), “Online learning in reproducing kernel Hilbert spaces,” in Academic Press Library in Signal Processing, R. Chellapa and S. Theodoridis, editors, vol. 1, pp. 883–987, Elsevier. Small, C. G. and D. L. McLeish (2011), Hilbert Space Methods in Probability and Statistical Inference, Wiley. Treves, F. (2006), Topological Vector Spaces, Distributions and Kernels, Dover Publications. Vovk, V. (2013), “Kernel ridge regression,” in Empirical Inference, B. Scholkopf, Z. Luo, and V. Vovk, editors, pp. 105–116, Springer. Wahba, G. (1990), Spline Models for Observational Data, SIAM. Werner, D. (1995), Functional Analysis, Springer. Weston, J., A. Bordes, and L. Bottou (2005), “Online (and offline) on an even tighter budget,” Proc. Int. Workshop on Artificial Intelligence and Statistics, pp. 413–420, Barbados. Williams, C. and D. Barber (1998), “Bayesian classification with Gaussian Processes,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 20, no. 12, pp. 1342–1351. Williams, C. and C. E. Rasmussen (1995), “Gaussian processes for regression,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 514–520, Denver, CO. Young, N. (1988), An Introduction to Hilbert Space, Cambridge University Press.

64 Generalization Theory

We described several data-based methods for inference and learning in the previous chapters. These methods operate directly on the data to arrive at classification or inference decisions. One key challenge these methods face is that the available training data need not provide sufficient representation for the sample space. For example, the training data that may be available in the neighborhood of any feature location, h ∈ IRM , will generally provide only a sparse representation (i.e., a few examples) of the sought-after classifier behavior within this volume of space. It is for this reason that the design of reliable inference methods in higher-dimensional spaces is more challenging than normal. In particular, algorithms that work well in lower-dimensional feature spaces need not perform well in higher-dimensional spaces. This property is a reflection of the phenomenon known as the curse of dimensionality. We examine these difficulties in this chapter and arrive at some important conditions for reliable learning from a finite amount of training data.

64.1

CURSE OF DIMENSIONALITY To illustrate the curse of dimensionality effect, we refer to Fig. 64.1. Consider initially a one-dimensional space, with M = 1, and assume all N training points {hn } (which are now scalars) are randomly distributed within the interval [0, 1]. In this case, we say that we have a sample density of d = N samples/dimension. Let us now consider the two-dimensional case, with M = 2, and let us assume, similarly, that the N training points are randomly distributed within the square region [0, 1] × [0, 1]. In this case, the resulting sample density will be d = N 1/2 samples/dimension. This can be seen as follows. Referring to the diagram in the left part of Fig. 64.2, we partition the horizontal and vertical dimensions of the square region [0, 1] × [0, 1] into N 1/2 sub-intervals in each direction. This division results in a total of N smaller squares. Since the total number of training samples is N , and since these samples are assumed to be uniformly distributed within the region [0, 1] × [0, 1], we conclude that the expected number of samples per small square is equal to 1. Consequently, if we consider any horizontal (or vertical) stripe, the average number of samples in that stripe will be N 1/2 , from which we infer that the sample density is d = N 1/2 samples/dimension. Likewise, for

64.1 Curse of Dimensionality

2651

density = N 1/M samples/dimension

Figure 64.1 The plots illustrate how sample density varies with the dimension values

M = 1, 2, 3. For a generic M −dimensional space, the density is equal to N 1/M samples per dimension.

M = 3, the density will be d = N 1/3 and, more generally, for M -dimensional spaces, the density will be d = N 1/M samples/dimension

(64.1)

If we were to consider a density value of d = 100 samples/dimension to be reasonable for one-dimensional problems within the interval [0, 1], then to attain this same density in M -dimensions, we will need a total number N of training samples that satisfies N 1/M = 100 or N = 100M samples

(64.2)

For example, for M = 20, which is a relatively small feature dimension, we would need to collect 1040 samples (which is a huge number of samples). For M = 40, we would need 1080 samples (which is equal to the estimated number of atoms in the universe!). In other words, as the dimension of the feature space increases, we will be needing substantially more training data to maintain the sampling density uniform. Conversely, if we keep N fixed and increase M , then the higherdimensional space will become more sparsely populated by the training data. One other way to visualize this effect is to consider a small hypercube of edge length ` < 1 embedded within the larger [0, 1]M hypercube in M -dimensional space, whose volume is equal to 1. The volume of the smaller hypercube is `M , which is a fraction of the larger volume – see the right plot in Fig. 64.2. If the larger [0, 1]M hypercube has N samples distributed randomly within it, then the smaller hypercube will contain, on average, a fraction of these samples and their number will be `M N . Observe that as M increases, this fraction of samples will decrease in number since ` < 1 and the smaller hypercube will become less populated.

2652

Generalization Theory

Figure 64.2 The plot on the left illustrates the density expression of N 1/2 samples per

dimension for the case M = 2. The plot on the right illustrates that the fraction of training samples inside the smaller cube is equal to `3 N on average.

Example 64.1 (Numerical example) We illustrate the curse of dimensionality effect by means of an example. A collection of N = 2000 feature vectors hn ∈ IRM are generated randomly for increasing values of M . The entries of each hn are uniformly distributed within the range [−0.5, 0.5] so that the feature vectors lie inside a hypercube of unit edge centered at the origin. For each fixed M , we determine the distance to the closest neighbor for each feature vector and average these distances over all N = 2000 vectors. The numerical values listed in Table 64.1 are obtained in this manner. Table 64.1 Average minimum distance to nearest neighbor for different M , obtained by averaging over N = 2000 random feature vectors. Dimension, M

Average minimum distance

1 10 50 100 500 1000 5000 10,000

0.00026 0.46 2.07 3.28 8.34 12.13 28.10 40.06

The values in the table indicate that the minimum distance between uniformly distributed feature vectors increases quickly with the feature dimension, M , so that the feature vectors become more dispersed in higher dimensions. Actually, as the dimension M increases, the feature vectors tend to concentrate at the corners of the unit hypercube. To see this, assume we insert a sphere of radius 1/2 inside this hypercube; it is centered at the center of the cube – see Fig. 64.3. Its volume is given by the expression volume =

 M 1 π M/2   2 Γ M +1 2

(64.3)

64.1 Curse of Dimensionality

2653

in terms of the gamma function, Γ(x), defined earlier in Prob. 4.3. Since the feature vectors are uniformly distributed in space, we find that the ratio of points that lie inside the sphere relative to the points that lie inside the hypercube is equal to the above volume expression. Taking the limit as M → ∞, the volume expression approaches zero (see Prob. 64.8), which confirms that most of the volume of the hypercube is at its 2M corners and not in the center. Consequently, the feature vectors become more spread out as M increases.

y-axis

x-axis

z-axis

Figure 64.3 A cube centered at the origin with unit edge length, along with a sphere

of radius 1/2 inserted inside the cube and touching its surfaces. Feature vectors are randomly distributed inside the cube.

Implication for classification The curse of dimensionality is problematic when one is searching for classification mappings, γ b(h) : IRM → IR, over the set of all possible classifiers. This is because the problem of determining a classifier is essentially one of fitting a function γ b(h) to the training data {γ(n), hn } and using it to classify test features, h, for example, by examining the sign of γ b(h) when γ ∈ {±1}. As the feature dimension increases, a significantly larger amount of training data will be necessary for a better fit. The larger amount of data allows sampling the feature space more

2654

Generalization Theory

densely so that the behavior of the training data {γ(n), hn } is informative enough to obtain a classifier that performs well over the entire feature space. One important question, then, is whether it is possible to design a good classifier in high-dimensional spaces. We address this question in the next section and answer it in the affirmative under some conditions. Specifically, it will turn out that as long as the size of the training data is large enough and the complexity of the classification mapping that we are seeking is moderate, then we will be able to learn reasonably well. We have two main tools at our disposal to deal with the curse of dimensionality: (a) (Moderate classifier complexity) One approach is to limit the complexity of the classification model by restricting the class of classifiers, as will be done further ahead in (64.11). This is one reason why we often rely on affine or linear classifiers (and not arbitrary classifier structures). (b) (Dimensionality reduction) A second approach is to reduce the dimension of the feature space. We already encountered two dimensionality-reduction procedures in the earlier chapters in the form of the Fisher discriminant analysis (FDA) method of Section 56.4 and the principal component analysis (PCA) method of Chapter 57. In this chapter, we focus on the first approach, which relies on reducing the classifier complexity. In particular, we will examine the feasibility of the learning problem and explain how it is affected by the size of the training data, N , and by the complexity of the classifier model.

64.2

EMPIRICAL RISK MINIMIZATION The available information for learning is limited to the training data: {γ(n), hn , n = 0, 1, . . . , N N 1}

(64.4)

where n is the running variable and γ(n) ∈ {±1} is the binary label associated with the nth feature vector hn ∈ IRM . There will be no prior information about the underlying joint data distribution, fγ ,h (γ, h). As such, we will rarely be able to solve directly the problem of minimizing the actual risk, R(c), defined as the probability of erroneous classifications: n o ∆ ∆ c• (h) = argmin R(c) = P(c(h) 6= γ) = E I [c(h) 6= γ] (64.5) c(h)

where c(h) : IRM → {±1} is a classifier mapping from h to the label space. In (64.5) the minimization is over all possible choices for c(h) and the optimal classifier is denoted by the bullet superscript, c• (h). Observe that we are writing the risk R(c) in two equivalent forms: as the probability of misclassification and as the expected value of the indicator function. The second form is valid because

64.2 Empirical Risk Minimization

2655

the indicator function is either 1 or 0, and it assumes the value of 1 when an error occurs. In the notation used in (64.5), the variable γ refers to the true label associated with the feature vector h. We already know that the solution to the above problem is given by the Bayes classifier (28.28):  +1, when P(γ = +1|h = h) ≥ 1/2 • c (h) = (64.6) N1, otherwise This solution requires knowledge of the conditional probability distribution of γ given h, which is rarely available beforehand. For this reason, as we already saw in several examples in previous chapters, we will need to deviate from seeking the optimal Bayes solution and settle on approximating it from the training data {γ(n), hn }. One first approximation to consider is to minimize the empirical error rate over the training data, i.e., to replace (64.5) by ) ( N −1 1 X ∆ N I [c(hn ) 6= γ(n)] (64.7) c (h) = argmin Remp (c) = N n=0 c(h) where we are now counting only the misclassification errors that occur over the training data. We are denoting the solution to this problem by cN (h), with a filled triangle. We reserve the filled circle and triangle superscripts to minimizations over all classifiers without limitation.

Four optimal classifiers Problem (64.7) continues to be challenging because it does not restrict the class of classifiers over which the minimization of Remp (c) is performed. We have seen in several of the learning algorithms we studied before that it is customary to limit the search space over some restricted set of classifiers, denoted by c ∈ C. This classifier space C is sometimes called the hypothesis space in learning theory, where it is denoted by the letter H. We will refer to it instead as the classifier space and use the notation C. One popular classifier class C, which we have employed extensively before, is the class of “linear” or affine classifiers of the form: n o  c(h) = sign hT w N θ , w ∈ IRM , θ ∈ IR (64.8) where the sign function is defined by  +1, if x ≥ 0 ∆ sign(x) = N1, if x < 0

(64.9)

This class of classifiers is parameterized by (w, θ); each choice for (w, θ) results in one particular classifier. Once optimal values (w? , θ? ) are selected (based on some design criteria), for any test feature vector, h, the classification decision is based on examining the sign of hT w? N θ? . Other families of classifiers are of course possible, such as nonlinear models that are based on kernel representations or neural network models (which are studied in future chapters).

2656

Generalization Theory

optimization over c 2 C

optimization over c 2 C co (h) optimization over all c

c? (h)

c• (h)

optimization over all c

cN (h) AAACAHicbVC7SgNBFL3rM8bXqoWFzWIixCbsRlDLgI1lBPOAZA2zk9lkyOzsMnNXCEsaf8XGQhFbP8POv3HyKDTxwMDhnPuYe4JEcI2u+22trK6tb2zmtvLbO7t7+/bBYUPHqaKsTmMRq1ZANBNcsjpyFKyVKEaiQLBmMLyZ+M1HpjSP5T2OEuZHpC95yClBI3Xt4yJ9yDqBIHSIihPZF2xcGpwXu3bBLbtTOMvEm5MCzFHr2l+dXkzTiEmkgmjd9twE/Ywo5NTMzHdSzRKzhfRZ21BJIqb9bHrA2DkzSs8JY2WeRGeq/u7ISKT1KApMZURwoBe9ifif104xvPYzLpMUmaSzRWEqHIydSRpOjytGUYwMIVRx81eHDogiFE1meROCt3jyMmlUyt5l+eKuUqiW5nHk4AROoQQeXEEVbqEGdaAwhmd4hTfryXqx3q2PWemKNe85gj+wPn8A8JyV5g==

minimizing actual error rate over the distribution of data

minimizing empirical error rate over training data AAACN3icbVDBShxBEO0xJupq4iYevTQuwi4ky4yIyUUQcjGXRcVVYXocenprdhu7e8bunsDSzF/l4m94Sy45KOLVP7B33YPRPCh4vFdFVb2sFNzYMPwdzL2Zf/tuYXGpsbzy/sNq8+OnE1NUmkGfFaLQZxk1ILiCvuVWwFmpgcpMwGl28X3in/4EbXihju24hETSoeI5Z9R6KW32SHyUOqIlBlnWbdbZJbmmzEW169XEVDJ1ajesz13vS1QTSe0oy9yPOmbtUao6RMElJkMqJW2rTkKStNkKu+EU+DWJZqSFZjhIm9dkULBKgrJMUGPiKCxt4qi2nAmoG6QyUFJ2QYcQe6qoBJO46d813vTKAOeF9qUsnqrPJxyVxoxl5jsnl5uX3kT8nxdXNv+WOK7KyoJiT4vySmBb4EmIeMA1MCvGnlCmub8VsxH1uVkfdcOHEL18+TU52epGO93tw63W3udZHItoHW2gNorQV7SH9tEB6iOGfqE/6AbdBlfB3+AuuH9qnQtmM2voHwQPj2JYrCU=

R(c) = P[c(h) 6= ] AAACGHicbVDLSgMxFM3UV62vUZdugq3QgtSZIupGKLhxWcU+oDOUTJq2oUlmTDJCGfoZbvwVNy4Ucdudf2Om7UJbD4QczrmXe+8JIkaVdpxvK7Oyura+kd3MbW3v7O7Z+wcNFcYSkzoOWShbAVKEUUHqmmpGWpEkiAeMNIPhTeo3n4hUNBQPehQRn6O+oD2KkTZSxz4r3Bdx6drjSA+CIKmN27joBSHrqhE3XzIYlzxBHqHXR5wjv9Cx807ZmQIuE3dO8mCOWseeeN0Qx5wIjRlSqu06kfYTJDXFjIxzXqxIhPAQ9UnbUIE4UX4yPWwMT4zShb1Qmic0nKq/OxLEVbqoqUwPUIteKv7ntWPdu/ITKqJYE4Fng3oxgzqEaUqwSyXBmo0MQVhSsyvEAyQR1ibLnAnBXTx5mTQqZfeifH5XyVdP53FkwRE4BkXggktQBbegBuoAg2fwCt7Bh/VivVmf1tesNGPNew7BH1iTH76pn40=

Remp (c) =

N 1 1 X I[c(hn ) 6= (n)] N n=0

Figure 64.4 Four interrelated optimization problems. Two classifiers {c• (h), co (h)}

minimize the actual risk, while two other classifiers {cN (h), c? (h)} minimize the empirical error rate. Moreover, two classifiers {co (h), c? (h)} restrict the search class to c ∈ C, while two other classifiers {c• (h), cN (h)} do not. The smaller circles are meant to indicate that the respective optimal classifiers attain smaller risk values because they are optimizing over a larger pool of classifiers.

Whether we restrict or not the class of classifiers, and whether we minimize the actual or empirical risk, we end up with four interrelated optimization problems that we can compare against each other – see Fig. 64.4. We denote the minimizer for the actual risk (64.5) by c• (h), which uses the bullet superscript notation. This is the optimal Bayes classifier (the ideal solution that we aim for but is generally unattainable). This solution results from minimizing the risk (or error rate) R(c) over all possible classifier mappings and not only over any restricted set c ∈ C, i.e., ∆

c• (h) = argmin R(c)

(64.10)

c(h)

Once we limit the minimization to some classifier set, say, c ∈ C, the resulting minimizer need not agree with c• (h) anymore, and we will denote it instead by co (h) using the circle superscript notation to refer to optimality over a restricted search space: ∆

co (h) = argmin R(c)

(64.11)

c(h)∈C

The larger the space C is, the closer we expect the solution co (h) to get to the optimal Bayes classifier, c• (h). We say that the restriction c(h) ∈ C introduces a form of inductive bias by moving the solution away from c• (h). Problem (64.11) continues to require knowledge of the underlying joint data distribution to evaluate the risk R(c). In data-based learning methods, we move away from this requirement by relying solely on the training data. In that case, we replace R(c)

64.3 Generalization Ability

2657

in (64.11) by the empirical error rate over the training data and denote the solution by c? (h): ∆

c? (h) = argmin Remp (c)

(64.12)

c(h)∈C

We reserve the star superscript notation to solutions that result from using the training data. Thus, note that we use the circle (o) superscript to refer to optimality over the entire distribution of the data, and the star (?) superscript to refer to optimality relative to the training data. In contrast to cN (h) from (64.7), the sought-after classifiers in (64.12) are limited to the set c ∈ C. Problem (64.12) is referred to as the empirical risk minimization (ERM) problem, and its solution is solely dependent on the training data. This is the problem that the various learning procedures that we have been studying focus on and its performance should generally be compared against co (h) in (64.11). Table 64.2 lists the four classifiers discussed in this section and indicates whether they minimize the actual or empirical risk and whether they restrict the class of classifiers. Table 64.2 Four optimal classification problems and their respective classifiers.

64.3

Actual risk, R(c)

Empirical risk, Remp (c)

minimization over all c

c• (h)

cN (h)

minimization over c ∈ C

co (h)

c? (h)

GENERALIZATION ABILITY Our main focus will be on designing c? (h), namely, classifiers that minimize the empirical error rate over some classifier set c ∈ C, and on examining how close their actual error performance, R(c? ), gets to the solution co (h) that minimizes the actual risk. Ideally, we would like the performance of c? (h) to approximate the performance of the optimal Bayes solution, c• (h), as N → ∞. However, this objective is generally impossible to meet. This is because the determination of c? (h) is limited to the restricted set c ∈ C, while the determination of c• (h) is over all possible classifier mappings. We therefore need to formulate a more realistic expectation. Since we are limiting the search space to some set c ∈ C (such as the space of affine classifiers), it is the two classifiers {c? (h), co (h)} that matter the most in our discussions. For this reason, it is the risk value of co (h) that we would like the empirical solution c? (h) to approach and not that of c• (h). This is an attainable objective. We will show below in (64.20) that, under some reasonable conditions, the risk value of c? (h) can be made to approach asymptotically, as N → ∞ and with high probability 1 N , the risk value of co (h). This is a remarkable conclusion, especially since it will hold irrespective of the joint distribution of the data (γ, h).

2658

Generalization Theory

64.3.1

Vapnik–Chervonenkis Bound To arrive at this important conclusion, we let VC denote the so-called Vapnik– Chervonenkis dimension of the classifier set, C (we will not be limiting this set to linear classifiers in the current discussion). We will define the VC dimension later in Section 64.4. Here, it is sufficient to know that this nonnegative number serves as a measure of the complexity of the classification set: More complex classifier models will have larger VC dimension than simpler models. For example, for the case of affine classifiers, we will find that VC = M + 1. The argument that leads to future conclusion (64.20) relies on a fundamental result in statistical learning theory, known as the Vapnik–Chervonenkis bound. The result is motivated in Probs. 64.24–64.25 under some simplifying conditions, and is proven more generally in Appendix 64.C. To state the result, we introduce an auxiliary parameter. Given any small  > 0, we introduce a positive factor δ > 0 that depends on N , VC, and  as follows: ∆

δ(N, VC, ) =

s

32 N



VC ln



Ne VC



  8 + ln 

(64.13)

where the letter “e” refers to the base number for natural logarithms, e ≈ 2.7183. Observe that the value of δ is independent of the distribution of the data, fγ ,h (γ, h), and that δ is small when N is large (i.e., under sufficient training data) and VC is small (i.e., for moderately complex classification models). 3 N=10 3 2.5

N=10 4 N=10 5 N=10 6

2

1.5

1

0.5

0

5

10

15

20

25

30

35

40

45

50

VC dimension

Figure 64.5 The plot illustrates the behavior of the bound δ(N, VC, ) in (64.13) as a function of the VC dimension for various values of N and  = 0.01.

Figure 64.5 illustrates the behavior of δ as a function of the VC dimension for several values of N and  = 0.01. Observe from the plot that, for example for N = 104 , increasing the complexity of the model (i.e., increasing its VC

64.3 Generalization Ability

2659

dimension), enlarges the value of δ. Observe further from (64.13) that if, on the other hand, we fix the values of VC and  and let the size of the training set increase, we obtain: lim δ(N ) = 0,

N →∞

for fixed VC and 

(64.14)

Now, using the value of δ defined by (64.13), it can be shown that, regardless of the distribution of the data and for any c ∈ C, it holds with high probability of at least (1 N ), that: |Remp (c) N R(c)| ≤ δ,

for any c ∈ C

(64.15)

That is, the empirical error rate of classifier c evaluated on the training data is δ-close to its actual error rate over the entire data distribution with high probability 1 N . We restate result (64.15) in another equivalent form as follows. Theorem 64.1. (VC bound) Consider a collection of N training data points {γ(n), hn } and let C denote the classifier space. Let Remp (c) denote the empirical risk for any classifier c ∈ C over the training data, and let R(c) denote its actual risk (i.e., its probability of misclassification) over the entire data distribution: Remp (c) =

N −1 1 X I [c(hn ) 6= γ(n)] , N n=0

R(c) = P (c(h) 6= γ)

(64.16)

Introduce the parameter δ defined by (64.13) in terms of the VC dimension for C. Then, for any small  > 0, it holds that   P sup |Remp (c) N R(c)| ≤ δ ≥ 1 N  (64.17) c∈C

where the supremum is over the classifier set. This useful result is known as the Vapnik–Chervonenkis bound. Proof: See Appendix 64.C. 

The result of the theorem provides a bound on the size of the difference between the empirical and actual risks, Remp (c) and R(c), for any finite N and for any classifier, c ∈ C. Loosely, it states that the difference between these risk values is relatively small (when δ is small) with high probability. The result implies roughly that for any classifier, c ∈ C:     error rate on error rate on P ≈ ≥ 1N (64.18) training data test data where we are using the symbol a ≈ b to indicate that the values a and b are similar up to a small difference of magnitude δ. Obviously, since the risk values Remp (c) and R(c) amount to misclassification error rates (and are therefore probability measures), their individual values must lie within the interval [0, 1]. This means that the bound (64.15) is meaningful only for parameter values (N, VC, ) that

2660

Generalization Theory

result in small δ; this typically requires large sample size, N , as already illustrated by Fig. 64.5.

64.3.2

PAC Learning The VC bound (64.17) is important because it implies, as we now explain, that learning is feasible when N is large and the VC dimension is relatively small (so that δ is small). Indeed, note that for the classifiers {c? (h), co (h)} that we are interested in, it holds with probability at least 1 N  that: R(c? ) ≤ Remp (c? ) + δ ≤ Remp (co ) + δ ≤ R(co ) + 2δ

(by (64.15) applied to c? ) (since c? minimizes Remp (c)) (by (64.15) applied to co )

(64.19)

That is,   P R(c? ) N R(co ) ≤ 2δ ≥ 1 N 

(64.20)

Recall that, by design, R(c? ) ≥ R(co ) since co (h) minimizes the actual risk R(c). The above result is known as the PAC bound, where the letters PAC stand for “probably approximately accurate” learning. When δ is small (e.g., when N is large and VC is small), the result shows that a classifier c? (h) determined from the training data is able to produce misclassification errors over the distribution of the data that are comparable to the best possible value, R(co ), i.e., R(c? ) ≈ R(co ). However, we still do not know how small R(co ) is. This value can be assessed from the empirical risk, Remp (c? ). Using (64.15) and (64.19), we can verify, again with high probability 1 N , that   |Remp (c? ) N R(co )| = Remp (c? ) N R(c? )) + (R(c? ) N R(co ) ≤ |Remp (c? ) N R(c? )| + |R(c? ) N R(co )|

(a)

≤ |Remp (c? ) N R(c? )| + (R(c? ) N R(co )) ≤ δ + 2δ = 3δ

(using (64.15) and (64.19))

(64.21)

where step (a) is because co minimizes R(c) over C and hence R(c? ) ≥ R(co ). Result (64.21) provides one useful way to assess R(co ) (and R(c? )) through Remp (c? ); this latter value is readily obtained from the training data. In summary, we conclude from results (64.15), (64.19), and (64.21) that, with high probability of at least 1 N  and for small δ, the empirical and actual risk values (or empirical and actual error rates) for the classifiers c? (h) and co (h) are clustered together and satisfy the relations:

64.3 Generalization Ability

R(co ) ≤ R(c? ) ≤ R(co ) + 2δ

2661

(64.22a)

?

?

(64.22b)

o

?

(64.22c)

|R(c ) N Remp (c )| ≤ δ

|R(c ) N Remp (c )| ≤ 3δ

Figure 64.6 illustrates these relations graphically. The first relation states that the risk of the empirical classifier, R(c? ), does not exceed the optimal risk value R(co ) by more than 2δ. The second and third relations state that the empirical risk, Remp (c? ), provides a good indication of the actual risks R(c? ) and R(co ).

Figure 64.6 The figure illustrates relations (64.22a)–(64.22c). The first relation states

that the risk of the empirical classifier, R(c? ), does not exceed the optimal risk value R(co ) by more than 2δ. The second and third relations state that the empirical risk, Remp (c? ), provides a good indication of the actual risks R(c? ) and R(co ).

The main conclusion from the above analysis is the following. Assume the size of the training data, N , is large enough and the complexity of the classification model, VC, is moderate enough such that the corresponding δ parameter from (64.13) is sufficiently small. Assume further that we use the training data to determine a classifier c? (h) that minimizes the empirical risk Remp (c) defined by (64.12) over the set of classifiers, c ∈ C. If Remp (c? ) is small, then the actual risk, R(c? ), that corresponds to this classifier (i.e., its generalization ability, which corresponds to the probability of misclassification on test data apart from the training data), will also be small. We refer to the error on the test data as the generalization error. Moreover, the value of the empirical risk, Remp (c? ), will be close to the optimal value R(co ). These results hold irrespective of the distribution of the data, fγ ,h (γ, h). In other words, learning from data is feasible under these conditions. By feasible learning we therefore mean any learning procedure that

2662

Generalization Theory

is able to satisfy the PAC property (64.20) with sufficiently small δ. The size of δ can be made small by choosing the sample size, N , large enough.

64.4

VC DIMENSION We are ready to explain the meaning of the VC parameter. This so-called Vapnik– Chervonenkis dimension of the class of classifiers C, also referred to as the modeling capacity of C, is a measure of the complexity of C. We will use the set of linear classifiers to illustrate this concept and subsequently extend it more generally. Consider a collection of K feature vectors hn in M -dimensional space. In a binary classification setting, each of these feature vectors can be assigned to class +1 or N1. There are 2K possibilities (also called dichotomies) for assigning the K feature vectors over the two classes. We say that a class of classifiers C is able to shatter the K feature vectors if every possible assignment among the 2K possibilities can be separated by a classifier from the set. We illustrate this definition in Fig. 64.7. The figure considers K = 3 feature vectors in IR2 (i.e., M = 2 in this case). There are 23 = 8 possibilities for assigning these feature vectors to the classes ±1. All eight possibilities are shown in the figure on the left. Observe that in each of the eight assignments, we can find at least one line that is able to separate the feature vectors into the classes ±1. We therefore say that the three feature vectors in this example can be shattered by linear classifiers. In contrast, the figure on the right shows four feature vectors and one particular assignment for them that cannot be separated by linear classifiers. Motivated by this example, we define the VC dimension for a general class of classifiers, C, as the largest value of K for which at least one set of K feature vectors can be found that can be shattered by C. For the class of linear classifiers over IR2 , the above example shows that K = 3. Therefore, VC = 3 when M = 2. It is important to observe that the definition of the VC dimension is not stating that the value of K should be such that every set of K feature vectors can be shattered. The definition is only requiring that at least one set of K feature vectors should exist that can be shattered. Example 64.2 (VC dimension for a finite number of classifiers) Assume the set of classifiers (linear or otherwise) consists of a finite number, L, of possibilities denoted by {c1 , c2 , . . . , cL }. In this case, the solution of the binary classification problem amounts to selecting one classifier from this collection. Then, it is easy to verify that the VC dimension for this set of classifiers is bounded by: VC (L classifiers) ≤ log2 (L)

(64.23)

Observe how (64.23) illustrates that the VC dimension of a set C provides an indication of how complex that set is. Proof: If the VC dimension of the set of classifiers is denoted by VC, then this means that we can find a set of VC feature vectors that can be shattered by the L classifiers.

64.5 Bias–Variance Trade-off

class =

1

2663

class = +1

Figure 64.7 The eight squares on the left show all possible assignments of the same

three feature vectors in IR2 . In each case, a line exists that separates the classes ±1 from each other. We therefore say that the three feature vectors in this example can be shattered by linear classifiers. In contrast, the figure on the right shows four feature vectors in the same space IR2 and an assignment of classes that cannot be separated by a linear classifier.

This set of VC feature vectors admits 2VC possible labeling assignments. Therefore, the size L should be at least equal to this value, i.e., L ≥ 2VC , from which we obtain (64.23). 

The next statement identifies the VC dimension of the class of affine classifiers of the form c(h) = sign(hT w N θ), for some parameters (w, θ). Lemma 64.1. (Affine classifiers) The VC dimension for the class of affine classifiers over IRM is equal to M + 1, i.e., VC(affine classifiers over IRM ) = M + 1

(64.24)

Proof: See Appendix 64.A. 

64.5

BIAS–VARIANCE TRADE-OFF The size of δ in (64.13) depends on the VC dimension of the classification set, C. The particular situation illustrated in Fig. 64.5 indicates that the value of δ becomes worse (i.e., larger) for larger VC values. This behavior seems to be

2664

Generalization Theory

counterintuitive in that it suggests that using more complex models is not necessarily beneficial for learning and can degrade performance (since it can increase the probability of misclassification and lead to poor generalization). There are at least two ways to explain this apparent dilemma. One explanation is more intuitive and relies on the Occam razor principle, which we already encountered in Section 63.2. As was indicated in Fig. 63.2, more complex models can succeed in weaving through the training points and separating them into their respective classes almost flawlessly. However, this “perfect” fitting that happens during the training phase ends up modeling spurious effects and causes poor performance over test data. By the same token, simplistic models need not fit the training data well and can similarly lead to poor misclassification.

64.5.1

Bias–Variance Curve The second explanation for the dilemma is more formal and relies on an important bias–variance trade-off that occurs in the design of optimal classifiers. As explained earlier, we desire the optimal Bayes classifier, c• (h), but can only work with c? (h) ∈ C, which is obtained from the training data. The risk associated with c• (h) is denoted by R(c• ), and the risk associated with c? (h) is denoted by R(c? ). This latter risk is data-dependent and, therefore, it can be viewed as a realization for a random variable: Each training dataset leads to one value for R(c? ). We use the boldface notation to emphasize this random nature and write R(c? ). Computing the expectation of R(c? ) over the distribution of the data allows us to evaluate the expected risk value for c? (h). It is instructive to compare the difference between the optimal risk, R(c• ), and the expected risk from the training data, E R(c? ). For this purpose, we note first that we can write, by adding and subtracting R(co ):     R(co ) N R(c• ) (64.25) E R(c? ) N R(c• ) = E R(c? ) N R(co ) + {z } | {z } | estimation error (variance)

approximation error (bias)

This relation expresses the difference on the left as the sum of two components, referred to as the estimation error (also called variance) and the approximation error (also called bias) – see Fig. 64.8:

(a) (Bias) The bias error is independent of the training data; it measures the discrepancy in the risk value that results from restricting the classifier models to the set C and by using co instead of c• . The richer the set C is, the smaller the bias is expected to be. (b) (Variance) On the other hand, each training dataset results in a realization for the risk value, R(c? ). These realizations are represented by the red circles in Fig. 64.8, and they are dispersed around R(co ); the dispersion arises from the random nature of the training data. The estimation or variance error therefore measures how far the values of R(c? ) are spread around R(co ).

64.5 Bias–Variance Trade-off

2665

Figure 64.8 The bias quantity relates to the distance from R(co ) to the optimal Bayes

risk value, R(c• ). The variance quantity relates to the spread of R(c? ) around R(co ) due to randomness in the data.

The bias and variance terms behave differently as the complexity of the classification set, C, increases. Assume, for instance, that we enlarge the class of classifiers to C0 ⊇ C. Then, seeking the optimal classifier co over the larger set C0 can only reduce the bias component on the right-hand side of (64.25) since min R(c) ≤ min R(c)

c∈C0

c∈C

(64.26)

Therefore, R(co ) will get closer to R(c• ) and the bias term will get smaller. On the other hand, enlarging the classifier set generally increases the variance component because the realizations R(c? ) will get dispersed farther away from R(co ), which is now smaller. Indeed, note that for a fixed N , as the complexity of class C increases, its VC dimension and subsequently the value of δ in (64.13) also increases. This behavior is observed in Fig. 64.5. It follows from (64.19) that the empirical solution will tend to have risk values, R(c? ), spread farther away from R(co ).

64.5.2

Overfitting and Underfitting We conclude from the bias–variance analysis that there exists a compromise between bias and variance. A simple model set C may result in large bias but smaller variance. We refer to this scenario as underfitting since we would be fitting the data rather poorly by using simple models. In contrast, a more elaborate model set C may result in smaller bias but larger variance. We refer to this scenario as overfitting since we are likely to be overreaching by fitting the data more than is necessary. Combining these facts we arrive at the bias–variance trade-off curve shown in Fig. 64.9 in solid color. The curve captures the behavior of the bias and variance components as a function of the model complexity (i.e., its VC dimension). In general, good classifiers, c? (h), would be ones that are

2666

Generalization Theory

close to the minimum of the curve; these are classifiers for which the sum of both components on the right-hand side of (64.25) is the least possible.

E R(c ) − R(c• )

bias–variance curve variance increases with model complexity

bias decreases with model complexity

Figure 64.9 Increasing the complexity of the classifier class (i.e., increasing its VC

dimension) reduces the bias but increases the variance. The behavior of the bound in (64.19) as a function of VC is illustrated by the solid curve. The figure indicates that there is generally an optimal VC value at which the bound (red curve) is minimized.

64.5.3

Requirements for Feasible Learning Based on the discussion this far on the bias–variance trade-off in (64.25) and on the VC bound in (64.17), we conclude that a learning algorithm is effective and able to learn well if it meets three general conditions: (a) (Moderate classifier complexity) The classifier structure should be moderately complex with a reasonable VC dimension in order to limit overfitting and reduce the size of the variance component in (64.25). (b) (Sufficient training data) The algorithm should be trained on a sufficient number of data points. Usually, the value of N is chosen to be some multiple of the VC dimension of the classifier set. (c) (Small empirical error rate) The algorithm should result in a small empirical error rate, Remp (c? ), on the training data (i.e., it should have a relatively small number of misclassifications).

64.6 Surrogate Risk Functions

2667

When these conditions are met, learning becomes feasible irrespective of the probability distribution of the data. This means that the classifier c? (h), determined from the training data, will be able to generalize and lead to small misclassification errors on test data arising from the same underlying distribution.

64.6

SURROGATE RISK FUNCTIONS The previous discussion establishes that learning from data is feasible for a sufficient amount of training data and for moderately complex classifier models (such as affine classifiers). Specifically, if we determine a classifier c? (h) with a small empirical error rate (misclassification error) over the training data {γ(n), hn }, then it is likely that this classifier will perform equally well on test data and its performance will approach that of co (h) (which minimizes the probability of error over the distribution of the data). Thus, consider again the empirical risk minimization problem (64.12) and select the set C to be the class of affine classifiers, c(h) = sign(hT w N θ). For convenience, we extend the feature and weight vectors using     1 Nθ hn ← , w← (64.27) hn w in which case c(w) = sign(hT w) and the offset parameter is represented implicitly within w. The optimal w? that determines c? (w) is found by solving ) ( N −1  1 X  T ? ∆ I hn w 6= γ(n) (64.28) w = argmin N n=0 w∈IRM

where we continue to denote the size of w by M . The difficulty we face now is that this problem is not only challenging to solve but is also ill-conditioned, meaning that decisions based on its solution are sensitive (and can change drastically) for minor variations in the data. To see this, we rewrite (64.28) in the equivalent form ( N −1 ) X 1 ∆ w? = argmin I [γ(n)b γ (hn ) ≤ 0] (64.29) N n=0 w∈IRM where ∆

γ b(hn ) = hT nw

(64.30)

This alternative rewriting is based on the observation that a classification error occurs whenever the signs of γ(n) and hT n w do not match each other. It is generally difficult to minimize the empirical risk in (64.29) for at least two main reasons. First, a closed-form expression for w? is rarely possible except in some special cases. Second, and more importantly, the 0/1-loss function, ∆

Q(w; γ, h) = I [γb γ (h) ≤ 0] , γ b = hT w

(64.31)

2668

Generalization Theory

is nonsmooth over w and its value changes abruptly from 0 to 1. For example, if w is some classifier for which I [γb γ (h) ≤ 0] = 1 for a particular feature vector h, then a slight perturbation to this w can transition the indicator function to zero and lead to I [γb γ (h) ≤ 0] = 0. This behavior occurs because of the discontinuity of the indicator function I[y ≤ 0] at location y = 0, which causes problem (64.29) to be ill-conditioned – see Fig. 64.10. The term “ill-conditioning” refers to the phenomenon in which slight variations in the input data to a problem can lead to significant variations in the outcome.

I[y  0]

y

Figure 64.10 The indicator function I[y ≤ 0] is discontinuous at y = 0.

To illustrate this undesirable property numerically, assume we succeed in determining a solution, w? , for (64.29). Consider further a particular training data point h in class γ = N1 and assume the value of h is such that γ b = hT w? = N10−6

(64.32)

sign (b γ ) = N1 = γ

(64.33)

Since γ b is negative, the classifier w? will classify this point correctly:

Assume next that in the process of determining w? we end up with a slightly perturbed version of it (e.g., due to numerical errors in the optimization process or due to minor perturbations in the training data). We denote this perturbed classifier by w× . It is not difficult to envision situations in which the perturbed w× would lead to a positive value for γ b, say, × γ b = hT = 10−6 nw

(64.34)

The two values {N10−6 , 10−6 } are very close to each other, and yet the new value will cause h to be misclassified and assigned to class +1.

Alternate risk functions Due to the difficulty in dealing with 0/1-losses, it is customary to rely on surrogate loss functions that are easier to minimize and better behaved. We have

64.6 Surrogate Risk Functions

2669

encountered several choices for alternative loss functions in the earlier chapters, such as the logistic loss, hinge loss, quadratic loss, and so forth. For example, since γ 2 = 1, we have (γ N γ b)2 =

 2 γ(1 N γb γ) = (1 N γb γ )2

(64.35)

so that the quadratic loss (γ N γ b)2 will in effect be seeking values w that force the product γb γ to stay close to 1. We refer to the product γb γ as the margin variable: ∆

y = γb γ (h)

(margin variable)

(64.36)

The margin y is a function of w since γ b = hT w. We can consider several surrogate loss functions defined as follows in terms of the margin variable: Q(y) = (1 N y)2  Q(y) = ln 1 + e−y

Q(y) = max{0, Ny}

Q(y) = max{0, 1 N y} Q(y) = e

−y

Q(y) = I[y ≤ 0]

(quadratic loss)

(64.37a)

(logistic loss)

(64.37b)

(perceptron loss)

(64.37c)

(hinge loss)

(64.37d)

(exponential loss)

(64.37e)

(ideal 0/1-loss)

(64.37f)

In each of these cases, the loss function can be interpreted as the “cost” or “price” we incur in using γ b(h) to predict γ. Figure 64.11 plots these various loss functions. Several observations stand out: (a) Observe that the ideal 0/1-loss function returns a value of 0 for correct

decisions and a value of 1 for mismatches in the signs of γ and γ b (i.e., whenever y ≤ 0); this latter situation corresponds to misclassification.

(b) In comparison, the perceptron loss (64.37c) also returns zero for correct

decisions but penalizes misclassifications close to the boundary y = 0 less severely than misclassifications farther away from the boundary; the penalty value varies linearly in the argument y. (c) The hinge loss (64.37d) shows similar behavior with a linear penalty compo-

nent; however, this component adds some margin away from the boundary y = 0 and penalizes arguments y that are smaller than 1 (rather than smaller than 0). We already know from our study of support vector machines (SVMs) that this feature adds robustness to the operation of the learning algorithm. (d) Ideally, under perfect operation, the value of γ b(h) should match γ and their

product should evaluate to 1. That is why the quadratic loss penalizes deviations away from 1, both to the left and right. However, we know that requiring the product γb γ (h) to be exactly 1 is unnecessary; it is sufficient to require the variables γ and γ b(h) to have the same sign (i.e., to require the margin to be sufficiently positive). For this reason, several of the other loss

2670

Generalization Theory

4

3.5

3

2.5

2

1.5

1

0.5

0

-0.5 -1

-0.5

0

0.5

1

1.5

2

Figure 64.11 The dashed curve shows the plot of the ideal 0/1-loss I[y ≤ 0]. The other

plots show the loss functions Q(y) for quadratic, exponential, logistic, hinge, and perceptron designs – see expressions (64.37a)–(64.37f) for the definitions. It is seen from the graphs that, with the exception of the perceptron loss, all other loss functions bound the 0/1-loss from above. Although not seen in the figure, this fact is also true for the logistic loss if we rescale it by 1/ ln 2 to ensure that its value becomes 1 at y = 0.

functions assign more penalty to values of y smaller than 1 than to values of y larger than 1. (e) It is further seen from the figure that, with the exception of the perceptron loss, all other loss functions bound the 0/1-loss from above. Although not seen in the figure, this fact is also true for the logistic loss if we rescale it by 1/ ln 2 to ensure that its value becomes 1 at y = 0. This scaling by a constant value does not affect the solution of the corresponding optimization problem. For this reason, it is customary to list the logistic loss without the scaling by 1/ ln 2. (f) The five surrogate loss functions (64.37a)–(64.37e), and their corresponding empirical risk functions defined below, are convex functions in w. This is a useful property because it helps ensure that optimization problems that seek to minimize the surrogate risks P (w) will only have global minima.

64.6 Surrogate Risk Functions

2671

Using the aforementioned losses, we can replace the empirical 0/1-risk in (64.29) by any of the following expressions and continue to denote the minimizer by w? :

P (w) = P (w) = P (w) = P (w) = P (w) =

N −1 2 1 X γ(n) N hT nw N n=0

N −1  T 1 X  ln 1 + e−γ(n)hn w N n=0

(quadratic risk)

(64.38a)

(logistic risk)

(64.38b)

(perceptron risk)

(64.38c)

(hinge risk)

(64.38d)

(exponential risk)

(64.38e)

N −1  1 X max 0, Nγ(n)hT nw N n=0

N −1  1 X max 0, 1 N γ(n)hT nw N n=0 N −1 1 X −γ(n)hTn w e N n=0

Example 64.3 (Probability of misclassification) A classifier that minimizes a surrogate empirical risk with small misclassification errors over the training data will still generalize well and deliver small misclassification errors over test data. To see this, let w? denote the solution to one of the problems listed above, excluding the perceptron risk. Its actual error rate is denoted by ∆

R(w? ) = P(hT w? 6= γ) = E I[hT w? 6= γ]

(64.39)

whereas its empirical risk value is P (w? ) and its empirical error rate (misclassifications over the training data) is Remp (w? ) =

N −1 1 X I[γ(n)b γ (n) ≤ 0] N n=0

(64.40)

Now, observe from Fig. 64.11 that it is generally the case that the new loss functions bound the ideal 0/1-loss function from above (with the exception of the perceptron loss function, which we are excluding from this discussion), i.e., it holds that I[y ≤ 0] ≤ Q(y)

(64.41)

Remp (w? ) ≤ P (w? )

(64.42)

In this case, we get

and it follows that R(w? ) ≤ Remp (w? ) + δ ≤ P (w? ) + δ ?

(by result (64.15)) (by (64.42))

(64.43)

so that a small empirical risk value, P (w ), translates into a small probability of misclassification, R(w? ), over the entire data distribution. A similar conclusion holds for more general classifier spaces, C (other than affine classifiers – see the discussion leading to (64.64) in the comments at the end of the chapter).

2672

Generalization Theory

64.7

COMMENTARIES AND DISCUSSION Curse of dimensionality. The designation “curse of dimensionality” is attributed to the American control theorist Robert Bellman (1920–1964), who coined the term in his development of dynamic programming in Bellman (1957a); dynamic programming refers to a widely used class of mathematical optimization problems – discussed in Chapter 44. We explained in Section 64.1 how the curse of dimensionality degrades the performance of learning strategies. This is because in higher dimensions the available training data can only provide a sparse representation of the space. Moreover, as shown in Prob. 64.9, most of the training samples will concentrate close to the boundaries of the space. And it is common to encounter high-dimensional data in practice. For example, when DNA microarrays are used to measure the expression levels of a large number of genes, the dimension for this problem is on the order of M ∼ 104 . A useful theoretical study by Hughes (1968) illuminated how the curse of dimensionality degrades the performance of the Bayes classifier when a finite number, N , of training data is used to estimate conditional probabilities by using relative frequencies. It was shown in that work that, for a fixed N , the classification accuracy increases initially but then degrades as the dimensionality of the feature space, M , increases beyond some threshold value – see Prob. 64.30. From (64.1), we note that in order to design classifiers that perform well in higher-dimensional spaces, the number of training data, N , will need to increase exponentially fast with the dimension, M . In acknowledgment of Hughes’ work, the curse of dimensionality problem is sometimes referred to as the Hughes effect. Bias–variance trade-off. The bias–variance relation (64.25) reflects an important tradeoff in the design of effective learning algorithms from training data. The relation expresses the difference between the optimal risk R(c• ) and the average performance E R(c? ) as the sum of two components. Ideally, a designer would like to keep both the bias and variance terms small. One degree of freedom that the designer has is the choice of the model set, C. As explained in the text, a simple model set generally underfits the data and leads to large bias but small variance. In contrast, a more complex model set generally overfits the data and leads to small bias but large variance. A compromise needs to be struck by selecting classifier sets of moderate complexity – as illustrated in Fig. 64.9. This is one reason why it is often observed in practice that moderately complex classifiers perform better than more sophisticated classifiers. Some useful references that deal with the bias–variance trade-off in the learning context and other related issues include the works by German, Bienenstock and Doursat (1992), Kong and Dietterich (1995), Breiman (1994, 1996a,b), Tibshirani (1996a), James and Hastie (1997), Kohavi and Wolpert (1996), Friedman (1997), Domingos (2000), James (2003), and Geurts (2005), as well as the text by Hastie, Tibshirani, and Friedman (2009). Generalization theory. The Vapnik–Chervonenkis bound (64.17) is a reassuring statistical result; it asserts that, given a sufficient amount of training data, learning is feasible for moderately complex classifier models. This means that classifiers that perform well on the training data are able to generalize and deliver reliable classifications on test data. This result is one of the cornerstones of statistical learning theory and it resulted from the landmark work by Vapnik and Chervonenkis (1968, 1971); its strength lies in the fact that the bound is distribution-free. It is common to list the VC bound (64.17) in an alternative form where δ is fixed at some small constant value and the right-hand side bound is made to depend on N , δ, and the VC dimension, namely, as:   2 P sup |Remp (c) − R(c)| > δ ≤ 8 (N e/VC)VC e−N δ /32 (64.44) c∈C

In comparison, the earlier form (64.17) fixes the right-hand side probability at some constant level  and then specifies the attainable δ by means of relation (64.13) in

64.7 Commentaries and Discussion

2673

terms of , N , and the VC dimension. This earlier form motivates the PAC designation introduced by Valiant (1984). More expansive treatments of the VC bound(s) appear in the monographs by Vapnik (1995, 1998) and the textbooks by Fukunaga (1990), Kearns and Vazirani (1994), Devroye, Gyorfi, and Lugosi (1996), Vidyasagar (1997), Cherkassky and Mulier (2007), and Hastie, Tibshirani, and Friedman (2009). Accessible overviews on learning theory appear in Kulkarni, Lugosi, and Venkatesh (1998) and Vapnik (1999). An extension that applies to other bounded loss functions, besides the 0/1-loss function, appears in Vapnik (1998) – see also Prob. 64.28. We provide a derivation of the VC inequality (64.44) in Appendices 64.B and 64.C; the argument is nontrivial and relies on several steps. We adapt in these appendices the presentation given by Devroye, Gyorfi, and Lugosi (1996, ch. 12). In their presentation, the coefficient appearing in the exponential factor in (64.44) is N δ 2 /32, while the coefficient appearing in the original bound given by Vapnik and Chervonenkis (1971) is N δ 2 /8 and corresponds to a tighter bound – see also the works by Blumer et al. (1989) and Cherkassky and Mulier (2007). This difference is not significant for the conclusions and arguments presented in our treatment; it is sufficient for our purposes to know that a bound exists and that this bound decays to zero as N → ∞ at a uniform rate that is independent of the data distribution. The derivation used in Appendix 64.C relies on two famous inequalities. The first result is the Hoeffding inequality, which we encountered earlier in Appendix 3.B; it provides a bound on the probability of the sum of a collection of random variables deviating from their mean. This inequality is due to the Finnish statistician Wassily Hoeffding (1914–1991) and appeared in the work by Hoeffding (1963). Earlier related investigations appear in Chernoff (1952) and Okamoto (1958). The second inequality is known as the Sauer lemma (also the Sauer–Shelah lemma) in combinatorial analysis and is derived in Appendix 64.B. The result was derived independently by Sauer (1972) and Shelah (1972); a similar result also appeared in the work by Vapnik and Chervonenkis (1971). Universally consistent classifiers. The significance of the distribution-free property of the VC bound can be highlighted by commenting on the notion of universal consistency. Recalling the definitions introduced in Section 64.2, we let cN (h) denote the classifier that minimizes the empirical risk (64.7), while c• (h) refers to the Bayes classifier and minimizes the actual risk (64.10). Both solutions do not impose any restriction on the classifier set, which is indicated by the filled triangle and circle superscripts. The classifier cN (h) is determined from the training data and its structure depends on the sample size, N . This decision rule is said to be consistent if it satisfies the property: lim R(cN ) = R(c• ),

N →∞

almost surely

(64.45)

In other words, the risk value that is attained by the empirical classifier should approach the optimal risk value for increasingly large datasets. If the consistency property holds for all data distributions fγ ,h (γ, h), then the empirical decision rule, cN (h), is said to be universally consistent. Such decision rules would be desirable because the implication is that, regardless of the data distribution, sufficient training samples can make learning feasible. A remarkable result by Stone (1977) established that universally consistent classifiers exist. One notable example from this work is the asymptotic k-NN classifier when the value of k is selected to depend on N and satisfy the two conditions k(N ) → ∞ and k/N → 0 as N → ∞. However, and unfortunately, although R(cN ) can approach R(c• ) asymptotically for any data distribution, it turns out that the convergence rate can be extremely slow; moreover, the performance for finite sample size can also be disappointing. For example, a result by Devroye (1982), strengthening an earlier conclusion by Cover (1968), shows that for any classification rule cN (h) and any  > 0 and finite integer N , there exists a data distribution fγ ,h (γ, h) with R(c• ) = 0 and such that – see Prob. 64.31 and also Devroye, Gyorfi, and Lugosi (1996, p. 112):

2674

Generalization Theory

R(cN ) ≥ 0.5 − ,

for any finite N

(64.46)

This conclusion shows that the finite-sample performance can be very bad for some distributions (in this case, the optimal Bayes risk is equal to zero and, yet, the risk by the empirical classifier is close to 1/2). It is also shown in Cover (1968) and Devroye (1982) that the convergence rate of R(cN ) toward R(c• ) can be arbitrarily slow. Specifically, if a(n) > 0 denotes any monotonically decreasing sequence of positive numbers converging to zero, then for any classification rule cN (h) and any  > 0 and finite integer N , there exists a data distribution fγ ,h (γ, h) with R(c• ) = 0 and such that: R(cN ) ≥ a(N ),

for any finite N

(64.47)

As indicated by Devroye, Gyorfi, and Lugosi (1996, p. 114), statements (64.46)–(64.47) combined imply that “good universally consistent classifiers do not exist.” In light of this conclusion, which also relates to the concept of “no free lunch theorems” discussed further ahead, we can now re-examine the VC bound (64.44). Similar to (64.20), this result implies that, for a fixed constant δ,   2 P R(cN ) − R(c• ) ≥ 2δ ≤ 8 (N e/VC)VC e−N δ /32 (64.48) and the bound holds for all finite N and for all data distributions. Recall that this result is obtained by restricting the search for cN (h) and c• (h) to a set c ∈ C with a finite VC dimension (in which case cN (h) becomes c? (h) and c• (h) becomes co (h)). It is clear from (64.48) that R(cN ) can be made sufficiently close to R(c• ) by selecting N large enough; moreover, with high probability, the convergence rate of R(cN ) toward R(c• ) is O(ln(N )/N ). It is worth noting that the Vapnik–Chervonenkis bound (64.44) is a generalization of a famous result derived by the Russian mathematician Valery Glivenko (1896– 1940) and the Italian mathematician Francesco Cantelli (1875–1966) in two separate publications by Glivenko (1933) and Cantelli (1933). The result is known as the Glivenko–Cantelli theorem and it describes the asymptotic behavior of the ensemble cumulative distribution function. Proofs appear in the works by Dudley (1978, 1999), Pollard (1984), Devroye, Gyorfi, and Lugosi (1996), and van der Vaart and Wellner (1996). Glivenko–Cantelli theorem (Glivenko (1933), Cantelli (1933)): Consider a collection of N iid realizations, {xn }, of a random variable x with a cumulative density function, F (x). Introduce the ensemble average construction for F (x): ∆

FN (x) =

N −1 1 X I [xn ≤ x] N n=0

(64.49)

where the indicator function on the right-hand side counts the number of sample values observed within the interval (−∞, x]. It then holds that   2 P sup |FN (x) − F (x)| > δ ≤ 8(N + 1)e−N δ /32 (64.50) x∈IR

No free lunch theorem. The results by Cover (1968) and Devroye (1982) revealing that the finite-sample performance of a classifier can be very bad for some distributions can also be explained from the perspective of the “no free lunch theorem,” which we motivate as follows. We have devised several learning algorithms in our treatment so far, such as logistic regression, SVMs, kernel methods, and decision trees. We will introduce additional

64.7 Commentaries and Discussion

2675

learning algorithms in future chapters based on neural network architectures. But is there one “best” algorithm? It is observed in practice that some algorithms perform better on some data distributions and worse on other distributions. However, this does not mean that some algorithms are better than other algorithms. This conclusion is captured by a famous result known as the “no free lunch theorem” by Wolpert (1992, 1996); see also Schaffer (1994) and Wolpert and Macready (1997). In broad terms, the theorem asserts that, averaged over all possible data distributions, the performance of every classification algorithm will be the same as other algorithms on test data! This means that no classifier can be proven to be universally better than all other classifiers and, as such, there is no “best” learning method. Specifically, consider two learning algorithms, say, two binary classifiers A and B. The first classifier could be based on logistic regression while the second classifier could be an SVM or a neural network. Both classifiers are trained to decide whether feature vectors belong to one class (γ = +1) or the other (γ = −1). The training data arises from some distribution fγ ,h (γ, h). Assume we assess the performance of the algorithms on test data generated from this same distribution and write down the classification error that each algorithm generates during this assessment phase. We may find that one of the classifiers performs better than the other, say, classifier A outperforms B in this assessment exercise (i.e., it yields a smaller classification error). Now, assume we repeat the experiment but change the data distribution this time. We train the classifiers and test their performance on a new distribution and write down the resulting classification errors for each. It may be the case for this new distribution that the same betterperforming classifier A from the first assessment continues to outperform B in this second test. It may also be the case that classifier B outperforms A. We could continue in this manner and compare the performance of both classifiers over all possible choices of data distributions. The “no free lunch theorem” states that, averaged over all choices of data distributions, the performance of the two classifiers will match! This means that better performance by one algorithm in some data situations will be offset by worse performance in other situations. This also means that no single learning algorithm can be expected to work best for all data distributions (i.e., for all types of problems). We encounter one manifestation of this property in Prob. 64.31 where we show that for any finite sample-size optimal classifier, there always exists a data distribution for which the empirical risk of the classifier is bad. The following is an alternative justification for this fact, and can be viewed as one form of a “free lunch theorem.” Consider a finite number of feature vectors, H = {h ∈ IRM }. Each feature vector has label γ = +1 or γ = −1. Let Γ be the collection of all possible mappings γ(h) : H → {+1, −1}. That is, every γ(h) ∈ Γ assigns ±1 labels to features in H. There are a total of 2|H| possible mappings, γ(h), in terms of the cardinality of the set H. We will verify next that there exists some probability distribution over the feature vectors h ∈ H (which determines how they are selected or sampled from H) and a choice of mapping γ(h) ∈ Γ for which a trained classifier c? (h) will perform poorly. For more details, the reader may refer to the useful discussion in Shalev-Shwartz and Ben-David (2014, ch. 5).

Variation of the no free lunch theorem (Wolpert (1992, 1996)): Consider an arbitrary learning algorithm that is trained on at most N ≤ |H|/2 data points {γ(n), hn } from H. We denote the output generated by the algorithm by c? (h) : H → {+1, −1}. Then, there will exist a label mapping γ(h) : H → {+1, −1} and a distribution fh (h) over H such that   P c? (h) 6= γ(h) ≥ 1/8 holds with probability of at least 1/7 (64.51) In other words, there exists a mapping γ(h) and a data distribution leading to bad performance.

2676

Generalization Theory

Proof: We adapt the argument from Shalev-Shwartz and Ben-David (2014, sec. 5.1). We select 2N ≤ |H| iid feature vectors at random according to some distribution h ∼ fh (h) from the set H. We place N of these samples at random into a set S and use them to train a classification algorithm, say, by minimizing some empirical risk function. We keep the remaining samples for testing. The algorithm will generate some mapping c? (h) : H → {+1, −1}. For each feature vector h ∈ H, the trained classifier will assign the label c? (h). We have several elements of randomness involved in this setting: the distribution h ∼ fh (h), the samples that end up in S, and also the choice of the mapping γ(h) from Γ that sets the labels of the feature vectors. We wish to examine the size of the probability of error, denoted by Pe (γ) = P(c? (h) 6= γ(h)); this error depends on the mapping γ(h). Obviously, the error will also depend on the distribution fh (h) used to select the 2N feature vectors and on the randomness in defining the test set. For this reason, we will be interested in examining the average probability of error over these sources of randomness, namely, the quantity Pe,av (γ) = E S,h Pe (γ). The worst value for the average error over the mappings γ(h) is max γ(h)∈Γ

n

Pe,av (γ)

o

(a)

  (b) ≥ E γ Pe,av (γ) = E S E γ,h Pe (γ)

(64.52)

where step (a) is because the worst performance on the left is larger than the average performance on the right, and step (b) changes the order of the expectations. Now note that: E γ,h Pe (γ) = E γ,h P(c? (h) 6= γ(h)) n    o = E γ P(h ∈ S)P c? (h) 6= γ(h) | h ∈ S + P(h ∈ / S)P c? (h) 6= γ(h) | h ∈ /S (c)



  1 E γ P c? (h) 6= γ(h) | h ∈ /S 2

(64.53)

where step (c) ignores the first term from the third line and uses the fact that only half of the selected features are used for training so that P(h ∈ / S) = 1/2. We still need to evaluate the last expectation, which averages over the choice of the mapping γ(h). Recall that c? (h) is determined without knowledge of any of the features from outside S. Moreover, since we are free to choose γ(h), there are mappings that could result in γ(h) = +1 and others that could result in γ(h) = −1. Therefore, c? (h) will be wrong half of the time:   P c? (h) 6= γ(h) | h ∈ / S = 1/2 (64.54) Substituting into (64.52) we conclude that maxγ Pe,av (γ) ≥ 1/4. This means that there exists a mapping γ(h) and a distribution fh (h) such that E S,h Pe (γ) ≥ 1/4. Now, recall that Pe (c) is a probability measure and it assumes values in the interval [0, 1]. Therefore, using the result of part (a) from Prob. 3.19 we conclude that   1/4 − (1 − 1/8) P Pe (γ) ≥ 1/8 ≥ = 1/7 7/8

(64.55)

as desired.  We conclude that a classifier that performs well on certain data distributions need not deliver similar performance on other distributions. This is more or less in line with intuition. An architecture that distinguishes well between images of cats and dogs need not perform well in distinguishing between poetry and prose. For this reason, when one learning algorithm is said to outperform another, this statement should be qualified to mean that one algorithm outperforms the other for the particular data distribution under consideration.

64.7 Commentaries and Discussion

2677

It is important to note that some criticism has been leveled at the “no free lunch theorem” and its implication for practical learning algorithms. This is because the statement of the theorem averages performance over all possible data distributions; these include distributions over which the classifier was not trained and, moreover, many of these distributions need not be reflective of how real-world data behave. For example, the work by Fernandez-Delgado et al. (2014) has shown that some learning algorithms consistently outperform other algorithms in real-data scenarios. Moreover, even if an algorithm A performs badly on some distributions, it may be the case that these distributions are not relevant for the problem at hand. For all practical purposes, a designer should seek learning algorithms that perform best on the problems (or distributions) of interest. Surrogate loss functions. The VC bound (64.44) is established in Appendix 64.C under the assumption that the risk values are computed relative to the ideal 0/1-loss function. That is, the classifiers {co (h), c? (h)} correspond to the minimizers of the actual and empirical risks defined by (64.11) and (64.12): ∆

co (h) = argmin R(c),



c? (h) = argmin Remp (c)

c∈C

(64.56)

c∈C

where ∆

R(c) = E I[c(h) 6= γ],



Remp (c) =

N −1 1 X I[c(hn ) 6= γ(n)] N n=0

(64.57)

These expressions rely on the use of the ideal 0/1-loss defined by: ∆

Q(c; γ, h) = I[c(h) 6= γ]

(64.58)

In this way, the value of R(c) is a measure of the probability of misclassification over the entire data distribution, while the value of Remp (c) is a measure of the fraction of erroneous classifications over the N training data points, {γ(n), hn }. Given the difficulty in solving the optimal design problems (64.56) due to the nonsmooth nature of the indicator function, we motivated in Section 64.6 several surrogate convex losses (such as quadratic, logistic, hinge, exponential, and perceptron functions). A natural question is to inquire about the generalization ability of classifiers designed under these alternative choices. Thus, let, more generally, Q(c; γ, h) denote an arbitrary nonnegative convex loss function. For the purposes of the discussion in this section, we denote the surrogate risk by the notation: ∆

P (c) = E Q(c; γ, h)

(64.59)

and the corresponding empirical risk by Pemp (c) =

N −1 1 X Q(c; γ(n), hn ) N n=0

(64.60)

The quantities {P (c), Pemp (c)} play the role of {R(c), Remp (c)} when the 0/1-loss is used. Now, however, we are using more general loss functions, Q(c; ·). We then replace problems (64.56) by ∆

co (h) = argmin P (c), c∈C



c? (h) = argmin Pemp (c)

(64.61)

c∈C

where we continue to use the notation (co , c? ) in order to avoid an explosion of symbols. It turns out that an inequality of the VC type continues to hold in this more general case if we assume that, for any c ∈ C, the loss function Q(c; γ, h) is bounded, say, its values lie within some interval [a, b] for nonnegative scalars a < b. If we examine

2678

Generalization Theory

the derivation of inequality (64.111) in Appendix 64.C, we will be able to recognize that, under this boundedness condition, a similar bound continues to hold for more general loss functions with the exponent −N δ 2 /32 now replaced by −N δ 2 /32b2 ; see Prob. 64.28:   2 2 ? o P sup |Pemp (c ) − P (c )| > δ ≤ Ke−N δ /32b (64.62) c∈C

for some constant K that is independent of the data distribution. The ultimate conclusion is that the bound continues to decay to zero as N → ∞ at a uniform rate that is also independent of the data distribution. Further discussion on this result can be found in Vapnik and Chervonenkis (1968, 1971), Dudley, Gine, and Zinn (1991), Alon et al. (1997), Vapnik (1998), Cucker and Smale (2002), and Rosasco et al. (2004). Next, following steps similar to the ones that led to (64.21), we can then conclude that with high probability, |Pemp (c? ) − P (co )| ≤ 3δ. If the loss function further satisfies I[c(h) 6= γ] ≤ Q(c; γ, h) for any c ∈ C, then it will hold that Remp (c? ) ≤ Pemp (c? )

(64.63) ?

Applying these inequalities to the optimal classifier c (h) from (64.61), we conclude that R(c? )

(64.15)



Remp (c? ) + δ ≤ Pemp (c? ) + δ

(64.64)

so that a small Pemp (c? ) translates into a small probability of misclassification for c? (h). In other words, learning from data for general loss functions is still feasible. The main limitation in the argument leading to this conclusion is the requirement that the loss function Q(c; γ, h) be bounded for any c ∈ C. Rademacher complexity. There is an alternative method to examine the generalization ability of learning algorithms for more general loss functions, by relying on the concept of the Rademacher complexity. We pursue this approach in Appendix 64.D. Recall that the analysis in the body of the chapter has shown that classification structures with medium VC dimensions are able to learn well with high likelihood for any data distribution. In a sense, this conclusion amounts to a generalization guarantee under a worst case scenario since it holds irrespective of the data distribution. It is reasonable to expect that some data distributions will be more favorable than others and, therefore, it is desirable to seek generalization results that have some dependence on the data distribution. The framework that is based on the Rademacher complexity allows for this possibility and leads to tighter generalization error bounds. The approach also applies to multiclass classification problems and to other loss functions, and is not restricted to binary classification or 0/1-losses. The analysis carried out in Appendix 64.D continues to lead to similar reassuring conclusions about the ability of learning methods to generalize for mild VC dimensions. However, the conclusions are now dependent on the data distribution and will not correspond to worst-case statements that hold for any distribution. The main results in the appendix are the one- and two-sided generalization bounds (64.182a)–(64.182b) and (64.197a)–(64.197b). The derivation of these results relies on two critical tools known as the McDiarmid inequality, which we encountered earlier in (3.259a) and is due to McDiarmid (1989), and the Massart lemma (64.145), due to Massart (2000, 2007). The first works to use the Rademacher complexity to study the generalization ability of learning algorithms are by Koltchinskii (2001), Koltchinskii and Panchenko (2000, 2002), Bartlett, Boucheron, and Lugosi (2001), Barlett and Mendelson (2002), Mendelson (2002), Antos et al. (2002), and Bartlett, Bousquet, and Mendelson (2005). Overviews and further treatments appear in Boucheron, Bousquet, and Lugosi (2005), Shalev-Shwartz and Ben-David (2014), Mohri, Rostamizadeh, and Talwalkar (2018), and Wainwright (2019). The designation Rademacher complexity is motivated by the connection to the discrete Rademacher distribution, named after the German-American mathematician Hans Rademacher (1892–1969), which refers to

Problems

2679

random variables that assume the values ±1 with equal probability. A sum of such variables leads to a random walk with symmetry, where it is equally likely to move in one direction or the other. The Rademacher distribution is related to the standard Bernoulli distribution: the former deals with values {+1, −1} chosen with probability 1/2 each, while the latter deals with values {1, 0} chosen with probabilities {p, 1 − p}.

PROBLEMS

64.1 Let t(h) = P(γ = +1|h = h). (a) For any classifier c, derive the following expression for the excess risk:   R(c) − R(c• ) = E h |2t(h) − 1| I[c• (h) 6= c(h)]

(b)

where the expectation is over the distribution of the feature n data. o • Show that the optimal Bayes risk is given by R(c ) = E h min (t(h), 1 − t(h)) .

(c) Show also that R(c• ) = 12 (1 − E h |2t(h) − 1|). 64.2 Continuing with Prob. 64.1, let π±1 denote the prior probabilities of classes γ ∈ {±1}. That is, π+1 = P(γ = +1) and likewise for π−1 . Assume the feature data, h, has a continuous conditional probability distribution, fh|γ (h|γ). (a) Verify that ˆ  min π+1 fh|γ (h|γ = +1), π−1 fh|γ (h|γ = −1) dh R(c• ) = h∈H

(b)

where the integration is over the feature space, h ∈ H. Assume π+1 = π−1 = 1/2. Conclude that in this case: ( ) ˆ 1 1 • 1− R(c ) = fh|γ (h|γ = +1) − fh|γ (h|γ = −1) dh 2 2 h∈H

In other words, the Bayes risk is related to the L1 -distance between the two conditional distributions of the feature data. 64.3 Refer to expression (64.7) for the empirical risk. Assume {γ(n), hn } are iid realizations of {γ, h}. (a) Argue that each term of the form I[c(h) 6= γ] is a binomial random variable with probability parameter p = R(c). (b) Conclude that the mean and variance of Remp (c) are given by p and p(1 − p)/N , respectively, (c) Use the Chebyshev bound (3.28) to conclude that, for any scalar δ > 0,   p(1 − p) P |Remp (c) − R(c)| ≥ δ ≤ N δ2 64.4 Let {xn , n = 1, . . . , N } denote N independent random variables, with each P variable satisfying an ≤ xn ≤ bn . Let S N = N n=1 xn denote the sum of these random PN 2 variables. Let ∆ = (b − a ) denote the sum of the squared lengths of the n n n=1 respective intervals. A famous inequality known as the Hoeffding inequality was derived in Appendix 3.B; it asserts that for any δ > 0:   2 P |S N − E S N | ≥ δ ≤ 2e−2δ /∆

2680

Generalization Theory

Now, refer to expression (64.7) for the empirical risk. Use the Hoeffding inequality to establish that, for any particular classifier c and δ > 0, it holds:   2 P |Remp (c) − R(c)| ≥ δ ≤ 2e−2N δ In comparison with the bound obtained in part (c) of Prob. 64.3, observe that the bound on the right-hand side of the above expression decays exponentially with the 2 size of the training data. Let  = 2e−2N δ . Conclude that the above bound asserts that P( |Remp (c) − R(c)| ≥ δ) ≤  for any small  > 0 and where δ and  are related via: s   1 2 ln δ= 2N  Remark. This result shows that the true and empirical risk values get closer to each other as the number of training samples, N , increases. However, this conclusion assumes a fixed classifier, c. See the extensions studied in Probs. 64.24 and 64.25. 64.5 We reconsider the discussion on surrogate risk functions from Section 64.6. Consider an arbitrary predictor function γ b(h) : IRM → IR, which maps feature vectors h into real-valued predictions γ b for their labels. For each h, let y = γb γ denote the corresponding margin variable with surrogate loss denoted by Q(y) : IR → IR, for some loss function Q(·) to be selected. In the body of the chapter we listed several choices for Q(·) in (64.37a)–(64.37f). We associate with each Q(·) the stochastic risk function P (b γ ) = E Q(y), where the expectation is over the distribution of the data {γ, h}. (a) Let t(h) = P(γ = +1|h = 1). By conditioning on h = h, verify that   E Q(y)|h = h = t(h)Q(b γ (h)) + (1 − t(h))Q(−b γ (h)) The right-hand side is a function of γ b and we denote it more compactly by P (b γ |h) = tQ(b γ ) + (1 − t)Q(−b γ ). (b) We know that the optimal Bayes classifier assigns γ bBayes (h) = +1 when t(h) > 1/2 and γ bBayes (h) = −1 when t(h) < 1/2. We wish to select convex loss functions Q(y) such that P (b γ |h) ends up having a negative minimizer γ b when t < 1/2 and a positive minimizer γ b when t > 1/2. When this happens, the sign of the minimizer γ b will match the optimal Bayes decision. Show that this occurs if, and only if, the convex loss Q(y) is differentiable at y = 0 with a negative derivative value at that location (i.e., Q0 (0) < 0). Remark. The reader may refer to Bartlett, Jordan, and McAuliffe (2006) for a related discussion. In the language of this reference, convex loss function Q(·) that satisfies these two conditions is said to be classification-calibrated. 64.6 Consider a hypercube in M dimensions with edge length equal to 1. Let ho represent a particular feature vector located somewhere inside this hypercube. Assume there are a total of N feature vectors distributed uniformly inside the hypercube. We center a smaller hypercube around ho with edge size `. (a) Assume M = 3. Determine the value of ` such that the volume of the smaller hypercube around ho captures 10% of the N training samples. (b) Assume now M = 20. Determine the value of ` such that the volume of the smaller hypercube around ho captures the same fraction, 10%, of the N training samples. Compare the result with part (a). 64.7 Consider a hypercube in M dimensions with edge size equal to 1. Consider a smaller cube with edge size `. What should the length ` be for the volume of the smaller cube to correspond to 1% of the volume of the larger cube? Determine ` for both cases of M = 10 and M = 100. What do you observe? 64.8 Refer to the volume expression (64.3). (a) Assume M = 2K is even. Show that the expression reduces to (1/4)K π K /K!. (b) Show that it tends to zero as M → ∞.

Problems

2681

64.9 Assume N feature vectors are distributed uniformly inside a hypersphere in M dimensions centered at the origin and of radius equal to 1. Let d denote the distance from the origin to the closest training point; this variable is random in nature. Show that the median value of d is given by  median(d) =

1−

1

1/M

21/N

Assume M = 20 and N = 1000. What is the median of d? What do you conclude? 64.10 Refer to definition (64.11) for co (h), where R(c) = P(c(h) 6= γ). Show that any solution co that results in R(co ) = 0 also leads to Remp (co ) = 0, where the empirical risk is defined by (64.7). Conclude that if a solution co exists such that R(co ) = 0, then the Bayes classifier generates zero classification errors. 64.11 Refer to the alternate loss functions (64.37a)–(64.37e), and their corresponding risks. Show that all these functions are convex in w. Is the ideal 0/1-loss function convex in w? 64.12 Consider feature vectors h ∈ IR2 . Give an example of three feature vectors that cannot be shattered by the class of linear classifiers. Does this fact contradict the conclusion that the VC dimension is three? 64.13 True or false. The VC dimension of a class of classifiers is the value d for which any number N > d of training samples cannot be shattered by this class of classifiers. 64.14 Show that the VC dimension of the class of linear classifiers c(h) = sign(hT w) over IRM is equal to M . 64.15 Consider a collection of M +2 vectors in IRM denoted by X ={x1 , x2 , . . . , xM +2}. Radon theorem states that every such set can be split into two disjoint subsets, denoted by X1 and X2 , such that the convex hulls of X1 and X2 intersect with each other. (a) Establish the validity of Radon theorem. (b) Use Radon theorem to conclude that the VC dimension of the class of affine classifiers c(h) = sign(hT w − θ) over IRM is equal to M + 1. Remark. The theorem is due to Radon (1921). See Mohri, Rostamizadeh, and Talwalkar (2018) for a related discussion. 64.16 Consider the class of classifiers that consists of circles centered at the origin in IR2 , where feature vectors inside the circle belong to class −1 and feature vectors outside the circle belong to class +1. What is the VC dimension of this class of classifiers over IR2 ? 64.17 Consider a class of classifiers defined by two scalar parameters a ≤ b; the parameters define an interval [a, b] on the real line. A scalar feature value h is declared to belong to class +1 if h ∈ [a, b] (i.e., h lies inside the interval); otherwise, h is declared to belong to class −1. Show that the VC dimension of this class of classifiers is equal to 2. What is the shatter coefficient for this class of classifiers? 64.18 Consider a class of classifiers defined by four scalar parameters a ≤ b < c ≤ d; the parameters define two disjoint intervals [a, b] and [c, d] on the real line. A scalar feature value h is declared to belong to class +1 if h ∈ [a, b] or h ∈ [c, d] (i.e., h lies inside one of the intervals); otherwise, h is declared to belong to class −1. Show that the VC dimension of this class of classifiers is equal to 4. 64.19 Consider the class of classifiers that consists of two separate co-centric circles centered at the origin in IR2 , where feature vectors that lie in the ring between both circles belong to class −1 and feature vectors outside this area belong to class +1. (a) What is the VC dimension of this class of classifiers over IR2 ? (b) If we replace the circles by co-centric spheres centered at the origin in IR3 , what would the VC dimension be? 64.20 Consider feature vectors h ∈ IR2 , which represent points in the plane. The classifier class consists of rectangles with vertical or horizontal edges. Points that fall inside a rectangle are declared to belong to class +1 and points that fall outside the rectangle are declared to belong to class −1. Show that the VC dimension for this class of classifiers is equal to 4.

2682

Generalization Theory

64.21 Consider feature vectors h ∈ IR2 , which represent points in the plane. The classifier class consists of squares with vertical edges. Points that fall inside a square are declared to belong to class +1 and points that fall outside the square are declared to belong to class −1. Show that the VC dimension for this class of classifiers is equal to 3. 64.22 Refer to the VC bound in (64.17). How many training samples, N , do we need in order to ensure that the error between the actual and empirical risks is no larger than a prescribed value δ with probability of a least 1 − . Compute the numerical value for N when δ = 5% =  and VC = 20. 64.23 Refer again to the VC bound in (64.17). At what rate does the error between the actual and empirical risks decay as a function of the sample size, N ? 64.24 The bound derived in Prob. 64.4 is applicable to a single classifier, c. We can derive a uniform bound over all classifiers as follows. Assume first that the number of classifiers in the set C is finite, i.e., |C| < ∞. (a) Argue that    X  P sup |Remp (c) − R(c)| ≥ δ ≤ P |Remp (c) − R(c)| ≥ δ c∈C

(b)

c∈C

Conclude that, for any δ > 0:   2 P sup |Remp (c) − R(c)| ≥ δ ≤ 2|C|e−2N δ c∈C

(c)

Conclude that, for any small  > 0:   P sup |Remp (c) − R(c)| ≥ δ ≤  c∈C

where δ and  are related via (compare with (64.13)): s    1 2 δ= ln |C| + ln 2N  (d)

Conclude further that, for given (δ, ) values, the amount of training samples that is necessary to ensure the bound from part (c) is   2|C| 1 N ≥ 2 ln 2δ 

so that more complex models require more data for training. 64.25 We continue with Prob. 64.24. (a) When the number of classifiers in C is not necessarily finite, but the set has a finite VC dimension, it can be shown that the quantity |C| that appears on the right-hand side in the bound in part (b) should be replaced by 4(N e/VC)VC , and the scalar 2N δ 2 in the exponent should be replaced by N δ 2 /32 – see (64.111) and (64.88) in Appendix 64.C. Use this fact to conclude that for any small  > 0, it holds that   P sup |Remp (c) − R(c)| ≥ δ ≤  c∈C

where δ and  are now related via (compare with (64.13)): s      8 Ne 4 δ= VC ln + ln N VC 

Problems

(b)

2683

An alternative bound can be obtained as follows for finite VC dimensions. It can also be shown that the quantity |C| that appears on the right-hand side in the bound in part (b) can be replaced by 4(N + 1)VC , and the scalar 2N δ 2 in the exponent can be replaced by N δ 2 /32 – see (64.111) and (64.87) in Appendix 64.C. Use this fact to conclude that for any small  > 0, it also holds that   P sup |Remp (c) − R(c)| ≥ δ ≤  c∈C

where δ and  are related via: s    32 8 δ= VC ln(N + 1) + ln N  64.26 Refer to the result of Prob. 64.25. Show that during the training phase with N data points, it holds that  VC   2 2N e P |Remp (c) − R(c)| ≥ δ ≤ 4 e−N δ /8 VC

while during the testing phase, also using a total of N test data points, and after the classifier c? (h) has been selected, it holds that   2 P |Remp (c? ) − R(c? )| ≥ δ ≤ 2e−2N δ Explain the difference. 64.27 Follow arguments similar to those employed in the derivation of the VC inequality (64.111) in Appendix 64.C to establish the Glivenko–Cantelli inequality (64.50). 64.28 Let Q(c; γ, h) denote an arbitrary nonnegative convex loss function that is assumed to be bounded, say, Q(c; γ, h) ∈ (a, b) for some nonnegative scalars a, b and for any c ∈ C (i.e., for any choice in the classifier set under consideration). Define the corresponding surrogate risk function P (c) = E Q(c; γ, h). In the text, we used the indicator function I[c(h) 6= γ] in expression (64.5) instead of Q(c; γ, h). Likewise, define the empirical risk over a set of N training points {γ(n), hn } as ∆

Pemp (c) =

N −1 1 X Q(c; γ(n), hn ) N n=0

Follow arguments similar to those employed in the derivation of the VC inequality (64.111) in Appendix 64.C to establish that a similar bound holds for these more general loss and risk functions with the exponent −N δ 2 /32 replaced by −N δ 2 /32b2 . 64.29 Refer to the Sauer inequality (64.86). Several useful bounds on the shatter coefficient are given in the text by Devroye, Gyorfi, and Lugosi (1996). In particular, verify that the following bounds hold for the shatter coefficient of a class C of classifiers applied to N feature vectors: S(C, N ) ≤ N VC + 1, S(C, N ) ≤ N S(C, N ) ≤ e

VC

NH

for all VC

, VC/N

for VC > 2 

,

for N ≥ 1 and VC < N/2

where H(p) denotes the entropy measure for a binomial random variable with parameter p ∈ (0, 1), i.e., H(p) = −p log2 p − (1 − p) log2 (1 − p).

2684

Generalization Theory

64.30 Consider a binary classification problem with classes γ ∈ {±1} having known prior probabilities denoted by π+1 and π−1 = 1 − π+ . Let h(m) denote the mth entry of the feature vector h ∈ IRM and assume it is a discrete random variable. Refer to the optimal Bayes classifier. (a) Argue that the probability of correct decisions is given by P(correct decisions) =

M X m=1

(b)

n o max πγ P(h(m) = h(m) | γ = γ)

γ=±1

Assume first that π+1 < π−1 . The expression derived in part (a) is dependent on the data, {h(m)}. Averaging over all possible distributions for the data, show that the average accuracy of the Bayes classifier is given by: M  π+1 ∆ Pav (correct decisions) = π+1 + π−1 (M − 1) π−1 where ∆=

M X m=0

(c)

M! m!(M − m)!(2M − m − 1) [π+1 /(1 − 2π+1 )]m

Let M → ∞ and conclude that Pav → (1−π−1 π+1 ). What does this result mean? When π+1 = π−1 = 1/2, show that Pav (correct decisions) =

3M − 2 4M − 2

M →∞

−→ 0.75

(d) What is the value of Pav when M = 1? Is this expected? 64.31 In this problem, we establish result (64.46), namely, that for any finite samplesize empirical classifier, cN (h), there always exists a data distribution for which the empirical risk is bad. Here, cN (h) denotes the classifier that minimizes (64.7). Assume the data (h, γ) is constructed as follows. The feature variable h is a discrete scalar random variable satisfying: P(h = s) =

1 , K

for s = 0, . . . , K − 1

Consider a real number b ∈ [0, 1) and introduce its binary expansion written in the form b = 0.b0 b1 b2 · · · bK−1 , where each bj is either 0 or 1. The label γ corresponding to h = s is set to γ = bs . Observe that in this description, and without any loss in generality, we are setting the binary label to the values {0, 1} rather than {−1, 1} used in the body of the text. (a) Argue that the risk of the optimal Bayes classifier is zero. (b) Using the training dataset DN , {(h0 , γ(0)), . . . , (hN −1 , γ(N − 1))}, we estimate the label γ for a feature vector h by employing the empirical classifier cN (·): b = cN (h) γ −1 Assume the training dataset {(γ(n), hn }N n=0 and the test data (γ, h) are generated by the same process described previously. Let us denote the actual risk of cN (·), parameterized by b, by the notation:

R(cN ; b) , P(cN (h) 6= γ] We next model b as a random variable that is uniformly distributed in [0, 1) and has binary expansion b = 0.b0 b1 b2 · · · bK−1 . What is the value of P(bj = 0) for any j? Prove that for any empirical classifier cN (h) we have sup R(cN ; b) ≥ Eb R(cN ; b)

b∈[0,1)

where the expectation is over the distribution of b.

Problems

(c)

2685

Assume that b is independent of the test vector h and the training vectors −1 {hn }N n=0 . Prove that  N 1 1 N Eb R(c ; b) ≥ 1− 2 K

What can we conclude about the lower bound on supb∈[0,1) R(cN ; b) as K → ∞? Comment on the result. 64.32 Verify that the supremum function is convex, i.e., for any two sequences {xn , x0n } and α ∈ [0, 1]:   sup αxn + (1 − α)x0n ≤ α sup xn + (1 − α) sup x0n n

n

n

64.33 Consider a subset A ⊂ IR , with finite cardinality, and refer to its Rademacher complexity defined by (64.142). Introduce the convex hull of A, denoted by conv(A), which is the set of all convex combinations of elements in A. Show that the sets A and conv(A) have the same Rademacher complexity. Remark. See Bartlett and Mendelson (2002, sec. 3) and Shalev-Shwartz and Ben-David (2014, ch. 26). 64.34 Consider a subset A ⊂ IRN and refer to its Rademacher complexity defined by (64.142). Let φ(x) : IR → IR denote a δ-Lipschitz function satisfying |φ(x) − φ(y)| ≤ δ|x − y|, for all x, y ∈ dom(φ) and some δ > 0. We denote the entries of each a ∈ A by a = col{an }, for n = 1, 2, . . . , N . We define the transformation φ(a), with vector argument a, as the vector that results from applying φ(·) to each individual entry of a, i.e., φ(a) = col{φ(an )}. Consider the set Aφ = {φ(a), a ∈ A}. In other words, Aφ is obtained by applying the Lipschitz continuous function φ(·) to the elements of A. Show that the Rademacher complexity is modified as follows N

RN (Aφ ) ≤ δ RN (A) Remark. See Ledoux and Talagrand (1991), Kakade, Sridharan, and Tewari (2008), and Shalev-Shwartz and Ben-David (2014, ch. 26) for related discussion. 64.35 Consider a collection of N feature vectors {h1 , . . . , hN } where each hn ∈ IRM . Introduce two sets A, B ⊂ IRN consisting of N -dimensional vectors each defined as follows: n o A = a = col{an } ∈ IRN | an = hTn w, kwk2 ≤ 1 n o B = b = col{bn } ∈ IRN | bn = hTn w, kwk1 ≤ 1 where the only difference is the bound on the parameter w: In the first case, we bound its Euclidean norm and in the second case we bound its `1 -norm. Show that the Rademacher complexities of these two sets satisfy n o 1 RN (A) ≤ √ × max khn k2 1≤n≤N N r o 2 ln(2M ) n RN (B) ≤ × max khn k∞ 1≤n≤N N Remark. See Shalev-Shwartz and Ben-David (2014, ch. 26) and Mohri, Rostamizadeh, and Talwalkar (2018, ch. 10) for a related discussion. 64.36 Derive the two-sided generalization bounds (64.197a)–(64.197b) by extending the argument used to derive their one-sided counterparts in Appendix 64.D. 64.37 Refer to the binary classification context described in Example 64.9. Verify that the empirical risk admits the representation ( ) N 1 X 1 1− γ(n)c(hn ) Remp (c) = 2 N n=1

2686

Generalization Theory

Conclude that one can alternatively select an optimal classifier by solving ( ) N 1 X o c = argsup γ(n)c(hn ) N n=1 c∈C How does this formulation relate to the Rademacher complexity of the class of binary classifiers C? 64.38 Refer to definitions (64.162) and (64.163) for the Rademacher complexity and its empirical version. The Gaussian complexity of a set of functions Q ∈ Q is defined similarly, with each variable σ n now selected from the standard Gaussian distribution: ( !) N X bN (Q) = E σ sup 1 G σ n Q(yn ) , σ n ∼ Nσ (0, 1) N n=1 Q∈Q n o b N (Q) GN (Q) = E y G Show that the Rademacher and Gaussian complexities are related as follows: αRN (Q) ≤ GN (Q) ≤ β ln N RN (Q) for some nonnegative constants α and β. Remark. See Tomczak-Jaegermann (1989) for a related discussion. 64.39 Consider a collection of vectors a ∈ A ⊂ IRN , with individual entries a = col{an }. Consider also Rademacher variables {σ n }, which take values {±1} with equal probability. Establish the Khintchine–Kahane inequality: ( )2 N N N

X

X

X

2

2 1



σ n an k ≤ E σ σ n an E σ σ n an ≤ E σ 2 n=1 n=1 n=1 Remark. The inequality is originally due to Khintchine (1923) and was extended by Kahane (1964). Proofs and discussion appear in Latala and Oleszkiewicz (1994), Wolff (2003), and Mohri, Rostamizadeh, and Talwalkar (2018). 64.40 Consider a collection of N feature vectors {h1 , h2 , . . . , hN } from the set {h ∈ IRM | K(h, h) ≤ r2 }, where K(ha , hb ) denotes the kernel function. Let φ(h) represent the mapping that is implicitly defined by the choice of kernel: It maps vectors from the original feature space h ∈ IRM to a transformed space hφ ∈ IRMφ . Introduce the set: n o A = a = col{an } ∈ IRN | an = (hφn )T wφ , kwφ k ≤ 1 Extend the derivation from Prob. 64.35 to show that the Rademacher complexity of this set satisfies √ RN (A) ≤ r/ N Remark. See Mohri, Rostamizadeh, and Talwalkar (2018, ch. 5) for a related discussion.

64.A

VC DIMENSION FOR LINEAR CLASSIFIERS We establish in this appendix the result of Lemma 64.1 following an argument similar to Abu-Mostafa, Magdon-Ismail, and Lin (2012), Shalev-Shwartz and Ben-David (2014, ch. 9), and Mohri, Rostamizadeh, and Talwalkar (2018, ch. 3). Thus, consider the class of affine classifiers defined by c(h) = sign(hT w − θ), with parameters w ∈ IRM and

64.A VC Dimension for Linear Classifiers

2687

θ ∈ IR. If we assume the feature vectors are extended with a top unit entry, and the weight vector w is extended with −θ as leading entry, namely,     −θ 1 w← , h← (64.65) w h then it is sufficient to focus on linear classifiers of the form c(h) = sign(hT w). Assuming this extension, let us first establish that VC ≥ M + 1. We can do so by constructing a collection of M + 1 features vectors that can be shattered by linear classifiers. Indeed, consider the following M + 1 feature vectors collected as rows into the matrix H below:     hT1 1 0TM T T  h2   1 e1  ∆     (M +1)×(M +1) H =  (64.66)  =  . .. ..  ∈ IR .     . . . hTM +1 1 eTM Each feature vector starts with the unit entry, with the remaining entries corresponding to the zero vector for h1 and to the basis vectors {em } in IRM for the remaining features. It is easy to verify that the square matrix H is invertible. Now, let γvec ∈ IRM +1 denote any label vector of size M + 1: The individual entries of γvec can be +1 or −1 at will, so that all labeling possibilities for the M + 1 feature vectors in H are covered. Now, for any choice of γvec , there exists a classifier w that maps H to γvec and it can be chosen as w = H −1 γvec . Therefore, the above set of M + 1 feature vectors can be shattered and we conclude that VC(linear classifiers) ≥ M + 1

(64.67)

Let us verify next that VC ≤ M + 1 so that equality must hold. To prove this second statement, it is sufficient to exhibit an example with M + 2 feature vectors and the corresponding labels for which no linear classifier exists. Thus, consider a collection of M + 2 nonzero feature vectors in IRM +1 . These vectors are clearly linearly dependent, which means there exists some feature vector among them, denoted by hn , such that hn is a linear combination of the remaining feature vectors. Specifically, we write hn =

M +1 X

α(m)hm

(64.68)

m6=n

for some coefficients {α(m)}; some of which are nonzero. We now assign the following labels to the M + 2 feature vectors {h1 , h2 , . . . , hM +2 }:  sign(α(m)), for all m 6= n γ(m) = (64.69) −1, for m = n That is, we label hn as −1 and label all other feature vectors by the sign of the corresponding coefficient α(m); if α(m) = 0, it does not matter whether we label the corresponding feature vector with +1 or −1. Now, note the following. For any classifier w that is able to classify the M + 1 features {hm , m 6= n} so that sign(hTm w) = γ(m) = sign(α(m)) this classifier will not be able to classify hn correctly because X  sign(hTn w) = sign α(m)hTm w > 0 m6=n

(64.70)

(64.71)

2688

Generalization Theory

The positive sign contradicts the fact that the label for hn is negative. Therefore, we have a collection of M + 2 feature vectors that cannot be separated by the linear classifier and we conclude that VC(linear classifiers) ≤ M + 1

(64.72)

Combining (64.67) and (64.72) we arrive at the desired conclusion.

64.B

SAUER LEMMA In this appendix, we establish a useful lemma that deals with a fundamental combinatorial bound and use it to establish the VC inequality (64.44) in Appendix 64.C. The arguments in these two appendices are adapted from the derivation given by Devroye, Gyorfi, and Lugosi (1996) adjusted to our notation and conventions. Thus, let {hn ∈ IRM } denote N feature vectors and let C denote a set of classifier models; this set may have a finite or infinite number of elements. Each c ∈ C maps a feature vector hn into one of two binary classes, i.e., c(hn ) : IRM → {±1}.

Shatter coefficient or growth function There are 2N possibilities for assigning the N feature vectors to the two classes ±1. For each choice of a classifier c ∈ C, we obtain one possible labeling vector (also called a dichotomy), denoted by `c , for the given feature vectors: n o ∆ `c = col c(h0 ), c(h1 ), c(h2 ), . . . , c(hN −1 ) ∈ {±1}N (64.73) This is a vector of size N × 1 with entries ±1. Example 64.4 (Illustrating dichotomies) Figure 64.12 illustrates the construction. In this example, we assume the feature data are scalars, hn ∈ IR, and that each classifier c in the set C is defined by some threshold parameter t ∈ IR. Based on the value of t, the classifier c assigns a feature vector to class +1 or −1 according to the following decision rule:  if hn ≥ t, then c(hn ) = +1 (64.74) if hn < t, then c(hn ) = −1 The first row on the left-hand side of the figure shows three feature values, denoted by {h0 , h1 , h2 }; they occur at coordinate locations {0, 2, 3} on the real axis. The subsequent rows in the figure indicate the classes that these feature entries are assigned to, depending on where the threshold value t (denoted by the red circle) is located. In particular, if t ≤ 0, if 0 < t ≤ 2, if 2 < t ≤ 3, if t > 3,

then then then then

{h0 , h1 , h2 } ∈ {+1, +1, +1} {h0 , h1 , h2 } ∈ {−1, +1, +1} {h0 , h1 , h2 } ∈ {−1, −1, +1} {h0 , h1 , h2 } ∈ {−1, −1, −1}

(64.75) (64.76) (64.77) (64.78)

Therefore, in this example, the classifier set C is only able to generate four possible labeling vectors, {`c }, which we collect into the rows of an assignment matrix AC :   h0 h1 h2  +1 +1 +1    AC =  −1 +1 +1  (64.79)  −1 −1 +1  −1 −1 −1

64.B Sauer Lemma

2689

There are clearly assignments that are not possible to generate by this set of threshold classifiers, such as the assignment {+1, −1, +1}. Observe that even though the classifier set C may have an infinite number of models, the number of dichotomies in AC (i.e., the number of its rows) is always finite and bounded by 2N . h0

h0

1 AAAB6nicbVDLSgNBEOyNrxhfUY9eBhPBg4TdCOox4MVjRPOAZAmzk95kyOzsMjMrhJBP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c781hMqzWP5aMYJ+hEdSB5yRo2VHspeuVcsuRV3DrJKvIyUIEO9V/zq9mOWRigNE1Trjucmxp9QZTgTOC10U40JZSM6wI6lkkao/cn81Ck5s0qfhLGyJQ2Zq78nJjTSehwFtjOiZqiXvZn4n9dJTXjjT7hMUoOSLRaFqSAmJrO/SZ8rZEaMLaFMcXsrYUOqKDM2nYINwVt+eZU0qxXvqnJ5Xy3VLrI48nACp3AOHlxDDe6gDg1gMIBneIU3RzgvzrvzsWjNOdnMMfyB8/kDLW2NAQ==

1 AAAB6nicbVDLSgNBEOyNrxhfUY9eBhPBg4TdCOox4MVjRPOAZAmzk95kyOzsMjMrhJBP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c781hMqzWP5aMYJ+hEdSB5yRo2VHspeuVcsuRV3DrJKvIyUIEO9V/zq9mOWRigNE1Trjucmxp9QZTgTOC10U40JZSM6wI6lkkao/cn81Ck5s0qfhLGyJQ2Zq78nJjTSehwFtjOiZqiXvZn4n9dJTXjjT7hMUoOSLRaFqSAmJrO/SZ8rZEaMLaFMcXsrYUOqKDM2nYINwVt+eZU0qxXvqnJ5Xy3VLrI48nACp3AOHlxDDe6gDg1gMIBneIU3RzgvzrvzsWjNOdnMMfyB8/kDLW2NAQ==

Figure 64.12 The rows on the left show three feature values on the real line and the

four possible class assignments for them. The red circle represents the location of the threshold value in each case. The rows on the right show the same construction for the case in which two feature values coincide, h1 = h2 . In this second case, only three assignments are possible.

As was already explained in Section 64.4, we say that the set of classifiers C is able to shatter the N feature vectors if every possible assignment among the 2N possibilities can be generated by C. We illustrated this definition in Fig. 64.7. We observe from the above example, with three scalar feature values, that it is not always possible to generate all 2N valid assignments (or labeling vectors, `c ) by the classifiers in C. We let AC denote the collection of all assignments that can be generated by C: ∆

AC (h0 , h1 , . . . , hN −1 ) = {`c , c ∈ C}

(64.80)

so that each choice c ∈ C generates one assignment vector, `c , and the aggregation of all these row vectors is the matrix AC . Observe that the assignment set AC is a function of both the classifier space, C, and the feature vectors, {hn }. A different collection of feature vectors {hn } would generally lead to a different assignment set AC . For instance, as shown in the second column of Fig. 64.12, if two of the feature values happen to occur at the same location, say, h0 = 0 while h1 = h2 = 2, then, in this case, the threshold classifier set can only generate three possible labeling vectors, namely,   h0 h1 h2  +1 +1 +1  AC =  (64.81) −1 +1 +1  −1 −1 −1

2690

Generalization Theory

To remove ambiguity due to the choice of the feature data, we introduce the shatter coefficient of the classifier set, C, and denote it by S(C, N ). This coefficient, which is also called the growth function, is an integer value that corresponds to the largest number of assignments that can be generated by C over all possible choices for the feature vectors {hn }, i.e., ∆

S(C, N ) = max | AC (ho , h1 , . . . , hN −1 ) | {hn }

(64.82)

where the notation |AC | denotes the cardinality of the set AC ; in this case, it is the number of rows in the assignment matrix. Thus, the shatter coefficient S(C, N ) corresponds to the largest possible cardinality for AC . For the example of Fig. 64.12, it is clear that S(C, 3) = 4. For this same example, if we instead had a total of N features (rather than only 3), then it is easy to see that the shatter coefficient will be S(C, N ) = N + 1. Obviously, for any set of classifiers and feature vectors, it holds that S(C, N ) ≤ 2N

(64.83)

N

since 2 is the maximum number of possible assignments for binary classification scenarios. Observe that this bound grows exponentially with the size of the training data. One fundamental result, derived further ahead under the designation of the Sauer lemma, is that for classifier sets with finite VC dimension, their shatter coefficients are bounded by polynomial (rather than exponential) functions of N – see (64.87) and (64.88).

VC dimension

We defined in Section 64.4 the VC dimension of a class of classifiers C as the largest integer K for which at least one set of K feature vectors can be shattered by C. In other words, the VC dimension of C is the largest K for which S(C, K) = 2K or, equivalently, S(C, VC) = 2VC

(64.84)

It turns out that when VC < ∞, the growth function (or shatter coefficient) of C grows polynomially in N . This property is established in the following statement, where we employ the following definition for the combinatorial function:  ! N!  N , 0≤n≤N ∆ = (64.85) n!(N − n)!  n 0, otherwise

Sauer lemma (Sauer (1972), Shelah (1972)): The shatter coefficient (or growth function) of a set of classifiers C applied to N feature vectors is bounded by the following value in terms of the VC dimension: ! VC X N S(C, N ) ≤ (64.86) n n=0 Two other useful bounds that follow from (64.86) when 1 ≤ VC ≤ N are: S(C, N ) ≤ (1 + N )VC  VC Ne S(C, N ) ≤ VC

where the letter “e” refers to the basis of the natural logarithm (e ≈ 2.7183).

(64.87) (64.88)

64.B Sauer Lemma

2691

Proof: The argument is lengthy and involves several steps. We employ a traditional inductive argument. Let us first verify that the lemma holds for a couple of useful boundary conditions. (Boundary conditions). For N = 0 and any VC, we have ! VC X 0 = 1, and S(C, 0) ≤ 1 n n=0

(64.89)

where the second equality is because there are no feature data to label (therefore, we can bound the number of label possibilities by 1). Likewise, for VC = 0 and any N ≥ 1, we have ! 0 X N = 1, and S(C, N ) = 1 (64.90) n n=0 where the second equality is because the VC dimension is 0 and, therefore, the set of classifiers can only assign the same label to all feature vectors. Similarly, for N = 1 and any VC ≥ 1, we have ! ! ! ! VC X 1 1 1 1 = + + ··· + = 2 (64.91) n 0 1 VC n=0 while S(C, 1) ≤ 2. This latter inequality is because, at best, the set of classifiers may be able to assign the single feature vector into either class. We therefore assume VC ≥ 1. (Induction argument). We now assume that (64.86) holds up to N − 1 and show that it also holds for N . To do so, and in order to simplify the notation, we introduce the shorthand symbol HN to refer to the collection of N feature vectors, say, ∆

HN = {h0 , h1 , . . . , hN −1 }

(64.92)

Let S(C, N ) denote the shatter coefficient for the set C over N feature vectors. We already know that this value is the maximal number of different ways by which the N vectors can be labeled. Let Cs ⊂ C denote the smallest subset of the classifier set that attains this shatter value. That is, the number of classifiers in Cs is equal to the number of distinct labeling/dichotomies that can be generated on HN . Likewise, we write HN −1 to refer to the collection formed by excluding the last feature vector: ∆

HN = HN −1 ∪ {hN −1 }

(64.93)

We also let S(C, N − 1) denote the shatter coefficient for the same set C over N − 1 feature vectors. This value is the maximal number of different ways by which N − 1 vectors can be labeled. We further let C1 ⊂ Cs denote the smallest subset of the classifier set that attains this shatter value. Again, the number of classifiers in C1 is equal to the number of distinct labeling/dichotomies that can be generated on HN −1 . Moreover, since C1 ⊂ Cs , it holds that VC(C1 ) ≤ VC(Cs ) ≤ VC(C)

(64.94)

This is because any set of feature vectors that can be shattered by C1 can also be shattered by Cs . We subsequently decompose the set Cs into ∆

Cs = C1 ∪ (C\C1 ) = C1 ∪ Cc1 Cc1

(64.95)

It is clear that each classifier in the complementary set generates a labeling for the feature vectors in HN −1 that is already generated by some classifier in C1 ; otherwise,

2692

Generalization Theory

this classifier from Cc1 would need to be in C1 . This also means that for every classifier in Cc1 there exists a classifier in C1 such that both classifiers agree on their labeling of HN −1 but disagree on their labeling of hN −1 ; they need to disagree on hN −1 otherwise they will be identical classifiers. Another property for the set Cc1 is the following. Assume two classifiers, say c1 and c2 , exist in the set Cs that classify the N − 1 feature vectors in HN −1 in the same manner. If this happens, then only one of these classifiers, say, c1 , must belong to the set C1 because otherwise C1 would not be the smallest classifier set that attains the shatter value for HN −1 . The other classifier, say, c2 , will be in Cc1 . Moreover, and importantly, this second classifier will label hN −1 differently from the classifier c1 added to C1 (otherwise, both classifiers c1 and c2 would be identical). The above properties are illustrated in the assignment matrix shown below for the threshold-based classifier of Fig. 64.12 for the case N = 4:

    AC =   

h0 +1 −1 −1 −1 −1

h1 +1 +1 −1 −1 −1

h2 +1 +1 +1 −1 −1

h3 +1 +1 +1 +1 −1

   C1    

(64.96)

Cc1

In this case, the shatter coefficient is S(C, 4) = 5, so that there are at most five dichotomies that can be generated by C. These dichotomies are listed as the rows of AC shown above. These rows represent the smallest classifier set, denoted by Cs . Observe that the first four rows correspond to the classifiers in the set C1 : They attain the maximal shatter value of S(C, 3) = 4 on the first three feature vectors. Observe further that the last row in AC represents the classifier set Cc1 ; it consists of a single classifier that generates the same labels on the features {h0 , h1 , h2 } as the fourth classifier, but nevertheless leads to a different label for h3 . We now verify that, more generally, and in view of the above observations regarding the sets {C1 , Cc2 , Cs }, it should hold that:

c

VC(C1 ) ≤ VC(Cs ) − 1 ≤ VC(C) − 1

(64.97)

Indeed, assume Cc1 shatters completely some set of feature vectors H0 ⊂ HN −1 . Then, it necessarily holds that Cs should shatter the expanded collection H0 ∪ {hN −1 }. It is obvious that Cs shatters H0 since Cc1 ⊂ Cs . With regards to hN −1 , we simply observe that Cs = C1 ∪Cc1 and each of these sets contains a classifier that labels hN −1 differently than the other (e.g., the classifiers c1 and c2 mentioned before). Now note that, by the induction assumption,

S(C1 , N − 1) ≤

VC X n=0

S(Cc1 , N

− 1) ≤

VC −1 X n=0

! N −1 n N −1 n

(64.98) ! (64.99)

64.B Sauer Lemma

2693

Moreover, it holds that S(C, N ) ≤ S(C1 , N − 1) + S(Cc1 , N − 1) ! ! VC −1 VC X X N −1 N −1 ≤ + n n n=0 n=0 ! ! VC VC X X N −1 N −1 = + n n−1 n=0 n=0 ! VC X N = n n=0 where in the last equality we used the property ! ! N N −1 = + n n

N −1 n−1

(64.100)

! (64.101)

The bound (64.100) establishes result (64.86). Now, assume 1 ≤ VC ≤ N . With regards to the bound (64.88), we note that since VC/N ≤ 1:  ! ! VC X n  VC VC  X N N VC VC ≤  N n  N n n=0 n=0 ! n N (a) X N VC ≤ n N n=0 n    N X VC N 1N −n = n N n=0  N (b) VC = 1+ N (c)

≤ eVC

(64.102)

where in step (a) we replaced the upper limit on the sum by N , in step (b) we used the binomial theorem, namely, ! m X m ` m−` m (x + y) = x y (64.103) ` `=0

and in step (c) we used the fact that, for any x ≥ 0:  x N ex ≤ 1 + (64.104) N Using (64.102) in (64.86) gives (64.88). With regards to bound (64.87), we first note that, for any integer n ≥ 0, it holds that ! N N! Nn ∆ = ≤ (64.105) n n!(N − n)! n! Consequently, since VC ≥ 1 and 0 ≤ n ≤ VC, ! N Nn VC! ≤ = n n! (VC − n)!

!

VC

n

Nn

(64.106)

2694

Generalization Theory

Using this result in (64.100) gives S(C, N ) ≤

VC X n=0

N n

! ≤

VC X n=0

VC

!

n

N n × 1(VC−n)

(a)

= (1 + N )VC

(64.107)

where in step (a) we applied the binomial theorem (64.103) again. 

64.C

VAPNIK–CHERVONENKIS BOUND In this appendix, we establish the validity of the VC bound (64.44) for binary classification problems with classes γ ∈ {±1}. The argument is adapted from the derivation given by Devroye, Gyorfi, and Lugosi (1996) adjusted to our notation and conventions. Let {γ(n), hn ∈ IRM } denote N independent realizations arising from a joint (unknown) distribution fγ ,h (γ, h). Let c? (h) denote a solution to the empirical risk minimization problem over some set c ∈ C: ( ) N −1 h i X ∆ ∆ 1 ? I c(hn ) 6= γ(n) (64.108) c (h) = argmin Remp (c) = N n=0 c∈C Likewise, let co (h) denote the optimal solution that minimizes the probability of misclassification over the same set: ( ) ∆

co (h) = argmin c∈C



R(c) = P [c(h) 6= γ]

(64.109)

Classifier set with finite cardinality

Assume initially that the set C has finite cardinality (i.e., a finite number of elements), denoted by |C|. Using straightforward arguments, and the Hoeffding inequality (3.233), we are able to establish in Probs. 64.4 and 64.24 the following useful bound:   2 P sup |Remp (c) − R(c)| ≥ δ ≤ 2|C|e−2N δ (when |C| is finite) (64.110) c∈C

The difficulty arises when the set C has uncountably infinite elements. In that case, the term on the right-hand side of (64.110) is not useful because it degenerates to an unbounded value. It turns out, though, that what matters is not the cardinality of C, but rather the largest number of dichotomies that the set C can generate on the training data. This number is equal to the shatter coefficient of C, which we introduced in the previous appendix and denoted by S(C, N ). We showed in (64.86), and also (64.87)– (64.88), that the shatter coefficient is bounded polynomially in N even when |C| is infinite.

Derivation of VC bound We now establish the following fundamental result, the proof of which is nontrivial and relies again on several steps. We follow largely the presentation given by Devroye, Gyorfi, and Lugosi (1996, ch. 12). As indicated in the concluding remarks, the coefficient appearing in the exponential factor in (64.111) below ends up being N δ 2 /32, while the coefficient appearing in the original bound given by Vapnik and Chervonenkis (1971) is N δ 2 /8 and corresponds to a tighter bound. This difference is not significant since it

64.C Vapnik-Chervonenkis Bound

2695

is sufficient for our purposes to know that a bound exists and that this bound decays to zero as N → ∞ at a uniform rate that is independent of the data distribution.

Vapnik–Chervonenkis inequality (Vapnik and Chervonenkis (1971)): For any given small constant δ > 0 and N δ 2 ≥ 2, it holds that    VC 2 Ne P sup |Remp (c) − R(c)| > δ ≤ 8 e−N δ /32 (64.111) VC

c∈C

independent of the data distribution, fγ ,h (γ, h), and in terms of the VC dimension of the classifier set C.

Proof: The argument is demanding and involves several steps. We remark that the condition N δ 2 ≥ 2 in the statement of the inequality is not a restriction. This is because for N δ 2 < 2, the bound in (64.111) becomes trivial because it will be larger than 7.5(N e/VC), which in turn is generally larger than 1 (especially since we often have VC ≤ N ). (Symmetrization step: adding fictitious samples). The first step in the argument involves replacing the difference Remp (c) − R(c), which involves the unknown R(c), by one that involves only empirical risks – see (64.113). By doing so, we will be able to bound the probability expression that appears in (64.111) by a term that depends symmetrically and solely on empirical data. To achieve this task, we start by introducing a collection of N fictitious data samples, denoted by {γ 0 (n), h0n }, and which are assumed to arise from the same data distribution as the original samples, {γ(n), hn }. These fictitious samples are added merely for the sake of argument and will not affect the final result. For any classifier c ∈ C, we denote its empirical risk on the fictitious data by using the prime notation: ∆

0 Remp (c) =

N −1  1 X  0 I c(hn ) 6= γ 0 (n) N n=0

We now verify that     0 P sup |Remp (c) − R(c)| > δ ≤ 2 P sup |Remp (c) − Remp (c)| > δ/2 c∈C

(64.112)

(64.113)

c∈C

where, as desired, the term on the right-hand side involves only empirical risks in a symmetrical manner. Once established, this result relates the distance between two empirical risks to the desired distance from the empirical risk to the optimal risk – see Fig. 64.13. Let c¯ ∈ C be an element in the classifier set that satisfies the bound |Remp (¯ c) − R(¯ c)| > δ

(64.114)

If such an element does not exist, we simply let c¯ be an arbitrary element from C. This classifier therefore satisfies:     P |Remp (¯ c) − R(¯ c)| > δ ≥ P sup |Remp (c) − R(c)| > δ (64.115) c∈C

This is because if a classifier c¯ ∈ C exists satisfying (64.114), then both probabilities in the above relation are equal to 1 and the inequality holds. If, on the other hand, such a c¯ does not exist, then the probabilities will be zero and the inequality again holds.

2696

Generalization Theory

Figure 64.13 The red circles represent empirical risk values for different realizations of

the data. Expression (64.113) relates the probability of the distance between Remp (c) and R(c) being larger than δ to the probability of the distance between two empirical risks being larger than δ/2. The subsequent analysis will bound this latter difference, which is a useful step since R(c) is unknown.

To arrive at (64.113), we first note the following sequence of inequalities:   0 P sup |Remp (c) − Remp (c)| > δ/2 c∈C  0 ≥ P |Remp (¯ c) − Remp (¯ c)| > δ/2  0 = P |Remp (¯ c) − R(¯ c) + R(¯ c) − Remp (¯ c)| > δ/2 (a)

0 ≥ P |Remp (¯ c) − R(¯ c)| > δ and |Remp (¯ c) − R(¯ c)| < δ/2



where step (a) follows from the property that for any two real numbers: |a − b| ≥ |a| − |b| Indeed, assume that the following two conditions hold: 0  (|Remp (¯ c) − R(¯ c)| > δ) and Remp (¯ c) − R(¯ c) < δ/2 Then, property (64.117) implies that 0 Remp (c) − Remp (c) = (Remp (¯ c) − R(¯ c)) − ≥ |Remp (¯ c) − R(¯ c)| − {z } | >δ

> δ/2

(64.116)

(64.117)

(64.118)

 0 Remp (¯ c) − R(¯ c) 0 Remp (¯ c) − R(¯ c) {z } | < δ/2

(64.119)

Consequently, conditions (64.118) combined imply result (64.119), which justifies step (a). Continuing, we have from (64.116) that

64.C Vapnik-Chervonenkis Bound

2697

  0 P sup |Remp (c) − Remp (c)| > δ/2 (64.120) c∈C o n  0  = E I[|Remp (¯ c) − R(¯ c)| > δ] I |Remp (¯ c) − R(¯ c)| < δ/2  n o  0  (b) = E E I[|Remp (¯ c) − R(¯ c)| > δ] I |Remp (¯ c) − R(¯ c)| < δ/2 | {γ(n), hn } n o n  o  0 = E I[|Remp (¯ c) − R(¯ c)| > δ] E I |Remp (¯ c) − R(¯ c)| < δ/2 | {γ(n), hn }     0 = P |Remp (¯ c) − R(¯ c)| > δ P |Remp (¯ c) − R(¯ c)| < δ/2 | {γ(n), hn }

where step (b) introduces conditioning on the original training data {γ(n), hn }. We now examine the rightmost probability term. For this purpose, we introduce the zero-mean iid random variables:       ∆ z(n) = I c¯(h0n ) 6= γ 0 (n) − E I c¯(h0n ) 6= γ 0 (n) | {γ(n), hn }

(64.121)

where we will be using the boldface notation for the variables {γ(n), γ 0 (n), hn , h0n } whenever it is necessary to emphasize their stochastic nature; we will use the normal font notation to refer to their realizations. Using the fact that, by definition, R(¯ c) = E I [¯ c(h0 ) 6= γ 0 ], it is straightforward to verify that the variance of each z(n) is given by ∆

σz2 = E (z(n))2 = R(¯ c) − R2 (¯ c)

(64.122)

But since the risk value, R(¯ c), is a probability measure, it assumes values in the range R(¯ c) ∈ [0, 1]. It can then be verified that the quadratic expression in R(¯ c) on the right-hand side of (64.122) satisfies: 0 ≤ R(¯ c) − R2 (¯ c) ≤ 1/4

(64.123)

so that σz2 ≤ 1/4. It follows from the definition of the empirical and actual risks that:

  0 P |Remp (¯ c) − R(¯ c)| < δ/2 | {γ(n), hn } = P = P

N −1 ! 1 X z(n) < δ/2 | {γ(n), hn } N n=0 N −1 ! X z(n) < N δ/2 | {γ(n), hn } n=0

(a)

4 N σz2 N 2 δ2 4 2 = 1− σz N δ2 (b) 4 1 ≥ 1− N δ2 4 ≥ 1/2 ≥ 1−

(64.124)

2698

Generalization Theory

where step (a) uses Chebyshev inequality (3.28), step (b) uses σz2 ≤ 1/4, and the last step uses the condition N δ 2 ≥ 2. Combining results (64.120) and (64.124) we arrive at    1  0 P sup |Remp (c) − Remp (c)| > δ/2 ≥ P |Remp (¯ c) − R(¯ c)| > δ 2 c∈C  (64.115) 1  ≥ P sup |Remp (c) − R(c)| > δ 2 c∈C (64.125) which leads to the desired result (64.113). (Symmetrization step: randomizing the signs). Now we work on bounding the righthand side of (64.113) since it only involves empirical risks. Expressing these empirical risks directly in terms of the corresponding data, we can write (where we are again emphasizing the random nature of the training and fictitious data): P

 sup |Remp (c) −

0 Remp (c)|

 > δ/2

c∈C

=P

sup

=P

sup

c∈C

c∈C

1 N 1 N

N −1 ! X   0 0 I [c(hn ) 6= γ(n)] − I c(hn ) 6= γ (n) > δ/2 n=0 N −1 ! X y(n) > δ/2 (64.126) n=0

where we introduced the independent random variables:   ∆ y(n) = I [c(hn ) 6= γ(n)] − I c(h0n ) 6= γ 0 (n) I [c(h0n )

(64.127)

0

Since the random variables I [c(hn ) 6= γ(n)] and 6= γ (n)] have identical probability distributions, we conclude that y(n) has zero mean and, more importantly, the distribution of y(n) is symmetric (meaning that both y(n) and −y(n) have the same distribution). This property implies that if we randomly switch the signs of the y(n) terms appearing inside the sum in (64.126), then the sum variable will continue to have the same distribution and, therefore, the value of the probability measure (64.126) will not change. This useful observation can be exploited as follows. We introduce N random sign variables, {s(n)}, independently of {γ(n), γ 0 (n), hn , h0n }, such that: P(s(n) = +1) = P(s(n) = −1) = 1/2,

n = 0, 1, . . . , N − 1

(64.128)

Then, in view of the symmetry of the distribution of the y(n) random variables, we have

P

X 1 N c∈C

N −1 ! X y(n) > δ/2 = P n=0

N −1 ! 1 X sup s(n)y(n) > δ/2 c∈C N n=0

Now note from the definition of y(n) in (64.127) that the event N −1 1 X s(n)y(n) > δ/2 N

(64.129)

(64.130)

n=0

implies that either one of the following two events is true: N −1 N −1   1 X δ 1 X δ 0 0 s(n)I [c(hn ) 6= γ(n)] > or s(n)I c(hn ) 6= γ (n) > N n=0 4 N n=0 4 (64.131)

64.C Vapnik-Chervonenkis Bound

2699

This is because if both events are false and the two terms in the above expression are less than or equal to δ/4 then, from the triangle inequality of norms, we would get:

N −1 1 X δ s(n)y(n) ≤ N n=0 2

(64.132)

which contradicts (64.130). Therefore, for event (64.130) to hold, it must be the case that event (64.131) also holds. It follows that

N −1 ! 1 X P sup s(n)y(n) > δ/2 c∈C N n=0 N −1 1 X s(n)I [c(hn ) 6= γ(n)] > δ/4 or ≤ P sup c∈C N n=0 N −1 !   1 X 0 0 s(n)I c(hn ) 6= γ (n) > δ/4 sup c∈C N n=0 N −1 ! 1 X ≤ 2 P sup s(n)I [c(h ) = 6 γ(n)] > δ/4 n c∈C N

(64.133)

n=0

where in the last inequality we used the union bound for probabilities to eliminate the fictitious data and arrive at a bound that depends only on the original training data. Indeed, combining with (64.126) and (64.113), we conclude that:

P



 sup |Remp (c) − R(c)| > δ c∈C

≤ 4P

N −1 ! 1 X sup s(n)I [c(hn ) 6= γ(n)] > δ/4 c∈C N n=0 (64.134)

Note further that the term on the right-hand side involves the sum of a collection of independent random variables. This property will facilitate the last step given further ahead and which will rely on the Hoeffding inequality. In order to prepare for that step, we need to explain how to move the sup operation on the right-hand side outside of the probability expression – see (64.136). (Union bound step). Given N feature vectors, {hn }, there exist at most S(C, N ) distinct dichotomies (or labeling) that can be generated by the set of classifiers, and where S(C, N ) denotes the corresponding shatter coefficient. Let Cs denote the smallest subset of C that is able to generate all these dichotomies. Then, obviously,

|Cs | ≤ S(C, N )

(64.135)

2700

Generalization Theory

Using the probability union bound, we now write: N −1 ! 1 X P sup s(n)I [c(hn ) 6= γ(n)] > δ/4 c∈C N n=0 N −1 ! 1 X = P sup s(n)I [c(hn ) 6= γ(n)] > δ/4 c∈Cs N n=0 ( N −1 )! [ 1 X =P s(n)I [c(hn ) 6= γ(n)] > δ/4 N n=0 c∈Cs N −1 ! X 1 X s(n)I [c(hn ) 6= γ(n)] > δ/4 ≤ P N n=0 c∈Cs N −1 ( !) 1 X ≤ |Cs | sup P s(n)I [c(hn ) 6= γ(n)] > δ/4 N n=0 c∈Cs N −1 !) ( X 1 s(n)I [c(hn ) 6= γ(n)] > δ/4 ≤ S(C, N ) sup P N n=0 c∈Cs N −1 ( !) X 1 s(n)I [c(hn ) 6= γ(n)] > δ/4 = S(C, N ) sup P N n=0 c∈C

(64.136)

Observe that, as claimed earlier, the sup operation is now outside the probability calculation. Observe also that the bound involves the class size, S(C, N ), as well as the independent random variables s(n)I [c(hn ) 6= γ(n)]. (Hoeffding inequality). The final step is to exploit this independence along with the Hoeffding inequality to bound the right-hand side of (64.136). Thus, let ∆

b(n) = s(n)I [c(hn ) 6= γ(n)]

(64.137)

Each of these random variables has zero mean and its value is +1, 0, or −1. In particular, the value of each b(n) is bounded between −1 and 1. It then follows from the Hoeffding inequality (3.231b) by using ∆ = 4N that N −1 N −1 ! ! X 1 X b(n) > δ/4 = P b(n) > N δ/4 P N n=0 n=0 ≤ 2e−

2(N δ/4)2 4N

= 2e−N δ

2

/32

(64.138)

The bound on the right-hand side is independent of the classifier set, C, so that N −1 !) ( 2 1 X sup P s(n)I [c(hn ) 6= γ(n)] > δ/4 ≤ 2e−N δ /32 (64.139) N c∈C n=0

Substituting into (64.136) and (64.134) we obtain N −1 ! 2 1 X P sup s(n)I [c(hn ) 6= γ(n)] > δ/4 ≤ 2S(C, N )e−N δ /32 c∈C N n=0

(64.140)

as well as P



 sup |Remp (c) − R(c)| > δ c∈C

≤ 8S(C, N )e−N δ

2

/32

(64.141)

64.D Rademacher Complexity

2701

We finally arrive at the desired result (64.111) by using the bound (64.88) for the shatter coefficient, S(C, N ). 

64.D

RADEMACHER COMPLEXITY There is an alternative method to examine the generalization ability of learning algorithms by relying on the concept of the Rademacher complexity. Useful overviews appear in Boucheron, Bousquet, and Lugosi (2005), Shalev-Shwartz and Ben-David (2014), Mohri, Rostamizadeh, and Talwalkar (2018), and Wainwright (2019). Recall that the analysis in the body of the chapter, and the derivations in the last appendix, focused on binary classification problems where γ ∈ {±1} and on the 0/1-loss function. The analysis showed that classification structures with medium VC dimensions are able to learn well with high likelihood for any data distribution. In a sense, this conclusion amounts to a generalization guarantee under a worst case scenario since it holds irrespective of the data distribution. It is reasonable to expect that some data distributions are more favorable than others and, therefore, it would be desirable to seek generalization results that have some dependence on the data distribution. The framework that is based on the Rademacher complexity will allow for this possibility and will lead to tighter error bounds. The approach will also apply to multiclass classification problems and to other loss functions, and is not restricted to binary classification or 0/1-losses. The analysis will continue to lead to similar reassuring conclusions about the ability of learning methods to generalize for mild VC dimensions. However, the conclusions will now be dependent on the data distribution and will not correspond to worst-case statements that hold for any distribution. Before formally introducing the concept, we remark that we have already encountered some elements of Rademacher complexity in the last appendix, for example, when we introduced the sign variables {s(n)} in (64.128) and incorporated them into the probability expression (64.129).

Definition over a set

Consider initially a subset A ⊂ IRN , with cardinality |A|. Select an arbitrary vector a ∈ A, which is N -dimensional, and denote its individual scalar entries by a = col{an }, for n = 1, 2, . . . , N . The Rademacher complexity of the set of vectors A is a scalar denoted by RN (A) and defined as the following expectation: !) ( N 1 X σ n an (64.142) RN (A) = E σ sup N n=1 a∈A where the {σ n } are called the Rademacher variables: They are random variables chosen independently of each other with P(σ n = +1) = P(σ n = −1) = 1/2

(64.143)

The expectation in (64.142) is relative to the randomness in the Rademacher variables. In the definition, the entries of each a ∈ A are first modulated by (or correlated with) the binary variables {σ n } before computing the sample average. The expected largest value for this sample average is taken as the Rademacher complexity of the set. Observe that RN (A) depends on N . One famous result concerning RN (A) is the Massart lemma. Let ∆ denote the largest Euclidean norm within A: ∆

∆ = sup kak a∈A

(64.144)

2702

Generalization Theory

Massart lemma (Massart (2000)): The Rademacher complexity of a set of vectors A is bounded by ∆ p RN (A) ≤ × 2 ln |A| (64.145) N

Proof: We follow steps similar to Shalev-Shwartz and Ben-David (2014, ch. 26) and Mohri, Rostamizadeh, and Talwalkar (2018, ch. 3). The argument uses the Hoeffding lemma, which we encountered earlier in (3.233). Thus, for any positive scalar t, we consider the following sequence of calculations:

etRN (A)

(64.142)

=

(a)

≤ (b)

=

(c)



=

"

(

!#) N 1 X σ n an N n=1 a∈A !) ( N 1 X E σ exp t × sup σ n an N n=1 a∈A !)# " ( N 1 X E σ sup exp t × σ n an N n=1 a∈A ( !) N X 1 X E σ exp t × σ n an N n=1 a∈A ( N )   Y X Eσ exp tσ n an /N

exp t × E σ sup

n=1

a∈A (d)

=

N XY

n o E σ exp(tσ n an /N )

(64.146)

a∈A n=1

where step (a) uses the Jensen inequality (8.77) and the fact that the function ex is convex, step (b) switches the order of the sup and exponentiation operations since t > 0, step (c) bounds the sup by the sum of the entries, and step (d) uses the fact that the Rademacher variables {σ n } are independent of each other. We are now ready to apply the Hoeffding bound (3.233). Let



y n = σ n an

(64.147)

and note that E y n = 0 since E σ n = 0. Moreover, the value of the variable y n is either −an or an depending on the polarity of σ n . It follows from (3.233) that

E etyn /N ≤ et

2

(2an )2 /8N 2

2 2 an /2N 2

= et

(64.148)

64.D Rademacher Complexity

2703

Substituting into (64.146) gives etRN (A) ≤

N XY

et

2 2 an /2N 2

a∈A n=1

(

) N t2 X 2 = exp an 2N 2 n=1 a∈A ) ( t2 ∆2 ≤ |A| × exp 2N 2 ( ) t2 ∆2 = exp ln |A| + 2N 2 X

(64.149)

or, equivalently, RN (A) ≤

ln |A| t∆2 + t 2N 2

(64.150)

We are free to select the parameter t. We therefore minimize the upper bound over t to get N p t= × 2 ln |A| (64.151) ∆ so that, upon substitution into the right-hand side of (64.150), we conclude that RN (A) ≤

∆ p × 2 ln |A| N

(64.152) 

Example 64.5

(Finite set of classifiers) Consider a finite set of binary classifiers o n (64.153) C = c(h) : IRM → {±1}

and a collection of N feature vectors {hn ∈ IRM }, for n = 1, 2, . . . , N . Each classifier c ∈ C provides one possible labeling for the feature vectors, which we denote by n o ∆ a = col c(h1 ), c(h2 ), . . . , c(hN ) ∈ {±1}N (64.154) This is a vector of size N × 1 with entries ±1. By constructing the label vectors {a} for each of the classifiers c ∈ C, we end up with a finite collection of vectors: n o A = a | a = col{c(hn )}, c ∈ C (64.155) In this example, the cardinality of A is equal to the cardinality of C: |A| = |C| (64.156) √ Moreover, the bound ∆ is easily seen to be ∆ = N . Using the Massart bound (64.145) we conclude that the Rademacher complexity that is associated with the set of classifiers C is bounded by r 2 ln |C| RN (A) ≤ (64.157) N

2704

Generalization Theory

Example 64.6 (Some intuition on the Rademacher complexity) We use the previous example to gain some intuition into the definition of RN (A), which we rewrite in terms of the binary classifiers: ( !) N 1 X RN (A) = E σ sup σ n c(hn ) (64.158) N n=1 c∈C Observe that the summation on the right-hand side is computing the correlation between the random vector of Rademacher parameters {σ 1 , . . . , σ N } and the label vector {c(h1 ), . . . , c(hN )} that results from applying the classifier c(h). A high correlation value means that this label vector is able to match relatively well the particular choice of Rademacher labels {σ n }. The Rademacher complexity is therefore assessing the largest possible correlation that the class of classifiers C is able to attain on average. The larger this value is, the more likely the class C will be able to fit randomly chosen label vectors – Prob. 64.37 provides additional motivation. It follows from this explanation that the Rademacher complexity provides an assessment of the representation power (or richness or expressiveness) of a class of classifiers, C. In this sense, it plays a role similar to the VC dimension. However, unlike the VC concept, the Rademacher complexity is not limited to binary classification problems. Example 64.7 (Rademacher complexity and VC dimension) The Massart lemma helps link the two important concepts of Rademacher complexity and VC dimension. To see this, we continue with Example 64.5 but consider now the situation in which the set of classifiers C has infinitely many elements. We already know how many different labeling vectors this set can generate for the N feature vectors {hn }. This number is given by the shatter coefficient S(C, N ), which, in view of the Sauer lemma, we showed in (64.88) to be bounded by S(C, N ) ≤ (N e/VC)VC

(64.159)

Therefore, if we again generate the set of vectors A that corresponds to this class of classifiers C, its cardinality will be bounded by this same value: |A| ≤ (N e/VC)VC

(64.160)

Using the Massart bound (64.145) we conclude that the Rademacher complexity that is associated with the class of classifiers C is now bounded by r  Ne  2 VC ln (64.161) RN (C) ≤ N VC

Definition over functions We can extend the definition of the Rademacher complexity to sets of scalar real-valued functions Q ∈ Q, where each Q(y) : IR → IR. We use the letter Q because it will often correspond to the loss function in the context of learning algorithms. We also use the letter y because it will correspond to the margin variable y = γb γ . For now, we treat Q and y generically and later specialize them to the learning context. We consider a collection of N scalar variables {yn } and define the empirical Rademacher complexity of the set Q as follows using the hat notation: ( !) N X b N (Q) = E σ sup 1 R σ n Q(yn ) (64.162) N n=1 Q∈Q where the {σ n } continue to be the Rademacher variables, which assume the values {±1} uniformly and independently of each other. In definition (64.162), the function Q(·) is evaluated at each yn and modulated by the binary variable {σ n } before computing

64.D Rademacher Complexity

2705

the sample average. The expected largest value for this sample average, over the set of functions, is taken as the empirical Rademacher complexity for the set Q. The reason for the designation “empirical” is because the variables {yn } will usually correspond to independent observations of some random variable y ∼ fy (y). If we then compute the expectation relative to the distribution of y we obtain the Rademacher complexity without the hat notation: o n b N (Q) (64.163) RN (Q) = E y R where we are now treating the empirical complexity as a random variable due to its dependence on the random observations {y n }. Note that by computing the expectation relative to the distribution of y, the Rademacher complexity becomes dependent on this distribution. This line of reasoning is unlike the analysis carried out in the previous appendix where, for example, bounds on the shatter coefficient (or growth function) were derived independently of any distribution. Example 64.8 (Useful property) Consider a class of functions c ∈ C, where each c can be expressed in the form c(y) = aQ(y) + b for some constants a, b ∈ IR and function Q(y) from another set Q. We can relate the Rademacher complexities of both sets {C, Q} as follows: RN (C) = |a| RN (Q) b N (C) = |a| R b N (Q) R

(64.164a) (64.164b)

Proof: It is sufficient to establish the result for the empirical Rademacher complexity. Thus, note from the definition that ( b N (C) = E σ R

sup Q∈Q

N 1 X σ n C(yn ) N n=1

!)

(

!) N N 1 X 1 X = E σ sup σ n aQ(yn ) + σn b N n=1 N n=1 Q∈Q !) ) 0 ( ( N N : 1 X 1 X    = E σ sup σ n aQ(yn ) + Eσ  σn b N n=1  N n=1 Q∈Q ( !) N 1 X = |a| E σ sup σ n Q(yn ) N n=1 Q∈Q b N (Q) = |a| R

(64.165)

where |a| is used since the polarities of the {σ n } can be switched between +1 to −1. Any value for the sample average that is achieved using a can also be achieved using −a with the polarities of σ n switched. Thus, for all practical purposes, we can work with |a|. 

In preparation for the main result of this appendix showing how the Rademacher complexity leads to generalization bounds, we introduce some intermediate concepts and results.

2706

Generalization Theory

Empirical and stochastic risks. With each loss function Q ∈ Q we associate two risk functions: E y Q(y)

(stochastic risk)

(64.166a)

N 1 X Q(yn ) N n=1

(empirical risk)

(64.166b)

where the first expression is the average loss value over the distribution of the data y, while the second expression is a sample average value obtained from a collection of N realizations {y1 , y2 , . . . , yN }. Although under ergodicity these two quantities are expected to approach each other as N → ∞, they are nevertheless generally different for finite N . The difference between the two risks also varies with the choice of Q. We denote the worst case difference by the notation: (worst excess risk function) (64.167) ( ∆

φ(y1 , . . . , yN ) = sup Q∈Q

N 1 X Q(yn ) E y Q(y) − N n=1

)

The function φ(·) is dependent on the N variables {yn }. Bounded variations. We will assume that the loss functions Q(y) assume values in some bounded interval, namely, Q(y) : IR → [a, b] with a < b. We denote the width of this interval by ∆

d = b−a

(64.168)

Alternatively, we can set d = supy |Q(y)|. It then follows that the excess risk function φ(·) will have bounded variations (i.e., if any of its entries changes, the function will 0 change by a bounded amount). Specifically, if ym changes to ym for any entry of index m, it will hold that 0 φ(yn6=m , ym ) − φ(yn6=m , ym ) ≤ d/N

(64.169)

for all {yn , n 6= m}. Proof of (64.169): To simplify the notation, we let Y = {y1 , . . . , ym−1 , ym , ym+1 , . . . , yN } 0 Ym = {y1 , . . . , ym−1 , ym , ym+1 , . . . , yN }

(64.170) (64.171)

0 denote the collection of observations with ym replaced by ym , while all other entries remain unchanged. Let Q? (·) be the function that attains the supremum in (64.167) with the observations {y1 , . . . , yN } so that

φ(Y) = E y Q? (y) −

1 X ? Q (yn ) N n∈Y

(64.172)

64.D Rademacher Complexity

2707

Then, the desired result follows from the following sequence of inequalities: |φ(Y) − φ(Y0 )| ) ( 1 X ? 1 X ? = E y Q (y) − Q(yn ) Q (yn ) − sup E y Q(y) − N N Q∈Q n∈Y

n∈Ym

  (a) 1 X ? 1 X ? ? ? ≤ E y Q (y) − Q (yn ) − E y Q (y) − Q (yn ) N N n∈Y

n∈Ym

(b) 1 ? 0 ? = Q (ym ) − Q (ym ) N (64.169)



d/N

where step (a) is because we employed the suboptimal Q? (·) in the rightmost supremum operation, and step (b) is because the sets Y and Ym differ by a single entry.  One useful consequence of the bounded variation property (64.169) is that we can bound how close the risk difference φ(Y) gets to its mean value. For this purpose, we appeal to the McDiarmid inequality (3.259a) and note that for any given δ > 0:   2 PN 2 2 2 2 P φ(Y) − E y φ(Y) ≥ δ ≤ e−2δ / n=1 d /N = e−2N δ /d

(64.173)

Thus, assume that we wish to determine the value of δ such that φ(Y) is δ-close to its mean E y φ(Y) with probability 1 − . Then, setting e−2N δ

2

/d2

≤

(64.174)

we can solve for δ: r δ≥d

1 ln(1/) 2N

(64.175)

Substituting into (64.173) we conclude that with high probability of at least 1 − : r φ(Y) ≤ E y φ(Y) + d

1 1 ln 2N 

(64.176)

Bounding the average risk. It turns out that the mean quantity E y φ(Y) in the above expression can be bounded by the Rademacher complexity of the set Q as follows: E y φ(Y) ≤ 2 RN (Q)

(64.177)

Proof: We follow steps similar to the proof of theorem 8 in Bartlett and Mendelson (2002); see also Shalev-Shwartz and Ben-David (2014, ch. 26) and Mohri, Rostamizadeh, and Talwalkar (2018, ch. 3). We introduce a fictitious collection of samples 0 {y10 , y20 , . . . , yN } and denote it by Y0 . This set consists of realizations of a random vari0 able y with the same distribution as y, except that the realizations {yn0 } are chosen independently of the original realizations {yn }. Then, it is clear that ( ) N 1 X 0 E y0 Q(y n ) = E y Q(y) (64.178) N n=1

2708

Generalization Theory

so that E y φ(Y)

(64.167)

=

(64.178)

=

( Ey

Q∈Q

( Ey

sup Q∈Q

(

(a)

E y,y0



E y Q(y) −

sup

"

E y0

N 1 X Q(y n ) N n=1

)!

# )! N N 1 X 1 X Q(y 0n ) − Q(y n ) N n=1 N n=1

N  1 X Q(y 0n ) − Q(y n ) sup Q∈Q N n=1

) (64.179)

where step (a) uses the Jensen inequality (8.77) and the fact that the sup function is convex (see Prob. 64.32). Now note that since {y n , y 0n } are equally distributed and independent of each other, the value of the last expectation will not change if we switch the roles of y n and y 0n for any n. In other words, it will hold that ( E y,y0

sup Q∈Q

N  1 X Q(y 0n ) − Q(y n ) N n=1

( = E σ ,y,y0

sup Q∈Q

)

N   1 X σ n Q(y 0n ) − Q(y n ) N n=1

) (64.180)

where we have incorporated the Rademacher parameters {σ n } on the right-hand side; recall that they have zero mean and are chosen uniformly from {±1}. We can therefore write ( ) N   1 X 0 E y φ(Y) ≤ E σ ,y,y0 sup σ n Q(y n ) − Q(y n ) Q∈Q N n=1 ( ≤ E σ ,y0 |

sup Q∈Q

) ( ) N N 1 X 1 X σ n Q(y 0n ) + E σ ,y sup σ n Q(y n ) N n=1 Q∈Q N n=1 {z } | {z }

= RN (Q)

= 2 RN (Q)

= RN (Q)

(64.181) 

Main generalization theorem We are now ready to establish the main result, which relates the Rademacher measure of complexity to the generalization ability of learning algorithms. The bounds below, which are due to Koltchinskii and Panchenko (2000, 2002) and Bartlett and Mendelson (2002), show that, with high probability, the stochastic risk of a learning algorithm will be close to its empirical risk by an amount that depends on the Rademacher complexity. One main difference between the two bounds shown in the statement is that the second result (64.182b) is data-dependent; it is stated in terms of the empirical b N (Q), which in principle can be estimated from the data observations complexity R {y1 , . . . , yN }. This is in contrast to the first bound, which employs the actual complexity RN (Q); its computation requires averaging over the data distribution, y ∼ fy (y). One-sided generalization bounds (Koltchinskii and Panchenko (2000, 2002), Bartlett and Mendelson (2002)): Consider a set Q ∈ Q of loss functions with each Q(y) : IR → [a, b]. Let d = b − a. Then, for every Q ∈ Q and with high probability of at least 1 − ,

64.D Rademacher Complexity

2709

either of the following bounds holds in terms of the empirical or regular Rademacher complexity: r N 1 1 X ln(1/) (64.182a) E y Q(y) ≤ Q(yn ) + 2 RN (Q) + d N n=1 2N r N 1 X 1 b E y Q(y) ≤ ln(2/) (64.182b) Q(yn ) + 2 RN (Q) + 3d N n=1 2N

Proof: We put together several of the results derived so far to note that E y Q(y)

= (64.167)

≤ (64.176)

≤ (64.177)



E y Q(y) +

N N 1 X 1 X Q(yn ) − Q(yn ) N n=1 N n=1

N 1 X Q(yn ) + φ(y1 , . . . , yN ) N n=1 r N 1 1 X Q(yn ) + E y φ(Y) + d ln(1/) N n=1 2N r N 1 1 X Q(yn ) + 2 RN (Q) + d ln(1/) N n=1 2N

(64.183)

which establishes (64.182a). To establish the second inequality, we first note that it b N (Q) satisfies is straightforward to verify that the empirical Rademacher complexity R the bounded variations property with the same bound d/N as φ(·). It then follows from the second McDiarmid inequality (3.259b) that, for any δ > 0:   b −2N δ 2 /d2 (64.184) P R N (Q) − RN (Q) ≥ δ ≤ 2e b N (Q) is δ-close to its mean RN (Q) with We can determine the value of δ that ensures R probability 1 − . Then, setting 2e−2N δ

2

/d2

≤

(64.185)

we can solve for δ: r δ≥d

1 ln(2/) 2N

(64.186)

Substituting into (64.184) we find that with high probability of at least 1 − : r 1 b N (Q) + d RN (Q) ≤ R ln(2/) (64.187) 2N Using this bound in (64.183) leads to (64.182b).  Example 64.9 (Application to the 0/1-loss and VC dimension) Consider N feature vectors {h1 , . . . , hN } with binary labels γ(n) ∈ {±1}. Assume we choose Q(y) as the 0/1-loss defined by  1, y ≤ 0 Q(y) = I[y ≤ 0] = (64.188) 0, y > 0

2710

Generalization Theory

where, in this example, y is the margin variable defined as y = γc(h). Note that we can relate the classifier and the loss function more explicitly as follows: Q(y) =

 1 1 − γ c(h) 2

(64.189)

In this way, we have one loss function Q(y) associated with each binary classifier c(h). It then follows from property (64.164a) that RN (Q) =

1 RN (C) 2

(64.190)

In particular, using (64.161), we find that, with high likelihood, the error probability (which is equal to E Q(y)) for any classifier c ∈ C designed under the 0/1-loss is bounded by ! r N  Ne  r 1 1 2 1 X I[γ(n)b γ (n)] ≤ VC ln + ln (64.191) sup Pe − N n=1 N VC 2N  c∈C We √ can √ the last two terms by using the easily verifiable algebraic inequality √ group x + y ≤ 2 x + y for any x, y ≥ 0. Therefore, we find that N 1 X sup Pe − I[γ(n)b γ (n)] N n=1 c∈C

!

v ( ) u  Ne  1 1 u8 t ≤ VC ln + ln N VC 4 

A similar argument using the two-sided inequality (64.197a) would lead to v ( ) N u  Ne  1 2 X 1 8 u t sup Pe − I[γ(n)b γ (n)] ≤ VC ln + ln N n=1 N VC 4  c∈C

(64.192)

(64.193)

where the rightmost term resembles the form we encountered earlier in (64.13) but provides a tighter bound. We further remark that the quantities appearing on the left-hand side play a role similar to the risks defined in the body of the chapter: R(c) = Pe , Example 64.10

Remp (c) =

N 1 X I[γ(n)b γ (n)] N n=1

(64.194)

(Application to the hinge loss) Consider next the hinge loss function Q(w; γ, h) = max{0, 1 − γb γ }, γ b = hT w

(64.195)

and the class of prediction functions used to generate γ b with vectors chosen from the set W = {w | kwk2 ≤ 1}. Assume the feature data lies within khk2 ≤ R. With γ ∈ {±1} fixed, the loss function Q(γ, γ b) is seen to be 1-Lipschitz with respect to the argument γ b. Thus, using (64.182a) and the result of Probs. 64.34 and 64.35, we conclude that N n o 2δR 1 X E y Q(w; γ, h) ≤ max 0, 1 − γ(n)b γ (n) + √ + d N n=1 N

r

1 1 ln 2N 

(64.196)

Additional examples are given in Bartlett and Mendelson (2002, sec. 4), Boucheron, Bousquet, and Lugosi (2005, sec. 4), and Shalev-Shwartz and Ben-David (2014, ch. 26).

References

2711

Similar arguments can be repeated to establish two-sided versions of the generalization bounds listed before. We leave the details to Prob. 64.36. Two-sided generalization bounds (Koltchinskii and Panchenko (2000, 2002), Bartlett and Mendelson (2002)): Consider a set Q ∈ Q of loss functions with each Q(y) : IR → [a, b]. Let d = b − a. Then, with probability of at least 1 − , either of the following bounds holds in terms of the empirical or regular Rademacher complexity: r N 1 X 1 sup E y Q(y) − Q(yn ) ≤ 2 RN (Q) + d ln(2/) N 2N q∈Q n=1 r N 1 X 1 b sup E y Q(y) − ln(4/) Q(yn ) ≤ 2 RN (Q) + 3d N n=1 2N q∈Q

(64.197a)

(64.197b)

REFERENCES Abu-Mostafa, Y. S., M. Magdon-Ismail, and H.-T. Lin (2012), Learning from Data, AMLBook.com. Alon, N., S. Ben-David, N. Cesa-Bianchi, and D. Haussler (1997), “Scale-sensitive dimensions, uniform convergence, and learnability,” J. ACM, vol. 44, no. 4, pp. 615–631. Antos, A., B. Kégl, T. Linder, and G. Lugosi (2002), “Data-dependent margin-based generalization bounds for classification,” J. Mach. Learn. Res., vol. 3, pp. 73–98. Bartlett, P. L., S. Boucheron, and G. Lugosi (2001), “Model selection and error estimation,” Mach. Learn., vol. 48, pp. 85–113. Bartlett, P., O. Bousquet, and S. Mendelson (2005), “Local Rademacher complexities,” Ann. Statist., vol. 33, no. 4, pp. 1497–1537. Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006), “Convexity, classification, and risk functions,” J. Amer. Statist. Assoc., vol. 101, no. 473, pp. 138–156. Bartlett, P. L. and S. Mendelson (2002), “Rademacher and Gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482. Bellman, R. E. (1957a), Dynamic Programming, Princeton University Press. Also published in 2003 by Dover Publications. Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989), “Learnability and the Vapnik–Chervonenkis dimension,” J. ACM, vol. 36, no. 4, pp. 929–965. Boucheron, S., O. Bousquet, and G. Lugosi (2005), “Theory of classification: A survey of recent advances,” ESAIM: Probab. Statist., vol. 9, pp. 323–375. Breiman, L. (1994), “Heuristics of instability in model selection,” Ann. Statist., vol. 24, no. 6, pp. 2350–2383. Breiman, L. (1996a), “Stacked regressions,” Mach. Learn., vol. 24, no. 1, pp. 41—64. Breiman, L. (1996b), “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140. Cantelli, F. P. (1933), “Sulla determinazione empirica delle leggi di probabilita,” Giorn. Ist. Ital. Attuari, vol. 4, pp. 221–424. Cherkassky, V. and F. M. Mulier (2007), Learning from Data: Concepts, Theory, and Methods, 2nd ed., Wiley. Chernoff, H. (1952), “A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations, ” Ann. Math. Statist., vol. 23, pp. 493–507. Cover, T. M. (1968), “Estimation by the nearest neighbor rule,” IEEE Trans. Inf. Theory, vol. 14, pp. 21–27. Cucker, F. and S. Smale (2002), “On the mathematical foundation of learning,” Bull. Amer. Math. Soc., vol. 39, pp. 1–49.

2712

Generalization Theory

Devroye, L. (1982), “Necessary and sufficient conditions for the almost everywhere convergence of nearest neighbor regression function estimates,” Z. Wahrscheinlichkeitstheorie verw. Gebiete, vol. 61, pp. 467–481. Devroye, L., L. Gyorfi, and G. Lugosi (1996), A Probabilistic Theory of Pattern Recognition, Springer. Domingos, P. (2000), “A unified bias–variance decomposition,” Proc. Int. Conf. Machine Learning (ICML), pp. 231–238, Stanford, CA. Dudley, R. M. (1978), “Central limit theorems for empirical measures,” Ann. Probab., vol. 6, pp. 899–929. Dudley, R. M. (1999), Uniform Central Limit Theorems, Cambridge University Press. Dudley, R. M., E. Gine, and J. Zinn (1991), “Uniform and universal Glivenko–Cantelli classes,” J. Theoret. Probab., vol. 4, no. 3, pp. 485–510. Fernandez-Delgado, M., E. Cernadas, S. Barro, and D. Amorim (2014), “Do we need hundreds of classifiers to solve real world classification problems?,” J. Mach. Learn. Res., vol. 15, pp. 3133–3181. Friedman, J. H. (1997), “On bias, variance, 0/1 loss, and the curse-of-dimensionality,” Data Mining Knowl. Discov., vol. 1, pp. 55–77. Fukunaga, K. (1990), Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press. German, S., E. Bienenstock and R. Doursat (1992), “Neural networks and the bias variance dilemma,” Neural Comput., vol. 4, pp. 1–58. Geurts, P. (2005), “Bias vs variance decomposition for regression and classification,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, editors, pp. 749–763, Springer. Glivenko, V. (1933), “Sulla determinazione empirica della legge di probabilita,” Giorn. Ist. Ital. Attuari, vol. 4, pp. 92–99. Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed., Springer. Hoeffding, W. (1963), “Probability inequalities for sums of bounded random variables,” J. Amer. Statist. Assoc., vol. 58, pp. 13–30. Hughes, G. F. (1968), “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. 14, no. 1, pp. 55–63. James, G. (2003), “Variance and bias for general loss functions,” Mach. Learn., vol. 51, pp. 115–135. James, G. and T. Hastie (1997), “Generalizations of the bias/variance decomposition for prediction error,” Technical Report, Department of P Statistics, Stanford University. Kahane, J.-P. (1964), “Sur les sommes vectorielles ±un ,” Comptes Rendus Hebdomadaires des S’eances de l’Académie des Sciences, vol. 259, pp. 2577–2580. Kakade, S., K. Sridharan, and A. Tewari (2008), “On the complexity of linear prediction: Risk bounds, margin bounds, and regularization,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–11, Vancouver. Kearns, M. and U. Vazirani (1994), An Introduction to Computational Learning Theory, MIT Press. Khintchine, A. (1923), “Über dyadische brüche,” Mathematische Zeitschrift, vol. 18, no. 1, pp. 109–116. Kohavi, R. and D. H. Wolpert (1996), “Bias plus variance decomposition for zero-one loss functions,” Proc. Int. Conf. Machine Learning (ICML), pp. 275–283, Tahoe City, CA. Koltchinskii, V. (2001), “Rademacher penalties and structural risk minimization,” IEEE Trans. Inf. Theory, vol. 47, pp. 1902–1914. Koltchinskii, V. and D. Panchenko (2000), “Rademacher processes and bounding the risk of function learning,” in High Dimensional Probability II, pp. 443–457, Springer. Koltchinskii, V. and D. Panchenko (2002), “Empirical margin distributions and bounding the generalization error of combined classifiers,” Ann. Statist., vol. 30, no. 1, pp. 1–50.

References

2713

Kong, E. B. and T. G. Dietterich (1995), “Error-correcting output coding corrects bias and variance,” Proc. Int. Conf. Machine Learning (ICML), pp. 313–321, Tahoe City, CA. Kulkarni, S., G. Lugosi, and S. Venkatesh (1998), “Learning pattern classification: A survey,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2178–2206. Latala, R. and K. Oleszkiewicz (1994), “On the best constant in the Khintchine–Kahane inequality,” Studia Math, vol. 109, no. 1, pp. 101–104. Ledoux, M. and M. Talagrand (1991), Probability in Banach Spaces, Springer. Massart, P. (2000), “Some applications of concentration inequalities to statistics,” Annales de la Faculté des Sciences de Toulouse, vol. 9, no. 2, pp. 245–303. Massart, P. (2007), Concentration Inequalities and Model Selection, Springer. McDiarmid, C. (1989), “On the method of bounded differences,” in Surveys in Combinatorics, J. Siemons, editor, pp. 148–188, Cambridge University Press. Mendelson, S. (2002), “Improving the sample complexity using global data,” IEEE Trans. Inf. Theory, vol. 48, pp. 1977–1991. Mohri, M., A. Rostamizadeh, and A. Talwalkar (2018), Foundations of Machine Learning, 2nd ed., MIT Press. Okamoto, M. (1958), “Some inequalities relating to the partial sum of binomial probabilities,” Ann. Inst. Statist. Math., vol. 10, pp. 29–35. Pollard, D. (1984), Convergence of Stochastic Processes, Springer. Radon, J. (1921), “Mengen konvexer Körper, die einen gemeinsamen Punkt enthalten,” Mathematische Annalen, vol. 83, no. 1–2, pp. 113–115. Rosasco, L., E. De Vito, A. Caponnetto, M. Piana, and A. Verri (2004), “Are loss functions all the same?” Neural Comput., vol. 16, no. 5, pp. 1063–1076. Sauer, N. (1972), “On the density of families of sets,” J. Combinat. Theory Ser. A, vol. 13, pp. 145–147. Schaffer, C. (1994), “A conservation law for generalization performance,” in Proc. Int. Conf. Machine Learning (ICML), pp. 259–265, New Brunswick, NJ. Shalev-Shwartz, S. and S. Ben-David (2014), Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press. Shelah, S. (1972), “A combinatorial problem: Stability and order for models and theories in infinitary languages,” Pacific J. Math., vol. 41, pp. 247–261. Stone, C. (1977), “Consistent nonparametric regression,” Ann. Statist., vol. 5, pp. 595– 645. Tibshirani, R. (1996a), “Bias, variance, and prediction error for classification rules,” Technical Report, Department of Preventive Medicine and Biostatistics and Department of Statistics, University of Toronto. Tomczak-Jaegermann, N. (1989), Banach-Mazur Distance and Finite-Dimensional Operator Ideals, Pitman. Valiant, L. (1984), “A theory of the learnable,” Commun. ACM, vol. 27, pp. 1134–1142. van der Vaart, A. W. and J. A. Wellner (1996), Glivenko–Cantelli Theorems, Springer. Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer. Vapnik, V. N. (1998), Statistical Learning Theory, Wiley. Vapnik, V. N. (1999), “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, pp. 988–999. Vapnik, V. N. and A. Y. Chervonenkis (1968), “On the uniform convergence of relative frequencies of events to their probabilities,” Doklady Akademii Nauk. SSSR vol. 181, no. 4, pp. 781–783 (in Russian). Vapnik, V. N. and A. Y. Chervonenkis (1971), “On the uniform convergence of relative frequencies of events to their probabilities,” Theory Probab. Appl., vol. 16, no. 2, pp. 264–280. Vidyasagar, M. (1997), A Theory of Learning and Generalization, Springer. Wainwright, M. J. (2019), High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press. Wolff, T. H. (2003), Lectures on Harmonic Analysis, vol. 29, American Mathematical Society.

2714

Generalization Theory

Wolpert, D. H. (1992), “On the connection between in-sample testing and generalization error,” Complex Syst., vol. 6, pp. 47–94. Wolpert, D. H. (1996), “The lack of a priori distinctions between learning algorithms,” Neural Comput., vol. 8, no. 7, pp. 1341–1390. Wolpert, D. H. and W. G. Macready (1997), “No free lunch theorems for optimization,” IEEE Trans. Evol. Comput., vol. 1, no. 1, pp. 67–82.

65 Feedforward Neural Networks

We illustrated in Example 63.2 one limitation of linear separation surfaces by considering the XOR mapping (63.11). The example showed that certain feature spaces are not linearly separable and cannot be resolved by the perceptron algorithm. The result in the example was used to motivate one powerful approach to nonlinear separation surfaces by means of kernel methods. In this chapter we describe a second powerful and popular method, based on training feedforward neural networks. These are also called multilayer perceptrons and even deep networks, depending on the size and number of their layers. We revisit the XOR mapping in Prob. 65.1 and show how a simple network of this type can separate the features. Feedforward neural networks are layered structures of interconnected units called neurons, each of which will be a modified version of the perceptron. A neural network will consist of: (a) one input layer, where the feature vector, h ∈ IRM , is applied; (b) one output layer, where the predicted label, now represented by a vector γ b ∈ IRQ , is read from; and (c) several hidden layers in between the input and output layers. The net effect is a nonlinear mapping from the input space, h, to the output space γ b: h ∈ IRM N→ γ b ∈ IRQ

(65.1)

We will describe several iterative procedures for learning the internal parameters of this mapping from training data {γn , hn }. In future chapters, we will describe alternative neural network architectures and their respective training algorithms, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTMs) networks, where some parameters are shared across nodes and/or layers. Besides the ability of feedforward neural networks to model nonlinear mappings from h to γ b, one other notable difference in relation to the learning algorithms considered so far in the text is that neural networks can deal with classification problems where the scalar binary class variable γ ∈ {+1, N1} is replaced by a vector class variable γ ∈ {+1, N1}Q whose entries are again binary. This level of generality allows us to solve multiclass and multilabel classifica-

2716

Feedforward Neural Networks

tion problems directly without the need to resort to one-versus-all (OvA) or one-versus-one (OvO) constructions. For instance, in multiclass problems, the feature vector h can belong to one of a collection of Q classes (such as deciding whether h represents cats, dogs, or elephants), and the label vector γ will contain +1 in the location corresponding to the correct class and N1 in the remaining entries. In multilabel classification problems, on the other hand, the feature vector h can reveal several properties simultaneously (such as representing a male individual with high-school education who likes mystery movies). In this case, the entries of γ corresponding to these properties will be +1 while the remaining entries will be N1, for example: 

  γ=  

N1 N1 N1 +1 N1



   (multiclass problem),  



  γ=  

+1 N1 N1 +1 +1



   (multilabel problem)  

(65.2) In one of the most common implementations of neural networks, information flows forward from the input layer into the successive hidden layers until it reaches the output layer. This type of implementation is known as a feedforward structure. There are other implementations, known as feedback or recursive structures, where signals from later layers feed back into neurons in earlier layers. We will encounter examples of these in the form of RNN and LSTM networks in a future chapter. We focus here on feedforward structures, which are widely used in applications; they also exhibit a universal approximation ability, as explained in the comments at the end of the chapter – see expression (65.189).

65.1

ACTIVATION FUNCTIONS The basic unit in a neural network is the neuron shown in Fig. 65.1 (left). It consists of a collection of multipliers, one adder, and a nonlinearity. The input to the first multiplier is fixed at +1 and its coefficient is denoted by Nθ, which represents an offset parameter. The coefficients for the remaining multipliers are denoted by w(m) and their respective inputs by h(m). If we collect the input and scaling coefficients into column vectors: n o h = col h(1), h(2), . . . , h(M ) (65.3a) n o w = col w(1), w(2), . . . , w(M ) (65.3b) then the output of the adder is the affine relation: ∆

z = hT w N θ

(65.4)

65.1 Activation Functions

2717

where we are using the letter “z” to refer to the result of this calculation. This signal is subsequently fed into a nonlinearity, called the activation function, to generate the output signal y: ∆

y = f (z) = f (hT w N θ)

(65.5)

On the right-hand side of the same figure, we show two compact representations for neurons. The only difference is the additional arc that appears inside the circle in the top right corner. This arc is used to indicate the presence of a nontrivial activation function. This is because, sometimes, the neuron may appear without the activation function (i.e., with f (z) = z), in which case it will simply operate as a pure linear combiner.

y = f (hT w



+

z = hT w



✓)

y = f (z)

z = hT w



Figure 65.1 (Left) Structure of a neuron consisting of an offset parameter −θ, and M

multipliers with weights {w(m)} and input signals {h(m)}, followed by an adder with output z and a nonlinearity y = f (z). (Right) Compact representations for the neuron in terms of a circle with multiple input lines and one output line. Two circle representations are used to distinguish between the cases when the nonlinearity is present or not (i.e., whether f (z) = z or not). When a nonlinearity is present, we will indicate its presence by an arc inside the circular representation, as shown in the top right corner.

Sigmoid and tanh functions There are several common choices for the activation function f (z), listed in Table 65.1, with some of them illustrated in Fig. 65.2. We encountered the sigmoid function earlier in (59.5a) while discussing the logistic regression problem. One useful property of the sigmoid function is that its derivative admits the representation f 0 (z) = f (z)(1 N f (z))

(sigmoid function)

(65.6)

We also encountered the hyperbolic tangent function earlier in (27.33) while studying the optimal mean-square-error (MSE) inference problem. Its derivative is given by any of the forms:

2718

Feedforward Neural Networks

f 0 (z) = 1/ cosh2 (z) = 4/(ez + e−z )2 = 1 N (tanh(z))2

(tanh function)

(65.7)

The sigmoid and tanh functions are related via the translation 1 1 = (tanh (z/2) + 1) ⇐⇒ tanh(z/2) = 2 sigmoid(z) N 1 1 + e−z 2

(65.8)

and satisfy (sigmoid function) (tanh function)

lim f (z) = 1,

z→+∞

lim f (z) = 1,

z→+∞

lim f (z) = 0

(65.9a)

lim f (z) = N1

(65.9b)

z→−∞ z→−∞

That is, both functions saturate for large |z|. This means that when |z| is large, the derivatives of the sigmoid and tanh functions will assume small values close to 0. We will explain in Section 65.8 that this property is problematic and is responsible for a slowdown in the speed of learning by neural networks. This is because small derivative values at any neuron will end up limiting the learning ability of the neurons in the preceding layers. The scaled hyperbolic tangent function, f (z) = a tanh(bz), maps the real axis to the interval [Na, a] and, therefore, it saturates at ±a for large values of z. Typical choices for the parameters (a, b) are b=

2 , 3

a=

1 ≈ 1.7159 tanh(2/3)

(65.10)

With these values, one finds that f (±1) = ±1. In other words, when the input value z approaches ±1 (which are common values in binary classification problems), then the scaled hyperbolic tangent will assume the same values ±1, which are sufficiently away from the saturation levels of ±1.7159.

ReLU and leaky ReLU functions In the rectifier (or hinge) function listed in Table 65.1 (also called a “rectified linear unit” or ReLU function), nonnegative values of z remain unaltered while negative values of z are set to zero. In this case, we set the derivative value at z = 0 to f 0 (0) = 0 by convention (we could also set it to 1, if desired):   0, z < 0 0 f (z) = ReLU(z) =⇒ f (z) = (65.11) 0, z = 0  1, z > 0

Compared with the sigmoid function, we observe that the derivative of ReLU is constant and equal to 1 for all positive values of z; this property will help speed up the training of neural networks and is one of the main reasons why ReLU activation functions are generally preferred over sigmoid functions. ReLU

65.1 Activation Functions

1

sigmoid

1

tanh

6

2719

rectifier+softplus

5

0.8

0.5 4

0.6 0

3

0.4 2

softplus

-0.5

0.2

1

rectifier 0 -5

-1 0

5

-5

derivative of sigmoid

0

5

0 -5

0

5

derivatives

derivative of tanh

0.25

1

1

0.2

0.8

0.8

0.15

0.6

0.6

0.1

0.4

0.4

0.05

0.2

0.2

softplus

0 -5

0

5

0 -5

0

5

0 -5

rectifier 0

5

Figure 65.2 Examples of activation functions and their derivatives. (Left) Sigmoid

function. (Center) Hyperbolic tangent function. (Right) Rectifier and softplus functions.

functions are also easy to implement and do not require the exponentiation operation that appears in the sigmoid implementation. Unfortunately, the derivative of the ReLU function is zero for negative values of z, which will affect training when internal values in the network drop below zero. These nodes will not be able to continue learning and recover from their state of negative z-values. This challenge is referred to as the “dying ReLU” problem. The softplus function provides a smooth approximation for the rectifier function, tending to zero gracefully as z → N∞. The leaky ReLU version, on the other hand, incorporates a small positive gradient for z < 0:   0.01, z < 0 f (z) = leaky ReLU(z) =⇒ f 0 (z) = (65.12) 0, z=0  1, z>0

The exponential linear unit (ELU) also addresses the problem over negative z by incorporating an exponential decay term such that as z → N∞, the function ELU(z) will tend to Nα where α > 0. The value of α can be selected through a (cross validation) training process by simulating the performance of the neural network with different choices for α. The rectifier functions are widely used within

2720

Feedforward Neural Networks

Table 65.1 Typical choices for the activation function f (z) used in (65.5). Activation function

f (z)

sigmoid or logistic

f (z) =

softplus

1 1 + e−z f (z) = ln(1 + ez )

hyperbolic tangent (tanh)

f (z) = tanh(z) =

scaled tanh

ez − e−z ez + e−z f (z) = a tanh(bz), a, b > 0 ∆

f (z) = max {0, z}  z, z≥0 f (z) = 0.01z, z < 0  z, z≥0 f (z) = α(ez − 1), z < 0 f (z) = z

ReLU (hinge) leaky ReLU ELU no activation

hidden layers in the training of deep and convolutional neural networks. ELU activation functions have been observed to lead to neural networks with higher classification performance than ReLUs.

Softmax activation The activation functions shown in Table 65.1 act on scalar arguments z; these arguments are the internal signals within the various neurons. We will encounter another popular activation function known as softmax activation, and written compactly as y = softmax(z). This activation will be used at the output layer of the neural network, which has Q neurons, and it will operate simultaneously on all z-values from these neurons denoted by {z(q)}, for q = 1, 2, . . . , Q. Softmax transforms these values into another set of Q values constructed as:  −1 Q X ∆ z(q)  z(q 0 )  , q = 1, 2, . . . , Q (65.13) y(q) = e e q 0 =1

Observe that each y(q) is influenced by all {z(q)}. The exponentiation and normalization in the denominator ensure that the output variables {y(q)} are all nonnegative and add up to 1. In this way, the softmax transformation generates a Gibbs probability distribution – recall (3.168). We explained earlier in Remark 36.2 that some care is needed in the numerical implementation of the softmax procedure due to the exponentiation in (65.13) and the possibility of overflow or underflow. For example, if the number in the exponent has a large value, then y(q) can saturate in finite-precision implementations. One way to avoid this difficulty is to subtract from all the {z(q)} their largest value and introduce the centered variables:   ∆ 0 s(q) = z(q) N max z(q ) , q = 1, 2, . . . , Q (65.14) 0 1≤q ≤Q

65.2 Feedforward Networks

2721

By doing so, the largest value of the {s(q)} will be zero. It is easy to see that using the {s(q)} instead of the {z(q)} does not change the values of the {y(q)}:  −1 Q X 0 y(q) = es(q)  es(q )  (65.15) q 0 =1

Comparing with perceptron

In Fig. 65.3 we compare the structure of the sigmoidal neuron with perceptron, which uses the sign function for activation with its sharp discontinuous transition. Recall from (60.25) that perceptron predicts the label for a feature vector h by using γ b = hT w N θ

(65.16)

which agrees with the expression for z in the figure. Subsequently, the class for h is decided based on the sign of γ b. In other words, the perceptron unit operates on z by means of an implicit sign function. One of the key advantages of using continuous (smooth) activation functions in constructing neural networks, such as the sigmoid or tanh functions, over the discontinuous sign function, is that the resulting networks will respond more gracefully to small changes in their internal signals. For example, networks consisting solely of interconnected perceptron neurons can find their output signals change dramatically in response to small internal signal variations; this is because the outputs of the sign functions can change suddenly from N1 to +1 for slight changes at their inputs. Smooth activation functions limit this sensitivity.

+

+

plays role of b

perceptron neuron AAAB+nicbVDLSgMxFM3UV62vVpdugkXoqswUUZcFNy4r2Ae0Q8mkd9rQTDIkGaWM/RQ3LhRx65e482/MtLPQ1gMXDufcm9x7gpgzbVz32ylsbG5t7xR3S3v7B4dH5cpxR8tEUWhTyaXqBUQDZwLahhkOvVgBiQIO3WB6k/ndB1CaSXFvZjH4ERkLFjJKjJWG5UoM9qHYKCmwgERlWtWtuwvgdeLlpIpytIblr8FI0iQCYSgnWvc9NzZ+SpRhlMO8NEg0xIROyRj6lgoSgfbTxepzfG6VEQ6lsiUMXqi/J1ISaT2LAtsZETPRq14m/uf1ExNe+ykTcWJA0OVHYcKxkTjLAY+YAmr4zBJCFbO7YjohilBj0yrZELzVk9dJp1H3LusXd41qs5bHUUSn6AzVkIeuUBPdohZqI4oe0TN6RW/Ok/PivDsfy9aCk8+coD9wPn8AvAmUOw==

sigmoidal neuron

Figure 65.3 (Left) Neuron where the output of the linear combiner is smoothed

through a sigmoid activation function. (Right) Perceptron neuron where the output of the linear combiner is applied to a hard-thresholding sign function.

65.2

FEEDFORWARD NETWORKS We explain next how to combine several neurons to form a feedforward multilayer neural network, which we will subsequently train to solve classification

2722

Feedforward Neural Networks

and regression problems. In the feedforward implementation, information flows forward in the network and there is no loop to feed signals from future layers back to earlier layers. Figure 65.4 illustrates this structure for a network consisting of an input layer, three hidden layers, and an output layer. In this example, there are two output nodes in the output layer, denoted by γ b(1) and γ b(2), and three input nodes in the input layer, denoted by h(1), h(2), and h(3). Note that we are excluding the bias source +1 from the number of input nodes. There are also successively three, two, and three neurons in the hidden layers, again excluding the bias sources. The neurons in each layer are numbered, with the numbers placed inside the symbol for the neuron. We will be using the terminology “node” to refer to any arbitrary element in the network, whether it is a neuron or an input node. In this example, the nodes in the output layer employ activation functions. There are situations where output nodes may be simple combiners without activation functions (for example, when the network is applied to the solution of regression problems).

layer 1

input layer

layer 2

layer 3

3 hidden layers

layer 4

layer 5

output layer

Figure 65.4 A feedforward neural network consisting of an input layer, three hidden

layers, and an output layer. There are three input nodes in the input layer (excluding the bias source denoted by +1) and two output nodes in the output layer denoted by γ b(1) and γ b(2).

For convenience, we will employ the vector and matrix notation to examine how signals flow through the network. We let L denote the number of layers in the network, including the input and output layers. In the example of Fig. 65.4 we have L = 5 layers, three of which are hidden. Usually, large networks with many hidden layers are referred to as deep networks. For every layer ` = 1, 2, . . . , L, we

65.2 Feedforward Networks

2723

let n` denote the number of nodes in that layer (again, our convention excludes the bias sources from this count). For our example, we have n1 = 3, n2 = 3, n3 = 2, n4 = 3, n5 = 2

(65.17)

Now, between any two layers, there will be a collection of combination coefficients that scale the signals arriving from the nodes in the prior layer. For example, if we focus on layers 2 and 3, as shown in Fig. 65.5, these coefficients can be collected into a matrix W2 of size n2 × n3 with individual entries:  (2) (2)  w11 w12   ∆  (2) (2)  W2 =  w21 (65.18) w22  , (n2 × n3 )   (2)

w31

(2)

w32

layer 2

layer 3

T

Figure 65.5 Combination and bias weights between layers 2 and 3 for the network

shown in Fig. 65.4. The combination weights between nodes are collected into a matrix W2 of size n2 × n3 , while the bias weights are collected into a vector θ2 of size n3 × 1. (`)

In the notation (65.18), the scalar wij has the following interpretation: (`)

wij = weight from node i in layer ` to node j in layer ` + 1

(65.19)

In the expressions used to describe the operation of a neural network, and its training algorithms, we will also be dealing with the transpose of the weight matrix, which has size n3 × n2 and is given by

2724

Feedforward Neural Networks

∆ W2T =

"

(2)

w21

(2)

w22

w11

w12

(2)

w31

(2)

w32

(2) (2)

#

,

(n3 × n2 )

(65.20)

We also associate with layer 2 a bias vector of size n3 , containing the coefficients that scale the bias arriving into layer 3 from layer 2, with entries denoted by ∆

θ2 =

"

#

θ2 (1) θ2 (2)

, (n3 × 1)

(65.21)

where the notation θ` (j) has the following interpretation: N θ` (j) = weight from +1 bias source in layer ` to node j in layer ` + 1 (65.22) Figure 65.5 illustrates these definitions for the parameters linking layers 2 and 3. In the figure we are illustrating the common situation when every node from a preceding layer feeds into every node in the succeeding layer, thus leading to a fully connected interface between the layers. One can consider situations where only a subset of these connections are active (selected either at random or according to some policy); we will focus for now on the fully connected case and discuss later the dropout strategy where some of the links will be deactivated. Later, in Chapter 67, when we discuss CNNs, we will encounter other strategies for connecting nodes between successive layers. Figure 65.6 illustrates the four weight matrices, {W1T , W2T , W3T , W4T }, and four bias vectors, {θ1 , θ2 , θ3 , θ4 }, that are associated with the network of Fig. 65.4. More generally, the combination weights between two layers ` and ` + 1 will be collected into a matrix W`T of size n`+1 × n` and the bias weights into layer ` + 1 will be collected in a column vector θ` of size n`+1 × 1. Continuing with Fig. 65.5, we denote the outputs of the nodes in layer 3 by y3 (1) and y3 (2) and collect them into the vector: " # y3 (1) ∆ y3 = , (n3 × 1) (65.23) y3 (2) where the notation y` (j) has the following interpretation: y` (j) = output of node j in layer `

(65.24)

Likewise, the output vector for layer 2 is 

y2 (1)



 ∆   y2 =   y2 (2)  , y2 (3)

(n2 × 1)

(65.25)

65.2 Feedforward Networks

layer 1

layer 2

T

layer 3

T

layer 4

T

2725

layer 5

T

Figure 65.6 The combination weights between successive layers are collected into

matrices, W`T , for ` = 1, 2, . . . , L − 1. Likewise, the bias coefficients for each hidden layer are collected into vectors, θ` , for ` = 1, 2, . . . , L − 1.

Using the vector and matrix quantities so defined, we can now examine the flow of signals through the network. For instance, it is clear that the output vector for layer ` = 3 is given by  y3 = f W2T y2 N θ2 (65.26)

in terms of the output vector for layer 2 and where the notation f (z) for a vector argument, z, means that the activation function is applied to each entry of z individually. For later use, we similarly collect the signals prior to the activation function at layer 3 into a column vector: " # z3 (1) ∆ z3 = , (n3 × 1) (65.27) z3 (2) where the notation z` (j) has the following interpretation: z` (j) = signal at node j of layer ` prior to activation function

(65.28)

That is, y3 (1) = f (z3 (1)),

y3 (2) = f (z3 (2))

(65.29)

If we now let γ b = col{b γ (1), γ b(2)} denote the column vector that collects the outputs of the neural network, we then arrive at the following description for the

2726

Feedforward Neural Networks

flow of signals through a feedforward network – this flow is depicted schematically in Fig. 65.7. In this description, the vectors {z` , y` } denote the pre- and postactivation signals for the internal layer of index `. For the output layer, we will interchangeably use either the notation {zL , yL } or the notation {z, γ b} for the pre- and post-activation signals. We will also employ the following compact notation to refer to the forward recursions (65.31), which feed a feature vector h into a network with parameters {W` , θ` } and generate the output signals (z, γ b) along with the intermediate signals {z` , y` } at the internal layers: 

   γ b, z, {y` , z` } = forward h, {W` , θ` }

(65.30)

Feedforward propagation through a neural network with L layers. given feedforward network with L layers (input+output+hidden); start with y1 = h; repeat for ` = 1, . . . , L N 1: z`+1 = W`T y` N θ` y`+1 = f (z`+1 ) end z = zL γ b = yL .

input layer

(65.31)

output layer AAAB8nicbVBNSwMxEM3Wr1q/qh69BFvBi2W3FPVY8OKxgv2A7VKyabYNzSZLMiuUpT/DiwdFvPprvPlvTNs9aOuDgcd7M8zMCxPBDbjut1PY2Nza3inulvb2Dw6PyscnHaNSTVmbKqF0LySGCS5ZGzgI1ks0I3EoWDec3M397hPThiv5CNOEBTEZSR5xSsBKfvWqD2MGZNCoDsoVt+YugNeJl5MKytEalL/6Q0XTmEmgghjje24CQUY0cCrYrNRPDUsInZAR8y2VJGYmyBYnz/CFVYY4UtqWBLxQf09kJDZmGoe2MyYwNqveXPzP81OIboOMyyQFJulyUZQKDArP/8dDrhkFMbWEUM3trZiOiSYUbEolG4K3+vI66dRr3nWt8VCvNKt5HEV0hs7RJfLQDWqie9RCbUSRQs/oFb054Lw4787HsrXg5DOn6A+czx/xfJBL

z4 AAAB7HicbVBNTwIxEJ3FL8Qv1KOXRjDxRHYJUY8kXjxi4gIJbEi3dKGh7W7arglu+A1ePGiMV3+QN/+NBfag4EsmeXlvJjPzwoQzbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TRahPYh6rbog15UxS3zDDaTdRFIuQ0044uZ37nUeqNIvlg5kmNBB4JFnECDZW8qtPg0Z1UK64NXcBtE68nFQgR2tQ/uoPY5IKKg3hWOue5yYmyLAyjHA6K/VTTRNMJnhEe5ZKLKgOssWxM3RhlSGKYmVLGrRQf09kWGg9FaHtFNiM9ao3F//zeqmJboKMySQ1VJLloijlyMRo/jkaMkWJ4VNLMFHM3orIGCtMjM2nZEPwVl9eJ+16zbuqNe7rlWY1j6MIZ3AOl+DBNTThDlrgAwEGz/AKb450Xpx352PZWnDymVP4A+fzB8Ufjeo=

T

T

T

AAAB7HicbVBNS8NAEJ3Ur1q/qh69LLaCp5KUoh4LXjxWMLXQhrLZbtqlu5uwuxFC6G/w4kERr/4gb/4bt20O2vpg4PHeDDPzwoQzbVz32yltbG5t75R3K3v7B4dH1eOTro5TRahPYh6rXog15UxS3zDDaS9RFIuQ08dwejv3H5+o0iyWDyZLaCDwWLKIEWys5NezYas+rNbchrsAWideQWpQoDOsfg1GMUkFlYZwrHXfcxMT5FgZRjidVQappgkmUzymfUslFlQH+eLYGbqwyghFsbIlDVqovydyLLTORGg7BTYTverNxf+8fmqimyBnMkkNlWS5KEo5MjGaf45GTFFieGYJJorZWxGZYIWJsflUbAje6svrpNtseFeN1n2z1q4XcZThDM7hEjy4hjbcQQd8IMDgGV7hzZHOi/PufCxbS04xcwp/4Hz+AMOYjek=

y4

✓4

AAACCHicbVC5TsNAFFyHK4TLQEnBhhiJKrIjrjISDWWQyCHFUfS8WSer7NrW7hoUWSlp+BUaChCi5RPo+Bs2RwGBqUYz7+nNmyDhTGnX/bJyS8srq2v59cLG5tb2jr2711BxKgmtk5jHshWAopxFtK6Z5rSVSAoi4LQZDK8mfvOOSsXi6FaPEtoR0I9YyAhoI3XtQ2fUPXP8Io4l9ouOf896dAA68/sgBIydrl1yy+4U+C/x5qSE5qh17U+/F5NU0EgTDkq1PTfRnQykZoTTccFPFU2ADKFP24ZGIKjqZNNHxvjYKD0cmixhHGk8VX9uZCCUGonATArQA7XoTcT/vHaqw8tOxqIk1TQis0NhyrGO8aQV3GOSEs1HhgCRzGTFZAASiDbdFUwJ3uLLf0mjUvbOy6c3lVLVmdeRRwfoCJ0gD12gKrpGNVRHBD2gJ/SCXq1H69l6s95nozlrvrOPfsH6+AYqV5gb

y5 or b

z5 or z AAAB+XicbVC7TsMwFHXKq5RXgJHFpUFiqpKK11iJhbFI9CE1UeW4TmvVsSPbqdRG/RMWBhBi5U/Y+BvcNgO0nOnonHt1zz1hwqjSrvttFTY2t7Z3irulvf2DwyP7+KSlRCoxaWLBhOyESBFGOWlqqhnpJJKgOGSkHY7u5357TKSigj/pSUKCGA04jShG2kg923amvWvHL0MhoV92pk7PrrhVdwG4TrycVECORs/+8vsCpzHhGjOkVNdzEx1kSGqKGZmV/FSRBOERGpCuoRzFRAXZIvkMXhilDyNzPBJcw4X6eyNDsVKTODSTMdJDterNxf+8bqqjuyCjPEk14Xh5KEoZ1ALOa4B9KgnWbGIIwpKarBAPkURYm7JKpgRv9eV10qpVvZvq1WOtUnfyOorgDJyDS+CBW1AHD6ABmgCDMXgGr+DNyqwX6936WI4WrHznFPyB9fkDScSRZw==

W4T AAAB9HicbVBNSwMxEJ31s9avqkcvwVbwVHZLUY8FLx4r9AvatWTTbBuaTdYkWyhLf4cXD4p49cd489+YtnvQ1gcDj/dmmJkXxJxp47rfzsbm1vbObm4vv39weHRcODltaZkoQptEcqk6AdaUM0GbhhlOO7GiOAo4bQfju7nfnlClmRQNM42pH+GhYCEj2FjJL7X71ce0p0PUmJX6haJbdhdA68TLSBEy1PuFr95AkiSiwhCOte56bmz8FCvDCKezfC/RNMZkjIe0a6nAEdV+ujh6hi6tMkChVLaEQQv190SKI62nUWA7I2xGetWbi/953cSEt37KRJwYKshyUZhwZCSaJ4AGTFFi+NQSTBSztyIywgoTY3PK2xC81ZfXSatS9q7L1YdKsVbK4sjBOVzAFXhwAzW4hzo0gcATPMMrvDkT58V5dz6WrRtONnMGf+B8/gBYWZEW

AAAB8HicbVDLSgNBEOyNrxhfUY9eBhPBU9gNvo4BLx4jmIckS5idzCZDZmaXmVlhWfIVXjwo4tXP8ebfOEn2oIkFDUVVN91dQcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcjvzO09UaRbJB5PG1Bd4JFnICDZWeuQ4pQpVL6uDcsWtuXOgVeLlpAI5moPyV38YkURQaQjHWvc8NzZ+hpVhhNNpqZ9oGmMywSPas1RiQbWfzQ+eojOrDFEYKVvSoLn6eyLDQutUBLZTYDPWy95M/M/rJSa88TMm48RQSRaLwoQjE6HZ92jIFCWGp5Zgopi9FZExVpgYm1HJhuAtv7xK2vWad1W7uK9XGtU8jiKcwCmcgwfX0IA7aEILCAh4hld4c5Tz4rw7H4vWgpPPHMMfOJ8/gKePdw==

layer 1

layer 2

layer 3

layer 4

layer 5

Figure 65.7 A block diagram representation of the feedforward recursions (65.31)

through a succession of four layers, where the notation f denotes the activation function.

Column and row partitioning Continuing with layers 2 and 3 from Fig. 65.5, and the corresponding combination matrix W2T , we remark for later use that we can partition W2T either in column form or in row form as follows:

65.2 Feedforward Networks

W2T

=

"

(2)

w21

(2)

w22

w11 w12

(2)

w31

(2)

w32

(2)

(2)

#

= ∆

=

" h

(2)

w21 (2) w22

(2)

w2

w11 (2) w12 w1

(2)

w31 (2) w32

(2)

(2)

w3

(2)

2727

#

i

(65.32)

In the first case, when the partitioning is row-wise, we observe that W2T consists of two rows; one for each of the nodes in the subsequent layer 3. The entries of the first row are the weighting coefficients on the edges arriving at the first node in layer 3. The entries of the second row are the weighting coefficients on the edges arriving at the second node in the same layer 3. In other words, each row of W2T consists of the weighting coefficients on the edges arriving at the corresponding node in the subsequent layer 3.

T

weights from node #1 in layer `

T

Figure 65.8 The weight matrix W`T can be partitioned either in column or row form.

The columns of W`T represent the weights emanating from the nodes in layer `. The rows of W`T represent the weights arriving at the nodes in layer ` + 1.

In the second case, when the partitioning is column-wise, we observe that W2T consists of three columns; one column for each of the nodes in the originating layer 2. The entries of the first column are the weighting coefficients on the edges emanating from the first node in layer 2 to all nodes in layer 3. The entries of the

2728

Feedforward Neural Networks

second column are the weighting coefficients on the edges emanating from the second node in layer 2 to the nodes in layer 3, and likewise for the third column of W2T . In other words, each column of W2T consists of the weighting coefficients on the edges emanating from the nodes in layer 2. Figure 65.8 illustrates this partitioning for a generic combination matrix between two arbitrary layers ` and ` + 1. For later use, we find it convenient to denote the columns of W`T by the notation – as depicted in the second line in (65.32): (`)

wi

= weight vector emanating from node i in layer `

For example, from (65.32) we have n o (2) (2) (2) w1 = col w11 , w12 = weights from node 1 in layer 2 n o (2) (2) (2) w2 = col w21 , w22 = weights from node 2 in layer 2 n o (2) (2) (2) w3 = col w31 , w32 = weights from node 3 in layer 2

65.3

(65.33)

(65.34a) (65.34b) (65.34c)

REGRESSION AND CLASSIFICATION Feedforward neural networks can be viewed as systems that map multidimensional input vectors {h ∈ IRM } into multi-dimensional output vectors {b γ ∈ IRQ }, so that the network is effectively a multi-input multi-output system. We can exploit this level of generality to solve regression problems as well as multiclass and multilabel classification problems.

Regression In regression problems, the nodes at the output layer of the network do not employ activation functions. They will consist of linear combiners so that γ b = z, where γ b denotes the output vector of the network and z denotes the pre-activation signal at the output layer. The individual entries of γ b will assume real values and, hence, γ b ∈ IRQ .

Classification

In classification problems, the output layer of the network will employ activation functions; this layer can also be modified to replace the individual activation functions by the softmax construction described earlier. Regardless of how the output vector γ b is generated, there will generally be an additional step that transforms it into a second vector γ ? with Q discrete entries, just like we used sign(b γ ) in previous chapters to transform a scalar real-valued γ b into +1 or N1 by examining its sign. In the neural network context, we will perform similar operations, described below, to generate the discrete vector γ ? whose entries will belong, out of convenience, either to the choices {+1, N1} or {1, 0}, e.g., γ b(h) ∈ IRQ (real output)

=====⇒

γ ? (h) ∈ {+1, N1}Q (discrete output)

(65.35)

65.3 Regression and Classification

2729

The choice of which discrete values to use, {+1, N1} or {1, 0}, is usually dictated by the type of the activation function used. For instance, note that the sigmoid function in Table 65.1 generates values in the range (0, 1), while the hyperbolic tangent function generates values in the range (N1, 1). Therefore, for classification problems, we would: (1) Employ the discrete values {1, 0} when the sigmoid function is used. We can, for instance, obtain these discrete values by performing the following transformation on the entries of the output vector γ b ∈ IRQ :  if γ b(q) ≥ 1/2, set γ ? (q) = 1 (65.36) if γ b(q) < 1/2, set γ ? (q) = 0 for q = 1, 2, . . . , Q.

(2) Employ the discrete values {+1, N1} when the hyperbolic tangent function is used. We can obtain these discrete values by performing the following transformation on the entries of the output vector γ b ∈ IRQ :   γ ? (q) = sign γ b(q) , q = 1, 2, . . . , Q (65.37) We can exploit the vector nature of {b γ , γ ? } to solve two types of classification problems:

(a) (Multilabel classification) In some applications, one is interested in determining whether a feature vector, h, implies the presence of certain conditions A or B or more, for example, such as checking whether an individual with feature h is overweight or not (condition A) and smokes or not (condition B). In cases like these, we would train a neural network to generate two labels, i.e., a two-dimensional vector γ ? with individual discrete entries denoted by γ ? (A) and γ ? (B):  ?  γ (A) γ b(h) ∈ IR2 =⇒ γ ? (h) = (65.38) γ ? (B)

One of the labels relates to condition A and would indicate whether the condition is present or not by assuming binary state values such as {+1, N1} or {1, 0}. Similarly for γ ? (B). This example corresponds to a multilabel classification problem. Such problems arise, for example, when training a neural network to solve a multitask problem such as deciding whether an image contains instances of streets, cars, and traffic signs.

(b) (Multiclass classification) In other applications, one is interested in classifying a feature vector, h, into only one of a collection of classes. For example, given a feature vector extracted from the image of a handwritten digit, one would want to identify the digit (i.e., classify h into 1 of 10 classes: 0, 1, 2, . . . , 9). We encountered multiclass classification problems of this type

2730

Feedforward Neural Networks

earlier in Sections 59.3.1 and 59.3.2 while discussing the OvA and OvO strategies. These solution methods focused on reducing the multiclass classification problem into a collection of binary classifiers. In comparison, in the neural network approach, the multiclass classification problem will be solved directly by generating a vector-valued class variable, γ ? ∈ IRQ – see Example 65.9. Each entry of this vector will correspond to one class. In particular, when h belongs to some class r, the rth entry of γ ? will be activated at +1, while all other entries will be N1 (or 0, depending on which convention is used, {+1, N1} or {1, 0}).

Softmax formulation In multiclass classification problems, the output of the feedforward network will generally be a softmax layer where the entries γ b(q) are generated by employing the softmax function described before:  −1 Q X 0 ∆ γ b(q) = ez(q)  ez(q )  (65.39) q 0 =1

Here, the variable z(q) denotes the qth entry of the output vector z prior to activation. Observe that the computation of γ b(q) is now influenced by other signals {z(q 0 )} from the other output neurons, and is not solely dependent on z(q). For convenience, we will express the transformation (65.39) more succinctly by writing ∆

γ b = softmax(z)

(65.40)

r? = argmax γ b(q)

(65.41)

The exponentiation and normalization in (65.39) ensure that the output variables {b γ (q)} are all nonnegative and add up to 1. As a result, in multiclass classification problems, each γ b(q) can be interpreted as corresponding to the likelihood that the feature vector h belongs to class q. The label for vector h is then selected as the class r? corresponding to the highest likelihood value 1≤q≤Q

In other words, the predicted label γ ? can be set to the r? -basis vector in IRQ : n o ∆ γ ? = er? = col 0, . . . , 0, 1, 0, . . . , 0 (basis vector) (65.42)

where the notation er refers to the basis vector with the value 1 at location r and 0s elsewhere. When the output of the network structure includes the softmax transformation, the last line of the feedforward propagation recursion (65.31) is modified as indicated below, with γ b = yL replaced by (65.40); the vector yL does not need to be generated anymore. We continue to use the compact description (65.30) to refer to this implementation.

65.4 Calculations of Gradient Vectors

2731

Feedforward propagation through L layers with softmax output. given feedforward network with L layers (input+output+hidden); output layer is softmax; start with y1 = h; repeat for ` = 1, . . . , L N 1: z`+1 = W`T y` N θ` y`+1 = f (z`+1 ) end z = zL γ b = softmax(z).

65.4

(65.43)

CALCULATION OF GRADIENT VECTORS Now that we have described the structure of feedforward neural networks, we can proceed to explain how to train them, i.e., how to determine their weight matrices {W` } and bias vectors {θ` } to solve classification or regression problems. We will do so in two steps. First, we will derive the famed backpropagation algorithm, which is a popular procedure for evaluating gradient vectors. Subsequently, we will combine this procedure with stochastic gradient approximation to arrive at an effective training method. Due to the interconnected nature of the network, with signals from one layer feeding into a subsequent layer, in addition to the presence of nonlinear activation functions, it is necessary to pursue a systematic presentation to facilitate the derivation. We focus initially on the case in which all layers, including the output layer, employ activation functions; later, we explain the adjustments that are needed when the output layer relies on a softmax construction. To begin with, since we will now be dealing with feature vectors h ∈ IRM that are indexed by a subscript n, say, hn ∈ IRM (e.g., hn can refer to vectors selected from a training set or to vectors streaming in over time), we will similarly denote the output vector γ b ∈ IRQ corresponding to hn by γ bn , with the same subscript n. We will also let zn ∈ IRQ denote the output vector prior to activation so that γ bn = f (zn )

(65.44)

Thus, consider a collection of N data pairs {γn , hn } for n = 0, 1, . . . , N N 1, where γn denotes the actual discrete label vector corresponding to the nth feature vector hn . The objective is to train the neural network to result in an input– output mapping h → γ b that matches reasonably well the mapping h → γ that is reflected by the training data.

2732

Feedforward Neural Networks

65.4.1

Regularized Least-Squares Risk We formulate initially a regularized empirical risk optimization problem of the following form: (L−1 ) N −1 X 1 X ∆ ? ? 2 2 {W` , θ` } = argmin ρkW` kF + kγn N γ bn k (65.45) N n=0 {W` ,θ` } `=1

where the first term applies regularization to the sum of the squared Frobenius norms of the weight matrices between successive layers. Recall that the squared Frobenius norm of a matrix is the sum of the squares of its entries (i.e., it is the squared Euclidean norm of the vectorized form of the matrix): kW` k2F =

n` nX ` +1 X (`) 2 wij

(65.46)

i=1 j=1

Therefore, the regularization term is in effect adding the squares of all combination weights in the network. As already explained in Section 51.2, regularization helps avoid overfitting and improves the generalization ability of the network. Other forms of regularization are possible, including `1 -regularization (see Prob. 65.15), as well as other risk functions (see Probs. 65.15–65.18). We will motivate some of these alternative costs later in Section 65.8; their main purpose is to avoid the slowdown in learning that arises from using (65.45). Continuing with (65.45), we denote the regularized empirical risk by ∆

P(W, θ) =

L−1 X `=1

ρkW` k2F +

N −1 1 X kγn N γ bn k2 N n=0

(65.47)

where we are denoting the arguments of P(·) generically by {W, θ}; these refer to the collection of all parameters {W` , θ` } across all layers. The loss function that is associated with each data point in (65.47) is seen to be (we used the letter Q to refer to loss functions in earlier chapters; we will use the calligraphic letter Q here to avoid confusion with the number of classes Q at the output of the network): ∆

Q(W, θ; γ, h) = kγ N γ bk2

(65.48)

This loss value depends on all weight and bias parameters of the network. For simplicity, we will often drop the parameters (W, θ) and write only Q(γ, h). It is useful to note from (65.47) that regularization is not applied to the bias vectors {θ` }; these entries are embedded into the output signals {b γn }, as is evident from (65.31). Observe further that the risk (65.47) is not quadratic in the unknown variables {W` , θ` }. Actually, the dependency of γ bn on the variables {W` , θ` } is highly nonlinear due to the activation functions at the successive nodes. As a result, the risk function P(W, θ) is nonconvex over its parameters and will generally have multiple local minima. We will still apply stochastic-gradient constructions to seek a local minimizer, especially since it has been observed in

65.4 Calculations of Gradient Vectors

2733

practice, through extensive experimentation, that the training algorithm works well despite the nonlinear and nonconvex nature of the risk function. In order to implement iterative procedures for “minimizing” P(W, θ), we need to evaluate the gradients of P(W, θ) relative to the individual entries of {W` , θ` }, i.e., we need to evaluate quantities of the form: ∂P(W, θ) (`) ∂wij

and

∂P(W, θ) ∂θ` (j)

(65.49)

(`)

for each layer ` and entries wij and θ` (j). The backpropagation algorithm is the procedure that enables us to compute these gradients in an effective manner, as we explain in the remainder of this section.

65.4.2

Sensitivity Factors To simplify the notation in this section, we drop the subscript n and reinstate it later when it is time to list the final algorithm. The subscript is not necessary for the gradient calculations. We thus consider a generic feedforward neural network consisting of L layers, including the input and output layers. We denote the vector signals at the output layer by {z, γ b}, with the letter z representing the signal prior to the activation function, i.e., γ b = f (z)

(65.50)

y` = f (z` )

(65.51)

We denote the pre- and post-activation signals at the `th hidden layer by {z` , y` }, which satisfy

with individual entries indexed by {z` (j), y` (j)} for j = 1, 2, . . . , n` . The number of nodes in the layer is n` (which excludes the bias source). We associate with each layer ` a sensitivity vector of size n` denoted by δ` ∈ IRn` and whose individual entries are defined as follows: ∆

δ` (j) =

∂Q(γ, h) ∂kγ N γ bk2 = , j = 1, 2, . . . , n` ∂z` (j) ∂z` (j)

(65.52)

These factors measure how the unregularized term kγ N γ bk2 (i.e., the loss value Q(γ, h)) varies in response to changes in the pre-activation signals, z` (j); we show later in (65.79) that this same quantity also measures the sensitivity to changes in the bias coefficients. It turns out that knowledge of the sensitivity variables facilitates evaluation of the partial derivatives (65.49). We therefore examine how to compute the {δ` } for all layers. We will show that these variables satisfy a backward recursive relation that tells us how to construct δ` from knowledge of δ`+1 . We start with the output layer for which ` = L. We denote its individual entries by {b γ (1), . . . , γ b(Q)}. Likewise, we denote

2734

Feedforward Neural Networks

the pre-activation entries by {z(1), . . . , z(Q)}. In this way, the chain rule for differentiation gives ∂kγ N γ b k2 ∂z(j)



δL (j) =

Q X ∂kγ N γ bk2 ∂b γ (k) ∂b γ (k) ∂z(j)

(a)

=

k=1 Q X

=

k=1

2(b γ (k) N γ(k))

∂b γ (k) ∂z(j)

  = 2 γ b(j) N γ(j) f 0 (z(j))

(b)

(65.53)

where step (a) applies the following general chain-rule property. Assume y is a function of several variables, {x1 , x2 , . . . , xQ }, which in turn are themselves functions of some variable z, i.e., y = f (x1 , x2 , . . . , xQ ),

xq = gq (z)

(65.54)

(chain rule)

(65.55)

Then, it holds that Q

∂y X ∂y ∂xq = ∂z ∂xq ∂z q=1

Step (b) in (65.53) is because only γ b(j) depends on z(j) through the relation γ b(j) = f (z(j)). Consequently, using the Hadamard product notation we can write δL = 2(b γ N γ) f 0 (z)

(terminal sensitivity value)

(65.56)

where a b denotes elementwise multiplication for two vectors a and b. It is important to note that the activation function whose derivative appears in (65.56) is the one associated with the output layer of the network. If desired, we can express the above relation in matrix-vector form by writing δL = 2J(b γ N γ)

(65.57a)

n o ∆ J = diag f 0 (z(1)), f 0 (z(2)), . . . , f 0 (z(Q))

(65.57b)

where J is the diagonal matrix

Next, we evaluate δ` for the earlier layers. This calculation can be carried out recursively by relating δ` to δ`+1 . Indeed, note that

65.4 Calculations of Gradient Vectors

2735

∂kγ N γ b k2 ∂z` (j) n`+1 X ∂kγ N γ bk2 ∂z`+1 (k) = ∂z`+1 (k) ∂z` (j) ∆

δ` (j) =

k=1 n`+1

=

X

k=1

δ`+1 (k)

∂z`+1 (k) ∂z` (j)

(65.58)

where the rightmost term involves differentiating the output of node k in layer `+1 relative to the (pre-activation) output of node j in the previous layer, `. The summation in the second equality results from the chain rule of differentiation since the entries of γ b are generally dependent on the {z`+1 (k)}. The two signals (`) z` (j) and z`+1 (k) are related by the combination coefficient wjk since (`)

z`+1 (k) = f (z` (j)) wjk + terms independent of z` (j)

(65.59)

It follows that n`+1

δ` (j) =

X

(`) δ`+1 (k)wjk

k=1

!

f 0 (z` (j))

 T (`) δ`+1 = f 0 (z` (j)) wj

(65.60)

where we used the inner product notation in the last line by using the column (`) vector wj , which collects the combination weights emanating from node j in layer ` – recall the earlier definition (65.33). In vector form, we arrive at the following recursion for the sensitivity vector δ` , which runs backward from ` = L N 1 down to ` = 2 with the boundary condition δL given by (65.56): δ` = f 0 (z` ) (W` δ`+1 )

(65.61)

It is again useful to note that the activation function whose derivative appears in (65.61) is the one associated with the `th layer of the network. If desired, we can also express this relation in matrix-vector form by writing δ` = J` W` δ`+1

(65.62a)

where J` is now the diagonal matrix n o ∆ J` = diag f 0 (z` (1)), f 0 (z` (2)), . . . , f 0 (z` (Q))

(65.62b)

We therefore arrive at the following description for the flow of sensitivity signals backward through the network – this flow is depicted schematically in Fig. 65.9.

2736

Feedforward Neural Networks

Backward propagation through a network with L layers. given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (z, γ b); internal pre-activation signals are {z` }; given feature vector h ∈ IRM with label vector γ ∈ IRQ ; start from δL = 2(b γ N γ) f 0 (z) repeat for ` = L N 1, . . . , 3, 2: δ` = f 0 (z` ) (W` δ`+1 ) end

(65.63)

For compactness of notation, we will sometimes express the backward recursion (65.63), which starts from a terminal condition δL and feeds it through a network with L layers, activation functions f (·), and combination matrices {W` }, to generate the sensitivity factors {δ` , ` = 2, . . . , L N 1} by writing   {δ` } = backward δL , f, {z` , W` } (65.64)

three hidden layers terminal sensitivity

z4 AAAB7HicbVBNTwIxEJ3FL8Qv1KOXRjDxRHYJUY8kXjxi4gIJbEi3dKGh7W7arglu+A1ePGiMV3+QN/+NBfag4EsmeXlvJjPzwoQzbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TRahPYh6rbog15UxS3zDDaTdRFIuQ0044uZ37nUeqNIvlg5kmNBB4JFnECDZW8qtPg0Z1UK64NXcBtE68nFQgR2tQ/uoPY5IKKg3hWOue5yYmyLAyjHA6K/VTTRNMJnhEe5ZKLKgOssWxM3RhlSGKYmVLGrRQf09kWGg9FaHtFNiM9ao3F//zeqmJboKMySQ1VJLloijlyMRo/jkaMkWJ4VNLMFHM3orIGCtMjM2nZEPwVl9eJ+16zbuqNe7rlaabx1GEMziHS/DgGppwBy3wgQCDZ3iFN0c6L86787FsLTj5zCn8gfP5A8i7jfY=

AAAB8XicbVBNS8NAEJ3Ur1q/qh69BFvBU0mKX8eCF48V7Ae2oWw2m3bpZhN2J0IJ/RdePCji1X/jzX/jts1BWx8MPN6bYWaenwiu0XG+rcLa+sbmVnG7tLO7t39QPjxq6zhVlLVoLGLV9YlmgkvWQo6CdRPFSOQL1vHHtzO/88SU5rF8wEnCvIgMJQ85JWikx2o/YALJ4LI6KFecmjOHvUrcnFQgR3NQ/uoHMU0jJpEKonXPdRL0MqKQU8GmpX6qWULomAxZz1BJIqa9bH7x1D4zSmCHsTIl0Z6rvycyEmk9iXzTGREc6WVvJv7n9VIMb7yMyyRFJuliUZgKG2N79r4dcMUoiokhhCpubrXpiChC0YRUMiG4yy+vkna95l7VLu7rlYaTx1GEEziFc3DhGhpwB01oAQUJz/AKb5a2Xqx362PRWrDymWP4A+vzB3krkBU=

W3

W2 layer 2

layer 3

W4 AAAB7HicbVBNS8NAEJ34WetX1aOXxVbwVJJS1GPBi8cKpi20oWy223bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEhh0HW/nY3Nre2d3cJecf/g8Oi4dHLaMnGqGfdZLGPdCanhUijuo0DJO4nmNAolb4eTu7nffuLaiFg94jThQURHSgwFo2glv9Lu1yv9UtmtuguQdeLlpAw5mv3SV28QszTiCpmkxnQ9N8EgoxoFk3xW7KWGJ5RN6Ih3LVU04ibIFsfOyKVVBmQYa1sKyUL9PZHRyJhpFNrOiOLYrHpz8T+vm+LwNsiESlLkii0XDVNJMCbzz8lAaM5QTi2hTAt7K2FjqilDm0/RhuCtvrxOWrWqd12tP9TKDTePowDncAFX4MENNOAemuADAwHP8ApvjnJenHfnY9m64eQzZ/AHzucPk0aN0w==

layer 4

5

output layer

Figure 65.9 A block diagram representation of the backward recursion (65.63) for the

same scenario shown earlier in Fig. 65.7, which involves only two hidden layers.

Example 65.1 (Terminal sensitivity factor for softmax layer) The backward recursion (65.61) continues to be valid when the output layer of the neural network is modified to be the softmax construction (65.39), in which case γ b = softmax(z). The only change will be in the terminal or boundary value, δL , as we now explain. If we refer to expressions (65.52) and (65.53), where the subscript n was dropped for convenience, we have δL (j) =

Q   ∂b X γ (k) 2 γ b(k) − γ(k) ∂z(j)

(65.65)

k=1

where now, in view of the normalization (65.39):  ∂b γ (k) −b γ (j)b γ (k), = (1 − γ b(k))b γ (k), ∂z(j)

k 6= j k=j

(65.66)

65.4 Calculations of Gradient Vectors

2737

Substituting into (65.65) gives   δL (j) = 2 γ b(j) − γ(j) γ b(j) −

Q X

! 2(b γ (k) − γ(k))b γ (k) γ b(j)

(65.67)

k=1

We can express this relation in a more compact form by collecting the partial derivatives (65.66) into a Q × Q symmetric matrix: ∆

[J]jk = That is, for Q = 3:  (1 − γ b(1))b γ (1) γ (2)b γ (1) J =  −b −b γ (3)b γ (1)

∂b γ (k) , ∂z(j)

−b γ (1)b γ (2) (1 − γ b(2))b γ (2) −b γ (3)b γ (2)

j, k = 1, 2, . . . , Q

(65.68)

 −b γ (1)b γ (3) −b γ (2)b γ (3)  = diag(b γ) − γ bγ bT (65.69) (1 − γ b(3))b γ (3)

Then, we can rewrite (65.67) in the following matrix-vector product form: δL = 2J(b γ − γ)

(65.70)

Comparing this expression with (65.56), we see that (65.70) does not depend on the terminal derivative term f 0 (z).

65.4.3

Expressions for the Gradients We are ready to evaluate the partial derivatives in (65.49). Following arguments similar to the above, we note that ∂kγ N γ bk2 (`)

∂wij

n`+1

= =

X ∂kγ N γ bk2 ∂z`+1 (k) ∂z`+1 (k) ∂w(`) k=1 ij

∂kγ N γ bk2 ∂z`+1 (j) ∂z`+1 (j) ∂w(`) ij

= δ`+1 (j)y` (i)

(65.71) (`)

where the second equality is because only z`+1 (j) depends on wij . If we apply result (65.71) to the combination matrix W2 defined earlier in (65.18), we find that these gradient calculations lead to (where we are writing ∂kγ N yk2 /∂W2 to refer to the resulting matrix):   δ3 (1)y2 (1) δ3 (2)y2 (1) ∂kγ N γ b k2 =  δ3 (1)y2 (2) δ3 (2)y2 (2)  = y2 δ3T (65.72) ∂W2 δ3 (1)y2 (3) δ3 (2)y2 (3)

in terms of the outer product between the output vector, y` , for layer ` and the sensitivity vector, δ`+1 , for layer ` + 1. Therefore, we can write for a generic `: ∂kγ N γ b k2 T = y` δ`+1 ∂W`

(65.73)

2738

Feedforward Neural Networks

so that from (65.47), and after restoring the subscript n, N −1 ∂P(W, θ) 1 X T = 2ρW` + y`,n δ`+1,n ∂W` N n=0

(a matrix)

(65.74)

In this notation, y`,n is the output vector for layer ` at time n. A similar argument can be employed to compute the gradients of kγ N γ b k2 relative to the bias weights, θ` (i), across the layers. Thus, note that N −1 bn k2 1 X ∂kγn N γ ∂P(W, θ) = ∂θ` (i) N n=0 ∂θ` (i)

(65.75)

so that, in a manner similar to the calculation (65.71),

n`+1 X ∂kγ N γ ∂kγ N γ bk2 bk2 ∂z`+1 (k) = ∂θ` (i) ∂z`+1 (k) ∂θ` (i) k=1

∂kγ N γ bk2 ∂z`+1 (i) = ∂z`+1 (i) ∂θ` (i) = Nδ`+1 (i)

(65.76)

where the second equality is because only z`+1 (i) depends on θ` (i), namely, z`+1 (i) = Nθ` (i) + terms independent of θ` (i)

(65.77)

If we apply result (65.76) to the bias vector θ2 defined earlier in (65.21), we find that these gradient calculations lead to (where we are writing ∂kγ N γ bk2 /∂θ2 to refer to the resulting gradient vector): " # Nδ3 (1) ∂kγ N γ bk2 = (65.78) ∂θ2 Nδ3 (2)

More generally, we have

∂kγ N γ b k2 = Nδ`+1 ∂θ`

(65.79)

so that from (65.47), and after restoring the subscript n, N −1 1 X ∂P(W, θ) =N δ`+1,n (a column vector) ∂θ` N n=0

(65.80)

In summary, we arrive at the following listing for the main steps involved in computing the partial derivatives in (65.49) relative to all combination matrices and bias vectors in a feedforward network consisting of L layers. In the description below, we reinstate the subscript n to refer to the sample index. Moreover, the quantities {y`,n , δ`,n , z`,n } are all vectors associated with layer `, while {zn , γ bn } are the pre- and post-activation vectors at the output layer of the network. When the softmax construction (65.39) is employed at the output layer, we simply replace the boundary condition for δL,n by (65.70).

65.5 Backpropagation Algorithm

2739

Computation of partial derivatives for empirical risk (65.47). given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }; repeat for n = 0, 1, .. . , N N 1:    γ bn , zn , {y`,n , z`,n } = forward hn , {W` , θ` }

δL,n = 2(b γn N γn ) f 0 (zn )

 {δ`,n } = backward δL,n , f, {z`,n , W` }

(65.81)

end

compute for ` = 1, 2, . . . , L N 1: N −1 ∂P(W, θ) 1 X T y`,n δ`+1,n , = 2ρW` + ∂W` N n=0 N −1 ∂P(W, θ) 1 X δ`+1,n , =N ∂θ` N n=0

65.5

(n` × n`+1 )

(n`+1 × 1)

BACKPROPAGATION ALGORITHM We can now use the forward and backward recursions from the previous section to train the neural network by writing down a stochastic-gradient implementation with step size µ > 0. In this implementation, either one randomly selected data point (γn , hn ) may be used per iteration or a mini-batch block of size B. We describe the implementation in the mini-batch mode. By setting B = 1, we recover a stochastic-gradient version with one data point per iteration. Moving forward, we will need to introduce an iteration index, m, and attach it to the combination matrices and bias vectors since they will now be adjusted from one iteration to the other. We will therefore write W`,m and θ`,m for the parameter values at iteration m.

2740

Feedforward Neural Networks

Mini-batch backpropagation algorithm for solving (65.45). given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }, n = 0, 1, . . . , N N 1; given small step size µ > 0 and regularization parameter ρ ≥ 0; start from random initial parameters {W `,−1 , θ `,−1 }. repeat until convergence over m = 0, 1, 2, . . .: select B random data pairs {γ b , hb } (forward processing) repeat for b = 0, 1, . . . , B N 1: y 1,b = hb repeat for ` = 1, 2, . . . , L N 1: z `+1,b = W T `,m−1 y `,b N θ `,m−1 y `+1,b = f (z `+1,b ) end b b = y L,b γ z b = z L,b δ L,b = 2(b γ b N γ b ) f 0 (z b ) end (backward processing) repeat for ` = L N 1, . . . , 2, 1: W `,m = (1 N 2µρ)W `,m−1 N B−1

(65.82)

B−1 µ X y `,b δ T `+1,b B b=0

µ X θ `,m = θ `,m−1 + δ `+1,b B b=0   δ `,b = f 0 (z `,b ) W `,m−1 δ `+1,b , ` ≥ 2, b = 0, 1, . . . , B N 1

end end {W`? , θ`? } ← {W `,m , θ `,m }.

In listing (65.82) we are denoting the training data and the network parameters in boldface to highlight their random nature. In contrast to (65.81), we are also blending the backward update for the sensitivity factors into the same loop for updating the parameters of the network for improved computational efficiency; this is because the sensitivity factors and the network parameters are updated one layer at a time. Clearly, if desired, we can perform multiple passes over the data and repeat (65.82) for several epochs. The training of the algorithm is performed until sufficient convergence is attained, which can be met by either

65.5 Backpropagation Algorithm

2741

training until a certain maximum number of iterations has been reached (i.e., for m ≤ Miter ), or until the improvement in the risk function P(W, θ) is negligible, say, P({W`,m , θ`,m }) N P({W`,m−1 , θ`,m−1 }) ≤ 

(65.83)

over two successive iterations and for some small enough  > 0. When the softmax construction (65.39) is employed at the output layer, the only adjustment that is needed to the algorithm is to replace the boundary condition for δ L,b by (65.70), namely, bT bbγ δ L,b = 2J (b γ b N γ b ), J = diag(b γb) N γ b

(65.84)

Stochastic-gradient implementation When the batch size is B = 1, the above mini-batch recursions simplify to the listing shown in (65.87). Again, when the softmax construction is employed in the last layer, the expression for the boundary condition δ L,m in (65.82) would be replaced by: bT bmγ δ L,m = 2J (b γ m N γ m ), J = diag(b γm) N γ m

(65.85)

γ b = softmax(z)

(65.86a)

In both implementations, the parameter values {W`? , θ`? } at the end of the training phase are used for testing purposes. For example, assuming a softmax output layer, the predicted class r? corresponding to a feature vector h would be the index of the highest value within γ b (whose entries have the interpretation of a probability measure): γ b(q) ≈ P(h ∈ class q) ?

r = argmax γ b(q)

(65.86b) (65.86c)

1≤q≤Q

where z is the vector prior to the activation at the last layer of the neural network, and which results from feeding the test feature h through the feedforward layers after training.

2742

Feedforward Neural Networks

Stochastic-gradient backpropagation for solving (65.45). given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }, n = 0, 1, . . . , N N 1; given small step size µ > 0 and regularization parameter ρ ≥ 0; start from random initial parameters {W `,−1 , θ `,−1 }. repeat until convergence over m = 0, 1, 2, . . .: select one random data pair (hm , γ m ) y 1,m = hm (forward processing) repeat for ` = 1, 2, . . . , L N 1: z `+1,m = W T `,m−1 y `,m N θ `,m−1 y `+1,m = f (z `+1,m ) end b m = y L,m γ z m = z L,m δ L,m = 2(b γ m N γ m ) f 0 (z m )

(65.87)

(backward processing) repeat for ` = L N 1, . . . , 2, 1: W `,m = (1 N 2µρ)W `,m−1 N µy `,m δ T `+1,m θ `,m = θ `,m−1 + µδ `+1,m   δ `,m = f 0 (z `,m ) W `,m−1 δ `+1,m , ` ≥ 2

end end {W`? , θ`? } ← {W `,m , θ `,m }.

Initialization At step n = N1, it is customary to select the entries of the initial bias vector {θ`,−1 } randomly by following a Gaussian distribution with zero mean and unit variance: θ `,−1 ∼ Nθ` (0, In`+1 ) (`)

(65.88)

The combination weights {wij,−1 } across the network are also selected randomly according to a Gaussian distribution, but one whose variance is adjusted in accordance with the number of nodes in layer `, which we denoted earlier by n` . It is customary to select the variance of the Gaussian distribution as

65.5 Backpropagation Algorithm

  (`) wij,−1 ∼ Nw(`) 0, 1/n`

2743

(65.89)

ij

The reason for this normalization is to limit the variation of the signals in the subsequent layer ` + 1. Note, in particular, that for an arbitrary node j in layer ` + 1, its initial pre-activation signal would be given by z `+1,−1 (j) =

n` X i=1

(`)

wij,−1 y `,−1 (i) N θ `,−1 (j)

(65.90)

If we assume, for illustration purposes, that the output signals, y `,−1 (i), from layer `, are independent and have uniform variance, σy2 , it follows that the variance of z `+1,−1 (j), denoted by σz2 , will be given by σz2 = 1 +

n` X 1 2 σy = 1 + σy2 n ` i=1

(with normalization)

(65.91)

Without the variance scaling by 1/n` in (65.89), the above variance would instead be given by σz2 = 1 + n` σy2

(without normalization)

(65.92)

which grows linearly with n` . In this case, the pre-activation signals in layer ` + 1 will be more likely to assume larger (negative or positive) values when n` is large, which in turn means that the activation functions will saturate. This fact slows down the learning process in the network because small changes in internal signals (or weights) will have little effect on the output of a saturated node and on subsequent layers. Other forms of initialization include selecting the weight variables by uniformly sampling within the following ranges, where the second and third choices are recommended when tanh and sigmoidal activation functions are used:   1 1 (`) (65.93a) wij,−1 ∈ U N √ , √ n` n` " # √ √ 6 6 (`) wij,−1 ∈ U N √ , √ (tanh activation) (65.93b) n`+1 + n` n`+1 + n` " # √ √ 4 6 4 6 (`) wij,−1 ∈ U N √ , √ (sigmoid activation) (65.93c) n`+1 + n` n`+1 + n` These choices are meant to facilitate the reliable flow of information in the forward and backward directions in the network away from saturation to avoid the difficulties caused by the vanishing gradient problem (discussed later in Section 65.8). Example 65.2 (Early stopping procedure) Regularization reduces overfitting and improves generalization. In the least-squares risk formulation described so far, and in the cross-entropy formulation described further ahead, regularization is attained by adding

2744

Feedforward Neural Networks

a penalty term to the risk function. There are other more implicit ways by which regularization (i.e., improvement of the generalization ability) can be attained in order to reduce the gap between the training error and the test error. Early stopping is one such procedure; it relies on the use of pocket variables and a validation test set. During training, we split the original training data of size N into two groups: one larger group of size NT used exclusively for the standard training procedure, say, by means of the stochastic gradient backpropagation algorithm, and a smaller disjoint collection of data points of size NV used for validation purposes (with N = NT + NV ). Early stopping operates as follows. Assume we continually train the parameters {W` , θ` } of a neural network and test the learned parameters against the validation data after each iteration. It has been observed in practice that while the training error improves with training and continues to decrease, the same is not true for the validation error. The latter may improve initially but will start deteriorating after some time. This suggests that rather than train the network continually, we should stop and freeze the network parameters at those values where the validation error was the smallest. These parameter values are likely to lead to better generalization. We use pocket variables to keep track of these specific parameters. The details of the implementation are shown in (65.94). Early stopping procedure for training neural networks. split the N training samples into two disjoint sets of size (NT , NV ); denote pocket variables by {W`,p , θ`,p }; denote initial conditions by {W`,−1 , θ`,−1 }; set pocket variables to the initial conditions; set initial error on validation set to a very large value, denoted by Rp . repeat for sufficient time: run NT iterations of the stochastic-gradient algorithm to update the network parameters to {W`,NT , θ`,NT } evaluate the network error, Rv , on the validation set

(65.94)

if Rv < Rp : update Rp ← Rv set the pocket variables to {W`,p , θ`,p } ← {W`,NT , θ`,NT } end end return {W`,p , θ`,p }.

Example 65.3 (Training using ADAM) Algorithm (65.87) relies on the use of stochastic-gradient updates for {W `,m , θ `,m }. There are of course other methods to update the network parameters, including the use of adaptive gradients. Here, we describe how the updates for {W `,m , θ `,m } should be modified if we were to employ instead the ADAM recursions from Section 17.4. Note first that the gradient matrix for the update of W `,m at iteration m is given by ∆

G`,m = 2ρW `,m−1 + y `,m δ T`+1,m

(65.95)

¯ `,m and while the gradient vector for updating θ `,m is given by −δ `+1,m . Now, let W ¯ `,m denote smoothed quantities that are updated as follows starting from zero initial θ conditions at m = −1: ¯ `,m = βw,1 W ¯ `,m−1 + (1 − βw,1 )G`,m W ¯ ¯ θ `,m = βθ,1 θ `,m−1 − (1 − βθ,1 )δ `+1,m

(65.96a) (65.96b)

65.5 Backpropagation Algorithm

2745

A typical value for the forgetting factors βw,1 , βθ,1 ∈ (0, 1) is 0.9. Let further S `,m and s`,m denote variance quantities associated with {W ` , θ ` } and updated as follows, starting again from zero initial conditions for m = −1:   S `,m = βw,2 S `,m−1 + (1 − βw,2 ) G`,m G`,m (65.97a)   s`,m = βθ,2 s`,m−1 + (1 − βθ,2 ) δ `+1,m δ `+1,m (65.97b) where denotes the Hadamard (elementwise) product of two matrices or vectors. A typical value for the forgetting factors βw,2 , βθ,2 ∈ (0, 1) is 0.999. 1

Let the notation A 2 refer to the elementwise computation of the square-roots of the entries of a matrix or vector argument. The ADAM updates for {W `,m , θ `,m } then take the form: q ) ( m+1  1 − βw,2 1  2 T ¯ `,m 1n 1n +1 + S W `,m = W `,m−1 − µ × (65.98a) W ` `,m m+1 × ` 1 − βw,1 q ( ) m+1   1 − βθ,2 1 2 ¯ θ `,m = θ `,m−1 − µ × (65.98b) θ `,m 1n`+1 + s`,m m+1 × 1 − βθ,1 where  is a small positive number to avoid division by zero, and where we are using the symbol to refer to elementwise division. Example 65.4 (Use of neural networks as autoencoders) We can use three-layer neural networks to act as “autoencoders.” The network consists of an input layer, one hidden layer, and an output layer, and its objective would be to map the input space back onto itself. This is achieved by constructing a feedforward network with the same number of output nodes as input nodes, and by setting γ = h: Q = M,

γ=h

(65.99)

If the individual entries of h happen to lie within the interval (0, 1), then the output layer will include sigmoidal nonlinearities so that the output signals will lie within the same interval. On the other hand, if the entries of h lie within the interval (−1, 1), then the neurons in the output layer will include hyperbolic-tangent nonlinearities. If, however, the individual entries of h are arbitrary real numbers, then the neurons in the output layer should not include nonlinearities. For illustration purposes, we assume the entries of h lie within (0, 1) and consider an autoencoder structure of the form shown in Fig. 65.10, where the output neurons contain sigmoidal nonlinearities. Now, by applying the training algorithm in any of its forms, e.g., in the stochasticgradient form (65.87) or in the mini-batch form (65.82), to a collection of N feature vectors hn using γn = hn , we end up with an unsupervised learning procedure that trains the network to learn how to map hn onto itself (i.e., it learns how to recreate the input data). This situation is illustrated in Fig. 65.10, which shows an autoencoder fed by h. The hidden layer in this example has three nodes. If we denote their outputs by {y2 (1), y2 (2), y2 (3)}

(65.100)

then the network will be mapping these three signals through the output layer back into a good approximation for the five entries of the input data, h. The step of generating the outputs in the hidden layer amounts to a form of data compression since it generates three hidden signals that are sufficient to reproduce the five input signals. Recall that we are denoting the post-activation output vector of the hidden layer by y2 ∈ IRn2 , the weight matrix between the input layer and the hidden layer by W1 ∈

2746

Feedforward Neural Networks

hidden layer with outputs input layer

output layer

Figure 65.10 An autoencoder with a single hidden layer. The network is trained to

map the feature vectors back to themselves using, for example, either the stochastic-gradient algorithm (65.87) or the mini-batch implementation (65.82). IRM ×n2 , and the bias vector feeding into the hidden layer by θ1 . Then, using these symbols, the autoencoder effectively determines a representation y2 for h of the form: y2 = f (W1T h − θ1 ) (encoding)

(65.101)

where f (·) denotes the activation function. Moreover, since the autoencoder is trained h and we find that the representation y2 is mapped back to the to recreate h, then γ b=b original feature vector by using b h = f (W2T y2 − θ2 ) (decoding)

(65.102)

in terms of the bias vector, θ2 , for the output layer and the weight matrix W2 ∈ IRn2 ×M between the hidden layer and the output layer. We refer to these two transformations as “encoding” and “decoding” steps, respectively. In some implementations, the weight matrices of the encoder and decoder sections are tied together by setting W2 = W1T – see Prob. 65.16. Autoencoders provide a useful structure to encode data, i.e., to learn a compressed representation for the data and to perform dimensionality reduction. The reduced representation will generally extract the most significant features present in the data. For example, if the input data happens to be a raw signal, then the autoencoder can function as a feature extraction/detection module. Compact representations of this type are useful in reducing the possibility of overfitting in learning algorithms. This is because the reduced features can be used to drive learning algorithms to perform classification based on less complex models. One can also consider designing autoencoders where the number of hidden units is larger than the dimension of the input vector, i.e., for which n2 ≥ M . Obviously, this

65.5 Backpropagation Algorithm

2747

case will not lead to dimensionality reduction. However, it has been observed in practice that such “overcomplete” representations are useful in that the features y2 that they produce can lead to reduced classification errors. In order to prevent the autoencoder from mapping the input vector back to itself at the hidden layer (i.e., in order to prevent the autoencoder from learning the identity mapping), a variation of the autoencoder structure is often used, known as a “denoising” autoencoder. Here, a fraction of the input entries in h are randomly set to zero (possibly as many as 50% of them). The perturbed input vector, denoted by h0 , is then applied to the input of the autoencoder while the original vector h continues to be used as the reference signal, γ. By doing so, the autoencoder ends up performing two tasks: encoding the input data and predicting the missing entries (i.e., countering the effect of the corruption). In order to succeed at the second task, the autoencoder ends up learning the correlations that may exist among the entries of the input vector in order to be able to “recover” the missing data. One can further consider autoencoder implementations that involve multiple hidden layers, thus leading to deep autoencoder architectures. Obviously, the training of these layered structures becomes more challenging due to the vanishing gradient problem, which we will discuss in a future section. Example 65.5 (Linear autoencoders and principal component analysis) Consider a collection of N feature vectors {hn ∈ IRM } and assume they have already been preprocessed according to procedure (57.6), i.e., each vector is centered around the ensemble mean and each entry is scaled by the ensemble standard deviation. We continue to denote the preprocessed features by {hn }. We introduce the sample covariance matrix b= R

N −1 X 1 hn hTn N − 1 n=0

(65.103)

b = U ΛU T , where U is M × M orthogonal and Λ and define its eigen-decomposition R is diagonal with nonnegative ordered entries λ1 ≥ λ2 ≥ . . . ≥ λM ≥ 0

(65.104) 0

Assume we wish to reduce the dimension of the feature data to M < M . We retain the leading M 0 × M 0 block of Λ and the leading M 0 columns of U denoted by U1 (which is M × M 0 ). We showed in listing (57.34) and Fig. 57.2 that the principal component analysis (PCA) implementation admits an encoder–decoder structure where the transformations from the original feature space h to the reduced space h0 and back to the original space b h are given by h0n = U1T hn , h0n ,

b hn = U1 h0n

(65.105) U1T ,

If we make the identification y2 ← W1 = U1 , and W2 = we conclude that PCA is a special case of the structure shown in Fig. 65.10 with the nonlinearities removed and with the weight matrices tied together (since we now have W2 = W1T ). It should be noted that referring to PCA as a “linear” autoencoder, which is common in the literature, is technically inaccurate because the weight matrix U1 depends on the feature data in a rather nonlinear fashion. The “linear” qualification is meant to refer to the fact that there are no nonlinearities when PCA is represented according to Fig. 65.10. Interestingly, it turns out that the modeling capability of the PCA construction is comparable to autoencoders that employ nonlinearities. To see this, we collect all feature vectors into the N × M matrix (where N ≥ M ):  T H = h0 h1 . . . hN −1 , (N × M ) (65.106) and formulate the problem of seeking an approximation for H of rank M 0 < M that is optimal in the following sense:

2748

Feedforward Neural Networks

∆ b = H argmin kH − Xk2F , subject to rank(X) = M 0

(65.107)

X

Then, we know from (57.49) that the solution is constructed as follows. We introduce the singular value decomposition (SVD):  H=V



Σ 0

UT

(65.108)

where V is N × N orthogonal, U is M × M orthogonal, and Σ is M × M . We let U1 denote the leading M × M 0 submatrix of U . Then, it holds that b = HU1 U1T H

(65.109)

In other words, the optimal approximation for H is given by the PCA construction. It involves applying U1 followed by U1T to the input matrix H, as already described by (65.105) Example 65.6 (Graph neural network) A standing assumption in our treatment of feedforward neural networks will be that all nodes from one layer are connected to all nodes in the subsequent layer. There are variations, however, in the form of graph neural networks (GNNs) where sparse connections are allowed. We describe one simple example here to illustrate the main idea. Consider, for instance, the relations: z`+1 = W`T y` − θ` ,

y`+1 = f (z`+1 )

(65.110a)

which map the output vector y` for the `th layer to the output vector y`+1 for the subsequent layer. If we use j to index the individual entries, then we can rewrite the above relations more explicitly as

z`+1 (j) =

n` X

(`)

wij y` (i) − θ` (j),

y`+1 (j) = f (z`+1 (j))

(65.111a)

i=1 (`)

in terms of the weights {wij } that appear on the jth row of W`T . It is seen that all n` outputs {y` (i)} from layer ` contribute to the formation of each y`+1 (j). We may consider instead a sparse structure where only a subset of the nodes from the previous (`+1) layer contribute to y`+1 (j). For example, let Nj denote the subset of nodes from layer ` that contribute to y`+1 (j). Then, we can replace the above expressions by writing

z`+1 (j) =

X

(`)

wij y` (i) − θ` (j)

(65.112a)

(`+1) i∈Nj

y`+1 (j) = f (z`+1 (j))

(65.112b)

In this way, procedure (65.31) for propagating signals forward through the network will need to be adjusted to (65.114). For the backward recursions, we start from relation (65.61), which shows how to update the sensitivity factors for two consecutive layers: δ` = f 0 (z` ) (W` δ`+1 )

(65.113)

65.5 Backpropagation Algorithm

2749

Feedforward propagation: graph neural network with L layers. given a feedforward network with L layers (input+output+hidden); (`+1) given neighborhoods {Nj } for every node j in every layer ` > 1; start with y1 = h; repeat for ` = 1, . . . , L − 1: for j = 1, 2, . . . , n`+1 : X (`) z`+1 (j) = wij y` (i) − θ` (j)

(65.114)

(`+1)

i∈Nj

y`+1 (j) = f (z`+1 (j)) end end z = zL γ b = yL .

If we again use j to index individual entries, we have n`+1 0

δ` (j) = f (z` (j))

X

! (`) wji δ`+1 (i)

(65.115)

i=1

(`)

in terms of the weights {wji } that appear on the jth row of W` . It is seen once more that all entries of δ`+1 contribute to the formation of δ` (j). Motivated by the derivation (65.58)–(65.60), we consider a similar sparse construction and use instead n`+1

δ` (j) = f 0 (z` (j))

i X h (`) (`+1) wji δ`+1 (i) I j ∈ Ni

! (65.116)

i=1

In this way, the backward recursions appearing in procedure (65.87) to propagate signals back through the network will need to be adjusted to (65.119). If the neighborhoods happen to satisfy the symmetry condition: (`+1)

j ∈ Ni+1

(`+1)

⇐⇒ i ∈ Nj+1

(65.117)

then we can simplify (65.116) to ! δ` (j) = f 0 (z` (j))

X (`+1) i∈Nj+1

Other variations are possible.

(`)

wji δ`+1 (i)

(65.118)

2750

Feedforward Neural Networks

Backward propagation: graph neural network with L layers. for every pair (γ, h), feed h forward through network using (65.114); determine {z` , y` } across layers and output γ b; δL = 2(b γ − γ) f 0 (z). repeat for ` = L − 1, . . . , 2, 1: θ` ← θ` + µδ`+1 for j = 1, 2, . . . , n` : ! n`+1 h i X (`+1) (`) 0 wji δ`+1 (i) , ` ≥ 2 δ` (j) ← f (z` (j)) I j ∈ Ni

(65.119)

i=1

end for j = 1, 2, . . . , n`+1 : (`+1) : for i ∈ Nj (`) (`) wij ← (1 − 2µρ)wij − µy` (i)δ`+1 (j) end end end

65.6

DROPOUT STRATEGY We described in Section 62.1 the bagging technique for enhancing the performance of classifiers and combating overfitting. The same technique can in principle be applied to neural networks. As explained in that section, bagging is based on the idea of training multiple networks by sampling from the same dataset, and on combining the classification decisions, e.g., by taking a majority vote. In order to ensure variability across the networks, the training data for each network is obtained by sampling from the original dataset with replacement. While the bagging procedure is justified for simple network architectures, it can nevertheless become expensive for networks with many hidden layers that require training a large number of combination weights and offset coefficients. One useful alternative to reduce overfitting and provide a form of regularization is to employ the dropout method. We refer to the mini-batch implementation (65.82), which trains a network with L layers. Each iteration m involves feeding a batch of B randomly selected data points {γb , hb } through the network and adjusting its parameters from {W`,m−1 , θ`,m−1 } to {W`,m , θ`,m }. Dropout is based on the idea that during each iteration m, only a random fraction of the connections in the network are retained and their parameters adjusted. The thinning operation is achieved by switching a good portion of the nodes into sleeping mode, where nodes are turned off with probability p (usually, p is close to 1/2 for internal nodes and close to 0.1 for the input nodes). When a node is turned off, it does not feed any signals into subsequent layers and its incoming and outgoing combination weights and

65.6 Dropout Strategy

2751

bias coefficient are frozen and not adjusted during the iteration. The other active nodes in the network participate in the training and operate as if the sleeping nodes do not exist. The ultimate effect is that the size of the original network is scaled down by p (e.g., halved when p = 1/2). Only the combination weights and bias coefficients of the active nodes are adjusted by the backpropagation algorithm (65.82). This situation is illustrated in Fig 65.11.

Figure 65.11 A succession of three layers where node 2 in layer ` is turned off at

random. The dashed lines arriving at this node from the preceding layer ` − 1, and also leaving from it toward layer ` + 1, represent the weighting and bias coefficients that will be frozen and not adjusted by the training algorithm. These coefficients T correspond to one column in W`T , one row in W`−1 , and one entry in θ`−1 .

Forward propagation The training of the network proceeds as follows. During each iteration m, we associate a Bernoulli variable with each node in a generic layer `: it is equal to 0 with probability p` and 1 with probability 1 N p` . The variable will be zero when the node is turned off. We collect the Bernoulli variables for layer ` into a vector a` ∈ {0, 1}n` and denote it by writing ∆

a` = Bernoulli(p` , n` )

(65.120)

This notation means that a` is a vector of dimension n` × 1 and its entries are Bernoulli variables, each with success probability 1 N p` . Then, during the forward propagation of signals through the feedforward network by means of algorithm (65.31), the expression for z`+1 will be modified to:

2752

Feedforward Neural Networks

z`+1 = W`T (y` a` ) N θ`

(65.121)

where the Hadamard product is used to annihilate the output signals from sleeping nodes. During normal operation of the forward step, the quantities {y` , z`+1 } in this expression will be computed for every index b within a mini-batch, while the parameters {W` , θ` , a` } will be the ones available at the start of iteration m, so that we should write more explicitly: T z`+1,b = W`,m−1 (y`,b a`,m ) N θ`,m−1 , b = 0, 1, . . . , B N 1

(65.122)

The Bernoulli vectors {a`,m } only vary with m and, therefore, remain fixed for all samples within the B-size mini-batch. In other words, the thinned network structure remains invariant during the processing of each mini-batch of samples.

Backward propagation During the backward step of the training algorithm, at iteration m, only the combination weights and bias coefficients of active nodes are updated, while the combination weights and bias coefficients of sleeping nodes remain intact. For T example, in the context of Fig. 65.11, the second column of W`,m−1 , the second T row of W`−1,m−1 , and the second entry of θ`−1,m−1 are not updated; they remain frozen at their existing values within the new W`,m , W`−1,m , and θ`−1,m . At the next iteration, m + 1, involving a new mini-batch of samples, the process is repeated. A new collection of Bernoulli vectors {a`,m+1 } is generated, resulting in a new thinned network. The batch of B samples is fed forward through this network and its thinned parameters adjusted to {W`,m+1 , θ`,m+1 }, and so on. If we repeat the derivation of the backpropagation algorithm under the dropout condition, we arrive at listing (65.124). Again, if the softmax construction (65.39) is employed at the output layer, then the expression for the boundary condition δ L,b would be replaced by bbγ bT δ L,b = 2J (b γ b N γ b ), J = diag(b γb) N γ b

(65.123)

Observe that the recursions in the forward and backward passes are similar to the implementation without dropout, with the main difference being the incorporation of the Bernoulli vector a`,m in three locations: during the generation of z `+1,b in the forward pass, and during the generation of W `,m and δ `,b in the backward pass. The net effect of the dropout step is the following: (a) During the backward pass, at the first iteration corresponding to ` = L N 1, the columns of W T L−1,m corresponding to sleeping nodes in aL−1,m (i.e., to its zero entries) would not be updated and stay at the values in W T L−1,m−1 . (b) For the subsequent stages ` = L N 2, L N 3, . . . , 1, the columns of W T `,m corresponding to sleeping nodes in a`,m are not updated and stay at the T values in W T `,m−1 . Likewise, the rows of W `,m corresponding to sleeping nodes in a`+1,m are not updated and stay at the values in W T `,m−1 . By the

65.6 Dropout Strategy

2753

same token, all entries in θ `,m corresponding to the sleeping nodes in a`+1,m are not updated and stay at the values in θ `,m−1 .

Mini-batch backpropagation for solving (65.45) with dropout. given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }, n = 0, 1, . . . , N N 1; given Bernoulli probabilities {p` }, ` = 1, 2, . . . , L N 1; given small step size µ > 0 and regularization parameter ρ ≥ 0; start from random initial parameters {W `,−1 , θ `,−1 }. repeat until convergence over m = 0, 1, 2, . . .: select B random data pairs {γ b , hb } generate a`,m = Bernoulli(p` , n` ), ` = 1, 2, . . . , L N 1 (forward processing) repeat for b = 0, 1, . . . , B N 1: y 1,b = hb repeat for ` = 1, 2, . . . , L N 1: z `+1,b = W T `,m−1 (y `,b a`,m ) N θ `,m−1 y `+1,b = f (z `+1,b ) end b b = y L,b γ z b = z L,b δ L,b = 2(b γ b N γ b ) f 0 (z b ) end (backward processing) repeat for ` = L N 1, . . . , 2, 1: W `,m = (1 N 2µρ)W `,m−1 N B−1

B−1 µ X (y `,b a`,m )δ T `+1,b B b=0

µ X θ `,m = θ `,m−1 + δ `+1,b B n  b=0 o δ `,b = f 0 (z `,b ) W `,m−1 δ `+1,b a`,m , ` ≥ 2, ∀b

end end

n o {W`? , θ`? } ← (1 N p` ) W `,m , (1 N p` ) θ `,m .

(65.124)

2754

Feedforward Neural Networks

The convergence time of the dropout implementation is expected to be worse than an implementation without dropout. Once training is completed, all nodes in the network are activated again for testing purposes without any dropout. As indicated in the last line of the algorithm, the combination weights and bias coefficients for the network will be set to scaled versions of the parameter values obtained at the end of training. If a particular node was dropped with probability p during training, then its outgoing combination weights will need to be scaled by 1 N p (e.g., they should be scaled by 2/3 if p = 1/3) in order to account for the fact that these weights were determined with only a fraction of the nodes active. By doing so, we are in effect averaging the performance of a collection of randomly thinned networks, in a manner that mimics the bagging technique, except that the training was performed on successive networks with a reduced number of parameters. As a result, the possibility of overfitting is reduced since each iteration involves training a sparse version of the network.

65.7

REGULARIZED CROSS-ENTROPY RISK The derivations in the earlier sections illustrated the training of feedforward neural networks by minimizing the regularized least-squares empirical risk (65.47). Other risk functions are of course possible. In this section we examine one popular scheme that is suitable for multiclass classification problems where the label vector γ is assumed to be one-hot encoded. That is, the entries of γ assume binary values in {0, 1} with only one entry equal to 1 while all other entries are 0: γ ∈ {0, 1}Q

(65.125)

In this setting, there are Q classes and a feature vector h ∈ IRM can belong to one of the classes. The corresponding label γ ∈ IRQ will have the form of a basis vector and the location of its unit entry will identify the class of h. The output layer of the neural network will now be a softmax layer where the entries of the output vector γ b are computed as follows:  −1 Q X 0 ∆ γ b(q) = ez(q)  ez(q )  (65.126) q 0 =1

where the symbol z refers to the pre-activation signal at the last layer. The parameters {W` , θ` } of the network will be determined by minimizing the following regularized cross-entropy empirical risk: ( ) L−1 N −1 Q X 1 XX ∆ ∆ ? ? 2 {W` , θ` } = argmin P(W, θ) = ρkW` kF N γn (q) ln (b γn (q)) N n=0 q=1 {W` ,θ` } `=1

(65.127)

65.7 Regularized Cross-Entropy Risk

2755

where γn is the label vector associated with the nth feature vector hn and γ bn is the corresponding output vector. The loss function that is associated with each data point in (65.127) is seen to be: ∆

Q(W, θ; γ, h) = N

Q X

γ(q) ln(b γ (q))

(65.128)

q=1

This loss value depends on all weight and bias parameters. For simplicity, we will drop the parameters and write Q(γ, h). Example 65.7 (Binary labels) Consider the special case in which Q = 2 so that the network has only two outputs, denoted by   γ b(1) γ b= (65.129) γ b(2) These outputs add up to 1 in view of the softmax calculation (65.126). The feature vectors belong to one of two classes so that     1 0 γ= or γ = (65.130) 0 1 In this case, expression (65.127) for the empirical risk simplifies to ∆

P(W, θ) =

L−1 X `=1

ρkW` k2F −

N −1  1 X γn (1) ln(b γn (1)) + (1 − γn (1)) ln(1 − γ bn (1)) N n=0

(65.131) where, for each n, only one of the terms inside the rightmost summation is nonzero since either γn (1) = 0 or γn (1) = 1.

65.7.1

Motivation for Cross-Entropy We already know from the result of the earlier Example 31.5 that cross-entropy minimization is related to maximum-likelihood (ML) inference and to the minimization of the Kullback–Leibler (KL) divergence between a true distribution and an empirical approximation for it. We provide further motivation for this observation here and explain how it applies to the choice of risk function in (65.127). Consider two probability mass functions (pmfs), denoted generically by px (x) and sx (x), for a random variable x. Their cross-entropy is denoted by H(p, s) and defined as the value: ∆

H(p, s) = NE p log2 sx (x) X = N px (x) log2 sx (x) x

= N

X

Pp (x = x) log2 Ps (x = x)

(65.132)

x

where the sum is over the discrete realizations for x, and the notation Pp (x = x) refers to the probability that event x = x occurs under the discrete distribution

2756

Feedforward Neural Networks

px (x). Similarly, for Ps (x = x) under distribution sx (x). It is straightforward to verify that the cross-entropy is, apart from an offset value, equal to the KL divergence measure between the two distributions, namely, H(p, s) = H(p) + DKL (p, s) in terms of the entropy of the distribution px (x): X ∆ Pp (x = x) log2 Pp (x = x) H(p) = N

(65.133)

(65.134)

x

In this way, the cross-entropy between two distributions is effectively a measure of how close the distributions are to each other. To illustrate the relevance of this conclusion in the context of classification problems, let us consider a multiclass classification scenario with Q classes where the label vectors γ ∈ IRQ are one-hot encoded. For example, if a feature vector h belongs to some class r, then the rth entry of its label vector γ will be equal to 1 while all other entries will be 0. Let γ b denote the output vector that is generated by the neural network for this feature vector. Recall that the output layer is based on a softmax calculation and, hence, we can interpret γ b as defining a probability distribution over the class variable, denoted by Ps (r = q) = γ b(q),

q = 1, 2, . . . , Q

Pp (r = q) = γ(q),

q = 1, 2, . . . , Q

(65.135)

Likewise, if γ is the true label vector, we can use its entries to define a second probability distribution on the same class variable, denoted by (65.136)

Since only one entry of the vector γ is equal to 1, this second distribution will be zero everywhere except at location q = r. Ideally, we would like the distribution that results from γ b to match the distribution from γ. The cross-entropy between the two distributions is given by H(p, s) = N

Q X q=1

γ(q) log2 γ b(q)

(65.137)

This result is the reason for the form of the rightmost term in (65.127), where the outer sum is over all training samples. Example 65.8 (Log loss function) The rightmost term in (65.127) is often referred to as the log-loss function. We denote it by ∆

LogLoss = −

N −1 Q 1 XX γn (q) ln (b γn (q)) N n=0 q=1

(65.138)

which in view of the above discussion has the following interpretation in terms of the class variables: ∆

LogLoss = −

N −1 Q 1 XX I[r(hn ) = q] P(r? (hn ) = q) N n=0 q=1

(65.139)

65.7 Regularized Cross-Entropy Risk

2757

Here, the indicator term I[r(hn ) = q] is equal to 1 or 0, depending on whether the true label for feature hn is q, while the term P(r? (hn ) = q) represents the probability that the classifier will assign label q to hn . The notation r(hn ) and r? (hn ) refers to the true and assigned labels for feature hn , respectively; they both assume integer values in the set {1, 2, . . . , Q}.

65.7.2

Sensitivity Factors We return to problem (65.127), where the objective is to minimize the risk function over the parameters {W` , θ` }. The derivation that follows is meant to show that the same backpropagation algorithm from before will continue to apply, where the only adjustment that is needed is in the value of the boundary sensitivity vector. We drop the subscript n for convenience of the derivation; we restore it in the listing of the algorithm. As before, we associate with each layer ` a sensitivity vector of size n` denoted by δ` ∈ IRn` and with entries {δ` (j)} defined by ∆

δ` (j) =

∂ Q(γ, h) , ∂z` (j)

j = 1, 2, . . . , n`

(65.140)

in terms of the partial derivative of the loss function (65.128). We can derive a recursive update for the vector δ` . We consider first the output layer for which ` = L. The chain rule for differentiation gives ! Q X ∂ ∆ ∂Q(γ, h) δL (j) = = N γ(q) ln(b γ (q)) (65.141) ∂z(j) ∂z(j) q=1 Using the fact that ∂ ln(b γ (q)) = ∂z(j)



we find that

1Nγ b(j), q = j Nb γ (j), q 6= j

Q X  δL (j) = γ b(j) γ(q) N γ(j) = γ b(j) N γ(j)

(65.142)

(65.143)

q=1

| {z } =1

In other words, the boundary sensitivity vector is now given by δL = γ bNγ

(65.144)

δ` = f 0 (z` ) (W` δ`+1 )

(65.145)

Next, repeating the same arguments after (65.57b), we can similarly derive a backward recursion for updating the sensitivity vectors as

2758

Feedforward Neural Networks

along with the same gradient expressions:

N −1 ∂P(W, θ) 1 X T = 2ρW` + y`,n δ`+1,n ∂W` N n=0 N −1 ∂P(W, θ) 1 X =N δ`+1,n ∂θ` N n=0

(65.146a)

(65.146b)

In summary, the same backpropagation algorithm (65.82), and its dropout version (65.124), continue to hold, with the main difference being that the boundary condition for δ L,b should be replaced by

bb N γ b δ L,b = γ

(65.147)

Recall that this conclusion assumes a neural network with softmax construction at the output layer, and one-hot encoded label vectors of the form γ ∈ {0, 1}Q . When the batch size is B = 1, the mini-batch recursions (65.149) simplify to the listing shown in (65.150). In both implementations, the parameters {W`? , θ`? } at the end of the training phase are used for testing purposes. For example, assume a new feature vector h is fed into the network and let γ b denote the corresponding output vector. We declare the class of h to be the index that corresponds to the largest value within γ b: r? = argmax γ b(q)

(65.148)

1≤q≤Q

For ease of reference, we collect in Table 65.2 the boundary values for the sensitivity factor under different scenarios.

Table 65.2 List of terminal sensitivity factors for different risk functions and/or network structure. Risk function

Output layer

Boundary sensitivity factor, δ L,m

least-squares (65.47)

activation, f (·)

2(b γm − γm ) f 0 (zm )

least-squares (65.47) cross-entropy (65.127)

softmax (65.39) softmax (65.39)

2J(b γm − γm ), J = diag(b γm ) − γ bm (b γ m )T γ bm − γm

65.7 Regularized Cross-Entropy Risk

2759

Mini-batch backpropagation algorithm for solving (65.127). given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }, n = 0, 1, . . . , N N 1; given small step size µ > 0 and regularization parameter ρ ≥ 0; start from random initial parameters {W `,−1 , θ `,−1 }. repeat until convergence over m = 0, 1, 2, . . .: select B random data pairs {γ b , hb } (forward processing) repeat for b = 0, 1, . . . , B N 1: y 1,b = hb repeat for ` = 1, 2, . . . , L N 1: z `+1,b = W T `,m−1 y `,b N θ `,m−1 y `+1,b = f (z `+1,b ) end z b = z L,b b b = softmax(z b ) γ bb N γ b δ L,b = γ end (backward processing) repeat for ` = L N 1, . . . , 2, 1:

W `,m = (1 N 2µρ)W `,m−1 N B−1

B−1 µ X y `,b δ T `+1,b B b=0

µ X δ `+1,b θ `,m = θ `,m−1 + B  b=0  δ `,b = f 0 (z `,b ) W `,m−1 δ `+1,b , ` ≥ 2, b = 0, 1, . . . , B N 1

end end {W`? , θ`? } ← {W `,m , θ `,m }.

(65.149)

2760

Feedforward Neural Networks

Stochastic-gradient backpropagation for solving (65.127). given a feedforward network with L layers (input+output+hidden); pre- and post-activation signals at the output layer are (zn , γ bn ); internal pre- and post-activation signals are {z`,n , y`,n }; given N training data samples {γn , hn }, n = 0, 1, . . . , N N 1; given small step size µ > 0 and regularization parameter ρ ≥ 0; start from random initial parameters {W `,−1 , θ `,−1 }. repeat until convergence over m = 0, 1, 2, . . .: select one random data pair (hm , γ m ) y 1,m = hm (forward processing) repeat for ` = 1, 2, . . . , L N 1: z `+1,m = W T `,m−1 y `,m N θ `,m−1 y `+1,m = f (z `+1,m ) end z m = z L,m b m = softmax(z m ) γ bm N γ m δ L,m = γ

(backward processing) repeat for ` = L N 1, . . . , 2, 1: W `,m = (1 N 2µρ)W `,m−1 N µy `,m δ T `+1,m θ `,m = θ `,m−1 + µδ `+1,m   δ `,m = f 0 (z `,m ) W `,m−1 δ `+1,m , ` ≥ 2

end end

(65.150)

Example 65.9 (Classification of handwritten digits) We illustrate the operation of a neural network by applying it to the problem of identifying handwritten digits using the same MNIST dataset considered earlier in Example 52.3. The dataset consists of 60,000 labeled training samples and 10,000 labeled testing samples. Each entry in the dataset is a 28 × 28 grayscale image, which we transform into an M = 784-long feature vector, hn . Each pixel in the image and, therefore, each entry in hn , assumes integer values in the range [0, 255]. Every feature vector (or image) is assigned an integer label in the range 0–9, depending on which digit the image corresponds to. The earlier Fig. 52.6, which is repeated here as Fig. 65.12, shows randomly selected images from the training dataset. We preprocess the images (or the corresponding feature vectors {hn }) by scaling their entries by 255 (so that they assume values in the range [0, 1]). We subsequently com-

65.7 Regularized Cross-Entropy Risk

2761

Figure 65.12 Randomly selected images from the MNIST dataset for handwritten

digits. Each image is 28 × 28 grayscale with pixels assuming integer values in the range [0, 255].

pute the mean feature vectors for both the training and test sets. We center the scaled feature vectors around these means in both sets. The earlier Fig. 52.7 showed randomly selected images for the digits {0, 1} before and after processing. We construct a neural network with a total of four layers: one input layer, one output layer, and two hidden layers. The size of the input layer is n1 = 784 (which agrees with the size of the feature vectors), while the size of the output layer is n4 = 10 (which agrees with the number of classes). The size of the hidden layers is set to n2 = n3 = 512 neurons. We employ a softmax layer at the output and train the network using a regularized cross-entropy criterion with parameters µ = 0.001,

ρ = 0.0001

(65.151)

We run P = 200 passes of the stochastic-gradient algorithm (65.150) over the training data, with the data being randomly reshuffled at the start of each pass. At the end of the training phase, we evaluate the empirical error rate over the 10,000 test samples, as well as over the 60,000 training samples. We simulate three different scenarios where we vary the nonlinearity at the output of the internal neurons: sigmoid, rectifier, and tanh. We also simulate a dropout implementation with p1 = 0.1 for the input layer and p2 = p3 = 0.5 for the two hidden layers; in this last simulation, we use the sigmoid activation function for the internal nodes and perform the same number of passes, 200, over the data. The results are summarized in Table 65.3. The performance under dropout can be improved by using a larger number of passes due to the slower convergence in this case. Example 65.10 (Classification of tiny color images) We again illustrate the operation of neural networks by applying them to the problem of classifying color images into 1 of

2762

Feedforward Neural Networks

Table 65.3 The table lists the empirical error rates over both the test and training samples from the MNIST dataset for three types of internal nonlinearities: sigmoid, rectifier, and tanh. The last row in the table corresponds to a dropout implementation using 200 passes over the data, the sigmoid activation function, and putting 50% of the neurons in the hidden layers to sleep at each iteration.

Nonlinearity sigmoid tanh rectifier dropout

Empirical test error (%)

Number of test errors

Empirical training error (%)

Number of training errors

2.18% 1.84% 1.82% 6.22%

218 184 182 622

1.02% 0.00167% 0.00167% 6.25%

613 1 1 3752

10 classes using the CIFAR-10 dataset. This dataset consists of color images that can belong to 1 of 10 classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Figure 65.13 shows random selections of images from the dataset. The images in the dataset have low resolution and that is why they appear blurred.

Figure 65.13 Randomly selected color images from the CIFAR-10 dataset. Each image

has three channels (red, green, blue) of size 32 × 32 each. The pixels in each channel assume integer values in the range [0, 255]. The CIFAR-10 dataset is found at www.cs.toronto.edu/~kriz/cifar.html. There are 6000 images per class for a total of 60,000 images in the dataset. There are 50,000 training images and 10,000 test images. There are 1000 random images from each class in the test collection of 10,000 images. The training images are divided into 5 batches of 10,000 images each. Each training batch may contain more images from one class or another. Each image has size 32 × 32 in the red, green, and blue color channels, which we transform into an M = 32 × 32 × 3 = 3072-long feature vector, hn . Each pixel in the image assumes integer values in the range [0, 255]. Each feature vector (or image) is assigned

65.7 Regularized Cross-Entropy Risk

2763

an integer class label in the range 0–9. We preprocess the images (or the corresponding feature vectors {hn }) by scaling their entries by 255 (so that they assume values in the range [0, 1]). We subsequently compute the mean feature vectors for the training and test sets and center the scaled feature vectors in both sets around these means. We construct a neural network with a total of four layers: one input layer, one output layer, and two hidden layers. The size of the input layer is n1 = 3072 (which agrees with the size of the feature vectors), while the size of the output layer is n4 = 10 (which agrees with the number of classes). The size of the hidden layers is set to n2 = n3 = 2048 neurons. We employ a softmax layer at the output of the network, and rectifier units at the internal neurons. We train the network using a regularized cross-entropy criterion with parameters µ = 0.001,

ρ = 0.0001

(65.152)

We run a stochastic-gradient version of the backpropagation algorithm (65.82) with mini-batches of size equal to one sample, and adjusted to the cross-entropy scenario where the boundary condition δ L,b is replaced by bb − γ b δ L,b = γ

(65.153)

We run P = 200 passes of the stochastic-gradient algorithm (65.150) over the training data, with the data being randomly reshuffled at the start of each pass. At the end of the training phase, we evaluate the empirical error rate over the 10,000 test samples and also over the 50,000 training samples. We also simulate a dropout implementation with p1 = 0.1 for the input layer and p2 = p3 = 0.5 for the two hidden layers using now P = 300 passes over the data. The results are summarized in Table 65.4. It is seen from the results on the test data that this is a more challenging classification problem. Table 65.4 The table lists the empirical error rates over 10,000 test samples and 50,000 training samples from the CIFAR-10 dataset with and without dropout.

Setting w/o dropout w/ dropout

Empirical test error (%)

Number of test errors

Empirical training error (%)

Number of training errors

42.28% 42.92%

4228 4292

0.02% 5.62%

10 2810

Example 65.11 (Transfer learning) Assume a neural network has been trained to perform a certain task A, such as detecting images of cars (“car” versus “no car”). This assumes that a large amount of training data is available so that the network can be trained well (say, by minimizing a regularized cross-entropy empirical risk) to perform its intended task with minimal classification error. Now assume we wish to train a second neural network to perform another task B, which also involves classifying images, say, detecting whether an image shows a bird or not. This objective is different from detecting the presence of cars. If we happen to have a sufficient amount of training data under task B, then we could similarly train this second network to perform its task well. However, it may be the case that while we had a large amount of data to train network A, we may only have a limited amount of data to train network B. Transfer learning provides one useful method to transfer the knowledge acquired from training the first network and apply it to assist in training the second network. The approach exploits the fact that the input data to both tasks, A and B (i.e., to both neural networks), are of similar type: they are images of the same size. There are of course other methods to transfer learning/knowledge from one situation to another, and we will describe one such method later in Chapter 72 when we study meta learning. Here, we continue with transfer learning.

2764

Feedforward Neural Networks

3 hidden layers input layer layer 1

layer 2

layer 3

layer 4

layer 5

(task A)

layer 5 layer 2

layer 3

layer 4

(task B) y4

(z2 , y2 )

(z3 , y3 )

(z4 , y4 )

(W4 , ✓4 )

features from task B

leading layers of network A output layer

Figure 65.14 The top part shows the neural network for solving task A; it consists of

three hidden layers. The last layer is replaced, with new parameters (W4 , θ4 ), as shown in the lower part of the figure. These parameters are trained using the data {γ(n), y4,n } for task B. The main idea is to replace the last output layer of network A by a new weight matrix and a new bias vector, and to retrain only these last-layer parameters, while keeping the weights and biases from all prior layers fixed at the values obtained during the training of network A. The main reason why this approach works reasonably well is because the earlier layers from network A have been well trained to identify many low-level features (such as edges, textures) that continue to be useful for task B. It is generally the last layer that is responsible for performing the final step of prediction or classification in a network. We can therefore limit our training for network B by retraining the weight matrix and bias vector for the last layer using the data {γ(n), y4,n } from network B. Some variations are possible: (a) We can use the training data available for task B to train the last layer only, while keeping the weights and bias vectors of all prior layers fixed at the values obtained from training network A. This is the approach described above. (b) Once the training from step (a) is concluded, we can consider fine-tuning all weight and bias vector parameters across all layers by using the training data from task B. That is, we can retrain all parameters starting from their current values as initial conditions. (c) Under step (a), we can consider replacing the last layer of network A by two or more layers and retrain these using the data from task B.

65.7 Regularized Cross-Entropy Risk

2765

The situation under option (a) is illustrated in Fig. 65.14 for a network A with three hidden layers shown in the top part of the figure. The new weight matrix and bias vector of the last layer are denoted by (W4 , θ4 ) and shown in the lower part of the figure. The hidden layers of network B continue to be the same as those trained under network A. Fixing the parameters of these earlier layers, we can feed the training data {γ(n), hn } under task B and generate realizations {y4,n } for the vector y4 shown in network B. We are then faced with the problem of training a single-layer neural network: Its training data are {γ(n), y4,n } and its outputs are {b γ (1), γ b(2)} in the lower part of the figure. The parameters to be trained are (W4 , θ4 ). Example 65.12 (Multitask learning) The objective in multitask learning is to design a single neural network to identify the presence or absence of several labels at once, as is the case with multilabel classification. For example, the purpose may be to examine an image and to indicate which of the following objects appear in the image: a car, a street, a traffic signal, a pedestrian, or a bicycle. The network should be able to detect the presence of several of these objects simultaneously, such as indicating that the image shows a pedestrian, a stop sign, and a bicycle. In principle, if we have a sufficient amount of training data for each situation, then we could design five separate neural networks: one for detecting cars in images, a second one for detecting streets, a third one for detecting traffic signals, a fourth one for detecting pedestrians, and a fifth one for detecting bicycles. Once trained, these networks would operate separately. Multitask learning provides an alternative approach to the problem; it relies on training a single network to detect the presence of any of the objects of interest simultaneously. This is possible when the multiple tasks benefit from some shared low-level features (such as edges or textures), and the amount of training data available for each task is more or less uniform. Assume there are T separate tasks. Motivated by expression (65.131), one way to design a multitask neural network is to minimize an empirical risk function of the form: ∆

P(W, θ) =

L−1 X

ρkW` k2F −

(65.154)

`=1 N −1 T  1 X X γn,t (1) ln(b γn,t (1)) + (1 − γn,t (1)) ln(1 − γ bn,t (1)) N n=0 t=1

where we associate a pair of outputs {b γn,t (1), γ bn,t (2)} with each task t; these outputs continue to add up to 1 because they are defined by applying the softmax construction to their respective pre-activation signals, denoted by {zn,t (1), zn,t (2)}: γ bn,t (q) =

ezn,t (q) , q = 1, 2 + ezn,t (2)

ezn,t (1)

(65.155)

Thus, the value of γ bn,t (1) represents the likelihood that the nth feature vector contains the attribute that is present under task t. If desired, it is sufficient to have a single output γ bn,t (1) associated with each task t. We continue with two-dimensional label vectors to remain consistent with the assumed one-hot encoding formulation from the earlier sections. Figure 65.15 shows a network structure for a two-task learning problem (T = 2); we are dropping the iteration index n from the variables in the figure. The network consists of three hidden layers and one output layer. The top part shows the network with output vectors γ b1 ∈ IR2 for task t = 1 and γ b2 ∈ IR2 for task t = 2 (i.e., the subscripts here refer to the task number). The lower part of the figure shows in greater detail the

layer 1 T T

layer 2 layer 3

z5

3 hidden layers b2 (2)

input layer W4

T

T T

layer 4 T W4,2

     T z5,1 W4,1 θ4,1 = y − 4 T z5,2 θ4,2 W4,2 | {z } | {z } | {z }

=W4T

γ b1 = softmax(z5,1 ), γ b2 = softmax(z5,2 ), ✓4,1

W4,1 ✓4,2

z5,2

for task t = 1 for task t = 2 softmax b1 (task 1)

softmax b2 (task 2)

sha1_base64="jywVO4UiiDWc2aUfGbVWQsBJ4JY=">AAAB8XicbVA9TwJBEJ3DL8Qv1NJmI5hgQ+4wUUsSG0tMBIlwIXvLHmzY27vszpkQwr+wsdAYW/+Nnf/GBa5Q8CWTvLw3k5l5QSKFQdf9dnJr6xubW/ntws7u3v5B8fCoZeJUM95ksYx1O6CGS6F4EwVK3k40p1Eg+UMwupn5D09cGxGrexwn3I/oQIlQMIpWeqwgNSNS9srnvWLJrbpzkFXiZaQEGRq94le3H7M04gqZpMZ0PDdBf0I1Cib5tNBNDU8oG9EB71iqaMSNP5lfPCVnVumTMNa2FJK5+ntiQiNjxlFgOyOKQ7PszcT/vE6K4bU/ESpJkSu2WBSmkmBMZu+TvtCcoRxbQpkW9lbChlRThjakgg3BW355lbRqVe+yenFXK9UrWRx5OIFTqIAHV1CHW2hAExgoeIZXeHOM8+K8Ox+L1pyTzRzDHzifP3r4j2c=

b2 (1)

AAAB8XicbVA9TwJBEJ3DL8Qv1NJmI5hgQ+4wUUsSG0tMBIlwIXvLHmzY27vszpkQwr+wsdAYW/+Nnf/GBa5Q8CWTvLw3k5l5QSKFQdf9dnJr6xubW/ntws7u3v5B8fCoZeJUM95ksYx1O6CGS6F4EwVK3k40p1Eg+UMwupn5D09cGxGrexwn3I/oQIlQMIpWeqwgNSNSrpXPe8WSW3XnIKvEy0gJMjR6xa9uP2ZpxBUySY3peG6C/oRqFEzyaaGbGp5QNqID3rFU0YgbfzK/eErOrNInYaxtKSRz9ffEhEbGjKPAdkYUh2bZm4n/eZ0Uw2t/IlSSIldssShMJcGYzN4nfaE5Qzm2hDIt7K2EDammDG1IBRuCt/zyKmnVqt5l9eKuVqpXsjjycAKnUAEPrqAOt9CAJjBQ8Ayv8OYY58V5dz4WrTknmzmGP3A+fwB8fo9o

b1 (1) sha1_base64="jywVO4UiiDWc2aUfGbVWQsBJ4JY=">AAAB8XicbVA9TwJBEJ3DL8Qv1NJmI5hgQ+4wUUsSG0tMBIlwIXvLHmzY27vszpkQwr+wsdAYW/+Nnf/GBa5Q8CWTvLw3k5l5QSKFQdf9dnJr6xubW/ntws7u3v5B8fCoZeJUM95ksYx1O6CGS6F4EwVK3k40p1Eg+UMwupn5D09cGxGrexwn3I/oQIlQMIpWeqwgNSNS9srnvWLJrbpzkFXiZaQEGRq94le3H7M04gqZpMZ0PDdBf0I1Cib5tNBNDU8oG9EB71iqaMSNP5lfPCVnVumTMNa2FJK5+ntiQiNjxlFgOyOKQ7PszcT/vE6K4bU/ESpJkSu2WBSmkmBMZu+TvtCcoRxbQpkW9lbChlRThjakgg3BW355lbRqVe+yenFXK9UrWRx5OIFTqIAHV1CHW2hAExgoeIZXeHOM8+K8Ox+L1pyTzRzDHzifP3r4j2c=

layer 4

AAACGHicbVDLSsNAFJ3UV62vqks3wVaoFEqSjboruHFZwdhCE8tketMOnUzizEQoIb/hxl9x40LFbXf+jdPHQlsP3MvhnHuZuSdIGJXKsr6Nwtr6xuZWcbu0s7u3f1A+PLqXcSoIuCRmsegEWAKjHFxFFYNOIgBHAYN2MLqe+u0nEJLG/E6NE/AjPOA0pAQrLfXKVtULBSaZnWeefBRKdzqIcC/zgLG6nT84Nee87kEiKYt5nld75YrVsGYwV4m9IBW0QKtXnnj9mKQRcEUYlrJrW4nyMywUJQzykpdKSDAZ4QF0NeU4Aulns8ty80wrfTOMhS6uzJn6eyPDkZTjKNCTEVZDuexNxf+8bqrCSz+jPEkVcDJ/KEyZqWJzGpPZpwKIYmNNMBFU/9UkQ6yTUjrMkg7BXj55lbhO46ph3zqVZnWRRhGdoFNUQza6QE10g1rIRQQ9o1f0jj6MF+PN+DS+5qMFY7FzjP7AmPwAdDegCw==AAACGHicbVDLSsNAFJ3UV62vqks3wVaoFEqSjboruHFZwdhCE8tketMOnUzizEQoIb/hxl9x40LFbXf+jdPHQlsP3MvhnHuZuSdIGJXKsr6Nwtr6xuZWcbu0s7u3f1A+PLqXcSoIuCRmsegEWAKjHFxFFYNOIgBHAYN2MLqe+u0nEJLG/E6NE/AjPOA0pAQrLfXKVtULBSaZnWeefBRKdzqIcC/zgLG6nT84Nee87kEiKYt5nld75YrVsGYwV4m9IBW0QKtXnnj9mKQRcEUYlrJrW4nyMywUJQzykpdKSDAZ4QF0NeU4Aulns8ty80wrfTOMhS6uzJn6eyPDkZTjKNCTEVZDuexNxf+8bqrCSz+jPEkVcDJ/KEyZqWJzGpPZpwKIYmNNMBFU/9UkQ6yTUjrMkg7BXj55lbhO46ph3zqVZnWRRhGdoFNUQza6QE10g1rIRQQ9o1f0jj6MF+PN+DS+5qMFY7FzjP7AmPwAdDegCw==AAACGHicbVDLSsNAFJ3UV62vqks3wVaoFEqSjboruHFZwdhCE8tketMOnUzizEQoIb/hxl9x40LFbXf+jdPHQlsP3MvhnHuZuSdIGJXKsr6Nwtr6xuZWcbu0s7u3f1A+PLqXcSoIuCRmsegEWAKjHFxFFYNOIgBHAYN2MLqe+u0nEJLG/E6NE/AjPOA0pAQrLfXKVtULBSaZnWeefBRKdzqIcC/zgLG6nT84Nee87kEiKYt5nld75YrVsGYwV4m9IBW0QKtXnnj9mKQRcEUYlrJrW4nyMywUJQzykpdKSDAZ4QF0NeU4Aulns8ty80wrfTOMhS6uzJn6eyPDkZTjKNCTEVZDuexNxf+8bqrCSz+jPEkVcDJ/KEyZqWJzGpPZpwKIYmNNMBFU/9UkQ6yTUjrMkg7BXj55lbhO46ph3zqVZnWRRhGdoFNUQza6QE10g1rIRQQ9o1f0jj6MF+PN+DS+5qMFY7FzjP7AmPwAdDegCw==AAAB93icbVBNS8NAEJ3Ur1o/GvXoJdgKFaQk9aDeCl48VjC20Iaw2W7apZtN2N0IMfSXePGg4tW/4s1/47bNQVsfDDzem2FmXpAwKpVtfxultfWNza3ydmVnd2+/ah4cPsg4FZi4OGax6AVIEkY5cRVVjPQSQVAUMNINJjczv/tIhKQxv1dZQrwIjTgNKUZKS75ZrWd+PiCMnQfTxsVZ3TdrdtOew1olTkFqUKDjm1+DYYzTiHCFGZKy79iJ8nIkFMWMTCuDVJIE4Qkakb6mHEVEevn88Kl1qpWhFcZCF1fWXP09kaNIyiwKdGeE1FguezPxP6+fqvDKyylPUkU4XiwKU2ap2JqlYA2pIFixTBOEBdW3WniMBMJKZ1XRITjLL68St9W8bjp3rVq7XqRRhmM4gQY4cAltuIUOuIAhhWd4hTfjyXgx3o2PRWvJKGaO4A+Mzx9q15G4AAAB93icbVBNS8NAEJ3Ur1o/GvXoJdgKFaQk9aDeCl48VjC20Iaw2W7apZtN2N0IMfSXePGg4tW/4s1/47bNQVsfDDzem2FmXpAwKpVtfxultfWNza3ydmVnd2+/ah4cPsg4FZi4OGax6AVIEkY5cRVVjPQSQVAUMNINJjczv/tIhKQxv1dZQrwIjTgNKUZKS75ZrWd+PiCMnQfTxsVZ3TdrdtOew1olTkFqUKDjm1+DYYzTiHCFGZKy79iJ8nIkFMWMTCuDVJIE4Qkakb6mHEVEevn88Kl1qpWhFcZCF1fWXP09kaNIyiwKdGeE1FguezPxP6+fqvDKyylPUkU4XiwKU2ap2JqlYA2pIFixTBOEBdW3WniMBMJKZ1XRITjLL68St9W8bjp3rVq7XqRRhmM4gQY4cAltuIUOuIAhhWd4hTfjyXgx3o2PRWvJKGaO4A+Mzx9q15G4AAAB93icbVBNS8NAEJ3Ur1o/GvXoJdgKFaQk9aDeCl48VjC20Iaw2W7apZtN2N0IMfSXePGg4tW/4s1/47bNQVsfDDzem2FmXpAwKpVtfxultfWNza3ydmVnd2+/ah4cPsg4FZi4OGax6AVIEkY5cRVVjPQSQVAUMNINJjczv/tIhKQxv1dZQrwIjTgNKUZKS75ZrWd+PiCMnQfTxsVZ3TdrdtOew1olTkFqUKDjm1+DYYzTiHCFGZKy79iJ8nIkFMWMTCuDVJIE4Qkakb6mHEVEevn88Kl1qpWhFcZCF1fWXP09kaNIyiwKdGeE1FguezPxP6+fqvDKyylPUkU4XiwKU2ap2JqlYA2pIFixTBOEBdW3WniMBMJKZ1XRITjLL68St9W8bjp3rVq7XqRRhmM4gQY4cAltuIUOuIAhhWd4hTfjyXgx3o2PRWvJKGaO4A+Mzx9q15G4AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCIJSkl+qt4MVjBWMLbSib7aZdursJuxuhhP4FLx5UvPqLvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb6e0sbm1vVPereztHxweVY9PHnWcKkJ9EvNY9UKsKWeS+oYZTnuJoliEnHbD6W3ud5+o0iyWD2aW0EDgsWQRI9jkUv3Kqw+rNbfhLoDWiVeQGhToDKtfg1FMUkGlIRxr3ffcxAQZVoYRTueVQappgskUj2nfUokF1UG2uHWOLqwyQlGsbEmDFurviQwLrWcitJ0Cm4le9XLxP6+fmug6yJhMUkMlWS6KUo5MjPLH0YgpSgyfWYKJYvZWRCZYYWJsPBUbgrf68jrxm42bhnffrLXrRRplOINzuAQPWtCGO+iADwQm8Ayv8OYI58V5dz6WrSWnmDmFP3A+fwD9H4z6AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCIJSkl+qt4MVjBWMLbSib7aZdursJuxuhhP4FLx5UvPqLvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb6e0sbm1vVPereztHxweVY9PHnWcKkJ9EvNY9UKsKWeS+oYZTnuJoliEnHbD6W3ud5+o0iyWD2aW0EDgsWQRI9jkUv3Kqw+rNbfhLoDWiVeQGhToDKtfg1FMUkGlIRxr3ffcxAQZVoYRTueVQappgskUj2nfUokF1UG2uHWOLqwyQlGsbEmDFurviQwLrWcitJ0Cm4le9XLxP6+fmug6yJhMUkMlWS6KUo5MjPLH0YgpSgyfWYKJYvZWRCZYYWJsPBUbgrf68jrxm42bhnffrLXrRRplOINzuAQPWtCGO+iADwQm8Ayv8OYI58V5dz6WrSWnmDmFP3A+fwD9H4z6AAAB+3icbVBNS8NAEN34WetXtUcvwVaoB0vSi3orePFYwdpCE8JmO2mXbj7YnQgh1L/ixYOKV/+IN/+N2zYHbX0w8Hhvhpl5fiK4Qsv6NtbWNza3tks75d29/YPDytHxg4pTyaDLYhHLvk8VCB5BFzkK6CcSaOgL6PmTm5nfewSpeBzdY5aAG9JRxAPOKGrJq1TrFw6OAamXOyDEtNE6r3uVmtW05jBXiV2QGinQ8SpfzjBmaQgRMkGVGthWgm5OJXImYFp2UgUJZRM6goGmEQ1Bufn8+Kl5ppWhGcRSV4TmXP09kdNQqSz0dWdIcayWvZn4nzdIMbhycx4lKULEFouCVJgYm7MkzCGXwFBkmlAmub7VZGMqKUOdV1mHYC+/vEq6reZ1075r1dr1Io0SOSGnpEFsckna5JZ0SJcwkpFn8krejCfjxXg3Phata0YxUyV/YHz+AH1Tk3c=

sha1_base64="4JT44IBZuao1ddQmiFciyM3S32o=">AAAB+3icbVBNS8NAEN34WetXtUcvwVaoB0vSi3orePFYwdpCE8JmO2mXbj7YnQgh1L/ixYOKV/+IN/+N2zYHbX0w8Hhvhpl5fiK4Qsv6NtbWNza3tks75d29/YPDytHxg4pTyaDLYhHLvk8VCB5BFzkK6CcSaOgL6PmTm5nfewSpeBzdY5aAG9JRxAPOKGrJq1TrFw6OAamXOyDEtNE6r3uVmtW05jBXiV2QGinQ8SpfzjBmaQgRMkGVGthWgm5OJXImYFp2UgUJZRM6goGmEQ1Bufn8+Kl5ppWhGcRSV4TmXP09kdNQqSz0dWdIcayWvZn4nzdIMbhycx4lKULEFouCVJgYm7MkzCGXwFBkmlAmub7VZGMqKUOdV1mHYC+/vEq6reZ1075r1dr1Io0SOSGnpEFsckna5JZ0SJcwkpFn8krejCfjxXg3Phata0YxUyV/YHz+AH1Tk3c=AAAB+3icbVBNS8NAEN34WetXtUcvwVaoB0vSi3orePFYwdpCE8JmO2mXbj7YnQgh1L/ixYOKV/+IN/+N2zYHbX0w8Hhvhpl5fiK4Qsv6NtbWNza3tks75d29/YPDytHxg4pTyaDLYhHLvk8VCB5BFzkK6CcSaOgL6PmTm5nfewSpeBzdY5aAG9JRxAPOKGrJq1TrFw6OAamXOyDEtNE6r3uVmtW05jBXiV2QGinQ8SpfzjBmaQgRMkGVGthWgm5OJXImYFp2UgUJZRM6goGmEQ1Bufn8+Kl5ppWhGcRSV4TmXP09kdNQqSz0dWdIcayWvZn4nzdIMbhycx4lKULEFouCVJgYm7MkzCGXwFBkmlAmub7VZGMqKUOdV1mHYC+/vEq6reZ1075r1dr1Io0SOSGnpEFsckna5JZ0SJcwkpFn8krejCfjxXg3Phata0YxUyV/YHz+AH1Tk3c=AAAB+3icbVBNS8NAEN34WetXtUcvwVaoB0vSi3orePFYwdpCE8JmO2mXbj7YnQgh1L/ixYOKV/+IN/+N2zYHbX0w8Hhvhpl5fiK4Qsv6NtbWNza3tks75d29/YPDytHxg4pTyaDLYhHLvk8VCB5BFzkK6CcSaOgL6PmTm5nfewSpeBzdY5aAG9JRxAPOKGrJq1TrFw6OAamXOyDEtNE6r3uVmtW05jBXiV2QGinQ8SpfzjBmaQgRMkGVGthWgm5OJXImYFp2UgUJZRM6goGmEQ1Bufn8+Kl5ppWhGcRSV4TmXP09kdNQqSz0dWdIcayWvZn4nzdIMbhycx4lKULEFouCVJgYm7MkzCGXwFBkmlAmub7VZGMqKUOdV1mHYC+/vEq6reZ1075r1dr1Io0SOSGnpEFsckna5JZ0SJcwkpFn8krejCfjxXg3Phata0YxUyV/YHz+AH1Tk3c=AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==

sha1_base64="ejofrb9Y7XBteaLou+FrFDuLSMg=">AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hQkTsatSOxscTEExK4kL1lgQ17e+funAm58CdsLNTY+nvs/DcucIWCL5nk5b2ZzMwLEykMuu63U9jY3NreKe6W9vYPDo/KxycPJk414z6LZaw7ITVcCsV9FCh5J9GcRqHk7XByM/fbT1wbEat7nCY8iOhIiaFgFK3UqfZQRNxU++WKW3cXIOvEy0kFcrT65a/eIGZpxBUySY3pem6CQUY1Cib5rNRLDU8om9AR71qqqN0SZIt7Z+TCKgMyjLUthWSh/p7IaGTMNAptZ0RxbFa9ufif101xeBVkQiUpcsWWi4apJBiT+fNkIDRnKKeWUKaFvZWwMdWUoY2oZEPwVl9eJ36jfl337hqVZi1PowhncA418OASmnALLfCBgYRneIU359F5cd6dj2VrwclnTuEPnM8f1EKPSA==AAAB+nicbVDLSsNAFJ3UV62vWJduBluxopSkG3VXcOOygrGFNoTJ9KYdOnkwMxFryK+4caHi1i9x5984fSy09cCFwzn3cu89fsKZVJb1bRRWVtfWN4qbpa3tnd09c798L+NUUHBozGPR8YkEziJwFFMcOokAEvoc2v7oeuK3H0BIFkd3apyAG5JBxAJGidKSZ5arT17WA87P7HM/P6nZp1XPrFh1awq8TOw5qaA5Wp751evHNA0hUpQTKbu2lSg3I0IxyiEv9VIJCaEjMoCuphEJQbrZ9PYcH2ulj4NY6IoUnqq/JzISSjkOfd0ZEjWUi95E/M/rpiq4dDMWJamCiM4WBSnHKsaTIHCfCaCKjzUhVDB9K6ZDIghVOq6SDsFefHmZOI36Vd2+bVSa1XkaRXSIjlAN2egCNdENaiEHUfSIntErejNy48V4Nz5mrQVjPnOA/sD4/AGsupJYAAAB+nicbVDLSsNAFJ3UV62vWJduBluxopSkG3VXcOOygrGFNoTJ9KYdOnkwMxFryK+4caHi1i9x5984fSy09cCFwzn3cu89fsKZVJb1bRRWVtfWN4qbpa3tnd09c798L+NUUHBozGPR8YkEziJwFFMcOokAEvoc2v7oeuK3H0BIFkd3apyAG5JBxAJGidKSZ5arT17WA87P7HM/P6nZp1XPrFh1awq8TOw5qaA5Wp751evHNA0hUpQTKbu2lSg3I0IxyiEv9VIJCaEjMoCuphEJQbrZ9PYcH2ulj4NY6IoUnqq/JzISSjkOfd0ZEjWUi95E/M/rpiq4dDMWJamCiM4WBSnHKsaTIHCfCaCKjzUhVDB9K6ZDIghVOq6SDsFefHmZOI36Vd2+bVSa1XkaRXSIjlAN2egCNdENaiEHUfSIntErejNy48V4Nz5mrQVjPnOA/sD4/AGsupJYAAAB+nicbVDLSsNAFJ3UV62vWJduBluxopSkG3VXcOOygrGFNoTJ9KYdOnkwMxFryK+4caHi1i9x5984fSy09cCFwzn3cu89fsKZVJb1bRRWVtfWN4qbpa3tnd09c798L+NUUHBozGPR8YkEziJwFFMcOokAEvoc2v7oeuK3H0BIFkd3apyAG5JBxAJGidKSZ5arT17WA87P7HM/P6nZp1XPrFh1awq8TOw5qaA5Wp751evHNA0hUpQTKbu2lSg3I0IxyiEv9VIJCaEjMoCuphEJQbrZ9PYcH2ulj4NY6IoUnqq/JzISSjkOfd0ZEjWUi95E/M/rpiq4dDMWJamCiM4WBSnHKsaTIHCfCaCKjzUhVDB9K6ZDIghVOq6SDsFefHmZOI36Vd2+bVSa1XkaRXSIjlAN2egCNdENaiEHUfSIntErejNy48V4Nz5mrQVjPnOA/sD4/AGsupJYAAAB+XicbVBNT8JAEN3iF+JX0aOXRjDBaEjLRb2RePGIiRUSaJrtMoUN222zu8WQyk/x4kGNV/+JN/+NC/Sg4EsmeXlvJjPzgoRRqWz72yisrW9sbhW3Szu7e/sHZvnwQcapIOCSmMWiE2AJjHJwFVUMOokAHAUM2sHoZua3xyAkjfm9miTgRXjAaUgJVlryzXJ17Gc9YOzcuQimNees6psVu27PYa0SJycVlKPlm1+9fkzSCLgiDEvZdexEeRkWihIG01IvlZBgMsID6GrKcQTSy+anT61TrfStMBa6uLLm6u+JDEdSTqJAd0ZYDeWyNxP/87qpCq+8jPIkVcDJYlGYMkvF1iwHq08FEMUmmmAiqL7VIkMsMFE6rZIOwVl+eZW4jfp13blrVJrVPI0iOkYnqIYcdIma6Ba1kIsIekTP6BW9GU/Gi/FufCxaC0Y+c4T+wPj8AUFHkiM=AAAB+XicbVBNT8JAEN3iF+JX0aOXRjDBaEjLRb2RePGIiRUSaJrtMoUN222zu8WQyk/x4kGNV/+JN/+NC/Sg4EsmeXlvJjPzgoRRqWz72yisrW9sbhW3Szu7e/sHZvnwQcapIOCSmMWiE2AJjHJwFVUMOokAHAUM2sHoZua3xyAkjfm9miTgRXjAaUgJVlryzXJ17Gc9YOzcuQimNees6psVu27PYa0SJycVlKPlm1+9fkzSCLgiDEvZdexEeRkWihIG01IvlZBgMsID6GrKcQTSy+anT61TrfStMBa6uLLm6u+JDEdSTqJAd0ZYDeWyNxP/87qpCq+8jPIkVcDJYlGYMkvF1iwHq08FEMUmmmAiqL7VIkMsMFE6rZIOwVl+eZW4jfp13blrVJrVPI0iOkYnqIYcdIma6Ba1kIsIekTP6BW9GU/Gi/FufCxaC0Y+c4T+wPj8AUFHkiM=AAAB+XicbVBNT8JAEN3iF+JX0aOXRjDBaEjLRb2RePGIiRUSaJrtMoUN222zu8WQyk/x4kGNV/+JN/+NC/Sg4EsmeXlvJjPzgoRRqWz72yisrW9sbhW3Szu7e/sHZvnwQcapIOCSmMWiE2AJjHJwFVUMOokAHAUM2sHoZua3xyAkjfm9miTgRXjAaUgJVlryzXJ17Gc9YOzcuQimNees6psVu27PYa0SJycVlKPlm1+9fkzSCLgiDEvZdexEeRkWihIG01IvlZBgMsID6GrKcQTSy+anT61TrfStMBa6uLLm6u+JDEdSTqJAd0ZYDeWyNxP/87qpCq+8jPIkVcDJYlGYMkvF1iwHq08FEMUmmmAiqL7VIkMsMFE6rZIOwVl+eZW4jfp13blrVJrVPI0iOkYnqIYcdIma6Ba1kIsIekTP6BW9GU/Gi/FufCxaC0Y+c4T+wPj8AUFHkiM=AAAB9HicbVDLTgJBEOzFF+IL9ehlIpiQmJBdLuqNxItHTFwhgZXMDr0wYfaRmVnNhvAfXjyo8erHePNvHGAPClbSSaWqO91dfiK40rb9bRXW1jc2t4rbpZ3dvf2D8uHRvYpTydBlsYhlx6cKBY/Q1VwL7CQSaegLbPvj65nffkSpeBzd6SxBL6TDiAecUW2kB0EzlKTaQyHOnWq/XLHr9hxklTg5qUCOVr/81RvELA0x0kxQpbqOnWhvQqXmTOC01EsVJpSN6RC7hkY0ROVN5ldPyZlRBiSIpalIk7n6e2JCQ6Wy0DedIdUjtezNxP+8bqqDS2/CoyTVGLHFoiAVRMdkFgEZcIlMi8wQyiQ3txI2opIybYIqmRCc5ZdXiduoX9Wd20alWcvTKMIJnEINHLiAJtxAC1xgIOEZXuHNerJerHfrY9FasPKZY/gD6/MHXJWROA==

sha1_base64="wdLloJVV0b6j2wK655gOtp5//DQ=">AAAB9HicbVDLTgJBEOzFF+IL9ehlIpiQmJBdLuqNxItHTFwhgZXMDr0wYfaRmVnNhvAfXjyo8erHePNvHGAPClbSSaWqO91dfiK40rb9bRXW1jc2t4rbpZ3dvf2D8uHRvYpTydBlsYhlx6cKBY/Q1VwL7CQSaegLbPvj65nffkSpeBzd6SxBL6TDiAecUW2kB0EzlKTaQyHOnWq/XLHr9hxklTg5qUCOVr/81RvELA0x0kxQpbqOnWhvQqXmTOC01EsVJpSN6RC7hkY0ROVN5ldPyZlRBiSIpalIk7n6e2JCQ6Wy0DedIdUjtezNxP+8bqqDS2/CoyTVGLHFoiAVRMdkFgEZcIlMi8wQyiQ3txI2opIybYIqmRCc5ZdXiduoX9Wd20alWcvTKMIJnEINHLiAJtxAC1xgIOEZXuHNerJerHfrY9FasPKZY/gD6/MHXJWROA==AAAB9HicbVDLTgJBEOzFF+IL9ehlIpiQmJBdLuqNxItHTFwhgZXMDr0wYfaRmVnNhvAfXjyo8erHePNvHGAPClbSSaWqO91dfiK40rb9bRXW1jc2t4rbpZ3dvf2D8uHRvYpTydBlsYhlx6cKBY/Q1VwL7CQSaegLbPvj65nffkSpeBzd6SxBL6TDiAecUW2kB0EzlKTaQyHOnWq/XLHr9hxklTg5qUCOVr/81RvELA0x0kxQpbqOnWhvQqXmTOC01EsVJpSN6RC7hkY0ROVN5ldPyZlRBiSIpalIk7n6e2JCQ6Wy0DedIdUjtezNxP+8bqqDS2/CoyTVGLHFoiAVRMdkFgEZcIlMi8wQyiQ3txI2opIybYIqmRCc5ZdXiduoX9Wd20alWcvTKMIJnEINHLiAJtxAC1xgIOEZXuHNerJerHfrY9FasPKZY/gD6/MHXJWROA==AAAB9HicbVDLTgJBEOzFF+IL9ehlIpiQmJBdLuqNxItHTFwhgZXMDr0wYfaRmVnNhvAfXjyo8erHePNvHGAPClbSSaWqO91dfiK40rb9bRXW1jc2t4rbpZ3dvf2D8uHRvYpTydBlsYhlx6cKBY/Q1VwL7CQSaegLbPvj65nffkSpeBzd6SxBL6TDiAecUW2kB0EzlKTaQyHOnWq/XLHr9hxklTg5qUCOVr/81RvELA0x0kxQpbqOnWhvQqXmTOC01EsVJpSN6RC7hkY0ROVN5ldPyZlRBiSIpalIk7n6e2JCQ6Wy0DedIdUjtezNxP+8bqqDS2/CoyTVGLHFoiAVRMdkFgEZcIlMi8wQyiQ3txI2opIybYIqmRCc5ZdXiduoX9Wd20alWcvTKMIJnEINHLiAJtxAC1xgIOEZXuHNerJerHfrY9FasPKZY/gD6/MHXJWROA==AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCIJSkl+qt4MVjBWMLbSib7aZdursJuxuhhP4FLx5UvPqLvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb6e0sbm1vVPereztHxweVY9PHnWcKkJ9EvNY9UKsKWeS+oYZTnuJoliEnHbD6W3ud5+o0iyWD2aW0EDgsWQRI9jkUv3Kqw+rNbfhLoDWiVeQGhToDKtfg1FMUkGlIRxr3ffcxAQZVoYRTueVQappgskUj2nfUokF1UG2uHWOLqwyQlGsbEmDFurviQwLrWcitJ0Cm4le9XLxP6+fmug6yJhMUkMlWS6KUo5MjPLH0YgpSgyfWYKJYvZWRCZYYWJsPBUbgrf68jrxm42bhnffrLXrRRplOINzuAQPWtCGO+iADwQm8Ayv8OYI58V5dz6WrSWnmDmFP3A+fwD9H4z6AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCIJSkl+qt4MVjBWMLbSib7aZdursJuxuhhP4FLx5UvPqLvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb6e0sbm1vVPereztHxweVY9PHnWcKkJ9EvNY9UKsKWeS+oYZTnuJoliEnHbD6W3ud5+o0iyWD2aW0EDgsWQRI9jkUv3Kqw+rNbfhLoDWiVeQGhToDKtfg1FMUkGlIRxr3ffcxAQZVoYRTueVQappgskUj2nfUokF1UG2uHWOLqwyQlGsbEmDFurviQwLrWcitJ0Cm4le9XLxP6+fmug6yJhMUkMlWS6KUo5MjPLH0YgpSgyfWYKJYvZWRCZYYWJsPBUbgrf68jrxm42bhnffrLXrRRplOINzuAQPWtCGO+iADwQm8Ayv8OYI58V5dz6WrSWnmDmFP3A+fwD9H4z6AAAB6nicbVBNS8NAEJ3Ur1q/qh69LLaCIJSkl+qt4MVjBWMLbSib7aZdursJuxuhhP4FLx5UvPqLvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb6e0sbm1vVPereztHxweVY9PHnWcKkJ9EvNY9UKsKWeS+oYZTnuJoliEnHbD6W3ud5+o0iyWD2aW0EDgsWQRI9jkUv3Kqw+rNbfhLoDWiVeQGhToDKtfg1FMUkGlIRxr3ffcxAQZVoYRTueVQappgskUj2nfUokF1UG2uHWOLqwyQlGsbEmDFurviQwLrWcitJ0Cm4le9XLxP6+fmug6yJhMUkMlWS6KUo5MjPLH0YgpSgyfWYKJYvZWRCZYYWJsPBUbgrf68jrxm42bhnffrLXrRRplOINzuAQPWtCGO+iADwQm8Ayv8OYI58V5dz6WrSWnmDmFP3A+fwD9H4z6AAAB/XicbVBNS8NAEJ3Ur1q/ouLJS7AVKkJJelFvBS8eKxhbaELYbLft0s0m7G6EGgL+FS8eVLz6P7z5b9y2OWjrg4HHezPMzAsTRqWy7W+jtLK6tr5R3qxsbe/s7pn7B/cyTgUmLo5ZLLohkoRRTlxFFSPdRBAUhYx0wvH11O88ECFpzO/UJCF+hIacDihGSkuBeVTzQiSyxzzIPMLYuZPXnbNaYFbthj2DtUycglShQDswv7x+jNOIcIUZkrLn2InyMyQUxYzkFS+VJEF4jIakpylHEZF+Njs/t0610rcGsdDFlTVTf09kKJJyEoW6M0JqJBe9qfif10vV4NLPKE9SRTieLxqkzFKxNc3C6lNBsGITTRAWVN9q4RESCCudWEWH4Cy+vEzcZuOq4dw2q61akUYZjuEE6uDABbTgBtrgAoYMnuEV3own48V4Nz7mrSWjmDmEPzA+fwD7HpRKAAAB/XicbVBNS8NAEJ3Ur1q/ouLJS7AVKkJJelFvBS8eKxhbaELYbLft0s0m7G6EGgL+FS8eVLz6P7z5b9y2OWjrg4HHezPMzAsTRqWy7W+jtLK6tr5R3qxsbe/s7pn7B/cyTgUmLo5ZLLohkoRRTlxFFSPdRBAUhYx0wvH11O88ECFpzO/UJCF+hIacDihGSkuBeVTzQiSyxzzIPMLYuZPXnbNaYFbthj2DtUycglShQDswv7x+jNOIcIUZkrLn2InyMyQUxYzkFS+VJEF4jIakpylHEZF+Njs/t0610rcGsdDFlTVTf09kKJJyEoW6M0JqJBe9qfif10vV4NLPKE9SRTieLxqkzFKxNc3C6lNBsGITTRAWVN9q4RESCCudWEWH4Cy+vEzcZuOq4dw2q61akUYZjuEE6uDABbTgBtrgAoYMnuEV3own48V4Nz7mrSWjmDmEPzA+fwD7HpRKAAAB/XicbVBNS8NAEJ3Ur1q/ouLJS7AVKkJJelFvBS8eKxhbaELYbLft0s0m7G6EGgL+FS8eVLz6P7z5b9y2OWjrg4HHezPMzAsTRqWy7W+jtLK6tr5R3qxsbe/s7pn7B/cyTgUmLo5ZLLohkoRRTlxFFSPdRBAUhYx0wvH11O88ECFpzO/UJCF+hIacDihGSkuBeVTzQiSyxzzIPMLYuZPXnbNaYFbthj2DtUycglShQDswv7x+jNOIcIUZkrLn2InyMyQUxYzkFS+VJEF4jIakpylHEZF+Njs/t0610rcGsdDFlTVTf09kKJJyEoW6M0JqJBe9qfif10vV4NLPKE9SRTieLxqkzFKxNc3C6lNBsGITTRAWVN9q4RESCCudWEWH4Cy+vEzcZuOq4dw2q61akUYZjuEE6uDABbTgBtrgAoYMnuEV3own48V4Nz7mrSWjmDmEPzA+fwD7HpRK 0. The Huber loss is linear in kxk over the range kxk > ∆ and, therefore, it penalizes less drastically large values for kxk in comparison with the quadratic loss, kxk2 . Repeat the derivation that led to the backpropagation algorithm and determine the necessary adjustments. 65.18 Consider a feedforward neural network with a single output node; its output signals are denoted by {z(n), γ b(n)}. Set all activation functions to the hyperbolic tangent function from Table 65.1. Replace the regularized empirical risk (65.47) by the following logistic risk: ∆

P(W, θ) =

L−1 X `=1

ρkW` k2F +

N −1  1 X  ln 1 + e−γ(n)bγ (n) N n=0

where γ(n) = ±1. Repeat the derivation that led to the backpropagation algorithm (65.81) and determine the necessary adjustments. 65.19 Let P(W, θ, a) denote a differentiable empirical risk function that is dependent on three sets of parameters {W` , θ` , a` } similar to the least-squares risk (65.186) introduced during our treatment of the batch normalization procedure in Section 65.9. Using the notation of that section, establish the validity of the following expressions:

Problems

2785

∂P ∂P = a` (k) 00 0 ∂z`+1,b (k) ∂z`+1,b (k) ∂P 2 ∂σ`+1 (k)

=−

B−1   X 1 1 ∂P z (k) − z ¯ (k) `+1,b `+1 0 2 2 (σ`+1 (k) + )3/2 ∂z`+1,b (k) b=0

∂P 1 =− 2 ∂ z¯`+1 (k) (σ`+1 (k) + )1/2 2 ∂P 2 B ∂σ`+1 (k)

B−1 X

B−1 X b=0

∂P − 0 ∂z`+1,b (k)

(z`+1,b (k) − z¯`+1 (k))

b=0

∂P 1 ∂P = 2 + 0 ∂z`+1,b (k) (k) (σ`+1 (k) + )1/2 ∂z`+1,b  ∂P 2 1 ∂P z`+1,b (k) − z¯`+1 (k) + 2 B ∂σ`+1 (k) B ∂ z¯`+1 (k) B−1 X ∂P ∂P 0 = z`+1,b (k) 00 ∂a` (k) ∂z`+1,b (k) b=0

B−1

X ∂P ∂P =− 00 ∂θ` (k) ∂z`+1,b (k) b=0

2 (j) and note that z¯` (j) also depends on 65.20 Refer to expression (65.212b) for σ`+1 z`+1,b (j). Verify that, for any batch index b: 2  (j) ∂σ`+1 2 = z`+1,b (j) − z¯`+1 (j) ∂z`+1,b (j) B

65.21 Establish the validity of expressions (65.157) and (65.164) for the boundary sensitivity factors under multitask learning. 65.22 Consider two identical feedforward neural networks. The input to one network (1) consists of feature vectors {hn } ∈ IRM , while the input to the second network consists (2) of feature vectors {hn } ∈ IRM . For example, the inputs to each of the networks could correspond to images of signatures written by individuals on a device. The Q(1) (2) dimensional outputs of the networks are similarly denoted by {b γn } and {b γn }. They are assumed to be generated by sigmoidal activation functions in the last layer. The cosine of the angle (denoted by ∠) between the output vectors is used as a measure of similarity between them: (1)

γ b(n) = cos ∠(b γn(1) , γ bn(2) ) =

(2)

(b γ n )T γ bn (1)

(2)

kb γn k kb γn k

If the angle is small then the cosine value is close to 1, and if the angle is large then the cosine value is close to −1. In this way, class +1 would correspond to a situation in which the two signatures are more or less matching while class −1 would correspond to a situation in which one of the signatures is a forgery. We refer to the output vectors (1) (2) {b γn , γ bn } as embeddings for the original input images. We impose the condition that the weight matrices and bias vectors of both networks should be the same. The resulting architecture is related to the “Siamese” network studied later in Section 72.2. Develop a backpropagation algorithm to train the parameters of the network, say, by minimizing an empirical risk of the form

2786

Feedforward Neural Networks

( min W,θ



P(W, θ) =

L−1 X

ρkW` k2F

`=0

N −1 1 X (γ(n) − γ b(n))2 + N n=0

)

where γ(n) ∈ {+1, −1} are the true labels and γ b(n) is the predicted label (i.e., the cosine output of the network). Remark. For more motivation, see the work by Bromley et al. (1994), where Siamese networks are introduced and applied to the verification of signatures written on pen-input tablets. 65.23 Consider the same setting of the Siamese network described in Prob. 65.22. We incorporate two modifications. First, we compute the absolute elementwise difference (2) (1) bn } and determine the vector rn between the entries of the embedding vectors {b γn , γ with entries o n ∆ (1) rn = col b γn (q) − γ bn(2) (q) The vector rn ∈ IRQ is then fed into a neural layer with a single output node employing the sigmoidal activation function and generating the scalar output γ b(n): γ b(n) = sigmoid(wrT rn − θr ), wr ∈ IRQ , θr ∈ IR where (wr , θr ) are the weight and bias parameters for the output layer. The value of γ b can be interpreted as a probability measure indicating which class is more likely (matching signatures or forgery). Note that, for all practical purposes, the output layer operates as an affine classifier, with values of wrT rn − θr larger than 1/2 corresponding to one class and smaller values corresponding to another class. We continue to impose the condition that the weight matrices and bias vectors of both networks should be the same. Develop a backpropagation algorithm to train the parameters of the Siamese network by minimizing now a cross-entropy empirical risk of the form ( ) L−1 N −1 o X 1 Xn 2 2 min ρkwr k2 + ρkW` kF − γ(n) ln(b γ (n)) + (1 − γ(n)) ln(1 − γ b(n)) W,θ,wr ,θr N n=0 `=0

where γ(n) ∈ {1, 0} are the true labels with γ(n) = 1 corresponding to matching signatures and γ(n) = 0 corresponding to forgery. Remark. For more motivation, see the work by Koch et al. (2015), which applies this structure to one-shot image recognition problems. 65.24 Consider the same setting of the Siamese network described in Prob. 65.23 with the following modification. We concatenate the outputs of the two Siamese branches into n o ∆ dn = col γ bn(1) , γ bn(2) ∈ IRQ and feed dn into another feedforward network with L0 layers and a scalar softmax output node denoted by γ b(n). This node is referred to as the “relation score” between the two inputs {h(1) , h(2) }. We denote the parameters of the third network by {W`0 , θ`0 }. We continue to impose the condition that the weight matrices and bias vectors of the first two Siamese branches are the same. Develop a backpropagation algorithm to train the parameters of this new architecture by minimizing: min

W,θ,W 0 ,θ 0

( L−1 X `=0

ρkW` k2F

+

0 L −1 X

`=0

ρkW`0 k2F

N −1 1 X (γ(n) − γ b(n))2 + N n=0

)

where γ(n) ∈ {1, 0} are the true labels with γ(n) = 1 corresponding to matching signatures and γ(n) = 0 corresponding to forgery. Remark. For more motivation, see the work by Sung et al. (2018) on relation networks in meta learning.

65.A Derivation of Batch Normalization Algorithm

2787

65.25 Refer to expression (65.197) for the variance of the Gaussian process at the 2 output of the neural network. Assume Σ is diagonal with Σ = diag{σα , σa2 IK }, where the variance of α is different from the variance of the entries of r. Let λ denote the angle between vectors h and h0 . Argue that when the squared norms khk2 and kh0 k2 2 are much larger than (1 + 2σα )/2σa2 , it holds that 2 E g(h)g(h0 ) ≈ σθ2 + σw (1 − 2λ/π)

65.A

DERIVATION OF BATCH NORMALIZATION ALGORITHM We derive in this appendix algorithm (65.187) for learning the parameters of a batchnormalized neural network. We illustrate the learning procedure by reconsidering the regularized least-squares risk (65.45): ) ( N −1 L−1 X 1 X ∆ ∆ 2 ? ? ? 2 kγn − γ bn k {W` , θ` , a` } = argmin P(W, θ, a) = ρkW` kF + N n=0 {W` ,θ` ,a` } `=1 (65.198) where the arguments of P(·) are augmented to include the vectors {a` }. Here, the notation {W, θ, a} is referring to the collection of all parameters {W` , θ` , a` } from across all layers. In order to implement iterative procedures for minimizing P(W, θ, a), we need to know how to evaluate (or approximate) the gradients of P(W, θ, a) relative to the individual entries of {W` , θ` , a` }, namely, the quantities ∂P(W, θ, a) (`) ∂wij

,

∂P(W, θ, a) , ∂θ` (i)

∂P(W, θ, a) ∂a` (i)

(65.199)

(`)

for each layer ` and entries {wij , θ` (i), a` (i)}; the notation {θ` (i), a` (i)} refers to the ith entries in the vectors θ` and a` defined earlier in (65.179)–(65.180). To compute the gradients in (65.199), we repeat the arguments from Section 65.4.1.

Expressions for the gradients Since, under batch normalization, it is the signal vector denoted generically by v that is fed into the nonlinear activation functions (rather than the original vector z), we replace the earlier definition (65.52) for the sensitivity factor by ∆

δ` (j) =

∂kγ − γ bk2 ∂v` (j)

(65.200)

where the differentiation is now relative to the jth entry of the vector v` feeding into layer `; the output of this layer is given by y` = f (v` )

(65.201)

The above two expressions are written for a generic internal vector v` , regardless of its batch index (i.e., we are simply writing v` instead of v`,b ). However, we will soon restore the index within the batch for clarity and completeness. We are ready to evaluate the partial derivatives in (65.199). To begin with, note that N −1 ∂P(W, θ, a) bn k2 1 X ∂kγn − γ = ∂θ` (i) N n=0 ∂θ` (i)

(65.202)

2788

Feedforward Neural Networks

Using a mini-batch of size B to approximate the gradient we replace (65.202) by B−1 bb k2 1 X ∂kγb − γ B ∂θ` (i)

B−1 bb k2 ∂v`+1,b (i) 1 X ∂kγb − γ B ∂v`+1,b (i) ∂θ` (i)

=

b=0

b=0

(65.183)

=



B−1 bb k2 1 X ∂kγb − γ B ∂v`+1,b (i) b=0

(65.200)

=



1 B

B−1 X

δ`+1,b (i)

(65.203)

b=0

where the notation δ`+1,b (i) denotes the sensitivity factor relative to the bth signal v`+1,b (i) in the batch. It follows that, in terms of the gradients relative to the vectors θ` and not only relative to their individual entries, we have B−1 B−1 bb k2 1 X ∂kγb − γ 1 X = − δ`+1,b B ∂θ` B b=0

(65.204)

b=0

where δ`+1,b is the sensitivity vector that collects the factors {δ`+1,b (i)}. Next, we consider partial derivative relative to the individual entries of the parameter vector a` . Thus, note that N −1 ∂P(W, θ, a) bn k2 1 X ∂kγn − γ = ∂a` (i) N n=0 ∂a` (i)

(65.205)

Using a mini-batch of size B to approximate the gradient we have B−1 bb k2 1 X ∂kγb − γ B ∂a` (i)

B−1 bb k2 ∂v`+1,b (i) 1 X ∂kγb − γ B ∂v`+1,b (i) ∂a` (i)

=

b=0

b=0

(65.182)

=

B−1 1 X 0 δ`+1,b (i) z`+1,b (i) B

(65.206)

b=0

so that, in terms of the gradient relative to the vector a` and not only relative to their individual entries, we can write B−1 bb k2 1 1 X ∂kγb − γ = diag B ∂a` B

(B−1 X

b=0

 T 0 δ`+1,b z`+1,b

) (65.207)

b=0

where the diag{·} operation applied to a matrix argument returns a column vector with the diagonal entries of the matrix. Moreover,   0 −1 z`+1,b = S`+1 z`+1,b − z¯`+1 (65.208) We now compute the partial derivatives of the risk function relative to the individual entries of the weight matrices. Thus, note that for the regularized least-squares risk: ∂P(W, θ, a) (`)

∂wij

(`)

= 2ρwij +

N −1 bn k2 1 X ∂kγn − γ (`) N n=0 ∂wij

(65.209)

65.A Derivation of Batch Normalization Algorithm

2789

Using a mini-batch of size B to approximate the rightmost term we have B−1 bb k2 1 X ∂kγb − γ (`) B ∂wij b=0

=

(65.210)

B−1 bb k2 ∂v`+1,b (j) 1 X ∂kγb − γ 0 B ∂v`+1,b (j) ∂z`+1,b (j) b=0

B−1 X p=0

0 ∂z`+1,b (j) ∂z`+1,p (j) (`) ∂z`+1,p (j) ∂wij

!

where the rightmost sum over the batch index p = 0, 1, . . . , B − 1 appears in view of 0 (j) depends on all the chain rule of differentiation and the fact that the variable z`+1,b vectors z`+1,p within the mini-batch of size B (this dependence is through the mean and variance variables z¯`+1 and S`+1 – see the second equation below). To evaluate the partial derivatives in (65.210) we first recall the relations: 0 v`+1,b (j) = a` (j) z`+1,b (j) − θ` (j)   1 0 z (j) − z ¯ (j) z`+1,b (j) = `+1,b `+1 2 (σ`+1 (j) + )1/2

(65.211a) (65.211b) (`)

(`)

z`+1,p (j) = f (v`,p (i)) wij + terms independent of wij ∆

y`,p (i) = f (v`,p (i))

(65.211c) (65.211d)

as well as the definitions z¯`+1 (j) =

B−1 1 X z`+1,q (j) B q=0

(65.212a)

2 σ`+1 (j) =

B−1 2 1 X z`+1,q (j) − z¯`+1 (j) B q=0

(65.212b)

2 (j) that, for any batch index p = 0, 1, . . . , B−1 We conclude from the expression for σ`+1 (see Prob. 65.20): 2 2 ∂(σ`+1 (j) + )−1/2 ∂(σ`+1 (j) + ) 1 1 =− 2 3/2 ∂z`+1,p (j) 2 (σ`+1 (j) + ) ∂z`+1,p (j)

(65.213)

2 ∂σ`+1 (j) 1 1 2 3/2 2 (σ`+1 (j) + ) ∂z`+1,p (j)   1 1 =− z (j) − z ¯ (j) `+1,p `+1 2 B (σ`+1 (j) + )3/2

=−

Returning to the partial derivatives in (65.210) we deduce from (65.200), (65.211a), and (65.211c) that ∂kγb − γ bb k2 = δ`+1,b (j) ∂v`+1,b (j) ∂v`+1,b (j) = a` (j) 0 ∂z`+1,b (j) ∂z`+1,b0 (j) (`)

∂wij

= y`,b0 (i)

(65.214) (65.215) (65.216)

2790

Feedforward Neural Networks

while (65.211b) gives   0 ∂ z`+1,b (j) − z¯`+1 (j) ∂z`+1,b (j) 2 = (σ`+1 (j) + )−1/2 + ∂z`+1,p (j) ∂z`+1,p (j)   ∂(σ 2 (j) + )−1/2 `+1 z`+1,b (j) − z¯`+1 (j) ∂z`+1,p (j) Note from the definition of the mean value z¯`+1 (j) that    ∂ z`+1,b (j) − z¯`+1 (j) −1/B, when p 6= b = (1 − 1/B), when p = b ∂z`+1,p (j)

(65.217)

(65.218)

Combining with (65.213) we get 0 (j) ∆ (`+1) ∂z`+1,b = cb,p (j) ∂z`+1,p (j) (`+1)

where the scalar cb,p follows. Let

(j), whose value depends on the layer index, is computed as ∆

ξ` (j) =

(`+1)

cb,p



(j) =

(65.219)

1 2 (σ`+1 (j) + )1/2

(65.220)

    2  1 1 2  1 − (j) z (j) − z ¯ (j) ξ (j) 1 − ξ  `+1,b `+1 ` `   B B−1    (when p = b)

      1  2  − ξ (j) 1 + ξ (j) z (j) − z ¯ (j) z (j) − z ¯ (j)  ` `+1,b `+1 `+1,p `+1 `   B (when p 6= b) (65.221) Substituting into (65.209)–(65.210) we find that a batch approximation for the gradient (`) of the risk function relative to the individual entries wij of the weight matrices can be computed as follows: (`)

2ρwij +

=

(`) 2ρwij

B−1 bb k2 1 X ∂kγb − γ (`) B ∂wij b=0 B−1 1 X + δ`+1,b (j)a` (j) B b=0

B−1 X

!

(65.222)

(`+1) cb,p (j)y`,p (i)

p=0

Sensitivity factors It remains to show how to propagate the sensitivity vectors δ`,b defined by (65.200). We start with the output layer for which ` = L. We denote the vector signals at the output layer in the bth sample of the batch by {vL,b , γ bb }, using the subscript b, with the letter v representing the signal prior to the activation function, i.e., γ bb = f (vL,b )

(65.223)

We denote the individual entries at the output layer by {b γb (1), . . . , γ bb (Q)}. Likewise, we denote the pre-activation entries by {vL,b (1), . . . , vL,b (Q)}. We also denote the preand post-activation signals at a generic `th hidden layer by {v`,b , y`,b } with y`,b = f (v`,b )

(65.224)

65.A Derivation of Batch Normalization Algorithm

2791

with individual entries indexed by {v`,b (i), y`,b (i)}. The number of nodes within hidden layer ` is denoted by n` . In this way, the chain rule for differentiation gives ∆

δL,b (j) = =

∂kγb − γ bb k2 ∂vL,b (j) Q X γb (k) ∂kγb − γ bb k2 ∂b ∂b γb (k) ∂vL,b (j)

k=1

=

Q X

2(b γb (k) − γb (k))

k=1

∂b γb (k) ∂vL,b (j)

  = 2 γ bb (j) − γb (j) f 0 (vL,b (j))

(65.225)

since only γ bb (j) depends on vL,b (j) through the relation γ bb (j) = f (vL,b (j)). Consequently, using the Hadamard product notation we get δL,b = 2(b γb − γb ) f 0 (vL,b )

(65.226)

Next we evaluate δ`,b for the earlier layers. This calculation can be carried out recursively by relating δ`,b to δ`+1,b . Indeed, note that ∆

δ`,b (j) = =

∂kγb − γ bb k2 ∂v`,b (j)

(65.227)

n`+1

B−1 X

k=1

p=0

X ∂kγb − γ bb k2 ∂v`+1,b (k) 0 ∂v`+1,b (k) ∂z`+1,b (k)

0 ∂z`+1,b (k) ∂z`+1,p (k) ∂z`+1,p (k) ∂v`,b (j)

!

where the signals z`+1,p (k) and v`,b (j) are related via (`)

z`+1,p (k) = f (v`,p (j)) wjk

(65.228)

Therefore, when p = b, we have ∂z`+1,p (k) (`) = f 0 (v`,b (j)) wjk ∂v`,b (j)

(65.229)

Otherwise, the above partial derivative is zero when p 6= b. Using this result and expression (65.219) we find that the partial derivatives in (65.227) evaluate to n`+1

δ`,b (j) =

X

(`+1)

δ`+1,b (k)a` (k) cb,b

  (`) (k)f 0 v`,b (j) wjk

(65.230)

k=1

If we introduce the diagonal scaling matrix n o ∆ (`+1) (`+1) D`,b = diag a` (1)cb,b (1), . . . , a` (n`+1 )cb,b (n`+1 )

(65.231)

then, in vector form, we arrive at the following recursion for the sensitivity vector δ`,b , which runs backward from ` = L − 1 down to ` = 2, with the boundary condition δL,b given by (65.226): δ`,b = f 0 (v`,b ) (W` D`,b δ`+1,b )

(65.232)

The resulting algorithm with batch normalization for minimizing the regularized leastsquares risk (65.198) is listed in (65.187).

2792

Feedforward Neural Networks

REFERENCES Bengio, Y. (2009), “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127. Bengio, Y. (2012), “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, K. Muller, editors, 2nd ed., pp. 437–478, Springer. Also available at arXiv:1206.5533. Bengio, Y., A. Courville, and P. Vincent (2013), “Representation learning: A review and new perspectives,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828. Bengio, Y., P. Lamblin, D. Popovici, and H. Larochelle (2006), “Greedy layer-wise training of deep networks,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 153–160, Vancouver. Bishop, C. (1995), Neural Networks for Pattern Recognition, Clarendon Press. Bjorck, J., C. Gomes, B. Selman, and K. Q. Weinberger (2018), “Understanding batch normalization,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 7694–7705, Montreal. Bourlard, H. and Y. Kamp (1988), “Auto-association by multilayer perceptrons and singular value decomposition,” Biol. Cybern., vol. 59, pp. 291–294. Bourlard, H. A. and N. Morgan (1993), Connectionist Speech Recognition: A Hybrid Approach, Kluwer. Bourlard, H. and C. J. Wellekens (1989), “Links between Markov models and multilayer perceptrons,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 502–507, Denver, CO. Bridle, J. S. (1990a), “Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 211–217, Denver, CO. Bridle, J. S. (1990b), “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing: Algorithms, Architectures and Applications, F. F. Soulie and J. Herault, editors, pp. 227–236, Springer. Bromley, J., I. Guyon, Y. LeCun, E. Sickinger, and R. Shah (1994), “Signature verification using a Siamese time delay neural network,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 737–744, Denver, CO. Bui, T. D., S. Ravi, and V. Ramavajjala (2018), “Neural graph learning: Training neural networks using graphs,” Proc. ACM Int. Conf. on Web Search and Data Mining, pp. 64–71, Los Angeles, CA. Caruana, R. (1993), “Multitask learning: A knowledge-based source of inductive bias,” Proc. Int. Conf. Machine Learning (ICML), pp. 41–48, Amherst, MA. Caruana, R. (1997), “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75. Cybenko, G. (1989), “Approximations by superpositions of sigmoidal functions,” Math. Control Signals Syst., vol. 2, no. 4, pp. 303–314. Dai, A. M. and Q. V. Le (2015), “Semi-supervised sequence learning,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Montreal. De Boer, P.-T., D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005), “A tutorial of the cross-entropy method,” Ann. Oper. Res., vol. 134, pp. 19–67. Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009), “ImageNet: A large-scale hierarchical image database,” Proc. Conf. Computer Vision and Pattern Recognition (CVPR), pp. 248–255, Miami, FL. Dietterich, T. G., H. Hild, and G. Bakiri (1990), “A comparative study of ID3 and backpropagation for English text-to-speech mapping,” Proc. Int. Conf. Artificial Intelligence, pp. 24–31, Boston, MA. Donahue, J., Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014), “DeCAF: A deep convolutional activation feature for generic visual recognition,” Proc. Int. Conf. Machine Learning (ICML), pp. 647–655, Beijing.

References

2793

Duda, R. O. and P. E. Hart (1973), Pattern Classification and Scene Analysis, Wiley. Duda, R. O., P. E. Hart, and D. G. Stork (2000), Pattern Classification, 2nd ed., Wiley. Erdogmus D. and J. Principe (2002), “An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems,” IEEE Trans. Signal Process., vol. 50, no. 7, pp. 1780–1786. Funahashi, K. (1989), “On the approximate realization of continuous mappings by neural networks,” Neural Netw., vol. 2, pp. 183–192. Glorot, X. and Y. Bengio (2010), “Understanding the difficulty of training deep feedforward neural networks,” Proc. Mach. Learn. Res., vol. 9, pp. 249–256. Glorot, X., A. Bordes, and Y. Bengio (2011), “Deep sparse rectifier neural networks,” J. Mach. Learn. Res., vol. 15, pp. 315–323. Goodfellow, I., Y. Bengio, and A. Courville (2016), Deep Learning, MIT Press. Gori, M., G. Monfardini and F. Scarselli (2005), “A new model for learning in graph domains,” Proc. IEEE Int. Joint Conf. Neural Networks, pp. 729–734, Montreal. Hassibi, B., A. H. Sayed and T. Kailath (1994a), “H∞ -optimality criteria for LMS and backpropagation,” Proc. Advances Neural Information Processing Systems (NIPS), vol. 6, pp. 351–358, Denver, CO. Hassibi, B., A. H. Sayed and T. Kailath (1994b), “LMS and backpropagation are minimax filters,” in Theoretical Advances in Neural Computation and Learning, V. Roychowdhury, K. Siu and A. Orlitsky, editors, pp. 425–447, Kluwer Academic Publishers. Haykin, S. (1999), Neural Networks: A Comprehensive Foundation, Prentice Hall. Haykin, S. (2009), Neural Networks and Learning Machines, 3rd ed., Pearson Press. Hecht-Nielsen, R. (1989), “Theory of the back-propagation neural network,” Proc. Int. Joint Conf. Neural Networks, pp. 593–606, New York. Hinton, G. (1987), “Connectionist learning procedures,” Technical Report CMU-CS-87115, Carnegie Mellon University, Dept. Computer Science. Also published in Artif. Intell., vol. 40, pp. 185–234, 1989. Hinton, G., S. Osindero, and Y.-W. Teh (2006), “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554. Hinton, G. and R. Salakhutdinov (2006), “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507. Hinton, G., N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2012b), “Improving neural networks by preventing co-adaptation of feature detectors,” available at arXiv:1207.0580. Hinton, G. and R. S. Zemel (1994), “Autoencoders, minimum description length, and Helmholtz free energy,” Proc. Advances Neural Information Processing (NIPS), pp. 3–10, Denver, CO. Hochreiter, S. (1991), Untersuchungen zu Dynamischen Neuronalen Netzen, Diploma thesis, Institut f. Informatik, Technische Univ. Munich. Hochreiter, S., Y. Bengio, P. Frasconi, and J. Schmidhuber (2001), “Gradient flow in recurrent nets: The difficulty of learning long-term dependencies,” in A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen, editors, IEEE Press. Hornik, K. (1991), “Approximation capabilities of multilayer feedforward networks,” Neural Netw., vol. 4, no. 2, pp. 251–257. Hornik, K., M. Stinchcombe, H. White (1989), “Multilayer feedforward networks are universal approximators,” Neural Netw., vol. 2, pp. 359–366. Ioffe, S. and C. Szegedy (2015), “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proc. Int. Conf. Machine Learning (ICML), vol. 37, pp. 448–456, Lille. Also available at arXiv:1502.03167v3. Jarrett, K., K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2009), “What is the best multi-stage architecture for object recognition,” Proc. IEEE Int. Conf. Computer Vision, pp. 2146–2153, Kyoto. Koch, G., R. Zemel, and R. Salakhutdinov (2015), “Siamese neural networks for oneshot image recognition,” Proc. Int. Conf. Machine Learning (ICML), pp. 1–8, Lille.

2794

Feedforward Neural Networks

Kohler, J., H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann (2018), “Exponential convergence rates for batch normalization: The power of length– direction decoupling in non-convex optimization,” available at arXiv:1805.10694. Krizhevsky, A. (2009), Learning Multiple Layers of Features from Tiny Images, MS dissertation, Computer Science Department, University of Toronto, Canada. Krizhevsky, A., I. Sutskever, and G. Hinton (2012), “ImageNet classification with deep convolutional neural networks,” Proc. Advances Neural Information Processing Systems (NIPS), vol. 25, pp. 1097–1105, Lake Tahoe, NV. LeCun, Y., Y. Bengio, and G. Hinton (2015), “Deep learning,” Nature, vol. 521, pp. 436–444. LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner (1998), “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324. LeCun, Y., L. Bottou, G. B. Orr, and K. Muller (2012), “Efficient backprop,” in Neural Networks, Tricks of the Trade, G. Montavon, G. B. Orr, and K. Muller, editors, 2nd ed., pp. 9–48, Springer. Leshno, M., V. Y. Lin, A. Pinkus, and S. Schocken (1993), “Multilayer feedforward networks with a non-polynomial activation function can approximate any function,” Neural Netw., vol. 6, pp. 861–867. Linsker, R. (1998), “Self-organization in a perceptual network,” IEEE Comput., vol. 21, pp. 105–117. Liu, Z. and J. Zhou (2020), Introduction to Graph Neural Networks, Morgan & Claypool. Minsky, M. and S. Papert (1969), Perceptrons, MIT Press. Expanded edition published in 1987. Neal, R. (1995), Bayesian Learning for Neural Networks, PhD dissertation, Dept. of Computer Science, University of Toronto, Canada. Neal, R. (1996), Bayesian Learning for Neural Networks, Springer. Ney, H. (1995), “On the probabilistic interpretation of neural network classifiers and discriminative training criteria,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 17, no. 2, pp. 107–119. Nilsson, N. (1965), Learning Machines, McGraw-Hill. Principe, J. C., D. Xu, and J. Fisher (2000), “Information theoretic learning,” in Unsupervised Adaptive Filtering: Blind Source Separation, S. Haykin, editor, pp. 265–319, Wiley. Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever (2018), “Improving language understanding by generative pre-training,” available at https://s3-us-west2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper.pdf. Rosenblatt, F. (1957), The Perceptron: A Perceiving and Recognizing Automaton, Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab. Rosenblatt, F. (1958), “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408. Rosenblatt, F. (1962), Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Spartan Press. Rubinstein, R. Y. and D. P. Kroese (2004), The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer. Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1985), “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, J. L. McClelland and D. E. Rumelhart, editors, vol. 1, pp. 318–362, MIT Press. Rumelhart, D. E., G. Hinton, and R. J. Williams (1986), “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536. Russakovsky, O., J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015), “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252. Also available at arXiv:1409.0575.

References

2795

Santurkar, S., D. Tsipras, A. Ilyas, and A. Madry (2018), “How does batch normalization help optimization?,” available at arXiv:1805.11604. Scarselli, F., M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini (2008), “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 61–80. Scarselli, F., A. C. Tsoi, M. Gori, and M. Hagenbuchner (2004), “Graphical-based learning environments for pattern recognition,” Proc. Joint IAPR Int. Workshops on SPR and SSPR, pp. 42–56, Lisbon. Schmidhuber, J. (2015), “Deep learning in neural networks: An overview,” Neural Netw., vol. 61, pp. 85–117. Schwenk, H. and M. Milgram (1995), “Transformation invariant autoassociation with application to handwritten character recognition,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 991–998, Denver, CO. Siu, K.-Y., V. P. Roychowdhury, and T. Kailath (1995), Discrete Neural Computation: A Theoretical Foundation, Prentice Hall. Solla, S. A., E. Levin, and M. Fleisher (1988), “Accelerated learning in layered neural networks,” Complex Syst., vol. 2, pp. 625–640. Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014), “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958. Stinchcombe, M. and H. White (1989), “Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions,” Proc. Int. Joint Conf. Neural Networks, pp. 613–617, Washington, DC. Suddarth, S. C. and A. D. C. Holden (1991), “Symbolic-neural systems and the use of hints for developing complex systems,” Int. J. Man-Mach. Stud., vol. 35, no. 3, pp. 291–311. Suddarth, S. C. and Y. L. Kergosien (1990), “Rule-injection hints as a means of improving network performance and learning time,” Proc. EURASIP Workshop on Neural Networks, pp. 120–129, Sesimbra. Sung, F., Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2018), “Learning to compare: Relation network for few-shot learning,” Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1199–1208, Salt Lake City, UT. Theodoridis, S. (2015), Machine Learning: A Bayesian and Optimization Perspective, Academic Press. Theodoridis, S. and K. Koutroumbas (2008), Pattern Recognition, 4th ed., Academic Press. Vincent, P., H. Larochelle, Y. Bengio, and P. A. Manzagol (2008), “Extracting and composing robust features with denoising autoencoders,” Proc. Int. Conf. Machine Learning (ICML), pp. 1096–1103, Helsinki. Ward, I. R., J. Joyner, C. Lickfold, S. Rowe, Y. Guo, and M. Bennamoun (2020), “A practical guide to graph neural networks: How do graph neural networks work, and where can they be applied?” available at arXiv:2010.05234. Werbos, P. J. (1974), Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. dissertation, Harvard University, Cambridge, MA, USA. Werbos, P. J. (1988), “Generalization of backpropagation with application to a recurrent gas market model,” Neural Netw., vol. 1, no. 4, pp. 339–356. Werbos, P. J. (1990), “Backpropagation through time: What it does and how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560. Werbos, P. J. (1994), The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley. Widrow, B. and M. A. Lehr (1990), “30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation,” Proc. IEEE, vol. 78, no. 9, pp. 1415–1442. Williams, C. (1996), “Computing with infinite networks,” in Proc. Advances Neural Information Processing Systems (NIPS), pp. 295–301, Denver, CO.

2796

Feedforward Neural Networks

Wu, Z., S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020), “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, pp. 4–24 Xu, D. and J. Principe (1999), “Training MLPs layer-by-layer with the information potential,” Proc. IEEE ICASSP, pp. 1045–1048, Phoenix, AZ. Yosinski, J., J. Clune, Y. Bengio, and H. Lipson (2014), “How transferable are features in deep neural networks?” Proc. Advances Neural Information Processing Systems (NIPS), pp. 1–9, Montreal. Zhou, P. and J. Austin (1989), “Learning criteria for training neural network classifiers,” Neural Comput. Appl., vol. 7, no. 4, pp. 334–342. Zhou, J., G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2019), “Graph neural networks: A review of methods and applications,” available at arXiv:1812.08434 Zhou, Y., H. Zheng, and X. Huang (2020), “Graph neural networks: Taxonomy, advances and trends,” available at arXiv:2012.08752.

66 Deep Belief Networks

We indicated in the concluding remarks of the previous chapter that feedforward neural networks have powerful modeling capabilities, as reflected by the universal approximation theorem. In one of its versions, the theorem asserts that networks with a single hidden layer are rich enough to model almost any arbitrary function. The number of neurons in the hidden layer can still be large. One would expect the modeling capabilities of the network, as well as its performance in solving inference problems, to be equally powerful if one could employ multiple hidden layers, albeit with fewer neurons per layer. Unfortunately, the vanishing gradient problem discussed in Section 65.8 hinders the successful training of such networks. This is because, as the discussion in that section has revealed, the derivatives of the activation functions in the saturation region have small values, and these values scale down signals flowing back during the backpropagation procedure. We described in the previous chapter one popular method to alleviate this problem based on batch normalization. In this chapter we describe two other earlier mechanisms, which have an independent value of their own. One method is based on cascading layers of autoencoders, while the other method is based on cascading layers of restricted Boltzmann machines (RBMs). The main contribution of these two methods is to provide proper initialization for the combination matrices and bias vectors, {W`,−1 , θ`,−1 }, from which training can be launched using the backpropagation procedure. As the presentation will reveal, the autoencoder and RBM cascades will be trained separately from the neural network. Their training is unsupervised and relies solely on the feature vectors {hn } without their labels. There is no formal theory to explain why these strategies provide “good” initial conditions except to say that they help ensure that the network will be operating away from the saturation regime.

66.1

PRE–TRAINING USING STACKED AUTOENCODERS Consider a multilayer feedforward network consisting of L layers: an input layer, an output layer, and L N 2 hidden layers – see Fig. 66.1 with L = 5 layers (an input layer, an output layer, and three hidden layers). Consider further a collection of N training data points {γn , hn }, where γn ∈ IRQ are the class vectors

2798

Deep Belief Networks

and hn ∈ IRM are the feature vectors. We already know how to train the network by means of the backpropagation algorithm, e.g., by using a stochastic-gradient or mini-batch implementation and by using any of several possible design criteria such as least-squares or cross-entropy. However, as already noted in the previous chapter, training can suffer from slowdown caused by the vanishing gradient problem. The difficulty is pronounced for deep networks, which usually contain many hidden layers and many neural units per layer.

layer 2

layer 3

layer 4

autoencoder for layer 2 layer 2

fictitious

autoencoder for layer 3 layer 3 fictitious

Figure 66.1 Pre-training of a multilayered feedforward network by means of a stacked

autoencoder structure, where one layer is trained at a time followed by a full-blown training of the network by backpropagation.

The stacked autoencoder approach helps alleviate this problem. The approach is based on the idea of training a supporting network greedily, one layer at a time, which helps counter overfitting by training fewer parameters at each stage. The method operates as follows:

66.1 Pre-Training Using Stacked Autoencoders

2799

(1) We train the hidden layers, one layer at a time. We start with layer 2, which is the first hidden layer. The input vector to this layer is hn . We ignore all layers succeeding layer 2 and envision instead a fictitious output layer following it with the same number of units as hn . In other words, we envision an autoencoder built around layer 2, with layer 2 serving as its hidden layer and hn serving as the input and desired output signals for this autoencoder. If the individual entries of hn are assumed to lie within the intervals [0 1] or [N1 1], then the neurons in the output layer for this autoencoder step will need to include sigmoidal or hyperbolic-tangent nonlinearities so that the output signals will also lie within the same intervals. If, on the other hand, the individual entries of hn are arbitrary real numbers, then the neurons in the output layer will not include nonlinearities. Using the given features {hn }, we train the autoencoder by means of the backpropagation algorithm and determine the combination matrix and bias vector (W1 , θ1 ) feeding from hn into layer 2 in the autoencoder. The same algorithm also ends up determining the combination matrix and bias vector, denoted by (Wx , θx ), which feed from layer 2 to the fictitious output layer following it. We ignore (Wx , θx ). Since the autoencoder stage involves only one hidden layer, we can be more explicit and write down the training expressions to illustrate the operation of this first stage. Similar descriptions would apply to the subsequent stages. We assume in the description that follows that the output of the autoencoder does not employ a nonlinearity. We refer to the lower leftmost plot in Fig. 66.1 and consider the three-layer network with input hn and parameters {W1 , θ1 , Wx , θx }, which we wish to learn. The training involves three stages: (a) initialization, (b) forward propagation, and (c) backward propagation.

(1a) Initialization. First, we set the initial values for these parameters according to the procedure explained earlier in Section 65.5. Specifically,   entries of W 1 ∼ Nwij,1 0, 1/n1 , (n1 × n2 )   entries of W x ∼ Nwji,x 0, 1/n2 , (n2 × n1 ) θ 1 ∼ Nθ1 (0, In2 ), θ x ∼ Nθx (0, In1 ),

(66.1a) (66.1b)

(n2 × 1)

(66.1c)

(n1 × 1)

(66.1d)

(1b) Feedforward propagation. Second, for every feature vector hn , we feed it through the two (hidden and output) layers in the autoencoder and generate the following signals, which are internal to the first autoencoder stage:

2800

Deep Belief Networks

y 1 = hn z2 =

WT 1 y1

(66.2a) N θ1

(66.2b)

N θx

(66.2d)

y 2 = f (z 2 ) (sigmoidal function) z3 =

WT x y2

y3 = z3 bn = y h 3

(linear mapping)

(66.2c) (66.2e) (66.2f)

Observe that, in this implementation, the hidden layer in the autoencoder employs a nonlinear activation function, while the output layer does not employ a nonlinearity. The above steps end up mapping each input vector b n at the output of the first autoencoding stage. We are denoting hn into h the parameters and the feature vectors in boldface because, as we explain shortly in the following, we will be running these iterations repeatedly on randomly selected feature vectors using multiple passes over the data. The boldface notation is used to highlight the random nature of the data. b n , we perform two (1c) Backward propagation. Once hn is mapped to h backpropagation steps to generate the following signals, which are again internal to the first autoencoder stage: b n N hn ) δ 3 = 2(h

(boundary condition)

0

δ 2 = f (z 2 ) (W x δ 3 )

(66.3b)

W x ← (1 N 2µac ρac )W x N

µac y 2 δ T 3

W 1 ← (1 N 2µac ρac )W 1 N

µac y 1 δ T 2

θ x ← θ x + µac δ 3 θ 1 ← θ 1 + µac δ 2

(66.3a) (66.3c) (66.3d) (66.3e) (66.3f)

Note that we are denoting the step size and regularization parameters used during the training of the autoencoder stages by (µac , ρac ). At the end of the above steps we would have updated {W 1 , θ 1 , W x , θ x } to new values in response to the feature vector hn . We repeat the three steps (1a)–(1c) multiple times over the training data using, for example, uniform sampling or random reshuffling:  initialize {W 1 , θ 1 , W x , θ x } as in step (1a)      repeat over multiple passes    select a random feature vector, h n b  feed it forward through {W , θ , 1 1 W x , θ x } to get hn as in step (1b)    b  feed hn backward and update {W 1 , θ 1 , W x , θ x } as in step (1c)    end (66.4) Once these steps are concluded, the parameters (W1 , θ1 , Wx , θx ) for the first autoencoder stage are learned. We discard (Wx , θx ) and move on to the second autoencoder stage.

66.1 Pre-Training Using Stacked Autoencoders

2801

(2) Using (W1 , θ1 ), we feed each feature vector hn through (W1 , θ1 ) to determine the post-activation signals at the output of layer 2. We denote these signals by y2,n = f W1T hn N θ1



(66.5)

(3) We now consider the next hidden layer in the feedforward network, which is layer 3. The input vector to this layer is y2,n . We ignore all layers succeeding layer 3 and envision again a fictitious output layer following it with the same number of units as y2,n . In other words, we envision an autoencoder built around layer 3, with layer 3 serving as its hidden layer and y2,n serving as the input and desired output signals for this autoencoder. Using the given features {y2,n }, we train the autoencoder by means of the backpropagation algorithm in a manner similar to steps (1a)–(1c) used for the first autoencoder stage, and determine the combination matrix and bias vector (W2 , θ2 ) feeding from y2,n into layer 3 in the autoencoder. The same algorithm also ends up determining the combination matrix and bias vector, denoted by (Wx , θx ), which feed from layer 3 to the fictitious output layer following it. We ignore (Wx , θx ). (4) Using (W2 , θ2 ), we next feed each feature vector y2,n through (W2 , θ2 ) to determine the post-activation signals at the output of layer 3. We denote these signals by y3,n = f W2T y2,n N θ2



(66.6)

(5) We repeat the same steps as before for each of the remaining hidden layers. We extract a hidden layer `+1 and build an autoencoder around it: Its input is the signal y`,n arriving from the previously trained layer. We train this autoencoder to determine (W` , θ` ). (6) Once all hidden layers have been covered in this manner, we end up with a collection of parameters {W` , θ` } for the layers ` = 1, 2, . . . , L N 1. Observe that we are still left with (WL , θL ) for the output layer. The weights that were determined from training the successive autoencoders are not the final weights for the network. Instead, they are used as initial conditions for a fullblown training of the network. Starting from the values {W` , θ` } and selecting some random initial conditions for (WL , θL ) (as was explained before in Section 65.5), we now train the entire multilayer network afresh using the full-blown backpropagation algorithm. The algorithm is summarized in listing (66.7), where we are denoting the variables in boldface to highlight their random nature due to the stochasticgradient implementation involving random selections of data.

2802

Deep Belief Networks

Stacked autoencoder training of feedforward neural networks. given a feedforward network with L layers (input+output+hidden); given N data pairs {γ n , hn }, n = 0, 1, . . . , N N 1; set y 1,n = hn repeat over layers for ` = 1, 2, . . . , L N 1: select {W `,−1 , θ `,−1 , W x,−1 , θ x,−1 } randomly (Section 65.5); set input and output signals for this stage to {y `,n , y `,n }; train the `th stage of autoencoder using backpropagation, with step size µac and regularization parameter ρac ; this results in the parameters (W ` , θ ` )

(66.7)

set y `+1,n = f W T ` y `,n N θ `

end select (W L,−1 , θ L,−1 ) randomly according to Section 65.5; set initial conditions {W `,−1 , θ `,−1 } ← {W ` , θ ` }, ` = 1, . . . , L N 1; train neural network using step size µ, regularization parameter ρ, and the given training data {γ n , hn }; return {W ?` , θ ?` }.

There is still no clear understanding or formal theory to justify the performance of the stacked autoencoder construction in practice. The evidence is mostly experimental. It is believed that one of the advantages of training neural networks in this manner is that the initial values for the combination and bias coefficients help ensure that the network will be operating away from the saturation regime, which allows information to flow more readily through the network during training.

66.2

RESTRICTED BOLTZMANN MACHINES We describe a second approach to determining initial conditions for the parameters of a neural network, one layer at a time. The approach is based on cascading a sequence of RBMs. In this section, we describe RBMs without formal derivations and defer the justifications to the next section. We illustrate in Fig. 66.2 that an RBM consists of two layers: an input layer and a hidden layer, with every input unit connected to every hidden unit. The activation function in each hidden unit is required to be the sigmoid function. The qualification “restricted” in RBM refers to the fact that there are no connections among the input units or among the hidden units. The connections are limited to linking input units to hidden units only. If we drop the qualification “restricted,” then the resulting structure will correspond to a Boltzmann machine, where every unit is allowed to connect to any other unit (even within the same layer).

66.2 Restricted Boltzmann Machines

2803

hidden layer (W T , ✓)

Bernoulli

(W, ✓r )

real (0,1)

binary {0,1}

binary {0,1}

Figure 66.2 An RBM consists of two layers: an input layer and a hidden layer. The

activation function in each hidden unit is the sigmoid function. Only connections among units from different layers are allowed.

We denote the vector at the input by h ∈ IRM and the vector at the output of the RBM by y ∈ IRQ . The integers M and Q need not be identical so that the number of input and hidden units are not necessarily equal. Although unnecessary, it is often assumed that the entries of h are binary-valued, i.e., each entry is either 0 or 1: h(m) ∈ {0, 1}

(66.8)

In order to highlight this fact, we will write hb instead of h, with the subscript b indicating the “binary” nature of the entries of h. Later, at the end of this section, following listing (66.17), we explain how the training algorithm can be modified to handle nonbinary input data. The combination matrix and bias coefficient that map hb to y from left to right in the figure are denoted by (W T , θ), i.e.,  y = f W T hb N θ (66.9) where the activation function is the sigmoid function: f (x) =

1 ∆ = sigmoid(x) 1 + e−x

(66.10)

The matrix W has dimensions M × Q. In the RBM implementation, we also need to map the entries of y to binary values. This can be achieved as follows. Because of the sigmoid function, each entry of y will have the interpretation of

2804

Deep Belief Networks

a probability measure, assuming real values between 0 and 1. We transform this vector into a second vector consisting solely of binary values 0 or 1. We denote this second vector by y b and set its entries randomly to the value 1 according to the probability distribution:   P y b (q) = 1|h = y(q), q = 1, 2, . . . , Q (66.11)

where {y(q), y b (q)} denote the qth entries of the vectors {y, y b }, respectively. This amounts to one step of a Gibbs sampling procedure; the motivation for the terminology is explained in the comments at the end of the chapter. Note that we are using the boldface notation for y b to indicate that we are treating it as a random variable. The conditioning on h in (66.11) is meant to indicate that the hidden vector y is computed based on knowledge (or observation) of hb . Expression (66.11) means that, given y(q), we run a Bernoulli experiment with probability of success p = y(q) to decide on whether to set y b (q) to 1 or 0 (i.e., we flip a biased coin with the probability of heads equal to p and assign y b (q) = 1 if heads is observed). When this happens, we say that the qth entry is turned on. Otherwise, it is turned off. For compactness of notation, we refer to the transformation from y to yb by the notation: yb = Bernoulli(y)

(66.12)

where yb , in normal font, denotes a binary realization for y b . Once the binary vector yb is computed, we proceed by mapping it back to the input layer. In other words, in RBM implementations, and unlike layers in a traditional feedforward network, we also have a return mapping running backward, to the input layer using the same combination weights from the forward map; we encountered this structure earlier in Example 43.6 in the context of undirected graphs. We denote the bias coefficient associated with the return map by θr , while the weight matrix is W . The backward mapping is given by h0 = f (W yb N θr )

(66.13)

where we are again employing an activation function so that the entries of h0 can be interpreted as probability measures between 0 and 1. In a manner similar to y and y b in (66.11), these entries can be transformed to binary values using   P h0b (m) = 1|yb = h0 (m) (66.14) That is,

h0b = Bernoulli(h0 )

(66.15)

Given h0b , we feed it forward through (W, θ) and determine y 0 = f (W T h0b N θ),

yb0 = Bernoulli(y 0 )

(66.16)

In this way, we end up carrying out the sequence of transformations shown in Fig. 66.3 to generate the signals {yb , h0b , yb0 }. The RBM will determine the

66.2 Restricted Boltzmann Machines

(W T , ✓)

forward

Bernoulli

(W, ✓r )

Bernoulli

(W T , ✓)

backward

2805

Bernoulli

forward

cycle

Figure 66.3 One forward pass through the network results in the vectors (hb , yb ),

while a cycle involving a backward pass and a forward pass results in the vectors (h0b , yb0 ). These vectors are used by the contrastive divergence algorithm (66.17).

coefficients (W, θ, θr ) iteratively from the training feature vectors by applying the contrastive divergence algorithm listed in (66.17). The algorithm is “almost” a special case of a stochastic-gradient recursion for minimizing a log-likelihood cost function, as we explain in the next section. It starts from some random Gaussian-distributed initial conditions, {W−1 , θ−1 , θr,−1 }, with small variances.

Contrastive divergence algorithm with binary features. given binary-valued feature vectors {hb,n }, n = 0, 1, . . . , N N 1. start from random initial conditions {W−1 , θ−1 , θr,−1 } repeat until sufficient convergence over m = 0, 1, . . .: select a  random sample hb,m  ym = f W T m−1 hb,m N θ m−1

y b,m = Bernoulli(y m )  h0n = f W m−1 y b,m N θ r,m−1 h0b,m = Bernoulli(h0m ) 0 y 0m = f (W T m−1 hb,m N θ m−1 ) y 0b,m = Bernoulli(y 0m )   0 0 T W m = W m−1 + µcd hb,m y T b,m N hb,m (y b,m )

θ m = θ m−1 + µcd (y 0b,m N y b,m ) θ r,m = θ r,m−1 + µcd (h0b,m N hb,m ) end return {W ? , θ? , θr? } ← {W m , θ m , θ r,m }.

(66.17)

2806

Deep Belief Networks

The algorithm operates as follows:

(1) We feed hb,n forward through the RBM and find the corresponding binary T output yb,n . The outer product, hb,n yb,n , is referred to as a positive contribution term. (2) We feed yb,n backward through the RBM and find the corresponding binary input h0b,n . We feed h0b,n forward through the RBM and find the correspond0 . The outer product generated by this cycle, namely, ing binary output yb,n T 0 0 hb,n (yb,n ) , is referred to as a negative contribution term. T 0 (3) The difference hb,n yb,n N h0b,n (yb,n )T drives the learning process. It provides a rough measure of how well the RBM models the distribution of the input data.

(4) At the end of the training procedure, the resulting estimated values for {W, θ, θr } are denoted by {W ? , θ? , θr? }. As is customary when dealing with stochastic learning algorithms, we can perform multiple passes over the data using either uniform sampling or random reshuffling. We denote the step-size parameter used by the contrastive divergence algorithm by µcd . The algorithm can be modified for feature data that are not necessarily binary. We explain in the next section that it is more critical for the hidden states 0 (yb,n , yb,n ) to be binary-valued. We simply modify recursions (66.17) and replace {hb,n , h0b,n } by {hn , h0n }; there is no need to generate the binary features {h0b,n } – see (66.72a)–(66.72c). For ease of reference, and also for completeness, we list the algorithm again in (66.18), assuming now a mini-batch implementation of size B samples per batch. We also denote the variables in boldface to highlight their random nature due to the randomness in the selection of the feature data.

Generative models We redraw Fig. 66.2 in the form shown in Fig. 66.4. In this latter figure, we view the RBM as a mapping from a binary-valued input vector hb ∈ IRM to another binary-valued output vector y b ∈ IRQ . For reasons explained in the next section, we will refer to hb as the visible/observable component and to y b as the latent/hidden component. The derivation in that section will show that, given realizations for hb , the contrastive divergence algorithm ends up determining model parameters {W, θ, θr } that model the conditional probability density functions (pdfs) of the variables {hb , y b } in terms of logistic distributions in the following manner.

66.2 Restricted Boltzmann Machines

2807

Mini-batch contrastive divergence algorithm with real features. given binary-valued feature vectors {hn }, n = 0, 1, . . . , N N 1. start from random initial conditions {W−1 , θ−1 , θr,−1 } repeat until sufficient convergence over m = 0, 1, . . .: select B random feature vectors {hp }, p = 0, . . . , B N 1 repeat for 1:  p = 0, 1, . . . , B N  yp = f W T m−1 hp N θ m−1

y b,p = Bernoulli(y p )  h0p = f W m−1 y b,p N θ r,m−1  0 y 0p = f W T m−1 hp N θ m−1

y 0b,p = Bernoulli(y 0p ) end B−1 T  µcd X  0 0 hp y T W m = W m−1 + b,p N hp y b,p B p=0 θm

(66.18)

B−1  µcd X 0 = θ m−1 + y N y b,p B p=0 b,p

θ r,m

B−1  µcd X 0 = θ r,m−1 + h N hp B p=0 p

end return {W ? , θ ? , θ ?r } ← {W m , θ m , θ r,m }. Let (W T )q,: and Wm,: denote the qth and mth rows of W T and W , respectively. Let {hb (m), y b (q)} denote the individual entries of the vectors {hb , y b }. Then, it will hold that:

and

1 ∆ y 0 (q) = T 1 + e−((W )q,: hb −θ(q))   P y b (q) = 1|hb = hb = y 0 (q)   P y b (q) = 0|hb = hb = 1 N y 0 (q)

(66.19a) (66.19b) (66.19c)

1 ∆ h0 (m) = −(W yb −θr (m)) m,: 1+e   P hb (m) = 1|y b = yb = h0 (m)   P hb (m) = 0|y b = yb = 1 N h0 (m)

(66.20b)

n o P(hb = hb , y b = yb ) ∝ exp NθT yb N θrT hb + ybT W T hb

(66.21)

(66.20a)

(66.20c)

while the joint pdf of (hb , y b ) is

2808

Deep Belief Networks

(W T , ✓)

binary {0, 1} 8 hb (m) | y b ⇠ Bernoulli(h0 (m)) > > > < 1 h0 (m) = > 1 + e (Wm,: yb ✓r (m)) > > : 0 h (m) = probability of success AAAC43icfVLNi9QwFE/r11o/dtSjl+AgzuLs0I6gIgiLXjyu4OwsTMaapq/TsElaknSxxF69eFDEq/+UN/8VT6azg6y74oOQX37v+71kteDGxvHPILxw8dLlK1tXo2vXb9zcHty6fWCqRjOYsUpU+jCjBgRXMLPcCjisNVCZCZhnRy97/fwYtOGVemPbGpaSrhQvOKPWU+ngF1lEREBhiSMZrLhyVGvadk50EckqkZtW+suVXZqN5A4ZfyDj03zreWK4xMQ/3zuiJX4BWlWNELwblQ+8zw4h5NjUlIHbjSdTJjtCorXmOckLTZlLOpc8hLdudzRPnRw/69o0242ILcHSVPchuu4/Qf5krnWV0YwLbltcFdg0jIExvhFQ+aYvovmqtJOILKN0MIwn8VrweZBswBBtZD8d/CB5xRoJyjJBjVkkcW2XPrDlTIBP0xjwFR7RFSw8VFSCWbr1jjp83zM5Lirtj7J4zZ72cFSafqbeUlJbmrO6nvyXbtHY4unScVU3FhQ7SVQ0AtsK9wvHOdfArGg9oExzXytmJfVjt/5b9ENIzrZ8HhxMJ8njyaPX0+HeeDOOLXQX3UMjlKAnaA+9QvtohljwLvgYfA6+hBB+Cr+G305Mw2Djcwf9JeH33+Jn6DQ=

yb

8 y b (q) | hb ⇠ Bernoulli(y 0 (q)) > > > < 1 y 0 (q) = T > 1 + e ((W )q,: hb ✓(q)) > > : 0 y (q) = probability of success AAAC63icfVJbb9MwFHbCbYRbB4+8WFSIVnRVUiRASEgTvPA4pHWdVHeR45w01hw7s52JyOQv8MIDCPHKH+KNf4PbVWhsiCNZ/vyd+znOasGNjeNfQXjl6rXrN7ZuRrdu37l7r7d9/8CoRjOYMiWUPsyoAcElTC23Ag5rDbTKBMyy47cr/ewUtOFK7tu2hkVFl5IXnFHrqXQ7CMg8IgIKSxzJYMmlo1rTtnOii0imRG7ayl+u7dJscDIko49kdJ4vPU8MrzDxzw+O6Aq/AS1VIwTvBu0T7zMkhJyamjJwO/F4wqqOkGiteU3yQlPmks4lT+HI7QwGsyNHTIH3u2HqTkavujLNdiJiS7B0Farr/hPsTwW1VhnNuOC2xarApmEMjPENgcw3/RHNl6UdR2SR9vrxOF4LvgySDeijjeylvZ8kV6ypQFomqDHzJK7twse1nAnwWRoDvsBjuoS5h5JWYBZuvasOP/ZMjgul/ZEWr9nzHo5WZjVab1lRW5qLuhX5L928scXLheOybixIdpaoaAS2Cq8Wj3OugVnRekCZ5r5WzErqp2/994j8EJKLLV8GB5Nx8nz87P2kvzvajGMLPUSP0AAl6AXaRe/QHpoiFpTBp+BL8DWsws/ht/D7mWkYbHweoL8k/PEbI7bqNA==

hidden

visible (W, ✓r )

(one RBM layer)

Figure 66.4 An RBM maps a visible binary-valued component hb to a hidden

binary-valued component y b , and vice-versa. It learns the generative models that map one component to the other.

The contrastive divergence algorithm also ends up generating samples {h0b , y 0b } that follow these distributions. In this way, we say that an RBM is able to learn from the given realizations for hb at least five important elements: (a) It extracts information from hb in the form of the hidden variable y b . The explanation in the following will reveal that this hidden information can correspond to label information about hb or to some other useful feature information. (b) It identifies the underlying generative models for generating y b given hb and for generating hb given y b in the form of the logistic distributions described above. (c) It identifies the joint distribution for (hb , y b ). (d) It identifies the marginal distribution for the visible part, hb , as explained in the next example. (e) It generates samples {h0b } that follow a similar distribution to hb . Example 66.1 (Marginal pdf of visible component) The marginal pdf of the visible component hb can be obtained by marginalizing the joint pdf (66.21) over all 2Q possible choices for the vector y b , leading to ( ) Q   X T T P(hb = hb ) = exp − θr hb + softplus −θ(q) + (W )q,: hb (66.22) q=1 ∆

in terms of the softplus function f (x) = ln(1 + ex ).

66.3 Contrastive Divergence

2809

Proof: The result follows from the sequence of calculations: P(hb = hb ) n o X exp −θT yb − θrT hb + ybT W T hb ∝ yb ∈{0,1}Q

=

exp{−θrT hb }

X

×

! n o T exp −θ(1)yb (1) + yb (1)(W )1,: hb × ...

yb (1)∈{0,1}

! n o T exp −θ(Q)yb (Q) + yb (Q)(W )Q,: hb

X

×

yb (Q)∈{0,1} Q  Y

= exp{−θrT hb } ×

n o 1 + exp −θ(q) + (W T )q,: hb

q=1

( = exp



θrT hb

+

Q X

)  n o T ln 1 + exp −θ(q) + (W )q,: hb

(66.23)

q=1



66.3

CONTRASTIVE DIVERGENCE We now explain the origin of the contrastive divergence algorithm (66.17), as well as the origin of the structure described in Fig. 66.2 for the RBM. The derivation relies on the Boltzmann distribution, which we encountered earlier in Section 3.6.3.

66.3.1

Boltzmann Distribution The Boltzmann distribution is useful in modeling the behavior of complex systems. It models the system as having a number of discrete states, with each state i corresponding to an energy level Ei . The Boltzmann distribution states that the probability that the system is at state i is proportional to e−βEi : !−1 X −βEi −βEk P(system is at state i) = e e (66.24) k

for some constant β > 0 and where the sum is over all possible state levels. For our purposes in this discussion, it is sufficient to assume β = 1. Observe that states with low energy levels will have higher probability of occurring than states with high energy levels. Consider next a discrete random vector x arising from some unknown probability mass function (pmf), which we denote by P(x = x) : probability that x assumes the value x

(66.25)

2810

Deep Belief Networks

We partition x into two blocks, denoted by x = col{h, z}, where z ∈ IRQ denotes the latent or hidden part of x and h ∈ IRM denotes the visible part of x:   h ← visible part x= (66.26) z ← hidden part That is, realizations are available for h but not for z. The dimensions of z and h need not be the same. The variable h will play the role of the feature data, which we are able to observe. The variable z will play the role of the label vector γ, which we need not know (hence, RBMs operate in an unsupervised manner). We assume the entries of z have binary values, say, {N1, +1} or {0, 1}. Since we are assuming x to be discrete, then its components {h, z} will also be discrete. We denote the possible discrete values for {x, h, z} by {xk , hk , zk }, with the subscript k varying over the range of possibilities for x. To avoid confusion with the notation, we recall that we refer to feature vectors used for training by {hn }, with the subscript n denoting the sample index. In the discrete case, each hn will assume one of several possible discrete values from the set hn ∈ {hk }. Although the actual distribution for x is unknown, the RBM approach will allow us to model it by means of a Boltzmann distribution and, more importantly, to generate samples {zn , zn0 , h0n } for the hidden component z and the feature space h that are distributed according to this assumed Boltzmann model. Specifically, the {zn } will arise from the conditional pmf P(z|h = hn ) that is implied by the Boltzmann distribution model for x, and the x0n = col{h0n , zn0 } will follow the same Boltzmann model. To do so, we start by associating with each realization x = xk an energy level, denoted generically by E(xk ; Θ), which we also write more explicitly as E(hk , zk ; Θ) in terms of the components of xk . Here, the symbol Θ refers generically to the collection of all parameters that define the energy function, and which we wish to learn. We will exhibit later in (66.44) one choice for E(h, z; Θ) and the corresponding parameters. We proceed by assuming a generic energy function. Using E(x; Θ) and the assumed Boltzmann distribution, we write !−1 X −E(h,z;Θ) −E(hk ,zk ;Θ) P(x = x; Θ) = e e (66.27) k

where the sum is over all possible states for x, and xk = col{hk , zk }. To simplify the representation, we introduce the normalization factor: X ∆ Z(Θ) = e−E(hk ,zk ;Θ) (66.28) k

and refer to it as the partition function. In this way we get the equivalent expressions: 1 −E(x;Θ) e Z(Θ) 1 −E(h,z;Θ) P(h = h, z = z; Θ) = e Z(Θ) P(x = x; Θ) =

(66.29a) (66.29b)

66.3 Contrastive Divergence

2811

We can extract the marginal probability distribution for the visible component, as well as its energy level, by adding over all levels of the hidden component z, i.e., 1 X −E(h,z` ;Θ) ∆ 1 −E(h;Θ) P(h = h; Θ) = e = e (66.30) Z(Θ) Z(Θ) `

where we introduced the following notation for the energy of state h = h (sometimes referred to as the free energy term): ! X ∆ −E(h,z` ;Θ) E(h; Θ) = N ln e (66.31) `

Here, the sum is over all states of the hidden variable, z.

66.3.2

Log-Likelihood Function If the Boltzmann distribution (66.29a)–(66.29b) happens to be a good fit for the distribution of x, then one useful step to consider is to show how to estimate the parameter Θ. For this purpose, assume we observe N iid realizations for h ∈ IRM , denoted by {h0 , h1 , . . . , hN −1 }. Then, the probability of observing this set of realizations is given by N NY  −1 1 e−E(hn ;Θ) (66.32) P(h0 = h0 , . . . , hN −1 = hN −1 ; Θ) = Z(Θ) n=0 and the corresponding log-likelihood function is N −1   X ` {hn }; Θ = NN ln(Z(Θ)) N E(hn ; Θ)

(66.33)

n=0

We can formulate the problem of estimating Θ by maximizing this function, which is equivalent to ∆

Θ? = argmin C(Θ)

(66.34)

Θ

where we introduced ∆

C(Θ) = ln(Z(Θ)) +

N −1 1 X E(hn ; Θ) N n=0

(66.35)

Let λ denote any of the scalar parameters in Θ, and let us examine the partial derivatives of C(Θ) relative to λ. These derivatives are important when we devise (stochastic) gradient algorithms for minimizing C(Θ). It will turn out, however, that this is a challenging task to pursue since we only have access to realizations {hn } for the visible component; there are no realizations available for the corresponding hidden part. The (contrastive divergence) algorithm we derive in this section will enable us to generate samples {zn } for the hidden component that

2812

Deep Belief Networks

arises from the conditional distribution P(z|h = hn ). The derivation that follows clarifies these points. To begin with, note that for any of the entries λ of Θ: N −1 ∂C(Θ) ∂ ln(Z(Θ)) 1 X ∂E(hn ; Θ) = + ∂λ ∂λ N n=0 ∂λ

(66.36)

where the first term on the right-hand side is given by: ∂ ln(Z(Θ)) ∂λ

= =

1 ∂Z(Θ) Z(Θ) ∂λ 1 ∂ X

−E(hk ,zk ;Θ)

!

e Z(Θ) ∂λ k 1 X ∂ −E(hk ,zk ;Θ) = e Z(Θ) ∂λ k 1 X −E(hk ,zk ;Θ) ∂E(hk , zk ; Θ) e = N Z(Θ) ∂λ k X 1 ∂E(hk , zk ; Θ) = N e−E(hk ,zk ;Θ) Z(Θ) ∂λ k X ∂E(hk , zk ; Θ) (66.29a) = N P(x = xk ) ∂λ k   ∂E(x; Θ) = NE BD(x) ∂λ

(66.37)

The expectation in the last expression is over the assumed Boltzmann distribution for P(x) defined by (66.29a) and, hence, the use of the subscript “BD(x).” Therefore, we arrive at the following intermediate result for the partial derivative of the log-likelihood function:   N −1 ∂C(Θ) 1 X ∂E(hn ; Θ) ∂E(x; Θ) = N E BD(x) (66.38) ∂λ N n=0 ∂λ ∂λ | {z } | {z } sample mean based on visible realizations

actual mean over assumed Boltzmann distribution

Observe that the first term on the right-hand side has the form of a sample average using measurements {hn } arising from the actual (yet unknown) distribution. For a large number of samples, N , we can appeal to the law of large numbers to deduce that   N −1 1 X ∂E(hn ; Θ) ∂E(h; Θ) N→ E h , N →∞ (66.39) N n=0 ∂λ ∂λ where the expectation is relative to the actual (not assumed) distribution of the visible component, h. We may also refer to the sample average on the left-hand side of the above equation as the sample mean of the gradient term ∂E(h; Θ)/∂λ

66.3 Contrastive Divergence

2813

relative to the empirical data distribution. If we substitute into (66.38) we arrive at the following expression for large N :     ∂C(Θ) ∂E(h; Θ) ∂E(x; Θ) = Eh N E BD(x) , N →∞ (66.40) ∂λ ∂λ ∂λ

This expression involves two expectations: The first expectation, E h , is relative to an actual distribution, while the second expectation, E BD(x) , is relative to an assumed distribution. Both expectations are generally unavailable. We can still seek to minimize C(Θ) by employing a stochastic-gradient procedure where the expectations are replaced by instantaneous approximations. This step is straightforward to perform for the expectation over the actual distribution of h since we are given samples {hn } that arise from this distribution. For instance, we can use the instantaneous approximation:   ∂E(hn ; Θ) ∂E(h; Θ) ≈ (66.41) Eh ∂λ ∂λ

The more challenging task is to determine a sample approximation for the expectation over the assumed Boltzmann distribution for x for the rightmost term in (66.40). This is because we would need to have access to realizations for both h and z that arise from this assumed Boltzmann distribution. If we are able to obtain such realizations, and if we denote them by {h0n , zn0 }, then we could similarly set   ∂E(x; Θ) ∂E(h0n , zn0 ; Θ) E BD(x) (66.42) ≈ ∂λ ∂λ so that one approximation for ∂C(Θ)/∂λ can be computed as \ ∂C(Θ) ∂E(hn ; Θ) ∂E(h0n , zn0 ; Θ) = N ∂λ ∂λ ∂λ

(66.43)

To complete the argument we need to address two issues. First, even if {hn , h0n , zn0 } were all known, we still need to evaluate the above partial derivatives, which in turn requires us to specify the form of the energy function, E(·; ·). Second, we need to generate samples {h0n , zn0 } that arise from the assumed Boltzmann distribution for {h, z}. We consider the first issue next.

66.3.3

Bilinear Energy Form We assume a useful bilinear form for the energy of a state x = col{h, z} as follows:1 ∆

E(x; Θ) = E(h, z; Θ) = θT z + θrT h N z T W T h, Θ = {θ, θr , W } 1

(66.44)

This is the same energy model we assumed in (43.33) while examining RBMs from the perspective of undirected graphs. The only difference is in the notation for the weight matrix W . Here we are using W T to be consistent with the notation used for the weight matrix in neural network implementations; while there, we used W .

2814

Deep Belief Networks

where (θ, θr ) are column vectors of appropriate dimensions, and W is a matrix matching the dimensions of h ∈ IRM and z ∈ IRQ , i.e., W ∈ IRM ×Q . Observe that this expression is not a full-blown quadratic form in (h, z); for example, it does not contain separate quadratic terms in h and z. Observe also that the set Θ consists of all entries of the parameters {θ, θr , W }. In this way, depending on whether λ is an entry of W , θ, or θr , the rightmost partial derivative in (66.43) would be given by any of the following forms: ∂E(h0n , zn0 ; Θ) = Nzn0 (q)h0n (m) ∂Wmq ∂E(h0n , zn0 ; Θ) = zn0 (q) ∂θ(q) ∂E(h0n , zn0 ; Θ) = h0n (m) ∂θr (m)

(66.45a) (66.45b) (66.45c)

in terms of the mth and qth entries of the visible and hidden vectors (h0n , zn0 ), with m = 1, 2, . . . , M and q = 1, 2, . . . , Q. In the above expressions, the symbol Wmq denotes the (m, q)th entry of W , while θ(q) and θr (m) denote the qth and mth entries of θ and θr , respectively. If we collect these partial derivatives into matrix and vector quantities, denoted for convenience here by ∂/∂W , ∂/∂θ, and ∂/∂θr , we obtain ∂E(h0n , zn0 ; Θ) T = Nh0n (zn0 ) ∂W ∂E(h0n , zn0 ; Θ) = zn0 ∂θ ∂E(h0n , zn0 ; Θ) = h0n ∂θr

(66.46a) (66.46b) (66.46c)

Thus, once we have available realizations {h0n , zn0 }, the rightmost term in (66.43) is computable as above. Let us now examine the first partial derivative of E(hn ; Θ) in (66.43), which is more demanding: ∂E(hn ; Θ) ∂λ

(66.31)

=

= =

∂ N ∂λ

(

ln

X `

e

−E(hn ,z` ;Θ)

!)

X ∂ e−E(hn ,z` ;Θ) e−E(hn ;Θ) ` ∂λ X 1 ∂E(hn , z` ; Θ) e−E(hn ,z` ;Θ) −E(h ;Θ) n ∂λ e

N

1

`

=

X e−E(hn ,z` ;Θ) ∂E(hn , z` ; Θ) ∂λ e−E(hn ;Θ) `

(66.47)

66.3 Contrastive Divergence

2815

Using (66.29a)–(66.29b) and (66.30), we rewrite the above as ∂E(hn ; Θ) X P(h = hn , z = z` ) Z(Θ) ∂E(hn , z` ; Θ) = ∂λ P(h = hn ) Z(Θ) ∂λ ` X ∂E(hn , z` ; Θ) = P(z = z` |h = hn ) ∂λ `   ∂E(hn , z; Θ) = E BD(z|h) ∂λ

(66.48)

where the expectation in the last expression is over the conditional distribution that follows for z from the assumed Boltzmann distribution for x in (66.29a). We indicate this fact by using the subscript “BD(z|h).” Noting that λ refers to any of the entries in {W, θ, θr }, it follows from (66.45a)–(66.45c) that: ∂E(hn ; Θ) = NE BD(z|h) (z(q)hn (m)) ∂Wmq ∂E(hn ; Θ) = E BD(z|h) (z(q)) ∂θ(q) ∂E(hn ; Θ) = E BD(z|h) (hn (m)) = hn (m) ∂θr (m)

(66.49a) (66.49b) (66.49c)

Note that the term hn (m) in the first and last expressions does not appear in boldface because this value is known (i.e., not random) given the conditioning on h. If we again collect the partial derivatives into matrix and vector quantities, we obtain ∂E(hn ; Θ) = NE BD(z|h) (hn z T ) ∂W ∂E(hn ; Θ) = E BD(z|h) (z) ∂θ ∂E(hn ; Θ) = hn ∂θr

(66.50a) (66.50b) (66.50c)

In these expressions, the observed variable hn arises from realizations from the actual (yet unknown) distribution for the visible component. In contrast, the hidden variable z is assumed to arise from the postulated Boltzmann distribution for z conditioned on hn . We can now appeal to sample approximations to estimate the above derivatives once a realization zn is computed according to the assumed conditional distribution. When this is done, we can employ ∂E(hn ; Θ) T ≈ Nhn (zn ) ∂W ∂E(hn ; Θ) ≈ zn ∂θ ∂E(hn ; Θ) = hn ∂θr

(66.51a) (66.51b) (66.51c)

2816

Deep Belief Networks

Therefore, we are left with the problem of generating realizations zn that follow the assumed conditional distribution BD(z|h), and joint realizations (h0n , zn0 ) that follow the assumed Boltzmann distribution BD(x) for (h, z). It turns out that the conditional distribution BD(z|h) has a logistics form, which can be exploited to facilitate the generation of the samples, {zn , h0n , zn0 }.

66.3.4

Binary RBM The matrix W has dimensions M × Q. Let the notation (W T )q,: denote the qth row of W T . Then, from the assumed form (66.44) we have that E(hn , z` ; Θ) = θT z` + θrT hn N z`T W T hn Q   X = θrT hn + θ(q)z` (q) N z` (q)(W T )q,: hn

(66.52)

q=1

in terms of the individual entries of z` ∈ IRQ and the individual entries of the bias vector θ. The result shows that the energy level, E(hn , z` ; Θ), for the Boltzmann machine can be expressed as the sum of Q individual levels, each associated with one entry of z` . It follows from the assumed Boltzmann model (66.27) and energy function (66.52) that P(z = z` |h = hn )

= (66.31)

=

∝ ∝ =

e−E(hn ,z` ;Θ) −E(hn ,z`0 ;Θ) `0 e

P

e−E(hn ,z` ;Θ) e−E(hn ;Θ) −E(hn ,z` ;Θ) e Q Y T e−θ(q)z` (q)+z` (q)(W )q,: hn q=1 Q Y

ez` (q)((W

T

)q,: hn −θ(q))

(66.53)

q=1

This last expression shows that we can represent the conditional probability, P(z = z` |h = hn ), as a product of individual distributions, one for each entry of z. Therefore, conditioned on h = hn , the entries of z are independent of each other. This property is a manifestation of the fact that the assumed form (66.44) for the energy function does not involve any connections among the individual entries of z and among the individual entries of h since only the mixed term z T W T h appears in (66.44). As the analysis in the sequel will show, a similar conclusion holds for the reversed conditional probability, P(h = hn |z = z` ) – see (66.58) further ahead. We therefore say that the RBM is effectively modeling the conditional distribution of the hidden state given the visible state as the product of elementary distributions, also called experts. This is an example of a product of experts model;

66.3 Contrastive Divergence

2817

this model is in contrast to other approaches where the conditional probability is modeled as the sum (rather than the product) of elementary distributions (as was the case, for example, with mixture models studied in Chapter 32). Let us further assume that the entries of the hidden variable, z, are binary, i.e., they assume values {0, 1} – we could also consider values {N1, 1} with minor adjustments to the argument, as pursued in Prob. 66.1: z(q) ∈ {0, 1},

q = 1, 2, . . . , Q

(66.54)

We then conclude from the product expression (66.53) that, for each individual entry of z, its conditional probability distribution is of the form P(z(q) = 1|h = hn ) = αq e(W

T

)q,: hn −θ(q)

P(z(q) = 0|h = hn ) = αq

(66.55a) (66.55b)

where αq > 0 is a proportionality constant used to ensure that the two probability values add up to 1. It is easy to see that αq =

1 1 + e(W

T)

q,: hn −θ(q)

(66.56)

and, hence, 1

P(z(q) = 1|h = hn ) =

1+

T e−((W )q,: hn −θ(q))

(66.57a)

T

e−((W )q,: hn −θ(q)) P(z(q) = 0|h = hn ) = T 1 + e−((W )q,: hn −θ(q))

(66.57b)

This means that, conditioned on h = hn , the variable z(q) follows a logistic distribution – recall (59.5a). In a similar manner, we can verify that P(h = hn |z = z` ) ∝

M Y

ehn (m)(Wm,: z` −θr (m))

(66.58)

m=1

where Wm,: denotes the mth row of W . If we also assume that the entries of h are binary: h(m) ∈ {0, 1},

m = 1, 2, . . . , M

(66.59)

then we conclude from the above product expression that, for each individual entry of h, its probability distribution is of the form: 1 1 + e−(Wm,: z` −θr (m)) e−(Wm,: z` −θr (m)) P(h(m) = 0|z = z` ) = 1 + e−(Wm,: z` −θr (m)) P(h(m) = 1|z = z` ) =

(66.60a) (66.60b)

This means that, conditioned on z = z` , the variable h(m) follows a logistic distribution as well.

2818

Deep Belief Networks

66.3.5

Sampling the Distributions We are now in a position to explain how to generate the desired realizations {zn , zn0 , h0n }. Let us denote the sigmoid function by

1 (66.61) 1 + e−x and let us propagate the visible vector, hn , through this activation function and denote the result by  yn = f W T hn N θ (66.62) f (x) =

The entries of yn assume real values between 0 and 1. We can transform each of these entries into a binary value, 0 or 1, as follows. We denote the transformed vector by y b,n . We set the entries of y b,n randomly to the value 1 according to the logistic probability distribution:   P y b,n (q) = 1|h = hn = yn (q), q = 1, 2, . . . , Q (66.63) where {y b,n (q), yn (q)} denote the qth entries of the vectors {y b,n , yn }, respectively. We denote this transformation more compactly by writing yb,n = Bernoulli(yn )

(66.64)

where yb,n , in normal font, denotes the resulting binary-valued realization. Because of the forms (66.61)–(66.62), through this (forward-pass) construction, we are able to construct a realization yb,n whose entries are distributed according to the logistic model (66.57a). The variable yb,n therefore plays the role of the desired realization zn since, conditioned on the given observation hn , it is distributed according to BD(z|h); which in this case agrees with the logistic model (66.57a). Let us now propagate the binary vector, yb,n , backward through the sigmoidal activation function and denote the result by h0n = f (W yb,n N θr )

(66.65)

The entries of h0n assume real values between 0 and 1. We can transform each of these entries into a binary value, 0 or 1, as follows. We denote the transformed vector by h0b,n and set its entries randomly to the value 1 according to the logistic probability distribution:   P h0b,n (m) = 1|z = z` = h0n (m), m = 1, 2, . . . , M (66.66) where {h0b,n (m), h0n (m)} denote the mth entries of the vectors {h0b,n , h0n }, respectively. We denote this transformation more compactly by writing h0b,n = Bernoulli(h0n ) h0b,n ,

(66.67)

where in normal font, denotes the resulting binary-valued realization. Again, through this construction, we are able to observe a realization h0b,n whose entries are now distributed according to the logistic model (66.60a).

66.3 Contrastive Divergence

2819

One more time, we propagate the binary realization h0b,n forward through the activation function and denote the result by  yn0 = f W T h0b,n N θ (66.68) We further transform the entries of yn0 into binary values and define 0 yb,n = Bernoulli(yn0 )

(66.69)

0 The variables (h0b,n , yb,n ) play the role of the desired realizations (h0n , zn0 ) since they are both generated according to the conditional distributions that follow from the assumed Boltzmann model for the data. Thus, observe that we carried out the sequence of transformations illustrated earlier in Fig. 66.3 and arrived at a procedure that consists of two phases: a positive phase that samples visible realizations {hn } from the actual data distribution P(h, z), and a negative phase that samples realizations from the assumed conditional distributions P(z|h) and P(h|z). This construction is a special case of the so-called Gibbs sampling procedure, which is explained in the concluding remarks of the chapter; it is a special case since it relies on a single backward pass. In summary, we motivated the following instantaneous approximations for the gradient quantities of interest:

\ T ∂C(Θ) 0 T = h0b,n yb,n N hb,n yb,n ∂W \ ∂C(Θ) 0 = yb,n N yb,n ∂θh \ ∂C(Θ) = hb,n N h0b,n ∂θb

(66.70a) (66.70b) (66.70c)

The RBM determines the coefficients (W, θ, θr ) by employing a stochastic gradient iteration that relies on the above approximations, namely,  T  0 T Wm = Wm−1 + µcd hb,m yb,m N h0b,m yb,m (66.71a) 0 0 θm = θm−1 + µcd (yb,m N yb,m )

θr,m = θr,m−1 +

µcd (h0b,m

N hb,m )

(66.71b) (66.71c)

where m ≥ 0 denotes the iteration index. These are the same recursions as the contrastive divergence algorithm (66.17). Note that an RBM is effectively determining parameters (W, θ, θr ) that model the conditional probabilities P(z|h) and P(h|z) linking the visible and hidden variables. We therefore say that an RBM is a special case of a generative model: it learns a probability distribution and helps generate samples according to this distribution. We further remark that the algorithm can be modified to apply to visible data, (h, h0 ), that are not necessarily binary; it is more critical that the hidden states, (z, z 0 ), be binary-valued because it is the conditional probability P(z = z` |h =

2820

Deep Belief Networks

hn ) that appears in the partial derivatives (66.49a)–(66.49c). For this case, we replace the above recursions by  T  0 0 Wm = Wm−1 + µcd ym hT b,m N ym hb,m

(66.72b)

θr,m =

(66.72c)

θm =

66.4

0 θm−1 + µcd (yb,m N yb,m ) 0 θr,m−1 + µcd (hm N hm )

(66.72a)

PRE–TRAINING USING STACKED RBMs We can now explain how to train feedforward networks by using a cascade of RBMs (as opposed to autoencoders). We refer to Fig. 66.5. Thus, consider again a multilayer feedforward network consisting of L layers: an input layer, an output layer, and L N 2 hidden layers. Consider further a collection of N training data points {γn , hn }, where γn ∈ IRQ are the class vectors and hn ∈ IRM are the feature vectors. This second pre-training approach follows closely the stacked autoencoder approach, except that the autoencoders are replaced by RBM: (1) We first train the hidden layers, one layer at a time. We start with layer 2, which is the first hidden layer. The input vector to this layer is hn . We ignore all layers succeeding layer 2 and envision an RBM built around layer 2, with layer 2 serving as its hidden layer and hn serving as the input. Using the given features {hn }, we train the RBM by means of the contrastive divergence algorithm, and determine the combination matrix and bias vector (W1T , θ1 ) feeding from hn into layer 2 in the RBM. The same algorithm also ends up determining a bias vector, denoted by θr,1 , which feeds from layer 2 back to hn . (2) We feed each feature vector hn through (W1T , θ1 ) to determine the postactivation signals at the output of layer 2. We denote these signals by y2,n = f W1T hn N θ1



(66.73)

(3) We consider the next hidden layer in the feedforward network, which is layer 3. The input vector to this layer is y2,n . We ignore all layers succeeding layer 3 and envision again an RBM built around layer 3, with layer 3 serving as its hidden layer and y2,n serving as the input. Using the given features {y2,n }, we train the RBM by means of the contrastive divergence algorithm and determine the combination matrix and bias vector (W2T , θ2 ) feeding from y2,n into layer 3 in the RBM. The same algorithm also ends up determining a bias vector, denoted by θr,2 , which feeds from layer 3 back to y2,n .

66.4 Pre-Training using Stacked RBMs

layer 2

layer 3

2821

layer 4

restricted Boltzmann machine, layer 2 (W1T , ✓1 ) AAACAXicbVDLSsNAFJ3UV62vqBvBzWArVJCSdKEuC25cVugLmhgm00k7dPJg5kYooW78FTcuFHHrX7jzb5y2WWj1wIXDOfdy7z1+IrgCy/oyCiura+sbxc3S1vbO7p65f9BRcSopa9NYxLLnE8UEj1gbOAjWSyQjoS9Y1x9fz/zuPZOKx1ELJglzQzKMeMApAS155lGl2vXsu8xRAW5Nzx0YMSCefVbxzLJVs+bAf4mdkzLK0fTMT2cQ0zRkEVBBlOrbVgJuRiRwKti05KSKJYSOyZD1NY1IyJSbzT+Y4lOtDHAQS10R4Ln6cyIjoVKT0NedIYGRWvZm4n9eP4Xgys14lKTAIrpYFKQCQ4xnceABl4yCmGhCqOT6VkxHRBIKOrSSDsFefvkv6dRr9kXNuq2XG9U8jiI6Rieoimx0iRroBjVRG1H0gJ7QC3o1Ho1n4814X7QWjHzmEP2C8fEN78OVMQ==

layer 2

restricted Boltzmann machine, layer 3 (W2T , ✓2 ) AAACAXicbVDLSsNAFJ3UV62vqBvBzWArVJCSZKEuC25cVugLmhgm00k7dPJg5kYooW78FTcuFHHrX7jzb5w+Flo9cOFwzr3ce0+QCq7Asr6Mwsrq2vpGcbO0tb2zu2fuH7RVkknKWjQRiewGRDHBY9YCDoJ1U8lIFAjWCUbXU79zz6TiSdyEccq8iAxiHnJKQEu+eVSpdnznLndViJuTcxeGDIjvnFV8s2zVrBnwX2IvSBkt0PDNT7ef0CxiMVBBlOrZVgpeTiRwKtik5GaKpYSOyID1NI1JxJSXzz6Y4FOt9HGYSF0x4Jn6cyInkVLjKNCdEYGhWvam4n9eL4Pwyst5nGbAYjpfFGYCQ4KnceA+l4yCGGtCqOT6VkyHRBIKOrSSDsFefvkvaTs1+6Jm3TrlenURRxEdoxNURTa6RHV0gxqohSh6QE/oBb0aj8az8Wa8z1sLxmLmEP2C8fEN8uCVMw==

layer 3

AAAB/XicbVDLSsNAFJ3UV62v+Ni5GWyFCqUkXajLghuXFewD2hAm02k7dDIJMzdCDcVfceNCEbf+hzv/xmmbhbYeuHA4517uvSeIBdfgON9Wbm19Y3Mrv13Y2d3bP7APj1o6ShRlTRqJSHUCopngkjWBg2CdWDESBoK1g/HNzG8/MKV5JO9hEjMvJEPJB5wSMJJvn5TKbd+t9GDEgPipqrjTi5JvF52qMwdeJW5GiihDw7e/ev2IJiGTQAXRuus6MXgpUcCpYNNCL9EsJnRMhqxrqCQh0146v36Kz43Sx4NImZKA5+rviZSEWk/CwHSGBEZ62ZuJ/3ndBAbXXsplnACTdLFokAgMEZ5FgftcMQpiYgihiptbMR0RRSiYwAomBHf55VXSqlXdy6pzVyvWy1kceXSKzlAZuegK1dEtaqAmougRPaNX9GY9WS/Wu/WxaM1Z2cwx+gPr8wcaKZOg

(W1 , ✓r,1 )

AAAB/XicbVDLSsNAFJ3UV62v+Ni5CbZChVKSLNRlwY3LCvYBbQiT6aQdOpmEmRuhhuKvuHGhiFv/w51/47TNQlsPXDiccy/33hMknCmw7W+jsLa+sblV3C7t7O7tH5iHR20Vp5LQFol5LLsBVpQzQVvAgNNuIimOAk47wfhm5nceqFQsFvcwSagX4aFgISMYtOSbJ5Vqx3drfRhRwH4ma+70ouKbZbtuz2GtEicnZZSj6Ztf/UFM0ogKIBwr1XPsBLwMS2CE02mpnyqaYDLGQ9rTVOCIKi+bXz+1zrUysMJY6hJgzdXfExmOlJpEge6MMIzUsjcT//N6KYTXXsZEkgIVZLEoTLkFsTWLwhowSQnwiSaYSKZvtcgIS0xAB1bSITjLL6+Stlt3Luv2nVtuVPM4iugUnaEqctAVaqBb1EQtRNAjekav6M14Ml6Md+Nj0Vow8plj9AfG5w8dQ5Oi

(W2 , ✓r,2 )

Figure 66.5 Pre-training of a multilayered feedforward network by means of a

sequence of RBMs, where one RBM is trained at a time to generate initial conditions for a subsequent full-blown training by backpropagation for the entire network.

(4) We feed each feature vector y2,n through (W2T , θ2 ) to determine the postactivation signals at the output of layer 3. We denote these signals by  y3,n = f W2T y2,n N θ2 (66.74) (5) We repeat the same steps as before for each of the remaining hidden layers. We extract a hidden layer ` + 1 and build an RBM around it: Its input is the signal y`,n arriving from the previously trained layer. We train this RBM to determine (W`T , θ` ). (6) Once all hidden layers have been covered in this manner, we end up with a collection of parameters {W` , θ` } for the layers ` = 1, 2, . . . , L N 1. Observe that we are still left with (WL , θL ) for the output layer. The weights that were determined from training the successive RBMs are not the final weights

2822

Deep Belief Networks

for the network. Instead, they are used as initial conditions. Starting from these values, and selecting random initial conditions for (WL , θL ), we now train the entire multilayer network afresh using a full-blown backpropagation algorithm. The algorithm is summarized in listing (66.75). We are also denoting the variables in boldface to highlight their random nature due to the stochastic gradient implementation involving random selections of data.

RBM training of feedforward neural networks. given a feedforward network with L layers (input+output+hidden); given N data pairs {γ n , hn }, n = 0, 1, . . . , N N 1; set y 1,n = hn repeat for ` = 1, 2, . . . , L N 1: select {W `,−1 , θ `,−1 , θ r,`,−1 } randomly (Section 65.5); train the `th stage using contrastive divergence with step size (W ` , θ ` )  µcd . This generates  set y `+1,n = f

WT ` y `,n

(66.75)

N θ`

end select (W L,−1 , θ L,−1 ) randomly according to Section 65.5; set initial conditions {W `,−1 , θ `,−1 } ← {W ` , θ ` }; train network using step size µ and regularization parameter ρ; return{W ?` , θ ?` }.

Again, there is still no clear understanding or formal theory to justify the performance of this greedy construction in practice. The evidence is mostly experimental. It is believed that one of the advantages is that the initial values for the combination and bias coefficients help ensure that the network will be operating away from the saturation regime. Example 66.2 (Classification of handwritten digits) We illustrate the operation of the autoencoder and contrastive divergence schemes by using them to determine initial parameter conditions for the training of a multilayer neural network. The network will be used to identify handwritten digits using the same MNIST dataset from Example 65.9. Recall that the MNIST dataset consists of 60,000 labeled training samples and 10,000 labeled testing samples. Each entry in the dataset is a 28 × 28 grayscale image, which we transform into an M = 784-long feature vector, hn . Each pixel in the image and, therefore, each entry in hn , assumes integer values in the range [0, 255]. Every feature vector (or image) is assigned an integer label in the range 0−9, depending on which digit the image corresponds to. The earlier Fig. 65.12 shows randomly selected images from the training dataset. In the simulations in this example, we select only 5000 random samples from the training dataset and 1000 random samples from the test dataset. The objective is to illustrate

66.5 Deep Generative Model

2823

the enhancement that is provided by the autoencoder and contrastive divergence constructions. For this purpose, it is not necessary to run a full-blown simulation using the entire dataset. We use the 5000 training samples to train a neural network with L = 7 layers (including the input layer, the output layer, and five hidden layers). The size of the input layer is n1 = 784 (which agrees with the size of the feature vectors), while the size of the output layer is n7 = 10 (which agrees with the number of classes). The size of the hidden layers is set to n` = 128 neurons for ` = 2, 3, 4, 5, 6 with sigmoidal activation functions. We employ a softmax layer at the output and train the network using a regularized cross-entropy criterion with parameters µ = 0.01,

ρ = 0.0001

(66.76)

We preprocess the images (or the corresponding feature vectors {hn }) by scaling their entries by 255 (so that they assume values in the range [0, 1]). We subsequently compute the sample mean feature vectors for the training and test sets. We center the scaled feature vectors around these means in both sets. We train the network using 200 passes over the data with random reshuffling. Table 66.1 Empirical error rates over 1000 test samples and 5000 training samples chosen at random from the MNIST dataset for three types of initialization procedures: Gaussian, autoencoders, and contrastive divergence. Initialization Gaussian autoencoder contrastive divergence

Empirical test error (%)

Empirical training error (%)

87.4% 11.5% 16.6%

88.7% 0.00% 1.86%

We use three different initialization procedures for setting the parameters {W` , θ` } before applying the backpropagation procedure (65.150) for cross-entropy risk minimization with a softmax output layer: (a) random Gaussian initial conditions as explained in Section 65.5; (b) initial conditions obtained by means of the autoencoder structure using µac = 0.001, ρac = 0 (autoencoder training) (66.77) for the backpropagation iterations during the training of the individual autoencoder stages; and (c) initial conditions obtained by means of the contrastive divergence structure using µcd = 0.0001 (contrastive divergence training) (66.78) for the backpropagation iterations during the training of the individual contrastive divergence stages. The training within the autoencoder and contrastive divergence stages employs 100 passes over the data with random reshuffling. The empirical errors rates are summarized in Table 66.1. These errors are computed over the 1000 test samples and 5000 training samples.

66.5

DEEP GENERATIVE MODEL We explained in Section 66.3 how the contrastive divergence algorithm can be used to estimate the parameters {W, θ, θr } of the RBM shown in Fig. 66.2 in

2824

Deep Belief Networks

order to learn a generative distribution for the latent component via expressions (66.57a)–(66.57b), given the observation h, namely,   P z(q) = 1 | h = hn =

1 1+

  P z(q) = 0 | h = hn =

(66.79a)

T e−((W )q,: hn −θ(q))

e−(Wq,: hn −θ(q)) T 1 + e−((W )q,: hn −θ(q))

(66.79b)

Likewise, when h has binary entries, its generative model given the latent information can be described by:   P h(m) = 1 | z = zn =

  P h(m) = 0 | z = zn =

1 1+

(66.80a)

e−(Wm,: zn −θr (m))

e−(Wm,: zn −θr (m)) T 1 + e−((W )m,: zn −θr (m))

(66.80b)

We now exploit these results to construct and train a deep belief network. These are generative graphical models that allow us to learn latent information and/or generate feature data by using more elaborate network structures than a single RBM. Before describing deep belief networks, we refer to the auxiliary diagram shown in Fig. 66.6, which will be used to pre-train the deep network. We assume henceforth that the input h has binary entries in {0, 1} (so that h and hb from our earlier notation would coincide); likewise, the variable z coincides with the earlier notation y b . In the figure, we show a cascade of RBMs from left to right. The input h appears on the left. The first RBM models the mapping between (h, z 2 ), while the second RBM models the mapping between (z 2 , z 3 ), and so forth. All signals {h, z 2 , z 3 , z 4 , . . .} shown in the figure have binary entries. We will again learn the parameters (W` , θ` , θr,` ) for the layers in this model by following a multilayered approach where each RBM is trained separately, as was the case with stacked autoencoders:

AAAB/XicbVDJSgNBEO2JW4zbuNy8NCZChBBmElCPAS8eI5gFkmHo6fQkTXoWumuEOAR/xYsHRbz6H978GzvJHDTxQcHjvSqq6nmx4Aos69vIra1vbG7ltws7u3v7B+bhUVtFiaSsRSMRya5HFBM8ZC3gIFg3lowEnmAdb3wz8zsPTCoehfcwiZkTkGHIfU4JaMk1T0rljluv9GHEgLiprNSnFyXXLFpVaw68SuyMFFGGpmt+9QcRTQIWAhVEqZ5txeCkRAKngk0L/USxmNAxGbKepiEJmHLS+fVTfK6VAfYjqSsEPFd/T6QkUGoSeLozIDBSy95M/M/rJeBfOykP4wRYSBeL/ERgiPAsCjzgklEQE00IlVzfiumISEJBB1bQIdjLL6+Sdq1qX1atu1qxUc7iyKNTdIbKyEZXqIFuURO1EEWP6Bm9ojfjyXgx3o2PRWvOyGaO0R8Ynz8gXZOk

AAAB/XicbVDJSgNBEO2JW4zbuNy8NCZChBBmQlCPAS8eI5gFkmHo6fQkTXoWumuEOAR/xYsHRbz6H978GzvJHDTxQcHjvSqq6nmx4Aos69vIra1vbG7ltws7u3v7B+bhUVtFiaSsRSMRya5HFBM8ZC3gIFg3lowEnmAdb3wz8zsPTCoehfcwiZkTkGHIfU4JaMk1T0rljluv9GHEgLiprNSnFyXXLFpVaw68SuyMFFGGpmt+9QcRTQIWAhVEqZ5txeCkRAKngk0L/USxmNAxGbKepiEJmHLS+fVTfK6VAfYjqSsEPFd/T6QkUGoSeLozIDBSy95M/M/rJeBfOykP4wRYSBeL/ERgiPAsCjzgklEQE00IlVzfiumISEJBB1bQIdjLL6+Sdq1qX1atu1qxUc7iyKNTdIbKyEZXqIFuURO1EEWP6Bm9ojfjyXgx3o2PRWvOyGaO0R8Ynz8jd5Om

(W2 , ✓r,2 )

(W3 , ✓r,3 )

(W4 , ✓r,4 )

(W2T , ✓2 )

(W3T , ✓3 )

(W4T , ✓4 )

AAAB/XicbVDLSsNAFJ3UV62v+Ni5CbZChVKSLNRlwY3LCvYBbQiT6aQdOpmEmRuhhuKvuHGhiFv/w51/47TNQlsPXDiccy/33hMknCmw7W+jsLa+sblV3C7t7O7tH5iHR20Vp5LQFol5LLsBVpQzQVvAgNNuIimOAk47wfhm5nceqFQsFvcwSagX4aFgISMYtOSbJ5Vqx3drfRhRwH4ma+70ouKbZbtuz2GtEicnZZSj6Ztf/UFM0ogKIBwr1XPsBLwMS2CE02mpnyqaYDLGQ9rTVOCIKi+bXz+1zrUysMJY6hJgzdXfExmOlJpEge6MMIzUsjcT//N6KYTXXsZEkgIVZLEoTLkFsTWLwhowSQnwiSaYSKZvtcgIS0xAB1bSITjLL6+Stlt3Luv2nVtuVPM4iugUnaEqctAVaqBb1EQtRNAjekav6M14Ml6Md+Nj0Vow8plj9AfG5w8dQ5Oi

AAAB/XicbVDLSsNAFJ3UV62v+Ni5GWyFCqUkXajLghuXFewD2hAm02k7dDIJMzdCDcVfceNCEbf+hzv/xmmbhbYeuHA4517uvSeIBdfgON9Wbm19Y3Mrv13Y2d3bP7APj1o6ShRlTRqJSHUCopngkjWBg2CdWDESBoK1g/HNzG8/MKV5JO9hEjMvJEPJB5wSMJJvn5TKbd+t9GDEgPipqrjTi5JvF52qMwdeJW5GiihDw7e/ev2IJiGTQAXRuus6MXgpUcCpYNNCL9EsJnRMhqxrqCQh0146v36Kz43Sx4NImZKA5+rviZSEWk/CwHSGBEZ62ZuJ/3ndBAbXXsplnACTdLFokAgMEZ5FgftcMQpiYgihiptbMR0RRSiYwAomBHf55VXSqlXdy6pzVyvWy1kceXSKzlAZuegK1dEtaqAmougRPaNX9GY9WS/Wu/WxaM1Z2cwx+gPr8wcaKZOg

(W1 , ✓r,1 )

•••

all vectors have binary entries {0, 1}

AAACA3icbVDJSgNBEO2JW4zbqDe9NCZCBAkzCajHgBePEbJBZhx6Oj1Jk56F7hohDAEv/ooXD4p49Se8+Td2loNGHxQ83quiqp6fCK7Asr6M3Mrq2vpGfrOwtb2zu2fuH7RVnErKWjQWsez6RDHBI9YCDoJ1E8lI6AvW8UfXU79zz6TicdSEccLckAwiHnBKQEueeVQqd7zaXeaoADcn5w4MGRAvq03OSp5ZtCrWDPgvsRekiBZoeOan049pGrIIqCBK9WwrATcjEjgVbFJwUsUSQkdkwHqaRiRkys1mP0zwqVb6OIilrgjwTP05kZFQqXHo686QwFAte1PxP6+XQnDlZjxKUmARnS8KUoEhxtNAcJ9LRkGMNSFUcn0rpkMiCQUdW0GHYC+//Je0qxX7omLdVov18iKOPDpGJ6iMbHSJ6ugGNVALUfSAntALejUejWfjzXift+aMxcwh+gXj4xvJ2pZB

AAACA3icbVBNS8NAEN3Ur1q/qt70stgKFaQkPajHghePFfoFbQ2b7aZdutmE3YlQQsCLf8WLB0W8+ie8+W/ctjlo9cHA470ZZuZ5keAabPvLyq2srq1v5DcLW9s7u3vF/YO2DmNFWYuGIlRdj2gmuGQt4CBYN1KMBJ5gHW9yPfM790xpHsomTCM2CMhIcp9TAkZyi0flSset3SV97eNmet6HMQPiJrX0rOwWS3bVngP/JU5GSihDwy1+9ochjQMmgQqidc+xIxgkRAGngqWFfqxZROiEjFjPUEkCpgfJ/IcUnxpliP1QmZKA5+rPiYQEWk8Dz3QGBMZ62ZuJ/3m9GPyrQcJlFAOTdLHIjwWGEM8CwUOuGAUxNYRQxc2tmI6JIhRMbAUTgrP88l/SrlWdi6p9WyvVK1kceXSMTlAFOegS1dENaqAWougBPaEX9Go9Ws/Wm/W+aM1Z2cwh+gXr4xvGupY/

(W1T , ✓1 ) AAACA3icbVDLSsNAFJ3UV62vqDvdDLZCBSlJF+qy4MZlhb6giWEynbRDJw9mboQSCm78FTcuFHHrT7jzb5y2WWj1wIXDOfdy7z1+IrgCy/oyCiura+sbxc3S1vbO7p65f9BRcSopa9NYxLLnE8UEj1gbOAjWSyQjoS9Y1x9fz/zuPZOKx1ELJglzQzKMeMApAS155lGl2vXsu8xRAW5Nzx0YMSBeZk/PKp5ZtmrWHPgvsXNSRjmanvnpDGKahiwCKohSfdtKwM2IBE4Fm5acVLGE0DEZsr6mEQmZcrP5D1N8qpUBDmKpKwI8V39OZCRUahL6ujMkMFLL3kz8z+unEFy5GY+SFFhEF4uCVGCI8SwQPOCSURATTQiVXN+K6YhIQkHHVtIh2Msv/yWdes2+qFm39XKjmsdRRMfoBFWRjS5RA92gJmojih7QE3pBr8aj8Wy8Ge+L1oKRzxyiXzA+vgHDmpY9

z3,n

z2,n

hn

|

{z

first RBM

|

}

{z

|

second RBM

AAACA3icbVDJSgNBEO2JW4zbqDe9NCZCBAkzIajHgBePEbJBZhx6Oj1Jk56F7hohDAEv/ooXD4p49Se8+Td2loNGHxQ83quiqp6fCK7Asr6M3Mrq2vpGfrOwtb2zu2fuH7RVnErKWjQWsez6RDHBI9YCDoJ1E8lI6AvW8UfXU79zz6TicdSEccLckAwiHnBKQEueeVQqd7zaXeaoADcn5w4MGRAvq03OSp5ZtCrWDPgvsRekiBZoeOan049pGrIIqCBK9WwrATcjEjgVbFJwUsUSQkdkwHqaRiRkys1mP0zwqVb6OIilrgjwTP05kZFQqXHo686QwFAte1PxP6+XQnDlZjxKUmARnS8KUoEhxtNAcJ9LRkGMNSFUcn0rpkMiCQUdW0GHYC+//Je0qxX7omLdVov18iKOPDpGJ6iMbHSJ6ugGNVALUfSAntALejUejWfjzXift+aMxcwh+gXj4xvM+pZD

z5,n

z4,n

{z

} third RBM

} |

{z

fourth RBM

}

Figure 66.6 A sequence of RBMs mapping binary-valued inputs to binary-valued

outputs.

66.5 Deep Generative Model

2825

(1) We train the first RBM on the left relating (h, z2 ) and ignore all subsequent RBMs. Using the given features {hn }, we train this first RBM by means of the contrastive divergence algorithm, and determine the combination matrix and bias vector (W1T , θ1 ) that feed from hn into z2,n . The same algorithm also determines the model (W1 , θr,1 ) that feeds from z2,n back to hn . (2) We feed each feature vector hn through (W1T , θ1 ) and perform Gibbs sampling (i.e., the Bernoulli sampling step) to determine the samples {z2,n }. (3) We consider the second RBM relating (z2 , z3 ). The input vector is z2,n . We ignore all other RBMs. Using the given samples {z2,n }, we train this RBM by means of the contrastive divergence algorithm. Once converged, we arrive at the combination matrix and bias vectors (W2 , θ2 , θr,2 ). (4) We feed the realizations z2,n through (W2T , θ2 ) and perform Gibbs sampling to determine the samples {z3,n }. (5) We repeat the same steps as before for each of the remaining RBMs. For a generic `th RBM, we launch the contrastive divergence algorithm using the input signals {z`,n } to arrive at {W` , θ` , θr,` }. (6) Once all RBMs have been trained in this manner, we end up with a collection of parameters {W` , θ` , θr,` } for the layers ` = 1, 2, . . . , L. (7) Usually, a fine-tuning procedure is applied to the layers to enhance performance; we forgo the details here, which can be found in the references at the end of the chapter. This procedure is slow and involves two processing steps from left to right and from right to left. The simulation in Example 66.3 does not perform fine-tuning for simplicity.

Justification We can justify the multilayered greedy training procedure for deep belief networks as follows by using some generic notation to refer to visible and hidden variables; the association with the notation used to describe the above greedy procedure will be evident from the context. Using the same notation introduced earlier while deriving the contrastive divergence algorithm, we consider a generic random vector x that is split into visible and hidden components, h and z, respectively, say, x = col{h, z}. We assume their entries are binary-valued with {0, 1} values; in particular, the h and z vectors correspond to the hb and y b variables in the first RBM. We recall that we established earlier in (6.164) a lower bound on the pdf of any random variable h, known as the evidence lower bound (ELBO), namely, h i h i ln fh (h) ≥ E q ln fh,z (h, z) N E q ln qz|h (z|h) (66.81) =

X z

qz|h (z|h) ln fh,z (h, z) N

X z

qz|h (z|h) ln qz|h (z|h)

2826

Deep Belief Networks

where the notation E q refers to expectation relative to some conditional distribution qz|h (z|h) chosen by the designer. This distribution is selected to be an approximation for the true conditional pdf fz|h (z|h): qz|h (z|h) ≈ fz|h (z|h)

(66.82)

Equality holds in (66.81) when qz|h (z|h) = fz|h (z|h) – see Prob. 66.4. We have used the ELBO in earlier chapters to design variational inference algorithms: rather than maximize the log-likelihood of fh (h) over its model parameters, we maximize the lower bound because, as was already shown in Chapter 33, the latter problem is more tractable. Here, we will not be using the ELBO to learn parameters. Instead, we will be using it to motivate the greedy layered procedure for training deep belief networks. The argument is as follows. Using the Bayes rule, we rewrite the lower bound in (66.82) more explicitly in the form: h i h i ln fh (h) ≥ E q ln fz (z)fh|z (h|z) N E q ln qz|h (z|h) (66.83) h i h i h i = E q ln fh|z (h|z) N E q ln qz|h (z|h) + E q ln fz (z)

Now, assume we train a first generative model (such as an RBM) to learn the distributions fh|z (h|z), fz|h (z|h), and fz (z) – see Fig. 66.7. This amounts to learning the parameters {W1 , θ1 , θr,1 } in the first step of the greedy procedure described above by using the contrastive divergence procedure. In this case, the parameters (W1T , θ1 ) would define the pdf fz|h (z|h), which maps the visible part to the latent part, while the parameters (W1 , θr,1 ) would define the pdf fh|z (h|z); which performs the reverse mapping. The estimated pdf fz (z) would also depend on these parameters – as shown in Example 66.1 and Prob. 66.3. Remember that the contrastive divergence algorithm for learning these parameters was derived by maximizing the log-likelihood function resulting from observations hn ∼ fh (h). Once the conditional pdfs for (h, z) are learned, we fix them at their estimated values, including setting qz|h (z|h) = fz|h (z|h). However, we decouple fz (z) from the parameters (W1 , θ1 , θx ) and keep its description open-ended. We then seek to determine a “better” model for it in order to increase the lower bound in (66.83). This can be achieved by maximizing the last term or, equivalently, by minimizing the following weighted log-likelihood function over some additional modeling parameters: X N qz|h (z|h) ln fz (z) (66.84) z

If qz|h (z|h) were uniform, then this sum amounts to the negative of the loglikelihood function of fz (z). More generally, the sum can be interpreted as the ensemble average that results from using samples z generated by the distribution qz|h (z|h). These are the same samples generated by the first RBM since we set qz|h (z|h) = fz|h (z|h). We therefore face a second maximum-likelihood (ML) estimation problem using the output samples {zn } from the first RBM as input

66.5 Deep Generative Model

2827

generative model

AAAB/XicbVDLSsNAFJ3UV62v+Ni5GWyFCqUkXajLghuXFewD2hAm02k7dDIJMzdCDcVfceNCEbf+hzv/xmmbhbYeuHA4517uvSeIBdfgON9Wbm19Y3Mrv13Y2d3bP7APj1o6ShRlTRqJSHUCopngkjWBg2CdWDESBoK1g/HNzG8/MKV5JO9hEjMvJEPJB5wSMJJvn5TKbd+t9GDEgPipqrjTi5JvF52qMwdeJW5GiihDw7e/ev2IJiGTQAXRuus6MXgpUcCpYNNCL9EsJnRMhqxrqCQh0146v36Kz43Sx4NImZKA5+rviZSEWk/CwHSGBEZ62ZuJ/3ndBAbXXsplnACTdLFokAgMEZ5FgftcMQpiYgihiptbMR0RRSiYwAomBHf55VXSqlXdy6pzVyvWy1kceXSKzlAZuegK1dEtaqAmougRPaNX9GY9WS/Wu/WxaM1Z2cwx+gPr8wcaKZOg

(W1 , ✓r,1 )

(W2 , ✓r,2 )

(W1T , ✓1 )

(W2T , ✓2 )

AAAB/XicbVDLSsNAFJ3UV62v+Ni5CbZChVKSLNRlwY3LCvYBbQiT6aQdOpmEmRuhhuKvuHGhiFv/w51/47TNQlsPXDiccy/33hMknCmw7W+jsLa+sblV3C7t7O7tH5iHR20Vp5LQFol5LLsBVpQzQVvAgNNuIimOAk47wfhm5nceqFQsFvcwSagX4aFgISMYtOSbJ5Vqx3drfRhRwH4ma+70ouKbZbtuz2GtEicnZZSj6Ztf/UFM0ogKIBwr1XPsBLwMS2CE02mpnyqaYDLGQ9rTVOCIKi+bXz+1zrUysMJY6hJgzdXfExmOlJpEge6MMIzUsjcT//N6KYTXXsZEkgIVZLEoTLkFsTWLwhowSQnwiSaYSKZvtcgIS0xAB1bSITjLL6+Stlt3Luv2nVtuVPM4iugUnaEqctAVaqBb1EQtRNAjekav6M14Ml6Md+Nj0Vow8plj9AfG5w8dQ5Oi

AAACA3icbVDLSsNAFJ3UV62vqDvdDLZCBSlJF+qy4MZlhb6giWEynbRDJw9mboQSCm78FTcuFHHrT7jzb5y2WWj1wIXDOfdy7z1+IrgCy/oyCiura+sbxc3S1vbO7p65f9BRcSopa9NYxLLnE8UEj1gbOAjWSyQjoS9Y1x9fz/zuPZOKx1ELJglzQzKMeMApAS155lGl2vXsu8xRAW5Nzx0YMSBeZk/PKp5ZtmrWHPgvsXNSRjmanvnpDGKahiwCKohSfdtKwM2IBE4Fm5acVLGE0DEZsr6mEQmZcrP5D1N8qpUBDmKpKwI8V39OZCRUahL6ujMkMFLL3kz8z+unEFy5GY+SFFhEF4uCVGCI8SwQPOCSURATTQiVXN+K6YhIQkHHVtIh2Msv/yWdes2+qFm39XKjmsdRRMfoBFWRjS5RA92gJmojih7QE3pBr8aj8Wy8Ge+L1oKRzxyiXzA+vgHDmpY9

AAACA3icbVBNS8NAEN3Ur1q/qt70stgKFaQkPajHghePFfoFbQ2b7aZdutmE3YlQQsCLf8WLB0W8+ie8+W/ctjlo9cHA470ZZuZ5keAabPvLyq2srq1v5DcLW9s7u3vF/YO2DmNFWYuGIlRdj2gmuGQt4CBYN1KMBJ5gHW9yPfM790xpHsomTCM2CMhIcp9TAkZyi0flSset3SV97eNmet6HMQPiJrX0rOwWS3bVngP/JU5GSihDwy1+9ochjQMmgQqidc+xIxgkRAGngqWFfqxZROiEjFjPUEkCpgfJ/IcUnxpliP1QmZKA5+rPiYQEWk8Dz3QGBMZ62ZuJ/3m9GPyrQcJlFAOTdLHIjwWGEM8CwUOuGAUxNYRQxc2tmI6JIhRMbAUTgrP88l/SrlWdi6p9WyvVK1kceXSMTlAFOegS1dENaqAWougBPaEX9Go9Ws/Wm/W+aM1Z2cwh+gXr4xvGupY/

visible

hidden

h

z

|

{z

first RBM

} |

hidden

{z

second RBM

r

}

extracting features and learning latent information

Figure 66.7 A cascade of two RBMs. The leftmost RBM learns the conditional pdfs

for (h, z) while the second rightmost RBM learns the conditional pdfs for (z, r).

signals and feeding them into the second RBM. We then associate a new hidden variable r with z and rewrite the corresponding ELBO for fz (z): h i h i ln fz (z) ≥ E p ln fz,r (z, r) N E p ln pr|z (r|z) h i h i h i = E p ln fz|r (z|r) N E p ln pr|z (r|z) + E p ln fr (r)

(66.85)

where pr|z (r|z) is, as before, some approximation for the conditional pdf fr|z (r|z). We can now train a second generative model (such as the second RBM) to learn the distributions fz|r (z|r), fr|z (r|z), and fr (r). This amounts to learning new parameters {W2 , θ2 , θr,2 } in the second step of the greedy procedure described above. In this way, the parameters (W2T , θ2 ) would define the pdf fr|z (r|z) while the parameters (W2 , θr,2 ) would define the pdf fz|r (z|r). We fix these pdfs at their estimated values and set pr|z (r|z) = fr|z (r|z). Another way to justify this factorization is to assume the number of units in the layers is such that W1 and W2T have the same dimensions (this is not always guaranteed in practice and this case is only being discussed for illustration purposes). Thus, in this situation, when running the contrastive divergence algorithm to learn (W2 , θ2 , θx ), we could initialize W2T to W1 and the bias θr,2 to θ1 . By doing so, the generative behavior of the two combined RBMs mapping r

2828

Deep Belief Networks

to z to x becomes equivalent at that point to the generative behavior of the first RBM mapping z to x. This is because the mapping from r to z will be defined by (W1T , θ1 ) and will therefore generate samples for z that are consistent with what would have been generated by the first RBM mapping x to z. We continue with the recursive process by introducing a model for r, involving a new hidden variable, and seeking to maximize the last term in (66.85) over new model parameters in order to further increase the lower bound. Repeating these arguments leads to the greedy procedure described above.

Network structure We are now ready to describe the structure of a deep belief network. The purpose of the network is to serve as a generative model, i.e., to generate feature samples that arise from the “same” distribution as the input features h. We refer to Fig. 66.8.

OUTPUT

INPUT

generative model

AAAB/XicbVDJSgNBEO2JW4zbuNy8NCZChBBmQlCPAS8eI5gFkmHo6fQkTXoWumuEOAR/xYsHRbz6H978GzvJHDTxQcHjvSqq6nmx4Aos69vIra1vbG7ltws7u3v7B+bhUVtFiaSsRSMRya5HFBM8ZC3gIFg3lowEnmAdb3wz8zsPTCoehfcwiZkTkGHIfU4JaMk1T0rljluv9GHEgLiprNSnFyXXLFpVaw68SuyMFFGGpmt+9QcRTQIWAhVEqZ5txeCkRAKngk0L/USxmNAxGbKepiEJmHLS+fVTfK6VAfYjqSsEPFd/T6QkUGoSeLozIDBSy95M/M/rJeBfOykP4wRYSBeL/ERgiPAsCjzgklEQE00IlVzfiumISEJBB1bQIdjLL6+Sdq1qX1atu1qxUc7iyKNTdIbKyEZXqIFuURO1EEWP6Bm9ojfjyXgx3o2PRWvOyGaO0R8Ynz8jd5Om

(W4 , ✓r,4 )

z5,n

AAAB/XicbVDJSgNBEO2JW4zbuNy8NCZChBBmElCPAS8eI5gFkmHo6fQkTXoWumuEOAR/xYsHRbz6H978GzvJHDTxQcHjvSqq6nmx4Aos69vIra1vbG7ltws7u3v7B+bhUVtFiaSsRSMRya5HFBM8ZC3gIFg3lowEnmAdb3wz8zsPTCoehfcwiZkTkGHIfU4JaMk1T0rljluv9GHEgLiprNSnFyXXLFpVaw68SuyMFFGGpmt+9QcRTQIWAhVEqZ5txeCkRAKngk0L/USxmNAxGbKepiEJmHLS+fVTfK6VAfYjqSsEPFd/T6QkUGoSeLozIDBSy95M/M/rJeBfOykP4wRYSBeL/ERgiPAsCjzgklEQE00IlVzfiumISEJBB1bQIdjLL6+Sdq1qX1atu1qxUc7iyKNTdIbKyEZXqIFuURO1EEWP6Bm9ojfjyXgx3o2PRWvOyGaO0R8Ynz8gXZOk

(W3 , ✓r,3 )

AAAB/XicbVDLSsNAFJ3UV62v+Ni5GWyFCqUkXajLghuXFewD2hAm02k7dDIJMzdCDcVfceNCEbf+hzv/xmmbhbYeuHA4517uvSeIBdfgON9Wbm19Y3Mrv13Y2d3bP7APj1o6ShRlTRqJSHUCopngkjWBg2CdWDESBoK1g/HNzG8/MKV5JO9hEjMvJEPJB5wSMJJvn5TKbd+t9GDEgPipqrjTi5JvF52qMwdeJW5GiihDw7e/ev2IJiGTQAXRuus6MXgpUcCpYNNCL9EsJnRMhqxrqCQh0146v36Kz43Sx4NImZKA5+rviZSEWk/CwHSGBEZ62ZuJ/3ndBAbXXsplnACTdLFokAgMEZ5FgftcMQpiYgihiptbMR0RRSiYwAomBHf55VXSqlXdy6pzVyvWy1kceXSKzlAZuegK1dEtaqAmougRPaNX9GY9WS/Wu/WxaM1Z2cwx+gPr8wcaKZOg

(W1 , ✓r,1 )

AAAB/XicbVDLSsNAFJ3UV62v+Ni5CbZChVKSLNRlwY3LCvYBbQiT6aQdOpmEmRuhhuKvuHGhiFv/w51/47TNQlsPXDiccy/33hMknCmw7W+jsLa+sblV3C7t7O7tH5iHR20Vp5LQFol5LLsBVpQzQVvAgNNuIimOAk47wfhm5nceqFQsFvcwSagX4aFgISMYtOSbJ5Vqx3drfRhRwH4ma+70ouKbZbtuz2GtEicnZZSj6Ztf/UFM0ogKIBwr1XPsBLwMS2CE02mpnyqaYDLGQ9rTVOCIKi+bXz+1zrUysMJY6hJgzdXfExmOlJpEge6MMIzUsjcT//N6KYTXXsZEkgIVZLEoTLkFsTWLwhowSQnwiSaYSKZvtcgIS0xAB1bSITjLL6+Stlt3Luv2nVtuVPM4iugUnaEqctAVaqBb1EQtRNAjekav6M14Ml6Md+Nj0Vow8plj9AfG5w8dQ5Oi

(W2 , ✓r,2 )

hn

latent information or label

(W4T , ✓4 ) AAACA3icbVDJSgNBEO2JW4zbqDe9NCZCBAkzIajHgBePEbJBZhx6Oj1Jk56F7hohDAEv/ooXD4p49Se8+Td2loNGHxQ83quiqp6fCK7Asr6M3Mrq2vpGfrOwtb2zu2fuH7RVnErKWjQWsez6RDHBI9YCDoJ1E8lI6AvW8UfXU79zz6TicdSEccLckAwiHnBKQEueeVQqd7zaXeaoADcn5w4MGRAvq03OSp5ZtCrWDPgvsRekiBZoeOan049pGrIIqCBK9WwrATcjEjgVbFJwUsUSQkdkwHqaRiRkys1mP0zwqVb6OIilrgjwTP05kZFQqXHo686QwFAte1PxP6+XQnDlZjxKUmARnS8KUoEhxtNAcJ9LRkGMNSFUcn0rpkMiCQUdW0GHYC+//Je0qxX7omLdVov18iKOPDpGJ6iMbHSJ6ugGNVALUfSAntALejUejWfjzXift+aMxcwh+gXj4xvM+pZD

generated feature vector z3,n

z2,n

hn

|

{z

} |

{z

} |

earlier layers are sigmoidal

z5,n

z4,n

{z

} |

{z

last layer is an RBM

}

Figure 66.8 A deep belief network consists of a single RBM layer on the far right and

a sequence of sigmoidal neural units preceding it. The latent information is provided as input to the RBM unit on the right, and the signals propagate to the left to generate a feature sample hn on the far left.

The network consists of a single RBM layer on the far right and a sequence of sigmoidal neural units preceding it (in particular, observe that a deep belief network is not a feedforward neural network). The RBM has bidirectional connections between its nodes, with information flowing back and forth between them, whereas information flows only in one direction within the sigmoidal layers. The latent information is provided as input to the RBM unit on the right, and the signals propagate to the left to generate a feature sample hn on the far left. The parameters (W` , θ` , θr,` ) are the ones generated during the multilayer pre-training procedure described before. We feed some latent information into the deep belief network, represented in the figure by the variable z5,n , which is assumed to have binary inputs. This variable represents some class of input feature vectors. Once

66.5 Deep Generative Model

2829

z5,n is applied to the network, the RBM unit performs several iterations of Gibbs sampling, i.e., it alternates between z5,n → z4,n → z5,n → z4,n → . . ., namely, (performing several Gibbs sampling steps) ∆

y4,n = sigmoid(W4 z5,n N θr,4 )

z4,n = Bernoulli(y4,n ) 0 ∆

x = 0 z5,n

sigmoid(W4T z4,n 0

= Bernoulli(x )

set z5,n =

0 z5,n

N θ4 )

and repeat

(66.86a) (66.86b) (66.86c) (66.86d) (66.86e)

At the end of these iterations, we propagate z4,n through the sigmoidal layers, for example, (propagating through the sigmoidal layers) ∆

y3,n = sigmoid(W3 z4,n N θr,3 )

(66.87a)

y2,n = sigmoid(W2 y3,n N θr,2 )

(66.87b)

hn = sigmoid(W1 y2,n N θr,1 )

(66.87c)





until the feature hn is generated on the far left. Example 66.3 (Generation of handwritten digits using a deep belief network) We illustrate the operation of the deep belief network by applying it to the problem of learning to generate “handwritten digits” that are similar to the ones arising from the same MNIST dataset considered earlier in Examples 52.3 and 66.2. Recall that the MNIST dataset consists of 60,000 labeled training samples. Each entry in the dataset is a 28 × 28 grayscale image, which we transform into an M = 784-long feature vector, hn . Each pixel in the image and, therefore, each entry in hn , assumes integer values in the range [0, 255]. We transform these entries into binary values by thresholding all pixel values above 127 to 1 and all values less than or equal to 127 to 0. For comparison purposes, Fig. 66.9 shows a random selection of eight images before and after transformation into the binary representation. We construct a deep belief network with seven total layers with sigmoidal units, including the last RBM layer, in a manner similar to Fig. 66.8. The signals at the successive layers are denoted by {h, z2 , z3 , . . . , z7 }. The number of hidden units in each layer from left to right is n1 = 784, n2 = 128, n3 = 128, n4 = 128, n5 = 128, n6 = 128, n7 = 10

(66.88)

We set the step size for the contrastive divergence procedure to µcd = 0.0001. Initially, we train each RBM separately, as explained before, using the binary-valued feature vectors from the MNIST dataset. Each RBM is trained for 1000 iterations using binary data. Once training is complete, we generate 16 random vectors {z7 } at the far right of the deep belief network, with binary entries. We propagate each vector z7 through the network as follows. First, we perform 100 Gibbs sampling steps at the far-right RBM layer to determine z6 . Then, we propagate z6 through the earlier sigmoidal sections, as described prior to the example. During these steps, we propagate the real-valued signals through the layers (i.e., the outputs of the sigmoidal units). In this way, we arrive at a feature vector h for each z7 . Figure 66.10 shows 16 random images generated in this manner. The purpose of this example is to illustrate the general principle behind the operation of the network without focusing on perfecting its parameters or performance.

2830

Deep Belief Networks

binary

binary

binary

binary

original

original

original

original

binary

binary

binary

binary

original

original

original

original

Figure 66.9 The top row shows four randomly selected images after pixel

transformation to binary values along with the original images in the second row with pixel values in the range [0, 255]. The same display is repeated in the last two rows for another collection of four random images.

66.6

COMMENTARIES AND DISCUSSION Deep belief networks. Deep belief networks and stacked autoencoder structures are two classes of deep architectures where the vanishing gradient problem is ameliorated by applying a greedy training strategy. Although deep belief networks appear to have fallen out of favor, their emergence has sparked an immense interest in deep learning algorithms and has motivated many subsequent developments and advances in the field. The presentation in Section 66.5 on deep generative networks is based on Hinton, Osindero, and Teh (2006). It was this work that propelled the interest in deep learning techniques to new intensity. In this publication, the authors introduced an efficient method for training deep belief networks by pre-training one layer at a time to counter the vanishing gradient problem, followed by a fine-tuning procedure. Their approach relied on the use of RBMs. These RBMs depended on an energy-based learning concept apparently first introduced by Smolensky (1986), where an RBM was called a Harmonium – see the overviews by LeCun et al. (2006), Bengio (2009), and Fischer and Igel (2014). We explained following (66.71c) that RBMs are important instances of generative models: they are effective in learning a probability distribution and can be used to generate samples according to this distribution. Using a sufficient number of hidden units, it can be shown that an RBM structure is able to represent any discrete probability distribution and that, generally, adding more hidden units helps improve the log-likelihood function (66.33) – see, e.g., Freund and Haussler (1994) and Le Roux and Bengio (2008). We also explained after (66.53) a useful feature of RBM structures in

66.6 Commentaries and Discussion

2831

Figure 66.10 Random images generated by a deep belief network with 7 layers. The

last RBM layer has 10 output units and 128 hidden units.

the case of quadratic energy functions and binary states, namely, that RBMs lead to a product-of-experts (rather than sum-of-experts) model for the probability distribution. The RBM approach determines the parameters that characterize the elementary distributions by using the contrastive divergence algorithm, first developed in Hinton (1999, 2002). The presentation in Section 66.3 is based on Bengio et al. (2006), Bengio (2009), and Bengio, Courville, and Vincent (2013). As illustrated in Fig. 66.3, the argument shows that the contrastive divergence algorithm relies on Gibbs sampling to generate realizations from a desired conditional distribution through a two-phase procedure: a positive phase that feeds the visible state forward through the network, and a negative phase that cycles the hidden state through the network. Several other studies revealed useful properties of deep networks and their potential. For example, Salakhutdinov and Hinton (2009) considered Boltzmann machines with many layers of hidden variables, while Glorot, Bordes, and Bengio (2011) showed how the training of deep neural networks is considerably faster when the hidden layers employ the linear rectifier, f (x) = max{0, x}, as an activation function. The work by Krizhevsky, Sutskever, and Hinton (2012) was also successful in halving the error rate for object recognition problems and generated great interest around deep learning architectures; it also motivated the quick adoption of convolutional neural networks (studied in the next chapter) in computer vision applications. Motivated by these advances, deep networks have since been applied to a variety of problems in computer vision, image analysis, and speech and language processing. The paper by Hinton et al. (2012a) provides an overview of advances in the use of deep learning techniques to the problem of phonetic classification for automatic speech recognition. Gibbs sampling.The derivation of the contrastive divergence algorithm in Section 66.3 relies on the use of Gibbs sampling, which is a powerful technique to generate samples from some underlying probability distribution without needing to compute the distri-

2832

Deep Belief Networks

bution explicitly. The technique is named after the American physicist Josiah Gibbs (1839–1903) to honor his contributions to the field of statistical mechanics. Gibbs was also a discoverer of the Gibbs phenomenon in Fourier series representations and the first to provide an analytical explanation for it. Although named after Gibbs, the technique of Gibbs sampling is actually due to Metropolis et al. (1953) and Hastings (1970); the procedure from these works is nowadays known as the Metropolis–Hastings algorithm and it includes Gibbs sampling as a special case – recall the discussion in Section 33.3.2. We motivate Gibbs samplers by considering the case of two jointly distributed random variables, say, h and z, where both variables are initially assumed to be scalarvalued. We denote their joint probability distribution by fh,z (h, z). If we are interested in evaluating E h then, in principle, we would need to evaluate first the marginal distribution of h, denoted by fh (h). This step need not be trivial in general. For example, in the case of continuous random variables, this step would involve computing an integral of the form: ˆ fh (h) = fh,z (h, z)dz (66.89) z∈Z

where the integration is over all possible realizations for z. Likewise, for discrete random variables, the same step would involve computing sums of the form: X P(h = h) = P(h = h, z = z) (66.90) z

where the sum is over all possible states for z. Once the distribution for h is determined, we can find its mean by computing either of the following expressions depending on the nature of h (continuous or discrete): ˆ hfh (h)dh (66.91a) Eh = h∈H

Eh =

X

h P(h = h)

(66.91b)

h

These calculations assume that the required integral and sum expressions are easy to evaluate. If not, then we need an alternative method to estimate the mean E h without assuming availability of the probability distribution for h. Gibbs sampling provides one elegant solution under the assumption that the conditional probabilities are easy to compute, i.e., fh|z (h|z) and fz|h (z|h) (continuous distributions) (66.92a) P(h = h|z = z) and P(z = z|h = h) (discrete distributions) (66.92b) Starting from an initial condition z (0) , Gibbs sampling generates realizations from the unavailable fh (h) by sequentially sampling from the conditional distributions in the following manner (written for continuous distributions for convenience): h(0) ∼ fh|z (h|z = z (0) ) z

(1)

h

(1)

∼ fz|h (z|h = h ∼ fh|z (h|z = z

(0)

(1)

)

(66.93a) (66.93b)

)

(66.93c)

z (2) ∼ fz|h (z|h = h(1) )

(66.93d)

h(2) ∼ fh|z (h|z = z (2) )

(66.93e)

z

(3)

∼ fz|h (z|h = h • •

(2)

)

(66.93f)

66.6 Commentaries and Discussion

2833

where the notation x ∼ fx (x) means that a sample x is generated from the distribution fx (x). In other words, Gibbs sampling performs the following steps in succession: h(i) ∼ fh|z (h|z = z (i) ) z

(i+1)

(66.94a)

(i)

∼ fz|h (z|h = h )

(66.94b)

It can be shown that, under some reasonable technical conditions, the realizations {h(i) } generated under this construction follow the marginal distribution fh (h) as i → ∞ – see Prob. 66.6 for one example. This means that the histogram representation for the realizations tends toward fh (h). It is customary to ignore the realizations generated during the initial convergence period (called burn-in period) to avoid initial-condition effects. We can subsequently estimate the desired mean by averaging N samples for i large enough: 1 X (i) d E h = h (66.95) N d The ergodic theorem then ensures that E h → E h as N → ∞ – recall (7.80). Furthermore, assume we repeat Gibbs sampling L times, with each time ` starting from some randomly chosen initial condition z (0),` . We run each experiment long enough (say, for N iterations) and keep the last sample, h(N ),` . Then, these samples can be viewed as approximating independent and identically distributed realizations from fh (h) and, if desired, we can again estimate E h by using their sample mean: L 1 X (N ),` d E h = h L

(66.96)

`=1

Gibbs sampling can be extended to multiple variables. For convenience, we illustrate the notation for the case of four random variables, denoted by (h1 , h2 , z1 , z2 ); the same construction can be extended to a larger number of variables with minimal changes. The steps involved in the sampling procedure are now given by h1 ∼ fh|z (h1 |h2 = h2 , z1 = z1 , z2 = z2 )

(0)

(66.97a)

=

(0) z2 )

(66.97b)

=

(0) z2 )

(66.97c)

=

(1) z1 )

(66.97d)

h1 ∼ fh|z (h1 |h2 = h2 , z1 = z1 , z2 = z2 )

(1)

(66.97e)

(2) h2

=

(1) z2 )

(66.97f)

=

(1) z2 )

(66.97g)

=

(2) z1 )

(66.97h)

(i)

(66.98a)

(i) z2 )

(66.98b)

(1) h2 (1) z1 (1) z2

∼ fh|z (h2 |h1 =

(0) h1 , z 1

∼ fz|h (z1 |h1 =

(0) h1 , h 2

∼ fz|h (z2 |h1 =

(0) h1 , h 2

(1)

(2) z1 (2) z2

(0)

(0)

(0)

=

(0) z1 , z 2

=

(1) h2 , z 2

=

(1) h2 , z 1

(1)

(1)

∼ fh|z (h2 |h1 =

(1) h1 , z 1

∼ fz|h (z1 |h1 =

(1) h1 , h 2

∼ fz|h (z2 |h1 = • •

(1) h1 , h 2

=

(1) z1 , z 2

=

(2) h2 , z 2

=

(2) h2 , z 1

That is, (i)

(i)

(i)

h1 ∼ fh|z (h1 |h2 = h2 , z1 = z1 , z2 = z2 ) (i+1) h2 (i+1) z1 (i+1) z2

∼ fh|z (h2 |h1 =

(i) h1 , z 1

∼ fz|h (z1 |h1 =

(i) h1 , h 2

∼ fz|h (z2 |h1 =

(i) h1 , h 2

=

(i) z1 , z 2

=

(i+1) h2 , z2

=

(i+1) h2 , z1

=

=

(i) z2 )

(66.98c)

=

(i+1) z1 )

(66.98d)

2834

Deep Belief Networks

Observe that the conditioning is performed on the most recent sample values. We encountered similar steps in (66.64), (66.67), and (66.69) while deriving the contrastive divergence algorithm in order to generate samples {zn , h0n , zn0 ): (66.64)

P(z n |h = hn ) (using (66.57a)–(66.57b))

(66.99a)

(66.67) h0n ∼

P(h0n |z

(66.99b)

(66.69) zn0 ∼

P(z 0n |h

zn



= zn ) (using (66.60a)–(66.60b)) =

h0n )

(66.99c)

Constructions of the type (66.98a)–(66.98d) are also referred to as realizations generated by a Markov chain Monte Carlo (MCMC) simulation. The qualification “Markov chain” refers to the fact that each new realization is independent of past state values and is only dependent on the most recent state value. The qualification “Monte Carlo” refers to the fact that this construction is obtained by means of simulations. For more details on Gibbs sampling and applications, readers may consult, for example, the works by Geman and Geman (1984), Gelfand and Smith (1990), Casella and George (1992), Gilks, Richardson, and Spiegelhalter (1996), Andrieu et al. (2003), Robert and Casella (2004), Lynch (2007), and Liu (2008).

PROBLEMS

66.1 How would the arguments after (66.54) change if the hidden variable assumes instead the binary values {+1, −1}? 66.2 Let x = col{h, z} with binary entries {0, 1}, and define θ = col{θb , θh }. Extending definition (66.44), we consider a quadratic energy function of the form   1 ∆ A WT 2 x E(h, z) = θT x − xT 1 W B 2 where A and B are symmetric matrices with zero entries on their diagonals. Let x(k) denote the kth entry of x and let x−k denote the column vector that excludes x(k). Determine an expression for the conditional probability P(x(k) = 1|x−k = x−k ). 66.3 Use an argument similar to Example 66.1 to determine the marginal pdf of the hidden component, P(y b = yb ). 66.4 Verify that equality holds in (66.81) when qz|h (z|h) is chosen to match the conditional pdf fz|h (z|h). 66.5 We studied binary RBMs in the body of the chapter and assumed the bilinear form (66.44) for the energy function. Under the assumption that the visible and hidden components {h, z} are binary-valued, we derived the logistic distributions (66.57a)– (66.57b) and (66.60a)–(66.60b). Assume instead that the energy function is chosen as follows: 1 E(hn , z` ) = θT z` + khn − θr k2 − z`T W T hn 2 The resulting structure is called a Gaussian RBM. Show that the conditional distributions for z given h continue to be (66.57a)–(66.57b) while the conditional pdf for the entries of h given z becomes the Gaussian distribution: fh|z (h | z = z` ) = Nh (W z` + θr , IM ) 66.6

Consider two Bernoulli random variables {x, y} with joint pmf values:     P(x = 0, y = 0) P(x = 0, y = 1) p00 p01 ∆ = P(x = 1, y = 0) P(x = 1, y = 1) p10 p11

Problems

(a)

2835

Verify that the marginals of x and y are given by: P(x = 0) = p00 + p01 , P(y = 0) = p00 + p10 ,

P(x = 1) = p10 + p11 P(y = 1) = p01 + p11

(b)

Verify that the transition probability matrices, denoted by Ax|y and Ay|x , and which represent the conditional pmfs P(x|y) and P(y|x), are given by:   p00 p01   p + p p + p P(x = 0|y = 0) P(x = 0|y = 1) 00 10 01 11 ∆   = Ax|y =  p10 p11 P(x = 1|y = 0) P(x = 1|y = 1) p00 + p10 p01 + p11   p00 p10   P(y = 0|x = 0) P(y = 0|x = 1) ∆  p00 + p01 p10 + p11  Ay|x = =  p01 p11 P(y = 1|x = 0) P(y = 1|x = 1) p00 + p01 p10 + p11

(c)

Assume we employ a Gibbs sampler to generate a sequence of samples based on these conditional distributions, starting from some initial condition x0 , say, P(y|x)

P(x|y)

P(y|x)

P(x|y)

x0 −→ y0 −→ x1 −→ y1 −→ x2 . . . We denote the marginal pmf for the kth realization xk by the column vector fk = col{P(xk = 0), P(xk = 1)}. Verify that the conditional transition matrix from xk−1 to xk is given by   P(xk = 0|xk−1 = 0) P(xk = 0|xk−1 = 1) ∆ Ax|x = = Ax|y × Ay|x P(xk = 1|xk−1 = 0) P(xk = 1|xk−1 = 1) Conclude that fk = Ax|x fk−1 . Argue that since Ax|x is left stochastic, the marginal fk converges to the Perron vector f that satisfies f = Af . Remark. For further details on this example, the reader can refer to Casella and George (1992). 66.7 Consider a Poisson distribution with rate λ ≥ 0; this scalar refers to the average number of events expected to occur per unit time (or space): fy (y) = P(y events occur) =

λy e−λ , y!

y = 0, 1, 2, . . .

We collect a total of N measurements {yn } arising from two Poisson distributions with rates {λ1 , λ2 }. The rate switches from λ1 to λ2 at some unknown instant no , which is selected uniformly at random from within the interval 0 ≤ no ≤ N − 1. Realizations {yn } over 0 ≤ n ≤ no arise from y n ∼ Poisson(λ1 ), while realizations {yn } over no < n ≤ N − 1 arise from y n ∼ Poisson(λ2 ). This situation is sometimes referred to as a change-point model. We model each λ1 , λ2 as Gamma-distributed with parameters (α, β), as defined earlier in Prob. 5.2: λ1 , λ2 ∼ Gamma(α, β) =

β α α−1 −βλ λ e , λ>0 Γ(α)

We wish to employ the Gibbs sampler to estimate the parameters {λ1 , λ2 , no }. (a) Verify that the pdfs of λ1 and λ2 conditioned on the observations and the other parameters are given by (we ignore the subscripts in the notation of the pdfs for simplicity and write f (x) instead of fx (x)): no   X f (λ1 |λ2 , no , y0:N −1 ) = Gamma α + yn , β + no n=0

 f (λ2 |λ1 , no , y0:N −1 ) = Gamma α +

N −1 X n=no +1

yn , N − no − 1 + β



2836

Deep Belief Networks

(b)

Verify that the pdf of no conditioned on the observations and the λ-parameters is given by f (no |λ1 , λ2 , y0:N −1 ) ∝ ( n ) ( N −1 ) o X   X  exp yn ln λ1 − no λ1 × exp yn ln λ2 − (N − no − 1)λ2 n=0

n=no +1

Referring to (5.37), relate this expression to multinomial distributions. (c) Write down the Gibbs sampler to estimate {λ1 , λ2 , no }. Remark. The reader may refer to Gilks, Richardson, and Spiegelhalter (1996), Lynch (2007), and Liu (2008) for more details on this and related application topics. The Gibbs sampling solution for this problem builds on Bayesian approaches to change detection, such as the works by Barry and Hartigan (1993), Lavielle and Lebarbier (2001), Adams and MacKay (2007), and Qian, Wu, and Xu (2019) – see also the overview by Li et al. (2009).

REFERENCES Adams, R. P. and D. J. C. MacKay (2007), “Bayesian online change-point detection,” available at arXiv:0710.3742. Andrieu, C., N. de Freitas, A. Doucet, and M. Jordan (2003), “An introduction to MCMC for machine learning,” Mach. Learn., vol. 50, pp. 5–43. Barry, D. and J. A. Hartigan (1993), “A Bayesian analysis for change point problems, J. Amer. Statist. Assoc., vol. 88, pp. 309–319. Bengio, Y. (2009), “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127. Bengio, Y., A. Courville, and P. Vincent (2013), “Representation learning: A review and new perspectives,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828. Bengio, Y., P. Lamblin, D. Popovici, and H. Larochelle (2006), “Greedy layer-wise training of deep networks,” Proc. Advances Neural Information Processing Systems (NIPS), pp. 153–160, Vancouver. Casella, G. and E. I. George (1992), “Explaining the Gibbs sampler,” Amer. Statist., vol. 46, no. 3, pp. 167–174. Fischer, A. and C. Igel (2014), “Training restricted Boltzmann machines: An introduction,” Pat. Recogn., vol. 47, no. 1, pp. 25–39. Freund, Y. and D. Haussler (1994), “Unsupervised learning of distributions on binary vectors using two layer networks,” Technical Report CRL-94-25, University of California, Santa Cruz. Gelfand, A. E. and A. F. M. Smith (1990), “Sampling-based approaches to calculating marginal densities,” J. Amer. Statist. Assoc., vol. 85, pp. 398–409. Geman, S. and D. Geman (1984), “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Trans. Patt. Anal. Mach. Intel., vol. 6, pp. 721–741. Gilks, W. R., S. Richardson, and D. J. Spiegelhalter (1996), Markov Chain Monte Carlo in Practice, Chapman & Hall. Glorot, X., A. Bordes, and Y. Bengio (2011), “Deep sparse rectifier neural networks,” J. Mach. Learn. Res., vol. 15, pp. 315–323. Hastings, W. K. (1970), “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97–109. Hinton, G. (1999), “Products of experts,” Proc. Int. Conf. Artificial Neural Networks (ICANN), vol. 1, pp. 1–6, Edinburgh.

References

2837

Hinton, G. (2002), “Training products of experts by minimizing contrastive divergence,” Neural Comput., vol. 14, pp. 1771–1800. Hinton, G., L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012a), “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Process. Mag., vol. 29, pp. 82–97. Hinton, G., S. Osindero, and Y.-W. Teh (2006), “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554. Krizhevsky, A., I. Sutskever, and G. Hinton (2012), “ImageNet classification with deep convolutional neural networks,” Proc. Advances Neural Information Processing Systems (NIPS), vol. 25, pp. 1097–1105, Lake Tahoe, NV. Lavielle, M. and E. Lebarbier (2001), “An application of MCMC methods for the multiple change-points problem,” Signal Process., vol. 81, pp. 39–53. LeCun Y., S. Chopra, R. M. Hadsell, M.-A. Ranzato, and F.-J. Huang (2006), “A tutorial on energy-based learning,” in Predicting Structured Data, G. Bakir, T. Hofmann, B, Schölkopf, A. J. Smola, B. Taskar, and S.V. N. Vishwanathan, editors, pp. 191–246, MIT Press. Le Roux, N. and Y. Bengio (2008), “Representational power of restricted Boltzmann machines and deep belief networks,” Neural Comput., vol. 20, no. 6, pp. 1631–1649. Li, Y., G. Lin, T. Lau, and R. Zeng (2009), “A review of changepoint detection models,” available at https://arxiv.org/pdf/1908.07136.pdf. Liu, J. S. (2008), Monte Carlo Strategies in Scientific Computing, Springer. Lynch, S. M. (2007), Introduction to Applied Bayesian Statistics and Estimation for Social Scientists, Springer. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953), “Equations of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, pp. 1087–1091. Qian, G., Y. Wu, and M. Xu (2019), “Multiple change-points detection by empirical Bayesian information criteria and Gibbs sampling induced stochastic search,” App. Math. Model., vol. 72, pp. 202–216. Robert, C. P. and G. Casella (2004), Monte Carlo Statistical Methods, Springer. Salakhutdinov, R. and G. Hinton (2009), “Deep Boltzmann machines,” Proc. Int. Conf. Artificial Intelligence and Statistics (PMLR), vol. 5, pp. 448–455. Smolensky, P. (1986), “Information processing in dynamical systems: Foundations of harmony theory,” in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, editors, pp. 194–281, MIT Press.

67 Convolutional Networks

Convolutional neural networks (CNNs) are prevalent in computer vision, image, speech, and language processing applications, where they have been successfully applied to perform classification tasks at high accuracy rates. One of their main attractions is the ability to operate directly on raw input signals, such as images, and to extract salient features automatically from the raw data. The designer does not need to worry about which features to select to drive the classification process. The extraction of features is achieved by a succession of correlation layers placed prior to the input to the network. Each layer performs convolution and pooling operations. The cascade of layers is followed by a fully connected feedforward neural network, which is trained by means of the backpropagation procedure, as illustrated in Fig. 67.1.

raw input signal

extracted feature vector

cascade of correlation layers

automated feature extraction (with weight and offset coefficients trained)

fully connected feedforward neural network (with weight and offset coefficients trained)

Figure 67.1 Convolutional neural networks automate the feature extraction process.

The raw data is fed into a cascade of correlation layers, whose purpose is to filter the input data (through a combination of linear and nonlinear operations) and generate a feature vector that feeds into a fully connected feedforward network.

The correlation layers in a CNN will be parameterized by weight and bias coefficients of their own, just like the layers of the feedforward network. Therefore, the training of a CNN involves learning all of these coefficients (for both the feedforward network and the correlation layers). Once this is done, the fea-

67.1 Correlation Layers

2839

ture extraction process becomes automated and the CNN learns how to extract the “best” features. Convolutional networks will generally involve many more connections than tuning parameters. For this reason, it is easier to train deep convolutional networks than deep feedforward networks. Convolutional networks are particularly well suited for image processing applications because they exploit more effectively correlations that exist among neighboring pixels. They are also less sensitive to translation or scaling transformations in the image. For example, if the position of a handwritten digit is displaced within the frame of an image, or its size is scaled up or down, the convolutional network will still be able to provide accurate classification in most cases. The translation invariance property is one of the main attractive features of CNNs, in addition to their ability to extract features directly from raw data.

67.1

CORRELATION LAYERS We start by explaining how convolution (or correlation) is involved in the operation of CNNs. We illustrate the operation by considering an image processing example.

67.1.1

Correlation Masks One common step in image analysis is that of correlating patches of an image with a mask, also called a filter or a kernel. Consider a patch of size K ×K, which corresponds to a region extracted from a 2D grayscale image. We represent the patch by a matrix H, with the row index running over k = 1, 2, . . . , K and the column index running over ` = 1, 2, . . . , K:   × × × (67.1) H =  × × ×  , k, ` = 1, 2, . . . , K × × ×

Each entry of H corresponds to a pixel from the image. The pixels may assume integer values in the range [0, 255] representing different shades of gray, varying from black (at 0) to white (at 255); they may also assume normalized values obtained through centering and scaling. We choose the letter H to represent the patch because the entries of the patch will end up playing the role of feature entries (and, by analogy, we have been using the lower-case letter h to denote features in learning algorithms). We also consider a 2D mask or kernel, represented by a matrix W of the same size K × K. Each entry in this convolution (or correlation) mask is real-valued. We choose the letter W because the entries of this mask will end up playing the role of combination weights (which we denoted earlier by the lower-case letter w in the context of learning algorithms). We similarly index the entries of W by

2840

Convolutional Networks

(k, `), with the row index running over k = 1, 2, . . . , K and the column index running over ` = 1, 2, . . . , K: 

× W= × ×

× × ×

 × ×  , k, ` = 1, 2, . . . , K ×

(67.2)

This situation is illustrated in Fig. 67.2, where a patch H of size 3 × 3 is highlighted in the top left corner of the image within a marked rectangular area, while the mask W is shown on the right side of the figure.

Lo ⇥ Lo image I(r, c), r, c = 1, 2, . . . , Lo AAACKnicbZDLSgMxFIYz9VbrrerSTbAVKgxlphvdCFU3Ci4q2At0SsmkqQ3NTIbkjFiGPo8bX8VNF0px64OYtrPQ6oGEj//8h5z8fiS4BseZWpmV1bX1jexmbmt7Z3cvv3/Q0DJWlNWpFFK1fKKZ4CGrAwfBWpFiJPAFa/rD61m/+cSU5jJ8gFHEOgF5DHmfUwJG6uYvvcCXz0nxris94AHT2FARc2NjuJh4lAh8Oy4pm57a2NwXrl2xPdGToO25c9zNF5yyMy/8F9wUCiitWjc/8XqSxgELgQqiddt1IugkRAGngo1zXqxZROjQbNA2GBKzVSeZf3WMT4zSw32pzAkBz9WfEwkJtB4FvnEGBAZ6uTcT/+u1Y+ifdxIeRjGwkC4e6scCg8Sz3HCPK0ZBjAwQqrjZFdMBUYSCSTdnQnCXv/wXGpWya/i+UqhepXFk0RE6RiXkojNURTeohuqIohf0ht7Rh/VqTayp9bmwZqx05hD9KuvrG7mhpE4=

Figure 67.2 (Left) An original image of size Lo × Lo with rows and columns indexed by the letters r and c, respectively. These indices assume values in the range r, c = 1, 2, . . . , Lo . (Right) A mask of size 3 × 3 (K = 3) with rows and columns indexed by the letters k and `. These indices assume values in the range k, ` = 1, 2, 3. A patch H of size 3 × 3 is highlighted in the top left corner of the image. The source of the image is the site www.pexels.com, where all photos are free to use.

We write H(k, `) and W(k, `) to refer to the (k, `)th entries of H and K, respectively. The 2D correlation of H with W is defined as the transformation that generates the scalar value:

67.1 Correlation Layers



corr(H, W) =

K X K X

H(k, `) W(k, `)

2841

(67.3a)

k=1 `=1

T

= (vec(H)) vec(W) ∆

= hT w

(67.3b) (67.3c)

In other words, we compute a weighted sum of the entries of H with the weights given by the entries of W. The result of the correlation is written in the second and third lines in the form of inner products between the vector representations for H and W, where ∆

2



2

h = vec(H) ∈ IRK

w = vec(W) ∈ IRK

(patch)

(67.4a)

(mask)

(67.4b)

Recall that the vec operation stacks the columns of a matrix on top of each other. This operation is illustrated in Fig. 67.3, where a matrix X of size A × A is transformed into a vector x of size A2 × 1. Although we are considering square patches and masks, the same discussion is applicable to rectangular patches and masks, in which case the vec operation will be applied to nonsquare matrices.

Figure 67.3 The vec operation stacks the columns of an A × A matrix X on top of

each other and generates a vector x of size A2 × 1.

2842

Convolutional Networks

67.1.2

Image Filtering Now, consider a 2D image, denoted by the symbol I, consisting of a grid of Lo × Lo pixels on the left side of Fig. 67.2. The operation of filtering the image by a mask W involves sliding the mask across the image and generating successive correlation values. Starting from the top left corner, the mask is slid one column at a time to the right until the right side of the image is reached. The mask is then moved one row down and slid again from left to right; then moved again one row down and slid again from left to right, and so on until the entire image is covered. This process is illustrated in Fig. 67.4.

Figure 67.4 A mask of size 3 × 3 is slid across the image, one column to the right at

each step. The first row of images shows the mask slid across the top part of the image until the far right end is reached. The mask is then moved one row down and slid again across the image until the far right end, as shown in the second row of images. This process continues until the entire image is covered, as shown in the bottom row of images.

Oftentimes, zero boundary values are added around the image (in the grayscale convention, this would correspond to adding black pixels). This situation is illustrated in Fig. 67.5, where two patches of size 3 × 3 each are highlighted. In the figure, K N 1 zero columns and K N 1 zero rows are added around the original image; its size becomes L × L where L = Lo + 2(K N 1)

(67.5)

Irrespective of whether boundary conditions are added to the image or not, we will use the symbol I to denote the image that needs to be filtered by the mask

67.1 Correlation Layers

2843

(this could be the original image or the extended image) and we will denote its size by L × L (the value of L is either the original size Lo or the modified size after the addition of boundary columns and rows): ∆

L = notation used to denote the size of the input image

(67.6)

In our description of the correlation operation, we assumed the mask is shifted by one column or by one row at a time. In this case, we say that the stride value is equal to 1 in both the horizontal and vertical directions. We can of course consider higher values for the stride variable, and these values can differ along the horizontal and vertical dimensions.

Figure 67.5 Sometimes it is convenient to add boundary layers of zero values around

an image. In that case, the filtering by the mask W would treat the extended image as the input signal to be filtered.

Now, given an input image I of size L × L, for every location of the mask over the image, we evaluate the correlation between the mask and the covered patch of the image. The collection of all correlations will constitute the entries of a filtered image of size L0 × L0 , which we denote by F: W

input image, I N→ output image, F

(67.7)

This filtering step operates as follows. We start by placing the mask on the top left corner of the input image I – see the plot on the left side in Fig. 67.6. The value of the (1, 1) entry in the filtered image is given by the correlation:

2844

Convolutional Networks

F(1, 1) =

K X K X

I(k, `) W(k, `)

k=1 `=1

= corr(H1,1 , W) = hT 1,1 w

(67.8)

where we are now attaching subscripts to the patch matrix and denoting the K × K patch involved in the evaluation of F(1, 1) by H1,1 ; its vector representation is given by ∆

2

h1,1 = vec(H1,1 ) ∈ IRK ,

K

H1,1 = [ I(k, `) ]k,`=1 (K × K)

(67.9)

Figure 67.6 (Left) The mask is moved across the image I and correlations are

computed. (Right) The values obtained from the correlation calculations lead to a new transformed image, F.

Next, we shift the mask by one column to the right. Its leftmost top corner will now be located on top of entry (1, 2) in the input image. The result of the correlation with the underlying patch is assigned to entry (1, 2) in the filtered image: F(1, 2) =

K X K X

I(k, ` + 1) W(k, `)

k=1 `=1

= corr(H1,2 , W) = hT 1,2 w

(67.10)

where we are, similarly, denoting the vector representation for the patch of the image involved in the evaluation of F(1, 2) by

67.1 Correlation Layers

2



h1,2 = vec(H1,2 ) ∈ IRK ,

K

H1,2 = [ I(k, ` + 1) ]k,`=1 (K × K)

2845

(67.11)

Observe that the successive patches {H1,1 , H1,2 } or vectors {h1,1 , h1,2 } share several entries in common. The filtering process continues in this manner by sliding the mask over the input image and computing correlations with the patches. More generally, the (r, c)th entry of F is generated from the correlation: F(r, c) = hT r,c w

(67.12)

where now ∆

hr,c = vec(Hr,c ) ∈ IRK

2

Hr,c = [ I(k + r N 1, ` + c N

(67.13a) K 1)]k,`=1

(K × K)

(67.13b)

Note that for stride steps of unit value along the horizontal and vertical directions, the dimensions of the input and output images are related as follows (see Prob. 67.1): L0 = L N K + 1

(67.14)

For example, if the input image is 64 × 64 and K = 3, then the filtered image will be 62 × 62: ∆

L0 = notation used to denote the size of the filtered image

(67.15)

The value of L0 will be adjusted later according to the explanation in Section 67.2.3. In the next example on edge detection we provide instances for typical masks W used in traditional image processing applications, such as the Sobel and Prewitt masks, and use them to reveal how masks can be used to discover useful features in images. Another example is the Gaussian mask, whose entries are defined by: W(k, `) = √

k2 +`2 1 e− 2σ2 , k, ` = NKg , . . . , N1, 0, 1, . . . , Kg 2πσ 2

(67.16)

for a mask of size K = 2Kg + 1 and where the parameter σ 2 > 0 controls the width of the Gaussian shape. In the context of convolutional networks, however, the masks will not be fixed as is the case here with the Sobel, Prewitt, or Gaussian masks. Instead, the entries of the masks will be learned and adjusted by a training algorithm to be derived later in our presentation. Example 67.1 (Edge detection) We consider one fundamental problem in image analysis, which relates to edge detection. Edges characterize object boundaries and are useful in many applications requiring segmentation, registration, and identification of objects. One typical application where edge detection techniques are involved is automatic character recognition. Edges in an image can be determined by identifying pixel locations where abrupt changes in the gray levels occur. If we assume, for the sake of argument, that the intensity of the pixels varies continuously in the (x, y) coordinates, where x, y are now

2846

Convolutional Networks

real-valued, then the derivative of the image I(x, y) at an edge location (x, y) is expected to assume some local maximum value. This property motivates one technique for detecting edges by examining the gradient of I(x, y) by using gradient operators along the vertical and horizontal directions. These operators, which are masks, compute finitedifference approximations for the gradient vectors ∂I(x, y)/∂x and ∂I(x, y)/∂y. Two common gradient operators are the Sobel and Prewitt operators, defined by     −1 0 1 −1 −2 −1 ∆ ∆ 0 0  (Sobel) Wx =  −2 0 2  , Wy =  0 (67.17a) −1 0 1 1 2 1     −1 0 1 −1 −1 −1 ∆ ∆ 0  0 (Prewitt) Wx =  −1 0 1  , Wy =  0 (67.17b) −1 0 1 1 1 1 These operators can be used to estimate gradients of an image at every pixel location. We center the operator at the pixel location (with the centers of the masks identified in (67.17a)–(67.17b) by the small squares around them) and evaluate the correlation of the image with the gradient operators: W

x Fx , I −→

Wy

I −→ Fy

(67.18)

Figure 67.7 Illustration of the edge detection procedure using the Sobel operator. The

original image on the left is from the site www.pexels.com, where photos are free to use. Here, the notation (Fx , Fy ) denotes the result of computing the gradients along the x (horizontal) and y (vertical) directions. Note that the Prewitt and Sobel operators compute local horizontal and vertical differences or sums. This helps reduce the effect of noise in the data. Note further that these gradient operators have the desirable property of yielding zeros for uniform regions in an image. Once we have computed the gradient values, we can combine their components along the row and column directions and determine the gradient vector magnitudes: q 2 2 F(r, c) = Fx (r, c) + Fy (r, c) (67.19) We then decide that a pixel location (r, c) is an edge location in the original image if F(r, c) is sufficiently large (i.e., larger than some threshold value) – see Fig. 67.7. If we are interested in detecting separately the presence of horizontal and vertical edges, then the decision can be based solely on the values of Fx (r, c) and Fy (r, c) exceeding appropriate thresholds. This example illustrates that different choices for a mask can be used to highlight different properties of the underlying image.

67.1 Correlation Layers

67.1.3

2847

Feature Maps Now, given an input image I of size L×L, the first step in a CNN is to construct a correlation layer consisting of identical neural units, one for each patch Hr,c . The correlation layer is characterized by a mask W. All neural units in the layer will have the same weight vector corresponding to this mask, i.e., w = vec(W). All neural units will also have the same offset parameter θ and the same activation function. This construction is illustrated in Fig. 67.8, which shows an input image I, a filtered image F, and two neurons for illustration purposes. One neuron acts on the entries of patch H1,1 to generate the output pixel F(1, 1), and a second neuron acts on the entries of patch Hr,c to generate pixel F(r, c). There will of course be many more neurons in the correlation layer lying between I and F; there will be as many neurons as patches in the input image. We refer to the output image F as a feature map.

Figure 67.8 Each patch Hr,c from the input image is fed into a neuron whose weight

vector is determined by the mask, w = vec(W). All neurons have the same weight vector and the same bias coefficient, θ. The result is a filtered image, F, also called a feature map. The pre-activation image is denoted by Z.

The activation function for the neurons can be any of the possibilities listed earlier in Table 65.1, although the tanh and rectifier functions have been observed to provide enhanced performance in the case of convolutional networks (with the

2848

Convolutional Networks

rectifier function often being the preferred choice): f (x) = max{0, x},

f (x) = tanh(x)

(67.20)

We denote the scalar output of each neuron in this initial correlation layer by F(r, c), for r, c = 1, 2, . . . , L0 (where L0 × L0 is the size of the output image), so that F(r, c) = f hT r,c w N θ



(67.21)

where hr,c is the vector representation of patch Hr,c and w is the vector representation of mask W. We also denote the scalar value prior to the activation function by Z(r, c) = hT r,c w N θ

=⇒

F(r, c) = f (Z(r, c))

(67.22)

We refer to the matrix Z as the pre-activation image. Observe that each patch Hr,c is not only correlated with the mask W, as was the case in the image filtering example from the previous section, but the result is further adjusted by an offset value and processed by the nonlinearity f (·). The addition of nonlinear processing enriches the dynamics of the transformation from the input signal to its filtered version. We summarize the process of filtering an image by sliding a mask over it in listing (67.25). Each patch contains the entries of the input image that are responsible for generating the (r, c) entry of the filtered maps Z and F. We will denote the mapping from the input image I to the feature map F generically by the notation F = f (I, W, θ)

(67.23)

which consists of the activation function, the mask, the offset parameter, and the input image. The calculations involved in this transformation are the steps in (67.25). Sometimes, we will need to be more explicit and separate the operation of generating the pre-activation map Z from the post-activation map F. In that case, we replace (67.23) by the description: 

Z = mask(I, W, θ) F = f (Z)

(67.24)

where the first line refers to the operation of generating the entries of Z through the linear operation in (67.22) involving the mask W and the offset parameter θ. This operation acts on the entries of the input image.

67.1 Correlation Layers

2849

Correlating an image I with a mask W to generate image F. input is an image I of size L × L; given a mask W of size K × K; given an offset scalar θ and an activation function f (·); output: maps Z and F of size L0 × L0 , where L0 = L N K + 1; let w = vec(W). repeat for r = 1, 2, . . . , L0 (row index): repeat for c = 1, 2, . . . , L0 (column index): extract K × K input patch Hr,c for location F(r, c) hr,c = vec(Hr,c ) Z(r, c) = hT (pre-activation) r,c w N θ F(r, c) = f (Z(r, c)) (post-activation) end end

(67.25)

0

The successive patches {Hr,c }L r,c=1 that are generated by the correlation procedure (67.25) can be saved into a block matrix, which we will denote by H; from now on, we reserve the notation Hr,c with subscripts to the smaller K × K patches and the notation H without subscripts to the aggregation of all these smaller patches as indicated by expression (67.27c). This aggregate matrix, which we will refer to as the filtering gradient map and whose block structure is illustrated in Fig. 67.9 for L0 = 4, will be needed later in the description of the algorithm for training CNNs. The designation of gradient map is motivated in the next examples, which illustrate certain gradient calculations relative to both the mask entries (first example) and image pixels (second example). These types of calculations will be useful in the derivation of the training algorithm for CNNs. Example 67.2 (Filtering gradient map) Expression (67.22) shows how each (r, c) entry in the pre-activation map Z is computed from knowledge of the mask vector w = vec(W) and the patch vector hr,c = vec(Hr,c ). Assume, for illustration purposes, that K = 3 so that W and Hr,c are 3 × 3. Introduce the following notation for the K × K gradient matrix of Z(r, c) relative to the entries of the mask: 



∂Z(r, c)/∂W(1, 1)

∂Z(r, c)/∂W(1, 2)

∂Z(r, c)/∂W(1, 3)

∂Z(r, c) ∆  =  ∂Z(r, c)/∂W(2, 1) ∂W ∂Z(r, c)/∂W(3, 1)

∂Z(r, c)/∂W(2, 2)

 ∂Z(r, c)/∂W(2, 3)  (67.26) ∂Z(r, c)/∂W(3, 3)

∂Z(r, c)/∂W(3, 2)

where W(a, b) denotes the (a, b) entry of W. It follows from (67.22) that ∂Z(r, c) = Hr,c , ∂W

(K × K)

(67.27a)

For this reason, we refer to Hr,c as the local gradient matrix at location (r, c) of Z. If we collect all these K × K local gradient matrices {Hr,c } from across all locations

2850

Convolutional Networks

L0 blocks AAAB8nicbZA9T8MwEIYvfJbyVWBksWgRTFXSBcYKFgaGItEPKY0qx3Vaq04c2RekqurPYGEAIVZ+DRv/BrfNAC2vZOnRe3fy3RumUhh03W9nbX1jc2u7sFPc3ds/OCwdHbeMyjTjTaak0p2QGi5FwpsoUPJOqjmNQ8nb4eh2Vm8/cW2ESh5xnPIgpoNERIJRtJZfub+okFAqNjK9UtmtunORVfByKEOuRq/01e0rlsU8QSapMb7nphhMqEbBJJ8Wu5nhKWUjOuC+xYTG3AST+cpTcm6dPomUti9BMnd/T0xobMw4Dm1nTHFolmsz87+an2F0HUxEkmbIE7b4KMokQUVm95O+0JyhHFugTAu7K2FDqilDm1LRhuAtn7wKrVrVs/xQK9dv8jgKcApncAkeXEEd7qABTWCg4Ble4c1B58V5dz4WrWtOPnMCf+R8/gDBW5A/

n

each block Hr,c is K ⇥ K AAACHXicbVDLSgNBEJyNrxhfqx69DCZCDhJ2g6DHgJdALhHMA7IhzHYmyZDZmWVmVghLfsSLv+LFgyIevIh/4+Rx0MSGhqKqm+quMOZMG8/7djIbm1vbO9nd3N7+weGRe3zS1DJRQBsguVTtkGjKmaANwwyn7VhREoWctsLx7UxvPVClmRT3ZhLTbkSGgg0YEGOpnnsVCMlEnwqDKYERDrmEMS6kARCOq9Neqi5hWsBM40ItMCyiGtcKPTfvlbx54XXgL0EeLavecz+DvoQksjbAidYd34tNNyXKMOB0mgsSTWMCYzKkHQsFsT7ddP7dFF9Ypo8HUtm2Z87Z3xspibSeRKGdjIgZ6VVtRv6ndRIzuOmmTMSJoQIWRoOEYyPxLCrcZ4qC4RMLCChmb8UwIoqAsYHmbAj+6svroFku+RbflfOV4jKOLDpD56iIfHSNKqiK6qiBAD2iZ/SK3pwn58V5dz4WoxlnuXOK/pTz9QP2T6By

H1,1

H1,2

H1,3

H1,4

H2,1

H2,2

H2,3

H2,4

AAAB+XicbZDLSsNAFIZP6q3WW9Slm8FWcCEl6UaXRTddVrAXaEOYTCft0MkkzEwKJeRN3LhQxK1v4s63cdpmoa0/DHz85xzOmT9IOFPacb6t0tb2zu5eeb9ycHh0fGKfnnVVnEpCOyTmsewHWFHOBO1opjntJ5LiKOC0F0wfFvXejErFYvGk5wn1IjwWLGQEa2P5tl3LhgRz1Mr9zL1x85pvV526sxTaBLeAKhRq+/bXcBSTNKJCE46VGrhOor0MS80Ip3llmCqaYDLFYzowKHBElZctL8/RlXFGKIyleUKjpft7IsORUvMoMJ0R1hO1XluY/9UGqQ7vvIyJJNVUkNWiMOVIx2gRAxoxSYnmcwOYSGZuRWSCJSbahFUxIbjrX96EbqPuGn5sVJv3RRxluIBLuAYXbqEJLWhDBwjM4Ble4c3KrBfr3fpYtZasYuYc/sj6/AG0U5Ji

AAAB+XicbZDLSsNAFIZP6q3WW9Slm8FWcCEl6UaXRTddVrAXaEOYTCft0MkkzEwKJeRN3LhQxK1v4s63cdpmoa0/DHz85xzOmT9IOFPacb6t0tb2zu5eeb9ycHh0fGKfnnVVnEpCOyTmsewHWFHOBO1opjntJ5LiKOC0F0wfFvXejErFYvGk5wn1IjwWLGQEa2P5tl3LhgRz1Mr9zL1p5DXfrjp1Zym0CW4BVSjU9u2v4SgmaUSFJhwrNXCdRHsZlpoRTvPKMFU0wWSKx3RgUOCIKi9bXp6jK+OMUBhL84RGS/f3RIYjpeZRYDojrCdqvbYw/6sNUh3eeRkTSaqpIKtFYcqRjtEiBjRikhLN5wYwkczcisgES0y0CatiQnDXv7wJ3UbdNfzYqDbvizjKcAGXcA0u3EITWtCGDhCYwTO8wpuVWS/Wu/Wxai1Zxcw5/JH1+QO12ZJj

AAAB+XicbZDLSsNAFIZP6q3WW9Slm8FWcCEl6UaXRTddVrAXaEOYTCft0MkkzEwKJeRN3LhQxK1v4s63cdpmoa0/DHz85xzOmT9IOFPacb6t0tb2zu5eeb9ycHh0fGKfnnVVnEpCOyTmsewHWFHOBO1opjntJ5LiKOC0F0wfFvXejErFYvGk5wn1IjwWLGQEa2P5tl3LhgRz1Mr9rHHj5jXfrjp1Zym0CW4BVSjU9u2v4SgmaUSFJhwrNXCdRHsZlpoRTvPKMFU0wWSKx3RgUOCIKi9bXp6jK+OMUBhL84RGS/f3RIYjpeZRYDojrCdqvbYw/6sNUh3eeRkTSaqpIKtFYcqRjtEiBjRikhLN5wYwkczcisgES0y0CatiQnDXv7wJ3UbdNfzYqDbvizjKcAGXcA0u3EITWtCGDhCYwTO8wpuVWS/Wu/Wxai1Zxcw5/JH1+QO125Jj

AAAB+XicbZDLSsNAFIZPvNZ6i7p0M9gKLqQkdaHLopsuK9gLtCFMppN26GQSZiaFEvImblwo4tY3cefbOG2z0NYfBj7+cw7nzB8knCntON/WxubW9s5uaa+8f3B4dGyfnHZUnEpC2yTmsewFWFHOBG1rpjntJZLiKOC0G0we5vXulErFYvGkZwn1IjwSLGQEa2P5tl3NBgRz1Mz9zL2+yau+XXFqzkJoHdwCKlCo5dtfg2FM0ogKTThWqu86ifYyLDUjnOblQapogskEj2jfoMARVV62uDxHl8YZojCW5gmNFu7viQxHSs2iwHRGWI/Vam1u/lfrpzq88zImklRTQZaLwpQjHaN5DGjIJCWazwxgIpm5FZExlphoE1bZhOCufnkdOvWaa/ixXmncF3GU4Bwu4ApcuIUGNKEFbSAwhWd4hTcrs16sd+tj2bphFTNn8EfW5w+3X5Jk

AAAB+XicbZDLSsNAFIZP6q3WW9Slm8FWcCElyUaXRTddVrAXaEOYTCft0MkkzEwKJfRN3LhQxK1v4s63cdpmoa0/DHz85xzOmT9MOVPacb6t0tb2zu5eeb9ycHh0fGKfnnVUkklC2yThieyFWFHOBG1rpjntpZLiOOS0G04eFvXulErFEvGkZyn1YzwSLGIEa2MFtl3LBwRz1JwHuXfjzWuBXXXqzlJoE9wCqlCoFdhfg2FCspgKTThWqu86qfZzLDUjnM4rg0zRFJMJHtG+QYFjqvx8efkcXRlniKJEmic0Wrq/J3IcKzWLQ9MZYz1W67WF+V+tn+nozs+ZSDNNBVktijKOdIIWMaAhk5RoPjOAiWTmVkTGWGKiTVgVE4K7/uVN6Hh11/CjV23cF3GU4QIu4RpcuIUGNKEFbSAwhWd4hTcrt16sd+tj1Vqyiplz+CPr8we3YZJk

AAAB+XicbZDLSsNAFIZPvNZ6i7p0M9gKLqQkRdBl0U2XFewF2hAm00k7dDIJM5NCCXkTNy4UceubuPNtnLZZaOsPAx//OYdz5g8SzpR2nG9rY3Nre2e3tFfePzg8OrZPTjsqTiWhbRLzWPYCrChngrY105z2EklxFHDaDSYP83p3SqVisXjSs4R6ER4JFjKCtbF8265mA4I5auZ+5l7f5FXfrjg1ZyG0Dm4BFSjU8u2vwTAmaUSFJhwr1XedRHsZlpoRTvPyIFU0wWSCR7RvUOCIKi9bXJ6jS+MMURhL84RGC/f3RIYjpWZRYDojrMdqtTY3/6v1Ux3eeRkTSaqpIMtFYcqRjtE8BjRkkhLNZwYwkczcisgYS0y0CatsQnBXv7wOnXrNNfxYrzTuizhKcA4XcAUu3EIDmtCCNhCYwjO8wpuVWS/Wu/WxbN2wipkz+CPr8we45ZJl

AAAB+XicbZDLSsNAFIZPvNZ6i7p0M9gKLqQkdaHLopsuK9gLtCFMppN26GQSZiaFEvImblwo4tY3cefbOG2z0NYfBj7+cw7nzB8knCntON/WxubW9s5uaa+8f3B4dGyfnHZUnEpC2yTmsewFWFHOBG1rpjntJZLiKOC0G0we5vXulErFYvGkZwn1IjwSLGQEa2P5tl3NBgRz1Mz9rH59k1d9u+LUnIXQOrgFVKBQy7e/BsOYpBEVmnCsVN91Eu1lWGpGOM3Lg1TRBJMJHtG+QYEjqrxscXmOLo0zRGEszRMaLdzfExmOlJpFgemMsB6r1drc/K/WT3V452VMJKmmgiwXhSlHOkbzGNCQSUo0nxnARDJzKyJjLDHRJqyyCcFd/fI6dOo11/BjvdK4L+IowTlcwBW4cAsNaEIL2kBgCs/wCm9WZr1Y79bHsnXDKmbO4I+szx+455Jl

AAAB+XicbZDLSsNAFIZPvNZ6i7p0M9gKLqQkRdBl0U2XFewF2hAm00k7dDIJM5NCCXkTNy4UceubuPNtnLZZaOsPAx//OYdz5g8SzpR2nG9rY3Nre2e3tFfePzg8OrZPTjsqTiWhbRLzWPYCrChngrY105z2EklxFHDaDSYP83p3SqVisXjSs4R6ER4JFjKCtbF8265mA4I5auZ+Vr++yau+XXFqzkJoHdwCKlCo5dtfg2FM0ogKTThWqu86ifYyLDUjnOblQapogskEj2jfoMARVV62uDxHl8YZojCW5gmNFu7viQxHSs2iwHRGWI/Vam1u/lfrpzq88zImklRTQZaLwpQjHaN5DGjIJCWazwxgIpm5FZExlphoE1bZhOCufnkdOvWaa/ixXmncF3GU4Bwu4ApcuIUGNKEFbSAwhWd4hTcrs16sd+tj2bphFTNn8EfW5w+6bZJm

H AAAB8XicbZA9TwJBEIbn8AvxC7W02QgmVuSORkuiDSUmAka4kLllDzbs7V1290zIhX9hY6Extv4bO/+NC1yh4Jts8uSdmezMGySCa+O6305hY3Nre6e4W9rbPzg8Kh+fdHScKsraNBaxeghQM8ElaxtuBHtIFMMoEKwbTG7n9e4TU5rH8t5ME+ZHOJI85BSNtR6rWZ+iIM1ZdVCuuDV3IbIOXg4VyNUalL/6w5imEZOGCtS657mJ8TNUhlPBZqV+qlmCdIIj1rMoMWLazxYbz8iFdYYkjJV90pCF+3siw0jraRTYzgjNWK/W5uZ/tV5qwms/4zJJDZN0+VGYCmJiMj+fDLli1IipBaSK210JHaNCamxIJRuCt3ryOnTqNc/yXb3SuMnjKMIZnMMleHAFDWhCC9pAQcIzvMKbo50X5935WLYWnHzmFP7I+fwBZiWQEA== AAACqXicbZHfa9swEMdl71eX/cq6x72IJRsuS4PtPqy0FAp72cMeWljSsCg1sqLEopZkJDksGP9v+xv61v9mZyewNd0JiQ/fO510d2mRC+vC8M7zHz1+8vTZ3vPOi5evXr/pvt0fW10axkdM59pMUmp5LhQfOeFyPikMpzLN+VV687XxX624sUKrH25d8JmkSyUWglEHUtL9TZQWas6Vw0IVpcMrzpw2J7ifnRGZ6l8VMRLDOzWpsiA6GOAsiNvz6IDUfULIyhaU8eowHMZM1oR0/mbUpbuXcjchkWUSBRmkayhuiAzoRmqgUUg9IKewrFhKmhTXjXbGrytgoLqfdHvhMGwNP4RoCz20tYuke0vmmpUSPshyau00Cgs3q6hxguW87pDScqjohi75FFBRye2sajtd44+gzPFCG9hQYKv+e6Oi0tq1TCFSUpfZXV8j/s83Ld3ieFa1I+CKbR5alDl2Gjdjw3NhoIv5GoAyI+CvmGXUUOZguB1oQrRb8kMYx8PoaBhfxr3zYNuOPfQefUABitAXdI6+oQs0Qsz75H33Rt7Y/+xf+hP/5ybU97Z33qF75rM/LV7Ltg==

AAAB73icbVA9SwNBEJ2LXzF+RS1tFhMhNuHuLLQM2FhGMB+QHGFvs5cs2d07d/eEcORP2FgoYuvfsfPfuEmu0MQHA4/3ZpiZFyacaeO6305hY3Nre6e4W9rbPzg8Kh+ftHWcKkJbJOax6oZYU84kbRlmOO0mimIRctoJJ7dzv/NElWaxfDDThAYCjySLGMHGSt1qX6S18WV1UK64dXcBtE68nFQgR3NQ/uoPY5IKKg3hWOue5yYmyLAyjHA6K/VTTRNMJnhEe5ZKLKgOssW9M3RhlSGKYmVLGrRQf09kWGg9FaHtFNiM9ao3F//zeqmJboKMySQ1VJLloijlyMRo/jwaMkWJ4VNLMFHM3orIGCtMjI2oZEPwVl9eJ22/7l3V/Xu/0qjlcRThDM6hBh5cQwPuoAktIMDhGV7hzXl0Xpx352PZWnDymVP4A+fzB5vdjvE=

AAACAHicbVC7TsMwFL3hWcorwMDAYtEilaVKygBjJRbGItGH1ESV4zitVceJbAepirrwKywMIMTKZ7DxN7htBmg5kqWjc+71vfcEKWdKO863tba+sbm1Xdop7+7tHxzaR8cdlWSS0DZJeCJ7AVaUM0HbmmlOe6mkOA447Qbj25nffaRSsUQ86ElK/RgPBYsYwdpIA/vUEwkTIRUaVWstT7OYKuReVgd2xak7c6BV4hakAgVaA/vLCxOSxeYnwrFSfddJtZ9jqRnhdFr2MkVTTMZ4SPuGCmwG+fn8gCm6MEqIokSaZzaZq787chwrNYkDUxljPVLL3kz8z+tnOrrxcybSTFNBFoOijCOdoFkaKGSSEs0nhmAimdkVkRGWmGiTWdmE4C6fvEo6jbp7VW/cNyrNWhFHCc7gHGrgwjU04Q5a0AYCU3iGV3iznqwX6936WJSuWUXPCfyB9fkDpvWVDA==

layer 2

AAAB8XicbVA9SwNBEJ2LXzF+RS1tFhMhNuHuLLQM2FhGMB+YHGFvs5cs2d07dveEcORf2FgoYuu/sfPfuEmu0MQHA4/3ZpiZFyacaeO6305hY3Nre6e4W9rbPzg8Kh+ftHWcKkJbJOax6oZYU84kbRlmOO0mimIRctoJJ7dzv/NElWaxfDDThAYCjySLGMHGSo/VvkgHXm18WR2UK27dXQCtEy8nFcjRHJS/+sOYpIJKQzjWuue5iQkyrAwjnM5K/VTTBJMJHtGepRILqoNscfEMXVhliKJY2ZIGLdTfExkWWk9FaDsFNmO96s3F/7xeaqKbIGMySQ2VZLkoSjkyMZq/j4ZMUWL41BJMFLO3IjLGChNjQyrZELzVl9dJ2697V3X/3q80ankcRTiDc6iBB9fQgDtoQgsISHiGV3hztPPivDsfy9aCk8+cwh84nz/FdY+V

AAACAHicbVC7TsMwFL3hWcorwMDAYtEilaVKygBjJRbGItGH1ESV4zitVceJbAepirrwKywMIMTKZ7DxN7htBmg5kqWjc+71vfcEKWdKO863tba+sbm1Xdop7+7tHxzaR8cdlWSS0DZJeCJ7AVaUM0HbmmlOe6mkOA447Qbj25nffaRSsUQ86ElK/RgPBYsYwdpIA/vUEwkTIRUaVWstT7OYKuReVgd2xak7c6BV4hakAgVaA/vLCxOSxeYnwrFSfddJtZ9jqRnhdFr2MkVTTMZ4SPuGCmwG+fn8gCm6MEqIokSaZzaZq787chwrNYkDUxljPVLL3kz8z+tnOrrxcybSTFNBFoOijCOdoFkaKGSSEs0nhmAimdkVkRGWmGiTWdmE4C6fvEo6jbp7VW/cNyrNWhFHCc7gHGrgwjU04Q5a0AYCU3iGV3iznqwX6936WJSuWUXPCfyB9fkDpvWVDA==

sha1_base64="QesM0jcd4SP02hwctw2adM9aaU4=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFhMhNuHuLLQM2FhGMB+YHGFvs5cs2d07dveEcORf2FgoYuu/sfPfuEmu0MQHA4/3ZpiZFyacaeO6305hY3Nre6e4W9rbPzg8Kh+ftHWcKkJbJOax6oZYU84kbRlmOO0mimIRctoJJ7dzv/NElWaxfDDThAYCjySLGMHGSo/VvkgHfm18WR2UK27dXQCtEy8nFcjRHJS/+sOYpIJKQzjWuue5iQkyrAwjnM5K/VTTBJMJHtGepRILqoNscfEMXVhliKJY2ZIGLdTfExkWWk9FaDsFNmO96s3F/7xeaqKbIGMySQ2VZLkoSjkyMZq/j4ZMUWL41BJMFLO3IjLGChNjQyrZELzVl9dJ2697V3X/3q80ankcRTiDc6iBB9fQgDtoQgsISHiGV3hztPPivDsfy9aCk8+cwh84nz/G/Y+W

AAACBHicbVC7TsMwFHV4lvIKMHaxqJA6VUkZYKzEwlgk+pDaqHKcm9aqY0e2g1RFHVj4FRYGEGLlI9j4G9w2A7QcydLROfde33vClDNtPO/b2djc2t7ZLe2V9w8Oj47dk9OOlpmi0KaSS9ULiQbOBLQNMxx6qQKShBy64eRm7ncfQGkmxb2ZphAkZCRYzCgxVhq6lYGQTEQgDAZBZQQKa6MyajIFQ7fq1b0F8DrxC1JFBVpD92sQSZoldhrlROu+76UmyIkyjHKYlQeZhpTQCRlB31JBEtBBvjhihi+sEuFYKvvsNgv1d0dOEq2nSWgrE2LGetWbi/95/czE10HORJoZe+Lyozjj2Eg8TwRHTAE1fGoJoYrZXTEdE0WosbmVbQj+6snrpNOo+5f1xl2j2qwVcZRQBZ2jGvLRFWqiW9RCbUTRI3pGr+jNeXJenHfnY1m64RQ9Z+gPnM8fTM2YbQ==

AAACAHicbVC7TsMwFL0pr1JeAQYGFosWqSxVUgYYK7GwIBWJPqQmqhzHba06TmQ7SFXUhV9hYQAhVj6Djb/BbTNAy5EsHZ1zr++9J0g4U9pxvq3C2vrG5lZxu7Szu7d/YB8etVWcSkJbJOax7AZYUc4EbWmmOe0mkuIo4LQTjG9mfueRSsVi8aAnCfUjPBRswAjWRurbJ56ImQip0KhSvfM0i6hCbuWib5edmjMHWiVuTsqQo9m3v7wwJmlkfiIcK9VznUT7GZaaEU6nJS9VNMFkjIe0Z6jAZpCfzQ+YonOjhGgQS/PMJnP1d0eGI6UmUWAqI6xHatmbif95vVQPrv2MiSTVVJDFoEHKkY7RLA0UMkmJ5hNDMJHM7IrICEtMtMmsZEJwl09eJe16zb2s1e/r5UY1j6MIp3AGVXDhChpwC01oAYEpPMMrvFlP1ov1bn0sSgtW3nMMf2B9/gCiRpUJ

AAACBHicbVC7TsMwFHV4lvIKMHaxqJA6VUkZYKzEwlgk+pDaqHKcm9aqY0e2g1RFHVj4FRYGEGLlI9j4G9w2A7QcydLROfde33vClDNtPO/b2djc2t7ZLe2V9w8Oj47dk9OOlpmi0KaSS9ULiQbOBLQNMxx6qQKShBy64eRm7ncfQGkmxb2ZphAkZCRYzCgxVhq6lYGQTEQgDI6AyggU1kZl1GQKhm7Vq3sL4HXiF6SKCrSG7tcgkjRL7DTKidZ930tNkBNlGOUwKw8yDSmhEzKCvqWCJKCDfHHEDF9YJcKxVPbZbRbq746cJFpPk9BWJsSM9ao3F//z+pmJr4OciTQzIOjyozjj2Eg8TwRHTAE1fGoJoYrZXTEdE0WosbmVbQj+6snrpNOo+5f1xl2j2qwVcZRQBZ2jGvLRFWqiW9RCbUTRI3pGr+jNeXJenHfnY1m64RQ9Z+gPnM8fPQ6YYw==

AAACA3icbVC7TsMwFHV4lvIKsMFi0SKVpUrKAGMlFsYi0YfURJXjOI1Vx45sB1RFlVj4FRYGEGLlJ9j4G9w2A7QcydLROfde33uClFGlHefbWlldW9/YLG2Vt3d29/btg8OOEpnEpI0FE7IXIEUY5aStqWakl0qCkoCRbjC6nvrdeyIVFfxOj1PiJ2jIaUQx0kYa2MceF5SHhGtY9R5oSGKk83hSc8+rA7vi1J0Z4DJxC1IBBVoD+8sLBc4SMwwzpFTfdVLt50hqihmZlL1MkRThERqSvqEcJUT5+eyGCTwzSggjIc0zy8zU3x05SpQaJ4GpTJCO1aI3Ff/z+pmOrvyc8jTThOP5R1HGoBZwGggMqSRYs7EhCEtqdoU4RhJhbWIrmxDcxZOXSadRdy/qjdtGpVkr4iiBE3AKasAFl6AJbkALtAEGj+AZvII368l6sd6tj3npilX0HIE/sD5/AMurlt4=

sha1_base64="9kSRARnYggp/Cn9PEDxsDy3Ch/4=">AAAB9HicbVDLTgIxFL2DL8QX6tJNI5iwIjO40CWJG5eYyCOBCel0OtDQace2Q0ImfIcbFxrj1o9x599YYBYKnqTJyTn35N6eIOFMG9f9dgpb2zu7e8X90sHh0fFJ+fSso2WqCG0TyaXqBVhTzgRtG2Y47SWK4jjgtBtM7hZ+d0qVZlI8mllC/RiPBIsYwcZK/kBIJkIqDKqm1WG54tbdJdAm8XJSgRytYflrEEqSxjZPONa677mJ8TOsDCOczkuDVNMEkwke0b6lAsdU+9ny6Dm6skqIIqnss/uX6u9EhmOtZ3FgJ2NsxnrdW4j/ef3URLd+xkSSGirIalGUcmQkWjSAQqYoMXxmCSaK2VsRGWOFibE9lWwJ3vqXN0mnUfeu642HRqVZy+sowgVcQg08uIEm3EML2kDgCZ7hFd6cqfPivDsfq9GCk2fO4Q+czx/575F8

latexit


AAACA3icbVC7TsMwFHV4lvIKsMFi0SKVpUrCAGMlFsYi0YfURJXjOK1Vx45sB1RFlVj4FRYGEGLlJ9j4G9w2A7QcydLROfde33vClFGlHefbWlldW9/YLG2Vt3d29/btg8O2EpnEpIUFE7IbIkUY5aSlqWakm0qCkpCRTji6nvqdeyIVFfxOj1MSJGjAaUwx0kbq28c+F5RHhGtY9R9oRIZI58NJzTuv9u2KU3dmgMvELUgFFGj27S8/EjhLzDDMkFI910l1kCOpKWZkUvYzRVKER2hAeoZylBAV5LMbJvDMKBGMhTTPLDNTf3fkKFFqnISmMkF6qBa9qfif18t0fBXklKeZJhzPP4ozBrWA00BgRCXBmo0NQVhSsyvEQyQR1ia2sgnBXTx5mbS9untR9269SqNWxFECJ+AU1IALLkED3IAmaAEMHsEzeAVv1pP1Yr1bH/PSFavoOQJ/YH3+AM0xlt8=

latexit

b h

sha1_base64="GEiJkkk7dGykR6u9Rmx08PAooEM=">AAACAHicbVC7TsMwFL0pr1JeAQYGFosWqSxVUgYYK7GwIBWJPqQmqhzHba06TmQ7SFXUhV9hYQAhVj6Djb/BbTNAy5EsHZ1zr++9J0g4U9pxvq3C2vrG5lZxu7Szu7d/YB8etVWcSkJbJOax7AZYUc4EbWmmOe0mkuIo4LQTjG9mfueRSsVi8aAnCfUjPBRswAjWRurbJ56ImQip0KhSvfM0i6hCbuWib5edmjMHWiVuTsqQo9m3v7wwJmlkfiIcK9VznUT7GZaaEU6nJS9VNMFkjIe0Z6jAZpCfzQ+YonOjhGgQS/PMJnP1d0eGI6UmUWAqI6xHatmbif95vVQPrv2MiSTVVJDFoEHKkY7RLA0UMkmJ5hNDMJHM7IrICEtMtMmsZEJwl09eJe16zb2s1e/r5UY1j6MIp3AGVXDhChpwC01oAYEpPMMrvFlP1ov1bn0sSgtW3nMMf2B9/gCiRpUJ

AAACA3icbVC7TsMwFHXKq5RXgA0WixapLFXSDjBWYmEsEn1ITVQ5jtNadezIdkBVVImFX2FhACFWfoKNv8FtM0DLkSwdnXPv9b0nSBhV2nG+rcLa+sbmVnG7tLO7t39gHx51lEglJm0smJC9ACnCKCdtTTUjvUQSFAeMdIPx9czv3hOpqOB3epIQP0ZDTiOKkTbSwD7xuKA8JFzDivdAQzJCOhtNq42LysAuOzVnDrhK3JyUQY7WwP7yQoHT2AzDDCnVd51E+xmSmmJGpiUvVSRBeIyGpG8oRzFRfja/YQrPjRLCSEjzzDJz9XdHhmKlJnFgKmOkR2rZm4n/ef1UR1d+RnmSasLx4qMoZVALOAsEhlQSrNnEEIQlNbtCPEISYW1iK5kQ3OWTV0mnXnMbtfptvdys5nEUwSk4A1XggkvQBDegBdoAg0fwDF7Bm/VkvVjv1seitGDlPcfgD6zPH863luA=

b h(3)

AAAB+XicbVC7TsMwFL0pr1JeAUYWixapLFUSBhgrsTAWiT6kNqocx2mtOk5kO5WqqH/CwgBCrPwJG3+D22aAliNZOjrnXvv4BClnSjvOt1Xa2t7Z3SvvVw4Oj45P7NOzjkoySWibJDyRvQArypmgbc00p71UUhwHnHaDyf3C706pVCwRT3qWUj/GI8EiRrA20tC2ByJhIqRCo1pW965rQ7vqNJwl0CZxC1KFAq2h/TUIE5LF5grCsVJ910m1n2OpGeF0XhlkiqaYTPCI9g0VOKbKz5fJ5+jKKCGKEmmOibBUf2/kOFZqFgdmMsZ6rNa9hfif1890dOfnTKSZpoKsHooyjnSCFjWgkElKNJ8ZgolkJisiYywx0aasiinBXf/yJul4Dfem4T161Wa9qKMMF3AJdXDhFprwAC1oA4EpPMMrvFm59WK9Wx+r0ZJV7JzDH1ifP7XIkk4=

AAACAHicbVC7TsMwFL0pr1JeAQYGFosWqSxVUgYYK7GwIBWJPqQmqhzHba06TmQ7SFXUhV9hYQAhVj6Djb/BbTNAy5EsHZ1zr++9J0g4U9pxvq3C2vrG5lZxu7Szu7d/YB8etVWcSkJbJOax7AZYUc4EbWmmOe0mkuIo4LQTjG9mfueRSsVi8aAnCfUjPBRswAjWRurbJ56ImQip0KhSvfM0i6hCbuWib5edmjMHWiVuTsqQo9m3v7wwJmlkfiIcK9VznUT7GZaaEU6nJS9VNMFkjIe0Z6jAZpCfzQ+YonOjhGgQS/PMJnP1d0eGI6UmUWAqI6xHatmbif95vVQPrv2MiSTVVJDFoEHKkY7RLA0UMkmJ5hNDMJHM7IrICEtMtMmsZEJwl09eJe16zb2s1e/r5UY1j6MIp3AGVXDhChpwC01oAYEpPMMrvFlP1ov1bn0sSgtW3nMMf2B9/gCiRpUJ

AAAB/nicbVBNS8NAEN34WetXVDx5WWyFeilJRfRY8OKxgv2AJpTJZtsu3U3C7kYpoeBf8eJBEa/+Dm/+G7dtDtr6YODx3gwz84KEM6Ud59taWV1b39gsbBW3d3b39u2Dw5aKU0lok8Q8lp0AFOUsok3NNKedRFIQAaftYHQz9dsPVCoWR/d6nFBfwCBifUZAG6lnH5e9RxbSIejMG4AQMKnUzss9u+RUnRnwMnFzUkI5Gj37ywtjkgoaacJBqa7rJNrPQGpGOJ0UvVTRBMgIBrRraASCKj+bnT/BZ0YJcT+WpiKNZ+rviQyEUmMRmE4BeqgWvan4n9dNdf/az1iUpJpGZL6on3KsYzzNAodMUqL52BAgkplbMRmCBKJNYkUTgrv48jJp1aruRfXyrlaqV/I4CugEnaIKctEVqqNb1EBNRFCGntErerOerBfr3fqYt65Y+cwR+gPr8wcJeZTM

AAAB+3icbVBNS8NAEN3Ur1q/Yj16CbaCBylJRfRY8OKxgv2AJpTJZtMu3U3C7kYtIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM8xNGpbLtb6O0tr6xuVXeruzs7u0fmIfVroxTgUkHxywWfR8kYTQiHUUVI/1EEOA+Iz1/cjPzew9ESBpH92qaEI/DKKIhxaC0NDSrdfeRBmQMKnNHwDnk9aFZsxv2HNYqcQpSQwXaQ/PLDWKcchIpzEDKgWMnystAKIoZyStuKkkCeAIjMtA0Ak6kl81vz61TrQRWGAtdkbLm6u+JDLiUU+7rTg5qLJe9mfifN0hVeO1lNEpSRSK8WBSmzFKxNQvCCqggWLGpJoAF1bdaeAwCsNJxVXQIzvLLq6TbbDgXjcu7Zq11XsRRRsfoBJ0hB12hFrpFbdRBGD2hZ/SK3ozceDHejY9Fa8koZo7QHxifP79SlC8=

layer 3

AAAB9HicbVDLTgJBEOzFF+IL9ehlIpjghexijB5JvHiERJAENmR2mIUJsw9neknIhu/w4kFjvPox3vwbB9iDgpV0UqnqTneXF0uh0ba/rdzG5tb2Tn63sLd/cHhUPD5p6yhRjLdYJCPV8ajmUoS8hQIl78SK08CT/NEb3839xwlXWkThA05j7gZ0GApfMIpGcsuVZg9FwDVxLsv9Ysmu2guQdeJkpAQZGv3iV28QsSTgITJJte46doxuShUKJvms0Es0jykb0yHvGhpSs8hNF0fPyIVRBsSPlKkQyUL9PZHSQOtp4JnOgOJIr3pz8T+vm6B/66YijBPkIVsu8hNJMCLzBMhAKM5QTg2hTAlzK2EjqihDk1PBhOCsvrxO2rWqc1W9btZK9UoWRx7O4Bwq4MAN1OEeGtACBk/wDK/wZk2sF+vd+li25qxs5hT+wPr8AaG3kKA=

layer 1

AAAB/nicbVBNS8NAEN34WetXVDx5CbZCvZSkInosePFYwX5AE8pks22X7m7C7kYpoeBf8eJBEa/+Dm/+G7dtDtr6YODx3gwz88KEUaVd99taWV1b39gsbBW3d3b39u2Dw5aKU4lJE8cslp0QFGFUkKammpFOIgnwkJF2OLqZ+u0HIhWNxb0eJyTgMBC0TzFoI/Xs47L/SCMyBJ35A+AcJhXvvNyzS27VncFZJl5OSihHo2d/+VGMU06ExgyU6npuooMMpKaYkUnRTxVJAI9gQLqGCuBEBdns/IlzZpTI6cfSlNDOTP09kQFXasxD08lBD9WiNxX/87qp7l8HGRVJqonA80X9lDk6dqZZOBGVBGs2NgSwpOZWBw9BAtYmsaIJwVt8eZm0alXvonp5VyvVK3kcBXSCTlEFeegK1dEtaqAmwihDz+gVvVlP1ov1bn3MW1esfOYI/YH1+QMH85TL

real or fake AAACAHicbVA9T8MwFHTKVylfAQYGFosKqVOVFCEYK7EwFom2SG1UOY7TWrWdyH5BqqIu/BUWBhBi5Wew8W9w2wzQctPp7j29exemghvwvG+ntLa+sblV3q7s7O7tH7iHRx2TZJqyNk1Eoh9CYpjgirWBg2APqWZEhoJ1w/HNzO8+Mm14ou5hkrJAkqHiMacErDRwTyJuqOaSKwKJxgZ0RiHTbOBWvbo3B14lfkGqqEBr4H71o4RmkimgghjT870Ugpxo4FSwaaWfGZYSOiZD1rNUEclMkM8fmOJzq0Q4tgHiRAGeq783ciKNmcjQTkoCI7PszcT/vF4G8XWQc5VmwBRdHIozgSHBszZwxDWjICaWEFuDzYrpiGhCwXZWsSX4yy+vkk6j7l/UL+8a1WatqKOMTtEZqiEfXaEmukUt1EYUTdEzekVvzpPz4rw7H4vRklPsHKM/cD5/AK0llwU=