211 74 6MB
English Pages 560 Year 2019
Variational Bayesian Learning Theory Variational Bayesian learning is one of the most popular methods in machine learning. Designed for researchers and graduate students in machine learning, this book summarizes recent developments in the nonasymptotic and asymptotic theory of variational Bayesian learning and suggests how this theory can be applied in practice. The authors begin by developing a basic framework with a focus on conjugacy, which enables the reader to derive tractable algorithms. Next, it summarizes nonasymptotic theory, which, although limited in application to bilinear models, precisely describes the behavior of the variational Bayesian solution and reveals its sparsity-inducing mechanism. Finally, the text summarizes asymptotic theory, which reveals phase transition phenomena depending on the prior setting, thus providing suggestions on how to set hyperparameters for particular purposes. Detailed derivations allow readers to follow along without prior knowledge of the mathematical techniques specific to Bayesian learning. S h i n i c h i N a k a j i m a is a senior researcher at Technische Universit¨at Berlin. His research interests include the theory and applications of machine learning, and he has published papers at numerous conferences and in journals such as The Journal of Machine Learning Research, The Machine Learning Journal, Neural Computation, and IEEE Transactions on Signal Processing. He currently serves as an area chair for Neural Information Processing Systems (NIPS) and an action editor for Digital Signal Processing. K a z u h o Wata na b e is an associate professor at Toyohashi University of Technology. His research interests include statistical machine learning and information theory, and he has published papers at numerous conferences and in journals such as The Journal of Machine Learning Research, The Machine Learning Journal, IEEE Transactions on Information Theory, and IEEE Transactions on Neural Networks and Learning Systems. M a s a s h i S u g i ya m a is the director of the RIKEN Center for Advanced Intelligence Project and professor of Complexity Science and Engineering at the University of Tokyo. His research interests include the theory, algorithms, and applications of machine learning. He has written several books on machine learning, including Density Ratio Estimation in Machine Learning. He served as program cochair and general cochair of the NIPS conference in 2015 and 2016, respectively, and received the Japan Academy Medal in 2017.
Variational Bayesian Learning Theory S H I N I C H I NA K A J I M A Technische Universit¨at Berlin K A Z U H O WATA NA B E Toyohashi University of Technology M A S A S H I S U G I YA M A University of Tokyo
University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107076150 DOI: 10.1017/9781139879354 © Shinichi Nakajima, Kazuho Watanabe, and Masashi Sugiyama 2019 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2019 Printed and bound in Great Britain by Clays Ltd, Elcograf S.p.A. A catalogue record for this publication is available from the British Library. Library of Congress Cataloging-in-Publication Data Names: Nakajima, Shinichi, author. | Watanabe, Kazuho, author. | Sugiyama, Masashi, 1974- author. Title: Variational Bayesian learning theory / Shinichi Nakajima (Technische Universit¨at Berlin), Kazuho Watanabe (Toyohashi University of Technology), Masashi Sugiyama (University of Tokyo). Description: Cambridge ; New York, NY : Cambridge University Press, 2019. | Includes bibliographical references and index. Identifiers: LCCN 2019005983| ISBN 9781107076150 (hardback : alk. paper) | ISBN 9781107430761 (pbk. : alk. paper) Subjects: LCSH: Bayesian field theory. | Probabilities. Classification: LCC QC174.85.B38 N35 2019 | DDC 519.2/33–dc23 LC record available at https://lccn.loc.gov/2019005983 ISBN 978-1-107-07615-0 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
Contents
Preface Nomenclature
page ix xii
Part I Formulation
1
1 1.1 1.2
Bayesian Learning Framework Computation
3 3 10
2 2.1 2.2
Variational Bayesian Learning Framework Other Approximation Methods
39 39 51
Part II Algorithm
61
3 3.1 3.2 3.3 3.4 3.5
VB Algorithm for Multilinear Models Matrix Factorization Matrix Factorization with Missing Entries Tensor Factorization Low-Rank Subspace Clustering Sparse Additive Matrix Factorization
63 63 74 80 87 93
4 4.1 4.2
VB Algorithm for Latent Variable Models Finite Mixture Models Other Latent Variable Models
103 103 115
5 5.1 5.2 5.3
VB Algorithm under No Conjugacy Logistic Regression Sparsity-Inducing Prior Unified Approach by Local VB Bounds
132 132 135 137
v
vi
Contents
Part III Nonasymptotic Theory
147
6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10
Global VB Solution of Fully Observed Matrix Factorization Problem Description Conditions for VB Solutions Irrelevant Degrees of Freedom Proof of Theorem 6.4 Problem Decomposition Analytic Form of Global VB Solution Proofs of Theorem 6.7 and Corollary 6.8 Analytic Form of Global Empirical VB Solution Proof of Theorem 6.13 Summary of Intermediate Results
149 150 152 153 157 160 162 163 171 173 180
7 7.1 7.2 7.3 7.4 7.5
Model-Induced Regularization and Sparsity Inducing Mechanism VB Solutions for Special Cases Posteriors and Estimators in a One-Dimensional Case Model-Induced Regularization Phase Transition in VB Learning Factorization as ARD Model
184 184 187 195 202 204
8 8.1 8.2 8.3 8.4 8.5 8.6 8.7
Performance Analysis of VB Matrix Factorization Objective Function for Noise Variance Estimation Bounds of Noise Variance Estimator Proofs of Theorem 8.2 and Corollary 8.3 Performance Analysis Numerical Verification Comparison with Laplace Approximation Optimality in Large-Scale Limit
205 205 207 209 214 228 230 232
9 9.1 9.2 9.3 9.4
Global Solver for Matrix Factorization Global VB Solver for Fully Observed MF Global EVB Solver for Fully Observed MF Empirical Comparison with the Standard VB Algorithm Extension to Nonconjugate MF with Missing Entries
236 236 238 242 247
10 10.1 10.2 10.3 10.4
Global Solver for Low-Rank Subspace Clustering Problem Description Conditions for VB Solutions Irrelevant Degrees of Freedom Proof of Theorem 10.2
255 255 258 259 259
Contents
vii
10.5 10.6 10.7 10.8
Exact Global VB Solver (EGVBS) Approximate Global VB Solver (AGVBS) Proof of Theorem 10.7 Empirical Evaluation
264 267 270 274
11 11.1 11.2 11.3
Efficient Solver for Sparse Additive Matrix Factorization Problem Description Efficient Algorithm for SAMF Experimental Results
279 279 282 284
12 12.1 12.2 12.3
MAP and Partially Bayesian Learning Theoretical Analysis in Fully Observed MF More General Cases Experimental Results
294 295 329 332
Part IV Asymptotic Theory
339
13 13.1 13.2 13.3 13.4 13.5 13.6
Asymptotic Learning Theory Statistical Learning Machines Basic Tools for Asymptotic Analysis Target Quantities Asymptotic Learning Theory for Regular Models Asymptotic Learning Theory for Singular Models Asymptotic Learning Theory for VB Learning
341 341 344 346 351 366 382
14 14.1 14.2 14.3
Asymptotic VB Theory of Reduced Rank Regression Reduced Rank Regression Generalization Properties Insights into VB Learning
385 385 396 426
15 15.1 15.2 15.3 15.4
Asymptotic VB Theory of Mixture Models Basic Lemmas Mixture of Gaussians Mixture of Exponential Family Distributions Mixture of Bernoulli with Deterministic Components
429 429 434 443 451
16 16.1 16.2 16.3 16.4
Asymptotic VB Theory of Other Latent Variable Models Bayesian Networks Hidden Markov Models Probabilistic Context-Free Grammar Latent Dirichlet Allocation
455 455 461 466 470
viii
17 17.1 17.2 17.3 17.4 17.5
Contents
Unified Theory for Latent Variable Models Local Latent Variable Model Asymptotic Upper-Bound for VB Free Energy Example: Average VB Free Energy of Gaussian Mixture Model Free Energy and Generalization Error Relation to Other Analyses
Appendix A Appendix B Appendix C Appendix D
James–Stein Estimator Metric in Parameter Space Detailed Description of Overlap Method Optimality of Bayesian Learning
500 500 504 507 511 513 516 520 525 527
Bibliography
529
Subject Index
540
Preface
Bayesian learning is a statistical inference method that provides estimators and other quantities computed from the posterior distribution—the conditional distribution of unknown variables given observed variables. Compared with point estimation methods such as maximum likelihood (ML) estimation and maximum a posteriori (MAP) learning, Bayesian learning has the following advantages: • Theoretically optimal. The posterior distribution is what we can obtain best about the unknown variables from observation. Therefore, Bayesian learning provides most accurate predictions, provided that the assumed model is appropriate. • Uncertainty information is available. Sharpness of the posterior distribution indicates the reliability of estimators. The credible interval, which can be computed from the posterior distribution, provides probabilistic bounds of unknown variables. • Model selection and hyperparameter estimation can be performed in a single framework. The marginal likelihood can be used as a criterion to evaluate how well a statistical model (which is typically a combination of model and prior distributions) fits the observed data, taking account of the flexibility of the model as a penalty. • Less prone to overfitting. It was theoretically proven that Bayesian learning overfits the observation noise less than MAP learning. On the other hand, Bayesian learning has a critical drawback—computing the posterior distribution is computationally hard in many practical models. This is because Bayesian learning requires expectation operations or integral computations, which cannot be analytically performed except for simple cases. ix
x
Preface
Accordingly, various approximation methods, including deterministic and sampling methods, have been proposed. Variational Bayesian (VB) learning is one of the most popular deterministic approximation methods to Bayesian learning. VB learning aims to find the closest distribution to the Bayes posterior under some constraints, which are designed so that the expectation operation is tractable. The simplest and most popular approach is the mean field approximation where the approximate posterior is sought in the space of decomposable distributions, i.e., groups of unknown variables are forced to be independent of each other. In many practical models, Bayesian learning is intractable jointly for all unknown parameters, while it is tractable if the dependence between groups of parameters is ignored. Such a case often happens because many practical models have been constructed by combining simple models in which Bayesian learning is analytically tractable. This property is called conditional conjugacy, and makes VB learning computationally tractable. Since its development, VB learning has shown good performance in many applications. Its good aspects and downsides have been empirically observed and qualitatively discussed. Some of those aspects seem inherited from full Bayesian learning, while some others seem to be artifacts by forced independence constraints. We have dedicated ourselves to theoretically clarifying the behavior of VB learning quantitatively, which is the main topic of this book. This book starts from the formulation of Bayesian learning methods. In Part I, we introduce Bayesian learning and VB learning, emphasizing how conjugacy and conditional conjugacy make the computation tractable. We also briefly introduce other approximation methods and relate them to VB learning. In Part II, we derive algorithms of VB learning for popular statistical models, on which theoretical analysis will be conducted in the subsequent parts. We categorize the theory of VB learning into two parts, and exhibit them separately. Part III focuses on nonasymptotic theory, where we do not assume the availability of a large number of samples. This analysis so far has been applied only to a class of bilinear models, but we can make detailed discussions including analytic forms of global solutions and theoretical performance guarantees. On the other hand, Part IV focuses on asymptotic theory, where the number of observed samples is assumed to be large. This approach has been applied to a broad range of statistical models, and successfully elucidated the phase transition phenomenon of VB learning. As a practical outcome, this analysis provides a guideline on how to set hyperparameters for different purposes.
Preface
xi
Recently, a lot of variations of VB learning have been proposed, e.g., more accurate inference methods beyond the mean field approximation, stochastic gradient optimization for big data analysis, and sampling based update rules for automatic (black-box) inference to cope with general nonconjugate likelihoods including deep neural networks. Although we briefly introduce some of those recent works in Part I, they are not in the central scope of this book. We rather focus on the simplest mean field approximation, of which the behavior has been clarified quantitatively by theory. This book was completed under the support by many people. Shinichi Nakajima deeply thanks Professor Klaus-Robert M¨uller and the members in Machine Learning Group in Technische Universit¨at Berlin for their direct and indirect support during the period of book writing. Special thanks go to Sergej Dogadov, Hannah Marienwald, Ludwig Winkler, Dr. Nico G¨onitz, and Dr. Pan Kessel, who reviewed chapters of earlier versions, found errors and typos, provided suggestions to improve the presentation, and kept encouraging him in proceeding book writing. The authors also thank Lauren Cowles and her team in Cambridge University Press, as well as all other staff members who contributed to the book production process, for their help, as well as their patience on the delays in our manuscript preparation. Lauren Cowles, Clare Dennison, Adam Kratoska, and Amy He have coordinated the project since its proposal, and Harsha Vardhanan in SPi Global has managed the copy-editing process with Andy Saff. The book writing project was partially supported by the following organizations: the German Research Foundation (GRK 1589/1) by the Federal Ministry of Education and Research (BMBF) under the Berlin Big Data Center project (Phase 1: FKZ 01IS14013A and Phase 2: FKz 01IS18025A), the Japan Society for the Promotion of Science (15K16050), and the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.
Nomenclature
a, b, c, . . . , A, B, C, . . . a, b, c, . . . (bold-faced small letters) A, B, C, . . . (bold-faced capital letters) A, B, C, . . . (calligraphic capital letters) (·)l,m tr(·) det (·) ⊗ ×n |·| sign(·)
: Scalars. : Vectors. : Matrices. : Tensors or sets.
: (l, m)th element of a matrix. : Transpose of a matrix or vector. : Trace of a matrix. : Determinant of a matrix. : Hadamard (elementwise) product. : Kronecker product. : n-mode tensor product. : Absolute value of a scalar. It applies element-wise for a vector or matrix. ⎧ ⎪ ⎪ if x ≥ 0, ⎨1 : Sign operator such that sign(x) = ⎪ It applies ⎪ ⎩−1 otherwise. elementwise for a vector or matrix.
{· · · } {· · · }D # (·)
: Set consisting of specified entities. : Dfold Cartesian product, i.e., XD ≡ {(x1 , . . . , xD ) ; xd ∈ X for d = 1, . . . , D}. : Cardinality (the number of entities) of a set. R R+ R++ RD
: The set of all real numbers. : The set of all nonnegative real numbers. : The set of all positive real numbers. : The set of all D-dimensional real (column) vectors. xii
Nomenclature [·, ·] [·, ·]D
R
RL×M
M1 ×M2 ×···×MN
I
I++ C SD S+D D S++ DD D+D D D++ HK−1 N ΔK−1
(a1 , . . . , a M ) aL ) ( a1 , . . . ,
xiii
: The set of real numbers in a range, i.e., [l, u] = {x ∈ R; l ≤ x ≤ u}. : The set of D-dimensional real vectors whose entries are in a range, i.e., [l, u]D ≡ {x ∈ RD ; l ≤ xd ≤ u for d = 1, . . . , D}. : The set of all L × M real matrices. : The set of all M1 × M2 × · · · × MN real tensors. : The set of all integers. : The set of all positive integers. : The set of all complex numbers. : The set of all D × D symmetric matrices. : The set of all D × D positive semidefinite matrices. : The set of all D × D positive definite matrices. : The set of all D × D diagonal matrices. : The set of all D × D positive semidefinite diagonal matrices. : The set of all D × D positive definite diagonal matrices. : The set of all possible histograms for N samples and K ≡ {x ∈ {0, . . . , N}K ; k=1 xk = N}. K categories, i.e., HK−1 N : The standard (K − 1)-simplex, i.e., K θk = 1}). ΔK−1 ≡ {θ ∈ [0, 1]K ; k=1 : Column vectors of A, i.e., A = (a1 , . . . , a M ) ∈ RL×M . : Row vectors of A, i.e., A = ( a1 , . . . , aL ) ∈ RL×M .
Diag(·)
: Diagonal matrix with specified diagonal elements, i.e., ⎧ ⎪ ⎪ ⎨ xl if l = m, (Diag(x))l,m = ⎪ ⎪ ⎩0 otherwise.
diag(·)
: Column vector consisting of the diagonal entries of a matrix, i.e., (diag(X))l = Xl,l . : Vectorization operator concatenating all column vectors of a matrix into a long column vector, i.e., vec( A) = (a1 , . . . , aM ) ∈ RLM for a matrix A = (a1 , . . . , a M ) ∈ RL×M . : D-dimensional (D × D) identity matrix. : A diagonal matrix. : An orthogonal matrix.
vec(·)
ID Γ Ω
kth
ek
: One of K expression, i.e., ek = (0, . . . , 0, 1 , 0, . . . , 0) ∈ {0, 1}K .
K
1K
: K-dimensional vector with all elements equal to one, i.e., ek = (1, . . . , 1) .
K
xiv
Nomenclature GaussD (μ, Σ)
MGaussD1 ,D2 (M, Σ ⊗ Ψ) Gamma(α, β) InvGamma(α, β) WishartD (V, ν) InvWishartD (V, ν) Multinomial(θ, N) Dirichlet(φ)
: D-dimensional Gaussian distribution with mean μ and covariance Σ. : D1 × D2 dimensional matrix variate Gaussian distribution with mean M and covariance Σ ⊗ Ψ. : Gamma distribution with shape parameter α and scale parameter β. : Inverse-Gamma distribution with shape parameter α and scale parameter β. : D-dimensional Wishart distribution with scale matrix V and degree of freedom ν. : D-dimensional inverse-Wishart distribution with scale matrix V and degree of freedom ν. : Multinomial distribution with event probabilities θ and number of trials N. : Dirichlet distribution with concentration parameters φ.
Prob(·) p(·), q(·)
: Probability of an event. : Probability distribution (probability mass function for discrete random variables, and probability density function for continuous random variables). Typically p is used for a model distribution and q is used for the true distribution. r(·) : A trial distribution (a variable of a functional) for approximation. f (x) p(x) : Expectation value of f (x) over distribution p(x), i.e., f (x) p(x) ≡ f (x)p(x)dx.
· : Estimator for an unknown variable, e.g., x and A are estimators for a vector x and a matrix A, respectively. Mean(·) : Mean of a random variable. Var(·) : Variance of a random variable. Cov(·) : Covariance of a random variable. KL(·||·) : Kullbuck–Leibler divergence distributions, i.e., between p(x) . KL(p||q) ≡ log q(x) δ(μ; μ)
GE TE F
p(x)
: Dirac delta function located at μ. It also denotes its approximation (called Pseudo-delta function) with its entropy finite. : Generalization error. : Training error. : Free energy.
Nomenclature
O( f (N)) o( f (N)) Ω( f (N)) ω( f (N)) Θ( f (N)) Op ( f (N)) op ( f (N)) Ωp ( f (N)) ωp ( f (N)) Θp ( f (N))
xv
: A function such that lim supN→∞ |O( f (N))/ f (N)| < ∞. : A function such that limN→∞ o( f (N))/ f (N) = 0. : A function such that lim inf N→∞ |Ω( f (N))/ f (N)| > 0 : A function such that limN→∞ |ω( f (N))/ f (N)| = ∞. : A function such that lim supN→∞ |Θ( f (N))/ f (N)| < ∞ and lim inf N→∞ |Θ( f (N))/ f (N)| > 0. : A function such that lim supN→∞ Op ( f (N))/ f (N) < ∞ in probability. : A function such that limN→∞ op ( f (N))/ f (N) = 0 in probability. : A function such that lim inf N→∞ Ωp ( f (N))/ f (N) > 0 in probability : A function such that limN→∞ ωp ( f (N))/ f (N) = ∞ in probability. : A function such that lim supN→∞ Θp ( f (N))/ f (N) < ∞ and lim inf N→∞ Θp ( f (N))/ f (N) > 0 in probability.
Part I Formulation
1 Bayesian Learning
Bayesian learning is an inference method based on the fundamental law of probability, called the Bayes theorem. In this first chapter, we introduce the framework of Bayesian learning with simple examples where Bayesian learning can be performed analytically.
1.1 Framework Bayesian learning considers the following situation. We have observed a set D of data, which are subject to a conditional distribution p(D|w), called the model distribution, of the data given unknown model parameter w. Although the value of w is unknown, vague information on w is provided as a prior distribution p(w). The conditional distribution p(D|w) is also called the model likelihood when it is seen as a function of the unknown parameter w.
1.1.1 Bayes Theorem and Bayes Posterior Bayesian learning is based on the following basic factorization property of the joint distribution p(D, w): p(w|D) p(D) = p(D, w) = p(D|w) p(w) ,
posterior marginal
joint
likelihood prior
where the marginal distribution is given by p(D) = p(D, w)dw = W
(1.1)
W
p(D|w)p(w)dw.
(1.2)
Here, the integration is performed in the domain W of the parameter w. Note that, if the domain W is discrete, integration should be replaced with 3
4
1 Bayesian Learning
summation, i.e., for any function f (w), f (w)dw → f (w ). W
w ∈W
The posterior distribution, the distribution of the unknown parameter w given the observed data set D, is derived by dividing both sides of Eq. (1.1) by the marginal distribution p(D): p(w|D) =
p(D, w) ∝ p(D, w). p(D)
(1.3)
Here, we emphasized that the posterior distribution is proportional to the joint distribution p(D, w) because the marginal distribution p(D) is a constant (as a function of w). In other words, the joint distribution is an unnormalized posterior distribution. Eq. (1.3) is called the Bayes theorem, and the posterior distribution computed exactly by Eq. (1.3) is called the Bayes posterior when we distinguish it from its approximations. Example 1.1 (Parametric density estimation) Assume that the observed data D = {x(1) , . . . , x(N) } consist of N independent and identically distributed (i.i.d.) samples from the model distribution p(x|w). Then, the model likelihood is N p(x(n) |w), and therefore, the posterior distribution given by p(D|w) = n=1 is given by N N p(x(n) |w)p(w) ∝ p(w|D) = Nn=1 p(x(n) |w)p(w). (n) n=1 p(x |w)p(w)dw n=1 Example 1.2 (Parametric regression) Assume that the observed data D = {(x(1) , y(1) ), . . . , (x(N) , y(N) )} consist of N i.i.d. input–output pairs from the model distribution p(x, y|w) = p(y|x, w)p(x). Then, the likelihood function N p(y(n) |x(n) , w)p(x(n) ), and therefore, the posterior is given by p(D|w) = n=1 distribution is given by N N p(y(n) |x(n) , w)p(w) ∝ p(w|D) = Nn=1 p(y(n) |x(n) , w)p(w). (n) |x(n) , w)p(w)dw p(y n=1 n=1 Note that the input distribution p(x) does not affect the posterior, and accordingly is often ignored in practice.
1.1.2 Maximum A Posteriori Learning Since the joint distribution p(D, w) is just the product of the likelihood function and the prior distribution (see Eq. (1.1)), it is usually easy to
1.1 Framework
5
compute. Therefore, it is relatively easy to perform maximum a posteriori (MAP) learning, where the parameters are point-estimated so that the posterior probability is maximized, i.e.,
wMAP = argmax p(w|D) = argmax p(D, w). w
(1.4)
w
MAP learning includes maximum likelihood (ML) learning,
wML = argmax p(D|w),
(1.5)
w
as a special case with the flat prior p(w) ∝ 1.
1.1.3 Bayesian Learning On the other hand, Bayesian learning requires integration of the joint distribution with respect to the parameter w, which is often computationally hard. More specifically, performing Bayesian learning means computing at least one of the following quantities: Marginal likelihood (zeroth moment) p(D) = p(D, w)dw.
(1.6)
This quantity has been already introduced in Eq. (1.2) as the normalization factor of the posterior distribution. As seen in Section 1.1.5 and subsequent sections, the marginal likelihood plays an important role in model selection and hyperparameter estimation. Posterior mean (first moment)
w = w p(w|D)
1 = p(D)
w · p(D, w)dw,
(1.7)
where · p denotes the expectation value over the distribution p, i.e., · p(w) = ·p(w)dw. This quantity is also called the Bayesian estimator. The Bayesian estimator or the model distribution with the Bayesian estimator plugged in (see the plug-in predictive distribution (1.10)) can be the final output of Bayesian learning. Posterior covariance (second moment) 1
Σ w = (w − w)(w − w) = (w − w)(w − w) p(D, w)dw, (1.8) p(w|D) p(D)
6
1 Bayesian Learning
where denotes the transpose of a matrix or vector. This quantity provides the credibility information, and is used to assess the confidence level of the Bayesian estimator. Predictive distribution (expectation of model distribution) 1 p(Dnew |D) = p(Dnew |w) p(w|D) = p(Dnew |w)p(D, w)dw, p(D)
(1.9)
where p(Dnew |w) denotes the model distribution on unobserved new data Dnew . In the i.i.d. case such as Examples 1.1 and 1.2, it is sufficient to compute the predictive distribution for a single new sample Dnew = {x}. Note that each of the four quantities (1.6) through (1.9) requires to compute the expectation of some function f (w) over the unnormalized posterior distribution p(D, w) on w, i.e., f (w)p(D, w)dw. Specifically, the marginal likelihood, the posterior mean, and the posterior covariance are the zeroth, the first, and the second moments of the unnormalized posterior distribution, respectively. The expectation is analytically intractable except for some simple cases, and numerical computation is also hard when the dimensionality of the unknown parameter w is high. This is the main bottleneck of Bayesian learning, with which many approximation methods have been developed to cope. It hardly happens that the first moment (1.7) or the second moment (1.8) are computationally tractable but the zeroth moment (1.6) is not. Accordingly, we can say that performing Bayesian learning on the parameter w amounts to obtaining the normalized posterior distribution p(w|D). It sometimes happens that computing the predictive distribution (1.9) is still intractable even if the zeroth, the first, and the second moments can be computed based on some approximation. In such a case, the model distribution with the Bayesian estimator plugged in, called the plug-in predictive distribution, w), p(Dnew |
(1.10)
is used for prediction in practice.
1.1.4 Latent Variables So far, we introduced the observed data set D as a known variable, and the model parameter w as an unknown variable. In practice, more varieties of known and unknown variables can be involved. Some probabilistic models have latent variables (or hidden variables) z, which can be involved in the original model, or additionally introduced for
1.1 Framework
7
computational reasons. They are typically attributed to each of the observed samples, and therefore have large degrees of freedom. However, they are just additional unknown variables, and there is no reason in inference to distinguish them from the model parameters w.1 The joint posterior over the parameters and the latent variables is given by Eq. (1.3) with w and p(w) replaced with w = (w, z) and p(w) = p(z|w)p(w), respectively. Example 1.3 (Mixture models) A mixture model is often used for parametric density estimation (Example 1.1). The model distribution is given by p(x|w) =
K
αk p(x|τk ),
(1.11)
k=1
K K αk = 1}k=1 is the unknown parameters. The where w = {αk , τk ; αk ≥ 0, k=1 mixture model (1.11) is the weighted sum of K distributions, each of which is parameterized by the component parameter τk . The domain of the mixing weights α = (α1 , . . . , αK ) , also called as the mixture coefficients, forms the K αk = 1} (see standard (K − 1)-simplex, denoted by ΔK−1 ≡ {α ∈ R+K ; k=1 Figure 1.1). Figure 1.2 shows an example of the mixture model with three one-dimensional Gaussian components. The likelihood, p(D|w) =
N
p(x(n) |w),
n=1
⎛ K ⎞ N ⎜ ⎟⎟ ⎜⎜ ⎜ = ⎜⎝ αk p(x|τk )⎟⎟⎟⎠ , n=1
(1.12)
k=1
α3 α1 + α2 + α3 = 1 α2 α1 Figure 1.1 (K − 1)-simplex, ΔK−1 , for K = 3.
1
For this reason, the latent variables z and the model parameters w are also called local latent variables and global latent variables, respectively.
8
1 Bayesian Learning
1 0.8 0.6 0.4 0.2 0 –2
0
–1
1
2
Figure 1.2 Gaussian mixture.
for N observed i.i.d. samples D = {x(1) , . . . , x(N) } has O(K N ) terms, which makes even ML learning intractable. This intractability arises from the summation inside the multiplication in Eq. (1.12). By introducing latent variables, we can turn this summation into a multiplication, and make Eq. (1.12) tractable. Assume that each sample x belongs to a single component k, and is drawn from p(x|τk ). To describe the assignment, we introduce a latent variable K associated with each observed sample x, where ek ∈ {0, 1}K is z ∈ Z ≡ {ek }k=1 the K-dimensional binary vector, called the one-of-K representation, with one at the kth entry and zeros at the other entries: kth
ek = (0, . . . , 0, 1 , 0, . . . , 0) .
K
Then, we have the following model: p(x, z|w) = p(x|z, w)p(z|w), p(x|z, w) =
where
K
{p(x|τk )}zk ,
(1.13) p(z|w) =
k=1
K
αzkk .
k=1
The conditional distribution (1.13) on the observed variable x and the latent variable z given the parameter w is called the complete likelihood. Note that marginalizing the complete likelihood over the latent variable recovers the original mixture model: p(x|w) =
Z
p(x, z|w)d z =
K K k=1 z∈{ek }k=1
{αk p(x|τk )}zk =
K
αk p(x|τk ).
k=1
This means that, if samples are generated from the model distribution (1.13), and only x is recorded, the observed data follow the original mixture model (1.11).
1.1 Framework
9
In the literature, latent variables tend to be marginalized out even in MAP learning. For example, the expectation-maximization (EM) algorithm (Dempster et al., 1977), a popular MAP solver for latent variable models, seeks a (local) maximizer of the posterior distribution with the latent variables marginalized out, i.e.,
p(D, w, z)d z. (1.14) wEM = argmax p(w|D) = argmax w
w
Z
However, we can also maximize the posterior jointly over the parameters and the latent variables, i.e., zMAP−hard ) = argmax p(w, z|D) = argmax p(D, w, z). ( wMAP−hard , w,z
(1.15)
w,z
For clustering based on the mixture model in Example 1.3, the EM algorithm (1.14) gives a soft assignment, where the expectation value zEM ∈ ΔK−1 ⊂ K [0, 1] is substituted into the joint distribution p(D, w, z), while the joint maximization (1.15) gives the hard assignment, where the optimal assignment K
zMAP−hard ∈ {ek }k=1 ⊂ {0, 1}K is looked for in the binary domain.
1.1.5 Empirical Bayesian Learning In many practical cases, it is reasonable to use a prior distribution parameterized by hyperparameters κ. The hyperparameters can be tuned by hand or based on some criterion outside the Bayesian framework. A popular method of the latter is the cross validation, where the hyperparameters are tuned so that an (preferably unbiased) estimator of the performance criterion is optimized. In such cases, the hyperparameters should be treated as known variables when Bayesian learning is performed. On the other hand, the hyperparameters can be estimated within the Bayesian framework. In this case, there is again no reason to distinguish the hyperparameters from the other unknown variables (w, z). The joint posterior over all unknown variables is given by Eq. (1.3) with w and p(w) replaced with w = (w, κ, z) and p(w) = p(z|w)p(w|κ)p(κ), respectively, where p(κ) is called a hyperprior. A popular approach, called empirical Bayesian (EBayes) learning (Efron and Morris, 1973), applies Bayesian learning on w (and z) and point-estimate κ, i.e.,
κEBayes = argmax p(D, κ) = argmax p(D|κ)p(κ), κ κ where p(D|κ) = p(D, w, z|κ)dwd z.
10
1 Bayesian Learning
Here the marginal likelihood p(D|κ) is seen as the likelihood of the hyperparameter κ, and MAP learning is performed by maximizing the joint distribution p(D, κ) of the observed data D and the hyperparameter κ, which can be seen as an unnormalized posterior distribution of the hyperparameter. The hyperprior is often assumed to be flat: p(κ) ∝ 1. With an appropriate design of priors, empirical Bayesian learning combined with approximate Bayesian learning is often used for automatic relevance determination (ARD), where irrelevant degrees of freedom of the statistical model are automatically pruned out. Explaining the ARD property of approximate Bayesian learning is one of the main topics of theoretical analysis in Parts III and IV.
1.2 Computation Now, let us explain how Bayesian learning is performed in simple cases. We start from introducing conjugacy, an important notion in performing Bayesian learning.
1.2.1 Popular Distributions Table 1.1 summarizes several distributions that are frequently used as a model distribution (or likelihood function) p(D|w) or a prior distribution p(w) in Bayesian learning. The domain X of the random variable x and the domain W of the parameters w are shown in the table. Some of the distributions in Table 1.1 have complicated function forms, involving Beta or Gamma functions. However, such complications are mostly in the normalization constant, and can often be ignored when it is sufficient to find the shape of a function. In Table 1.1, the normalization constant is separated by a dot, so that one can find the simple main part. As will be seen shortly, we often refer to the normalization constant when we need to perform integration of a function, which is in the same form as the main part of a popular distribution. Below we summarize abbreviations of distributions: 1 1 −1 · exp − (x − μ) Σ (x − μ) , Gauss M (x; μ, Σ) ≡ 2 (2π) M/2 det (Σ)1/2 (1.16) βα · xα−1 exp(−βx), Gamma(x; α, β) ≡ Γ(α)
(1.17)
Table 1.1 Popular distributions. The following notation is used: R : The set of all real numbers, R++ : The set of all M : The set of all M × M positive definite matrices, positive real numbers, I++ : The set of all positive integers, S++ K−1 K K HN ≡ {x ∈ {0, . . . , N} ; k=1 xk = N} : The set of all possible histograms for N samples and K categories, K θk = 1} : The standard (K − 1)-simplex, det (·) :Determinant of matrix, B(y, z) ≡ ΔK−1 ≡ {θ ∈ [0, 1]K ; k=1 1 ∞ y−1 z−1 t (1 − t) dt : Beta function, Γ(y) ≡ 0 ty−1 exp(−t)dt : Gamma function, and Γ M (y) ≡ 0 det (T)y−(M+1)/2 exp(−tr(T))dT : Multivariate Gamma function. T∈S M ++
Probability distribution Isotropic Gaussian Gaussian Gamma Wishart Bernoulli Binomial Multinomial Beta Dirichlet
p(x|w) Gauss M (x; μ, σ2 I M ) ≡ Gauss M (x; μ, Σ) ≡
Gamma(x; α, β) ≡ Wishart M (X; V, ν) ≡
· exp − 2σ1 2 x − μ2 · exp − 21 (x − μ) Σ −1 (x − μ)
1 (2πσ2 ) M/2
1 (2π) M/2 det(Σ)1/2
βα Γ(α)
1 (2ν |V|) M/2 Γ M ( 2ν )
· xα−1 exp(−βx) · det (X)
ν−M−1 2
!
Binomial1 (x; θ) ≡ θ x (1 − θ)1−x N x BinomialN (x; θ) ≡ θ (1 − θ)N−x x K (xk ! )−1 θkxk MultinomialK,N (x; θ) ≡ N! · k=1 Beta(x; a, b) ≡ DirichletK (x; φ)
−1 X)
exp − tr(V2
1 · xa−1 (1 − x)b−1 B(a,b) Γ( K φ ) K ≡ K k=1Γ(φk ) · k=1 xkφk−1 k k=1
"
x∈X
w∈W
x ∈ RM
μ ∈ R M , σ2 > 0
x ∈ RM
M μ ∈ R M , Σ ∈ S++
x>0
α > 0, β > 0
M X ∈ S++
M V ∈ S++ ,ν > M − 1
x ∈ {0, 1}
θ ∈ [0, 1]
x ∈ {0, . . . , N}
θ ∈ [0, 1]
x∈
HK−1 N
θ ∈ ΔK−1
x ∈ [0, 1]
a > 0, b > 0
x ∈ ΔK−1
K φ ∈ R++
12
1 Bayesian Learning
Wishart M (X; V, ν) ≡
1 (2ν |V|) M/2 Γ M
ν 2
· det (X)
ν−M−1 2
tr(V −1 X) exp − , 2
N · θ x (1 − θ)N−x , BinomialN (x; θ) ≡ x MultinomialK,N (x; θ) ≡ N! ·
K
(xk ! )−1 θkxk ,
(1.18) (1.19) (1.20)
k=1
1 · xa−1 (1 − x)b−1 , B(a, b) K Γ( K φk ) xkφk−1 . DirichletK (x; φ) ≡ K k=1 · Γ(φ ) k k=1 k=1 Beta(x; a, b) ≡
(1.21) (1.22)
The distributions in Table 1.1 are categorized into four groups, which are separated by dashed lines. In each group, an upper distribution family is a special case of a lower distribution family. Note that the following hold: 1 Gamma(x; α, β) = Wishart1 x; , 2α , 2β BinomialN (x; θ) = Multinomial2,N (x, N − x) ; (θ, 1 − θ) , Beta(x; a, b) = Dirichlet2 (x, 1 − x) ; (a, b) .
1.2.2 Conjugacy Let us think about the function form of the posterior (1.3): p(w|D) =
p(D|w)p(w) ∝ p(D|w)p(w), p(D)
which is determined by the function form of the product of the model likelihood p(D|w) and the prior p(w). Note that we here call the conditional p(D|w) NOT the model distribution but the model likelihood, since we are interested in the function form of the posterior, a distribution of the parameter w. Conjugacy is defined as the relation between the likelihood p(D|w) and the prior p(w). Definition 1.4 (Conjugate prior) A prior p(w) is called conjugate with a likelihood p(D|w), if the posterior p(w|D) is in the same distribution family as the prior.
1.2 Computation
13
1.2.3 Posterior Distribution Here, we introduce computation of the posterior distribution in simple cases where a conjugate prior exists and is adopted. Isotropic Gaussian Model Let us compute the posterior distribution for the isotropic Gaussian model: 1 1 2 2 · exp − 2 x − μ . (1.23) p(x|w) = Gauss M (x; μ, σ I M ) = 2σ (2πσ2 ) M/2 The likelihood for N i.i.d. samples D = {x(1) , . . . , x(N) } is written as N N x(n) − μ2 exp − 2σ1 2 n=1 (n) p(D|w) = p(x |w) = . (2πσ2 ) MN/2 n=1
(1.24)
Gaussian Likelihood As noted in Section 1.2.2, we should see Eq. (1.24), which is the distribution of observed data D, as a function of the parameter w. Naturally, the function form depends on which parameters are estimated in the Bayesian way. The isotropic Gaussian has two parameters w = (μ, σ2 ), and we first consider the case where the variance parameter σ2 is known, and the posterior of the mean parameter μ is estimated, i.e., we set w = μ. This case contains the case where σ2 is unknown but point-estimated in the empirical Bayesian procedure or tuned outside the Bayesian framework, e.g., by performing cross-validation (we set w = μ, κ = σ2 in the latter case). Omitting the constant (with respect to μ), the likelihood (1.24) can be written as ⎞ ⎛ N ⎟ ⎜⎜⎜ 1 (n) 2⎟ ⎜ x − μ ⎟⎟⎟⎠ p(D|μ) ∝ exp ⎜⎝− 2 2σ n=1 ⎞ ⎛ N ⎟⎟ ⎜⎜⎜ 1 (n) 2 (x − x) + (x − μ) ⎟⎟⎟⎠ ∝ exp ⎜⎜⎝− 2 2σ n=1 ⎛ N ⎞⎞ ⎛ ⎟⎟⎟ ⎜⎜⎜ 1 ⎜⎜⎜ (n) 2 2⎟ ⎜ ⎜ = exp ⎜⎝− 2 ⎜⎝ x − x + Nx − μ ⎟⎟⎟⎠⎟⎟⎟⎠ 2σ n=1 ! N # #2 " ∝ exp − 2 ##μ − x## 2σ σ2 IM , ∝ Gauss M μ; x, (1.25) N N (n) x is the sample mean. Note that we omitted the factor where x = N1 n=1 N % $ x(n) − x2 as a constant in the fourth equation. exp − 2σ1 2 n=1
14
1 Bayesian Learning
The last equation (1.25) implies that, as a function of the mean parameter μ, the model likelihood p(D|μ) has the same form as the isotropic Gaussian with 2 mean x and variance σN . Eq. (1.25) also implies that the ML estimator for the mean parameter is given by
μML = x. Thus, we found that the likelihood function for the mean parameter of the isotropic Gaussian is in the Gaussian form. This comes from the following facts: • The isotropic Gaussian model for a single sample x is in the Gaussian form also as a function of the mean parameter, i.e., Gauss M (x; μ, σ2 I M ) ∝ Gauss M (μ; x, σ2 I M ). • The isotropic Gaussians are multiplicatively closed, i.e., the product of isotropic Gaussians with different means p(D|μ) ∝ is a Gaussian: N σ2 (n) 2 Gauss (μ; x , σ I ) ∝ Gauss x, I μ; . M M M n=1 N M Since the isotropic Gaussian is multiplicatively closed and the likelihood (1.25) is in the Gaussian form, the isotropic Gaussian prior must be conjugate. Let us choose the isotropic Gaussian prior, ⎛ ⎞ ⎜⎜⎜ 1 ⎟⎟ 2 2 2 p(μ|μ0 , σ0 ) = Gauss M (μ; μ0 , σ0 I M ) ∝ exp ⎝⎜− 2 μ − μ0 ⎟⎠⎟ , 2σ0 for hyperparameters κ = (μ0 , σ20 ). Then, the function form of the posterior is given by p(μ|D, μ0 , σ20 ) ∝ p(D|μ)p(μ|μ0 , σ20 ) σ2 ∝ Gauss M μ; x, Gauss M (μ; μ0 , σ20 ) N ⎛ ⎞ ##2 ##2 ⎟⎟ ⎜⎜⎜ N ## 1 ## #μ − μ0 # ⎟⎠⎟ ∝ exp ⎝⎜− 2 #μ − x# − 2σ 2σ20 ⎛ # ##2 ⎞ ⎟ ⎜⎜⎜ Nσ−2 + σ−2 ## Nσ−2 x + σ−2 0 # 0 μ0 # ⎜ ## ⎟⎟⎟⎟⎟ ∝ exp ⎜⎜⎝− ##μ − −2 −2 2 Nσ + σ0 # ⎠ ⎛ ⎞ ⎜⎜ Nσ−2 x + σ−2 ⎟⎟⎟ 1 0 μ0 ⎟. , ∝ Gauss M ⎜⎜⎝μ; −2 −2 ⎠ −2 −2 Nσ + σ0 Nσ + σ0 Therefore, the posterior is p(μ|D, μ0 , σ20 )
⎛ ⎞ ⎜⎜⎜ Nσ−2 x + σ−2 ⎟⎟⎟ 1 0 μ0 ⎜ ⎟. = Gauss M ⎝μ; , −2 + σ−2 ⎠ Nσ−2 + σ−2 Nσ 0 0
(1.26)
1.2 Computation
15
Note that the equality holds in Eq. (1.26). We omitted constant factors in the preceding derivation. But once the function form of the posterior is found, the normalization factor is unique. If the function form coincides with one of the well-known distributions (e.g., ones given in Table 1.1), one can find the normalization constant (from the table) without any further computation. Multiplicative closedness of a function family of the model likelihood is essential in performing Bayesian learning. Such families are called the exponential family: Definition 1.5 (Exponential families) A family of distributions is called the exponential family if it is written as (1.27) p(x|w) = p(t|η) = exp η t − A(η) + B(t) , where t = t(x) is a function, called sufficient statistics, of the random variable x, and η = η(w) is a function, called natural parameters, of the parameter w. The essential property of the exponential family is that the interaction between the random variable and the parameter occurs only in the log linear $ % form, i.e., exp η t . Note that, although A(·) and B(·) are arbitrary functions, A(·) does not depend on t, and B(·) does not depend on η. Assume that N observed samples D = (t (1) , . . . , t (N) ) = (t(x(1) ), . . . , t(x(N) )) are drawn from the exponential family distribution (1.27). If we use the exponential family prior p(η) = exp η t (0) − A0 (η) + B0 (t (0) ) , then the posterior is given as an exponential family distribution with the same set of natural parameters η: ⎞ ⎛ N ⎟⎟ ⎜⎜⎜ (n)
⎜ t − A (η) + B (D)⎟⎟⎟⎠ , p(η|D) = exp ⎜⎝η n=0
where A (η) and B (D) are a function of η and a function of D, respectively. Therefore, the conjugate prior for the exponential family distribution is the exponential family with the same natural parameters η. All distributions given in Table 1.1 are exponential families. For example, the sufficient statistics and the natural parameters for the univariate Gaussian are given by η = ( σμ2 , − 2σ1 2 ) and t = (x, x2 ) , respectively. The mixture model (1.11) is a common nonexponential family distribution. Gamma Likelihood Next we consider the posterior distribution of the variance parameter σ2 with the mean parameter regarded as a constant, i.e., w = σ2 .
16
1 Bayesian Learning
Omitting the constants (with respect to σ2 ) of the model likelihood (1.24), we have ⎞ ⎛ N ⎟⎟ ⎜⎜⎜ 1 2 2 −MN/2 (n) 2 exp ⎜⎜⎝− 2 x − μ ⎟⎟⎟⎠ . p(D|σ ) ∝ (σ ) 2σ n=1 If we see the likelihood as a function of the inverse of σ2 , we find that it is proportional to the Gamma distribution: ⎞ ⎞ ⎛ ⎛ N ⎟⎟⎟ −2 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ 1 (n) −2 −2 MN/2 2 exp ⎜⎜⎝− ⎜⎜⎝ x − μ ⎟⎟⎠ σ ⎟⎟⎠ p(D|σ ) ∝ (σ ) 2 n=1 ⎞ ⎛ N ⎟ ⎜⎜⎜ −2 MN 1 (n) 2⎟ ⎜ + 1, x − μ ⎟⎟⎟⎠ . ∝ Gamma ⎜⎝σ ; (1.28) 2 2 n=1 Since the mode of the Gamma distribution is known as argmax x Gamma (x; α, β) = α−1 β , Eq. (1.28) implies that the ML estimator for the variance parameter is given by N 1 N (n) − μ2 1 (n) 1 n=1 x 2 ML
= σ = −2 ML = 2 MN x − μ2 . MN
σ + 1 − 1 2 n=1 Now we found that the model likelihood of the isotropic Gaussian is in the Gamma form as a function of the inverse variance σ−2 . Since the Gamma distribution is in the exponential family and multiplicatively closed, the Gamma prior is conjugate. If we use the Gamma prior p(σ−2 |α0 , β0 ) = Gamma(σ−2 ; α0 , β0 ) ∝ (σ−2 )α0 −1 exp(−β0 σ−2 ) with hyperparameters κ = (α0 , β0 ), the posterior can be written as p(σ−2 |D, α0 , β0 ) ∝ p(D|σ−2 )p(σ−2 |α0 , β0 ) ⎛ ⎞ N ⎜⎜⎜ −2 MN ⎟ 1 (n) 2⎟ ⎜ + 1, ∝ Gamma ⎜⎝σ ; x − μ ⎟⎟⎟⎠ Gamma(σ−2 ; α0 , β0 ) 2 2 n=1 ⎞ ⎞ ⎛ ⎛ N ⎟⎟⎟ −2 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ 1 (n) −2 MN/2+α0 −1 2 ∝ (σ ) exp ⎜⎜⎝− ⎜⎜⎝ x − μ + β0 ⎟⎟⎠ σ ⎟⎟⎠ , 2 n=1 and therefore ⎞ ⎛ N ⎟⎟ ⎜⎜⎜ −2 MN 1 (n) 2 + α0 , x − μ + β0 ⎟⎟⎟⎠ . (1.29) p(σ |D, α0 , β0 ) = Gamma ⎜⎜⎝σ ; 2 2 n=1 −2
1.2 Computation
17
Isotropic Gauss-Gamma Likelihood Finally, we consider the general case where both the mean and variance parameters are unknown, i.e., w = (μ, σ−2 ). The likelihood is written as ⎛ ⎛ N ⎞ ⎞ ⎜⎜⎜ ⎜⎜⎜ 1 (n) ⎟⎟ −2 ⎟⎟⎟ −2 −2 MN/2 2⎟ ⎜ ⎟ ⎜ p(D|μ, σ ) ∝ (σ ) exp ⎜⎝− ⎜⎝ x − μ ⎟⎠ σ ⎟⎟⎠ 2 n=1 ⎛ ⎛ ⎞ ⎞ N (n) ⎜⎜⎜ ⎜⎜⎜ Nμ − x2 − x2 ⎟⎟⎟ −2 ⎟⎟⎟ n=1 x −2 MN/2 + exp ⎝⎜− ⎝⎜ = (σ ) ⎠⎟ σ ⎠⎟ 2 2 ⎛ ⎞ N (n) ⎜⎜⎜ − x2 ⎟⎟⎟ M(N − 1) n=1 x −2 + 1, ∝ GaussGamma M ⎝⎜μ, σ x, N I M , ⎠⎟ , 2 2 where GaussGamma M (x, τ|μ, λI M , α, β) ≡ Gauss M (x|μ, (τλ)−1 I M ) · Gamma(τ|α, β) 2 exp − τλ βα α−1 2 x − μ τ exp(−βτ) · = −1 M/2 Γ(α) (2π(τλ) ) λx − μ2 βα α+ M2 −1 + β τ = exp − τ 2 (2π/λ) M/2 Γ(α) is the isotropic Gauss-Gamma distribution on the random variable x ∈ R M , τ > 0 with parameters μ ∈ R M , λ > 0, α > 0, β > 0. Note that, although the isotropic Gauss-Gamma distribution is the product of an isotropic Gaussian distribution and a Gamma distribution, the random variables x and τ are not independent of each other. This is because the isotropic Gauss-Gamma distribution is a hierarchical model p(x|τ)p(τ), where the variance parameter σ2 = (τλ)−1 for the isotropic Gaussian depends on the random variable τ of the Gamma distribution. Since the isotropic Gauss-Gamma distribution is multiplicatively closed, it is a conjugate prior. Choosing the isotropic Gauss-Gamma prior p(μ, σ−2 |μ0 , λ0 , α0 , β0 ) = GaussGamma M (μ, σ−2 |μ0 , λ0 I M , α0 , β) λ0 μ − μ0 2 −2 α0 + M2 −1 −2 + β0 σ exp − ∝ (σ ) 2 with hyperparameters κ = (μ0 , λ0 , α0 , β0 ), the posterior is given by p(μ, σ−2 |D, κ) ∝ p(D|μ, σ−2 )p(μ, σ−2 |κ) ⎛ ⎞ N (n) ⎜⎜⎜ − x2 ⎟⎟⎟ M(N − 1) n=1 x −2 ⎜ ⎟⎠ + 1, ∝ GaussGamma M ⎝μ, σ x, N I M , 2 2 · GaussGamma M (μ, σ−2 |μ0 , λ0 I M , α0 , β)
18
1 Bayesian Learning
−2 MN/2
∝ (σ )
⎛ ⎛ ⎞ ⎞ N (n) ⎜⎜⎜ ⎜⎜⎜ Nμ − x2 − x2 ⎟⎟⎟ −2 ⎟⎟⎟ n=1 x ⎜ ⎜ ⎟ + exp ⎝− ⎝ ⎠ σ ⎟⎠ 2 2 λ0 μ − μ0 2 −2 α0 + M2 −1 −2 + β0 σ exp − · (σ ) 2
∝ (σ−2 ) M(N+1)/2+α0 −1 ! ! " " N x(n) −x2 Nμ−x2 +λ0 μ−μ0 2 · exp − + n=1 2 + β0 σ−2 2 #2 ⎞ ⎞ ⎛ ⎛ ## ⎟⎟⎟ ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ λ #μ − μ## −2 −2 α+ M2 −1 ⎟ ⎜ ⎜ ∝ (σ ) + β⎟⎟⎠ σ ⎟⎟⎟⎠ , exp ⎜⎜⎝− ⎜⎜⎝ 2 where
μ=
N x + λ 0 μ0 , N + λ0
λ = N + λ0 , MN
+ α0 , α= 2 N x(n) − x2 Nλ0 x − μ0 2
+ + β0 . β = n=1 2 2(N + λ0 ) Thus, the posterior is obtained as μ, λI M , α, β). p(μ, σ−2 |D, κ) = GaussGamma M (μ, σ−2 |
(1.30)
Although the Gauss-Gamma distribution seems a bit more complicated than the ones in Table 1.1, its moments are known. Therefore, Bayesian learning with a conjugate prior can be analytically performed also when both parameters w = (μ, σ−2 ) are estimated. Gaussian Model Bayesian learning can be performed for a general Gaussian model in a similar fashion to the isotropic case. Consider the M-dimensional Gaussian distribution, 1 1 −1 (x − μ) · exp − Σ (x − μ) p(x|w) = Gauss M (x; μ, Σ) ≡ 2 (2π) M/2 det (Σ)1/2 (1.31) with mean and covariance parameters w = (μ, Σ). The likelihood for N i.i.d. samples D = {x(1) , . . . , x(N) } is written as N N (x(n) − μ) Σ −1 (x(n) − μ) exp − 12 n=1 (n) p(D|w) = . (1.32) p(x |w) = (2π)N M/2 det (Σ)N/2 n=1
1.2 Computation
19
Gaussian Likelihood Let us first compute the posterior distribution on the mean parameter μ, with the covariance parameter regarded as a known constant. In this case, the likelihood can be written as ⎞ ⎛ N ⎟⎟ ⎜⎜⎜ 1 (n) −1 (n) (x − μ) Σ (x − μ)⎟⎟⎟⎠ p(D|μ) ∝ exp ⎜⎜⎝− 2 n=1 N 1 (n) −1 (n) (x − x) + (x − μ) · Σ (x − x) + (x − μ) ∝ exp − 2 n=1 N 1 (n) −1 (n) −1 = exp − (x − x) Σ (x − x) + N(x − μ) Σ (x − μ) 2σ2 n=1 ! N " ∝ exp − (μ − x) Σ −1 (μ − x) 2 1 ∝ Gauss M μ; x, Σ . (1.33) N Therefore, with the conjugate Gaussian prior
1 −1 p(μ|μ0 , Σ 0 ) = Gauss M (μ; μ0 , Σ 0 ) ∝ exp − (μ − μ0 ) Σ 0 (μ − μ0 ) , 2 with hyperparameters κ = (μ0 , Σ 0 ), the posterior is written as p(μ|D, μ0 , Σ 0 ) ∝ p(D|μ)p(μ|μ0 , Σ 0 ) 1 ∝ Gauss M μ; x, Σ Gauss M (μ; μ0 , Σ 0 ) N ⎛ ⎞ ⎜⎜⎜ N(μ − x) Σ −1 (μ − x) + (μ − μ0 ) Σ −1 ⎟⎟⎟ 0 (μ − μ0 ) ⎟ ∝ exp ⎝⎜− ⎠ 2 ⎛ $ %⎞ % −1 $ ⎜⎜⎜⎜ μ − μ ⎟⎟⎟⎟ Σ μ − μ ∝ exp ⎜⎜⎝− ⎟⎟⎠ , 2 where
μ = NΣ −1 + Σ −1 0 −1
Σ = NΣ + Σ −1 0
−1 −1
NΣ −1 x + Σ −1 0 μ0 ,
.
Thus, we have μ, Σ . p(μ|D, μ0 , Σ 0 ) = Gauss M μ;
(1.34)
20
1 Bayesian Learning
Wishart Likelihood If we see the mean parameter μ as a given constant, the model likelihood (1.32) can be written as follows, as a function of the covariance parameter Σ: ⎛ N ⎞ ⎜⎜ (x(n) − μ) Σ −1 (x(n) − μ) ⎟⎟⎟ N/2 ⎟⎠ p(D|Σ −1 ) ∝ det Σ −1 exp ⎜⎜⎝− n=1 2 ⎞ ⎛ N ⎜⎜⎜ tr n=1 (x(n) − μ)(x(n) − μ) Σ −1 ⎟⎟⎟ N/2 ⎟⎟⎟ exp ⎜⎜⎜⎝− ∝ det Σ −1 ⎠ 2 ⎞ ⎛ ⎞−1 ⎛ N ⎟⎟⎟ ⎜⎜⎜ ⎟ ⎜ ⎟ ⎜ ∝ Wishart M ⎜⎜⎜⎜⎝Σ −1 ; ⎜⎜⎝⎜ (x(n) − μ)(x(n) − μ) ⎟⎟⎠⎟ , M + N + 1⎟⎟⎟⎟⎠ . n=1
Here, as in the isotropic Gaussian case, we computed the distribution on the inverse Σ −1 of the covariance parameter. With the Wishart distribution p(Σ −1 |V 0 , ν0 ) = Wishart M (Σ −1 ; V 0 , ν0 ) =
1 (2ν0 det (V 0 )) M/2 Γ M
ν
0
2
· det Σ −1
ν0 −M−1 2
⎛ −1 ⎞ ⎜⎜ tr(V −1 ⎟⎟⎟ 0 Σ )⎟ exp ⎜⎜⎝− ⎠ 2
for hyperparameters κ = (V 0 , ν0 ) as a conjugate prior, the posterior is computed as p(Σ −1 |D, V 0 , ν0 ) ∝ p(D|Σ −1 )p(Σ −1 |V 0 , ν0 ) ⎞ ⎛ ⎞−1 ⎛ N ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ (n) ⎜ −1 (n) ∝ Wishart M ⎜⎜⎜⎝Σ ; ⎜⎜⎝ (x − μ)(x − μ) ⎟⎟⎠ , M + N + 1⎟⎟⎟⎟⎠ n=1
∝ det Σ −1
∝ det Σ −1
· Wishart M (Σ −1 ; V 0 , ν0 ) ⎞ ⎛ N ⎜⎜⎜ tr N (x(n) − μ)(x(n) − μ) Σ −1 ⎟⎟⎟ n=1 2 ⎟⎟⎟ exp ⎜⎜⎜⎝− ⎠ 2 ⎞ ⎛ −1 ⎟ ν0 −M−1 ⎜⎜⎜ tr V −1 ⎟⎟⎟ 0 Σ ⎟⎠⎟ · det Σ −1 2 exp ⎜⎜⎝⎜− 2 ! " ν0 −M+N−1 N −1 tr(( n=1 (x(n) −μ)(x(n) −μ) +V −1 2 0 )Σ ) exp − . 2
Thus we have p(Σ −1 |D, V 0 , ν0 )
⎛ ⎞ ⎞−1 ⎛ N ⎜⎜⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟⎟⎟ , N + ν ⎟⎟⎟⎟⎟ . (1.35) = Wishart M ⎜⎜⎜⎜⎝Σ −1 ; ⎜⎜⎜⎝ (x(n) − μ)(x(n) − μ) + V −1 0 ⎟⎠ 0 ⎠ n=1
1.2 Computation
21
Note that the Wishart distribution can be seen as a multivariate extension of the Gamma distribution and is reduced to the Gamma distribution for M = 1: Wishart1 (x; V, ν) = Gamma (x; ν/2, 1/(2V)) . Gauss-Wishart Likelihood When both parameters w = (μ, Σ −1 ) are unknown, the model likelihood (1.32) is seen as ! N (n) −1 (n) " N/2 (x −μ) Σ (x −μ) p(D|μ, Σ −1 ) ∝ det Σ −1 exp − n=1 2 ! " N (n) (n) −1 N/2 tr (x −μ)(x −μ) Σ ) ( ∝ det Σ −1 exp − n=1 2 N (n) N/2 tr ((x −x)+(x−μ))((x(n) −x)+(x−μ)) Σ −1 ∝ det Σ −1 exp − n=1 2 ! " N N/2 tr(N(μ−x)(μ−x) + n=1 (x(n) −x)(x(n) −x) )Σ −1 ) ∝ det Σ −1 exp − 2 −1 N ∝ GaussWishart M μ, Σ −1 ; x, N, n=1 (x(n) − x)(x(n) − x) , M + N , where GaussWishart M (x, Λ|μ, λ, V, ν) ≡ Gauss M (x|μ, (λΛ)−1 )Wishart M (Λ|V, ν) ! " ν−M−1 tr(V −1 Λ) 2 λ (Λ) det exp − exp − 2 (x − μ) Λ (x − μ) 2 · = M/2 −1/2 (2π) det(λΛ) (2ν det (V)) M/2 Γ M 2ν ! " ν−M tr((λ(x−μ)(x−μ) +V −1 )Λ) λ M/2 (Λ) 2 exp − det = (2ν+1 π det(V)) ν M/2 Γ 2 ( ) M
2
M is the Gauss–Wishart distribution on the random variables x ∈ R M , Λ ∈ S++ M with parameters μ ∈ R M , λ > 0, V ∈ S++ , ν > M − 1. With the conjugate Gauss–Wishart prior,
p(μ, Σ −1 |μ0 , λ0 , α0 , β0 ) = GaussWishart M (μ, Σ −1 |μ0 , λ0 , V 0 , ν0 ) ν−M −1 tr (λ0 (μ−μ0 )(μ−μ0 ) +V −1 0 )Σ −1 2 ∝ det Σ exp − 2 with hyperparameters κ = (μ0 , λ0 , V 0 , ν0 ), the posterior is written as p(μ, Σ −1 |D, κ) ∝ p(D|μ, Σ −1 )p(μ, Σ −1 |κ) N (x(n) − x)(x(n) − x) ∝ GaussWishart M μ, Σ −1 ; x, N, n=1 −1
· GaussWishart M (μ, Σ |μ0 , λ0 , V 0 , ν0 )
−1
,M+N
22
1 Bayesian Learning ∝ det Σ −1
∝ det Σ −1
! " N tr(N(μ−x)(μ−x) + n=1 (x(n) −x)(x(n) −x) )Σ −1 ) exp − 2 ν0 −M −1 tr (λ0 (μ−μ0 )(μ−μ0 ) +V −1 0 )Σ · det Σ −1 2 exp − 2 ⎛ ⎛! $ %$ % −1 " −1 ⎞⎟⎞⎟ ⎜⎜⎜ ⎜⎜⎜ Σ ⎟⎟⎟⎟⎟⎟ λ μ − μ μ − μ V
ν−M ⎜ ⎜ ⎟⎟⎟⎟⎟⎟ ⎜ ⎜ 2 exp ⎜⎜⎜⎜−tr ⎜⎜⎜⎜ ⎟⎟⎟⎟⎟⎟ , ⎜⎝ ⎜⎝ 2 ⎠⎠
N/2
where
μ=
N x + λ 0 μ0 , N + λ0
λ = N + λ0 ,
= N (x(n) − x)(x(n) − x) + V n=1
Nλ0 N+λ0 (x
− μ0 )(x − μ0 ) + V −1 0
−1
,
ν = N + ν0 . Thus, we have the posterior distribution as the Gauss–Wishart distribution:
ν . μ, λ, V, (1.36) p(μ, Σ −1 |D, κ) = GaussWishart M μ, Σ −1 | Linear Regression Model Consider the linear regression model, where an input variable x ∈ R M and an output variable y ∈ R are assumed to satisfy the following probabilistic relation: y = a x + ε, p(ε|σ2 ) = Gauss1 (ε; 0, σ2 ) = √
1 2πσ2
· exp −
ε2 . 2σ2
(1.37) (1.38)
Here a and σ2 are called the regression parameter and the noise variance parameter, respectively. By substituting ε = y − a x, which is obtained from Eq. (1.37), into Eq. (1.38), we have 1 (y − a x)2 p(y|x, w) = Gauss1 (y; a x, σ2 ) = √ · exp − . 2σ2 2πσ2 The likelihood function for N observed i.i.d.2 samples, D = (y, X),
2
In the context of regression, i.i.d. usually means that the observation noise ε(n) = y(n) − a x(n) is N ) = N p(ε(n) ), and the independence independent for different samples, i.e., p({ε(n) }n=1 n=1 N ) = N p(x(n) ), is not required. between the input (x(1) , . . . , x(N) ), i.e., p({x(n) }n=1 n=1
1.2 Computation
23
is given by 1 y − Xa2 p(D|w) = · exp − , 2σ2 (2πσ2 )N/2
(1.39)
where we defined y = (y(1) , . . . , y(N) ) ∈ RN ,
X = (x(1) , . . . , x(N) ) ∈ RN×M .
Gaussian Likelihood The computation of the posterior is similar to the isotropic Gaussian case. As in Section 1.2.3, we first consider the case where only the regression parameter a is estimated, with the noise variance parameter σ2 regarded as a known constant. One can guess that the likelihood (1.39) is Gaussian as a function of a, since it is an exponential of a concave quadratic function. Indeed, by expanding the exponent and completing the square for a, we obtain y − Xa2 p(D|a) ∝ exp − 2σ2 ! " ( a−(X X)−1 X y) X X( a−(X X)−1 X y) ∝ exp − 2 2σ −1 ∝ Gauss M a; (X X) X y, σ2 (X X)−1 . (1.40) Eq. (1.40) implies that, when X X is nonsingular (i.e., its inverse exists), the ML estimator for a is given by
aML = (X X)−1 X y.
(1.41)
Therefore, with the conjugate Gaussian prior 1 −1 p(a|a0 , Σ 0 ) = Gauss M (a; a0 , Σ 0 ) ∝ exp − (a − a0 ) Σ 0 (a − a0 ) 2 for hyperparameters κ = (a0 , Σ 0 ), the posterior is Gaussian: p(a|D, a0 , Σ 0 ) ∝ p(D|a)p(a|a0 , Σ 0 ) ∝ Gauss M a; a0 , N1 σ2 (X X)−1 Gauss M (a; a0 , Σ 0 ) ⎞ ⎛ a−(X X)−1 X y X X a−(X X)−1 X y ) ( ) +(a−a ) Σ −1 (a−a ) ⎟⎟ ⎜⎜⎜ ( 0 0 ⎟ 2 0 σ ⎟⎟⎟ ∝ exp ⎜⎜⎝⎜− 2 ⎠ ⎛ $ %⎞ % −1 $ ⎜⎜⎜ a − a ⎟⎟⎟⎟ Σa a − a ⎜ ⎟⎠⎟ , ∝ exp ⎜⎝⎜− 2
24
1 Bayesian Learning
where
a=
X X + Σ −1 0 σ2
−1
−1 X X −1
+ Σ . Σa = 0 σ2
X y −1 + Σ 0 a0 , σ2
Thus we have p(a|D, a0 , Σ 0 ) = Gauss M a; a, Σa .
(1.42)
Gamma Likelihood When only the noise variance parameter σ2 is unknown, the model likelihood (1.39) is in the Gamma form, as a function of the inverse σ−2 : y − Xa2 −2 σ p(D|σ−2 ) ∝ (σ−2 )N M/2 exp − 2 y − Xa2 −2 N M ∝ Gamma σ ; + 1, , (1.43) 2 2 which implies that the ML estimator is
σ2 ML =
1
σ−2 ML
=
N 1 y − Xa2 . MN n=1
With the conjugate Gamma prior p(σ−2 |α0 , β0 ) = Gamma(σ−2 ; α0 , β0 ) ∝ (σ−2 )α0 −1 exp(−β0 σ−2 ) with hyperparameters κ = (α0 , β0 ), the posterior is computed as p(σ−2 |D, α0 , β0 ) ∝ p(D|σ−2 )p(σ−2 |α0 , β0 ) 1 −2 MN 2 ∝ Gamma σ ; + 1, y − Xa Gamma(σ−2 ; α0 , β0 ) 2 2 1 −2 MN/2+α0 −1 2 −2 ∝ (σ ) exp − y − Xa + β0 σ . 2 Therefore,
MN 1 p(σ−2 |D, α0 , β0 ) = Gamma σ−2 ; + α0 , y − Xa2 + β0 . 2 2
(1.44)
Gauss-Gamma Likelihood When we estimate both parameters w = (a, σ−2 ), the likelihood (1.39) is written as
1.2 Computation
25
! " 2 σ−2 p(D|a, σ−2 ) ∝ (σ−2 )N M/2 exp − y−Xa 2 ML ML a− a aML 2 −2 X X a− a +y−X ∝ (σ−2 )N M/2 exp − σ 2 ! " aML 2 ∝ GaussGamma M a, σ−2 ; aML , X X, M(N−1) + 1, y−X , 2 2 where aML is the ML estimator, given by Eq. (1.41), for the regression parameter, and GaussGamma M (x, τ|μ, Λ, α, β) ≡ Gauss M (x|μ, (τΛ)−1 ) · Gamma(τ|α, β) exp(− τ (x−μ) Λ(x−μ)) βα τα−1 exp(−βτ) = (2πτ−12 )M/2 det(Λ)−1/2 · Γ(α) (x−μ) Λ(x−μ) βα α+ M2 −1 = (2π)M/2 det(Λ) τ exp − +β τ −1/2 2 Γ(α) is the (general) Gauss-Gamma distribution on the random variable x ∈ R M , M , α > 0, β > 0. With the conjugate τ > 0 with parameters μ ∈ R M , Λ ∈ S++ Gauss-Gamma prior p(a, σ−2 |κ) = GaussGamma M (a, σ−2 |μ0 , Λ0 , α0 , β0 ) (a−μ ) Λ (a−μ ) M 0 0 0 + β0 σ−2 ∝ (σ−2 )α0 + 2 −1 exp − 2 for hyperparameters κ = (μ0 , Λ0 , α0 , β0 ), the posterior is computed as p(a, σ−2 |D, κ) ∝ p(D|a, σ−2 )p(a, σ−2 |κ) ! ∝ GaussGamma M a, σ−2 ; aML , X X,
M(N−1) 2
· GaussGamma M (a, σ−2 |μ0 , Λ0 , α0 , β0 ) ML ML a− a aML 2 −2 X X a− a +y−X −2 N M/2 ∝ (σ ) exp − σ 2 M (a−μ0 ) Λ0 (a−μ0 ) · (σ−2 )α0 + 2 −1 exp − + β0 σ−2 2 ⎞ ⎛ ⎛$ ⎞ % % $ ⎜⎜⎜ ⎜⎜⎜ a − μ ⎟⎟⎟ −2 ⎟⎟⎟ μ Λ a − −2 α+ M2 −1 + β⎠⎟ σ ⎠⎟ , ∝ (σ ) exp ⎝⎜− ⎝⎜ 2 where
aML + Λ0 μ0 , μ = (X X + Λ0 )−1 X X
Λ = X X + Λ0 ,
α=
NM 2
β=
y−X aML 2 2
+ α0 , +
( aML −μ0 ) Λ0 (X X+Λ0 )−1 X X( aML −μ0 ) 2
ML 2
a + 1, y−X 2
+ β0 .
"
26
1 Bayesian Learning
Thus, we obtain p(a, σ−2 |D, κ) = GaussGamma M (a, σ−2 | μ, Λ, α, β).
(1.45)
Multinomial Model The multinomial distribution, which expresses a distribution over the histograms of independent events, is another frequently used basic component in Bayesian modeling. For example, it appears in mixture models and latent Dirichlet allocation. Assume that exclusive K events occur with the probability ⎧ ⎫ K ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ K−1 K ≡⎪ θk = 1⎪ θ = (θ1 , . . . , θK ) ∈ Δ θ ∈ R ; 0 ≤ θk ≤ 1, . ⎪ ⎪ ⎩ ⎭ k=1
Then, the histogram x = (x1 , . . . , xK ) ∈
HK−1 N
⎧ ⎫ K ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ K ≡⎪ xk = N ⎪ x ∈ I ; 0 ≤ xk ≤ N; ⎪ ⎪ ⎩ ⎭ k=1
of events after N iterations follows the multinomial distribution, defined as p(x|θ) = MultinomialK,N (x; θ) ≡ N! ·
K θkxk . x! k=1 k
(1.46)
θ is called the multinomial parameter. As seen shortly, calculation of the posterior with its conjugate prior is surprisingly easy. Dirichlet Likelihood As a function of the multinomial parameter w = θ, it is easy to find that the likelihood (1.46) is in the form of the Dirichlet distribution: p(x|θ) ∝ DirichletK (θ; x + 1K ), where 1K is the K-dimensional vector with all elements equal to 1. Since the Dirichlet distribution is an exponential family and hence multiplicatively closed, it is conjugate for the multinomial parameter. With the conjugate Dirichlet prior p(θ|φ) = DirichletK (θ; φ) ∝
K
θkφ−1
k=1
with hyperparameters κ = φ, the posterior is computed as
1.2 Computation
27
p(θ|x, φ) ∝ p(x|θ)p(θ|φ) ∝ DirichletK (θ; x + 1K ) · DirichletK (θ; φ) K
∝
θkxk · θkφk −1
k=1 K
∝
θkxk +φk −1 .
k=1
Thus we have p(θ|x, φ) = DirichletK (θ; x + φ).
(1.47)
Special Cases For K = 2, the multinomial distribution is reduced to the binomial distribution: p(x1 |θ1 ) = Multinomial2,N (x1 , N − x1 ) ; (θ1 , 1 − θ1 ) = BinomialN (x1 ; θ1 ) ! " N = x · θ1x1 (1 − θ1 )N−x1 . 1
Furthermore, it is reduced to the Bernoulli distribution for K = 2 and N = 1: p(x1 |θ1 ) = Binomial1 (x1 ; θ1 ) = θ1x1 (1 − θ1 )1−x1 . Similarly, its conjugate Dirichlet distribution for K = 2 is reduced to the Beta distribution: p(θ1 |φ1 , φ2 ) = Dirichlet2 (θ1 , 1 − θ1 ) ; (φ1 , φ2 ) = Beta(θ1 ; φ1 , φ2 ) 1 · θφ1 −1 (1 − θ1 )φ2 −1 , = B(φ1 , φ2 ) 1 1 )Γ(φ2 ) where B(φ1 , φ2 ) = Γ(φ Γ(φ1 +φ2 ) is the Beta function. Naturally, the Beta distribution is conjugate to the binomial and the Bernoulli distributions, and the posterior can be computed as easily as for the multinomial case. With a conjugate prior in the form of a popular distribution, the four quantities introduced in Section 1.1.3, i.e., the marginal likelihood, the posterior mean, the posterior covariance, and the predictive distribution, can be obtained analytically. In the following subsections, we show how they are obtained.
28
1 Bayesian Learning
Table 1.2 First and second moments of common distributions. Mean(x) = x p(x|w) , Var(x) = (x − Mean(x))2 , p(x|w) d Cov(x) = (x − Mean(x))(x − Mean(x)) p(x|w) , Ψ (z) ≡ dz log Γ(z) : dm Digamma function, and Ψm (z) ≡ dzm Ψ (z): Polygamma function of order m. p(x|w)
First moment
Second moment
Gauss M (x; μ, Σ)
Mean(x) = μ
Cov(x) = Σ
Gamma(x; α, β)
Mean(x) = αβ Mean(log x)
Var(x) = βα2 Var(log x) = Ψ1 (α)
= Ψ (α) − log β Wishart M (X; V, ν)
Mean(X) = νV
MultinomialK,N (x; θ)
Mean(x) = Nθ
DirichletK (x; φ)
Mean(x) =
K
1
φ k=1 k
φ
Mean(log xk ) = Ψ (φk ) − Ψ ( kK =1 φk )
2 Var(xm,m ) = ν(Vm,m
+ Vm,m Vm ,m ) ) Nθk (1 − θk ) (k = k ) (Cov(x))k,k = (k k ) −Nθ k θk ⎧ φ (τ−φ k k) ⎪ ⎪ (k = k ) ⎨ τ2 (τ+1) (Cov(x))k,k = ⎪ φk φk ⎪ ⎩− 2 (k k ) τ (τ+1) K where τ = k=1 φk
1.2.4 Posterior Mean and Covariance As seen in Section 1.2.3, by adopting a conjugate prior having a form of one of the common family distributions, such as the one in Table 1.1, we can have the posterior distribution in the same common family.3 In such cases, we can simply use the known form of moments, which are summarized in Table 1.2. For example, the posterior (1.42) for the regression parameter a (when the noise variance σ2 is treated as a known constant) is the following Gaussian distribution: a, Σa , p(a|D, a0 , Σ 0 ) = Gauss M a; −1 X y X X −1 −1 + Σ0 + Σ 0 a0 , where a= σ2 σ2 −1 X X −1
+ Σ . Σa = 0 σ2
3
If we would say that the prior is in the family that contains all possible distributions, this family would be the conjugate prior for any likelihood function, which is however useless. Usually, the notion of the conjugate prior implicitly requires that moments (at least the normalization constant and the first moment) of any family member can be computed analytically.
1.2 Computation
29
Therefore, the posterior mean and the posterior covariance are simply given by a, a p(a|D,a0 ,Σ 0 ) = = Σ a, (a − a )(a − a ) p(a|D,a0 ,Σ 0 )
respectively. The posterior (1.29) of the (inverse) variance parameter σ−2 of the isotropic Gaussian distribution (when the mean parameter μ is treated as a known constant) is the following Gamma distribution: ⎞ ⎛ N ⎟⎟ ⎜⎜ MN 1 (n) + α0 , x − μ2 + β0 ⎟⎟⎟⎠ . p(σ−2 |D, α0 , β0 ) = Gamma ⎜⎜⎜⎝σ−2 ; 2 2 n=1 Therefore, the posterior mean and the posterior variance are given by σ−2 * 2+ σ−2 − σ−2
p(σ−2 |D,α0 ,β0 )
p(σ−2 |D,α0 ,β0 )
=
1 N 2
=
n=1
MN 2 + α0 x(n) − μ2
1 N
(2
n=1
+ β0
MN 2 + α0 x(n) − μ2
,
+ β0 )2
,
respectively. Also in other cases, the posterior mean and the posterior covariances can be easily computed by using Table 1.2, if the form of the posterior distribution is in the table.
1.2.5 Predictive Distribution The predictive distribution (1.9) for a new data set Dnew can be computed analytically, if the posterior distribution is in the exponential family, and hence multiplicatively closed. In this section, we show two examplary cases, the linear regression model and the multinomial model. Linear Regression Model Consider the linear regression model: 1 (y − a x)2 2 · exp − , p(y|x, a) = Gauss1 (y; a x, σ ) = √ 2σ2 2πσ2
(1.48)
where only the regression parameter is unknown, i.e., w = a ∈ R M , and the noise variance parameter σ2 is treated as a known constant. We choose the zero-mean Gaussian as a conjugate prior: exp − 12 a C−1 a , (1.49) p(a|C) = Gauss M (a; 0, C) = (2π) M/2 det (C)1/2 where C is the prior covariance.
30
1 Bayesian Learning When N i.i.d. samples D = (X, y), where y = (y(1) , . . . , y(N) ) ∈ RN ,
X = (x(1) , . . . , x(N) ) ∈ RN×M ,
are observed, the posterior is given by p(a|y, X, C) = Gauss M a; a, Σa = where
1
(2π) M/2 det Σa
1/2
⎛ $ %⎞ % −1 $ ⎜⎜⎜ a − a ⎟⎟⎟⎟ Σa a − a ⎜ ⎟⎟⎠ , · exp ⎜⎜⎝− 2
−1 X y X y X X −1
+ C = Σa 2 , a= σ2 σ2 σ −1 X X
+ C−1 . Σa = σ2
(1.50)
(1.51) (1.52)
This is just a special case of the posterior (1.42) for the linear regression model with the most general Gaussian prior. Now, let us compute the predictive distribution on the output y∗ for a new given input x∗ . As defined in Eq. (1.9), the predictive distribution is the expectation value of the model distribution (1.48) (for a new input–output pair) over the posterior distribution (1.50): p(y∗ |x∗ , y, X, C) = p(y∗ |x∗ , a) p(a|y,X,C) = p(y∗ |x∗ , a)p(a|y, X, C)da = Gauss1 (y∗ ; a x∗ , σ2 )Gauss M a; a, Σ a da −1 ∗ ∗ 2 ( a− a) Σ a ( a− a) x ) ∝ exp − (y −a − da 2 2σ2 ! " ⎛ ! −1 x∗ x∗ " ∗ ∗ ⎞ ! ∗2 " ⎜⎜⎜ a Σ a + σ2 a−2a Σ −1 a+ x 2y ⎟ a ⎟⎟⎟ y σ ∝ exp − 2 exp ⎜⎜⎝− ⎟⎠ da 2
2σ
⎛ " ! −1 " ! −1 "⎞ ! ∗ ∗ ∗ ∗ ∗ ∗ −1
⎜⎜⎜ σ−2 y∗2 − Σ −1 Σ a + x x2 Σa a+ x 2y a+ x 2y ⎟ ⎟⎟⎟ a σ σ σ ⎟⎟⎠ ∝ exp ⎜⎜⎜⎝− 2 ! " ⎞ ⎛ x∗ x∗ ⎜⎜⎜ (a− a˘ ) Σ −1 (a− a˘ ) ⎟ a + σ2 ⎟⎟⎟ ⎜ · exp ⎜⎝− ⎟⎠ da, 2
where
−1 −1 −1 x∗ x∗ x∗ y∗
Σa a+ 2 . a˘ = Σ a + σ2 σ
(1.53)
1.2 Computation
31
Note that, although the preceding computation is similar to the one for the posterior distribution in Section 1.2.3, any factor that depends on y∗ cannot be ignored even if it does not depend on a, since the goal is to obtain the distribution on y∗ . The integrand in Eq. (1.53) coincides with the main part of ⎛ ⎞ ∗ ∗ −1 ⎟ ⎜⎜⎜ −1 x x ⎟⎟⎟ Gauss M ⎜⎝⎜ a; a˘ , Σa + ⎠⎟ σ2 without the normalization factor. Therefore, the integral is the inverse of the normalization factor, i.e., ! " ⎞ ⎛ ! −1 "−1/2 x∗ x∗ ⎜⎜⎜ (a− a˘ ) Σ −1 (a− a˘ ) ⎟ a + σ2 ⎟⎟⎟ x∗ x∗ M/2
det Σ + , da = (2π) exp ⎜⎜⎝− ⎟ a ⎠ 2 σ2 which is a constant with respect to y∗ . Therefore, by using Eqs. (1.51) and (1.52), we have p(y∗ |x∗ , y, X, C) ⎛ " ! −1 " ! −1 "⎞ ! ∗ ∗ ∗ ∗ ∗ ∗ −1
⎜⎜⎜ σ−2 y∗2 − Σ −1 Σ a + x x2 Σa a+ x 2y a+ x 2y ⎟ ⎟⎟⎟ a σ σ σ ⎜ ⎟⎟⎠ ∝ exp ⎜⎜⎝− 2 −1 y∗2 −( X y+x∗ y∗ ) ( X X+x∗ x∗ +σ2 C−1 ) ( X y+x∗ y∗ ) ∝ exp − 2σ2 1 , ∗2 ∝ exp − y 1 − x∗ X X + x∗ x∗ + σ2 C−1 2 2σ - −1 − 2y∗ x∗ X X + x∗ x∗ + σ2 C−1 X y ∝ exp −
−1
1−x∗ ( X X+x∗ x∗ +σ2 C−1 ) x∗ 2σ2
· y∗ −
−1
x∗ ( X X+x∗ x∗ +σ2 C−1 ) X y
2
−1 1−x∗ ( X X+x∗ x∗ +σ2 C−1 ) x∗ ⎛ ⎞ ⎜⎜ (y∗ − y)2 ⎟⎟⎟ ⎟⎠ , ∝ exp ⎜⎜⎝− 2 2 σy
where
−1 x∗ X X + x∗ x∗ + σ2 C−1 X y
, y= −1 1 − x∗ X X + x∗ x∗ + σ2 C−1 x∗
σ2y =
σ2
1 − x∗ X X + x∗ x∗ + σ2 C−1
−1
x∗
.
−1
x∗
32
1 Bayesian Learning
4 True Estimated Credible interval
2 0 –2 –4 –4
–2
0
2
4
Figure 1.3 Predictive distribution of the linear regression model.
Thus, the predictive distribution has been analytically obtained: p(y∗ |x∗ , y, X, C) = Gauss1 y∗ ; y, σ2y .
(1.54)
Figure 1.3 shows an example of the predictive distribution of the linear regression model. The curve labeled as “True” indicates the mean y = a∗ x of the true regression model y = a∗ x + ε, where a∗ = (−2, 0.4, 0.3, −0.1) , x = (1, t, t2 , t3 ) , and ε ∼ Gauss1 (0, 12 ). The crosses are N = 30 i.i.d. observed samples generated from the true regression model and the input distribution t ∼ Uniform(−2.4, 1.6), where Uniform(l, u) denotes the uniform distribution on [l, u]. The regression model (1.48) with the prior (1.49) for the hyperparameters C = 10000 · I M , σ2 = 1 was trained with the observed samples. The curve labeled as “Estimated” and the pair of curves labeled as “Credible interval” show the mean y and the credible interval y± σy of the predictive distribution (1.54), respectively. Reflecting the fact that the samples are observed only in the middle region (t ∈ [−2.4, 1.6]), the credible interval is large in outer regions. The larger interval implies that the “Estimated” function is less reliable, and we see that the gap from the “True” function is indeed large. Since the true function is unknown in practical situations, the variance of the predictive distribution is important information on the reliability of the estimated result. Multinomial Model Let us compute the predictive distribution of the multinomial model: p(x|θ) = MultinomialK,N (x; θ) ∝
K θkxk , x! k=1 k
1.2 Computation
p(θ|φ) = DirichletK (θ; φ) ∝
33
K
θkφk −1 ,
k=1
with the observed data D = x = (x1 , . . . , xK ) ∈ HK−1 and the unknown N parameter w = θ = (θ1 , . . . , θK ) ∈ ΔK−1 . The posterior was derived in Eq. (1.47): p(θ|x, φ) = DirichletK (θ; x + φ) ∝
K
θkxk +φk −1 .
k=1
Therefore, the predictive distribution for a new single sample x∗ ∈ H1K−1 is given by p(x∗ |x, φ) = p(x∗ |θ) p(θ|x,φ) = p(x∗ |θ)p(θ|x, φ)dθ = MultinomialK,1 (x∗ ; θ)DirichletK (θ; x + φ)dθ ∝
K
x∗
θk k · θkxk +φk −1 dθ
k=1
=
K
x∗ +xk +φk −1
θk k
dθ.
(1.55)
k=1
In the fourth equation, we ignored the factors that depend neither on x∗ nor on θ. The integrand in Eq. (1.55) is the main part of DirichletK (θ; x∗ + x + φ), and therefore, the integral is equal to the inverse of its normalization factor: K K ∗ xk∗ +xk +φk −1 k=1 Γ(x + xk + φk ) θk dθ = K ∗k Γ( k=1 xk + xk + φk ) k=1 K ∗ k=1 Γ(xk + xk + φk ) = . K Γ(N + k=1 φk + 1) Thus, by using the identity Γ(x + 1) = xΓ(x) for the Gamma function, we have p(x∗ |x, φ) ∝
K
Γ(xk∗ + xk + φk )
k=1
∝
K k=1
∗
(xk + φk ) xk Γ(xk + φk )
34
1 Bayesian Learning
∝
K
∗
(xk + φk ) xk
k=1
⎞ xk∗ K ⎛ ⎜⎜⎜ xk + φk ⎟⎟⎟ ⎜ ⎟⎠ ∝ ⎝ K
k =1 xk + φ k k=1 = MultinomialK,1 (x∗ ; θ),
(1.56)
where xk + φk
. θk = K
k =1 xk + φk
(1.57)
From Eq. (1.47) and Table 1.2, we can easily see that the predictive mean
θ, specified by Eq. (1.57), coincides with the posterior mean, i.e., the Bayesian estimator:
θ = θ DirichletK (θ;x+φ) . Therefore, in the multinomial model, the predictive distribution coincides with the model distribution with the Bayesian estimator plugged in. In the preceding derivation, we performed the integral computation and derived the form of the predictive distribution. However, the necessary information to determine the predictive distribution is the probability table on the K , of which the degree of freedom is only K. events x∗ ∈ H1K−1 = {ek }k=1 Therefore, the following simple calculation gives the same result: Prob(x∗ = ek |x, φ) = MultinomialK,1 (ek ; θ) DirichletK (θ;x+φ) = θk DirichletK (θ;x+φ) = θk , which specifies the function form of the predictive distribution, given by Eq. (1.56).
1.2.6 Marginal Likelihood Let us compute the marginal likelihood of the linear regression model, defined by Eqs. (1.48) and (1.49): p(D|C) = p(y|X, C) = p(y|X, a) p(a|C) = p(y|X, a)p(a|C)da
1.2 Computation
35
GaussN (y; Xa, σ2 I N )Gauss M (a; 0, C)da ! " 2 exp − y−Xa exp − 12 a C−1 a 2σ2 = · da (2πσ2 )N/2 (2π) M/2 det (C)1/2 ! " 2 exp − y 2σ2 = (2πσ2 )N/2 (2π) M/2 det (C)1/2 ⎛ ⎞ ⎜⎜⎜ −2a X 2y + a X2X + C−1 a ⎟⎟⎟ σ σ ⎟⎟⎟⎟ da · exp ⎜⎜⎜⎝⎜− ⎠ 2 "" ! ! 2 −1 a − a Σa exp − 12 y σ2 = (2πσ2 )N/2 (2π) M/2 det (C)1/2 ⎛ $ %⎞ % −1 $ ⎜⎜⎜⎜ a − a ⎟⎟⎟⎟ Σa a − a ⎟⎟⎠ da, · exp ⎜⎜⎝− (1.58) 2
=
where a and Σ a are, respectively, the posterior mean and the posterior covariance, given by Eqs. (1.51) and (1.52). By using ⎛ $ %⎞ % −1 $ . ⎜⎜⎜ a − a ⎟⎟⎟⎟ Σa a − a ⎜ ⎟⎟⎠ da = (2π) M det Σa , exp ⎜⎜⎝− 2 and Eq. (1.58), we have p(y|X, C) =
=
! ! 2 exp − 12 y − σ2
y X Σ a X y σ4
""
. (2π) M det Σa
(2πσ2 )N/2 (2π) M/2 det (C)1/2 −1 y2 −y X( X X+σ2 C−1 ) X y exp − 2σ2
(2πσ2 )N/2 det(CX X + σ2 I M )1/2
,
(1.59)
where we also used Eqs. (1.51) and (1.52). Eq. (1.59) is an explicit expression of the marginal likelihood as a function of the hyperparameter κ = C. Based on it, we perform EBayes learning in Section 1.2.7.
1.2.7 Empirical Bayesian Learning In empirical Bayesian (EBayes) learning, the hyperparameter κ is estimated by maximizing the marginal likelihood p(D|κ). The negative logarithm of the marginal likelihood,
36
1 Bayesian Learning F Bayes = − log p(D|κ),
(1.60)
is called the Bayes free energy or stochastic complexity.4 Since log(·) is a monotonic function, maximizing the marginal likelihood is equivalent to minimizing the Bayes free energy. Eq. (1.59) implies that the Bayes free energy of the linear regression model is given by 2F Bayes = −2 log p(y|X, C) = N log(2πσ2 ) + log det(CX X + σ2 I M ) −1 y2 − y X X X + σ2 C−1 X y + . σ2 Let us restrict the prior covariance to be diagonal: C = Diag(c21 , . . . , c2M ) ∈ D M .
(1.61)
(1.62)
The prior (1.49) with diagonal covariance (1.62) is called the automatic relevance determination (ARD) prior, which is known to make the EBayes estimator sparse (Neal, 1996). In the following example, we see this effect by setting the design matrix to identity, X = I M , which enables us to derive the EBayes solution analytically. Under the identity design matrix, the Bayes free energy (1.61) can be decomposed as −1 y2 − y I M + σ2 C−1 y Bayes 2 2 = N log(2πσ ) + log det(C + σ I M ) + 2F σ2 M y2 y2 = N log(2πσ2 ) + 2 + log(c2m + σ2 ) − 2 $ m 2 −2 % σ σ 1 + σ cm m=1 =
M
2Fm∗ + const.,
(1.63)
m=1
where 2Fm∗
−1 c2m y2m σ2 = log 1 + 2 − 2 1 + 2 . σ σ cm
(1.64)
In Eq. (1.63), we omitted the constant factors with respect to the hyperparameter C. As the remaining terms are decomposed into each component m, we can independently minimize Fm∗ with respect to c2m . 4
The logarithm of the marginal likelihood log p(D|κ) is called the log marginal likelihood or evidence.
1.2 Computation
37
1
0.5
0
–0.5 0
1
2
3
Figure 1.4 The (componentwise) Bayes free energy (1.64) of linear regression model with the ARD prior. The minimizer is shown as a cross if it lies in the positive region of c2m /σ2 .
The derivative of Eq. (1.64) with respect to c2m is 2
y2m ∂Fm∗ 1 = − $ %2 4 ∂c2m c2m + σ2 cm 1 + σ2 c−2 m 2 ym 1 = 2 − cm + σ2 $c2m + σ2 %2 =
c2m − (y2m − σ2 ) . (c2m + σ2 )2
(1.65)
Eq. (1.65) implies that Fm∗ is monotonically increasing over all domain c2m > 0 when y2m ≤ σ2 , and has the unique minimizer in the region c2m > 0 when y2m > σ2 . Specifically, the minimizer is given by ⎧ 2 2 2 2 ⎪ ⎪ ⎨ym − σ if ym > σ , 2
cm = ⎪ (1.66) ⎪ ⎩+0 otherwise. Figure 1.4 shows the (componentwise) Bayes free energy (1.64) for different observations, y2m = 0, σ2 , 1.5σ2 , 2σ2 . The minimizer is in the positive region of c2m if and only if y2m > σ2 . If the EBayes estimator is given by c2m → +0, it means that the prior distribution for the mth component am of the regression parameter is the Dirac delta function located at the origin.5 This formally means that we a priori 5
When y2m ≤ σ2 , the Bayes free energy (1.64) decreases as c2m approaches to 0. However, the c2m = 0 is not the solution. We express domain of c2m is restricted to be positive, and therefore, this solution as c2m → +0.
38
1 Bayesian Learning
knew that am = 0, i.e., we choose a model that does not contain the mth component. By substituting Eq. (1.66) into the Bayes posterior mean (1.51), we obtain the EBayes estimator: −1 EBayes
= c2m c2m + σ2 ym am " ⎧! ⎪ σ2 ⎪ ⎪ ym if y2m > σ2 , 1 − ⎨ y2m (1.67) =⎪ ⎪ ⎪ ⎩0 otherwise. The form of the estimator (1.67) is called the James–Stein (JS) estimator having interesting properties including the domination over the ML estimator (Stein, 1956; James and Stein, 1961; Efron and Morris, 1973) (see Appendix A). Note that the assumption that X = I M is not practical. For a general design matrix X, the Bayes free energy is not decomposable into each component. M that minimize the Bayes free energy Consequently, the prior variances {c2m }m=1 (1.61) interact with each other. Therefore, the preceding simple mechanism is not applied. However, it is empirically observed that many prior variances tend aEBayes is sparse. to go to c2m → +0, so that the EBayes estimator
2 Variational Bayesian Learning
In Chapter 1, we saw examples where the model likelihood has a conjugate prior, with which Bayesian learning can be performed analytically. However, many practical models do not have conjugate priors. Even in such cases, the notion of conjugacy is still useful. Specifically, we can make use of the conditional conjugacy, which comes from the fact that many practical models are built by combining basic distributions. In this chapter, we introduce variational Bayesian (VB) learning, which makes use of the conditional conjugacy, and approximates the Bayes posterior by solving a constrained minimization problem.
2.1 Framework VB learning is derived by casting Bayesian learning as an optimization problem with respect to the posterior distribution (Hinton and van Camp, 1993; MacKay, 1995; Opper and Winther, 1996; Attias, 1999; Jordan et al., 1999; Jaakkola and Jordan, 2000; Ghahramani and Beal, 2001; Bishop, 2006; Wainwright and Jordan, 2008).
2.1.1 Free Energy Minimization Let r(w), or r for short, be an arbitrary distribution, which we call a trial distribution, on the parameter w, and consider the Kullback–Leibler (KL) divergence from the trial distribution r(w) to the Bayes posterior p(w|D): KL (r(w)p(w|D)) =
/ 0 r(w) r(w) dw = log . r(w) log p(w|D) p(w|D) r(w) 39
(2.1)
40
2 Variational Bayesian Learning
Since the KL divergence is equal to zero if and only if the two distributions coincide with each other, the minimizer of Eq. (2.1) is the Bayes posterior, i.e., p(w|D) = argmin KL (r(w)p(w|D)) .
(2.2)
r
The problem (2.2) is equivalent to the following problem: p(w|D) = argmin F(r),
(2.3)
r
where the functional of r, / 0 r(w) r(w) dw = log F(r) = r(w) log p(w, D) p(w, D) r(w) = KL (r(w)p(w|D)) − log p(D),
(2.4) (2.5)
is called the free energy. Intuitively, we replaced the posterior distribution p(w|D) in the KL divergence (2.1) with its unnormalized version—the joint distribution p(D, w) = p(w|D)p(D)—in the free energy (2.4). The equivalence holds because the normalization factor p(D) does not depend on w, and therefore log p(D) r(w) = log p(D) does not depend on r. Note that the free energy (2.4) is a generalization of the Bayes free energy, defined by Eq. (1.60): The free energy (2.4) is a functional of an arbitrary distribution r, and equal to the Bayes free energy (1.60) for the Bayes poterior r(w) = p(w|D). Since the KL divergence is nonnegative, Eq. (2.5) implies that the free energy F(r) is an upper-bound of the Bayes free energy − log p(D) for any distribution r. Since the log marginal likelihood log p(D) is called the evidence, −F(r) is also called the evidence lower-bound (ELBO). As mentioned in Section 1.1.2, the joint distribution is easy to compute in general. However, the minimization problem in Eq. (2.3) can still be computationally intractable, because the objective functional (2.4) involves the expectation over the distribution r(w). Actually, it can be hard to even evaluate the objective functional for most of the possible distributions. To make the evaluation of the objective functional tractable for optimal r(w), we restrict the search space to G. Namely, we solve the following problem: min F(r) r
s.t.
r ∈ G,
(2.6)
where s.t. is an abbreviation for “subject to.” We can choose a tractable distribution class directly for G, e.g., Gaussian, such that the expectation for evaluating the free energy is tractable for any r ∈ G. However, in many practical models, a weaker constraint restricts the optimal distribution to be in a tractable class, thanks to conditional conjugacy.
2.1 Framework
41
2.1.2 Conditional Conjugacy Let us consider a few examples where the model likelihood has no conjugate prior. The likelihood of the matrix factorization model (which will be discussed in detail in Section 3.1) is given by ! # #2 " exp − 2σ1 2 ##V − B A ##Fro , (2.7) p(V|A, B) = (2πσ2 )LM/2 where V ∈ RL×M is an observed random variable, and A ∈ R M×H and B ∈ RL×H are the parameters to be estimated. Although σ2 ∈ R++ can also be unknown, let us treat it as a hyperparameter, i.e., a constant when computing the posterior distribution. If we see Eq. (2.7) as a function of the parameters w = ( A, B), its function #2 is the exponential of a polynomial including a fourth-order # form term ## B A ##Fro = tr(B A AB ). Therefore, no conjugate prior exists for this likelihood with respect to the parameters w = ( A, B).1 The next example is a mixture of Gaussians (which will be discussed in detail in Section 4.1.1): ⎧ ! " ⎫z(n) k ⎪ x(n) −μk 2 ⎪ ⎪ ⎪ ⎪ ⎪ N K exp − ⎪ ⎪ ⎪ 2 ⎪ 2σ ⎨ ⎬ , (2.8) p(D, H|w) = αk ⎪ ⎪ ⎪ ⎪ 2 ) M/2 ⎪ ⎪ (2πσ ⎪ ⎪ ⎪ n=1 k=1 ⎪ ⎩ ⎭ N N are observed data, H = {z(n) }n=1 are hidden variables, and where D = {x(n) }n=1 K w = (α, {μk }k=1 ) are parameters. For simplicity, we here assume that all Gaussian components have the same variance σ2 , which is treated as a hyperparameter, i.e., we compute the joint posterior distribution of the hidden variables N and the parameters w, regarding the hyperparameter σ2 as a constant. {z(n) }n=1 N K , α, {μk }k=1 ), no conjugate If we see Eq. (2.8) as a function of ({z(n) }n=1 N K z(n) k prior exists. More specifically, it has a factor n=1 k=1 αk , and we cannot compute N K (n) αk zk dαk K z(n) k ∈{ek }k=1
n=1 k=1
analytically for general N, which is required when evaluating moments. 1
Here, “no conjugate prior” means that there is no useful and nonconditional conjugate prior, such that the posterior is in the same distribution family with computable moments. We might say that the exponential function of fourth-order polynomials is conjugate to the likelihood (2.7), since the posterior is within the same family. However, this statement is useless in practice because we cannot compute moments of the distribution analytically.
42
2 Variational Bayesian Learning
The same difficulty happens in the latent Dirichlet allocation model (which will be discussed in detail in Section 4.2.4). The likelihood is written as ⎧ ⎫z(n,m) h M N (m) H ⎪ L ⎪ ⎪ ⎪ ⎨ ⎬ w(n,m) l B , p(D, H|w) = Θ ⎪ ⎪ l,h ⎪ ⎪ ⎩ m,h ⎭ m=1 n=1 h=1
(2.9)
l=1
(m)
(m)
N N M M where D = {{w(n,m) }n=1 }m=1 are observed data, H = {{z(n,m) }n=1 }m=1 are hidden variables, and w = (Θ, B) are parameters to be estimated. Computing the sum (over the hidden variables H) of the integral (over the parameters w) is intractable for practical problem sizes. Readers might find that Eqs. (2.7), (2.8), and (2.9) are not much more complicated than the conjugate cases: Eq. (2.7) is similar to the Gaussian form, and Eqs. (2.8) and (2.9) are in the form of the multinomial or Dirichlet distribution, where we have unknowns both in the base and in the exponent. Indeed, they are in a known form if we regard a part of unknowns as fixed constants. The likelihood (2.7) of the matrix factorization model is in the Gaussian form of A if we see B as a constant, or vice versa. The likelihood (2.8) of a mixture of Gaussians is in the multinomial form of the hidden variables H = N K if we see the parameters w = (α, {μk }k=1 ) as constants, and it is the {z(n) }n=1 (independent) product of the Dirichlet form of α and the Gaussian form of N K if we see the hidden variables H = {z(n) }n=1 as constants. Similarly, the {μk }k=1 likelihood (2.9) of the latent Dirichlet allocation model is in the multinomial N (m) M }m=1 if we see the parameters form of the hidden variables H = {{z(n,m) }n=1 w = (Θ, B) as constants, and it is the product of the Dirichlet form of the row M H of Θ and the Dirichlet form of the column vectors {βh }h=1 of B vectors { θm }m=1 (n,m) N (m) M }n=1 }m=1 as constants. if we see the hidden variables H = {{z Since the likelihoods in the Gaussian, multinomial, and Dirichlet forms have conjugate priors, the aforementioned properties can be described with the notion of conditional conjugacy, which is defined as follows:
Definition 2.1 (Conditionally conjugate prior) Let us divide the unknown parameters w (or more generally all unknown variables including hidden variables) into two parts w = (w1 , w2 ). If the posterior of w1 , p(w1 |w2 , D) ∝ p(D|w1 , w2 )p(w1 ),
(2.10)
is in the same distribution family as the prior p(w1 ) (where w2 is regarded as a given constant or condition), the prior p(w1 ) is called a conditionally conjugate prior of the model likelihood p(D|w) with respect to the parameter w1 , given the fixed parameter w2 .
2.1 Framework
43
2.1.3 Constraint Design Once conditional conjugacy for all unknowns is found, designing tractable VB learning is straightforward. Let us divide the unknown parameters w into S groups, i.e., w = (w1 , . . . , wS ), such that, for each s = 1, . . . , S , the model likelihood p(D|w) = p(D|w s , {w s } s s ) has a conditionally conjugate prior p(w s ) with respect to w s , given {w s } s s as fixed constants. Then, if we use the prior p(w) =
S
p(w s ),
(2.11)
s=1
the posterior distribution p(w|D) ∝ p(D|w)p(w) is, as a function of w s , in the same distribution family as the prior p(w s ). Therefore, moments of the posterior distribution are tractable, if the other parameters {w s } s s are given. To make use of this property, we impose on the approximate posterior the independence constraint between the parameter groups, r(w) =
S
r s (w s ),
(2.12)
s=1
which allows us to compute moments with respect to w s independently from the other parameters {w s } s s . In VB learning, we solve the minimization problem (2.6) under the constraint (2.12). This makes the expectation computation, which is required in evaluating the free energy (2.4), tractable (on any stationary points for r). Namely, we define the VB posterior as
r = argminF(r) s.t. r(w) = r
S
r s (w s ).
(2.13)
s=1
r(w) Note that it is not guaranteed that the free energy F(r) = log p(D|w)p(w) r(w) is tractable for any r satisfying the constraint (2.12). However, the constraint allows us to optimize each factor {r s }Ss=1 separately. To optimize each factor, we rely on calculus of variations, which will be explained in Section 2.1.4. By applying calculus of variations, the free energy is expressed as an explicit function with a finite number of unknown parameters.
44
2 Variational Bayesian Learning
2.1.4 Calculus of Variations Calculus of variations is a method, developed in physics, to derive conditions that any optimal function minimizing a (smooth) functional should satisfy (Courant and Hilbert, 1953). Specifically, it gives (infinitely many) stationary conditions of the functional with respect to the variable. The change of the functional F(r) with respect to an infinitesimal change of the variable r (which is a function of w) is called a variation and written as δI. For r to be a stationary point of the functional, the variation must be equal to zero for all possible values of w. Since the free energy (2.4) does not depend on the derivatives of r(w), the variation δI is simply the derivative with respect to r. Therefore, the stationary conditions are given by ∂F = 0, ∀w ∈ W, (2.14) ∂r which is a special case of the Euler–Lagrange equation. If we see the function r(w) as a (possibly) infinite-dimensional vector with the parameter value w as its index, the variation δI = δI(w) can be interpreted as the gradient of the functional F(r) in the |W|-dimensional space. As the stationary conditions in a finite-dimensional space require that all entries of the gradient equal to zero, the optimal function r(w) should satisfy Eq. (2.14) for any parameter values w ∈ W. In Section 2.1.5, we see that, by applying the stationary conditions (2.14) to the free energy minimization problem (2.13) with the independence constraint taken into account, we can find that each factor r s (w s ) of the approximate posterior is in the same distribution family as the corresponding prior p s (w s ), thanks to the conditional conjugacy. δI =
2.1.5 Variational Bayesian Learning Let us solve the problem (2.13) to get the VB posterior
r = argminF(r) s.t. r(w) = r
S
r s (w s ).
s=1
We use the decomposable conditionally conjugate prior (2.11): p(w) =
S
p(w s ),
s=1
which means that, for each s = 1, . . . , S , the posterior p(w s |{w s } s s , D) for w s is in the same form as the corresponding prior p(w s ), given {w s } s s as fixed constants.
2.1 Framework
45
Now we apply the calculus of variations, and compute the stationary conditions (2.14). The free energy can be written as ⎞⎛ ⎞ S ⎛⎜ S ⎟⎟⎟ ⎜⎜ ⎟⎟⎟ ⎜⎜⎜ s=1 r s (w s ) ⎜ ⎟ ⎟⎠ dw. r s (w s )⎟⎠ ⎝⎜log F(r) = (2.15) ⎜⎝ S p(D|w) s=1 p(w s ) s=1 Taking the derivative of Eq. (2.15) with respect to r s (w s ) for any s = 1, . . . , S and w s ∈ W, we obtain the following stationary conditions: ⎞⎛ ⎞ S ⎛⎜ ⎟⎟⎟ ⎜⎜ ⎟⎟ ∂F ⎜⎜⎜ s =1 r s (w s ) ⎜ ⎟ 0= = r s (w s )⎟⎠ ⎜⎝log + 1⎟⎟⎠ dw ⎜⎝ S ∂r s p(D|w) s =1 p(w s ) s s S 0 / s =1 r s (w s ) +1 = log p(D|w) Ss =1 p(w s ) s s rs (ws ) / 0 r s (w s ) s s r s (w s ) +1 = log + log
p(D|w) s s p(w s ) s s rs (ws ) p(w s ) / 0 1 r s (w s ) = log + const. (2.16) + log p(D|w) s s rs (ws ) p(w s ) Note the following on Eq. (2.16):
• The right-hand side is a function of w s (w s for s s are integrated out). • For each s, Eq. (2.16) must hold for any possible value of w s , which can fully specify the function form of the posterior r s (w s ). • To make Eq. (2.16) satisfied for any w s , it is necessary that r s (w s ) − log p(D|w) s s rs (ws ) + log p(w s )
is a constant. The last note leads to the following relation: r s (w s ) ∝ p(w s ) exp log p(D|w) s s rs (ws ) . As a function of w s , Eq. (2.17) can be written as r s (w s ) ∝ exp log p(D|w)p(w s ) s s rs (ws ) ∝ exp log p(w s |{w s } s s , D) s s rs (ws ) ∝ exp log p(w s |{w s } s s , D) r s (w s )dw s .
(2.17)
(2.18)
s s
Due to the conditional conjugacy, p(w s |{w s } s s , D) is in the same form as the prior p(w s ). As the intergral operator g(x) = f (x; α)dα can be interpreted as an infinite number of additions of parametric functions f (x; α) over all possible values of α, the operator h(x) = exp log f (x; α)dα corresponds to an infinite
46
2 Variational Bayesian Learning
number of multiplications of f (x; α) over all possible values of α. Therefore, Eq. (2.18) implies that the VB posterior r s (w s ) is in the same form as the prior p(w s ), if the distribution family is multiplicatively closed. Assume that the prior p(w s ) for each group of parameters is in a multiplicatively closed distribution family. Then, we may express the corresponding VB posterior r s (w s ) in a parametric form, of which the parameters are called variational parameters, without any loss of accuracy or optimality. The last question is whether we can compute the expectation value of the log-likelihood log p(D|w) for each factor r s (w s ) of the approximate posterior. In many cases, this expectation can be computed analytically, which allows us to express the stationary conditions (2.17) as a finite number of equations in explicit forms of the variational parameters. Typically, the obtained stationary conditions are used to update the variational parameters in an iterative algorithm, which gives a local minimizer r of the free energy (2.4). We call the minimizer r the VB posterior, and its mean
w = w
r(w)
(2.19)
the VB estimator. The computation of predictive distribution p(Dnew |D) = p(Dnew |w) r(w) can be hard even after finding the VB posterior r(w). This is natural because we need approximation for the function form of the likelihood p(D|w), and now we need to compute the integral with the integrand involving the same function w), form. In many practical cases, the plug-in predictive distribution p(Dnew | i.e., the model distribution with the VB estimator plugged in, is substituted for the predictive distribution.
2.1.6 Empirical Variational Bayesian Learning When the model involves hyperparameters κ in the likelihood and/or in the prior, the joint distribution is dependent on κ, i.e., p(D, w|κ) = p(w|κ)p(D|w, κ), and so is the free energy:
r(w) dw p(D, w|κ) / 0 r(w) = log p(w, D|κ) r(w)
F(r, κ) =
r(w) log
= KL (r(w)p(w|D, κ)) − log p(D|κ).
(2.20) (2.21)
2.1 Framework
47
Similarly to the empirical Bayesian learning, the hyperparameters can be estimated from observation by minimizing the free energy simultaneously with respect to r and κ: ( r, κ) = argmin F(r, κ). r,κ
This approach is called the empirical VB (EVB) learning. EVB learning amounts to minimizing the sum of the KL divergence to the Bayes posterior and the marginal likelihood (see Eq. (2.21)). Conceptually, minimizing any weighted sum of those two terms is reasonable to find the VB posterior and the hyperparameters at the same time. But only the unweighted sum makes the objective tractable—under this choice, the objective is written with the joint distribution as in Eq. (2.20), while any other choice requires explicitly accessing the Bayes posterior and the marginal likelihood separately.
2.1.7 Techniques for Nonconjugate Models In Sections 2.1.2 through 2.1.5, we saw how to design tractable VB learning by making use of the conditional conjugacy. However, there are also many cases where a reasonable model does not have a conditionally conjugate prior. A frequent and important example is the case where the likelihood involves the sigmoid function, σ(x; w) =
1 , 1 + e−w x
(2.22)
or a function with a similar shape, e.g., the error function, the hyperbolic tangent, and the rectified linear unit (ReLU). We face such cases, for example, in solving classification problems and in adopting neural network structure with a nonlinear activation function. To maintain the tractability in such cases, we need to explicitly restrict the function form of the approximate posterior r(w; λ), and optimize its variational parameters λ by free energy minimization. Namely, we solve the VB learning problem (2.6) with the search space G set to the function space of a simple w, Σ) distribution family, e.g., the Gaussian distribution r(w; λ) = GaussD (w;
parameterized with the variational parameters λ = ( w, Σ) consisting of the mean and the covariance parameters. Then, the VB learning problem (2.6) is reduced to the following unconstrained minimization problem, λ), min F(
λ
(2.23)
48
2 Variational Bayesian Learning
of the free energy / 0 r(w; λ) r(w; λ) dw = log F( λ) = r(w; λ) log , p(w, D) p(w, D) r(w; λ)
(2.24)
which is a function of the variational parameters λ. It is often the case that the free energy (2.24) is still intractable in computing the expectation value of the log joint probability, log p(w, D) = log p(D|w)p(w), over the approximate posterior r(w; λ) (because of the intractable function form of the likelihood p(D|w) or the prior p(w)). In this section, we introduce a few techniques developed for coping with such intractable functions. Local Variational Approximation The first method is to bound the joint distribution p(w, D) with a simple function, of which the expectation value over the approximate distribution r(w; λ) is tractable. As seen in Section 2.1.1, the free energy (2.24) is an upper-bound of the λ. Consider further upperBayes free energy, F Bayes ≡ − log p(D), for any bounding the free energy as r(w; λ)
dw (2.25) r(w; λ) log F(λ) ≤ F(λ, ξ) ≡ p(w; ξ) by replacing the joint distribution p(w, D) with its parametric lower-bound p(w; ξ) such that 0 ≤ p(w; ξ) ≤ p(w, D)
(2.26)
for any w ∈ W and ξ ∈ Ξ. Here, we introduced another set of variational parameters ξ with its domain Ξ. Let us choose a lower-bound p(w; ξ) such that its function form with respect to w is the same as the approximate posterior r(w; λ). More specifically, we assume that, for any given ξ, there exists λ such that λ) p(w; ξ) ∝ r(w;
(2.27)
as a function of w.2 Since the direct minimization of F( λ) is intractable, we λ, ξ) jointly over λ and ξ. Namely, we instead minimize its upper-bound F( solve the problem λ, ξ), min F(
λ,ξ
2
(2.28)
The parameterization, i.e., the function form with respect to the variational parameters, can be λ). different between p(w; ξ) and r(w;
2.1 Framework
49
to find the approximate posterior r(w; λ) such that F( λ, ξ) (≥ F( λ)) is closest Bayes (when ξ is also optimized). to the Bayes free energy F Let p(w; ξ) (2.29) q(w; ξ) = Z(ξ) be the distribution created by normalizing the lower-bound with its normalization factor Z(ξ) = p(w; ξ)dw. (2.30) Note that the normalization factor (2.30) is trivially a lower-bound of the marginal likelihood, i.e., p(w, D)dw = p(D), Z(ξ) ≤ and is tractable because of the assumption (2.27) that p is in the same simple function form as r. With Eq. (2.29), the upper-bound (2.25) is expressed as r(w; λ) dw − log Z(ξ) F( λ, ξ) = r(w; λ) log q(w; ξ) = KL r(w; λ)q(w; ξ) − log Z(ξ), (2.31) which implies that the optimal λ is attained when r(w; λ) = q(w; ξ)
(2.32)
for any ξ ∈ Ξ (the assumption (2.27) guarantees the attainability). Thus, by putting this back into Eq. (2.31), the problem (2.28) is reduced to max Z(ξ), ξ
(2.33)
which amounts to maximizing the lower-bound (2.30) of the marginal likelihood p(D). Once the maximizer ξ is obtained, Eq. (2.32) gives the optimal approximate posterior. Such an approximation scheme for nonconjugate models is called local variational approximation or direct site bounding (Jaakkola and Jordan, 2000; Girolami, 2001; Bishop, 2006; Seeger, 2008, 2009), which will be discussed further with concrete examples in Chapter 5. Existing nonconjugate models applied with the local variational approximation form the bound in Eq. (2.26) based on the convexity of a function. In such a case, the gap between F(ξ) and F turns out to be the expected Bregman divergence associated with the convex function (see Section 5.3.1). A similar approach can be
50
2 Variational Bayesian Learning
applied to expectation propagation, another approximation method introduced in Section 2.2.3. There, by upper-bounding the joint probability p(w, D), we minimize an upper-bound of KL p(w|D)r(w; λ) (see Section 2.2.3). Black Box Variational Inference As the available data size increases, and the benefit of using big data has been proven, for example, by the breakthrough in deep learning (Krizhevsky et al., 2012), scalable training algorithms have been intensively developed, to enable big data analysis on billions of data samples. The stochastic gradient descent (Robbins and Monro, 1951; Spall, 2003), where a noisy gradient of the objective function is cheaply computed from a subset of the whole data in each iteration, has become popular, and has been adopted for VB learning (Hoffman et al., 2013; Khan et al., 2016). The black-box variational inference was proposed as a general method to compute a noisy gradient of the free energy in nonconjugate models (Ranganath et al., 2013; Wingate and Weber, 2013; Kingma and Welling, 2014). As a function of the variational parameters λ, the gradient of the free energy (2.24) can be evaluated by ∂ r(w; λ) ∂F dw = r(w; λ) log p(D, w) ∂ λ ∂ λ ∂ r(w; λ) ∂r(w; λ) dw + r(w; λ) log = log r(w; λ) dw p(D, w) ∂ λ ∂ λ ∂ r(w; λ) ∂ log r(w; λ) dw + log r(w; λ)dw = r(w; λ) p(D, w) ∂ λ ∂ λ / 0 ∂ log r(w; λ) r(w; λ) log . (2.34) = p(D, w) r(w; λ) ∂ λ Assume that we restrict the approximate posterior r(w; λ) to be in a simple distribution family, from which samples can be easily drawn, and its score function, the gradient of the log probability, is easily computed, e.g., an analytic form is available. Then, Eq. (2.34) can be easily computed by drawing samples from r(w; λ), and computing the sample average. With some variance reduction techniques, the stochastic gradient with the black box gradient estimator (2.34) has shown to be useful for VB learning in general nonconjugate models. A notable advantage is that it does not require any model specific analysis to implement the gradient estimation, since Eq. (2.34) can be evaluated as long as the log joint probability p(D, w) = p(D|w)p(w) of the model can be evaluated for drawn samples of w.
2.2 Other Approximation Methods
51
2.2 Other Approximation Methods There are several other methods for approximate Bayesian learning, which are briefly introduced in this section.
2.2.1 Laplace Approximation In the Laplace approximation, the posterior is approximated by a Gaussian: r(w) = GaussD (w; w, Σ). VB learning finds the variational parameters λ = ( w, Σ) by minimizing the free energy (2.4), i.e., solving the problem (2.6) with the search space G restricted to the Gaussian distributions. Instead, the Laplace approximation estimates the mean and the covariance by
wMAP ) = argmax p(D|w)p(w), wLA (=
(2.35)
LA −1
F , Σ =
(2.36)
w
D where the entries of F ∈ S++ are given by
∂2 log p(D|w)p(w)
. Fi, j = − ∂wi ∂w j w= wLA
(2.37)
Namely, the Laplace approximation first finds the MAP estimator for the mean, and then computes Eq.(2.37) at w = wLA to estimate the inverse covariance, which corresponds to the second-order Taylor approximation to log p(D|w)p(w). Note that, for the flat prior p(w) ∝ 1, Eq. (2.37) is reduced to the Fisher information: / / 2 0 0 ∂ log p(D|w) ∂ log p(D|w) ∂ log p(D|w) =− . Fi, j = ∂wi ∂w j ∂wi ∂w j p(D|w) p(D|w) In general, the Laplace approximation is computationally less demanding than VB learning, since no integral computation is involved, and the inverse covariance estimation (2.36) is performed only once after the MAP mean estimator (2.35) is found.
2.2.2 Partially Bayesian Learning Partially Bayesian (PB) learning is MAP learning after some of the unknown parameters are integrated out. This approach can be described in the free energy minimization framework (2.6) with a strnger constraint than VB learning.
52
2 Variational Bayesian Learning
Let us split the unknown parameters w into two parts w = (w1 , w2 ), and assume that we integrate w1 out and point-estimate w2 . Integrating w1 out means that we consider the exact posterior on w1 , and MAP estimating w2 means that we approximate the posterior w2 with the delta function. Namely, PB learning solves the following problem: min F(r) r
s.t.
r(w) = r1 (w1 ) · δ(w2 ; w2 ),
(2.38)
where the free energy F(r) is defined by Eq. (2.4), and δ(w; w) is the Dirac delta function located at w. Using the constraint in Eq. (2.38), under which the variables to be optimized w2 , we can express the free energy as are r1 and / 0 w2 ) r1 (w1 ) · δ(w2 ; F(r1 , w2 ) = log p(D|w1 , w2 )p(w1 )p(w2 ) r1 (w1 )·δ(w2 ; w2 ) / 0 r1 (w1 ) = log + log δ(w2 ; w2 ) δ(w2 ; w2 ) p(D|w1 , w2 )p(w1 )p( w2 ) r1 (w1 ) / 0 r1 (w1 ) = log − log p(D| w2 )p( w2 ) + log δ(w2 ; w2 ) δ(w2 ; w2 ) , p(w1 | w2 , D) r1 (w1 ) (2.39) where p(D| w2 ) = p(D|w1 , w2 ) p(w1 ) =
p(D|w1 , w2 )p(w1 )dw1 .
(2.40)
The free energy (2.39) depends on r1 only through the first term, which is % $ w2 , D) , from the trial distribution to the the KL divergence, KL r1 (w1 )p(w1 | Bayes posterior (conditioned on w2 ). Therefore, the minimizer for r1 is trivially the conditional Bayes posterior w2 , D), r1 (w1 ) = p(w1 |
(2.41)
with which the first term in Eq. (2.39) vanishes. The third term in Eq. (2.39) is the entropy of the delta function, which diverges to infinity but is independent of w2 . By regarding the delta function as a distribution with its width narrow enough to express a point estimate, while its entropy is finite (although it is very large), we can ignore the third term. Thus, the free energy minimization problem (2.38) can be written as w2 )p( w2 ), min − log p(D|
w2
which amounts to MAP learning for w2 after w1 is marginalized out.
(2.42)
2.2 Other Approximation Methods
53
This method is computationally beneficial when the likelihood p(D|w) = p(D|w1 , w2 ) is conditionally conjugate to the prior p(w1 ) with respect to w1 , given w2 . Thanks to the conditional conjugacy, the posterior (2.41) of w1 is in a known form, and its normalization factor (2.40), which is required when evaluating the objective in Eq. (2.42), can be obtained analytically. PB learning was applied in many previous works. For example, in the expectation-maximization (EM) algorithm (Dempster et al., 1977), latent variables are integrated out and parameters are point-estimated. In the first probabilisitic interpretation of principal component analysis (PCA) (Tipping and Bishop, 1999), one factor of the matrix factorization was called a latent variable and integrated out, while the other factor was called a parameter and point-estimated. The same idea has been adopted for Gibbs sampling and VB learning, where some of the unknown parameters are integrated out based on the conditional conjugacy, and the other parameters are estimated by the corresponding learning method. Those methods are called collapsed Gibbs sampling (Griffiths and Steyvers, 2004) and collapsed VB learning (Kurihara et al., 2007; Teh et al., 2007; Sato et al., 2012), respectively. Following this terminology, PB learning may be also called collapsed MAP learning. The collapsed version is in general more accurate and more computationally efficient than the uncollapsed counterpart, since it imposes a weaker constraint and applies a nonexact numerical estimation to a smaller number of unknowns.
2.2.3 Expectation Propagation As explained in Section 2.1.1, VB learning amounts to minimizing the KL divergence KL (r(w)p(w|D)) from the approximate posterior to the Bayes posterior. Expectation propagation (EP) is an alternative deterministic approximation scheme, which minimizes the KL divergence from the Bayes posterior to the approximate posterior (Minka, 2001b), i.e., min KL (p(w|D)r(w)) r
s.t.
r ∈ G.
(2.43)
Clearly from its definition, the KL divergence, q(x) dx, KL (q(x)p(x)) = q(x) log p(x) diverges to +∞ if the support of q(x) is not covered by the support of p(x), while it remains finite if the support of p(x) is not covered by the support of q(x). Due to this asymmetric property of the KL divergence, VB learning and EP can provide drastically different approximate posteriors—VB learning,
54
2 Variational Bayesian Learning
1 Bayes posterior VB posterior EP posterior
0.8 0.6 0.4 0.2 0 –2
–1
0
1
2
Figure 2.1 Bayes posterior, VB posterior, and EP posterior.
minimizing KL (r(w)p(w|D)), tends to provide a posterior that approximates a single mode of the Bayes posterior, while EP, minimizing KL (p(w|D)r(w)), tends to provide a posterior with a broad support covering all modes of the Bayes posterior (see the illustration in Figure 2.1). Moment Matching Algorithm The EP problem (2.43) is typically solved by moment matching. It starts with expressing the posterior distribution by the product of factors, 1 tn (w), p(w|D) = Z n where Z = p(D) is the marginal likelihood. For example, in the parametric density estimation (Example 1.1) with i.i.d. samples D = {x(1) , . . . , x(N) }, the factor can be set to tn (w) = p(x(n) |w) and t0 (w) = p(w). In EP, the approximating posterior is also assumed to have the same form, 1 tn (w), r(w) = (2.44) n Z is the normalization constant and becomes an approximation of the where Z marginal likelihood Z. Note that the factorization is not over the elements of w. EP tries to minimize the KL divergence, ⎞ ⎛ ## 1 ⎟⎟ ⎜⎜⎜ 1 # ⎜ tn (w)⎟⎟⎟⎠ , tn (w) # KL(pr) = KL ⎜⎝ n Z n Z which is approximately carried out by refining each factor while the other factors are fixed, and cycling through all the factors. To refine the factor tn (w), we define the unnormalized distribution, r(w) r¬n (w) = , tn (w)
2.2 Other Approximation Methods
55
and the following distribution is used as an estimator of the true posterior:
pn (w) =
tn (w)r¬n (w) , n Z
n = tn (w)r¬n (w)dw is the normalization constant. That is, the new where Z pn rnew ). approximating posterior rnew (w) is computed so that it minimizes KL( Usually, the approximating posterior is assumed to be a member of the exponential family. In that case, the minimization of KL( pn rnew ) is reduced new to the moment matching between pn and r . Namely, the parameter of rnew is determined so that its moments are matched with those of pn . The new approximating posterior rnew yields the refinement of the factor tn (w), new n r (w) , tn (w) = Z r¬n (w) n is derived from the zeroth-order moment where the multiplication of Z tn (w)r¬n (w)dw. matching between pn and rnew , tn (w)r¬n (w)dw = After several passes through all the factors, if the factors converge, then the posterior is approximated by Eq. (2.44), and the marginal likelihood is ← Z Z n = n tn (w)dw or alternatively by updating it as Z approximated by Z whenever the factor tn (w) is refined. Although the convergence of EP is not guaranteed, it is known that if EP converges, the resulting approximating posterior is a stationary point of a certain energy function (Minka, 2001b). Local Variational Approximation for EP In Section 2.1.1, we saw that VB learning minimizes an upper-bound (the free energy (2.4)) of the Bayes free energy F Bayes ≡ − log p(D) (or equivalently maximizing the ELBO). We can say that EP does the opposite. Namely, the EP problem (2.43) maximizes a lowerbound of the Bayes free energy: maxE(r) r
where
s.t.
r ∈ G,
p(w, D) p(w, D) log dw p(D) r(w) p(w|D) = − p(w|D) log dw − log p(D) r(w) = −KL (p(w|D)r(w)) − log p(D).
E(r) = −
(2.45) (2.46)
(2.47)
The maximization form (2.45) of the EP problem can be solved by local variational approximation, which is akin to the local variational approximation for VB learning (Section 2.1.7). Let us restrict the search space G for the approximate posterior r(w; ν) to the function space of a simple distribution
56
2 Variational Bayesian Learning
family, e.g., Gaussian, parameterized with variational parameters ν. Then, the EP problem (2.45) is reduced to the following unconstrained maximization problem, ν), max E(
ν
of the objective function written as p(w, D) p(w, D) log dw − log p(D). E( ν) = − p(D) p(D)r(w; ν)
(2.48)
(2.49)
Consider lower-bounding the objective (2.49) as ) 1 p(w; η) p(w; η) E( ν) ≥ E( max 0, log ν, η) ≡ − dw − log p(D) p(D) p(D)r(w; ν) (2.50) by using a parametric upper-bound p(w; η) of the joint distribution such that p(w; η) ≥ p(w, D)
(2.51)
for any w ∈ W and η ∈ H, where η is another set of variational parameters with its domain H.3 Let us choose an upper-bound p(w; η) such that its function form with respect to w is the same as the approximate posterior r(w; ν). More specifically, we assume that, for any given η, there exists ν such that p(w; η) ∝ r(w; ν)
(2.52)
as a function of w. Since the direct maximization of E( ν) is intractable, we instead maximize ν, η) jointly over ν and η. Namely, we solve the problem, its lower-bound E( max E( ν, η),
ν,η
(2.53)
ν, η) (≤ E( ν)) is closest to to find the approximate posterior r(w; ν) such that E( the Bayes free energy F Bayes (when η is also optimized). Let p(w; η) (2.54) q(w; η) = Z(η) be the distribution created by normalizing the upper-bound with its normalization factor Z(η) = p(w; η)dw. (2.55) 3
The two sets, ν and η, of variational parameters play the same roles as λ and ξ, respectively, in the local variational approximation for VB learning.
2.2 Other Approximation Methods
57
Note that the normalization factor (2.55) is trivially an upper-bound of the marginal likelihood, i.e., Z(η) ≥ p(w, D)dw = p(D), and is tractable because of the assumption (2.52) that p is in the same simple function form as r. With Eq. (2.54), the lower-bound (2.50) is expressed as ⎧ ⎫ ⎪ Z(η)q(w; η) Z(η)q(w; η) ⎪ ⎨ ⎬ max ⎪ ν, η) = − 0, log E( ⎪ dw − log p(D) ⎩ p(D) p(D)r(w; ν) ⎭ ⎫ ⎧ ⎪ ⎪ Z(η) Z(η) ⎬ q(w; η) ⎨ + log =− q(w; η) max ⎪ 0, log ⎪ dw − log p(D). ⎩ p(D) p(D) ⎭ r(w; ν) (2.56) Eq. (2.56) is upper-bounded by ⎛ ⎞ ⎜⎜ Z(η) Z(η) ⎟⎟⎟ q(w; η) ⎟⎠ dw − log p(D) + log − q(w; η) ⎜⎜⎝log p(D) p(D) r(w; ν) =−
$ % Z(η) Z(η) Z(η) KL q(w; η)r(w; ν) − log − log p(D), p(D) p(D) p(D)
(2.57)
which, for any η ∈ H, is maximized when ν is such that r(w; ν) = q(w; η)
(2.58)
(the assumption (2.52) guarantees the attainability). With this optimal ν, Eq. (2.57) coincides with Eq. (2.56). Thus, after optimization with respect to
ν, the lower-bound (2.56) is given as max E( ν, η) = −
ν
Z(η) Z(η) log − log p(D). p(D) p(D)
(2.59)
Since x log x for x ≥ 1 is monotonically increasing, maximizing the lowerbound (2.59) is achieved by solving min Z(η). η
(2.60)
Once the minimizer η is obtained, Eq. (2.58) gives the optimal approximate posterior. The problem (2.60) amounts to minimizing an upper-bound of the marginal likelihood. This is in contrast to the local variational approximation for VB learning, where a lower-bound of the marginal likelihood is maximized in the end (compare Eq. (2.33) and Eq. (2.60)).
58
2 Variational Bayesian Learning
1 Bayes posterior Lower-bound Upper-bound
0.8 0.6 0.4 0.2 0 –2
–1
0
1
2
Figure 2.2 Bayes posterior and its tightest lower- and upper-bounds, formed by a Gaussian.
Remembering that the joint distribution is proportional to the Bayes posterior, i.e., p(w|D) = p(w, D)/p(D), we can say that the VB posterior is the normalized version of the tightest (in terms of the total mass) lowerbound of the Bayes posterior, while the EP posterior is the normalized version of the tightest upper-bound of the Bayes posterior. Figure 2.2 illustrates the tightest upper-bound and the tightest lower-bound of the Bayes posterior, which correspond to unnormalized versions of the VB posterior and the EP posterior, respectively (compare Figures 2.1 and 2.2). This view also explains the tendency of VB learning and EP—a lower-bound (the VB posterior) must be zero wherever the Bayes posterior is zero, while an upper-bound (the EP posterior) must be positive wherever the Bayes posterior is positive.
2.2.4 Metropolis–Hastings Sampling (L) If a sufficient number of samples {w(1) , . . . , w } from the posterior distribution (1.3) are obtained, the expectation f (w)p(w|D)dw required for computing the quantities such as Eqs. (1.6) through (1.9) can be approximated by
1 f (w(l) ). L l=1 L
The Metropolis–Hastings sampling and the Gibbs sampling are most popular methods to sample from the (unnormalized) posterior distribution in the framework of Markov chain Monte Carlo (MCMC). In the Metropolis–Hastings sampling, we draw samples from a simple distribution q(w|w(t) ) called a proposal distribution, which is conditioned on the current state w(t) of the parameter (or latent variables) w. The proposal distribution is chosen to be a simple distribution such as a Gaussian centered at w(t) if w is continuous or the uniform distribution in a certain neighborhood
2.2 Other Approximation Methods
59
of w(t) if w is discrete. At each cycle of the algorithm, we draw a candidate sample w∗ from the proposal distribution q(w|w(t) ), and we accept it with probability p(w∗ , D) q(w(t) |w∗ ) min 1, . p(w(t) , D) q(w∗ |w(t) ) If w∗ is accepted, then the next state w(t+1) is moved to w∗ , w(t+1) = w∗ ; otherwise, it stays at the current state, w(t+1) = w(t) . We repeat this procedure until a sufficiently long sequence of states is obtained. Note that if the proposal distribution is symmetric, i.e., q(w|w ) = q(w |w) for any w and w , in which case the algorithm is called the Metropolis algorithm, the probability of acceptance depends on the ratio of the posteriors, p(w∗ , D)/Z p(w∗ |D) p(w∗ , D) = = , (t) (t) p(w , D) p(w , D)/Z p(w(t) |D) and if w∗ has higher posterior probability (density) than w(t) , it is accepted with probability 1. To guarantee that the distribution of the sampled sequence converges to the posterior distribution, we discard a first part of the sequence, which is called burn-in. Usually, after the burn-in period, we retain only every Mth sample and discard the other samples so that the retained samples can be considered as independent if M is sufficiently large.
2.2.5 Gibbs Sampling Another popular MCMC method is Gibbs sampling, which makes use of the conditional conjugacy. More specifically, it is applicable when we can compute and draw samples from the conditional distribution of a variable of w ∈ R J , p(w j |w1 , . . . , w j−1 , w j+1 , . . . , w J , D) ≡ p(w j |w¬ j , D), conditioned on the rest of the variables of w. Assuming that w(t) is obtained at the tth cycle of the Gibbs sampling algorithm, the next sample of each variable is drawn from the conditional distribution, |w(t) p(w(t+1) j ¬ j , D), where (t+1) (t) (t) , . . . , w(t+1) w(t) ¬ j = (w1 j−1 , w j+1 , . . . , w J )
from j = 1 to J in turn. This sampling procedure can be viewed as a special case of the Metropolis– Hastings algorithm. If the proposal distribution q(w|w(t) ) is chosen to be
60
2 Variational Bayesian Learning (t) p(w j |w(t) ¬ j , D)δ(w¬ j − w¬ j ),
then the probability that the candidate w∗ is accepted is 1 since w∗¬ j = w(t) ¬j implies that (t) (t) p(w∗ , D) q(w(t) |w∗ ) p(w∗ |D) p(w j |w¬ j , D) = p(w(t) , D) q(w∗ |w(t) ) p(w(t) |D) p(w∗ |w(t) , D) j ¬j
=
(t) p(w∗¬ j |D)p(w∗j |w∗¬ j , D) p(w(t) j |w¬ j , D) (t) (t) ∗ (t) p(w(t) ¬ j |D)p(w j |w¬ j , D) p(w j |w¬ j , D)
= 1.
As we have seen in the Metropolis–Hastings and Gibbs sampling algorithms, MCMC methods do not require the knowledge of the normalization constant Z = p(w, D)dw. Note that, however, even if we have samples from the posterior, we need additional steps to compute Z with the samples. A simple way is to calculate the expectation of the inverse of the likelihood by the sample average, 0 / L 1 1 1 . p(D|w) p(w|D) L l=1 p(D|w(l) ) It provides an estimate of the inverse of Z because / 0 1 p(D|w)p(w) 1 1 1 dw = p(w)dw = . = p(D|w) p(w|D) p(D|w) Z Z Z However, this estimator is known to have high variance. A more sophisticated sampling method to compute Z was developed by Chib (1995), while it requires multiple runs of MCMC sampling. A new efficient method to compute the marginal likelihood was recently proposed and named a widely applicable Bayesian information criterion (WBIC), which requires only a single run of MCMC sampling from a generalized posterior distribution (Watanabe, 2013). This method computes the expectation of the negative log-likelihood, − log p(D|w) p(β) (w|D) , over the β-generalized posterior distribution defined as p(β) (w|D) ∝ p(D|w)β p(w), with β = 1/ log N, where N is the number of i.i.d. samples. The computed (approximated) expectation is proved to have the same leading terms as those of the asymptotic expansion of − log Z as N → ∞.
P a r t II Algorithm
3 VB Algorithm for Multilinear Models
In this chapter, we derive iterative VB algorithms for multilinear models with Gaussian noise, where we can rely on the conditional conjugacy with respect to each linear factor. The models introduced in this chapter will be further analyzed in Part III, where the global solution or its approximation is analytically derived, and the behavior of the VB solution is investigated in detail.
3.1 Matrix Factorization Assume that we observe a matrix V ∈ RL×M , which is the sum of a target matrix U ∈ RL×M and a noise matrix E ∈ RL×M : V = U + E. In the matrix factorization (MF) model (Srebro and Jaakkola, 2003; Srebro et al., 2005; Lim and Teh, 2007; Salakhutdinov and Mnih, 2008; Ilin and Raiko, 2010) or the probabilistic principal component analysis (probabilistic PCA) (Tipping and Bishop, 1999; Bishop, 1999b), the target matrix is assumed to be low rank, and therefore can be factorized as U = B A , where A ∈ R M×H , B ∈ RL×H for H ≤ min(L, M) are unknown parameters to be estimated, and denotes the transpose of a matrix or vector. Here, the rank of U is upper-bounded by H. We denote a column vector of a matrix by a bold lowercase letter, and a row vector by a bold lowercase letter with a tilde, namely, $ % a1 , . . . , a M ∈ R M×H , A = (a1 , . . . , aH ) = b1 , . . . , bL ∈ RL×H . B = (b1 , . . . , bH ) = 63
64
3 VB Algorithm for Multilinear Models
3.1.1 VB Learning for MF Assume that the observation noise E is independent Gaussian: ##2 1 ## p(V|A, B) ∝ exp − 2 #V − B A #Fro , 2σ
(3.1)
where ·Fro denotes the Frobenius norm. Conditional Conjugacy If we treat B as a constant, the likelihood (3.1) is in the Gaussian form of A. Similarly, if we treat A as a constant, the likelihood (3.1) is in the Gaussian form of B. Therefore, conditional conjugacy with respect to A given B, as well as with respect to B given A, holds if we adopt Gaussian priors: 1 A , (3.2) p( A) ∝ exp − tr AC−1 A 2 1 p(B) ∝ exp − tr BC−1 B , (3.3) B 2 where tr(·) denotes the trace of a matrix. Typically, the prior covariance matrices CA and C B are restricted to be diagonal, which induces low-rankness (we discuss this mechanism in Chapter 7): CA = Diag(c2a1 , . . . , c2aH ), C B = Diag(c2b1 , . . . , c2bH ), for cah , cbh > 0, h = 1, . . . , H. Variational Bayesian Algorithm Thanks to the conditional conjugacy, the following independence constraint makes the approximate posterior Gaussian: r( A, B) = rA ( A)rB (B).
(3.4)
The VB learning problem (2.13) is then reduced to
r = argminF(r) s.t. r( A, B) = rA ( A)rB (B).
(3.5)
r
Under the constraint (3.4), the free energy is written as / 0 rA ( A)rB (B) F(r) = log p(V|A, B)p( A)p(B) rA (A)rB (B) rA ( A)rB (B) d AdB. = rA ( A)rB (B) log p(V|A, B)p( A)p(B)
(3.6)
3.1 Matrix Factorization
65
Following the recipe described in Section 2.1.5, we take the derivatives of the free energy (3.6) with respect to rA ( A) and rB (B), respectively. Thus, we obtain the following stationary conditions: (3.7) rA ( A) ∝ p( A) exp log p(V|A, B) rB (B) , (3.8) rB (B) ∝ p(B) exp log p(V|A, B) rA (A) . By substituting the likelihood (3.1) and the prior (3.2) into Eq. (3.7), we obtain #2 + 1 1 *## # # # A − rA ( A) ∝ exp − tr AC−1 V − B A A Fro r (B) 2 2σ2 B ! " 1 −2 ∝ exp − tr AC−1 A + σ B A + AB B A −2V A rB (B) 2 "⎞ ! ⎛ −1 ⎜⎜⎜ tr ( A − A) ⎟⎟⎟⎟ A) ΣA ( A − ⎜⎜⎜ ⎟⎟⎟ ⎜ ∝ exp ⎜⎜− (3.9) ⎟⎟⎟ , ⎜⎝ 2 ⎠ where
Σ A, A = σ−2 V B rB (B) ! "−1
+ σ2 C−1 . Σ A = σ2 B B A rB (B)
(3.10) (3.11)
Similarly, by substituting the likelihood (3.1) and the prior (3.3) into Eq. (3.8), we obtain ##2 + 1 −1 1 *## #V − B A #Fro rB (B) ∝ exp − tr BC B B − 2 2σ2 rA (A) " 1 ! −1 −2 ∝ exp − tr BC B B + σ −2V AB + B A AB rA (A) 2 " ! ⎞ ⎛ −1 ⎜⎜⎜ tr (B − B) ⎟⎟⎟⎟ B) Σ B (B − ⎜⎜⎜ ⎟⎟⎟⎟ , ∝ exp ⎜⎜⎜− (3.12) ⎟⎟⎠ ⎜⎝ 2 where
Σ B, B = σ−2 V A rA (A) ! "−1
+ σ2 C−1 . Σ B = σ2 A A B rA (A)
(3.13) (3.14)
Eqs. (3.9) and (3.12) imply that the posteriors are Gaussian. More specifically, they can be written as rA ( A) = MGauss M,H ( A; A, I M ⊗ Σ A ),
(3.15)
B, I L ⊗ Σ B ), rB (B) = MGaussL,H (B;
(3.16)
66
3 VB Algorithm for Multilinear Models
where ⊗ denotes the Kronecker product, and ˘ ≡ GaussD1 ·D2 (vec(X ); vec(M ), Σ) ˘ MGaussD1 ,D2 (X; M, Σ)
(3.17)
denotes the matrix variate Gaussian distribution (Gupta and Nagar, 1999). Here, vec : RD2 ×D1 → RD2 D1 is the vectorization operator, which concatenates all column vectors of a matrix into a long column vector. Note that, if the covariance has a specific structure expressed as Σ˘ = Σ ⊗ Ψ ∈ RD2 D1 ×D2 D1, such as Eqs. (3.15) and (3.16), the matrix variate Gaussian distribution can be written as 1 MGaussD1 ,D2 (X; M, Σ ⊗ Ψ) ≡ D D /2 1 2 (2π) det (Σ)D2 /2 det (Ψ)D1 /2 1 −1 −1 · exp − tr Σ (X − M) Ψ (X − M) . 2 (3.18) The fact that the posterior is Gaussian is a consequence of the forced independence - between A and B and conditional conjugacy. The parameters, ,
Σ B , defining the VB posterior (3.15) and (3.16), are the variational A, B, Σ A, parameters. Since rA ( A) and rB (B) are Gaussian, the first and the (noncenterized) second moments can be expressed with variational parameters as follows:
A, A rA (A) = = A A + M Σ A, A A rA (A)
B, B rB (B) = = B B + L Σ B. B B rB (B)
By substituting the preceding into Eqs. (3.10), (3.11), (3.13), and (3.14), we have the following relations among the variational parameters:
B Σ A, A = σ−2 V ! "−1
, Σ A = σ2 B B + L Σ B + σ2 C−1 A
(3.19)
A Σ B, B = σ−2 V ! "−1
. Σ B = σ2 A A + M Σ A + σ2 C−1 B
(3.21)
(3.20)
(3.22)
As we see shortly, Eqs. (3.19) through (3.22) are stationary conditions for variational parameters, which can be used as update rules for coordinate descent local search (Bishop, 1999b).
3.1 Matrix Factorization
67
Free Energy as a Function of Variational Parameters By substituting Eqs. (3.15) and (3.16) into Eq. (3.6), we can explicitly write down the free energy ,as (not a functional but) a function of the unknown ΣB : variational parameters A, B, Σ A, / 2F = 2 log
0 rA ( A)rB (B) p(V|A, B)p( A)p(B) rA (A)rB (B) / 0 rA ( A)rB (B) = 2 log − 2 log p(V|A, B) rA (A)rB (B) p( A)p(B) rA (A)rB (B) / det (C B ) det (CA ) −1 + L log + tr C−1 = M log A A A + CB B B
det Σ A det Σ B " ! −1 −1 A) ( A − A) + Σ B (B − B) (B − B) − tr ΣA ( A − 0 V − B A 2Fro 2 + LM log(2πσ ) + σ2 rA (A)rB (B) ! " −1 −1 det (C B ) det (CA ) + L log − tr M ΣA = M log ΣB Σ A + L ΣB det ΣA det ΣB ! " ! "" ! −1
+ tr C−1 A A A + M Σ A + C B B B + LΣ B / 0 (V− B A )+( B A −BA )2Fro + LM log(2πσ2 ) + 2 σ rA (A)rB (B)
det (C B ) det (CA ) + L log − (L + M)H = M log
det Σ A det ΣB ! ! " ! "" −1
+ tr C−1 A A A + M Σ A + C B B B + LΣ B / 0 V− B A 2Fro B A −BA 2Fro + LM log(2πσ2 ) + + σ2 σ2 ## # #2 ##V − B A ## Fro
rA (A)rB (B)
det (C B ) det (CA ) + L log det ΣA det ΣB 2 ! " ! " −1
− (L + M)H + tr C−1 A A A + M Σ A + C B B B + LΣ B ! " ! ""3 ! + σ−2 − A B (3.23) A B B+ A A + M ΣA B + L ΣB .
= LM log(2πσ2 ) +
σ2
+ M log
Now, the VB learning problem is reduced from the function optimization (3.5) to the following variable optimization:
68
3 VB Algorithm for Multilinear Models
Given min
A, B, Σ A , ΣB
s.t.
H CA , CA ∈ D++ ,
σ2 ∈ R++ ,
F,
(3.24)
B ∈ RL×H , A ∈ R M×H ,
H
Σ A, Σ B ∈ S++ ,
D where R++ is the set of positive real numbers, S++ is the set of D × D D (symmetric) positive definite matrices, and D++ is the set of D × D positive definite diagonal matrices. We note the following: , Σ B of the problem (3.24) is obtained, Eqs. A, B, Σ A, • Once the solution (3.15) and (3.16) specify the VB posterior r( A, B) = rA ( A)rB (B). 2 and C We treated the prior covariances C A B and the noise variance σ as • hyperparameters, and therefore assumed to be given when the VB problem was solved. However, they can be estimated through the empirical Bayesian procedure, which is explained shortly. They can also be treated as random variables, and their VB posterior can be computed by adopting conjugate Gamma priors and minimizing the free energy under an appropriate independence constraint. • Eqs. (3.19) through (3.22) coincide with the stationary conditions of the free energy (3.23), which are derived from the derivatives with respect to
B, and Σ B , respectively. Therefore, iterating Eqs. (3.19) through A, Σ A, (3.22) gives a local solution to the problem (3.24).
Empirical Variational Bayesian Algorithm The empirical variational Bayesian (EVB) procedure can be performed by minimizing the free energy also with respect to the hyperparameters: min
A, B, Σ A , Σ B ,CA ,CA ,σ2
s.t.
F,
(3.25)
B ∈ RL×H , A ∈ R M×H , H , CA , CA ∈ D++
H
Σ A, Σ B ∈ S++ ,
σ2 ∈ R++ .
By differentiating the free energy (3.23) with respect to each entry of CA and C B , we have, for h = 1, . . . , H, # ##2 c2ah = ## ah # /M + ΣA ## ##2 ΣB c2bh = ## bh ## /L +
h,h h,h
.
,
(3.26) (3.27)
3.1 Matrix Factorization
69
Algorithm 1 EVB learning for matrix factorization. 1: Initialize the variational parameters ( A, Σ A, B, Σ B ), and the hyperparame2
m,h , Bl,h ∼ Gauss1 (0, τ), ΣA = Σ B = CA = ters (CA , C B , σ ), for example, A 2: 3: 4: 5: 6:
C B = τI H , and σ2 = τ2 for τ2 = V2Fro /(LM). Apply (substitute the right-hand side into the left-hand side) Eqs. (3.20), A, Σ B , and B, respectively. (3.19), (3.22), and (3.21) to update Σ A, Apply Eqs. (3.26) and (3.27) for all h = 1, . . . , H, and Eq. (3.28) to update CA , C B , and σ2 , respectively. Prune the hth component if c2ah c2bh < ε, where ε > 0 is a small threshold, e.g., set to ε = 10−4 . Evaluate the free energy (3.23). Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold).
Similarly, by differentiating the free energy (3.23) with respect to σ2 , we have
σ2 =
! " " ! B A B A + tr ( A + M Σ A )( B + L Σ B) V2Fro − tr 2V LM
.
(3.28)
Eqs. (3.26)–(3.28) are used as update rules for the prior covariances CA , C B , and the noise variance σ2 , respectively. Starting from some initial value, iterating Eqs. (3.19) through (3.22) and Eqs. (3.26) through (3.28) gives a local solution for EVB learning. Algorithm 1 summarizes this iterative procedure. If we appropriately set the hyperparameters (CA , C B , σ2 ) in Step 1 and skip Steps 3 and 4, Algorithm 1 is reduced to (nonempirical) VB learning. We note the following for implementation: • Due to the automatic relevance determination (ARD) effect in EVB learning (see Chapter 7), c2ah c2bh converges to zero for some h. For this reason, “pruning” in Step 4 is necessary for numerical stability (log det (C) diverges if C is singular). If the hth component is pruned, the corresponding Σ B , CA , C B hth column of A and B and the hth column and row of Σ A, should be removed, and the rank H should be reduced accordingly. • In principle, the update rules never increase the free energy. However, pruning can slightly increase it. • When computing the free energy by Eq. (3.23), log det (·) should be computed as twice the sum of the log of the diagonals of the Cholesky decomposition, i.e.,
70
3 VB Algorithm for Multilinear Models
log det (C) = 2
H $ % log(Chol(C))h,h . h=1
Otherwise, det (·) can be huge for practical size of H, causing numerical instability. Simple Variational Bayesian Learning (with Columnwise Independence) The updates (3.19) through (3.22) require inversion of an H × H matrix. One can derive a faster VB learning algorithm by using a stronger constraint for the VB learning. More specifically, instead of the matrixwise independence (3.4), we assume the independence between the column vectors of A = (a1 , . . . , aH ) and B = (b1 , . . . , bH ) (Ilin and Raiko, 2010; Nakajima and Sugiyama, 2011; Kim and Choi, 2014): r( A, B) =
H
rah (ah )
h=1
H
rbh (bh ).
(3.29)
h=1
By applying the same procedure as that with the matrixwise independence constraint, we can derive the solution to
r = argminF(r) s.t. r( A, B) = r
H
rah (ah )
h=1
H
rbh (bh ),
(3.30)
h=1
which is in the form of the matrix variate Gaussian: A, I M ⊗ Σ A) = rA ( A) = MGauss M,H ( A;
H
Gauss M (ah ; ah , σ2ah I M ),
h=1
rB (B) = MGaussL,H (B; B, I L ⊗ Σ B) =
H
GaussL (bh ; bh , σ2bh I L ),
h=1
with the variational parameters,
aH ), A = ( a1 , . . . ,
bH ), B = ( b1 , . . . ,
σ2a1 , . . . , σ2aH ), Σ A = Diag(
σ2b1 , . . . , σ2bH ). Σ B = Diag( Here Diag(· · · ) denotes the diagonal matrix with the specified diagonal entries. The stationary conditions are given as follows: for all h = 1, . . . , H, ⎛ ⎞ ⎟⎟⎟
σ2ah ⎜⎜⎜⎜
ah ⎟⎟⎟⎠ bh
bh , (3.31) ah = 2 ⎜⎜⎝V − σ h h
3.1 Matrix Factorization # # −1 ## ##2 σ2 2 = σ # bh # + L σbh + 2 , cah ⎛ ⎞ ⎟⎟
σ2bh ⎜⎜⎜ ⎟ ⎜
bh = 2 ⎜⎜⎝V − ah ⎟⎟⎟⎠ bh
ah , σ
σ2ah
2
71
(3.32) (3.33)
h h
σ2bh
⎛ ⎞−1 ⎜⎜⎜## ##2 σ2 ⎟⎟⎟⎟ 2 ⎜ # # = σ ⎜⎝ σah + 2 ⎟⎠ . ah + M cbh 2
(3.34)
The free energy is given by Eq. (3.23) with the posterior covariances Σ A and
Σ B restricted to be diagonal. The stationary conditions for the hyperparameters are unchanged, and given by Eqs. (3.26) through (3.28). Therefore, Algorithm 1 with Eqs. (3.31) through (3.34), substituted for Eqs. (3.19) through (3.22), gives a local solution to the VB problem (3.30) with the columnwise independence constraint. We call this variant simple VB (SimpleVB) learning. In Chapter 6, it will be shown that, in the fully observed MF model, the SimpleVB problem (3.30) with columnwise independence and the original VB problem (3.5) with matrixwise independence actually give the equivalent solution.
3.1.2 Special Cases Probabilistic principal component analysis and reduced rank regression are special cases of matrix factorization. Therefore, they can be trained by Algorithm 1 with or without small modifications. Probabilistic Principal Component Analysis Probabilistic principal component analysis (Tipping and Bishop, 1999; Bishop, 1999b) is a probabilisitic model of which the ML estimation corresponds to the classical principal component analysis (PCA) (Hotelling, 1933). The observation v ∈ RL is assumed to be driven by a latent vector a ∈ RH in the following form: v = B a + ε. a and v, and ε ∈ RL Here, B ∈ RL×H specifies the linear relationship between 2 is a Gaussian noise subject to GaussL (0, σ I L ). Suppose that we are given M observed samples V = (v1 , . . . , v M ) generated a1 , . . . , a M ), and each latent vector is subject to from the latent vectors A = ( a ∼ GaussH (0, I H ). Then, the probabilistic PCA model is written as Eqs. (3.1) and (3.2) with CA = I H . Having the prior (3.2) on B, it is equivalent to the MF model.
72
3 VB Algorithm for Multilinear Models
Figure 3.1 Reduced rank regression model.
If we apply VB or EVB learning, the intrinsic dimension H is automatically selected without additional procedure (Bishop, 1999b). This useful property is caused by the ARD (Neal, 1996), which makes the estimators for the irrelevant column vectors of A and B zero. In Chapter 7, this phenomenon is explained in terms of model-induced regularization (MIR), while in Chapter 8, a theoretical guarantee of the dimensionality estimation is given. Reduced Rank Regression Reduced rank regression (RRR) (Baldi and Hornik, 1995; Reinsel and Velu, 1998) is aimed at learning a relation between an input vector x ∈ R M and an output vector y ∈ RL by using the following linear model: y = B A x + ε,
(3.35)
where A ∈ R M×H and B ∈ RL×H are parameters to be estimated, and ε ∼ GaussL (0, σ 2 I L ) is a Gaussian noise. RRR can be seen as a linear neural network (Figure 3.1), of which the model distribution is given by #2 −L/2 1 # exp − 2 ## y − B A x## . (3.36) p(y|x, A, B) = 2πσ 2 2σ Thus, we can interpret this model as first projecting the input vector x onto a lower-dimensional latent subspace by A and then performing linear prediction by B. Suppose we are given N pairs of input and output vectors: , (3.37) D = (x(n) , y(n) )|x(n) ∈ R M , y(n) ∈ RL , n = 1, . . . , N . Then, the likelihood of the RRR model (3.36) is expressed as p(D|A, B) =
N
p(y(n) |x(n) , A, B)p(x(n) )
n=1
⎛ ⎞ N # ##2 ⎟⎟⎟ ⎜⎜⎜ 1 # (n) (n) # y − B A x # ⎟⎟⎠ . ∝ exp ⎜⎜⎝− 2 2σ n=1
(3.38)
3.1 Matrix Factorization
73
N Note that we here ignored the input distributions n=1 p(x(n) ) as constants (see Example 1.2 in Section 1.1.1). Let us assume that the samples are centered: N 1 (n) x =0 N n=1
and
N 1 (n) y = 0. N n=1
Furthermore, let us assume that the input samples are prewhitened (Hyv¨arinen et al., 2001), i.e., they satisfy N 1 (n) (n) x x = IM . N n=1
Let V = Σ XY =
N 1 (n) (n) y x N n=1
(3.39)
be the sample cross-covariance matrix, and σ 2 (3.40) N be a rescaled noise variance. Then the exponent of the likelihood (3.38) can be written as σ2 =
−
N #2 1 ## (n) (n) # # # − B A x y 2σ 2 n=1 N 3 1 2## (n) ##2 (n) (n) (n) (n) # # − 2tr y x AB x x AB + tr B A y 2σ 2 n=1 ⎧ N ⎫ ⎪ ⎪ ⎪ 1 ⎪ ⎨ ## (n) ##2 ⎬ # # − 2Ntr V AB B A = − 2 ⎪ + Ntr AB y ⎪ ⎪ ⎩ ⎭ 2σ ⎪ n=1 ⎛ ⎞ N # #2 1 ⎜⎜ # ## y(n) ###2 − N V2 ⎟⎟⎟⎟ = − 2 ⎜⎜⎜⎝N ##V − B A ##Fro + ⎠ Fro ⎟ 2σ n=1 ⎛ ⎞ N ##2 ⎟⎟ 1 ⎜⎜⎜⎜ 1 ## (n) ##2 1 ## 2 # y # − VFro ⎟⎟⎟⎠ . = − 2 #V − B A #Fro − (3.41) ⎜⎝ 2 2σ 2σ N n=1
=−
The first term in Eq. (3.41) coincides with the log-likelihood of the MF model (3.1), and the second term is constant with respect to A and B. Thus, RRR is reduced to MF, as far as the posteriors for A and B are concerned. However, the second term depends on the rescaled noise variance σ2 , and therefore, should be considered when σ2 is estimated based on the free energy minimization principle. Furthermore, the normalization constant of the
74
3 VB Algorithm for Multilinear Models
likelihood (3.38) differs from that of the MF model. Taking these differences into account, the VB free energy of the RRR model (3.38) with the priors (3.2) and (3.3) is given by # ##2 1 N # # (n) # − V2Fro n=1 y N RRR 2 = NL log(2πNσ ) + 2F σ2 ## # 2 # ##V − B A ## det (C B ) det (CA ) Fro + L log + + M log 2 σ det ΣA det ΣB 2 ! " ! " −1
− (L + M)H + tr C−1 A A A + M Σ A + C B B B + LΣ B ! "! ""3 ! +σ−2 − A B (3.42) A B B+ A A + M ΣA B + L ΣB . Note that the difference from Eq. (3.23) is only in the first two terms. Accordingly, the stationary conditions for the variational parameters A, B, Σ A,
and Σ B , and those for the prior covariances CA and C B (in EVB learning) are the same, i.e., the update rules given by Eqs. (3.19) through (3.22), (3.26), and (3.27) are valid for the RRR model. The update rule for the rescaled noise variance is different from Eq. (3.28), and given by ! " " ! # ##2
1 N # (n) #
# − tr 2V )( B ) + tr ( A B A A + M Σ B + L Σ y A B n=1 N , (σ2 )RRR = NL (3.43) which was obtained from the derivative of Eq. (3.42), instead of Eq. (3.23), with respect to σ2 . Once the rescaled noise variance σ2 is estimated, Eq. (3.40) gives the original noise variance σ 2 of the RRR model (3.38).
3.2 Matrix Factorization with Missing Entries One of the major applications of MF is collaborative filtering (CF), where only a part of the entries in V are observed, and the task is to predict missing entries. We can derive a VB algorithm for this scenario, similarly to the fully observed case.
3.2.1 VB Learning for MF with Missing Entries To express missing entris, the likelihood (3.1) should be replaced with #2 1 ### # p(V|A, B) ∝ exp − 2 #PΛ (V) − PΛ B A ## , (3.44) Fro 2σ
3.2 Matrix Factorization with Missing Entries
75
where Λ denotes the set of observed indices, and PΛ (V) denotes the matrix of the same size as V with its entries given by ⎧ ⎪ ⎪ ⎨Vl,m if (l, m) ∈ Λ, (PΛ (V))l,m = ⎪ ⎪ ⎩0 otherwise. Conditional Conjugacy Since the likelihood (3.44) is still in a Gaussian form of A if B is regarded as a constant, or vise versa, the conditional conjugacy with respect to A and B, respectively, still holds if we adopt the Gaussian priors (3.2) and (3.3): 1 A , p( A) ∝ exp − tr AC−1 A 2 1 p(B) ∝ exp − tr BC−1 B . B 2 The posterior will be still Gaussian, but in a broader class than the fully observed case, as will be seen shortly. Variational Bayesian Algorithm With the missing entries, the stationary condition (3.7) becomes #2 + 1 −1 1 *### # # rA ( A) ∝ exp − tr ACA A − #PΛ (V) − PΛ B A #Fro 2 2σ2 rB (B) 1 ∝ exp − tr AC−1 A A 2 0 H H H / −2
+σ Bl,h Am,h + Bl,h Bl,h Am,h Am,h −2Vl,m (l,m)∈Λ
h=1
h=1 h =1
"⎞ ! ⎛ M −1
⎟⎟ ⎜⎜⎜ − a ) ( a − a ) ( a Σ m m m ⎟ A,m m m=1 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ , ∝ exp ⎜⎜⎜− ⎟⎠ 2 ⎝⎜
rB (B)
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎠
(3.45)
where
Σ A,m am = σ−2
l:(l,m)∈Λ
Σ A,m
bl Vl,m
rB (B)
,
⎛ ⎞−1 ⎜⎜⎜ * + ⎟⎟⎟ −1 2 ⎜ bl bl = σ ⎜⎜⎝ + σ CA ⎟⎟⎟⎠ . rB (B) 2
(3.46)
(3.47)
l:(l,m)∈Λ
Here, (l,m)∈Λ denotes the sum over l and m such that (l, m) ∈ Λ, and l:(l,m)∈Λ denotes the sum over l such that (l, m) ∈ Λ for given m.
76
3 VB Algorithm for Multilinear Models
Similarly, we have 1 1 *### (V) rB (B) ∝ exp − tr BC−1 − P − B A P # Λ Λ B B 2 2σ2 "⎞ ⎛ !
L ⎜⎜⎜ ⎟⎟ −1 − b ) ( b − b ) ( b Σ m l l ⎟ B,l l l=1 ⎟⎟⎟ ⎜⎜⎜⎜ ⎟⎟⎟ , ∝ exp ⎜⎜⎜− ⎟⎟⎠ 2 ⎜⎝ where
##2 + ## Fro
rA (A)
(3.48)
am rA (A) , Vl,m Σ B,l bl = σ−2
(3.49)
m:(l,m)∈Λ
Σ B,l
⎛ ⎞−1 ⎜⎜⎜ ⎟⎟⎟ −1 2 am am = σ ⎜⎜⎜⎝ + σ C B ⎟⎟⎟⎠ . rA (A) 2
(3.50)
m:(l,m)∈Λ
Eqs. (3.45) and (3.48) imply that A and B are Gaussian in the following form: rA ( A) = MGauss M,H ( A; A, Σ˘ A ) =
M
GaussH ( am ; am , Σ A,m ),
m=1
rB (B) = MGaussL,H (B; B, Σ˘ B ) =
L
GaussH ( bm ; bl , Σ B,l ),
l=1
where ⎛ ⎜⎜⎜ Σ ⎜⎜⎜ A,1 ⎜⎜⎜ ⎜ 0 Σ˘ A = ⎜⎜⎜⎜ . ⎜⎜⎜ . ⎜⎜⎜ . ⎝ 0
0
···
Σ A,2 .. ···
. 0
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ , 0 ⎟⎟⎟⎟ ⎟⎠
Σ A,M 0 .. .
⎛ ⎜⎜⎜ Σ ⎜⎜⎜ B,1 ⎜⎜⎜ ⎜ 0 Σ˘ B = ⎜⎜⎜⎜ . ⎜⎜⎜ . ⎜⎜⎜ . ⎝ 0
0
···
Σ B,2 .. ···
. 0
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ . 0 ⎟⎟⎟⎟ ⎟⎠
Σ B,L 0 .. .
Note that the posterior covariances cannot be expressed with a Kronecker product, unlike the fully observed case. However, the posteriors are Gaussian, and moments are given by am , am rA (A) = am am am + = am Σ A,m , rA (A)
bl = bl , rB (B) * +
bl = bl Σ B,l . bl + bl rB (B)
3.2 Matrix Factorization with Missing Entries
77
Thus, Eqs. (3.46), (3.47), (3.49), and (3.50) lead to
am = σ−2 Σ A,m
Vl,m bl ,
(3.51)
l:(l,m)∈Λ
⎛ ⎞−1 ⎜⎜ ⎟⎟⎟ −1 2⎜ 2 ⎜
Σ A,m = σ ⎜⎜⎝ bl bl + Σ B,l + σ CA ⎟⎟⎟⎠ ,
(3.52)
l:(l,m)∈Λ
am , Vl,m bl = σ−2 Σ B,l
(3.53)
m:(l,m)∈Λ
Σ B,l
⎛ ⎞−1 " ⎜⎜⎜ ! ⎟⎟⎟ ⎟⎟ ,
am + = σ2 ⎜⎜⎜⎝ Σ A,m + σ2 C−1 am B ⎟ ⎠
(3.54)
m:(l,m)∈Λ
which are used as update rules for local search (Lim and Teh, 2007). Free Energy as a Function of Variational Parameters An explicit form of the free energy can be obtained in a similar fashion to the fully observed case: 2F = # (Λ) · log(2πσ2 ) + M log det (CA ) + L log det (C B ) −
M
L log det Σ A,m − log det Σ B,l − (L + M)H
m=1
l=1
⎧ ⎛ ⎞ ⎛ ⎞⎫ M L ⎪ ⎜⎜ ⎟⎟⎟⎪ ⎪ ⎪ ⎨ −1 ⎜⎜⎜⎜ ⎟⎟⎟⎟ −1 ⎜
⎜ ⎟⎟⎠⎬ + tr ⎪ A + Σ B + Σ C + C A B ⎜ ⎟ ⎜ ⎪ A,m B,l ⎝ ⎠ ⎝ A B ⎪ ⎪ ⎩ ⎭ m=1 l=1 ) 1 ! "
−2
am bl + tr am + Σ A,m bl bl + Σ B,l am Vl,m − 2Vl,m , +σ (l,m)∈Λ
(3.55) where # (Λ) denotes the number of observed entries. Empirical Variational Bayesian Algorithm By taking derivatives of the free energy (3.55), we can derive update rules for the hyperparameters: ## ##2 M # Σ A,m ah # + m=1 h,h , (3.56) c2ah = M ## ##2 L ## Σ B,l bh ## + l=1 h,h 2 cbh = , (3.57) L
78
3 VB Algorithm for Multilinear Models
Algorithm 2 EVB learning for matrix factorization with missing entries. M L 1: Initialize the variational parameters ( A, { Σ A,m }m=1 , B, { Σ B,l }l=1 ), and the 2
m,h , Bl,h ∼ Gauss1 (0, τ), hyperparameters (CA , C B , σ ), for example, A
Σ A,m = Σ B,l = CA = C B = τI H , and σ2 = τ2 for τ2 = (l,m)∈Λ V 2 /# (Λ). l,m
2:
3: 4: 5: 6:
Apply (substitute the right-hand side into the left-hand side) Eqs. (3.52), M L , A, { Σ B,l }l=1 , and B, respec(3.51), (3.54), and (3.53) to update { Σ A,m }m=1 tively. Apply Eqs. (3.56) and (3.57) for all h = 1, . . . , H, and Eq. (3.58) to update CA , C B , and σ2 , respectively. Prune the hth component if c2ah c2bh < ε, where ε > 0 is a threshold, e.g., set to ε = 10−4 . Evaluate the free energy (3.55). Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold).
(l,m)∈Λ
am bl + tr Vl,m − 2Vl,m
σ2 =
)! 1 "
am + am Σ A,m Σ B,l bl + bl . (3.58)
# (Λ)
Algorithm 2 summarizes the EVB algorithm for MF with missing entries. Again, if we appropriately set the hyperparameters (CA , C B , σ2 ) in Step 1 and skip Steps 3 and 4, Algorithm 2 is reduced to (nonempirical) VB learning. Simple Variational Bayesian Learning (with Columnwise Independence) Similarly to the fully observed case, we can reduce the computational burden and the memory requirement of VB learning by adopting the columnwise independence (Ilin and Raiko, 2010): r( A, B) =
H
rah (ah )
h=1
H
rbh (bh ).
(3.59)
h=1
By applying the same procedure as the matrixwise independence case, we can derive the solution to
r = argminF(r) s.t. r( A, B) = r
H
rah (ah )
h=1
which is in the form of the matrix variate Gaussian,
H h=1
rbh (bh ),
(3.60)
3.2 Matrix Factorization with Missing Entries
rA ( A) = MGauss M,H ( A; A, Σ˘ A ) =
M H
79
m,h , Gauss1 (Am,h ; A σ2Am,h ),
m=1 h=1
rB (B) = MGaussL,H (B; B, Σ˘ B ) =
L H
Gauss1 (Bl,h ; Bl,h , σ2Bl,h ),
l=1 h=1
with diagonal posterior covariances, i.e., ⎞ ⎛ ⎛ ⎜⎜⎜ ⎜⎜⎜ 0 ··· 0 ⎟⎟⎟ Σ Σ ⎟ ⎜⎜⎜ A,1 ⎜⎜⎜ B,1 ⎟ ⎟ . ⎟ ⎜⎜⎜ ⎜ . ⎟ ⎜
⎜⎜ 0 . ⎟⎟⎟ Σ A,2 ⎜ 0 Σ˘ A = ⎜⎜⎜⎜ . Σ˘ B = ⎜⎜⎜⎜ . ⎟⎟⎟⎟ , .. ⎜⎜⎜ . ⎜⎜⎜ . . 0 ⎟⎟⎟⎟ ⎜⎜⎜ . ⎜⎜⎜ . ⎟ ⎠ ⎝ ⎝ 0 ··· 0 Σ A,M 0
0
···
Σ B,2 .. ···
. 0
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ , 0 ⎟⎟⎟⎟ ⎟⎠
Σ B,L 0 .. .
for
σ2Am,1 , . . . , σ2Am,H ), Σ A,m = Diag(
σ2Bl,1 , . . . , σ2Bl,H ). Σ B,l = Diag( The stationary conditions are given as follows: for all l = 1, . . . , L, m = 1, . . . , M, and h = 1, . . . , H, ⎛ ⎞ ⎟⎟
σ2Am,h ⎜⎜⎜ ⎜
m,h ⎟⎟⎟⎟
m,h = ⎜⎝⎜Vl,m − (3.61) Bl,h A A ⎠ Bl,h , 2 σ l;(l,m)∈Λ h h ⎛ ⎞−1 2⎟ ⎜⎜ ⎟⎟ σ 2 2⎜ 2 2
σAm,h = σ ⎜⎜⎝⎜ σBl,h + 2 ⎟⎟⎠⎟ , (3.62) Bl,h + c ah l;(l,m)∈Λ ⎛ ⎞ ⎟⎟⎟
σ2Bl,h ⎜⎜⎜
m,h
m,h , ⎜⎜⎝⎜Vl,m − (3.63) A Bl,h ⎟⎟⎠⎟ A Bl,h = 2 σ m;(l,m)∈Λ h h ⎛ ⎞−1 2⎟ ⎜⎜⎜ ⎟⎟ σ
2m,h +
σ2Bl,h = σ2 ⎜⎜⎜⎝ σ2Am,h + 2 ⎟⎟⎟⎠ . (3.64) A cbh m;(l,m)∈Λ The free energy is given by Eq. (3.55) with the posterior covariances Σ B,l } restricted to be diagonal. The stationary conditions for the hyper{ Σ A,m , parameters are unchanged, and given by Eqs. (3.56) through (3.58). Therefore, Algorithm 2 with Eqs. (3.61) through (3.64) substituted for Eqs. (3.51) through (3.54) gives a local solution to the VB problem (3.60) with the columnwise independence constraint. SimpleVB learning is much more practical when missing entries exist. In Σ B are common the fully observed case, the posterior covariances Σ A and to all rows of A and to all rows of B, respectively, while in the partially
80
3 VB Algorithm for Multilinear Models
observed case, we need to store the posterior covariances Σ A,m and Σ B,l for all m = 1, . . . , M and l = 1, . . . , L. Since L and M can be huge, e.g., in collaborative filtering applications, the required memory size is significantly reduced by restricting the covariances to be diagonal.
3.3 Tensor Factorization A matrix is a two-dimensional array of numbers. We can extend this notion to an N-dimensional array, which is called an N-mode tensor. Namely, a tensor N (1) (N) M (n) entries lying in an N-dimensional V ∈ R M ×···×M consists of n=1 (n) array, where M denotes the length in mode n. In this section, we derive VB learning for tensor factorization.
3.3.1 Tucker Factorization Similarly to the rank of a matrix, we can control the degree of freedom of a tensor by controlling its tensor rank. Although there are a few different definitions of the tensor rank and corresponding ways of factorization, we here focus on Tucker factorization (TF) (Tucker, 1996; Kolda and Bader, 2009) defined as follows: V = G ×1 A(1) · · · ×N A(N) + E, where V ∈ R M ×···×M , G ∈ RH ×···×H , and { A(n) ∈ R M ×H } are an observed tensor, a core tensor, and factor matrices, respectively. E ∈ (1) (N) R M ×···×M is noise and ×n denotes the n-mode tensor product. Parafac (Harshman, 1970), another popular way of tensor factorization, can be seen as a special case of Tucker factorization where the core tensor is superdiagonal, i.e., only the entries Gh(1) ,...,h(N) for h(1) = h(2) = · · · = h(N) are nonzero. (1)
(N)
(1)
(N)
(n)
(n)
3.3.2 VB Learning for TF The probabilistic model for MF is straightforwardly extended to TF (Chu and Ghahramani, 2009; Mørup and Hansen, 2009). Assume Gaussian noise and Gaussian priors: # ⎞ ⎛ ## ⎜⎜⎜ #V − G ×1 A(1) · · · ×N A(N) ##2 ⎟⎟⎟ ⎟⎟⎟ , ⎜ p(V|G, { A }) ∝ exp ⎜⎜⎝− ⎠ 2σ2 (n)
(3.65)
3.3 Tensor Factorization
81
vec(G) (CG(N) ⊗ · · · ⊗ CG(1) )−1 vec(G) p(G) ∝ exp − , 2 ⎞ ⎛ N (n) ⎟ ⎜⎜⎜ tr( A(n) C−1 ) ⎟⎟ (n) A n=1 A (n) ⎟⎟⎠ , p({ A }) ∝ exp ⎜⎜⎝− 2
(3.66) (3.67)
where ⊗ and vec(·) denote the Kronecker product and the vectorization operator, respectively. {CG(n) } and {CA(n) } are the prior covariances restricted to be diagonal, i.e., 2 2 (n) CG = Diag cg(n) , . . . , cg(n) , 1 H (n) 2 2 (n) CA = Diag ca(n) , . . . , ca(n) . H (n)
1
We denote C˘ G = CG(N) ⊗ · · · ⊗ CG(1) . Conditional Conjugacy Since the TF model is multilinear, the likelihood (3.65) is in the Gaussian form of the core tensor G and of each of the factor matrices { A(n) }, if the others are fixed as constants. Therefore, the Gaussian priors (3.66) and (3.67) are conditionally conjugate for each parameter, and the posterior will be Gaussian. Variational Bayesian Algorithm Based on the conditional conjugacy, we impose the following constraint on the VB posterior: r(G, { A(n) }) = r(G)
N
r( A(n) ).
n=1
Then, the free energy can be written as ⎛ N ⎞⎛ ⎞ N ⎟⎟ ⎜⎜⎜ ⎟⎟ ⎜⎜⎜⎜ (n) ⎟ (n) (n) F(r) = r(G) ⎜⎝ r( A )⎟⎟⎠ ⎜⎜⎝log p(V, G, { A }) − log r(G) − log r( A )⎟⎟⎟⎠ n=1
n=1
⎞ ⎛ N ⎜⎜⎜ (n) ⎟⎟⎟ ⎜ d A ⎟⎟⎠ . · dG ⎜⎝
(3.68)
n=1
Using the variational method, we obtain ⎞⎛ ⎞ ⎛⎜ N N ⎟⎟ ⎜⎜⎜ ⎟⎟ ⎜⎜⎜ (n) ⎟ (n) (n) ⎟ ⎜ r( A )⎠⎟ ⎝⎜log p(V, G, { A }) − log r(G) − log r( A ) − 1⎟⎟⎟⎠ 0= ⎝⎜ n=1
⎞ ⎛ N ⎜⎜⎜ (n) ⎟⎟⎟ d A ⎟⎟⎠ , · ⎜⎜⎝ n=1
n=1
82
3 VB Algorithm for Multilinear Models
and therefore r(G) ∝ exp log p(V, G, { A(n) }) r({A(n) }) ∝ p(G) exp log p(V|G, { A(n) }) r({A(n) }) .
(3.69)
Similarly, we can also obtain ⎛ ⎞⎛ ⎞ N ⎟⎟ ⎜⎜⎜ ⎟⎟ ⎜⎜⎜⎜ (n) ⎟ (n) (n) ⎟ ⎜ 0= r(G) ⎝⎜ r( A )⎠⎟ ⎝⎜log p(V, G, { A }) − log r(G) − log r( A ) − 1⎟⎟⎟⎠ n n
n=1
⎞ ⎛ ⎜⎜⎜ (n) ⎟⎟⎟ d A ⎟⎟⎠ , · ⎜⎜⎝ n n
and therefore r( A(n) ) ∝ exp log p(V, G, { A(n) }) r(G)r({A(n ) }n n ) ∝ p( A(n) ) exp log p(V|G, { A(n) }) r(G)r({A(n ) }n n ) .
(3.70)
Eqs. (3.69) and (3.70) imply that the VB posteriors are Gaussian. The expectation in Eq. (3.69) can be calculated as follows: log p(V|G, { A(n) }) r({A(n) }) #2 + 1 *# = − 2 ##V − G(×1 A(1) ) · · · (×N A(N) )## + const. 2σ r({A(n) }) 1 = − 2 −2vec(V) ( A(N) ⊗ · · · ⊗ A(1) )vec(G) 2σ + vec(G) ( A(N) A(N) ⊗ · · · ⊗ A(1) A(1) )vec(G) + const. (n) r({A })
Substituting the preceding calculation and the prior (3.66) into Eq. (3.69) gives log r(G) = log p(G) log p(V|G, { A(n) }) r({A(n) }) + const. −1 1 g˘ ) g˘ ) + const., = − ( g˘ − Σ˘ G ( g˘ − 2
where g˘ = vec(G), v˘ = vec(V), "
(1) Σ˘ G ! (N)
g˘ = 2 v˘ , A A ⊗ ··· ⊗ σ !
Σ˘ G = σ2 A(N) A(N) ⊗ · · · ⊗ A(1) A(1)
(3.71) −1
r({A(n) })
+ σ2 C˘ G
"−1 .
(3.72)
3.3 Tensor Factorization
83
Similarly, the expectation in Eq. (3.70) can be calculated as follows: log p(V|G, A(n) ) r(G)r({A(n ) }n n ) #2 + 1 *# = − 2 ##V − G(×1 A(1) ) · · · (×N A(N) )## + const.
2σ r(G)r({A(n ) }n n ) ) ! " (N) (n+1) (n−1) (1) 1
(n) ( A ⊗ ··· ⊗ A ⊗ A ··· ⊗ A ) = − 2 tr −2V (n) A(n) G 2σ + tr A(n) G(n) ( A(N) A(N) ⊗ · · · ⊗ A(n+1) A(n+1) "1 (n−1) (n−1) (1) (1) (n) ⊗A A · · · ⊗ A A )G(n) A . (n )
}n
r(G)r({A
n )
Substituting the preceding calculation and the prior (3.67) into Eq. (3.70) gives log r( A(n) ) = log p( A(n) ) exp log p(V|G, { A(n) }) r(G)r({A(n ) }n n ) + const. " (n) −1 (n) 1 ! A ) Σ A(n) ( A(n) − A ) , = − tr ( A(n) − 2 where (n) (N) (n+1) (n−1) (1) 1
(n)
A ⊗ ··· ⊗ A ⊗ A ··· ⊗ A )G (3.73) A = 2 V (n) ( Σ A(n) , σ
Σ A(n) = σ2 G(n) ( A(N) A(N) ⊗ · · · ⊗ A(n+1) A(n+1) "−1 2 −1 ⊗ A(n−1) A(n−1) ⊗ · · · ⊗ A(1) A(1) )G(n) + σ C . (n ) A(n)
r(G)r({A
}n
n )
(3.74) Thus, the VB posterior is given by ! " N ˘; r(G, { A(n) }) = Gaussn=1 g˘ , Σ˘ G H (n) g ·
N
! " (n) MGauss M(n) ,H (n) A(n) ; A , I M(n) ⊗ Σ A(n) , (3.75)
n=1
where the means and the covariances satisfy "
(1) Σ˘ G ! (N)
v˘ , g˘ = 2 A A ⊗ ··· ⊗ σ ! " ! (1) (1) (N) (N)
Σ˘ = σ2 A A A + M (N) Σ (N) ⊗ · · · ⊗ A + M (1) Σ G
A
−1 + σ C˘ G 2
(3.76) " A(1)
−1 ,
(3.77)
84
3 VB Algorithm for Multilinear Models
(n) (N) (n+1) (n−1) (1) 1
(n)
A ⊗ ··· ⊗ A ⊗ A ··· ⊗ A )G (3.78) Σ A(n) , A = 2 V (n) ( σ * ! (N) (N) (n+1) (n+1)
Σ A(n) = σ2 G(n) ( A + M (N) Σ A(N) ) ⊗ · · · ⊗ ( A Σ A(n+1) ) A + M (n+1) A " + (n−1) (n−1) (1) (1)
+ M (n−1) A ⊗ ( A A Σ A(n−1) ) ⊗ · · · ⊗ ( A + M (1) Σ A(1) ) G(n)
+ σ2 C−1 A(n)
r(G)
−1 .
(3.79)
The expectation in Eqs. (3.79) is explicitly given by * ! (N) (N) (n+1) (n+1)
G(n) ( A + M (n+1) A A + M (N) Σ A(N) ) ⊗ · · · ⊗ ( A Σ A(n+1) ) " + (n−1) (n−1) (1) (1)
⊗ ( A + M (n−1) A A Σ A(n−1) ) ⊗ · · · ⊗ ( A + M (1) Σ A(1) ) G(n) r(G) ! (N) (N) (n+1) (n+1) (N) (n+1)
(n) (
=G A +M A A +M Σ A(N) ) ⊗ · · · ⊗ ( A Σ A(n+1) ) " (n−1) (n−1) (1) (1)
(n) + Ξ (n) ,
⊗ ( A + M (n−1) A A Σ A(n−1) ) ⊗ · · · ⊗ ( A + M (1) Σ A(1) ) G (3.80) where the entries of Ξ (n) ∈ RH Ξh(n) (n) ,h (n) =
(1)
· · · ( A
×H (n)
are specified as
(N)
( A
(N)
A + M (N) Σ A(N) )h(N) ,h (N)
(h(1) ,h (1) ),...,(h(n−1) ,h (n−1) ),(h(n+1) ,h (n+1) ),...,(h(N) ,h (N) )
(n+1)
· · · ( A
(n)
(n+1) (n−1) (n−1)
A A + M (n+1) Σ A(n+1) )h(n+1) ,h (n+1) ( A + M (n−1) Σ A(n−1) )h(n−1) ,h (n−1)
(1)
A + M (1) Σ A(1) )h(1) ,h (1) (Σ G )(h(1) ,h (1) ),...,(h(N) ,h (N) ) . N
Here, we used the tensor expression of ΣG ∈ R covariance Σ˘ G .
n=1
2H (n)
for the core posterior
Free Energy as a Function of Variational Parameters Substituting Eq. (3.75) into Eq. (3.68), we have / 2F = 2 log r(G) +
N
log r( A(n) )
n=1
− log p(V|G, { A })p(G) (n)
N n=1
0 (n)
p( A )
N r(G)( n=1 r(A(n) ))
3.3 Tensor Factorization
85
⎛ N ⎞ N det C˘ G ⎜⎜⎜ (n) ⎟⎟⎟ det (CA(n) ) 2 ! "+ = ⎜⎜⎝ M ⎟⎟⎠ log(2πσ ) + log M (n) log
det Σ A(n) n=1 n=1 det Σ˘ G V2 (n) (n) (n) − H − (M H ) σ2 n=1 n=1 N
+
N
N ! " ! " (n) (n) −1 (n)
(n) ) Σ˘ G ) + + tr C˘ G ( g˘ g˘ + tr C−1 ( A + M A Σ (n) A A n=1 (N) (1) 2 g˘ − 2 v˘ ( A ⊗ ··· ⊗ A ) σ " (1) (1) 1 2! (N) (N) + 2 tr ( A A A + M (N) Σ A(N) ) ⊗ · · · ⊗ ( A + M (1) Σ A(1) ) σ 3 Σ˘ G ) . ·( g˘ g˘ + (3.81)
Empirical Variational Bayesian Algorithm The derivative of the free energy (3.81) with respect to C˘ G gives ! ! "" ∂F −1 −2 Σ˘ G . = M (n) C˘ G − C˘ G g˘ g˘ + 2 ∂C˘ G Since it holds that ∂C˘ G = CG(N) ⊗ · · · ⊗ CG(n+1) ⊗ E(H (n) ,h,h) ⊗ CG(n−1) ⊗ · · · ⊗ CG(1) , ∂(CG(n) )h,h where E(H,h,h ) ∈ RH×H is the matrix with the (h, h )th entry equal to one and the others equal to zero, we have ∂F ∂F ∂C˘ G 2 = 2tr ∂(CG(n) )h,h ∂C˘ G ∂(CG(n) )h,h ## # = M (n) ##vec I H (N) ⊗ · · · ⊗ I H (n+1) ⊗ (CG(n) )−1 h,h E(H (n) ,h,h) ⊗ I H (n−1) ⊗ · · · ⊗ I H (1) # ! "" ### −1 −1 −1 −1 −2 ˘ −CG(N) ⊗ · · · ⊗ CG(n+1) ⊗ (CG(n) )h,h E(H (n) ,h,h) ⊗ CG(n−1) ⊗ · · · ⊗ CG(1) g˘ g˘ + Σ G ## #1 ## ## = M (n) (CG(n) )−2 vec I H (N) ⊗ · · · ⊗ I H (n+1) ⊗(CG(n) )h,h E(H (n) ,h,h) ⊗ I H (n−1) ⊗ · · · ⊗ I H (1) h,h# # ! "" ###
−1 −1 −1 −1 ˘ −CG(N) ⊗ · · · ⊗ CG(n+1) ⊗ E(H (n) ,h,h) ⊗ CG(n−1) ⊗ · · · ⊗ CG(1) g˘ g˘ + Σ G ## #1 ! " (n ) = M (n) (CG(n) )−2 Σ˘ G (CG(n) )h,h − diag g˘ g˘ + n n H h,h (3.82) diag CG−1(N) ⊗ · · · ⊗ CG−1(n+1) ⊗ E(H (n) ,h,h) ⊗ CG−1(n−1) ⊗ · · · ⊗ CG−1(1) ,
86
3 VB Algorithm for Multilinear Models
Algorithm 3 EVB learning for Tucker factorization. (n) Σ˘ G , { g˘ , 1: Initialize the variational parameters ( A }, { Σ A(n) }), and the hyper-
(n) parameters ({CG(n) }, {CA(n) }, σ2 ), for example, g˘ h ∼ Gauss1 (0, τ), Am,h ∼ 1/N
N Gauss1 0, τ1/N , I H (n) , Σ˘ G = τIn=1 H (n) , CG(n) = τI H (n) , Σ A(n) = C A(n) = τ 2 2 2 2 N (n) and σ = τ for τ = V / n=1 M . 2: Apply (substitute the right-hand side into the left-hand side) Eqs. (3.77), (n) g˘ , { Σ A(n) }, and { (3.76), (3.79), and (3.78) to update Σ˘ G , A }, respectively.
Apply Eqs. (3.83) through (3.85) to update {CG(n) }, {CA(n) }, and σ2 , respectively. 4: Prune the hth component if c2(n) c2(n) < ε, where ε > 0 is a threshold, e.g., 3:
gh
ah
set to ε = 10−4 . 5: Evaluate the free energy (3.81). 6: Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold).
where · 1 denotes the 1 -norm, and diag(·) denotes the column vector consisting of the diagonal entries of a matrix. Thus, the prior covariance for the core tensor can be updated by c2g(n) h
=
" ! diag g˘ g˘ + Σ˘ G diag CG−1(N) ⊗···⊗CG−1(n+1) ⊗E(H (n) ,h,h) ⊗CG−1(n−1) ⊗···⊗CG−1(1) (n ) n n H
.
(3.83)
The derivative of the free energy (3.81) with respect to CA(n) gives ⎛ ⎛ (n) 2 ⎞⎞ ⎜⎜ −2 ⎜⎜⎜ ⎟⎟⎟⎟ ah ∂F (n) ⎜ −4 Σ A(n) )h,h ⎟⎟⎟⎠⎟⎟⎟⎠ , 2 2 = M ⎜⎜⎝ca(n) − ca(n) ⎜⎜⎝ (n) + ( M h h ∂c (n) ah
which leads to the following update rule for the prior covariance for the factor matrices: c2a(n) = h
2 a(n) h + ( Σ A(n) )h,h . M (n)
(3.84)
Finally, the derivative of the free energy (3.81) with respect to σ2 gives N M (n) (N) (1) ∂F 1 2 2 = n=1 2 − 4 V2 − 2˘v ( A ⊗ ··· ⊗ A ) g˘ ∂σ σ σ " 3 2! (N) (N) (1) (1) (N) (1)
˘ + tr ( A A + M Σ A(N) ) ⊗ · · · ⊗ ( A A + M Σ A(1) ) ( g˘ g˘ + Σ G ) ,
3.4 Low-Rank Subspace Clustering
87
which leads to the update rule for the noise variance as follows: (N) (1) 1
σ2 = N A ⊗ ··· ⊗ A ) g˘ V2 − 2˘v ( (n) n=1 M 2! (N) (N) " 3 (1) (1) (N) (1)
˘ + tr ( A A + M Σ A(N) ) ⊗ · · · ⊗ ( A A + M Σ A(1) ) ( g˘ g˘ + Σ G ) . (3.85) Algorithm 3 summarizes the EVB algorithm for Tucker factorization. If we appropriately set the hyperparameters ({CG(n) }, {CA(n) }, σ2 ) in Step 1 and skip Steps 3 and 4, Algorithm 3 is reduced to (nonempirical) VB learning.
3.4 Low-Rank Subspace Clustering PCA globally embeds data points into a low-dimensional subspace. As more flexible tools, subspace clustering methods, which locally embed the data into the union of subspaces, have been developed. In this section, we derive VB learning for subspace clustering.
3.4.1 Subspace Clustering Methods Most clustering methods, such as k-means (MacQueen, 1967; Lloyd, 1982) and spectral clustering (Shi and Malik, 2000), (explicitly or implicitly) assume that there are sparse areas between dense areas, and separate the dense areas as clusters (Figure 3.2 left). On the other hand, there are some data, e.g., projected trajectories of points on a rigid body in 3D space, where data points can be assumed to lie in a union of small dimensional subspaces (Figure 3.2 right). Note that a point lying in a subspace is not necessarily far from a point lying in another subspace if those subspaces intersect each other. Subspace clustering methods have been developed to analyze this kind of data.
Figure 3.2 Clustering (left) and subspace clustering (right).
88
3 VB Algorithm for Multilinear Models
Let V = (v1 , . . . , v M ) ∈ RL×M be L-dimensional observed samples of size M. We assume that each vm is approximately expressed as a linear combination
of M words in a dictionary, D = (d1 , . . . , d M ) ∈ RL×M , i.e., V = DU + E,
where U ∈ R M ×M is unknown coefficients, and E ∈ RL×M is noise. In subspace clustering, the observed matrix V itself is often used as a dictionary D. Then, a convex formulation of the sparse subspace clustering (SSC) (Soltanolkotabi and Cand`es, 2011; Elhamifar and Vidal, 2013) is given by min V − VU2Fro + λ U1 , s.t. diag(U) = 0, U
(3.86)
where U ∈ R M×M is a parameter to be estimated, λ > 0 is a regularization coefficient. ·1 is the 1 -norm of a matrix. The first term in Eq. (3.86) together with the constraint requires that each data point vm is accurately expressed as a linear combination of other data points, {vm } for m m. The second term, which is the 1 -regularizer, enforces that the number of samples contributing to the linear combination should be small, which leads to low to the problem dimensionality of each obtained subspace. After the solution U
+ abs(U
), where abs(·) takes the absolute (3.86) is obtained, the matrix abs(U) value elementwise, is regarded as an affinity matrix, and a spectral clustering algorithm, such as the normalized cuts (Shi and Malik, 2000), is applied to obtain clusters. Another popular method for subspace clustering is low-rank subspace clustering (LRSC) or low-rank representation (Liu et al., 2010; Liu and Yan, 2011; Liu et al., 2012; Vidal and Favaro, 2014), where low-dimensional subspaces are sought by enforcing the low-rankness of U with the trace norm: min V − VU2Fro + λ Utr . U
(3.87)
Since LRSC enforces the low-rankness of U, the constraint diag(U) = 0 is not necessary, which makes its optimization problem (3.87) significantly simpler than the optimization problem (3.86) for SSC. Thanks to this simplicity, the global solution of Eq. (3.87) has been analytically obtained (Vidal and Favaro, 2014). Good properties of SSC and LRSC have been theoretically shown (Liu et al., 2010, 2012; Soltanolkotabi and Cand`es, 2011; Elhamifar and Vidal, 2013; Vidal and Favaro, 2014). It is observed that they behave differently in different situations, and each of SSC and LRSC shows advantages and disadvantages over the other, i.e., neither SSC nor LRSC is known to dominate the other in the general situations. In the rest of this section, we focus on LRSC and derive its VB learning algorithm.
3.4 Low-Rank Subspace Clustering
89
3.4.2 VB Learning for LRSC We start with the following probabilistic model, of which the maximum a posteriori (MAP) estimator coincides with the solution to the convex formulation (3.87) under a certain hyperparameter setting: #2 1 ## # # # (3.88) p(V|A, B) ∝ exp − 2 V − VBA Fro , 2σ 1 p( A) ∝ exp − tr( AC−1 (3.89) A A ) , 2 1 p(B) ∝ exp − tr(BC−1 (3.90) B B ) . 2 Here, we factorized U as U = B A , where A ∈ R M×H and B ∈ R M×H for H ≤ min(L, M) are the parameters to be estimated (Babacan et al., 2012a). This factorization is known to induce low-rankness through the MIR mechanism, which will be discussed in Chapter 7. We assume that the prior covariances are diagonal: CA = Diag(c2a1 , . . . , c2aH ),
C B = Diag(c2b1 , . . . , c2bH ).
Conditional Conjugacy The model likelihood (3.88) of LRSC is similar to the model likelihood (3.1) of MF, and it is in the Gaussian form with respect to A if B is regarded as a constant, or vice versa. Therefore, the priors (3.89) and (3.90) are conditionally conjugate for A and B, respectively. Variational Bayesian Algorithm The conditional conjugacy implies that the following independence constraint on the approximate posterior leads to a tractable algorithm: r( A, B) = r( A)r(B). In the same way as MF, we can show that the VB posterior is Gaussian in the following form: "⎞ ! ⎛ −1 ⎟ ⎜⎜⎜ tr ( A −
⎟⎟⎟ ( A − A) A) Σ A ⎜⎜ ⎟⎟⎟ r( A) ∝ exp ⎜⎜⎜⎜− ⎟⎟⎟ , 2 ⎠ ⎝⎜ ⎞ ⎛ −1 ⎜⎜⎜ ˘ ˘ ⎟⎟⎟⎟ ˘ Σ˘ B ( b˘ − b) ⎜⎜⎜ ( b − b) ⎟⎟⎟ , r(B) ∝ exp ⎜⎜− ⎟⎠ ⎝ 2
(3.91)
90
3 VB Algorithm for Multilinear Models
where b˘ = vec(B) ∈ R MH . The variational parameters satisfy the following stationary conditions: 1
B Σ A, A = 2 VV σ !
Σ A = σ2 B V VB
(3.92) r(B)
+ σ2 C−1 A
"−1 ,
(3.93)
Σ˘ B
A , b˘ = 2 vec V V σ "−1 !
, A Σ˘ B = σ2 ( A + M Σ A ) ⊗ V V + σ2 (C−1 B ⊗ IM )
(3.94) (3.95)
where the entries of B V VB r(B) in Eq. (3.93) are explicitly given by !
B V VB
(h,h )
Here ΣB
"
r(B) h,h
! " = B VV B
h,h
! " (h,h ) + tr V V ΣB .
(3.96)
Σ˘ B ∈ R MH×MH , i.e., ∈ R M×M is the (h, h )th block matrix of ⎛ (1,1) ⎜⎜⎜ ⎜⎜⎜ Σ B .
˘ Σ B = ⎜⎜⎜⎜⎜ .. ⎜⎜⎝ (H,1)
ΣB
(1,H) ⎞
Σ B ⎟⎟⎟⎟ .. ⎟⎟⎟⎟ . ⎟⎟⎟⎟ . (H,H) ⎟ ⎠ ··· ΣB
··· .. .
Free Energy as a Function of Variational Parameters The free energy can be explicitly written as 2F = LM log(2πσ ) + 2
## # #2 ##V−V B A ## Fro σ2
+ M log
det(C A) det ΣA
+ log det(C!B ⊗I"M ) det Σ˘ B
2 ! "3 2 3 3 2
−1 −1
˘ Σ − 2MH + tr C−1 ⊗ I ) A + tr C A + M Σ B B + tr (C A M B A B B ) 1 * + + tr σ−2 V V − B A A , (3.97) A B + B( A + M Σ A )B r(B)
where the expectation in the last term is given by *
A + M Σ A )B B( A
+ r(B) m,m
! " = B( A B A + M Σ A )
m,m
(m,m )
´ + tr ( A A + M Σ A )Σ B .
(3.98)
3.4 Low-Rank Subspace Clustering
91
Algorithm 4 EVB learning for low-rank subspace clustering. Σ˘ B ), and the hyperpara1: Initialize the variational parameters ( A, Σ A, B, 2
Σ A = CA = meters (CA , C B , σ ), for example, Am,h , Bm,h ∼ Gauss1 (0, 12 ),
2 2 ˘ C B = I H , Σ B = I MH , and σ = V /(LM). 2:
Apply (substitute the right-hand side into the left-hand side) Eqs. (3.93), Σ˘ B , and (3.92), (3.95), and (3.94) to update Σ A, A, B, respectively.
Apply Eqs. (3.99) through (3.101) to update CA , C B , and σ2 , respectively. 4: Prune the hth component if c2ah c2b < ε, where ε > 0 is a threshold, e.g., set h to ε = 10−4 . 5: Evaluate the free energy (3.97). 6: Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold). 3:
(m,m )
Here, Σ´ B
∈ RH×H is defined as ⎛ ! (1,1) " ⎜⎜⎜ ⎜⎜⎜ Σ B m,m ⎜⎜⎜ (m,m ) ..
Σ´ B = ⎜⎜⎜⎜ ." ⎜⎜⎜! ⎜⎜⎝ (H,1) ΣB
m,m
··· ..
.
···
! (1,H) "
ΣB
⎞ ⎟⎟⎟ ⎟⎟ m,m ⎟ ⎟⎟⎟ .. ⎟ . " ⎟⎟⎟⎟ . ! (H,H) ⎟⎟⎟
⎠⎟ ΣB m,m
Empirical Variational Bayesian Algorithm By differentiating the free energy (3.97) with respect to the hyperparameters, we can obtain the stationary conditions for the hyperparameters: # ##2 a # /M + Σ , (3.99) c2 = ## ah
h
A
h,h
!## ##2 ! (h,h) "" bh ## + tr ΣB c2bh = ## /M, (3.100) * + tr V V I M − 2 B A + B( A A + M Σ A )B r(B)
. (3.101) σ2 = LM Algorithm 4 summarizes the EVB algorithm. If we appropriately set the hyperparameters (CA , C B , σ2 ) in Step 1 and skip Steps 3 and 4, Algorithm 4 is reduced to (nonempirical) VB learning.
Variational Bayesian Algorithm under the Kronecker Product Covariance Constraint The standard VB algorithm, given in Algorithm 4, for LRSC requires the inversion of an MH × MH matrix, which is prohibitively huge in practical
92
3 VB Algorithm for Multilinear Models
applications. As a remedy, Babacan et al. (2012a) proposed to restrict the posterior r(B) for B to be the matrix variate Gaussian with the Kronecker product covariance (KPC) structure, as Eq. (3.18). Namely, we restrict the approximate posterior to be in the following form:
B ⊗ B, Ψ Σ B) r(B) = MGauss M,H (B; ! " −1 1 −1
∝ exp − tr Ψ B (B − B)Σ B (B − B) . 2
(3.102)
Under this additional constraint, the free energy is written as # ## #2 ##V − V B A ## det (C B ) det (CA ) 2F KPC = LM log(2πσ2 ) + + M log + M log
σ2 det Σ A det ΣB 1 − 2MH
B det Ψ 2 ! "3 2 ! "3 −1
+ tr C−1 ) Σ A + tr C B A + M Σ B + tr( Ψ A B B A B 2 ! ! " "3
B . (3.103) + tr σ−2 V V M B Σ A A ΣB Ψ B + tr ( A + M Σ A )
+ H log
By differentiating the free energy (3.103) with respect to each variational parameter, we obtain the following update rules: 1
B Σ A, A = 2 VV σ ! "−1
B
B + tr V V Ψ , B VV Σ A = σ2 Σ B + σ2 C−1 A ! " new old old old 1
B −α B C−1 + V V − A + B ( A ) , A + M Σ B = A B σ2 ⎛ ⎞−1
B ⎜⎜ tr V V Ψ
B ) −1 ⎟⎟⎟ σ2 tr(Ψ 2⎜ ⎜
( A A + MΣ A) + C B ⎟⎟⎟⎠ , Σ B = σ ⎜⎜⎝ M M " ⎛ ! ⎞−1 ⎜⎜⎜ tr ( ⎟⎟
) Σ A + M Σ A −1 2 A B ⎜⎜ σ tr(C B Σ B ) ⎟⎟⎟⎟ 2⎜
⎜ V V+ I M ⎟⎟⎟ , Ψ B = σ ⎜⎜ ⎜⎝ ⎟⎠ H H
c2ah = ( A A + M Σ A )h,h /M, ! "
B ) c2bh = B ΣB /M, B + tr(Ψ h,h ! " 1 ! !
B
tr V V I M − 2 B A + tr ( A ΣB Ψ σ2 = A + M Σ A ) LM "" + B( A B . A + M Σ A )
(3.104) (3.105) (3.106)
(3.107)
(3.108) (3.109) (3.110)
(3.111)
3.5 Sparse Additive Matrix Factorization
93
Algorithm 5 EVB learning for low-rank subspace clustering under the Kronecker product covariance constraint (3.102).
B ), and the hyper1: Initialize the variational parameters ( A, Σ A, B, Σ B, Ψ 2
ΣA = parameters (CA , C B , σ ), for example, Am,h , Bm,h ∼ Gauss1 (0, 12 ),
B = I M , and σ2 = V2 /(LM). Σ B = CA = C B = IH , Ψ 2: 3: 4: 5: 6: 7:
Apply (substitute the right-hand side into the left-hand side) Eqs. (3.105),
B , respectively. A, Σ B , and Ψ (3.104), (3.107), and (3.108) to update Σ A, Apply Eq. (3.106) T times (e.g., T = 20) to update B. Apply Eqs. (3.109) through (3.111) to update CA , C B , and σ2 , respectively. Prune the hth component if c2ah c2bh < ε, where ε > 0 is a threshold, e.g., set to ε = 10−4 . Evaluate the free energy (3.103). Iterate Steps 2 through 6 until convergence (until the energy decrease becomes smaller than a threshold).
Note that Eq. (3.106) is the gradient descent algorithm for B with the step size α > 0. Algorithm 5 summarizes the EVB algorithm under the KPC constraint, which we call KPC approximation (KPCA). If we appropriately set the hyperparameters (CA , C B , σ2 ) in Step 1 and skip Steps 4 and 5, Algorithm 5 is reduced to (nonempirical) VB learning.
3.5 Sparse Additive Matrix Factorization PCA is known to be sensitive to outliers in data and generally fails in their presence. To cope with outliers, robust PCA, where spiky noise is captured by an elementwise sparse term, was proposed (Cand`es et al., 2011). In this section, we introduce a generalization of robust PCA, called sparse additive matrix factorization (SAMF) (Nakajima et al., 2013b) and derive its VB learning algorithm.
3.5.1 Robust PCA and Matrix Factorization In robust PCA, the observed matrix V ∈ RL×M is modeled as follows: V = Ulow-rank + Uelement + E,
(3.112)
where Ulow-rank ∈ RL×M is a low-rank matrix, Uelement ∈ RL×M is an elementwise sparse matrix, and E ∈ RL×M is a (typically dense) noise matrix.
94
3 VB Algorithm for Multilinear Models
Given the observed matrix V, one can infer each term in the right-hand side of Eq. (3.112) by solving the following convex problem (Cand`es et al., 2011): ## #2 # # # # #V − Ulow-rank − Uelement ##Fro + λ1 ##Ulow-rank ##tr + λ2 ##Uelement ##1 , min Ulow-rank ,Uelement
where the trace norm ·tr induces low-rank sparsity, and the 1 -norm ·1 induces elementwise sparsity. The regularization coefficients λ1 and λ2 control the strength of sparsity. In Bayesian modeling, the low-rank matrix is commonly expressed as the product of two matrices, A ∈ R M×H and B ∈ RL×H : Ulow-rank = B A =
H
bh ah .
(3.113)
h=1
Trivially, low-rankness is forced if H is set to a small value. However, when VB learning is applied, the estimator can be low-rank even if we adopt the fullrank model, i.e., H = min(L, M). This phenomenon is caused by MIR, which will be discussed in Chapter 7.
3.5.2 Sparse Matrix Factorization Terms SAMF (Nakajima et al., 2013b) was proposed as a generalization of robust PCA, where various types of sparsity are induced by combining different types of factorization. For example, the following factorization implicitly induces rowwise, columnwise, and elementwise sparsity, respectively: d1 , . . . , γeL d L ) , Urow = ΓE D = (γ1e column
= EΓD =
element
= E D,
U U
(γ1d e1 , . . .
, γdM e M ),
(3.114) (3.115) (3.116)
where ΓD = Diag(γ1d , . . . , γdM ) ∈ R M×M and ΓE = Diag(γ1e , . . . , γeL ) ∈ RL×L are diagonal matrices, and D, E ∈ RL×M . denotes the Hadamard product, i.e., (E D)l,m = El,m Dl,m . The reason why the factorizations (3.114) through (3.116) induce the corresponding types of sparsity is explained in Section 7.5. As a general expression of sparsity inducing factorizations, we define a sparse matrix factorization (SMF) term with a mapping G consisting of partitioning, rearrangement, and factorization: K ; X), where U (k) = B(k) A(k) . U = G({U (k) }k=1
(3.117)
K are parameters to be estimated, and G(·; X) : Here, { A(k) , B(k) }k=1 K
(k)
(k) (L ×M ) L×M k=1 → R is a designed function associated with an index R mapping X (explained shortly).
3.5 Sparse Additive Matrix Factorization
95
U
B
U
U
U
B B
A
B
A
A
U U
B
A
A
Figure 3.3 An example of SMF-term construction. G(·; X) with X : (k, l , m ) → K (l, m) maps the set {U (k) }k=1 of the PR matrices to the target matrix U, so that
(k)
Ul ,m = UX(k,l ,m ) = Ul,m .
Figure 3.3 illustrates how to construct an SMF term. First, we partition the entries of U into K parts. Then, by rearranging the entries in each part, we form
(k)
(k) partitioned-and-rearranged (PR) matrices U (k) ∈ RL ×M for k = 1, . . . , K.
(k)
(k) Finally, each of U (k) is decomposed into the product of A(k) ∈ R M ×H and
(k)
(k) B(k) ∈ RL ×H , where H (k) ≤ min(L (k) , M (k) ). In Eq. (3.117), the function G(·; X) is responsible for partitioning and K of the PR matrices to the target matrix rearrangement: it maps the set {U (k) }k=1 L×M , based on the one-to-one map X : (k, l , m ) → (l, m) from the U ∈ R K to the indices of the entries in U such that indices of the entries in {U (k) }k=1 K ; X) = Ul,m = UX(k,l ,m ) = Ul (k) (3.118) G({U (k) }k=1
,m . l,m
When VB learning is applied, the SMF-term expression (3.117) induces partitionwise sparsity and low-rank sparsity in each partition. Accordingly, partitioning, rearrangement, and factorization should be designed in the following way. Suppose that we are given a required sparsity structure on a matrix (examples of possible side information that suggests particular sparsity structures are given in Section 3.5.3). We first partition the matrix, according to the required sparsity. Some partitions can be submatrices. We rearrange each of the submatrices on which we do not want to impose low-rank sparsity into a long vector (U (3) in the example in Figure 3.3). We leave the other submatrices which we want to be low-rank (U (2) ) and the original vectors (U (1) and U (4) ) and scalars (U (5) ) as they are. Finally, we factorize each of the PR matrices to induce sparsity. Let us, for example, assume that rowwise sparsity is required. We first make the rowwise partition, i.e., separate U ∈ RL×M into L pieces of M-dimensional ul ∈ R1×M . Then, we factorize each partition as U (l) = row vectors U (l) = (l) (l) (see the top illustration in Figure 3.4). Thus, we obtain the rowwise B A sparse term (3.114). Here, X(k, 1, m ) = (k, m ) makes the following connection dl = A(k) ∈ R M×1 for between Eqs. (3.114) and (3.117): γle = B(k) ∈ R, k = l. Similarly, requiring columnwise and elementwise sparsity leads to
96
3 VB Algorithm for Multilinear Models Table 3.1 Examples of SMF terms.
Factorization
Induced sparsity
K
(L (k) , M (k) )
X : (k, l , m ) → (l, m)
U = B A U = ΓE D U = EΓD U = E D
low-rank rowwise columnwise elementwise
1 L M L×M
(L, M) (1, M) (L, 1) (1, 1)
X(1, l , m ) = (l , m ) X(k, 1, m ) = (k, m ) X(k, l , 1) = (l , k) X(k, 1, 1) = vec-order(k)
U=
U1,1 U2,1
U1,2 U2,2
U1,3 U2,3
U=
U1,1 U2,1
U1,2 U2,2
U1,3 U2,3
U1,1 U2,1
U1,2 U2,2
U1,3 U2,3
U=
G
G
G
= U1,1
U1,2
U1,3 = B (1) A (1)
U (2) = U2,1
(1)
U2,2
U2,3 = B (2) A (2)
U
U
(1)
U
(2)
=
U1,1 U2,1
= B (1) A (1)
=
U1,2 U2,2
(2)
=B
A
U (3) =
(2)
U (1) = U1,1 = B (1) A (1)
U1,3 U2,3
= B (3) A (3)
U (4) = U2,2 = B (4) A (4)
(2)
U (5) = U1,3 = B (5) A (5)
U (3) = U1,2 = B (3) A (3)
U (6) = U2,3 = B (6) A (6)
U
(2)
= U2,1 = B
(2)
A
Figure 3.4 SMF-term construction for the rowwise (top), the columnwise (middle), and the elementwise (bottom) sparse terms.
Eqs. (3.115) and (3.116), respectively (see the bottom two illustrations in Figure 3.4). Table 3.1 summarizes how to design these SMF terms, where vec-order(k) = (1 + ((k − 1) mod L), k/L) goes along the columns one after another in the same way as the vec operator forming a vector by stacking the columns of a matrix (in other words, (U (1) , . . . , U (K) ) = vec(U)). Now we define the SAMF model as the sum of SMF terms (3.117): V=
S
U(s) + E,
(3.119)
s=1
where
(s)
K U(s) = G({B(k,s) A(k,s) }k=1 ; X(s) ).
(3.120)
3.5.3 Examples of SMF Terms In practice, SMF terms should be designed based on side information. Suppose that V ∈ RL×M consists of M samples of L-dimensional sensor outputs. In robust PCA (3.112), we add an elementwise sparse term (3.116) to the low-rank term (3.113), assuming that the low-rank signal is expected to be
3.5 Sparse Additive Matrix Factorization
B
97
F
Figure 3.5 Foreground/background video separation task.
contaminated with spiky noise when observed. Here, we can say that the existence of spiky noise is used as side information. Similarly, if we expect that a small number of sensors can be broken, and their outputs are unreliable over all M samples, we should add the rowwise sparse term (3.114) to separate the low-rank signal from rowwise noise: V = Ulow-rank + Urow + E. If we expect some accidental disturbances occurred during the observation, but do not know their exact locations (i.e., which samples are affected), the columnwise sparse term (3.115) can effectively capture such disturbances. The SMF expression (3.117) enables us to use side information in a more flexible way, and its advantage has been shown in a foreground/background video separation problem (Nakajima et al., 2013b). The top image in Figure 3.5 is a frame of a video available from the Caviar Project website,1 and the task is to separate moving objects (bottom-right) from the background (bottom-left). Previous approaches (Cand`es et al., 2011; Ding et al., 2011; Babacan et al., 2012b) first constructed the observation matrix V by stacking all pixels in each frame into each column (Figure 3.6), and then fitted it by the robust PCA model (3.112). Here, the low-rank term and the elementwise 1
The European Commission (EC)-funded CAVIAR project/IST 2001 37540, found at URL: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
98
3 VB Algorithm for Multilinear Models
V
P
T Figure 3.6 The observation matrix V is constructed by stacking all pixels in each frame into each column.
sparse term are expected to capture the static background and the moving foreground, respectively. However, we can also rely on the natural assumption that the pixels in a segment sharing similar intensities tend to belong to the same object. Under this assumption as side information, we can adopt a segmentwise sparse term, for which the PR matrix is constructed based on a precomputed oversegmented image (Figure 3.7). The segmentwise sparse term has been shown to capture the foreground more accurately than the elementwise sparse term in this application. Details will be discussed in Chapter 11.
3.5.4 VB Learning for SAMF Let us summarize the parameters of the SAMF model (3.119) as follows: (s) S Θ = {Θ(s) A , Θ B } s=1 ,
where
(s)
(k,s) K Θ(s) }k=1 , A = {A
(s)
(k,s) K Θ(s) }k=1 . B = {B
As in the MF model, we assume independent Gaussian noise and priors. Then, the likelihood and the priors are given by ⎛ # ##2 ⎞ S ⎜⎜⎜ 1 ## # ⎟⎟⎟ p(V|Θ) ∝ exp ⎜⎜⎜⎜⎝− 2 ###V − U(s) ### ⎟⎟⎟⎟⎠ , (3.121) 2σ # # s=1 Fro
3.5 Sparse Additive Matrix Factorization
P
99
U
P T
Figure 3.7 Construction of a segmentwise sparse term. The original frame is presegmented, based on which the segmentwise sparse term is constructed as an SMF term.
S K (s) 1 (k,s) (k,s)−1 (k,s) ∝ exp − tr A CA A , 2 s=1 k=1 S K (s) 1 (k,s) (k,s)−1 (k,s) (s) S p({Θ B } s=1 ) ∝ exp − tr B C B B . 2 s=1 k=1
S p({Θ(s) A } s=1 )
(3.122)
(3.123)
We assume that the prior covariances of A(k,s) and B(k,s) are diagonal: = Diag(c(k,s)2 , . . . , c(k,s)2 C(k,s) a1 aH ), A C(k,s) = Diag(c(k,s)2 , . . . , c(k,s)2 B b1 bH ). Conditional Conjugacy As seen in Eq. (3.121), the SAMF model is the MF model for the parameters (s) (s ) (s )
(Θ(s) A , Θ B ) in the sth SMF term, if the other parameters {Θ A , Θ B } s s are regarded as constants. Therefore, the Gaussian priors (3.122) and (3.123) are (s) conditionally conjugate for each of Θ(s) A and Θ B in each of the SMF terms. Variational Bayesian Algorithm Based on the conditional conjugacy, we solve the VB learning problem under the following independence constraint (Babacan et al., 2012b):
100
3 VB Algorithm for Multilinear Models
r(Θ) =
S
(s) (s) (s) r(s) A (Θ A )r B (Θ B ).
(3.124)
s=1
Following the standard procedure described in Section 2.1.5, we can find that the VB posterior, which minimizes the free energy (2.15), is in the following form: S K (s) (k,s) (k,s) A , ΣA ) MGauss M (k,s) ,H (k,s) ( A(k,s) ; r(Θ) = s=1 k=1
· MGaussL (k,s) ,H (k,s) (B(k,s) ; B =
(k,s) S K (s) M
s=1 k=1
(k,s)
(k,s)
, ΣB )
(k,s) (k,s) am , GaussH (k,s) ( a(k,s) ΣA ) m ;
m =1
·
(k,s) L
(k,s) (k,s) (k,s) bl , GaussH (k,s) ( bl ; ΣB )
(3.125)
l =1
with the variational parameters satisfying the stationary conditions given by (k,s) (k,s) (k,s)
= σ−2 Z (k,s) B ΣA , A ! "−1 (k,s) (k,s) (k,s) (k,s)
+ L (k,s) , Σ A = σ2 B B Σ B + σ2 C(k,s)−1 A (k,s) (k,s) (k,s)
= σ−2 Z (k,s) A ΣB , B ! "−1 (k,s) (k,s) (k,s) (k,s)
+ M (k,s) . Σ B = σ2 A A Σ A + σ2 C(k,s)−1 B
Here, Z (k,s) ∈ RL
(k,s)
×M (k,s)
is defined as
(s) Zl (k,s) ,
,m = Z (s) X (k,l ,m )
where
Z (s) = V −
(s) . U
s s
Free Energy as a Function of Variational Parameters The free energy can be explicitly written as 2F = LM log(2πσ2 ) + +
S K (s) s=1 k=1
M
(k,s)
V2Fro σ2 log
det C(k,s) A " ! (k,s) det ΣA
+L
(k,s)
log
det C(k,s) B " ! (k,s) det ΣB
S K (S ) 2 (k,s) (k,s) (k,s)
+ tr C(k,s)−1 ( A + M (k,s) A ΣA ) A s=1 k=1 (k,s) (k,s) (k,s)
( B + L (k,s) + C(k,s)−1 B ΣB ) B
3
(3.126) (3.127) (3.128) (3.129)
(3.130)
3.5 Sparse Additive Matrix Factorization ⎧ S ⎪ (k,s) (k,s) K (s) 1 ⎪ ⎨ (s)
+ 2 tr ⎪ G({ B A }k=1 ; X ) −2V ⎩ σ ⎪ s=1 +2
S S s=1 s =s+1
+ −
(k,s)
G ({ B
101
⎫
⎪ (k,s) K (s) (k,s ) (k,s ) K (s )
⎪ ⎬
}k=1 ; X(s) )G({ B }k=1 ; X(s ) )⎪ A A ⎪ ⎭
S K (S ) 3 (k,s) (k,s) (k,s) (k,s) 1 2 (k,s) (k,s)
tr ( A + M (k,s) B + L (k,s) A Σ A )( B ΣB ) 2 σ s=1 k=1 S K (S ) (L (k,s) + M (k,s) )H (k,s) .
(3.131)
s=1 k=1
Empirical Variational Bayesian Algorithm The following stationary conditions for the hyperparameters can be obtained from the derivatives of the free energy (3.131): ## #2 (k,s) (k,s) #
(k,s) ## # = a + ( Σ A )hh , c(k,s)2 ah h # /M #2 ## (k,s) # (k,s) # (k,s) + ( ## = Σ B )hh , c(k,s)2 b bh # h ## /L
(3.132) (3.133)
Algorithm 6 EVB learning for sparse additive matrix factorization. 1:
(k,s) (k,s) (k,s) (k,s) K (s) S Initialize the variational parameters { A , ΣA , B , Σ B }k=1, s=1 , and (k,s) (k,s) K (s) S (k,s) (k,s) 2
the hyperparameters {CA , C B }k=1, s=1 , σ , for example, Am,h , Bl,h ∼ (k,s) (k,s) Gauss1 (0, τ), ΣA = Σ B = C(k,s) = C(k,s) = τI H (k,s) , and σ2 = τ2 for τ2 = A
B
V2Fro /(LM). 2: Apply (substitute the right-hand side into the left-hand side) Eqs. (3.127), (k,s) (k,s) (k,s) A , ΣB , (3.126), (3.129), and (3.128) for each k and s to update ΣA , (k,s) and B , respectively. Apply Eqs. (3.132) and (3.133) for all h = 1, . . . , H (k,s) , k and s, and (k,s) 2 Eq. (3.134) to update C(k,s) A , C B , and σ , respectively. (k,s)2 (k,s)2 4: Prune the hth component if cah cb < ε, where ε > 0 is a threshold, h e.g., set to ε = 10−4 . 5: Evaluate the free energy (3.131). 6: Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold). 3:
102
3 VB Algorithm for Multilinear Models
⎛ ⎞⎞ ⎛ ) S S ⎜⎜⎜ (s) ⎜⎜⎜ (s ) ⎟ 1 ⎟⎟⎟ 2
⎜⎜⎝V − σ = tr ⎜⎜⎝U U ⎟⎟⎟⎠⎟⎟⎟⎠ VFro − 2 LM s=1 s =s+1 2
1 S K (s) (k,s) (k,s) (k,s) (k,s)
(k,s) (k,s)
(k,s) (k,s)
tr ( B +L +M B ΣB ) · ( A A ΣA ) . + s=1 k=1
(3.134) Algorithm 6 summarizes the EVB algorithm for SAMF. If we appropriately (k,s) K (s) S 2 set the hyperparameters {C(k,s) A , C B }k=1, s=1 , σ in Step 1 and skip Steps 3 and 4, Algorithm 6 is reduced to (nonempirical) VB learning.
4 VB Algorithm for Latent Variable Models
In this chapter, we discuss VB learning for latent variable models. Starting with finite mixture models as the simplest example, we overview the VB learning algorithms for more complex latent variable models such as Bayesian networks and hidden Markov models. Let H denote the set of (local) latent variables and w denote a model parameter vector (or the set of global latent variables). In this chapter, we consider the latent variable model for training data D: p(D, H|w). p(D|w) = H
Let us employ the following factorized model to approximate the posterior distribution for w and H: r(w, H) = rw (w)rH (H).
(4.1)
Applying the general VB framework explained in Section 2.1.5 to the preceding model leads to the following update rules for w and H: 1 p(w) exp log p(D, H|w) rH (H) , Cw 1 exp log p(D, H|w) rw (w) . rH (H) = CH rw (w) =
(4.2) (4.3)
In the following sections, we discuss these update rules for some specific examples of latent variable models.
4.1 Finite Mixture Models A finite mixture model p(x|w) of an L-dimensional input x ∈ RL with a parameter vector w ∈ R M is defined by 103
104
4 VB Algorithm for Latent Variable Models
p(x|w) =
K
αk p(x|τk ),
(4.4)
k=1
where integer K is the number of components and α = (α1 , . . . , αK ) ∈ ΔK−1 is the set of mixing weights (Example 1.3). The parameter w of the model is K . w = {αk , τk }k=1 The finite mixture model can be rewritten as follows by using a hidden variable z = (z1 , . . . , zK ) ∈ {e1 , . . . , eK }, p(x, z|w) =
K 4
αk p(x|τk )
5zk
.
(4.5)
k=1
Here ek ∈ {0, 1}K is the K-dimensional binary vector, called the one-of-K representation, with one at the kth entry and zeros at the other entries: k-th
ek = (0, . . . , 0, 1 , 0, . . . , 0) .
K
The hidden variable z is not observed and is representing the component from which the data sample x is generated. If the data sample x is from the kth component, then zk = 1, otherwise, zk = 0. Then p(x, z|w) = p(x|w) z
holds where the sum over z goes through all possible values of the hidden variable.
4.1.1 Mixture of Gaussians If the component distribution in Eq. (4.4) is chosen to be a Gaussian distribution, p(x|τ) = GaussL (x; μ, Σ), the finite mixture model is called the mixture of Gaussians or the Gaussian mixture model (GMM). In some applications, the parameters are restricted to the means of each component, and it is assumed that there is no correlation between each input dimension. In this case, since L = M, the model is written by K x − μk 2 αk exp − , (4.6) p(x|w) = 2σ2 (2πσ2 ) M/2 k=1 where σ > 0 is a constant.
4.1 Finite Mixture Models
105
In this subsection, the uncorrelated GMM (4.6) is considered in the VB framework by further assuming that σ2 = 1 for simplicity. The joint model for the observed and hidden variables (4.5) is given by the product of the following two distributions: p(z|α) = MultinomialK,1 (z; α),
(4.7)
K 7 6 K p(x|z, {μk }k=1 )= Gauss M (x; μk , I M ) zk .
(4.8)
k=1 N and the complete data set Thus, for the set of hidden variables H = {z(n) }n=1 (n) (n) N {D, H} = {x , z }n=1 , the complete likelihood is given by K p(D, H|α, {μk }k=1 )=
N K ,
αk Gauss M (x(n) ; μk , I M )
-z(n) k
.
(4.9)
n=1 k=1
ML learning of the GMM is carried out by the expectation-maximization (EM) algorithm (Dempster et al., 1977), which corresponds to a clustering algorithm called the soft K-means (MacKay, 2003, ch. 22). Because of the conditional conjugacy (Section 2.1.2) of the parameters α = K , we assume that the prior of the parameters (α1 , . . . , αK ) ∈ ΔK−1 and {μk }k=1 is the product of the following two distributions: p(α|φ) = DirichletK (α; (φ, . . . , φ) ), K p({μk }k=1 |μ0 , ξ) =
K
Gauss M (μk |μ0 , (1/ξ)I M ),
(4.10) (4.11)
k=1
where ξ > 0, μ0 ∈ R M and φ > 0 are the hyperparameters. VB Posterior for the Gaussian Mixture Model Let Nk =
N
z(n) k
n=1
(4.12)
rH (H)
and xk =
N 1
Nk
n=1
z(n) k
rH (H)
x(n) ,
(4.13)
(n) is from the kth component; otherwise, where z(n) k = 1 if the nth data sample x (n) zk = 0. The variable N k is the expected number of times data come from the kth component, and xk is the mean of them. Note that the variables N k K K N (n) N k = N and k=1 N k xk = n=1 x . From and xk satisfy the constraints k=1
106
4 VB Algorithm for Latent Variable Models
(4.2) and the respective priors (4.10) and (4.11), the VB posterior rw (w) = K ) is obtained as the product of the following two distributions: rα (α)rμ ({μk }k=1 rα (α) = DirichletK α; ( φ K ) , φ1 , . . . , (4.14) K )= rμ ({μk }k=1
K
Gauss M μk ; μk , σ2k I M ,
(4.15)
k=1
where
φk = N k + φ, 1
, σ2k = Nk + ξ N k xk + ξμ0
μk =
Nk + ξ
(4.16) (4.17) .
(4.18)
From Eq. (4.3), the VB posterior rH (H) is given by ⎞ ⎛ ⎧ ⎛ K N ⎜⎜⎜ (n) ⎪ ⎜⎜⎜ ⎟⎟⎟ ⎪ 1 ⎨ ⎜ ⎜
rH (H) = Ψ (φk ) − Ψ ⎜⎝ exp ⎜⎝zk ⎪ φk ⎟⎟⎠ ⎪ ⎩ CH n=1
k =1 1 μk 2 M x(n) − 1 − , log 2π + − 2 2 Nk + ξ where Ψ is the di-gamma (psi) function, and we used ⎞ ⎛ K ⎜⎜⎜ ⎟⎟⎟ ⎜
φk ⎟⎟⎠ . log αk r (α) = Ψ (φk ) − Ψ ⎜⎝ α
(4.19)
k =1
N ) is the multinomial distribution: That is, rH (H) = rz ({z(n) }n=1 N rz ({z(n) }n=1 )=
N
rz (z(n) )
n=1
=
N
MultinomialK,1 z(n) ; z(n) ,
n=1
where z(n) ∈ ΔK−1 is (n)
z(n) k = zk for z(n) k
rH (H)
z(n) = K k (n) , k =1 zk
⎞ ⎞ ⎛ ⎛ K ##2 ⎟⎟ ⎜⎜⎜ ⎜⎜⎜ ⎟⎟⎟ 1 ## (n) 2 φk ⎟⎟⎠ − # x − = exp ⎜⎜⎝Ψ ( μk # + M σk ⎟⎟⎟⎠ . φk ) − Ψ ⎜⎜⎝ 2 k =1
(4.20)
(4.21)
4.1 Finite Mixture Models
107
The free energy as a function of variational parameters is expressed as follows: / 0 rH (H)rw (w) − log p(D, H|w) rH (H)rw (w) F = log p(w) rH (H)rw (w) / = log
( z(n) k ) Γ(Kφ) (Γ(φ))K
z(n) k
φk ) Γ( K K k=1
k=1 Γ(φk )
K
k=1
K
αφ−1 k
φk −1 k=1 αk
ξ M/2 2π
μ − μ 2 exp − k 2k 2 σ k
(2π σ2k ) M/2
0
ξμ −μ 2 exp − k 2 0
rH (H)rw (w)
⎧ ! (n) 2 " ⎫z(n) ⎪ x −μ ⎪ k 0 / ⎪ ⎪ N K ⎪ exp − 2 k ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ − log α ⎪ ⎪ k ⎪ ⎪ M/2 ⎪ ⎪ (2π) ⎪ ⎪ ⎪ n=1 k=1 ⎪ ⎩ ⎭ rH (H)rw (w) ⎞ ⎛ K K ⎟ ⎜⎜⎜ Γ( k=1 φk ) ⎟⎟ KM ⎟⎟⎠ − log Γ(Kφ) − M = log ⎜⎜⎝ K log ξ σ2k − K
(Γ(φ)) 2 2 k=1 Γ(φk ) k=1 +
N K
z(n) z(n) k log k +
n=1 k=1
+
k=1
K ξ μk − μ0 2 + M σ2k 2
k=1
+
K
φk ) φk − φ − N k Ψ ( φk ) − Ψ ( kK =1
K
N k xk − μk 2 +
+
K N M log(2π) + M σ2k k
k=1 (n) (n)
n=1 zk x
N
2 − x k 2
2
k=1
.
(4.22)
The prior hyperparameters, (φ, μ0 , ξ), can be estimated by the EVB learning (Section 2.1.6). Computing the partial derivatives, we have ∂F = K (Ψ (φ) − Ψ (Kφ)) − φk , Ψ ( φk ) − Ψ kK =1 ∂φ k=1 K
K $ % ∂F =ξ μk , μ0 − ∂μ0 k=1 ⎛ ⎞ K ⎟ μk − μ0 2 M ⎜⎜⎜⎜ K ∂F 2 ⎟ = − ⎜⎝ − + σk ⎟⎟⎟⎠ . ∂ξ 2 ξ k=1 M
The stationary conditions μ0 =
∂F ∂μ0
= 0 and
K 1
μ, K k=1 k
∂F ∂ξ
(4.23)
(4.24)
(4.25)
= 0 yield the following update rules: (4.26)
108
4 VB Algorithm for Latent Variable Models
Algorithm 7 EVB learning for the Gaussian mixture model. N K K 1: Initialize the variational parameters ({ z(n) }n=1 , { φk }k=1 , { μk , σ2k }k=1 ), and the hyperparameters (φ, μ0 , ξ). 2: Apply (substitute the right-hand side into the left-hand side) Eqs. (4.21), N K , { φk }k=1 , (4.20), (4.12), (4.13), (4.16), (4.17), and (4.18) to update { z(n) }n=1 2 K σk }k=1 . and { μk , 3: Apply Eqs. (4.29), (4.26), and (4.27) to update φ, μ0 and ξ, respectively. 4: Evaluate the free energy (4.22). 5: Iterate Steps 2 through 4 until convergence (until the energy decrease becomes smaller than a threshold). ⎧ −1 ⎫ K ⎪ ⎪ ⎪ ⎪ μ k − μ0 2 ⎨ 1 2 ⎬ + σk ⎪ ξ=⎪ . ⎪ ⎪ ⎩K ⎭ M k=1
(4.27)
Since the stationary condition ∂F ∂φ = 0 is not explicitly solved for φ, the Newton–Raphson step is usually used for updating φ. With the second derivative, ∂2 F = K Ψ (1) (φ) − KΨ (1) (Kφ) , (4.28) ∂φ2 the update rule is obtained as follows: ⎞ ⎛ 2 −1 ⎟⎟⎟ ⎜⎜⎜ old ∂ F ∂F new ⎟⎟ φ = max ⎜⎜⎝0, φ − 2 ∂φ ⎠ ∂φ K φk Ψ ( φk )−Ψ kK =1 K(Ψ (φ)−Ψ (Kφ))− k=1 old , = max 0, φ − K (Ψ (1) (φ)−KΨ (1) (Kφ))
(4.29)
m
d where Ψm (z) ≡ dz m Ψ (z) is the polygamma function of order m. The EVB learning for the GMM is summarized in Algorithm 7. If the prior hyperparameters are fixed and Step 3 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm.
4.1.2 Mixture of Exponential Families It is well known that the Gaussian distribution is an example of the exponential family distribution: p(x|τ) = p(t|η) = exp η t − A(η) + B(t) , (4.30) where η ∈ H is the natural parameter, η t is its inner product with the vector t = t(x) = (t1 (x), . . . , t M (x)) , and A(η) and B(t) are real-valued functions
4.1 Finite Mixture Models
109
of the parameter η and the sufficient statistics t, respectively (Brown, 1986) (see Eq. (1.27) in Section 1.2.3). Suppose functions t1 , . . . , t M and the constant function, 1, are linearly independent and the number of parameters in a single component distribution, p(t|η), is M. The VB framework for GMMs in Section 4.1.1 is generalized to a mixture of exponential family distributions as follows. The conditional conjugate prior K are given by distributions of α ∈ ΔK−1 and {ηk }k=1 p(α|φ) = DirichletK (α; (φ, . . . , φ) ), K p({ηk }k=1 |ν0 , ξ) =
K k=1
1 exp ξ(ν0 ηk − A(ηk )) , C(ξ, ν0 )
where the function C(ξ, ν) of ξ ∈ R and ν ∈ R M is defined by C(ξ, ν) = exp ξ(ν η − A(η)) dη.
(4.31) (4.32)
(4.33)
H
Constants ξ > 0, ν0 ∈ R M , and φ > 0 are the hyperparameters. VB Posterior for Mixture-of-Exponential-Family Models Here, we derive the VB posterior rw (w) for the mixture-of-exponential-family model using Eq. (4.2). N , we put Using the complete data {x(n) , z(n) }n=1 N z(n) k
Nk =
n=1
rH (H)
,
N 1 (n) t (n) , zk rH (H) N k n=1
tk =
(4.34)
(4.35)
where t (n) = t(x(n) ). Note that the variables N k and t k satisfy the constraints K N (n) K k=1 N k = N and k=1 N k t k = n=1 t . From Eq. (4.2) and the respective prior distributions, Eqs. (4.10) and (4.32), the VB posterior rw (w) = K ) is obtained as the product of the following two distributions: rα (α)rη ({ηk }k=1 φ K ) , rα (α) = DirichletK α; ( φ1 , . . . , K )= rη ({ηk }k=1
K k=1
1 νk ηk − A(ηk )) , exp ξk ( C( ξk , νk )
(4.36)
where
φk = N k + φ,
(4.37)
110
4 VB Algorithm for Latent Variable Models
νk =
N k t k + ξν0 Nk + ξ
,
(4.38)
ξk = N k + ξ.
(4.39)
Let ξk , νk ) 1 ∂log C(
. ηk = ηk rη (ηk ) =
∂ν k ξk
(4.40)
It follows that
A(ηk )
rη (ηk )
νk − = ηk
νk ) ∂ log C( ξk , . ∂ξk
(4.41)
From Eq. (4.3), the VB posterior rH (H) is given by rH (H) =
N
rz (z(n) )
n=1
=
N
MultinomialK,1 z(n) ; z(n) ,
n=1 (n)
where z
∈Δ
K−1
is (n)
z(n) = zk k
for z(n) k
rH (H)
z(n) = K k (n) , k =1 zk
(4.42)
⎞ ⎛ ⎛ K ⎞ ⎜⎜⎜ ⎜⎜⎜ ⎟⎟⎟ ⎟ (n) (n) ⎟ ⎟ ⎜ ⎜
= exp ⎜⎝Ψ (φk ) − Ψ ⎜⎝ φk ⎟⎠ + ηk t − A(ηk ) r (η ) + B(t )⎟⎟⎟⎠ . k =1
η
k
(4.43)
To obtain the preceding expression of z(n) k , we used Eq. (4.19). The free energy as a function of variational parameters is expressed as ⎛ K ⎞ K ⎜⎜⎜ Γ( k=1 φk ) ⎟⎟⎟ Γ(Kφ) ⎜ ⎟ F = log ⎝⎜ K log C( ξk , νk ) + K log C(ξ, ν0 ) − ⎟ − log ⎠
(Γ(φ))K k=1 Γ(φk ) k=1 +
N K
z(n) z(n) k log k +
n=1 k=1
K
φk ) φk − φ − N k Ψ ( φk ) − Ψ ( kK =1 k=1
⎤ K ⎡ - ⎢⎢⎢ , $ % ∂log C( ξk , νk ) ⎥⎥⎥
⎢ ⎥⎦ νk − ν 0 + N k + ν k − t k + ξk − ξ − N k ηk ξ ⎣ ∂ξ k k=1 −
N n=1
B(t (n) ).
(4.44)
4.1 Finite Mixture Models
111
The update rule of φ for the EVB learning is obtained by Eq. (4.29) as in the GMM. The partial derivatives of F with respect to the hyperparameters (ν0 , ξ) are K ∂F ∂log C(ξ, ν0 )
ηk , =K −ξ ∂ν0 ∂ν0 k=1 ⎧ ⎫ K ⎪ ⎪ % ∂log C( ∂F ⎪ νk ) ⎪ ξk , ∂log C(ξ, ν0 ) ⎨ $ ⎬
νk − ν 0 − = . ηk +K ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ ∂ξ ∂ξk ∂ξ k=1
(4.45)
(4.46)
Equating these derivatives to zeros, we have the following stationary conditions: 1 ∂log C(ξ, ν0 ) 1
η, = ξ ∂ν0 K k=1 k ⎛ ⎞ K K ∂log C(ξ, ν0 ) ⎜⎜⎜⎜ 1 ⎟⎟⎟⎟ 1
ηk ⎟⎠ ν0 − = ⎜⎝ A(ηk ) rη (ηk ) , ∂ξ K k=1 K k=1 K
(4.47)
(4.48)
where we have used Eq. (4.41). If these equations are solved for ν0 and ξ, respectively, we obtain their update rules as in the case of the GMM. Otherwise, we need the Newton–Raphson steps to update them. The EVB learning for the mixture of exponential families is summarized in Algorithm 8. If the prior hyperparameters are fixed and Step 3 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm.
Algorithm 8 EVB learning for the mixture-of-exponential-family model. N K K 1: Initialize the variational parameters ({ z(n) }n=1 , { φk }k=1 , { νk , ξk }k=1 ), and the hyperparameters (φ, ν0 , ξ). 2: Apply (substitute the right-hand side into the left-hand side) Eqs. (4.43), N K , { φk }k=1 , (4.42), (4.34), (4.35), (4.37), (4.38), and (4.39) to update { z(n) }n=1 K K K
and { νk , ξk }k=1 . Transform { νk , ξk }k=1 to { ηk , A(ηk ) rη (ηk ) }k=1 by Eqs. (4.40) and (4.41). 3: Apply Eqs. (4.29), (4.47), and (4.48) to update φ, ν0 and ξ, respectively. 4: Evaluate the free energy (4.44). 5: Iterate Steps 2 through 4 until convergence (until the energy decrease becomes smaller than a threshold).
112
4 VB Algorithm for Latent Variable Models
4.1.3 Infinite Mixture Models In 2000s, there was a revival of Bayesian nonparametric models to estimate the model complexity, e.g., the number of components in mixture models, by using a prior distribution for probability measures such as the Dirichlet process (DP) prior. The Bayesian nonparametric approach fits a single model adapting its complexity to the data. The VB framework plays an important role in achieving tractable inference for Bayesian nonparametric models. Here, we introduce the VB learning for the stick-breaking construction of the DP prior by instantiating the estimation of the number of components of the mixture model. For the finite mixture model with K components, we had the discrete latent variable, z ∈ {e1 , e2 , . . . , eK }, indicating the label of the component. We also assumed the multinomial distribution, K αzkk . p(z|α) = MultinomialK,1 (z; α) = k=1
In the nonparametric Bayesian approach, we consider possibly an infinite number of components, p(z|α) = lim MultinomialK,1 (z; α), K→∞
and the following generation process of αk , called the stick-breaking process (Blei and Jordan, 2005; Gershman and Blei, 2012): αk = vk
k−1
(1 − vl ),
l=1
vk ∼ Beta(1, γ), where Beta(α, β) denotes the beta distribution with parameters α and β, and γ is the scaling parameter. To derive a tractable VB learning algorithm, the truncation level T is usually introduced to the preceding process, which enforces vT = 1. If the truncation level T is sufficiently large, some components are left unused, and hence T does not directly specify the number of components. −1 is Then, the VB posterior r(H, v) for the latent variables and v = {vk }Tk=1 assumed to be factorized, rH (H)rv (v), for which the free energy minimization implies further factorization: r(H, v) =
N n=1
rz (z(n) )
T −1 k=1
rv (vk ),
4.1 Finite Mixture Models
113
where rz (z(n) ) is the multinomial distribution as in the case of the finite mixture model, and rv (vk ) is the beta distribution because of the conditional conjugacy. To see this and how the VB learning algorithm is derived, we instantiate the GMM discussed in Section 4.1.1. The free energy is decomposed as 0 / N K )rv (v)rμ ({μk }k=1 ) rz ({z(n) }n=1 F = log p(v)p({μk }Tk=1 ) N K ) rz ({z(n) }n=1 )rv (v)rμ ({μk }k=1 N − log p(D|{z(n) }n=1 , w) N K ) rz ({z(n) }n=1 )rv (v)rμ ({μk }k=1 N − log p({z(n) }n=1 |v) . (n) N rz ({z }n=1 )rv (v)
These terms are computed in the same wayas in Section 4.1.1 except for the N N (n) last term, log p({z(n) }n=1 |v) = |v) . log p(z n=1 (n) N (n) rz ({z }n=1 )rv (v) such that z(n) k
rz (z )rv (v)
Let c be the index k = 1 and θ be the indicator function. Then, we have log p(z(n) |v) (n) rz (z )rv (v) / 0 ∞ θ(c(n) >k) θ(c(n) =k) = log (1 − vk ) vk (n)
k=1
=
∞ ,
rz (z(n) )rv (v)
rz (c(n) > k) log(1 − vk ) rv (v) + rz (c(n) = k) log vk rv (v)
k=1
=
T −1 ,
rz (c(n) > k) log(1 − vk ) rv (v) + rz (c(n) = k) log vk rv (v) ,
k=1
where we have used log vT = 0 and rz (c(n) > T ) = 0. Since the probabilities rz (c(n) = k) and rz (c(n) > k) are given by z(n) rz (c(n) = k) = k , rz (c(n) > k) =
T
z(n) l ,
l=k+1
it follows from Eq. (4.12) that N
rz (c(n) = k) = N k ,
n=1 N n=1
rz (c(n) > k) =
T l=k+1
Nl = N −
k l=1
Nl.
114
4 VB Algorithm for Latent Variable Models
Now Eq. (4.3) in this case yields that rv (v) ∝
N
log p(z(n) |v)
n=1
rz (z(n) )
p(v).
It follows from similar manipulations to those just mentioned and the conditional conjugacy that rv (v) =
T −1
Beta(vk ; κk , λk ),
(4.49)
k=1
i.e., for a fixed rH (H), the optimal rv (v) is the beta distribution with the parameters,
κk = 1 + N k ,
λk = γ + N −
(4.50) k
Nl.
(4.51)
l=1
The VB posterior rH (H) is computed similarly except that the expectation log αk rα (α) = Ψ (N k + φ) − Ψ (N + Kφ) in Eq. (4.19) for the finite mixture model is replaced by κk )−Ψ ( κk + λk )+ {Ψ ( λl )−Ψ ( κl + λl )}, log vk rv (vk ) + log(1 − vl ) rv (vl ) = Ψ ( k−1
k−1
l=1
l=1
since
κk ) − Ψ ( κk + λk ), rv (vk ) = Ψ ( λk ) − Ψ ( κk + λk ), log(1 − vk ) rv (vk ) = Ψ ( log vk
κk , λk ). In the case of the GMM, z(n) for rv (vk ) = Beta(vk ; k in Eq. (4.21) is replaced with ⎛ k−1 ⎜⎜⎜ (n) zk = exp ⎜⎜⎝⎜Ψ ( κk + λk ) + {Ψ ( λl ) − Ψ ( κl + λl )} κk ) − Ψ ( l=1 #2 1# μk ## + M σ2k . (4.52) − ## x(n) − 2 The free energy is given by ⎛ ⎞ T T −1 ⎜⎜ Γ( λk ) ⎟⎟⎟ TM κk + M ⎟⎠ − (T − 1) log γ − F= log ξ σ2k − log ⎜⎜⎝
2 2 Γ( κk )Γ(λk ) k=1 k=1 +
N T n=1 k=1
z(n) z(n) k log k +
T −1 ,
κk + λk ) κk − 1 − N k Ψ ( κk ) − Ψ ( k=1
4.2 Other Latent Variable Models ⎞⎫ ⎛ ⎧ T −1 ⎪ k ⎟⎟⎟⎪ ⎜⎜⎜ ⎪ ⎪, ⎨ ⎟⎟⎟⎬ ⎜ ⎜ + − γ − N κk + λk ) λ N − Ψ ( λk ) − Ψ ( ⎪ ⎜ k l ⎪ ⎪ ⎠⎪ ⎝ ⎩ ⎭ k=1 l=1 2 T ξ T N M log(2π) + M μk − μ0 + M σ2k σ2k k + + 2 2 k=1 k=1 (n) T N N k xk − zk x(n) − xk 2 μk 2 + n=1 . + 2 k=1
115
(4.53)
The VB learning algorithm is similar to Algorithm 7 for the finite GMM while the number of components K is replaced with the truncation level T −1 λk }Tk=1 throughout, the update rule (4.21) is replaced with Eq. (4.52), and { κk , K are updated by Eqs. (4.50) and (4.51) instead of { φk }k=1 . and equating it to zero, the EVB learning for the By computing ∂F ∂γ hyperparameter γ updates it as follows: ⎤−1 ⎡ T −1 , ⎢⎢⎢ −1 -⎥⎥⎥ ⎢ κk + λk ) ⎥⎥⎦⎥ , (4.54) γ = ⎢⎣⎢ Ψ ( λk ) − Ψ ( T − 1 k=1 which can replace the update rule of φ in Step 3 of Algorithm 7.
4.2 Other Latent Variable Models In this section, we discuss more complex latent variable models than mixture models and derive VB learning algorithms for them. Although we focus on the models where the multinomial distribution is assumed on the observed data given latent variables, it is straightforward to replace it with other members of the exponential family.
4.2.1 Bayesian Networks A Bayesian network is a probabilistic model defined by a graphical model expressing the relations among random variables by a graph and the conditional probabilities associated with them (Jensen, 2001). In this subsection, we focus on a Bayesian network whose states of all hidden nodes influence those of all observation nodes, and assume that it has M observation nodes and K hidden nodes. The graphical structure of this Bayesian network is called bipartite and presented in Figure 4.1. The observation nodes are denoted by x = (x1 , . . . , x M ), and the set of states Yj is {1, . . . , Y j }. The hidden of observation node x j = (x j,1 , . . . , x j,Y j ) ∈ {el }l=1
116
4 VB Algorithm for Latent Variable Models
z1
x1
z2
x2
zK
...
. . .
xM
. . .
Figure 4.1 Graphical structure of the Bayesian network.
nodes are denoted by z = (z1 , . . . , zK ), and the set of states of hidden node k is {1, . . . , T k }. zk = (zk,1 , . . . , zk,Tk ) ∈ {ei }Ti=1 The probability that the state of the hidden node zk is i (1 ≤ i ≤ T k ) is expressed as a(k,i) = Prob(zk = ei ). Then, ak = (a(k,1) , . . . , a(k,Tk ) ) ∈ ΔTk −1 for k = 1, . . . , K. The conditional probability that the jth observation node x j is l (1 ≤ l ≤ Y j ), given the condition that the states of hidden nodes are z = (z1 , . . . , zK ), is denoted by b( j,l|z) = Prob(x j = el |z). Then, b j|z = (b( j,1|z) , . . . , b( j,Y j |z) ) ∈ ΔY j −1 for j = 1, . . . , M. Define b z = Tk K {b j|z } M j=1 for z ∈ Z = {z; zk ∈ {ei }i=1 , k = 1, . . . , K}. Let w = {{ak }k=1 , {b z } z∈Z } be the set of all parameters. Then, the joint probability that the states of observation nodes are x = (x1 , . . . , x M ) and the states of hidden nodes are z = (z1 , . . . , zK ) is p(x, z|w) = p(x|b z )
Tk K
z
k,i a(k,i) ,
k=1 i=1
where p(x|b z ) =
Yj M
x
j,l b( j,l|z) .
j=1 l=1
Therefore, the marginal probability that the states of observation nodes are x is p(x, z|w) p(x|w) = z∈Z
=
z∈Z
p(x|b z )
Tk K k=1 i=1
z
k,i a(k,i) ,
(4.55)
4.2 Other Latent Variable Models
where we used the notation nodes. Let
z∈Z
117
for the summation over all states of hidden
Mobs =
M
(Y j − 1),
j=1
which is the number of parameters to specify the conditional probability p(x|b z ) of the states of all the observation nodes given the states of the hidden nodes. Then, the number of the parameters of the model, D, is D = Mobs
K
Tk +
k=1
K
(T k − 1).
(4.56)
k=1
We assume that the prior distribution p(w) of the parameters w = K , {b conjugate-prior distribution. Then, p(w) is {{ak }k=1 z } z∈Z } is the-conditional , , M K given by p(a |φ) k z∈Z j=1 p(b j|z |ξ) , where k=1 (4.57) p(ak |φ) = DirichletTk ak ; (φ, . . . , φ) , p(b j|z |ξ) = DirichletY j b j|z ; (ξ, . . . , ξ) , (4.58) are the Dirichlet distributions with hyperparameters φ > 0 and ξ > 0. VB Posterior for Bayesian Networks N Let {D, H} be the complete data with the observed data set D = {x(n) }n=1 (n) N and the corresponding hidden variables H = {z }n=1 . Define the expected sufficient statistics: N z N (k,ik ) = , z(n) k,ik rH (H)
n=1 x
N ( j,l j |z) =
N
(n) x(n) = z), j,l j rz (z
(4.59)
n=1
where rz (z(n) = z) =
/ K k=1
0 z(n) k,ik
(4.60) rH (H)
is the estimated probability that z(n) = z = (ei1 , . . . , eiK ). Here x(n) j indicates the state of the jth observation node and z(n) k indicates the state of the kth hidden node when the nth training sample is observed. From Eq. (4.2), the K , {b z } z∈Z } is given by VB posterior distribution of parameters w = {{ak }k=1 ⎫ ⎧ K ⎫⎧ ⎪ ⎪ M ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ ⎬⎪ ⎨ r (a ) r (b ) rw (w) = ⎪ , ⎪ a k ⎪ b j|z ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭⎪ ⎭ ⎩ k=1 z∈Z j=1
118
4 VB Algorithm for Latent Variable Models ra (ak ) = DirichletTk ak ; φk , rb (b j|z ) = DirichletY j b j|z ; ξ j|z ,
(4.61) (4.62)
where
φ(k,1) , . . . , φ(k,Tk ) ) (k = 1, . . . , K), φk = ( z
φ(k,i) = N (k,i) + φ (i = 1, . . . , T k ),
(4.63)
ξ( j,1|z) , . . . , ξ( j,Y j |z) ) ( j = 1, . . . , M, z ∈ Z), ξ j|z = ( x
ξ( j,l|z) = N ( j,l|z) + ξ (l = 1, . . . , Y j ).
(4.64)
Note that if we define 0 N N / K x (n) (n) Nz = rz (z = z) = zk,ik n=1
n=1
k=1
, rH (H)
for z = (ei1 , . . . , eiK ) ∈ Z, we have x Nz
=
Yj
x
N ( j,l|z) ,
(4.65)
l=1
for j = 1, . . . , M, and z
N (k,i) =
x
N z,
(4.66)
z−k
where z−k denotes the summation over zk (k k) other than zk = ei . It follows from Eqs. (4.61) and (4.62) that φ(k,i) ) − Ψ Ti k=1 φ(k,i ) (i = 1, . . . , T k ), log a(k,i) ra (ak ) = Ψ ( for k = 1, . . . , K and log b( j,l|z)
rb (b j|z )
= Ψ ( ξ( j,l|z) ) − Ψ
Y j
l =1 ξ( j,l |z)
(l = 1, . . . , Y j ),
for j = 1, . . . , M. From Eq. (4.3), the VB posterior distribution of the hidden N variables is given by rH (H) = n=1 rz (z(n) ), where for z = (ei1 , . . . , eiK ), rz (z(n) = z) =
rz (z(n) )
z(n) ∈Z
K
z(n) k,ik
k=1
⎛ K 2 ! "3 ⎜⎜ ∝ exp ⎜⎜⎜⎝ φ(k,i k ) Ψ ( φ(k,ik ) ) − Ψ Ti k=1 k
k=1
+
M 2 j=1
if x(n) j = el(n) . j
Y j
ξ( j,l |z) Ψ ( ξ( j,l(n) |z) ) − Ψ l =1 j
⎞ 3⎟⎟ ⎟⎟⎟ ⎟⎠ ,
(4.67)
4.2 Other Latent Variable Models
119
x
The VB algorithm updates {N ( j,l j |z) } using Eqs. (4.59) and (4.67) iteratively. The other expected sufficient statistics and variational parameters are computed by Eqs. (4.65), (4.66) and Eqs. (4.63), (4.64), respectively. The free energy as a function of the variational parameters is given by ⎧ ⎛ T ⎞ K ⎪ k ⎪ φ(k,i) ) ⎟⎟⎟ Γ(T k φ) ⎨ ⎜⎜⎜⎜ Γ( i=1 ⎟ F= ⎟⎠ − log ⎪ ⎪log ⎜⎝ Tk ⎩ (Γ(φ))Tk k=1 i=1 Γ(φ(k,i) ) ⎫ Tk ⎪ T ⎪ z ⎬
+ φ(k,i ) ⎪ φ(k,i) − φ − N (k,i) Ψ ( φ(k,i) ) − Ψ i k=1 ⎪ ⎭ i=1 ⎞ ⎛ ⎧ Y j M ⎪ ⎪ ξ( j,l|z) ) ⎟⎟⎟⎟ Γ(Y j ξ) ⎨ ⎜⎜⎜⎜ Γ( l=1 ⎟ ⎜ + log − log ⎪ ⎟⎠ ⎪ ⎩ ⎜⎝ Y j Γ( (Γ(ξ))Y j ξ( j,l|z) ) z∈Z j=1 l=1 ⎫ Yj ⎪ ⎪ Y j ⎪ x ⎬
+ ξ( j,l|z) − ξ − N ( j,l|z) Ψ (ξ( j,l|z) ) − Ψ l =1 ξ( j,l |z) ⎪ ⎪ ⎪ ⎭ l=1 +
N
rz (z(n) = z) log rz (z(n) = z).
(4.68)
n=1 z∈Z
The following update rule for the EVB learning of the hyperparameter φ is obtained in the same way as the update rule (4.29) for the GMM: - T Tk K , φ(k,i ) Ψ ( φ(k,i) )−Ψ i k=1 T k (Ψ (φ)−Ψ (T k φ))− i=1 K . (4.69) φnew = max 0, φold − k=1 (1) (1) k=1 T k (Ψ (φ)−T k Ψ (T k φ)) Similarly, we obtain the following update rule of the hyperparameter ξ: 3⎞ ⎛ Y Y j M 2 k ⎟⎟⎟ ⎜⎜⎜ ξ z∈Z j=1 Y j (Ψ (ξ)−Ψ (Y j ξ))− l=1 Ψ (ξ( j,l|z) )−Ψ l =1 ( j,l |z) new old ⎟⎟⎠ . ⎜ K ξ = max ⎜⎝0, ξ − (1) (1) Tk ) M ( k=1 j=1 Y j (Ψ (ξ)−Y j Ψ (Y j ξ)) (4.70)
= {rz (z(n) = z)}N = Let S n=1,z∈Z
2
3N (n) K z , k=1 k,ik rH (H) n=1,z∈Z
K
= { Φ φk }k=1 , and
Ξ = { ξ j|z } M be the sets of variational parameters. The EVB learning for the j=1,z∈Z Bayesian network is summarized in Algorithm 9. If the prior hyperparameters are fixed and Step 3 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm.
4.2.2 Hidden Markov Models Hidden Markov models (HMMs) have been widely used for sequence modeling in speech recognition, natural language processing, and so on (Rabiner,
120
4 VB Algorithm for Latent Variable Models
Algorithm 9 EVB learning for the Bayesian network.
Φ,
1: Initialize the variational parameters (S, Ξ) and the hyperparameters (φ, ξ). 2: Apply (substitute the right-hand side into the left-hand side) Eqs. (4.67),
Φ,
and (4.59), (4.65), (4.66), (4.63), and (4.64) to update S, Ξ. 3: Apply Eqs. (4.69) and (4.70) to update φ and ξ, respectively. 4: Evaluate the free energy (4.68). 5: Iterate Steps 2 through 4 until convergence (until the energy decrease becomes smaller than a threshold).
1989). In this subsection, we consider discrete HMMs. Suppose a sequence D = (x(1) , . . . , x(T ) ) was observed. Each x(t) is an M-dimensional binary vector (M-valued finite alphabet): x(t) = (x1(t) , . . . , x(t) M ) ∈ {e1 , . . . , e M }, (t) where if the output symbol at time t is m, then xm = 1, and otherwise 0. (t) Moreover, x is produced in K-valued discrete hidden state z(t) . The sequence of hidden states H = (z(1) , . . . , z(T ) ) is generated by a first-order Markov process. Similarly, z(t) is represented by a K-dimensional binary vector (t) z(t) = (z(t) 1 , . . . , zK ) ∈ {e1 , . . . , eK },
where if the hidden state at time t is k, then z(t) k = 1, and otherwise 0. Without loss of generality, we assume that the initial state (t = 1) is the first one, namely z(1) 1 = 1. Then, the probability of a sequence is given by p(D|w) =
M H m=1
(1)
xm b1,m
K K T t=2 k=1 l=1
z(t) z(t−1) k
ak,ll
M
z(t) x(t)
k m bk,m ,
(4.71)
m=1
where H is taken all over possible values of hidden variables, and the model parameters, w = ( A, B), consist of the state transition probability matrix A = aK ) and the emission probability matrix B = ( b1 , . . . , bK )T satisfying ( a1 , . . . , K−1 T M−1 ak = (ak,1 , . . . , ak,K ) ∈ Δ and bk = (bk,1 , . . . , bk,K ) ∈ Δ for 1 ≤ k ≤ K, respectively. ak,l represents the transition probability from the kth hidden state to the lth hidden state and bk,m is the emission probability that alphabet m is produced in the kth hidden state. Figure 4.2 illustrates an example of the state transition diagram of an HMM.
4.2 Other Latent Variable Models
121
Figure 4.2 State transition diagram of an HMM.
The log-likelihood of the HMM for a sequence of complete data {D, H} is defined by log p(D, H|w) =
T K K
(t−1) z(t) k zl
log ak,l +
t=2 k=1 l=1
T K M
(t) z(t) k xm log bk,m .
t=1 k=1 m=1
We assume that the prior distributions of the transition probability matrix A and the emission probability matrix B are the Dirichlet distributions with hyperparameters φ > 0 and ξ > 0: p( A|φ) =
K
ak ; (φ, . . . , φ) , DirichletK
(4.72)
bk ; (ξ, . . . , ξ) . Dirichlet M
(4.73)
k=1
p(B|ξ) =
K k=1
VB Posterior for HMMs We define the expected sufficient statistics by Nk =
T
z(t) k
rH (H)
t=1 [z]
N k,l =
T
(t−1) z(t) l zk
t=2 [x]
N k,m =
T t=1
z(t) k
,
rH (H)
rH (H)
(4.74)
,
(4.75)
(t) xm ,
where the expected count N k is constrained by N k = posterior distribution of parameters rw (w) is given by
(4.76)
[z]
l
N k,l . Then, the VB
122
4 VB Algorithm for Latent Variable Models
r A ( A) =
K
ak ; ( DirichletK φk,1 , . . . , φk,K ) ,
k=1
r B (B) =
K
bk ; ( Dirichlet M ξk,1 , . . . , ξk,M ) ,
k=1
where [z]
φk,l = N k,l + φ,
ξk,m =
[x] N k,m
(4.77)
+ ξ.
(4.78)
The posterior distribution of hidden variables rH (H) is given by ⎛ T K K ⎜⎜ (t) (t−1) 1 rH (H) = exp ⎜⎜⎜⎝ zk zl log ak,l r A (A) CH t=2 k=1 l=1 ⎞ T K M ⎟⎟ (t) (t) + z xm log bk,m r (B) ⎟⎟⎟⎠ , k
B
(4.79)
t=1 k=1 m=1
where CH is the normalizing constant and
log ak,l
log bk,m
r A (A)
r B (B)
⎞ ⎛ K ⎟⎟ ⎜⎜⎜ = Ψ ( φk,l ) − Ψ ⎜⎜⎝ φk,l ⎟⎟⎟⎠ , l =1
⎞ ⎛ M ⎟⎟ ⎜⎜⎜
⎜
= Ψ (ξk,m ) − Ψ ⎜⎝ ξk,m ⎟⎟⎟⎠ . m =1
(t) (t−1) in Eqs. The expected sufficient statistics z(t) k rH (H) and zl zk rH (H) (4.74) through (4.76) can be efficiently computed in the order of O(T ) by the forward–backward algorithm (Beal, 2003). This algorithm can also compute CH . Thus, after the substitution of Eq. (4.3), the free energy is given by ⎧ ⎛ ⎞ K K ⎪ K ⎪ φk,l ) ⎟⎟⎟ ⎨ ⎜⎜⎜⎜ Γ( l=1 ⎟⎟⎠ +
F= φk,l ) − Ψ ( lK =1 φk,l ) φk,l − φ Ψ ( log ⎜⎝ K ⎪ ⎪ ⎩
l=1 Γ(φk,l ) k=1 l=1 ⎫ ⎛ M ⎞ M ⎪ ⎜⎜⎜ Γ( m=1 ⎪ ξk,m ) ⎟⎟⎟ ⎬ M
⎟⎟⎠ + + log ⎜⎜⎝ M ξk,m ) − Ψ ( m =1 ξk,m ) ⎪ ξk,m − ξ Ψ ( ⎪ ⎭
Γ( ξ ) k,m m=1 m=1 Γ(Mξ) Γ(Kφ) (4.80) − K log − log CH . − K log K (Γ(φ)) (Γ(ξ)) M The following update rule for the EVB learning of the hyperparameter φ is obtained in the same way as the update rule (4.29) for the GMM: K K K
K 2 (Ψ (φ)−Ψ (Kφ))− k=1 l=1 Ψ (φk,l )−Ψ l =1 φk,l new old φ = max 0, φ − . (4.81) K 2 (Ψ (1) (φ)−KΨ (1) (Kφ))
4.2 Other Latent Variable Models
123
Algorithm 10 EVB learning for the hidden Markov model.
Φ,
1: Initialize the variational parameters (S, Ξ), and the hyperparameters (φ, ξ). 2: Apply the forward–backward algorithm to rH (H) in Eq. (4.79) and compute CH . 3: Apply (substitute the right-hand side into the left-hand side) Eqs. (4.74),
and (4.75), (4.76), (4.77), and (4.78) to update Φ, Ξ. 4: Apply Eqs.(4.81) and (4.82) to update φ and ξ, respectively. 5: Evaluate the free energy (4.80). 6: Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold).
Similarly, we obtain the following update rule of the hyperparameter ξ: K M M
K M(Ψ (ξ)−Ψ (Mξ))− k=1 m=1 Ψ (ξk,m )−Ψ m =1 ξk,m ξnew = max 0, ξold − . (4.82) K M (Ψ (1) (ξ)−MΨ (1) (Mξ)) Let
= S
)2
z(t) k
3K
rH (H) k=1
,
2
(t−1) z(t) l zk
3K
1T
rH (H) k,l=1 t=1
,
K,M K
= { , and Ξ = { ξk,m }k,m=1 be the sets of variational parameters. Φ φk,l }k,l=1 The EVB learning for the HMM is summarized in Algorithm 10. If the prior hyperparameters are fixed, and Step 4 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm.
4.2.3 Probabilistic Context-Free Grammars In this subsection, we discuss probabilistic context-free grammars (PCFGs), which have been used for more complex sequence modeling applications than those with the Markov assumption in natural language processing, bioinformatics, and so on (Durbin et al., 1998). Without loss of generality, we can assume that the grammar is written by the Chomsky normal form. Let the model have K nonterminal symbols and M terminal symbols. The observation sequence of length L is written by X = (x(1) , . . . , x(L) ) ∈ {e1 , . . . , e M }L . Then, the statistical model is defined by p(X, Z|w), (4.83) p(X|w) = Z∈T (X)
124
4 VB Algorithm for Latent Variable Models
Figure 4.3 Derivation tree of PCFG. A and B are nonterminal symbols and a and b are terminal symbols. K ai→ jk
p(X, Z|w) =
Z ci→ jk
K L M
(l) (l) xm
(bi→m )zi
,
l=1 i=1 m=1
i, j,k=1 K K w = {{ai }i=1 , {bi }i=1 },
ai = {ai→ jk }Kj,k=1 (1 ≤ i ≤ K), M (1 ≤ i ≤ K), bi = {bi→m }m=1
where T (X) is the set of derivation sequences that generate X, and Z corresponds to a tree structure representing a derivation sequence. Figure 4.3 illustrates an example of the derivation tree of a PCFG model. The derivation Z K L Z z(l) sequence is summarized by {ci→ i }l=1 , where ci→ jk denotes the jk }i, j,k=1 and { count of the transition rule from the nonterminal symbol i to the pair of nonterminal symbols ( j, k) appearing in the derivation sequence Z and z(l) = z(l) ( z(l) K ) is the indicator of the (nonterminal) symbol generating the lth 1 , . . . , output symbol of X. Moreover the parameter ai→ jk represents the probability that the nonterminal symbol i emits the pair of nonterminal symbols ( j, k) and bi→m represents the probability that the nonterminal symbol i emits the terminal K K , {bi }i=1 }, have constraints symbol m. The parameters, {{ai }i=1 ai→ii = 1 −
( j,k)(i,i)
ai→ jk , bi→M = 1 −
M−1
bi→m ,
m=1
i.e., ai ∈ ΔK −1 and bi ∈ Δ M−1 , respectively. Let D = {X(1) , . . . , X(N) } be a given training corpus and H = {Z (1) , . . . , Z (N) } be the corresponding hidden derivation sequences. The loglikelihood for the complete sample {D, H} is given by 2
4.2 Other Latent Variable Models
125
⎡ K ⎤ K N ⎢ L M ⎥⎥⎥ ⎢⎢⎢ (n) (n,l) (n,l) Z ⎢⎢⎣ log p(D, H|w) = zi xm log bi→m ⎥⎥⎥⎦ , ci→ jk log ai→ jk + n=1 i, j,k=1
l=1 i=1 m=1
are the indicators of the lth output symbol and the where x(n,l) and z nonterminal symbol generating the lth output in the nth sequences, X(n) and Z (n) , respectively. We now turn to the VB learning for PCFGs (Kurihara and Sato, 2004). K K and {bi }i=1 are the We assume that the prior distributions of parameters {ai }i=1 Dirichlet distributions with hyperparameters φ and ξ: (n,l)
K |φ) = p({ai }i=1
K
DirichletK 2 ai ; (φ, . . . , φ) ,
(4.84)
Dirichlet M bi ; (ξ, . . . , ξ) .
(4.85)
i=1 K p({bi }i=1 |ξ) =
K i=1
VB Posterior for PCFGs We define the expected sufficient statistics as follows: z
N i→ jk =
N L
(n)
Z ci→ jk
n=1 l=1 z
Ni =
K
rz (Z (n) )
,
(4.86)
z
N i→ jk ,
j,k=1 x
N i→m =
N L z(n,l) i n=1 l=1
x
Ni =
M
rz (Z (n) )
(n,l) xm ,
(4.87)
x
N i→m .
m=1
Then the VB posteriors of the parameters are given by K K rw (w) = ra ({ai }i=1 )rb ({bi }i=1 ), K ra ({ai }i=1 )=
K
DirichletK 2 ai ; ( φi→11 , . . . , φi→KK ) ,
(4.88)
Dirichlet M bi ; ( ξi→1 , . . . , ξi→M ) ,
(4.89)
i=1 K rb ({bi }i=1 )=
K i=1
where z
φi→ jk = N i→ jk + φ,
(4.90)
ξi→m =
(4.91)
x N i→m
+ ξ.
126
4 VB Algorithm for Latent Variable Models
The VB posteriors of the hidden variables are given by rH (H) =
N
rz (Z (n) ),
n=1
rz (Z (n) ) = γ Z (n) =
1 C Z (n) K
$ % exp γ Z (n) , Z (n) ci→ jk log ai→ jk
i, j,k=1
+
K L M
(4.92)
K ) ra ({ai }i=1
(n,l) z(n,l) xm log bi→m rb ({bi }K ) , i i=1
l=1 i=1 m=1
where C Z (n) = Z∈T (X(n) ) exp(γ Z ) is the normalizing constant and =Ψ φi→ jk − Ψ Kj =1 kK =1 φi→ j k , log ai→ jk K ra ({ai }i=1 ) ξi→m − Ψ mM =1 ξi→m . log bi→m rb ({bi }K ) = Ψ i=1
All the expected sufficient statistics and C Z (n) can be efficiently computed by the inside–outside algorithm (Kurihara and Sato, 2004). The free energy, after the substitution of Eq. (4.3), is given by ⎞ ⎧ ⎛ K K ⎪ φi→ jk ) ⎟⎟⎟ ⎪ ⎨ ⎜⎜⎜⎜ Γ( j,k=1 ⎟⎟⎟ F= log ⎜ ⎪ ⎪ ⎩ ⎝⎜ K Γ( φi→ jk ) ⎠ i=1 j,k=1 +
K
φi→ jk − Ψ Kj ,k =1 φi→ j k ) φi→ jk − φ Ψ j,k=1
⎫ ⎛ M ⎞ M ⎪ ⎪ ⎜⎜⎜ Γ m=1 ξi→m ⎟⎟⎟ ⎪ ⎬ M ⎟⎟⎟ +
+ log ⎜⎜⎜⎝ − ξ Ψ ξ ξ ξ − Ψ
=1 i→m ⎪ i→m i→m m ⎪ ⎠ M ⎪
⎭ m=1 m=1 Γ ξi→m N Γ(Mξ) Γ(K 2 φ) log C Z (n) . (4.93) − K log − − K log (Γ(ξ)) M (Γ(φ))K 2 n=1 The following update rules for the EVB learning of the hyperparameters φ and ξ are obtained similarly to the HMM in Eqs. (4.81) and (4.82): K K K
K 3 (Ψ (φ)−Ψ (K 2 φ))− i=1 j,k=1 Ψ φi→ jk −Ψ j ,k =1 φi→ j k φnew = max 0, φold − , K 3 (Ψ (1) (φ)−K 2 Ψ (1) (K 2 φ)) ξnew = max 0, ξold −
K M M
K M(Ψ (ξ)−Ψ (Mξ))− i=1 m=1 Ψ ξi→m −Ψ m =1 ξi→m K M (Ψ (1) (ξ)−MΨ (1) (Mξ))
(4.94)
.
(4.95)
4.2 Other Latent Variable Models
127
Algorithm 11 EVB learning for probabilistic context-free grammar.
Φ,
1: Initialize the variational parameters (S, Ξ) and the hyperparameters 2: 3: 4: 5: 6:
(φ, ξ). Apply the inside–outside algorithm to rz (Z (n) ) in Eq. (4.92) and compute C Z (n) for n = 1, . . . , N. Apply (substitute the right-hand side into the left-hand side) Eqs. (4.86),
and (4.87), (4.90), and (4.91) to update Φ, Ξ. Apply Eqs. (4.94) and (4.95) to update φ and ξ, respectively. Evaluate the free energy (4.93). Iterate Steps 2 through 5 until convergence (until the energy decrease becomes smaller than a threshold).
Let
= S
)2
3K Z (n) ci→ jk rz (Z (n) ) i, j,k=1
2 , z(n,l) i
3L 1N rz (Z (n) ) l=1 n=1
,
K,M
= { Φ φi→ jk }i,Kj,k=1 , and Ξ = { ξi→m }i,m=1 be the sets of variational parameters. The EVB learning for the PCFG model is summarized in Algorithm 11. If the prior hyperparameters are fixed and Step 4 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm.
4.2.4 Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a generative model successfully used in various applications such as text analysis (Blei et al., 2003), image analysis (Li and Perona, 2005), genomics (Bicego et al., 2010; Chen et al., 2010), human activity analysis (Huynh et al., 2008), and collaborative filtering (Krestel et al., 2009; Purushotham et al., 2012). Given word occurrences of documents in a corpora, LDA expresses each document as a mixture of multinomial distributions, each of which is expected to capture a topic. The extracted topics provide bases in a low-dimensional feature space, in which each document is compactly represented. This topic expression was shown to be useful for solving various tasks, including classification (Li and Perona, 2005), retrieval (Wei and Croft, 2006), and recommendation (Krestel et al., 2009). In this subsection, we introduce the VB learning for tractable inference in the LDA model. Suppose that we observe M documents, each of which consists of N (m) words. Each word is included in a vocabulary with size L. We assume that each word is associated with one of the H topics, which is not observed.
128
4 VB Algorithm for Latent Variable Models
We express the word occurrence by an L-dimensional indicator vector w, where one of the entries is equal to one and the others are equal to zero. Similarly, we express the topic occurrence as an H-dimensional indicator vector z. We define the following functions that give the item numbers chosen by w and z, respectively: ´ = l if wl = 1 and wl = 0 for l l, l(w) ´ h(z) = h if zh = 1 and zh = 0 for h h. In the LDA model (Blei et al., 2003), the word occurrence w(n,m) of the nth position in the mth document is assumed to follow the multinomial distribution: L w(n,m) (4.96) (BΘ )l,m l = (BΘ )l(w p(w(n,m) |Θ, B) = ´ (n,m) ),m , l=1
and B ∈ [0, 1]L×H are parameter matrices to be estimated. where Θ ∈ [0, 1] $ % θ M ) and the columns of B = β1 , . . . , βH are The rows of Θ = (θ1 , . . . , probability mass vectors that sum up to one. That is, θm ∈ ΔH−1 is the topic L−1 distribution of the mth document, and βh ∈ Δ is the word distribution of the hth topic. N (m) M }m=1 . Given the topic Suppose that we observe the data D = {{w(n,m) }n=1 occurrence latent variable z(n,m) , the complete likelihood for each word is written as M×H
p(w(n,m) , z(n,m) |Θ, B) = p(w(n,m) |z(n,m) , B)p(z(n,m) |Θ), where p(w(n,m) |z(n,m), B) =
L H
(n,m) (n,m) zh
(Bl,h )wl
, p(z(n,m) |Θ) =
l=1 h=1
H
(4.97) (n,m)
(θm,h )zh .
h=1
We assume the Dirichlet priors on Θ and B: p(Θ|α) =
M
DirichletH ( θm ; (α1 , . . . , αH ) ),
(4.98)
DirichletL (βh ; (η1 , . . . , ηL ) ),
(4.99)
m=1
p(B|η) =
H h=1
where α and η are hyperparameters that control the prior sparsity. VB Posterior for LDA N (m) M }m=1 and the parameter w = For the set of all hidden variables H = {{z(n,m) }n=1 (Θ, B), we assume that our approximate posterior is factorized as Eq. (4.1). Thus, the update rule (4.2) yields the further factorization rΘ,B (Θ, B) = rΘ (Θ)rB (B) and the following update rules:
4.2 Other Latent Variable Models
129
rΘ (Θ) ∝ p(Θ|α) log p(D, H|Θ, B) rB (B)rH (H) , rB (B) ∝ p(B|η) log p(D, H|Θ, B) r(Θ)rH (H) .
(4.100) (4.101)
Define the expected sufficient statistics as (m) Nh
=
N (m)
z(n,m) h
n=1
W l,h =
M N (m)
N (m) M rz {{z(n,m) }n=1 }m=1
(n,m) w(n,m) zh l
m=1 n=1
,
(4.102)
N (m) M rz {{z(n,m) }n=1 }m=1
.
(4.103)
Then, the VB posterior distribution is given by the Dirichlet distributions: M
rΘ (Θ) =
Dirichlet θm ; αm ,
(4.104)
$ % Dirichlet βh ; ηh ,
(4.105)
m=1 H
rB (B) =
h=1
where the variational parameters satisfy (m)
αm )h = N h + αh , αm,h = (
(4.106)
ηl,h = ( ηh )l = W l,h + ηl .
(4.107)
From the update rule (4.3), the VB posterior distribution of latent variables is given by the multinomial distribution: M N N (m) M }m=1 = MultinomialH,1 z(n,m) ; z(n,m) , rz {{z(n,m) }n=1 (m)
(4.108)
m=1 n=1
where the variational parameter z(n,m) ∈ ΔH−1 is z(n,m) h
= z(n,m) H (n,m) h h =1 zh for z(n,m) h
⎛ ⎜⎜ = exp ⎜⎜⎜⎝ log Θm,h r
Θ (Θ)
+
L
w(n,m) l
(4.109)
log Bl,h
rB
⎞ ⎟⎟⎟ ⎟⎠ . (B) ⎟
l=1
We also have
(n,m) zh
N (m) M rz {{z(n,m) }n=1 }m=1
= z(n,m) , h
αm,h , αm,h ) − Ψ hH =1 log Θm,h rΘ (Θ) = Ψ ( ηl ,h . ηl,h ) − Ψ lL =1 log Bl,h rB (B) = Ψ (
(4.110)
130
4 VB Algorithm for Latent Variable Models
Iterating Eqs. (4.106), (4.107), and (4.110) provides a local minimum of the free energy, which is given as a function of the variational parameters by ⎞ ⎞⎞ ⎛ H ⎛ H M ⎛ ⎜⎜⎜ ⎜⎜⎜ Γ( h=1 ⎜⎜⎜ Γ( h=1 αh ) ⎟⎟⎟⎟⎟⎟
αm,h ) ⎟⎟⎟ F= ⎝⎜log ⎝⎜ H ⎠⎟ − log ⎝⎜ H ⎠⎟⎠⎟ αm,h ) h=1 Γ( h=1 Γ(αh ) m=1 ⎞ ⎞⎞ ⎛ L ⎛ L H ⎛ ⎜⎜⎜ ⎜⎜⎜ Γ( l=1 ⎜ Γ(
ηl,h ) ⎟⎟⎟ ηl ) ⎟⎟ ⎜⎝log ⎜⎝ L ⎟⎠ − log ⎜⎜⎜⎝ L l=1 ⎟⎟⎟⎠⎟⎟⎟⎠ + ηl,h ) l=1 Γ( l=1 Γ(ηl ) h=1 M H ! " (m)
αm,h ) αm,h ) − Ψ ( hH =1 αm,h − (N h + αh ) Ψ ( + m=1 h=1 H L
+ ηl ,h ) ηl,h ) − Ψ ( lL =1 ηl,h − (W l,h + ηl ) Ψ ( h=1 l=1
+
M N (m) H
z(n,m) log z(n,m) . h h
(4.111)
m=1 n=1 h=1
The partial derivatives of the free energy with respect to (α, η) are computed as follows: ∂F = M Ψ (αh ) − Ψ ( hH =1 αh ) ∂αh M αm,h ) , − (4.112) Ψ ( αm,h ) − Ψ ( hH =1 m=1
∂2 F = M δh,h Ψ (1) (αh ) − Ψ (1) ( hH
=1 αh
) , ∂αh ∂αh ∂F = H Ψ (ηl ) − Ψ ( lL =1 ηl ) ∂ηl H − ηl ,h ) , Ψ ( ηl,h ) − Ψ ( lL =1
(4.113)
(4.114)
h=1
∂2 F = H δl,l Ψ (1) (ηl ) − Ψ (1) ( lL
=1 ηl
) , ∂ηl ∂ηl
(4.115)
where δn,n is the Kronecker delta. Thus, we have the following Newton– Raphson steps to update the hyperparameters: ⎞ ⎛ 2 −1 ⎜⎜⎜ ∂ F ∂F ⎟⎟⎟⎟ new old ⎜ α = max ⎝⎜0, α − (4.116) ⎟, ∂α∂α ∂α ⎠ ⎞ ⎛ 2 −1 ⎜⎜⎜ old ∂ F ∂F ⎟⎟⎟⎟ new ⎜ (4.117) η = max ⎝⎜0, η − ⎟, ∂η∂η ∂η ⎠ where max(·) is the max operator applied elementwise.
4.2 Other Latent Variable Models
131
Algorithm 12 EVB learning for latent Dirichlet allocation. 1: 2:
3: 4: 5:
N M M H Initialize the variational parameters ({{ z(n,m) }n=1 }m=1 , { αm }m=1 , { ηh }h=1 ), and the hyperparameters (α, η). Apply (substitute the right-hand side into the left-hand side) Eqs. (4.110), N (m) M }m=1 , (4.109), (4.102), (4.103), (4.106), and (4.107) to update {{ z(n,m) }n=1 M H , and { ηh }h=1 . { αm }m=1 Apply Eqs. (4.116) and (4.117) to update α and η, respectively. Evaluate the free energy (4.111). Iterate Steps 2 through 4 until convergence (until the energy decrease becomes smaller than a threshold). (m)
The EVB learning for LDA is summarized in Algorithm 12. If the prior hyperparameters are fixed and Step 3 in the algorithm is omitted, the algorithm reduces to the (nonempirical) VB learning algorithm. We can also apply partially Bayesian (PB) learning by approximating the posterior of Θ or B by the delta function (see Section 2.2.2). We call it PB-A learning if Θ is marginalized and B is point-estimated, and PB-B learning if B is marginalized and Θ is point-estimated. Note that the original VB algorithm for LDA proposed by Blei et al. (2003) corresponds to PB-A learing in our terminology. MAP learning, where both of Θ and B are point-estimated, corresponds to the probabilistic latent semantic analysis (pLSA) (Hofmann, 2001), if we assume the flat prior αh = ηl = 1 (Girolami and Kaban, 2003).
5 VB Algorithm under No Conjugacy
As discussed in Section 2.1.7, there are practical combinations of a model and a prior where conjugacy is no longer available. In this chapter, as a method for addressing nonconjugacy, we demonstrate local variational approximation (LVA), also known as direct site bounding, for logistic regression and a sparsity-inducing prior (Jaakkola and Jordan, 2000; Girolami, 2001; Bishop, 2006; Seeger, 2008, 2009). Then we describe a general framework of LVA based on convex functions by using the associated Bregman divergence (Watanabe et al., 2011).
5.1 Logistic Regression Let D = {(x(1) , y(1) ), (x(2) , y(2) ), . . . , (x(N) , y(N) )} be the N observations of the binary response variable y(n) ∈ {0, 1} and the input vector x(n) ∈ R M . The logistic regression model assumes the following Bernoulli model over y = (y(1) , y(2) , . . . , y(N) ) given X = {x(1) , x(2) , . . . , x(N) }: p(y|X, w) =
N
(n) exp y(n) (w x(n) ) − log 1 + ew x .
(5.1)
n=1
Let us consider the Bayesian learning of the parameter w assuming the Gaussian prior distribution: p(w) = Gauss M (w; w0 , S−1 0 ), where S0 and w0 are the hyperparameters. Gaussian approximations for the posterior distribution p(w|D) ∝ p(w, y|X) √ √ h/2 − h/2 +e ) is a convex are obtained by LVA based on the facts that − log(e function of h and that log(1 + eg ) is a convex function of g. More specifically, 132
5.1 Logistic Regression √
√
133
because φ(h(w)) = − log(e w /2 + e− w /2 ) is a convex function of h(w) = w2 and ψ(g(w)) = log(1 + ew ) is a convex function of g(w) = w, they are bounded from below by their tangents at h(ξ) = ξ2 and g(η) = η, respectively: ! √ " ! √ √2 " √ tanh (ξ/2) 2 2 2 , − log e w /2 + e− w /2 ≥ − log e ξ /2 + e− ξ /2 − (w2 − ξ2 ) 4ξ eη log(1 + ew ) ≥ log(1 + eη ) + (w − η) . 1 + eη By substituting these bounds into the likelihood (5.1), we obtain the following bounds on p(w, y|X): 2
2
p(w; ξ) ≤ p(w, y|X) ≤ p(w; η), where p(w; ξ) ≡ p(w)
p(w; η) ≡ p(w)
1 (n) exp y(n) − w x 2 n=1 ! √h √ "" , hn n −θn (w x(n) )2 − hn − log e 2 + e− 2 ,
N
N
exp (y(n) − κn )w x(n) − b(κn ) .
n=1
Here we have put
√ tanh( hn /2) , θn = √ 4 hn egn κn = , 1 + egn
(5.2) (5.3)
N N and {hn }n=1 and {gn }n=1 are the sets of variational parameters defined from ξ = (ξ1 , ξ2 , . . . , ξ M ) and η = (η1 , η2 , . . . , η M ) by the transformations hn = (ξ x(n) )2 and gn = η x(n) , respectively. We also defined the binary entropy function by b(κ) = −κ log κ − (1 − κ) log(1 − κ) for κ ∈ [0, 1]. Normalizing these bounds with respect to w, we approximate the posterior by the Gaussian distributions as
qξ (w; ξ) = Gauss M (m, S−1 ), −1
qη (w; η) = Gauss M (m, S ), whose mean and precision (inverse-covariance) matrix are respectively given by , N (n) m = S−1 S0 w0 + n=1 (y − 1/2)x(n) , (5.4) N θn x(n) x(n) , S = S0 + 2 n=1
134
5 VB Algorithm under No Conjugacy
and m = w0 + S−1 0 S = S0 .
N
(n) n=1 (y
− κn )x(n) ,
(5.5)
We also obtain the bounds for the marginal likelihood, Z(ξ) ≡ p(w; ξ)dw and Z(η) ≡ p(w; η)dw. These are respectively given in the forms of free energy bounds as follows: F(ξ) ≡ − log Z(ξ) =
w S0 w0 m (S)m 1 1 log |S| − log |S0 | + 0 − 2 2 2 2 √ ) 1 N hn − hn θn − log 2 cosh , 2 n=1
(5.6)
and F(η) ≡ − log Z(η) =
N w0 S0 w0 m S0 m − + b(κn ). 2 2 n=1
We optimize the free energy bounds to determine the variational parameters. As will be discussed generally in Section 5.3.2, to decrease the upper-bound F(ξ), the EM algorithm is available, which instead maximizes log p(w; ξ) , qξ (w;ξo )
where the expectation is taken with respect to the approximate posterior before updating with the variational parameters given by ξo . The update rule of the variational parameters is specifically given by hn = (w x(n) )2 qξ (w;ξo )
=x
(n)
−1
(S
+ mm )x(n) ,
(5.7)
where m and S−1 are the mean and covariance matrix of qξ (w; ξo ). We can use the following gradient for the maximization of the lower-bound F(η): ∂F(η) (n) = w x − η x(n) qη (w;η) ∂κn = m x(n) − gn . (5.8) The Newton–Raphson step to update κ = (κ1 , . . . , κN ) is given by 2 −1 ∂ F ∂F new old , κ =κ − ∂κ∂κ ∂κ
(5.9)
5.2 Sparsity-Inducing Prior
135
Algorithm 13 LVA algorithm for logistic regression. N N 1: Initialize the variational parameters {hn }n=1 and transform them to {θn }n=1 by Eq. (5.2). −1 2: Compute the approximate posterior mean and covariance matrix (m, S ) by Eq. (5.4). N N 3: Apply Eq. (5.7) to update {hn }n=1 and transform them to {θn }n=1 by Eq. (5.2). 4: Evaluate the free energy bound (5.6). 5: Iterate Steps 2 through 4 until convergence (until the decrease of the bound becomes smaller than a threshold).
where the (n, n )th entry of the Hessian matrix is given as follows: ∂2 F(η) 1 1 (n )
= −x(n) S−1 x − δ + . n,n 0 ∂κn ∂κn κn 1 − κn The learning algorithm for logistic regression with LVA is summarized in Algorithm 13 in the case of F(ξ) minimization. To obtain the algorithm for N N , {κn }n=1 , F(η) maximization, the updated variables are replaced with {gn }n=1 −1
and (m, S ), and the update rule (5.7) in Step 3 is replaced with the Newton– Raphson step (5.9). Recall the arguments in Section 2.1.7 that the VB posterior r(w; λ) = q(w; ξ) in Eq. (2.32) minimizes the upper-bound of the free energy (2.25) and the approximate posterior r(w; ν) = q(w; η) in Eq. (2.58) maximizes the lowerbound of the objective function of EP (2.50). This means that the variational −1 ν = (m, S ) in the LVAs for VB and parameters are given by λ = (m, S−1 ) and EP, respectively.
5.2 Sparsity-Inducing Prior Another representative example where a nonconjugate prior is used is the linear regression model with a sparsity-inducing prior distribution (Girolami, 2001; Seeger, 2008, 2009). We discuss the linear regression model for i.i.d. data D = {(x(1) , y(1) ), (x(2) , y(2) ), . . . , (x(N) , y(N) )}, where for each observation, x(n) ∈ R M is the input vector and y(n) ∈ R is the response. Denoting y = (y(1) , . . . , y(N) ) and X = (x(1) , . . . , x(N) ) , we assume the model, p(y|X, w) = GaussN (y; Xw, σ2 I N ),
136
5 VB Algorithm under No Conjugacy
and the following sparsity-inducing prior with the Lβ -norm: p(w) =
M 1 exp −γ|wm |β , C m=1 β,γ
(5.10)
where γ > 0 and 0 < β ≤ 2 are the hyperparameters, and Cβ,γ = β2 γ1−1/β Γ(1/β) is the normalizing constant. For 0 < β < 2, the prior has a heavier tail than the Gaussian distribution (β = 2) and induces sparsity of the coefficients w. We apply the following inequality for w, ξ ∈ R: w2
β/2
≤
β 2 ξ 2
β 2 −1
w2 − ξ2 ,
which is obtained from the concavity of the function f (x) = xβ/2 for x > 0 and 0 < β < 2. Introducing the variational parameter to each dimension, ξ = (ξ1 , . . . , ξ M ) and bounding the nonconjugate prior (5.10) by this inequality, we have p(y, w|X) = p(y|X, w)p(w) 1 ≥ M (2πσ2 )N/2Cβ,γ ⎛ M N ⎜⎜ 1 βγ 2 · exp ⎜⎜⎜⎝− 2 (y(n) − w x(n) )2 − ξ 2 m=1 m 2σ n=1
β 2 −1
w2m − ξm2
⎞ ⎟⎟⎟ ⎟⎟⎠
≡ p(w; ξ). Normalizing the lower-bound p(w; ξ), we obtain a Gaussian approximation to the posterior. This is in effect equivalent to assuming the Gaussian prior for w: Gauss M (w; 0, S−1 ξ ), β −1 where Sξ = γβDiag(ξβ−1/2 ) for ξβ−1/2 ≡ ξ12 2 , . . . , ξ2M
β 2 −1
.
The resulting Gaussian approximation to the posterior is qξ (w; ξ) = Gauss M (w; m, S−1 ), where S = Sξ + m=
1 X X, σ2
1 −1 S X y. σ2
(5.11) (5.12)
5.3 Unified Approach by Local VB Bounds
137
Algorithm 14 LVA algorithm for sparse linear regression. 2 M 1: Initialize the variational parameters {ξm }m=1 . −1 2: Compute the approximate posterior mean and covariance matrix (m, S ) by Eqs. (5.11) and (5.12). 2 M 3: Apply Eq. (5.14) to update {ξm }m=1 . 4: Evaluate the free energy bound (5.13). 5: Iterate Steps 2 through 4 until convergence (until the decrease of the bound becomes smaller than a threshold).
We also obtain the upper bound for the free energy: F(ξ) = − log p(w; ξ)dw =
N M log |S| (y(n) )2 γβ 2 N−M log(2π) + + M log Cβ,γ − + ξ 2 2 2 m=1 m 2σ2 n=1
β/2
,
(5.13) which is optimized with respect to the variational parameter. The general framework in Section 5.3.2, which corresponds to the EM algorithm, provides the following update rule: ξm2 = w2m qξ (w;ξ o )
= (S −1 )mm + m2m ,
(5.14)
where m and S−1 are the mean and covariance matrix of qξ (w; ξo ). The learning algorithm for sparse linear regression with this approximation is summarized in Algorithm 14. This approximation has been applied to the Laplace prior (β = 1) in Seeger (2008, 2009). LVA for another heavy-tailed distribution, p(w) ∝ cosh−1/β (βw), is discussed in Girolami (2001), which also bridges the Gaussian distribution (β → 0) and the Laplace distribution (β → ∞).
5.3 Unified Approach by Local VB Bounds As discussed in Section 2.1.7, LVA for VB and LVA for EP form lower- and upper-bounds of the joint distribution p(w, D), denoted by p(w; ξ) and p(w; η), respectively. If the bounds satisfying
138
5 VB Algorithm under No Conjugacy p(w; ξ) ≤ p(w, D),
(5.15)
p(w; η) ≥ p(w, D),
(5.16)
for all w and D are analytically integrable, then by normalizing the bounds instead of p(w, D), LVAs approximate the posterior distribution by qξ (w; ξ) = qη (w; η) =
p(w; ξ) Z(ξ) p(w; η) Z(η)
,
(5.17)
,
(5.18)
respectively, where Z(ξ) and Z(η) are the normalization constants defined by p(w; ξ)dw, Z(ξ) = Z(η) = p(w; η)dw. Here ξ and η are called the variational parameters. The respective approximations are optimized by estimating the variational parameters, ξ and η so that Z(ξ) is maximized and Z(η) is minimized since the inequalities Z(ξ) ≤ Z ≤ Z(η)
(5.19)
hold by definition, where Z = p(D) is the marginal likelihood. To consider the respective LVAs in terms of information divergences in later sections, let us introduce the Bayes free energy, F Bayes ≡ − log Z, and its lower- and upper-bounds, F(η) = − log Z(η) and F(ξ) = − log Z(ξ). By taking the negative logarithms on both sides of Eq. (5.19), we have F(η) ≤ F Bayes ≤ F(ξ).
(5.20)
Hereafter, we follow the measure of the free energy and adopt the following terminology to refer to respective LVAs (5.18) and (5.17): the lower-bound maximization (F(η) maximization) and the upper-bound minimization (F(ξ) minimization).
5.3.1 Divergence Measures in LVA Most of the existing LVA techniques are based on the convexity of the log-likelihood function or the log-prior (Jaakkola and Jordan, 2000; Bishop,
5.3 Unified Approach by Local VB Bounds
139
2006; Seeger, 2008, 2009). We describe these cases by using general convex functions, φ and ψ, and show that the objective functions, F(ξ) − F Bayes = log
Z ≥ 0, Z(ξ)
Z(η) ≥ 0, Z to be minimized in the approximations (5.17) and (5.18), are decomposable into the sum of the KL divergence and the expected Bregman divergence. Let φ and ψ be twice differentiable real-valued strictly convex functions and denote by dφ the Bregman divergence associated with the function φ (Banerjee et al., 2005): F Bayes − F(η) = log
dφ (v1 , v2 ) = φ(v1 ) − φ(v2 ) − (v1 − v2 ) ∇φ(v2 ) ≥ 0,
(5.21)
where ∇φ(v2 ) denotes the gradient vector of φ at v2 . Let us consider the case where φ and ψ are respectively used to form the following bounds of the joint distribution p(w, D): p(w; ξ) = p(w, D) exp{−dφ (h(w), h(ξ))},
(5.22)
p(w; η) = p(w, D) exp{dψ ( g(w), g(η))},
(5.23)
where h and g are vector-valued functions of w.1 Eq. (5.22) is interpreted as follows. log p(w, D) includes a term that prevents analytic integration of p(w, D) with respect to w. If such a term is expressed by the convex function φ of some function h transforming w, it is replaced by the tangent hyperplane, φ(h(ξ)) + (h(w) − h(ξ)) ∇φ(h(ξ)), so that log p(w; ξ) makes a simpler function of w, such as a quadratic function. Remember that if log p(w; ξ) is quadratic with respect to w, p(w; ξ) is analytically integrable by the Gaussian integral. Rephrased in terms of the convex duality theory (Jordan et al., 1999; Bishop, 2006), φ(h(w)) is replaced by its lower-bound, φ(h(w)) ≥ φ(h(ξ)) + (h(w) − h(ξ)) ∇φ(h(ξ)) φ(θ(ξ)), = θ(ξ) h(w) −
(5.24) (5.25)
where we have put θ(ξ) = ∇φ(h(ξ)) and φ(θ(ξ)) = θ(ξ) h(ξ) − φ(h(ξ)) = max{θ(ξ) h − φ(h)} h
1
The functions g and h (also ψ and φ) can be dependent on D in this discussion. However, we denote them as if they were independent of D for simplicity. They are actually independent of D in the examples in Sections 5.1 and 5.2 and in most applications (Bishop, 2006; Seeger, 2008, 2009).
140
5 VB Algorithm under No Conjugacy
φ(h(w))
dφ (h(w), h(ξ)) ˜ θ(ξ)h(w) − φ(θ(ξ))
˜ −φ(θ(ξ)) h(ξ)
h(w)
Figure 5.1 Convex function φ (solid curve), its tangent (dashed line), and the Bregman divergence (arrow).
is the conjugate function of φ. The inequality (5.24) indicates the fact that the convex function φ is bounded globally by its tangent at h(ξ), which is equivalent to the nonnegativity of the Bregman divergence. In Eq. (5.25), the tangent is reparameterized by θ(ξ), its gradient, instead of the contact point h(ξ), and its offset is given by − φ(θ(ξ)). Figure 5.1 illustrates the relationship among the convex function φ, its lower-bound, and the Bregman divergence. We now describe the free energy bounds F(ξ) and F(η) in terms of information divergences. It follows from the definition (5.17) of the approximate posterior distribution that Z p(w; ξ) dw qξ (w; ξ) log KL(qξ (w; ξ)||p(w|D)) = Z(ξ)p(w, D) Z − dφ (h(w), h(ξ)) = log . qξ (w;ξ) Z(ξ) We have a similar decomposition for KL(p(w|D)||qη (w; η)) as well. Finally, we obtain the following expressions:2 F(ξ) − F Bayes = dφ (h(w), h(ξ)) + KL(qξ ||p), (5.26) qξ F Bayes − F(η) = dψ ( g(w), g(η)) + KL(p||qη ). (5.27) p
+ KL(qξ ||p) = F is the free energy, which is further bounded Recall that F λ, ξ) in Eq. (2.25). The expression (5.26) shows that the gap between by F( λ, ξ) and F is the expected Bregman divergence dφ (h(w), h(ξ)) . min λ F( Bayes
qξ
Recall also that F Bayes − KL(p||qη ) = E is the objective function of the EP problem (2.46) and that F(η) = − log Z(η) is obtained as the maximum of its ν, η) in Eq. (2.59), under a monotonic transformation. lower-bound, max ν E( 2
Hereafter in this section, we omit the notation “(w|D)” if no confusion is likely.
5.3 Unified Approach by Local VB Bounds
141
The expression (5.27) shows that the gap between E and max ν, η) is ν E( expressed by the expected Bregman divergence dψ ( g(w), g(η)) while the p expectation is taken with respect to the true posterior. Similarly, we also have the following decompositions: F(ξ) − F Bayes = dφ (h(w), h(ξ)) − KL(p||qξ ), (5.28) p
and
F Bayes − F(η) = dψ ( g(w), g(η)) − KL(qη ||p). qη
Unlike Eqs. (5.26) and (5.27), the KL divergence is subtracted in these expressions. This again implies the affinities of LVAs by F minimization and F maximization to VB and EP, respectively.
5.3.2 Optimization of Approximations In this subsection, we show that the conditions for the optimal variational parameters are generally given by the moment matching with respect to h(ξ) and g(η). Optimal Variational Parameters From Eqs. (5.22) and (5.23), we can see that the approximate posteriors, qξ (w; ξ) ∝ p(w, D) exp{h(w) ∇φ(h(ξ)) − φ(h(w))}
(5.29)
qη (w; η) ∝ p(w, D) exp{− g(w) ∇ψ( g(η)) + ψ( g(w))}, are members of the exponential family with natural parameters ∇φ(h(ξ)) and ∇ψ( g(η)) (Section 1.2.3). Let θ(ξ) = ∇φ(h(ξ)) and κ(η) = ∇ψ( g(η)). The variational parameters are optimized so that F(ξ) is minimized and F(η) is maximized, respectively. In practice, however, they can be optimized directly with respect to h(ξ) and g(η) instead of ξ and η. Applications of LVA, storing h(ξ) and g(η) as parameters, do not require ξ and η explicitly. Furthermore, we consider the gradient vectors of the free energy bounds with respect to θ(ξ) and κ(η), which have one-to-one correspondence with h(ξ) and g(η), because they provide simple expressions of the gradient vectors. For notational simplicity, we drop the dependencies on ξ and η and denote as θ and κ.
142
5 VB Algorithm under No Conjugacy
The gradient of the upper bound with respect to θ is given by3 1 ) ∇θ F(ξ) = ∇θ − log p(w; ξ)dw
=− =−
1 ∂p(w; ξ) dw Z(ξ) ∂θ ∂h(ξ) 1 ∂p(w; ξ) ∂θ
Z(ξ) ∂h(ξ)
dw
2 ∂h(ξ) ∂ φ(h(ξ)) (h(w) − h(ξ))qξ (w; ξ)dw ∂θ ∂h∂h = −( h(w) qξ − h(ξ)), =−
where we have used Eq. (5.22) and the fact that the matrix entry is
∂hi (ξ) ∂θ j ,
is the inverse of the Hessian matrix
∂2 φ(h(ξ)) . ∂h∂h
∂h(ξ) ∂θ ,
(5.30)
whose (i, j)th
Similarly, we obtain
∇κ F(η) = g(w) qη − g(η).
(5.31)
Hence, we can utilize gradient methods to minimize F(ξ) and maximize F(η). We can see that when ξ and η are optimized, h(ξ) = h(w) qξ and g(η) = g(w) qη hold. In practice, the variational parameter h(ξ) is iteratively updated so that F(ξ) is monotonically decreased. Recall the argument in Section 2.1.7 where λ, ξ) over the LVA for VB was formulated as the joint minimization of F( approximate posterior r(w; λ) and ξ. The free energy bound F(ξ) = − log Z(ξ) λ, ξ), which is attained by (see Eq. (2.32)) was obtained as the minimum of F( r(w; λ) = q(w; ξ). λo be the variational parameter such Let h(ξo ) be a current estimate of h(ξ) and λo , ξ) decreases that r(w; λo ) = q(w; ξo ). Then, updating h(ξ) to argmin h(ξ) F( F(ξ) because F( λo , ξ) ≥ F(ξ) for all ξ and the equality holds when ξ = ξo . More specifically, it follows for λo , ξ) that h( ξ) = argmin h(ξ) F( F( ξ) ≤ F( λo , ξ) ≤ F( λo , ξo ) = F(ξo ), 3
(5.32)
We henceforth use the operator ∇ with the subscript expressing for which variable the gradient f (θ) denotes the vector whose ith element is is taken. That is, for a function f (θ), ∇θ f (θ) = ∂ ∂θ ∂ f (θ) . ∂θi
5.3 Unified Approach by Local VB Bounds
143
which means that the bound is improved. This corresponds to the EM algorithm to decrease F(ξ) and yields the following specific update rule of h(ξ): h( ξ) = argmin F( λo , ξ) h(ξ)
= argmin − log p(w; ξ) h(ξ)
qξ (w;ξ o )
= argmin dφ (h(w), h(ξ))
qξ (w;ξ o )
h(ξ)
(5.33) (5.34)
= argmin dφ ( h(w) qqξ (w;ξo ) , h(ξ))
(5.35)
= h(w) qξ (w;ξo ) ,
(5.36)
h(ξ)
which is summarized as h( ξ) = h(w) qξ (w;ξo ) .
(5.37)
The preceding lines of equations are basically derived by focusing on the λo , ξ) terms depending on h(ξ). Eq. (5.33) follows from the definition of F( by Eq. (2.25). Eq. (5.34) follows from the definition of p(w; ξ) by Eq. (5.15). Eq. (5.35) follows from the definition of the Bregman divergence (5.21) and the linearity of expectation. Eq. (5.36) follows from the nonnegativity of the Bregman divergence. Eqs. (5.34) through (5.36) are equivalent to the fact that the expected Bregman divergence is minimized by the mean (Banerjee et al., 2005). The update rule (5.37) means that h(ξ) is updated to the expectation of h(w) with respect to the approximate posterior. Note here again that if we store h(ξ), ξ is not explicitly required. The update rule (5.37) is an iterative substitution of h(ξ). To maximize the lower-bound F(η) in LVA for EP, such a simple update rule is not applicable in general. Thus, gradient-based optimization methods with the gradient (5.31) are usually used. The Newton–Raphson step to update κ is given by −1 (5.38) κnew = κold − ∇2κ F(ηold ) ∇κ F(ηold ), where the Hessian matrix is given as follows: ∇2κ F(η) =
∂2 F(η) ∂κ∂κ
= −Cov( g(w)) −
∂ g(η) ∂κ
for the covariance matrix of g(w), Cov( g(w)) = g(w) g(w) − g(w) qη g(w) qη , and
∂g(η) ∂κ
=
!
∂2 ψ(g(η)) ∂g∂g
"−1
qη
holds in Eq. (5.39).
(5.39)
144
5 VB Algorithm under No Conjugacy
5.3.3 An Alternative View of VB for Latent Variable Models In this subsection, we show that the VB learning for latent variable models can be viewed as a special case of LVA, where the log-sum-exp function is used to form the lower-bound of the log-likelihood (Jordan et al., 1999). Let H be a vector of latent (unobserved) variables and consider the latent variable model, p(D, H, w), p(D, w) = H
where H denotes the summation over all possible realizations of the latent variables. We have used the notation as if H were discrete in order to include examples such as GMMs and HMMs, where the likelihood function is given latent by p(D|w) = H p(D, H|w). In the case of a model with continuous variables, the summation H is simply replaced by the integration dH. This includes, for example, the hierarchical prior distribution presented in Tipping (2001), where the prior distribution is defined by p(w) = p(w|H)p(H)dH with the hyperprior p(H). The Bayesian posterior distribution of the latent variables and the parameter w is p(D, H, w) , p(H, w|D) = p(D, H, w)dw H p(D, H, w)dw requires summation which is intractable when Z = H over exponentially many terms as in GMMs and HMMs or the analytically intractable integration. So is the posterior of the parameter p(w|D). Let us consider an application of the local variational method for approx imating p(w|D). By the convexity of the function log H exp(·), the log-joint distribution is lower-bounded as follows: exp{log p(D, H, w)} log p(D, w) = log H
≥ log p(D, ξ) +
H
= log p(D, w) −
H
p(D, H, w) log p(H|D, ξ) p(D, H, ξ)
p(H|D, ξ) log
p(H|D, ξ) , p(H|D, w)
(5.40)
where p(H|D, ξ) = Hp(D,H,ξ) p(D,H,ξ) . This corresponds to the case where φ(h) = log n exp(hn ) and h(w) is the vector-valued function that consists of the elements log p(D, H, w) for all possible H. The vector h is infinite dimensional when H is continuous. Taking exponentials of the most right-hand side and
5.3 Unified Approach by Local VB Bounds
145
left-hand side of Inequality (5.40) leads to Eqs. (5.22) and (5.15) with the Bregman divergence, p(H|D, ξ) dφ (h(w), h(ξ)) = p(H|D, ξ) log p(H|D, w) H = KL(p(H|D, ξ)||p(H|D, w)). From Eq. (5.26), we have F(ξ) = F Bayes + KL(qξ (w; ξ)||p(w|D)) + KL(p(H|D, ξ)||p(H|D, w)) qξ (w;ξ) = F Bayes + KL(qξ (w; ξ)p(H|D, ξ)||p(w, H|D)), which is exactly the free energy of the factorized distribution qξ (w; ξ)p(H|D, ξ). In fact, from Eqs. (5.29) and (5.40), the approximating posterior is given by ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ log p(D, H, w)p(H|D, ξ)⎪ qξ (w; ξ) ∝ exp ⎪ ⎪ ⎪ ⎩ ⎭ H (5.41) = exp log p(D, H, w) p(H|D,ξ) . From Eq. (5.37), the EM update for the variational parameters ξ yields log p(D, H, ξ) = log p(D, H, w) qξ (w;ξo ) ⇒ p(H|D, ξ) ∝ exp log p(D, H, w) qξ (w;ξo ) . (5.42) Eqs. (5.41) and (5.42) are exactly the same as the VB algorithm for minimizing the free energy over the factorized distributions, Eqs. (4.2) and (4.3). In this example, we no longer have ξ satisfying Eq. (5.42) in general. However, if the model p(H, D|w) and the prior p(w) are included in the exponential family, h(ξ) as well as p(H|D, ξ) and qξ (w; ξ) are expressed by expected sufficient statistics, the number of which is equal to the dimensionality of w (Beal, 2003). In that case, it is not necessary to obtain ξ explicitly but only to store and update the expected sufficient statistics instead.
P a r t III Nonasymptotic Theory
6 Global VB Solution of Fully Observed Matrix Factorization
Variational Bayesian (VB) learning has shown good performance in many applications. However, VB learning sometimes gives a seemingly different posterior and exhibits different sparsity behavior from full Bayesian learning. For example, Figure 6.1 compares the Bayes posterior (left) and the VB posterior (right) of 1×1 matrix factorization. VB posterior tries to approximate a two-mode Bayes posterior with a single-mode Gaussian, which results in the
= 0. This behavior zero-mean Gaussian posterior with the VB estimator BA makes the VB estimator exactly sparse as shown in Figure 6.2: thresholding is observed for the VB estimator, while no thresholding is observed for the full Bayesian estimator. Mackay (2001) discussed the sparsity of VB learning as an artifact by showing inappropriate model pruning in mixture models. These facts might deprive the justification of VB learning based solely on the fact that it is a tractable approximation to Bayesian learning. Can we clarify the behavior of VB learning, and directly justify its use as an inference method? The nonasymptotic theory, introduced in Part III, gives some answer to this question. In this chapter, we derive an analytic-form of the global VB solution of fully observed matrix factorization (MF). The analytic-form solution allows us to make intuitive discussion on the behavior of VB learning (Chapter 7), and further analysis gives theoretical guarantees of the performance of VB learning (Chapter 8). The analytic-form global solution naturally leads to efficient and reliable algorithms (Chapter 9), which are extended to other similar models (Chapters 10 and 11). Relation to MAP learning and partially Bayesian learning is also theoretically investigated (Chapter 12).
149
150
6 Global VB Solution of Fully Observed Matrix Factorization Bayes p osterior (V = 1)
VB p osterior (V = 1)
3
3 0.2
0.1 0.2 3 0.
0.3
MAP estimators: 2 (A, B ) ≈ (± 1, ±
2
VB estimator : (A, B ) = (0, 0) 0.05
0 A
05 0.
05
0.15
0.1 0.1
−1
0. 1
−1
0.3
0.3 0.2 1 0.
−2
0
0.
0.0 5
B
0.3 0.2 0.1
0.2
0.3 0.2 0.1
−2 −3 −3
0. 1
0. 1 0.2
0.0
5
0.05
−2
0.2
−1
0.3
0.3
0.1
0.15
0
0.1 0.2 0.3
1
0.1 0.2 0.3
0.05
B
1
1
2
3
−3 −3
−2
−1
0 A
1
2
3
Figure 6.1 The Bayes posterior (left) and the VB posterior (right) of the 1 × 1 MF model V = BA + E with almost flat prior, when V = 1 is observed (E is the standard Gaussian noise). VB approximates the Bayes posterior having two modes by an origin-centered Gaussian, which induces sparsity. 3 2.5 2 1.5 1 0.5 0 0
1
2
3
=
as a function of the observed Figure 6.2 Behavior of the estimators of U BA value V. The VB estimator is zero when V ≤ 1, which indicates exact sparsity. On the other hand, the Bayesian estimator shows no sign of sparsity. The maximum
= V, is shown as a reference. likelihood estimator, i.e., U
6.1 Problem Description We first summarize the MF model and its VB learning algorithm, which was derived in Section 3.1. The likelihood and priors are given as #2 1 # p(V|A, B) ∝ exp − 2 ##V − B A ##Fro , (6.1) 2σ 1 p( A) ∝ exp − tr AC−1 A , (6.2) A 2
6.1 Problem Description 1 B p(B) ∝ exp − tr BC−1 , B 2
151
(6.3)
where the prior covariances are restricted to be diagonal: CA = Diag(c2a1 , . . . , c2aH ), C B = Diag(c2b1 , . . . , c2bH ), for cah , cbh > 0, h = 1, . . . , H. Without loss of generality, we assume that the diagonal entries of the product CA C B are arranged in nonincreasing order, i.e., cah cbh ≥ cah cbh for any pair h < h . We assume that L ≤ M.
(6.4)
If L > M, we may simply redefine the transpose V as V so that L ≤ M holds. Therefore, the assumption (6.4) does not impose any restriction. We solve the following VB learning problem:
r = argmin F(r) s.t. r( A, B) = rA ( A)rB (B),
(6.5)
r
where the objective function to be minimized is the free energy: / 0 rA ( A)rB (B) F = log . p(V|A, B)p( A)p(B) rA (A)rB (B) The solution to the problem (6.5) is in the following form: "⎞ ! ⎛ −1 ⎟ ⎜⎜⎜ tr ( A −
⎟⎟⎟ ( A − A) A) Σ A ⎜⎜ ⎟⎟⎟ rA ( A) = MGauss M,H ( A; A, I M ⊗ Σ A ) ∝ exp ⎜⎜⎜⎜− ⎟⎟⎟ , 2 ⎠ ⎝⎜ (6.6) "⎞ ! ⎛ −1 ⎜⎜⎜ tr (B − B) ⎟⎟⎟⎟ B) Σ B (B − ⎜⎜⎜ ⎟⎟⎟
rB (B) = MGaussL,H (B; B, I L ⊗ Σ B ) ∝ exp ⎜⎜⎜− ⎟⎟⎟ . ⎜⎝ 2 ⎠ (6.7) B, Σ B , the free energy can be explicitly With the variational parameters A, Σ A, written as ## # #2 ##V − B A ## det (C B ) det (CA ) Fro + L log + M log 2F = LM log(2πσ2 ) +
σ2 det Σ A det ΣB 2 ! "3 2 ! "3
− (L + M)H + tr C−1 + tr C−1 A A A + MΣ A B B B + LΣ B 2 " ! "3 ! + σ−2 tr − A B (6.8) A B B+ A A + M ΣA B + L ΣB .
152
6 Global VB Solution of Fully Observed Matrix Factorization
The stationary conditions for the variational parameters are given by
A = σ−2 V B Σ A, ! "−1
, Σ A = σ2 B B + L Σ B + σ2 C−1 A
B = σ−2 V A Σ B, ! "−1
. Σ B = σ2 A A + M Σ A + σ2 C−1 B
(6.9) (6.10) (6.11) (6.12)
In the subsequent sections, we derive the global solution to the problem (6.5) in an analytic form, which was obtained in Nakajima et al. (2013a, 2015).
6.2 Conditions for VB Solutions With the explicit form (6.8) of the free energy, the VB learning problem (6.5) can be written as a minimization problem with respect to a finite number of variables: Given min
A, B, Σ A , ΣB
s.t.
H , CA , CA ∈ D++
σ2 ∈ R++ ,
F
(6.13)
B ∈ RL×H , A ∈ R M×H ,
H
Σ A, Σ B ∈ S++ .
(6.14)
We can easily show that the solution is a stationary point of the free energy. Lemma 6.1 Any local solution of the problem (6.13) is a stationary point of the free energy (6.8). Proof Since # ## #2 ##V − B A ## ≥ 0, Fro and
2 "! "3 ! tr − A B A B B+ A A + M ΣA B + L ΣB 2 3 = tr L A B Σ A A Σ B + M B Σ A + LM Σ B ≥ 0,
the free energy (6.8) is lower-bounded as 2F ≥ −M log det Σ A − L log det ΣB 2 ! "3 2 ! "3
+ tr C−1 + tr C−1 + τ, A A A + MΣ A B B B + LΣ B
(6.15)
6.3 Irrelevant Degrees of Freedom
153
where τ is a finite constant. The right-hand side of Eq. (6.15) diverges to +∞ if any entry of A or B goes to +∞ or −∞. Also it diverges if any eigenvalue
of Σ A or Σ B goes to +0 or ∞. This implies that that no local solution exists on the boundary of (the closure of) the domain (6.14). Since the free energy is differentiable in the domain (6.14), any local minimizer is a stationary point. For any observed matrix V, the free energy (6.8) can be finite, for example, B = 0L,H , and ΣA = Σ B = I H , where 0D1 ,D2 denotes the D1 × D2 at A = 0 M,H , matrix with all the entries equal to zero. Therefore, at least one minimizer always exists, which completes the proof of Lemma 6.1. Lemma 6.1 implies that the stationary conditions (6.9) through (6.12) are satisfied for any solution. Accordingly, we can obtain the global solution by finding all points that satisfy the stationary conditions. However, the condition involves O(MH) unknowns, and therefore finding all such candidate points seems hard. The first step to tackle this problem is to find hidden separability, which enables us to decompose the problem so that each problem involves only O(1) unknowns.
6.3 Irrelevant Degrees of Freedom The most of the terms in the free energy (6.8) have symmetry, i.e., they are invariant with respect to the coordinate change shown in Eqs. (6.16) and (6.17). Assume that ( A∗ , B∗ , Σ ∗A , Σ ∗B ) is a global solution of the VB problem (6.13), and let F ∗ = F( A∗ , B∗ , Σ ∗A , Σ ∗B ) be the minimum free energy. Consider the following rotation of the coordinate system for an arbitrary orthogonal matrix Ω ∈ RH×H :
A = A∗ Ω ,
Σ A = ΩΣ ∗A Ω ,
(6.16)
B = B ∗ Ω ,
Σ B = ΩΣ ∗B Ω .
(6.17)
We can easily confirm that the terms in Eq. (6.8) except the sixth and the seventh terms are invariant with respect to the rotation, and the free energy can be written as a function of Ω as follows: ! ! " 3 2 " 3 2
+ tr C−1 + const. 2F(Ω) = tr C−1 A Ω A A + MΣ A Ω B Ω B B + LΣ B Ω To find the irrelevant degrees of freedom, we consider skewed rotations that only affect a single term in Eq. (6.8). Consider the following transform:
Ω C1/2 A = A∗ C−1/2 A A ,
−1/2 ∗ −1/2 1/2 Σ A = C1/2 Σ A CA Ω CA , A ΩC A
(6.18)
154
6 Global VB Solution of Fully Observed Matrix Factorization −1/2
, B = B∗ C1/2 A Ω CA
∗ 1/2 −1/2 Σ B = C−1/2 ΩC1/2 . A A Σ B CA Ω CA
Then, the free energy can be written as , 2F(Ω) = tr ΓΩΦΩ + const.,
(6.19)
(6.20)
where −1 Γ = C−1 A CB , Φ = C1/2 B∗ B∗ + LΣ ∗B C1/2 A A .
By assumption, Ω = I H is a minimizer of Eq. (6.20), i.e., F(I H ) = F ∗ . Now we can use the following lemma: Lemma 6.2 Let Γ, Ω, Φ ∈ RH×H be a nondegenerate diagonal matrix, an orthogonal matrix, and a symmetric matrix, respectively. Let {Λ(k) , Λ (k) ∈ RH×H ; k = 1, . . . , K} be arbitrary diagonal matrices. If a function ⎧ ⎫ K ⎪ ⎪ ⎪ ⎪ ⎨ (k)
(k) ⎬ G(Ω) = tr ⎪ Λ ΩΛ Ω ⎪ ΓΩΦΩ + (6.21) ⎪ ⎪ ⎩ ⎭ k=1
is minimized (as a function of Ω, given Γ, Φ, {Λ(k) , Λ (k) }) at Ω = I H , then Φ is diagonal. Here, K can be any natural number including K = 0 (when only the first term exists). Proof Let Φ = Ω Γ Ω
(6.22)
be the eigenvalue decomposition of Φ. Let γ, γ , {λ(k) }, {λ (k) } be the vectors consist of the diagonal entries of Γ, Γ , {Λ(k) }, {Λ (k) }, respectively, i.e., Γ = Diag(γ),
Γ = Diag(γ ),
Λ(k) = Diag(λ(k) ),
Λ (k) = Diag(λ (k) ).
Then, Eq. (6.21) can be written as ⎧ ⎫ K K ⎪ ⎪ ⎪ ⎪ ⎨ (k)
(k) ⎬
G(Ω) = tr ⎪ + Λ ΩΛ Ω Qγ + λ(k) Rλ (k) , (6.23) ΓΩΦΩ = γ ⎪ ⎪ ⎪ ⎩ ⎭ k=1
k=1)
where Q = (ΩΩ ) (ΩΩ ),
R = Ω Ω.
Here, denotes the Hadamard product. Since Q as well as R is the Hadamard square of an orthogonal matrix, it is doubly stochastic (i.e., any of the columns and the rows sums up to one) (Marshall et al., 2009). Therefore, it can be seen that Q reassigns the components of γ to those of γ when calculating
6.3 Irrelevant Degrees of Freedom
155
the elementwise product in the first term of Eq. (6.23). The same applies to R and {λ(k) , λ (k) } in the second term. Naturally, rearranging the components of γ in nondecreasing order and the components of γ in nonincreasing order minimizes γ Qγ (Ruhe, 1970; Marshall et al., 2009). Using the expression (6.23) with Q and R, we will prove that Φ is diagonal if Ω = I H minimizes Eq. (6.23). Let us consider a bilateral perturbation Ω = Δ such that the 2 × 2 matrix Δ(h,h ) consisting of the hth and the h th columns and rows form an 2 × 2 orthogonal matrix cos θ − sin θ Δ(h,h ) = , sin θ cos θ and the remaining entries coincide with those of the identity matrix. Then, the elements of Q become ⎧ ⎪ (Ωh, j cos θ − Ω h , j sin θ)2 if i = h, ⎪ ⎪ ⎪ ⎪ ⎨ Qi, j = ⎪ (Ωh, j sin θ + Ω h , j cos θ)2 if i = h , ⎪ ⎪ ⎪ ⎪ ⎩Ω 2 otherwise, i, j and Eq. (6.23) can be written as a function of θ: G(θ) =
H ,
γh (Ω h, j cos θ − Ω h , j sin θ)2 + γh (Ω h, j sin θ + Ω h , j cos θ)2 γ j
j=1 K + λ(k) h k=1
λ(k) h
cos2 θ sin2 θ
sin2 θ λ(k) h + const. cos2 θ λ(k) h
(6.24)
Since Eq. (6.24) is differentiable at θ = 0, our assumption that Eq. (6.23) is minimized when Ω = I H requires that θ = 0 is a stationary point of Eq. (6.24) for any h h . Therefore, it holds that , ∂G = 2 0= γh (Ω h, j cos θ − Ω h , j sin θ)(−Ω h, j sin θ − Ω h , j cos θ) ∂θ θ=0 j + γh (Ω h, j sin θ + Ω h , j cos θ)(Ω h, j cos θ − Ω h , j sin θ) γ j = 2 (γh − γh ) Ω h, j γ j Ω h , j = 2 (γh − γh ) Φh,h . (6.25) j
In the last equality, we used Eq. (6.22). Since we assumed that Γ is nondegenerate (γh γh for h h ), Eq. (6.25) implies that Φ is diagonal, which completes the proof of Lemma 6.2. −1 Assume for simplicity that Γ = C−1 A C B is nondegenerate, i.e., no pair of diagonal entries coincide, in Eq. (6.20). Then, since Eq. (6.20) is minimized
156
6 Global VB Solution of Fully Observed Matrix Factorization
at Ω = I H , Lemma 6.2 implies that Φ = C1/2 B∗ B∗ + LΣ ∗B C1/2 A A is diagonal. ∗ ∗ ∗ This means that B B + LΣ B is diagonal. Thus, the stationary condition (6.10) implies that Σ ∗A is diagonal. Similarly, we can find that Σ ∗B is diaognal, if Γ = −1 C−1 A C B is nondegenerate. To generalize the preceding discussion to degenerate cases, we need to consider an equivalent solution, defined as follows: Σ B) Definition 6.3 (Equivalent solutions) We say that two points ( A, B, Σ A,
B , Σ A, Σ B ) are equivalent if both give the same free energy and the and ( A , same mean prediction, i.e.,
F( A, B, Σ A, Σ B ) = F( A , B , Σ A, Σ B)
and
B A = B A .
With this definition, we can obtain the following theorem (its proof is given in the next section): Theorem 6.4 When CA C B is nondegenerate (i.e., cah cbh > cah cbh for any Σ A and Σ B . When pair h < h ), any solution of the problem (6.13) has diagonal ΣA CA C B is degenerate, any solution has an equivalent solution with diagonal
and Σ B . The result that the solution has diagonal Σ A and Σ B would be natural because we assumed the independent Gaussian priors on A and B: the fact that any V can be decomposed into orthogonal singular components may imply that the observation V cannot convey any preference for singular-componentwise correlation. Note, however, that Theorem 6.4 does not necessarily hold when the observed matrix has missing entries. Obviously, any VB solution (a solution of the problem (6.13)) with diagonal covariances can be found by solving the following problem: Given min
A, B, Σ A , ΣB
s.t.
H , CA , CA ∈ D++
σ2 ∈ R++ ,
F
B ∈ RL×H , A ∈ R M×H ,
(6.26) H
Σ A, Σ B ∈ D++ ,
(6.27)
which is equivalent to solving the SimpleVB learning problem (3.30) with columnwise independence, introduced in Section 3.1.1. Theorem 6.4 states that, if CA C B is nondegenerate, the set of VB solutions and the set of SimpleVB solutions are identical. When CA C B is degenerate, the set of VB solutions is the union of the set of SimpleVB solutions and the set of their equivalent solutions with nondiagonal covariances. Actually, any VB solution can be obtained by rotating its equivalent SimpleVB solution (VB solution with diagonal covariances) (see Section 6.4.4). In practice, it is however sufficient
6.4 Proof of Theorem 6.4
157
to focus on the SimpleVB solutions, since equivalent solutions share the same free energy F and the same mean prediction B A . In this sense, we can conclude that the stronger columnwise independence constraint (3.29) does not degrade approximation accuracy, and the VB solution under the matrixwise independence (3.4) essentially agrees with the SimpleVB solution.
6.4 Proof of Theorem 6.4 In this section, we prove Theorem 6.4 by considering the following three cases separately:
Case 1 When no pair of diagonal entries of CA C B coincide. Case 2 When all diagonal entries of CA C B coincide. Case 3 When (not all but) some pairs of diagonal entries of CA C B coincide. We will prove that, in Case 1, Σ A and Σ B are diagonal for any solution Σ B ), and that, in other cases, any solution has its equivalent solution A, B, Σ A, ( Σ B. with diagonal Σ A and Remember our assumption that the diagonal entries {cah cbh } of CA C B are arranged in nonincreasing order.
6.4.1 Proof for Case 1 Here, we consider the case where cah cbh > cah cbh for any pair h < h . Assume that ( A∗ , B∗ , Σ ∗A , Σ ∗B ) is a minimizer of the free energy (6.8), and consider the following variation defined with an arbitrary H × H orthogonal matrix Ω:
Ω C1/2 A = A∗ C−1/2 A A ,
(6.28)
−1/2
, B = B∗ C1/2 A Ω CA
(6.29)
ΣA =
ΣB =
−1/2 ∗ −1/2 1/2 C1/2 Σ A CA Ω CA , A ΩC A −1/2 1/2 ∗ 1/2 −1/2 CA ΩCA Σ B CA Ω CA .
(6.30) (6.31)
Note that this variation does not change B A , and it holds that Σ B ) = ( A∗ , B∗ , Σ ∗A , Σ ∗B ) for Ω = I H . Then, the free energy (6.8) ( A, B, Σ A,
158
6 Global VB Solution of Fully Observed Matrix Factorization
can be written as a function of Ω: 1 , 1/2 −1 F(Ω) = tr C−1 B∗ B∗ + LΣ ∗B C1/2 + const. A C B ΩC A A Ω 2 We define Φ = C1/2 B∗ B∗ + LΣ ∗B C1/2 A A ,
(6.32)
and rewrite Eq. (6.32) as 1 , −1 −1 tr CA C B ΩΦΩ + const. (6.33) 2 The assumption that ( A∗ , B∗ , Σ ∗A , Σ ∗B ) is a minimizer requires that Eq. (6.33) is minimized when Ω = I H . Then, Lemma 6.2 (for K = 0) implies that Φ is diagonal. Therefore, F(Ω) =
∗ ∗ ∗ ΦC−1/2 (= ΦC−1 C−1/2 A ) = B B + LΣ B A A
is also diagonal. Consequently, Eq. (6.10) implies that Σ ∗A is diagonal. Next, consider the following variation defined with an arbitrary H × H orthogonal matrix Ω :
−1/2
, A = A∗ C1/2 B Ω CB
Ω C1/2 B = B∗ C−1/2 B B , ∗ 1/2 −1/2
Σ A = C−1/2 Ω C1/2 , B B Σ A CB Ω CB
−1/2 ∗ −1/2 1/2
Σ BCB Ω CB . Σ B = C1/2 B Ω CB
Then, the free energy as a function of Ω is given by 1 , −1 1/2
F(Ω ) = tr C−1 A∗ A∗ + MΣ ∗A C1/2 + const. A CB Ω CB B Ω 2 From this, we can similarly prove that Σ ∗B is also diagonal, which completes the proof for Case 1.
6.4.2 Proof for Case 2 Here, we consider the case where CA C B = c2 I H for some c2 ∈ R++ . In this case, there exist solutions with nondiagonal covariances. However, for any (or each) of those nondiagonal solutions, the equivalent class to which the (nondiagonal) solution belongs contains a solution with diagonal covariances. We can easily show that the free energy (6.8) is invariant with respect to Ω under the transformation (6.28) through (6.31). This arbitrariness forms an equivalent class of solutions. Since there exists Ω that diagonalizes any given Σ ∗A through Eq. (6.30), each equivalent class involves a solution with diagonal
6.4 Proof of Theorem 6.4
159
Σ A has Σ A . In the following, we will prove that any solution with diagonal diagonal Σ B. Assume that ( A∗ , B∗ , Σ ∗A , Σ ∗B ) is a solution with diagonal Σ ∗A , and consider the following variation defined with an arbitrary H × H orthogonal matrix Ω:
Γ−1/2 Ω Γ1/2 C1/2 A = A∗ C−1/2 A A , 1/2 −1/2 −1/2
Ω Γ CA , B = B∗ C1/2 A Γ 1/2
ΩΓ−1/2 C−1/2 Σ ∗A C−1/2 Γ−1/2 Ω Γ1/2 C1/2 Σ A = C1/2 A Γ A A A , ∗ 1/2 1/2 −1/2 −1/2
Γ−1/2 ΩΓ1/2 C1/2 Ω Γ CA . Σ B = C−1/2 A A Σ B CA Γ
Here, Γ = Diag(γ1 , . . . , γH ) is an arbitrary nondegenerate (γh γh for h h ) positive-definite diagonal matrix. Then, the free energy can be written as a function of Ω: 1 , Γ−1/2 Ω A∗ A∗ + MΣ ∗A C−1/2 F(Ω) = tr ΓΩΓ−1/2 C−1/2 A A 2 1/2 1/2 ∗ ∗ ∗ + c−2 Γ−1 ΩΓ1/2 C1/2 B + LΣ Γ Ω . (6.34) B C B A A We define
ΦA = Γ−1/2 C−1/2 Γ−1/2 , A∗ A∗ + MΣ ∗A C−1/2 A A 1/2 ΦB = c−2 Γ1/2 C1/2 , B∗ B∗ + LΣ ∗B C1/2 A A Γ
and rewrite Eq. (6.34) as 1 , tr ΓΩΦA Ω + Γ−1 ΩΦB Ω . (6.35) 2 Since Σ ∗A is diagonal, Eq. (6.10) implies that ΦB is diagonal. The assumption that ( A∗ , B∗ , Σ ∗A , Σ ∗B ) is a solution requires that Eq. (6.35) is minimized when Ω = I H . Accordingly, Lemma 6.2 implies that ΦA is diagonal. Consequently, Eq. (6.12) implies that Σ ∗B is diagonal. Thus, we have proved that any solution has its equivalent solution with diagonal covariances, which completes the proof for Case 2. F(Ω) =
6.4.3 Proof for Case 3 Finally, we consider the case where cah cbh = cah cbh for (not all but) some Σ A and ΣB pairs h h . First, in the same way as Case 1, we can prove that are block diagonal where the blocks correspond to the groups sharing the same cah cbh . Next, we can apply the proof for Case 2 to each block, and show that Σ B . Combining any solution has its equivalent solution with diagonal Σ A and these results completes the proof of Theorem 6.4.
160
6 Global VB Solution of Fully Observed Matrix Factorization
6.4.4 General Expression In summary, for any minimizer of Eq. (6.8), the covariances can be written in the following form: −1/2 −1/2 1/2 −1/2
Γ Σ A C−1/2 Θ C1/2 ΘC1/2 ), (6.36) Σ A = C1/2 Σ A CB Θ CB A ΘC A A A (= C B B Γ 1/2 −1/2 −1/2
ΘC1/2 (= C1/2 Γ Σ B C−1/2 Θ C1/2 Σ B = C−1/2 Σ B CA Θ CA A A Γ B ΘC B B B ). (6.37)
Here, Γ Σ A and Γ Σ B are positive-definite diagonal matrices, and Θ is a block diagonal matrix such that the blocks correspond to the groups sharing the same cah cbh , and each block consists of an orthogonal matrix. Furthermore, if there Σ B ) written in the form of Eqs. (6.36) and (6.37) exists a solution with ( Σ A, with a certain set of (Γ Σ A , Γ Σ B , Θ), then there also exist its equivalent solutions with the same (Γ Σ A , Γ Σ B ) for any Θ. Focusing on the solution with Θ = I H as the representative of each equivalent class, we can assume that Σ A and Σ B are diagonal without loss of generality.
6.5 Problem Decomposition As discussed in Section 6.3, Theorem 6.4 allows us to focus on the solutions H Σ B ∈ D++ . For any solution that have diagonal posterior covariances, i.e., Σ A, with diagonal covariances, the stationary conditions (6.10) and (6.12) (with A and B B are also diagonal, which means that the Lemma 6.1) imply that A column vectors of A, as well as B, are orthogonal to each other. In such a case, the free energy (6.8) depends on the column vectors of A and B only through the second term ## # #2 σ−2 ##V − B A ## , Fro which coincides with the objective function for the singular value decomposition (SVD). This leads to the following lemma: Lemma 6.5 Let V=
L
γh ωbh ωah
(6.38)
h=1
be the SVD of V, where γh (≥ 0) is the hth largest singular value, and ωah and ωbh are the associated right and left singular vectors. Then, any VB solution
6.5 Problem Decomposition
161
(with diagonal posterior covariances) can be written as
B A =
H
γhVB ωbh ωah
(6.39)
h=1
for some { γhVB ≥ 0}. Thanks to Theorem 6.4 and Lemma 6.5, the variational parameters
aH ), B = ( b1 , . . . , bH ), Σ A, Σ B can be expressed as A = ( a1 , . . . ,
ah = ah ωah ,
bh = bh ωbh ,
σ2a1 , . . . , σ2aH , Σ A = Diag
σ2b1 , . . . , σ2bH , Σ B = Diag H bh ∈ R, σ2ah , σ2bh ∈ R++ }h=1 . Thus, the with a new set of unknowns { ah , following holds:
Corollary 6.6 The VB posterior can be written as r( A, B) =
H
Gauss M (ah ; ah ωah , σ2ah I M )
h=1
H
GaussL (bh ; bh ωbh , σ2bh I L ),
h=1
(6.40) H bh , σ2ah , σ2bh }h=1 are the solution of the following minimization where { ah , problem:
σ2 ∈ R++ ,
Given min
H { ah , bh , σ2ah , σ2b }h=1
H {c2ah , c2bh ∈ R++ }h=1 ,
F
(6.41)
h
s.t. { ah , bh ∈ R,
H
σ2ah , σ2bh ∈ R++ }h=1 .
Here, F is the free energy (6.8), which can be written as L 2F = LM log(2πσ2 ) + where
2Fh = M log
c2ah
σ2ah
2 h=1 γh 2 σ
+ L log
− (L + M) +
+
c2bh
σ2bh
H
2Fh ,
(6.42)
h=1
+
σ2ah a2h + M
+
σ2bh b2h + L
c2ah c2bh −2 ah a2h + M σ2ah σ2bh bh γh + b2h + L σ2
.
(6.43)
162
6 Global VB Solution of Fully Observed Matrix Factorization
Importantly, the free energy (6.42) depends on the variational parameters H bh , σ2ah , σ2bh }h=1 only through the third term, and the third term is decom{ ah , posed into H terms, each of which only depends on the variational parameters bh , σ2ah , σ2bh ) for the hth singular component. Accordingly, given the noise ( ah , 2 variance σ , we can separately minimize the free energy (6.43), which involves only four unknowns, for each singular component.
6.6 Analytic Form of Global VB Solution The stationary conditions of Eq. (6.43) are given by
σ2ah
γh bh , σ2 −1 σ2 2 2 2 2
σah = σ bh + L σbh + 2 , cah
ah =
bh =
(6.44) (6.45)
σ2bh
ah , γh σ2 ⎛ ⎞−1 ⎜⎜ 2 σ2 ⎟⎟⎟⎟ 2 2⎜ 2 ⎜
σbh = σ ⎜⎝ σah + 2 ⎟⎠ , ah + M cbh
(6.46) (6.47)
which form is a polynomial system, a set of polynomial equations, on the four bh , σ2ah , σ2bh ). Since Lemma 6.1 guarantees that any minimizer unknowns ( ah , is a stationary point, we can obtain the global solution by finding all points that satisfy the stationary conditions (6.44) through (6.47) and comparing the free energy (6.43) at those points. This leads to the following theorem and corollary: Theorem 6.7 The VB solution is given by
VB = U
H
γhVB ωbh ωah ,
where
h=1
for
> ? ? ? @
γVB h
γ˘ hVB
γhVB
⎧ VB ⎪ ⎪ ⎨γ˘ h =⎪ ⎪ ⎩0
if γh ≥ γVB , h
otherwise,
> ? @⎛ ⎞2 2 ⎟ ⎜⎜⎜ (L + M) σ (L + M) ⎟ + 2 2 + ⎜⎜⎝ + 2 2 ⎟⎟⎟⎠ − LM, =σ 2 2 2cah cbh 2cah cbh > ⎛ ⎛ ⎞⎞ @ 2 ⎟ ⎜⎜⎜ 2 ⎜ ⎜ ⎟⎟⎟⎟⎟ 4γ σ ⎜ = γh ⎜⎜⎜⎜⎝1 − 2 ⎜⎜⎜⎜⎝ M + L + (M − L)2 + 2 h2 ⎟⎟⎟⎟⎠⎟⎟⎟⎟⎠ . 2γh cah cbh σ2
(6.48)
(6.49)
(6.50)
6.7 Proofs of Theorem 6.7 and Corollary 6.8
163
Corollary 6.8 The VB posterior is given by Eq. (6.40) with the following variational parameters: if γh > γVB , h > @ . γ˘ hVB σ2 δVB σ2 h
bh = ± , σ2ah = , σ2bh = , (6.51) δVB ah = ± γ˘ hVB h ,
γh γh δVB δVB h h cah
Lσ2 ah VB
− γ ˘ − ≡ = γ , (6.52) where δVB h h h
γh σ2 bh and otherwise,
ah = 0,
bh = 0,
σ2ah
where σ2
σ2bh = σ2ah ζhVB ≡ 2LM
=
c2ah
⎛ ⎞ ⎜⎜⎜ L ζhVB ⎟⎟⎟ ⎜⎜⎝1 − ⎟⎟ , σ2 ⎠
σ2bh
=
c2bh
⎛ ⎞ ⎜⎜⎜ M ζhVB ⎟⎟⎟ ⎜⎜⎝1 − ⎟⎟ , (6.53) σ2 ⎠
> ⎛ ⎞ ? @⎛ ⎞2 ⎜⎜⎜ ⎟⎟⎟ 2 ⎟ ⎜⎜⎜ σ σ2 ⎜⎜⎜ ⎟ ⎟ ⎜⎜⎜L + M + 2 2 − ⎜⎜⎝L + M + 2 2 ⎟⎟⎟⎠ − 4LM ⎟⎟⎟⎟⎟ . ⎝ ⎠ cah cbh cah cbh (6.54)
Theorem 6.7 states that the VB solution for fully observed MF is a truncated shrinkage SVD with the truncation threshold and the shrinkage estimator given by Eqs. (6.49) and (6.50), respectively. Corollary 6.8 completely specifies the VB posterior.1 These results give insights into the behavior of VB learning; for example, they explain why a sparse solution is obtained, and what are similarities and differences between the Bayes posterior and the VB posterior, which will be discussed in Chapter 7. The results also form the basis of further analysis on the global empirical VB solution (Section 6.8), which will be used for performance guarantee (Chapter 8), and global (or efficient local) solvers for multilinear models (Chapters 9, 10, and 11). Before moving on, we give the proofs of the theorem and the corollary in the next section.
6.7 Proofs of Theorem 6.7 and Corollary 6.8 We will find all stationary points that satisfy Eqs. (6.44) through (6.47), and compare the free energy (6.43). 1
The similarity between (γVB )2 and LM ζhVB comes from the fact that they are the two different h
solutions of the same quadratic equations, i.e., Eq. (6.79) with respect to (γVB )2 and (6.77) with h respect to LM ζ VB . h
164
6 Global VB Solution of Fully Observed Matrix Factorization
By using Eqs. (6.45) and (6.47), the free energy (6.43) can be simplified as ⎛ ⎞ c2bh c2ah 1 ⎜⎜⎜⎜ 2 σ2 ⎟⎟⎟⎟ 2 σ2 2 2 Fh = M log 2 + L log 2 + 2 ⎜⎝ah + M σah + 2 ⎟⎠ bh + L σbh + 2
σah σ cah
σb cb h
h
−2ah bh γh σ2 − (L + M) + − σ2 c2ah c2bh = M log
c2ah
σ2ah
+ L log
c2bh
σ2bh
⎞ ⎛ bh γh ⎜⎜⎜⎜ σ2 2 ah σ2 ⎟⎟⎟⎟ + 2 2 − − ⎜⎝L + M + 2 2 ⎟⎠ . σ2
σah σbh cah cbh (6.55)
The stationary conditions (6.44) through (6.47) imply two possibilities of stationary points.
6.7.1 Null Stationary Point If ah = 0 or bh = 0, Eqs. (6.44) and (6.46) require that ah = 0 and bh = 0. In this case, Eqs. (6.45) and (6.47) lead to ⎛ ⎞ σ2bh ⎟⎟⎟ L σ2ah ⎜⎜ 2 2 ⎜ ⎜ ⎟⎟ ,
σah = cah ⎜⎝1 − (6.56) σ2 ⎠ ⎛ ⎞ σ2bh ⎟⎟⎟ M σ2ah ⎜⎜ 2 2 ⎜
(6.57) σbh = cbh ⎜⎜⎝1 − ⎟⎟ . σ2 ⎠ Multiplying Eqs. (6.56) and (6.57), we have ⎞⎛ ⎞ ⎛ σ2bh ⎟⎟⎟ ⎜⎜⎜ σ2bh ⎟⎟⎟ σ2 σ2 L σ2ah M σ2ah ⎜⎜⎜ ⎟⎟⎠ ⎜⎜⎝1 − ⎟⎟⎠ = ah bh , ⎜⎜⎝1 − σ2 σ2 c2ah c2bh and therefore
⎞ ⎛ LM 4 4 ⎜⎜⎜⎜ σ2 ⎟⎟⎟⎟ 2 2 σbh + σ2 = 0. σ σ − ⎜L + M + 2 2 ⎟⎠ σah σ2 ah bh ⎝ cah cbh
(6.58)
(6.59)
Solving the quadratic equation (6.59) with respect to σ2ah σ2bh and checking the 2 2 σbh , we have the following lemma: signs of σah and Lemma 6.9 For any γh ≥ 0 and c2ah , c2bh , σ2 ∈ R++ , the null stationary point given by Eq. (6.53) exists with the following free energy: ! ! L VB " M VB " LM VB ζh − L log 1 − 2 ζ FhVB−Null = −M log 1 − 2 − 2 ζh , (6.60) σ σ h σ where ζhVB is defined by Eq. (6.54).
6.7 Proofs of Theorem 6.7 and Corollary 6.8
165
Proof Eq. (6.59) has two positive real solutions: > ⎛ ⎞ ? @⎛ ⎞2 ⎟⎟⎟ ⎜ 2 ⎜ 2 2 ⎜ ⎟ ⎜ σ ⎜⎜⎜ σ ⎟ σ ⎜ ⎟
σ2ah σ2bh = ⎜⎜⎜L + M + 2 2 ± ⎜⎜⎜⎝L + M + 2 2 ⎟⎟⎟⎠ − 4LM ⎟⎟⎟⎟⎟ . 2LM ⎝ ⎠ cah cbh cah cbh The larger solution (with the plus sign) is decreasing with respect to c2ah c2bh , and σ2bh > σ2 /L. The smaller solution (with the minus sign) lower-bounded as σ2ah σ2bh < σ2 /M. σ2ah is increasing with respect to c2ah c2bh , and upper-bounded as 2 2 σbh to be positive, Eqs. (6.56) and (6.57) require that For σah and
σ2ah σ2bh
0,
(6.62)
ah
> 0. δh =
bh
(6.63)
From Eqs. (6.44) and (6.46), we have σ2 δh , γh σ2 −1 δ . σ2bh = γh h
σ2ah =
Substituting Eqs. (6.64) and (6.65) into Eqs. (6.45) and (6.47) gives σ2 −1 σ2 −1 −1
γh δh + L δh + 2 , γh δh = γh cah ⎛ ⎞ 2 2⎟ ⎜⎜⎜ σ σ ⎟ γh δh = ⎜⎜⎝ δh + M δh + 2 ⎟⎟⎟⎠ , γh γh cbh
(6.64) (6.65)
(6.66) (6.67)
166
6 Global VB Solution of Fully Observed Matrix Factorization
and therefore, cah
δh = 2 γh − γh − σ cbh
γh − δ−1 γh − h = σ2
Lσ2 , γh Mσ2 . γh
Multiplying Eqs. (6.68) and (6.69), we have Lσ2 Mσ2 σ4 γh − γh − , γh − = γh − γh γh cah cbh
(6.68) (6.69)
(6.70)
and therefore (L + M)σ2 Lσ2 Mσ2 σ4 2
γh − 2γh − = 0. (6.71) γh + γh − γh − − γh γh γh cah cbh By solving the quadratic equation (6.71) with respect to γh , and checking δh , σ2ah , and σ2bh , we have the following lemma: the signs of γh , Lemma 6.10 If and only if γh > γVB , where γVB is defined by Eq. (6.49), h h the positive stationary point given by Eq. (6.51) exists with the following free energy: ⎛ ⎛ VB ⎞⎞ ⎞⎞ ⎛ ⎛ VB ⎜⎜⎜ ⎜⎜⎜ γ˘ h ⎜⎜⎜ ⎜⎜⎜ γ˘ h Lσ2 ⎟⎟⎟⎟⎟⎟ Mσ2 ⎟⎟⎟⎟⎟⎟ VB−Posi = − M log ⎝⎜1 − ⎝⎜ + 2 ⎠⎟⎠⎟ − L log ⎝⎜1 − ⎝⎜ + 2 ⎠⎟⎠⎟ Fh γh γh γh γh ⎞ ⎛ VB ⎞ 2 ⎛ VB 2 2 γ ⎜⎜ γ˘ Lσ ⎟⎟ ⎜⎜ γ˘ Mσ ⎟⎟ (6.72) − h2 ⎜⎜⎝ h + 2 ⎟⎟⎠ ⎜⎜⎝ h + 2 ⎟⎟⎠ , γh γh σ γh γh where γ˘ hVB is defined by Eq. (6.50). Proof Since δh > 0, Eqs. (6.68) and (6.69) require that
γh < γh −
Lσ2 , γh
and therefore, the positive stationary point exists only when √ γh > Mσ. Let us assume that Eq. (6.74) holds. Eq. (6.71) has two solutions: > ⎞ ⎛ @ 2 (M − L)σ2 1 ⎜⎜⎜⎜⎜ (L + M)σ2 4σ4 ⎟⎟⎟⎟⎟
γh = ⎜⎜⎝2γh − ± + 2 2 ⎟⎟⎠ . 2 γh γh cah cbh
(6.73)
(6.74)
6.7 Proofs of Theorem 6.7 and Corollary 6.8
167
The larger solution with the plus sign is positive, decreasing with respect to γh > γh − Lσ2 /γh , which violates the condition c2ah c2bh , and lower-bounded as (6.73). The smaller solution, Eq. (6.50), with the minus sign is positive if the intercept of the left-hand side in Eq. (6.71) is positive, i.e., Lσ2 Mσ2 σ4 (6.75) γh − − 2 2 > 0. γh − γh γh cah cbh From the condtion (6.75), we obtain the threshold √ (6.49) for the existence of the positive stationary point. Note that γVB > Mσ, and therefore, Eq. (6.74) h holds whenever γh > γVB . h Assume that γh > γVB . Then, with the solution (6.50), δh , given by h
Eq. (6.68), and σ2ah and σ2bh , given by Eqs. (6.64) and (6.65), are all positive. Thus, we obtain the positive stationary point (6.51). Substituting Eqs. (6.64) and (6.65), and then Eqs. (6.68) and (6.69), into the free energy (6.55), we have ⎛ ⎞ ⎞ ⎛ γ˘ hVB Lσ2 ⎟⎟⎟ γ˘ hVB Mσ2 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ VB−Posi ⎜ ⎟ ⎜ = − M log ⎝1 − − 2 ⎠ − L log ⎝1 − − 2 ⎟⎠ Fh γh γh γh γh ⎞ ⎛ 2 ⎟ −2γh γ˘ hVB γh2 ⎜⎜⎜ σ ⎟ + 2 − ⎜⎜⎝L + M + 2 2 ⎟⎟⎟⎠ . + (6.76) σ2 σ cah cbh Using Eq. (6.70), we can eliminate the direct dependency on c2ah c2bh , and express the free energy (6.76) as a function of γ˘ hVB . This results in Eq. (6.72), and completes the proof of Lemma 6.10.
6.7.3 Useful Relations Let us summarize some useful relations between variables, which are used in the subsequent sections. ζhVB , γ˘ hVB , and γVB , derived from Eqs. (6.58), (6.70), h and the constant part of Eq. (6.71), respectively, satisfy the following: ⎞⎛ ⎞ ⎛
⎜⎜⎜ ζ VB L ζhVB ⎟⎟⎟ ⎜⎜⎜ M ζhVB ⎟⎟⎟ ⎟⎠ ⎜⎜⎝1 − ⎟⎠ − h = 0, (6.77) ⎟ ⎟ ⎜⎜⎝1 − σ2 σ2 c2ah c2bh Lσ2 Mσ2 σ4 VB VB = 0, (6.78) γh − γ˘ h − − γh − γ˘ h − γh γh cah cbh ⎞⎛ ⎞ ⎛ ⎜⎜⎜ VB Lσ2 ⎟⎟⎟ ⎜⎜⎜ VB Mσ2 ⎟⎟⎟ σ4 = 0. (6.79) ⎜⎜⎝γh − VB ⎟⎟⎠ ⎜⎜⎝γh − VB ⎟⎟⎠ − cah cbh γ γ h
h
168
6 Global VB Solution of Fully Observed Matrix Factorization
From Eqs. (6.54) and (6.49), we find that
γVB h
> @⎛ ⎞ ⎜⎜⎜ σ4 ⎟⎟⎟⎟ 2 ⎜ = ⎝⎜(L + M)σ + 2 2 ⎟⎠ − LM ζhVB , cah cbh
(6.80)
which is useful when comparing the free energies of the null and the positive stationary points.
6.7.4 Free Energy Comparison Lemmas 6.9 and 6.10 imply that, when γh ≤ γVB , the null stationary point h is only the stationary point, and therefore the global solution. When γh > γVB , both of the null and the positive stationary points exist, and therefore h identifying the global solution requires us to compare their free energies, given by Eqs. (6.60) and (6.72). Given the observed singular value γh ≥ 0, we view the free energy as a function of c2ah c2bh . We also view the threshold γVB as a function of c2ah c2bh . h We find from Eq. (6.49) that γVB is decreasing and lower-bounded by γVB > h h √ √ Mσ. Therefore, when γh ≤ Mσ, γVB never gets smaller than γh for any h √ c2ah c2bh > 0. When γh > Mσ, on the other hand, there is a threshold cah cbh such that γh > γVB if c2ah c2bh > cah cbh . Eq. (6.79) implies that the threshold is h given by cah cbh =
σ4 "! 2 γh2 1 − Lσ 1− 2 γ !
h
Mσ2 γh2
".
(6.81)
We have the following lemma: Lemma 6.11 For any γh ≥ 0 and c2ah c2bh > 0, the derivative of the free energy (6.60) at the null stationary point with respect to c2ah c2bh is given by ∂FhVB−Null ∂c2ah c2bh
=
LM ζhVB σ2 c2ah c2bh
.
(6.82)
For γh > M/σ2 and c2ah c2bh > cah cbh , the derivative of the free energy (6.72) at the positive stationary point with respect to c2ah c2bh is given by
6.7 Proofs of Theorem 6.7 and Corollary 6.8 ∂FhVB−Posi ∂c2ah c2bh
169
⎛ VB 2 ⎛ ⎞ ⎞ ⎜⎜⎜ ⎜⎜⎜ (˘γh ) (L + M)σ2 ⎟⎟⎟ γ˘ hVB LMσ4 ⎟⎟⎟ ⎜ ⎜ ⎟ ⎟⎠ . (6.83) = 2 2 2 ⎝ 2 − ⎝1 − + ⎠ γh σ cah cbh γh γh2 γh4 γh2
The derivative of the difference is negative, i.e., ∂(FhVB−Posi − FhVB−Null ) ∂c2ah c2bh
=−
1 σ2 c2ah c2bh
γh γh − γ˘ hVB − (γVB )2 < 0. h
(6.84)
Proof By differentiating Eqs. (6.60), (6.54), (6.72), and (6.50), we have ∂FhVB−Null LM LM LM + − 2 = L M VB VB VB σ ∂ ζh σ2 1 − σ2 σ2 1 − σ2 ζh ζh ! "! " √ √
VB 1 − LM
LMc2ah c2bh 1 + σLM ζ VB 2 ζh σ2 h = , σ2 ζhVB ⎛ ⎞ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ σ2 2 2σ L + M + ⎜ ⎟⎟⎟ c2ah c2b ∂ ζhVB σ2 ⎜⎜⎜⎜ σ2 h ⎟⎟⎟ ⎜ = + − A ⎜ ⎟⎟⎟ 2 2LM ⎜⎜⎜⎜ c4ah c4b ∂c2ah c2bh ⎟⎟⎟ h ⎜⎜⎜ 2 2c4ah c4bh L + M + c2σc2 − 4LM ⎟⎟⎠ ⎝ ⎛ ⎜⎜ 1 ⎜⎜⎜⎜⎜ = 4 4 ⎜⎜⎜ ! cah cbh ⎜⎝ 1 − ∂FhVB−Posi ∂˘γhVB
=
=
√
( ζhVB )2 "! LM ζhVB 1+ σ2
M ! ! VB γ˘ h γh 1 − γh +
! ! VB γ˘ 2c2ah c2bh γh3 1 − γhh
Lσ2
"" +
γh2
⎞ ⎟⎟⎟ ⎟⎟ √ " ⎟⎟⎟⎟ , VB ⎟
LM ζh ⎟⎠
(6.85)
ah b h
(6.86)
σ2
!
γh 1 −
!
L γ˘ hVB γh
+
Mσ2 γh2
""
⎛ ⎞ γh ⎜⎜⎜ 2˘γhVB (L + M)σ2 ⎟⎟⎟ ⎟⎠ − 2 ⎜⎝ + γh σ γh2 "" ! VB 2 ! " VB 2 (˘γh ) (L+M)σ2 γ˘ h + (L+M)σ − 1 − 2 2 2 γh + 2γ γ γ h
h
σ6
h
LMσ4 γh4
" ,
(6.87) ∂ γh = 2 ∂cah c2bh =
4γh2 σ2
4γh c4ah c4bh
B (M − L)2 +
σ ! ! VB γ˘ 2γh c4ah c4bh 1 − γhh +
4γh2 c2ah c2b
h
4
(L+M)σ2 2γh2
"" .
(6.88)
170
6 Global VB Solution of Fully Observed Matrix Factorization
Here, we used Eqs. (6.54) and (6.77) to obtain Eqs. (6.85) and (6.86), and Eqs. (6.50) and (6.78) to obtain Eqs. (6.87) and (6.88), respectively. Eq. (6.82) is obtained by multiplying Eqs. (6.85) and (6.86), while Eq. (6.83) is obtained by multiplying Eqs. (6.87) and (6.88). Taking the difference between the derivatives (6.82) and (6.83), and then using Eqs. (6.78) and (6.80), we have ∂(FhVB−Posi − FhVB−Null ) ∂c2ah c2bh
=
∂FhVB−Posi
=−
−
∂c2ah c2bh 1 σ2 c2ah c2bh
∂FhVB−Null ∂c2ah c2bh
$ % γh − (γVB )2 . γh γh − h
(6.89)
The following can be obtained from Eqs. (6.78) and (6.79), respectively: 2 (L + M)σ2 (L + M)2 σ4 σ4 − LMσ4 + 2 2 γh2 , (6.90) = γh (γh − γ˘ hVB ) − 2 4 cah cbh 2 (L + M)σ2 (L + M)2 σ4 σ4 − LMσ4 + 2 2 (γVB )2 . (γVB )2 − = h 2 4 cah cbh h (6.91) Eqs. (6.90) and (6.91) imply that γh (γh − γ˘ hVB ) > (γVB )2 h
when
γh > γVB . h
Therefore, Eq. (6.89) is negative, which completes the proof of Lemma 6.11. It is easy to show that the null stationary point (6.53) and the positive stationary point (6.51) coincide with each other at c2ah c2bh → cah cbh + 0. Here, +0 means that it approaches to zero from the positive side. Therefore, lim FhVB−Posi − FhVB−Null = 0. (6.92) c2ah c2b →ca cb +0 h
h
h
Eqs. (6.84) and (6.92) together imply that FhVB−Posi − FhVB−Null < 0
for
c2ah c2bh > cah cbh ,
(6.93)
which results in the following lemma: Lemma 6.12 The positive stationary point is the global solution (the global minimizer of the free energy (6.43) for fixed cah and cbh ) whenever it exists. Combining Lemmas 6.9, 6.10, and 6.12 completes the proof of Theorem 6.7 and Corollary 6.8. Figure 6.3 illustrates the behavior of the free energies.
6.8 Analytic Form of Global Empirical VB Solution
171
1
1 VB Null
0.5
0.5
0
0
VB Null Posi
–0.5
–0.5 0
0.5
1
1.5
(a) γh ≤
√
2
0
2.5
Mσ
(b)
0.5
√
1
1.5
√
Mσ < γh ≤ ( L +
2
√
2.5
M)σ
1
1 VB Null Posi
0.5
VB Null Posi
0.5 0
0
–0.5
–0.5 0
0.5
1
1.5
2
2.5
√ √ (c) ( L + M)σ < γh < γEVB
0
0.5
1
1.5
2
2.5
(d) γh ≥ γEVB
Figure 6.3 Behavior of the free energies (6.60) and (6.72) at the null and the positive stationary points as functions of cah cbh , when L = M = H = 1 Fh = and σ2 = 1. The curve labeled as “VB” shows the VB free energy √ Mσ, min(FhVB−Null , FhVB−Posi ) at the global solution, given cah cbh . If γh ≤ only the null stationary point exists for any cah cbh > 0. Otherwise, the positive stationary point exists for cah cbh > cah cbh , and it is the global minimum whenever → 0 it exists. In empirical VB learning where cah cbh is also optimized,√ cah cb√ h (indicated by a diamond) is the unique local minimum if γh ≤ ( L + M)σ. Otherwise, a positive local minimum also exists (indicated by a cross), and it is the global minimum if and only if γh ≥ γEVB (see Section 6.9 for detailed discussion).
6.8 Analytic Form of Global Empirical VB Solution In this section, we will solve the empirical VB (EVB) problem where the prior covariances are also estimated from observation, i.e.,
r = argminF(r) s.t. r( A, B) = rA ( A)rB (B).
(6.94)
r,CA ,C B
Since the solution of the EVB problem is a VB solution with some values for the prior covariances CA , C B , the empirical VB posterior is in the same form as the VB posterior (6.40). Accordingly, the problem (6.94) can be written with H bh , σ2ah , σ2bh }h=1 as follows: the variational parameters { ah ,
172
6 Global VB Solution of Fully Observed Matrix Factorization
Given
σ2 ∈ R++ , min
H { ah , bh , σ2ah , σ2b ,c2ah ,c2b }h=1 h
{ ah , bh ∈ R,
s.t.
F
(6.95)
h
H
σ2ah , σ2bh , c2ah , c2bh ∈ R++ }h=1 ,
(6.96)
where the free energy F is given by Eq. (6.42). Solving the empirical VB problem (6.95) is not much harder than the VB problem (6.41) because the objective is still separable into H singular components when the prior variances {c2ah , c2bh } are also optimized. More specifically, we can obtain the empirical VB solution by minimizing the componentwise free energy (6.43) with respect to the only six unknowns bh , σ2ah , σ2bh , c2ah , c2bh ) for each hth component. On the other hand, analyzing ( ah , the VB estimator for the noise variance σ2 is hard, since Fh for all h = 1, . . . , H depend on σ2 and therefore the free energy (6.42) is not separable. We postpone the analysis of this full empirical VB learning to Chapter 8, where the theoretical performance guarantee is derived. For the problem (6.95), the stationary points of the free energy (6.43) satisfy Eqs. (6.44) through (6.47) along with Eqs. (3.26) and (3.27), which are written with the new set of variational parameters as a2h /M + σ2ah , c2ah =
(6.97)
c2bh = b2h /L + σ2bh .
(6.98)
However, unlike the VB solution, for which Lemma 6.1 holds, we cannot assume that the EVB solution is a stationary point, since the free energy Fh given by Eq. (6.43) does not necessarily diverge to +∞ when approaching the domain boundary (6.96). More specifically, Fh can converge to a finite value, bh = 0, σ2ah , σ2bh , c2ah , c2bh → +0. Taking this into account, for example, for ah = we can obtain the following theorem: Theorem 6.13 Let L (0 < α ≤ 1), (6.99) M and let τ = τ(α) be the unique zero-cross point of the following decreasing function: !τ" log(z + 1) 1 − . (6.100) Ξ (τ; α) = Φ (τ) + Φ , where Φ(z) = α z 2 Then, the EVB solution is given by ⎧ H EVB ⎪ ⎪ if γh ≥ γEVB , EVB ⎨γ˘ h EVB EVB
U = where γh = ⎪ γh ωbh ωah , ⎪ ⎩0 otherwise, h=1 (6.101) α=
6.9 Proof of Theorem 6.13
173
3 2.5 2 1.5 1 0.5 0
0
0.2
0.4
0.6
Figure 6.4 Values of τ(α),
0.8
√
1
√ α, and z α.
for A
γ
EVB
γ˘ hEVB
α =σ M 1+τ 1+ , τ > ⎞ ⎛ @⎛ ⎞2 ⎟ ⎜ ⎜⎜⎜ γh ⎜⎜⎜⎜ 4LMσ4 ⎟⎟⎟⎟ (M + L)σ2 ⎟⎟⎟ (M + L)σ2 ⎜ ⎟ ⎟⎟ . ⎜⎜⎜1 − = + − 1 − ⎝ ⎠ 2 ⎝ γh2 γh2 γh4 ⎟⎠
(6.102)
(6.103)
The EVB threshold (6.102) involves τ, which needs to be numerically computed. However, we can easily prepare a table of the values for 0 < α ≤ 1 beforehand, like the cumulative Gaussian probability used in statistical tests √ (Pearson, 1914). Alternatively, τ ≈ z α is a good approximation, where z ≈ 2.5129 is the unique zero-cross point of Φ(z), as seen in Figure 6.4. We can show that τ lies in the following range (Lemma 6.18 in Section 6.9): √ α < τ < z. (6.104) We will see in Chapter 8 that τ is an important quantity in describing the behavior of the full empirical VB solution where the noise variance σ2 is also estimated from observation. In Section 6.9, we give the proof of Theorem 6.13. Then, in Section 6.10, some corollaries obtained and variables defined in the proof are summarized, which will be used in Chapter 8.
6.9 Proof of Theorem 6.13 In this section, we prove Theorem 6.13, which provides explicit forms, Eqs. (6.102) and (6.103), of the EVB threshold γEVB and the EVB shrinkage estimator γ˘ hEVB . In fact, we can easily obtain Eq. (6.103) in an intuitive way, by
174
6 Global VB Solution of Fully Observed Matrix Factorization
using some of the results obtained in Section 6.7. After that, by expressing the free energy Fh with normalized versions of the observation and the estimator, we derive Eq. (6.102).
6.9.1 EVB Shrinkage Estimator Eqs. (6.60) and (6.72) imply that the free energy does not depend on the ratio cah /cbh between the hyperparameters. Accordingly, we fix the ratio to cah /cbh = 1 without loss of generality. Lemma 6.11 allows us to minimize the free energy with respect to cah cbh in a straightforward way. Let us regard the free energies (6.60) and (6.72) at the null and the positive stationary points as functions of cah cbh (see Figure 6.3). Then, we find from Eq. (6.82) that ∂FhVB−Null ∂c2ah c2bh
> 0,
which implies that the free energy (6.60) at the null stationary point is increasing. Using Lemma 6.9, we thus have the following lemma: Lemma 6.14 For any given γh ≥ 0 and σ2 > 0, the null EVB local solution, given by
ah = 0,
bh = 0,
σ2ah
. = ζ EVB ,
where
σ2bh
. = ζ EVB ,
. cah cbh = ζ EVB ,
ζ EVB → +0,
exists, and its free energy is given by
√
√
FhEVB−Null → +0.
(6.105)
When γh ≥ ( L + M)σ, the derivative (6.83) of the free energy (6.72) at the positive stationary point can be further factorized as ∂FhVB−Posi ∂c2ah c2bh where
=
γh 2 σ c2ah c2bh
γ˘ hVB − γ´ h γ˘ hVB − γ˘ hEVB ,
(6.106)
> ⎛ ⎞ @⎛ ⎞2 ⎜⎜⎜ ⎟ 2 2 4 ⎜ ⎟ γh ⎜⎜ 4LMσ ⎟⎟⎟⎟ (L + M)σ ⎟⎟ (L + M)σ ⎜⎜⎜ ⎟ ⎜⎜⎜1 − ⎟⎟ , − − γ´ h = 1 − ⎝ ⎠ 2 ⎝ γh2 γh2 γh4 ⎟⎠ (6.107)
6.9 Proof of Theorem 6.13
175
and γ˘ hEVB is given by Eq. (6.103). The VB shrinkage estimator (6.50) is an increasing function of cah cbh ranging over 0 < γ˘ hVB < γh −
Mσ2 , γh
and both of Eqs. (6.107) and (6.103) are in this range, i.e., 0 < γ´ h ≤ γ˘ hEVB < γh −
Mσ2 . γh
Therefore Eq. (6.106) leads to the following lemma: √ √ Lemma 6.15 If γh ≤ ( L + M)σ, the free energy FhVB−Posi at the positive stationary point is monotonically increasing. Otherwise, ⎧ ⎪ increasing for γ˘ hVB < γ´ h , ⎪ ⎪ ⎪ ⎪ ⎨ VB−Posi is ⎪ Fh decreasing for γ´ h < γ˘ hVB < γ˘ hEVB , ⎪ ⎪ ⎪ ⎪ ⎩increasing for γ˘ hVB > γ˘ hEVB , and therefore minimized at γ˘ hVB = γ˘ hEVB . We can see this behavior of the free energy in Figure 6.3. The derivative (6.83) is zero when γ˘ hVB = γ˘ hEVB , which leads to Lσ2 Mσ2 EVB EVB γ˘ h + = γh γ˘ hEVB . γ˘ h + γh γh
(6.108)
Using Eq. (6.108), we obtain the following lemma: Lemma 6.16 If and only if √ √ γh ≥ γlocal−EVB ≡ ( L + M)σ, the positive EVB local solution given by > @ EVB . γ˘ h EVB EVB
ah = ± γ˘ h δh , bh = ± ,
δEVB h σ2 δEVB h
σ2ah = , γh
σ2
σ2bh = , γh δEVB h
cah cbh =
(6.109)
(6.110) A
γh γ˘ hEVB , LM
(6.111)
exists with the following free energy: ⎛ ⎞ ⎞ ⎛ ⎜⎜⎜ γh γ˘ hEVB ⎟⎟⎟ ⎟⎟⎟ γh γ˘ hEVB ⎜⎜⎜ γh γ˘ hEVB EVB−Posi ⎟ ⎟⎠ − ⎜ Fh = M log ⎝⎜ + 1 + 1 . (6.112) + L log ⎠ ⎝ Mσ2 Lσ2 σ2
176
6 Global VB Solution of Fully Observed Matrix Factorization
Here,
A
δEVB h
=
⎛ ⎞ M γ˘ hEVB ⎜⎜⎜ Lσ2 ⎟⎟⎟ ⎜⎝1 + ⎟⎠ , Lγh γh γ˘ hEVB
(6.113)
and γ˘ hEVB is given by Eq. (6.103). Proof Lemma 6.15 immediately leads to the EVB shrinkage estimator (6.103). We can find the value of cah cbh at the positive EVB local solution by combining the condition (6.78) for the VB estimator and the condition (6.108) for the EVB estimator. Specifically, by using the condition (6.108), the condition (6.78) for γ˘ hVB replaced with γ˘ hEVB can be written as ⎛ ⎞⎛ ⎞ EVB EVB ⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ σ4 ⎜⎜⎜γh − γh γ˘ h ⎟⎟⎟ ⎜⎜⎜γh − γh γ˘ h ⎟⎠ = , ⎝ 2 Mσ2 ⎠ ⎝ Lσ2 ⎟ EVB EVB cah c2bh γ˘ h + γ γ˘ h + γ h
and therefore
h
⎞⎛ ⎞ ⎛ 4 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ Mσ2 Lσ2 ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ = σ . ⎜⎜⎜ ⎝ γ˘ EVB + Mσ2 ⎠ ⎝ γ˘ EVB + Lσ2 ⎠ c2 c2 ah bh h h γ γ h
h
Applying the condition (6.108) again gives LMσ4 σ4 = , γh γ˘ hEVB c2ah c2bh which leads to the last equation in Eq. (6.111). Similarly, using the condition (6.108), Eq. (6.52) for γ˘ hVB replaced with γ˘ hEVB is written as ⎛ ⎞ c2ah ⎜⎜⎜⎜ γh γ˘ hEVB ⎟⎟⎟⎟
⎟⎠ δh = 2 ⎜⎜⎝γh − 2 ⎟ σ γ˘ hEVB + Mσ γh ⎛ ⎞ ⎟⎟⎟ c2ah ⎜⎜⎜⎜ Mσ2 ⎟⎟ = 2 ⎜⎜⎝ 2 ⎟ ⎠ Mσ EVB σ γ˘ + h γh ⎛ ⎞ c2ah M ⎜⎜⎜ γh γ˘ hEVB + Lσ2 ⎟⎟⎟ ⎜ ⎟⎠ = ⎝ γh γh γ˘ hEVB ⎛ ⎞ c2a M ⎜⎜ Lσ2 ⎟⎟⎟ ⎟⎠ . = h ⎜⎜⎝1 + γh γh γ˘ hEVB Using the assumption that cah = cbh and therefore c2ah = cah cbh , we obtain Eq. (6.113). Eq. (6.110) and the first two equations in Eq. (6.111) are simply obtained from Lemma 6.10.
6.9 Proof of Theorem 6.13
177
Finally, applying Eq. (6.108) to the free energy (6.72), we have ⎛ ⎞ ⎞ ⎛ γh γ˘ hEVB γh γ˘ hEVB ⎜⎜⎜ ⎟⎟⎟ ⎟⎟⎟ ⎜⎜⎜ EVB−Posi ⎟⎠ Fh = −M log ⎝⎜1 − ⎠⎟ − L log ⎝⎜1 − EVB EVB 2 2 γh γ˘ h + Mσ γh γ˘ h + Lσ −
γh γ˘ hEVB , σ2
which leads to Eq. (6.112). This completes the proof of Lemma 6.16. . In Figure 6.3, the positive EVB local solution at cah cbh = γh γ˘ hEVB /(LM) is indicated by a cross if it exists.
6.9.2 EVB Threshold Lemmas 6.14 and 6.16 state that, if γh ≤ γlocal−EVB , only the null EVB local solution exists, and therefore it is the global EVB solution. In this section, assuming that γh ≥ γlocal−EVB , we compare the free energy (6.105) at the null EVB local solution and the free energy (6.112) at the positive EVB local solution. Since FhEVB−Null → +0, we simply consider the situation where FhEVB−Posi ≤ 0. Eq. (6.108) gives ⎞ ⎛ ⎜⎜ Mσ2 ⎟⎟⎟ EVB 2 ⎜ 2 (6.114) γh γ˘ h + Lσ ⎝⎜1 + ⎠⎟ = γh . γh γ˘ hEVB By using Eqs. (6.103) and (6.109), we have √ 2 1 γh γ˘ hEVB = γh2 − γlocal−EVB + 2 LMσ2 2 B! "! 2 + γh2 − γlocal−EVB γh2 − γlocal−EVB ≥
√
2
LMσ2 .
+4
√
" LMσ2 (6.115)
Remember the definition of α (Eq. (6.99)) α=
L M
(0 < α ≤ 1),
and let γh2 , Mσ2 EVB γh γ˘ h τh = . Mσ2 xh =
(6.116) (6.117)
178
6 Global VB Solution of Fully Observed Matrix Factorization
0.5
0
−0.5 0
5
Figure 6.5 Φ(z) =
10 log(z+1) z
− 12 .
Eqs. (6.114) and (6.103) imply the following mutual relations between xh and τh : α , (6.118) xh ≡ x(τh ; α) = (1 + τh ) 1 + τh . 1 τh ≡ τ(xh ; α) = (6.119) xh − (1 + α) + (xh − (1 + α))2 − 4α . 2 Eqs. (6.109) and (6.115) lead to xh ≥ xlocal = τh ≥ τlocal =
(γlocal−EVB )2 √
Mσ2
√ √ = x( α; α) = (1 + α)2 ,
α.
(6.120) (6.121)
Then, using Ξ (τ; α) defined by Eq. (6.100), we can rewrite Eq. (6.112) as " !τ h + 1 − Mτh FhEVB−Posi = M log (τh + 1) + L log α = Mτh Ξ (τ; α) . (6.122) The following holds for Φ(z) (which is also defined in Eq. (6.100)): Lemma 6.17 Φ(z) is decreasing for z > 0. Proof The derivative is ∂Φ 1 − = ∂z
1 z+1
− log(z + 1) z2
,
which is negative for z > 0 because 1 + log(z + 1) > 1. z+1 This completes the proof of Lemma 6.17.
6.9 Proof of Theorem 6.13
179
Figure 6.5 shows the profile of Φ(z). Since Φ(z) is decreasing, Ξ(τ; α) is also decreasing with respect to τ. It holds that, for any 0 < α ≤ 1, lim Ξ(τ; α) = 1,
τ→0
lim Ξ(τ; α) = −1.
τ→∞
Therefore, Ξ(τ; α) has a unique zero-cross point τ, such that Ξ(τ; α) ≤ 0
if and only if
τ ≥ τ.
(6.123)
Then, we can prove the following lemma: Lemma 6.18 range:
The unique zero-cross point τ of Ξ(τ; α) lies in the following √ α < τ < z,
where z ≈ 2.5129 is the unique zero-cross point of Φ(z). Proof Since Φ(z) is decreasing, Ξ (τ; α) is upper-bounded by !τ" Ξ (τ; α) = Φ (τ) + Φ ≤ 2Φ (τ) = Ξ (τ; 1) . α Therefore, the unique zero-cross point τ of Ξ (τ; α) is no greater than the unique zero-cross point z of Φ(z), i.e., τ ≤ z.
√ α, it suffices to show that For obtaining the lower-bound τ > √ Ξ( α; α) > 0. Let us prove that the following function is decreasing and positive for 0 < α ≤ 1: √ Ξ α; α . g(α) ≡ √ α From the definition (6.100) of Ξ (τ; α), we have √ √ 1 1 g(α) = 1 + log( α + 1) − log α − √ . α α The derivative is given by 1 + α1 √ ∂g 2 1 1 = − log( α + 1) − √ + √ √ ∂ α α + 1 α3/2 α α √ 2 1 = − 3/2 log( α + 1) + √ −1 α α+1 < 0,
180
6 Global VB Solution of Fully Observed Matrix Factorization
which implies that g(α) is decreasing. Since g(1) = 2 log 2 − 1 ≈ 0.3863 > 0, g(α) is positive for 0 < α ≤ 1, which completes the proof of Lemma 6.18. √ Since Eq. (6.118) is increasing with respect to τh (> α), the thresholding condition τ ≥ τ in Eq. (6.123) can be expressed in terms of x: Ξ(τ(x); α) ≤ 0 where
if and only if x ≥ x, α x ≡ x(τ; α) = 1 + τ 1 + . τ
(6.124) (6.125)
Using Eqs. (6.116) and (6.122), we have FhEVB−Posi ≤ 0
if and only if
γh ≥ γEVB ,
(6.126)
where γEVB is defined by Eq. (6.102). Thus, we have the following lemma: Lemma 6.19 The positive EVB local solution is the global EVB solution if and only if γh ≥ γEVB . Combining Lemmas 6.14, 6.16, and 6.19 completes the proof of Theorem 6.13. 2 Figure 6.6 shows estimators and thresholds for L = M = H = 1 and σ = 1. The curves indicate the VB solution γhVB given by Eq. (6.48), the EVB solution EVB
γh given by Eq. (6.101), the EVB positive local minimizer γ˘ hEVB given by Eq. (6.103), and the EVB positive local maximizer γ´ h given by Eq. (6.107), respectively. The arrows indicate the VB threshold γVB given by Eq. (6.49), h the local EVB threshold γlocal−EVB given by Eq. (6.109), and the EVB threshold γEVB given by Eq. (6.102), respectively.
6.10 Summary of Intermediate Results In the rest of this section, we summarize some intermediate results obtained in Section 6.9, which are useful for further analysis (mainly in Chapter 8). Summarizing Eqs. (6.109), (6.114), and (6.115) leads to the following corollary: Corollary 6.20 The EVB shrinkage estimator (6.103) is a stationary point of the free energy (6.43), which exists if and only if √ √ (6.127) γh ≥ γlocal−EVB ≡ ( L + M)σ,
6.10 Summary of Intermediate Results
181
3
2
1
1
2
3
Figure 6.6 Estimators and thresholds for L = M = H = 1 and σ2 = 1.
and satisfies the following equation: γh γ˘ hEVB + Lσ2
⎞ ⎛ ⎜⎜⎜ Mσ2 ⎟⎟⎟ ⎟⎠ = γh2 . ⎜⎝1 + γh γ˘ hEVB
It holds that γh γ˘ hEVB ≥
√
LMσ2 .
(6.128)
(6.129)
Combining Lemmas 6.14, 6.16, and 6.19 leads to the following corollary: Corollary 6.21 The minimum free energy achieved under EVB learning is given by Eq. (6.42) with " ! EVB " ! EVB ⎧ γh γ˘ h γh γ˘ h γh γ˘ hEVB ⎪ ⎪ ⎪ if γh ≥ γEVB , ⎨ M log Mσ2 + 1 + L log Lσ2 + 1 − σ2 2Fh = ⎪ ⎪ ⎪ ⎩ +0 otherwise. (6.130) Corollary 6.20 together with Theorem 6.13 implies that when γlocal−EVB ≤ γh < γEVB , a stationary point (called the positive EVB local solution and specified by Lemma 6.16) exists at Eq. (6.103), but it is not the global minimum. Actually, a local minimum (called the null EVB local solution and specified by Lemma 6.14) with Fh = +0 always exists. The stationary point at Eq. (6.103) is a nonglobal local minimum when γlocal−EVB ≤ γh < γEVB and the global
182
6 Global VB Solution of Fully Observed Matrix Factorization
minimum when γh ≥ γEVB (see Figure 6.3 with its caption). This phase transition induces the free energy thresholding observed in Corollary 6.21. We define a local-EVB estimator by
local−EVB = U
H
γhlocal−EVB ωbh ωah ,
h=1
γhlocal−EVB
where
⎧ EVB ⎪ ⎪ ⎨γ˘ h =⎪ ⎪ ⎩0
if γh ≥ γlocal−EVB , otherwise,
(6.131)
and call γlocal−EVB a local-EVB threshold. This estimator gives the positive EVB local solution, whenever it exists, for each singular component. There is an interesting relation between the local-EVB solution and an alternative dimensionality selection method (Hoyle, 2008), which will be discussed in Section 8.6. Rescaling the quantities related to the squared singular value by Mσ2 —to which the contribution from noise (each eigenvalue of E E) scales linearly— simplifies expressions. Assume that the condition (6.127) holds, and define γh2 , Mσ2 γh γ˘ hEVB τh = , Mσ2 xh =
(6.132) (6.133)
which are used as a rescaled observation and a rescaled EVB estimator, respectively. Eqs. (6.128) and (6.103) specify the mutual relations between them: α xh ≡ x(τh ; α) = (1 + τh ) 1 + , (6.134) τh . 1 τh ≡ τ(xh ; α) = (6.135) xh − (1 + α) + (xh − (1 + α))2 − 4α . 2 With these rescaled variables, the condition (6.127), as well as (6.129), for the existence of the positive local-EVB solution γ˘ hEVB is expressed as xh ≥ xlocal = τh ≥ τlocal =
(γlocal−EVB )2 √
Mσ2
√ √ = x( α; α) = (1 + α)2 ,
α.
(6.136) (6.137)
The EVB threshold (6.102) is expressed as x=
(γEVB )2 Mσ2
α = x(τ; α) = 1 + τ 1 + , τ
(6.138)
6.10 Summary of Intermediate Results
183
and the free energy (6.130) is expressed as Fh = Mτh · min (0, Ξ (τh ; α)) ,
(6.139)
where Ξ(τ; α) is defined by Eq. (6.100). The preceding rescaled expressions give an intuition of Theorem 6.13: the EVB solution γhEVB is positive if and only if the positive local-EVB solution γ˘ hEVB exists (i.e., xh ≥ xlocal ), and the free energy Ξ (τ(xh ; α); α) at the localEVB solution is nonpositive (i.e., τ(xh ; α) ≥ τ or equivalently xh ≥ x ).
7 Model-Induced Regularization and Sparsity Inducing Mechanism
Variational Bayesian (VB) learning often shows the automatic relevance determination (ARD) property—the solution is sparse with unnecessary components eliminated automatically. In this chapter, we try to elucidate the sparsity inducing mechanism of VB learning, based on the global analytic solutions derived in Chapter 6. We argue that the ARD property is induced by the model-induced regularization (MIR), which all Bayesian learning methods possess when unidentifiable models are involved, and that MIR is enhanced by the independence constraint (imposed for computational tractability), which induces phase transitions making the solution (exactly) sparse (Nakajima and Sugiyama, 2011). We first show the VB solution for special cases where the MIR effect is visible in the solution form. Then we illustrate the behavior of the posteriors and estimators in the one-dimensional case, comparing VB learning with maximum a posteriori (MAP) learning and Bayesian learning. After that, we explain MIR, and how it is enhanced in VB learning through phase transitions.
7.1 VB Solutions for Special Cases Here we discuss two special cases of fully observed matrix factorization (MF), in which the VB solution is simple and intuitive.
Almost Flat Prior When cah cbh → ∞ (i.e., the prior is almost flat), the VB solution given by Theorem 6.7 in Chapter 6 has a simple form.
184
7.1 VB Solutions for Special Cases
185
Corollary 7.1 The VB solution of the fully observed matrix factorization model (6.1) through (6.3) is given by
VB = U
H
γhVB ωbh ωah ,
(7.1)
h=1
where the estimator γhVB corresponding to the hth largest singular value is upper-bounded as ⎧ ⎛ ⎞ ⎫ ⎪ ⎪ ⎪ max(L, M)σ2 ⎟⎟⎟ ⎪ ⎬ ⎨ ⎜⎜⎜ VB ⎟⎠ γh ⎪
γh < max ⎪ . (7.2) 0, ⎜⎝1 − ⎪ ⎪ 2 ⎭ ⎩ γh For the almost flat prior (i.e., cah cbh → ∞), the equality holds, i.e., ⎧ ⎛ ⎞ ⎫ ⎪ ⎪ ⎪ max(L, M)σ2 ⎟⎟⎟ ⎪ ⎨ ⎜⎜⎜ ⎬ VB ⎟⎠ γh ⎪ lim γ = max ⎪ 0, ⎜1 − . ⎪ ⎪ 2 ⎩ ⎝ ⎭ cah cbh →∞ h γh
(7.3)
Proof It is clear that the threshold (6.49) is decreasing and the shrinkage γhVB is largest for factor (6.50) is increasing with respect to cah cbh . Therefore, cah cbh → ∞. In this limit, Eqs. (6.49) and (6.50) are reduced to > ? A @ 2 (L + M) (L + M) VB + − LM lim γ = σ cah cbh →∞ h 2 2 C = σ max(L, M), ⎛ ⎞ C ⎜⎜⎜ ⎟⎟ σ2 VB 2 ⎜ lim γ˘ = γh ⎝1 − 2 M + L + (M − L) ⎟⎟⎠ cah cbh →∞ h 2γh ⎞ ⎛ ⎜⎜⎜ max(L, M)σ2 ⎟⎟⎟ = ⎝⎜1 − ⎠⎟ γh , γh2 which prove the corollary.
The form of the VB solution (7.3) in the limit is known as the positivepart James–Stein (PJS) estimator (James and Stein, 1961), operated on each singular component separately (see Appendix A for its interesting property and the relation to Bayesian learning). A counterintuitive fact—a shrinkage is observed even in the limit of the flat prior—will be explained in terms of MIR in Section 7.3. Square Matrix When L = M (i.e., the observed matrix V is square), the VB solution is intuitive, so that the shrinkage caused by MIR and the shrinkage caused by the prior are separately visible in its formula.
186
7 Model-Induced Regularization and Sparsity Inducing Mechanism
Corollary 7.2 When L = M, the VB solution is given by Eq. (7.1) with
γhVB
⎧ ⎛ ⎫ ⎞ ⎪ ⎪ ⎪ σ2 ⎪ Mσ2 ⎟⎟⎟ ⎨ ⎜⎜⎜ ⎬ = max ⎪ 0, ⎜1 − 2 ⎟⎠ γh − ⎪ ⎪ ⎪. ⎩ ⎝ c c γh ah bh ⎭
(7.4)
Proof When L = M, Eqs. (6.49) and (6.50) can be written as > ? ? ? @
γVB h
γ˘ hVB
> ? @⎛ ⎞2 σ2 σ2 ⎟⎟⎟⎟ ⎜⎜⎜⎜ = σ M + 2 2 + ⎜⎝ M + 2 2 ⎟⎠ − M 2 , 2cah cbh 2cah cbh > ⎛ ⎛ ⎞⎞ @ ⎜⎜⎜ 4γh2 ⎟⎟⎟⎟⎟⎟⎟⎟ σ2 ⎜⎜⎜ ⎟⎟⎟⎟ = γh ⎜⎜⎜⎜⎝1 − 2 ⎜⎜⎜⎜⎝2M + 2γh c2ah c2bh ⎟⎠⎟⎠ ⎛ ⎞ ⎜⎜⎜ γh ⎟⎟⎟ σ2 ⎜ ⎟⎠ . = γh ⎝1 − 2 M + cah cbh γh
We can confirm that γ˘ hVB ≤ 0 when γh ≤ γVB , which proves the corollary. h Actually, we can confirm that γ˘ hVB = 0 when γh = γVB , and γ˘ hVB < 0 when h γh < γVB for any L, M, c2ah , and c2bh . h
In the VB solution (7.4), we can identify the PJS shrinkage and a constant shrinkage. The PJS shrinkage can be considered to be caused by MIR since it appears even with the flat prior, while the constant shrinkage −σ2 /(cah cbh ) is considered to be caused by the prior since it appears in MAP learning (see Theorem 12.1 in Chapter 12). The empirical VB (EVB) solution is also simple for square matrices. The following corollary is obtained from Theorem 6.13 in Chapter 6: Corollary 7.3 When L = M, the global EVB solution is given by
γhEVB
⎧ EVB ⎪ ⎪ ⎨γ˘ h =⎪ ⎪ ⎩0
if γh > γEVB , otherwise,
where A
γ
EVB
γ˘ hEVB
1 = σ M 2 + τ(1) + , τ(1) A ⎛ ⎞ 4Mσ2 ⎟⎟⎟⎟ γh ⎜⎜⎜⎜ 2Mσ2 ⎜⎜1 − ⎟⎟ . + 1− = 2 ⎝ γh2 γh2 ⎠
7.2 Posteriors and Estimators in a One-Dimensional Case
187
Proof When L = M, Eqs. (6.102) and (6.103) can be written as A 1 EVB γ = σ M 2 + τ(1) + , τ(1) > ⎞ ⎛ @⎛ ⎞ ⎜⎜⎜ 2 ⎟⎟⎟ 2 ⎟2 2 σ4 ⎟ ⎜ γ 4M 2Mσ 2Mσ ⎜⎜⎜ ⎟⎟⎟ h ⎜ ⎟⎟⎟ ⎜⎜⎜1 − + − γ˘ hEVB = 1 − ⎝ ⎠ 2 ⎝⎜ γh2 γh2 γh4 ⎠⎟ A ⎞ ⎛ 4Mσ2 ⎟⎟⎟⎟ 2Mσ2 γh ⎜⎜⎜⎜ ⎟⎟ , ⎜⎜1 − + 1− = 2 ⎝ γh2 γh2 ⎠
which completes the proof.
7.2 Posteriors and Estimators in a One-Dimensional Case In order to illustrate how strongly Bayesian learning and its approximation methods are regularized, we depict posteriors and estimators in the MF model for L = M = H = 1 (i.e., U, V, A, and B are merely scalars): (V − BA)2 1 exp − . (7.5) p(V|A, B) = √ 2σ2 2πσ2 In this model, we can visualize the unidentifiability of the MF model as equivalence classes—a set of points (A, B) on which the product is unchanged, i.e., U = BA, represents the same distribution (see Figure 7.1). When U = 0, the equivalence class has a “cross-shape” profile on the A- and B-axes; otherwise, it forms a pair of hyperbolic curves. This redundant structure in the 3
U=2 U=1 U=0 U = −1 U = −2
2
B
1
0
−1
−2
−3 −3
−2
−1
0 A
1
2
3
Figure 7.1 Equivalence class structure of the one-dimensional MF model. Any A and B such that their product is unchanged give the same U.
188
7 Model-Induced Regularization and Sparsity Inducing Mechanism
parameter space is the origin of MIR, and highly influences the phase transition phenomenon in VB learning, as we will see shortly. With Gaussian priors, A2 1 exp − 2 , (7.6) p(A) = C 2ca 2πc2a ⎛ ⎞ ⎜⎜ B2 ⎟⎟ 1 exp ⎜⎜⎝− 2 ⎟⎟⎠ , p(B) = . (7.7) 2cb 2πc2b the Bayes posterior is proportional to p(A, B|V) ∝ p(V|A, B)p(A)p(B) ⎞ ⎛ ⎜⎜⎜ 1 A2 B2 ⎟⎟⎟ 2 ∝ exp ⎝⎜− 2 (V − BA) − 2 − 2 ⎟⎠ . 2σ 2ca 2cb
(7.8)
Figure 7.2 shows the contour of the unnormalized Bayes posterior (7.8) when V = 0, 1, 2 are observed, the noise variance is σ2 = 1, and the prior covariances are set to ca = cb = 100 (i.e., almost flat priors). We can see that the equivalence class structure is reflected in the Bayes posterior: when V = 0, the surface of the Bayes posterior has a cross-shaped profile and its maximum is at the origin; when V > 0, the surface is divided into the positive orthant (i.e., A, B > 0) and the negative orthant (i.e., A, B < 0), and the two “modes” get farther as V increases. MAP Solution Let us first investigate the behavior of the MAP estimator, which coincides with the maximum likelihood (ML) estimator when the priors are flat. For finite ca and cb , the MAP solution can be expressed as A ) 1 ca σ2 MAP
=± max 0, |V| − A , cb ca cb A ) 1 cb σ2 MAP
= ±sign(V) max 0, |V| − , B ca ca cb where sign(·) denotes the sign of a scalar (see Corollary 12.2 in Chapter 12 for derivation). In Figure 7.2, the asterisks indicate the MAP estimators, and the dashed curves indicate the ML estimators (the modes of the contour of Eq. (7.8) when ca = cb → ∞). When V = 0, the Bayes posterior takes the maximum value on the A- and B-axes, which results in the MAP estimator equal to
MAP ) = 0. When V = 1, the profile of the Bayes posterior
MAP (= BMAP A U is hyperbolic and the maximum value is achieved on the hyperbolic curves in the positive orthant (i.e., A, B > 0) and the negative orthant (i.e., A, B < 0);
7.2 Posteriors and Estimators in a One-Dimensional Case Bayes p osterior (V = 0) MAP estimator: 2 (A, B ) = (0, 0)
MAP estimators: 2 (A, B ) ≈ (± 1, ± 1)
−2
−1
3
0 0.1 .2
0 A
1
1
1
0.3 0.2 0.1
−2
2
−3 −3
3
0.3 0.2 0.1
0.2
−2
−1
0.3
−3 −3
0.2
0.3
0.3 0.2 1 0.
0.2 .1 0
−2
0
−1
0.
0.3
−1
0.1 0.2 0.3 0.3
0.
0.3 0.2 0.1
0
0.1 0.2 0.3
0.2
0.3 0.2 0.1
B
0.1 0.2 0.3
0.
0.1
2 0. 0. 3
0.3
1
0.1 0.2 0.3
0.1 0.2 3 0.
0.2
0.3
3
0 .1 . 2 0
B
Bayes p osterior (V = 1)
3
1
189
0 A
1
2
3
Bayes p osterior (V = 2) 3 0.1.2 0 0.3
0.1
MAP estimators: √ √ 2 (A, B ) ≈ (± 2, ± 2) 0.3
0.2 0.3
0. 2
1
0.3 0.2 0.1
B
0.1
0
0.1
0.3 0.2
2
0.
−1
0.1 0.2 0.3
0.3
−2
−2
−1
0.1
0.30.2 0.1
−3 −3
0 A
1
2
3
Figure 7.2 (Unnormalized) Bayes posteriors for ca = cb = 100 (i.e., almost flat priors). The asterisks are the MAP estimators, and the dashed curves indicate the ML estimators (the modes of the contour when ca = cb = c → ∞).
MAP ≈ 1 (limca ,cb →∞ U
MAP = 1). When V = 2, a similar in either case, U
MAP ≈ 2 multimodal structure is observed and the MAP estimator is U MAP
= 2). From these plots, we can visually confirm that the (limca ,cb →∞ U MAP estimator with almost flat priors (ca = cb = 100) approximately agrees
ML = V (limca ,cb →∞ U
ML ). We
MAP = U
MAP ≈ U with the ML estimator: U will use the ML estimator as an unregularized reference in the following discussion. Figure 7.3 shows the contour of the Bayes posterior when ca = cb = 2. The MAP estimators shift from the ML solutions (dashed curves) toward the origin, and they are more clearly contoured as peaks. VB Solution Next we depict the VB posterior, given by Corollary 6.8 in Chapter 6. When L = M = H = 1, the VB solution is given by
190
7 Model-Induced Regularization and Sparsity Inducing Mechanism Bayes p osterior (V = 0)
Bayes p osterior (V = 1) 3
2
2 0.2
1
2
0.2
0.3
B
0.
0.1
0.3 0.2
0
0.2
0.1
0.1
2
0.
0.3
0.2
0.1
−1
0.2 0. 1
0.2 1 0.
0. 3
−1
0.1
0.1
0.1
0.3
0.2
0.2
1
0. 3
0.
B
0
2 0.
1 0.
1 0.1
0.1
0 .2
0.1
3
−2
0.2
0.1
0.1
−2 −3 −3
−2
−1
0 A
1
2
−3 −3
3
−2
−1
0 A
1
2
3
Bayes p osterior (V = 2) 1
0.
0.1
3 2 0.
2
0.
0.1
2
1 B
0.1
0.1
0 0.1
−1
0.1
0.1
0.
2
0.2
1
0.
−3 −3
−2
−1
0.1
−2
0 A
1
2
3
Figure 7.3 (Unnormalized) Bayes posteriors for ca = cb = 2. The dashed curves indicating the ML estimators are identical to those in Figure 7.2.
! " ! " . . ⎧ 2 2 ⎪ VB ca , σ ca Gauss B; ±sign(V) γ VB cb , σ cb ⎪ γ ˘ ˘ A; ± Gauss ⎪ 1 1 ⎪ cb |V|cb ca |V|ca ⎪ ⎪ ⎨ r(A, B) = ⎪ if |V| ≥ γVB , ⎪ ⎪ ⎪ ⎪ ⎪ ⎩Gauss1 A; 0, c2 κVB Gauss1 B; 0, c2 κVB otherwise, a
b
(7.9) where
A γ
VB
γ˘ VB
κVB
σ2 2c2a c2b
B! + 1+
σ =σ 1+ 2c2a c2b 2 2 = 1 − Vσ2 |V| − cσa cb , B! "2 2 σ2 = − 2c2 c2 + 1 + 2cσ2 c2 − 1. a b
2
"2
− 1,
a b
Figure 7.4 shows the contour of the VB posterior (7.9) when V = 0, 1, 2 are observed, the noise variance is σ2 = 1, and the prior covariances are
7.2 Posteriors and Estimators in a One-Dimensional Case VB p osterior (V = 1)
VB p osterior (V = 0) 3
3 VB estimator : (A, B ) = (0, 0)
2
VB estimator : (A, B ) = (0, 0) 0.05
0.05
0. 05
1
0.1
0.
B
0.1
5
0.0 5
0.05
0.05
−2
−2
−1
0 A
1
2
3
−3 −3
−2
−1
VB p osterior (V = 2)
0 A
1
2
3
2
3
VB p osterior (V = 2) 3
0.2
15 0. 0.1 5 0.0
0.05
0 0.0
0.15
−1 0.1
−2
−2
05 0.
5
0.2
VB estimator : √ √ (A, B ) ≈ ( 1.5, 1.5)
−1
0.15 0.1 0.2
0.1
5
0.05
0. 05
2 0.
0.
5
5 0.2
0.1
2 VB estimator : √ √ (A, B ) ≈ (− 1.5, − 1.5) 1
3
0.2
0. 1 0.1 5
0.3
0.05 0.1
0.05
0
0.2
0
0.25
1
0.05 0.1 0. 15
5 .1
0.2
2
B
0. 05
3
B
0.0 5
0.
0.1
0.0
0.1
−1
0.0 5
−2 −3 −3
0
0.15
0.1
−1
0.15
0.05
0.15
B
0.15
0.05
0
0. 05
1
1
0. 05
0.1
0. 05
1
2
0.
0.1 0.15 0.0 5
2
191
0.15 0.1 0.05
−3 −3
−2
−1
0 A
1
2
3
−3 −3
−2
−1
0 A
1
Figure 7.4 VB solutions for ca = cb = 100 (i.e., almost flat priors). When V = 2, VB learning gives either one of the two solutions shown in the bottom row.
set to ca = cb = 100 (i.e., almost flat priors). When V = 0, the crossshaped contour of the Bayes posterior (see Figure 7.2) is approximated by a spherical Gaussian distribution located at the origin. Thus, the VB estimator
VB = 0, which coincides with the MAP estimator. When V = 1, two is U hyperbolic “modes” of the Bayes posterior are approximated again by a spherical Gaussian distribution located at the origin. Thus, the VB estimator
MAP ≈ 1.
VB = 0, which differs from the MAP estimator U is still U √ √ VB VB 2 2 V = γ ≈ Mσ = 1 (limca ,cb →∞ γ = Mσ ) is actually a transition point of the VB solution. When V is not larger than the threshold γVB ≈ 1, VB learning tries to approximate the two “modes” of the Bayes posterior by the origin-centered Gaussian distribution. When V goes beyond the threshold γVB ≈ 1, the “distance” between two hyperbolic modes of the Bayes posterior becomes so large that VB learning chooses to approximate one of those two modes in the positive and the negative orthants. As such, the symmetry is
192
7 Model-Induced Regularization and Sparsity Inducing Mechanism
broken spontaneously and the VB estimator is detached from the origin. The bottom row of Figure 7.4 shows the contour of the two possible VB posteriors
VB ≈ 3/2, is the same for both when V = 2. Note that the VB estimator, U
cases, and differs from the MAP estimator U MAP ≈ 2. In general, the VB estimator is closer to the origin than the MAP estimator, and the relative difference between them tends to shrink as V increases. Bayesian Estimator The full Bayesian estimator is defined as the mean of the Bayes posterior (see Eq. (1.7)). In the MF model with L = M = H = 1, the Bayesian estimator is expressed as
Bayes = BA p(V|A,B)p(A)p(B)/p(V) . U
(7.10)
If V = 0, 1, 2, 3 are observed, the Bayesian estimator with almost flat priors are
Bayes = 0, 0.92, 1.93, 2.95, respectively, which were numerically computed.1 U Compared with the MAP estimator (with almost flat priors), which gives
MAP = 0, 1, 2, 3, respectively, the Bayesian estimator is slightly shrunken. U EVB Solution Next we consider the empirical Bayesian solutions, where the hyperparameters ca , cb are also estimated from observation (the noise variance σ2 is still treated as a given constant). We fix the ratio between the prior variances to ca /cb = 1. From Corollary 7.3 and Eq. (7.9), we obtain the EVB posterior for L = M = H = 1 as follows: C C ⎧ 2 2 ⎪ Gauss1 A; ± γ˘ EVB , σ|V| Gauss1 B; ±sign(V) γ˘ EVB , σ|V| ⎪ ⎪ ⎪ ⎪ ⎨ r(A, B) = ⎪ if |V| ≥ γEVB , ⎪ ⎪ ⎪ ⎪ ⎩Gauss (A; 0, +0) Gauss (B; 0, +0) otherwise, 1 1 (7.11) where γ
A
EVB
γ˘ EVB
1
B 1 1 ≈ σ 2 + 2.5129 + ≈ 2.216σ, = σ 2 + τ(1) + τ(1) 2.5129 B ⎛ ⎞ 4σ2 ⎟⎟⎟⎟ 2σ2 |V| ⎜⎜⎜⎜ = ⎜1 − 2 + 1 − 2 ⎟⎠ . 2 ⎝ V V
More precisely, we numerically calculated the Bayesian estimator (7.10) by sampling A and B from the almost flat priors p(A)p(B) for ca = cb = 100 and computing the ratio between the sample averages of BA · p(V|A, B) and p(V|A, B).
7.2 Posteriors and Estimators in a One-Dimensional Case EVB p osterior (V = 2)
EVB p osterior (V = 3)
3
3 0.1
B
B
−1
−2
−2
−1
0 A
1
2
3
0. 1
0.4 0.3 0.2 0.1
0
−1
−2
0.4
0.2
0.2 .1 0
1
0
−3 −3
0. 3
2
1
0.2
0.1
EVB estimator : (A, B ) = (0, 0)
0.1 2 0. 0.3
2
193
EVB estimator : √ √ (A, B ) ≈ ( 2.28, 2.28)
−3 −3
−2
−1
0 A
1
2
3
EVB p osterior (V = 3) 3 2 EVB estimator : √ √ 2.28, − 2.28)
0
0.2
0.1 0.2
2
0.1
−3 −3
0.4
0.1
−2
0.1 0.2 3 0.
0.3 0.4
−1
0.2 0.1
−2
0.3
0.
B
1 (A, B ) ≈ (−
0.1
−1
0 A
1
2
3
Figure 7.5 EVB solutions. Top-left: When V = 2, the EVB posterior is reduced to the Dirac delta function located at the origin. Top-right and bottom: When V√= 3, the from √ EVB posterior is detached √ √ the origin, and located at (A, B) ≈ ( 2.28, 2.28) or (A, B) ≈ (− 2.28, − 2.28), both of which yield the same
EVB ≈ 2.28. EVB estimator U
Figure 7.5 shows the EVB posterior when V = 2, 3 are observed, and the noise variance is σ2 = 1. When V = 2 < γEVB , the EVB posterior is given by the Dirac delta function located at the origin, resulting in the EVB estimator
EVB = 0 (top-left graph). On the other hand, when V = 3 > γEVB , equal to U the EVB posterior is a Gaussian located in the top-right region or bottom-left
EVB ≈ 2.28 for both solutions (top-right and region, and the EVB estimator is U bottom graphs). Empirical Bayesian Estimator The empirical Bayesian (EBayes) estimator (introduced in Section 1.2.7) is the Bayesian estimator,
EBayes = BA p(V|A,B)p(A; c )p(B; c )/p(V; c , c ) , U a b a b
194
7 Model-Induced Regularization and Sparsity Inducing Mechanism
with the hyperparameters estimated by minimizing the Bayes free energy F Bayes (V; ca , cb ) ≡ − log p(V; ca , cb ), i.e., cb ) = argmin F Bayes (V; ca , cb ). ( ca , (ca ,cb )
When V = 0, 1, 2, 3 are observed, the EBayes estimators are 0.00, 0.00, 1.25, cb ≈ 0.0, 0.0, 1.4, 2.1), 2.58 (with the prior variance estimators given by ca = respectively, which were numerically computed.2 Behavior of Estimators Figure 7.6 shows the behavior of estimators, including the MAP estimator
VB , the Bayesian estimator U
Bayes , the EVB
MAP , the VB estimator U U EVB EBayes
, and the EBayes estimator U , when the noise variance estimator U is σ2 = 1. For nonempirical Bayesian estimators, i.e., the MAP, the VB, and the Bayesian estimators, the hyperparameters are set to ca = cb = 100 (i.e., almost flat priors). Overall, the solutions satisfy
EBayes < U
VB < U
Bayes < U
MAP (≈ U
ML ),
EVB < U U 3 2.5 2 1.5 1 0.5 0 0
1
2
3
MAP , the VB estimator U
VB , the Figure 7.6 Behavior of the MAP estimator U
EVB , and the EBayes estimator
Bayes , the EVB estimator U Bayesian estimator U
EBayes , when the noise variance is σ2 = 1. For the MAP, the VB, and the U Bayesian estimators, the hyperparameters are set to ca = cb = 100 (i.e., almost flat priors).
2
For ca cb = 10−2.00 , 10−1.99 , . . . , 101.00 , we numerically computed the Bayes free energy, and chose its minimizer ca cb , with which the Bayesian estimator was computed.
7.3 Model-Induced Regularization
195
which shows the strength of the regularization effect of each method. Naturally, the empirical Bayesian variants are more regularized than their nonempirical Bayesian counterparts with almost flat priors. With almost flat priors, the MAP estimator is almost identical to the ML
ML = V, meaning that it is unregularized. We see in
MAP ≈ U estimator, U
Bayes is regularized even with almost Figure 7.6 that the Bayesian estimator U
VB shows thresholding behavior, flat priors. Furthermore, the VB estimator U which leads to exact sparsity in multidimensional cases. Exact sparsity also appears in EVB learning and EBayes learning. In the subsequent sections, we explain those observations in terms of model-induced regularization and phase transitions.
7.3 Model-Induced Regularization In this section, we explain the origin of the shrinkage of the Bayesian estimator, observed in Section 7.2. The shrinkage is caused by an implicit regularization effect, called model-induced regularization (MIR), which is strongly related to unidentifiability of statistical models.
7.3.1 Unidentifiable Models Identifiability is formally defined as follows: Definition 7.4 (Identifiability of statistical models) A statistical model p(·|w) parameterized by w ∈ W is said to be identifiable, if the mapping w → p(·|w) is one-to-one, i.e., p(·|w1 ) = p(·|w2 ) ⇐⇒ w1 = w2
for any w1 , w2 ∈ W.
Otherwise, it is said to be unidentifiable.3 Many popular statistical models are unidentifiable. Example 7.5 The MF model (introduced in Section 3.1) is unidentifiable, because the model distribution ##2 1 ## p(V|A, B) ∝ exp − 2 #V − B A #Fro (7.12) 2σ 3
Distributions are identified in weak topology in distribution, i.e., p(x|w1 ) is identified with p(x|w2 ) if f (x)p(x|w1 )dx = f (x)p(x|w2 )dx for any bounded continuous function f (x).
196
7 Model-Induced Regularization and Sparsity Inducing Mechanism
is invariant to the following transformation ( A, B) → ( AT , BT −1 ) for any nonsingular matrix T ∈ RH×H . Example 7.6 The multilayer neural network model is unidentifiable. Consider a three-layer neural network with H hidden units: 1 p(y|x, A, B) ∝ exp − 2 y − f (x; A, B)2 , 2σ
f (x; A, B) =
H
bh · ψ ah x ,
h=1
(7.13) where x ∈ R M is an input vector, y ∈ RL is an output vector, A = (a1 , . . . , aH ) ∈ R M×H and B = (b1 , . . . , bH ) ∈ RL×H are the weight parameters to be estimated, and ψ(·) is an antisymmetric nonlinear activation function such as tanh(·). This model expresses the identical distribution on each of the following sets of points in the parameter space: {ah ∈ R M , bh = 0} ∪ {ah = 0, bh ∈ RL } for any h, {ah = ah , bh , bh ∈ RL , bh + bh = const.} for any pair h, h . In other words, the model is invariant for any ah ∈ R M if bh = 0, for any bh ∈ RL if ah = 0, and for any bh , bh ∈ RL as long as bh + bh is unchanged and ah = ah . Example 7.7 (Mixture models) The mixture model (introduced as Example 1.3 in Section 1.1.4) is generally unidentifiable. The model distribution is given as p(x|α, {τk }) =
K
αk p(x|τk ),
(7.14)
k=1
where x ∈ X is an observed random variable, and α = (α1 , . . . , αK ) ∈ ΔK−1 K and {τk ∈ T }k=1 are the parameters to be estimated. This model expresses the identical distribution on each of the following sets of points in the parameter space: {αk = 0, τk ∈ T }
for any k,
{αk , αk ∈ [0, 1], αk + αk = const., τk = τk } for any pair k, k . Namely, if the mixing weight αk is zero for the kth mixture component, the corresponding component parameter τk does not affect the model distribution, and if there are two identical components τk = τk , the balance between the corresponding mixture weights, αk and αk , are arbitrary.
7.3 Model-Induced Regularization
197
Readers might have noticed that, in the multilayer neural network (Example 7.6) and the mixture model (Example 7.7), the model expressed by the unidentifiable sets of points corresponds to the model with fewer components or smaller degrees of freedom. For example, if ah = 0 or bh = 0 in the neural network with H hidden units, the model is reduced to the neural network with units. If two hidden units receive the identical input, i.e., H − 1 hidden ψ ah x = ψ ah x for any x ∈ R M , they can be combined into a single unit with its output weight equal to the sum of the original output weights, i.e., bh + bh → bh . Thus, the model is again reduced to the neural network with H − 1 hidden units. The same applies to the mixture models and many other popular statistical models, including Bayesian networks, hidden Markov models, and latent Dirichlet allocation, which were introduced in Chapter 4. As will be explained shortly, this nesting structure—simpler models correspond to unidentifiable sets of points in the parameter space of more complex models— is essential for MIR.
7.3.2 Singularities Continuous points denoting the same distribution are called singularities, on which the Fisher information, ∂ log p(x|w) ∂ log p(x|w) D p(x|w)dx, (7.15) S+ F = ∂w ∂w is singular, i.e., it has at least one zero eigenvalue. This is a natural consequence from the fact that the Fisher information corresponds to the metric when the distance between two points in the parameter space is measured by the KL divergence (Jeffreys, 1946), i.e., it holds that KL (p(x|w)p(x|w + Δw)) =
1 Δw FΔw + O(Δw3 ) 2
for a small change Δw of the parameter. On the singularities, there is at least one direction in which the small change Δw does not affect the distribution, implying that the Fisher metric F is singular. This means that the volume element, proportional to the determinant of the Fisher metric, is zero on the singularities, while it is positive on the regular points (see Appendix B for more details on the Fisher metric and the volume element in the parameter space). This strong nonuniformity of (the density of) the volume element affects the behavior of Bayesian learning. For this reason, statistical models having singularities in their parameter space are called singular models and distinguished
198
7 Model-Induced Regularization and Sparsity Inducing Mechanism
Figure 7.7 Singularities of a neural network model.
from the regular models in statistical learning theory (Watanabe, 2009). There are two aspects of how singularities affect the learning properties. In this chapter, we focus on one aspect that leads to MIR. The other aspect will be discussed in Chapter 13. Figure 7.7 illustrates the singularities in the parameter space of the threelayer neural network (7.13) with H = 1 hidden unit (see Example 7.6). The horizontal axis corresponds to an arbitrary direction of ah ∈ R M , while the vertical axis corresponds to an arbitrary direction of bh ∈ RL . The shadowed locations correspond to the singularities. Importantly, all points on the singularities express the identical neural network model with no (H = 0) hidden unit, while each regular point expresses a different neural network model with H = 1 hidden unit. This illustration gives an intuition that the neighborhood of the smaller model (H = 0) is broader than the neighborhood of the larger model (H = 1) in the parameter space. Consider the Jeffreys prior, C (7.16) pJef (w) ∝ det (F), which is the uniform prior in the space of distributions when the distance is measured by the KL divergence (see Appendix B). As discussed previously, the Fisher information is singular on the singularities, giving pJef (w) = 0 for the smaller model (with H = 0), while the Fisher information is regular on the other points, giving pJef (w) > 0 for the larger model (with H = 1). Also in the neighborhood of the singularities, the Fisher information has similar 1. This means that, in comparison with the values and it holds that pJef (w)
7.3 Model-Induced Regularization
199
Jeffreys prior, the flat priors on ah and bh —the uniform prior in the parameter space—put much more mass to the smaller model and its neighborhood. A consequence is that, if we apply Bayesian learning with the flat prior, the overweighted singularities and their neighborhood pull the estimator to the smaller model through the integral computation, which induces implicit regularization—MIR. The same argument holds for mixture models (Example 7.6), and other popular models, including Bayesian networks, hidden Markov models, and latent Dirichlet allocation. In summary, MIR occurs in general singular models for the following reasons: • There is strong nonuniformity in (the density of) the volume element around the singularities. • Singularities correspond to the model with fewer degrees of freedom than the regular points. This structure in the parameter space makes the flat prior favor smaller models in Bayesian learning, which appears as MIR. Note that MIR does not occur in point-estimation methods, including ML estimation and MAP learning, since the nonuniformity of the volume element affects the estimator only through integral computations. MIR also occurs in the MF model (Example 7.6), which will be investigated in the next subsection with a generalization of the Jeffreys prior.
7.3.3 MIR in one-Dimensional Matrix Factorization In Section 7.2, we numerically observed MIR—the Bayesian estimator is shrunken even with the almost flat prior in the one-dimensional MF model. However, in the MF model, the original definition (7.16) of the Jeffreys prior is zero everywhere in the parameter space because of the equivalence class structure (Figure 7.1), and therefore, it provides no information on MIR. To evaluate the nonuniformity of the volume element, we redefine the (generalized) Jeffreys prior by ignoring the zero common eigenvalues, i.e., . D Jef (7.17) p (w) ∝ d=1 λd , where λd is the dth largest eigenvalue of the Fisher metric F, and D is the maximum number of positive eigenvalues over the whole parameter space. Let us consider the nonfactorizing model, 1 2 2 (7.18) p(V|U) = Gauss1 (V; U, σ ) ∝ exp − 2 (V − U) , 2σ
200
7 Model-Induced Regularization and Sparsity Inducing Mechanism
3
0
0.4
0.3
0.1
0.3
0.2
0.3
0.4
0.2
0.1
0.2
B
1
0.4
0.3
2
5 0.
0.5
0.4
−1
0.2 0.4
3
0.
0.3
0. 4
−2
0.5
5 0.
−3 −3
0.4
−2
−1
0 A
1
2
3
Figure 7.8 The (unnormalized) Jeffreys noninformative prior (7.20) of the onedimensional MF model (7.5).
where U itself is the parameter to be estimated. The Jeffreys prior for this model is uniform (see Example B.1 in Appendix B for derivation): pJef (U) ∝ 1.
(7.19)
On the other hand, the Jeffreys prior for the MF model (7.5) is given as follows (see Example B.2 in Appendix B for derivation): √ pJef (A, B) ∝ A2 + B2 , (7.20) which is illustrated in Figure 7.8. Note that the Jeffreys priors (7.19) and (7.20) for both cases are improper, meaning that they cannot be normalized since their integrals diverge. Jeffreys (1946) stated that the both combinations, the nonfactorizing model (7.18) with its Jeffreys prior (7.19) and the MF model (7.5) with its Jeffreys prior (7.20) give the equivalent Bayesian estimator. We can easily show that the former combination, Eqs. (7.18) and (7.19), gives an unregularized solution. Thus, the Bayesian estimator in the MF model (7.5) with its Jeffreys prior (7.20) is also unregularized. Since the flat prior on (A, B) has more probability mass around the origin than the Jeffreys prior (7.20) (see Figure 7.8), it favors smaller |U| and regularizes the Bayesian estimator. Although MIR appears also in regular models unless the Jeffreys prior is flat in the parameter space, its effect is prominent in singular models with unidentifiability, since the difference between the flat prior and the Jeffreys prior is large.
7.3 Model-Induced Regularization
201
7.3.4 Evidence View of Unidentifiable Models MIR works as Occam’s razor in general. MacKay (1992) explained, with the illustration shown in the left panel of Figure 7.9, that evidence-based (i.e., free-energy-minimization-based) model selection is naturally equipped with Occam’s razor. In the figure, the horizontal axis denotes the space of the observed data set D. H1 and H2 denote a simple hypothesis and a more complex hypothesis, respectively. For example, in the MF model, the observed data set corresponds to the observed matrix, i.e., D = V, H1 corresponds to a lower-rank model, and H2 corresponds to a higher-rank model. The vertical axis indicates the evidence or marginal likelihood, for t = 1, 2, (7.21) p(D|Ht ) = p(D|wHt , Ht ) p(wH ) t
where θHt denotes the unknown parameters that the hypothesis Ht has. Since H1 is simple, it covers a limited area of the space of D (meaning that it can explain only a simple phenomenon), while H2 covers a broader area. The illustration implies that, because of the normalization, it holds that p(D|H1 ) > p(D|H2 )
for
D ∈ C1 ,
where C1 denotes the observed data region where H1 can explain the data well. This view gives an intuition on why evidence-based model selection prefers simpler models when the observed data can be well explained by them. However, this view does not explain MIR, which is observed even without any model selection procedure. In fact, the illustration in the left panel of Figure 7.9 is not accurate for unidentifiable models unless the Jeffreys prior is adopted (note that a hypothesis consists of a model and a prior). The right illustration of Figure 7.9 is a more accurate view for unidentifiable models. When H2 is a complex unidentifiable model nesting H1 as a simpler model in
Figure 7.9 Left: The evidence view by MacKay (1992), which gives an intuition on why evidence-based model selection prefers simpler models. Right: A more accurate view for unidentifiable models. Simpler models are preferred even without explicit model selection.
202
7 Model-Induced Regularization and Sparsity Inducing Mechanism
its parameter space, its evidence p(D|H2 ) has a bump covering the region C1 if the flat prior is adopted. This is because the flat prior typically places large weights on the singularities representing the simpler model H1 .
7.4 Phase Transition in VB Learning In Section 7.3, we explained MIR, which shrinks the Bayesian estimator. We can expect that VB learning, which involves integral computations, inherits this property. However, we observe in Figure 7.6 that VB learning behaves differently from Bayesian learning. Actually, the Bayesian estimator behaves more similarly to the ML estimator (the MAP estimator with almost flat priors), rather than the VB estimator. A remarkable difference is that the VB estimator, which is upper-bounded by the PJS estimator (7.2), shows exact sparsity, i.e., the estimator can be zero for nonzero observation |V|. In this section, we explain that this gap is caused by a phase transition phenomenon in VB learning. The middle graph in Figure 7.2 shows the Bayes posterior when V = 1. The probability mass in the first and the third quadrants pulls the product U = BA toward the positive direction, and the mass in the second and the fourth quadrants toward the negative direction. Since the Bayes posterior is skewed and more mass is placed in the first and the third quadrants, the Bayesian
Bayes = BA p(A,B|V) is positive. This is true even if V > 0 is estimator U very small, and therefore, no thresholding occurs in Bayesian learning—the Bayesian estimator is not sparse. On the other hand, the VB posterior (the top-right graph of Figure 7.4) is prohibited to be skewed because of the independent constraint, which causes the following phase transition phenomenon. As seen in Figure 7.2, the Bayes posterior has two modes unless V = 0, and the distance between the two modes increases as |V| increases. Since the VB posterior tries to approximate the Bayes posterior with a single uncorrelated distribution, it stays at the origin if the two modes are close to each other so that covering both modes minimizes the free energy. The VB posterior detaches from the origin if the two modes get far apart so that approximating either one of the modes minimizes the free energy. This phase transition mechanism makes the VB estimator exactly sparse. The profile of the Bayes posterior (the middle graph of Figure 7.2) implies that, if we restrict the posterior to be Gaussian, but allow it to have correlation between A and B, exact sparsity will not appear. In this sense, we can say that MIR is enhanced by the independence constraint, which was imposed for computational tractability.
7.4 Phase Transition in VB Learning
203
Mackay (2001) pointed out that there are cases where VB learning prunes model components inappropriately, by giving a toy example of a mixture of Gaussians. Note that appropriateness was measured in terms of the similarity to full Bayesian learning. He plotted the free energy of the mixture of Gaussians as a function of hidden responsibility variables—the probabilities that each sample belongs to each Gaussian component—and argued that VB learning sometimes favors simpler models too much. In this case, degrees of freedom are pruned when spontaneous symmetry breaking (a phase transition) occurs. Interestingly, in the MF model, degrees of freedom are pruned when spontaneous symmetry breaking does not occur, as explained earlier. Eq. (7.3) implies that the symmetry breaking occurs when V > γVB ≈ h √ Mσ2 = 1, which coincides with the average contribution of noise to the observed singular values over all singular components—more accurately, √ Mσ2 is the square root of the average eigenvalues of the Wishart matrix EE ∼ WishartL (σ2 I L , M).4 In this way, VB learning discards singular components dominated by noise. Given that the full Bayesian estimator in MF is not sparse (see Figure 7.6), one might argue that the sparsity of VB learning is an inappropriate artifact. On the other hand, given that automatic model pruning by VB learning has been acknowledged as a practically useful property (Bishop, 1999b; Bishop and Tipping, 2000; Sato et al., 2004; Babacan et al., 2012b), one might also argue that appropriateness should be measured in terms of performance. Motivated by the latter idea, performance analysis has been carried out (Nakajima et al., 2015), which will be detailed in Chapter 8. In the empirical Bayesian scenario, where the prior variances ca , cb are also estimated from observation, Bayesian learning also gives a sparse solution, which is shown as diamonds (labeled as “EBayes”) in Figure 7.6. This is somewhat natural since, in empirical Bayesian learning, the dependency −2 between A and c−2 a (as well as B and cb ) in the prior (7.6) (in the prior (7.7)) and hence in the Bayes posterior is broken—the point-estimation of c2a (as well as c2a ) forces it to be independent of all other parameters. This forced independence causes a similar phase transition phenomenon to the one caused by the independence constraint between A and B in the (nonempirical) VB learning, and results in exact sparsity of the EBayes estimator. EVB learning has a different transition point, and tends to give a sparser solution than VB learning. A notable difference from the VB estimator is that the EVB estimator is no longer continuous as a function of the observation V. This comes from the fact that, when |V| > γlocal−EVB , there exist two local 4
It holds that EE ∼ WishartL (σ2 I L , M) if E ∼ GaussL (0, σ2 I L ).
204
7 Model-Induced Regularization and Sparsity Inducing Mechanism
EVB = 0 until the solutions (see Figure 6.3), but the global solution is U EVB local−EVB ). When the positive local observed amplitude |V| exceeds γ (> γ EVB solution γ˘ becomes the global solution, it is already distant from the origin, which makes the estimator noncontinuous at the thresholding point (see the dashed curve labeled as “EVB” in Figure 7.6).
7.5 Factorization as ARD Model As shown in Section 7.1, MIR in VB learning for the MF model appears as PJS shrinkage. We can see this as a natural consequence from the equivalence between the MF model and the ARD model (Neal, 1996). Assume that CA = I H in the MF model (6.1) through (6.3), and consider the following transformation: B A → U ∈ RL×M . Then, the likelihood (6.1) and the prior (6.2) on A become 1 p(V|U) ∝ exp − 2 V − U2Fro , (7.22) 2σ 1 † p(U|B) ∝ exp − tr U (BB ) U , (7.23) 2 where † denotes the Moore–Penrose generalized inverse of a matrix. The prior (6.3) on B is kept unchanged. p(U|B) in Eq. (7.23) is so-called the ARD prior with the covariance hyperparameter BB ∈ RL×L . It is known that this prior induces the ARD property—empirical Bayesian learning, where the prior covariance hyperparameter BB is estimated from observation by maximizing the marginal likelihood (or minimizing the free energy), induces strong regularization and sparsity (Neal, 1996). Efron and Morris (1973) showed that this particular model gives the JS shrinkage estimator as an empirical Bayesian estimator (see Appendix A). This equivalence can explain the sparsity-inducing terms (3.113) through (3.116), introduced for sparse additive matrix factorization (SAMF) in Section 3.5. The ARD prior (7.23) induces low-rankness on U if no restriction on BB is imposed. We can similarly show that, (γle )2 in Eq. (3.114) corresponds dl ∈ R M , that to the prior variance shared by the entries in ul ≡ γle d 2 (γm ) in Eq. (3.115) corresponds to the prior variance shared by the entries 2 in Eq. (3.116) corresponds to the in um ≡ γmd em ∈ RL , and that El,m prior variance on Ul,m ≡ El,m Dl,m ∈ R, respectively. This explains why the factorization forms in Eqs. (3.113) through (3.116) induce low-rank, rowwise, columnwise, and elementwise sparsity, respectively. If we employ the sparse matrix factorization (SMF) term (3.117), ARD occurs in each partition, which induces partitionwise sparsity and low-rank sparsity within each partition.
8 Performance Analysis of VB Matrix Factorization
In this chapter, we further analyze the behavior of VB learning in the fully observed MF model, introduced in Section 3.1. Then, we derive a theoretical guarantee for rank estimation (Nakajima et al., 2015), which corresponds to the hidden dimensionality selection in principal component analysis (PCA). In Chapter 6, we derived an analytic-form solution (Theorem 6.13) of EVB learning, where the prior variances are also estimated from observation. When discussing the dimensionality selection performance in PCA, it is more practical to assume that the noise variance σ2 is estimated, since it is unknown in many situations. To this end, we first analyze the behavior of the noise variance estimator. After that, based on the random matrix theory, we derive a theoretical guarantee of dimensionality selection performance, and show numerical results validating the theory. We also discuss the relation to an alternative dimensionality selection method (Hoyle, 2008) based on the Laplace approximation. In the following analysis, we use some results in Chapter 6. Specifically, we mostly rely on Theorem 6.13 along with the corollaries and the equations summarized in Section 6.10.
8.1 Objective Function for Noise Variance Estimation Let us consider the complete empirical VB problem, where all the variational parameters and the hyperparameters are estimated in the free energy minimization framework: min
H ,σ2 { ah , bh , σ2ah , σ2b ,c2ah ,c2b }h=1 h
s.t. { ah , bh ∈ R,
F
h
H
σ2ah , σ2bh , c2ah , c2bh ∈ R++ }h=1 , σ2 ∈ R++ .
205
(8.1)
206
8 Performance Analysis of VB Matrix Factorization
Here, the free energy is given by L
2 h=1 γh σ2
2F = LM log(2πσ ) + 2
where
2Fh = M log
c2ah
σ2ah
+ L log
− (L + M) +
c2bh
σ2bh
+
+
H
2Fh ,
(8.2)
h=1
σ2ah a2h + M
+
σ2bh b2h + L
c2ah c2bh a2h + M σ2ah σ2bh bh γh + b2h + L −2 ah σ2
. (8.3)
Note that we are focusing on the solution with diagonal posterior covariances without loss of generality (see Theorem 6.4). We have already obtained the empirical VB estimator (Theorem 6.13) and the minimum free energy (Corollary 6.21) for given σ2 . By using those results, we can express the free energy (8.2) as a function of σ2 . With the rescaled expressions (6.132) through (6.138), the free energy can be written in a simple form, which leads to the following theorem: Theorem 8.1 The noise variance estimator, denoted by σ2 EVB , is the global minimizer of ⎛ H ⎛ ⎞ ⎛ 2 ⎞⎞ L ⎜⎜ γ ⎟⎟⎟⎟ 1 ⎜⎜⎜⎜ ⎜⎜⎜ γh2 ⎟⎟⎟ 2F(σ−2 ) −2 ⎟⎠ + + const. = ⎜⎝ ψ ⎜⎝ ψ0 ⎜⎜⎝ h 2 ⎟⎟⎠⎟⎟⎟⎠ , (8.4) Ω(σ ) ≡ 2 LM L h=1 Mσ Mσ h=H+1 where
ψ (x) = ψ0 (x) + θ x > x ψ1 (x) , ψ0 (x) = x − log x,
τ(x; α) ψ1 (x) = log (τ(x; α) + 1) + α log + 1 − τ(x; α). α Here, x is given by
α x= 1+τ 1+ , τ
(8.5) (8.6) (8.7)
(8.8)
where τ is defined in Theorem 6.13, τ(x; α) is a function of x (> x) defined by . 1 2 (8.9) τ(x; α) = x − (1 + α) + (x − (1 + α)) − 4α , 2 and θ(·) denotes the indicator function such that θ(condition) = 1 if the condition is true and θ(condition) = 0 otherwise.
8.2 Bounds of Noise Variance Estimator
207
Proof By using Lemma 6.14 and Lemma 6.16, the free energy (8.2) can be written as a function of σ2 as follows: L 2 H h=1 γh 2F = LM log(2πσ2 ) + + θ γh > γEVB FhEVB−Posi , (8.10) 2 σ h=1 where FhEVB−Posi is given by Eq. (6.112). By using Eqs. (6.133) and (6.135), Eq. (6.112) can be written as " !τ h + 1 − Mτh FhEVB−Posi = M log (τh + 1) + L log α = Mψ1 (xh ). (8.11) Therefore, Eq. (8.10) is written as ⎛ ⎞ ⎞ ⎞ ⎛ ) L L ⎛ γh2 ⎟⎟⎟ ⎜⎜⎜ 2πγh2 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ Mσ2 ⎟⎟⎟ ⎜⎝log ⎜⎝ 2 ⎟⎠ + ⎟⎠ 2F = M log ⎝⎜ ⎠⎟ + M Mσ2 γh h=1 h=1 1 H F EVB−Posi θ γh > γEVB h + M h=1 ⎛ ⎞ 1 ) L L H ⎜⎜⎜ 2πγh2 ⎟⎟⎟ log ⎝⎜ ψ0 (xh ) + θ xh > x ψ1 (xh ) . =M ⎠⎟ + M h=1 h=1 h=1 Note that the first term in the curly braces is constant with respect to σ2 . By defining ⎛ ⎞ L ⎜⎜⎜ 2πγh2 ⎟⎟⎟ 1 2F ⎜ ⎟⎠ , − log ⎝ Ω= LM L h=1 M we obtain Eq. (8.4), which completes the proof of Theorem 8.1.
The functions ψ0 (x) and ψ (x) are depicted in Figure 8.1. We can confirm the convexity of ψ0 (x) and the quasiconvexity of ψ (x) (Lemma 8.4 in Section 8.3), which are useful properties in the subsequent analysis.1
8.2 Bounds of Noise Variance Estimator
EVB be the estimated rank by EVB learning, i.e., the rank of the EVB Let H
EVB , such that
EVB , and γhEVB > 0 for h = 1, . . . , H γhEVB = 0 for estimator U 1
A function f : X → R on the domain X being a convex subset of a real vector space is said to be quasiconvex if f (λx1 + (1 − λ)x2 ) ≤ max( f (x1 ), f (x2 )) for all x1 , x2 ∈ X and λ ∈ [0, 1]. It is furthermore said to be strictly quasiconxex if f (λx1 + (1 − λ)x2 ) < max( f (x1 ), f (x2 )) for all x1 x2 and λ ∈ (0, 1). Intuitively, a strictly quasiconvex function does not have more than one local minima.
208
8 Performance Analysis of VB Matrix Factorization
6
4
2
0 0
2
4
6
8
Figure 8.1 ψ0 (x) and ψ(x).
EVB + 1, . . . , H. By further analyzing the objective (8.4), we can derive h=H bounds of the estimated rank and the noise variance estimator:
EVB is upper-bounded as Theorem 8.2 H
EVB ≤ H = min H
!D L E " − 1, H , 1+α
(8.12)
and the noise variance estimator σ2 EVB is bounded as follows: ⎞ ⎛ L L ⎜⎜⎜ γh2 ⎟⎟⎟ 1 2 h=H+1 2 EVB ⎟⎟⎟ ≤ ≤ γ , max ⎜⎜⎜⎝σ2H+1 , σ ⎠ LM h=1 h M L−H ⎧ ⎪ ⎪ ∞ for h = 0, ⎪ ⎪ ⎪ ⎪ γ2 ⎨ 2 h where σh = ⎪ for h = 1, . . . , L, ⎪ Mx ⎪ ⎪ ⎪ ⎪ ⎩0 for h = L + 1.
(8.13)
(8.14)
Theorem 8.2 states that EVB learning discards the (L − L/(1 + α) + 1) L . For smallest components, regardless of the observed singular values {γh }h=1 example, half of the components are always discarded when the matrix is square (i.e., α = L/M = 1). The smallest singular value γL is always discarded, and σ2 EVB ≥ γ2L /M always holds. H for the singular values, the noise Given the EVB estimators { γhEVB }h=1 2 EVB is specified by the following corollary: variance estimator σ Corollary 8.3 The EVB estimator for the noise variance satisfies the following equality:
σ
2 EVB
1 = LM
⎛ L ⎞ H ⎜⎜⎜ 2 ⎟⎟ EVB γh ⎟⎟⎟⎠ . γh ⎜⎜⎝ γl − l=1
h=1
(8.15)
8.3 Proofs of Theorem 8.2 and Corollary 8.3
209
This corollary can be used for implementing a global EVB solver (see Chapter 9). In the next section we give the proofs of the theorem and the corollary.
8.3 Proofs of Theorem 8.2 and Corollary 8.3 First, we show nice properties of the functions, ψ (x) and ψ0 (x), which are defined by Eqs. (8.5) and (8.6), respectively, and depicted in Figure 8.1: Lemma 8.4 The following hold for x > 0: ψ0 (x) is differentiable and strictly convex; ψ (x) is continuous and strictly quasiconvex; ψ (x) is differentiable except x = x, at which ψ (x) has a discontinuously decreasing derivative, i.e., lim x→x−0 ∂ψ/∂x > lim x→x+0 ∂ψ/∂x; both of ψ0 (x) and ψ (x) are minimized at x = 1. For x > x, ψ1 (x) is negative and decreasing. Proof Since ∂ψ0 1 =1− , ∂x x ∂ 2 ψ0 1 = 2 > 0, ∂x2 x
(8.16)
ψ0 (x) is differentiable and strictly convex for x > 0 with its minimizer at x = 1. ψ1 (x) is continuous for x ≥ x, and Eq. (8.11) implies that ψ1 (xh ) ∝ FhEVB−Posi . Accordingly, ψ1 (x) ≤ 0 for x ≥ x, where the equality holds when x = x. This equality implies that ψ(x) is continuous. Since x > 1, ψ(x) shares the same minimizer as ψ0 (x) at x = 1 (see Figure 8.1). Hereafter, we investigate ψ1 (x) and ψ(x) for x ≥ x. By differentiating Eqs. (8.7) and (6.135), respectively, we have ⎞ ⎛ τ2 ⎟⎟⎟ ⎜⎜⎜ − 1 ∂ψ1 α ⎟⎟⎠⎟ < 0, = − ⎜⎜⎝⎜ (8.17) ∂τ (τ + 1) ατ + 1 ⎞ ⎛ ⎟⎟⎟ ⎜⎜⎜ ∂τ 1 ⎜⎜⎜ x − (1 + α) ⎟⎟⎟⎟ = ⎜⎜⎜1 + . (8.18) ⎟⎟⎟ > 0. ∂x 2 ⎜⎝ ⎠ 2 (x − (1 + α)) − 4α Substituting Eq. (6.134) into Eq. (8.18), we have τ2 ∂τ . = 2 ∂x α τ − 1 α
(8.19)
210
8 Performance Analysis of VB Matrix Factorization
Multiplying Eqs. (8.17) and (8.19) gives ⎛ ⎜⎜⎜ ∂ψ1 ∂ψ1 ∂τ τ2 = = − ⎜⎜⎜⎝ ∂x ∂τ ∂x α (τ + 1) ατ + 1
⎞ ⎟⎟⎟ ⎟⎟⎟ = − τ < 0, ⎠ x
(8.20)
which implies that ψ1 (x) is decreasing for x > x. Let us focus on the thresholding point of ψ(x) at x = x. Eq. (8.20) does not converge to zero for x → x + 0 but stay negative. On the other hand, ψ0 (x) is differentiable at x = x. Consequently, ψ (x) has a discontinuously decreasing derivative, i.e., lim x→x−0 ∂ψ/∂x > lim x→x+0 ∂ψ/∂x, at x = x. Finally, we prove the strict quasiconvexity of ψ(x). Taking the sum of Eqs. (8.16) and (8.20) gives 1+τ 1+τ ∂ψ ∂ψ0 ∂ψ1 = + =1− =1− > 0. ∂x ∂x ∂x x 1 + τ + α + ατ−1 This means that ψ(x) is increasing for x > x. Since ψ0 (x) is strictly convex and increasing at x = x, and ψ(x) is continuous, ψ(x) is strictly quasiconvex. This completes the proof of Lemma 8.4. Lemma 8.4 implies that our objective (8.4) is a sum of quasiconvex functions with respect to σ−2 . Therefore, its minimizer can be bounded by the smallest one and the largest one among the set collecting the minimizer from each quasiconvex function: Lemma 8.5 Ω(σ−2 ) has at least one global minimizer, and any of its local minimizers is bounded as M M ≤ σ−2 ≤ 2 . (8.21) 2 γ1 γL Proof The strict convexity of ψ0 (x) and the strict quasiconvexity of ψ(x) also hold for ψ0 (γh2 σ−2 /M) and ψ(γh2 σ−2 /M) as functions of σ−2 (for γh > 0). Because of the different scale factor γh2 /M for each h = 1 . . . , L, each of ψ0 (γh2 σ−2 /M) and ψ(γh2 σ−2 /M) has a minimizer at a different position: σ−2 =
M . γh2
The strict quasiconvexity of ψ0 and ψ guarantees that Ω(σ−2 ) is decreasing for M , γ12
(8.22)
M < σ−2 < ∞. γ2L
(8.23)
0 < σ−2 < and increasing for
This proves Lemma 8.5.
8.3 Proofs of Theorem 8.2 and Corollary 8.3
211
Ω(σ−2 ) has at most H nondifferentiable points, which come from the nondifferentiable point x = x of ψ(x). The values ⎧ ⎪ ⎪ 0 for h = 0, ⎪ ⎪ ⎪ ⎪ ⎨ Mx −2 for h = 1, . . . , L, (8.24) σh = ⎪ ⎪ γh2 ⎪ ⎪ ⎪ ⎪ ⎩∞ for h = L + 1, defined in Eq. (8.14) for h = 1, . . . , H actually correspond to those points. Lemma 8.4 states that, at x = x, ψ(x) has a discontinuously decreasing derivative and neither ψ0 (x) nor ψ(x) has a discontinuously increasing derivative at any point. Therefore, none of those nondifferentiable points can be a local minimum. Consequently, we have the following lemma: Lemma 8.6 Ω(σ−2 ) has no local minimizer at σ−2 = σ−2 h for h = 1, . . . , H, and therefore any of its local minimizers is a stationary point. Then, Theorem 6.13 leads to the following lemma:
= h if and only if the inverse noise Lemma 8.7 The estimated rank is H variance estimator lies in the range , −2
< σ−2 (8.25) σ−2 ∈ Bh ≡ σ−2 ; σ−2 h 0, Eq. (8.26) is upper-bounded by L γh2 , Θ ≤ −σ + LM h=1 2
which leads to the upper-bound given in Eq. (8.13). Actually, if ⎛ L ⎞−1 ⎜⎜⎜ γh2 ⎟⎟⎟ ⎜⎜⎝ ⎟⎟ ∈ B0 , LM ⎠ h=1 then
= 0, H
σ2 =
L γh2 , LM h=1
is a local minimum. The following.lemma is easily obtained from Eq. (6.103) by using the inequalities z1
z2 > 0:
Lemma 8.9 For γh ≥ γEVB , the EVB shrinkage estimator (6.103) can be bounded as follows: √ √ ( M + L)2 σ2 (M + L)σ2 γh − < γ˘ hEVB < γh − . (8.32) γh γh This lemma is important for our analysis, because it allows us to bound the most complicated part of Eq. (8.26) by quantities independent of γh , i.e., √ √ (8.33) (M + L)σ2 < γh γh − γ˘ hEVB < ( M + L)2 σ2 . Using Eq. (8.33), we obtain the following lemma: Lemma 8.10 Any local minimizer exists in σ−2 ∈ BH such that
< H
2
L , 1+α
= 0, 1. It is easy to derive a closed-form solution for H
214
8 Performance Analysis of VB Matrix Factorization
and the following holds for any local minimizer lying in σ−2 ∈ BH : L 2
γh h=H+1 2
σ ≥ .
+ L) LM − H(M Proof By substituting the lower-bound in Eq. (8.33) into Eq. (8.26), we obtain 2
+ M)σ2 + L H(L
γh h=H+1 2 . Θ ≥ −σ + LM This implies that Θ > 0 unless the following hold:
< LM = L , H L+M 1+α L 2
γh h=H+1 2 σ ≥ .
+ M) LM − H(L Therefore, no local minimum exists if either of these conditions is violated. This completes the proof of Lemma 8.10. It holds that
L
h=H+1
γh2
+ L) LM − H(M
L ≥
h=H+1
γh2
M(L − H)
,
(8.34)
Combining of which the right-hand side is decreasing with respect to H. Lemmas 8.5, 8.6, 8.7, and 8.10 and Eq. (8.34) completes the proof of Theorem 8.2. Corollary 8.3 is easily obtained from Lemmas 8.6 and 8.8.
8.4 Performance Analysis To analyze the behavior of the EVB solution in the fully observed MF model, we rely on the random matrix theory (Marˇcenko and Pastur, 1967; Wachter, 1978; Johnstone, 2001; Bouchaud and Potters, 2003; Hoyle and Rattray, 2004; Baik and Silverstein, 2006), which describes the distribution of the singular values of random matrices in the limit when the matrix size goes to infinity. We first introduce some results obtained in the random matrix theory and then apply them to our analysis.
8.4.1 Random Matrix Theory Assume that the observed matrix V is generated from the spiked covariance model (Johnstone, 2001): V = U∗ + E,
(8.35)
8.4 Performance Analysis
215
where U∗ ∈ RL×M is a true signal matrix with rank H ∗ and singular values H∗ , and E ∈ RL×M is a random matrix such that each element is {γh∗ }h=1 independently drawn from a distribution with mean zero and variance σ∗2 (not L of V, the true necessarily Gaussian). As the observed singular values {γh }h=1 ∗ H∗ singular values {γh }h=1 are also assumed to be arranged in the nonincreasing order. We define normalized versions of the observed and the true singular values: γh2 Mσ∗2 γ∗2 νh∗ = h ∗2 Mσ yh =
for
h = 1, . . . , L,
(8.36)
for
h = 1, . . . , H ∗ .
(8.37) ∗
L H In other words, {yh }h=1 are the eigenvalues of VV /(Mσ∗2 ), and {νh∗ }h=1 are ∗ ∗ ∗2 the eigenvalues of U U /(Mσ ). Note the difference between xh , defined by Eq. (6.132), and yh : xh is the squared observed singular value normalized with the model noise variance σ2 , which is to be estimated, while yh is the one normalized with the true noise variance σ∗2 . L by Define the empirical distribution of the observed eigenvalues {yh }h=1
1 δ(y − yh ), L h=1 L
p(y) =
(8.38)
where δ(y) denotes the Dirac delta function. When H ∗ = 0, the observed matrix V = E consists only of noise, and its singular value distribution in the largescale limit is specified by the following proposition: Proposition 8.11 (Marˇcenko and Pastur, 1967; Wachter, 1978) In the largescale limit when L and M go to infinity with its ratio α = L/M fixed, the empirical distribution of the eigenvalue y of EE /(Mσ∗2 ) almost surely converges to . (y − y)(y − y) θ(y < y < y), (8.39) p(y) → pMP (y) ≡ 2παy √ 2 √ 2 where y = (1 + α) , y = (1 − α) , (8.40) and θ(·) is the indicator function, defined in Theorem 8.1.3 Figure 8.3 shows Eq. (8.39), which we call the Marˇcenko–Pastur (MP) distribution, for α = 0.1, 1. The mean y pMP (y) = 1 (which is constant for 3
MP Convergence is in weak topology in distribution, i.e., p(y) almost surely converges to p (y) so that f (y)p(y)dy = f (y)pMP (y)dy for any bounded continuous function f (y).
216
8 Performance Analysis of VB Matrix Factorization
Figure 8.3 Marˇcenko–Pastur distirbution.
any 0 < α ≤ 1) and the upper-limits y = y(α) of the support for α = 0.1, 1 are indicated by arrows. Proposition 8.11 states that the probability mass is concentrated in the range between y ≤ y ≤ y. Note that the MP distribution appears for a single sample matrix; differently from standard “large-sample” theories, Proposition 8.11 does not require one to average over many sample matrices.4 This single-sample property of the MP distribution is highly useful in our analysis because the MF model usually assumes a single observed matrix V. We call the (unnormalized) singular value corresponding to the upper-limit y, i.e., . √ √ MPUL γ = Mσ∗2 · y = ( L + M)σ∗ , (8.41) the Marˇcenko–Pastur upper limit (MPUL). When H ∗ > 0, the true signal matrix U∗ affects the singular value L, the distribution can be approximated distribution of V. However, if H ∗ by a mixture of spikes (delta functions) and the MP distribution pMP (y). Let (≤ H ∗ ) be the number of singular values of U∗ greater than γh∗ > H ∗∗ √ α1/4 Mσ∗ , i.e., √ √ ν∗H ∗∗ > α and ν∗H ∗∗ +1 ≤ α. (8.42) Then, the following proposition holds: Proposition 8.12 (Baik and Silverstein, 2006) In the large-scale limit when L and M go to infinity with finite α and H ∗ , it almost surely holds that α Sig ∗ (8.43) yh = yh ≡ 1 + νh 1 + ∗ for h = 1, . . . , H ∗∗ , νh yH ∗∗ +1 = y, 4
and
yL = y.
This property is called self-averaging (Bouchaud and Potters, 2003).
8.4 Performance Analysis
217
∗∗
H Figure 8.4 Spiked covariance distribution when {νh∗ }h=1 = {1.5, 1.0, 0.5}.
Furthermore, Hoyle and Rattray (2004) argued that, when L and M are L, the empirical distribution of the eigenvalue y large (but finite) and H ∗ of VV /(Mσ∗2 ), is accurately approximated by 1 L − H ∗∗ MP Sig p (y). δ y − yh + p(y) ≈ p (y) ≡ L h=1 L H ∗∗
SC
(8.44)
Figure 8.4 shows Eq. (8.44), which we call the spiked covariance (SC) H ∗∗ = {1.5, 1.0, 0.5}. The SC distribution, for α = 0.1, H ∗∗ = 3, and {νh∗ }h=1 √ H∗ , which satisfy 0 < νh∗ ≤ α (see distribution is irrespective of {νh∗ }h=H ∗∗ +1 the definition (8.42) of H ∗∗ ). Proposition 8.12 states that in the large-scale limit, the large signal compo√ nents such that νh∗ > α appear outside the support of the MP distribution as spikes, while the other small signals are indistinguishable from the MP √ Sig distribution (note that Eq. (8.43) implies that yh > y for νh∗ > α). This implies that any PCA method fails to recover the true dimensionality, unless ν∗H ∗ >
√
α.
(8.45)
The condition (8.45) requires that U∗ has no small positive singular value such √ that 0 < νh∗ ≤ α, and therefore H ∗∗ = H ∗ . The approximation (8.44) allows us to investigate more practical situations where the matrix size is finite. In Sections 8.4.2 and 8.4.4, respectively, we provide two theorems: one is based on Proposition 8.12 and guarantees perfect rank (PCA dimensionality) recovery of EVB learning in the large-scale limit, and the other one assumes that the approximation (8.44) exactly holds and provides a more realistic condition for perfect recovery.
218
8 Performance Analysis of VB Matrix Factorization
8.4.2 Perfect Rank Recovery Condition in Large-Scale Limit Now, we are almost ready for clarifying the behavior of the EVB solution. We assume that the model rank is set to be large enough, i.e., H ∗ ≤ H ≤ L, and all model parameters including the noise variance are estimated from observation (i.e., complete EVB learning). The last proposition on which our analysis relies is related to the property, called the strong unimodality, of the log-concave distributions: Proposition 8.13 The convolution
(Ibragimov, 1956; Dharmadhikari and Joag-Dev, 1988) g(s) = f (s + t) p(t) =
f (s + t)p(t)dt
is quasiconvex, if p(t) is a log-concave distribution, and f (t) is a quasiconvex function. In the large-scale limit, the summation over h = 1, . . . , L in the objective Ω(σ−2 ), given by Eq. (8.4), for noise variance estimation can be replaced with the expectation over the MP distribution pMP (y). By scaling variables, the objective can be written as a convolution with a scaled version of the MP distribution, which turns out to be log-concave. Accordingly, we can use Proposition 8.13 to show that Ω(σ−2 ) is quasiconvex, which means that the noise variance estimation by EVB learning can be accurately performed by a local search algorithm. Combining this result with Proposition 8.12, we obtain the following theorem: Theorem 8.14 In the large-scale limit when L and M go to infinity with finite
EVB = H ∗ , α and H ∗ , EVB learning almost surely recovers the true rank, i.e., H if and only if ν∗H ∗ ≥ τ,
(8.46)
where τ is defined in Theorem 6.13. Furthermore, the following corollary completely describes the behavior of the EVB solution in the large-scale limit: Corollary 8.15 In the large-scale limit, the objective Ω(σ−2 ), defined by Eq. (8.4), for the noise variance estimation converges to a quasiconvex function, and it almost surely holds that ⎛ ⎞ ⎧ ∗ ∗ ⎪ γhEVB ⎟⎟⎟ ⎪ ⎜⎜ γh ⎨νh if νh ≥ τ, EVB ⎜ ⎟
(8.47) τh ⎝⎜≡ = ⎪ ⎠ ⎪ ⎩0 otherwise, M σ2 EVB
σ2 EVB = σ∗2 .
8.4 Performance Analysis
219
One may get intuition of Eqs. (8.46) and (8.47) by comparing Eqs. (8.8) and (6.134) with Eq. (8.43): The estimator τh has the same relation to the observation xh as the true signal νh∗ , and hence is an unbiased estimator of the signal. However, Theorem 8.14 does not even approximately hold in practical situations with moderate-sized matrices (see the numerical validation in Section 8.5). After proving Theorem 8.14 and Corollary 8.15, we will derive a more practical condition for perfect recovery in Section 8.4.4.
8.4.3 Proofs of Theorem 8.14 and Corollary 8.15 In the large-scale limit, we can substitute the expectation f (y) p(y) for the L f (yh ). We can also substitute the MP distribution pMP (y) summation L−1 h=1 for p(y) for the expectation, since the contribution from the H ∗ signal components converges to zero. Accordingly, our objective (8.4) converges to y κ Ω(σ−2 ) → ΩLSL (σ−2 ) ≡ ψ σ∗2 σ−2 y pMP (y)dy + ψ0 σ∗2 σ−2 y pMP (y)dy κ
= ΩLSL−Full (σ−2 ) −
where
ΩLSL−Full (σ−2 ) ≡
y
ψ1 σ∗2 σ−2 y pMP (y)dy,
κ
max(xσ2 /σ∗2 ,y)
(8.48)
y
ψ σ∗2 σ−2 y pMP (y)dy,
(8.49)
y
and κ is a constant satisfying y H = pMP (y)dy L κ
(y ≤ κ ≤ y).
(8.50)
Note that x, y, and y are defined by Eqs. (8.8) and (8.40), and it holds that x > y.
(8.51)
We first investigate Eq. (8.49), which corresponds to the objective for the full-rank model (i.e., H = L). Let s = log(σ−2 ), t = log y
dt = 1y dy .
Then Eq. (8.49) is written as a convolution: LSL−Full LSL−Full s (s) ≡ Ω (e ) = ψ σ∗2 e s+t et pMP (et )dt Ω (s + t)pLSMP (t)dt, = ψ
(8.52)
220
8 Performance Analysis of VB Matrix Factorization
where (s) = ψ(σ∗2 e s ), ψ pLSMP (t) = et pMP (et ) . (et − y)(y − et ) = θ(y < et < y). 2πα
(8.53)
(s) with the Since Lemma 8.4 states that ψ(x) is quasiconvex, its composition ψ ∗2 s nondecreasing function σ e is also quasiconvex. The following holds for pLSMP (t), which we call a log-scaled MP (LSMP) distribution: Lemma 8.16 The LSMP distribution (8.53) is log-concave. Proof Focusing on the support, log y < t < log y, of the LSMP distribution (8.53), we define . (et − y)(y − et ) LSMP f (t) ≡ 2 log p (t) = 2 log 2πα = log(−e2t + (y + y)et − yy) + const. Let u(t) ≡ (et − y)(y − et ) = −e2t + (y + y)et − yy > 0,
(8.54)
and let ∂u = −2e2t + (y + y)et = u − e2t + yy, ∂t ∂2 u w(t) ≡ 2 = −4e2t + (y + y)et = v − 2e2t , ∂t v(t) ≡
be the first and the second derivatives of u. Then, the first and the second derivatives of f (t) are given by v ∂f = , ∂t u ∂2 f uw − v2 = ∂t2 u2 t e (y + y)e2t − 4yyet + (y + y)yy =− u2
8.4 Performance Analysis
221
⎛⎛ 2⎞ ⎞ et (y + y) ⎜⎜⎜⎜⎜⎜ 2yy ⎟⎟2 yy y − y ⎟⎟⎟⎟ ⎟⎟⎟ ⎜⎜⎜⎜⎜⎜et − ⎟⎟⎟ + =− ⎝⎜⎝ (y + y) ⎠ (y + y)2 ⎟⎠ u2 ≤ 0. This proves the log-concavity of the LSMP dsitribution pLSMP (t), and completes the proof of Lemma 8.16. LSL−Full (s) is quasiconvex, Lemma 8.16 and Proposition 8.13 imply that Ω LSL−Full −2 (σ ) with the nondecreasing function and therefore its composition Ω log(σ−2 ) is quasiconvex. The minimizer of ΩLSL−Full (σ−2 ) can be found by evaluating the derivative Θ, given by Eq. (8.26), in the large-scale limit: y Full LSL−Full 2 ∗2 = −σ + σ y · pMP (y)dy Θ →Θ −
y y xσ2 /σ∗2
τ(σ∗2 σ−2 y; α)pMP (y)dy.
Here, we used Eqs. (6.133) and (8.9). In the range xσ∗−2 xσ2 > y , 0 < σ−2 < i.e., y σ∗2
(8.55)
(8.56)
the third term in Eq. (8.55) is zero. Therefore, Eq. (8.55) is increasing with respect to σ−2 , and zero when y 2 ∗2 y · pMP (y)dy = σ∗2 . σ =σ y
Accordingly, ΩLSL−Full (σ−2 ) is strictly convex in the range (8.56). Eq. (8.51) implies that the point σ−2 = σ∗−2 is contained in the region (8.56), and therefore it is a local minimum of ΩLSL−Full (σ−2 ). Combined with the quasiconvexity of ΩLSL−Full (σ−2 ), we have the following lemma: Lemma 8.17 The objective ΩLSL−Full (σ−2 ) for the full-rank model H = L in the large-scale limit is quasiconvex with its minimizer at σ−2 = σ∗−2 . It is strictly convex in the range (8.56). For any κ (y < κ < y), the second term in Eq. (8.48) is zero in the range (8.56), which includes its minimizer at σ−2 = σ∗−2 . Since Lemma 8.4 states that ψ1 (x) is decreasing for x > x, the second term in Eq. (8.48) is nondecreasing in the region where xσ∗−2 ≤ σ−2 < ∞. σ∗−2 < y
222
8 Performance Analysis of VB Matrix Factorization
Therefore, the quasi-convexity of ΩLSL−Full is inherited to ΩLSL : Lemma 8.18 The objective ΩLSL (σ−2 ) for noise variance estimation in the large-scale limit is quasiconvex with its minimizer at σ−2 = σ∗−2 . ΩLSL (σ−2 ) is strictly convex in the range (8.56). Thus we have proved that EVB learning accurately estimates the noise variance in the large-scale limit:
σ2 EVB = σ∗2 .
(8.57)
Assume that Eq. (8.45) holds. Then Proposition 8.12 guarantees that, in the large-scale limit, the following hold: γ2H ∗ $ α ∗ % ∗ ≡ y = 1 + ν , (8.58) 1 + H H∗ ν∗H ∗ Mσ∗2 γ2H ∗ +1 √ ≡ yH ∗ +1 = y = (1 + α)2 . ∗2 Mσ Remember that the EVB threshold is given by Eq. (8.8), i.e., (γEVB )2 α ≡ x= 1+τ 1+ . τ M σ2 EVB
(8.59)
(8.60)
Since Lemma 8.18 states that σ2 EVB = σ∗2 , comparing Eqs. (8.58) and (8.59) with Eq. (8.60) results in the following lemma: Lemma 8.19 It almost surely holds that γH ∗ ≥ γEVB γH ∗ +1 < γ
if and only if
EVB
for any
ν∗H ∗ ≥ τ,
(8.61)
{νh∗ }.
This completes the proof of Theorem 8.14. Comparing Eqs. (6.134) and (8.43) under Lemmas 8.18 and 8.19 proves Corollary 8.15.
8.4.4 Practical Condition for Perfect Rank Recovery Theorem 8.14 rigorously holds in the large-scale limit. However, it does not describe the behavior of the EVB solution very accurately in practical finite matrix-size cases. We can obtain a more practical condition for perfect recovery by relying on the approximation (8.44). We can prove the following theorem: Theorem 8.20 Let ξ=
H∗ L
8.4 Performance Analysis
223
be the relevant rank ratio, and assume that p(y) = pSC (y).
(8.62)
EVB = H ∗ , if the following Then, EVB learning recovers the true rank, i.e., H two inequalities hold: ξ
1 , x x−1 1−xξ
(8.63) −α +
B
x−1 1−xξ
2
− α − 4α
2
,
(8.64)
where x is defined by Eq. (8.8). Note that, in the large-scale limit, ξ converges to zero, and the sufficient condition, Eqs. (8.63) and (8.64), in Theorem 8.20 is equivalent to the necessary and sufficient condition (8.46) in Theorem 8.14. Theorem 8.20 only requires that the SC distribution (8.44) well approximates the observed singular value distribution. Accordingly, it well describes the dependency of the EVB solution on ξ, which will be shown in numerical validation in Section 8.5. Theorem 8.20 states that, if the true rank H ∗ is small enough compared with L and the smallest signal ν∗H ∗ is large enough, EVB learning perfectly recovers the true rank. The following corollary also supports EVB learning: Corollary 8.21 Under the assumption (8.62) and the conditions (8.63) and (8.64), the objective Ω(σ−2 ) for the noise variance estimation has no local minimum (no stationary point if ξ > 0) that results in a wrong estimated rank
EVB H ∗ . H This corollary states that, although the objective function (8.4) is nonconvex and possibly multimodal in general, any local minimum leads to the correct estimated rank. Therefore, perfect recovery does not require global search, but only local search, for noise variance estimation, if L and M are sufficiently large so that we can warrant Eq. (8.62). In the next section, we give the proofs of Theorem 8.20 and Corollary 8.21, and then show numerical experiments that support the theory.
8.4.5 Proofs of Theorem 8.20 and Corollary 8.21 We regroup the terms in Eq. (8.4) as follows: Ω(σ−2 ) = Ω1 (σ−2 ) + Ω0 (σ−2 ),
(8.65)
224
8 Performance Analysis of VB Matrix Factorization
where
⎛ ⎞ H∗ 1 ⎜⎜⎜ γh2 −2 ⎟⎟⎟ ⎜ ⎟⎠ , σ ψ ⎝ H ∗ h=1 M ⎛ H ⎛ 2 ⎞ ⎛ 2 ⎞⎞ L ⎜⎜⎜ γh −2 ⎟⎟⎟ ⎜⎜⎜ γh −2 ⎟⎟⎟⎟⎟⎟ 1 ⎜⎜⎜⎜ −2 ψ ⎜⎝ σ ⎟⎠ + ψ0 ⎜⎝ σ ⎟⎠⎟⎟⎠ . Ω0 (σ ) = ⎜ L − H ∗ ⎝h=H ∗ +1 M M h=H+1
Ω1 (σ−2 ) =
(8.66)
(8.67)
In the following, assuming that Eq. (8.62) holds and yH ∗ > y,
(8.68)
we derive a sufficient condition for any local minimizer to lie only in σ−2 ∈ BH ∗ , with which Lemma 8.7 proves Theorem 8.20. Under the assumption (8.62) and the condition (8.68), Ω0 (σ−2 ), defined by Eq. (8.67), is equivalent to the objective ΩLSL (σ−2 ) in the large-scale limit. Using Lemma 8.18, and noting that σ−2 H ∗ +1 =
Mx 2 xσ∗−2 > σ∗−2 , = γH ∗ +1 y
(8.69)
we have the following lemma: Lemma 8.22 Ω0 (σ−2 ) is quasiconvex with its minimizer at σ−2 = σ∗−2 . Ω0 (σ−2 ) is strictly convex in the range 0 < σ−2 < σ−2 H ∗ +1 . Using Lemma 8.22 and the strict quasiconvexity of ψ(x), we can deduce the following lemma: Lemma 8.23 Ω(σ−2 ) is nondecreasing (increasing if ξ > 0) in the range σ2H ∗ +1 < σ−2 < ∞. Proof Lemma 8.22 states that Ω0 (σ−2 ), defined by Eq. (8.67), is quasiconvex with its minimizer at ⎞−1 ⎛ L ⎜⎜⎜ h=H ∗ +1 γh2 ⎟⎟⎟ −2 ⎟⎟ = σ∗−2 . ⎜ σ = ⎜⎝ (L − H ∗ )M ⎠ Since Ω1 (σ−2 ), defined by Eq. (8.66), is the sum of strictly quasiconvex functions with their minimizers at σ−2 = M/γh2 < σ∗−2 for h = 1, . . . , H ∗ , our objective Ω(σ−2 ), given by Eq. (8.65), is nondecreasing (increasing if H ∗ > 0) for σ−2 ≥ σ∗−2 .
8.4 Performance Analysis
225
∗−2 Since Eq. (8.69) implies that σ−2 , Ω(σ−2 ) is nondecreasing H ∗ +1 > σ −2 −2 > σH ∗ +1 , which completes the proof of (increasing if ξ > 0) for σ Lemma 8.23.
Using the bounds given by Eq. (8.33) and Lemma 8.22, we also obtain the following lemma: Lemma 8.24 Ω(σ−2 ) is increasing at σ−2 = σ2H ∗ +1 − 0.5 It is decreasing at σ−2 = σ2H ∗ + 0 if the following hold: 1 √ , (1 + α)2 x(1 − ξ) > √ . 1 − ξ(1 + α)2
ξ< yH ∗
(8.70) (8.71)
Proof Lemma 8.22 states that Ω0 (σ−2 ) is strictly convex in the range 0 < σ−2 < σ2H ∗ +1 , and minimized at σ−2 = σ∗−2 . Since Eq. (8.69) implies that σ∗−2 < σ2H ∗ +1 , Ω0 (σ−2 ) is increasing at σ−2 = σ2H ∗ +1 − 0. Since Ω1 (σ−2 ) is the sum of strictly quasiconvex functions with their minimizers at σ−2 = M/γh2 < σ∗−2 for h = 1, . . . , H ∗ , Ω(σ−2 ) is also increasing at σ−2 = σ2H ∗ +1 − 0. Let us investigate the sign of the derivative Θ of Ω(σ−2 ) at σ−2 = σ2H ∗ + 0 ∈ BH ∗ . Substituting the upper-bound in Eq. (8.33) into Eq. (8.26), we have √ √ L 2 H ∗ ( L + M)2 σ2 + h=H ∗ +1 γ h 2 Θ < −σ + LM √ √ 2 2 ∗ H ( L + M) σ + (L − H ∗ )Mσ∗2 . (8.72) = −σ2 + LM The right-hand side of Eq. (8.72) is negative if the following hold: ξ=
M H∗ 1 < √ = √ 2, √ 2 L (1 + α) ( L + M) ∗ ∗2 (L − H )Mσ (1 − ξ)σ∗2 σ2 > = √ . √ √ LM − H ∗ ( L + M)2 1 − ξ(1 + α)2
(8.73) (8.74)
Assume that the first condition (8.73) holds. Then the second condition (8.74) holds at σ−2 = σ2H ∗ + 0, if √ 1 − ξ(1 + α)2 ∗−2 −2 σ , σH ∗ < (1 − ξ)
5
By “−0” we denote an arbitrarily large negative value.
226
8 Performance Analysis of VB Matrix Factorization
or equivalently, yH ∗ =
σ2H ∗ γ2H ∗ x(1 − ξ) = x · > √ , ∗2 ∗2 Mσ σ 1 − ξ(1 + α)2
which completes the proof of Lemma 8.24. Finally, we obtain the following lemma:
Lemma 8.25 Ω(σ−2 ) is decreasing in the range 0 < σ−2 < σ2H ∗ if the following hold: 1 , x x(1 − ξ) . > 1 − xξ
ξ< yH ∗
(8.75) (8.76)
Proof In the range 0 < σ−2 < σ2H ∗ , the estimated rank (8.27) is bounded as
≤ H ∗ − 1. 0≤H Substituting the upper-bound in Eq. (8.33) into Eq. (8.26), we have √ √ L 2 2
L + M)2 σ2 + H ∗ H(
γh + h=H ∗ +1 γh h=H+1 2 Θ < −σ + LM √ √ 2 2 H ∗ 2 ∗ ∗2
H( L + M) σ + h=H+1
γh + (L − H )Mσ . = −σ2 + LM
(8.77)
(8.78)
The right-hand side of Eq. (8.78) is negative, if the following hold:
H M 1 < √ = √ , √ 2 L (1 + α)2 ( L + M) H ∗ 2 ∗ ∗2
γh + (L − H )Mσ 2 σ > h=H+1 . √ √
L + M)2 LM − H(
(8.79) (8.80)
Assume that ξ=
H∗ 1 < √ . L (1 + α)2
Then both of the conditions (8.79) and (8.80) hold for σ−2 ∈ (0, σ2H ∗ ), if the following holds: √ √
L + M)2 LM − H( −2
= 0, . . . , H ∗ − 1. (8.81) < H ∗ for H σH+1
2 ∗ )Mσ∗2 γ + (L − H
h h=H+1
8.4 Performance Analysis
Since the sum bounded as
H ∗
h=H+1
227
γh2 in the right-hand side of Eq. (8.81) is upperH∗
h=H+1
2 , γh2 ≤ (H ∗ − H)γ
H+1
Eq. (8.81) holds if σ−2
H+1
√ √
L + M)2 LM − H(
γH+1
H ∗2 L ) M + (1 − ξ)σ √
1 − HL (1 + α)2 ⎞
⎟⎟ γ2H+1 H
− x⎟⎟⎠ + (1 − ξ)x, L Mσ∗2
(ξ −
⎛ ⎛ ⎞ 2
√ 2 ⎟⎟⎟ γH+1 ⎜⎜ ⎜⎜⎜ H
> ⎜⎜⎝ξx ⎝⎜1 − (1 + α) ⎟⎠ L Mσ∗2
2
or equivalently yH+1 =
γ2
H+1
Mσ∗2
> 1 − ξx +
(1 − ξ) x √ x − (1 + α)2
H L
for
= 0, . . . , H ∗ − 1. H
(8.83) √ 2 Note that x > y = (1 + α) . Further bounding both sides, we have the following sufficient condition for Eq. (8.83) to hold: yH ∗ >
(1 − ξ)x . max 0, 1 − ξx
(8.84)
Thus we obtain the conditions (8.75) and (8.76) for Θ to be negative for σ−2 ∈ (0, σ2H ∗ ), which completes the proof of Lemma 8.25. Lemmas 8.23, 8.24, and 8.25 together state that, if all the conditions (8.68) and (8.70) through (8.76) hold, at least one local minimum exists in the correct range σ−2 ∈ BH ∗ , and no local minimum (no stationary point if ξ > 0) exists
EVB = outside the correct range. Therefore, we can estimate the correct rank H H ∗ by using a local search algorithm for noise variance estimation. Choosing the tightest conditions, we have the following lemma:
228
8 Performance Analysis of VB Matrix Factorization
Lemma 8.26 Ω(σ−2 ) has a global minimum in σ−2 ∈ BH ∗ , and no local minimum (no stationary point if ξ > 0) outside BH ∗ , if the following hold: ξ< yH ∗ =
1 , x γ2H ∗ x(1 − ξ) . > 1 − xξ Mσ∗2
(8.85)
Using Eq. (8.43), Eq. (8.85) can be written with the true signal amplitude as follows: x(1 − ξ) $ α ∗ % > 0. (8.86) 1 + νH ∗ 1 + ∗ − νH ∗ 1 − xξ The left-hand side of Eq. (8.86) can be factorized as follows: B ⎛ ⎞ x(1−ξ) x(1−ξ) ⎜⎜⎜ ⎟⎟⎟ 2 ⎜⎜⎜ ⎟⎟⎟ − (1 + α) + − (1 + α) − 4α 1−xξ 1−xξ 1 ⎜⎜⎜ ∗ ⎟⎟⎟ ⎜ν ∗ − ⎟⎟⎟⎟ ν∗H ∗ ⎜⎜⎜⎜⎜ H 2 ⎟⎟⎠ ⎝ B ⎞ ⎛ x(1−ξ) x(1−ξ) ⎟⎟⎟ ⎜⎜⎜ 2 ⎟⎟⎟ ⎜⎜⎜ − (1 + α) − − (1 + α) − 4α 1−xξ 1−xξ ⎟⎟⎟ ⎜⎜⎜ ∗ · ⎜⎜⎜νH ∗ − ⎟⎟⎟⎟ > 0. (8.87) 2 ⎜⎜⎜ ⎟⎟⎠ ⎝ When Eq. (8.45) holds, the last factor in the left-hand side in Eq. (8.87) is positive. Therefore, we have the following condition: B x(1−ξ) x(1−ξ) 2 1−xξ − (1 + α) + 1−xξ − (1 + α) − 4α ν∗H ∗ > 2 B x−1 x−1 2 1−xξ − α + 1−xξ − α − 4α = . (8.88) 2 Lemma 8.26 with the condition (8.85) replaced with the condition (8.88) leads to Theorem 8.20 and Corollary 8.21.
8.5 Numerical Verification Figure 8.5 shows numerical simulation results for M = 200 and L = 20, 100, 200. E was drawn from the independent Gaussian distribution with H∗ were mean 0 and variance σ∗2 = 1, and true signal singular values {γh∗ }h=1
1
Success rate
Success rate
8.5 Numerical Verification
0.5
0
229
1
0.5
0
0
1
2
3
4
5
0
1
Success rate
(a) L = 20
2
3
4
5
(b) L = 100
1
0.5
0 0
1
2
3
4
5
(c) L = 200
Figure 8.5 Success rate of rank recovery in numerical simulation for M = 200. The horizontal axis indicates theClower limit of the support of the simulated true signal distribution, i.e., z ≈ ν∗H ∗ . The recovery condition (8.64) for finitesized matrices is indicated by a vertical bar with the same line style for each ξ. The leftmost vertical bar, which corresponds to the condition (8.64) for ξ = 0, coincides with the recovery condition (8.46) for infinite-sized matrices.
√ √ drawn from the uniform distribution on [z Mσ∗ , 10 Mσ∗ ] for different z, which is indicated by the horizontal axis. We used Algorithm 16, which will be introduced in Chapter 9, to compute the global EVB solution. The vertical axis indicates the success rate of rank recovery over 100 trials,
EVB = H ∗ . If the condition (8.63) on i.e., the proportion of the trials giving H ξ is violated, the corresponding curve is depicted with markers. Otherwise, the ∗2 condition (8.64) on ν∗H ∗ (= γ∗2 H ∗ /(Mσ )) is indicated by a vertical bar with the same line style for each ξ. In other words, Theorem 8.20 states that the success √ rate should be equal to one if z (> γ∗H ∗ /( Mσ∗2 )) is larger than the value indicated by the vertical bar. The leftmost vertical bar, which corresponds to the condition (8.64) for ξ = 0, coincides with the recovery condition (8.46), given by Theorem 8.14, for infinite-sized matrices. We see that Theorem 8.20 with the condition (8.64) approximately holds for these moderate-sized matrices, while Theorem 8.14 with the condition (8.46), which does not depend on the relevant rank ratio ξ, immediately breaks for positive ξ.
230
8 Performance Analysis of VB Matrix Factorization
8.6 Comparison with Laplace Approximation Here, we compare EVB learning with an alternative dimensionality selection method (Hoyle, 2008) based on the Laplace approximation (LA). Consider the PCA application, where D denotes the dimensionality of the observation space, and N denotes the number of samples, i.e., in our MF notation to keep L ≤ M, L = D, M = N L = N, M = D
if if
D ≤ N, D > N.
(8.89)
Right after Tipping and Bishop (1999) proposed the probabilistic PCA, Bishop (1999a) proposed to select the PCA dimensionality by maximizing the marginal likelihood: p(V) = p(V|A, B) p(A)p(B) .
(8.90)
Since the marginal likelihood (8.90) is computationally intractable, he approximated it by LA, and suggested Gibbs sampling and VB learning as alternatives. The VB variant, of which the model is almost the same as the MF defined by Eqs. (6.1) through (6.3), was also proposed by himself (Bishop, 1999b) along with a standard local solver similar to Algorithm 1 in Chapter 3. The LA-based approach was polished in Minka (2001a), by introducing a conjugate prior6 on B to p(V|B) = p(V|A, B) p(A) , and ignoring the nonleading terms that do not grow fast as the number N of samples goes to infinity. Hoyle (2008) pointed out that Minka’s method is inaccurate when D ! N, and proposed the overlap (OL) method, a further polished variant of the LA-based approach. A notable difference of the OL method from most of the LA-based methods is that the OL method applies LA around a more accurate estimator than the MAP estimator.7 Thanks to the use of the accurate estimator, the OL method behaves optimally in the large-scale limit when D and N go to infinity, while Minka’s method does not. We will clarify the meaning of the optimality and discuss it in more detail in Section 8.7. The OL method minimizes an approximation to the negative logarithm of the marginal likelihood (8.90), which depends on estimators for λh = b2h + σ2 and σ2 computed by an iterative algorithm, over the hypothetical model rank H = 0, . . . , L (see Appendix C for the detailed computational procedure). Figure 8.6 shows numerical simulation results that compare EVB learning and the OL method: Figure 8.6(a) shows the success rate for the no-signal case ξ = 0 (H ∗ = 0), while Figures 8.6(b) through 8.6(f) show the success rate for ξ = 0.05 and D = 20, 100, 200, 400, and 1, 000, respectively. We also show 6 7
This conjugate prior does not satisfy the implicit requirement, footnoted in Section 1.2.4, that the moments of the family member can be computed analytically. As explained in Section 2.2.1, LA is usually applied around the MAP estimator.
8.6 Comparison with Laplace Approximation
231
Success rate
1.2 1
Success rate
0.8 0.6 0.4 0.2
OL
Local-EVB
0.5
EVB
0
1
Local-EVB
0 0
Success rate
Success rate
1
0.5
3
4
0.5 Local-EVB
Local-EVB
1
2
3
4
0 0
5
1
2
3
4
5
Success rate
(d) ξ = 0.05, D = 200
1
0.5
1
0.5
Local-EVB
0 0
5
1
(c) ξ = 0.05, D = 100
Success rate
2
(b) ξ = 0.05, D = 20
(a) ξ = 0
0 0
1
1
2
3
(e) ξ = 0.05, D = 400
4
Local-EVB
5
0 0
1
2
3
4
5
(f) ξ = 0.05, D = 1, 000
Figure 8.6 Success rate of PCA dimensionality recovery by (global) EVB learning, the OL method, and the local-EVB estimator for N = 200. Vertical bars indicate the recovery conditions, Eq. (8.46) for EVB learning, and Eq. (8.95) for the OL method and the local-EVB estimator, in the large-scale limit.
the performance of the local-EVB estimator (6.131), which was computed by a local solver (Algorithm 18 introduced in Chapter 9). For the OL method and the local-EVB estimator, we initialized the noise variance estimator to L 2 γh /(LM). 10−4 · h=1 In comparison with the OL method, EVB learning shows its conservative nature: It exhibits almost zero false positive rate (Figure 8.6(a)) at the expense of low sensitivity (Figures 8.6(c) through 8.6(f)). Actually, because of its low sensitivity, EVB learning does not behave optimally in the large-scale limit. The local-EVB estimator, on the other hand, behaves similarly to the OL method, for which the reason will be elucidated in the next section.
232
8 Performance Analysis of VB Matrix Factorization
8.7 Optimality in Large-Scale Limit Consider the large-scale limit, and assume that the model rank H is set to be large enough but finite so that H ≥ H ∗ and H/L → 0. Then the rank estimation procedure, detailed in Appendix C, by the OL method is reduced to counting > σ2 OL−LSL , i.e., the number of components such that λOL−LSL h
OL−LSL = H
L θ λOL−LSL > σ2 OL−LSL , h
(8.91)
h=1
and where θ(·) is the indicator function defined in Theorem 8.1. Here λOL−LSL h 2 OL−LSL
σ are computed by iterating the following updates until convergence: ⎧ OL−LSL ⎪ ⎪ if γh ≥ γlocal−EVB , ⎨λ˘ h OL−LSL
(8.92) =⎪ λh ⎪ ⎩ σ2 OL−LSL otherwise, ⎛ L 2 ⎞ H ⎜⎜⎜ γl ⎟⎟ 1 2 OL−LSL OL−LSL ⎟
⎜ ⎟⎟⎠ ,
− σ = λh (8.93) ⎜⎝ (M − H) l=1 L h=1 where
= λ˘ OL−LSL h
γh2 (M − L) σ2 OL−LSL 1− 2L γh2 > @⎛ ⎞2 ⎜⎜⎜ 4L σ2 OL−LSL (M − L) σ2 OL−LSL ⎟⎟⎟ ⎟⎠ − + ⎜⎝1 − . γh2 γh2
(8.94)
The OL method evaluates its objective, which approximates the negative logarithm of the marginal likelihood (8.90), after the updates (8.92) and (8.93)
OL−LSL as the converge for each hypothetical H, and adopts the minimizer H rank estimator. However, Hoyle (2008) proved that, in the large-scale limit, the objective decreases as H increases, as long as Eq. (8.94) is a real number (or equivalently γh ≥ γlocal−EVB holds) for all h = 1, . . . , H at the convergence. Accordingly, Eq. (8.91) holds. Interestingly, the threshold in Eq. (8.92) coincides with the local-EVB threshold (6.127). Moreover, the updates (8.92) and (8.93) for the OL method are equivalent to the updates (9.29) and (9.30) for the local-EVB estimator (Algorithm 18) with the following correspondence: γh γhlocal−EVB + σ2 local−EVB , L = σ2 local−EVB .
= λOL−LSL h
σ2 OL−LSL
Thus, the rank estimation procedure by the OL method and that by the local local−EVB in the large OL−LSL = H EVB estimator are equivalent, and therefore H scale limit.
8.7 Optimality in Large-Scale Limit
233
If the noise variance is accurately estimated, i.e., σ2 = σ∗2 , the threshold both for the OL method and the local-EVB estimator coincides with γ the MPUL (8.41), which corresponds to the minimum detectable observed singular value. By using this fact, the optimality of the OL method in the largescale limit was shown: local−EVB
Proposition 8.27 (Hoyle, 2008) In the large-scale limit, when L and M go to infinity with finite α, H ∗ , and H (≥ H ∗ )8 , the OL method almost surely recovers
OL−LSL = H ∗ , if and only if the true rank, i.e., H √ (8.95) ν∗H ∗ > α. It almost surely holds that
λOL−LSL h − 1 = νh∗ ,
σ2 OL−LSL
σ2 OL−LSL = σ∗2 . The condition (8.95) coincides with the condition (8.45), which any PCA method requires for perfect dimensionality recovery. In this sense, the OL method, as well as the local-EVB estimator, is optimal in the large-scale limit. On the other hand, Theorem 8.14 implies that (global) EVB learning is not optimal in the large-scale limit but more conservative (see the difference √ between τ and α in Figure 6.4). In Figure 8.6, the conditions for perfect dimensionality recovery in the large-scale limit are indicated by vertical bars: . √ z = τ for EVB, and z = τlocal = α1/4 for OL and local-EVB. All methods accurately estimate the noise variance in the large-scale limit, i.e.,
σ2 OL−LSL = σ2 local−EVB = σ∗2 . σ2 EVB = Taking this into account, we indicate the recovery conditions in Figure 8.4 by arrows at y = x for EVB, and y = xlocal (= y) for OL and local-EVB, respectively. Figure 8.4 implies that, in this particular case, EVB learning discards the third spike coming from the third true signal ν3∗ = 0.5, while the OL method and the local-EVB estimator successfully capture it as a signal. When the matrix size is finite, the conservative nature of EVB learning is not always bad, since it offers almost zero false positive rate, which makes 8
Unlike our analysis in Section 8.4, Hoyle (2008) assumed H/L → 0 to prove that the OL method accurately estimates the noise variance.
234
8 Performance Analysis of VB Matrix Factorization
(a) V = 1.0
(b) V = 1.05
(c) V = 1.5
(d) V = 2.1
Figure 8.7 The VB free energy contribution (6.55) from the (first) component and its counterpart (8.96) of Bayesian learning for L = M = H = 1 and σ2 = 1. Markers indicate the local minima.
Theorem 8.20 approximately hold for finite cases, as seen in Figures 8.5 and 8.6. However, the fact that not (global) EVB learning but the localEVB estimator is optimal in the large-scale limit might come from inaccurate approximation to the Bayes posterior by the VB posterior. Having this in mind, we discuss the difference between VB learning and full Bayesian learning in the remainder of this section. Figure 8.7 shows the VB free energy contribution (6.55) from the (first) component as a function of ca cb , and its counterpart of Bayesian learning: V2 Bayes = −2 log p(V|A, B) p(A)p(B) − log(2πσ2 ) + 2 , (8.96) 2F1 σ which was numerically computed. We see that the minimizer (shown as a diamond) of the Bayes free energy is at ca cb → +0 until V exceeds 1. The difference in behavior between EVB learning and the local-EVB estimator appears in the nonempty range of the observed value V where the positive local solution exists but gives positive free energy. Figure 8.7(d) shows this case, where a bump exists between two local minima (indicated by crosses). On the other hand, such multimodality is not observed in empirical
8.7 Optimality in Large-Scale Limit
235
full Bayesian learning (see the dashed curves in Figures 8.7(a) through 8.7(d)). We can say that this multimodality in EVB learning with a bump between two local minima is induced by the independence constraint for VB learning. We further guess that it is this bump that pushes the EVB threshold from the optimal point (at the local-EVB threshold) to a larger value. Further investigation is necessary to fully understand this phenomenon.
9 Global Solver for Matrix Factorization
The analytic-form solutions, derived in Chapter 6, for VB learning and EVB learning in fully observed MF can naturally be used to develop efficient and reliable VB solvers. Some properties, shown in Chapter 8, can also be incorporated when the noise variance is unknown and to be estimated. In this chapter, we introduce global solvers for VB learning and EVB learning (Nakajima et al., 2013a, 2015), and how to extend them to more general cases with missing entries and nonconjugate likelihoods (Seeger and Bouchard, 2012).
9.1 Global VB Solver for Fully Observed MF We consider the MF model, introduced in Section 3.1: #2 1 # p(V|A, B) ∝ exp − 2 ##V − B A ##Fro , 2σ 1 p( A) ∝ exp − tr AC−1 A , A 2 1 p(B) ∝ exp − tr BC−1 B , B 2 where V ∈ RL×M is an observed matrix; $ % a1 , . . . , a M ∈ R M×H , A = (a1 , . . . , aH ) = b1 , . . . , bL ∈ RL×H , B = (b1 , . . . , bH ) =
236
(9.1) (9.2) (9.3)
9.1 Global VB Solver for Fully Observed MF
237
are parameter matrices; and CA , C B , and σ2 are hyperparameters. The prior covariance hyperparameters are restricted to be diagonal: CA = Diag(c2a1 , . . . , c2aH ), C B = Diag(c2b1 , . . . , c2bH ). Our VB solver gives the global solution to the following minimization problem,
r = argminF(r) s.t. r( A, B) = rA ( A)rB (B),
(9.4)
r
of the free energy
/ F(r) = log
rA ( A)rB (B) p(V|A, B)p( A)p(B)
0 .
(9.5)
rA (A)rB (B)
Assume that L ≤ M without loss of generality, and let V=
L
γh ωbh ωah
(9.6)
h=1
be the singular value decomposition (SVD) of the observed matrix V ∈ RL×M . According to Theorem 6.7, the VB solution is given by ⎧ H VB ⎪ ⎪ if γh ≥ γVB , VB ⎨γ˘ h h
=
(9.7) γhVB ωbh ωah , B A= where γhVB = ⎪ U ⎪ ⎩0 otherwise, h=1 for
> ? ? ? @
γVB h
γ˘ hVB
> ? @⎛ ⎞2 ⎜⎜⎜ (L + M) σ2 ⎟⎟⎟⎟ (L + M) ⎜ + 2 2 + ⎜⎝ + 2 2 ⎟⎠ − LM, =σ 2 2 2cah cbh 2cah cbh > ⎛ ⎛ ⎞⎞ @ ⎜⎜⎜ 2 ⎜ ⎜⎜⎜ 4γh2 ⎟⎟⎟⎟⎟⎟⎟⎟ σ ⎜ 2 = γh ⎜⎜⎜⎝1 − 2 ⎜⎜⎜⎝ M + L + (M − L) + 2 2 ⎟⎟⎟⎠⎟⎟⎟⎠ . 2γh cah cbh σ2
(9.8)
(9.9)
Corollary 6.8 completely specifies the VB posterior, which is written as r( A, B) =
H h=1
Gauss M (ah ; ah ωah , σ2ah I M )
H
GaussL (bh ; bh ωbh , σ2bh I L )
h=1
(9.10) with the following variational parameters: if γh > γVB , h > @ . γ˘ hVB σ2 δVB σ2 h 2 2
ah = ± γ˘ hVB , b = ± , σ = , σ = , (9.11) δVB h ah bh h
γh γh δVB δVB h h
238
9 Global Solver for Matrix Factorization
Algorithm 15 Global VB solver for fully observed matrix factorization. 1: Transpose V → V if L > M, and set H (≤ L) to a sufficiently large value. 2: Compute the SVD (9.6) of V. 3: Apply Eqs. (9.7) through (9.9) to get the VB estimator. 4: If necessary, compute the variational parameters by using Eqs. (9.11) through (9.14), which specify the VB posterior (9.10). We can also evaluate the free energy by using Eqs. (9.15) and (9.16). cah
Lσ2 ah VB
δVB − γ ˘ − ≡ = γ , h h h
γh σ2 bh
where and otherwise,
ah = 0,
bh = 0,
σ2ah
where σ2
σ2bh = σ2ah ζhVB ≡ 2LM
=
c2ah
⎛ ⎞ ⎜⎜⎜ L ζ VB ⎟ ⎜⎜⎝1 − h ⎟⎟⎟⎟⎠ , σ2
σ2bh
=
c2bh
(9.12)
⎛ ⎞ ⎜⎜⎜ M ζhVB ⎟⎟⎟ ⎜⎜⎝1 − ⎟⎟ , (9.13) σ2 ⎠
> ⎛ ⎞ ? @⎛ ⎞2 ⎜⎜⎜ ⎟⎟⎟ 2 2 ⎜ ⎟ σ ⎟ σ ⎜ ⎜⎜⎜ ⎟ ⎜⎜⎜L + M + 2 2 − ⎜⎜⎜⎝L + M + 2 2 ⎟⎟⎟⎠ − 4LM ⎟⎟⎟⎟⎟ . ⎝ ⎠ cah cbh cah cbh (9.14)
The free energy can be written as L 2F = LM log(2πσ ) + 2
where
2Fh = M log
c2ah
σ2ah
2 h=1 γh σ2
+ L log
− (L + M) +
+
c2bh
σ2bh
H
2Fh ,
(9.15)
h=1
+
σ2ah a2h + M c2ah
+
σ2bh b2h + L c2bh
−2 ah a2h + M σ2ah σ2bh bh γh + b2h + L σ2
.
(9.16) Based on these results, we can straightforwardly construct a global solver for VB learning, which is given in Algorithm 15.
9.2 Global EVB Solver for Fully Observed MF EVB learning, where the hyperparameters CA , CA , and σ2 are also estimated from observation, solves the following minimization problem,
r = argmin F r,CA ,CA ,σ2
of the free energy (9.5).
s.t. r( A, B) = rA ( A)rB (B),
(9.17)
9.2 Global EVB Solver for Fully Observed MF
239
According to Theorem 6.13, given the noise variance σ2 , the EVB solution can be written as ⎧ H EVB ⎪ ⎪ if γh ≥ γEVB , EVB ⎨γ˘ h EVB EVB
(9.18) γh ωbh ωah , where = γh = ⎪ U ⎪ ⎩0 otherwise, h=1 for
A
γ
EVB
γ˘ hEVB
α =σ M 1+τ 1+ , τ > ⎞ ⎛ @⎛ ⎞2 ⎟ ⎜ ⎜⎜⎜ γh ⎜⎜⎜⎜ 4LMσ4 ⎟⎟⎟⎟ (M + L)σ2 ⎟⎟⎟ (M + L)σ2 ⎜ ⎟ ⎟⎟ . ⎜⎜⎜1 − = + − 1 − ⎝ ⎠ 2 ⎝ γh2 γh2 γh4 ⎟⎠
Here L M
α=
(0 < α ≤ 1),
is the “squaredness” of the observed matrix V, and τ = τ(α) is the unique zero-cross point of the following function: !τ" log(z + 1) 1 − . (9.19) Ξ (τ; α) = Φ (τ) + Φ , where Φ(z) = α z 2 Summarizing Lemmas 6.14, 6.16, and 6.19, the EVB posterior is completely specified by Eq. (9.10) with the variational parameters given as follows: If γh ≥ γEVB , > @ EVB . γ˘ h EVB EVB
ah = ± γ˘ h δh , bh = ± , (9.20)
δEVB h A 2 EVB 2 γh γ˘ hEVB σ δ σ h
σ2ah = , (9.21) , σ2bh = , cah cbh = γh LM γh δEVB h A ⎛ ⎞ 2 ⎟ M γ˘ hEVB ⎜⎜⎜ Lσ EVB ⎟⎟⎠⎟ ,
where δh = (9.22) ⎝⎜1 + Lγh γh γ˘ hEVB and otherwise
ah = 0,
bh = 0,
σ2ah =
where
.
ζ EVB ,
σ2bh =
ζ EVB → +0.
.
ζ EVB ,
cah cbh =
.
ζ EVB , (9.23) (9.24)
To use the preceding result, we need to prepare a table of τ by computing the zero-cross point of Eq. (9.19) as a function of α. A simple approximation √ √ τ ≈ z α ≈ 2.5129 α is a reasonable alternative (see Figure 6.4).
240
9 Global Solver for Matrix Factorization
For noise variance estimation, we can use Theorems 8.1 and 8.2, derived in Chapter 8. Specifically, after performing the SVD (9.6), we first estimate the noise variance by solving the following problem:
σ2 EVB = argmin Ω(σ−2 ),
(9.25)
σ2
s.t.
where
⎛ ⎞ L L ⎜⎜⎜ γh2 ⎟⎟⎟ h=H+1 2 ⎟⎟⎟ ≤ σ2 ≤ 1 max ⎜⎜⎜⎝σH+1 , γh2 , ⎠ LM M L−H h=1
⎛ H ⎛ ⎞ ⎛ 2 ⎞⎞ L ⎜⎜ γ ⎟⎟⎟⎟ 1 ⎜⎜⎜⎜ ⎜⎜⎜ γh2 ⎟⎟⎟ ⎟⎠ + Ω(σ ) = ⎜⎝ ψ ⎜⎝ ψ0 ⎜⎜⎝ h 2 ⎟⎟⎠⎟⎟⎟⎠ , 2 L h=1 Mσ Mσ h=H+1 ψ (x) = ψ0 (x) + θ x > x ψ1 (x) , −2
ψ0 (x) = x − log x,
(9.26)
(9.27)
τ(x; α) ψ1 (x) = log (τ(x; α) + 1) + α log + 1 − τ(x; α), α α x= 1+τ 1+ , τ . 1 2 τ(x; α) = x − (1 + α) + (x − (1 + α)) − 4α , 2 ⎧ ⎪ ⎪ ∞ for h = 0, ⎪ ⎪ ⎪ ⎪ γ2 ⎨ 2 h σh = ⎪ for h = 1, . . . , L, ⎪ Mx ⎪ ⎪ ⎪ ⎪ ⎩0 for h = L + 1, !D L E " H = min − 1, H . 1+α Problem (9.25) is simply a one-dimensional search for the minimizer of the function Ω(σ−2 ), which is typically smooth. Note also that, if the matrix size is large enough, Corollary 8.21 states that any local minimizer is accurate enough σ2 EVB , to estimate the correct rank. Given the estimated noise variance σ2 = Eq. (9.18) gives the EVB solution. Algorithm 16 summarizes the procedure explained in the preceding discussion. This algorithm gives the global solution, provided that the global solution to the one-dimensional search problem (9.25) is attained. If the noise variance σ2 is known, we should simply skip Step 4.
9.2 Global EVB Solver for Fully Observed MF
241
Algorithm 16 Global EVB solver for fully observed matrix facrtorization. 1: Transpose V → V if L > M, and set H (≤ L) to a sufficiently large value. 2: Refer to the table of τ(α) at α = L/M (or use a simple approximation √ τ ≈ 2.5129 α). 3: Compute the SVD (9.6) of V. 4: Solve the one-dimensional search problem (9.25) to get σ2 EVB . EVB H 5: Apply Eq. (9.18) to get the EVB estimator { γh }h=1 for σ2 = σ2 EVB . 6: If necessary, compute the variational parameters and the hyperparameters by using Eqs. (9.20) through (9.24), which specify the EVB posterior (9.10). We can also evaluate the free energy by using Eqs. (9.15) and (9.16), noting that Fh → +0 for h such that γh < γEVB . Algorithm 17 Iterative EVB solver for fully observed matrix factorization. 1: Transpose V → V if L > M, and set H (≤ L) to a sufficiently large value. 2: Refer to the table of τ(α) at α = L/M (or use a simple approximation √ τ ≈ 2.5129 α). 3: Compute the SVD (9.6) of V. 4: Initialize the noise variance σ2 EVB to the lower bound in Eq. (9.26). H 5: Apply Eq. (9.18) to update the EVB estimator { γhEVB }h=1 . 6: Apply Eq. (9.28) to update the noise variance estimator σ2 EVB . 7: Compute the variational parameters and the hyperparameters by using Eqs. (9.20) through (9.24). 8: Evaluate the free energy (9.15), noting that F h → +0 for h such that γh < γEVB . 9: Iterate Steps 5 through 8 until convergence (until the energy decrease becomes smaller than a threshold).
Another implementation is to iterate Eq. (9.18) and
σ
2 EVB
1 = LM
⎛ L ⎞ H ⎜⎜⎜ 2 ⎟ EVB ⎟ ⎜⎜⎝ γl − γh ⎟⎟⎟⎠ γh l=1
(9.28)
h=1
in turn. Note that Eq. (9.28) was derived in Corollary 8.3 and can be used as an update rule for the noise variance estimator, given the current EVB H . Although it is not guaranteed, this iterative algorithm estimators { γhEVB }h=1 (Algorithm 17) tends to converge to the global solution if we initialize the noise variance σ2 EVB to be sufficiently small (Nakajima et al., 2015). We recommend to initialize it to the lower-bound given in Eq. (9.26).
242
9 Global Solver for Matrix Factorization
Algorithm 18 Local-EVB solver for fully observed matrix factorization. 1: Transpose V → V if L > M, and set H (≤ L) to a sufficiently large value. 2: Refer to the table of τ(α) at α = L/M (or use a simple approximation √ τ ≈ 2.5129 α). 3: Compute the SVD (9.6) of V. 4: Initialize the noise variance σ2 local−EVB to the lower-bound in Eq. (9.26). H 5: Apply Eq. (9.29) to update the local-EVB estimator { γhlocal−EVB }h=1 . 2 local−EVB 6: Apply Eq. (9.30) to update the noise variance estimator σ . 7: Compute the variational parameters and the hyperparameters by using Eqs. (9.20) through (9.24). 8: Evaluate the free energy (9.15), noting that F h → +0 for h such that γh < γlocal−EVB . 9: Iterate Steps 5 through 8 until convergence (until the energy decrease becomes smaller than a threshold).
Finally, we introduce an iterative solver, in Algorithm 18, for the local-EVB estimator (6.131), which iterates the following updates: ⎧ EVB ⎪ ⎪ if γh ≥ γlocal−EVB , ⎨γ˘ h local−EVB
γh =⎪ (9.29) ⎪ ⎩0 otherwise, ⎛ L ⎞ H ⎟⎟ 1 ⎜⎜⎜⎜ 2 2 local−EVB local−EVB ⎟ ⎟⎟⎠ ,
σ γh = γh (9.30) ⎜ γ − LM ⎝ l=1 l h=1 where γlocal−EVB ≡
√
L+
√
M σ
(9.31)
is the local-EVB threshold, defined by Eq. (6.127). If we initialize the noise variance σ2 local−EVB to be sufficiently small, this algorithm tends to retain the positive local-EVB solution for each h if it exists, and therefore does not necessarily converge to the global EVB solution. The interesting relation between the local-EVB estimator and the overlap (OL) method (Hoyle, 2008), an alternative dimensionality selection method based on the Laplace approximation, was discussed in Section 8.7.
9.3 Empirical Comparison with the Standard VB Algorithm Here we see how efficient the global solver (Algorithm 16) is in comparison with the standard VB algorithm (Algorithm 1 in Section 3.1) on artificial and benchmark data.
9.3 Empirical Comparison with the Standard VB Algorithm
243
9.3.1 Experiment on Artificial Data We first created an artificial data set (Artificial1) with the data matrix size L = 100 and M = 300, and the true rank H ∗ = 20. We randomly drew ∗ ∗ true matrices A∗ ∈ R M×H and B∗ ∈ RL×H so that each entry of A∗ and B∗ follows Gauss1 (0, 1), where Gauss1 (μ, σ2 ) denotes the one-dimensional Gaussian distribution with mean μ and variance σ2 . An observed matrix V was created by adding noise subject to Gauss1 (0, 1) to each entry of B∗ A∗ . We evaluated the performance under the complete empirical Bayesian scenario, where all variational parameters and hyperparameters are estimated from observation. We used the full-rank model (i.e., H = min(L, M)), expecting that irrelevant H − H ∗ components will be automatically trimmed out by the automatic relevance determination (ARD) effect (see Chapters 7 and 8). We compare the global solver (Algorithm 16) and the standard VB algorithm (Algorithm 1 in Section 3.1), and show the free energy, the computation time, and the estimated rank over iterations in Figure 9.1. For the standard VB algorithm, initial values were set in the following way: A and B are randomly created so that each entry follows Gauss1 (0, 1). Other variables are set to
Σ B = CA = C B = I H and σ2 = 1. Note that we rescale V so that ΣA = 2 VFro /(LM) = 1, before starting iterations. We ran the standard algorithm 10 80 1.9
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
Time(sec)
1.85
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
60
1.8
40 20 0
0
50
100
150
200
0
250
50
100
150
200
250
Iteration
Iteration
(b) Computation time
(a) Free energy 100
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
80 60 40 20 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.1 Experimental results on the Artificial1 data, where the data matrix size is L = 100 and M = 300, and the true rank is H ∗ = 20.
244
9 Global Solver for Matrix Factorization
40
3 Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
Time(sec)
2.8
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
30
2.6
20 10
2.4
0 0
50
100
150
200
0
250
50
100
150
200
250
Iteration
Iteration
(b) Computation time
(a) Free energy 80
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
60 40 20 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.2 Experimental results on the Artificial2 data set (L = 70, M = 300, and H ∗ = 40).
times, starting from different initial points, and each trial is plotted by a solid curve labeled as “Standard(iniRan)” in Figure 9.1. The global solver has no iteration loop, and therefore the corresponding dashed line labeled as “Global” is constant over iterations. We see that the
= H ∗ = 20 immediately (∼ 0.1 sec on global solver finds the true rank H average over 10 trials), while the standard iterative algorithm does not converge in 60 sec. Figure 9.2 shows experimental results on another artificial data set (Artificial2) where L = 70, M = 300, and H ∗ = 40. In this case, all the 10 trials of the standard algorithm are trapped at local minima. We empirically observed that the local minimum problem tends to be more critical when H ∗ is large (close to H). We also evaluated the standard algorithm with different initialization schemes. The curve labeled as “Standard(iniML)” indicates the standard bh ) = algorithm starting from the maximum likelihood (ML) solution: ( ah , √ √ ( γh ωah , γh ωbh ). The initial values for other variables are the same as the random initialization. Figures 9.1 and 9.2 show that the ML initialization generally makes convergence faster than the random initialization, but suffers from the local minimum problem more severely—it tends to converge to a worse local minimum.
9.3 Empirical Comparison with the Standard VB Algorithm
245
3 2.5 2 1.5
Time(sec)
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
1
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
2
1
0.5 0
0 0
50
100
150
200
250
0
50
100
150
200
250
Iteration
Iteration
(a) Free energy
(b) Computation time
15
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
10
5
0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.3 Experimental results on the Glass data set (L = 9, M = 214).
We observed that starting from a small noise variance tends to alleviate the local minimum problem at the expense of slightly slower convergence. The curve labeled as “Standard(iniMLSS)” indicates the standard algorithm starting from the ML solution with a small noise variance σ2 = 0.0001. We see in Figures 9.1 and 9.2 that this initialization improves the quality of solutions, and successfully finds the true rank for these artificial data sets. However, we will show in Section 9.3.2 that this scheme still suffers from the local minimum problem on benchmark datasets.
9.3.2 Experiment on Benchmark Data Figures 9.3 through 9.5 show the experimental results on the Glass, the Satimage, and the Spectf data sets available from the University of California, Irvine (UCI) repository (Asuncion and Newman, 2007). A similar tendency to the artificial data experiment (Figures 9.1 and 9.2) is observed: “Standard(iniRan)” converges slowly, and is often trapped at a local minimum with a wrong estimated rank;1 “Standard(iniML)” converges slightly faster but to a worse local minimum; and “Standard(iniMLSS)” tends to give a better solution. Unlike the artificial data experiment, “Standard(iniMLSS)” fails to 1
Since the true ranks of the benchmark data sets are unknown, we mean by a wrong rank a rank different from the one giving the lowest free energy.
246
9 Global Solver for Matrix Factorization
8000
3 Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
Time(sec)
2.8
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
6000
2.6
4000 2000
2.4
0
0
50
100
150
200
250
0
50
100
Iteration
150
200
250
Iteration
(a) Free energy
(b) Computation time Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
50 40 30 20 10 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.4 Experimental results on the Satimage data set (L = 36, M = 6435). 25
3.3 Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
20
Time(sec)
3.2
3.1
15 10 5
3
0 0
50
100
150
200
0
250
50
100
150
200
250
Iteration
Iteration
(a) Free energy
(b) Computation time
40
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
30 20 10 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.5 Experimental results on the Spectf data set (L = 44, M = 267).
find the correct rank in these benchmark data sets. We also conducted experiments on other benchmark data sets and found that the standard VB algorithm generally converges slowly, and sometimes suffers from the local minimum problem, while the global solver gives the global solution immediately.
9.4 Extension to Nonconjugate MF with Missing Entries
247
0.08 Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
–2.0431
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
0.06
Time(sec)
–2.043
–2.0432 –2.0433 –2.0434
0.04 0.02
–2.0435
0 0
50
100
150
200
0
250
50
100
150
200
250
Iteration
Iteration
(b) Computation time
(a) Free energy 4
Global Standard(iniML) Standard(iniMLSS) Standard(iniRan)
3 2 1 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 9.6 Experimental results on the Concrete Slump Test data set (an RRR task with L = 3, M = 7).
Finally, we applied EVB learning to the reduced rank regression (RRR) model (see Section 3.1.2), of which the model likelihood is given by Eq. (3.36). Figure 9.6 shows the results on the Concrete Slump Test data set, where we centered the L = 3-dimensional outputs and prewhitened the M = 7dimensional inputs. We also standardized the outputs so that the variance of each element is equal to one. Note that we cannot directly apply Algorithm 16 for the RRR model. Instead, we use Algorithm 16 with a fixed noise variance (skipping Step 4) and apply one-dimensional search to minimize the free energy (3.42), in order to estimate the rescaled noise variance σ2 . For the standard VB algorithm, the rescaled noise variance should be updated by Eq. (3.43), instead of Eq. (3.28). The original noise variance σ 2 is recovered by Eq. (3.40) for both cases. Overall, the global solver showed excellent performance over the standard VB algorithm.
9.4 Extension to Nonconjugate MF with Missing Entries The global solvers introduced in Section 9.1 can be directly applied only for the fully observed isotropic Gaussian likelihood (9.1). However, the global solver can be used as a subroutine to develop efficient algorithms for more
248
9 Global Solver for Matrix Factorization
general cases. In this section, we introduce the approach by Seeger and Bouchard (2012), where an iterative singular value shrinkage algorithm was proposed, based on the global VB solver (Algorithm 15) and local variational approximation (Section 2.1.7).
9.4.1 Nonconjugate MF Model Consider the following model: p(V|A, B) =
L M
am ), φl,m (Vl,m , bl
(9.32)
l=1 m=1
1 −1 p( A) ∝ exp − tr ACA A , 2 1 −1 p(B) ∝ exp − tr BC B B , 2
(9.33) (9.34)
where φl,m (v|u) is a function of v and u, and satisfy −
∂2 log φl,m (v, u) 1 ≤ 2 ∂u2 σ
(9.35)
for any l, m, v, and u. The function φl,m (v, u) corresponds to the model distribution of the (l, m)th entry v of V parameterized by u. If φl,m (v, u) = Gauss1 (v; u, σ2 )
(9.36)
for all l and m, the model (9.32) through (9.34) is reduced to the fully observed isotropic Gaussian MF model (9.1)–(9.3), and the Hessian, −
∂2 log φl,m (v, u) 1 = 2, 2 ∂u σ
of the negative log-likelihood is a constant with respect to v and u. The model (9.32) through (9.34) can cover the case with missing entries by setting the noise variance in Eq. (9.36) to σ2 → ∞ for the unobserved entries (the condition (9.35) is tight for the smallest σ2 ). Other one-dimensional distributions, including the Bernoulli distribution with sigmoid parameterization and the Poisson distribution, satisfy the condition (9.35) for a certain σ2 , which will be introduced in Section 9.4.3.
9.4 Extension to Nonconjugate MF with Missing Entries
249
9.4.2 Local Variational Approximation for Non-conjugate MF The VB learning problem (9.4) minimizes the free energy, which can be written as / 0 rA ( A)rB (B) F(r) = log . (9.37) L M am )p( A)p(B) m=1 φl,m (Vl,m , bl l=1 rA (A)rB (B) In order to make the global VB solver applicable as a subroutine, we instead solve the following joint minimization problem,
r = argmin F(r, Ξ) s.t. r( A, B) = rA ( A)rB (B),
(9.38)
r,Ξ
of an upper-bound of the free energy, / 0 rA ( A)rB (B) F ≤ F(r, Ξ) ≡ log L M am , Ξl,m )p( A)p(B) m=1 φ (Vl,m , bl l=1 l,m
, rA (A)rB (B)
(9.39) where φ (v, u, ξ) ≤ φl,m (v, u) l,m
(9.40)
is a lower-bound of the likelihood parameterized with variational parameters Ξ ∈ RL×M . The condition (9.35) allows us to form a parametric lower-bound in the (unnormalized) isotropic Gaussian form, which we derive as follows. Any 2 function f (x) with bounded curvature ∂∂x2f ≤ κ can be upper-bounded by the following quadratic function: ∂ f (x − ξ) + f (ξ) for any ξ ∈ R. (9.41) f (x) ≤ κ(x − ξ)2 + ∂x x=ξ Therefore, it holds that, for any ξ ∈ R, − log φl,m (v, u) ≤
(u − ξ)2 + g(v, ξ)(u − ξ) − log φl,m (v, ξ), 2σ2
where g(v, ξ) = −
(9.42)
∂ log φl,m (v, u) . ∂u u=ξ
The left graph in Figure 9.7 shows the parametric quadratic upper-bounds (9.42) for the Bernoulli likelihood with sigmoid parameterization.
250
9 Global Solver for Matrix Factorization
10
1
8
0.8
6
0.6
4
0.4
2
0.2
0 –5
0
5
0 –5
0
5
Figure 9.7 Parametric quadratic (Gaussian-form) bounds for the Bernoulli likelihood with sigmoid parameterization, φ(v, u) = evu /(1 + eu ), for v = 1. Left: the negative log-likelihood (the left-hand side of Eq. (9.42)) and its quadratic upper-bounds (the right-hand side of Eq. (9.42)) for ξ = −2.5, 0.0, 2.5. Right: the likelihood function φ(1, ξ) and its Gaussian-form lower-bounds (9.43) for ξ = −2.5, 0.0, 2.5.
Since log(·) is a monotonic function, we can adopt the following parametric lower-bound of φl,m (v, u): (u − ξ)2 φ (u, ξ) = exp − + g(v, ξ)(u − ξ) − log φ (v, ξ) l,m l,m 2σ2 = φl,m (v, ξ) exp − 2σ1 2 (u − ξ)2 + 2σ2 g(v, ξ)(u − ξ) ! ! "" 2 2 = φl,m (v, ξ) exp − 2σ1 2 u − ξ + σ2 g(v, ξ) − σ2 g(v, ξ) ! " 2 2 = φl,m (v, ξ) exp σ2 g2 (v, ξ) exp − 2σ1 2 (ξ − σ2 g(v, ξ)) − u 2 √ = 2πσ2 φl,m (v, ξ) exp σ2 g2 (v, ξ) Gauss1 ξ − σ2 g(v, ξ); u, σ2 . (9.43) The right graph in Figure 9.7 shows the parametric Gaussian-form lowerbounds (9.43) for the Bernoulli likelihood with sigmoid parameterization. Substituting Eq. (9.43) into Eq. (9.39) gives M L 1 σ2 2 2 log(2πσ ) + log φl,m (Vl,m , Ξl,m ) + g (Vl,m , Ξl,m ) F(r, Ξ) = − 2 2 l=1 m=1 * + rA (A)rB (B) + log L M 2 2 l=1
=−
L M l=1 m=1
/
+ log where V˘ ∈ R
L×M
m=1
am ,σ )p(A)p(B) rA (A)rB (B) Gauss1 (Ξl,m −σ g(Vl,m ,Ξl,m ); bl
1 σ2 2 2 log(2πσ ) + log φl,m (Vl,m , Ξl,m ) + g (Vl,m , Ξl,m ) 2 2 0 rA (A)rB (B) , (9.44) 1 2 2 −LM/2 ˘
(2πσ )
exp −
2σ2
V−BA Fro p(A)p(B)
rA (A)rB (B)
is a matrix such that V˘ l,m = Ξl,m − σ2 g(Vl,m , Ξl,m ).
(9.45)
9.4 Extension to Nonconjugate MF with Missing Entries
251
The first term in Eq. (9.44) does not depend on r and the second term is equal to the free energy of the fully observed isotropic Gaussian MF model with the ˘ Therefore, given the variational parameter observed matrix V replaced with V. Ξ, we can partially solve the minimization problem (9.38) with respect to r by applying the global VB solver (Algorithm 15). The solution is Gaussian in the following form: r( A, B) = rA ( A)rB (B),
where
"⎞ ⎛ !
⎟ ⎜⎜⎜ tr (A− A) Σ −1 A (A− A) ⎟⎟⎟ rA ( A) = MGauss M,H ( A; A, I M ⊗ Σ A ) ∝ exp ⎜⎜⎝− ⎟⎠ , 2
⎛ ⎜⎜
rB (B) = MGaussL,H (B; B, I L ⊗ Σ B ) ∝ exp ⎜⎜⎜⎝−
"⎞ ! −1 tr (B− B) Σ B (B− B) ⎟ ⎟ 2
⎟⎟⎟ . ⎠
(9.46) (9.47) (9.48)
B, Σ B are another set of Here the mean and the covariance parameters A, Σ A, variational parameters. Given the optimal r specified by Eq. (9.46), the free energy bound (9.44) is written (as a function of Ξ) as follows: M L σ2 2 g (Vl,m , Ξl,m ) log φl,m (Vl,m , Ξl,m ) + min F(r, Ξ) = − r 2 l=1 m=1 − 2σ1 2 V˘ − B A 2Fro + const. rA (A)rB (B) L M σ2 2 g (Vl,m , Ξl,m ) =− log φl,m (Vl,m , Ξl,m ) + 2 l=1 m=1 − =−
1 V˘ − 2σ2 L M
B A 2Fro + const.
log φl,m (Vl,m , Ξl,m ) +
l=1 m=1
−
1 2σ2
L M ˘
l=1 m=1 Vl,m − U l,m
2
σ2 2 g (Vl,m , Ξl,m ) 2 + const.,
(9.49)
where
= U B A .
The second-to-last equation in Eq. (9.43), together with Eq. (9.45), implies that 2
l,m , Ξl,m ) = log φl,m (Vl,m , Ξl,m ) + σ2 g2 (Vl,m , Ξl,m ) − 1 2 V˘ l,m − U
l,m , log φ (U 2 2σ l,m
with which Eq. (9.49) is written as L M
min F(r, Ξ) = − l=1 m=1 log φ (U l,m , Ξl,m ) + const. r
l,m
(9.50)
252
9 Global Solver for Matrix Factorization
Algorithm 19 Iterative singular value shrinkage algorithm for nonconjugate MF (with missing entries). 1: Set the noise variance σ2 with which the condition (9.35) tightly holds, and initialize the variational parameters to Ξ = 0(L,M) . ˘ by Eq. (9.45). 2: Compute V 3: Compute the VB posterior (9.46) by applying the global solver (Algorithm 15) with V˘ substituted for V. 4: Update Ξ by Eq. (9.51). 5: Iterate Steps 2 through 4 until convergence.
Since log φ (u, ξ) is the quadratic upper-bound (the right-hand side in l,m
l,m when ξ = U
l,m , the Eq. (9.42)) of − log φl,m (v, u), which is tight at u = U minimizer of Eq. (9.50) with respect to Ξ is given by
≡ argmin min F(r, Ξ) = U. Ξ Ξ
r
(9.51)
In summary, to solve the joint minimization problem (9.38), we can iteratively update r and Ξ. The update of r can be performed by the global ˘ defined solver (Algorithm 15) with the observed matrix V replaced with V, by Eq. (9.45). The update of Ξ is simply performed by Eq. (9.51). Algorithm 19 summarizes this procedure, where 0(d1 ,d2 ) denotes the d1 × d2 matrix with all entries equal to zero. Seeger and Bouchard (2012) empirically showed that this iterative singular value shrinkage algorithm significantly outperforms the MAP solution at comparable computational costs. They also proposed an efficient way to perform SVD when V is huge but sparsely observed, based on the techniques proposed by Tomioka et al. (2010).
9.4.3 Examples of Nonconjugate MF In this subsection, we introduce a few examples of model likelihood φl,m (v, u), which satisfy Condition (9.35), and give the corresponding derivatives of the negative log likelihood. Isotropic Gaussian MF with Missing Entries If we let ⎧ 2 ⎪ ⎪ ⎨Gauss1 (v; u, σ ) if (l, m) ∈ Λ, φl,m (v, u) = ⎪ ⎪ ⎩1 otherwise,
(9.52)
9.4 Extension to Nonconjugate MF with Missing Entries
253
where Λ denotes the set of observed entries, the model distribution (9.32) corresponds to the model distribution (3.44) of MF with missing entries. The first and the second derivatives of the negative log likelihood are given as follows: ⎧ 1 ⎪ ∂ log φl,m (v, u) ⎪ ⎨ σ2 (u − v) if (l, m) ∈ Λ, =⎪ − ⎪ ⎩0 ∂u otherwise, ⎧ 1 ⎪ ∂2 log φl,m (v, u) ⎪ ⎨ σ2 if (l, m) ∈ Λ, − (9.53) = ⎪ ⎪ ⎩0 ∂u2 otherwise. Bernoulli MF with Sigmoid Parameterization The Bernoulli distribution with sigmoid parameterization is suitable for binary observations, i.e., V ∈ {0, 1}L×M : ⎧ vu e ⎪ ⎪ ⎪ if (l, m) ∈ Λ, ⎨ 1 + eu φl,m (v, u) = ⎪ (9.54) ⎪ ⎪ ⎩ 1 otherwise. The first and the second derivatives are given as follows: ⎧ 1 ⎪ ∂ log φl,m (v, u) ⎪ ⎨ 1+e−u − v if (l, m) ∈ Λ, =⎪ − ⎪ ⎩0 ∂u otherwise, ⎧ 1 ⎪ ∂2 log φl,m (v, u) ⎪ ⎨ (1+e−u )(1+eu ) if (l, m) ∈ Λ, − = ⎪ ⎪ ⎩0 ∂u2 otherwise.
(9.55)
It holds that −
∂2 log φl,m (v, u) 1 ≤ , 4 ∂u2
and therefore, the noise variance should be set to σ2 = 4, which satisfies the condition (9.35). Figure 9.7 was depicted for this model. Poisson MF The Poisson distribution is suitable for count data, i.e., V ∈ {0, 1, 2, . . .}L×M : ⎧ v −λ(u) ⎪ ⎪ if (l, m) ∈ Λ, ⎨λ (u)e (9.56) φl,m (v, u) = ⎪ ⎪ ⎩1 otherwise, where λ(u) is the link function. Since a common choice λ(u) = eu for the link function gives unbounded curvature for large u, Seeger and Bouchard (2012)
254
9 Global Solver for Matrix Factorization
proposed to use another link function λ(u) = log(1 + eu ). The first derivative is given as follows: ⎧ 1 v ⎪ ∂ log φl,m (v, u) ⎪ ⎨ 1+e−u 1 − λ(u) if (l, m) ∈ Λ, =⎪ (9.57) − ⎪ ⎩0 ∂u otherwise. It was confirmed that the second derivative is upper-bounded as ∂2 log φl,m (v, u) 1 ≤ + 0.17v, 4 ∂u2 and therefore, the noise variance should be set to −
σ2 =
1 . 1/4 + 0.17 maxl,m Vl,m
Since the bound can be loose if some of the entries Vl,m of the observed matrix are huge compared to the others, overly large counts should be clipped.
10 Global Solver for Low-Rank Subspace Clustering
The nonasymptotic theory, described in Chapter 6, for fully observed matrix factorization (MF) has been extended to other bilinear models. In this chapter, we introduce exact and approximate global variational Bayesian (VB) solvers (Nakajima et al., 2013c) for low-rank subspace clustering (LRSC).
10.1 Problem Description The LRSC model, introduced in Section 3.4, is defined as #2 1 ##
# # # p(V|A , B ) ∝ exp − 2 V − VB A Fro , 2σ 1
−1 p( A ) ∝ exp − tr( A CA A ) , 2 1
−1 p(B ) ∝ exp − tr(B C B B ) , 2
(10.1) (10.2) (10.3)
where V ∈ RL×M is an observation matrix, and A ∈ R M×H and B ∈ R M×H for H ≤ min(L, M) are the parameters to be estimated. Note that in this chapter we denote the original parameters A and B with primes for convenience. We assume that hyperparameters CA = Diag(c2a1 , . . . , c2aH ),
C B = Diag(c2b1 , . . . , c2bH ),
are diagonal and positive definite. The LRSC model is similar to MF. The only difference is that the product B A of the parameters is further multiplied by V in Eq. (10.1). Accordingly, we can hope that similar analysis could be applied to LRSC, providing a global solver for LRSC. 255
256
10 Global Solver for Low-Rank Subspace Clustering
We first transform the parameters as right
A ← ΩV
A ,
right
B ← ΩV
B ,
right
V = Ωleft V ΓV ΩV
where
(10.4) right
L×L and ΩV ∈ is the singular value decomposition (SVD) of V. Here, Ωleft V ∈R M×M L×M are orthogonal matrices, and ΓV ∈ R is a (possibly nonsquare) R diagonal matrix with nonnegative diagonal entries aligned in nonincreasing order, i.e., γ1 ≥ γ2 ≥ · · · ≥ γmin(L,M) . After this transformation, the LRSC model (10.1) through (10.3) is rewritten as ##2 1 ## p(ΓV |A, B) ∝ exp − 2 #ΓV − ΓV B A )#Fro , (10.5) 2σ 1 p( A) ∝ exp − tr( AC−1 (10.6) A A ) , 2 1 p(B) ∝ exp − tr(BC−1 (10.7) B B ) . 2
The transformation (10.4) does not affect much the derivation of the VB learning algorithm. The following summarizes the result obtained in Section 3.4 with the transformed parameters A and B. The solution of the VB learning problem,
r = argmin F(r) s.t. r( A, B) = rA ( A)rB (B), r / 0 rA ( A)rB (B) F = log , p(ΓV |A, B)p( A)p(B) rA (A)rB (B)
where
(10.8)
has the following form:
"⎞ ! ⎛ −1 ⎟ ⎜⎜⎜ tr ( A −
⎟⎟⎟ ( A − A) A) Σ A ⎜⎜ ⎟⎟⎟ r( A) ∝ exp ⎜⎜⎜⎜− ⎟⎟⎟ , ⎜⎝ 2 ⎠ ⎞ ⎛ −1 ⎜⎜⎜ ˘ ˘ ⎟⎟⎟⎟ ˘ Σ˘ B ( b˘ − b) ⎜⎜⎜ ( b − b) ⎟⎟⎟ , r(B) ∝ exp ⎜⎜− ⎟⎠ ⎝ 2
(10.9)
for b˘ = vec(B) ∈ R MH , and the free energy can be explicitly written as 2
## # #2 ##ΓV −ΓV B A ## Fro σ2
2
!
2F = LM log(2πσ ) +
+ M log
2
+ log det(C!B ⊗I"M ) det Σ˘ B
3 3 2
−1 −1
˘ − 2MH + tr C−1 ⊗ I ) Σ A + tr C A + M Σ B B + tr (C A M B A B B ) 1 * + + tr σ−2 ΓV ΓV − A B A . (10.10) A B + B( A + M Σ A )B
"3
det(C A) det ΣA
r(B)
10.1 Problem Description
257
Σ˘ B ) can be obtained by solvTherefore, the variational parameters ( A, B, Σ A, ing the following problem: H , σ2 ∈ R++ , Given CA , C B ∈ D++
min
( A, B, Σ A , Σ˘ B )
s.t.
F,
(10.11)
H MH
A, B ∈ R M×H , Σ A ∈ S++ , Σ˘ B ∈ S++ .
(10.12)
The stationary conditions with respect to the variational parameters are given by 1
B Σ A, A = 2 ΓV ΓV σ !
Σ A = σ2 B ΓV ΓV B
r(B)
(10.13) + σ2 C−1 A
"−1 ,
Σ˘ B
A , b˘ = 2 vec ΓV ΓV σ "−1 !
⊗ I ) . A A + M Σ A ) ⊗ ΓV ΓV + σ2 (C−1 Σ˘ B = σ2 ( M B
(10.14) (10.15) (10.16)
For empirical VB (EVB) learning, we solve the problem, Given σ2 ∈ R++ , min
( A, B, Σ A , Σ˘ B ,CA ,C B )
subject to
F
H MH H
A, B ∈ R M×H , Σ A ∈ S++ , Σ˘ B ∈ S++ , CA , C B ∈ D++ ,
(10.17) (10.18)
for which the stationary conditions with respect to the hyperparameters are given by # ##2 ah # /M + ΣA , (10.19) c2ah = ## h,h !## ##2 ! (h,h) "" bh ## + tr ΣB c2bh = ## /M, (10.20) * + tr ΓV ΓV I M − 2 B A + B( A A + M Σ A )B r(B)
. (10.21) σ2 = LM In deriving the global VB solution of fully observed MF in Chapter 6, the following two facts were essential. First, a large portion of the degrees of freedom of the original variational parameters are irrelevant (see Section 6.3), and the optimization problem can be decomposed into subproblems, each of which has only a small number of unknown variables. Second, the stationary conditions of each subproblem is written as a polynomial system (a set of
258
10 Global Solver for Low-Rank Subspace Clustering
polynomial equations). These two facts also apply to the LRSC model, which allows us to derive an exact global VB solver (EGVBS). However, each of the decomposed subproblems still has too many unknowns whose number is proportional to the problem size, and therefore EGVBS is still computationally demanding for typical problem sizes. As an alternative, we also derive an approximate global VB solver (AGVBS) by imposing an additional constraint, which allows further decomposition of the problem into subproblems with a constant number of unknowns. In this chapter, we first find irrelevant degrees of freedom of the variational parameters and decompose the VB learning problem. Then we derive EGVBS and AGVBS and empirically show their usefulness.
10.2 Conditions for VB Solutions Let J (≤ min(L, M)) be the rank of the observed matrix V. For simplicity, we assume that no pair of positive singular values of V coincide with each other, i.e., γ1 > γ2 > · · · > γ J > 0. This holds with probability 1 if V is contaminated with Gaussian noise, as the LRSC model (10.1) assumes. Since (ΓV ΓV )m,m is zero for m > J or m > J, Eqs. (10.13) and (10.15) imply that
m,h = Bm,h = 0 A
for
m > J.
(10.22)
Similarly to Lemma 6.1 for the fully observed MF, we can prove the following lemma: Lemma 10.1 Any local solution of the problem (10.11) is a stationary point of the free energy (10.10). Proof Since # ## #2 ##ΓV − ΓV B A ## ≥ 0, Fro and
) * + 1 A B A tr ΓV ΓV − A B + B( A + M Σ A )B r(B) 2 3 = M · tr ΓV ΓV B ≥ 0, Σ A B r(B)
10.4 Proof of Theorem 10.2
259
the free energy (10.10) is lower-bounded as ! " 2F ≥ −M log det Σ A − log det Σ˘ B 2 ! "3 2 3 3 2
−1 −1
˘ Σ + tr C−1 ⊗ I ) A + tr C A + M Σ B B + tr (C A M B + τ, A B B (10.23) where τ is a finite constant. The right-hand side of Eq. (10.23) diverges to +∞ if any entry of A or B goes to +∞ or −∞. Also it diverges if any eigenvalue
˘ of Σ A or Σ B goes to +0 or ∞. This implies that no local solution exists on the boundary of (the closure of) the domain (10.12). Since the free energy is differentiable in the domain (10.12), any local minimizer is a stationary point. For any (diagonalized) observed matrix ΓV , the free energy (10.10) can Σ˘ B = I MH . be finite, for example, at A = 0 M,H , B = 0 M,H , Σ A = I H , and Therefore, at least one minimizer always exists, which completes the proof of Lemma 10.1. Lemma 10.1 implies that Eqs. (10.13) through (10.16) hold at any local solution.
10.3 Irrelevant Degrees of Freedom Also similarly to Theorem 6.4, we have the following theorem: Theorem 10.2 When CA C B is nondegenerate (i.e., cah cbh > cah cbh for any pair h < h ), ( A, B, Σ A, Σ˘ B ) are diagonal for any solution of the problem (10.11). When CA C B is degenerate, any solution has an equivalent solution with diagonal ( A, B, Σ A, Σ˘ B ). Theorem 10.2 significantly reduces the complexity of the optimization problem, and furthermore makes the problem separable, as seen in Section 10.5.
10.4 Proof of Theorem 10.2 Similarly to Section 6.4, we separately consider the following three cases: Case 1 When no pair of diagonal entries of CA C B coincide. Case 2 When all diagonal entries of CA C B coincide. Case 3 When (not all but) some pairs of diagonal entries of CA C B coincide.
260
10 Global Solver for Low-Rank Subspace Clustering
10.4.1 Diagonality Implied by Optimality We can prove the following lemma, which is an extension of Lemma 6.2. Lemma 10.3 Let Γ, Ω, Φ ∈ RH×H be a nondegenerate diagonal matrix, an orthogonal matrix, and a symmetric matrix, respectively. Let {Λ(k) , Λ (k) ∈
RH×H ; k = 1, . . . , K} be arbitrary diagonal matrices, and {Ψ (k ) ∈ RH×H ; k = 1, . . . , K } be arbitrary symmetric matrices. If ⎧ ⎫ K K ⎪ ⎪ ⎪ ⎪ ⎨ (k)
(k) (k ) ⎬ G(Ω) = tr ⎪ Λ ΩΛ Ω + ΩΨ ⎪ ΓΩΦΩ + (10.24) ⎪ ⎪ ⎩ ⎭
k =1
k=1
is minimized or maximized (as a function of Ω, given Γ, Φ, {Λ(k) , Λ (k) }, {Ψ (k ) }) when Ω = I H , then Φ is diagonal. Here, K and K can be any natural numbers including K = 0 and K = 0 (when the second and the third terms, respectively, do not exist). Proof Let Φ = Ω Γ Ω
(10.25)
be the eigenvalue decomposition of Φ. Let γ, γ , {λ(k) }, {λ (k) } be the vectors consisting of the diagonal entries of Γ, Γ , {Λ(k) }, {Λ (k) }, respectively, i.e., Γ = Diag(γ),
Γ = Diag(γ ),
Λ(k) = Diag(λ(k) ),
Λ (k) = Diag(λ (k) ).
Then, Eq. (10.24) can be written as ⎧ ⎫ K K ⎪ ⎪ ⎪ ⎪ ⎨ (k)
(k) (k ) ⎬ G(Ω) = tr ⎪ + Λ ΩΛ Ω + ΩΨ ΓΩΦΩ ⎪ ⎪ ⎪ ⎩ ⎭
k =1
k=1
= γ Qγ +
K
K ,
λ(k) Rλ (k) + tr ΩΨ (k ) ,
k=1
(10.26)
k =1
where Q = (ΩΩ ) (ΩΩ ),
R = Ω Ω.
Here, denotes the Hadamard product. Using this expression, we will prove that Φ is diagonal if Ω = I H minimizes or maximizes Eq. (10.26). Let us consider a bilateral perturbation Ω = Δ such that the 2 × 2 matrix Δ(h,h ) for h h consisting of the hth and the h th columns and rows form a 2 × 2 orthogonal matrix, cos θ − sin θ
Δ(h,h ) = , sin θ cos θ
10.4 Proof of Theorem 10.2
261
and the remaining entries coincide with those of the identity matrix. Then, the elements of Q become ⎧ ⎪ (Ωh, j cos θ − Ω h , j sin θ)2 if i = h, ⎪ ⎪ ⎪ ⎪ ⎨ Qi, j = ⎪ (Ωh, j sin θ + Ω h , j cos θ)2 if i = h , ⎪ ⎪ ⎪ ⎪ ⎩Ω 2 otherwise, i, j and Eq. (10.26) can be written as a function of θ as follows: G(θ) =
H ,
γh (Ω h, j cos θ − Ω h , j sin θ)2 + γh (Ω h, j sin θ + Ω h , j cos θ)2 γ j
j=1 K
) + λ(k h
) λ(k h
k=1
+
K k =1
cos2 θ sin2 θ
sin2 θ cos2 θ
⎛ (k ) ⎞ ⎜⎜⎜λh ⎟⎟⎟ ⎝⎜ (k ) ⎟⎠ λh
) (k ) (k ) (k ) Ψ(k h,h cos θ − Ψh ,h sin θ + Ψh,h sin θ + Ψh ,h cos θ + const.
(10.27) Since Eq. (10.27) is differentiable at θ = 0, our assumption that Eq. (10.26) is minimized or maximized when Ω = I H requires that θ = 0 is a stationary point of Eq. (10.27) for any h h . Therefore, it holds that F , ∂G = 2 γh (Ω h, j cos θ − Ω h , j sin θ)(−Ω h, j sin θ − Ω h , j cos θ) 0= ∂θ θ=0 j + γh (Ω h, j sin θ + Ω h , j cos θ)(Ω h, j cos θ − Ω h , j sin θ) γ j K G (k ) (k ) (k ) (k ) + −Ψh,h sin θ − Ψh ,h cos θ + Ψh,h cos θ − Ψh ,h sin θ θ=0
k =1
= 2 (γh − γh )
Ω h, j γ j Ω h , j
j
K
) (k ) + Ψ(k h,h − Ψh ,h k =1
= 2 (γh − γh ) Φh,h .
(10.28)
In the last equation, we used Eq. (10.25) and the assumption that {Ψ (k ) } are symmetric. Since we assume that Γ is nondegenerate (γh γh for h h ), Eq. (10.28) implies that Φ is diagonal, which completes the proof of Lemma 10.3.
10.4.2 Proof for Case 1 ∗ Assume that ( A∗ , B∗ , Σ ∗A , Σ˘ B ) is a minimizer, and consider the following variation defined with an arbitrary H × H orthogonal matrix Ω1 : −1/2
, A = A∗ C1/2 B Ω1 C B
(10.29)
262
10 Global Solver for Low-Rank Subspace Clustering
Ω1 C1/2 B = B∗ C−1/2 B B ,
ΣA =
Σ˘ B =
(10.30)
∗ 1/2 −1/2 C−1/2 Ω1 C1/2 , B B Σ A C B Ω1 C B ∗ −1/2 (C1/2 ⊗ I M )Σ˘ B (C−1/2 Ω1 C1/2 B Ω1 C B B B
(10.31) ⊗ I M ).
Then the free energy (10.10) can be written as a function of Ω1 : , 1/2 −1 2F(Ω1 ) = tr (C−1 A∗ A∗ + MΣ ∗A C1/2 A C B Ω1 C B B Ω1 + const.
(10.32)
(10.33)
Since Eq. (10.33) is minimized when Ω1 = I H by assumption, Lemma 10.3 implies that A∗ A∗ + MΣ ∗A C1/2 C1/2 B B is diagonal. Therefore, Φ1 = A∗ A∗ + MΣ ∗A
(10.34) ∗
is diagonal, with which Eq. (10.16) implies that Σ˘ B is diagonal. ∗ Since we have proved the diagonality of Σ˘ B , the expectations in Eqs. (10.10) and (10.14), respectively, can be expressed in the following simple forms at the ∗ Σ˘ B ) = ( A∗ , B∗ , Σ ∗A , Σ˘ B ): solution ( A, B, Σ A, ! " + " * ! = B A (10.35) B + Ξ Φ1 , A + M Σ A B A + M ΣA B A rB (B) = B ΓV ΓV (10.36) B + Ξ ΓV , B ΓV ΓV B rB (B)
where Ξ ΓV ∈ RH×H and Ξ Φ1 ∈ R M×M are diagonal matrices with their entries given by (Ξ Φ1 )m,m =
H !
A A + M ΣA
h=1
(Ξ ΓV )h,h =
M
" h,h
σ2Bm,h ,
σ2Bm,h . γm2
m=1
Here { σ2Bm,h } are the diagonal entries of Σ˘ B such that $ 2 %
Σ˘ B = Diag ( σB1,1 , . . . , σ2BM,1 ), ( σ2B1,2 , . . . , σ2BM,2 ), . . . . . . , ( σ2B1,H , . . . , σ2BM,H ) . Next consider the following variation defined with an M × M matrix Ω2 such that the upper-left J × J submatrix is an arbitrary orthogonal matrix and the other entries are zero:
A = Ω2 A∗ ,
B = Ω2 B∗ .
10.4 Proof of Theorem 10.2
263
Then, by using Eq. (10.35), the free energy (10.10) is written as 1 , Σ A B∗ Ω2 + const. 2F(Ω2 ) = 2 tr ΓV ΓV Ω2 −2B∗ A∗ + B∗ A∗ A∗ + M σ (10.37) Applying Lemma 10.3 to the upper-left J × J submatrix in the trace, and then using Eq. (10.22), we find that Φ2 = −2B∗ A∗ + B∗ A∗ A∗ + M Σ A B∗ (10.38) is diagonal. Eq. (10.38) also implies that B∗ A∗ is symmetric. Consider the following variation defined with an M × M matrix Ω3 such that the upper-left J × J submatrix is an arbitrary orthogonal matrix and the other entries are zero:
B = Ω3 B∗ . Then the free energy is written as 1 , 2F(Ω3 ) = 2 tr ΓV ΓV Ω3 −2B∗ A∗ σ Σ A B∗ Ω3 + const. +ΓV ΓV Ω3 B∗ A∗ A∗ + M
(10.39)
Applying Lemma 10.3 to the upper-left J × J submatrix in the trace, we find that Σ A B∗ (10.40) Φ3 = B∗ A∗ A∗ + M is diagonal. Since Eqs. (10.34) and (10.40) are diagonal, B∗ is diagonal. Consequently, Eq. (10.14) combined with Eq. (10.36) implies that A∗ and Σ ∗A are diagonal. Thus we proved that the solution for ( A, B, Σ A, Σ˘ B ) are diagonal, provided that CA C B is nondegenerate.
10.4.3 Proof for Case 2 When CA C B is degenerate, there are multiple equivalent solutions giving the same free energy (10.10) and the output B A . In the following, we show that ∗ one of the equivalent solutions has diagonal ( A∗ , B∗ , Σ ∗A , Σ˘ B ). 2 2 Assume that CA C B = c I H for some c ∈ R++ . In this case, the free energy (10.10) is invariant with respect to Ω1 under the transformation (10.29) Σ˘ B , which can be through (10.32). Let us focus on the solution with diagonal obtained by the transform (10.29) through (10.32) with a certain Ω1 from any solution satisfying Eq. (10.16). Then we can show, in the same way as in the
264
10 Global Solver for Low-Rank Subspace Clustering
nondegenerate case, that Eqs. (10.34), (10.38), and (10.40) are diagonal. This ∗ proves the existence of a solution such that ( A∗ , B∗ , Σ ∗A , Σ˘ B ) are diagonal.
10.4.4 Proof for Case 3 When cah cbh = cah cbh for (not all but) some pairs h h , we can show that ΣA Σ˘ B are block diagonal where the blocks correspond to the groups sharing and the same cah cbh . In each block, multiple equivalent solutions exist, one of which ∗ is a solution such that ( A∗ , B∗ , Σ ∗A , Σ˘ B ) are diagonal. This completes the proof of Theorem 10.2.
10.5 Exact Global VB Solver (EGVBS) Theorem 10.2 allows us to focus on the solutions such that ( A, B, Σ A, Σ B ) are diagonal. Accordingly, we express the solution of the VB learning problem (10.11) with diagonal entries, i.e.,
a1 , . . . , aH ), A = Diag M,H (
(10.41)
B = Diag M,H ( b1 , . . . , bH ),
(10.42)
ΣA =
Diag( σ2a1 , . . .
, σ2aH ),
(10.43) %
Σ˘ B = Diag ( σ2B1,1 , . . . , σ2BM,1 ), ( σ2B1,2 , . . . , σ2BM,2 ), . . . . . . , ( σ2B1,H , . . . , σ2BM,H ) , (10.44) $
where DiagD1 ,D2 (·) denotes the D1 × D2 diagonal matrix with the specified diagonal entries. Remember that J (≤ min(L, M)) is the rank of the observed matrix V, and {γm } are the singular values arranged in nonincreasing order. bh ∈ R+ for all h = 1, . . . , H. Without loss of generality, we assume that ah , We can easily obtain the following theorem: Theorem 10.4 Any local solution of the VB learning problem (10.11) satisfies, for all h = 1, . . . , H, γh2
σ2ah , bh σ2 ⎛ ⎞−1 J 2⎟ ⎜⎜ 2 2 σ ⎟ 2 2⎜ 2 2
σah = σ ⎜⎜⎝γh σBm,h + 2 ⎟⎟⎟⎠ , γm bh + c ah m=1
ah =
bh =
γh2
ah σ2Bh,h , σ2
(10.45) (10.46) (10.47)
10.5 Exact Global VB Solver (EGVBS)
σ2Bm,h
⎧ ⎪ ⎪ ⎪ ⎪ σ2ah + a2h + M ⎨σ2 γm2 =⎪ ⎪ ⎪ ⎪ ⎩c2 bh
σ2 c2b
−1 (for m = 1, . . . , J),
h
2Fh = M log
J
(10.48)
(for m = J + 1, . . . , M),
and has the free energy given by J H 2 h=1 γh 2 2F = LM log(2πσ ) + + 2Fh , σ2 h=1 c2ah
265
where
σ2ah a2h + M
c2bh
(10.49) J
σ2Bm,h b2h + m=1
+ log 2 − (M + J) + +
σ2ah m=1 c2ah
σBm,h c2bh ⎫ ⎧ J ⎪ ⎪ ⎪ 1 ⎪ ⎬ ⎨ 2 σ2Bm,h ( + 2⎪ b2h ( a2h + M σ2ah ) + γm2 a2h + M σ2ah )⎪ bh + ah . γh −2 ⎪ ⎪ ⎭ σ ⎩ m=1
(10.50) Proof By substituting the diagonal expression, Eqs. (10.41) through (10.44), into the free energy (10.10), we have M 2 H M H c2bh c2ah h=1 γh 2 log 2 + log 2 + − 2MH 2F = LM log(2πσ ) + M
σ σ2
σBm,h ah m=1 h=1 h=1 ⎧ ⎛ ⎞⎫ H ⎪ M ⎪ ⎪ 1 ⎜⎜⎜⎜ 2 2 ⎟⎟⎟⎟⎪ ⎨ 1 2 ⎬ 2
σBm,h ⎟⎠⎪ σah + 2 ⎜⎝bh + + ah + M ⎪ ⎪ ⎪ 2 ⎩ cah ⎭ c bh m=1 h=1 ⎫ ⎧ J H ⎪ ⎪ ⎪ 1 ⎪ ⎨ 2 2 2 2 2 2 2 2 ⎬
σ + b ( a + M σ ) + γ ( a + M σ ) b −2 a . γ + 2 ⎪ h h a m a Bm,h h h h h ⎪ ⎪ h h ⎪ ⎭ ⎩ σ h=1 m=1 (10.51) Eqs. (10.45) through (10.48) are obtained as the stationary conditions of Eq. (10.51) that any solution satisfies, according to Lemma 10.1. By substituting Eq. (10.48) for m = J + 1, . . . , M into Eq. (10.51), we obtain Eq. (10.49). For EVB learning, where the prior covariances CA , C B are also estimated, we have the following theorem: Theorem 10.5 Any local solution of the EVB learning problem (10.17) M bh , σ2ah , { σ2Bm,h }m=1 , c2ah , c2bh ) satisfies the following. For each h = 1, . . . , H, ( ah , is either a (positive) stationary point that satisfies Eqs. (10.45) through (10.48) and a2h /M + σ2ah , c2ah = ⎛ J ⎜⎜ 2
c2 = ⎜⎜⎜⎝ σ2 b + bh
Bm,h
h
m=1
⎞ ⎟⎟⎟ ⎟⎟⎠ /J,
(10.52) (10.53)
266
10 Global Solver for Low-Rank Subspace Clustering
or the null local solution defined by
ah = bh = 0,
σ2ah = c2ah → +0,
σ2Bm,h = c2bh → +0
(for m = 1, . . . , M), (10.54)
of which the contribution (10.50) to the free energy is Fh → +0.
(10.55)
The total free energy is given by Eq. (10.49). Proof Considering the derivatives of Eq. (10.51) with respect to c2ah and c2bh , we have a2h + M σ2ah , 2Mc2ah = 2Mc2bh = b2h +
M
σ2Bm,h ,
(10.56) (10.57)
m=1
as stationary conditions. By using Eq. (10.48), we can easily obtain Eqs. (10.52) and (10.53). Unlike in VB learning, where Lemma 10.1 guarantees that any local solution is a stationary point, there exist nonstationary local solutions in EVB learning. We can confirm that, along any path such that
bh = 0, ah ,
σ2ah , σ2Bm,h , c2ah , c2bh → +0
with βa =
σ2ah c2ah
and βb =
σ2Bm,h c2bh
kept constant,
(10.58)
the free energy contribution (10.50) from the hth component decreases monotonically. Among the possible paths, βa = βb = 1 gives the lowest free energy (10.55). Based on Theorem 10.5, we can obtain the following corollary for the global solution. Corollary 10.6 The global solution of the EVB learning problem (10.17) can be found in the following way. For each h = 1, . . . , H, find all stationary points that satisfy Eqs. (10.45) through (10.48), (10.52), and (10.53), and choose the one giving the minimum free energy contribution F h . The chosen stationary point is the global solution if F h < 0. Otherwise (including the case where no stationary point exists), the null local solution (10.54) is global. Proof For each h = 1, . . . , H, any candidate for a local solution is a stationary point or the null local solution. Therefore, if the minimum free energy contribution over all stationary points is negative, i.e., F h < 0, the corresponding
10.6 Approximate Global VB Solver (AGVBS)
267
stationary point is the global minimizer. With this fact, Corollary 10.6 is a straightforward deduction from Theorem 10.5. σ2Bm,h for m > J, the stationary Taking account of the trivial relations c2bh = conditions consisting of Eqs. (10.45) through (10.48), (10.52), and (10.53) for each h can be seen as a polynomial system, a set of polynomial equations, J bh , σ2ah , { σ2Bm,h }m=1 , c2ah , c2bh . Thus, Theorem with 5 + J unknown variables, ah , 10.5 has decomposed the original problem with O(M 2 H 2 ) unknown variables, for which the stationary conditions are given by Eqs. (10.13) through (10.16), (10.19), and (10.20), into H subproblems with O(J) unknown variables each. Fortunately, there is a reliable numerical method to solve a polynomial system, called the homotopy method or continuation method (Drexler, 1978; Garcia and Zangwill, 1979; Gunji et al., 2004; Lee et al., 2008). It provides all isolated solutions to a system of n polynomials f (x) ≡ ( f1 (x), . . . , fn (x)) = 0 by defining a smooth set of homotopy systems with a parameter t ∈ [0, 1], i.e., g(x, t) ≡ (g1 (x, t), g2 (x, t), . . . , gn (x, t)) = 0 such that one can continuously trace the solution path from the easiest (t = 0) to the target (t = 1). For empirical evaluation, which will be given in Section 10.8, we use HOM4PS2.0 (Lee et al., 2008), one of the most successful polynomial system solvers. With the homotopy method in hand, Corollary 10.6 allows us to solve the EVB learning problem (10.17) in the following way, which we call the exact global VB solver (EGVBS). For each h = 1, . . . , H, we first find all stationary points that satisfy the polynomial system, Eqs. (10.45) through (10.48), (10.52), and (10.53). After that, we discard the prohibitive solutions with complex numbers or negative variances, and then select the stationary point giving the minimum free energy contribution Fh , defined by Eq. (10.50). The global solution is the selected stationary point if it satisfies Fh < 0; otherwise, the null local solution (10.54) is the global solution. Algorithm 20 summarizes the procedure of EGVBS. When the noise variance σ2 is unknown, we conduct a naive one-dimensional search to minimize the total free energy (10.49), with EGVBS applied for every candidate value of σ2 . It is straightforward to modify Algorithm 20 to solve the VB learning problem (10.11), where the prior covariances CA , C B are given. In this case, we should solve the polynomial system (10.45) through (10.48) in Step 3, and skip Step 6 since all local solutions are stationary points.
10.6 Approximate Global VB Solver (AGVBS) Theorems 10.4 and 10.5 significantly reduced the complexity of the optimization problem. However, EGVBS is still not applicable to data with typical
268
10 Global Solver for Low-Rank Subspace Clustering
Algorithm 20 Exact global VB solver (EGVBS) for LRSC. 1: 2: 3: 4: 5: 6:
7: 8: 9:
right
Compute the SVD of V = Ωleft . V ΓV ΩV for h = 1 to H do Find all solutions of the polynomial system, Eqs. (10.45) through (10.48), (10.52), and (10.53) by the homotopy method. Discard prohibitive solutions with complex numbers or negative variances. Select the stationary point giving the smallest Fh (defined by Eq. (10.50)). The global solution for the hth component is the selected stationary point if it satisfies Fh < 0; otherwise, the null local solution (10.54) is the global solution. end for
= Ωright
right . Compute U V B A ΩV
+ Apply spectral clustering with the affinity matrix equal to abs(U)
abs(U ).
problem sizes. This is because the homotopy method is not guaranteed to find all solutions in polynomial time in J, when the polynomial system involves O(J) unknown variables. The following simple trick further reduces the complexity and leads to an efficient approximate solver. Let us impose an additional constraint that σ2Bm,h are constant over m = 1, . . . , J, i.e., γm2 2
γm2 σ2Bm,h = σbh
for
m = 1, . . . , J.
(10.59)
Under this constraint, the stationary conditions for the six unknowns 2 σbh , c2ah , c2bh ) (for each h) become similar to the stationary bh , σ2ah , ( ah , conditions for fully observed MF, which allows us to obtain the following theorem: Theorem 10.7 Under the constraint (10.59), any stationary point of the free energy (10.50) for each h satisfies the following polynomial equation for a
single variable γh ∈ R: 6 5 4 3 2
γh + ξ5 γh + ξ4 γh + ξ3 γh + ξ2 γh + ξ1 γh + ξ0 = 0, ξ6
(10.60)
where ξ6 =
φ2h , γh2
ξ5 = −2
(10.61) φ2h Mσ2 γh3
+
2φh γh ,
(10.62)
10.6 Approximate Global VB Solver (AGVBS)
ξ4 =
φ2h M 2 σ4 γh4
ξ3 =
2φh M(M−J)σ4 γh3
ξ2 =
(M−J) σ γh2
−
2φh (2M−J)σ2 γh2
−
+1+
2(M−J)σ2 γh
+
269
φ2h (Mσ2 −γh2 ) , γh2
(10.63)
φh ((M+J)σ2 −γh2 ) γh
−
φ2h Mσ2 (Mσ2 −γh2 ) γh3
+
φh (Mσ2 −γh2 ) , γh
(10.64) 2
4
−
φh Mσ ((M+J)σ γh2 2
2
−γh2 )
+ ((M + J)σ2 − γh2 ) −
φh (M−J)σ (Mσ −γh2 ) , γh2 2
2
(10.65) ξ1 = −
2
2
(M−J)σ ((M+J)σ γh
−γh2 )
+
φh MJσ γh
4
,
(10.66)
ξ0 = MJσ4 . Here φh = 1 −
(10.67) γh2 γ2
for γ2 = (
J m=1
γh + γh − γh =
γm−2 /J)−1 . For each real solution γh such that Mσ2 γh ,
(10.68)
γ
κh = γh2 − (M + J)σ2 − Mσ2 − γh2 φh h , γh B ! "
γh 1 2 4
τh = 2MJ κh − 4MJσ 1 + φh γh , κh + 2
δh = √σ
τh
γh −
Mσ2 γh
− γh
−1
,
(10.69) (10.70) (10.71)
are real and positive, there exists the corresponding stationary point given by √ √ ⎞⎟ ! " ⎛⎜ . C 2 2
⎜ 2
τ ⎟
/ δ γ δ σ ⎜ h h 2 2 2 σ h
ah , σbh , cah , cbh = ⎜⎜⎝ γh τh , γ2 h ⎟⎟⎟⎠ . bh , σah , δh , γh , γh , 2 , γh δh −φh √σ h
τh
(10.72) Given the noise variance σ2 , computing the coefficients (10.61) through (10.67) is straightforward. Theorem 10.7 implies that the following algorithm, which we call the AGVBS, provides the global solution of the EVB learning problem (10.17) under the additional constraint (10.59). After computing the SVD of the observed matrix V, AGVBS first finds all real solutions of the sixth-order polynomial equation (10.60) by using, e.g., the “roots” command R , for each h. Then, it discards the prohibitive solutions such in MATLAB that any of Eqs. (10.68) through (10.71) gives a complex or negative number. For each of the retained solutions, AGVBS computes the corresponding stationary point by Eq. (10.72), along with the free energy contribution Fh by Eq. (10.50). Here, Eq. (10.59) is used for retrieving the original posterior J for B. Finally, AGVBS selects the stationary point giving variances { σ2Bm,h }m=1 the minimum free energy contribution F h . The global solution is the selected stationary point if it satisfies F h < 0; otherwise, the null local solution (10.54) is the global solution. Algorithm 21 summarizes the procedure of AGVBS.
270
10 Global Solver for Low-Rank Subspace Clustering
Algorithm 21 Approximate global VB solver (AGVBS) for LRSC. 1: 2: 3: 4: 5: 6: 7:
8: 9: 10:
right
Compute the SVD of V = Ωleft . V ΓV ΩV for h = 1 to H do Find all real solutions of the sixth-order polynomial equation (10.60). Discard prohibitive solutions such that any of Eqs. (10.68) through (10.71) gives a complex or negative number. Compute the corresponding stationary point by Eq. (10.72) and its free energy contribution Fh by Eq. (10.50) for each of the retained solutions. Select the stationary point giving the minimum free energy contribution Fh. The global solution for the hth component is the selected stationary point if it satisfies F h < 0; otherwise, the null local solution (10.54) is the global solution. end for
= Ωright
right . Compute U V B A ΩV
+ Apply spectral clustering with the affinity matrix equal to abs(U)
). abs(U
As in EGVBS, a naive one-dimensional search is conducted when the noise variance σ2 is unknown. In Section 10.8, we show that AGVBS is practically a good alternative to the Kronecker product covariance approximation (KPCA), an approximate EVB algorithm for LRSC under the Kronecker product covariance constraint (see Section 3.4.2), in terms of accuracy and computation time.
10.7 Proof of Theorem 10.7 Let us rescale bh and c2bh as follows:
bh = γh bh ,
c2bh = γh2 c2bh .
(10.73)
By substituting Eqs. (10.59) and (10.73) into Eq. (10.50), we have ⎛ ⎞ c2bh c2ah γh2 2 ⎟⎟⎟ 1 2 1 ⎜⎜⎜⎜ 2 2 σ ⎟⎟ σah + 2 ⎜⎝bh + J 2 ah + M 2Fh = M log 2 + J log 2 + 2
σah γ bh ⎠ cbh
σbh cah J 2 2 γ2 1 ah σbh ) − (M + J) + a2h + M σ2ah )( log m2 , bh + ( bh + J + 2 −2γh σ γh m=1 (10.74)
10.7 Proof of Theorem 10.7
where
271
⎛ J ⎞−1 ⎜⎜⎜ −2 ⎟⎟⎟ γ = ⎜⎜⎝ γm /J ⎟⎟⎠ . 2
m=1
Ignoring the last two constant terms, we find that Eq. (10.74) is in almost the same form as the free energy of fully observed MF for a J × M observed matrix 2
σbh is multiplied by (see Eq. (6.43)). Only the difference is in the fourth term: J γh2 . γ2
Note that, as in MF, the free energy (10.74) is invariant under the following transformation: 2 3 2 3 2 2 2 2 −2 2
2 σ2a , s−2 ( ah , σbh , c2ah , c2bh ) → (sh ah , s−1 σ2ah , bh , h bh , sh h σbh , sh cah , sh cbh ) h for any {sh 0; h = 1, . . . , H}. Accordingly, we fix the ratio between cah and cbh to cah /cbh = 1 without loss of generality. By differentiating the free energy (10.74) with respect to ah , σ2a , bh , σ2b , c2a , h
and c2bh , respectively, we obtain the following stationary conditions: 1 2 σah , γh bh σ2 2 −1 2 σ2
σ2ah = σ2 σbh + 2 , bh + J cah ⎛ ⎞−1 ⎜⎜⎜ 2 σ2 ⎟⎟⎟⎟ 2
⎜ ah ⎜⎝ σah + 2 ⎟⎠ , bh = γh ah + M cbh ⎛ ⎞ 2 2 ⎟−1 ⎜⎜ σ γ ⎟⎟ h 2 2⎜ 2 2 σbh = σ ⎜⎜⎝⎜ σah + 2 ⎟⎟⎠⎟ , ah + M 2 cbh γ
ah =
c2ah = a2h /M + σ2ah , 2
c2bh = bh /J +
γh2 2
σ . γ2 bh
h
h
(10.75) (10.76) (10.77)
(10.78) (10.79) (10.80)
Note that, unlike the case of fully observed MF, A and B are not symmetric, which makes analysis more involved. Apparently, if ah = 0 or bh = 0, the null
solution (10.54) gives the minimum Fh → +0 of the free energy (10.74). In the following, we identify the positive stationary points such that ah , bh > 0. To this end, we derive a polynomial equation with a single unknown variable from the stationary conditions (10.75) through (10.80). Let
ah bh , γh =
(10.81)
ah / bh . δh =
(10.82)
272
10 Global Solver for Low-Rank Subspace Clustering
From Eqs. (10.75) through (10.78), we obtain ⎞ ⎛ ⎜⎜ 2 2 σ2 ⎟⎟ 2 σ2 γh2 = ⎜⎜⎝⎜ σbh + 2 , σ2ah + 2 ⎟⎟⎠⎟ bh + J ah + M cah cbh 2 2 σ2
σbh + 2 , δ−1 γh h = bh + J cah ⎞ ⎛ ⎜⎜⎜ 2 σ2 ⎟⎟⎟⎟ 2
⎜ γh δh = ⎜⎝ σah + 2 ⎟⎠ . ah + M cbh
(10.83) (10.84) (10.85)
Substituting Eq. (10.84) into Eq. (10.76) gives
σ2ah =
δh σ2 . γh
(10.86)
Substituting Eq. (10.85) into Eq. (10.78) gives 2
σbh =
σ2 2 γh δh − φh cσ2
,
(10.87)
bh
where φh = 1 −
γh2 . γ2
2
σbh have been written as functions of δh and c2bh . Thus, the variances σ2ah and Substituting Eqs. (10.86) and (10.87) into Eq. (10.78) gives ⎛ ⎞ ⎜⎜⎜ δh σ2 γh2 ⎟⎟⎟⎟ σ2 σ2 2 ⎜⎜⎜ + 2 ⎟⎠⎟ = σ2 , a +M 2 ⎝ h γh cbh γ2 γh δh − φh σ2 cbh
and therefore
γh +
Mσ2 σ2 −1 − γh + 2 δh = 0. γh cbh
Solving the preceding equation with respect to δ−1 h gives c2bh Mσ2 −1
δh = 2 γh − − γh . γh σ
(10.88)
Thus, we have obtained an expression of δh as a function of γh and c2bh . Substituting Eqs. (10.86) and (10.87) into Eq. (10.76) gives ⎛ ⎞ ⎜ ⎟ δh ⎜⎜⎜⎜⎜ 2 σ2 σ2 ⎟⎟⎟⎟⎟ σ2 + ⎜⎜⎜bh + J ⎟ = σ2 . 2 γh ⎝ c2ah ⎟⎟⎠ γh δh − φh σ2 cbh
10.7 Proof of Theorem 10.7 Rearranging the previous equation with respect to δ−1 h gives ⎞ ⎛ % φh σ2 −2 ⎜⎜⎜⎜ $ Jσ2 φh σ4 ⎟⎟⎟⎟ −1 σ2 − γh − γh 2 δh + ⎜⎝ γh − γh + ⎟⎠ δh + 2 = 0. γh cah cbh γh c2ah c2bh γh Substituting Eq. (10.88) into Eq. (10.89), we have 2 % φh $ Mσ2 σ4
− γh − γh γh − γh − γh γh c2a c2b ⎞⎞ h h ⎛ ⎛ ⎜⎜⎜ ⎜⎜⎜ Jσ2 φh σ4 ⎟⎟⎟⎟⎟⎟⎟⎟ Mσ2 ⎜ ⎜ + ⎜⎝ + = 0. γh − ⎜⎝γh − γh − γh − ⎟⎠⎟⎠ γh γh c2ah c2bh γh
273
(10.89)
(10.90)
Thus we have derived an equation that includes only two unknown variables,
γh and c2ah c2bh . Next we will obtain another equation that includes only γh and c2ah c2bh . Substituting Eqs. (10.79) and (10.80) into Eq. (10.83), we have ⎞ ⎛ ⎜⎜⎜ 2 2 σ2 ⎟⎟⎟⎟ σ2 2 2 ⎜ σbh + 2 . (10.91) γh = ⎝⎜ Mcah + 2 ⎠⎟ Jcbh + Jφh cah cbh Substituting Eq. (10.88) into Eq. (10.87) gives 2 c2bh γh − Mσ γh 2 γh −
. σbh = Mσ2 γh − φh γh − γh − γh
(10.92)
Substituting Eq. (10.92) into Eq. (10.91) gives 2 γh Mc2ah c2bh + σ2 γh − Mσ σ4 γh − 2 2 2 2 . γh = MJcah cbh + (M + J)σ + + Jφh 2 2 c2ah cbh γh − φh γh − Mσ γh γh − Rearranging the preceding equation with respect to c2ah c2bh , we have ⎞ ⎛ ⎞ ⎛
⎜⎜ ⎜⎜⎜
γh ⎟⎟⎟ 2 2 γh ⎟⎟⎟ 4 4 2 2 2 2 4⎜ ⎟ ⎜ ⎜ MJcah cbh + ⎜⎝(M + J)σ − γh + Mσ − γh φh ⎟⎠ cah cbh + σ ⎜⎝1 + φh ⎟⎟⎠ = 0, γh γh (10.93) where
Mσ2
γh − γh − . γh = γh
The solution of Eq. (10.93) with respect to c2ah c2bh is given by B ! "
γ
κh + κh2 − 4MJσ4 1 + φh γhh c2ah c2bh = , 2MJ
(10.94)
(10.95)
274
10 Global Solver for Low-Rank Subspace Clustering
where
γ
κh = γh2 − (M + J)σ2 − Mσ2 − γh2 φh h . γh
(10.96)
By using Eq. (10.94), Eq. (10.90) can be rewritten as 1 σ4 1 Mσ2 2 (M − J)σ2
γh − γh − γh + γh − γh + 1 φh φh = 0. γh γh γh γh c2a c2b h
h
(10.97) Thus, we have obtained two equations, Eqs. (10.95) and (10.97), that relate
two unknown variables, γh (or γh ) and c2ah c2bh . Substituting Eq. (10.95) into Eq. (10.97) gives a polynomial equation involving only a single unknown variable
γh . With some algebra, we obtain Eq. (10.60). Let
τh = c2ah c2bh . Since we fixed the arbitrary ratio to c2ah /c2bh = 1, we have C c2ah = τh , C 2 τh . cbh =
(10.98)
(10.99) (10.100)
Some solutions of Eq. (10.60) have no corresponding points in the problem
domain (10.18). Assume that a solution γh is associated with a point in the domain. Then γh is given by Eq. (10.94), which is real and positive by its definition (10.81). τh is defined and given, respectively, by Eqs. (10.98) and (10.95), which is real and positive. κh , defined by Eq. (10.96), is also δh is given real and positive, since τh cannot be real and positive otherwise. by Eq. (10.88), which is real and positive by its definition (10.82). Finally, remembering the variable change (10.73), we can obtain Eq. (10.72) from Eqs. (10.81), (10.82), (10.86), (10.87), (10.99), and (10.100), which completes the proof of Theorem 10.7.
10.8 Empirical Evaluation In this section, we empirically compare the global solvers, EGVBS (Algorithm 20) and AGVBS (Algorithm 21), with the standard iterative algorithm (Algorithm 4 in Section 3.4.2) and its approximation (Algorithm 5 in Section 3.4.2), which we here call the standard VB (SVB) iteration and the KPCA iteration, respectively. We assume that the prior covariances (CA , C B ) and the noise
10.8 Empirical Evaluation
275
variance σ2 are unknown and estimated from observation. We use the fullrank model (i.e., H = min(L, M)), and expect EVB learning to automatically find the true rank without any parameter tuning.
Artificial Data Experiment We first conducted an experiment with a small artificial data set (“artificial small”), on which the exact algorithms, i.e., EGVBS and the SVB iteration, are computationally tractable. Through this experiment, we can assess the accuracy of the efficient approximate solvers, i.e., AGVBS and the KPCA iteration. We randomly created M = 75 samples in the L = 10 dimensional space. We assumed K = 2 clusters: M (1)∗ = 50 samples lie in a H (1)∗ = 3dimensional subspace, and the other M (2)∗ = 25 samples lie in a H (2)∗ = 1-dimensional subspace. For each cluster k, we independently drew M (k)∗ samples from GaussH (k)∗ (0, 10 · I H (k)∗ ), and projected them onto the observed (k)∗ L-dimensional space by R(k) ∈ RL×H , each entry of which follows R(k) l,h ∼ (k)∗ Gauss1 (0, 1). Thus, we obtained a noiseless matrix V (k)∗ ∈ RL×M for the kth cluster. Concatenating all clusters, V ∗ = (V (1)∗ , . . . , V (K)∗ ), and adding random noise subject to Gauss1 (0, 1) to each entry gave an artificial observed matrix K M (k)∗ = 75. The true rank of V ∗ is given by V ∈ RL×M , where M = k=1 K ∗ (k)∗ H = min( k=1 H , L, M) = 4. Note that H ∗ is different from the rank of the observed matrix V, which is almost surely equal to J = min(L, M) (= 10) under the Gaussian noise. Figure 10.1 shows the free energy, the computation time, and the estimated
= A over iterations. For the iterative methods, we show the rank of U B results of 10 trials starting from different random initializations. We can see that AGVBS gives almost the same free energy as the exact methods (EGVBS and the SVB iteration). The exact methods require large computation costs: EGVBS took 621 sec to obtain the global solution, and the SVB iteration took ∼ 100 sec to achieve almost the same free energy. On the other hand, the approximate methods are much faster: AGVBS took less than 1 sec, and the KPCA iteration took ∼10 sec. Since the KPCA iteration had not converged after 250 iterations, we continued its computation until 2,500 iterations, and found that it sometimes converges to a local solution with a significantly higher free energy than the other methods. EGVBS, AGVBS, and the SVB iteration successfully found the true rank H ∗ = 4, while the KPCA iteration sometimes failed to find it. This difference is actually reflected to the clustering error, i.e., the misclassification rate with all possible cluster correspondences taken into account, after spectral clustering (Shi and Malik, 2000) is performed: 1.3% for EGVBS, AGVBS, and the SVB iteration, and 2.4% for the KPCA iteration.
276
10 Global Solver for Low-Rank Subspace Clustering 2.3
104
Time(sec)
EGVBS AGVBS SVB iteration KPCA iteration
2.2 2.1 2
EGVBS AGVBS SVB iteration KPCA iteration
102
100
1.9 1.8 0
50
100
150
200
250
0
50
100
Iteration
150
200
250
Iteration
(a) Free energy
(b) Computation time
10 EGVBS AGVBS SVB iteration KPCA iteration
8 6 4 2 0 0
50
100
150
200
250
Iteration
(c) Estimated rank
Figure 10.1 Results on the “artificial small” data set (L = 10, M = 75, H ∗ = 4). The clustering errors were 1.3% for EGVBS, AGVBS, and the SVB iteration, and 2.4% for the KPCA iteration.
Next we conducted the same experiment with a larger artificial data set (“artificial large”) (L = 50, K = 4, (M (1)∗ , . . . , M (K)∗ ) = (100, 50, 50, 25), (H (1)∗ , . . . , H (K)∗ ) = (2, 1, 1, 1)), on which EGVBS and the SVB iteration are computationally intractable. Figure 10.2 shows the results with AGVBS and the KPCA iteration. The advantage in computation time is clear: AGVBS only took ∼0.1 sec, while the KPCA iteration took more than 100 sec. The clustering errors were 4.0% for AGVBS and 11.2% for the KPCA iteration. Benchmark Data Experiment Finally, we applied AGVBS and the KPCA iteration to the Hopkins 155 motion database (Tron and Vidal, 2007). In this data set, each sample corresponds to the trajectory of a point in a video, and clustering the trajectories amounts to finding a set of rigid bodies. Figure 10.3 shows the results on the “1R2RC” (L = 59, M = 459) sequence.1 We see that AGVBS gave a lower free energy with much less computation time than the KPCA iteration. Figure 10.4 shows the clustering errors on the first 20 sequences, which implies that AGVBS generally outperforms the KPCA iteration. Figure 10.4 also shows the results 1
Peaks in the free energy curves are due to pruning. As noted in Section 3.1.1, the free energy can increase right after pruning happens, but immediately gets lower than the free energy before pruning.
10.8 Empirical Evaluation
1.64
10 4
AGVBS KPC alteration
Time(sec)
1.635
277
1.63 1.625 1.62
AGVBS KPC alteration
10 2
10 0
1.615 1.61 0
500
0
1,000 1,500 2,000 2,500
500
1,000
1,500
2,000
2,500
Iteration
Iteration
(a) Free energy
(b) Computation time
15 AGVBS KPC alteration
10
5
0
0
500
1,000
1,500
2,000
2,500
Iteration
(c) Estimated rank
Figure 10.2 Results on the “artificial large” data set (L = 50, M = 225, H ∗ = 5). The clustering errors were 4.0% for AGVBS and 11.2% for the KPCA iteration.
7 104
AGVBS KPC alteration
Time(sec)
6 5 4
AGVBS KPC aIteration
102
100
3 2 0
500
1,000
1,500
2,000
2,500
0
500
1,000
Iteration
1,500
2,000
2,500
Iteration
(a) Free energy
(b) Computation time AGVBS KPC alteration
50 40 30 20 10 0 0
500
1,000
1,500
2,000
2,500
Iteration
(c) Estimated rank
Figure 10.3 Results on the “1R2RC” sequence (L = 59, M = 459) of the Hopkins 155 motion database. Peaks in the free energy curves are due to pruning. The clustering errors are shown in Figure 10.4.
278
10 Global Solver for Low-Rank Subspace Clustering
0.03 MAP (with optimized lambda) AGVBS KPC aIteration
0.025 0.02 0.015 0.01
1R2TCRT_g13
1R2TCRT
1R2TCRT_g12
1R2TCR
1R2RC_g23
1R2RC_g13
1R2RC_g12
R2RCT_B_g23
R2RCT_B_g13
1R2RCT_B
R2RCT_B_g12
R2RCT_A_g23
R2RCT_A_g13
1R2RCT_A
R2RCT_A_g12
1R2RCR_g23
1R2RCR_g13
1R2RCR
1R2RCR_g12
0
1R2RC
0.005
Figure 10.4 Clustering errors on the first 20 sequences of the Hopkins 155 data set.
by MAP learning (Eq. (3.87) in Section 3.4) with the tuning parameter λ optimized over the 20 sequences (i.e., we performed MAP learning with different values for λ, and selected the one giving the lowest average clustering error). We see that AGVBS performs comparably to MAP learning with optimized λ, which implies that EVB learning estimates the hyperparameters and the noise variance reasonably well.
11 Efficient Solver for Sparse Additive Matrix Factorization
In this chapter, we introduce an efficient variational Bayesian (VB) solver (Nakajima et al., 2013b) for sparse additive matrix factorization (SAMF), where the global VB solver, derived in Chapter 9, for fully observed MF is used as a subroutine.
11.1 Problem Description The SAMF model, introduced in Section 3.5, is defined as ⎛ # ##2 ⎞ S ⎜⎜⎜ 1 ## # ⎟⎟⎟ # ⎜ (s) p(V|Θ) ∝ exp ⎜⎜⎜⎝− 2 ##V − U ### ⎟⎟⎟⎟⎠ , 2σ # # s=1 Fro (s) S K 1 (k,s) (k,s)−1 (k,s) S tr A CA A p({Θ(s) A } s=1 ) ∝ exp − 2 s=1 k=1 S K (s) 1 (k,s) (k,s)−1 (k,s) (s) S p({ΘB } s=1 ) ∝ exp − tr B C B B 2 s=1 k=1
(11.1) ,
(11.2)
,
(11.3)
where (s)
K ; X(s) ) U(s) = G({B(k,s) A(k,s) }k=1
(11.4)
is the sth sparse matrix factorization (SMF) term. Here G(·; X) : K
(k)
(k) R k=1 (L ×M ) → RL×M maps the partitioned-and-rearranged (PR) matrices
(k) K {U }k=1 to the target matrix U ∈ RL×M , based on the one-to-one map K to the X : (k, l , m ) → (l, m) from the indices of the entries in {U (k) }k=1 indices of the entries in U such that K ; X) = Ul,m = UX(k,l ,m ) = Ul (k) (11.5) G({U (k) }k=1
,m . l,m
279
280
11 Efficient Solver for Sparse Additive Matrix Factorization
The prior covariances of A(k,s) and B(k,s) are assumed to be diagonal and positive-definite: C(k,s) = Diag(c(k,s)2 , . . . , c(k,s)2 a1 aH ), A C(k,s) = Diag(c(k,s)2 , . . . , c(k,s)2 B b1 bH ), and Θ summarizes the parameters as follows: (s)
(s)
(s) S (s) (k,s) K (k,s) K }k=1 , Θ(s) }k=1 . Θ = {Θ(s) A , Θ B } s=1 , where ΘA = { A B = {B
Under the independence constraint, r(Θ) =
S
(s) (s) rA(s) (Θ(s) A )r B (Θ B ),
(11.6)
s=1
the VB posterior minimizing the free energy can be written as r(Θ) =
S K (s)
(k,s) (k,s) A , ΣA ) MGauss M (k,s) ,H (k,s) ( A(k,s) ;
s=1 k=1
· MGaussL (k,s) ,H (k,s) (B
(k,s)
(k,s) (k,s) ; B , ΣB )
⎛ (k,s) S K (s) ⎜ M (k,s) ⎜⎜⎜ (k,s) ⎜⎜⎝ am , = GaussH (k,s) ( a(k,s) ΣA ) m ; s=1 k=1
m =1
·
(k,s) L
l =1
⎞ ⎟⎟ (k,s) ⎟ (k,s) (k,s)
GaussH (k,s) ( bl ; bl , Σ B )⎟⎟⎟⎠ .
The free energy can be explicitly written as V2Fro 2F = LM log(2πσ2 ) + σ2 ⎞ (s) ⎛ S K ⎜ ⎟⎟ det C(k,s) det C(k,s) ⎜⎜⎜ (k,s) A B
(k,s) + log ! (k,s) " + L log ! (k,s) " ⎟⎟⎟⎠ ⎜⎝ M det ΣA
s=1 k=1
det ΣB
S K (S ) 2 (k,s) (k,s) (k,s)
tr C(k,s)−1 ( A + M (k,s) A ΣA ) + A s=1 k=1
3 (k,s) (k,s)
(k,s) (k,s)
( B + L ) + C(k,s)−1 B Σ B B ⎧ ⎛ S ⎞ ⎜⎜ ⎟⎟ ⎪ (k,s) (k,s) K (s) 1 ⎪ ⎨ ⎜ (s) + 2 tr ⎪ B }k=1 ; X )⎟⎟⎟⎠ A −2V ⎜⎜⎝ G({ ⎪ σ ⎩ s=1
(11.7)
11.1 Problem Description
+2
S S s=1 s =s+1
281
⎫ ⎪ ⎪ (k,s) (k,s) K (s) (k,s ) (k,s ) K (s ) (s) (s ) ⎬
G ({ B A }k=1 ; X )G({ B }k=1 ; X )⎪ A ⎪ ⎭
S K (S ) 3 (k,s) (k,s) (k,s) (k,s) 1 2 (k,s) (k,s)
tr ( A + M (k,s) B + L (k,s) A Σ A )( B ΣB ) + 2 σ s=1 k=1 S K (S ) − (L (k,s) + M (k,s) )H (k,s) ,
(11.8)
s=1 k=1
of which the stationary conditions are given by (k,s) (k,s) (k,s)
= σ−2 Z (k,s) A B ΣA , ! "−1 (k,s) (k,s) (k,s) (k,s)
+ L (k,s) , Σ A = σ2 B B Σ B + σ2 C(k,s)−1 A
(11.9) (11.10)
(k,s) (k,s) (k,s)
= σ−2 Z (k,s) A ΣB , B ! "−1 (k,s) (k,s) (k,s) (k,s)
+ M (k,s) . Σ B = σ2 A A Σ A + σ2 C(k,s)−1 B
Here Z (k,s) ∈ RL
(k,s)
×M (k,s)
(11.11) (11.12)
is defined as
(s) Zl (k,s) ,
,m = Z (s) X (k,l ,m )
where
Z (s) = V −
(s) . U
(11.13)
s s (s)
(k,s) K S 2 The stationary conditions for the hyperparameters {C(k,s) A , C B }k=1, s=1 , σ are given as
## ##2 (k,s)
(k,s) # = ## a(k,s) + ( Σ A )hh , c(k,s)2 ah h # /M #2 ## (k,s) # (k,s) # (k,s)2
# Σ B )hh , cbh = # bh ### /L (k,s) + ( ⎧ ⎛ ⎞⎞ ⎛ S S ⎜⎜⎜ (s) ⎜⎜⎜ ⎪ (s ) ⎟ 1 ⎪ ⎟⎟⎟ ⎨ 2 2
⎜⎜⎝V − σ = tr ⎜⎜⎝U U ⎟⎟⎟⎠⎟⎟⎟⎠ VFro − 2 ⎪ ⎪ ⎩ LM s=1 s =s+1
(11.14) (11.15)
⎫ ⎪ S K (s) ⎪ ⎪ (k,s) (k,s) (k,s) (k,s) (k,s) (k,s) ⎬
tr ( A + M (k,s) B + L (k,s) A Σ A ) · ( B ΣB ) ⎪ . + ⎪ ⎪ ⎭ s=1 k=1 (11.16)
The standard VB algorithm (Algorithm 6 in Section 3.5) iteratively applies Eqs. (11.9) through (11.12) and (11.14) through (11.16) until convergence.
282
11 Efficient Solver for Sparse Additive Matrix Factorization
11.2 Efficient Algorithm for SAMF In this section, we derive a more efficient algorithm than the standard VB algorithm. We first present a theorem that reduces a partial SAMF problem to the (fully observed) MF problem, which can be solved analytically. Then we describe the algorithm that solves the entire SAMF problem.
11.2.1 Reduction of the Partial SAMF Problem to the MF Problem Let us denote the mean of U(s) , defined in Eq. (11.4), over the VB posterior by
(s) = U(s) (s) (s) (s) (s) U rA (ΘA )rB (ΘB ) 2 3 (s) (k,s) (k,s) K (s)
=G B A ;X . (11.17) k=1
Then we obtain the following theorem:
(s ) } s s and the noise variance σ2 , the VB posterior Theorem 11.1 Given {U (s) (k,s) K (s) , B(k,s) }k=1 coincides with the VB posterior of the of (Θ(s) A , ΘB ) = { A following MF model: #2 1 # (11.18) p(Z (k,s) |A(k,s) , B(k,s) ) ∝ exp − 2 ## Z (k,s) − B(k,s) A(k,s) ##Fro , 2σ 1 (k,s) p( A(k,s) ) ∝ exp − tr A(k,s) C(k,s)−1 A , (11.19) A 2 1 (k,s) p(B(k,s) ) ∝ exp − tr B(k,s) C(k,s)−1 B , (11.20) B 2 for each k = 1, . . . , K (s) . Here, Z (k,s) ∈ RL (s)
} s s = {{ Proof Given {U B
(k,s )
(k,s)
×M (k,s)
is defined by Eq. (11.13).
(k,s ) K (s )
}k=1 } s s A
and σ2 as fixed constants, the (k,s) (k,s) (k,s) (k,s) free energy (11.8) can be written as a function of { A , B , ΣA , Σ B , C(k,s) , A
(s)
K C(k,s) B }k=1 as follows:
K (s) ! (k,s) (k,s) (k,s) (k,s) " (k,s) K (s) B , ΣA , Σ B , C(k,s) , C } 2F (k,s) + const., A , = 2F (s) { k=1 A B k=1
(11.21) where 2F
(k,s)
=M
(k,s)
log
det C(k,s) A " ! (k,s) det ΣA
+L
(k,s)
log
det C(k,s) B " ! (k,s) det ΣB
2 3 (k,s) (k,s) (k,s) (k,s) (k,s) (k,s)
+ tr C(k,s)−1 ( A ( B A + M (k,s) Σ A ) + C(k,s)−1 B +L (k,s) ΣB ) A B
11.2 Efficient Algorithm for SAMF 1 2 (k,s) (k,s) (k,s) tr −2 A Z B σ2 3 (k,s) (k,s) (k,s) (k,s) (k,s) (k,s)
+ ( A + M (k,s) B + L (k,s) A Σ A )( B ΣB ) .
283
+
(11.22)
Eq. (11.22) coincides with the free energy of the fully observed matrix factorization model (11.18) through (11.20) up to a constant (see Eq. (3.23) with Z substituted for V). Therefore, the VB solution is the same. Eq. (11.13) relates the entries of Z (s) ∈ RL×M to the entries of {Z (k,s) ∈ by using the map X(s) : (k, l , m ) → (l, m) (see Eq. (11.5) and Figure 3.3).
(k,s)
(k,s) K (s) RL ×M }k=1
11.2.2 Mean Update Algorithm Theorem 11.1 states that a partial problem of SAMF—finding the posterior of
(s ) } s s and σ2 —can be solved ( A(k,s) , B(k,s) ) for each k = 1, . . . , K(s) given {U by the global solver for the fully observed MF model. Specifically, we use
(s) Algorithm 16, introduced in Chapter 9, for estimating each SMF term U in turn. We use Eq. (11.16) for updating the noise variance σ2 . The whole procedure, called the mean update (MU) algorithm (Nakajima et al., 2013b), is summarized in Algorithm 22, where 0(d1 ,d2 ) denotes the d1 × d2 matrix with all entries equal to zero. The MU algorithm is similar in spirit to the backfitting algorithm (Hastie and Tibshirani, 1986; D’Souza et al., 2004), where each additive term is updated to fit a dummy target. In the MU algorithm, Z (s) defined in Eq. (11.13) corresponds to the dummy target. Although the MU algorithm globally solves a partial problem in each step, its joint global optimality over the entire Algorithm 22 Mean update (MU) algorithm for VB SAMF. 1: 2: 3: 4:
5: 6: 7: 8:
(s)
← 0(L,M) for s = 1, . . . , S , σ2 ← V2 /(LM). Initialize: U Fro for s = 1 to S do
(k,s)
(k,s) Compute Z (k,s) ∈ RL ×M by Eq. (11.13). For each partition k = 1, . . . , K (s) , compute the solution U (k,s) = B(k,s) A(k,s) for the fully observed MF by Algorithm 16 with Z (k,s) as the observed matrix. (k,s) (k,s) K (s)
(s) ← G({ B }k=1 ; X(s) ). U A end for Update σ2 by Eq. (11.16). Repeat 2 to 7 until convergence.
284
11 Efficient Solver for Sparse Additive Matrix Factorization
parameter space is not guaranteed. Nevertheless, experimental results in Section 11.3 show that the MU algorithm performs well in practice. When Algorithm 16 is applied to the dummy target matrix Z (k,s) ∈ L (k,s) ×M (k,s) in Step 4, singular value decomposition is required, which domiR nates the computation time. However, for many practical SMF terms, including the rowwise (3.114), the columnwise (3.115), and the elementwise (3.116) terms (as well as the segmentwise term, which will be defined for a video application in Section 11.3.2), Z (k,s) is a vector or scalar, i.e., L (k,s) = 1 or M (k,s) = 1 holds. In such cases, the singular value and the singular vectors are given simply by γ1(k,s) = Z (k,s) ,
(k,s) ω(k,s) /Z (k,s) , a1 = Z
ω(k,s) b1 = 1
if L (k,s) = 1,
γ1(k,s) = Z (k,s) ,
ω(k,s) a1 = 1,
(k,s) ω(k,s) /Z (k,s) b1 = Z
if M (k,s) = 1.
11.3 Experimental Results In this section, we experimentally show good performance of the MU algorithm (Algorithm 22) over the standard VB algorithm (Algorithm 6 in Section 3.5). We also demonstrate advantages of SAMF in its flexibility in a real-world application.
11.3.1 Mean Update vs. Standard VB We compare the algorithms under the following model: V = ULRCE + E, where ULRCE =
4
U(s) = Ulow−rank + Urow + Ucolumn + Uelement .
(11.23)
s=1
Here, “LRCE” stands for the sum of the low-rank, rowwise, columnwise, and elementwise terms, each of which is defined in Eqs. (3.113) through (3.116). We call this model “LRCE”-SAMF. As explained in Section 3.5, “LRCE”SAMF may be used to separate the clean signal Ulow−rank from a possible rowwise sparse component (constantly broken sensors), a columnwise sparse component (accidental disturbances affecting all sensors), and an elementwise sparse component (randomly distributed spiky noise). We also evaluate “LCE”-SAMF, “LRE”-SAMF, and “LE”-SAMF, which can be regarded as generalizations of robust PCA (Cand`es et al., 2011; Ding et al., 2011; Babacan
11.3 Experimental Results
285
et al., 2012b). Note that “LE”-SAMF corresponds to an SAMF counterpart of robust PCA. First, we conducted an experiment with artificial data. We assume the empirical VB scenario with unknown noise variance, i.e., all hyperparameters, (k,s) K (s) S 2 {C(k,s) A , C B }k=1, s=1 , and σ , are estimated from observations. We use the fullrank model (H = min(L, M)) for the low-rank term Ulow−rank , and expect the model-induced regularization (MIR) effect (see Chapter 7) to find the true rank of Ulow−rank , as well as the nonzero entries in Urow , Ucolumn , and Uelement . We created an artificial data set with the data matrix size L = 40 and M = 100, and the rank H ∗ = 10 for a true low-rank matrix Ulow−rank∗ = B∗ A∗ . ∗ ∗ Each entry in A∗ ∈ RL×H and B∗ ∈ RL×H was drawn from Gauss1 (0, 1). A true rowwise (columnwise) part Urow∗ (Ucolumn∗ ) was created by first randomly selecting ρL rows (ρM columns) for ρ = 0.05, and then adding a noise subject to Gauss M (0, ζ I M ) (GaussL (0, ζ I L )) for ζ = 100 to each of the selected rows (columns). A true elementwise part Uelement∗ was similarly created by first selecting ρLM entries and then adding a noise subject to Gauss1 (0, ζ) to each of the selected entries. Finally, an observed matrix V was created by adding a noise subject to Gauss1 (0, 1) to each entry of the sum ULRCE∗ of the aforementioned four true matrices. For the standard VB algorithm, we initialized the variational parameters and the hyperparameters in the following way: the mean parameters, (k,s) (k,s) K (s) S B }k=1, { A , s=1 , were randomly created so that each entry follows (k,s) (k,s) (s) (s) ΣA , Σ B }K S and {C(k,s) , C(k,s) }K S , Gauss1 (0, 1); the covariances, { k=1, s=1
A
B
k=1, s=1
were set to be identity; and the noise variance was set to σ2 = 1. Note that we rescaled V so that V2Fro /(LM) = 1, before starting iteration. We ran the standard VB algorithm 10 times, starting from different initial points, and each trial is plotted by a solid line (labeled as “Standard(iniRan)”) in Figure 11.1.
(s) = 0(L,M) Initialization for the MU algorithm is simple: we simply set U for s = 1, . . . , S , and σ2 = 1. Initialization of all other variables is not needed. Furthermore, we empirically observed that the initial value for σ2 does not affect the result much, unless it is too small. Actually, initializing σ2 to a large value is not harmful in the MU algorithm, because it is set to an adequate
(s) = 0(L,M) . The value after the first iteration with the mean parameters kept U performance of the MU algorithm is plotted by the dashed line in Figure 11.1. Figures 11.1(a) through 11.1(c) show the free energy, the computation time, and the estimated rank, respectively, over iterations, and Figure 11.1(d) shows the reconstruction errors after 250 iterations. The reconstruction errors consist #2 ## LRCE LRCE∗ # # #
−U of the overall error #U # /(LM), and the four componentwise Fro
11 Efficient Solver for Sparse Additive Matrix Factorization
4.7
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
4.6
4.4 4.3 4.2
6 4 2
200
250
0 0
(a) Free energy 5
3
15
2
10
1
5
0
0 0
50
100 150 Iteration
200
250
200
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
4
Overall
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
20
100 150 Iteration
(b) Computation time
30 25
50
250
(c) Estimated rank
Element
100 150 Iteration
Column
50
Row
4.1 0
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
8 Time(sec)
4.5
10
Low−rank
286
(d) Reconstruction error
Figure 11.1 Experimental results of “LRCE”-SAMF on an artificial data set (L = 40, M = 100, H ∗ = 10, ρ = 0.05).
## (s) #2
− U(s)∗ ### /(LM). The graphs show that the MU algorithm, whose errors ##U Fro iteration is computationally slightly more expensive than the standard VB algorithm, immediately converges to a local minimum with the free energy substantially lower than the standard VB algorithm. The estimated rank agrees
= H ∗ = 10, while all 10 trials of the standard VB with the true rank H algorithm failed to estimate the true rank. It is also observed that the MU algorithm well reconstructs each of the four terms. We can slightly improve the performance of the standard VB algorithm by adopting different initialization schemes. The line labeled as “Standard(iniML)” in Figure 11.1 indicates the maximum likelihood (ML) (k,s) (k,s)1/2 (k,s) bh ) = (γ(k,s)1/2 ω(k,s) ω ). Here, γ(k,s) is the initialization, i.e, ( a(k,s) , a ,γ h
h
h
h
bh
h
hth largest singular value of the (k, s)th PR matrix V (k,s) of V (such that = VX(s) (k,l ,m ) ), and ω(k,s) and ω(k,s) are the associated right and Vl (k,s)
,m ah bh left singular vectors. Also, we empirically found that starting from small σ2 alleviates the local minimum problem. The line labeled as “Standard (iniMLSS)” indicates the ML initialization with σ2 = 0.0001. We can see that this scheme successfully recovered the true rank. However, it still performs substantially worse than the MU algorithm in terms of the free energy and the reconstruction error.
11.3 Experimental Results
287
4.2 MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
4
60 Time(sec)
3.8
80
3.6 3.4 3.2
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
40
20
3 0
50
100 150 Iteration
200
250
0 0
(a) Free energy
50
100 150 Iteration
200
250
(b) Computation time
60 MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
50 40
3
100 150 Iteration
(c) Estimated rank
200
250
Element
50
Column
0
Row
1
10
Low−rank
20
Overall
2
30
0 0
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
4
(d) Reconstruction error
Figure 11.2 Experimental results of “LE”-SAMF on an artificial data set (L = 100, M = 300, H ∗ = 20, ρ = 0.1).
Figure 11.2 shows results of “LE”-SAMF on another artificial data set with L = 100, M = 300, H ∗ = 20, and ρ = 0.1. We see that the MU algorithm performs better than the standard VB algorithm. We also tested various SAMF models including “LCE”-SAMF, “LRE”-SAMF, and “LE”SAMF under different settings for M, L, H ∗ , and ρ, and empirically found that the MU algorithm generally gives a better solution with lower free energy and smaller reconstruction errors than the standard VB algorithm. Next, we conducted experiments on several data sets from the UCI repository (Asuncion and Newman, 2007). Since we do not know the true model of those data sets, we only focus on the achieved free energy. Figure 11.3 shows the free energy after convergence in “LRCE”-SAMF, “LCE”-SAMF, “LRE”SAMF, and “LE”-SAMF. For better comparison, a constant is added so that the free energy achieved by the MU algorithm is zero. We can see a clear advantage of the MU algorithm over the standard VB algorithm.
11.3.2 Real-World Application Finally, we demonstrate the usefulness of the flexibility of SAMF in a foreground (FG)/background (BG) video separation problem (Figure 3.5 in
11 Efficient Solver for Sparse Additive Matrix Factorization
0.5
0.5
0
0
(a) “LRCE”-SAMF
wine
spectf
iris
glass
−0.5
forestfires
wine
spectf
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan) iris
glass
forestfires
−0.5
chart
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
chart
288
(b) “LCE”-SAMF 0.5
0
(c) “LRE”-SAMF
wine
spectf
iris
glass
forestfires
−0.5
chart
MeanUpdate Standard(iniML) Standard(iniMLSS) Standard(iniRan)
(d) “LE”-SAMF
Figure 11.3 Experimental results on benchmark data sets. For better comparison, a constant is added so that the free energy achieved by the MU algorithm is zero (therefore, the bar for “MeanUpdate” is invisible).
Section 3.5). Cand`es et al. (2011) formed the observed matrix V by stacking all pixels in each frame into each column (Figure 3.6), and applied the robust PCA (with “LE”-terms)—the low-rank term captures the static BG and the elementwise (or pixelwise) term captures the moving FG, e.g., people walking through. SAMF can be seen as an extension of the VB variant of robust PCA (Babacan et al., 2012b). Accordingly, we use “LE”-SAMF, V = Ulow−rank + Uelement + E, as a baseline method for comparison. The SAMF framework enables a fine-tuned design for the FG term. Assuming that pixels in an image segment with similar intensity values tend to share the same label (i.e., FG or BG), we formed a segmentwise sparse SMF term: U (k) for each k is a column vector consisting of all pixels in each segment. We produced an oversegmented image from each frame by using the efficient graph-based segmentation (EGS) algorithm (Felzenszwalb and Huttenlocher, 2004), and substituted the segmentwise sparse term for the FG term (see Figure 3.7): V = Ulow−rank + Usegment + E.
11.3 Experimental Results
289
We call this model segmentation-based SAMF (sSAMF). Note that EGS is computationally very efficient: it takes less than 0.05 sec on a usual laptop to segment a 192 × 144 gray image. EGS has several tuning parameters, and the obtained segmentation is sensitive to some of them. However, we confirmed that sSAMF performs similarly with visually different segmentations obtained over a wide range of tuning parameters (see the detailed information in the section “Segmentation Algorithm”). Therefore, careful parameter tuning of EGS is not necessary for our purpose. We compared sSAMF with “LE”-SAMF on the “WalkByShop1front” video from the Caviar data set.1 Thanks to the Bayesian framework, all unknown parameters (except the ones for segmentation) are estimated from the data, and therefore no manual parameter tuning is required. For both models (“LE”SAMF and sSAMF), we used the MU algorithm, which was shown in Section 11.3.1 to be practically more reliable than the standard VB algorithm. The original video consists of 2,360 frames, each of which is a color image with 384 × 288 pixels. We resized each image into 192 × 144 pixels, averaged over the color channels, and subsampled every 15 frames (the frame IDs are 0, 15, 30, . . . , 2355). Thus, V is of the size of 27,684 [pixels] × 158 [frames]. We evaluated “LE”-SAMF and sSAMF on this video, and found that both models perform well (although “LE”-SAMF failed in a few frames). In order to contrast between the two models more clearly, we created a more difficult video by subsampling every five frames from 1,501 to 2,000 (the frame IDs are 1501, 1506, . . . , 1996 and V is of the size of 27,684 [pixels] × 100 [frames]). Since more people walked through in this period, FG/BG separation is more challenging. Figure 11.4(a) shows one of the original frames. This is a difficult snap shot, because a person stayed at the same position for a while, which confuses separation—any object in the FG pixels is assumed to be moving. Figures 11.4(c) and 11.4(d) show the BG and the FG terms, respectively, obtained by “LE”-SAMF. We can see that “LE”-SAMF failed to separate the person from BG (the person is partly captured in the BG term). On the other hand, Figures 11.4(e) and 11.4(f) show the BG and the FG terms obtained by sSAMF based on the segmented image shown in Figure 11.4(b). We can see that sSAMF successfully separated the person from BG in this difficult frame. A careful look at the legs of the person reveals how segmentation helps separation— the legs form a single segment in Figure 11.4(b), and the segmentwise sparse
1
The EC Funded CAVIAR project/IST 2001 37540, found at URL: http://homepages.inf.ed.ac .uk/rbf/CAVIAR/.
290
11 Efficient Solver for Sparse Additive Matrix Factorization
(a) Original
(b) Segmented
(c) BG (‘LE’-SAMF)
(d) FG (‘LE’-SAMF)
(e) BG (sSAMF)
(f) FG (sSAMF)
Figure 11.4 “LE”-SAMF vs. segmentation-based SAMF in FG/BG video separation.
term (Figure 11.4(f)) captured all pixels on the legs, while the pixelwise sparse term (Figure 11.4(d)) captured only a part of those pixels. We observed that, in all frames of the difficult video, as well as the easier one, sSAMF gave good separation, while “LE”-SAMF failed in several frames. For reference, we applied the convex formulation of robust PCA (Cand`es et al., 2011), which solves the following minimization problem by the inexact augmented Lagrange multiplier (ALM) algorithm (Lin et al., 2009): min UBG tr + λUFG 1
UBG ,UFG
s.t.
V = UBG + UFG ,
(11.24)
11.3 Experimental Results
(a) BG (ALM λ = 0.001)
(b) FG (ALM λ = 0.001)
(c) BG (ALM λ = 0.005)
(d) FG (ALM λ = 0.05)
(e) BG (ALM λ = 0.025)
(f) FG (ALM λ = 0.025)
291
Figure 11.5 FG/BG video separation by the convex formulation of robust PCA (11.24) for λ = 0.001 (top row), λ = 0.005 (middle row), and λ = 0.025 (bottom row).
where · tr and · 1 denote the trace norm and the 1 -norm of a matrix, respectively. Figure 11.5 shows the obtained BG and FG terms of the same frame as that in Figure 11.4 with λ = 0.001, 0.005, 0.025. We see that the performance strongly depends on the parameter value of λ, and that sSAMF gives an almost identical result (bottom row in Figure 11.4) to the best ALM result with λ = 0.005 (middle row in Figure 11.5) without any manual parameter tuning. In the following subsections, we give detailed information on the segmentation algorithm and the computation time.
292
11 Efficient Solver for Sparse Additive Matrix Factorization
(a) Original image
(b) Segmented (k = 1)
(c) Segmented (k = 10)
(d) Segmented (k = 100)
Figure 11.6 Segmented images by the efficient graph-based segmentation (EGS) algorithm with different k values. They are visually different, but with all these segmentations, sSAMF gave almost identical FB/BG separations. The original image (a) is the same frame as the one in Figure 11.4.
Segmentation Algorithm For the EGS algorithm (Felzenszwalb and Huttenlocher, 2004), we used the code publicly available from the authors’ homepage.2 EGS has three tuning parameters: sigma, the smoothing parameter; k, the threshold parameter; and minc, the minimum segment size. Among them, k dominantly determines the typical size of segments (larger k leads to larger segments). To obtain oversegmented images for sSAMF in our experiment, we chose k = 50, and the other parameters are set to sigma = 0.5 and minc = 20 as recommended by the authors. We also tested other parameter setting, and observed that FG/BG separation by sSAMF performed almost equally for 1 ≤ k ≤ 100, despite the visual variation of segmented images (see Figure 11.6). Overall, we empirically observed that the performance of sSAMF is not very sensitive to the selection of segmented images, unless it is highly undersegmented.
2
www.cs.brown.edu/∼pff/
11.3 Experimental Results
293
Computation Time The computation time for segmentation by EGS was less than 10 sec (for 100 frames). Forming the one-to-one map X took more than 80 sec (which is expected to be improved). In total, sSAMF took 600 sec on a Linux machine with Xeon X5570 (2.93GHz), while “LE”-SAMF took 700 sec. This slight reduction in computation time comes from the reduction in the number K of partitions for the FG term, and hence the number of calculations of partial analytic solutions.
12 MAP and Partially Bayesian Learning
Variational Bayesian (VB) learning generally offers a tractable approximation of Bayesian learning, and efficient iterative local search algorithms were derived for many practical models based on the conditional conjugacy (see Part II). However, in some applications, VB learning is still computationally too costly. In such cases, cruder approximation methods, where all or some of the parameters are point-estimated, with potentially less computation cost, are attractive alternatives. For example, Chu and Ghahramani (2009) applied partially Bayesian (PB) learning (introduced in Section 2.2.2), where the core tensor is integrated out and the factor matrices are point-estimated, to Tucker factorization (TF) (Carroll and Chang, 1970; Harshman, 1970; Tucker, 1996; Kolda and Bader, 2009). Mørup and Hansen (2009) applied the maximum a posteriori (MAP) learning to TF with the empirical Bayesian procedure, i.e., the hyperparameters are also estimated from observations. Their proposed empirical MAP learning, which only requires the same order of computation costs as the plain alternating least squares algorithm (Kolda and Bader, 2009), showed its model selection capability through the automatic relevance determination (ARD) property. Motivated by the empirical success, we have analyzed PB learning and MAP learning and their empirical Bayesian variants (Nakajima et al., 2011; Nakajima and Sugiyama, 2014), which this chapter introduces. Focusing on fully observed matrix factorization (MF), we first analyze the global and local solutions of MAP learning and PB learning and their empirical Bayesian variants. This analysis theoretically reveals similarities and dissimilarities to VB learning. After that, we discuss more general cases, including MF with missing entries and TF.
294
12.1 Theoretical Analysis in Fully Observed MF
295
12.1 Theoretical Analysis in Fully Observed MF In this section, we formulate MAP learning and PB learning in the free energy minimization framework (Section 2.1.1) and derive analytic-form solutions.
12.1.1 Problem Description The model likelihood and the prior of the MF model are given by #2 1 # p(V|A, B) ∝ exp − 2 ##V − B A ##Fro , 2σ 1 p( A) ∝ exp − tr AC−1 A , A 2 1 p(B) ∝ exp − tr BC−1 B , B 2
(12.1) (12.2) (12.3)
where the prior covariance matrices are restricted to be diagonal: CA = Diag(c2a1 , . . . , c2aH ), C B = Diag(c2b1 , . . . , c2bH ), for cah , cbh > 0, h = 1, . . . , H. Without loss of generality, we assume that the diagonal entries of the product CA C B are arranged in nonincreasing order, i.e., cah cbh ≥ cah cbh for any pair h < h . As in Section 2.2.2, we treat MAP learning and PB learning as special cases of VB learning in the free energy minimization framework. The Bayes posterior is given by p( A, B|V) =
p(V|A, B)p( A)p(B) , p(V)
(12.4)
which is intractable for the MF model (12.1) through (12.3). Accordingly, we approximate it by
r = argminF(r) s.t. r( A, B) ∈ G,
(12.5)
r
where G specifies the constraint on the approximate posterior. F(r) is the free energy, defined as / 0 r( A, B) (12.6) F(r) = log p(V|A, B)p( A)p(B) r(A,B) / 0 r( A, B) = log − log p(V), p( A, B|V) r(A,B)
296
12 MAP and Partially Bayesian Learning
which is a monotonic function of the KL divergence log Bayes posterior.
r(A,B) p(A,B|V) r(A,B)
to the
Constraints for MAP Learning and PB Learning MAP learning finds the mode of the posterior distribution, which amounts to approximating the posterior with the Dirac delta function. Accordingly, solving the problem (12.5) with the following constraint gives the MAP solution: A)δ(B; B), rMAP ( A, B) = δ( A;
(12.7)
where δ(μ; μ) denotes the (pseudo-)Dirac delta function located at μ.1 Under the MAP constraint (12.7), the free energy (12.6) is written as / 0 δ( A; A)δ(B; B) A, B) = log F MAP ( p(V|A, B)p( A)p(B) δ(A; A)δ(B; B) = − log p(V| A, B)p( A)p( B) + χA + χB , where
A) χA = log δ( A;
δ(A; A)
,
χB = log δ(B; B)
δ(B; B)
(12.8)
(12.9)
are the negative entropies of the pseudo-Dirac delta functions. PB learning is a strategy to analytically integrate out as many parameters as possible, and the rest are point-estimated. In the MF model, a natural choice is to integrate A out and point-estimate B, which we call PB-A learning, or to integrate B out and point-estimate A, which we call PB-B learning. Their solutions can be obtained by solving the problem (12.5) with the following constraints, respectively: B), rPB−A ( A, B) = rAPB ( A)δ(B;
(12.10)
( A, B) = δ( A; A)rPB B (B).
(12.11)
r
PB−B
Under the PB-A constraint (12.10), the free energy (12.6) is written as / 0 rAPB ( A) F PB−A (rAPB , B) = log + χB p(V|A, B)p( A)p( B) rAPB (A) / 0 rAPB ( A) = log − log p(V| B)p( B) + χB , (12.12) p( A|V, B) rAPB (A) 1
By the pseudo-Dirac delta function, we mean an extremely localized density function, e.g., δ( A; A) ∝ exp − A − A2Fro /(2ε2 ) with a very small but strictly positive variance ε2 > 0, such A) remains that its tail effect can be ignored, while its negative entropy χA = log δ( A; δ(A; A)
finite.
12.1 Theoretical Analysis in Fully Observed MF
297
where p(V|A, B)p( A) p( A|V, B) = , p(V| B) p(V| B) = p(V|A, B)
and
p(A)
(12.13) (12.14)
are the posterior distribution with respect to A (given B) and the marginal B, distribution, respectively. Note that Eq. (12.12) is a functional of rAPB and and χB is a constant with respect to them. Since only the first term depends on rAPB , on which no restriction is imposed, Eq. (12.12) is minimized when B) rAPB ( A) = p( A|V,
(12.15)
for any B. With Eq. (12.15), the first term in Eq. (12.12) vanishes, and thus the estimator for B is given by PB−A
= argmin F´ PB−A ( B), B
(12.16)
B
where B) ≡ min F PB−A (rAPB , B) = − log p(V| B)p( B) + χB . F´ PB−A ( rAPB
(12.17)
B) in Eq. (12.17) corresponds to integrating The process to compute F´ PB−A ( A out based on the conditional conjugacy. The probabilistic PCA, introduced in Section 3.1.2, was originally proposed with PB-A learning (Tipping and Bishop, 1999). In the same way, we can obtain the approximate posterior under the PB-B constraint (12.11) as follows:
rPB B (B) = p(B|V, A), PB−B
A
= argmin F´ PB−B ( A),
(12.18) (12.19)
A
where p(V| A, B)p(B) , p(V| A) p(V| A) = p(V| A, B) ,
p(B|V, A) =
p(B)
A) ≡ min F PB−B (rPB F´ PB−B ( B , A) = − log p(V| A)p( A) + χA . rPB B
(12.20) (12.21) (12.22)
298
12 MAP and Partially Bayesian Learning
We define PB learning as one of PB-A learning and PB-B learning giving a lower free energy. Namely, ⎧ PB−A ⎪
⎪ ( A, B) if min F PB−A (rAPB , B) ≤ min F PB−B (rPB ⎨r PB B , A), r ( A, B) = ⎪ ⎪ PB−B ⎩r ( A, B) otherwise. Free Energies for MAP Learning and PB Learning Apparently, the constraint (12.7) for MAP learning and the constraints (12.10) and (12.11) for PB learning forces independence between A and B, and therefore, they are stronger than the independence constraint rVB ( A, B) = rAVB ( A)rVB B (B)
(12.23)
for VB learning. In Chapter 6, we showed that the VB posterior under the independence constraint (12.23) is in the following Gaussian form: "⎞ ⎛ !
⎟ ⎜⎜⎜ tr (A− A) Σ −1 A (A− A) ⎟⎟⎟ A, I M ⊗ Σ A ) ∝ exp ⎜⎜⎝− rA ( A) = MGauss M,H ( A; ⎟⎠ , (12.24) 2
"⎞ ⎛ !
⎟ ⎜⎜⎜ tr (B− B) Σ −1 B (B− B) ⎟⎟⎟ B, I L ⊗ Σ B ) ∝ exp ⎜⎜⎝− rB (B) = MGaussL,H (B; ⎟⎠ , 2
(12.25)
Σ B , are diagonal. Furthermore, Eqs. where the posterior covariances, Σ A and (12.24) and (12.25) can be the pseudo-Dirac delta functions by setting the Σ B = ε2 I H , respectively, for a very posterior covariances to Σ A = ε2 I H and 2 small ε > 0. Consequently, the MAP and the PB solutions can be obtained by minimizing the free energy for VB learning with posterior covariances clipped to ε2 I H , according to the corresponding constraint. Namely, we start from the free energy expression (6.42) for VB learning, i.e., L 2 H h=1 γh 2 2F = LM log(2πσ ) + + 2Fh , (12.26) σ2 h=1 where 2Fh = M log
c2ah
σ2ah
+ L log
− (L + M) +
c2bh
σ2bh
+
σ2ah a2h + M
+
σ2bh b2h + L
c2ah c2bh a2h + M σ2ah σ2bh bh γh + b2h + L −2 ah σ2
,
(12.27)
and set
σ2ah = ε2
h = 1, . . . , H,
(12.28)
12.1 Theoretical Analysis in Fully Observed MF
299
for MAP learning and PB-B learning, and
σ2bh = ε2
h = 1, . . . , H,
(12.29)
for MAP learning and PB-A learning. Here, $ %
aH ωaH , A= a1 ωa1 , . . . ,
B= b1 ωb1 , . . . , bH ωbH , 2
σa1 , . . . , σ2aH , Σ A = Diag
σ2b1 , . . . , σ2bH , Σ B = Diag and V=
L
γh ωbh ωah
h=1
is the singular value decomposition (SVD) of V. Thus, the free energies for MAP learning, PB-A learning, and PB-B learning are given by Eq. (12.26) for 2FhMAP = M log c2ah + L log c2bh +
a2h c2ah
+
b2h c2b
+
h
−2 ah a2h bh γh + b2h σ2
− (L + M) + (L + M)χ, 2FhPB−A = M log
c2ah
σ2ah
+ L log c2bh +
σ2ah a2h +M c2ah
+
(12.30)
b2h c2b h
+
− (L + M) + Lχ, 2FhPB−B = M log c2ah + L log
c2b h
σ2b h
+
a2h c2ah
b2h −2 ah a2h +M σ2ah bh γh + σ2
(12.31)
+
σ2b b2h +L h 2 cb h
− (L + M) + Mχ,
+
! " b2h +L −2 ah a2h σ2b bh γh + h
σ2
(12.32)
respectively, where χ = − log ε2
(12.33)
is a large positive constant corresponding to the negative entropy of the onedimensional pseudo-Dirac delta function. As in VB learning, the free energy (12.26) is separable for each singular component as long as the noise variance σ2 is treated as a constant. Therefore, bh , σ2ah , σ2bh ) and the prior covariances (c2ah , c2bh ) the variational parameters ( ah , for the hth component can be estimated by minimizing Fh .
300
12 MAP and Partially Bayesian Learning
12.1.2 Global Solutions In this section, we derive the global minimizers of the free energies, (12.30) through (12.32), and analyze their behavior. Global MAP and PB Solutions By minimizing the MAP free energy (12.30), we can obtain the global solution for MAP learning, given the hyperparameters CA , C B , σ2 treated as fixed constants. Let U = B A be the target low-rank matrix. H Theorem 12.1 Given CA , C B ∈ D++ , and σ2 ∈ R++ , the MAP solution of the MF model (12.1) through (12.3) is given by ⎧ H MAP ⎪ ⎪ if γh ≥ γMAP , MAP ⎨γ˘ h MAP MAP h
γh ωbh ωah , = where γh =⎪ U ⎪ ⎩0 otherwise, h=1 (12.34)
where σ2 , cah cbh σ2 = γh − . cah cbh
γMAP =
(12.35)
γ˘ hMAP
(12.36)
h
Proof Eq. (12.30) can be written as a function of ah and bh as
a2h ah a2h bh γh + b2h b2h −2 + + + const. c2ah c2b σ2 h ! ""2 ! σ2 ⎛ ⎞2
− γ − b a h h h
cah cbh bh ⎟⎟⎟ ah ⎜⎜ ⎟⎠ + = ⎜⎜⎝ − + const. cah cbh σ2
2FhMAP =
(12.37)
Noting that the first two terms are nonnegative, we find that, if γh > γMAP , h Eq. (12.37) is minimized when σ2 ah , bh = γh − γ˘ hMAP ≡ cah cbh
ah cah
δMAP ≡ = . h
bh cbh
(12.38) (12.39)
12.1 Theoretical Analysis in Fully Observed MF
301
Otherwise, it is minimized when
ah = 0,
bh = 0,
which completes the proof. Eqs. (12.38) and (12.39) immediately lead to the following corollary: Corollary 12.2 The MAP posterior is given by rMAP ( A, B) =
H
δ(ah ; ah ωah )
h=1
H
δ(bh ; bh ωbh ),
(12.40)
h=1
with the following estimators: if γh > γMAP , h
ah = ±
.
> @
γ˘ hMAP ,
δMAP h ca
ah
δMAP ≡ = h, h
cbh bh
γ˘ hMAP , δMAP h
where
bh = ±
(12.41) (12.42)
and otherwise
ah = 0,
bh = 0.
(12.43)
Similarly, by minimizing the PB-A free energy (12.31) and the PB-B free energy (12.32) and comparing them, we can obtain the global solution for PB learning: H , and σ2 ∈ R++ , the PB solution of the Theorem 12.3 Given CA , C B ∈ D++ MF model (12.1) through (12.3) is given by ⎧ H PB ⎪ ⎪ if γh ≥ γPB , PB ⎨γ˘ h PB PB h
(12.44) γh ωbh ωah , where γh = ⎪ U = ⎪ ⎩0 otherwise, h=1
where
A γPB h
γ˘ hPB
=σ
max(L, M) +
σ2 c2ah c2bh
,
⎞ B ⎛ ⎜⎜⎜ ⎟ γ2 σ2 max(L, M) + max(L, M)2 + 4 c2 ch 2 ⎟⎟⎟⎟ ⎜⎜⎜ ah b ⎟⎟ ⎜⎜⎜ h ⎟ ⎟⎟⎟ γh . = ⎜⎜⎜1 − ⎟⎟⎟ ⎜⎜⎜ 2γh2 ⎟⎠ ⎝
(12.45)
(12.46)
302
12 MAP and Partially Bayesian Learning
Corollary 12.4 The PB posterior is given by ⎧ PB−A ⎪ ( A, B) r ⎪ ⎪ ⎪ ⎪ ⎨ PB−A PB r ( A, B) = ⎪ r ( A, B) or rPB−B ( A, B) ⎪ ⎪ ⎪ ⎪ ⎩rPB−B ( A, B)
if L < M, if L = M,
(12.47)
if L > M,
where rPB−A ( A, B) and rPB−B ( A, B) are the PB-A posterior and the PB-B posterior, respectively, given as follows. The PB-A posterior is given by rPB−A ( A, B) =
H
Gauss M (ah ; ah ωah , σ2ah I M )
h=1
H
with the following estimators: if γh >
A
γPB−A h
then
≡σ
> @
.
bh = ±
ah = ± γ˘ hPB−A , δPB−A h
M+
γ˘ hPB−A ,
δPB−A
σ2 c2ah c2bh
(12.48)
,
(12.49)
σ2
σ2ah =
, γ˘ hPB−A / δPB−A + σ2 /c2ah h (12.50)
h
where
δ(bh ; bh ωbh ),
h=1
⎛ ⎜⎜⎜ ⎜⎜ PB−A
γ˘ h ≡ ah bh = ⎜⎜⎜⎜⎜1 − ⎜⎝
ah
δPB−A ≡ = h
bh
⎛ A ⎜⎜⎜ σ2 ⎜⎜⎜⎝ M+ M 2 +4
γ2 h 2 c2 ah cb h
2γh2
⎛ A ⎜⎜⎜ γ2 c2ah ⎜⎜⎜⎝ M+ M 2 +4 2 h2 c c
ah b h
2γh
⎞⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟⎠ ⎟ ⎟
⎟⎟⎟ γ , ⎟⎟⎟ h ⎠
(12.51)
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎠
,
(12.52)
and otherwise
ah = 0,
bh = 0,
σ2ah = c2ah .
(12.53)
The PB-B posterior is given by rPB−B ( A, B) =
H
δ(ah ; ah ωah )
h=1
H
GaussL (bh ; bh ωbh , σ2bh I L ),
(12.54)
h=1
with the following estimators: if γh >
γPB−B h
A ≡σ
L+
σ2 c2ah c2bh
,
(12.55)
12.1 Theoretical Analysis in Fully Observed MF
303
then > @
.
bh = ±
ah = ± γ˘ hPB−B , δPB−B h
γ˘ hPB−B ,
δPB−B
σ2bh =
h
σ2
, γ˘ hPB−B + σ2 /c2bh δPB−B h (12.56)
where ⎛ ⎜⎜⎜ ⎜⎜ PB−B γ˘ h bh = ⎜⎜⎜⎜⎜1 − ≡ ah ⎜⎝
ah
δPB−B ≡ h
bh
⎛ A ⎜⎜⎜ σ2 ⎜⎜⎜⎝L+ L2 +4
γ2 h 2 c2 ca h bh
2γh2
⎞⎞ ⎟⎟⎟ ⎟⎟⎟⎠ ⎟ ⎟⎟⎟⎟
⎟⎟⎟ γ , ⎟⎟⎟ h ⎠
⎞ ⎞−1 ⎛ ⎛⎜⎜ A ⎜⎜⎜ c2 ⎜⎜⎜⎜L+ L2 +4 γh2 ⎟⎟⎟⎟⎟⎟ ⎟⎟⎟ ⎜⎜ bh ⎝ c2a c2 ⎠ ⎟ h bh ⎟ ⎟⎟⎟ , = ⎜⎜⎜⎜⎜ ⎟⎟⎟ 2γh ⎜⎝ ⎠
(12.57)
(12.58)
and otherwise
ah = 0,
bh = 0,
σ2bh = c2bh .
(12.59)
Note that, when L = M, the choice from the PB-A and the PB-B posteriors depends on the prior covariances CA and C B . However, as long as the estimator for the target low-rank matrix U is concerned, the choice does not matter, as Theorem 12.3 states. This is because γPB−A = γPB−B h
and
h
γ˘ hPB−A = γ˘ hPB−B
when
L = M,
as Corollary 12.4 implies. Proofs of Theorem 12.3 and Corollary 12.4 We first derive the PB-A solution by minimizing the corresponding free energy (12.31): 2FhPB−A
c2ah
= M log σ2 + ah
L log c2bh
+
σ2ah a2h +M c2ah
+
b2h c2b
h
+
b2h −2 ah a2h +M σ2ah bh γh + 2 σ
− (L + M) + Lχ.
(12.60)
As a function of ah (treating bh and σ2ah as fixed constants), Eq. (12.60) can be written as ah ) = 2FhPB−A (
a2h −2 ah a2h bh γh + b2h + + const. c2ah σ2
304
12 MAP and Partially Bayesian Learning ⎛ ⎞
⎟⎟ b2h + σ2 /c2ah ⎜⎜⎜ 2 γ b h h ⎜⎜⎝
ah ⎟⎟⎟⎠ + const. = ah − 2 2 2
σ bh + σ2 /c2ah ⎛ ⎞2
⎟⎟⎟ b2h + σ2 /c2ah ⎜⎜⎜ bh γh ⎜ ⎟⎟⎠ + const., ah − = ⎜⎝ 2 2 2 2
σ bh + σ /cah
which is minimized when
ah =
bh γh
b2h + σ2 /c2ah
.
(12.61)
ah and bh as fixed constants), Eq. (12.60) can be As a function of σ2ah (treating written as ! ! " "
b2 2FhPB−A ( σ2ah ) = M − log σ2ah + c12 + σh2 σ2ah + const. ah ! ! ! " " "
b2h b2 1 = M − log c2 + σ2 σ2ah + c12 + σh2 σ2ah + const., ah
ah
which is minimized when
σ2ah
⎞−1 ⎛
⎜⎜⎜ 1 b2h ⎟⎟⎟ σ2 = ⎜⎜⎝ 2 + 2 ⎟⎟⎠ = .
cah σ b2h + σ2 /c2ah
(12.62)
Therefore, substituting Eqs. (12.61) and (12.62) into Eq. (12.60) gives the free σ2ah already optimized: energy with ah and 2F´ hPB−A = min 2FhPB−A
σ2ah ah ,
= −M log σ2ah +
σ2ah a2h +M
σ2ah
= M log( b2h + σ2 /c2ah ) − = M log( b2h + σ2 /c2ah ) + = M log( b2h + σ2 /c2ah ) +
+
b2h c2b
−
h
b2h γh2
2 ah bh γh σ2
σ2 ( b2h +σ2 /c2ah )
!
γh2 σ2
γh2 c2ah
−
+
b2h γh2
+ const.
b2h c2b
h
σ2 ( b2h +σ2 /c2ah )
+ const. " 2 b + c2h +
( b2h + σ2 /c2ah )−1 +
bh
1 c2b
h
σ2 c2ah c2b
+ const. h
( b2h + σ2 /c2ah ) + const. (12.63)
In the second last equation, we added some constants so that the minimizer can be found by the following lemma: Lemma 12.5 The function f (x) = ξlog log x + ξ−1 x−1 + ξ1 x
12.1 Theoretical Analysis in Fully Observed MF
305
of x > 0 for positive coefficients ξlog , ξ−1 , ξ1 > 0 is strictly quasiconvex,2 and minimized at . 2 −ξlog + ξlog + 4ξ1 ξ−1
. x= 2ξ1 Proof
f (x) is differentiable in x > 0, and its first derivative is ∂f = ξlog x−1 − ξ−1 x−2 + ξ1 ∂x = x−2 ξ1 x2 + ξlog x − ξ−1 . ⎛ ⎞⎛ 2 +4ξ ξ ξlog + ξlog ⎜⎜ 1 −1 ⎟ ⎟⎟⎟ ⎜⎜⎜⎜ −2 ⎜ = x ⎝⎜ x + ⎠ ⎝x − 2ξ1
. ⎞ 2 +4ξ ξ −ξlog + ξlog 1 −1 ⎟ ⎟ 2ξ1
⎟⎟⎠ .
Since the first two factors are positive, we find that f (x) is strictly decreasing for 0 < x < x, and strictly increasing for x > x, which proves the lemma. By applying Lemma 12.5 to Eq. (12.63) with x = b2h + σ2 /c2ah , we find that F´ hPB−A is strictly quasiconvex and minimized when B 4γh2 2 2 cbh −M + M + c2 c2 ah b h
. (12.64) b2h + σ2 /c2ah = 2 Since b2h is nonnegative, the minimizer of the free energy (12.63) is given by B ⎫ ⎧ ⎪ ⎪ 2 4γh2 ⎪ ⎪ 2σ 2 ⎪ 2 ⎪ ⎪ cbh − M + c2 c2 + M + c2 c2 ⎪ ⎪ ⎪ ⎪ ⎪ ah b ah b ⎪ h h ⎪ ⎨ ⎬
b2h = max ⎪ 0, . (12.65) ⎪ ⎪ ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ Apparently, Eq. (12.65) is positive when ⎞ ⎛ 2 ⎟2 ⎜⎜⎜ 4γh2 ⎜⎜⎝ M + 2σ ⎟⎟⎟⎟⎠ < M 2 + , c2ah c2bh c2ah c2bh which leads to the thresholding condition: γh > γPB−A . h
By using Eqs. (12.61), (12.62), (12.65), and ⎞ ⎧ ⎛⎜ A ⎜⎜ ⎟⎟ 4γ2 ⎟ ⎪ ⎪ 2 ⎜ h ⎟ 2+ ⎜ ⎟⎟ ⎪ M+ c M ⎜ ah ⎝ ⎪ c2a c2 ⎠ ⎪ ⎪ h bh −1 ⎨ 2 2 2
if γh > γPB−A , bh + σ /cah =⎪ 2γh2 h ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎩ ca2h otherwise, σ 2
The definition of quasiconvexity is given in footnote 1 in Section 8.1.
(12.66)
306
12 MAP and Partially Bayesian Learning
derived from Eqs. (12.64) and (12.65), we obtain
b2h γh
γhPB−A ≡ ah bh =
b2 + σ2 /c2ah ⎞ ⎛h ⎜⎜⎜ σ2 /c2ah ⎟⎟⎟ ⎟⎟⎠ γh ⎜ = ⎜⎝1 −
b2h + σ2 /c2ah ⎛ ⎞⎞ A ⎧⎛ ⎜⎜ 2 ⎟ ⎪ ⎜⎜⎜ 2⎜ ⎪ ⎜⎜⎜ M+ M 2 +4 γh ⎟⎟⎟⎟⎟ ⎟ ⎟⎟ ⎪ σ ⎪ ⎝ ⎜⎜⎜ c2a c2 ⎠ ⎟ ⎪ ⎪ h bh ⎟ ⎟⎟⎟ γ ⎪ ⎪ ⎨⎜⎜⎜⎜1 − ⎟⎟⎟ h 2γh2 =⎪ ⎜ ⎠ ⎝ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩0
ah γh
δPB−A ≡ = h
bh b2h + σ2 /c2ah =
⎛ A ⎜⎜⎜ c2ah ⎜⎜⎜⎝ M+ M 2 +4
γ2 h c2a c2 h bh
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎠
2γh
if γh > γPB−A , h
otherwise,
for γh > γPB−A , h
σ2
σ2ah =
b2h + σ2 /c2ah ⎧ σ2 ⎪ ⎪ ⎪ ⎨ γhPB−A / δPB−A +σ2 /c2ah h =⎪ ⎪ ⎪ 2 ⎩ca h
if γh > γPB−A , h
otherwise.
Thus, we have obtained the PB-A posterior (12.48) specified by Eqs. (12.49) through (12.53). In exactly the same way, we can obtain the PB-B posterior (12.54) specified by Eqs. (12.55) through (12.59), by minimizing the free energy (12.32) for PB-B learning. Finally, for the choice from the PB-A posterior and the PB-B posterior, we can easily prove the following lemma: Lemma 12.6 It holds that min FhPB−A < min FhPB−B
bh , σ2ah ah ,
bh , σ2b ah ,
bh , σ2b ah ,
L < M,
if
L > M.
h
min FhPB−A > min FhPB−B
bh , σ2ah ah ,
if
h
Proof When comparing the PB-A free energy (12.31) and the PB-B free energy (12.32), the last terms are dominant since we assume that the negative entropy χ, defined by Eq. (12.33), of the one-dimensional pseudo-Dirac delta function is finite but arbitrarily large. Then, comparing the last terms, each
12.1 Theoretical Analysis in Fully Observed MF
307
of which is proportional to the number of parameters point-estimated, of Eqs. (12.31) and (12.32) proves Lemma 12.6. Combining Lemma 12.6 with the PB-A posterior and the PB-B posterior obtained before, we have Corollary 12.4. Theorem 12.3 is a direct consequence of Corollary 12.4. Comparison between MAP, PB, and VB Solutions Here we compare the MAP solution (Theorem 12.1), the PB solution (Theorem 12.3), and the VB solution (Theorem 6.7 in Chapter 6). For all methods, the solution is a shrinkage estimator applied to each singular value, i.e., in the following form for γh ≤ γh : ⎧ H ⎪ ⎪ γh if γh ≥ γ , ⎨ h
(12.67) γh ωbh ωah , where γh = ⎪ U= ⎪ ⎩0 otherwise. h=1 When the prior is flat, i.e., cah cbh → ∞, the truncation threshold γ and the h shrinkage factor γ˘ h are simplified as lim γMAP cah cbh →∞ h lim γPB cah cbh →∞ h lim γVB cah cbh →∞ h
= 0, √ = σ max(L, M), √ = σ max(L, M),
lim γ˘ MAP cah cbh →∞ h lim γ˘ PB cah cbh →∞ h lim γ˘ VB cah cbh →∞ h
= γh , = 1−
max(L,M)σ2 γh
= 1−
max(L,M)σ2 γh
(12.68) γh , (12.69) γh , (12.70)
and therefore the estimators γh can be written as
γhMAP = γh ,
γhPB
γhVB
= max 0, 1 − = max 0, 1 −
(12.71) max(L,M)σ2 γh
γh ,
(12.72)
2
γh .
(12.73)
max(L,M)σ γh
As expected, the MAP estimator (12.71) coincides with the maximum likelihood (ML) estimator when the prior is flat. On the other hand, the PB estimator (12.72) and the VB estimator (12.73) do not converge to the ML estimator. Interestingly, the PB estimator and the VB estimator coincide with each other, and they are in the form of the positive-part James–Stein (PJS) estimator (James and Stein, 1961; Efron and Morris, 1973) applied to each singular component (see Appendix A for a short introduction to the James– Stein estimator). The reason why the VB estimator is shrunken even with the flat prior was explained in Chapter 7 in terms of model-induced regularization
308
12 MAP and Partially Bayesian Learning
15
15
10
10
5
5
0 0
0.5
1
1.5
2
0 0
(a) L = 20, M = 50
0.5
1
1.5
2
(b) L = 20, M = 10
Figure 12.1 Truncation thresholds, γVB , γPB , γPB−A , and γMAP as functions of the h h h h product cah cbh of the prior covariances. The noise variance is set to σ2 = 1. 10
10
8
8
6
6
4
4
2
2
0 0
10
20
30
(a) L = 20, cah cbh → ∞
40
0 0
10
20
30
40
(b) L = 20, cah cbh = 1
Figure 12.2 Truncation thresholds as functions of M.
(MIR) enhanced by phase transitions. Eq. (12.72) implies that PB learning— a cruder approximation to Bayesian learning—shares the same property as VB learning. To investigate the dimensionality selection behavior, we depict the truncation thresholds of VB learning, PB learning, PB-A learning, and MAP learning as functions of the product cah cbh of the prior covariances in Figure 12.1. The left panel is for the case with L = 20, M = 50, and the right panel is for the case with L = 20, M = 10. PB-A learning corresponds to PB learning with the predetermined marginalized and the point-estimated spaces as in Tipping and Bishop (1999) and Chu and Ghahramani (2009), i.e., the matrix A is always marginalized out and B is point-estimated regardless of the dimensionality. We see in Figure 12.1 that PB learning and VB learning show similar dimensionality selection behaviors, while PB-A learning behaves differently when L > M. Figure 12.2 shows the truncation thresholds as functions of M for L = 20. With the flat prior cah cbh → ∞ (left panel), the PB and the VB solutions agree
12.1 Theoretical Analysis in Fully Observed MF
309
with each other, as Eqs. (12.69) and (12.70) imply. The PB-A solution is also identical to them when M ≥ L. However, its behavior changes at M = L: the truncation threshold of PB-A learning smoothly goes down as M decreases, while those of PB learning and VB learning make a sudden turn and become constant. The right panel is for the case with a nonflat prior (cah cbh = 1), which shows similar tendency to the case with the flat prior. A question is which behavior is more desirable, a sudden turn in the threshold curve in VB/PB learning, or the smooth behavior in PB-A learning? We argue that the behavior of VB/PB learning is more desirable for the following reason. Let us consider the case where no true signal exists, i.e., the true rank is H ∗ = 0. In this case, we merely observe pure noise, V = E, and the average of the squared singular values of V over all components is given by tr(EE ) MGaussL,M (E;0L,M ,σ2 I L ⊗I M ) = σ2 max(L, M). (12.74) min(L, M) Comparing Eq. (12.74) with Eqs. (12.70) and (12.69), we find that VB learning and PB learning always discard the components with singular values no greater than the average noise contribution (note here that Eqs. (12.70) and (12.69) give the thresholds for the flat prior cah cbh → ∞, and the thresholds increase as cah cbh decreases). The sudden turn in the threshold curve actually follows the behavior of the average noise contribution (12.74) to the singular values. On the other hand, PB-A learning does not necessarily discard such noisedominant components, and can strongly overfit the noise when L ! M. Figure 12.3 shows the estimators γh by VB learning, PB learning, PB-A learning, and MAP learning for cah cbh = 1, as functions of the observed 20 8
15 6
10 4
5
2
0 0
5
10
15
(a) L = 20, M = 50
20
0 0
2
4
6
8
(b) L = 20, M = 10
Figure 12.3 Behavior of VB, PB, PB-A, and MAP estimators (the vertical axis) for cah cbh = 1, when the singular value γh (the horizontal axis) is observed. The noise variance is set to σ2 = 1.
310
12 MAP and Partially Bayesian Learning
singular value γh . We can see that the PB estimator behaves similarly to the VB estimator, while the MAP estimator behaves significantly differently. The right panel shows that PB-A learning also behaves differently from VB learning when L > M, which implies that the choice between the PB-A posterior and the PB-B posterior based on the free energy is essential to accurately approximate VB learning. Actually, the coincidence between the VB solution (12.70) and the PB solution (12.69) with the flat prior can be seen as a natural consequence from the similarity in the posterior shape. From Theorem 6.7 and Corollary 6.8, we can derive the following corollary: Corollary 12.7 Assume that, when we make the prior flat cah cbh → ∞, cah and cbh go to infinity in the same order, i.e., cah /cbh = Θ(1).3 Then, the following hold for the variances of the VB posterior (6.40): when L < M, lim
σ2ah = ∞,
lim
σ2ah = 0,
cah cbh →∞
lim
σ2bh = 0,
(12.75)
σ2bh = ∞.
(12.76)
cah cbh →∞
and when L > M, cah cbh →∞
lim
cah cbh →∞
Proof Assume first that L < M. When γh > γVB , Eq. (6.52) gives h limcah cbh →∞ = ∞, and therefore Eq. (6.51) gives Eq. (12.75). When δVB h −2 γh ≤ γVB , Eq. (6.54) gives ζhVB = σ2 /M − Θ(c−2 ah cbh ) as cah cbh → ∞, and h therefore Eq. (6.53) gives Eq. (12.75). When L > M, Theorem 6.7 and Corollary 6.8 hold for V ← V , meaning that the VB posterior is obtained by exchanging the variational parameters for A and those for B. Thus, we obtain Eq. (12.76), and complete the proof. Corollary 12.7 implies that, with the flat prior cah cbh → ∞, the shape of the VB posterior is similar to the shape of the PB posterior: they extend in the space of A when M > L, and extend in the space of B when M < L. Therefore, it is no wonder that the solutions coincide with each other. Global Empirical MAP and Empirical PB Solutions Next, we investigate the empirical Bayesian variants of MAP learning and PB learning, where the hyperparameters CA and C B are also estimated from observations. Note that the noise variance σ2 is still considered as a fixed constant (noise variance estimation will be discussed in Section 12.1.6). 3
Θ( f (x)) is a positive function such that lim sup x→∞ |Θ( f (x))/ f (x)| < ∞ and lim inf x→∞ |Θ( f (x))/ f (x)| > 0.
12.1 Theoretical Analysis in Fully Observed MF
311
Let us first consider the MAP free energy (12.30): 2FhMAP = M log c2ah + L log c2bh +
a2h c2ah
+
b2h c2b
+
h
−2 ah a2h bh γh + b2h σ2
− (L + M) + (L + M)χ. We can make the MAP free energy arbitrarily small, i.e., FhMAP → −∞ by setting c2ah , c2bh → +0 with the variational parameters set to the corresponding bh = 0 (see Corollary 12.2). Therefore, the global solution solution, i.e., ah = of empirical MAP (EMAP) learning is given by
ah = 0,
bh = 0,
c2ah → +0,
c2bh → +0,
for h = 1, . . . , H,
which results in the following theorem:
Theorem 12.8 The global solution of EMAP learning is γhEMAP ≡ bh = 0 ah for all h = 1, . . . , H, regardless of observations. The same happens in empirical PB (EPB) learning. The PB-A free energy (12.31), c2a
2FhPB−A = M log σ2h + L log c2bh + ah
σ2ah a2h +M c2ah
b2h c2b
+
+
h
b2h −2 ah a2h +M σ2ah bh γh + σ2
− (L + M) + Lχ, can be arbitrarily small, i.e., FhPB−A → −∞ by setting c2ah , c2bh → +0 with the bh = 0, σ2a = c2a variational parameters set to the corresponding solution ah = h
h
(see Corollary 12.4). Also, the PB-B free energy (12.32), 2FhPB−B
=
M log c2ah
+ L log
c2b
h
σ2b h
+
a2h c2ah
+
σ2b b2h +L c2b
h
h
+
! " b2h +L −2 ah a2h σ2b bh γh + h
σ2
− (L + M) + Mχ, can be arbitrarily small, i.e., FhPB−B → −∞ by setting c2ah , c2bh → +0 with the bh = 0, σ2 = c2 . variational parameters set to the corresponding solution ah = bh
bh
Thus, we have the following theorem:
bh = 0 for ah Theorem 12.9 The global solution of EPB learning is γhEPB ≡ all h = 1, . . . , H, regardless of observations. Theorems 12.8 and 12.9 imply that empirical Bayesian variants of MAP learning and PB learning give useless trivial estimators. This happens because the posterior variances of the parameters to be point-estimated are fixed to a small value, so that the posteriors form the pseudo-Dirac delta functions. In VB σ2ah and σ2bh , learning, if we set cah cbh to a small value, the posterior variances,
312
12 MAP and Partially Bayesian Learning
get small accordingly, so that the third and the fourth terms in Eq. (12.27) do not diverge to +∞. As a result, the first and the second terms in Eq. (12.27) remain finite. On the other hand, in MAP learning and PB learning, at least σ2bh , is treated as a constant and cannot one of the posterior variances, σ2ah and be adjusted to the corresponding prior covariance when it is set to be small. This makes the free energy lower-unbounded. Actually, if we lower-bound the prior covariances as c2ah , c2bh ≥ ε2 with the same ε2 as the one we used for defining the variances (12.28) and (12.29) of the pseudo-Dirac delta functions and their entropy (12.33), the MAP and the PB free energies, FhMAP , FhPB−A , and FhPB−B , are also lower-bounded by zero, as the VB free energy, FhVB .
12.1.3 Local Solutions The analysis in Section 12.1.2 might seem contradictory with the reported results in Mørup and Hansen (2009), where EMAP showed good performance with the ARD property in TF—since the free energies in MF and TF are similar to each other, they should share the same issue of the lower-unboundedness. In the following, we elucidate that this apparent contradiction is because of the local solutions in EMAP learning and EPB learning that behave similarly to the nontrivial positive solution of EVB learning. Actually, EMAP learning and EPB learning can behave similarly to EVB learning when the free energy is minimized by local search.
Local EMAP and EPB Solutions Here we conduct more detailed analysis of the free energies for EMAP learning and EPB learning, and clarify the behavior of their local minima. To make the free energy always comparable (finite), we slightly modify the problem. Specifically, we solve the following problem: Given
σ2 ∈ R++ , min
H r,{c2ah ,c2b }h=1
F,
h
and
s.t. cah cbh ≥ ε2 , cah /cbh = 1 for h = 1, . . . , H, ⎧ ⎪ ⎪ r( A, B) = δ( A; A)δ(B; B) (for EMAP learning), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ( A)δ(B; B) (for EPB-A learning), r( A, B) = r ⎨ A ⎪ ⎪ ⎪ ⎪ r( A, B) = δ( A; A)rB (B) (for EPB-B learning), ⎪ ⎪ ⎪ ⎪ ⎪ ⎩r( A, B) = rA ( A)rB (B) (for EVB learning),
(12.77)
12.1 Theoretical Analysis in Fully Observed MF
313
where the free energy F is defined by Eq. (12.6), and the pseudo-Dirac delta function is defined as Gaussian with an arbitrarily small but positive variance ε2 > 0: ! " A− A2 δ( A; A) = MGauss M,H ( A; A, ε2 I M ⊗ I H ) ∝ exp − 2ε2 Fro , ! " B− B2 B, ε2 I L ⊗ I H ) ∝ exp − 2ε2 Fro . δ(B; B) = MGaussL,H (B; Note that, in Eq. (12.77), we lower-bounded the product cah cbh of the prior covariances and fixed the ratio cah /cbh . We added the constraint for EVB learning for comparison. Following the discussion in Section 12.1.1, we can write the posterior as r( A, B) = rA ( A)rB (B),
where
"⎞ ⎛ !
⎟ ⎜⎜⎜ tr (A− A) Σ −1 A (A− A) ⎟⎟⎟ rA ( A) = MGauss M,H ( A; A, I M ⊗ Σ A ) ∝ exp ⎜⎜⎝− ⎟⎠ , 2 "⎞ ⎛ !
⎟ ⎜⎜⎜ tr (B− B) Σ −1 B (B− B) ⎟⎟⎟ B, I L ⊗ Σ B ) ∝ exp ⎜⎜⎝− rB (B) = MGaussL,H (B; ⎟⎠ , 2
for % $
aH ωaH , A= a1 ωa1 , . . . ,
bH ωbH , B= b1 ωb1 , . . . , 2
σa1 , . . . , σ2aH , Σ A = Diag
σ2b1 , . . . , σ2bH , Σ B = Diag H and the variational parameters { ah , bh , σ2ah , σ2bh }h=1 are the solution of the following problem:
Given
σ2 ∈ R++ , min
H { ah , bh , σ2ah , σ2b ,c2ah ,c2b }h=1 h
⎧ ⎪ ⎪
σ2ah ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ σ2ah ⎨ and ⎪ ⎪ ⎪ ⎪
σ2ah ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ σ2a h
ah , bh ∈ R,
s.t. =ε , 2
≥ε , 2
=ε , 2
≥ ε2 ,
σ2bh
σ2bh
σ2bh
σ2bh
F,
(12.78)
h
cah cbh ≥ ε2 ,
cah /cbh = 1,
=ε
(for EMAP learning),
=ε
(for EPB-A learning),
≥ε
(for EPB-B learning),
≥ ε2
(for EVB learning),
2 2 2
for h = 1, . . . , H,
(12.79)
314
12 MAP and Partially Bayesian Learning
where the free energy F is explicitly written by Eqs. (12.26) and (12.27), that is, min(L,M) 2 H γ 2 2Fh , (12.80) 2F = LM log(2πσ ) + h=1 2 h + σ h=1 where 2Fh = M log
c2ah
σ2ah
+ L log
− (L + M) +
c2bh
σ2bh
+
σ2ah a2h + M c2ah
+
σ2bh b2h + L
c2bh −2 ah a2h + M σ2ah σ2bh bh γh + b2h + L
. (12.81) σ2 By substituting the MAP solution (Corollary 12.2) and the PB solution (Corollary 12.4), respectively, into Eq. (12.81), we can write the free energy as a function of the product cah cbh of the prior covariances. We have the following lemmas (the proofs are given in Sections 12.1.4 and 12.1.5, respectively): Lemma 12.10 In EMAP learning, the free energy (12.81) can be written as a function of cah cbh as follows: 2F´ hMAP =
min
bh ∈R, σ2ah = σ2b =ε2 ah ,
2Fh
" ! ⎧ ⎪ ε2 ⎪ c + − 1 + χ (L + M) log c ⎪ a b ⎪ h h c c ah bh ⎨ ! =⎪ ⎪ % $ ⎪ ⎪ ⎩(L + M) log cah cbh − 1 + χ − σ−2 γh − h
for ε2 ≤ cah cbh ≤ σγh , "2 2 σ2 for cah cbh > σγh . ca cb 2
h
h
(12.82) Lemma 12.11 In EPB learning, the free energy (12.81) can be written as a √ function of cah cbh as follows: if γh > σ max(L, M), , 2F´ hPB = min 2F´ hPB−A , 2F´ hPB−B 2 3 = min min ah , bh ∈R, σ2a ≥ε2 , σ2 =ε2 2Fh , min ah , bh ∈R, σ2a =ε2 , σ2 ≥ε2 2Fh bh bh h h " ! ⎧ 2 ⎪ ⎪ min(L, M) log cah cbh + ca εcb − 1 + χ ⎪ ⎪ ⎪ h h ⎪ ⎪ 2 ⎪ ⎪ ⎪ for ε2 ≤ cah cbh ≤ √ 2 σ , ⎪ ⎪ ⎪ γh −max(L,M)σ2 ⎪ B ⎪ ⎪ ⎪ ⎪ 2 4γ2 min(L,M)+2 max(L,M) ⎪ ⎪ log c2ah c2bh + max(L, M)2 + c2 ch2 − c2σc2 ⎪ ⎪ 2 ⎨ ah b ah b h h =⎪ B ⎪ 2 ⎪ ⎪ 4γ ⎪ h ⎪ + max(L, M) log − max(L, M) + max(L, M)2 + c2 c2 ⎪ ⎪ ah b ⎪ ⎪ h ⎪ ⎪ ⎪ γh2 ⎪ 2 ⎪ − σ2 − max(L, M) log(2σ ) + min(L, M)(χ − 1) ⎪ ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎪ for cah cbh > √ 2 σ , ⎩ 2 γh −max(L,M)σ
(12.83)
12.1 Theoretical Analysis in Fully Observed MF
and otherwise,
2F´ hPB = min(L, M) log cah cbh +
315
ε2 −1+χ . cah cbh
(12.84)
By minimizing the EMAP free energy (12.82) and the EPB free energy (12.83), respectively, with respect to the product cah cbh of the prior covariances, we obtain the following theorems (the proofs are given also in Sections 12.1.4 and 12.1.5, respectively): Theorem 12.12 In EMAP learning, the free energy (12.81) has the global minimum such that
bh = 0. ah γhEMAP ≡ It has a nontrivial local minimum such that
if and only if bh = γ˘ hlocal−EMAP ah γhlocal−EMAP ≡
γh > γlocal−EMAP ,
where
C (12.85) γlocal−EMAP = σ 2(L + M), . " ! 1 γ˘ hlocal−EMAP = (12.86) γh + γh2 − 2σ2 (L + M) . 2 Theorem 12.13 In EPB learning, the free energy (12.81) has the global minimum such that
γhEPB ≡ bh = 0. ah It has a non-trivial local minimum such that
if and only if bh = γ˘ hlocal−EPB ah γhlocal−EPB ≡ where
.
γlocal−EPB = σ γ˘ hlocal−EPB =
γh 2
L+M+ 1+
C
2LM + min(L, M)2 , √4 2 2 2
− max(L,M)σ +
γh > γlocal−EPB ,
γh −2(L+M)σ γh +min(L,M)2 σ4 γh2
(12.87)
.
(12.88)
Figure 12.4 shows the free energy (normalized by LM) as a function of cah cbh for EMAP learning (given in Lemma 12.10), EPB learning (given in Lemma 12.11), and EVB learning, defined by 2F´ hVB =
min
bh ∈R, σ2ah , σ2b ≥ε2 ah ,
2Fh .
h
For EMAP learing and EPB learning, we ignored some constants (e.g., the entropy terms proportional to χ) to make the shapes of the free energies
316
12 MAP and Partially Bayesian Learning
0
0
–0.1
–0.1 EVB EPB EMAP
–0.2 0
0.2
0.4
EVB EPB EMAP
–0.2
0.6
0
0.2
0.4
0.6
(b) γh = 12
(a) γh = 10 0
0
–0.1
–0.1 EVB EPB EMAP
–0.2 0
0.2
0.4
(c) γh = 14
0.6
EVB EPB EMAP
–0.2 0
0.2
0.4
0.6
(d) γh = 16
Figure 12.4 Free energy dependence on cah cbh , where L = 20, M = 50. Crosses indicate nontrivial local minima.
comparable. We can see deep pits at cah cbh → +0 in EMAP and EPB free energies, which correspond to the global solutions. However, we also see nontrivial local minima, which behave similarly to the nontrivial local solution for VB learning. Namely, nontrivial local minima of EMAP, EPB, and EVB free energies appear at locations similar to each other when the observed singular value γh exceeds the thresholds given by Eqs. (12.85), (12.87), and (6.127), respectively. The deep pit at cah cbh → +0 is essential when we stick to the global solution. The VB free energy does not have such a pit, which enables consistent inference based on the free energy minimization principle. However, as long as we rely on local search, the pit at the origin is not essential in practice. Assume that a nontrivial local minimum exists, and we perform local search only once. Then, whether local search for EMAP learning or EPB learning converges to the trivial global solution or the nontrivial local solution simply depends on the initialization. Note that the same applies also to EVB learning, for which local search is not guaranteed to converge to the global solution. This is because of the multimodality of the VB free energy, which can be seen in Figure 12.4. One might wonder if some hyperpriors on c2ah and c2bh could fill the deep pits at cah cbh → +0 in the EMAP and the EPB free energies, so that the nontrivial
12.1 Theoretical Analysis in Fully Observed MF
317
local solutions are global when some reasonable conditions hold. However, when we rely on the ARD property for model selection, hyperpriors should be almost noninformative. With such an almost noninformative hyperprior, e.g., the inverge-Gamma, p(c2ah , c2bh ) ∝ (c2ah c2bh )1.001 +0.001/(c2ah c2bh ), which was used in Bishop (1999b), deep pits still exist very close to the origin, which keep the global EMAP and EPB estimators useless. Comparison between Local-EMAP, Local-EPB, and EVB Solutions Let us observe the behavior of local solutions. We define the local-EMAP estimator and the local-EPB estimator, respectively, by
local−EMAP = U
H
γhlocal−EMAP ωbh ωah ,
h=1
where γhlocal−EMAP
local−EPB = U
H
if γh ≥ γlocal−EMAP , otherwise,
(12.89)
γhlocal−EPB ωbh ωah ,
h=1
where
⎧ local−EMAP ⎪ ⎪ ⎨γ˘ h =⎪ ⎪ ⎩0
γhlocal−EPB
⎧ local−EPB ⎪ ⎪ ⎨γ˘ h =⎪ ⎪ ⎩0
if γh ≥ γlocal−EPB , otherwise,
(12.90)
following the definition of the local-EVB estimator (6.131) in Chapter 6. In the following, we assume that local search algorithms for EMAP learning and EPB learning find these solutions. Define the normalized (by the average noise contribution (12.74)) singular values: γh . γh = C max(L, M)σ2 We also define normalized versions of the estimator, the truncation threshold, and the shrinkage factor as γ
γh γ˘ h h
, γ = C , γ˘ h = C . (12.91) γh = C h 2 2 max(L, M)σ max(L, M)σ max(L, M)σ2 Then the normalized truncation thresholds and the normalized shrinkage factors can be written as functions of α = min(L, M)/ max(L, M) as follows: . √ (12.92) γ EVB = σ 1 + α + α κ + 1κ , ⎛ ⎞ B! "2 ⎟⎟ 2 2 ⎜ 4⎟ γ ⎜ ⎟⎠ , γ˘ h EVB = h ⎜⎜⎜⎝1 − (1+α)σ + − 4ασ 1 − (1+α)σ (12.93)
2
2
4 ⎟ 2
γh
γh
γh
318
12 MAP and Partially Bayesian Learning . √ γ local−EPB = σ 1 + α + 2α + α2 , √ −σ2 + γh 4 −2(1+α)σ2 γh 2 +σ4 γ γ˘ h local−EPB = 2h 1 + , γ 2
(12.94) (12.95)
h
√ γ local−EMAP = σ 2(1 + α), " ! . γ˘ h local−EMAP = 12 γh + γh 2 − 2σ2 (1 + α) .
(12.96) (12.97)
Note that κ is also a function of α. Figure 12.5 compares the normalized versions of the (global) EVB γh local−EPB , and the local-EMAP estimator γh EVB , the local-EPB estimator
local−EMAP . We can observe similar behaviors of those three estimator γh empirical Bayesian estimators. This is in contrast to the nonempirical Bayesian estimators shown in Figure 12.3, where the PB estimator behaves similarly to the VB estimator, while the MAP estimator behaves differently. Figure 12.6 compares the normalized versions of the EVB truncation threshold (12.92), the local-EPB truncation threshold (12.94), and the localEMAP truncation threshold (12.96). We can see that those thresholds behave similarly. However, we can find an essential difference of the local-EPB threshold from the EVB and the local-EMAP thresholds: it holds that, for any α, γ local−EPB < γ MPUL ≤ γ EVB , γ local−EMAP , γ MPUL = C
where
(12.98)
γMPUL max(L,
M)σ2
= (1 +
√ α)
Local Loca
2
1.5
1
0.5
0 0
0.5
1
1.5
2
Figure 12.5 Behavior of (global) EVB, the local-EPB, and the local-EMAP estimators for α = min(L, M)/ max(L, M) = 1/3.
12.1 Theoretical Analysis in Fully Observed MF
319
2.5
2
1.5
Local-EPB Local-EMAP
1 0
0.2
0.4
0.6
0.8
1
Figure 12.6 Truncation thresholds.
is the normalized version of the Marˇcenko–Pastur upper limit (MPUL) (Eq. (8.41) in Chapter 8), which is also shown in Figure 12.6. As discussed in Section 8.4.1, the MPUL is the largest singular value of an L× M zero-mean independent random matrix in the large-scale limit where the matrix size (L, M) goes to infinity with fixed ratio α = min(L, M)/ max(L, M). In other words, the MPUL corresponds to the minimum observed singular value detectable (or distinguishable from noise) by any dimensionality reduction method. The inequalities (12.98) say that local-EPB threshold is always smaller than the MPUL, while the EVB threshold and the local-EMAP threshold are never smaller than the MPUL. This implies that, for a large-scale observed matrix, the EVB estimator and the local-EMAP estimator discard the singular components dominated by noise, while the local-EPB estimator retains some of them.
12.1.4 Proofs of Lemma 12.10 and Theorem 12.12 By substituting the MAP solution, given by Corollary 12.2, we can write the 2 free energy (12.81) as follows: for ε2 ≤ cah cbh ≤ σγh , 2F´ hMAP =
min
bh ∈R, σ2ah = σ2b =ε2 ah ,
2Fh
h
= M log c2ah + L log c2bh M L + c2 + c2 ε2 − (L + M) + (L + M)χ ah bh M L 2 2 = M log cah + L log cbh + c2 + c2 ε2 ah
− (L + M) + (L + M)χ,
bh
(12.99)
320
12 MAP and Partially Bayesian Learning σ2 γh ,
and for cah cbh > 2F´ hMAP =
min
bh ∈R, σ2ah = σ2b =ε2 ah ,
2Fh
h
=
M log c2ah
+
!
+ γh −
+
L log c2bh
M c2ah
+
L c2b
σ2 cah cbh
" ⎛⎜ ⎜⎜⎜ 2 ⎜⎝ ca cb + h h
2
−2γh +γh − ca σc σ2
h bh
⎞ ⎟⎟⎟ ⎟⎟⎠
ε2 − (L + M) + (L + M)χ
h
2 σ2 = M log c2ah + L log c2bh − σ−2 γh − cah cbh − (L + M) + (L + M)χ.
(12.100)
In the second-to-last equation in Eq. (12.100), we ignored the fourth term 2 because cah cbh > σγh implies c2ah , c2bh ! ε2 (with an arbitrarily high probability depending on ε2 ). By fixing the ratio to cah /cbh = 1, we obtain Eq. (12.82), which proves Lemma 12.10. Now we minimize the free energy (12.82) with respect to cah cbh , and find nontrivial local solutions. The free energy is continuous in the domain ε2 ≤ 2 cah cbh < ∞, and differentiable except at cah cbh = σγh . The derivative is given by ⎧ ⎪ 2 ⎪ 1 ε2 ⎪ ⎪ − for ε2 ≤ cah cbh ≤ σγh (L + M) MAP ⎪ cah cbh ⎪ c2ah c2b ∂2F´ h ⎨ h =⎪ ! " 2 ⎪ ∂(cah cbh ) ⎪ ⎪ L+M σ2 σ2 −2 ⎪ + 2σ − γ for cah cbh > σγh ⎪ h ⎩ cah cbh cah cbh c2a c2b h h ⎧ 2 ⎪ L+M 2 ⎪ for ε2 ≤ cah cbh ≤ σγh , ⎪ 2 c2 ⎪ c ⎨ ah bh cah cbh − ε =⎪ 2 ⎪ 1 2 2 2 ⎪ ⎪ for cah cbh > σγh . ⎩ c3 c3 (L + M)cah cbh + 2γh cah cbh − 2σ ah b h
(12.101) Eq. (12.101) implies that the free energy F´ hMAP is increasing for ε2 ≤ cah cbh ≤ σ2 σ2 γh , and that it is increasing at cah cbh = γh and at cah cbh → +∞. In the region of and only if
σ2 γh
< cah cbh < +∞, the free energy has stationary points if C γh ≥ σ 2(L + M) ≡ γlocal−EMAP ,
(12.102)
because the derivative can be factorized (with real factors if and only if the condition (12.102) holds) as ∂2F´ hMAP %$ % L+M $ = 3 3 cah cbh − c´ ah c´ bh cah cbh − c˘ ah c˘ bh , ∂(cah cbh ) cah cbh
12.1 Theoretical Analysis in Fully Observed MF
where c´ ah c´ bh = c˘ ah c˘ bh =
γh −
. γh2 − 2σ2 (L + M)
L+M . 2 γh + γh − 2σ2 (L + M) L+M
321
,
(12.103)
.
(12.104)
Summarizing the preceding discussion, we have the following lemma: Lemma 12.14 If γh ≤ γlocal−EMAP , the EMAP free energy F´ hMAP , defined by Eq. (12.82), is increasing for cah cbh > ε2 , and therefore minimized at cah cbh = ε2 . If γh > γlocal−EMAP , ⎧ ⎪ increasing ⎪ ⎪ ⎪ ⎪ ⎨ MAP is ⎪ F´ h decreasing ⎪ ⎪ ⎪ ⎪ ⎩increasing
for
ε2 < cah cbh < c´ ah c´ bh ,
for
c´ ah c´ bh < cah cbh < c˘ ah c˘ bh ,
for
c˘ ah c˘ bh < cah cbh < +∞,
and therefore has two (local) minima at cah cbh = ε2 and at cah cbh = c˘ ah c˘ bh . Here c´ ah c´ bh and c˘ ah c˘ bh are defined by Eqs. (12.103) and (12.104), respectively. When γh > γlocal−EMAP , the EMAP free energy (12.82) at the local minima is 2F´ hMAP
⎧ ⎪ ⎪ ⎪ ⎨0 ! =⎪ % $ ⎪ ⎪ ⎩(L + M) log c˘ ah c˘ bh − 1 + χ − σ−2 γh −
σ2 c˘ ah c˘ bh
"2
at cah cbh = ε2 , at cah cbh = c˘ ah c˘ bh ,
respectively. Since we assume that χ = − log ε2 is an arbitrarily large constant (ε2 > 0 is arbitrarily small), cah cbh = ε2 is always the global minimum. Substituting Eq. (12.104) into Eq. (12.36) gives Eq. (12.86), which completes the proof of Theorem 12.12.
12.1.5 Proofs of Lemma 12.11 and Theorem 12.13 We first analyze the free energy for EPB-A learning. From Eq. (12.49), we have σ2 . cah cbh = . (γPB−A )2 − Mσ2 h
Therefore, if √ γh ≤ σ M,
322
12 MAP and Partially Bayesian Learning
there exists only the null solution (12.53) for any cah cbh > 0, and therefore the free energy (12.81) is given by 2F´ hPB−A =
min
bh ∈R, σ2ah ≥ε2 , σ2b =ε2 ah ,
2Fh
h
= L log c2bh +
−1+χ .
ε2 c2b
(12.105)
h
In the following, we consider the case where √ γh > σ M. σ2 , γh2 −Mσ2
For ε2 ≤ cah cbh ≤ √
(12.106)
there still exists only the null solution (12.53) with
the free energy given by Eq. (12.105). The positive solution (12.50) appears for 2 cah cbh > √ 2σ 2 with the free energy given by γh −Mσ
2F´ hPB−A =
min
bh ∈R, σ2ah ≥ε2 , σ2b =ε2 ah ,
)
=
bh ∈R, σ2ah ≥ε2 ah , 2 ah bh γh σ2
h
c2a
M log σ2h + L log c2bh +
min
−
2Fh
ah
+ a2h + M σ2ah )
=
min
bh ∈R, σ2ah ≥ε2 ah ,
M log
! 2
bh σ2
b2h +σ2 /c2ah σ2
+
"1
1 c2ah
+
b2h c2b
h
b2h c2b
h
− (L + M) + Lχ
− 2
bh +σ2 /c2ah
+ M log c2ah + L log c2bh − (L + M) + Lχ )
b2 +σ2 /c2a = min M log b2h + σ2 /c2ah + h c2 h −
bh ∈R
−
bh
σ2 c2ah c2b
h
bh ∈R
γh2 σ2
bh
−
2γh σ2
b2h γh2
+
σ2ah a2h +M
1
σ2ah
1
σ2 ( b2h +σ2 /c2ah )
+ M log c2ah + L log c2bh − M log σ2 − L + Lχ
)
b2 +σ2 /c2a b2h + σ2 /c2ah + h c2 h + = min M log −
b2h γh
σ2 c2ah c2b
h
γh2
1
c2ah ( b2h +σ2 /c2ah )
+ M log c2ah + L log c2bh − M log σ2 − L + Lχ. (12.107)
Here we used the conditions (12.61) and (12.62) for the PB-A solution. By substituting the other conditions (12.64) and (12.66) into Eq. (12.107), we have
12.1 Theoretical Analysis in Fully Observed MF
B
2F´ hPB−A = M log −M + −
γh2 σ2
−
γh2 σ2
M2 +
4γh2 c2ah c2b
A
−M+
M2 +
+
4γ2 h 2 c2 ah cb h
2
h
323 A M2 +
M+
+
4γ2 h 2 c2 ah cb h
2
− c2σc2 + M log c2ah + (L + M) log c2bh − M log(2σ2 ) − L + Lχ ah b h B B 2 4γh2 4γ2 2 = M log −M + M + c2 c2 + M 2 + c2 ch2 − c2σc2 2
ah b h
ah b h
ah b h
+ M log c2ah + (L + M) log c2bh − M log(2σ2 ) − L + Lχ. (12.108)
The PB-B free energy can be derived in exactly the same way, and the result is symmetric to the PB-A free energy. Namely, if √ γh ≤ σ L, there exists only the null solution (12.59) for any cah cbh > 0 with the free energy given by 2F´ hPB−B =
min
bh ∈R, σ2ah =ε2 , σ2b ≥ε2 ah ,
! = M log c2ah +
h
ε2 c2ah
2Fh
" −1+χ .
(12.109)
Assume that √ γh > σ L. σ2 , γh2 −Lσ2
For ε2 ≤ cah cbh ≤ √
there still exists only the null solution (12.59) with
the free energy given by Eq. (12.109). The positive solution (12.56) appears 2 for cah cbh > √ σ2 2 with the free energy given by γh −Lσ
2F´ hPB−B =
min
bh ∈R, σ2ah =ε2 , σ2b ≥ε2 ah ,
= L log −L + −
γh2 σ2
+ (L +
B
2Fh
h
L2
+
4γh2 2 cah c2b
M) log c2ah
B +
L2 +
h
+
L log c2bh
4γh2 2 cah c2b
− h
σ2 c2ah c2b
h
− L log(2σ ) − M + Mχ. (12.110) 2
By fixing the ratio between the prior covariances to cah /cbh = 1 in Eqs. (12.105) and (12.108) through (12.110), and taking the posterior choice in Eq. (12.47) into account, we obtain Eqs. (12.83) and (12.84), which prove Lemma 12.11.
324
12 MAP and Partially Bayesian Learning
Let us minimize the free energy with respect to cah cbh . When C γh ≤ σ max(L, M), the free energy is given by Eq. (12.84), and its derivative is given by ∂2F´ hPB min(L, M) = cah cbh − ε2 . ∂(cah cbh ) c2ah c2bh This implies that the free energy F´ hPB is increasing for ε2 < cah cbh < ∞, and therefore minimized at cah cbh = ε2 . Assume that C γh > σ max(L, M). In this case, the free energy is given by Eq. (12.83), which is continuous in the 2 . domain ε2 ≤ cah cbh < ∞, and differentiable except at cah cbh = √ 2 σ 2 γh −max(L,M)σ
Although the continuity is not very obvious, one can verify it by checking the 2 for each case in Eq. (12.83). The continuity is value at cah cbh = √ 2 σ 2 γh −max(L,M)σ
also expected from the fact that the PB solution is continuous at the threshold γh = γPB , i.e., the positive solution (Eq. (12.50) for PB-A and Eq. (12.56) for h PB-B) converges to the null solution (Eq. (12.53) for PB-A and Eq. (12.59) for PB-B) when γh → γPB−A + 0. h The free energy (12.83) is the same as Eq. (12.84) for ε2 ≤ cah cbh ≤ 2 2 √2 σ , and therefore increasing in ε2 < cah cbh ≤ √ 2 σ . For 2 2 γh −max(L,M)σ
σ2 , γh2 −max(L,M)σ2
cah cbh > √
γh −max(L,M)σ
the derivative of the free energy with respect to c2ah c2bh
is given by ∂2F´ hPB ∂(c2ah c2bh )
=
min(L,M)+2 max(L,M) 2c2ah c2b
−
h
−
A 2c4ah c4b
=
h
A 2c4ah c4b
h
+
σ2 c4ah c4b
h
4 max(L,M)γh2 ⎛ A ⎜⎜⎜ 2 ⎜⎜⎜− max(L,M)+ max(L,M)2 + 4γh 2 ⎝ c2 a c
min(L,M)+2 max(L,M) 2c2ah c2b
+
σ c4ah c4b
h bh
2
− h
=
4γ2 max(L,M)2 + 2 h2 ca c h bh
4γ2 max(L,M)2 + 2 h2 ca c h bh
h
=
4γh2
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎠
4γh2 A
⎛ ⎜⎜⎜ 2c4ah c4b ⎜⎜⎜⎝− max(L,M)+ h
max(L,M)2 +
4γ2 h 2 c2 ah cb h
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎠
B 4γ2 max(L, M)2 + c2 ch2 ah b ah b h h ! " . (L + M)c2ah c2bh + 2σ2 − cah cbh max(L, M)2 c2ah c2bh + 4γh2 , L + M + 2 c2σc2 − 2
1 2c2ah c2b
h
1 2c4ah c4b
h
(12.111)
12.1 Theoretical Analysis in Fully Observed MF
325
which has the same sign as 32 . , -2 2 τ(c2ah c2bh ) = (L + M)c2ah c2bh + 2σ2 − cah cbh max(L, M)2 c2ah c2bh + 4γh2 = 2LM + min(L, M)2 c4ah c4bh − 4 γh2 − σ2 (L + M) c2ah c2bh + 4σ4 . (12.112) Eq. (12.112) is a quadratic function of c2ah c2bh , being positive at c2ah c2bh → +0 and at c2ah c2bh → +∞. The free energy has stationary points if and only if . C γh ≥ σ L + M + 2LM + min(L, M)2 ≡ γlocal−EPB , (12.113) because τ(c2ah c2bh ), which has the same sign as the derivative of the free energy, can be factorized (with real factors if and only if the condition (12.113) holds) as τ(c2ah c2bh ) = 2LM + min(L, M)2 c2ah c2bh − c´ 2ah c´ 2bh c2ah c2bh − c˘ 2ah c˘ 2bh , where .
c´ 2ah c´ 2bh
=2·
c˘ 2ah c˘ 2bh = 2 ·
(γh2 −σ2 (L+M))− (γh2 −σ2 (L+M))2 −(2LM+min(L,M)2 )σ4 .
2LM+min(L,M)2
,
(12.114)
.
(12.115)
2
(γh2 −σ2 (L+M))+ (γh2 −σ2 (L+M)) −(2LM+min(L,M)2 )σ4 2LM+min(L,M)2
Summarizing the preceding discussion, we have the following lemma: Lemma 12.15 If γh ≤ γlocal−EPB , the EPB free energy F´ hPB , defined by Eqs. (12.83) and (12.84), is increasing for cah cbh > ε2 , and therefore minimized at cah cbh = ε2 . If γh > γlocal−EPB , ⎧ ⎪ increasing ⎪ ⎪ ⎪ ⎪ ⎨ PB ´ Fh is ⎪ decreasing ⎪ ⎪ ⎪ ⎪ ⎩increasing
for
ε2 < cah cbh < c´ ah c´ bh ,
for
c´ ah c´ bh < cah cbh < c˘ ah c˘ bh ,
for
c˘ ah c˘ bh < cah cbh < +∞,
and therefore has two (local) minima at cah cbh = ε2 and at cah cbh = c˘ ah c˘ bh . Here, c´ ah c´ bh and c˘ ah c˘ bh are defined by Eqs. (12.114) and (12.115), respectively. When γh > γlocal−EPB , the EPB free energy (12.83) at the null local solution cah cbh = ε2 is 2F´ hPB = 0, while the EPB free energy at the positive local solution cah cbh = c˘ ah c˘ bh contains the term min(L, M)χ with χ = − log ε2 assumed to be arbitrarily large. Therefore, the null solution is always the global minimum. Substituting Eq. (12.115) into Eq. (12.46) gives Eq. (12.88), which completes the proof of Theorem 12.13.
326
12 MAP and Partially Bayesian Learning
12.1.6 Noise Variance Estimation The noise variance σ2 is unknown in many practical applications. In VB learning, minimizing the free energy (12.26) with respect also to σ2 gives a reasonable estimator, with which perfect dimensionality recovery was proven in Chapter 8. Here, we investigate whether MAP learning and PB learning offer good noise variance estimators. We first consider the nonempirical Bayesian variants where the prior covariances CA , C B are treated as given constants. By using Lemma 12.10, we can write the MAP free energy with the variational parameters optimized, as a function of σ2 , as follows: 2F´ MAP = LM log(2πσ2 ) +
min(L,M) h=1
σ2
min(L,M)
γh2
+
H h=1
2F´ hMAP
γ2
= LM log(2πσ2 ) + h=1σ2 h ) ! "2 1 min(H,H) % $ σ2 −2 + h=1 (L + M) log cah cbh − 1 + χ − σ γh − ca cb h h ! " min(L,M) 2 + (L + M) log cah cbh + ca εcb − 1 + χ h=min(H,H)+1 h h min(L,M) γ2 min(H,H) 2γh h=min(H,H)+1 h σ2 2 = LM log σ + + h=1 cah cbh − c2a c2b σ2 h h ! " min(L,M) 2 + LM log(2π) + h=1 (L + M) log cah cbh + ca εcb − 1 + χ h h min(L,M) γh2 2γ min(H,H) h=min(H,H)+1 σ2 h 2 = LM log σ + + h=1 + const. ca cb − c2 c2 σ2 h
h
ah b h
(12.116) for MAP σ2H+1 ≤ σ2 ≤ σ2H MAP ,
where σ2h MAP
⎧ ⎪ ∞ ⎪ ⎪ ⎪ ⎪ ⎨ =⎪ cah cbh γh ⎪ ⎪ ⎪ ⎪ ⎩0
(12.117)
for h = 0, for h = 1, . . . , min(L, M),
(12.118)
for h = min(L, M) + 1.
Assume that we use the full-rank model H = min(L, M), and expect the ARD property to find the correct rank. Under this setting, the free energy (12.116) can be arbitrarily small for σ2 → +0, because the first term 2 MAP )≤ diverges to −∞, and the second term is equal to zero for 0 (= σmin(L,M)+1 2 2 MAP σ ≤ camin(L,M) cbmin(L,M) γmin(L,M) (= σmin(L,M) ) (note that γmin(L,M) > 0 with probability 1). This leads to the following lemma:
12.1 Theoretical Analysis in Fully Observed MF
327
Lemma 12.16 Assume that H = min(L, M) and CA , C B are given as constants. Then the MAP free energy with respect to σ2 is (globally) minimized at
σ2 MAP → +0. The PB free energy behaves differently. By using Lemma 12.11, we can write the PB free energy with the variational parameters optimized, as a function of σ2 , as follows: 2F´ PB = LM log(2πσ2 ) +
min(L,M) h=1
σ2
min(L,M)
γh2
+
H h=1
2F´ hPB
γ2
= LM log(2πσ2 ) + h=1σ2 h ) B min(L,M)+2 max(L,M) 2 2 + min(H,H) log c c + max(L, M)2 + ah bh h=1 2
B σ − c2 c2 + max(L, M) log − max(L, M) + max(L, M)2 + ah b h 1 2 γh 2 − σ2 − max(L, M) log(2σ ) + min(L, M)(χ − 1)
4γh2 c2ah c2b
2
+
min(L,M)
min(L, M) log cah cbh +
h=min(H,H)+1
min(L,M)
= (min(L, M) − min(H, H)) max(L, M) log(2σ ) + h=min(H,H)+1 σ2 ) B 4γ2 + min(H,H) max(L, M) log c2ah c2bh + max(L, M)2 + c2 ch2 h=1
+ LM log(π) +
min(L,M) h=1
% min(L, M) log cah cbh − 1 + χ
min(L,M)
σ2 c2ah c2b
4γh2 c2ah c2b
$
= (min(L, M) − min(H, H)) max(L, M) log(2σ2 ) + h=min(H,H)+1 σ2 ) B min(H,H) 4γ2 + h=1 max(L, M) log c2ah c2bh + max(L, M)2 + c2 ch2 −
γh2
B + max(L, M) log − max(L, M) + max(L, M)2 + h
h
ah b h
−
h
ε2 −1+χ cah cbh 2
σ2 c2ah c2b
4γh2 2 cah c2b
+ const.
h
γh2
ah b h
B + max(L, M) log − max(L, M) + max(L, M)2 + h
1
4γh2 c2ah c2b
1 h
(12.119)
for PB ≤ σ2 ≤ σ2H PB , σ2H+1
(12.120)
328
where
σ2h PB
12 MAP and Partially Bayesian Learning
⎧ ⎪ ⎪ ∞ for h = 0, ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ . c c ⎪ a ⎪ γh h bh 2 ⎪ ⎪ ⎨ 2 − max(L, M) + max(L, M) + 4 c2a c2b h h =⎪ ⎪ ⎪ ⎪ ⎪ ⎪ for h = 1, . . . , min(L, M), ⎪ ⎪ ⎪ ⎪ ⎪ ⎩0 for h = min(L, M) + 1. (12.121)
We find a remarkable difference between the MAP free energy (12.116) and the PB free energy (12.119): unlike in the MAP free energy, the first log PB PB < σ2 < σ2min(L,M) , term in the PB free energy disappears for 0 = σ2min(L,M)+1 2 and therefore, the PB free energy does not diverge to −∞ at σ → +0. We can actually prove that the noise variance estimator is lower-bounded by a positive value as follows. The PB free energy (12.121) is continuous, and, for PB PB < σ2 < σ2min(L,M) , it can be written as 0 = σ2min(L,M)+1 2F´ PB = −
σ2 + const., c2ah c2bh
which is monotonically decreasing. This leads to the following lemma: Lemma 12.17 Assume that H = min(L, M) and CA , C B are given as constants. Then the noise variance estimator in PB learning is lower-bounded by PB
σ2 MAP ≥ σ2min(L,M)
=
c2amin(L,M) c2b 2
min(L,M)
⎛ B ⎜⎜⎜ ⎜⎜⎝− max(L, M) + max(L, M)2 + 4 2 c
γmin(L,M) 2 amin(L,M) cb
min(L,M)
⎞ ⎟⎟⎟ ⎟⎟⎠ .
(12.122) We numerically investigated the behavior of the noise variance estimator by creating random observed matrices V = B∗ A∗ + E ∈ RL×M , and depicting the VB, PB, and MAP free energies as functions of σ2 with the variational parameters optimized. Figure 12.7 shows a typical case for L = 20, M = 50, ∗ ∗ H ∗ = 2 with the entries of A∗ ∈ R M×H and B∗ ∈ RL×H independently drawn 2 L×M independently drawn from from Gauss1 (0, 1 ), and the entries of E ∈ R 2 Gauss1 (0, 0.3 ). We set the prior covariances to cah cbh = 1. As Lemma 12.16 states, the global minimum of the MAP free energy is at σ2 → +0. Since no nontrivial local minimum is observed, local search gives the same trivial solution. On the other hand, the PB free energy has a minimum in the positive region σ2 > 0 with probability 1, as Lemma 12.17 states. However, we empirically observed that PB learning tends to underestimate the noise
12.2 More General Cases
329
2
1
0
VB PB MAP
-1 0.1
0.2
0.3
0.4
Figure 12.7 Free energy dependence on σ2 . Crosses indicate nontrivial minima.
variance, as in Figure 12.7. Therefore, we cannot expect that the noise variance estimation works well in PB learning, either. The situation is more complicated in the empirical Bayesian variants. Since the global EMAP estimator and the global EPB estimator, given any σ2 > 0, are the null solution, the joint global optimization over all variational param γh2 /(LM), regardless eters and the hyperparameters results in σ2 = min(L,M) h=1 of observations—all observed signals are considered to be noise. If we adopt nontrivial local solutions as estimators, i.e., the local-EMAP estimator and the local-EPB estimator, the free energies are not continuous anymore as functions of σ2 , because of the energy jump by the entropy factor χ of the pseudoDirac delta function. Even in that case, if we globally minimize the free energies with respect to σ2 , the estimator contains no nontrivial local solution, because the null solutions cancel all entropy factors. As such, no reasonable way to estimate the noise variance has been found in EMAP learning and in EPB learning. In the previous work on the TF model with PB learning (Chu and Ghahramani, 2009) and with EMAP learning (Mørup and Hansen, 2009), the noise variance was treated as a given constant. This was perhaps because the noise variance estimation failed, which is consistent with the preceding discussion.
12.2 More General Cases Although extending the analysis for fully observed MF to more general cases is not easy in general, some basic properties can be shown. Specifically, this section shows that the global solutions for EMAP learning and EPB learning are also trivial and useless in the MF model with missing entries and in the TF model. Nevertheless, we experimentally show in Section 12.3 that local search for EMAP learning and EPB learning provides estimators that behave similarly to the EVB estimator.
330
12 MAP and Partially Bayesian Learning
12.2.1 Matrix Factorization with Missing Entries The MF model with missing entries was introduced in Section 3.2. There, the likelihood (12.1) is replaced with #2 1 ### # p(V|A, B) ∝ exp − 2 #PΛ (V) − PΛ B A ## , (12.123) Fro 2σ where Λ denotes the set of observed indices, and ⎧ ⎪ ⎪ ⎨Vl,m if (l, m) ∈ Λ, (PΛ (V))l,m = ⎪ ⎪ ⎩0 otherwise. The VB free energy is explicitly written as 2F = # (Λ) · log(2πσ2 ) + M log det (CA ) + L log det (C B ) −
M
L log det Σ A,m − log det Σ B,l − (L + M)H
m=1
l=1
⎧ ⎛ ⎞ ⎛ ⎞⎫ M L ⎪ ⎜⎜ ⎟⎟⎟⎪ ⎪ ⎪ ⎨ −1 ⎜⎜⎜⎜ ⎟⎟⎟⎟ −1 ⎜
⎜ ⎟⎟⎠⎬ + tr ⎪ A + Σ B + Σ C + C A B ⎜ ⎟ ⎜ ⎪ A,m B,l ⎠ B ⎝ ⎪ ⎪ ⎩ A ⎝ ⎭ m=1 l=1 ) 1 ! "
−2
am bl + tr am + Σ A,m bl bl + Σ B,l am Vl,m − 2Vl,m , +σ (l,m)∈Λ
(12.124) where # (Λ) denotes the number of observed entries. We define the EMAP learning problem and the EPB learning problem by Eq. (12.77) with the free energy given by Eq. (12.124). The following holds: Lemma 12.18 The global solutions of EMAP learning and EPB learning for the MF model with missing entries, i.e., Eqs. (12.123), (12.2), and (12.3), are
EPB =
EMAP = U B A = 0(L,M) , regardless of observations. U Proof The posterior covariance for A is clipped to Σ A,m = ε2 I H in EPB-B learning, while the posterior covariance for B is clipped to Σ B,m = ε2 I H in EPB-A learning. In either case, one can make the second or the third term in the free energy (12.124) arbitrarily small to cancel the fourth or the fifth term by setting CA = ε2 I H or C B = ε2 I H . Then, because of the terms in the third line of Eq. (12.124), which come from the prior distributions, it holds that A → 0(M,H) EPB 2
or B → 0(M,H) for ε → +0, which results in U = B A → 0(L,M) . In EMAP Σ B,m = ε2 I H . By the learning, both posterior covariances are clipped to Σ A,m =
EMAP = B A → 0(L,M) , same argument as for EPB learning, we can show that U which completes the proof.
12.2 More General Cases
331
12.2.2 Tucker Factorization The TF model was introduced in Section 3.3.1. The likelihood and the priors are given by # ⎞ ⎛ ## ⎜⎜⎜ #V − G ×1 A(1) · · · ×N A(N) ##2 ⎟⎟⎟ ⎟⎟⎟ , p(V|G, { A(n) }) ∝ exp ⎜⎜⎜⎝− (12.125) ⎠ 2σ2 vec(G) (CG(N) ⊗ · · · ⊗ CG(1) )−1 vec(G) , (12.126) p(G) ∝ exp − 2 ⎞ ⎛ N (n) −1 (n) ⎟ ⎜⎜ ) ⎟⎟ n=1 tr( A C A(n) A p({ A(n) }) ∝ exp ⎜⎜⎜⎝− (12.127) ⎟⎟⎠ , 2 where ⊗ and vec(·) denote the Kronecker product and the vectorization operator, respectively. {CG(n) } and {CA(n) } are the prior covariances restricted to be diagonal, i.e., 2 2 CG(n) = Diag cg(n) , . . . , cg(n) , 1 H (n) 2 2 CA(n) = Diag ca(n) , . . . , ca(n) . 1
H (n)
We denote C˘ G = CG(N) ⊗ · · · ⊗ CG(1) . The VB free energy is explicitly written as ⎞ ⎛ N N ⎜⎜⎜ (n) ⎟⎟⎟ M ⎟⎟⎠ log(2πσ2 ) + log det C˘ G + M (n) log det (CA(n) ) 2F = ⎜⎜⎝ n=1
n=1
N ! " M (n) log det Σ A(n) − log det Σ˘ G − n=1
+
V − σ2 2
N n=1
H (n) −
N
(M (n) H (n) )
n=1
N ! " ! " (n) (n) −1 (n)
(n) ) g˘ + g˘ Σ˘ G ) + + tr C˘ G ( tr C−1 ( A + M A Σ (n) A A n=1 (N) (1) 2 − 2 v˘ ( A ⊗ ··· ⊗ A ) g˘ σ " (1) (1) 1 2! (N) (N) A A + 2 tr ( A + M (N) Σ A(N) ) ⊗ · · · ⊗ ( A + M (1) Σ A(1) ) σ 3 ·( g˘ g˘ + Σ˘ G ) . (12.128)
In the TF model, we refer as PB-G learning to the approximate Bayesian method where the posteriors for the factor matrices { A(N) } are approximated
332
12 MAP and Partially Bayesian Learning
by the pseudo-Dirac delta function, and as PB-A learning to the one where the posterior for the core tensor G is approximated by the pseudo-Dirac delta function. PB learning chooses the one giving a lower free energy from PB-G learning and PB-A learning. MAP learning approximates both posteriors by the pseudo-Dirac delta function. Note that the approach by Chu and Ghahramani (2009) corresponds to PB-G learning with the prior covariances fixed to CG(n) = CA(n) = I H (n) for n = 1, . . . , N, while the approach, called ARD Tucker, by Mørup and Hansen (2009) corresponds to EMAP learning with the prior covariances estimated from observations. In both approaches, the noise variance σ2 was treated as a given constant. Again the global solutions of EMAP learning and EPB learning are trivial and useless. Lemma 12.19 The global solutions of EMAP learning and EPB learning for
EPB =
EMAP = U the TF model, i.e., Eqs. (12.125) through (12.127), are U (1) (N)
×1 G A · · · ×N A = 0(M(1) , ... ,M(N) ) , regardless of observations. N Σ˘ G = ε2 In=1 Proof The posterior covariance for G is clipped to H (n) in EPB-A (n) Σ A(n) = learning, while the posterior covariances for { A } are clipped to {
ε2 I H (n) } in EPB-G learning. In either case, one can make the second or the third term in the free energy (12.128) arbitrarily small to cancel the fourth or the fifth term by setting {CG(n) = ε2 I H (n) } or {CA(n) = ε2 I H (n) }. Then, because of the terms in the fourth line of Eq. (12.128), which come from the prior distributions, (N)
→ 0(H (1) ,...,H (N) ) or { A → 0(M(N) ,H (N) ) } for ε2 → +0, which it holds that G EPB
= 0(M(1) ,...,M(N) ) . In EMAP learning, both posterior covariances results in U
˘ Σ A(n) = ε2 I H (n) }, respectively. By the same are clipped to Σ G = ε2 IN H (n) and { n=1
EMAP = 0(M(1) ,...,M(N) ) , which argument as for EPB learning, we can show that U completes the proof.
12.3 Experimental Results In this section, we experimentally investigate the behavior of EMAP learning and EPB learning, in comparison with EVB learning. We start from the fully observed MF model, where we can assess how often local search finds the nontrivial local solution (derived in Section 12.1.3) rather than the global null solution. After that, we conduct experiments in collaborative filtering, where the MF model with missing entries is used, and in TF. For local search, we adopted the standard iterative algorithm. The standard iterative algorithm for EVB learning has been derived in Chapter 3. The
12.3 Experimental Results
333
standard iterative algorithms for EPB learning and EMAP learning, which can be derived simply by setting the derivatives of the corresponding free energies with respect to the unknown parameters to zero, similarly apply the stationary conditions in turn to update unknown parameters. For initialization,
were drawn from the entries of the mean parameters, e.g., A, B, and G, 2 Gauss1 (0, 1 ), while the covariance parameters were set to the identity, e.g.,
Σ A(n) = CG(n) = CA(n) = I H (n) . We used this initialization scheme Σ˘ G = IN H (n) , n=1
through all experiments in this section.
12.3.1 Fully Observed MF We first conducted an experiment on an artificial (Artificial1) data set, which ∗ was generated as follows. We randomly generated true matrices A∗ ∈ R M×H ∗ and B∗ ∈ RL×H such that each entry of A∗ and B∗ follows Gauss1 (0, 1). An observed matrix V ∈ RL×M was created by adding a noise subject to Gauss1 (0, 1) to each entry of B∗ A∗ . Figures 12.8 through 12.10 show the free energy and the estimated rank over iterations in EVB learning, local-EPB
EVB(Analytic) EVB(Iterative)
2.14
100
2.12 2.1
60 H
F /( LM )
EVB(Analytic) EVB(Iterative)
80
2.08
40
2.06 2.04
20
2.02 0
500
1,000 1,500 Iteration
2,000
2,500
0 0
(a) Free energy
500
1,000 1,500 Iteration
2,000
2,500
(b) Estimated rank
Figure 12.8 EVB learning on Artificial1 (L = 100, M = 300, H ∗ = 20). Local−EPB(Analytic) Local−EPB(Iterative)
1.95
100 Local−EPB(Analytic) Local−EPB(Iterative)
80 60
1.9
40
1.85
0
20
500
1,000 1,500 Iteration
(a) Free energy
2,000
2,500
0 0
500
1,000 1,500 Iteration
(b) Estimated rank
Figure 12.9 Local-EPB learning on Artificial1.
2,000
2,500
334
12 MAP and Partially Bayesian Learning Table 12.1 Estimated rank in fully observed MF experiments.
EVB H
local-EPB H
local-EMAP H
L H ∗ Analytic Iterative Analytic Iterative Analytic Iterative
Data set
M
Artificial1
300 100 20
20
20 (100%)
20
20 (100%)
20
20 (100%)
Artificial2
500 400 5
5
5 (100%)
8
8 (90%) 9 (10%)
5
5 (100%)
Chart
600 60 –
2
2 (100%)
2
2 (100%)
2
2 (100%)
Glass
9 –
1
1 (100%)
1
1 (100%)
1
1 (100%)
Optical Digits
5,620 64 –
214
10
10 (100%)
10
10 (100%)
6
6 (100%)
Satellite
6,435 36 –
2
2 (100%)
2
2 (100%)
1
1 (100%)
Local−EMAP(Analytic) Local−EMAP(Iterative)
1.48 1.46 1.44
100 Local−EMAP(Analytic) Local−EMAP(Iterative)
80 60
1.42 40
1.4 1.38
20
1.36 0
500
1,000 1,500 Iteration
(a) Free energy
2,000
2,500
0 0
500
1,000 1,500 Iteration
2,000
2,500
(b) Estimated rank
Figure 12.10 Local-EMAP learning on Artificial1.
learning, and local-EMAP learning, respectively, on the Artificial1 data set with the data matrix size L = 100 and M = 300, and the true rank H ∗ = 20. The noise variance was assumed to be known, i.e., it was set to σ2 = 1. We performed iterative local search 10 times, starting from different initial points, and each trial is plotted by a solid curve in the figures. The results computed by the analytic-form solutions for EVB learning (Theorem 6.13), local-EPB learning (Theorem 12.13), and local-EMAP learning (Theorem 12.12) were plotted as dashed lines. We can observe that iterative local search for EPB learning and EMAP learning tends to successfully find the nontrivial local solutions, although they are not global solutions. We also conducted experiments on another artificial data set and benchmark data sets. The results are summarized in Table 12.1. Artificial2 was created in the same way as Artificial1, but with L = 400, M = 500, and H ∗ = 5. The benchmark data sets were collected from the UCI repository (Asuncion and
12.3 Experimental Results
335
Newman, 2007), on which we set the noise variance under the assumption that the signal to noise ratio is 0 db, following Mørup and Hansen (2009). In the table, the estimated ranks by the analytic-form solution and by iterative local search are shown. The percentages for iterative local search indicate the frequencies over 10 trials. We observe the following: first, iterative local search tends to estimate the same rank as the analytic-form (local) solution; and second, the estimated rank tends to be consistent among EVB learning, local-EPB learning, and local-EMAP learning. Furthermore, on the artificial data sets, where the true rank is known, the rank is correctly estimated in most of the cases. Exceptions are Artificial2, where local-EPB learning overestimates the rank, and Optical Digits and Satellite, where local-EMAP learning estimates a smaller rank than the others. These phenomena can be explained by the theoretical implications in Section 12.1.3: in Artificial2, the ratio ξ = H ∗ / min(L, M) = 5/400 between the true rank and the possible largest rank is small, which means that most of the singular components consist of noise. In such a case, local-EPB learning with its truncation threshold lower than MPUL tends to retain components purely consisting of noise (see Figure 12.6). In Optical Digits and Satellite, α (= 64/5620 for Optical Digits and = 36/6435 for Satellite) is extremely small, and therefore local-EMAP learning with its higher truncation threshold tends to discard more components than the others, as Figure 12.6 implies.
12.3.2 Collaborative Filtering Next we conducted experiments in the collaborative filtering (CF) scenario, where the observed matrix has missing entries to be predicted by the MF model. We generated an artificial (ArtificialCF) data set in the same way as the fully observed case for L = 2,000, M = 5,000, H ∗ = 5, and then masked 99% of the entries as missing values. We applied EVB learning, local-EPB learning, and local-EMAP learning to the MF model with missing entries, i.e., Eqs. (12.123), (12.2), and (12.3).4 Figure 12.11 shows the estimated rank and the generalization error over iterations for 10 trials, where the generalization B A )2Fro /(# (Λ ) σ2 ) for Λ being the error is defined as GE = PΛ (V) − PΛ ( set of test indices.
4
Here we solve the EVB learning problem, the EPB learning problem, and the EMAP learning problem, respectively, by the standard iterative algorithms. However, we refer to the last two methods as local-EPB learning and local-EMAP learning, since we expect the local search algorithm to find not the global null solution but the nontrivial local solution.
336
12 MAP and Partially Bayesian Learning 6
20 EVB Local-EPB Local-EMAP
15
EVB Local-EPB Local-EMAP
5.5
10
5
5
4.5 4
0 0
500
1,000
1,500
2,000
2,500
0
500
1,000
1,500
2,000
2,500
Iteration
Iteration
(a) Estimated rank
(b) Generalization error
Figure 12.11 CF result on ArtificialCF (L = 2,000, M = 5,000, H ∗ = 5 with 99% missing ratio). 20
0.65 EVB Local-EPB Local-EMAP
15
EVB Local-EPB Local-EMAP
0.6
10 0.55
5 0
0.5
0
500
1,000
1,500
2,000
2,500
0
500
1,000
Iteration
(a) Estimated rank
1,500
2,000
2,500
Iteration
(b) Generalization error
Figure 12.12 CF result on MovieLens (L = 943, M = 1,682 with 99% missing ratio).
We also conducted an experiment on the MovieLens data sets (with L = 943, M = 1,682).5 We randomly divided the observed entries into training entries and test entries, so that 99% of the entries are missing in the training phase. The test entries are used to evaluate the generalization error. Figure 12.12 shows the result in the same format as Figure 12.11. We see that, on both data sets, local-EMAP learning tends to estimate a similar rank to EVB learning, while local-EPB learning tends to estimate a larger rank—a similar tendency to the fully observed case. In terms of the generalization error, local-EPB learning performs comparably to EVB learning, while local-EMAP learning performs slightly worse.
12.3.3 Tensor Factorization Finally, we conducted experiments on TF. We created an artificial (ArtificialTF) data set, following Mørup and Hansen (2009): we drew a three-mode 5
www.grouplens.org/
12.3 Experimental Results
337
Table 12.2 Estimated rank (effective size of core tensor) in TF experiments. Data set
M
H∗
EVB H
local-EPB H
local-EMAP H
ARD-Tucker H
ArtificialTF
(30, 40, 50)
(3, 4, 5) (3, 4, 5): 100%
(3, 4, 5): 100%
(3, 4, 5): 90% (3, 7, 5): 10%
(3, 4, 5): 100%
FIA
(12, 100, 89) (3, 6, 4) (3, 5, 3): 100%
(3, 5, 3): 100%
(3, 5, 2): 50% (4, 5, 2): 20% (5, 4, 2): 10% (4, 4, 2): 10% (8, 5, 2): 10%
(3, 4, 2): 70% (3, 5, 2): 10% (3, 7, 2): 10% (10, 4, 3): 10%
random tensor of the size (M (1) , M (1) , M (1) ) = (30, 40, 50) with the signal components (H (1)∗ , H (2)∗ , H (3)∗ ) = (3, 4, 5). The noise is added so that the signal-to-noise ratio is 0 db. We also used the Flow Injection Analysis (FIA) data set.6 Table 12.2 shows the estimated rank with frequencies over 10 trials. Here we also show the results by ARD Tucker with the ridge prior (Mørup and Hansen, 2009), performed with the code provided by the authors. LocalEMAP learning and ARD Tucker minimize exactly the same objective, and the slightly different results come from the differences in the local search algorithm (standard iterative vs. gradient descent) and in the initialization scheme. We generally observe that all learning methods provide reasonable results, although local-EMAP learning, as well as ARD Tucker, is less stable than the others.
6
www.models.kvl.dk/datasets
P a r t IV Asymptotic Theory
13 Asymptotic Learning Theory
Part IV is dedicated to asymptotic theory of variational Bayesian (VB) learning. In this part, “asymptotic limit” always means the limit when the number N of training samples goes to infinity. The main goal of asymptotic learning theory is to clarify the behavior of some statistics, e.g., the generalization error, the training error, and the Bayes free energy, which indicate how fast a learning machine can be trained as a function of the number of training samples and how the trained machine is biased to the training samples by overfitting. This provides the mathematical foundation of information criteria for model selection—a task to choose the degree of freedom of a statistical model based on observed training data. We can also evaluate the approximation accuracy of VB learning to full Bayesian learning in terms of the free energy, i.e., the gap between the VB free energy and the Bayes free energy, which corresponds to the tightness of the evidence lower-bound (ELBO) (see Section 2.1.1). In this first chapter of Part IV, we give an overview of asymptotic learning theory as the background for the subsequent chapters.
13.1 Statistical Learning Machines A statistical learning machine consists of two fundamental components, a statistical model and a learning algorithm (Figure 13.1). The statistical model is denoted by a probabilistic distribution depending on some unknown parameters, and the learning algorithm estimates the unknown parameters from observed training samples. Before introducing asymptotic learning theory, we categorize statistical learning machines based on the model and the learning algorithm.
341
342
13 Asymptotic Learning Theory
Learning machine Statistical model
Learning algorithm
Linear models Neural networks Mixture models
ML estimation MAP learning Bayesian learning
Figure 13.1 A statistical learning machine consists of a statistical model and a learning algorithm.
Statistical models Singular Regular Linear models
Neural networks Mixture models Bayesian networks Hidden Markov models
Figure 13.2 Statistical models are classified into regular models and singular models.
13.1.1 Statistical Models—Regular and Singular We classify the statistical models into two classes, the regular models and the singular models (Figure 13.2). The regular models are identifiable (Definition 7.4 in Section 7.3.1), i.e., p(x|w1 ) = p(x|w2 ) ⇐⇒ w1 = w2
for any w1 , w2 ∈ W,
(13.1)
and do not have singularities in the parameter space, i.e., the Fisher information ∂ log p(x|w) ∂ log p(x|w) D p(x|w)dx (13.2) S+ F(w) = ∂w ∂w is nonsingular (or full-rank) for any w ∈ W. With a few additional assumptions, the regular models were analyzed under the regularity conditions (Section 13.4.1), which lead to the asymptotic normality of the distribution of the maximum likelihood (ML) estimator, and the asymptotic normality of the Bayes posterior distribution (Cramer,
13.1 Statistical Learning Machines
343
1949; Sakamoto et al., 1986; van der Vaart, 1998). Based on those asymptotic normalities, a unified theory was established, clarifying the asymptotic behavior of generalization properties, which are common over all regular models, and over all reasonable learning algorithms, including ML learning, maximum a posteriori (MAP) learning, and Bayesian learning, as will be seen in Section 13.4. On the other hand, analyzing singular models requires specific techniques for different models and different learning algorithms, and it was revealed that the asymptotic behavior of generalization properties depends on the model and the algorithm (Hartigan, 1985; Bickel and Chernoff, 1993; Takemura and Kuriki, 1997; Kuriki and Takemura, 2001; Amari et al., 2002; Hagiwara, 2002; Fukumizu, 2003; Watanabe, 2009). This is because the true parameter is at a singular point when the model size is larger than necessary to express the true distribution, and, in such cases, singularities affect the distribution of the ML estimator, as well as the Bayes posterior distribution even in the asymptotic limit. Consequently, the asymptotic normality, on which the regular learning theory relies, does not hold in singular models.
13.1.2 Learning Algorithms—Point Estimation and Bayesian Learning When analyzing singular models, we also classify learning algorithms into two classes, point estimation and Bayesian learning (Figure 13.3). The point estimation methods, including ML learning and MAP learning, choose a single model (i.e., a single point in the parameter space) that maximizes a certain criterion such as the likelihood or the posterior probability, while Bayesian learning methods use an ensemble of models over the posterior distribution or its approximation.
Learning algorithms Bayesian Point estimation ML learning MAP learning
Bayesian learning VB learning PB learning EP
Figure 13.3 Learning algorithms are classified into point-estimation and Bayesian learning.
344
13 Asymptotic Learning Theory
Unlike in the regular models, point estimation and Bayesian learning show different learning behavior in singular models. This is because how singularities affect the learning property depends on the learning methods. For example, as discussed in Chapter 7, strong nonuniformity of the density of the volume element leads to model-induced regularization (MIR) in Bayesian learning, while it does not affect point-estimation methods.
13.2 Basic Tools for Asymptotic Analysis Here we introduce basic tools for asymptotic analysis.
13.2.1 Central Limit Theorem Asymptotic learning theory heavily relies on the central limit theorem. Theorem 13.1 (Central limit theorem) (van der Vaart, 1998) Let {x(1) , . . . , x(N) } be N i.i.d. samples from an arbitrary distribution with finite N (n) D S++ , and let x = N −1 n=1 x be their mean μ ∈ RD and finite covariance Σ ∈ √ average. Then, the distribution of z = N(x − μ) converges to the Gaussian distribution with mean zero and covariance Σ,1 i.e., p (z) → GaussD (z; 0, Σ)
as
N → ∞.
Intuitively, Eq. (13.3) can be interpreted as $ % p x → GaussD x; μ, N −1 Σ as
N → ∞,
(13.3)
(13.4)
implying that the distribution of the average x of i.i.d. random variables converges to the Gaussian distribution with mean μ and covariance N −1 Σ. The central limit theorem implies the (weak) law of large numbers,2 i.e., for any ε > 0, $ % lim Prob x − μ > ε = 0. (13.5) N→∞
13.2.2 Asymptotic Notation We use the following asymptotic notation, a.k.a, Bachmann–Landau notation, to express the order of functions when the number N of samples goes to infinity: 1
2
We consider weak topology in the space of distributions, i.e., p(x) is identified with r(x) if f (x) p(x) = f (x) r(x) for any bounded continuous function f (x). Convergence (of a random variable x) in this sense is called convergence in distribution, weak convergence, or convergence in law, and denoted as p(x) → r(x) or x r(x) (van der Vaart, 1998). Convergence x → μ in the sense that limN→∞ Prob (x − μ > ε) = 0, ∀ε > 0 is called convergence in probability.
13.2 Basic Tools for Asymptotic Analysis
345
O( f (N)) : A function such that lim sup |O( f (N))/ f (N)| < ∞, N→∞
o( f (N)) : A function such that lim o( f (N))/ f (N) = 0, N→∞
Ω( f (N)) : A function such that lim inf |Ω( f (N))/ f (N)| > 0, N→∞
ω( f (N)) : A function such that lim |ω( f (N))/ f (N)| = ∞, N→∞
Θ( f (N)) : A function such that lim sup |Θ( f (N))/ f (N)| < ∞ N→∞
and lim inf |Θ( f (N))/ f (N)| > 0. N→∞
Intuitively, as a function of N, O( f (N)) is a function of no greater order than f (N), o( f (N)) is a function of less order than f (N), Ω( f (N)) is a function of no less order than f (N), ω( f (N)) is a function of greater order than f (N), and Θ( f (N)) is a function of the same order as f (N). One thing we need to be careful of is that the upper-bounding notations, O and o, preserve after addition and subtraction, while lower-bounding notations, Ω and ω, as well as the both-sides-bounding notation Θ, do not necessarily preserve. For example, if g1 (N) = Θ( f (N)) and g2 (N) = Θ( f (N)) then g1 (N) + g2 (N) = O( f (N)), while it can happen that g1 (N) + g2 (N) Θ( f (N)) since the leading terms of g1 (N) and g2 (N) can coincide with each other with opposite signs and be canceled out. For random variables, we use their probabilistic versions, Op , op , Ωp , ωp , and Θp , for which the corresponding conditions hold in probability. For N from Gauss1 (x; 0, 12 ), we can say that example, for i.i.d. samples {x(n) }n=1 x(n) = Θp (1), x= x2 =
N 1 (n) x = Θp (N −1/2 ), N n=1
N 1 (n) x N n=1
2
= 1 + Θp (N −1/2 ).
Note that the second and the third equations are consequences from the central limit theorem (Theorem 13.1) applied to the samples {x(n) } that follow the Gaussian distribution, and to the samples {(x(n) )2 } that follow the chi-squared distribution, respectively. In this book, we express asymptotic approximation mostly by using asymptotic notation. To this end, we sometimes need to translate convergence of a random variable into an equation with asymptotic notation. Let x be a random variable depending on N, r(x) be a distribution with finite mean and
346
13 Asymptotic Learning Theory
covariance, and f (x) be an arbitrary bounded continuous function. Then the following hold: • If p(x) → r(x), i.e., the distribution of x converges to r(x), then x = Op (1)
and
f (x) p(x) = f (x) r(x) (1 + o(1)) .
• If limN→∞ Prob (x − y > ε) = 0 for any ε > 0, then x = y + op (1). For example, the central limit theorem (13.3) implies that (x − μ)(x − μ)
x = μ + Op (N −1/2 ), p(x)
= N −1 Σ + o(N −1 ),
while the law of large numbers (13.5) implies that x = μ + op (1).
13.3 Target Quantities Here we introduce target quantities to be analyzed in asymptotic learning theory.
13.3.1 Generalization Error and Training Error Consider a statistical model p(x|w), where x ∈ R M is an observed random variable and w ∈ RD is a parameter to be estimated. Let X = (x(1) , . . . , x(N) ) ∈ RN×M be N i.i.d. training samples taken from the true distribution q(x). We assume realizability—the true distribution can be exactly expressed by the statistical model, i.e., ∃w∗ s.t. q(x) = p(x|w∗ ), where w∗ is called the true parameter. Learning algorithms estimate the parameter value w or its posterior distribution given the training data D = X, and provide the predictive distribution p(x|X) for a new sample x. For example, ML learning provides the predictive distribution given by wML ), pML (x|X) = p(x| where ML
w
⎛ N ⎞ ⎜⎜⎜ ⎟⎟ (n) = argmax p(X|w) = argmax ⎜⎜⎝ p(x |w)⎟⎟⎟⎠ w
w
n=1
(13.6)
(13.7)
13.3 Target Quantities
347
is the ML estimator, while Bayesian learning provides the predictive distribution given by pBayes (x|X) = p(x|w) p(w|X) = p(x|w)p(w|X)dw, (13.8) where p(w|X) =
p(X|w)p(w) p(X|w)p(w) = p(X) p(X|w)p(w)dw
(13.9)
is the posterior distribution (see Section 1.1). The generalization error, a criterion of generalization performance, is defined as the Kullback–Leibler (KL) divergence of the predictive distribution from the true distribution: q(x) dx. (13.10) GE(X) = q(x) log p(x|X) Its empirical variant, TE(X) =
N 1 q(x(n) ) log , N n=1 p(x(n) |X)
(13.11)
is called the training error, which is often used as an estimator of the generalization error. Note that, for the ML predictive distribution (13.6), −N · TEML (X) =
N n=1
log
wML ) p(x(n) | q(x(n) )
(13.12)
corresponds to the log-likelihood ratio, an important statistic for statistical test, when the null hypothesis is true. The generalization error (13.10) and the training error (13.11) are random variables that depend on realization of the training data X. Taking the average over the distribution of training samples, we define deterministic quantities, GE(N) = GE(X) q(X) ,
(13.13)
TE(N) = TE(X) q(X) ,
(13.14)
which are called the average generalization error and the average training error, respectively. Here · q(X) denotes the expectation value over the distribution of N training samples. The average generalization error and the average training error are scalar functions of the number N of samples, and represent generalization performance of a learning machine consisting of a statistical model and a learning algorithm. The optimality of Bayesian learning is proven in terms of the average generalization error (see Appendix D).
348
13 Asymptotic Learning Theory
If a learning algorithm can successfully estimate the true parameter w∗ with reasonably small error, the average generalization error and the average training error converge to zero with the rate Θ(N −1 ) in the asymptotic limit.3 One of the main goals of asymptotic learning theory is to identify or bound the coefficients of their leading terms, i.e., λ and ν in the following asymptotic expansions: GE(N) = λN −1 + o(N −1 ), TE(N) = νN
−1
−1
+ o(N ).
(13.15) (13.16)
We call λ and ν the generalization coefficient and the training coefficient, respectively.
13.3.2 Bayes Free Energy The marginal likelihood (defined by Eq. (1.6) in Chapter 1), N p(x(n) |w)dw, p(X) = p(X|w)p(w)dw = p(w)
(13.17)
n=1
is also an important quantity in Bayesian learning. As explained in Section 1.1.3, the marginal likelihood can be regarded as the likelihood of an ensemble of models—the set of model distributions with the parameters subject to the prior distribution. Following the concept of the “likelihood” in statistics, we can say that the ensemble of models giving the highest marginal likelihood is most likely. Therefore, we can perform model selection by maximizing the marginal likelihood (Efron and Morris, 1973; Schwarz, 1978; Akaike, 1980; MacKay, 1992; Watanabe, 2009). Maximizing the marginal likelihood (13.17) amounts to minimizing the Bayes free energy, defined by Eq. (1.60): F Bayes (X) = − log p(X).
(13.18)
The Bayes free energy is a random variable depending on the training samples X, and is of the order of Θp (N). However, the dominating part comes from the entropy of the true distribution, and does not depend on the statistical model nor the learning algorithm. In statistical learning theory, we therefore analyze the behavior of the relative Bayes free energy, Bayes (X) = log q(X) = F Bayes (X) − NS N (X), F p(X) 3
(13.19)
This holds if the estimator achieves a mean squared error in the same order as the Cram´er–Rao ≥ N −1 tr F−1 (w∗ ) , where F is the Fisher information (13.2) lower-bound, i.e., w − w∗ 2 q(X) at w∗ . The Cram´er–Rao lower-bound holds for any unbiased estimator under the regularity conditions.
13.3 Target Quantities
349
where S N (X) = −
N 1 log q(x(n) ) N n=1
(13.20)
is the empirical entropy. The negative of the relative Bayes free energy, N p(w) n=1 p(x(n) |w)dw Bayes −F (X) = log , (13.21) N (n) n=1 q(x ) can be seen as an ensemble version of the log-likelihood ratio—the logarithm of the ratio between the marginal likelihood (alternative hypothesis) and the true likelihood (null hypothesis). When the prior p(w) is positive around the true parameter w∗ , the relative Bayes free energy (13.19) is known to be of the order of Θ(log N) and can be asymptotically expanded as follows: Bayes (X) = λ Bayes log N + op (log N), F
(13.22)
where the coefficient of the leading term λ Bayes is called the Bayes free energy coefficient. Note that, although the relative Bayes free energy is a random variable depending on realization of the training data X, the leading term in Eq. (13.22) is deterministic. Let us define the average relative Bayes free energy over the distribution of training samples: N 0 / (n) Bayes n=1 q(x ) Bayes F (N) = F (X) = log . (13.23) N q(X) p(w) n=1 p(x(n) |w)dw q(X) An interesting and useful relation can be found between the average Bayes generalization error and the average relative Bayes free energy (Levin et al., 1990): Bayes q(x) GE (N) = q(x) log pBayes dx (x|X) q(X) * + q(x) = q(x) log p(x|w)p(w|X)dw dx q(X) * + q(x) = q(x) log p(x|w)p(X|w)p(w)dw dx q(X) * + 1 − q(x) log p(X|w)p(w)dw dx q(X) * * + + q(x)q(X) q(X) = log p(x|w)p(X|w)p(w)dw − log p(X|w)p(w)dw q(x)q(X)
=F
Bayes
(N + 1) − F
Bayes
(N).
q(X)
(13.24)
350
13 Asymptotic Learning Theory
The relation (13.24) combined with the asymptotic expansions, Eqs. (13.15) and (13.22), implies that the Bayes generalization coefficient and the Bayes free energy coefficient coincide with each other, i.e., λ Bayes = λBayes .
(13.25)
Importantly, this relation holds for any statistical model, regardless of being regular or singular.
13.3.3 Target Quantities under Conditional Modeling Many statistical models are for the regression or classification setting, where the model distribution p(y|x, w) is the distribution of an output y ∈ RL conditional on an input x ∈ R M and an unknown parameter w ∈ RD . The input is assumed to be given for all samples including the future test samples. Let D = {(x(1) , y(1) ), . . . , (x(N) , y(N) )} be N i.i.d. training samples drawn from the true joint distribution q(x, y) = q(y|x)q(x). As noted in Example 1.2 in Chapter 1, we can proceed with most computations without knowing the input distribution q(x). Let X = (x(1) , . . . , x(N) ) ∈ RN×M and Y = (y(1) , . . . , y(N) ) ∈ RN×L separately summarize the inputs and the outputs in the training data. The predictive distribution, given as a conditional distribution on a new input x as well as the whole training samples D = (X, Y), can usually be computed without any information on q(X). For example, the ML predictive distribution is given as wML ), pML (y|x, D) = p(y|x, where ML
w
⎛ N ⎞ ⎜⎜⎜ ⎟⎟ (n) (n) ⎜ = argmax p(Y|X, w) · q(X) = argmax ⎜⎝ p(y |x , w)⎟⎟⎟⎠ w
w
(13.26)
(13.27)
n=1
is the ML estimator, while the Bayes predictive distribution is given as p(y|x, w)p(w|X, Y)dw, (13.28) pBayes (y|x, D) = p(y|x, w) p(w|X,Y) = where p(w|X, Y) =
p(Y|X, w)p(w) · q(X) p(Y|X, w)p(w)dw · q(X)
=
p(Y|X, w)p(w) p(Y|X, w)p(w)dw
(13.29)
is the Bayes posterior distribution. Here (x, y) is a new input–output sample pair, assumed to be drawn from the true distribution q(y|x)q(x).
13.4 Asymptotic Learning Theory for Regular Models
351
The generalization error (13.10), the training error (13.11), and the relative Bayes free energy (13.19) can be expressed as follows: / / 0 0 q(y|x)q(x) q(y|x) GE(D) = log = log , (13.30) p(y|x, D)q(x) q(y|x)q(x) p(y|x, D) q(y|x)q(x) N N q(y(n) |x(n) )q(x(n) ) q(y(n) |x(n) ) 1 1 log log TE(D) = = , N n=1 p(y(n) |x(n) , D)q(x(n) ) N n=1 p(y(n) |x(n) , D)
(13.31) Bayes (D) = log q(Y|X)q(X) = F Bayes (Y|X) − NS N (Y|X), F p(Y|X)q(X) where
F Bayes (Y|X) = log
p(w)
N
p(y(n) |x(n) , w)dw,
(13.32)
(13.33)
n=1
S N (Y|X) = −
N 1 log q(y(n) |x(n) ). N n=1
(13.34)
We can see that the input distribution q(x) cancels out in most of the preceding equations, and therefore Eqs. (13.30) through (13.34) can be computed without considering q(x). Note that in Eq. (13.30), q(x) remains the distribution over which the expectation is taken. However, it is necessary only formally, and the expectation value does not depend on q(x) (as long as the regularity conditions hold). The same applies to the average generalization error (13.13), the average training error (13.14), and the average relative Bayes free energy (13.23), where the expectation · q(Y|X)q(X) over the distribution of the training samples is taken.
13.4 Asymptotic Learning Theory for Regular Models In this section, we introduce the regular learning theory, which generally holds under the regularity conditions.
13.4.1 Regularity Conditions The regularity conditions are defined for the statistical model p(x|w) parameterized by a finite-dimensional parameter vector w ∈ W ⊆ RD , and the true distribution q(x). We include conditions for the prior distribution p(w), which are necessary for analyzing MAP learning and Bayesian learning. There are variations, and we here introduce a (rough) simple set.
352
13 Asymptotic Learning Theory
(i) The statistical model p(x|w) is differentiable (as many times as necessary) with respect to the parameter w ∈ W for any x, and the differential operator and the integral operator are commutable. (ii) The statistical model p(x|w) is identifiable, i.e., Eq. (13.1) holds, and the Fisher information (13.2) is nonsingular (full-rank) at any w ∈ W. (iii) The support of p(x|w), i.e., {x ∈ X; p(x|w) > 0}, is common for all w ∈ W. (iv) The true distribution is realizable by the statistical model, i.e., ∃w∗ s.t. q(x) = p(x|w∗ ), and the true parameter w∗ is an interior point of the domain W. (v) The prior p(w) is twice differentiable and bounded as 0 < p(w) < ∞ at any w ∈ W. Note that the first three conditions are on the model distribution p(x|w), the fourth is on the true distribution q(x), and the fifth is on the prior distribution p(w). An important consequence of the regularity conditions is that the loglikelihood can be Taylor-expanded about any w ∈ W: ∂ log p(x|w) log p(x|w) = log p(x|w) + (w − w) w=w ∂w ∂2 log p(x|w) 1 3 + (w − w) w=w (w − w) + O(w − w ). (13.35) 2 ∂w∂w
13.4.2 Consistency and Asymptotic Normality We first show consistency and asymptotic normality, which hold in ML learning, MAP learning, and Bayesian learning. Consistency of ML Estimator The ML estimator is defined by ⎛ N ⎞ ⎜⎜⎜ ⎟⎟ ML (n) ⎜
w = argmax log ⎜⎝ p(x |w)⎟⎟⎟⎠ = argmax LN (w), w
(13.36)
w
n=1
where LN (w) =
N 1 log p(x(n) |w). N n=1
(13.37)
By the law of large numbers (13.5), it holds that LN (w) = L∗ (w) + op (1),
(13.38)
13.4 Asymptotic Learning Theory for Regular Models
353
where L∗ (w) = log p(x|w) p(x|w∗ ) .
(13.39)
Identifiability of the statistical model guarantees that w∗ = argmax L∗ (w)
(13.40)
w
is the unique maximizer. Eqs. (13.36), (13.38), and (13.40), imply the consistency of the ML estimator, i.e.,
wML = w∗ + op (1).
(13.41)
Asymptotic Normality of the ML Estimator Since the gradient ∂LN (w)/∂w is differentiable, the mean value theorem4 guarantees that there exists w´ ∈ [min( wML , w∗ ), max( wML , w∗ )]D (where min(·) and max(·) operate elementwise) such that ∂LN (w) ∂LN (w) ∂2 LN (w) wML − w∗ ). (13.42) ( w= wML = w=w∗ + ∂w ∂w ∂w∂w w=w´ By the definition (13.36) of the ML estimator and the differentiability of LN (w), the left-hand side of Eq. (13.42) is equal to zero, i.e., ∂LN (w) = 0. (13.43) ∂w w= wML The first term in the right-hand side of Eq. (13.42) can be written as N ∂LN (w) 1 ∂ log p(x(n) |w) = . (13.44) ∂w w=w∗ N n=1 ∂w ∗ w=w
Since Eq. (13.40) and the differentiability of L∗ (w) imply that / 0 ∂ log p(x|w) ∂L∗ (w) = = 0, ∂w w=w∗ ∂w w=w∗ p(x|w∗ )
(13.45)
the right-hand side of Eq. (13.44) is the average over N i.i.d. samples of the random variable ∂ log p(x(n) |w) , ∂w w=w∗
4
The mean value theorem states that, for a differentiable function f : [a, b] → R, d f (x) f (a) = f (b)− ∃c ∈ [a, b] s.t. dx b−a . x=c
354
13 Asymptotic Learning Theory
which follows a distribution with zero mean (Eq. (13.45)) and the covariance given by the Fisher information (13.2) at w = w∗ , i.e., / 0 ∂ log p(x|w) ∂ log p(x|w) . (13.46) F(w∗ ) = ∂w ∂w ∗ w=w∗ ∗ w=w
p(x|w )
Therefore, according to the central limit theorem (Theorem 13.1), the distribution of the first term in the right-hand side of Eq. (13.42) converges to ∂LN (w) ∂LN (w) −1 ∗ ; 0, N F(w ) . (13.47) → GaussD p ∂w ∗ ∂w ∗ w=w
w=w
The coefficient of the second term in the right-hand side of Eq. (13.42) satisfies ∂2 L∗ (w) ∂2 LN (w) = + op (1), (13.48) ∂w∂w w=w´ ∂w∂w w=w∗ because of the law of large numbers and the consistency of the ML estimator, wML , w∗ )] w´ → w∗ since wML → w∗ . Furthermore, i.e., [min( wML , w∗ ), max( the following relation holds under the regularity conditions (see Appendix B.2): / 2 0 ∂ log p(x|w) ∂2 L∗ (w) = = −F(w∗ ). (13.49) ∂w∂w w=w∗ ∂w∂w w=w∗ ∗ p(x|w )
Substituting Eqs. (13.43), (13.48), and (13.49) into Eq. (13.42) gives ∂LN (w) wML − w∗ ) = . (13.50) F(w∗ ) + op (1) ( ∂w w=w∗ Since the Fisher information is assumed to be invertible, Eq. (13.47) leads to the following theorem: Theorem 13.2 (Asymptotic normality estimator) Under the regularity √ of ML wML − w∗ ) converges to conditions, the distribution of vML = N( p vML → GaussD vML ; 0, F−1 (w∗ ) as N → ∞. (13.51) Theorem 13.2 implies that
wML = w∗ + Op (N −1/2 ).
(13.52)
13.4.3 Asymptotic Normality of the Bayes Posterior The Bayes posterior can be written as follows: % $ exp NLN (w) + log p(w) p(w|X) = $ % . exp NLN (w) + log p(w) dw
(13.53)
13.4 Asymptotic Learning Theory for Regular Models
355
In the asymptotic limit, the factor exp(NLN (w)) dominates the numerator, and the probability mass concentrates around the peak of LN (w)—the ML estimator
wML . The Taylor expansion of LN (w) about wML gives ML ML ∂LN (w) LN (w) ≈ LN ( w ) + (w − w ) ∂w w= wML ∂2 LN (w) 1 wML ) + (w − (w − wML ) 2 ∂w∂w w= wML 1 wML ) F(w∗ )(w − ≈ LN ( wML ) − (w − wML ), 2
(13.54)
where we used Eq. (13.43) and
∂2 LN (w) = −F(w∗ ) + op (1), ∂w∂w w= wML
(13.55)
which is implied by the law of large numbers and the consistency of the ML estimator. Eqs. (13.53) and (13.54) imply that the Bayes posterior can be approximated by Gaussian in the asymptotic limit: wML , N −1 F−1 (w∗ ) . p(w|X) ≈ GaussD w; The following theorem was derived with more accurate discussion. Theorem 13.3 (Asymptotic normality of the Bayes posterior) (van der Vaart, 1998) Under the regularity √ conditions, the (rescaled) Bayes posterior distribution p (v|X) where v = N(w − w∗ ) converges to as N → ∞, (13.56) p (v|X) → GaussD v; vML , F−1 (w∗ ) √ where vML = N( wML − w∗ ). Theorem 13.3 implies that
wMAP = wML + op (N −1/2 ), Bayes
w
ML
= w p(w|X) = w
+ op (N
−1/2
),
(13.57) (13.58)
which prove the consistency of the MAP estimator and the Bayesian estimator.
13.4.4 Generalization Properties Now we analyze the generalization error and the training error in ML learning, MAP learning, and Bayesian learning, as well as the Bayes free energy. After that, we introduce information criteria for model selection, which were developed based on the asymptotic behavior of those quantities.
356
13 Asymptotic Learning Theory
13.4.5 ML Learning The generalization error of ML learning can be written as / 0 p(x|w∗ ) ML GERegular (X) = log p(x| wML ) p(x|w∗ ) = L∗ (w∗ ) − L∗ ( wML )
(13.59)
with L∗ (w) defined by Eq. (13.39). The Taylor expansion of the second term of Eq. (13.59) about the true parameter w∗ gives ∂L∗ (w) L∗ ( wML ) = L∗ (w∗ ) + ( wML − w∗ ) ∂w w=w∗ ∂2 L∗ (w) 1 ML w − w ∗ ) + ( ( wML − w∗ ) + O( wML − w∗ 3 ) 2 ∂w∂w w=w∗ 1 ML w − w∗ ) F(w∗ )( = L∗ (w∗ ) − ( wML − w∗ ) + O( wML − w∗ 3 ), 2 (13.60) where we used Eqs. (13.45) and (13.49) in the last equality. Substituting Eq. (13.60) into Eq. (13.59) gives 1 ML ( w − w∗ ) F(w∗ )( wML − w∗ ) + O( wML − w∗ 3 ). (13.61) 2 The asymptotic normality (Theorem 13.2) of the ML estimator implies that √ 1 (13.62) N F 2 (w∗ )( wML − w∗ ) GaussD (0, I D ) ,
GEML Regular (X) =
and that O( wML − w∗ 3 ) = Op (N −3/2 ).
(13.63)
Eq. (13.62) implies that the distribution of s = N( wML − w∗ ) F(w∗ )( wML − w∗ ) converges to the chi-squared distribution with D degrees of freedom:5 p (s) → χ2 (s; D) , and therefore, ML N ( w − w∗ ) F(w∗ )( wML − w∗ )
q(X)
(13.64)
= D + o(1).
(13.65)
Eqs. (13.61), (13.63), and (13.65) lead to the following theorem: 5
The chi-squared distribution with D degrees of freedom is the distribution of the sum of the squares of D i.i.d. samples drawn from Gauss1 (0, 12 ). It is actually a special case of the Gamma distribution, and it holds that χ2 (x; D) = Gamma(x; D/2, 1/2). The mean and the variance are equal to D and 2D, respectively.
13.4 Asymptotic Learning Theory for Regular Models
357
Theorem 13.4 The average generalization error of ML learning in the regular models can be asymptotically expanded as ML −1 GERegular (N) = GEML = λML + o(N −1 ), (13.66) Regular (X) Regular N q(X)
where the generalization coefficient is given by 2λML Regular = D.
(13.67)
Interestingly, the leading term of the generalization error only depends on the parameter dimension or the degree of freedom of the statistical model. The training error of ML learning can be analyzed in a similar fashion. It can be written as −1 TEML Regular (X) = N
N
log
n=1 ∗
p(x(n) |w∗ ) p(x(n) | wML )
wML ) = LN (w ) − LN (
(13.68)
with LN (w) defined by Eq. (13.37). The Taylor expansion of the first term of Eq. (13.68) about the ML estimator wML gives ∂LN (w) LN (w∗ ) = LN ( wML ) + (w∗ − wML ) ∂w w= wML ∂2 LN (w) 1 + (w∗ − wML ) (w∗ − wML ) + O(w∗ − wML 3 ) 2 ∂w∂w w= wML 1 = LN ( wML ) − (w∗ − wML ) F(w∗ ) + op (1) (w∗ − wML ) 2 + O(w∗ − wML 3 ), (13.69) where we used Eqs. (13.43) and (13.55). Substituting Eq. (13.69) into Eq. (13.68) and applying Eq. (13.52), we have 1 ML w − w∗ ) F(w∗ )( wML − w∗ ) + op (N −1 ). TEML Regular (X) = − ( 2
(13.70)
Thus, Eq. (13.70) together with Eq. (13.65) gives the following theorem: Theorem 13.5 The average training error of ML learning in the regular models can be asymptotically expanded as ML ML TERegular (N) = TEML = νRegular N −1 + o(N −1 ), (13.71) Regular (X) q(X)
where the training coefficient is given by ML = −D. 2νRegular
(13.72)
358
13 Asymptotic Learning Theory
Comparing Theorems 13.4 and 13.5, we see that the generalization coefficient and the training coefficient are antisymmetric with each other: ML λML Regular = −νRegular .
13.4.6 MAP Learning We first prove the following theorem: Theorem 13.6 For any (point-) estimator such that
w= wML + op (N −1/2 ), it holds that GE wRegular (X)
/
p(x|w∗ ) = log p(x| w)
TE wRegular (X) = N −1
N n=1
log
0 p(x|w∗ )
−1 = GEML Regular (X) + op (N ),
p(x(n) |w∗ ) −1 = TEML Regular (X) + op (N ). p(x(n) | w)
(13.73)
(13.74) (13.75)
Proof The generalization error of the estimator w can be written as / 0 p(x|w∗ ) GE wRegular (X) = log p(x| w) p(x|w∗ ) = L∗ (w∗ ) − L∗ ( w)
∗ = GEML wML ) − L∗ ( w) , Regular (X) + L (
(13.76)
where the second term can be expanded as ∂L∗ (w) wML ) − L∗ ( w) = −( w− wML ) L∗ ( ∂w w= wML 2 ∗ 1 ML ∂ L (w) w− w ) − ( ( w− wML ) +O( w− wML 3 ). 2 ∂w∂w w= wML (13.77) Eqs. (13.45) and (13.52) (with the differentiability of ∂L∗ (w)/∂w) imply that ∂L∗ (w) ∂L∗ (w) = = Op (N −1/2 ), ∂w w= wML ∂w w=w∗ +Op (N −1/2 ) with which Eqs. (13.73) and (13.77) lead to wML ) − L∗ ( w) = op (N −1 ). L∗ ( Substituting the preceding into Eq. (13.76) gives Eq. (13.74).
13.4 Asymptotic Learning Theory for Regular Models
359
Similarly, the training error of the estimator w can be written as TE wRegular (X) = N −1
N
p(x(n) |w∗ ) p(x(n) | w)
log
n=1
w) = LN (w∗ ) − LN ( = TEML wML ) − LN ( w) , Regular (X) + LN (
(13.78)
where the second term can be expanded as ML
w LN (
∂LN (w) ∂w w= wML ∂2 LN (w) 1 w− wML ) − ( ( w− wML ) +O( w− wML 3 ). 2 ∂w∂w w= wML (13.79)
) − LN ( w) = −( w− wML )
Eqs. (13.43), (13.73), and (13.79) lead to wML ) − LN ( w) = op (N −1 ). LN ( Substituting the preceding into Eq. (13.78) gives Eq. (13.75), which completes the proof. Since the MAP estimator satisfies the condition (13.73) of Theorem 13.6 (see Eq. (13.57)), we obtain the following corollaries: Corollary 13.7 The average generalization error of MAP learning in the regular models can be asymptotically expanded as MAP GERegular (N) = GEMAP Regular (X)
q(X)
−1 = λMAP + o(N −1 ), Regular N
(13.80)
where the generalization coefficient is given by 2λMAP Regular = D.
(13.81)
Corollary 13.8 The average training error of MAP learning in the regular models can be asymptotically expanded as MAP TERegular (N) = TEMAP Regular (X)
q(X)
MAP = νRegular N −1 + o(N −1 ),
(13.82)
where the training coefficient is given by MAP 2νRegular = −D.
(13.83)
360
13 Asymptotic Learning Theory
13.4.7 Bayesian Learning Eq. (13.58) and Theorem 13.6 imply that the Bayesian estimator also gives the same generalization and training coefficients as ML learning, if the plugin predictive distribution p(x| wBayes ), i.e., the model distribution with the Bayesian parameter plugged-in, is used for prediction. We can show that the proper Bayesian procedure with the predictive distribution p(x|X) = p(x|w) p(w|X) also gives the same generalization and training coefficients. We first prove the following theorem: Theorem 13.9 Let r(w) be a (possibly approximate posterior) distribution of the parameter, of which the mean and the covariance satisfy the following:
w = w r(w) = w∗ + Op (N −1/2 ), * +
Σ w = w − w r(w) w − w r(w)
r(w)
(13.84) = Op (N −1 ).
(13.85)
Then the generalization error and the training error of the predictive distribution p(x|X) = p(x|w) r(w) satisfy / 0 p(x|w∗ ) GErRegular (X) = log = GE wRegular (X) + op (N −1 ), (13.86) p(x|w) r(w) p(x|w∗ ) TErRegular (X) = N −1
N n=1
log
p(x(n) |w∗ ) = TE wRegular (X) + op (N −1 ), (13.87) p(x(n) |w) r(w)
GE wRegular (X)
where and TE wRegular (X) are, respectively, the generalization error and the training error of the point estimator w (defined in Theorem 13.6). Proof The predictive distribution can be expressed as $ % p(x|w) r(w) = exp log p(x|w) r(w) * p(x|w ) = exp log p(x| w) + (w − w) ∂ log∂w w = w
"+ 2 log p(x|w ) + 12 (w − w) ∂ ∂w w) + O w − w 3 w = w (w −
∂w r(w) *!
p(x|w ) = p(x| w) · 1 + (w − w) ∂ log∂w w = w
∂ log p(x|w ) p(x|w ) + 12 (w − w) ∂ log∂w w) w = w (w −
∂w w = w "+ 2
log p(x|w ) 3
w) ∂ ∂w (w − w ) + O w − w . + 12 (w − w = w
∂w r(w)
Here we first expanded log p(x|w) about w, and then expanded the exponential function (with exp(z) = 1 + z + z2 /2 + O(z3 )).
13.4 Asymptotic Learning Theory for Regular Models
361
Using the conditions (13.84) and (13.85) on r(w), we have w) · 1 + 12 tr Σ w Φ(x; w) + Op N −3/2 , p(x|w) r(w) = p(x|
(13.88)
where Φ(x; w) = Therefore, * + p(x|w)
log p(x| w)r(w)
∂ log p(x|w) ∂ log p(x|w) w= w ∂w ∂w w= w
p(x|w∗ )
+
∂2 log p(x|w) . ∂w ∂w w= w
(13.89)
= log 1 + 12 tr Σ w Φ(x; w) + Op N −3/2 p(x|w∗ ) = 12 tr w) + Op N −3/2 . Σ w Φ(x; (13.90) ∗ p(x|w )
Here we expanded the logarithm function (with log(1 + z) = z + O(z2 )), using the condition (13.85) on the covariance, i.e., Σ w = Op (N −1 ). The condition (13.84) on the mean, i.e., w = w∗ + Op (N −1/2 ), implies that Φ(x; w) p(x|w∗ ) = Φ(x; w∗ ) p(x|w∗ ) + Op (N −1/2 ) = F(w∗ ) − F(w∗ ) + Op (N −1/2 ) = Op (N −1/2 ),
(13.91)
where we used the definition of the Fisher information (13.46) and its equivalent expression (13.49) (under the regularity conditions). Eqs. (13.90) and (13.91) together with the condition (13.85) give + * p(x|w)
= Op (N −3/2 ), log p(x| w)r(w) p(x|w∗ )
which results in Eq. (13.86). Similarly, by using the expression (13.88) of the predictive distribution, we have N −1
N
log
p(x(n) |w) r(w) p(x(n) | w)
= N −1
n=1
N
log 1 + 12 tr Σ w Φ(x(n) ; w) + Op N −3/2
n=1
1 −1 N tr Σ w Φ(x(n) ; w) + Op N −3/2 . (13.92) 2 n=1 N
=
The law of large numbers (13.5) and Eq. (13.91) lead to N −1
N
Φ(x(n) ; w) = Φ(x; w) p(x|w∗ ) + op (1)
n=1
= op (1).
362
13 Asymptotic Learning Theory
Substituting the preceding and the condition (13.85) into Eq. (13.92) gives N −1
N
log
p(x(n) |w) r(w) p(x(n) | w)
= op N −1 ,
n=1
which results in Eq. (13.87). This completes the proof.
The asymptotic normality of the Bayes posterior (Theorem 13.3), combined with the asymptotic normality of the ML estimator (Theorem 13.2), guarantees that the conditions (13.84) and (13.85) of Theorem 13.9 hold in Bayesian learning, which leads to the following corollaries: Corollary 13.10 The average generalization error of Bayesian learning in the regular models can be asymptotically expanded as Bayes Bayes Bayes GERegular (N) = GERegular (X) = λRegular N −1 + o(N −1 ), (13.93) q(X)
where the generalization coefficient is given by Bayes
2λRegular = D.
(13.94)
Corollary 13.11 The average training error of Bayesian learning in the regular models can be asymptotically expanded as Bayes Bayes Bayes TERegular (N) = TERegular (X) = νRegular N −1 + o(N −1 ), (13.95) q(X)
where the training coefficient is given by Bayes
2νRegular = −D.
(13.96)
Asymptotic behavior of the Bayes free energy (13.18) was also analyzed (Schwarz, 1978; Watanabe, 2009). The Bayes free energy can be written as F Bayes (X) = − log p(X) N = − log p(w) p(x(n) |w)dw = − log
n=1
$
% exp NLN (w) + log p(w) dw,
where the factor exp (NLN (w)) dominates in the asymptotic limit. By using the Taylor expansion ∂LN (w) wML ) + (w − wML ) LN (w) = LN ( ∂w w= wML
13.4 Asymptotic Learning Theory for Regular Models
363
1 ∂2 LN (w) + (w − wML ) (w − wML ) + O w − wML 3 2 ∂w∂w w= wML 1 wML) F(w∗ ) + op (1) (w − wML )− (w − wML)+O w − wML 3 , = LN ( 2 we can approximate the Bayes free energy as follows: N wML ) F(w∗ )(w − wML ) − log exp − (w − wML ) F Bayes (X) ≈ −NLN ( 2 + log p(w) dw ML
w = −NLN (
) − log
1 exp − v F(w∗ )v 2 + log p( wML + N −1/2 v)
= −NLN ( wML ) +
D log N + Op (1). 2
dv N D/2 (13.97)
√ where v = N(w − wML ) is a rescaled parameter, on which the integration was performed with dv = N D/2 dw. Therefore, the relative Bayes free energy (13.19) can be written as Bayes (X) = F Bayes (X) + NLN (w∗ ) F D ≈ log N + N LN (w∗ ) − LN ( wML ) + Op (1). 2
(13.98)
Here we used S N (X) = −LN (w∗ ), which can be confirmed by their definitions (13.20) and (13.37). The second term in Eq. (13.98) is of the order of Op (1), because Eqs. (13.68), (13.70), and (13.52) imply that −1 wML ) = TEML LN (w∗ ) − LN ( Regular (X) = Op (N ).
The following theorem was obtained with more rigorous discussion. Theorem 13.12 (Watanabe, 2009) The relative Bayes free energy for the regular models can be asymptotically expanded as Bayes (X) = F Bayes (X) − NS N (X) = λ Bayes log N + Op (1), F Regular Regular
(13.99)
where the Bayes free energy coefficient is given by
Bayes
2λRegular = D.
(13.100)
Note that Corollary 13.10 and Theorem 13.12 are consistent with Eq. (13.25), which holds for any statistical model.
364
13 Asymptotic Learning Theory
13.4.8 Information Criteria We have seen that the leading terms of the generalization error, the training error, and the relative Bayes free energy are proportional to the parameter dimension. Those results imply that how much a regular statistical model overfits training data mainly depends on the degrees of freedom of statistical models. Based on this insight, various information criteria were proposed for model selection. Let us first recapitulate the model selection problem. Consider a (D − 1)degree polynomial regression model for one-dimensional input t and output y: y=
D
wd td−1 + ε,
d=1
where ε denotes a noise. This model can be written as y = w x + ε, where w ∈ RD is a parameter vector, and x = (1, t, t2 , . . . , t D−1 ) is a transformed input vector. Suppose that the true distribution can be realized just with a (D∗ − 1)-degree polynomial: y=
D∗
w∗d td−1 + ε = w∗ x + ε,
d=1 ∗
D∗
∗
where w ∈ R is the true parameter vector, and x = (1, t, t2 , . . . , t D −1 ) .6 If we train a (D − 1)-degree polynomial model for D < D∗ , we expect poor generalization performance because the true distribution is not realizable, i.e., the model is too simple to express the true distribution. On the other hand, it was observed that if we train a model such that D ! D∗ , the generalization performance is also not optimal, because the unnecessarily high degree terms cause overfitting. Accordingly, finding an appropriate degree D of freedom, based on the observed data, is an important task, which is known as a model selection problem. It would be a good strategy if we could choose D, which minimizes the generalization error (13.30). Ignoring the terms that do not depend on the model (or D), the generalization error can be written as GE(D) = − q(x)q(y|x) log p(y|x, D)dxdy + const. (13.101)
6
By “just,” we mean that w∗D∗ 0, and therefore the true distribution is not realizable with any (D − 1)-degree polynomial for D < D∗ .
13.4 Asymptotic Learning Theory for Regular Models
365
Unfortunately, we cannot directly evaluate Eq. (13.101), since the true distribution q(y|x) is inaccessible. Instead, the training error (13.31), 1 log p(y(n) |x(n) , D) + const., N n=1 N
TE(D) = −
(13.102)
is often used as an estimator for the generalization error. Although Eq. (13.102) is accessible, the training error is known to be a biased estimator for the generalization error (13.101). In fact, the training error does not reflect the negative effect of redundancy of the statistical model, and tends to be monotonically decreasing as the parameter dimension D increases. Akaike’s information criterion (AIC) (Akaike, 1974), N wML ) + 2D, AIC = −2 log p(y(n) |x(n) ,
(13.103)
n=1
was proposed as an estimator for the generalization error of ML learning with bias correction. Theorems 13.4 and 13.5 provide the bias between the generalization error and the training error as follows:
ML GEML Regular (D) − TERegular (D)
q(D)
ML
ML
= GERegular (N) − TERegular (N) = =
ML λML Regular − νRegular
N
+ o(N −1 )
D + o(N −1 ). N
(13.104)
Therefore, it holds that ML ML TEML Regular (D) + GERegular (D) − TERegular (D)
q(D)
D + o(N −1 ) = TEML Regular (D) + N AIC = − S N (Y|X) + o(N −1 ), 2N
(13.105)
where S N (Y|X) is the (conditional) empirical entropy (13.34). Since the empirical entropy S N (Y|X) does not depend on the model, Eq. (13.105) implies that minimizing AIC amounts to minimizing an asymptotically unbiased estimator for the generalization error. Another strategy for model selection is to minimize an approximation to the Bayes free energy (13.33). Instead of performing integration for computing
366
13 Asymptotic Learning Theory
the Bayes free energy, Schwarz (1978) proposed to minimize the Bayesian information criterion (BIC): N BIC = MDL = −2 log p(y(n) |x(n) , wML ) + D log N.
(13.106)
n=1
Interestingly, an equivalent criterion, called the minimum description length (MDL) (Rissanen, 1986), was derived in the context of information theory in communication. The relation between BIC and the Bayes free energy can be directly found from the approximation (13.97), i.e., it holds that wML ) + F Bayes (Y|X) ≈ −NLN ( =
BIC + Op (1), 2
D log N + Op (1) 2 (13.107)
and therefore minimizing BIC amounts to minimizing an approximation to the Bayes free energy. The first terms of AIC (13.103) and BIC (13.106) are the maximum loglikelihood—the log-likelihood at the ML estimator—multiplied by −2. The second terms, called penalty terms, penalize high model complexity, which explicitly work as Occam’s razor (MacKay, 1992) to prune off irrelevant degrees of freedom of the statistical model. AIC, BIC, and MDL are easily computable and have shown their usefulness in many applications. However, their derivations rely on the fact that the generalization coefficient, the training coefficient, and the free energy coefficient depend only on the parameter dimension under the regularity conditions. Actually, it has been revealed that, in singular models, those coefficients depend not only on the parameter dimension D but also on the true distribution.
13.5 Asymptotic Learning Theory for Singular Models Many popular statistical models do not satisfy the regularity conditions. For example, neural networks, matrix factorization, mixture models, hidden Markov models, and Bayesian networks are all unidentifiable and have singularities, where the Fisher information is singular, in the parameter space. As discussed in Chapter 7, the true parameter is on a singular point when the true distribution is realizable with a model with parameter dimension smaller than the used model, i.e., when the model has redundant components for expressing the true distribution. In such cases, the likelihood cannot be Taylorexpanded about the true parameter, and the asymptotic normality does not hold.
13.5 Asymptotic Learning Theory for Singular Models
367
Consequently, the regular learning theory, described in Section 13.4, cannot be applied to singular models. In this section, we first give intuition on how singularities affect generalization properties, and then introduce asymptotic theoretical results on ML learning and Bayesian learning. After that, we give an overview of asymptotic theory of VB learning, which will be described in detail in the subsequent chapters.
13.5.1 Effect of Singularities Two types of effects of singularities have been observed, which will be detailed in the following subsections. Basis Selection Effect Consider a regression model for one-dimensional input x ∈ [−10, 10] and output y ∈ R with H radial basis function (RBF) units: 1 1 2 exp − 2 (y − f (x; a, b, c)) , (13.108) p(y|x, a, b, c) = √ 2σ 2πσ2 H ρh x; ah , bh , c2h . (13.109) where f (x; a, b, c) = h=1
Each RBF unit in Eq. (13.109) is a weighted Gaussian RBF, ⎛ ⎞ ⎜⎜ (x − bh )2 ⎟⎟⎟ ah ⎟⎠ , ρh x; ah , bh , c2h = ah · Gauss1 x; bh , c2h = . exp ⎜⎜⎝− 2c2h 2πc2h controlled by a weight parameter ah ∈ R, a mean parameter bh ∈ R, and a scale parameter c2h ∈ R++ . Treating the noise variance σ2 in Eq. (13.108) as a known constant, the parameters to be estimated are summarized as w = (a , b , c ) ∈ R3H , where a = (a1 , . . . , aH ) ∈ RH , b = (b1 , . . . , bH ) ∈ RH , H . Figure 13.4(a) shows an example of the RBF and c = (c21 , . . . , c2H ) ∈ R++ regression function (13.109) for H = 2. Apparently, the model (13.108) is unidentifiable, and has singularities— since ρh (x; 0, bh , c2h ) = 0 for any bh ∈ R, c2h ∈ R++ , the (bh , c2h ) half-space at ah = 0 is an unidentifiable set, on which the Fisher information is singular (see Figure 13.5).7 Accordingly, we call the model (13.108) a singular RBF regression model, of which the parameter dimension is equal to Dsin−RBF = 3H. 7
More unidentifiable sets can exist, depending on the other RBF units. See Section 7.3.1 for details on identifiability.
368
13 Asymptotic Learning Theory
1.5
1.5
1
1
0.5
0.5
0
0
–0.5
–0.5
–1
–1
–1.5 –10
–5
0
5
10
(a) Singular RBF with H = 2 units.
–1.5 –10
–5 0 5 10 (b) Regular RBF with H = 6 units.
Figure 13.4 Examples (solid curves) of the singular RBF regression function (13.109) and the regular RBF regression function (13.111). Each RBF unit ρh (x) is depicted as a dashed curve.
Figure 13.5 Singularities of the RBF regression model (13.108).
Let us consider another RBF regression model
1 2 exp − 2 (y − f (x; a)) , p(y|x, a) = √ 2σ 2πσ2 1
where
f (x; a) =
H h=1
ρ˘ h (x; ah ) =
H
ρh x; ah , b˘ h , c˘ 2h .
(13.110)
(13.111)
h=1
Unlike the singular RBF model (13.108), we here treat the mean parameters b˘ = (b˘ 1 , . . . , b˘ H ) and the scale parameters c˘ = (˘c21 , . . . , c˘ 2H ) as fixed constants, and only estimate the weight parameters a = (a1 , . . . , aH ) ∈ RH .
13.5 Asymptotic Learning Theory for Singular Models
369
Let us set the means and the scales as follows, so that the model covers the input domain [−10, 10]: h−1 , b˘ h = −10 + 20 · H−1 c˘ 2h = 1.
(13.112) (13.113)
Figure 13.4(b) shows an example of the RBF regression function (13.111) for H = 6. Clearly, it holds that ρ˘ h (x; ah ) ρ˘ h (x; a h ) if ah a h , and therefore the model is identifiable. The other regularity conditions (summarized in Section 13.4.1) on the model distribution p(y|x, a) are also satisfied. Accordingly, we call the model (13.110) a regular RBF regression model, of which the parameter dimension is equal to Dreg−RBF = H. Now we investigate difference in learning behavior between the singular RBF model (13.108) and the regular RBF model (13.110). Figure 13.6 shows trained regression functions (by ML learning) from N = 50 samples (shown as crosses) generated from the regression model, 1 1 2 ∗ exp − 2 (y − f (x)) , (13.114) q(y|x) = √ 2σ 2πσ2 with the following true functions: (i) (ii) (iii) (iv) (v) (vi)
poly: Polynomial function f ∗ (x) = −0.002x3 . cos: Cosine function f ∗ (x) = cos(0.5x). tanh: Tangent hyperbolic function f ∗ (x) = tanh(−0.5x). sin-sig: Sine times sigmoid function f ∗ (x) = sin(x) · 1+e1 −x . 9 x). sin-alg: Sine function aligned for the regular model f ∗ (x) = sin(2π 70 ∗ rbf : Single RBF function f (x) = ρ1 (x; 3, −10, 1).
The noise variance is set to σ2 = 0.01, and assumed to be known. We set the number of RBF units to H = 2 for the singular model, and H = 6 for the regular model, so that both models have the same degrees of freedom, Dsin−RBF = Dreg−RBF = 6. In Figure 13.6, we can observe the following: the singular RBF model can flexibly fit functions in different shapes (a) through (d), unless the function has too many peaks (e); the regular RBF model is not as flexible as the singular RBF model (a) through (d), unless the peaks and valleys match the predefined means of the RBF units (e). Actually, the frequency of sin-alg is aligned so that the peaks and the valleys match Eq. (13.112). These observations are quantitatively supported by the generalization error and the training error shown in Figure 13.7, leaving us an impression that the singular RBF model
370
13 Asymptotic Learning Theory
3
3
2
2
1
1
0
0
–1
–1 True Singular Regular
–2 –3 –10
–5
–2 0
5
10
–3 –10
True Singular Regular
–5
(a) poly
10
5
10
0
0
–1
–0.5 True Singular Regular
–5
–2 0
5
10
–3 –10
(c) tanh 3
2
2
1
1
0
0
–1
–1 True Singular Regular
–5
–2 0
(e) sin-alg
True Singular Regular
–5
0
(d) sin-sig
3
–3 –10
5
1
0.5
–2
10
2
1
–1.5 –10
5
3
1.5
–1
0
(b) cos
5
10
–3 –10
True Singular Regular
–5
0
(f) rbf
Figure 13.6 Trained regression functions by the singular RBF model (13.108) with H = 2 RBF units, and the regular RBF model (13.110) with H = 6 RBF units.
with two modifiable basis functions is more flexible than the regular RBF model with six prefixed basis functions. However, flexibility is granted at the risk of overfitting to noise, which can be observed in Figure 13.6(f). We can see that the true RBF function at x = −10 is captured by both models. However, the singular RBF model shows a small valley around x = 8, which is a consequence of overfitting to sample noise. Figure 13.7 also shows that, in the rbf case, the singular RBF model gives lower training error and higher generalization error than the regular RBF model—typical behavior when overfitting occurs. This overfitting tendency is reflected to the generalization and the training coefficients.
13.5 Asymptotic Learning Theory for Singular Models
0.3
371
0.3 Singular Regular
(a) Generalization error
rbf
sin-alg
sin-sig
tanh
cos
rbf
0
sin-alg
0
sin-sig
0.1
tanh
0.1
cos
0.2
poly
0.2
poly
Singular Regular
(b) Training error
Figure 13.7 The generalization error and the training error by the singular RBF model and the regular RBF model.
Apparently, if the true function is realizable, i.e., ∃w∗ s.t. f ∗ (x) = f (x; w∗ ), the true distribution (13.114) is realizable by the RBF regression model (Eq. (13.108) or (13.110)). In the examples (a) through (e) in Figure 13.7, the true function is not realizable by the RBF regression model. In such cases, the generalization error and the training error do not converge to zero, and it holds that GE(N) = Θ(1) and TE(N) = Θ(1) for the best learning algorithm. On the other hand, the true function (f) consists of a single RBF unit, and furthermore its mean and variance match those of the first unit of the regular RBF model (see Eqs. (13.112) and (13.113)). Accordingly, the true function (f) and therefore the true distribution (13.114) in the example (f) are realizable, i.e., ∃w∗ , s.t.q(y|x) = p(y|x, w∗ ), by both of the singular RBF model (13.108) and the regular RBF model (13.110). When the true parameter w∗ exists, the generalization error converges to zero, and, for any reasonable learning algorithm, the average generalization error and the average training error can be asymptotically expanded as Eqs. (13.15) and (13.16): GE(N) = λN −1 + o(N −1 ), TE(N) = νN −1 + o(N −1 ). Since the regular RBF model (13.110) is regular, its generalization coefficient and the training coefficient are given by 2λreg−RBF = −2νreg−RBF = D = Hreg−RBF
(13.115)
372
13 Asymptotic Learning Theory
for ML learning, MAP learning, and Bayesian learning (under the additional regularity conditions on the prior). On the other hand, the generalization coefficient and the training coefficient for the singular RBF model (13.108) are unknown and can be significantly different from the regular models. As will be introduced in Section 13.5.3, the generalization coefficients of ML learning and MAP learning for various singular models have been clarified, and all results that have been found so far satisfy 2λML Singuler ≥ D,
2λMAP Singuler ≥ D,
(13.116)
where D is the parameter dimensionality. By comparing Eq. (13.116) with Eq. (13.115) (or Eqs. (13.67) and (13.81)), we find that the ML and the MAP generalization coefficients per single model parameter in singular models are larger than those in the regular models, which implies that singular models tend to overfit more than the regular models. We can explain this phenomenon as an effect of the neighborhood structure around singularities. Recall the example (f), where the singular RBF model and the regular RBF model learn the true distribution f ∗ (x) = ρ1 (x; 3, −10, 1). For the singular RBF model, w∗sin−RBF = (a1 , a2 , b1 , b2 , c21 , c22 ) = (3, 0, −10, ∗, 1, ∗), where ∗ allows any value in the domain, are possible true parameters, while, for the regular RBF model, w∗sin−RBF = (a1 , a2 , . . . , a6 ) = (3, 0, . . . , 0) is the unique true parameter. Figure 13.8(a) shows the space of the three parameters (a2 , b2 , c22 ) of the second RBF unit of the singular RBF model, in which the true parameter is on the singularities. Since the true parameter extends over the two-dimensional half-space {(b2 , c22 ); b2 ∈ R, c22 ∈ R++ }, the neighborhood (shown by small arrows) contains any RBF with adjustable mean and variance. Although the estimated parameter converges to the singularities in the asymptotic limit, ML learning on finite training samples tries to fit the noise, which contaminates the training samples, by selecting the optimal basis function, where the optimality is in terms of the training error. On the other hand, Figure 13.8(b) shows the parameter space (ah , b˘ h , c˘ 2h ) for h = 2, . . . , 4. For each h, the true distribution corresponds to a single point, indicated by a shadowed circle, and its neighborhood extends only in one direction, i.e., ah = 0 ± ε with a prefixed RBF basis specified by the constants (b˘ h , c˘ 2h ). Consequently, with the same number of redundant parameters as the singular RBF model, ML learning tries to fit the training noise only with those three basis functions. Although the probability that the three prefixed basis functions can fit the training noise better than a single flexible basis function is not zero, we would expect that the singular RBF model would likely capture the noise more flexibly than the regular RBF model. This intuition is supported by previous theoretical work that showed Eq. (13.116) in many singular models,
13.5 Asymptotic Learning Theory for Singular Models
(a) Singular RBF
373
(b) Regular RBF
Figure 13.8 Neighborhood of the true distribution in the rbf example. (a) The parameter space of the second (h = 2) RBF unit of the singular RBF model. The true parameter is on the singularities, of which the neighborhood contains any RBF with adjustable mean and variance. (b) The parameter space of the second to the fourth (h = 2, . . . , 4) RBF units of the regular RBF model. With the same degrees of freedom as a single singular RBF unit, the neighborhood of the true parameter contains only three different RBF bases with prefixed means and variances.
as well as the numerical example in Figure 13.6. We call this phenomenon, i.e., singular models tending to overfit more than regular models, the basis selection effect. Although Eq. (13.116) was shown for ML learning and MAP learning, the basis selection effect should occur for any reasonable learning algorithms, including Bayesian learning. However, in Bayesian learning, this effect is canceled by the other effect of singularities, which is explained in the following subsection. Integration Effect Assume that, in Bayesian learning with a singular model, we adopt a prior distribution p(w) bounded as 0 < p(w) < ∞ at any w ∈ W. This assumption is the same as one of the regularity conditions in Section 13.4.1. However, this assumption excludes the use of the Jeffreys prior (see Appendix B) and positive mass is distributed over singularities. As discussed in detail in Chapter 7, this prior choice leads to nonuniformity of the volume element and favors models with smaller degrees of freedom, if a learning algorithm involving integral computations in the parameter space is adopted. As a result, singularities induce MIR in Bayesian learning and its approximation methods, e.g., VB learning. Importantly, the integration effect does not occur in point estimation methods, including ML learning and MAP learning, since the nonuniformity of
374
13 Asymptotic Learning Theory
the volume element affects the estimator only through integral computations. We call this phenomenon the integration effect of singularities. The basis selection effect and the integration effect influence the learning behavior in the opposite way: the former intensifies overfitting, while the latter suppresses it. A question is which is stronger in Bayesian learning. Singular learning theory, which will be introduced in Section 13.5.4, has already answered this question. The following has been shown for any singular models: Bayes
2λSinguler ≤ D.
(13.117)
Comparing Eq. (13.117) with Eq. (13.115) (or Eq. (13.94)), we find that the Bayes generalization coefficient per single model parameter in the singular models is smaller than that in the regular models. Note that the conclusion is opposite to ML learning and MAP learning—singular models overfit training noise more than the regular models in ML learning and MAP learning, while they less overfit in Bayesian learning. Since the basis selection effect should occur in any reasonable learning algorithm, we can interpret Eq. (13.117) as evidence that the integration effect is stronger than the basis selection effect in Bayesian learning. One might wonder why Bayesian learning is not analyzed with the Jeffreys prior—the parameterization invariant noninformative prior. Actually, the Jeffreys prior, or other prior distribution with zero mass at the singularities, is rarely used in singular models because of the computational reasons: when the computational tractability relies on the (conditional) conjugacy, the Jefferey prior is out of choice in singular models; when some sampling method is used for approximating the Bayes posterior, the diverging outskirts of the Jeffreys prior prevents the sampling sequence to converge. Note that this excludes the empirical Bayesian procedure, where the prior can be collapsed after training. Little is known about the learning behavior of empirical Bayesian learning in singular models, and the asymptotic learning theory part (Part IV) of this book also excludes this case. In the following subsections, we give a brief summary of theoretical results that revealed learning properties of singular models.
13.5.2 Conditions Assumed in Asymptotic Theory for Singular Models Singular models were analyzed under the following conditions on the true distribution and the prior distribution:
13.5 Asymptotic Learning Theory for Singular Models
375
(i) The true distribution is realizable by the statistical model, i.e., ∃w∗ s.t. q(x) = p(x|w∗ ). (ii) The prior p(w) is twice differentiable and bounded as 0 < p(w) < ∞ at any w ∈ W. Under the second condition, the prior choice does not affect the generalization coefficient. Accordingly, the results, introduced in the following subsection, for ML learning can be directly applied to MAP learning.
13.5.3 ML Learning and MAP Learning Fukumizu (1999) analyzed the asymptotic behavior of the generalization error of ML learning for the reduced rank regression (RRR) model (3.36), by applying the random matrix theory to evaluate the singular value distribution of the ML estimator. Specifically, the large-scale limit where the dimensions of the input and the output are infinitely large was considered, and the exact generalization coefficient was derived. The training coefficient can be obtained in the same way (Nakajima and Watanabe, 2007). The Gaussian mixture model (GMM) (4.6) has been studied as a prototype of singular models in the case of ML learning. Akaho and Kappen (2000) showed that the generalization error and the training error behave quite differently from regular models. As defined in Eq. (13.12), −N ·TEML (X) is the log-likelihood ratio, which asymptotically follows the chi-squared distribution for regular models, while little is known about its behavior for singular models. In fact, it is conjectured for the spherical GMM (4.6) that the log-likelihood ratio diverges to infinity in the order of log log N (Hartigan, 1985). For mixture models with discrete components such as binomial mixture models, the asymptotic distribution of the log-likelihood ratio was studied through the distribution of the maximum of the Gaussian random field (Bickel and Chernoff, 1993; Takemura and Kuriki, 1997; Kuriki and Takemura, 2001). Based on the idea of locally conic parameterization (Dacunha-Castelle and Gassiat, 1997), the asymptotic behaviors of the log-likelihood ratio in some singular models were analyzed. For some mixture models with continuous components, including GMMs, it can be proved that the log-likelihood ratio diverges to infinity as N → ∞. In neural networks, it is known that the loglikelihood ratio diverges in the order of log N when there are at least two redundant hidden units (Fukumizu, 2003; Hagiwara and Fukumizu, 2008). In all previous works, the obtained generalization coefficient or its equivalent satisfies Eq. (13.116).
376
13 Asymptotic Learning Theory
13.5.4 Singular Learning Theory for Bayesian Learning For analyzing the generalization performance of Bayesian learning, a general approach, called the singular learning theory (SLT), was established, based on the mathematical techniques in algebraic geometry (Watanabe, 2001a, 2009). The average relative Bayes free energy (13.23), N 0 / (n) Bayes n=1 q(x ) F (N) = log N p(w) n=1 p(x(n) |w)dw q(X) / 0 q(x) = − log exp −N log , · p(w)dw p(x|w) q(x) can be approximated as F where
Bayes
(N) ≈ − log
exp (−NE(w)) · p(w)dw,
/ 0 q(x) E(w) = log p(x|w) q(x)
(13.118)
(13.119)
is the KL divergence between the true distribution q(x) and the model distribution p(x|w).8 Let us see the KL divergence (13.119) as the energy in physics, and define the state density function for the energy value s > 0: v(s) = δ(s − E(w)) · p(w)dw, (13.120) where δ(·) is the Dirac delta function (located at the origin). Note that the state density (13.120) and the (approximation to the relative) Bayes free energy (13.118) are connected by the Laplace transform: ! s " ds Bayes . (13.121) F (N) = − log exp(−s)v N N Define furthermore the zeta function as the Mellin transform, an extension of the Laplace transform, of the state density (13.120): z ζ(z) = s v(s)ds = E(w)z p(w)dw. (13.122) The zeta function (13.122) is a function of a complex number z ∈ C, and it was proved that all the poles of ζ(z) are real, negative, and rational numbers. 8
Bayes (N) = − log exp (−NE(w)) · p(w)dw + O(1) if the support of the prior is It holds that F compact (Watanabe, 2001a, 2009).
13.5 Asymptotic Learning Theory for Singular Models
377
By using the relations through Laplace/Mellin transform among the free energy (13.118), the state density (13.120), and the zeta function (13.122), Watanabe (2001a) proved the following theorem: Theorem 13.13 (Watanabe, 2001a, 2009) Let 0 > −λ1 > −λ2 > · · · be the sequence of the poles of the zeta function (13.122) in the decreasing order, and m1 , m2 , . . . be the corresponding orders of the poles. Then the average relative Bayes free energy (13.119) can be asymptotically expanded as F
Bayes
(N) = λ1 log N − (m1 − 1) log log N + O(1).
(13.123)
Bayes
Let c(N) = F (N) − λ1 log N + (m1 − 1) log log N be the O(1) term in Eq. (13.123). The relation (13.24) between the generalization error and the free energy leads to the following corollary: 1 Corollary 13.14 (Watanabe, 2001a, 2009) If c(N + 1) − c(N) = o N log N , the average generalization error (13.13) can be asymptotically expanded as Bayes 1 m1 − 1 λ1 − +o GE (N) = . (13.124) N N log N N log N To sum up, finding the maximum pole λ1 of the zeta function ζ(z) gives the Bayes free energy coefficient λ Bayes = λ1 , which is equal to the Bayes generalization coefficient λBayes = λ1 . Note that Theorem 13.13 and Corollary 13.14 hold both for regular and singular models. As discussed in Section 7.3.2, MIR (or the integration effect of singularities) is caused by strong nonuniformity of the density of the volume element. Since the state density (13.120) reflects the strength of the nonuniformity, one can see that finding the maximum pole of ζ(z) amounts to finding the strength of the nonuniformity at the most concentrated point. Some general inequalities were proven (Watanabe, 2001b, 2009): • If the prior is positive at any singular point, i.e., p(w) > 0, ∀w ∈ {w; det (F(w)) = 0}, then 2λ Bayes = 2λBayes ≤ D.
(13.125)
• If the Jeffreys prior (see Appendiex B.4) is adopted, for which p(w) = 0 holds at any singular point, then 2λ Bayes = 2λBayes ≥ D.
(13.126)
378
13 Asymptotic Learning Theory
Some cases have been found where 2λ Bayes = 2λBayes are strictly larger than D. These results support the discussion in Section 13.5.1 on the two effects of singularities: Eq. (13.125) implies that the integration effect dominates the basis selection effect in Bayesian learning, and Eq. (13.126) implies that the basis selection effect appears also in Bayesian learning if the integration effect is suppressed by using the Jeffreys prior. Theorem 13.13 and Corollary 13.14 hold for general statistical models, while they do not immediately tell us learning properties of singular models. This is because finding the maximum pole of the zeta function ζ(z) is not an easy task, and requires a specific technique in algebraic geometry called the resolution of singularities. Good news is that, when any pole larger than −D/2 is found, it provides an upper bound of the generalization coefficient and thus guarantees the performance with a tighter bound (Theorem 13.13 implies that the larger the found pole is, the tighter the provided bound is). For the RRR model (Aoyagi and Watanabe, 2005) and for the GMM (Aoyagi and Nagata, 2012), the maximum pole was found for general cases, and therefore the exact value of the free energy coefficient, as well as the generalization coefficient, was obtained. In other singular models, including neural networks (Watanabe, 2001a), mixture models (Yamazaki and Watanabe, 2003a), hidden Markov models (Yamazaki and Watanabe, 2005), and Bayesian networks (Yamazaki and Watanabe, 2003b; Rusakov and Geiger, 2005), upperbounds of the free energy coefficient were obtained by finding some poles of the zeta function. An effort has been made to perform the resolution of singularities systematically by using the newton diagram (Yamazaki and Watanabe, 2004).
13.5.5 Information Criteria for Singular Models The information criteria introduced in Section 13.4.8 rely on the learning theory under the regularity conditions. Therefore, although they were sometimes applied for model selection in singular models, their relations to generalization properties, e.g., AIC to the ML generalization error, and BIC to the Bayes free energy, do not generally hold. In the following, we introduce information criteria applicable for general statistical models including the regular and the singular models (Watanabe, 2009, 2010, 2013). They also cover a generalization of Bayesian learning. Consider a learning method, called generalized Bayesian learning, based on the generalized posterior distribution,
13.5 Asymptotic Learning Theory for Singular Models N , (n) -β p(x |w) p(w) n=1 p(β) (w|X) = , N 6 (n) 7β p(w) n=1 p(x |w) dw
379
(13.127)
where β, called the inverse temperature parameter, modifies the importance of the likelihood per training sample. The prediction is made by the generalized predictive distribution, p(β) (x|X) = p(x|w) p(β) (w|X) .
(13.128)
Generalized Bayesian learning covers both Bayesian learning and ML learning as special cases: when β = 1, the generalized posterior distribution (13.127) is reduced to the Bayes posterior distribution (13.9), with which the generalized predictive distribution (13.128) gives the Bayes predictive distribution (13.8); As β increases, the probability mass of the generalized posterior distribution concentrates around the ML estimator, and, in the limit when β → ∞, the generalized predictive distribution converges to the ML predictive distribution (13.6). Define the following quantities: 0 / GL(X) = − log p(x|w)p(β) (w|X)dw
,
(13.129)
q(x)
N 1 log p(x(n) |w)p(β) (w|X)dw, TL(X) = − N n=1 / 0 $ % GGL(X) = − log p(x|w) p(β) (w|X)dw ,
(13.130)
(13.131)
q(x) N 1 GTL(X) = − log p(x(n) |w) p(β) (w|X)dw, N n=1
(13.132)
which are called the Bayes generalization loss, the Bayes training loss, the Gibbs generalization loss, and the Gibbs training loss, respectively. The generalization error and the training error of generalized Bayesian learning are, respectively, related to the Bayes generalization loss and the Bayes training loss as follows (Watanabe, 2009): 0 / q(x) (β) GE (X) = log p(x|w)p(β) (w|X)dw q(x) = GL(X) − S ,
(13.133)
380
13 Asymptotic Learning Theory
TE(β) (X) =
N 1 q(x(n) ) log N n=1 p(x(n) |w)p(β) (w|X)dw
= TL(X) − S N (X),
(13.134)
where S = − log q(x) q(x)
and
S N (X) = −
N 1 log q(x(n) ) N n=1
(13.135)
are the entropy of the true distribution and its empirical version, respectively.9 Also, Gibbs counterparts have the following relations: / 0 q(x) (β) (β) GGE (X) = log p (w|X)dw p(x|w) q(x) = GGL(X) − S , GTE(β) (X) =
(13.136)
N 1 q(x(n) ) p(β) (w|X)dw log N n=1 p(x(n) |w)
= GTL(X) − S N (X).
(13.137)
Here GGE(β) (X) and GTE(β) (X) are the generalization error and the training error, respectively, of Gibbs learning, where prediction is made by p(x|w) with its parameter w sampled from the generalized posterior distribution (13.127). The following relations were proven (Watanabe, 2009): GL(X) q(X) = TL(X) q(X) + 2β GTL(X) q(X) − TL(X) q(X) + o(N −1 ), (13.138) GGL(X) q(X) = GTL(X) q(X) + 2β GTL(X) q(X) − TL(X) q(X) + o(N −1 ), (13.139) which imply that asymptotically unbiased estimators for generalization losses (the left-hand sides of Eqs. (13.138) and (13.139)) can be constructed from training losses (the right-hand sides). The aforementioned equations lead to widely applicable information criteria (WAIC) (Watanabe, 2009), defined as
9
WAIC1 = TL(X) + 2β (GTL(X) − TL(X)) ,
(13.140)
WAIC2 = GTL(X) + 2β (GTL(X) − TL(X)) .
(13.141)
S N (X) was defined in Eq. (13.20), and it holds that S = S N (X) q(X) .
13.5 Asymptotic Learning Theory for Singular Models
381
Clearly, WAIC1 and WAIC2 are asymptotically unbiased estimators for the Bayes generalization loss GL(X) and the Gibbs generalization loss GGL(X), respectively, and therefore minimizing them amounts to minimizing the Bayes generalization error (13.133) and the Gibbs generalization error (13.136), respectively. The training losses, TL(X) and GTL(X) , can be computed by, e.g., MCMC sampling (see Sections 2.2.4 and 2.2.5). Let w(1) , . . . , w(T ) be samples drawn from the generalized posterior distribution (13.127). Then we can estimate the training losses by ⎛ ⎞ N T ⎜⎜⎜ 1 ⎟ 1 (n) (t) ⎟ ⎜ log ⎜⎝ p(x |w )⎟⎟⎟⎠ , (13.142) TL(X) ≈ − N n=1 T t=1 GTL(X) ≈ −
N T 1 1 log p(x(n) |w(t) ). N n=1 T t=1
(13.143)
WAIC can be seen as an extension of AIC, since minimizing it amounts to minimizing an asymptotically unbiased estimator for the generalization error. Indeed, under the regularity conditions, it holds that lim 2β (GTL(X) − TL(X)) =
β→∞
D , N
and therefore AIC as β → ∞. 2N An extension of BIC was also proposed. The widely applicable Bayesian information criterion (WBIC) (Watanabe, 2013) is defined as N WBIC = − (13.144) log p(x(n) |w)p(β=1/ log N) (w|X)dw, WAIC1 , WAIC2 →
n=1
where p(β=1/ log N) (w|X) is the generalized posterior distribution (13.127) with the inverse temperature parameter set to β = 1/ log N. It was shown that ⎛ ⎞ N ⎜⎜ ⎟⎟ C p(x(n) |w)dw⎟⎟⎟⎠ = WBIC + Op ( log N), F Bayes (X) ⎜⎜⎜⎝≡ − log p(w) n=1
(13.145) and therefore WBIC can be used as an estimator or approximation for the Bayes free energy (13.18) when N is large. It was also shown that, under the regularity conditions, it holds that WBIC =
BIC + Op (1). 2
382
13 Asymptotic Learning Theory
WBIC (13.144) can be estimated, similarly to WAIC, from samples (T ) (β=1/ log N) (w|X) as w(1) β=1/ log N , . . . , wβ=1/ log N drawn from p WBIC ≈ −
N T 1 log p(x(n) |w(t) β=1/ log N ). T n=1 t=1
(13.146)
Note that evaluating the Bayes free energy (13.18) is much more computationally demanding in general. For example, the all temperatures method (Watanabe, 2013) requires posterior samples {w(t) β j } for many 0 = β1 < β2 < · · · < β J = 1, and estimates the Bayes free energy as ⎛ ⎞ Tj J−1 N ⎜⎜⎜ ⎟⎟ 1 (t) Bayes (n) F (X) ≈ − log exp ⎜⎜⎝(β j+1 − β j ) log p(x |wβ j )⎟⎟⎟⎠ . T j t=1 j=1 n=1
13.6 Asymptotic Learning Theory for VB Learning In the rest of Part IV, we describe asymptotic learning theory for VB learning in detail. Here we give an overview of the subsequent chapters. VB learning is rarely applied to regular models.10 Actually, if the model (and the prior) satisfies the regularity conditions, Laplace approximation (2.2.1) can give a good approximation to the posterior, because of the asymptotic normality (Theorem 13.3). Accordingly, we focus on singular models when analyzing VB learning. We are interested in generalization properties of the VB posterior, which is defined as
r ≡ argmin F(r),
s.t.
r
where
/
r(w) F(r) = log N p(w) n=1 p(x(n) |w)
r ∈ G,
(13.147)
0 = KL (r(w)||p(w|X)) + F Bayes (X) r(w)
(13.148) is the free energy and G is the model-specific constraint, imposed for computational tractability, on the approximate posterior. 10
VB learning is often applied to a linear model with an ARD prior. In such a model, the model likelihood satisfies the regularity conditions, while the prior does not. Actually, the model exhibits characteristics of singular models, since it can be translated to a singular model with a constant prior (see Section 7.5). In Part IV, we only consider the case where the prior is fixed, without any hyperparameter optimized.
13.6 Asymptotic Learning Theory for VB Learning
383
With the VB predictive distribution pVB (x|X) = p(x|w)
r(w) , the generalization error (13.10) and the training error (13.11) are defined and analyzed. We also analyze the VB free energy, r) = min F(r). F VB (X) = F( r
(13.149)
Since Eq. (13.148) implies that $ % F VB (X) − F Bayes (X) = KL r(w)||p(w|X) , comparing the VB free energy and the Bayes free energy reveals how accurately VB learning approximates Bayesian learning. Similarly to the analysis of Bayesian learning, we investigate the asymptotic behavior of the relative VB free energy, VB (X) = F VB (X) − NS N (X) = λ VB log N + op (log N), F
(13.150)
where S N (X) is the empirical entropy defined in Eq. (13.20), and λ VB is called the VB free energy coefficient. Chapter 14 introduces asymptotic VB theory for the RRR model. This model was relatively easily analyzed by using the analytic-form solution for fully observed matrix factorization (Chapter 6), and the exact values of the VB generalization coefficient, the VB training coefficient, and the VB free energy coefficient were derived (Nakajima and Watanabe, 2007). Since generalization properties of ML learning and Bayesian learning have also been clarified (Fukumizu, 1999; Aoyagi and Watanabe, 2005), similarities and dissimilarities among ML (and MAP) learning, Bayesian learning, and VB learning will be discussed. Chapters 15 through 17 are devoted to asymptotic VB theory for latent variable models. Chapter 15 analyzes the VB free energy of mixture models. The VB free energy coefficients and their dependencies on prior hyperparameters are revealed. Chapter 16 proceeds to such analyses of the VB free energy for other latent variable models, namely, Bayesian networks, hidden Markov models, probabilistic context free grammar, and latent Dirichlet allocation. Chapter 17 provides a formula for general latent variable models, which reduces the asymptotic analysis of the VB free energy to that of the Bayes free energy introduced in Section 13.5.4. Those results will clarify phase transition phenomena with respect to the hyperparameter setting—the shape of the posterior distribution in the asymptotic limit drastically changes when some
384
13 Asymptotic Learning Theory
hyparparameter value exceeds a certain threshold. Such implication suggests to practitioners how to choose hyperparameters. Note that the relation (13.25) does not necessarily hold for VB learning and other approximate Bayesian methods, since Eq. (13.24) only holds for the exact Bayes predictive distribution. Therefore, unlike Bayesian learning, the asymptotic behavior of the VB free energy does not necessarily inform us of the asymptotic behavior of the VB generalization error. An effort on relating the VB free energy and the VB generalization error is introduced in Chapter 17, although clarifying VB generalization error requires further effort and techniques.
14 Asymptotic VB Theory of Reduced Rank Regression
In this chapter, we introduce asymptotic theory of VB learning in the reduced rank regression (RRR) model (Nakajima and Watanabe, 2007). Among the singular models, the RRR model is one of the simplest, and many aspects of its learning behavior have been clarified. Accordingly, we can discuss similarities and dissimilarities of ML (and MAP) learning, Bayesian learning, and VB learning in terms of generalization error, training error, and free energy. After defining the problem setting, we show theoretical results and summarize insights into VB learning that the analysis on the RRR model provides.
14.1 Reduced Rank Regression RRR (Baldi and Hornik, 1995; Reinsel and Velu, 1998), introduced in Section 3.1.2 as a special case of fully observed matrix factorization, is a regression model with a rank-H(≤ min(L, M)) linear mapping between input x ∈ R M and output y ∈ RL : y = B A x + ε,
(14.1)
where A ∈ R M×H and B ∈ RL×H are parameters to be estimated, and ε is observation noise. Assuming Gaussian noise ε ∼ GaussL (0, σ 2 I L ), the model distribution is given as ##2 1 ##
2 −L/2 exp − 2 # y − B A x# . (14.2) p(y|x, A, B) = 2πσ 2σ RRR is also called a linear neural network, since the three-layer neural network (7.13) is reduced to RRR (14.1) if the activation function ψ(·) is linear
385
386
14 Asymptotic VB Theory of Reduced Rank Regression
(see also Figure 3.1). We assume conditionally conjugate Gaussian priors for the parameters: 1 1 −1 p( A) ∝ exp − tr AC−1 tr BC A B , p(B) ∝ exp − , (14.3) A B 2 2 with diagonal convariances CA and C B : CA = Diag(c2a1 , . . . , c2aH ),
C B = Diag(c2b1 , . . . , c2bH ),
for cah , cbh > 0, h = 1, . . . , H. In the asymptotic analysis, we assume that the hyperparameters {c2ah , c2bh }, σ 2 are fixed constants of the order of 1, i.e., {c2ah , c2bh }, σ 2 ∼ Θ(1) when N → ∞. The degree of freedom of the RRR model is, in general, different from the apparent number, (M + L)H, of entries of the parameters A and B. This is because of the trivial redundancy in parameterization—the transformation ( A, B) → ( AT , BT −1 ) does not change the linear mapping B A for any nonsingular matrix T ∈ RH×H . Accordingly, the essential parameter dimensionality is counted as D = H(M + L) − H 2 . Suppose we are given N training samples: , D = (x(n) , y(n) ); x(n) ∈ R M , y(n) ∈ RL , n = 1, . . . , N ,
(14.4)
(14.5)
which are independently drawn from the true distribution q(x, y) = q(y|x)q(x). We also use the matrix forms that summarize the inputs and the outputs separately: X = (x(1) , . . . , x(N) ) ∈ RN×M ,
Y = (y(1) , . . . , y(N) ) ∈ RN×L .
We suppose that the data are preprocessed so that the input and the output are centered, i.e., N 1 (n) x =0 N n=1
and
N 1 (n) y = 0, N n=1
(14.6)
and the input is prewhitened (Hyv¨arinen et al., 2001), i.e., N 1 (n) (n) 1 x x = X X = IM . N n=1 N
(14.7)
The likelihood of the RRR model (14.1) on the training samples D = (X, Y) is expressed as ⎞ ⎛ N # ##2 ⎟⎟⎟ ⎜⎜⎜ 1 # (n) (n) # y − B A x # ⎟⎟⎠ . p(Y|X, A, B) ∝ exp ⎜⎜⎝− 2 (14.8) 2σ n=1
14.1 Reduced Rank Regression
387
As shown in Section 3.1.2, the logarithm of the likelihood (14.8) can be written, as a function of the parameters, as follows: log p(Y|X, A, B) = −
#2 N ## # # #Fro + const., V − B A 2σ 2
(14.9)
where V=
N 1 (n) (n) 1 y x = Y X. N n=1 N
(14.10)
Note that, unlike in Section 3.1.2, we here do not use the rescaled noise variance σ2 = σ 2 /N, in order to make the dependence on the number N of samples clear for asymptotic analysis. Because the log-likelihood (14.9) is in the same form as that of the fully observed matrix factorization (MF), we can use the global VB solution, derived in Chapter 6, of the MF model for analyzing VB learning in the RRR model.
14.1.1 VB Learning VB learning solves the following problem:
r = argmin F(r) s.t. r( A, B) = rA ( A)rB (B),
(14.11)
r
where
/ F = log
rA ( A)rB (B) p(Y|X, A, B)p( A)p(B)
0 rA (A)rB (B)
is the free energy. As derived in Section 3.1, the solution to the problem (14.11) is in the following forms: "⎞ ! ⎛ −1 ⎜⎜⎜ tr ( A − A) ⎟⎟⎟⎟ A) ΣA ( A − ⎜⎜⎜ ⎟⎟⎟ A, I M ⊗ Σ A ) ∝ exp ⎜⎜⎜− rA ( A) = MGauss M,H ( A; ⎟⎟⎟ , 2 ⎠ ⎝⎜ (14.12) "⎞ ! ⎛ −1 ⎟ ⎜⎜⎜ tr (B −
⎟⎟⎟ (B − B) B) Σ B ⎜⎜ ⎟⎟⎟ B, I L ⊗ Σ B ) ∝ exp ⎜⎜⎜⎜− rB (B) = MGaussL,H (B; ⎟⎟⎟ . 2 ⎠ ⎝⎜ (14.13) B, Σ B ), the free energy can be explicWith the variational parameters ( A, Σ A, itly written as
388
14 Asymptotic VB Theory of Reduced Rank Regression N ## (n) ##2 # # − N V2Fro n=1 y
2
2F = NL log(2πσ ) + σ 2 ## # 2 # N ##V − B A ## det (C B ) det (CA ) Fro + L log + + M log
2 σ det ΣA det ΣB 2 ! " ! " −1
− (L + M)H + tr C−1 A A A + M Σ A + C B B B + LΣ B ! "! ""3 ! + Nσ −2 − A B (14.14) A B B+ A A + M ΣA B + L ΣB . We can further apply Corollary 6.6, which states that the VB learning problem (14.11) is decomposable in the following way. Let V=
L
γh ωbh ωah
(14.15)
h=1
be the singular value decomposition (SVD) of V (defined in Eq. (14.10)), where γh (≥ 0) is the hth largest singular value, and ωah and ωbh are the associated right and left singular vectors. Then the solution (or its equivalent) aH ), B = ( b1 , . . . , bH ), Σ A, Σ B, of the variational parameters A = ( a1 , . . . , which minimizes the free energy (14.14), can be expressed as follows:
bh = bh ωbh ,
σ2b1 , . . . , σ2bH , Σ B = Diag
ah ωah , ah =
σ2a1 , . . . , σ2aH , Σ A = Diag
H where { ah , bh ∈ R, σ2ah , σ2bh ∈ R++ }h=1 are a new set of variational parameters. Thus, the VB posteriors (14.12) and (14.13) can be written as
rA ( A) =
H
Gauss M (ah ; ah ωah , σ2ah I M ),
(14.16)
GaussL (bh ; bh ωbh , σ2bh I L ),
(14.17)
h=1
rB (B) =
H h=1
H bh , σ2ah , σ2bh }h=1 that are the solution of the following minimization with { ah , problem:
Given
σ 2 ∈ R++ ,
min
H { ah , bh , σ2ah , σ2b }h=1
H {c2ah , c2bh ∈ R++ }h=1 ,
F
(14.18)
h
s.t. { ah , bh ∈ R,
H
σ2ah , σ2bh ∈ R++ }h=1 .
14.1 Reduced Rank Regression
389
Here F is the free energy (14.14), which can be decomposed as N ## (n) ##2 H # # n=1 y
2 2F = NL log(2πσ ) + + 2Fh , σ 2 h=1 where
2Fh = M log
c2ah
σ2ah
+ L log
c2bh
σ2bh
+
σ2ah a2h + M
+
(14.19)
σ2bh b2h + L
c2ah c2bh N − (L + M) + 2 −2 a2h + M σ2ah σ2bh . bh γh + ah b2h + L σ (14.20)
14.1.2 VB Solution Let us derive an asymptotic-form VB solution from the nonasymptotic global VB solution, derived in Section 6. Theorem 6.7 leads to the following theorem:
VB ≡ B A r (A)r (B) for the linear Theorem 14.1 The VB estimator U A B mapping of the RRR model (14.2) and (14.3) can be written as
VB = B A = U
H
γhVB ωbh ωah ,
where γhVB = max 0, γ˘ hVB
(14.21)
h=1
for ⎛ ⎞ ⎜⎜ max(L, M)σ 2 ⎟⎟⎟ ⎟⎠ + Op (N −1 ). γ˘ hVB = γh ⎜⎜⎝1 − Nγh2
(14.22)
For each component h, γhVB > 0 if and only if γh > γVB for h
B γVB = σ h
max(L, M) + O(N −1 ). N
(14.23)
Proof Noting that Theorem 6.7 gives the VB solution for either V or V ∈ RL×M that satisfies L ≤ M, that the shrinkage estimator γ˘ hVB (given by Eq. (6.50)) is an increasing function of γh , and that γ˘ hVB = 0 when γh is equal to the threshold γVB (given by Eq. (6.49)), we have Eq. (14.21) with h
γ˘ hVB
⎛ ⎜⎜⎜ σ 2 = γh ⎜⎜⎜⎜⎝1 − 2Nγh2 ⎛ ⎜⎜ σ 2 = γh ⎜⎜⎝1 − 2Nγh2
> ⎞⎞ ⎛ @ 2 ⎟⎟ ⎜⎜⎜ ⎜⎜⎜L + M + (M − L)2 + 4γh ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟ ⎜⎝ c2ah c2bh ⎟⎠⎟⎠ . "⎞⎟ ! 2 ⎟ 2 L + M + (M − L) + O(γh ) ⎟⎟⎠
390
14 Asymptotic VB Theory of Reduced Rank Regression ⎧ ! ⎪ ⎪ ⎪ ⎪ ⎨γh !1 − =⎪ ⎪ ⎪ ⎪ ⎩γ 1 −
σ 2 2Nγh2
L + M + max(L, M) − min(L, M) + O(γh2 ) " σ 2 (L + M + O(γh )) h 2Nγh2 ⎞ ⎛ ⎟⎟ ⎜⎜⎜ max(L, M)σ 2 1 + Op (γh ) ⎟⎠⎟ = γh ⎜⎝1 − 2 Nγh ⎛ ⎞ ⎜⎜⎜ max(L, M)σ 2 ⎟⎟⎟ ⎜ ⎟⎠ + Op (N −1 ), = γh ⎝1 − Nγh2
"
(if L M) (if L = M) (14.24)
and
γVB h
σ = √ N
> ? ? ? @ > ? @
> ? @⎛ ⎞2
2 ⎜⎜⎜ (L + M) ⎟⎟⎟ σ ⎟⎟ − LM + ⎜⎜⎝ 2 2 2 2Ncah cbh ⎠ A 2 max(L, M) − min(L, M) + + O(N −1 ) 2
σ 2 (L + M) + + 2 2Nc2ah c2bh
σ 2 (L + M) σ + = √ 2 2Nc2ah c2bh N B ⎧ ⎪
2 (L+M) σ ⎪ ⎪ √ + 2Ncσ2 c2 + max(L,M)−min(L,M) + O(N −1 ) (if L M) ⎪ ⎪ 2 2 N a ⎪ h bh ⎪ ⎨ =⎪ B ⎪ ⎪
2 ⎪ (L+M) σ ⎪ ⎪ √ + 2Ncσ2 c2 + O(N −1/2 ) (if L = M) ⎪ ⎩ N 2 ah b h σ C max(L, M) + O(N −1 ), = √ N
which completes the proof. Note that we used γh = Op (1) to get Eq. (14.24). Theorem 14.1 states that the VB estimator converges to the positive-part James–Stein (PJS) estimator (see Appendix A)—the same solution (Corollary 7.1) as the nonasymptotic MF solution with the flat prior. This is natural because the influence from the constant prior disappears in the asymptotic limit, making MAP learning converge to ML learning. Corollary 6.8 leads to the following corollary: Corollary 14.2 The VB posterior of the RRR model (14.2) and (14.3) is given by Eqs. (14.16) and (14.17) with the variational parameters given as follows: if γh > γVB , h
.
δVB ah = ± γ˘ hVB h ,
> @
bh = ±
γ˘ hVB ,
δVB h
σ2ah =
σ 2 δVB h , Nγh
σ2bh =
σ 2 , Nγh δVB h
(14.25)
14.1 Reduced Rank Regression
where δVB h
391
(max(L,M)−min(L,M))cah ⎧ ⎪ ⎪ + Op (1) ⎪ γh
ah ⎨ "−1 ! ≡ =⎪ (max(L,M)−min(L,M))cbh ⎪ ⎪
⎩ + O (1) bh p γ
(if L ≤ M), (if L > M),
h
(14.26) and otherwise,
ah = 0,
bh = 0,
σ2ah
=
c2ah
⎛ ⎞ ⎜⎜⎜ NL ζhVB ⎟⎟⎟ ⎜⎜⎝1 − ⎟⎟ , σ 2 ⎠
σ2bh
=
c2bh
⎛ ⎞ ⎜⎜⎜ N M ζhVB ⎟⎟⎟ ⎜⎜⎝1 − ⎟⎟ , σ 2 ⎠
(14.27) ⎧ min(L,M)σ 2 −2 ⎪ ⎪ (if L M), ⎪ ⎨ NLM + Θ(N ), =⎪
2 ⎪ ⎪ ⎩ min(L,M)σ + Θ(N −3/2 ), (if L = M). NLM (14.28)
σ2bh σ2ah where ζhVB ≡
Proof Noting that Corollary 6.8 gives the VB posterior for either V or V ∈ RL×M that satisfies L ≤ M, we have Eq. (14.25) with
δVB h
⎧ Nca Lσ 2 VB h ⎪ ⎪ (if L ≤ M) ⎪ ⎪ σ 2 γh − γ˘ h − Nγh ⎨ " ! =⎪ −1 ⎪
2 Nc ⎪ ⎪ ⎩ σ 2bh γh − γ˘ hVB − Mσ (if L > M) Nγh ⎧ (max(L,M)−min(L,M))ca h ⎪ ⎪ + Op (1) (if L ≤ M), ⎪ γh ⎪ ⎨! " =⎪ −1 ⎪ (max(L,M)−min(L,M))cbh ⎪ ⎪ ⎩ + Op (1) (if L > M), γh
when γh > γVB , and Eq. (14.27) with h
ζhVB = =
σ 2 2NLM σ 2 2NLM
⎧ ⎪ ⎪ ⎨ L+M+ ⎪ ⎪ ⎩ ) L+M+
B
B
) L+M+
L+M+
h
σ 2 Nc2ah c2b
σ 2 Nc2ah c2b
σ 2 Nc2ah c2b
"2 h
h
2
σ 2 2NLM
−
−
B!
(L + M) + 2 (L + M)
− =
σ 2 Nc2ah c2b
σ 2 Nc2ah c2b
+ h
!
σ 2 Nc2ah c2b
"2 h
⎫ ⎪ ⎪ ⎬ − 4LM ⎪ ⎪ ⎭
⎫ ⎪ ⎪ ⎬ − 4LM ⎪ ⎪ ⎭
h
(max(L, M) − min(L, M)) + 2 (L + M) 2
σ 2 Nc2ah c2b
+ h
!
σ 2 Nc2ah c2b
h
"2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
392
14 Asymptotic VB Theory of Reduced Rank Regression ) ⎧ ⎪ σ 2 σ 2 ⎪ ⎪ (max(L, M) − min(L, M)) ⎪ ⎪ 2NLM L + M + Nc2a c2b − ⎪ h h ⎪ ⎪ 1 ⎪ " ! ⎪ ⎪ ⎪ L+M σ 2 −2 ⎪ ⎪ ) (if L M) + O(N · 1 + ⎪ (max(L,M)−min(L,M))2 Nc2ah c2b ⎪ ⎪ h ⎨ ) =⎪ ⎪ ⎪ σ 2 σ 2 ⎪ ⎪ ⎪ 2NLM L + M + Nc2a c2b ⎪ h h ⎪ ⎪ B ⎪ ⎪ " ! "2 ⎫ ! ⎪ ⎪ ⎪ ⎪
2
2 ⎬ ⎪ σ σ ⎪ ⎪ (L − 2 + M) + (if L = M) ⎪ ⎪ 2 2 2 2 ⎪ ⎩ Ncah cb Ncah cb ⎭ h h ⎧ 2 σ −1 ⎪ ⎪ (if L M) ⎪ ⎨ 2NLM 2 min(L, M) + Θ(N ) =⎪ ⎪
2 ⎪ −1/2 ⎩ σ ) (if L = M) 2NLM 2 min(L, M) + Θ(N ⎧ min(L,M)σ 2 −2 ⎪ ⎪ (if L M), ⎪ ⎨ NLM + Θ(N ) =⎪
2 ⎪ ⎪ ⎩ min(L,M)σ + Θ(N −3/2 ) (if L = M), NLM
when γh ≤ γVB . This completes the proof.
h
From Corollary 14.2, we can evaluate the orders of the optimal variational parameters in the asymptotic limit, which will be used when the VB free energy is analyzed. Corollary 14.3 The orders of the optimal variational parameters, given by Eq. (14.25) or Eq. (14.27), are as follows: if γh > γVB (= Θ(N −1/2 )), h
ah = Θp (1),
bh = Θp (γh ),
σ2ah = Θp (N −1 γh−2 ), σ2bh = Θp (N −1 )
(if L < M),
ah = Θp (γh1/2 ), bh = Θp (γh1/2 ), σ2ah = Θp (N −1 γh−1 ), σ2bh = Θp (N −1 γh−1 ) (if L = M),
ah = Θp (γh ),
bh = Θp (1),
σ2ah = Θp (N −1 ),
σ2bh = Θp (N −1 γh−2 ) (if L > M), (14.29)
and otherwise,
ah = 0, bh = 0, σ2ah = Θ(1),
σ2bh = Θ(N −1 )
(if L < M),
ah = 0, bh = 0, σ2ah = Θ(N −1/2 ), σ2bh = Θ(N −1/2 )
(if L = M),
ah = 0, bh = 0, σ2ah = Θ(N −1 ),
(if L > M). (14.30)
σ2bh = Θ(1)
Proof Eqs. (14.22) and (14.23) give γVB = Θ(N −1/2 ), h
γ˘ hVB = Θ(γh ),
14.1 Reduced Rank Regression
393
and Eq. (14.26) gives
δVB h
⎧ ⎪ Θp (γh−1 ) ⎪ ⎪ ⎪ ⎪ ⎨ =⎪ Θp (1) ⎪ ⎪ ⎪ ⎪ ⎩Θ (γ ) p h
(if L < M), (if L = M), (if L > M).
Substituting the preceding into Eq. (14.25) gives Eq. (14.29), and substituting Eq. (14.28) into Eq. (14.27) gives Eq. (14.30), which complete the proof. Corollary 14.3 implies that the posterior probability mass does not necessarily converge to a single point, for example, σ2ah = Θ(1) if γh < γVB and h L < M. This is typical behavior of singular models with nonidentifiability. On the other hand, the probability mass of the linear mapping U = B A converges to a single point. Corollary 14.4 It holds that *## # + #2 ## B A − B A ## Fro
rA (A)rB (B)
= Op (N −1 ).
Proof We have *## # + #2 ## BA − B A ## Fro
rA (A)rB (B)
= tr
/!
"
B A − B A
!
B A B A −
"0 r (A)r (B)
A * +B = tr AB B A − 2AB B A + A B B A r (A)r (B) ! " A B = tr A AB B − A A B B rA (A)rB (B) !! "! " " = tr A B A A + M ΣA B + L ΣB − A B B
=
H
σ2ah σ2bh − b2h a2h + M b2h + L a2h h=1
=
H
σ2bh + M σ2ah + LM σ2bh . b2h σ2ah L a2h
h=1
(14.31) Corollary 14.3 guarantees that all terms in Eq. (14.31) are of the order of Θp (N −1 ) for any L, M, and {γh }, which completes the proof. Now we derive an asymptotic form of the VB predictive distribution, p (y|x, X, Y) = p(y|x, A, B) rA (A)rB (B) .
(14.32)
394
14 Asymptotic VB Theory of Reduced Rank Regression
From Corollary 14.4, we expect that the predictive distribution is not very far from the plug-in VB predictive distribution (see Section 1.1.3):
p(y|x, A, B) = GaussL (y; B A x, σ 2 I L ).
(14.33)
Indeed, we will show in the next section that both predictive distributions (14.32) and (14.33) give the same generalization and training coefficients. This justifies the use of the plug-in VB predictive distribution, which is easy to compute from the optimal variational parameters. By expanding the VB predictive distribution around the plug-in VB predictive distribution, we have the following theorem: Theorem 14.5 The VB predictive distribution (14.32) of the RRR model (14.2) and (14.3) can be written as
B A x, σ 2 Ψ) + Op (N −3/2 ) p(y|x, X, Y) = GaussL (y; Ψ
(14.34)
for Ψ = I L + Op (N −1 ). Proof The VB predictive distribution can be written as follows: p (y|x, X, Y) = p(y|x, A, B) rA (A)rB (B) / 0 p(y|x, A, B) = p(y|x, A, B) p(y|x, A, B) rA (A)rB (B) # #2 ⎞0 ⎛ / ⎜⎜ y−BA x2 −### y− B A x### ⎟⎟⎟ ⎟⎠ = p(y|x, A, B) exp ⎜⎜⎝− 2σ 2 rA (A)rB (B)
"! " ⎞0 / ⎛ ! ⎜⎜⎜ y−BA x+(y− B A x) y−BA x−(y− B A x) ⎟⎟⎟
⎟⎟⎠ ⎜ = p(y|x, A, B) exp ⎜⎝− 2σ 2 rA (A)rB (B)
" ⎞0 ⎛! ⎜⎜⎜ y−(BA − B A )x (BA − B A )x ⎟⎟⎟ ⎟⎟⎠ = p(y|x, A, B) exp ⎜⎜⎝ σ 2
/
. rA (A)rB (B)
(14.35) Corollary 14.4 implies that the exponent in Eq. (14.35) is of the order of N −1/2 , i.e., " ! B A )x (B A − B A )x y − (B A − = Op (N −1/2 ). (14.36) φ≡ σ 2 By applying the Taylor expansion of the exponential function to Eq. (14.35), we obtain an asymptotic expansion of the predictive distribution around the plug-in predictive distribution: 1 2 p (y|x, X, Y) = p(y|x, A, B) 1 + φ rA (A)rB (B) + + Op (N −3/2 ) . φ rA (A)rB (B) 2
14.1 Reduced Rank Regression
395
Focusing on the dependence on the random variable y, we can identify the function form of the predictive distribution as follows: # ### #2 B A x## # y− p (y|x, X, Y) ∝ exp − 2σ 2 ! + log 1 + φ rA (A)rB (B) + 12 φ2 = exp −
## # #2 ## y− B A x## 2σ 2
rA (A)rB (B)
+ φ rA (A)rB (B) +
1 2
+ Op (N −3/2 )
φ2
"
rA (A)rB (B)
− 12 φ 2rA (A)rB (B) + Op (N −3/2 ) #2 ⎛ ## ⎜⎜⎜ ## y− B A x### ∝ exp ⎝⎜− 2σ 2 + 12 φ2
rA (A)rB (B)
+ Op (N
−3/2
⎞ ⎟⎟ )⎟⎠⎟
#2 ⎞ ⎛ ## ⎟ ⎜⎜⎜ ## y− B A x### −y Ψ 1 y −3/2 ⎟ ⎜ + Op (N )⎟⎟⎠ ∝ exp ⎝− 2σ 2 " ! 2
∝ exp − y −2y B2σA 2 x−y Ψ 1 y + Op (N −3/2 ) ! " " ⎛ ! ⎞ ⎜⎜⎜ y−Ψ B A x Ψ −1 y−Ψ B A x ⎟⎟ −3/2 ∝ exp ⎜⎜⎝− + Op (N )⎟⎟⎟⎠ ,
2 2σ
(14.37)
where Ψ = (I L − Ψ 1 )−1 , / 0 (B A − B A )xx (B A − B A ) . Ψ1 = σ 2 rA (A)rB (B)
(14.38) (14.39)
Here we used /! φ rA (A)rB (B) = =
φ2
rA (A)rB (B)
" 0 y−(BA − B A )x (BA − B A )x σ 2
# / ### #2 0 B A )x## #(BA − σ 2
rA (A)rB (B)
rA (A)rB (B)
= const., ! " "0 /! y−(BA − B A )x (BA − B A )xx (BA − B A ) y−(BA − B A )x = σ 4
rA (A)rB (B)
y Ψ 1 y = + Op (N −3/2 ). σ 2 Eq. (14.39) implies that Ψ 1 is symmetric and Ψ 1 = Op (N −1 ). Therefore, Ψ, defined by Eq. (14.38), is symmetric, positive definite, and can be written
396
14 Asymptotic VB Theory of Reduced Rank Regression
as Ψ = I L + Op (N −1 ). The function form of Eq. (14.37) implies that the VB predictive distribution converges to the Gaussian distribution in the asymptotic limit, and we thus have p (y|x, X, Y) =
⎛ ⎞ ⎜⎜ ( y−Ψ B A x) Ψ −1 ( y−Ψ B A x) ⎟⎟ exp⎜⎜⎜⎝− +Op (N −3/2 )⎟⎟⎟⎠
2 2σ ⎞ ⎛ ⎟⎟ ⎜⎜ ( y−Ψ B A x) Ψ −1 ( y−Ψ B A x) +Op (N −3/2 )⎟⎟⎟⎠dy exp⎜⎜⎜⎝− 2σ 2
=
⎞ ⎛ ⎜⎜ ( y−Ψ B A x) Ψ −1 ( y−Ψ B A x) ⎟⎟⎟ ⎟⎟⎠(1+Op (N −3/2 )) exp⎜⎜⎝⎜− 2σ 2
⎞ ⎛ ⎜⎜ ( y−Ψ B A x) Ψ −1 ( y−Ψ B A x) ⎟⎟⎟ ⎟⎟⎠(1+Op (N −3/2 ))dy exp⎜⎜⎜⎝− 2σ 2
=
1
(2πσ 2 )L/2 det(Ψ)1/2
! " "⎞ ⎛ ! ⎜⎜⎜ y−Ψ B A x Ψ −1 y−Ψ B A x ⎟⎟⎟ ⎟⎟⎠ + Op (N −3/2 ), ⎜ exp ⎜⎝−
2 2σ
which completes the proof.
14.2 Generalization Properties Let us analyze generalization properties of VB learning based on the posterior distribution and the predictive distribution, derived in Section 14.1.2.
14.2.1 Assumption on True Distribution We assume that the true distribution can be expressed by the model distribution with the true parameter A∗ and B∗ with their rank H ∗ : q(y|x) = GaussL y, B∗ A∗ x, σ 2 I L # ⎞ ⎛ ## ⎜⎜⎜ # y − B∗ A∗ x##2 ⎟⎟⎟ −L/2 ⎟⎟⎟ . = 2πσ 2 exp ⎜⎜⎝⎜− (14.40) ⎠ 2σ 2 Let U∗ ≡ B∗ A∗ =
min(L,M)
γh∗ ω∗bh ω∗ ah
(14.41)
h=1
be the SVD of the true linear mapping B∗ A∗ , where γh∗ (≥ 0) is the hth largest singular value, and ω∗ah and ω∗bh are the associated right and left singular vectors. The assumption that the true linear mapping has rank H ∗ amounts to ⎧ ∗ ⎪ ⎪ ⎨Θ(1) for h = 1, . . . , H , ∗ γh = ⎪ (14.42) ⎪ ⎩0 for h = H ∗ + 1, . . . , min(L, M).
14.2 Generalization Properties
397
14.2.2 Consistency of VB Estimator Since the training samples are drawn from the true distribution (14.40), the central limit theorem (Theorem 13.1) guarantees the following: ⎞ ⎛ N N ⎟⎟ 1 ⎜⎜⎜ 1 (n) (n) y x ⎟⎟⎟⎠ = V ⎜⎜⎝≡ B∗ A∗ x(n) + ε(n) x(n) N n=1 N n=1 = B∗ A∗ + Op (N −1/2 ),
xx
q(x)
(14.43)
N 1 (n) (n) = x x + Op (N −1/2 ) N n=1
= I M + Op (N −1/2 ).
(14.44)
Here we used the assumption (14.7) that the input is prewhitened. Eq. (14.43) is consistent with Eq. (14.9), which implies that the distribution of V is given by σ 2 IL ⊗ I M , (14.45) q(V) = MGaussL,M V; B∗ A∗ , N and therefore V q(X,Y) = V q(V) = B∗ A∗ ,
(14.46)
and for each (l, m), *## ##V l,m − B∗ A∗
##2 + ## l,m Fro
q(X,Y)
=
σ 2 . N
(14.47)
Eq. (14.43) implies that γh = γh∗ + Op (N −1/2 ),
(14.48)
where γh is the hth largest singular value of V (see Eq. (14.15)). Eq. (14.45) also implies that, for any h, = γh∗ ω∗bh ω∗ (14.49) γh ωbh ωah ah , h :γh∗ =γh∗
q(X,Y)
h :γh∗ =γh∗
γh ωbh ωah =
h :γh∗ =γh∗
h :γh∗ =γh∗
−1/2 γh∗ ω∗bh ω∗ ), ah + Op (N
(14.50)
where h :γh∗ =γh∗ denotes the sum over all h such that γh∗ = γh∗ . Eq. (14.50) implies that for any nonzero and nondegenerate singular component h (i.e., γh∗ > 0 and γh∗ γh∗ ∀h h), it holds that
398
14 Asymptotic VB Theory of Reduced Rank Regression ωah = ω∗ah + Op (N −1/2 ), ωbh = ω∗bh + Op (N −1/2 ).
Eq. (14.9) implies that the ML estimator is given by !
B A
"ML
=
H
γh ωbh ωah .
(14.51)
h=1
Therefore, Eq. (14.43) guarantees the convergence of the ML estimator to the true linear mapping B∗ A∗ when H ≥ H ∗ . Lemma 14.6 (Consistency of ML estimator in RRR) It holds that ⎧ ! "ML ⎪ ⎪ if H < H ∗ , ⎨Θ(1) ∗ ∗
−B A =⎪ B A ⎪ ⎩Op (N −1/2 ) if H ≥ H ∗ . We can also show the convergence of the VB estimator: Lemma 14.7 (Consistency of VB estimator in RRR) It holds that ⎧ ⎪ ⎪ if H < H ∗ , ⎨Θ(1)
B A − B∗ A∗ = ⎪ ⎪ ⎩Op (N −1/2 ) if H ≥ H ∗ .
Proof The case where H < H ∗ is trivial because the rank H matrix B A can ∗ ∗ ∗ ∗ never converge to the rank H matrix B A . Assume that H ≥ H . Theorem 14.1 implies that, when γh > γVB (= Θ(N −1/2 )), h ⎛ ⎞ ⎜⎜⎜ max(L, M)σ 2 ⎟⎟⎟ VB VB ⎟⎠ + Op (N −1 ) = γh + Op (N −1/2 ),
γh = γ˘ h = γh ⎝⎜1 − Nγh2 and otherwise
γhVB = 0. Since γh = Op (N −1/2 ) for h = H ∗ + 1, . . . , min(L, M), the preceding two equations lead to
B A =
H h=1
γhVB ωbh ωah =
min(L,M)
γh ωbh ωah + Op (N −1/2 ) = V + Op (N −1/2 ).
h=1
(14.52) Substituting Eq. (14.43) into Eq. (14.52) completes the proof.
14.2.3 Generalization Error Now we analyze the asymptotic behavior of the generalization error. We first show the asymptotic equivalence between the VB predictive distribution,
14.2 Generalization Properties
399
given by Theorem 14.5, and the plug-in VB predictive distribution (14.33)— both give the same leading term of the generalization error with Op (N −3/2 ) difference. To this end, we use the following lemma: ´ ´ Σ) μ, Σ), (μ, Lemma 14.8 For any three sets of Gaussian parameters (μ∗ , Σ ∗ ), ( such that
μ = μ∗ + Op (N −1/2 ),
Σ = Σ ∗ + Op (N −1/2 ),
(14.53)
μ´ = μ + Op (N −1 ),
Σ´ = Σ + Op (N −1 ),
(14.54)
it holds that /
´ Σ´ + Op (N −3/2 ) 0 GaussL y; μ, log GaussL y; μ, Σ
= Op (N −3/2 ).
(14.55)
GaussL (y;μ∗ ,Σ ∗ )
Proof The (twice of the) left-hand side of Eq. (14.55) can be written as / ´ Σ´ + Op (N −3/2 ) 0 GaussL y; μ, ψ1 ≡ 2 log GaussL y; μ, Σ GaussL (y;μ∗ ,Σ ∗ ) / 0 det Σ ´ −1 −1 ´ Σ (y − μ) ´ + (y − = log det Σ´ − (y − μ) μ) Σ (y − μ) ( ) Gauss (y;μ∗ ,Σ ∗ ) L
+ Op (N −3/2 ) ! " −1 = − log det Σ ∗ Σ Σ ∗−1 Σ´ * + −1 − (y − μ∗ − (μ´ − μ∗ )) Σ´ (y − μ∗ − (μ´ − μ∗ )) GaussL (y;μ∗ ,Σ ∗ ) *$ + −1 % % $ + y − μ∗ − ( μ − μ∗ ) μ − μ∗ ) + Op (N −3/2 ) Σ y − μ∗ − ( GaussL (y;μ∗ ,Σ ∗ ) ! ! " ! "" ! " −1 −1 −1 −1 = tr log Σ ∗ Σ´ − log Σ ∗ − tr Σ ∗ Σ´ − (μ´ − μ∗ ) Σ´ (μ´ − μ∗ ) Σ ! " $ −1 % −1 $ % μ − μ∗ μ − μ∗ + Op (N −3/2 ). + tr Σ ∗ Σ + Σ By using Eqs. (14.53) and (14.54) and the Taylor expansion of the logarithmic function, we have ! " −1 ψ1 = tr Σ ∗ Σ´ − I L −
!
−1
Σ ∗ Σ´ −I L
" !
−1
Σ ∗ Σ´ −I L
"
2
! " Σ ∗ Σ −1 −IL Σ ∗ Σ −1 −IL " ∗ −1 − Σ Σ − IL + 2 ! " $ % −1 $ % −1 μ − μ∗ μ − μ∗ − tr Σ ∗ Σ´ − Σ !
" !
400
14 Asymptotic VB Theory of Reduced Rank Regression ! " $ −1 % −1 $ % μ − μ∗ μ − μ∗ + Op (N −3/2 ) Σ + Σ + tr Σ ∗ = Op (N −3/2 ),
which completes the proof.
Given a test input x, Lemma 14.8 can be applied to the true distribution (14.40), the plug-in VB predictive distribution (14.33), and the predictive distribution (14.34) when H ≥ H ∗ , where μ∗ = B∗ A∗ x,
Σ ∗ = σ 2 I L ,
μ= B A x = μ∗ + Op (N −1/2 ),
Σ = σ 2 I L = Σ ∗ ,
μ´ = Ψ B A x = μ + Op (N −1 ),
Σ´ = σ 2 Ψ = Σ + Op (N −1 ),
for Ψ = I L +Op (N −1 ). Here, Lemma 14.7 was used in the equation for μ. Thus, we have the following corollary: Corollary 14.9 When H ≥ H ∗ , it holds that 0 / p (y|x, X, Y) = Op (N −3/2 ), log p(y|x, A, B) q(y|x) and therefore the difference between the generalization error (13.30) of the VB predictive distribution (14.34) and the generalization error of the plug-in VB predictive distribution (14.33) is of the order of N −3/2 , i.e., / 0 q(y|x) GE(D) = log p(y|x, X, Y) q(y|x)q(x) / 0 q(y|x) = log + Op (N −3/2 ). p(y|x, A, B) q(y|x)q(x) Corollary 14.9 leads to the following theorem: Theorem 14.10 The generalization error of the RRR model is written as ⎧ ⎪ ⎪ if H < H ∗ , ⎪ ⎨Θ(1) ## ##2 GE(D) = ⎪ (14.56) ∗ ∗ # # ⎪ ⎪ ⎩ # B A −B 2A #Fro + Op (N −3/2 ) if H ≥ H ∗ . 2σ
Proof When H < H ∗ , Theorem 14.5 implies that / 0 q(y|x) GE(D) = log + Op (N −1 )
p(y|x, A, B) q(y|x)q(x) ## ##2 ## B A − B∗ A∗ ## Fro = + Op (N −1 ). 2σ 2
14.2 Generalization Properties
401
With Lemma 14.7, we have GE(D) = Θ(1). When H ≥ H ∗ , we have / 0 q(y|x) GE(D) = log + Op (N −3/2 ) p(y|x, A, B) q(y|x)q(x) ## # / #2 0 y−B∗ A∗ x2 −## y− B A x## = − + Op (N −3/2 ) 2σ 2 / = − = =
q(y|x)q(x) ## " ##2 0 ! y−B∗ A∗ x −### y−B∗ A∗ x− B A −B∗ A∗ x### 2
2σ 2
" ##2 ∗ ∗ # 0 # B A −B A x## 2σ 2
/ ####!
/ tr
)!
+ Op (N −3/2 )
q(x) " ! " 1 0 ∗ ∗
B A −B A xx B A −B∗ A∗ 2σ 2
+ Op (N −3/2 ) q(y|x)q(x)
+ Op (N −3/2 ). q(x)
By using Eq. (14.44) and Lemma 14.7, we obtain Eq. (14.56), which completes the proof. Next we compute the average generalization error (13.13) over the distribution of training samples. As Theorem 14.10 states, the generalization error never converges to zero if H < H ∗ , since a rank H ∗ matrix cannot be well approximated by a rank H matrix. Accordingly, we hereafter focus on the case where H ≥ H ∗ . By WishartD (V, ν) we denote the D-dimensional Wishart distribution with scale matrix V and degree of freedom ν. Then we have the following theorem: Theorem 14.11 The average generalization error of the RRR model for H ≥ H ∗ is asymptotically expanded as GE(N) = GE(D) q(D) = λVB N −1 + O(N −3/2 ), where the generalization coefficient is given by 2λVB = (H ∗ (L + M) − H ∗2 ) /H−H ! ∗ θ γh 2 > max(L, M) 1 − + h=1
max(L,M) γh 2
"2
0 γh 2
.
(14.57)
q(W)
Here γh 2 is the hth largest eigenvalue of a random matrix W ∈ S+min(L,M) subject to Wishartmin(L,M)−H ∗ (Imin(L,M)−H ∗ , max(L, M) − H ∗ ), and θ(·) is the indicator function such that θ(condition) = 1 if the condition is true and θ(condition) = 0 otherwise.
402
14 Asymptotic VB Theory of Reduced Rank Regression
Proof Theorem 14.1 and Eqs. (14.42) and (14.48) imply that ⎧ ⎪ ⎪ for h = 1, . . . , H ∗ , γh + Op (N −1 ) = Op (1) ⎪ ⎪ ⎪ ⎪ ! ! "" ⎪ ⎨
2
γhVB = ⎪ max 0, γh 1 − max(L,M)σ + Op (N −1 ) = Op (N −1/2 ) ⎪ ⎪ Nγh2 ⎪ ⎪ ⎪ ⎪ ⎩ for h = H ∗ + 1, . . . , H. (14.58) Therefore, we have ## ## ##2 ##2 H ## #
γhVB ωbh ωah − min(L,M) γh∗ ω∗bh ω∗ B A − B∗ A∗ ## = ## h=1 ah # h=1 Fro Fro ## ∗ ##2 H H ∗ ∗ ∗ VB γh ωbh ωah + Op (N −1 )## = ## h=1 γh ωbh ωah − γh ωbh ωah + h=H ∗ +1 Fro ## ∗ #2 H H ∗ ∗ ∗ VB # −3/2 # # γh ωbh ωah # + Op (N = # h=1 γh ωbh ωah − γh ωbh ωah + h=H ∗ +1 ) Fro ## # 2 H # # = ##V − B∗ A∗ + h=H γhVB ωbh ωah − min(L,M) ∗ +1 h=H ∗ +1 γh ωbh ωah #Fro + Op (N −3/2 ). Here, in order to get the third equation, we used the fact that the first two terms in the norm in the second equation are of the order of Op (N −1/2 ). The expectation over the distribution of training samples is given by *## ##2 + ## B A − B∗ A∗ ## Fro q(D) *## #2 + H # # γhVB ωbh ωah − min(L,M) = ##V − B∗ A∗ + h=H γ ω ω ∗ +1 ∗ h b a h h=H +1 h# Fro q(D)
+ O(N −3/2 )
= V − B∗ A∗ 2Fro q(D) ∗ ∗ H γhVB ωbh ωah − min(L,M) + 2 (V − B A ) γ ω ω ∗ h b a h h=H ∗ +1 h=H +1 h q(D) H min(L,M) 2 H VB 2 VB γh + h=H ∗ +1 γh + h=H ∗ +1 ( γh ) − 2 h=H ∗ +1 γh q(D)
+ O(N
) min(L,M)
= V − B +2 − h=H ∗ +1 γh2 q(D) H min(L,M) 2 H VB 2 VB γh + h=H ∗ +1 γh + h=H ∗ +1 ( γh ) − 2 h=H ∗ +1 γh + O(N −3/2 ) q(D) 2 = V − B∗ A∗ 2Fro − min(L,M) h=H ∗ +1 γh q(D) q(D) H + h=H γhVB )2 + O(N −3/2 ). (14.59) ∗ +1 ( ∗
−3/2
A∗ 2Fro q(D)
H h=H ∗ +1
γhVB γh
q(D)
Here we used Eq. (14.49) and the orthonormality of the singular vectors.
14.2 Generalization Properties
403
Eq. (14.45) implies that the first term in Eq. (14.59) is equal to
V − B∗ A∗ 2Fro
q(D)
= LM
σ 2 . N
(14.60)
min(L,M) The redundant components {γh ωbh ωah }h=H ∗ +1 are zero-mean (see Eq. (14.49)) Gaussian matrices capturing the Gaussian noise in the orthogonal space to H∗ . Therefore, the distribution of the the necessary components {γh ωbh ωah }h=1 min(L,M) coincides with the distribution of corresponding singular values {γh }h=H ∗ +1 ∗ ∗ the singular values of V ∈ R(min(L,M)−H )×(max(L,M)−H ) subject to
q(V ) = MGaussmin(L,M)−H ∗ ,max(L,M)−H ∗ σ 2
Imin(L,M)−H ∗ ⊗ Imax(L,M)−H ∗ . V ; 0min(L,M)−H ∗ ,max(L,M)−H ∗ , N (14.61) This leads to /min(L,M)
0
= (L − H ∗ )(M − H ∗ )
γh2
h=H ∗ +1
q(D)
σ 2 . N
(14.62)
√
∗
∗
Let {γh }min(L,M)−H be the singular values of σN V . Then, {γh 2 }min(L,M)−H h=1 h=1 are the eigenvalues of W = σN 2 V V , which is subject to Wishartmin(L,M)−H ∗ (Imin(L,M)−H ∗ , max(L, M) − H ∗ ). By substituting Eqs. (14.60), (14.62), and (14.58) into Eq. (14.59), we have *## ##2 + ## B A − B∗ A∗ ## Fro q(D)
σ 2 = {LM − (L − H ∗ )(M − H ∗ )} N/ 2 ! ! ""32 0 H max(L,M)σ 2 + h=H ∗ +1 max 0, γh 1 − Nγ2 h
σ 2 = N
+ O(N −3/2 ) q(D)
⎧ / ! ⎪ ⎪ H−H ∗ 2 ⎨ ∗ ∗2 max 0, 1 − (H (L + M) − H ) + h=1 ⎪ ⎪ ⎩
max(L,M) γh 2
"32
0 γh 2
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ q(W)
+ O(N −3/2 ). Substituting the preceding into Eq. (14.56) completes the proof.
The first and the second terms in Eq. (14.57) correspond to the contribution from the necessary components h = 1, . . . , H ∗ and the contribution from the redundant components h = H ∗ + 1, . . . , H, respectively. If we focus on the H∗ , the true linear parameter space of the first H ∗ components, i.e., {ah , bh }h=1 ∗ ∗ H lies at an (essentially) nonsingular point (after removing mapping {a∗h , bh }h=1
404
14 Asymptotic VB Theory of Reduced Rank Regression
the trivial H ∗2 redundancy). Therefore, as the regular learning theory states, the contribution from the necessary components is equal to the (essential) degree of freedom (see Eq. (14.4)) of the RRR model for H = H ∗ . On the other hand, the regular learning theory cannot be applied to the redundant H components {ah , bh }h=H ∗ +1 since the true parameter is on the singularities {ah = 0} ∪ {bh = 0}, making the second term different from the degree of freedom of the redundant parameters. Assuming that L and M are large, we can approximate the second term in Eq. (14.57) by using the random matrix theory (see Section 8.4.1). Consider the large-scale limit when L, M, H, H ∗ go to infinity with the same ratio, so that min(L, M) − H ∗ , max(L, M) − H ∗ H − H∗ , β= min(L, M) − H ∗ max(L, M) κ= max(L, M) − H ∗
α=
(14.63) (14.64) (14.65)
are constant. Then Marˇcenko–Pastur law (Proposition 8.11) states that the empirical distribution of the eigenvalues {y1 , . . . , ymin(L,M)−H ∗ } of the random NV V ∼ Wishartmin(L,M)−H ∗ (Imin(L,M)−H ∗ , 1) almost surely matrix (max(L,M)−H ∗ )σ 2 converges to . (y − y)(y − y) θ y max(L, M) 1 − − h=1
max(L,M) γh 2
"! 1+
max(L,M) γh 2
"
0 γh 2
. q(W)
(14.76) Here γh 2 is the hth largest eigenvalue of a random matrix W ∈ Smin(L,M) subject + to Wishartmin(L,M)−H ∗ (Imin(L,M)−H ∗ , max(L, M) − H ∗ ). Proof From Eq. (14.58), we have ##2 ## #2 ## H # # ##
γhVB ωbh ωah − min(L,M) γ ω ω B A − V ## = ## h=1 h bh ah # h=1 Fro Fro ## ##2 min(L,M) H VB # = # h=H ∗ +1 ( γh − γh )ωbh ωah − h=H+1 γh ωbh ωah + Op (N −1 )## Fro ## #2 min(L,M) H VB # −3/2 # # = # h=H ∗ +1 ( γh − γh )ωbh ωah − h=H+1 γh ωbh ωah # + Op (N ) Fro H 2 −3/2 γhVB − γh )2 + min(L,M) ) = h=H ∗ +1 ( h=H+1 γh + Op (N ! ! ! "" " 2
2 H = h=H ∗ +1 max 0, γh 1 − max(L,M)σ − γh Nγh2 2 −3/2 + min(L,M) ) h=H+1 γh + Op (N
14.2 Generalization Properties
=− +
H
h=H ∗ +1
θ γh2 >
min(L,M) h=H ∗ +1
· γh −
max(L,M)σ 2 N
409
max(L,M)σ 2 Nγh
γh +
max(L,M)σ 2 Nγh
γh2 + Op (N −3/2 ).
(14.77)
By using Eqs. (14.60), (14.62) and (14.77), we have *## # # #2 + #2 ##V − B A ## − ##V − B∗ A∗ ##Fro Fro
q(D)
σ 2 , ∗ =− (H (L + M) − H ∗2 ) N H max(L,M)σ 2 2 · γh − + h=H ∗ +1 θ γ > h N
max(L,M)σ 2 Nγh
γh +
max(L,M)σ 2 Nγh
-
+ Op (N −3/2 ). ∗
√
of σN V , where V is Thus, by introducing the singular values {γh }min(L,M)−H h=1 a random matrix subject to Eq. (14.61), and using Theorem 14.15, we obtain Eq. (14.76), which completes the proof. Finally, we apply the Marˇcenko–Pastur law (Proposition 8.11) for evaluating the second term in Eq. (14.76). In the large-scale limit when L, M, H, H ∗ go to infinity with the same ratio, so that Eqs. (14.63) through (14.65) are constant, we have /H−H ! "! " 0 ∗ max(L,M) max(L,M)
2 θ γh > max(L, M) 1 − γ 2 1 + γ 2 γh 2 h
h=1
h
→ (min(L, M) − H ∗ )(max(L, M) − H ∗ ) = (min(L, M) − H ∗ )(max(L, M) − H ∗ )
∞
θ (y > κ) 1 −
uβ ∞
q(W) κ y
1+
κ y
yp(y)dy
y − κ2 y−1 p(y)dy
max(κ,uβ )
(min(L, M) − H )(max(L, M) − H )
= (´u) , J1 (´u) − κ2 J−1 2πα ∗
∗
where J s (u), β, and u´ are defined in Eqs. (14.68), (14.69), and (14.70), √ respectively. Thus, the transformation z = y − (y + y)/2 /(2 α) gives the following theorem: Theorem 14.17 The VB training coefficient of the RRR model in the large scale limit is given by 2νVB → − (H ∗ (L + M) − H ∗2 ) (min(L, M) − H ∗ )(max(L, M) − H ∗ ) , − J1 (´z) − κ2 J−1 (´z) , (14.78) 2πα where J1 (z), J−1 (z), and z´ are defined in Theorem 14.12.
410
14 Asymptotic VB Theory of Reduced Rank Regression
14.2.5 Free Energy The VB free energy can be analyzed relatively easily based on the orders of the variational parameters, given by Corollary 14.3: Theorem 14.18 The relative VB free energy (13.150) of the RRR model for H ≥ H ∗ is asymptotically expanded as VB (D) = F VB (Y|X) − NS N (Y|X) = λ VB log N + Op (1), F
(14.79)
where the free energy coefficient is given by 2λ VB = H ∗ (L + M) + (H − H ∗ ) min(L, M).
(14.80)
Proof The VB free energy for the RRR model is given by Eq. (14.14), and the empirical entropy is given by N 2 log q(y(n) |x(n) ) N n=1 #2 N ## (n) # − B∗ A∗ x(n) ## n=1 y
2 = L log(2πσ ) + Nσ 2 # # #2 #2 # N # (n) # 1 # − 2tr(V B∗ A∗ ) + ## B∗ A∗ ##Fro y n=1 N = L log(2πσ 2 ) + σ 2 # ##2 ## #2 1 N # (n) # # + #V − B∗ A∗ ##Fro − V2Fro n=1 y N
2 = L log(2πσ ) + . σ 2 Therefore, the relative VB free energy (14.79) is given as
2S N (Y|X) = −
VB
2F
## # #2 2 ##V− B A ## −V−B∗ A∗ Fro Fro σ 2
B) + L log det(C det ΣB 2 ! " ! " −1 −1 − (L + M)H + tr CA A A + M B Σ A + CB B + L ΣB ! "! ""3 ! + Nσ −2 − A B A B B+ A A + M ΣA B + L ΣB .
(D) = N ·
+ M log
det(C A) det ΣA
(14.81) Eqs. (14.52) and (14.43) imply that the first term in Eq. (14.81) is Op (1). Corollary 14.3 with Eqs. (14.42) and (14.48) implies that, for h = 1, . . . , H ∗ ,
ah = Θp (1),
bh = Θp (1),
σ2ah = Θp (N −1 ),
σ2bh = Θp (N −1 ),
and, for h = H ∗ + 1, . . . , H,
ah = Op (1),
ah = Op (N
−1/4
bh = Op (N −1/2 ), σ2ah = Θp (1), ), bh = Op (N
−1/4
ah = Op (N −1/2 ), bh = Op (1),
),
σ2ah
= Θp (N
−1/2
σ2ah = Θp (N −1 ),
),
σ2bh = Θp (N −1 ),
(if L < M),
−1/2
), (if L = M),
σ2bh
= Θp (N
σ2bh = Θp (1),
(if L > M).
14.2 Generalization Properties
411
These results imply that the most terms in Eq. (14.81) are Op (1), and we thus have VB (D) = M log det (CA ) + L log det (C B ) + Op (1) 2F det ΣA det ΣB = M log
H h=1
σ−2 ah + L log
H
σ−2 bh + Op (1)
h=1
= {H ∗ (L + M) + (H − H ∗ ) min(L, M)} log N + Op (1),
which completes the proof.
Clearly from the proof, the first term and the second term in Eq. (14.80) correspond to the contribution from the necessary components, h = 1, . . . , H ∗ , and the contribution from the redundant components, h = H ∗ , . . . , H, respectively. A remark is that the contribution from the necessary components contains the trivial redundancy, i.e., it is H ∗ (L + M) instead of H ∗ (L + M) − H ∗2 . This is because the independence between A and B prevents the VB posterior distribution from extending along the trivial redundancy.
14.2.6 Comparison with Other Learning Algorithms Theorems 14.12, 14.17, and 14.18 allow us to compute the generalization, the training, and the free energy coefficients of VB learning. We can now compare those properties with those of ML learning and Bayesian learning, which have been clarified for the RRR model. Note that MAP learning with a smooth and finite prior (e.g., the Gaussian prior (14.3)) with fixed hyperparameters is asymptotically equivalent to ML learning, and has the same generalization and training coefficients. ML Learning The generalization error of ML learning in the RRR model was analyzed (Fukumizu, 1999), based on the Marˇcenko–Pastur law (Proposition 8.11). Let subject to γh 2 be the hth largest eigenvalue of a random matrix W ∈ Smin(L,M) + Wishartmin(L,M)−H ∗ (Imin(L,M)−H ∗ , max(L, M) − H ∗ ). Theorem 14.19 (Fukumizu, 1999) The average ML generalization error of the RRR model for H ≥ H ∗ is asymptotically expanded as ML
GE
(N) = λML N −1 + O(N −3/2 ),
where the generalization coefficient is given by /H−H ∗ 0 ML ∗ ∗2 2λ = (H (L + M) − H ) + γh 2 h=1
. q(W)
(14.82)
412
14 Asymptotic VB Theory of Reduced Rank Regression
Theorem 14.20 (Fukumizu, 1999) The ML generalization coefficient of the RRR model in the large-scale limit is given by 2λML → (H ∗ (L + M) − H ∗2 ) (min(L, M) − H ∗ )(max(L, M) − H ∗ ) + J1 (´z), 2πα
(14.83)
where J1 (·) and z´ are defined in Theorem 14.12. Actually, Theorems 14.11 and 14.12 were derived by extending Theorems 14.19 and 14.20 to VB learning. We can derive Theorems 14.19 and 14.20 in the same way as VB learning by replacing the VB estimator (14.58) with the ML estimator γhML = γh . The training error can be similarly analyzed. Theorem 14.21 The average ML training error of the RRR model for H ≥ H ∗ is asymptotically expanded as ML
TE
(N) = νML N −1 + O(N −3/2 ),
where the training coefficient is given by 2ν
ML
∗
∗2
= −(H (L + M) − H ) −
/H−H ∗ h=1
0 γh 2
.
(14.84)
q(W)
Theorem 14.22 The ML training coefficient of the RRR model in the largescale limit is given by 2νML → −(H ∗ (L + M) − H ∗2 ) (min(L, M) − H ∗ )(max(L, M) − H ∗ ) − J1 (´z), 2πα
(14.85)
where J1 (·) and z´ are defined in Theorem 14.12. A note is that Theorems 14.19 and 14.21 imply that the generalization coefficient and the training coefficient are antisymmetric in ML learning, i.e., λML = −νML , while they are not antisymmetric in VB learning, i.e., λVB −νVB (see Theorems 14.11 and 14.16). Bayesian Learning The Bayes free energy in the RRR model was clarified based on the singular learning theory (see Section 13.5.4).
14.2 Generalization Properties
413
Theorem 14.23 (Aoyagi and Watanabe, 2005) The relative Bayes free energy (13.32) in the RRR model is asymptotically expanded as Bayes (D) = F Bayes (Y|X) − NS N (Y|X) F = λ Bayes log N − (m − 1) log log N + Op (1), where the free energy coefficient, as well as the coefficient of the second leading term, is given as follows: (i) When L + H ∗ ≤ M + H, M + H ∗ ≤ L + H, and H ∗ + H ≤ L + M: (a) If L + M + H + H ∗ is even, then m = 1 and −(H ∗ + H)2 − (L − M)2 + 2(H ∗ + H)(L + M) . 4 (b) If L + M + H + H ∗ is odd, then m = 2 and 2λ Bayes =
−(H ∗ + H)2 − (L − M)2 + 2(H ∗ + H)(L + M) + 1 . 4 (ii) When M + H < L + H ∗ , then m = 1 and 2λ Bayes =
2λ Bayes = HM − HH ∗ + LH ∗ . (iii) When L + H < M + H ∗ , then m = 1 and 2λ Bayes = HL − HH ∗ + MH ∗ . (iv) When L + M < H + H ∗ , then m = 1 and 2λ Bayes = LM. Theorem 14.23 immediately informs us of the asymptotic behavior of the Bayes generalization error, based on Corollary 13.14. Theorem 14.24 (Aoyagi and Watanabe, 2005) The Bayes generalization error of the RRR model for H ≥ H ∗ is asymptotically expanded as Bayes GE (N) = λBayes N −1 − (m − 1)(N log N)−1 + o (N log N)−1 , where λBayes = λ Bayes and m are given in Theorem 14.23. Unfortunately, the Bayes training error has not been clarified yet. Numerical Comparison Let us visually compare the theoretically clarified generalization properties. Figures 14.1 through 14.4 show the generalization coefficients and the training coefficients of the RRR model under the following settings:
414
14 Asymptotic VB Theory of Reduced Rank Regression
2
VB ML Bayes Regular
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
5
10
15
20
25
30
35
40
Figure 14.1 The generalization coefficients (in the positive vertical region) and the training coefficients (in the negative vertical region) of VB learning, ML learning, and Bayesian learning in the RRR model with max(L, M) = 50, min(L, M) = 30, H = 1, . . . , 30, and H ∗ = 0. 2
VB ML Bayes Regular
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
20
40
60
80
100
Figure 14.2 The generalization coefficients and the training coefficients (max(L, M) = 80, min(L, M) = 1, . . . , 80, H = 1, and H ∗ = 0).
(i) max(L, M) = 50, min(L, M) = 30, H = 1, . . . , 30 (horizontal axis), H ∗ = 0, (ii) max(L, M) = 80, min(L, M) = 1, . . . , 80 (horizontal axis), H = 1, H ∗ = 0, (iii) L = M = 80, H = 1, . . . , 80 (horizontal axis), H ∗ = 0, (iv) max(L, M) = 50, min(L, M) = 30, H = 20, H ∗ = 1, . . . , 20 (horizontal axis). The vertical axis indicates the coefficient normalized by the half of the essential parameter dimension D, given by Eq. (14.4). The curves in the positive vertical region correspond to the generalization coefficients of VB learning, ML learning, and Bayesian learning, while the curves in the negative vertical region correspond to the training coefficients. As a guide, we depicted the lines 2λ/D = 1 and 2ν/D = −1, which correspond to the generalization and the training coefficients (by ML learning and Bayesian learning) of the regular
14.2 Generalization Properties
2
415
VB ML Bayes Regular
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
20
40
60
80
100
Figure 14.3 The generalization coefficients and the training coefficients (L = M = 80, H = 1, . . . , 80, and H ∗ = 0). 2
VB ML Bayes Regular
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
5
10
15
20
25
30
Figure 14.4 The generalization coefficients and the training coefficients (max(L, M) = 50, min(L, M) = 30, H = 20, and H ∗ = 1, . . . , 20).
models with the same parameter dimensionality. The curves for ML learning and VB learning were computed under the large-scale approximation, i.e., by using Theorems 14.12, 14.17, 14.20, and 14.22.1 We see in Figures 14.1 through 14.4 that VB learning generally provides comparable generalization performance to Bayesian learning. However, significant differences are also observed. For example, we see in Figure 14.1 that VB learning provides much worse generalization performance than Bayesian learning when H min(L, M), and much better performance when H ∼ min(L, M). Another finding is that, in Figures 14.1 and 14.3, the VB generalization coefficient depends on H similarly to the ML generalization coefficient. Moreover, we see that, when min(L, M) = 80 in Figure 14.2 and when H = 1 in Figure 14.3, the VB generalization coefficient slightly exceeds the line 1
We confirmed that numerical computation with Theorems 14.11, 14.16, 14.19, and 14.21 gives visually indistinguishable results.
416
14 Asymptotic VB Theory of Reduced Rank Regression
2λ/D = 1—the VB generalization coefficient per parameter dimension can be larger than that in the regular models, which never happens for the Bayes generalization coefficient (see Eq. (13.125)). Finally, Figure 14.4 shows that, for this particular RRR model with max(L, M) = 50, min(L, M) = 30, and H = 20, VB learning always gives smaller generalization error than Bayesian learning in the asymptotic limit, regardless of the true rank H ∗ . This might be seen contradictory with the proven optimality of Bayesian learning—Bayesian learning is never dominated by any other method (see Appendix D for the optimality of Bayesin learning and Appendix A for the definition of the term “domination”). We further discuss this issue by considering subtle true singular values in Section 14.2.7. Next we compare the VB free energy with the Bayes free energy, by using Theorems 14.18 and 14.23. Figures 14.5 through 14.8 show the free energy
2
VB Bayes Regular
1.5
1
0.5
0 0
5
10
15
20
25
30
35
40
Figure 14.5 Free energy coefficients (max(L, M) = 50, min(L, M) = 30, H = 1, . . . , 30, and H ∗ = 0). The VB and the Bayes free energy coefficients are almost overlapped. 2
VB Bayes Regular
1.5
1
0.5
0 0
20
40
60
80
100
Figure 14.6 Free energy coefficients (max(L, M) = 80, min(L, M) = 1, . . . , 80, H = 1, and H ∗ = 0). The VB and the Bayes free energy coefficients are almost overlapped.
14.2 Generalization Properties
2
417
VB Bayes Regular
1.5
1
0.5
0 0
20
40
60
80
100
Figure 14.7 Free energy coefficients (L = M = 80, H = 1, . . . , 80, and H ∗ = 0). 2
VB Bayes Regular
1.5
1
0.5
0 0
5
10
15
20
25
30
Figure 14.8 Free energy coefficients (max(L, M) = 50, min(L, M) = 30, H = 20, and H ∗ = 1, . . . , 20).
coefficients of the RRR model with the same setting as Figures 14.1 through 14.4, respectively. As for the generalization and the training coefficients, the vertical axis indicates the free energy coefficient normalized by the half of the essential parameter dimensionality D, given by Eq. (14.4). The curves correspond to the VB free energy coefficient (Theorem 14.18), the Bayes free energy coefficient (Theorem 14.23), and the Bayes free energy coefficient
Bayes 2λRegular = D of the regular models with the same parameter dimensionality. We find that the VB free energy almost coincides with the Bayes free energy in Figures 14.5 and 14.6, while the VB free energy is much larger than the Bayes free energy in Figures 14.7 and 14.8. Since the gap between the VB free energy and the Bayes free energy indicates how well the VB posterior approximates the Bayes posterior in terms of the KL divergence (see Section 13.6), our observation is not exactly what we would expect. For example, we see in Figure 14.1 that the generalization performance of VB learning is significantly different from Bayesian learning (when H min(L, M) and when H ∼ min(L, M)), while the free energies in
418
14 Asymptotic VB Theory of Reduced Rank Regression
2
VB Bayes Regular
1.5
1
0.5
0 0
20
40
60
80
100
Figure 14.9 Free energy coefficients (max(L, M) = 80, min(L, M) = 10, . . . , 80, H = 10, and H ∗ = 0). 2
VB Bayes Regular
1.5
1
0.5
0 0
20
40
60
80
100
Figure 14.10 Free energy coefficients (max(L, M) = 80, min(L, M) = 20, . . . , 80, H = 20, and H ∗ = 0).
Figure 14.5 imply that the VB posterior well approximates the Bayes posterior. Also, by comparing Figures 14.3 and 14.7, we observe that, when H min(L, M), VB learning provides much worse generalization performance than Bayesian learning, while the VB free energy well approximates the Bayes free energy; and that, when H ∼ min(L, M), VB learning provides much better generalization performance, while the VB free energy is significantly larger than the Bayes free energy. Further investigation is required to understand the relation between the generalization performance and the gap between the VB and the Bayes free energies. Figures 14.9 through 14.11 show similar cases to Figure 14.6 but for different ranks H = 10, 20, 40, respectively. From Figures 14.5 through 14.11, we conclude that, in general, the VB free energy behaves similarly to the Bayes free energy when L and M are significantly different from each other or H min(L, M). In Figure 14.8, the VB free energy behaves strangely and poorly approximates the Bayes free energy when H ∗ is large. This is because
14.2 Generalization Properties
2
419
VB Bayes Regular
1.5
1
0.5
0 0
20
40
60
80
100
Figure 14.11 Free energy coefficients (max(L, M) = 80, min(L, M) = 40, . . . , 80, H = 40, and H ∗ = 0).
of the trivial redundancy of the RRR model, of which VB learning with the independence constraint cannot make use to reduce the free energy (see the remark in the last paragraph of Section 14.2.5).
14.2.7 Analysis with Subtle True Singular Values Here we conduct an additional analysis to explain the seemingly contradictory observation in Figure 14.4—in the RRR model with max(L, M) = 50, min(L, M) = 30, H = 20, VB learning always gives smaller generalization error than Bayesian learning, regardless of the true rank H ∗ . We show that this does not mean the domination by VB learning over Bayesian learning, which was proven to be never dominated by any other method (see Appendix D). Distinct and Subtle Signal Assumptions The contradictory observation was due to the assumption (14.42) on the true singular values: ⎧ ∗ ⎪ ⎪ ⎨Θ(1) for h = 1, . . . , H , ∗ (14.86) γh = ⎪ ⎪ ⎩0 for h = H ∗ + 1, . . . , min(L, M), which we call the distinct signal assumption. This assumption seems to cover H ∗ ∗ ∗ γh ωbh ωah by classifying all singular any true linear mapping B∗ A∗ = h=1 ∗ components such that γh > 0 to the necessary components h = 1, . . . , H ∗ , and the other components such that γh∗ = 0 to the redundant components h = H ∗ + 1, . . . , min(L, M). However, in the asymptotic limit, the assumption (14.86) implicitly prohibits the existence of true singular values in the same order as the noise contribution, i.e., γh∗ = Θp (N −1/2 ). In other words, the distinct signal assumption (14.86) considers all true singular values to be either infinitely
420
14 Asymptotic VB Theory of Reduced Rank Regression
larger than the noise or exactly equal to zero. As a result, asymptotic analysis under the distinct signal assumption reflects only the overfitting tendency of a learning machine, and ignores the underfitting tendency, which happens when the signal is not clearly separable from the noise. Since overfitting and underfitting are in the trade-off relation, it is important to investigate both tendencies when generalization performance is analyzed. To relax the restriction discussed previously, we replace the assumption (14.42) with ⎧ ⎪ ⎪ for h = 1, . . . , H ∗ , ⎨Θ(1) ∗ (14.87) γh = ⎪ ⎪ ⎩O(N −1/2 ) for h = H ∗ + 1, . . . , min(L, M), which we call the subtle signal assumption, in the following analysis (Watanabe and Amari, 2003; Nakajima and Watanabe, 2007). Note that, with the assumption (14.87), we do not intend to analyze the case where the true singular values depend on N. Rather, we assume realistic situations where the number of necessary components H ∗ depends on N. Let us keep in mind the following two points, which are usually true when we analyze real-world data: • The number N of samples is always finite. Asymptotic theory is not to investigate what happens when N → ∞, but to approximate the situation where N is finite but large. • It rarely happens that real-world data can be exactly expressed by a low-rank model. Statistical models are supposed to be simpler than the real-world data generation process, but expected to approximate it with certain accuracy, and the accuracy depends on the noise level and the number of samples. Then we expect that, for most real-world data, it holds that γh∗ > 0 for all h = 1, . . . , min(L, M), but, given finite N, some of the true singular values are comparable to the noise contribution γh∗ = Θ(N −1/2 ), and some others are negligible γh∗ = o(N −1/2 ). The subtle signal assumption (14.87) covers such realistic situations. Generalization Error under Subtle Signal Assumption Replacing the distinct signal assumption (14.86) with the subtle signal assumption (14.87) does not affect the discussion up to Theorem 14.10, i.e., Theorems 14.1, 14.5, and 14.10, Lemmas 14.6 through 14.8, and their corollaries still hold. Instead of Theorem 14.11, we have the following theorem: Theorem 14.25 Under the subtle signal assumption (14.87), the average generalization error of the RRR model for H ≥ H ∗ is asymptotically expanded as
14.2 Generalization Properties
421
GE(N) = λVB N −1 + O(N −3/2 ), where the generalization coefficient is given by ∗2 2λVB = (H ∗ (L + M) − H ∗2 ) + σN 2 min(L,M) h=H ∗ +1 γh / H−H ∗
2 + h=1 θ γh > max(L, M) )! · 1−
max(L,M) γh
2
"2
γh
2
!
−2 1−
max(L,M) γh
2
"
∗
γh
ω
bh V ωah
10 . q(V
)
(14.88) Here, V
=
min(L,M)−H ∗
γh
ω
bh ω
ah
(14.89)
h=1 ∗
∗
is the SVD of a random matrix V
∈ R(min(L,M)−H )×(max(L,M)−H ) subject to $ % q(V
) = MGaussmin(L,M)−H ∗ ,max(L,M)−H ∗ V
; V
∗ , Imin(L,M)−H ∗ ⊗ Imax(L,M)−H ∗ , (14.90) ∗
∗
and V
∗ ∈ R(min(L,M)−H )×(max(L,M)−H √) is a (nonsquare) diagonal matrix with
= σN γ∗H ∗ +h for h = 1, . . . , min(L, M) − H ∗ . the diagonal entries given by Vh,h Proof From Eq. (14.58), we have ## ##2 #H #2 ##
γhVB ωbh ωah − B∗ A∗ ##Fro B A − B∗ A∗ ## = ## h=1 Fro #H ∗ #2 H γhVB ωbh ωah ##Fro + Op (N −3/2 ) = ## h=1 γh ωbh ωah − B∗ A∗ + h=H ∗ +1 ## #2 H # # γhVB ωbh ωah − min(L,M) = ##V − B∗ A∗ + h=H ∗ +1 h=H ∗ +1 γh ωbh ωah #Fro + Op (N −3/2 ), and therefore, *## ##2 + ## B A − B∗ A∗ ## Fro q(D) *## #2 + H # # γhVB ωbh ωah − min(L,M) = ##V − B∗ A∗ + h=H γ ω ω ∗ +1 ∗ h b a h h=H +1 h#
Fro q(D)
−3/2
+ O(N ) = V − B∗ A∗ 2Fro q(D) min(L,M) ∗ ∗ H γhVB ωbh ωah − h=H + 2 (V − B A ) ∗ +1 γh ωbh ωah h=H ∗ +1
q(D)
422
14 Asymptotic VB Theory of Reduced Rank Regression +
H
γhVB )2 h=H ∗ +1 (
−2
H
h=H ∗ +1
γhVB + γh
min(L,M) h=H ∗ +1
γh2
q(D)
−3/2
+ O(N ) ∗ ∗ 2 = V − B A Fro q(D) / H VB γh ωbh ωah + 2 h=H ∗ +1 (γh ωbh ωah − γh∗ ω∗bh ω∗ ah ) − +
min(L,M) h=H ∗ +1
0 (γh ωbh ωah − γh∗ ω∗bh ω∗ ah ) γh ωbh ωah
H
γhVB )2 h=H ∗ +1 (
−2
H
γhVB h=H ∗ +1 γh
q(D)
+
min(L,M) h=H ∗ +1
γh2
q(D)
+ O(N −3/2 ) H ∗ ∗ ∗ VB = V − B∗ A∗ 2Fro γh ωbh ωah − 2 h=H ∗ +1 (γ ω ωa ) h bh h q(D) q(D) min(L,M) ∗ ∗ ∗ + 2 h=H ∗ +1 (γh ωbh ωah ) γh ωbh ωah q(D) min(L,M) H VB 2 2 + h=H ∗ +1 ( γh ) − h=H ∗ +1 γh + O(N −3/2 ) q(D) q(D) * ## #2 + ##γh ωb ω − γ∗ ω∗ ω∗ ### = V − B∗ A∗ 2Fro − min(L,M) ∗ a a h h bh h=H +1 h h q(D) Fro q(D) H VB γh ωbh ωah − 2 h=H ∗ +1 (γh∗ ω∗bh ω∗ ah ) q(D) min(L,M) ∗2 H VB 2 + h=H ∗ +1 ( γh ) + h=H ∗ +1 γh + O(N −3/2 ) q(D)
min(L,M) ∗2 σ 2 = LM − (L − H ∗ )(M − H ∗ ) + σN 2 h=H ∗ +1 γh N H H ∗ ∗ ∗ VB γh ωbh ωah γhVB )2 − 2 h=H + h=H ∗ +1 ( ∗ +1 (γ ω ωa ) h bh h q(D)
+ O(N
−3/2
).
q(D)
(14.91)
In the orthogonal space to the distinctly necessary components H∗ , the distribution of {γh , ωah , ωbh }min(L,M) coincides with {γh , ωah , ωbh }h=1 h=H ∗ +1 ∗
2 σ
min(L,M)−H √ the distribution of { N γh , ωah , ωbh }h=1 , defined in Eq. (14.89), with V
∗ as the true matrix for subtle or the redundant components, h = H ∗ + 1, . . . , min(L, M). By using Eq. (14.58), we thus have *## ##2 + ## B A − B∗ A∗ ## Fro q(D)
2 σ ∗2 = (H ∗ (L + M) − H ∗2 ) + σN 2 min(L,M) h=H ∗ +1 γh N / H−H ∗
2 + h=1 θ γh > max(L, M)
14.2 Generalization Properties
·
)! 1−
max(L,M) γh
2
"2
! γh
2 − 2 1 −
max(L,M) γh
2
"
423
∗
γh
ω
bh V ωah
10
q(V
)
+ O(N −3/2 ),
which completes the proof. Training Error under Subtle Signal Assumption The training error can be analyzed more easily.
Theorem 14.26 Under the subtle signal assumption (14.87), the average training error of the RRR model for H ≥ H ∗ is asymptotically expanded as TE(N) = νVB N −1 + O(N −3/2 ), where the training coefficient is given by ∗2 2νVB = − (H ∗ (L + M) − H ∗2 ) + σN 2 min(L,M) h=H ∗ +1 γh / ! "! H−H ∗
2 max(L,M) + θ γ > max(L, M) · 1 − 1+ h=1 h γ
2 h
max(L,M) γh
2
"
0 γh
2
. q(V
)
(14.92)
Here V and
{γh
}
are defined in Theorem 14.25.
Proof Theorem 14.15 and Eq. (14.77) still hold under the assumption (14.87). Therefore, # ## #2 ##V − B A ## Fro
2
2 H max(L,M)σ 2 = − h=H ∗ +1 θ γh2 > max(L,M)σ − · γ γh + max(L,M)σ h N Nγh Nγh min(L,M) ∗2 ∗ 2 −3/2 ), + min(L,M) h=H ∗ +1 (γh − γh ) + h=H ∗ +1 γh + Op (N and *## # # #2 + #2 ##V − B A ## − ##V − B∗ A∗ ##Fro Fro
q(D)
) σ N =− (H ∗ (L + M) − H ∗2 ) − 2 N σ
2
+
2 h=H ∗ +1 θ γh >
H
max(L,M)σ 2 N
min(L,M)
γh∗2
h=H ∗ +1
· γh −
max(L,M)σ 2 Nγh
γh +
max(L,M)σ 2 Nγh
1
+ Op (N −3/2 ). Substituting the preceding equation into Eq. (14.75) and using the random matrix V
and its singular values {γh
}, defined in Theorem 14.25, we obtain Eq. (14.92).
424
14 Asymptotic VB Theory of Reduced Rank Regression
2
VB ML Regular
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
2
4
6
8
10
12
14
16
18
Figure 14.12 The generalization coefficients and the training coefficients under the subtle signal assumption (14.87) in the RRR model with max(L, M) = 50, min(L, M) = 30, H = 20, and H ∗ = 5.
Comparison with Other Learning Algorithms Figure 14.12 shows the generalization coefficients and the training coefficients, computed by using Theorems 14.25√and 14.26, respectively, as functions of a rescaled subtle true singular value Nγh∗ /σ . The considered RRR model is with max(L, M) = 50, min(L, M) = 30, and H = 20, and the true linear mapping is assumed to consist of H ∗ = 5 distinctly necessary components (γh∗ = Θ(1) for h = 1, . . . , 5), 10 subtle components (γh∗ = Θ(N −1/2 ) for h = 6, . . . , 15), and the other five null components (γh∗ = 0 for h = 16, . . . , 20). The subtle singular values are assumed to be√identical, γh∗ = γ∗ for h = 6, . . . , 15, and the horizontal axis indicates Nγ∗ /σ . The generalization coefficients and the training coefficients of ML learning can be derived in the same way as Theorems 14.25 and 14.26 with the VB estimator γhVB replaced with the ML estimator γhML = γh . Unfortunately, the generalization error nor the training error of Bayesian learning under the subtle signal assumption for the general RRR model has not been clarified. Only in the case where L = H = 1, the Bayes generalization error under the subtle signal assumption has been analyzed. Theorem 14.27 (Watanabe and Amari, 2003) The Bayes generalization error of the RRR model with M ≥ 2, L = H = 1 under the assumption that the true mapping is b∗ a∗ = O(N −1/2 ) is asymptotically expanded as GE
Bayes
(N) = λBayes N −1 + o(N −1 ),
where the generalization coefficient is given by *!## √ " + ##2 √ M (v) 2λBayes = 1 + ## σN b∗ a∗ ## + σN b∗ a∗ v ΦΦM−2 (v)
q(v)
.
(14.93)
14.2 Generalization Properties
2
425
VB ML(=Regular) Bayes
1.5 1 0.5 0 –0.5 –1 –1.5 –2 0
2
4
6
8
10
12
14
16
18
Figure 14.13 The generalization coefficients and the training coefficients under the subtle signal assumption (14.87) in the RRR model with M = 5, L = H = 1, and H ∗ = 0.
Here, Φ M (v) =
π/2 0
! ## √ " ##2 sin M θ exp − 12 ## σN b∗ a∗ + v## sin2 θ dθ,
and v ∈ R is a random vector subject to q(v) = Gauss M (v; 0, I M ). M
Figure 14.13 compares the generalization coefficients when M = 5, L = to a rescaled H = 1, and H ∗ = 0, where √ axis∗ corresponds √ the∗ horizontal
∗
2 N b a /σ . We see that the subtle true singular value Nγ /σ = generalization error of VB learning is smaller √ than that of Bayesian learning √ when Nγ∗ /σ = 0, and identical when Nγ∗ /σ → ∞. This means that, under the distinct signal assumption (14.86), which considers only the case where γ∗ = 0 (i.e., H ∗ = 0) or γ∗ = Θ(1) (i.e., H ∗ = 1), VB learning always performs better √ than Bayesian learning. However, we can see in Figure 14.13 that, when Nγ∗ /σ ≈ 3, Bayesian learning outperforms VB learning. Figure 14.13 simply implies that VB learning is more strongly regularized than Bayesian learning, or in other words, VB learning tends to underfit subtle signals such that γ∗ = Θ(N −1/2 ), while Bayesian learning tends to overfit noise. Knowing the proved optimality of Bayesian learning (Appendix D), we would that √the same happens in Figure 14.12, where the limits √ ∗ expect Nγ /σ = 0 and Nγ∗ /σ → ∞ correspond to the cases with H ∗ = 5 and H ∗ = 15, respectively, in Figure 14.4 under the distinct signal assumption 2
When L = H = 1, the parameter transformation ba → w makes the RRR model identifiable, and therefore the ML generalization coefficient is identical to that of the regular models. This is the reason why only the integration effect or model induced-regularization was observed in the one-dimensional matrix factorization model in Section 7.2 and Section 7.3.3. The basis selection effect appears only when a singular model cannot be equivalently transformed to a regular model (see the discussion in Section 14.3).
426
14 Asymptotic VB Theory of Reduced Rank Regression
(14.86). Namely, if we could depict the Bayes generalization coefficient in Figure 14.12, there should be some interval where Bayesian learning outperforms VB learning.
14.3 Insights into VB Learning In this chapter, we analyzed the generalization error, the training error, and the free energy of VB learning in the RRR model, and derived their asymptotic forms. We also introduced theoretical results providing those properties for ML learning and Bayesian learning. As mentioned in Section 13.5, the RRR model is the only singular model of which those three properties have been theoretically clarified for ML learning, Bayesian learning, and VB learning. Accordingly, we here summarize our observations, and discuss effects of singularities in VB learning and other learning algorithms. (i) In the RRR model, the basis selection effect, explained in Section 13.5.1, appears as a selection bias of largest singular values of a zero-mean random matrix. Theorem 14.19 gives an asymptotic expansion of the ML generalization error. The second term in the generalization coefficient (14.82) is the expectation of√ the square of the (H − H ∗ ) largest singular values of a random matrix σN V , where V is subject to the zero-mean Gaussian (14.61). This corresponds to the effect of basis selection: ML learning chooses the singular components that best fit the observation noise. With the full-rank model, i.e., H = min(L, M), the second term in the generalization coefficient (14.82) is equal to min(L,M)−H ∗ γh 2 = (min(L, M) − H ∗ ) (max(L, M) − H ∗ ) h=1 q(W)
= (L − H ∗ ) (M − H ∗ ) , and therefore the generalization coefficient becomes 2λML = (H ∗ (L + M) − H ∗2 ) + (L − H ∗ ) (M − H ∗ ) = LM = D, which is the same as the generalization coefficient of the regular models. Indeed, the full-rank RRR model is equivalently transformed to a regular model by B A → U, where the domain for U is the whole RL×M space (no low-rank restriction is imposed to U). In this case, no basis selection occurs because all possible bases are supposed to be used.
14.3 Insights into VB Learning
427
(ii) In the RRR model, the integration effect, explained in Section 13.5.1, appears as the James–Stein (JS) type shrinkage. This was shown in Theorem 14.1 in the asymptotic limit: the VB estimator converges to the positive-part JS estimator operated on each singular component separately. By comparing Theorems 14.11 and 14.19, we see that ML learning and VB learning differ from each other ! "2 , which comes from the by the factor θ γh 2 > max(L, M) 1 − max(L,M)
2 γh postive-part JS shrinkage. Unlike the basis selection effect, the integration effect appears even if the model can be equivalently transformed to a regular model—the full-rank RRR model (with H = min(L, M)) is still affected by the singularities. The relation between VB learning and the JS shrinkage estimator was also observed in nonasymptotic analysis in Chapter 7, where model-induced regularization (MIR) was illustrated as a consequence of the integration effect, by focusing on the one-dimensional matrix factorization model. Note that basis selection effect does not appear in the one-dimensional matrix factorization model (where L = M = H), because it can be equivalently transformed to a regular model. (iii) VB learning shows similarity both to ML learning and Bayesian learning. Figures 14.1 through 14.4 generally show that VB learning is regularized as much as Bayesian learning, while its dependence on the model size (H, L, M, etc.) is more like ML learning. Unlike Bayesian learning, the integration effect does not always dominate the basis selection effect in VB learning—a good property of Bayesian learning, 2λBayes ≤ D, does not necessarily hold in VB learning, e.g., we observe that 2λVB > D at min(L, M) = 80 in Figure 14.2, and H = 1 in Figure 14.3. (iv) In VB learning, the relation between the generalization error and the free energy is not as simple as in Bayesian learning. In Bayesian learning, the generalization coefficient and the free energy coefficient coincide with each other, i.e., λBayes = λ Bayes . This property does not hold in VB learning even approximately, as seen by comparing Figures 14.1 through 14.4 and Figures 14.5 through 14.8. In many cases, the VB free energy well approximates the Bayes free energy, while the VB generalization error significantly differs from the Bayes generalization error. Research on the relation between the free energy and the generalization error in VB learning is ongoing (see Section 17.4).
428
14 Asymptotic VB Theory of Reduced Rank Regression
(v) MIR in VB learning can be stronger than that in Bayesian learning. By definition, the VB free energy is never less than the Bayes free energy, and therefore it holds that λ VB ≥ λ Bayes . On the other hand, such a relation does not hold for the generalization error, i.e., λVB can be larger or less than λBayes . However, even if λVB is less than or equal to λBayes for any true rank H ∗ in some RRR model, it does not mean the domination of VB learning over Bayesian learning. Since the optimality of Bayesian learning was proved (see Appendix D), λVB < λBayes simply means that VB learning is more strongly regularized than Bayesian learning, or in other words, VB learning tends to underfit small signals while Bayesian learning tends to overfit noise. Extending the analysis under the subtle signal assumption (14.87) to the general RRR model would clarify this point. (vi) The generalization error depends on the dimensionality in an interesting way. The shrinkage factor is governed by max(L, M) and independent of min(L, M) in the asymptotic limit (see Theorem 14.1). This is because the shrinkage is caused by the VB posterior extending into the parameter space with larger dimensional space (M-dimensional input space or L-dimensional output space) for the redundant components, as seen in Corollary 14.3. This choice was made by maximizing the entropy of the VB posterior distribution when the free energy is minimized. Consequently, when L M, the shape of the VB posterior in the asymptotic limit is similar to the partially Bayesian learning, where the posterior of A or B is approximated by the Dirac delta function (see Chapter 12). On the other hand, increase of the smaller dimensionality min(L, M) broadens the variety of basis selection: as mentioned in (i), the basis selection effect in the RRR model occurs by the redundant components selecting the (H − H ∗ ) largest singular components, and (min(L, M) − H ∗ ) corresponds to the dimensionality that the basis functions can span. This phenomenon can be seen in Figure 8.3—the Marˇcenko–Pastur distribution is diverse when α = (min(L, M) − H ∗ )/ (max(L, M) − H ∗ ) is large. We can conclude that a large max(L, M) enhances the integration effect, leading to strong regularization, while a large min(L, M) enhances the basis selection effect, leading to overfitting. As a result, VB learning tends to be strongly regularized when L M or L ! M, and tends to overfit when L ≈ M.
15 Asymptotic VB Theory of Mixture Models
In this chapter, we discuss the asymptotic behavior of the VB free energy of mixture models, for which VB learning algorithms were introduced in Sections 4.1.1 and 4.1.2. We first prepare basic lemmas commonly used in this and the following chapters.
15.1 Basic Lemmas Consider the latent variable model expressed as p(D|w) = p(D, H|w). H
In this chapter, we analyze the VB free energy, which is the minimum of the free energy under the constraint, r(w, H) = rw (w)rH (H),
(15.1)
i.e., F VB (D) = where
min
rw (w),rH (H)
F(r),
(15.2)
/ 0 rw (w)rH (H) F(r) = log p(w, H, D) rw (w)rH (H) = F Bayes (D) + KL (rw (w)rH (H)||p(w, H|D)) .
Here, F Bayes (D) = − log p(D) = − log
p(D, H) = − log
H
H
429
(15.3)
p(D, H, w)dw
430
15 Asymptotic VB Theory of Mixture Models
is the Bayes free energy. Recall that the stationary condition of the free energy yields rw (w) = rH (H) =
1 p(w) exp log p(D, H|w) rH (H) , Cw
(15.4)
1 exp log p(D, H|w) rw (w) . CH
(15.5)
For the minimizer rw (w) of F(r), let
w = w
rw (w)
(15.6)
be the VB estimator. The following lemma shows that the free energy is decomposed into the sum of two terms. Lemma 15.1 It holds that F VB (D) = min{R + Q}, rw (w)
(15.7)
where R = KL(rw (w)||p(w)), Q = − log CH , for CH =
H
exp log p(D, H|w) rw (w) .
Proof From the restriction of the VB approximation in Eq. (15.1), F(r) can be divided into two terms, / / 0 0 rH (H) rw (w) + log . F(r) = log p(w) rw (w) p(D, H|w) rw (w)rH (H) Since the optimal VB posteriors satisfy Eqs. (15.4) and (15.5), if the VB posterior rH (H) is optimized, then / 0 rH (H) = − log CH log p(D, H|w) rw (w)rH (H) holds. Thus, we obtain Eq. (15.7).
The free energies of mixture models and other latent variable models involve the di-gamma function Ψ (x) and the log-gamma function log Γ(x) (see, e.g., Eq. (4.22)). To analyze the free energy, we will use the inequalities on these functions in the following lemma:
15.1 Basic Lemmas
431
Lemma 15.2 (Alzer, 1997) For x > 0,
and
) 0 ≤ log Γ(x) −
1 1 < log x − Ψ (x) < , 2x x
(15.8)
1 1 1 1 x− . log x − x + log 2π ≤ 2 2 12x
(15.9)
The inequalities (15.8) ensure that substituting log x for Ψ (x) only contributes at most additive constant terms to the VB free energy. The substitution for log Γ(x) is given by Eq. (15.9) as well. For the i.i.d. latent variable models defined as p(x, z|w), (15.10) p(x|w) = z N the likelihood for the observed data D = {x(n) }n=1 and the complete data (n) (n) N {D, H} = {x , z }n=1 is given by
p(D|w) =
N
p(x(n) |w),
n=1
p(D, H|w) =
N
p(x(n) , z(n) |w),
n=1
respectively. In the asymptotic analysis of the free energy for such a model, when the free energy is minimized, the second term in Eq. (15.7), Q = − log CH , is proved to be close to N times the empirical entropy (13.20), S N (D) = −
N 1 log p(x(n) |w∗ ), N n=1
(15.11)
where w∗ is the true parameter generating the data. Thus, the first term in Eq. (15.7) shows the asymptotic behavior of the VB free energy, which is analyzed with the inequalities in Lemma 15.2. Let = Q − NS N (D). Q
(15.12)
It follows from Jensen’s inequality that = log p(D|w∗ ) − log Q exp log p(D, H|w) rw (w) p(D|w∗ ) ≥ log p(D|w) rw (w) wML ), ≥ NE N (
H
(15.13)
432
15 Asymptotic VB Theory of Mixture Models
where wML is the maximum likelihood (ML) estimator, and E N (w) = LN (w∗ ) − LN (w) =
N 1 p(x(n) |w∗ ) log N n=1 p(x(n) |w)
(15.14)
is the empirical KL divergence. Note here that LN is defined in Eq. (13.37), and E N (w) corresponds to the training error of the plug-in predictive distribution (defined in Eq. (13.75) for regular models) with an estimator w. If the domain of data X is discrete and with finite cardinality, # (X) = M, Q in Eq. (15.7) can be analyzed in detail. In such a case, we can assume without loss of generality that x ∈ {e1 , . . . , e M }, where em is the one-of-M representation, i.e., only the mth entry is one and the other entries are zeros. (n) N
m = n=1
m be the number of output m in the sequence D, i.e., N xm , and Let N N ∗ define the strongly ε-typical set T ε ( p ) with respect to the probability mass function, p∗ = (p∗1 , . . . , p∗M ) = (p(x1 = 1|w∗ ), . . . , p(x M = 1|w∗ )) ∈ Δ M−1 as follows:
⎫ ⎧ ∗ ⎪ ⎪
⎪ ⎪ p N ⎬ ⎨ m m T εN ( p∗ ) = ⎪ − p∗m ≤ ε, m = 1, . . . , M ⎪ . D ∈ XN ; ⎪ ⎪ log M N ⎭ ⎩
(15.15)
It is known that the probability that the observed data sequence is not strongly ε-typical is upper-bounded as follows: Lemma 15.3 (Han and Kobayashi, 2007) It holds that Prob(D T εN ( p∗ )) ≤
κM , Nε2
where 1 − p∗m $ % . κ = log M 2 max m:p∗m 0 p∗m Let
p = ( p1 , . . . , p M ) = p(x1 = 1|w) rw (w) , . . . , p(x M = 1|w) rw (w)
∈ Δ M−1
be the probability mass function defined by the predictive distribution p(x|w) rw (w) with the VB posterior rw (w). For any fixed δ > 0, define , p ∈ Δ M−1 ; KL( p∗ || R∗δ = p) ≤ δ , (15.16) where KL( p∗ || p) =
M m=1
p∗
p∗m log pmm . Then the following lemma holds:
15.1 Basic Lemmas
433
Lemma 15.4 Suppose that the domain X is discrete and with finite cardinality, # (X) = M. For all ε > 0 and D ∈ T εN ( p∗ ), there exists a constant C > 0 ∗ such that if p RCε 2, = Ωp (N). Q
(15.17)
= Op (1). min Q
(15.18)
Furthermore, rw (w)
Proof From Eq. (15.13), we have ≥ log Q
p(D|w∗ ) p(D|w) rw (w)
M p∗ Nm log m
pm N m=1 , = N KL( pML || p) − KL( pML || p∗ ) ,
=N
(15.19)
1 /N, . . . , N
M /N) ∈ Δ M−1 is the type, where pML = ( pML pML = (N M ) 1 ,. . . , p) > KL( pML ||p∗ ), namely the empirical distribution of D. Thus, if KL( pML || the right-hand side of Eq. (15.19) grows in the order of N. If D ∈ T εN ( p∗ ), pML ||p∗ ) is well approximated by a quadratic KL( pML || p∗ ) = Op (ε2 ) since KL( ML function of p − p∗ . To prove the first assertion of the lemma, it suffices to see ∗ p) ≤ Cε2 is equivalent to KL( pML || p) ≤ C ε2 for a constant C > 0 that KL( p || N ∗ if D ∈ T ε ( p ). In fact, we have
p) = KL( p∗ || pML ) + KL( pML || p) + KL( p∗ ||
M
(p∗m − pML pML pm . m ) log m − log
m=1
(15.20) It follows from D ∈ T εN ( p∗ ) that KL( p∗ || pML )/ε2 and |p∗m − pML m |/ε are bounded ML
2 ML p) ≤ C ε implies that | pm − pm |/ε is bounded by by constants. Then KL( p || a constant, and hence all the terms in Eq. (15.20) divided by ε2 are bounded by constants. It follows from Eq. (15.19) that ≥ −NKL( pML ||p∗ ). min Q
rw (w)
(15.21)
The standard asymptotic theory of the multinomial model implies that twice the right-hand side of Eq. (15.21), with its sign flipped, asymptotically follows the chi-squared distribution with degree of freedom M − 1 as discussed in Section 13.4.5.
434
15 Asymptotic VB Theory of Mixture Models
This lemma is used for proving the consistency of the VB posterior and evaluating lower-bounds of VB free energy for discrete models in Sections 15.3 and 15.4 and Chapter 16.
15.2 Mixture of Gaussians In this section, we consider the following Gaussian mixture model (GMM) introduced in Section 4.1.1 and give upper- and lower-bounds for the VB free energy (Watanabe and Watanabe, 2004, 2006): p(z|α) = MultinomialK,1 (z; α), K 7 6 K p(x|z, {μk }k=1 )= Gauss M (x; μk , I M ) zk ,
(15.22) (15.23)
k=1
p(α|φ) = DirichletK (α; (φ, . . . , φ) ), p(μk |μ0 , ξ) = Gauss M (μk |μ0 , (1/ξ)I M ).
(15.24) (15.25)
Under the constraint, r(H, w) = rH (H)rw (w), the VB posteriors are given as follows: N K N K r({z(n) }n=1 , α, {μk }k=1 ) = rz ({z(n) }n=1 )rα (α)rμ ({μk }k=1 ), N rz ({z(n) }n=1 )=
N
MultinomialK,1 z(n) ; z(n) ,
n=1
rα (α) = Dirichlet α; ( φ1 , . . . , φ K ) , K )= rμ ({μk }k=1
K
Gauss M μk ; μk , σ2k I M .
k=1 N K K , { φk }k=1 , { μk , σ2k }k=1 {z(n) }n=1
minimize the free The variational parameters energy, ⎛ K ⎞ K ⎜⎜⎜ Γ( k=1 φk ) ⎟⎟⎟ M KM Γ(Kφ) ⎜ ⎟ log ξ σ2k − − F = log ⎜⎝ K ⎟⎠ − log K
(Γ(φ)) 2 k=1 2 k=1 Γ(φk ) +
N K n=1 k=1
z(n) z(n) k log k +
K
φk ) φk − φ − N k Ψ ( φk ) − Ψ ( kK =1 k=1
15.2 Mixture of Gaussians
+
K ξ μk − μ0 2 + M σ2k 2
k=1
+
K N k xk − μk 2 +
+
K N M log(2π) + M σ2k k 2
k=1
N
(n) z(n) n=1 k x
− x k 2
2
k=1
435
,
(15.26)
where Nk =
N
z(n) k ,
(15.27)
n=1
xk =
N 1
Nk
z(n) x(n) k .
(15.28)
n=1
The stationary condition of the free energy yields
z(n) k
z(n) k
= K
(n) k =1 zk
,
(15.29)
φk = N k + φ,
μk =
N k xk + ξμ0
Nk + ξ 1
, σ2k = Nk + ξ where
(15.30) ,
(15.31) (15.32)
z(n) k
1 (n) 2 2
∝ exp Ψ (φk ) − x − μk + M σk . 2
(15.33)
The following condition is assumed. Assumption 15.1 The true distribution q(x) is an M-dimensional GMM K0 ): p(x|w∗ ), which has K0 components and parameter w∗ = (α∗ , {μ∗k }k=1 q(x) = p(x|w∗ ) =
K0
α∗k Gauss M (x; μ∗k , I M ),
(15.34)
k=1
where x, μ∗k ∈ R M . Suppose that the true distribution can be realized by our model in hand, i.e., K ≥ K0 holds. Under this condition, we prove the following theorem, which evaluates the relative VB free energy, VB (D) = F VB (D) − NS N (D). (15.35) F The proof will appear in the next section.
436
15 Asymptotic VB Theory of Mixture Models
Theorem 15.5 The relative VB free energy of the GMM satisfies VB (D) ≤ λ VB λ VB w) + Op (1) ≤ F MM log N + Op (1), MM log N + NE N (
(15.36)
VB , where E N is the empirical KL divergence (15.14), and the coefficients λMM
VB
λMM are given by λ VB MM
VB
λMM
⎧ M ⎪ ⎪ φ < M+1 , ⎪ 2 ⎨ (K − 1)φ + 2 =⎪ ⎪ MK+K−1 M+1 ⎪ ⎩ φ≥ 2 , 2 ⎧ MK0 +K0 −1 ⎪ ⎪ φ< ⎪ 2 ⎨ (K − K0 )φ + =⎪ ⎪ ⎪ ⎩ MK+K−1 φ≥ 2
(15.37) M+1 2
,
M+1 2
.
(15.38)
w) is the training error of the VB estimator. Let wML be In this theorem, E N ( the ML estimator. Then it immediately follows from Eq. (15.14) that w) ≥ NE N ( wML ), NE N (
(15.39)
(n) ∗ N |w ) where NE N ( wML ) = minw n=1 log p(x is the (maximum) log-likelihood p(x(n) |w) ratio statistic with sign inversion. As discussed in Section 13.5.3, it is conjectured for the GMM defined by Eqs. (15.22) and (15.23) that the loglikelihood ratio diverges in the order of log log N (Hartigan, 1985). If this conjecture is proved, the statement of the theorem is simplified to
VB (D) = λ log N + op (log N), F
VB
wML ) diverges to minus for λ VB MM ≤ λ ≤ λMM . Note, however, that even if NE N ( w) diverges in the same infinity, Eq. (15.39) does not necessarily mean NE N ( w) does not affect the upper-bound in Eq. (15.36). order. Also note that NE N ( Since the dimension of the parameter w is D = MK + K − 1, the relative Bayes free energy coefficient of regular statistical models, on which the Bayesian information criterion (BIC) (Schwarz, 1978) and the minimum description length (MDL) (Rissanen, 1986) are based, is given by D/2. Note that, unlike regular models, the advantage of Bayesian learning for singular models is demonstrated by the asymptotic analysis as seen in Eqs. (13.123),
VB (13.124), and (13.125). Theorem 15.5 claims that the coefficient λMM of log N is smaller than D/2 when φ < (M + 1)/2. This means that the VB free energy F VB becomes smaller than that of regular models, i.e., 2λ VB ≤ D holds. Theorem 15.5 shows how the hyperparameters affect the learning process.
VB The coefficients λ VB MM and λMM in Eqs. (15.37) and (15.38) are divided into
15.2 Mixture of Gaussians
437
two cases. These cases correspond to whether φ < M+1 2 holds, indicating that the influence of the hyperparameter φ in the prior p(α|φ) appears depending
be the number of on the number M of parameters in each component. Let K components satisfying N k = Θp (N). Then the following corollary follows from the proof of Theorem 15.5.
= K0 if Corollary 15.6 The upper-bound in Eq. (15.36) is attained when K M+1 M+1
= K if φ ≥ φ < 2 and K 2 . This corollary implies that the phase transition of the VB posterior occurs M+1 at φ = M+1 2 , i.e., only when φ < 2 , the prior distribution reduces redundant components; otherwise, it uses all the components. The phase transition of the posterior occurs also in Bayesian learning while the phase transition point is different from that of VB learning (Yamazaki and Kaji, 2013). Theorem 15.5 also implies that the hyperparameter φ is the only hyperparameter on which the leading term of the VB free energy F VB depends. This is due to the influence of the hyperparameters on the prior probability density around the true parameters. Consider the case where K0 < K. In this case, for a parameter that gives the true distribution, either of the followings holds: αk = 0 for some k or μi = μ j for some pair (i, j). The prior distribution p(α|φ) given by Eq. (15.24) can drastically change the probability density around the points where αk = 0 for some k by changing the hyperparameter φ while the prior distribution p(μk |μ0 , ξ) given by Eq. (15.25) always takes positive values for any values of the hyperparameters ξ and μ0 . While the condition for the prior density p(α|φ) to diverge at αk = 0 is αk < 1, and hence is independent of M, the phase transition point of the VB posterior is φ = M+1 2 . As we will see in Section 15.4 for the Bernoulli mixture model, if some of the components are located at the boundary of the parameter space, the leading term of the relative VB free energy depends also on the hyperparameter of the prior for component parameters. Theorem 15.5 is also extended to the case of the general Dirichlet prior p(α|φ) = DirichletK (α; φ), where φ = (φ1 , . . . , φK ) is the hyperparameter as follows: Theorem 15.7 (Nakamura and Watanabe, 2014) The relative VB free energy of the GMM satisfies K k=1
VB (D) ≤ λ VB log N + NE N ( w) + Op (1) ≤ F k
K k=1
VB
λk
log N + Op (1),
438
15 Asymptotic VB Theory of Mixture Models
VB
where the coefficients λk VB , λk are given by ⎧ 1 ⎪ ⎪ k 1 and φk < M+1 , φk − 2K ⎪ 2 ⎨
VB λk = ⎪ ⎪ ⎪ ⎩ M+1 − 1 k = 1 or φk ≥ M+1 , 2 2K 2 ⎧ 1 ⎪ ⎪ k > K0 and φk < M+1 , ⎪
VB 2 ⎨ φk − 2K λk = ⎪ ⎪ ⎪ ⎩ M+1 − 1 k ≤ K0 or φk ≥ M+1 . 2
2K
2
The proof of this theorem is omitted. This theorem implies that the phase transition of the VB posterior of each component occurs at the same transition point φk = M+1 2 as Theorem 15.5. Proof of Theorem 15.5 Before proving Theorem 15.5, we show two lemmas where the two terms, R = KL(rw (w)||p(w)) and Q = − log CH , in Lemma 15.1 are respectively evaluated. Lemma 15.8 It holds that ⎫ ⎧ K ⎪ ⎪ ⎪ ⎪ ξ ⎬ ⎨ 2 ≤ C, R − ⎪ μ k − μ0 ⎪ G( α) + ⎪ ⎪ ⎭ ⎩ 2 k=1 N x +ξμ α) of where C is a constant, μk = μk rμ ({μk }K ) = kNk +ξ 0 , and the function G( k=1 k 2 3K N k +φ
α= αk = αk rα (α) = N+Kφ is defined by k=1
G( α) =
) 1 K M+1 MK + K − 1 log N + −φ log αk . 2 2 k=1
(15.40)
Proof Calculating the KL divergence between the posterior and the prior, we obtain KL(rα (α)||p(α|φ)) =
K
h(N k ) − NΨ (N + Kφ) + log Γ(N + Kφ) + log
k=1
Γ(φ)K , Γ(Kφ) (15.41)
where we use the notation h(x) = xΨ (x + φ) − log Γ(x + φ). Similarly, we obtain K K )||p({μk }k=1 |μ0 , ξ)) KL(rμ ({μk }k=1
=
1 K K ) M M Nk + ξ K M ξ + μ k − μ0 2 . log − + 2 ξ 2 2 k=1 N k + ξ k=1
(15.42)
15.2 Mixture of Gaussians
439
By using Inequalities (15.8) and (15.9), we obtain 1 12φ − 1 1 ≤ h(x) + φ − −1 + log(x + φ) − x − φ + log 2π ≤ 0. (15.43) 12(x + φ) 2 2 Thus, from Eqs. (15.41), (15.42), (15.43), and K K )||p({μk }k=1 |μ0 , ξ)), R = KL(rα (α)||p(α|φ)) + KL(rμ ({μk }k=1
it follows that ⎫ ⎧ K ⎪ ⎪ ⎪ ⎪ ξ ⎬ ⎨ 2 R − ⎪ μ − μ G( α ) + k 0 ⎪ ⎪ ⎪ ⎭ ⎩ 2 k=1 ! Kφ " MK + K − 1 log 1 + ≤ 2 N K 12N + 1 |12φ − 1| log 2π + K + + + (K − 1) φ − 2 12(N + Kφ) k=1 12(N k + φ) K K ξ M N k + ξ MK Γ(φ)K . − (1 + log ξ) + + log + log Γ(Kφ) k=1 2 2 k=1 N k + ξ Nk + φ The right-hand side of the preceding inequality is bounded by a constant since 1 1 1 < , < N + ξ Nk + ξ ξ and 1 1 1 < . < N + φ Nk + φ φ Lemma 15.9 It holds that ⎛ K ⎛ N ⎜⎜⎜ 1 ⎜⎜ ⎜ Q=− log ⎜⎝ exp ⎜⎜⎜⎝Ψ (N k + φ) − Ψ (N + Kφ) √ M 2π n=1 k=1 (n) 2 μk x − M 1 − − , 2 2 Nk + ξ
(15.44)
and w) − NE N (
N N ≤ NE N ( ≤Q , w) − N + Kφ 2(N + Kφ)
(15.45)
where E N ( w) is given by Eq. (15.14) and E N ( w) is defined by E N ( w) =
N 1 log N n=1 K
k=1
α √ k 2π M
p(x(n) |w∗ ) ! (n) 2 x − μ exp − 2 k −
M+2 2(N k +min{φ,ξ})
".
440
15 Asymptotic VB Theory of Mixture Models
Proof CH =
N
exp log p(x(n) , z(n) |w)
rw (w)
n=1 z(n)
=
N K n=1 k=1 (n)
−
x
√
1 2π M
exp Ψ (N k + φ) − Ψ (N + Kφ)
− μk 2 M 1 − . 2 2 Nk + ξ
Thus, we have Eq. (15.44). Using again Inequality (15.8), we obtain ⎛ K ⎞ N ⎜⎜⎜ ⎟⎟⎟ μk 2 x(n) − αk M+2 ⎜ ⎟⎟⎠ − log ⎝⎜ exp − Q≤− √ M 2 2(N + min{φ, ξ}) 2π k n=1 k=1 N , (15.46) − 2(N + Kφ) and
⎛ K ⎞ ⎜⎜⎜ μk 2 ⎟⎟⎟ x(n) − α N k ⎟⎟⎠ − Q≥− , log ⎜⎜⎝ exp − √ 2 N + Kφ 2π M n=1 k=1 N
which give upper- and lower-bounds in Eq. (15.45), respectively.
Now, from the preceding lemmas, we prove Theorem 15.5 by showing upper- and lower-bounds, respectively. First, we show the upper-bound in Eq. (15.36). From Lemma 15.1, Lemma 15.8, and Lemma 15.9, it follows that w) + C, F − NS N (D) ≤ min T N (
w
(15.47)
where ξ μ − μ0 2 + NE N ( w). 2 k=1 k K
T N ( w) = G( α) +
w) at specific From Eq. (15.47), it is noted that the function values of T N ( points of the variational parameter w give upper-bounds of the VB free energy F VB (D). Hence, let us consider following two cases. (I) Consider the case where all components, including redundant ones, are used to learn K0 true components, i.e.,
αk =
α∗k N + φ N + Kφ
(1 ≤ k ≤ K0 − 1),
15.2 Mixture of Gaussians
αk =
α∗K0 N/(K − K0 + 1) + φ
(K0 ≤ k ≤ K),
N + Kφ
μk = μ∗k
μk =
441
(1 ≤ k ≤ K0 − 1),
μ∗K0
(K0 ≤ k ≤ K).
Then we obtain NE N ( w)
α∗k N+Kφ and
μk 2 x(n) −
M+2 αk − exp − > 0. √ 2 2(N k + min{φ, ξ}) 2π M
The second inequality follows from log(1 + x) ≤ x for x > −1. It follows that ) 1 MK0 + K0 − 1 w) < (K − K0 )φ + log N + C
, T N ( 2
(15.50)
where C
is a constant. From Eqs. (15.47), (15.48), and (15.50), we obtain the upper-bound in Eq. (15.36). Next we show the lower-bound in Eq. (15.36). It follows from Lemma 15.1, Lemma 15.8, and Lemma 15.9 that α)} + NE N ( w) − C − 1. F − NS N (D) ≥ min{G(
α
If φ ≥
M+1 2 ,
(15.51)
then,
G( α) ≥
M+1 MK + K − 1 log N − − φ K log K, 2 2
since Jensen’s inequality yields that ⎞ ⎛ K ⎟⎟ ⎜⎜⎜ 1 1
αk ⎟⎟⎟⎠ = K log log αk ≤ K log ⎜⎜⎝ . K K k=1 k=1
K
(15.52)
15.3 Mixture of Exponential Family Distributions If φ
0 and scale parameter β > 0: p(x|α, β) = Gamma(x; α, β) =
βα α−1 x exp (−βx) , Γ(α)
(15.73)
where 0 ≤ x < ∞. The natural parameter η is given by η1 = β and η2 = α − 1.
VB
VB Hence, Eq. (15.66) holds where λMM and λMM are given by Eqs. (15.67) and (15.68) with M = 2. When shape parameter α is known, the likelihood ratio in ML learning diverges in the order of log log N (Liu et al., 2003). This implies w) = Op (log log N) from Eq. (15.70). that NE N ( Example 3 (Gaussian) Consider the L-dimensional Gaussian component with mean μ and covariance matrix Σ: 1 1 −1 (x − μ) exp − Σ (x − μ) . p(x|μ, Σ) = GaussL (x; μ, Σ) = 2 (2π)L/2 |Σ|1/2 The natural parameter η is given by μT Σ −1 and Σ −1 . These are functions of the elements of μ and the upper-right half of Σ −1 . Hence, Eq. (15.66) holds where
VB λ VB MM and λMM are given by Eqs. (15.67) and (15.68) with M = L + L(L + 1)/2. If the covariance matrix Σ is known and the parameter is restricted to mean μ, it is conjectured that the likelihood ratio in ML learning diverges in the order of log log N (Hartigan, 1985). This suggests that the likelihood ratio can diverge in a higher order than log log N if the covariance matrices are also estimated. Other than these examples, Theorems 15.10 and 15.11 apply to mixtures of distributions such as multinomial, Poisson, and Weibull.
Proof of Theorem 15.10 Here Theorem 15.10 is proved in the same way as Theorem 15.5. K ), we have Since the VB posterior satisfies rw (w) = rα (α)rη ({ηk }k=1 R = KL(rw (w)||p(w)) = KL(rα (α)||p(α|φ)) +
K
KL(rη (ηk )||p(ηk |ν0 , ξ)).
(15.74)
k=1
The following lemma is used for evaluating KL(rη (ηk )||p(ηk |ν0 , ξ)) in the case of the mixture of exponential family distributions. Lemma 15.12 It holds that KL(rη (ηk )||p(ηk |ν0 , ξ)) =
M log(N k + ξ) − log p( ηk |ν0 , ξ) + Op (1), 2
448
15 Asymptotic VB Theory of Mixture Models
where ξk , νk ) 1 ∂ log C(
ηk = ηk rη (ηk ) = .
∂ νk ξk
(15.75)
Proof Using the VB posterior, Eq. (15.58), we obtain KL(rη (ηk )||p(ηk |ν0 , ξ)) = − log
, νk ) C( ξk , + Nk νk ηk rη (ηk ) − A(ηk ) rη (ηk ) , C(ξ, ν0 ) (15.76)
ξk , νk ) where we used ξk = N k + ξ. Let us now evaluate the value of C( when ξk is sufficiently large. From Assumption 15.4, using the saddle point approximation, we obtain ⎛ ⎞ M/2 . ⎛ ⎞⎫ ⎧ ⎪ ⎜⎜⎜ 2π ⎟⎟⎟ ⎜⎜ 1 ⎟⎟⎪ $ % ⎨ ⎬ −1 ηk − A( 1 + Op ⎜⎜⎝ ⎟⎟⎠⎪ νk ) = exp ξk { νk ηk )} ⎜⎝ ⎟⎠ det F( ηk ) ⎪ C( ξk , ⎩ ⎭,
ξk ξk (15.77)
ν ηk − A(ηk ), that is, where ηk is the maximizer of the function ∂A( ηk ) = νk . ∂ηk Therefore, − log C( ξk , νk ) is evaluated as ⎛ ⎞
⎜⎜ 1 ⎟⎟ $ % , ξk 1 M
ηk − A( log + log det F( ηk ) − ξk νk ) = ηk ) + Op ⎜⎜⎝ ⎟⎟⎠ . νk − log C(ξk ,
2 2π 2 ξk (15.78) Applying the saddle point approximation to , 1 ηk = ηk ) exp ξk νk ηk − A(ηk ) dηk , (ηk − ηk − C( ξk , νk ) we obtain || ηk − ηk || ≤
−3/2 A + Op ξk ,
ξk
(15.79)
where A is a constant. Since 1 A(ηk ) − A( νk + (ηk − ηk ) = (ηk − ηk ) ηk ) F(ηk )(ηk − ηk ), 2
(15.80)
15.3 Mixture of Exponential Family Distributions
449
for some point ηk on the line segment between ηk and ηk , we have A( ηk ) − A( νk + Op ( ηk ) = ( ηk − ηk ) ξk−2 ),
(15.81)
and applying the saddle point approximation, we obtain
A(ηk )
rη (ηk )
νk + − A( ηk ) = ( ηk − ηk )
−3/2 M + Op ξk . 2 ξk
(15.82)
From Eqs. (15.81) and (15.82), we have
A(ηk )
rη (ηk )
− A(η¯ k ) =
−3/2 M + Op ξk , 2 ξk
(15.83)
Thus, from Eqs. (15.76), (15.78), (15.81), and (15.82), we obtain the lemma. Lemmas 15.8 and 15.9 are substituted by the following lemmas. Lemma 15.13 It holds that K R − G( log p( ηk |ν0 , ξ) ≤ C, α) + k=1
(15.84)
where C is a constant and the function G( α) is defined by Eq. (15.40). Proof From Eqs. (15.41), (15.43), and (15.74) and Lemma 15.12, K R − G( log p( ηk |ν0 , ξ) α) + k=1 is upper-bounded by a constant since 1 1 1 < . < N + ξ Nk + ξ ξ Lemma 15.14 It holds that = − log CH − NS N (D) ≤ NE N ( NE N ( w) + Op (1) ≤ Q w) + Op (1), where the function E N (w) is defined by Eq. (15.69) and E N ( w) =
N p(t (n) |w∗ ) 1 ! log N n=1 K αk p(t (n) | ηk ) exp − N k=1
where A is a constant.
A k +min{φ,ξ}
",
(15.85)
450
15 Asymptotic VB Theory of Mixture Models
Proof CH =
N K
exp log αk p(t (n) |ηk )
rw (w)
n=1 k=1
=
N K
exp Ψ (N k + φ) − Ψ (N + Kφ) + ηk t (n) − A(ηk ) rη (ηk ) + B(t (n) ) .
n=1 k=1
Again, using the inequalities in Eqs. (15.8) and (15.83), we obtain ⎛ K N ! 3 "⎞⎟ ⎜⎜⎜ M+2 − ⎟ (n) ⎜ + Op N k 2 ⎟⎟⎟⎠ + Op (1), αk p(t | Q≤ log ⎜⎝ ηk ) exp − 2(N k + min{φ, ξ}) n=1 k=1 ⎛ K ⎞ N ⎜⎜ ⎟⎟ Q≥− αk p(t (n) | log ⎜⎜⎜⎝ ηk )⎟⎟⎟⎠ + Op (1), n=1
k=1
which give the upper- and lower-bounds in Eq. (15.85), respectively.
K K |ν0 , ξ) satisfies 0 < p({ηk }k=1 |ν0 , ξ) Since the prior distribution p({ηk }k=1 < ∞, from Lemmas 15.13 and 15.14, we complete the proof of Theorem 15.10 in the same way as that of Theorem 15.5.
Proof of Theorem 15.11 The upper-bound follows from Theorem 15.10. From Lemmas 15.1 and 15.13 and the boundedness of the prior, we have the following lower-bound: + Op (1). α) + Q F − NS N (D) ≥ G( ∗ Lemma 15.4 implies that for ε > 0, if D ∈ T εN ( p∗ ) and p RCε 2 for the constant C in the lemma,
= Ωp (N). Q Since G( α) = Op (log N), this means that if the free energy is minimized, ∗
p ∈ RCε 2 for sufficiently large N, which implies that at least K0 components are active and | log αk | = Op (1) holds for at least K0 components. By minimizing G( α) under this constraint and the second assertion of Lemma 15.4, we have
VB
F VB (D) − NS N (D) ≥ λMM log N + Op (1), for D ∈ T εN ( p∗ ). Because the probability that the observed data sequence is strongly ε-typical tends to 1 as N → ∞ for any ε > 0 by Lemma 15.3, we obtain the theorem.
15.4 Mixture of Bernoulli with Deterministic Components
451
15.4 Mixture of Bernoulli with Deterministic Components In the previous sections, we assumed that all true component parameters are in the interior of the parameter space. In this section, we consider the Bernoulli mixture model when some components are at the boundary of the parameter space (Kaji et al., 2010). For an M-dimensional binary vector, x = (x1 , . . . , x M ) ∈ {0, 1} M , we define the Bernoulli distribution with parameter μ = (μ1 , . . . , μ M ) as Bern M (x|μ) =
M
μmxm (1 − μm )(1−xm ) .
m=1
For each element of μ, its conjugate prior, the Beta distribution, is given by 1 μa−1 (1 − μ)b−1 , B(a, b)
Beta(μ; a, b) =
for a, b > 0. The Bernoulli mixture model that we consider is given by p(z|α) = MultinomialK,1 (z; α),
(15.86)
K 7 6 Bern M (x; μk ) zk ,
(15.87)
K p(x|z, {μk }k=1 )=
k=1
p(α|φ) = DirichletK (α; (φ, . . . , φ) ), p(μk |ξ) =
M
Beta(μkm ; ξ, ξ),
(15.88) (15.89)
m=1
where φ > 0 and ξ > 0 are hyperparameters. Under the constraint, r(H, w) = rH (H)rw (w), the VB posteriors are given as follows: N K N K r({z(n) }n=1 , α, {μk }k=1 ) = rz ({z(n) }n=1 )rα (α)rμ ({μk }k=1 ), N rz ({z(n) }n=1 )=
N
MultinomialK,1 z(n) ; z(n) ,
n=1
rα (α) = Dirichlet α; ( φ1 , . . . , φ K ) , K )= rμ ({μk }k=1
K M k=1 m=1
Beta μkm ; akm , bkm .
452
15 Asymptotic VB Theory of Mixture Models
N K M M K The variational parameters {z(n) }n=1 , { φk }k=1 , {{ akm }m=1 , { bkm }m=1 }k=1 minimize the free energy, ⎞ ⎛ K N K ⎜⎜⎜ Γ( k=1 φk ) ⎟⎟⎟ Γ(Kφ) (n) (n) ⎟ ⎜
zk log zk + log ⎜⎝ K F= ⎟ − log
⎠ (Γ(φ))K n=1 k=1 k=1 Γ(φk )
+
K
φk ) φk − φ − N k Ψ ( φk ) − Ψ ( kK =1 k=1
⎧ ⎛ ⎞ K M ⎪ ⎪ bkm ) ⎟⎟⎟ Γ(2ξ) akm + ⎨ ⎜⎜⎜ Γ( ⎟⎠ − log + ⎪ ⎪log ⎜⎝ ⎩ (Γ(ξ))2 Γ( akm )Γ( bkm ) k=1 m=1 + akm − ξ − N k xkm Ψ ( akm + bkm ) akm ) − Ψ ( + bkm − ξ − N k (1 − xkm ) Ψ ( bkm ) − Ψ ( akm + bkm ) , where Nk =
N
z(n) k
n=1
xkm =
rH (H)
,
N 1 (n) x(n) , zk rH (H) m N k n=1
for k = 1, . . . , K and m = 1, . . . , M. The stationary condition of the free energy yields
zk(n)
z(n) k
= K
(n) k =1 zk
,
φk = N k + φ,
(15.90) (15.91)
akm = N k xkm + ξ,
(15.92)
bkm = N k (1 − xkm ) + ξ,
(15.93)
where
K M , (n)
z(n) akm + bkm ) akm ) − Ψ ( k = exp Ψ (φk ) − Ψ ( k =1 φk ) + m=1 xm Ψ ( (n) + (1 − xm ) Ψ ( bkm ) − Ψ ( akm + bkm ) . (15.94)
We assume the following condition. Assumption 15.5 For 0 ≤ K1∗ ≤ K0∗ ≤ K, the true distribution q(x) = p(x|w∗ ) is represented by K0∗ components and the parameter is given by K∗
0 : w∗ = {α∗k , μ∗k }k=1 ∗
∗
q(x) = p(x|w ) =
K0 k=1
α∗k Bern M (x; μ∗k ),
15.4 Mixture of Bernoulli with Deterministic Components
453
where 0 < μ∗km < 1 (1 ≤ k ≤ K1∗ ), μ∗km = 0 or 1 (K1∗ + 1 ≤ k ≤ K0∗ ). We define ΔK ∗ = K0∗ − K1∗ .
0 be the number of components satisfying N k /N = Ωp (1) and K
1 Let K be the number of components satisfying xkm = Ωp (1) and 1 − xkm = Ωp (1)
1 components, it holds that
≡ K
0 − K for all m = 1, · · · , M. Then, for ΔK
1 components N k /N = Ωp (1) and xkm = op (1) or 1 − xkm = op (1). Hence, the K with N k /N = Ωp (1) and xkm = Ωp (1) are said to be “nondeterministic” and
components are said to be “deterministic,” respectively. We have the the ΔK following theorem. Theorem 15.15 satisfies
The relative free energy of the Bernoulli mixture model
VB (D) = F VB (D) − NS N (D) F ) 1 M+1
1 + 1 − φ + Mξ ΔK
+ Kφ − 1 log N + Ωp N J + Op (1), = −φ K 2 2 2
< ΔK ∗ and otherwise J = 0.
1 < K ∗ or ΔK where J = 1 if K 1 The proof of Theorem 15.15 is shown after the next theorem. The following theorem claims that the numbers of deterministic and nondeterministic components are essentially determined by the hyperparameters.
1 of the
0 and K Theorem 15.16 The estimated numbers of components K Bernoulli mixture model are determined as follows: (1) (2) (3) (4)
If If If If
M+1 2 M+1 2 M+1 2 M+1 2
− φ > 0 and − φ > 0 and − φ < 0 and − φ < 0 and
1 2 1 2 1 2 1 2
− φ + Mξ − φ + Mξ − φ + Mξ − φ + Mξ
1 = K ∗ and ΔK
= ΔK ∗ . > 0, then K 1
1 = K ∗ and ΔK
= K − K∗. < 0, then K 1 1 ∗
1 = K − ΔK and ΔK
= ΔK ∗ . > 0, then K < 0, and
1 = K − ΔK ∗ and ΔK
= ΔK ∗ . (a) if ξ > 12 , then K 1
1 = K ∗ and ΔK
= K − K∗. (b) if ξ < 2 , then K 1 1 Proof Minimizing the coefficient of the relative free energy with respect to
under the constraint that the true distribution is realizable, i.e.,
1 and ΔK K ∗
≥ ΔK ∗ , we obtain the theorem.
K1 ≥ K1 and ΔK
454
15 Asymptotic VB Theory of Mixture Models
Proof of Theorem 15.15 From Lemma 15.1, we first evaluate R = KL(rw (w)||p(w)). The inequalities of the di-gamma and log-gamma functions in Eqs. (15.8) and (15.9) yield that K 1 1 − φ log(N k + φ) + Kφ − R= log(N + Kφ) 2 2 k=1 1 K M ) 1 1 − ξ log(N k xkm + ξ) + − ξ log(N k (1 − xkm ) + ξ) + 2 2 k=1 m=1 K 1 + M 2ξ − log(N k + 2ξ) + Op (1). 2 k=1
0 components are active, i.e., We consider variational parameters in which K N k = Ωp (N). Furthermore, without loss of generality, we can assume that
1 nondeterministic components and 0 < xkm < 1 (1 ≤ m ≤ M) for K
xkm = Op (1/N) (1 ≤ m ≤ M) for ΔK deterministic components. Putting such variational parameters into the preceding expression, we have the asymptotic form in the theorem.
< ΔK ∗ , Q = − log CH −
1 < K ∗ or ΔK Lemmas 15.3 and 15.4 imply that if K 1 NS N (D) = Ωp (N), and otherwise Q = Op (1). Thus, we obtain the theorem.
16 Asymptotic VB Theory of Other Latent Variable Models
In this chapter, we proceed to asymptotic analyses of VB learning in other latent variable models discussed in Section 4.2, namely, Bayesian networks, hidden Markov models, probabilistic context-free grammar, and latent Dirichlet allocation.
16.1 Bayesian Networks In this section, we analyze the VB free energy of the following Bayesian network model (Watanabe et al., 2009), introduced in Section 4.2.1: p(x, z|w), (16.1) p(x|w) = z∈Z
p(x, z|w) = p(x|b z )
Tk K
z
k,i a(k,i) ,
k=1 i=1
p(x|b z ) =
Yj M
x
j,l b( j,l|z) ,
j=1 l=1
⎫ ⎧ K ⎫⎧ ⎪ ⎪ M ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬⎪ ⎬ ⎨ p(w) = ⎪ p(a |φ) p(b |ξ) , ⎪ ⎪ ⎪ k j|z ⎪ ⎪ ⎪ ⎩ ⎭⎪ ⎪ ⎪ ⎭ ⎩ z∈Z j=1 k=1 p(ak |φ) = DirichletTk ak ; (φ, . . . , φ) , p(b j|z |ξ) = DirichletY j b j|z ; (ξ, . . . , ξ) , where φ > 0 and ξ > 0 are hyperparameters. Here, Z = {(z1 , . . . , zK ); zk ∈ k k , k = 1, . . . , K}, and zk ∈ {ei }Ti=1 is the one-of-K representation, i.e., {ei }Ti=1 zk,i = 1 for some i ∈ {1, . . . , T k } and zk, j = 0 for j i. Also, x = (x1 , . . . , x M ) Yj . The number of the parameters of this model is for x j ∈ {el }l=1 455
456
16 Asymptotic VB Theory of Other Latent Variable Models
D = Mobs
K
Tk +
k=1
K (T k − 1),
(16.2)
k=1
where Mobs =
M (Y j − 1). j=1
Under the constraint, r(H, w) = rH (H)rw (w), the VB posteriors are given by ⎫ ⎧ K ⎫⎧ ⎪ ⎪ M ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ ⎬⎪ ⎨ r (a ) r (b ) rw (w) = ⎪ , ⎪ ⎪ ⎪ a k b j|z ⎪ ⎪ ⎪ ⎪ ⎩ ⎭⎪ ⎪ ⎭ ⎩ z∈Z j=1 k=1 φk , (16.3) ra (ak ) = DirichletTk ak ; rb (b j|z ) = DirichletY j b j|z ; ξ j|z , (16.4) rH (H) =
N
rz (z(n) ),
n=1
where rz (z
(n)
⎛ K 2 ! "3 ⎜⎜ = z) ∝ exp ⎜⎜⎜⎝ φ(k,i k ) Ψ ( φ(k,ik ) ) − Ψ Ti k=1 k
k=1
+
M 2
Y j
ξ( j,l |z) Ψ ( ξ( j,l(n) |z) ) − Ψ l =1 j
j=1
⎞ 3⎟⎟ ⎟⎟⎟ ⎟⎠
(16.5)
for z = (ei1 , . . . , eiK ) and x(n) j = el(n) . The free energy is given by j
F=
⎧ ⎛ T ⎞ k ⎪ φ(k,i) ) ⎟⎟⎟ Γ(T k φ) ⎨ ⎜⎜⎜⎜ Γ( i=1 ⎟ log − log ⎟⎠ ⎜ ⎪ ⎪ ⎩ ⎝ Tk Γ( (Γ(φ))Tk φ(k,i) )
K ⎪ k=1
i=1
Tk z k
+ φ(k,i ) φ(k,i) − φ − N (k,i) Ψ ( φ(k,i) ) − Ψ Ti =1 i=1
⎞ ⎧ ⎛ Y j M ⎪ ⎪ ξ( j,l|z) ) ⎟⎟⎟⎟ Γ(Y j ξ) ⎨ ⎜⎜⎜⎜ Γ( l=1 ⎟ ⎜ + ⎪ ⎟⎠ − log ⎪log ⎜⎝ Y j ⎩ (Γ(ξ))Y j
z∈Z j=1 l=1 Γ(ξ( j,l|z) )
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
Yj Y j x
+ ξ( j,l |z) ξ( j,l|z) − ξ − N ( j,l|z) Ψ ( ξ( j,l|z) ) − Ψ l =1 l=1
+
N n=1 z∈Z
rz (z(n) = z) log rz (z(n) = z),
⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭
16.1 Bayesian Networks
457
where z
N (k,ik ) =
N
z(n) k,ik
rH (H)
n=1 x
N ( j,l j |z) =
N
,
(n) x(n) = z), j,l j rz (z
n=1
for rz (z
(n)
= (ei1 , . . . , eiK )) =
/ K
0 z(n) k,ik
k=1
.
(16.6)
rH (H)
We assume the following condition: Assumption 16.1 The true distribution q(x) can be expressed by a Bayesian network with H hidden nodes, each of which has S k states, for H ≤ K, i.e., q(x) = p(x|w∗ ) =
p(x, z|w∗ ) =
z∈Z∗
p(x|b∗z )
z∈Z∗
Sk , H -zk,i a∗(k,i) , k=1 i=1
where p(x|b∗z )
=
Yj M ,
b∗( j,l|z)
- x j,l
j=1 l=1 k for z ∈ Z∗ = {(z1 , . . . , zH ); zk ∈ {ei }Si=1 , k = 1, . . . , H}. The true parameters ∗ H , {b z } z∈Z∗ } are given by w∗ = {{a∗k }k=1
a∗k = {a∗(k,i) ; 1 ≤ i ≤ S k } (k = 1, . . . , H), ∗ b∗z = {b∗j|z } M j=1 (z ∈ Z ),
b∗j|z = {b∗( j,l|z) ; 1 ≤ l ≤ Y j } ( j = 1, . . . , M). For k > H, we define S k = 1. The true distribution can be realized by the model, i.e., the model is given by Eq. (16.1), where T k ≥ S k holds for k = 1, . . . , H. We assume that the true distribution is the smallest in the sense that it cannot be realized by any model with a smaller number of hidden units and with a smaller number of the states of each hidden unit. Under this condition, we prove the following theorem, which evaluates the relative VB free energy. The proof will appear in the next section.
458
16 Asymptotic VB Theory of Other Latent Variable Models
Theorem 16.1 satisfies
The relative VB free energy of the Bayesian network model
VB (D) = F VB (D) − NS N (D) = λ VB F BN log N + Op (1), where λ VB BN
⎧ K ⎫ K ⎪ ⎪ Mobs ⎪ K 1 ⎪ ⎨ ⎬ =φ T k − + min ⎪ uk − φ − uk ⎪ ⎪. ⎩ 2 ⎭ {uk } ⎪ 2 2 k=1 k=1 k=1 K
(16.7)
K . The minimum is taken over the set of positive integers {uk ; S k ≤ uk ≤ T k }k=1
If K = 1, this is reduced to the case of the naive Bayesian networks whose Bayes free energy or stochastic complexity has been evaluated (Yamazaki and Watanabe, 2003a; Rusakov and Geiger, 2005). Bounds for their VB free energy have also been obtained (Watanabe and Watanabe, 2004, 2005, 2006). The coefficient λ VB BN is given by the solution of the minimization problem in Eq. (16.7). We present a few exemplary cases as corollaries in this section. By taking uk = S k for 1 ≤ k ≤ H and uk = 1 for H + 1 ≤ k ≤ K, we obtain the following upper-bound for the VB free energy (Watanabe et al., 2006). This bound is tight if φ ≤ (1 + Mobs min1≤k≤K {S k })/2. Corollary 16.2 It holds that VB (D) ≤ λ VB F BN log N + Op (1), where λ VB BN
(16.8)
H H 1 1 Mobs −φ =φ T k − φK + φ − Sk + S k . (16.9) H+ 2 2 2 k=1 k=1 k=1 K
If K = H = 2, that is, the true network and the model both have two hidden nodes, solving the minimization problem in Eq. (16.7) gives the following corollary. Suppose S 1 ≥ S 2 and T 1 ≥ T 2 . Corollary 16.3 If K = H = 2, VB (D) = λ VB F BN log N + Op (1),
(16.10)
where λ VB BN ⎧ ⎪ (T 1 − S 1 + T 2 − S 2 )φ + M2obs S 1 S 2 + ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ 2 =⎪ −1 (T 2 − S 2 )φ + M2obs T 1 S 2 + T1 +S ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎩ Mobs T 1 T 2 + T1 +T2 − 1 2 2
S 1 +S 2 2
−1
(0 < φ ≤ ( 1+S 22Mobs
1+S 2 Mobs ), 2 1+T 1 Mobs 0 and ξ > 0 are hyperparameters. Let H = {Z (1) , . . . , Z (N) } be the set of hidden sequences. Under the constraint, r(H, w) = rH (H)rw (w), the VB posteriors are given by
462
16 Asymptotic VB Theory of Other Latent Variable Models rw (w) = r A ( A)r B (B), r A ( A) =
K
DirichletK φk,1 , . . . , φk,K ) , ak ; (
k=1
r B (B) =
K
bk ; ( Dirichlet M ξk,1 , . . . , ξk,M ) ,
k=1
rH (H) =
N
rZ (Z (n) ),
n=1
⎧ ⎞⎫ ⎛ T K K ⎛ K ⎟⎟⎪ ⎜⎜⎜ (n,t) (n,t−1) ⎪ ⎜⎜⎜ ⎪ ⎪ ⎨ ⎬ rZ (Z ) = exp ⎜⎜⎝ zk zl φk,l ⎟⎟⎟⎠⎪ Ψ ( φk,l ) − Ψ ⎜⎜⎝ ⎪ ⎪ ⎪ ⎩ ⎭ C Z (n)
t=2 k=1 l=1 l =1 ⎧ ⎫ ⎞ ⎛ ⎞ T M K M ⎪ ⎟⎟⎟⎪ ⎜⎜⎜ ⎪ ⎪ ⎬⎟⎟⎟⎟ (t) ⎨
⎜⎜⎝
⎟ z(n,t) x ) − Ψ ξ Ψ ( ξ + ⎟ ⎟, ⎪ ⎪ k,m k,m m ⎠ ⎪ ⎪ k ⎩ ⎭⎠
(n)
1
m =1
t=1 k=1 m=1
where C Z (n) is the normalizing constant. After the substitution of Eq. (15.5), the free energy is given by ⎧ ⎛ ⎞ K K ⎪ K ⎪ φk,l ) ⎟⎟⎟ ⎨ ⎜⎜⎜⎜ Γ( l=1 ⎟⎟⎠ +
φk,l ) − Ψ ( lK =1 φk,l ) F= φk,l − φ Ψ ( log ⎜⎝ K ⎪ ⎪ ⎩
l=1 Γ(φk,l ) k=1 l=1 ⎞ M ⎛ M ⎜⎜⎜ Γ( m=1 ξk,m ) ⎟⎟⎟
⎟⎟⎠ + + log ⎜⎜⎝ M ξk,m ) − Ψ ( mM =1 ξk,m ) ξk,m − ξ Ψ (
m=1 m=1 Γ(ξk,m ) N Γ(Mξ) Γ(Kφ) log C Z (n) . − K log − − K log (Γ(φ))K (Γ(ξ)) M n=1 The variational parameters satisfy [z]
φk,l = N k,l + φ, [x]
ξk,m = N k,m + ξ,
for the expected sufficient statistics defined by [z]
N k,l =
N T
z(n,t−1) z(n,t) l k
n=1 t=2 [x]
N k,m =
N T
z(n,t) k
n=1 t=1
We assume the following condition.
rH (H)
rH (H) (n,t) xm .
,
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
16.2 Hidden Markov Models
463
Assumption 16.2 The true distribution q(X) has K0 hidden states and emits M-valued discrete symbols: q(X) = p(X|w∗ ) =
M Z m=1
(1)
(b∗1m ) xm
K0 K0 T
(t) (t−1) zk
(a∗kl )zl
t=2 k=1 l=1
M
(t) (t) xm
(b∗km )zk
,
m=1
(16.22)
where Z is taken over all possible values of the hidden variables. Moreover, the true parameter is defined by w∗ = ( A∗ , B∗ ) = ((a∗kl ), (b∗km )), where A∗ ∈ RK0 ×K0 and B∗ ∈ RK0 ×m . The number of hidden states K0 of the true distribution is the smallest under this parameterization (Ito et al., 1992) and all parameters {a∗kl , b∗km } are strictly positive: w∗ = ((a∗kl > 0), (b∗km > 0)) (1 ≤ k, l ≤ K0 , 1 ≤ m ≤ M). The statistical model given by Eq. (16.21) can attain the true distribution, thus the model has K (≥ K0 ) hidden states. Under this assumption, the next theorem evaluates the relative VB free N log p(X(n) |w∗ ) is the empirical entropy of the energy. Here S N (D) = − N1 n=1 true distribution (16.22). Theorem 16.4 The relative VB free energy of HMMs satisfies VB (D) = F VB (D) − NS N (D) = λ VB F HMM log N + Op (1), where λ VB HMM
⎧ ⎪ ⎪ ⎪ ⎨ =⎪ ⎪ ⎪ ⎩
K0 (K0 −1)+K0 (M−1) 2
0 < φ ≤ K0 +K+M−2 , 2K0 K +K+M−2 0 0 and ξ > 0 are hyperparameters. Here T (X) is the set of derivation Z sequences that generate X, ci→ jk is the count of the transition rule from the nonterminal symbol i to the pair of nonterminal symbols ( j, k) appearing in z(l) z(l) the derivation sequence Z, and z(l) = ( K ) is the indicator of the 1 , . . . , (nonterminal) symbol generating the lth output symbol of X. Under the constraint, r(H, w) = rH (H)rw (w), the VB posteriors are given by K K rw (w) = ra ({ai }i=1 )rb ({bi }i=1 ), K ra ({ai }i=1 )=
K i=1
DirichletK 2 ai ; ( φi→11 , . . . , φi→KK ) ,
16.3 Probabilistic Context-Free Grammar
K rb ({bi }i=1 )=
K
467
Dirichlet M bi ; ( ξi→1 , . . . , ξi→M ) ,
i=1
rH (H) =
N
rz (Z (n) ),
n=1
rz (Z (n) ) = γ Z (n) =
1 C Z (n) K
$ % exp γ Z (n) ,
(16.31)
, K K Z (n)
ci→ j =1 k =1 φi→ j k jk Ψ φi→ jk − Ψ
i, j,k=1
+
L M K
, M (n,l)
z(n,l) x Ψ ξ − Ψ ,
=1 ξi→m i→m m m i
l=1 i=1 m=1
where C Z (n) = Z∈T (X(n) ) exp(γ Z ) is the normalizing constant and T (X(n) ) is the set of derivation sequences that generate X(n) . After the substitution of Eq. (15.5), the free energy is given by ⎞ ⎧ ⎛ K K ⎪ φi→ jk ) ⎟⎟⎟ ⎪ ⎨ ⎜⎜⎜⎜ Γ( j,k=1 ⎟⎟⎟ F= log ⎜ ⎪ ⎪ ⎩ ⎜⎝ K Γ( φi→ jk ) ⎠ i=1 j,k=1 +
K
φi→ jk − Ψ Kj ,k =1 φi→ j k ) φi→ jk − φ Ψ j,k=1
⎛ M ⎞ M ⎜⎜⎜ Γ m=1 ξi→m ⎟⎟⎟ ⎟⎟⎟ +
+ log ⎜⎜⎜⎝ ξi→m − Ψ mM =1 ξi→m ξi→m − ξ Ψ ⎠ M
Γ ξ i→m m=1 m=1 N Γ(Mξ) Γ(K 2 φ) log C Z (n) . − K log − − K log (Γ(ξ)) M (Γ(φ))K 2 n=1 The variational parameters satisfy z
φi→ jk = N i→ jk + φ, x
ξi→m = N i→m + ξ,
where z
N i→ jk =
N L
(n)
Z ci→ jk
n=1 l=1 x
N i→m =
N L z(n,l) i n=1 l=1
We assume the following condition.
rz (Z (n) )
rz (Z (n) )
,
(n,l) xm .
⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭
468
16 Asymptotic VB Theory of Other Latent Variable Models
Assumption 16.4 The true distribution q(X) has K0 nonterminal symbols and M terminal symbols with parameter w∗ : p(X, Z|w∗ ). (16.32) q(X) = p(X|w∗ ) = Z∈T (X)
The true parameters are K0 K0 , {b∗i }i=1 }, w∗ = {{a∗i }i=1 0 a∗i = {a∗i→ jk }Kj,k=1 (1 ≤ i ≤ K0 ),
M b∗i = {b∗i→m }m=1 (1 ≤ i ≤ K0 ),
which satisfy the constraints
a∗i→ii = 1 −
a∗i→ jk , b∗i→M = 1 −
M−1
b∗i→m ,
m=1
( j,k)(i,i)
respectively. Since PCFG has nontrivial nonidentifiability as in HMM (Ito et al., 1992), we assume that K0 is the smallest number of nonterminal symbols under this parameterization. The statistical model given by Eq. (16.30) includes the true distribution, namely, the number of nonterminal symbols K satisfies the inequality K0 ≤ K. Under this assumption, the next theorem evaluates the relative VB free N log p(X(n) |w∗ ) is the empirical entropy of the energy. Here S N (D) = − N1 n=1 true distribution (16.32). Theorem 16.6 The relative VB free energy of the PCFG model satisfies VB (D) = F VB (D) − NS N (D) = λ VB F PCFG log N + Op (1), where λ VB PCFG
⎧ ⎪ ⎪ ⎪ ⎪ ⎨ =⎪ ⎪ ⎪ ⎪ ⎩
K0 (K02 −1)+K0 (M−1) 2
+ K0 (K 2 − K02 )φ
K(K −1)+K(M−1) 2 2
! !
0 0, respectively. N (m) M }m=1 , the VB Under the constraint, r(w, H) = rΘ,B (Θ, B)rz {{z(n,m) }n=1 posteriors are given by M N N (m) M }m=1 = MultinomialH,1 z(n,m) ; z(n,m) , rz {{z(n,m) }n=1 (m)
m=1 n=1
rΘ,B (Θ, B) = rΘ (Θ)rB (B), rΘ (Θ) =
M
Dirichlet θm ; αm ,
m=1
rB (B) =
H
$ % Dirichlet βh ; ηh .
h=1
The free energy is given by ⎞ ⎛ H ⎞ M ⎛ ⎜⎜⎜ ⎜⎜⎜ Γ( h=1
αm,h ) ⎟⎟⎟ Γ(Hα) ⎟⎟⎟ ⎟⎠ ⎟⎠ − log F= ⎝⎜log ⎝⎜ H Γ(α)H αm,h ) h=1 Γ( m=1 ⎞ ⎛ L ⎞ H ⎛ ⎜⎜⎜ ⎜⎜⎜ Γ( l=1
ηl,h ) ⎟⎟⎟ Γ(Lη) ⎟⎟⎟ ⎜⎝log ⎜⎝ L ⎟⎠ ⎟⎠ − log + Γ(η)L ηl,h ) l=1 Γ( h=1
16.4 Latent Dirichlet Allocation
+
471
M H ! " (m)
αm,h ) αm,h ) − Ψ ( hH =1 αm,h − (N h + α) Ψ ( m=1 h=1
H L
+ ηl ,h ) ηl,h ) − Ψ ( lL =1 ηl,h − (W l,h + η) Ψ ( h=1 l=1
+
M N (m) H
z(n,m) log z(n,m) , h h
m=1 n=1 h=1
where (m) Nh
=
N (m)
z(n,m) h
n=1
W l,h =
M N (m)
N (m) M rz {{z(n,m) }n=1 }m=1
(n,m) w(n,m) zh l
m=1 n=1
,
N (m) M rz {{z(n,m) }n=1 }m=1
,
(m)
N M for the observed data D = {{w(n,m) }n=1 }m=1 . The variational parameters satisfy (m)
αm,h = N h + α,
(16.36)
ηl,h = W l,h + η,
(16.37)
= z(n,m) h for z(n,m) h
z(n,m) h H (n,m) h =1 zh
⎛ ⎜⎜, αm,h = exp ⎜⎜⎜⎝ Ψ ( αm,h ) − Ψ hH =1 +
L
w(n,m) l
,
Ψ ( ηl,h ) − Ψ
L
ηl ,h l =1
⎞ -⎟⎟ ⎟⎟⎟ . ⎠
l=1
Based on Lemma 15.1, we decompose the free energy as follows: F = R + Q,
(16.38)
where R = KL (rΘ (Θ)rB (B)||p(Θ|α)p(B|η)) ⎛ ⎞ H M ⎜ H ⎟⎟ αm,h ) Γ(α)H H % $ ⎜⎜⎜ Γ( h=1
+ = αm,h ) ⎟⎟⎟⎠ αm,h ) − Ψ ( h =1 αm,h − α Ψ ( ⎜⎝log H αm,h ) Γ(Hα) h=1 h=1 Γ( m=1 ⎛ ⎞ H ⎜ L L ⎟⎟
ηl,h ) Γ(ηl )L Γ( l=1 L $ % ⎜⎜⎜
+ ηl ,h ) ⎟⎟⎟⎠ , ηl,h − ηl Ψ ( ηl,h ) − Ψ ( l =1 + ⎜⎝log L ηl,h ) Γ(Lηl ) l=1 l=1 Γ( h=1 (16.39)
472
16 Asymptotic VB Theory of Other Latent Variable Models
Q = − log CH =−
M
N
(m)
m=1
L l=1
⎛ H ⎜⎜⎜ Vl,m log ⎜⎜⎝⎜ h=1
$ $ % % ⎞⎟ ⎟⎟⎟ exp Ψ ( ηl,h ) exp Ψ ( αm,h ) ⎟⎟⎠ . H L αm,h ) exp Ψ ( l =1 ηl ,h ) exp Ψ ( h =1 (16.40)
Here, V ∈ RL×M is the empirical word distribution matrix with its entries given N (m) (n,m) wl . by Vl,m = N1(m) n=1
16.4.1 Asymptotic Analysis of VB Learning Here we analyze the VB free energy of LDA in the asymptotic limit when N ≡ minm N (m) → ∞ (Nakajima et al., 2014). Unlike the analyses for the latent variable models in the previous sections, we do not assume L, M N, but 1 L, M, N at this point. This amounts to considering the asymptotic limit when L, M, N → ∞ with a fixed mutual ratio, or equivalently, assuming L, M ∼ O(N). We assume the following condition on the true distribution. Assumption 16.5 The word distribution matrix V is a sample from the multinomial distribution with the true parameter U∗ ∈ RL×M whose rank is ∗ ∗ H ∗ ∼ O(1), i.e., U∗ = B∗ Θ∗ where Θ∗ ∈ R M×H and B∗ ∈ RL×H .2 The number of topics of the model H is set to H = min(L, M) (i.e., the matrix BΘ can express any multinomial distribution). The stationary conditions, Eqs. (16.36) and (16.37), lead to the following lemma:
= BΘ . Then it holds that Lemma 16.7 Let BΘ rΘ,B (Θ,B) + *
)2 BΘ = Op (N −2 ), (16.41) (BΘ − l,m rΘ,B (Θ,B)
Q=−
M
N (m)
m=1
L
)l,m + Op (M). Vl,m log( BΘ
Proof For the Dirichlet distribution p(a| a˘ ) ∝ variance are given as follows:
ah = ah p(a| a˘ ) = where a˘ 0 = 2
H h=1
(16.42)
l=1
a˘ h , a˘ 0
H h=1
(ah − ah )2 p(a| a˘ ) =
a˘ h .
More precisely, U∗ = B∗ Θ∗ + O(N −1 ) is sufficient.
aah˘ h −1 , the mean and the a˘ h (˘a0 − a˘ h ) , a˘ 20 (˘a0 + 1)
16.4 Latent Dirichlet Allocation
473
For fixed N, R, defined by Eq. (16.39), diverges to +∞ if αm,h → +0 for any (m, h) or ηl,h → +0 for any (l, h). Therefore, the global minimizer of the free energy (16.38) is in the interior of the domain, where the free energy is differentiable. Consequently, the global minimizer is a stationary point. The stationary conditions (16.36) and (16.37) imply that
ηl,h ≥ η,
αm,h ≥ α, H
αm,h =
h=1
H
L
α + N (m) ,
h=1
l=1
ηl,h =
(16.43)
L
η+
M
Therefore, we have
m,h )2 = Op (N −2 ) (Θm,h − Θ rΘ (Θ) "2 !
m,h (Bl,h − Bl,h )2 = Op (N −2 ) max Θ rB (B)
m
( αm,h − α).
(16.44)
for all (m, h),
(16.45)
for all (l, h),
(16.46)
m=1
l=1
which leads to Eq. (16.41). By using Eq. (15.8), Q is bounded as follows: Q ≤ Q ≤ Q, where ⎛ ⎜⎜⎜ H ⎜ Q=− N (m) Vl,m log ⎜⎜⎜⎜⎜ ⎝ h=1 m=1 l=1 ⎛ ⎜⎜⎜ M H L ⎜ N (m) Vl,m log ⎜⎜⎜⎜⎜ Q=− ⎝ h=1 m=1 l=1 M
L
⎞ ⎟⎟⎟ ⎟⎟ ⎞ ⎞⎟ ⎟⎠ , ⎟⎟⎟ ⎟⎟⎟ ⎟ ⎟⎟⎠ ⎟⎟⎠ ⎟ 2 αm,h 2 L ηl ,h h =1 l =1 ⎞ ⎟⎟⎟ 1 1 exp − 2 exp − 2 αm,h ηl,h ⎟⎟ ⎛ ⎛ ⎞ ⎞⎟ ⎟⎠ . ⎜⎜ ⎜ ⎟⎟⎟ ⎟⎟⎟ ⎟ ⎟⎟⎠ exp⎜⎜⎜⎜⎝− 1 ⎟⎟⎠ ⎟ exp⎜⎜⎜⎝− H 1 L
1 exp − αm,h
ηl,h αm,h ⎛ H ⎜⎜
α lL =1 ηl ,h h =1 m,h exp⎜⎜⎜⎝− H 1
H
αm,h ηl,h
αm,h lL =1 ηl ,h
h =1
α h =1 m,h
1 exp − ηl,h ⎛ ⎜⎜ exp⎜⎜⎜⎝− 1
η l =1 l ,h
Using Eqs. (16.45) and (16.46), we have Eq. (16.42), which completes the proof of Lemma 16.7. ∗
θm ) be Eq. (16.41) implies the convergence of the posterior. Let u∗m = B∗ (
the true probability mass function for the mth document and um = B( θm ) be its predictive probability. Define a measure of how far the predictive distributions are from the true distributions by J =
M N (m) KL(u∗m || um ). N m=1
(16.47)
Then, by the same arguments as the proof of Lemma 15.4, Eq. (16.42) leads to the following lemma:
474
16 Asymptotic VB Theory of Other Latent Variable Models
Lemma 16.8 Q is minimized when J = Op (1/N), and it holds that
+ LM), Q = NS N (D) + Op ( JN S N (D) = −
where
M L N (m) 1 log p(D|Θ∗ , B∗ ) = − Vl,m log(B∗ Θ∗ )l,m . N N m=1 l=1
Lemma 16.8 simply states that Q/N converges to the empirical entropy S N (D) of the true distribution if and only if the predictive distribution converges to the true distribution (i.e., J = Op (1/N)). M H
= h=1 θ( M1 m=1 Let H Θm,h ∼ Op (1)) be the number of topics used in the M (h)
m,h ∼ Op (1)) be the number of documents that
whole corpus, M = m=1 θ(Θ L (h)
θ( Bl,h ∼ Op (1)) be the number of words contain the hth topic, and L = l=1 of which the hth topic consist. We have the following lemma: Lemma 16.9 R is written as follows: ⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 1 1 ⎬ ⎨ (h) (h)
α−
Lη −
M η − R=⎪ + H − + L log N M Hα − ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 2 2 ⎭ ⎩ h=1
Lη − 1 log L + Op (H(M + L)). (16.48) + (H − H) 2 Proof By using the bounds (15.8) and (15.9), R can be bounded as R ≤ R ≤ R,
(16.49)
where H Γ(Hα) Γ(Lη) L M(H − 1) + H(L − 1) log(2π) log − − H Γ(α) Γ(η) 2 m=1 h=1 ⎧ ⎫ M ⎪ H H ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
+ α − α − log log α Hα − ⎪ m,h m,h ⎪ ⎪ ⎪ ⎩ ⎭ 2 2 m=1 h=1 h=1 ⎧ ⎫ H L L ⎪ ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
ηl,h − η− log ηl,h ⎪ + ⎪ ⎪ Lη − 2 log ⎪ ⎩ ⎭ 2 h=1 l=1 l=1 ⎧ ⎫ M ⎪ H H ⎪ ⎪ ⎪ % 1 $ 1 1 ⎨ ⎬
− − H αm,h − α + ⎪ ⎪ ⎪− ⎪ ⎩
α 12 α αm,h ⎭ 2 h =1 m,h m,h m=1 h=1 h=1 ⎧ ⎫ H ⎪ L L ⎪ ⎪ ⎪ % 1 $ 1 1 ⎨ ⎬
− − L ηl,h − η + − , (16.50) ⎪ ⎪ ⎪ ⎩ ⎭
ηl,h 2 l =1 12 ηl,h ηl ,h ⎪
R=−
M
h=1
log
l=1
l=1
16.4 Latent Dirichlet Allocation
475
H Γ(Hα) Γ(Lη) L M(H − 1) + H(L − 1) log(2π) R=− log log − − H Γ(α) Γ(η) 2 m=1 h=1 ⎧ ⎫ M ⎪ H H ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
+ α − α − log log αm,h ⎪ Hα − ⎪ m,h ⎪ ⎪ ⎩ ⎭ 2 2 m=1 h=1 h=1 ⎧ ⎫ H ⎪ L L ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
η − η − log log ηl,h ⎪ + Lη − ⎪ l,h ⎪ ⎪ ⎩ ⎭ 2 2 h=1 l=1 l=1 ⎧ ⎫ M ⎪ H ⎪ ⎪ ⎪ % 1 $ 1 1 ⎨ ⎬
− − H αm,h − α + ⎪ ⎪ ⎪ ⎪ H ⎩ 12 h=1 ⎭ 2 α
α α m,h m,h h =1 m,h m=1 h=1 ⎧ ⎫ H ⎪ L ⎪ ⎪ ⎪ % 1 $ 1 1 ⎨ ⎬
− − L ηl,h − η + (16.51) ⎪ ⎪ ⎪ ⎪. L ⎩ 12 l=1 2 ηl,h
,h ⎭
ηl,h η
l l =1 M
h=1
l=1
Eqs. (16.43) and (16.44) imply that ⎧ ⎫ M ⎪ H H ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
αm,h − α− log log αm,h ⎪ R= Hα − ⎪ ⎪ ⎪ ⎩ ⎭ 2 2 m=1 h=1 h=1 ⎧ ⎫ H ⎪ L L ⎪ ⎪ ⎪ 1 1 ⎨ ⎬
ηl,h − η− log log ηl,h ⎪ + Lη − + Op (H(M + L)), ⎪ ⎪ ⎪ ⎩ ⎭ 2 2 h=1 l=1 l=1
which leads to Eq. (16.48). This completes the proof of Lemma 16.9. ∗
∗
∗
Since we assumed that the true matrices Θ and B are of the rank of H ,
H = H ∗ ∼ O(1) is sufficient for the VB posterior to converge to the true distri can be much larger than H ∗ with BΘ unchanged bution. However, H rΘ,B (Θ,B) because of the nonidentifiability of matrix factorization—duplicating topics with divided weights, for example, does not change the distribution. Let VB (D) = F VB (D) − NS N (D) F
(16.52)
be the relative free energy. Based on Lemmas 16.8 and 16.9, we obtain the following theorem: Theorem 16.10 In the limit when N → ∞ with L, M ∼ O(1), it holds that J = Op (1/N), and VB (D) = λ VB F LDA log N + Op (1), where λ VB LDA
⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 1 1 ⎨ ⎬ (h)
(h) α −
Lη −
=⎪ M η − + H − + L M Hα − . ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 2 2 ⎩ ⎭ h=1
476
16 Asymptotic VB Theory of Other Latent Variable Models
In the limit when N, M → ∞ with and
M N,L
∼ O(1), it holds that J = op (log N),
VB (D) = λ VB F LDA log N + op (N log N), where λ VB LDA
⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 ⎨ (h)
α− ⎬ =⎪ M − . M Hα − ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ ⎭ ⎩ h=1
In the limit when N, L → ∞ with and
L N,
M ∼ O(1), it holds that J = op (log N),
VB (D) = λ VB F LDA log N + op (N log N), where λ VB LDA = HLη. In the limit when N, L, M → ∞ with op (N log N), and
L M N, N
∼ O(1), it holds that J =
2 VB (D) = λ VB F LDA log N + op (N log N),
where λ VB LDA = H(Mα + Lη). Proof Lemmas 16.8 and 16.9 imply that the relative free energy can be written as follows: VB (D) F ⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 1 1 ⎬ ⎨ (h) (h)
α−
Lη −
=⎪ M η − + H − + L log N M Hα − ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 2 2 ⎭ ⎩ h=1 1
+ LM).
(16.53) log L + Op ( JN + (H − H) Lη − 2 In the following subsection, we investigate the leading term of the relative free energy (16.53) in different asymptotic limits. In the Limit When N → ∞ with L, M ∼ O(1) In this case, the minimizer should satisfy 1
J = Op N
(16.54)
16.4 Latent Dirichlet Allocation
477
and the leading term of the relative free energy (16.52) is of the order of Op (log N) as follows: VB (D) F ⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 1 1 ⎬ ⎨ (h)
(h) α −
Lη −
=⎪ M η − + H − + L log N M Hα − ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 2 2 ⎭ ⎩ h=1 + Op (1). Note that Eq. (16.54) implies the consistency of the VB posterior. In the Limit When N, M → ∞ with
M , N
L ∼ O(1)
In this case, J = op (log N),
(16.55)
making the leading term of the relative free energy of the order of Op (N log N) as follows: ⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 ⎬ ⎨ VB (h)
M α − F (D) = ⎪ − log N + op (N log N). M Hα − ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎭ ⎩ h=1 Eq. (16.55) implies that the VB posterior is not necessarily consistent. In the Limit When N, L → ∞ with NL , M ∼ O(1) In this case, Eq. (16.55) holds, and the leading term of the relative free energy is of the order of Op (N log N) as follows: VB (D) = HLη log N + op (N log N). F In the Limit When N, L, M → ∞ with In this case,
L M , N N
∼ O(1)
J = op (N log N),
(16.56)
and the leading term of the relative free energy is of the order of Op (N 2 log N) as follows: VB (D) = H (Mα + Lη) log N + op (N 2 log N). F This completes the proof of Theorem 16.10.
Since Eq. (16.41) was shown to hold, the predictive distribution converges to the true distribution if J = Op (1/N). Accordingly, Theorem 16.10 states that the consistency holds in the limit when N → ∞ with L, M ∼ O(1).
478
16 Asymptotic VB Theory of Other Latent Variable Models
Theorem 16.10 also implies that, in the asymptotic limits with small
meaning that it dominates the topic L ∼ O(1), the leading term depends on H, sparsity of the VB solution. We have the following corollary: M L θ(Θ∗m,h ∼ O(1)) and L∗(h) = l=1 θ(B∗l,h ∼ Corollary 16.11 Let M ∗(h) = m=1 1 , O(1)). Consider the limit when N → ∞ with L, M ∼ O(1). When 0 < η ≤ 2L 1 −Lη 1 2
the VB solution is sparse (i.e., H H = min(L, M)) if α < − ∗ (h) , and 2
≈ H) if α > dense (i.e., H sparse if α
1 2
1 2L
minh M
< η ≤ 12 , the VB solution is Lη− 12 ∗ (h) . h M
+ max
When η > 12 , the VB Lη− 1
solution is sparse + 2 maxL−1M∗ (h) , and dense if α > 12 + min M2∗ (h) . In the h h 1 limit when N, M → ∞ with M N , L ∼ O(1), the VB solution is sparse if α < 2 , 1 and dense if α > 2 .
(h) = M ∗(h) , and
= H∗, M Proof From the compact representation when H (h) ∗(h)
L = L , we can decompose a singular component into two, keeping BΘ unchanged, so that
→ H
+ 1, H H
(h) → M
h=1 H h=1
H∗
(h) + ΔM M
(16.57) for
h=1 H
min M ∗(h) ≤ ΔM ≤ max M ∗(h) , (16.58) h
h
∗
L(h) →
h=1
L(h) + ΔL
for
0 ≤ ΔL ≤ max L∗(h) . h
(16.59)
Here the lower-bound for ΔM in Eq. (16.58) corresponds to the case that the least frequent topic is chosen to be decomposed, while the upper-bound to the case that the most frequent topic is chosen. The lower-bound for ΔL in Eq. (16.59) corresponds to the decomposition such that some of the word-occurrences are moved to a new topic, while the upper-bound to the decomposition such that the topic with the widest vocabulary is copied to a new topic. Note that the bounds both for ΔM and ΔL are not always achievable simultaneously, when we choose one topic to decompose. In the following subsection, we investigate the relation between the sparsity of the solution and the hyperparameter setting in different asymptotic limits. In the Limit When N → ∞ with L, M ∼ O(1) The coefficient of the leading term of the free energy is
H 1 1 (h) 1 1 (h)
− M = M Hα − Lη − α − η − + − L . (16.60) λ VB LDA 2 2 2 2 h=1
16.4 Latent Dirichlet Allocation
479
Note that the solution is sparse if Eq. (16.60) is increasing with respect to H, and dense if it is decreasing. Eqs. (16.57) through (16.59) imply the following: (I) When 0 < η ≤
and α ≤ 12 , the solution is sparse if 1 1 Lη − − min M ∗(h) α − > 0, or equivalently, h 2 2 1 1 1 − Lη , α< − 2 minh M ∗ (h) 2 1 2L
and dense if
(II) When 0 < η ≤
1 1 1 α> − − Lη . 2 minh M ∗ (h) 2
and α > 12 , the solution is sparse if 1 1 Lη − − max M ∗(h) α − > 0, or equivalently, h 2 2 1 1 1 − Lη , α< − 2 maxh M ∗ (h) 2
and dense if
1 2L
1 1 1 α> − − Lη . 2 maxh M ∗ (h) 2
Therefore, the solution is always dense in this case. 1 (III) When 2L < η ≤ 12 and α < 12 , the solution is sparse if 1 1 Lη − − min M ∗(h) α − > 0, or equivalently, h 2 2 1 1 1 α< + Lη − , 2 minh M ∗ (h) 2 and dense if
1 1 1 α> + Lη − . 2 minh M ∗ (h) 2
Therefore, the solution is always sparse in this case. 1 < η ≤ 12 and α ≥ 12 , the solution is sparse if (IV) When 2L 1 1 Lη − − max M ∗(h) α − > 0, or equivalently, h 2 2 1 1 1 Lη − α< + , 2 maxh M ∗ (h) 2
480
16 Asymptotic VB Theory of Other Latent Variable Models
and dense if
1 1 1 α> + Lη − . 2 maxh M ∗ (h) 2
and α < 12 , the solution is sparse if 1 1 1 ∗(h) ∗(h) α− η− Lη − − max M +L > 0, h 2 2 2
(V) When η >
1 2
and dense if Lη −
1 1 1 − max M ∗(h) α − + L∗(h) η − < 0. h 2 2 2
(16.61)
(16.62)
Therefore, the solution is sparse if 1 1 1 − max L∗(h) η − > 0, or equivalently, Lη − − min M ∗(h) α − h h 2 2 2 1 1 1 1 ∗(h) − max L α< + Lη − η − , h 2 minh M ∗ (h) 2 2 and dense if
1 1 1 ∗(h) ∗(h) α− η− Lη − − max M − max L < 0, or equivalently, h h 2 2 2 1 1 1 1 α> + Lη − − max L∗(h) η − . (h) ∗ h 2 maxh M 2 2
Therefore, the solution is always sparse in this case. (VI) When η > 12 and α ≥ 12 , the solution is sparse if Eq. (16.61) holds, and dense if Eq. (16.62) holds. Therefore, the solution is sparse if 1 1 1 Lη − − max M ∗(h) α − − max L∗(h) η − > 0, or equivalently, h h 2 2 2 1 1 1 1 ∗(h) α< + Lη − − max L η− , h 2 maxh M ∗ (h) 2 2 and dense if
1 1 1 − min M ∗(h) α − − max L∗(h) η − < 0, or equivalently, h h 2 2 2 1 1 1 1 ∗(h) − max L α> + Lη − η − . h 2 minh M ∗ (h) 2 2 Lη −
Thus, we can conclude that, in this case, the solution is sparse if α
Lη − 12 1 + . 2 minh M ∗ (h)
Summarizing the preceding, we have the following lemma: Lemma 16.12 When 0 < η ≤ and dense if α > α
12 , maxh M ∗ (h) h Lη− 1 α < 12 + 2 maxL−1M∗ (h) , and dense if α > 12 + min M2∗ (h) . h h
the solution is
In the Limit When N, M → ∞ with M , L ∼ O(1) N The coefficient of the leading term of the free energy is given by
H 1 1 (h)
λ VB = M Hα − M α − − . LDA 2 2 h=1
(16.63)
Although the predictive distribution does not necessarily converge to the true distribution, we can investigate the sparsity of the solution by considering the
unchanged. It is clear duplication rules (16.57) through (16.59) that keep BΘ
if α < 1 , and decreasing that Eq. (16.63) is increasing with respect to H 2 if α > 12 . Combing this result with Lemma 16.12 completes the proof of Corollary 16.11. In the case when L, M N and in the case when L M, N, Corollary 16.11 provides information on the sparsity of the VB solution, which will be compared with other methods in Section 16.4.2. On the other hand, although we have successfully derived the leading term of the free energy also in the case when M L, N and in the case when 1 L, M, N, it unfortunately provides no information on sparsity of the solution.
16.4.2 Asymptotic Analysis of MAP Learning and Partially Bayesian Learning For training the LDA model, MAP learning and partially Bayesian (PB) learning (see Section 2.2.2), where Θ and/or B are point-estimated, are also popular choices. Although the differences in update equations is small, it can affect the asymptotic behavior. In this subsection, we aim to clarify the difference in the asymptotic behavior.
482
16 Asymptotic VB Theory of Other Latent Variable Models
MAP learning, PB-A learning, PB-B learning, and VB learning, respectively, solve the following problem: min F, r
s.t. ⎧ ⎪
⎪ B) rΘ,B (Θ, B) = δ(Θ; Θ)δ(B; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎨rΘ,B (Θ, B) = rΘ (Θ)δ(B; B) ⎪ ⎪ ⎪
B (B) ⎪ rΘ,B (Θ, B) = δ(Θ; Θ)r ⎪ ⎪ ⎪ ⎪ ⎪ ⎩r (Θ, B) = r (Θ)r (B) Θ,B
Θ
B
(for MAP learning), (for PB-A learning), (for PB-B learning), (for VB learning),
Similar analysis to Section 16.4.1 leads to the following theorem (the proof is given in Section 16.4.5): Theorem 16.13 In the limit when N → ∞ with L, M ∼ O(1), the solution is sparse if α < αsparse , and dense if α > αdense . In the limit when N, M → ∞ with M N , L ∼ O(1), the solution is sparse if α < α M→∞ , and dense if α > α M→∞ . Here, αsparse , αdense , and α M→∞ are given in Table 16.1. A notable finding from Table 16.1 is that the threshold that determines the topic sparsity of PB-B learning is (most of the case exactly) 12 larger than the threshold of VB learning. The same relation is observed between MAP learning and PB-A learning. From these, we can conclude that point-estimating Θ, instead of integrating it out, increases the threshold by 12 in the LDA model. We will validate this observation by numerical experiments in Section 16.4.4. Table 16.1 Sparsity thresholds of VB, PB-A, PB-B, and MAP methods (see Theorem 16.13). The first four columns show the thresholds (αsparse , αdense ), of which the function forms depend on the range of η, in the limit when N → ∞ with L, M ∼ O(1). A single value is shown if αsparse = αdense . The last column shows the threshold α M→∞ in the limit when N, M → ∞ with M N , L ∼ O(1). η range VB
0 12 + min M ∗ (h) h
, L ∼ O(1) In the Limit When N, M → ∞ with M N The coefficient of the leading term of the free energy is given by
H 1 1 (h)
λ PB−A = M Hα − M α − − . LDA 2 2 h=1
(16.74)
Although the predictive distribution does not necessarily converge to the true distribution, we can investigate the sparsity of the solution by considering the
unchanged. It is clear duplication rules (16.57) through (16.59) that keep BΘ
if α < 1 , and decreasing that Eq. (16.74) is increasing with respect to H 2 if α > 12 . Combing this result with Lemma 16.16, we obtain the following corollary: Corollary 16.17 Assume that η ≥ 1. In the limit when N → ∞ with L, L(η−1) . M ∼ O(1), the PB-A solution is sparse if α < 12 , and dense if α > 12 + min M ∗ (h) h
16.4 Latent Dirichlet Allocation In the limit when N, M → ∞ with α < 12 , and dense if α > 12 .
M N,L
491
∼ O(1), the PB-A solution is sparse if
PB-B Learning The free energy for PB-B learning is given as follows: F PB−B = χΘ + RPB−B + QPB−B ,
(16.75)
where χΘ is a large constant corresponding to the negative entropy of the delta functions, and / 0 rΘ (Θ)rB (B) PB−B = log R p(Θ|α)p(B|η) rPB−B (Θ,B) ⎛ ⎛ ⎞⎞ M ⎜ H H ⎜⎜⎜ ⎟⎟⎟⎟ Γ(α)H ⎜⎜⎜ PB−B PB−B (1 − α) ⎜⎜⎝log( + αm,h )⎟⎟⎟⎠⎟⎟⎟⎠ = αm,h ) − log( ⎜⎝log Γ(Hα) h=1 m=1 h =1 ⎛ L PB−B H ⎜ ηl,h ) Γ(η)L Γ( l=1 ⎜⎜⎜ + ⎜⎝log L ηPB−B ) Γ(Lη) l=1 Γ( h=1 l,h ⎞ L ⎟ L PB−B PB−B PB−B ⎟
η ) ⎟⎟⎟⎠ , ηl,h ) − Ψ ( l =1 + ηl,h − η Ψ ( (16.76) l ,h
/ PB−B
Q
l=1
= log
N (m) M rz {{z(n,m) }n=1 }m=1
0
p({w(n,m) }, {z(n,m) }|Θ, B) rPB−B (Θ,B,{z(n,m) }) ⎞ ⎛ H M L ⎟⎟⎟ ⎜⎜⎜ exp Ψ ( ηPB−B ) αPB−B l,h m,h (m) ⎟⎟⎟ . =− N Vl,m log ⎜⎜⎝⎜ H ⎠ L αPB−B ηPB−B )
h =1 m=1 l=1 h=1 m,h exp Ψ ( l =1 l ,h (16.77)
Let us first consider the case when α < 1. In this case, F diverges to αm,h → +0 for F → −∞ with fixed N, when αm,h = O(1) for any (m, h) and
all other h h. Therefore, the solution is sparse (so sparse that the estimator is useless). When α ≥ 1, the solution satisfies the following stationary condition:
αPB−B m,h
=α−1+
N (m)
zPB−B(n,m) , h
ηPB−B l,h
n=1
= zPB−B(n,m) h
=η+
M N (m) m=1 n=1
, L
L
exp l=1 w(n,m) )−Ψ Ψ ( ηPB−B l,h l , H (n,m) L αPB−B Ψ ( ηPB−B h =1 l=1 wl m,h exp l,h ) −
αPB−B m,h
zPB−B(n,m) w(n,m) , l h
ηPB−B l =1 l ,h
Ψ
L
(16.78) -
ηPB−B l =1 l ,h
-.
(16.79)
492
16 Asymptotic VB Theory of Other Latent Variable Models
In the same way as for VB and PB-A learning, we can obtain the following lemma: Lemma 16.18 Let B (BΘ − B QPB−B = −
PB−B M m=1
PB−B
PB−B = BΘ rPB−B (Θ,B) . Then it holds that Θ
PB−B )2 rPB−B (Θ,B) = Op (N −2 ), Θ l,m
N (m)
L
PB−B PB−B
Θ Vl,m log( B )l,m + Op (N −1 ).
(16.80) (16.81)
l=1
QPB−B is minimized when J = Op (N −1 ), and it holds that
+ LM). QPB−B = NS N (D) + Op ( JN RPB−B is written as follows: ⎧ ⎫
⎪ ⎪ H ⎪ ⎪ ⎪ ⎪ 1 1 ⎨ ⎬ PB−B (h) (h)
(α −1) +
Lη − (α L R =⎪ M η − − MH − 1)+ H log N ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎩ ⎭ h=1
Lη − 1 log L + Op (H(M + L)). (16.82) + (H − H) 2 Taking the different asymptotic limits, we obtain the following theorem:
PB−B has only one nonzero Theorem 16.19 When α < 1, each row vector of Θ entry, and the PB-B solution is sparse. Assume in the following that α ≥ 1. In the limit when N → ∞ with L, M ∼ O(1), it holds that J = Op (1/N) and PB−B (D) = λ PB−B F LDA log N + Op (1), where
H 1 1 (h) (h)
(α (α − 1) + L λ PB−B = MH − 1) + H Lη − M η − − . LDA 2 2 h=1 In the limit when N, M → ∞ with and
M N,L
∼ O(1), it holds that J = op (log N),
PB−B (D) = λ PB−B F LDA log N + op (N log N), where λ PB−B LDA = MH (α − 1) −
H h=1
(h) (α − 1) . M
16.4 Latent Dirichlet Allocation In the limit when N, L → ∞ with and
L N,
493
M ∼ O(1), it holds that J = op (log N),
PB−B (D) = λ PB−B F LDA log N + op (N log N), where λ PB−B LDA = HLη. In the limit when N, L, M → ∞ with op (N log N), and
L M N, N
∼ O(1), it holds that J =
2 PB−B (D) = λ PB−B F LDA log N + op (N log N),
where λ PB−B LDA = H(M(α − 1) + Lη). Theorem 16.19 states that the PB-B solution is sparse when α < 1. In the following subsection, we investigate the sparsity of the solution for α ≥ 1. In the Limit When N → ∞ with L, M ∼ O(1) The coefficient of the leading term of the free energy is λ PB−B LDA
= MH (α − 1) +
H h=1
1 (h) 1 (h)
(α − 1) − L η − Lη − − M . 2 2
The solution is sparse if λ PB−B LDA is increasing with respect to H, and dense if it is decreasing. We focus on the case where α ≥ 1. Eqs. (16.57) through (16.59) imply the following: (I) When 0 < η ≤
1 2L ,
the solution is sparse if
1 − max M ∗(h) (α − 1) > 0, or equivalently, h 2 1 1 α1−
1 1 − Lη . maxh M ∗ (h) 2
Therefore, the solution is always dense in this case. 1 (II) When 2L < η ≤ 12 , the solution is sparse if Lη −
Lη − 12 1 − max M ∗(h) (α − 1) > 0, or equivalently, α < 1 + , h 2 maxh M ∗ (h)
494
16 Asymptotic VB Theory of Other Latent Variable Models
and dense if Lη −
α>1+
maxh
1 2 . M ∗ (h)
(III) When η > 12 , the solution is sparse if 1 1 ∗(h) ∗(h) Lη − − max M (α − 1) + L η− > 0, h 2 2
(16.83)
and dense if
1 1 ∗(h) ∗(h) Lη − − max M (α − 1) + L η− < 0. h 2 2
(16.84)
Therefore, the solution is sparse if 1 1 − max M ∗(h) (α − 1) − max L∗(h) η − > 0, or equivalently, h h 2 2 1 1 1 ∗(h) − max L α1+ Lη − − max L∗(h) η − . (h) ∗ h 2 2 minh M
Thus, we can conclude that, in this case, the solution is sparse if α1+
Lη − minh
1 2 . M ∗ (h)
Summarizing the preceding, we have the following lemma: Lemma 16.20 Assume that α ≥ 1. When 0 < η ≤ dense. if α >
1 When 2L 1 +
≤
1 2,
1 2L ,
the solution is always Lη− 12 , maxh M ∗ (h)
and dense
the solution is sparse if α < 1 +
L−1 , 2 maxh M ∗ (h)
the solution is sparse if α < 1 +
When η > Lη− 12 . minh M ∗ (h)
1 2,
16.4 Latent Dirichlet Allocation
495
In the Limit When N, M → ∞ with M , L ∼ O(1) N The coefficient of the leading term of the free energy is given by λ PB−B LDA
= M (Hα − 1) −
H
(h) (α − 1) . M
(16.85)
h=1
Although the predictive distribution does not necessarily converge to the true distribution, we can investigate the sparsity of the solution by considering the
unchanged. It is clear duplication rules (16.57) through (16.59) that keep BΘ
if α > 1. Combing this result that Eq. (16.85) is decreasing with respect to H with Theorem 16.19, which states that the PB-B solution is sparse when α < 1, and Lemma 16.20, we obtain the following corollary: Corollary 16.21 Consider the limit when N → ∞ with L, M ∼ O(1). When 1 , the PB-B solution is sparse if α < 1, and dense if α > 1. When 0 < η ≤ 2L 1 2L
Lη− 12 1 2 , the PB-B solution is sparse if α < 1 + maxh M ∗ (h) , and dense if Lη− 1 1+ max M2∗ (h) . When η > 12 , the PB-B solution is sparse if α < 1+ 2 maxL−1M∗ (h) , h h Lη− 1 dense if α > 1 + min M2∗ (h) . In the limit when N, M → ∞ with M , L ∼ O(1), N h
< η ≤
α>
and the PB-B solution is sparse if α < 1, and dense if α > 1. MAP Learning The free energy for MAP learning is given as follows:
F MAP = χΘ + χB + RMAP + QMAP ,
(16.86)
where χΘ and χB are large constants corresponding to the negative entropies of the delta functions, and R
MAP
QMAP
/ 0 rΘ (Θ)rB (B) = log p(Θ|α)p(B|η) rMAP (Θ,B) ⎛ ⎞⎞ ⎛ M H H H ⎜⎜ ⎜⎜⎜ ⎟⎟⎟⎟ Γ(α) ⎜ MAP MAP (1 − α) ⎜⎜⎝log( + αm,h )⎟⎟⎟⎠⎟⎟⎟⎠ = αm,h ) − log( ⎜⎜⎝log Γ(Hα) h=1 m=1 h =1 ⎛ H L L ⎜⎜⎜ L ⎜⎜⎝log Γ(η) + (1 − η) log( ηMAP ηMAP + l,h ) − log( l =1 l ,h ) Γ(Lη) h=1 l=1 / N (m) M }m=1 0 rz {{z(n,m) }n=1 = log p({w(n,m) }, {z(n,m) }|Θ, B) rMAP (Θ,B,{z(n,m) }) ⎛ H ⎞ M L ⎜⎜ ⎟⎟⎟
αMAP ηMAP m,h l,h =− N (m) Vl,m log ⎜⎜⎜⎝ ⎟⎟ . H L MAP MAP ⎠ αm,h l =1 ηl ,h h =1 m=1 l=1 h=1
⎞ ⎟⎟⎟ ⎟⎟⎠ ,
(16.87)
(16.88)
Let us first consider the case when α < 1. In this case, F diverges to αm,h → +0 for F → −∞ with fixed N, when αm,h = O(1) for any (h, m) and
496
16 Asymptotic VB Theory of Other Latent Variable Models
all other h h. Therefore, the solution is sparse (so sparse that the estimator is useless). Similarly, assume that η < 1. Then F diverges to F → −∞ with ηl ,h → +0 for all other l l. fixed N, when ηl,h = O(1) for any (l, h) and Therefore, the solution is useless. When α ≥ 1 and η ≥ 1, the solution satisfies the following stationary condition:
αMAP m,h
=α−1+
N (m)
zMAP(n,m) , h
ηMAP l,h
=η−1+
n=1
zMAP(n,m) h
M N (m)
zMAP(n,m) w(n,m) , l h
m=1 n=1
(16.89)
L MAP w(n,m)
ηl,h ) l αMAP l=1 ( m,h . = H L w(n,m) l ηMAP αMAP h =1 l=1 ( m,h l,h )
(16.90)
In the same way as for VB, PB-A, and PB-B learning, we can obtain the following lemma: MAP MAP
Θ Lemma 16.22 Let B = BΘ rMAP (Θ,B) . Then QMAP is minimized when J = Op (N −1 ), and it holds that
+ LM). QMAP = NS N (D) + Op ( JN RMAP is written as follows: ⎧
⎪ H ⎪ ⎪ ⎨ MAP
(h) (α − 1) +
(η − 1) − (α L(h) (η − 1) R =⎪ M MH − 1) + HL ⎪ ⎪ ⎩ h=1
(η − 1) log L + Op (H(M + L)). + (H − H)L
⎫ ⎪ ⎪ ⎪ ⎬ log N ⎪ ⎪ ⎪ ⎭ (16.91)
Taking the different asymptotic limits, we obtain the following theorem:
MAP has only one nonzero Theorem 16.23 When α < 1, each row vector of Θ entry, and the MAP solution is sparse. When η < 1, each column vector of MAP
has only one nonzero entry. Assume in the following that α, η ≥ 1. In the B limit when N → ∞ with L, M ∼ O(1), it holds that J = Op (1/N) and MAP (D) = λ MAP F LDA log N + Op (1), where
λ MAP LDA = MH (α − 1) + HL (η − 1) −
H h=1
(h) (α − 1) + L(h) (η − 1) . M
16.4 Latent Dirichlet Allocation In the limit when N, M → ∞ with and
M N,L
497
∼ O(1), it holds that J = op (log N),
MAP (D) = λ MAP F LDA log N + op (N log N), where λ MAP LDA = MH (α − 1) −
H
(h) (α − 1) . M
h=1
In the limit when N, L → ∞ with and
L N,
M ∼ O(1), it holds that J = op (log N),
MAP (D) = λ MAP F LDA log N + op (N log N), where λ MAP LDA = HL(η − 1). In the limit when N, L, M → ∞ with op (N log N), and
L M N, N
∼ O(1), it holds that J =
2 MAP (D) = λ MAP F LDA log N + op (N log N),
where λ MAP LDA = H(M(α − 1) + L(η − 1)). Theorem 16.23 states that the MAP solution is sparse when α < 1. However, it provides no information on the sparsity of the MAP solution for η < 1. In the following, we investigate the sparsity of the solution for α, η ≥ 1. In the Limit When N → ∞ with L, M ∼ O(1) The coefficient of the leading term of the free energy is λ MAP LDA
= MH (α − 1) +
H
(h) (α − 1) − L(h) (η − 1) . L(η − 1) − M
h=1
The solution is sparse if λ MAP LDA is increasing with respect to H, and dense if it is decreasing. We focus on the case where α, η ≥ 1. Eqs. (16.57) through (16.59) imply the following:
498
16 Asymptotic VB Theory of Other Latent Variable Models
The solution is sparse if L(η − 1) − max M ∗(h) (α − 1) + L∗(h) (η − 1) > 0, h
(16.92)
and dense if
L(η − 1) − max M ∗(h) (α − 1) + L∗(h) (η − 1) < 0. h
(16.93)
Therefore, the solution is sparse if L(η − 1) − max M ∗(h) (α − 1) − max L∗(h) (η − 1) > 0, or equivalently, h
α1+ . minh M ∗ (h) Thus, we can conclude that the solution is sparse if α < 1, and dense if α>1+
L(η − 1) . minh M ∗ (h)
Summarizing the preceding, we have the following lemma: Lemma 16.24 Assume that η ≥ 1. The solution is sparse if α < 1, and dense L(η−1) . if α > 1 + min M ∗ (h) h
In the Limit When N, M → ∞ with M , L ∼ O(1) N The coefficient of the leading term of the free energy is given by λ MAP LDA
= MH (α − 1) −
H
(h) (α − 1) . M
(16.94)
h=1
Although the predictive distribution does not necessarily converge to the true distribution, we can investigate the sparsity of the solution by considering the
unchanged. It is clear duplication rules (16.57) through (16.59) that keep BΘ
if α > 1. Combing this result that Eq. (16.94) is decreasing with respect to H with Theorem 16.23, which states that the MAP solution is sparse if α < 1, and Lemma 16.24, we obtain the following corollary:
16.4 Latent Dirichlet Allocation
499
Corollary 16.25 Assume that η ≥ 1. In the limit when N → ∞ with L, M ∼ L(η−1) O(1), the MAP solution is sparse if α < 1, and dense if α > 1 + min ∗ (h) . In the h M M limit when N, M → ∞ with N , L ∼ O(1), the MAP solution is sparse if α < 1, and dense if α > 1. Summary of Results Summarizing Corollaries 16.11, 16.17, 16.21, and 16.25 completes the proof of Theorem 16.13.
17 Unified Theory for Latent Variable Models
In this chapter, we present a formula for evaluating an asymptotic form of the VB free energy of a general class of latent variable models by relating it to the asymptotic theory of Bayesian learning (Watanabe, 2012). This formula is applicable to all latent variable models discussed in Chapters 15 and 16.1 It also explains relationships between these asymptotic analyses of VB free energy and several previous works where the asymptotic Bayes free energy has been analyzed for specific latent variable models. We apply this formula to Gaussian mixture models (GMMs) as an example and demonstrate another proof of the upper-bound of the VB free energy given in Section 15.2. Furthermore, this analysis also provides a quantity that is related to the generalization performance of VB learning. Analysis of generalization performance of VB learning has been conducted only for limited cases, as discussed in Chapter 14. We show inequalities that relate the VB free energy to the generalization errors of an approximate predictive distribution (Watanabe, 2012).
17.1 Local Latent Variable Model Consider the joint model p(x, z|w)
(17.1)
on the observed variable x and the local latent variable z with the parameter w. The marginal distribution of the observed variable is2 1 2
The reduced rank regression (RRR) model discussed in Chapter 14 is not included in this class of latent variable models. The model is denoted as if the local latent variable is discrete, it can also be continuous. In this case, the sum z is replaced by the integral d z. The probabilistic principal component analysis is an example with a continuous local latent variable.
500
17.1 Local Latent Variable Model p(x|w) =
p(x, z|w).
501
(17.2)
z
For the complete data set {D, H} = {(x(n) , z(1) ), . . . , (x(N) , z(N) )}, we assume the i.i.d. model p(D, H|w) =
N
p(x(n) , z(n) |w),
n=1
which implies p(D|w) =
N
p(x(n) |w),
n=1
p(H|D, w) =
N
p(z(n) |x(n) , w).
n=1
We assume that p(x|w∗ ) =
p(x, z|w∗ )
z
with the parameter w∗ is the underlying distribution generating data D = {x(1) , . . . , x(N) }. Because of the nonidentifiability of the latent variable model, the set of true parameters, ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ ⎨ ∗ ∗ ∗ ∗ ⎬ ; p(x, z| w ) = p(x|w ) W ≡⎪ w , (17.3) ⎪ ⎪ ⎪ ⎩ ⎭ z
is not generally a point but can be a union of several manifolds with singularities as demonstrated in Section 13.5. In the analysis in this chapter, we define and analyze quantities related to generalization performance of a joint model, where the local latent variables are treated as observed variables. Although we do not consider the case where the local latent variables are observed, those quantities are useful for relating generalization properties of VB learning to those of Bayesian learning, with which we establish a unified theory connecting VB learning and Bayesian learning of latent variable models. Thus, consider for a moment the Bayesian learning of the joint model (17.1), where the complete data set {D, H} is observed. For the prior distribution p(w), the posterior distribution is given by p(w|D, H) =
p(D, H|w)p(w) . p(D, H)
(17.4)
502
17 Unified Theory for Latent Variable Models
The Bayes free energy of the joint model is defined by Bayes FJoint (D, H) = − log p(D, H) = − log p(D, H|w)p(w)dw. If w∗ ∈ W∗ is the true parameter, i.e., the complete data set {D, H} is generated from q(x, z) = p(x, z| w∗ ) i.i.d., the relative Bayes free energy is defined by Bayes (D, H) = F Bayes (D, H) − NS N (D, H), F Joint Joint
(17.5)
where S N (D, H) = −
N 1 log p(x(n) , z(n) | w∗ ) N n=1
is the empirical joint entropy. Then the average relative Bayes free energy is defined by Bayes Bayes F Joint (N) = F Joint (D, H)
p(D,H| w∗ )
,
and the average Bayes generalization error of the predictive distribution for the joint model is defined by Bayes GEJoint (N) = KL(p(x, z| w∗ )||p(x, z|D, H)) p(D,H|w∗ ) , where p(x, z|D, H) =
p(x, z|w)p(w|D, H)dw.
These two quantities are related to each other as Eq. (13.24): Bayes
Bayes
Bayes
GEJoint (N) = F Joint (N + 1) − F Joint (N).
(17.6)
Furthermore, the average relative Bayes free energy for the joint model can be approximated as (see Eq. (13.118)) Bayes F Joint (N) ≈ − log exp −NE(w) · p(w)dw, (17.7) where / 0 p(x, z| w∗ ) E(w) = KL(p(x, z| w∗ )||p(x, z|w)) = log . p(x, z|w) p(x,z|w∗ )
(17.8)
17.1 Local Latent Variable Model
503
Since the log-sum inequality yields that3 p(x, z| w∗ ) p(x|w∗ ) ≥ p(x|w∗ ) log , p(x, z| w∗ ) log p(x, z|w) p(x|w) z we have E(w) ≥ E(w), where
(17.10)
/ 0 p(x|w∗ ) E(w) = KL(p(x|w∗ )||p(x|w)) = log . p(x|w) p(x|w∗ )
Hence, it follows from Eq. (13.118) that Bayes Bayes (D) F (N) = F q(D) ≈ − log exp (−NE(w)) · p(w)dw $ % Bayes ≤ − log exp − NE(w) · p(w)dw ≈ F Joint (N),
(17.11)
where Bayes (D) = F Bayes (D) − NS N (D) F is the relative Bayes free energy defined by the Bayes free energy of the original marginal model, Bayes (D) = − log p(D) = − log p(D|w)p(w)dw F and its empirical entropy, S N (D) = −
N 1 1 log p(D|w∗ ) = − log p(x(n) |w∗ ), N N n=1
(17.12)
as in Section 13.3.2. The asymptotic theory of Bayesian learning (Theorem 13.13) shows that an Bayes asymptotic form of F Joint (N) is given by Bayes
Bayes
Bayes
F Joint (N) = λJoint log N − (mJoint − 1) log log N + O(1), 3
(17.13)
The log-sum inequality is the following inequality satisfied for nonnegative reals ai ≥ 0 and bi ≥ 0: ⎛ ⎞ $ % ai ai ⎜⎜⎜⎜ ⎟⎟⎟⎟ ai log ≥ ⎝⎜ ai ⎠⎟ log $i % . (17.9) b bi i i i i This can be proved by subtracting the right-hand side from the left-hand side and applying the nonnegativity of the KL divergence.
504
17 Unified Theory for Latent Variable Models
Bayes
Bayes
where −λJoint and mJoint are respectively the largest pole and its order of the zeta function defined for a complex number z by E(w)z p(w)dw. (17.14) ζE (z) = This means that the asymptotic behavior of the free energy is characterized by E(w), while that of the Bayes free energy F Bayes is characterized by E(w) = KL(p(x|w∗ )||p(x|w)) and the zeta function ζE (z) in Eq. (13.122) as Theorem 13.13. The two functions, E and E, are related by the log-sum inequality (17.10). Then Corollary 13.14 implies the following asymptotic expansion of the average generalization error:
Bayes
Bayes mJoint − 1 λJoint Bayes 1 − +o GEJoint (N) = . (17.15) N N log N N log N With the preceding quantities, we first provide a general upper-bound for the VB free energy (Section 17.2), and then show inequalities relating the VB free energy to the generalization errors of an approximate predictive distribution for the joint model (Section 17.4).
17.2 Asymptotic Upper-Bound for VB Free Energy Given the training data D = {x(1) , . . . , x(N) }, consider VB learning for the latent variable model (17.2) with the prior distribution p(w). Under the constraint, r(w, H) = rw (w)rH (H), the VB free energy is defined by F VB (D) = where
min
rw (w),rH (H)
F(r),
/ 0 rw (w)rH (H) F(r) = log p(D, H|w)p(w) rw (w)rH (H) = F Bayes (D) + KL (rw (w)rH (H)||p(w, H|D)) .
(17.16) (17.17)
The stationary condition of the free energy yields 1 p(w) exp log p(D, H|w) rH (H) , Cw 1 exp log p(D, H|w) rw (w) . rH (H) = CH rw (w) =
(17.18) (17.19)
17.2 Asymptotic Upper-Bound for VB Free Energy
505
Let us define the relative VB free energy VB (D) = F VB (D) − NS N (D) F ∗ ∈ by the VB free energy and the empirical entropy (17.12). For arbitrary w ∗ W , substituting Eq. (17.18) into Eq. (17.16), we have ⎤ ⎡ 0 / ⎥⎥ ⎢ VB (D) = min ⎢⎢⎢⎣− log p(w) exp log p(D, H|w) F dw⎥⎥⎦ + log p(D|w∗ ) rH (H) rH (H) rH (H) ≤ − log
(17.20) ⎧ ⎫ ⎪ ⎪ ⎪ p(D, H|w) ⎪ ⎨ ⎬ ∗ ) log exp ⎪ p(H|D, w p(w)dw (17.21) ⎪ ∗ ⎪ ⎪ ⎩ ⎭ p(D, H| w ) H
VB∗ (D). ≡F ∗
w ) ∗ ) = p(D,H| Here, we have substituted rH (H) ← p(H|D, w to obtain the w∗ ) H p(D,H| upper-bound (17.21). The expression (17.20) of the free energy corresponds to viewing the VB learning as a local variational approximation (Section 5.3.3), where the variational parameter h(ξ) is the vector consisting of the elements log p(D, H, ξ) for all possible H.4 By taking the expectation with respect to the distribution of training samples, we define the average relative VB free energy and its upper-bound as VB VB (D) F (N) = F , (17.22) p(D|w∗ ) VB∗ VB∗ (D) F (N) = F . (17.23) ∗ p(D|w )
From Eq. (17.7), we have Bayes
F Joint (N) ≈ − log
e−NE(w) p(w)dw ≡ FJoint (N), Bayes
where E(w) is defined by Eq. (17.8). Then, the following theorem holds: Theorem 17.1 It holds that F
Bayes
(N) ≤ F
VB
(N) ≤ F
VB∗
Bayes
(N) ≤ FJoint (N).
Proof The left inequality follows from Eq. (17.17). Eq. (17.21) gives F
4
VB
VB∗
(N) ≤ F (N) VB∗ (D) = F
p(D|w∗ )
The variational parameter h(ξ) has one-to-one correspondence with p(H|D, ξ), and is substituted as h(ξ) ← h( w∗ ) in Eq. (17.21).
(17.24)
506
17 Unified Theory for Latent Variable Models ⎫ ⎧ 0 ⎪ ⎪ ⎪ p(D, H|w) ⎪ ⎬ ⎨ ∗ ) log = − log exp ⎪ p(H|D, w p(w)dw ⎪ ⎪ ⎭ ⎩ p(D, H| w∗ ) ⎪ H p(D|w∗ ) ⎫ ⎧/ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ p(D, H|w) ⎬ ⎨ ∗ ) log ≤ − log exp ⎪ p(H|D, w p(w)dw ⎪ ∗ ⎪ ⎪ ⎪ ⎪ p(D, H| w) ⎭ ⎩ H ∗ p(D|w ) Bayes −NE(w) p(w)dw = FJoint (N). = − log e /
VB∗ VB∗ (D). We The first and second equalities are definitions of F (N) and F have applied Jensen’s inequality to the convex function log exp(·)p(w)dw to obtain the last inequality. Finally, the last equality follows from the fact that ∗ ) = p(D, H| w∗ ) and the i.i.d. assumption. p(D|w∗ )p(H|D, w
The following corollary is immediately obtained from Theorems 13.13 and 17.1: Corollary 17.2 Let 0 > −λ1 > −λ2 > · · · be the sequence of the poles of the zeta function (17.14) in the decreasing order, and m1 , m2 , . . . be the corresponding orders of the poles. Then the average relative VB free energy (17.22) can be asymptotically upper-bounded as F
VB
(N) ≤ λ1 log N − (m1 − 1) log log N + O(1).
Bayes
(17.25)
Bayes
It holds in Eqs. (17.13) and (17.15) that λJoint = λ1 and mJoint = m1 for w∗ ∈ W∗ . λ1 and m1 defined in Corollary 17.2. Note that E(w) depends on ∗ For different w , we have different values of λ1 , which is determined by the ∗ ∈ W∗ in Eq. (17.25). Then m1 is determined by minimum over different w the maximum of the order of the pole for the minimum λ1 . Also note that unlike for Bayesian learning, even if the largest pole of the zeta function is obtained, Eq. (17.25) does not necessarily provide a lower-bound of the VB free energy. If the joint model p(x, z|w), the true distribution p(x, z| w∗ ), and the prior p(w) satisfy the regularity conditions (Section 13.4.1), it holds that
Bayes
2λJoint = D, where D is the number of parameters. If the joint model p(x, z|w) is identifiable, even though the true parameter is on the boundary of the parameter space or the prior does not satisfy 0