136 6 2MB
English Pages 121 [119] Year 2020
SPRINGER BRIEFS IN STATISTICS JSS RESEARCH SERIES IN STATISTICS
Hisayuki Tsukuma Tatsuya Kubokawa
Shrinkage Estimation for Mean and Covariance Matrices
SpringerBriefs in Statistics JSS Research Series in Statistics
Editors-in-Chief Naoto Kunitomo, Economics, Meiji University, Chiyoda-ku, Tokyo, Tokyo, Japan Akimichi Takemura, The Center for Data Science Education and Research, Shiga University, Bunkyo-ku, Tokyo, Japan Series Editors Genshiro Kitagawa, Meiji Institute for Advanced Study of Mathematical Sciences, Nakano-ku, Tokyo, Japan Tomoyuki Higuchi, Faculty of Science and Engineering, Chuo University, Tokyo, Japan Toshimitsu Hamasaki, Office of Biostatistics and Data Management, National Cerebral and Cardiovascular Center, Suita, Osaka, Japan Shigeyuki Matsui, Graduate School of Medicine, Nagoya University, Nagoya, Aichi, Japan Manabu Iwasaki, School of Data Science, Yokohama City University, Yokohama, Tokyo, Japan Yasuhiro Omori, Graduate School of Economics, The University of Tokyo, Bunkyo-ku, Tokyo, Japan Masafumi Akahira, Institute of Mathematics, University of Tsukuba, Tsukuba, Ibaraki, Japan Takahiro Hoshino, Department of Economics, Keio University, Tokyo, Japan Masanobu Taniguchi, Department of Mathematical Sciences/School, Waseda University/Science & Engineering, Shinjuku-ku, Japan
The current research of statistics in Japan has expanded in several directions in line with recent trends in academic activities in the area of statistics and statistical sciences over the globe. The core of these research activities in statistics in Japan has been the Japan Statistical Society (JSS). This society, the oldest and largest academic organization for statistics in Japan, was founded in 1931 by a handful of pioneer statisticians and economists and now has a history of about 80 years. Many distinguished scholars have been members, including the influential statistician Hirotugu Akaike, who was a past president of JSS, and the notable mathematician Kiyosi Itô, who was an earlier member of the Institute of Statistical Mathematics (ISM), which has been a closely related organization since the establishment of ISM. The society has two academic journals: the Journal of the Japan Statistical Society (English Series) and the Journal of the Japan Statistical Society (Japanese Series). The membership of JSS consists of researchers, teachers, and professional statisticians in many different fields including mathematics, statistics, engineering, medical sciences, government statistics, economics, business, psychology, education, and many other natural, biological, and social sciences. The JSS Series of Statistics aims to publish recent results of current research activities in the areas of statistics and statistical sciences in Japan that otherwise would not be available in English; they are complementary to the two JSS academic journals, both English and Japanese. Because the scope of a research paper in academic journals inevitably has become narrowly focused and condensed in recent years, this series is intended to fill the gap between academic research activities and the form of a single academic paper. The series will be of great interest to a wide audience of researchers, teachers, professional statisticians, and graduate students in many countries who are interested in statistics and statistical sciences, in statistical theory, and in various areas of statistical applications.
More information about this subseries at http://www.springer.com/series/13497
Hisayuki Tsukuma Tatsuya Kubokawa •
Shrinkage Estimation for Mean and Covariance Matrices
123
Hisayuki Tsukuma Faculty of Medicine Toho University Tokyo, Japan
Tatsuya Kubokawa Faculty of Economics University of Tokyo Tokyo, Japan
ISSN 2191-544X ISSN 2191-5458 (electronic) SpringerBriefs in Statistics ISSN 2364-0057 ISSN 2364-0065 (electronic) JSS Research Series in Statistics ISBN 978-981-15-1595-8 ISBN 978-981-15-1596-5 (eBook) https://doi.org/10.1007/978-981-15-1596-5 © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
The rapid development of computer technology has started to yield many types of high-dimensional data and to enable us to deal with them well. Indeed, high-dimensional data appear in numerous fields such as web data science, genomics, telecommunication, atmospheric science, financial engineering, and others. With such a background, theory of statistical inference with high dimension has received much attention in recent years. High-dimensional data in general are hard to handle, and ordinary or traditional methods in statistics are frequently inapplicable for them. This has inspired statisticians to develop new methodology in high dimension from both theoretical and practical aspects. Most statisticians’ interests seem to be in development of efficient algorithms for statistical inference and in investigation of their asymptotic properties with the dimension going to infinity. On the other hand, there does not exist much literature in high-dimensional problems from a decision-theoretic point of view. Statistical decision theory is the study of how to make decisions in the presence of statistical knowledge under uncertainty. It has been studied from around the 1940s and the researchers have already been produced many important and interesting results. Probably the most surprising result in decision-theoretic estimation is the inadmissibility of the sample mean vector to estimate a multivariate normal population mean. In the multivariate normal mean estimation, the sample mean vector is the maximum likelihood estimator and the uniformly minimum variance unbiased estimator, and thus it has been recognized to be optimal for a long time. However, in 1956, Charles Stein showed that the sample mean vector is admissible for the one- and two-dimensional cases but inadmissible for three or more dimensional cases. A little after that, a specific estimator, called a shrinkage estimator, was provided for exactly dominating the sample mean vector. To this day, various extensions of shrinkage estimation have been achieving in other statistical models. The purpose of this book is to give a brief overview of shrinkage estimation in matrix-variate normal distribution model. More specifically, it includes recent techniques and results in estimation of mean and covariance matrices with a v
vi
Preface
high-dimensional setting that implies singularity of the sample covariance matrix. Such a high-dimensional model can really be analyzed by using the same arguments as for a low-dimensional model. Thus this book takes a unified approach to both high- and low-dimensional shrinkage estimation. Theory of shrinkage estimation for matrix parameters needs many mathematical tools. In Chap. 1, we begin by briefly introducing basic terminology of decision-theoretic estimation and a mathematical technique in shrinkage estimation. Chapter 2 defines the notation with respect to matrix algebra and collects useful results in terms of the Moore-Penrose inverse, the Kronecker product and matrix decompositions. Chapter 3 provides the definition and some properties of matrix-variate normal distribution and related distributions, including the Wishart distribution and joint distributions corresponding to the Cholesky and the eigenvalue decompositions of the Wishart matrix. With a unified treatment for high- and low-dimensional cases, some related distributions are discussed. Chapter 4 introduces a multivariate linear model and derives its canonical form. To find decision-theoretically optimal estimators, we usually direct our attention to several classes of invariant estimators. Therefore Chap. 4 briefly explains group invariance in the canonical form as well. A key tool in shrinkage estimation is an integration by parts formula, called the Stein identity. Chapter 5 gives a generalized Stein identity on matrix-variate normal distribution. Moreover we list some results on matrix differential operators and in particular show useful differentiation formulae concerning the Moore-Penrose inverse. Chapter 6 addresses the problem of estimating the mean matrix in matrix-variate normal distribution model. A unified result on matricial shrinkage estimation is presented, and extensions and applications are given for more general models. Chapter 7 deals with the problem of estimating the covariance matrix relative to an extended Stein loss and provides various unified estimation procedures for high- and low-dimensional cases. Some related topics to covariance estimation are also touched. The authors would like to thank Prof. M. Akahira for giving us the opportunity of publishing this book. The work of the first author was supported in part by Grant-in-Aid for Scientific Research (18K11201) from the Japan Society for the Promotion of Science (JSPI). The work of the second author was supported in part by Grant-in-Aid for Scientific Research (18K11188) from the JSPI. Tokyo, Japan March 2020
Hisayuki Tsukuma Tatsuya Kubokawa
Contents
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 1 2 3 4
....... ....... Inverse . ....... ....... .......
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
7 7 9 10 11 12
3 Matrix-Variate Distributions . . . . . . . . . . . . . . . . . . . . . 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Multivariate Normal Distribution . . . . . . 3.1.2 Jacobians of Matrix Transformations . . . . . . . 3.1.3 The Multivariate Gamma Function . . . . . . . . 3.2 The Matrix-Variate Normal Distribution . . . . . . . . . . 3.3 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . 3.4 The Cholesky Decomposition of the Wishart Matrix . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
13 13 13 14 16 17 21 23 26
4 Multivariate Linear Model and Group 4.1 Multivariate Linear Model . . . . . . . 4.2 A Canonical Form . . . . . . . . . . . . 4.3 Group Invariance . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
Invariance . . . . . . . . . . . . . . . . . .
27 27 30 31 33
1 Decision-Theoretic Approach to Estimation . . . . 1.1 Decision-Theoretic Framework for Estimation 1.2 James-Stein’s Shrinkage Estimator . . . . . . . . . 1.3 Unbiased Risk Estimate and Stein’s Identity . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Nonsingular Matrix and the Moore-Penrose 2.3 Kronecker Product and Vec Operator . . . . . 2.4 Matrix Decompositions . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
vii
viii
Contents
. . . . .
35 35 37 40 42
...... ......
45 45
. . . . . . . . . . . . . . . .
5 A Generalized Stein Identity and Matrix Differential Operators . 5.1 Stein’s Identity in Matrix-Variate Normal Distribution . . . . . . 5.2 Some Useful Results on Matrix Differential Operators . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Estimation of the Mean Matrix . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Unified Efron-Morris Type Estimators Including Singular Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Empirical Bayes Methods . . . . . . . . . . . . . . . . . . . 6.2.2 The Unified Efron-Morris Type Estimator . . . . . . . 6.3 A Unified Class of Matricial Shrinkage Estimators . . . . . . 6.4 Unbiased Risk Estimate . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Examples for Specific Estimators . . . . . . . . . . . . . . . . . . . 6.5.1 The Unified Efron-Morris Type Estimator . . . . . . . 6.5.2 A Modified Stein-Type Estimator . . . . . . . . . . . . . 6.5.3 Modified Efron-Morris Type Estimator . . . . . . . . . 6.6 Related Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Positive-Part Rule Estimators . . . . . . . . . . . . . . . . 6.6.2 Shrinkage Estimation with a Loss Matrix . . . . . . . 6.6.3 Application to a GMANOVA Model . . . . . . . . . . 6.6.4 Generalization in an Elliptically Contoured Model . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
48 48 49 50 53 55 55 56 58 59 59 62 63 67 68 74
7 Estimation of the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Scale Invariant Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Triangular Invariant Estimators and the James-Stein Estimator 7.3.1 The James-Stein Estimator . . . . . . . . . . . . . . . . . . . . . 7.3.2 Improvement Using a Subgroup Invariance . . . . . . . . . 7.4 Orthogonally Invariant Estimators . . . . . . . . . . . . . . . . . . . . . 7.4.1 Class of Orthogonally Invariant Estimators . . . . . . . . . 7.4.2 Unbiased Risk Estimate . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Improvement Using Information on Mean Statistic . . . . . . . . . 7.5.1 A Class of Estimators and Its Risk Function . . . . . . . . 7.5.2 Examples of Improved Estimators . . . . . . . . . . . . . . . . 7.5.3 Further Improvements with a Truncation Rule . . . . . . . 7.6 Related Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Decomposition of the Estimation Problem . . . . . . . . . . 7.6.2 Decision-Theoretic Studies Under Quadratic Losses . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
75 75 77 79 79 81 84 84 84 86 96 97 98 100 102 102 104
. . . . . . . . . . . . . . . .
Contents
ix
7.6.3 Estimation of the Generalized Variance . . . . . . . . . . . . . . . 105 7.6.4 Estimation of the Precision Matrix . . . . . . . . . . . . . . . . . . 106 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 1
Decision-Theoretic Approach to Estimation
Statistical decision theory has been studied from around the 1940s and the researchers have already been producing many remarkable results. In the field of decisiontheoretic estimation, the most surprising result is the inadmissibility of the sample mean vector in estimation of a mean vector of multivariate normal distribution. The inadmissibility result is closely relevant to the discovery of shrinkage estimator. This chapter summarizes basic terminology of decision-theoretic estimation and shrinkage estimators in the multivariate normal mean estimation. Also, Stein’s unbiased estimate of risk is briefly explained as a general method of how to find better estimators. The unbiased risk estimate method is applied to estimation of mean and covariance matrices discussed in this book.
1.1 Decision-Theoretic Framework for Estimation Let x be a random vector or matrix having a probability distribution characterized by an unknown parameter θ (possibly, θ can be a vector or matrix). Assume that we ˆ want to estimate θ based on x. An estimator of θ is denoted by θˆ = θ(x) which is a ˆ function of x. In the literature θ is also called the decision rule. Let P be the parameter space. There are usually many estimation procedures for the unknown parameter θ ∈ P and thus we need to decide how to select an optimal procedure. From an intuitive point of view, it seems reasonable to select an estimator minimizing a mathematical distance between θˆ and θ or making it smaller as soon as possible. In statistical decision theory, such a distance is regarded as the loss induced from θˆ by estimating θ . For this reason, the distance is called a loss function of θˆ and θ . The loss function of θˆ and θ is denoted by L(θˆ , θ ), where it is nonnegative for any θˆ and θ . We usually employ a loss function with properties that it takes zero when θˆ is equal to θ and increases when θˆ goes away from θ .
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_1
1
2
1 Decision-Theoretic Approach to Estimation
However, the loss function of θˆ and θ is random, and practically a distance between θˆ and θ is measured by an expected loss ˆ θ) = E[L(θˆ , θ )], R(θ, where E denotes the expectation taken with respect to the distribution of x. Here R(θˆ , θ ) is viewed as a quantified risk induced from θˆ by estimating θ and it is called ˆ θ). the risk function relative to L(θ, The risk function evaluates performance of estimators and can be employed to compare two estimators. If R(θˆ 0 , θ ) ≤ R(θˆ , θ ) for any θ ∈ P, with strict inequality for some θ , then an estimator θˆ 0 is said to be better than θˆ , or dominate θˆ , or improve on θˆ . An estimator θˆ is said to be inadmissible if there exists another estimator θˆ 0 which dominates θˆ and to be admissible if no such estimator θˆ 0 exists. For example, if we want to prove inadmissibility of an estimator then it simply suffices to find another dominating estimator. Admissibility is a fundamental criterion in decision-theoretic estimation, but in general there exist many admissible estimators. Therefore we consider another criterion called minimaxity. The minimaxity of an estimator θˆ 0 implies that θˆ 0 minimizes a supremum of the risk function among any estimator. Namely, θˆ 0 is said to be minimax if ˆ θ) sup R(θˆ 0 , θ ) ≤ sup R(θ, θ∈P
θ∈P
for any estimator θˆ . If an estimator θˆ is better than a minimax estimator, then θˆ is also minimax. For a general explanation on decision-theoretic estimation, see, for example, Ferguson (1967) and Lehmann and Casella (1998). Clearly, admissibility and minimaxity strongly depend on loss functions, and there exists some criticism concerning criteria based on loss functions. Berger (1985) and Robert (2007) discussed some criticism on decision-theoretic estimation and the use of loss functions.
1.2 James-Stein’s Shrinkage Estimator Now, we look at the problem of estimating the mean vector θ of p-variate normal distribution with the identity covariance matrix relative to the quadratic loss θ is an estimator L( θ, θ ) = θ − θ2 , where · denotes the Euclidean norm and of θ . A random vector drawn from the p-variate normal distribution is denoted by X.
1.2 James-Stein’s Shrinkage Estimator
3
In estimation of the normal mean vector θ, the maximum likelihood estimator is θ ML θ M L = X, which is the uniformly minimum variance unbiased estimator. Also, is the best invariant estimator under the group of the affine transformations X → AX + b, θ → Aθ + b,
(1.1)
where A is a p × p orthogonal matrix and b is a p-dimensional vector, and it is minimax with the constant risk p. From the above facts, θ M L has been recognized to be optimal for a long time. However, Stein (1956) showed that θ M L is inadmissible for three or more dimensional cases. Further James and Stein (1961) succeeded in providing an explicit estimator which dominates θ M L . The dominating estimator is of the form p−2 JS X, θ = 1− X2 which is invariant under a subgroup of (1.1), X → AX and θ → Aθ for a p × p orthogonal matrix A. James and Stein’s estimator θ J S is called a shrinkage estimator M L toward the origin. James and Stein (1961) showed that the since it is shrinking θ risk function of θ J S can be expressed as R( θ J S , θ ) = p − ( p − 2)2 E ( p − 2 + 2K )−1 ,
(1.2)
where K is a Poisson random variable with mean θ 2 /2. The expectation in the r.h.s. θ M L , θ ) for any θ , namely, of (1.2) is finite when p ≥ 3, and then R( θ J S , θ ) ≤ R( M L is inadmissible relative to the quadratic loss L. θ James and Stein’s shrinkage estimator can be characterized by an empirical Bayes method as shown by Efron and Morris (1973). See Gruber (1998), who compared the shrinkage and the ML estimators for some linear models from the Bayesian and Frequentist points of view. A broad survey on shrinkage estimation is presented by Kubokawa (1998) and a modern Bayesian approach is explained extensively by Fourdrinier et al. (2018). For geometrical interpretation of the shrinkage estimator, see Brown and Zhao (2012).
1.3 Unbiased Risk Estimate and Stein’s Identity When X follows the p-variate normal distribution with mean θ and the identity covariance matrix X2 is distributed as the noncentral chi-square distribution with p degrees of freedom and the noncentrality parameter θ 2 . The p.d.f. of the noncentral chi-square distribution is written as a Poisson mixture and E[X−2 ] = E[( p − 2 + 2K )−1 ], where K is the Poisson random variable defined in the previous section. The J S (X)], where θ J S , θ ) = E[ R risk function of θ J S can be rewritten as R(
4
1 Decision-Theoretic Approach to Estimation 2 J S (x) = p − ( p − 2) . R x2
J S (x) is a function of p-dimensional vector x but independent of θ. Therefore Here R it is called the unbiased risk estimate for R( θ J S , θ ). The unbiased risk estimate for M L M L (x) = p, so that R J S (x) ≤ R M L (x) for any x except R( θ , θ ) is given by R when x = 0. The unbiased risk estimate provides simple methods of proving the inadmissibility of estimators and of finding better estimators. In fact, if there exist unbiased risk 2 (x) with respect to two estimators 1 (x) and R θ 1 and θ 2 , respectively, estimates R 2 (x) for any x, then R(θ 1 , θ ) ≤ R(θ 2 , θ ) uniformly for any θ . 1 (x) ≤ R such that R In general, unbiased risk estimates are hard to derive. When we consider the class θ M L + G with a vector-valued function G of X, the of estimators of the form θG = G risk function of θ is expressed by θ M L , θ ) + E[G2 + 2(X − θ) G]. R( θ G , θ ) = R( Therefore we need to evaluate E[(X − θ ) G] for deriving an unbiased risk estimate of R( θ G , θ ). To this end, Stein (1973, 1981) proved a useful integration by parts formula in terms of a normal distribution: Let x be a normal random variable with mean θ and variance one. Let g be an absolutely continuous function such that E[(x − θ )g(x)] and E[g (x)] are finite. Then the integration by parts formula is given by (1.3) E[(x − θ )g(x)] = E[g (x)], which enables us to evaluate E[(X − θ ) G] given above. The integration by parts formula (1.3) is named the Stein identity or the Stein lemma after Stein (1973, 1981). The Stein identity is nowadays a key tool for evaluating risk functions and for deriving unbiased risk estimates, and it has exerted an immeasurable influence on the development of shrinkage estimation.
References J.O. Berger, Statistical Decision Theory and Bayesian Analysis, 2nd edn. (Springer, New York, 1985) L.D. Brown, L.H. Zhao, A geometrical explanation of Stein shrinkage. Stat. Sci. 27, 24–30 (2012) B. Efron, C. Morris, Stein’s estimation rule and its competitors—An empirical Bayes approach. J. Am. Stat. Assoc. 68, 117–130 (1973) T.S. Ferguson, Mathematical Statistics: A Decision Theoretic Approach (Academic Press, New York, 1967) D. Fourdrinier, W.E. Strawderman, M.T. Wells, Shrinkage Estimation (Springer, New York, 2018) M.H.J. Gruber, Improving Efficiency by Shrinkage (Marcel Dekker, New York, 1998) W. James, C. Stein, Estimation with quadratic loss, in proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, ed. by J. Neyman, vol. 1 (University of California Press, Berkeley, 1961), pp. 361–379
References
5
T. Kubokawa, The Stein phenomenon in simultaneous estimation: A review, in Applied Statistical Science III, ed. by S.E. Ahmed, M. Ahsanullah, B.K. Sinha (Nova Science Publishers, New York, 1998), pp. 143–173 E.L. Lehmann, G. Casella, Theory of Point Estimation, 2nd edn. (Springer, New York, 1998) C. Robert, The Bayesian Choice, 2nd edn. (Springer, New York, 2007) C. Stein, Inadmissibility of the usual estimator for the mean of a multivariate normal distribution, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, ed. by J. Neyman, vol. 1 (University of California Press, Berkeley, 1956), pp. 197–206 C. Stein, Estimation of the mean of a multivariate normal distribution, Technical Reports No.48 (Department of Statistics, Stanford University, Stanford, 1973) C. Stein, Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1981)
Chapter 2
Matrix Algebra
Matrix algebra is an important step in mathematical treatment of shrinkage estimation for matrix parameters, and in particular the Moore-Penrose inverse and some matrix decompositions are required for defining matricial shrinkage estimators. This chapter first explains the notation used in this book and subsequently lists helpful results in matrix algebra.
2.1 Notation Let Rn be the n-dimensional real vector space and in particular denote R = R1 . Let Rm×n be the set of all m × n matrices with real elements. Note that Rm×n = Rmn and Rmn is the (mn)-dimensional real vector space. If A ∈ Rm×n then A is of the form ⎛
a11 · · · ⎜ .. A=⎝ .
⎞ a1n .. ⎟ , . ⎠
am1 · · · amn
where all the ai j ’s belong to R for i = 1, . . . , m and j = 1, . . . , n. In some cases, A given above is written as A = (ai j ), or A = (a1 , . . . , an ) where for j = 1, . . . , n ⎞ a1 j ⎜ ⎟ a j = ⎝ ... ⎠ ∈ Rm . ⎛
am j Also, the (i, j)-th element of A is sometimes expressed as ai, j or { A}i j .
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_2
7
8
2 Matrix Algebra
Let 0m×n be the zero matrix of size m × n. The diagonal matrix of size n × n is denoted by diag (d1 , . . . , dn ), where d1 , . . . , dn are diagonal elements from the upper left corner to the lower right one. Define the identity matrix of size n × n as I n , namely, I n = diag (1, . . . , 1) consisting of n ones on the diagonal. Let A be the transpose of a matrix A and let tr A and | A| be, respectively, the trace and the determinant of a square matrix A. Also, let A−1 be the inverse of a nonsingular matrix A. A matrix square root of a symmetric positive semi-definite matrix A is written as A1/2 , where A1/2 is symmetric such that A1/2 A1/2 = A. The inverse of A1/2 is expressed as A−1/2 if it exists. We list the notation for special subsets in Rm×n as follows: Un : On : Vm,n : Dn : D(≥) n : : D(≥0) n
Set of all n × n nonsingular matrices; Set of all n × n orthogonal matrices; Set of all m × n matrices A such that A A = I n for m ≥ n; Set of all n × n diagonal matrices; Set of all n × n diagonal matrices diag (d1 , . . . , dn ) such that d1 ≥ · · · ≥ dn ; Set of all n × n diagonal matrices diag (d1 , . . . , dn ) such that d1 ≥ · · · ≥ dn ≥ 0; L(+) n : Set of all n × n lower triangular matrices with positive diagonal elements; L(1) n : Set of all n × n lower triangular matrices with ones on the diagonal; L(+) m,n : Set of all m × n matrices of the form A=
A1 , A2
(m−n)×n where m ≥ n, A1 ∈ L(+) ; n and A2 ∈ R Set of all m × n matrices A = (ai j ) ∈ L(+) such that aii = 1 for i = 1, . . . , n; m,n Sn : Set of all n × n symmetric matrices; S(+) n : Set of all n × n symmetric positive definite matrices; S(+) n,r : Set of all n × n symmetric positive semi-definite matrices with rank r .
L(1) m,n :
(+) (1) (1) (+) (+) Note that On = Vn,n , L(+) n = Ln,n , Ln = Ln,n and Sn = Sn,n . Set Vm,n is referred to as the Stiefel manifold. Also, it is important to note that On and L(+) n are groups, called respectively the orthogonal and the lower triangular groups, with respect to the group action by usual matrix multiplication. Matrix inequality is defined in the Löwner sense. Namely, for A1 , A2 ∈ Sn , the matrix inequality A1 A2 (or A1 A2 ) means that A2 − A1 (or A1 − A2 ) is symmetric positive semi-definite. For definition, concepts and applications in terms of matrix algebra, see, for example, Rao (1973), Golub and Van Loan (1996) and Harville (1997).
2.2 Nonsingular Matrix and the Moore-Penrose Inverse
9
2.2 Nonsingular Matrix and the Moore-Penrose Inverse A matrix A ∈ Rn×n is said to be invertible if there exists a matrix B ∈ Rn×n such that AB = B A = I n . Such a matrix B is uniquely defined from A and is denoted by A−1 , called the inverse of A. A matrix A is invertible if and only if A belongs to Un , namely, | A| = 0, and thus an invertible matrix is also called a nonsingular matrix. Here we give an important result on the inverse of a partitioned matrix. Lemma 2.1 Partition A ∈ Rn×n as A=
A11 A12 , A21 A22
where A11 , A12 , A21 and A22 are matrix subblocks of any size. If A11 and A22·1 = A22 − A21 A−1 11 A12 are squared and nonsingular then A is nonsingular and −1
A
=
−1 −1 −1 −1 −1 A−1 11 + A11 A12 A22·1 A21 A11 − A11 A12 A22·1 . −1 − A−1 A−1 22·1 A21 A11 22·1
In addition, if A22 and A11·2 = A11 − A12 A−1 22 A21 are also nonsingular then −1 −1 −1 −1 A−1 11 + A11 A12 A22·1 A21 A11 = A11·2 .
In particular, if A ∈ S(+) n then A11 , A22 , A11·2 and A22·1 are nonsingular. If a matrix A is partitioned as in Lemma 2.1 and A11 ∈ Um with m < n, then A can be expressed as A=
I m 0m×(n−m) A21 A−1 I n−m 11
A11
0(n−m)×m
0m×(n−m) A22·1
Im 0(n−m)×m
A−1 11 A12 . I n−m
(2.1)
Therefore Lemma 2.1 can easily be verified by (2.1). Next, we describe some basic and useful properties of the Moore-Penrose inverse, which will be needed for a unified treatment of high- and low-dimensional shrinkage estimators. The Moore-Penrose inverse is an extension of invertibility to a singular square matrix and to a rectangular matrix, and further it is a special case of generalized inverse. Here, for a matrix A ∈ Rm×n , a matrix A− (∈ Rn×m ) is said to be a generalized inverse of A if A− satisfies A A− A = A. The generalized inverse is not unique, while the Moore-Penrose inverse is unique. Definition 2.1 For a matrix A (∈ Rm×n ), a matrix A+ (∈ Rn×m ) is called the MoorePenrose inverse of A if A+ satisfies (i) A A+ A = A, (ii) A+ A A+ = A+ ,
10
2 Matrix Algebra
(iii) ( A A+ ) = A A+ , (iv) ( A+ A) = A+ A. Lemma 2.2 For any matrix A, A+ always exists and it is unique. Lemma 2.3 For m ≥ n, let A ∈ Rm×n be of full column rank. Then we have (i) (ii) (iii) (iv)
A+ = ( A A)−1 A ∈ Rn×m , and in particular A+ = A if A ∈ Vm,n , ( A )+ = A( A A)−1 , and thus ( A )+ = ( A+ ) , A+ A = I n , ( AB A )+ = ( A )+ B −1 A+ for any B ∈ Un .
Parts (i) and (ii) of Lemma 2.3 can easily be verified by Definition 2.1 and Lemma 2.2. Part (iii) can be obtained from (i). Part (iv) follows from Definition 2.1 and Lemma 2.2 with (ii) and (iii). Using (i) of Lemma 2.3, we can see that if A ∈ Un then A+ = ( A A)−1 A = −1 A ( A )−1 A = A−1 . This implies that if A ∈ Un and it is symmetric then A+ is symmetric as well. For more general results on the Moore-Penrose inverse, see Harville (1997) and Magnus and Neudecker (1999).
2.3 Kronecker Product and Vec Operator The notion of the Kronecker product and the vec operator is very important to discuss a clear theorization in terms of matrix-variate distributions. Here, we provide their definition and list some useful results without proofs. For the proofs and more details, see Harville (1997) and Muirhead (1982). Definition 2.2 The Kronecker product of two matrices A = (ai j ) ∈ Rm×n and B ∈ R p×q is denoted by A ⊗ B, which is a block matrix of the form ⎛
⎞ a11 B a12 B · · · a1n B ⎜ a21 B a22 B · · · a2n B ⎟ ⎜ ⎟ mp×nq A⊗ B =⎜ . . .. . . .. ⎟ ∈ R ⎝ .. . . ⎠ . am1 B am2 B · · · amn B Lemma 2.4 Some results on the Kronecker product are given as follows: (i) ( A ⊗ B)(C ⊗ D) = AC ⊗ B D for A ∈ Rm×n , B ∈ R p×q , C ∈ Rn×r and D ∈ Rq×s , (ii) ( A ⊗ B) = A ⊗ B , (iii) | A ⊗ B| = | A|n |B|m for A ∈ Rm×m and B ∈ Rn×n , (iv) ( A ⊗ B)−1 = A−1 ⊗ B −1 for nonsingular matrices A and B.
2.3 Kronecker Product and Vec Operator
11
Definition 2.3 Let A = (a1 , . . . , an ) ∈ Rm×n , where the ai ’s lie in Rm . The vec operation on A is expressed by vec( A), which is of the form ⎛ ⎞ a1 ⎜ .. ⎟ vec( A) = ⎝ . ⎠ ∈ Rmn . an Lemma 2.5 Some results on the vec operation are given as follows: (i) If A ∈ Rm×n , X ∈ Rn× p and B ∈ R p×q then vec( AX B) = (B ⊗ A)vec(X). (ii) If X ∈ Rm×n , A ∈ Rm×m and B ∈ Rn×n then tr AX B X = vec(X) (B ⊗ A)vec(X) = vec(X) (B ⊗ A )vec(X).
2.4 Matrix Decompositions A matrix decomposition, or a matrix factorization, is to express a matrix as a product of some matrices. The matrix decomposition appears in various scenes of statistical analysis. In shrinkage estimation, it is often used for finding better estimators. This section provides some important matrix decompositions without proofs. For the proofs and properties of matrix decompositions, see Golub and Van Loan (1996) and Harville (1997). Lemma 2.6 (QR decomposition) Let A ∈ Rm×n be of full column rank. Then there exist unique R ∈ L(+) n and Q ∈ Vm,n such that A = Q R . The QR decomposition A = Q R yields A = R Q , implying that for A ∈ of full row rank there exist unique L ∈ L(+) R m and Q ∈ Vn,m such that A = L Q . The decomposition A = L Q is called the LQ decomposition. m×n
Lemma 2.7 (Cholesky decomposition) For any A ∈ S(+) n , there exists a unique such that A = L L . L ∈ L(+) n If A ∈ S(+) then A can also be decomposed as A = L DL , where L ∈ L(1) n n , D ∈ Dn with positive diagonals, and L and D are unique. This is known as the LDL decomposition. When A ∈ Rn×n can be decomposed as A = LU, where L and U are, respectively, lower and upper triangular matrices, the decomposition is called the LU decomposition of A. Lemma 2.8 (Eigenvalue decomposition) For any A ∈ Rn×n , there exists P ∈ Un (+) such that A = P D P −1 , where D ∈ D(≥) n . In particular, if A ∈ Sn,r then there exist D ∈ Dr(≥0) and P ∈ Vn,r such that A = P D P .
12
2 Matrix Algebra
The eigenvalue decomposition is also named spectral decomposition. The diagonal elements of D in Lemma 2.8 are called the eigenvalues of A and the i-th column of P is called the eigenvector corresponding to the i-th diagonal element of D. The eigenvalue and eigenvector are also referred to as the characteristic root and characteristic vector, respectively. The eigenvalues of A are arranged in descending order on the diagonal of D, implying that D is uniquely determined. When A is a member of S(+) n,r , D is unique and P is also unique up to sign changes of columns of P. Lemma 2.9 (Singular value decomposition) For any A ∈ Rm×n with rank r , there exist D ∈ Dr(≥0) , U ∈ Vm,r and V ∈ Vn,r such that A = U DV . The diagonal elements of D in Lemma 2.9 are called the singular values of A. The singular value decomposition A = U DV is unique up to sign changes of columns of V . Here we introduce the wedge and descending wedge symbols, ∧ and ∨, implying that, for numbers a and b, a ∧ b = min(a, b),
a ∨ b = max(a, b).
Also for numbers a, b and c, a ∧ b ∧ c = min(a, b, c). The following identities hold: a ∨ b − (a ∧ b) = |a − b| = a + b − 2(a ∧ b) = 2(a ∨ b) − a − b. If A ∈ Rm×n is of full rank, then its rank is m ∧ n and the singular value decomposition of A can be expressed by A = U DV , where D ∈ D(≥0) m∧n , U ∈ Vm,m∧n and V ∈ Vn,m∧n .
References G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The Johns Hopkins University Press, Baltimore, 1996) D.A. Harville, Matrix Algebra From a Statistician’s Perspective (Springer, New York, 1997) J.R. Magnus, H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, 2nd edn. (Wiley, New York, 1999) R.J. Muirhead, Aspects of Multivariate Statistical Theory (Wiley, New York, 1982) C.R. Rao, Linear Statistical Inference and its Applications, 2nd edn. (Wiley, New York, 1973)
Chapter 3
Matrix-Variate Distributions
This chapter provides the definition and some useful properties of a matrix-variate normal distribution, nonsingular and singular Wishart distributions, and other related distributions. The matrix-variate distributions considered here are based on the multivariate (vector-valued) normal distribution, and so we begin by briefly introducing the multivariate normal distribution and some Jacobians for matrix transformations used to obtain probability density functions of the matrix-variate distributions.
3.1 Preliminaries 3.1.1 The Multivariate Normal Distribution A p-dimensional random vector follows a multivariate (vector-valued) normal distribution if the probability density function (p.d.f.) is given by 1 (2π ) p/2 ||1/2
1 exp − (x − μ) −1 (x − μ) , x ∈ R p , 2
where μ (∈ R p ) and (∈ S(+) p ) are parameters. Such a multivariate ( p-variate) normal distribution is denoted by N p (μ, ). The multivariate normal distribution has the following properties. Lemma 3.1 Let x = (x1 , . . . , x p ) ∼ N p (μ, ) with μ = (μ1 , . . . , μ p ) ∈ R p and = (σi j ) ∈ S(+) p . Then (i) E[xi ] = μi and E[xi x j ] = μi μ j + σi j for all i, j ∈ {1, . . . , p}. In other words, E[x] = (E[x1 ], . . . , E[x p ]) = μ, E[(x − μ)(x − μ) ] = E[(xi − μi )(x j − μ j )] = . © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_3
13
14
3 Matrix-Variate Distributions
(ii) For a full row rank constant matrix A ∈ Rq× p and a constant vector b ∈ Rq , Ax + b ∼ Nq ( Aμ + b, A A ). (iii) If x, μ and are partitioned, respectively, as μ1 x1 11 21 , μ= , = , x= x2 μ2 21 22
where x 1 ∈ Rq , μ1 ∈ Rq and 11 ∈ Sq(+) , then x 1 ∼ Nq (μ1 , 11 ), −1 x 2 |x 1 ∼ N p−q (μ2 + 21 −1 11 (x 1 − μ1 ), 22 − 21 11 21 ).
Further, x 1 and x 2 are independent if and only if 21 = 0( p−q)×q . From (i) of Lemma 3.1, N p (μ, ) is usually called the p-variate normal distribution with mean μ and covariance . For theoretical properties other than Lemma 3.1, see Muirhead (1982) and Srivastava and Khatri (1979).
3.1.2 Jacobians of Matrix Transformations Let X be a matrix such that it has k functionally independent variables xi ’s and let Y = F(X) be a matrix transformation, where Y is a matrix having k functionally independent variables yi ’s. The Jacobian of the transformation is given byJ (X → k dxi Y ) = | J|, where J = (∂ xi /∂ y j ) is the k × k Jacobian matrix. Let (dX) = i=1 k and (dY ) = i=1 dyi , where denotes the exterior product. Then (dX) = J (X → Y )(dY ). This section provides some specific Jacobians of matrix transformations. When considering a matrix transformation, we need to take attention to the number of functionally independent variables. For example, S = (si j ) ∈ Sn has {n(n + 1)/2} distinct elements, and the i product of distinct elements of the differential dS = nexterior (dsi j ) ∈ Sn is (dS) = i=1 j=1 dsi j . See Muirhead (1982) and Mathai (1997) for more details of Jacobians and exterior products. Hereafter we ignore the signs of exterior differential forms and define only positive integrals. Lemma 3.2 Let X = (xi j ) ∈ Rm×n and Y = (yi j ) ∈ Rm×n . If X = AY B + C with Rm×n , then (dX) = | A|n |B|m (dY ), where (dX) = A ∈ U m , B ∈ Un and C ∈ m n m n i=1 j=1 dx i j and (dY ) = i=1 j=1 dyi j . Proof See Muirhead (1982) and Mathai (1997).
Lemma 3.3 Let X ∈ R p×n with full row rank. Denote by X = T Q the LQ decomposition of X, where T = (ti j ) ∈ L(+) p and Q ∈ Vn, p . Then
3.1 Preliminaries
15
(dX) =
p
tiin−i
(dT )( Q d Q),
i=1
where (dT ) = on Vn, p .
p i j=1
i=1
dti j and ( Q d Q) is an unnormalized probability measure
Proof See Muirhead (1982, Theorem 2.1.13).
Muirhead (1982, p. 69) pointed out that the measure ( Q d Q) on Vn, p is invariant under the orthogonal transformations Q → A Q for A ∈ On and Q → Q B for B ∈ O p . Also, 2 p π np/2 , (3.1) ( Q d Q) = p (n/2) Vn, p which is given in Muirhead (1982, Theorem 2.1.15), where p (n/2) is the multivariate gamma function (see Definition 3.1 given below). Lemma 3.4 Let S ∈ S(+) p and let S = T T be the Cholesky decomposition of S,
p p+1−i p )(dT ). where T = (ti j ) ∈ L(+) p . Then (dS) = 2 ( i=1 tii
Proof See Muirhead (1982, Theorem 2.1.9).
Lemma 3.5 Let S ∈ S(+) p,r . Denote by S = H L H the eigenvalue decomposition of S, where L = diag (1 , . . . , r ) ∈ Dr(≥0) and H ∈ V p,r . Then
1 (dS) = r 2 where (dL) =
r i=1
r
p−r i
⎞ ⎛
⎝ (i − j )⎠ (dL)(H d H),
i=1
1≤i< j≤r
di .
Proof See Uhlig (1994, Theorem 2).
Lemma 3.6 Let X ∈ R p×n with rank r . Denote by X = U DV the singular value decomposition of X, where D = diag (d1 , . . . , dr ) ∈ Dr(≥0) , U ∈ V p,r and V ∈ Vn,r . Then ⎞ r ⎛ 1 p+n−2r ⎝
(dX) = r d (di2 − d 2j )⎠ (U dU)(d D)(V dV ). 2 i=1 i 1≤i< j≤r Proof See Díaz-García et al. (1997, Theorem 3.1).
16
3 Matrix-Variate Distributions
3.1.3 The Multivariate Gamma Function The multivariate gamma function is defined as a multivariate extension of the gamma function. It is convenient for clearly expressing normalizing constants of matrixvariate distributions. Definition 3.1 For a > ( p − 1)/2, the multivariate gamma function p (a) is defined by p (a) =
S(+) p
|W |a−( p+1)/2 exp(− tr W )(dW ).
∞ When p = 1, (a) = 1 (a) = 0 wa−1 e−w dw, which is the usual gamma function. The multivariate gamma function can be rewritten as a product of the gamma functions. Proposition 3.1 For a > ( p − 1)/2, p (a) = π p( p−1)/4
p
i − 1 . a− 2 i=1
Proof For W ∈ S(+) p , let W = T T be the Cholesky decomposition of W , where (+) T = (ti j ) ∈ L p . From Lemma 3.4, the Jacobian of transformation W → T is given
p p−i+1 , so that by J (W → T ) = 2 p i=1 tii
p (a) = 2 Since |T T | =
p L(+) p
p
2 i=1 tii
a−( p+1)/2
|T T |
exp(− tr T T )
p
p−i+1 tii
(dT ).
i=1
and tr T T =
⎛ p i−1
p (a) = 2 p ⎝
∞
i=2 j=1 −∞
p i i=1
2 j=1 ti j ,
⎞ p
2 e−ti j dti j ⎠ i=1
0
we have ∞
2 tii2a−i e−tii dtii
1 √ i − 1 = 2 p × ( π ) p( p−1)/2 × a− 2 2 i=1 p
= π p( p−1)/4 which completes the proof.
p
i − 1 , a− 2 i=1
3.2 The Matrix-Variate Normal Distribution
17
3.2 The Matrix-Variate Normal Distribution This section provides the definition of matrix-variate normal distribution and its useful properties to analyze a multivariate linear model. Here, the matrix-variate normal distribution is defined as an extension of N p (θ , ). Definition 3.2 For a random matrix X ∈ Rn× p , assume that vec(X ) ∼ Nnp (vec(M ), ⊗ ). That is, vec(X ) follows the (np)-variate normal distribution with mean vec(M ) (+) and covariance ⊗ , where M ∈ Rn× p , and ∈ S(+) n and ∈ S p . Then X is said to follow the matrix-variate normal distribution with mean matrix M and covariance matrix ⊗ , which is denoted by X ∼ Nn× p (M, ⊗ ). Note that Nnp (M, ⊗ ) means an (np)-variate (vector-valued) normal distribution and is distinguished from Nn× p (M, ⊗ ). The p.d.f. of the matrix-variate normal distribution can be written as follows. Proposition 3.2 The p.d.f. of Nn× p (M, ⊗ ) is given by 1 1 −1 −1 tr . exp − (X − M) (X − M) (2π )np/2 || p/2 ||n/2 2 Proof Using (iii) and (iv) of Lemma 2.4 and (ii) of Lemma 2.5, we observe | ⊗ | = || p ||n and vec((X − M) ) ( ⊗ )−1 vec((X − M) ) = vec((X − M) ) (−1 ⊗ −1 )vec((X − M) ) = tr −1 (X − M) −1 (X − M) . Since vec(X ) ∼ Nnp (vec(M ), ⊗ ), the p.d.f. is written as 1 1 −1 exp − ) ( ⊗ ) vec((X − M) ) vec((X − M) (2π )np/2 | ⊗ |1/2 2 1 1 −1 −1 tr . exp − (X − M) (X − M) = (2π )np/2 || p/2 ||n/2 2 Hence the proof is complete.
Using properties of the Kronecker product and the vec operator, we have the following proposition. Proposition 3.3 If X ∼ Nn× p (M, ⊗ ), then X ∼ N p×n (M , ⊗ ).
18
3 Matrix-Variate Distributions
Proof Using (iv) of Lemma 2.4 and (ii) of Lemma 2.5 gives tr −1 (X − M) −1 (X − M) = tr −1 (X − M) −1 (X − M) = vec(X − M) ( ⊗ )−1 vec(X − M). This implies that vec(X) ∼ N pn (vec(M), ⊗ ).
The first and second moments of the matrix-variate normal distribution are given in the following proposition. Proposition 3.4 Let X = (xi j ) ∼ Nn× p (M, ⊗ ) with M = (μi j ), = (ωi j ) and = (σi j ). Then, for any i, k ∈ {1, . . . , n} and j, l ∈ {1, . . . , p}, E[xi j ] = μi j ,
E[xi j xkl ] = ωik σ jl + μi j μkl .
Proof Let X = (x 1 , . . . , x n ) and M = (μ1 , . . . , μn ) , where x i = (xi1 , . . . , xi p ) ∈ R p and μi = (μi1 , . . . , μi p ) ∈ R p for i = 1, . . . , n. Then ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞ μ1 ω11 . . . ω1n x1 ⎜⎜ ⎟ ⎜ ⎜ ⎟ .. ⎟⎟ , vec(X ) = ⎝ ... ⎠ ∼ Nnp ⎝⎝ ... ⎠ , ⎝ ... . ⎠⎠ xn μn ωn1 . . . ωnn ⎛
implying that, according to (i) of Lemma 3.1, E[x i ] = μi ,
E[(x i − μi )(x k − μk ) ] = ωik ,
further implying that E[xi j ] = μi j ,
E[(xi j − μi j )(xkl − μkl )] = ωik σ jl .
Hence the proof is complete.
Proposition 3.4 suggests that, if X ∼ Nn× p (M, ⊗ ), covariance of any two rows of X is proportional to and also covariance of any two columns of X is proportional to . Further Proposition 3.4 yields the following corollary. Corollary 3.1 If X ∼ Nn× p (M, ⊗ ), then E[X] = M,
E[X AX] = ( tr A) + M AM, E[X B X ] = ( tr B) + M B M for constant matrices A ∈ Rn×n and B ∈ R p× p .
3.2 The Matrix-Variate Normal Distribution
19
Proof Using Proposition 3.4 immediately gives E[X] = (E[xi j ]) = (μi j ) = M. Let B = (b A = (ai j ) and inj ). For i, j ∈ {1, . . . , p}, the (i, j)-th element of X AX is n {X AX}i j = k=1 l=1 xki akl xl j , so that E[{X AX}i j ] = =
n n k=1 l=1 n n
akl E[xki xl j ] akl (ωkl σi j + μki μl j )
k=1 l=1
= ( tr A )σi j + {M AM}i j . It holds that tr A = tr A, so that E[X AX] = ( tr A) + M AM. Similarly, for i, j ∈ {1, . . . , n},
E[{X B X }i j ] =
p p
bkl E[xik x jl ]
k=1 l=1 p p
=
bkl (ωi j σkl + μik μ jl )
k=1 l=1
= ( tr B)ωi j + {M B M }i j . Hence the proof is complete.
Next, we provide a result on linear transformation of the matrix-variate normal distribution. The following proposition is an extension of (ii) in Lemma 3.1. Proposition 3.5 Let A be a full row rank constant matrix in Rm×n and let B be a full column rank constant matrix in R p×q . Also, let C be a constant matrix in Rm×q . If X ∼ Nn× p (M, ⊗ ), then AX B + C ∼ Nm×q ( AM B + C, A A ⊗ B B). Proof From (i) of Lemma 2.5, it is seen that vec(( AX B + C) ) = vec(B X A ) + vec(C ) = ( A ⊗ B )vec(X ) + vec(C ). Using (ii) of Lemma 3.1 gives ( A ⊗ B )vec(X ) + vec(C ) ∼ Nmq (( A ⊗ B )vec(M ) + vec(C ), ( A ⊗ B )( ⊗ )( A ⊗ B ) ), which implies from (i) and (ii) of Lemma 2.4 that ( A ⊗ B )vec(X ) + vec(C ) ∼ Nmq (vec(( AM B + C) ), A A ⊗ B B). Hence the proof is complete.
20
3 Matrix-Variate Distributions
A partitioning of a random matrix is needed in various fields of statistical inference. Here we give a distributional property for a partitioned random matrix with respect to the matrix-variate normal distribution. The following proposition is a generalization from (iii) of Lemma 3.1. Proposition 3.6 Let X ∼ Nn× p (M, ⊗ ). Partition X, M and as, respectively, X=
X1 M1 11 21 , M= and = , X2 M2 21 22
−1 where X 1 ∈ Rm× p , M 1 ∈ Rm× p and 11 ∈ S(+) m . Let 22·1 = 22 − 21 11 21 . Then
X 1 ∼ Nm× p (M 1 , 11 ⊗ ), X 2 |X 1 ∼ N(n−m)× p (M 2 + 21 −1 11 (X 1 − M 1 ), 22·1 ⊗ ). Further, if 21 = 0(n−m)×m then X 1 and X 2 are independently distributed as X 1 ∼ Nm× p (M 1 , 11 ⊗ ) and X 2 ∼ N(n−m)× p (M 2 , 22 ⊗ ). Proof Lemma 2.1 guarantees 11 and 22·1 to be nonsingular. Using (2.1) gives that || = |11 | × |22·1 | and −1 =
Im
0(n−m)×m
−−1 11 21 I n−m
−1 11
0(n−m)×m
0m×(n−m) −1 22·1
0m×(n−m) Im . −21 −1 I n−m 11
Here, (X − M) −1 (X − M) becomes (X 1 − M 1 ) −1 11 (X 1 − M 1 ) −1 −1 + {X 2 − M 2 − 21 −1 11 (X 1 − M 1 )} 22·1 {X 2 − M 2 − 21 11 (X 1 − M 1 )}.
From Proposition 3.2, the p.d.f. of X is rewritten as 1 −1 (X − M ) (X − M ) c1 exp − tr −1 1 1 1 1 11 2 1 −1 × c2 exp − tr −1 22·1 {X 2 − M 2 − 21 11 (X 1 − M 1 )} 2 × −1 {X 2 − M 2 − 21 −1 , (X − M )} 1 1 11 where c1 = (2π )−mp/2 |11 |− p/2 ||−m/2 , c2 = (2π )−(n−m) p/2 |22·1 |− p/2 ||−(n−m)/2 . Hence the above joint p.d.f. of X 1 and X 2 suggests that X 1 ∼ Nm× p (M 1 , 11 ⊗ ) and X 2 |X 1 ∼ N(n−m)× p (M 2 + 21 −1 11 (X 1 − M 1 ), 22·1 ⊗ ).
3.2 The Matrix-Variate Normal Distribution
21
When 21 = 0(n−m)×m , it is seen that 22·1 = 22 and X 2 ∼ N(n−m)× p (M 2 , 22 ⊗ ). As a consequence, X 2 does not depend on X 1 , and thus X 1 and X 2 are mutually independent. Proposition 3.6 suggests that if X ∼ Nn× p (M, ⊗ ) and is a diagonal matrix then all the rows of X are mutually independent. Various properties of the matrix-variate normal distribution have been discovered and studied in addition to useful properties mentioned above. For other properties of the matrix-variate normal distribution, see Gupta and Nagar (1999) and Muirhead (1982).
3.3 The Wishart Distribution The Wishart distribution is known as a distribution of the sample covariance matrix in a multivariate normal model and plays an important role in multivariate analysis. It is named for Wishart (1928). First, we provide the definition of the Wishart distribution. Definition 3.3 Let X ∼ Nn× p (0n× p , I n ⊗ ) with ∈ S(+) p and denote ν = n ∧ p. Then S = X X is called the Wishart matrix of rank ν and is said to follow the Wishart distribution with n degrees of freedom and scale matrix , which is denoted by S ∼ Wνp (n, ). p
If n ≥ p, W p (n, ) is often abbreviated to W p (n, ) and the Wishart matrix S lies in S(+) p with probability one. When p > n, the Wishart matrix is singular with probability one and belongs to S(+) p,n . Then in the literature, the distribution of the Wishart matrix is called a pseudoWishart distribution. See Srivastava and Khatri (1979) and Díaz-García et al. (1997). The following proposition provides a unified p.d.f. of Wνp (n, ) in the nonsingular and the singular cases of the Wishart matrix. Proposition 3.7 Let (dS) be defined as in Lemma 3.5 with r = ν. The p.d.f. of Wνp (n, ) with respect to (dS) is expressed by 1 π n(ν− p)/2 (n− p−1)/2 −1 |L| tr exp − S , 2np/2 ||n/2 ν (n/2) 2 where L ∈ D(≥0) and the diagonal of L consists of ν positive eigenvalues of S. ν Proof Let X = (xi j ) ∼ Nn× p (0n× p , I n ⊗ ). Note from Proposition 3.3 that X ∼ N p×n (0 p×n , ⊗ I n ). Let X = V DU be the singular value decomposition of X , U ∈ Vn,ν . Since the p.d.f. of where V ∈ V p,ν , D = diag (d1 , . . . , dν ) ∈ D(≥0) ν and p n dxi j on R p×n is given by N p×n (0 p×n , ⊗ I n ) with respect to (dX ) = j=1 i=1 1 (2π )np/2 ||n/2
1 exp − tr −1 X X (dX ). 2
(3.2)
22
3 Matrix-Variate Distributions
Lemma 3.6 is used to express the joint (unnormalized) p.d.f. of V , D and U as 1 exp − tr −1 V D2 V 2 ⎞ ⎛
× 2−ν | D|n+ p−2ν ⎝ (di2 − d 2j )⎠ (V dV )(d D)(U dU). 1
(2π )np/2 ||n/2
1≤i< j≤ν
Note from (3.1) that Vn,ν (U dU) = 2ν π nν/2 / ν (n/2). Making the transformation L = diag (1 , . . . , ν ) = D2 and integrating out with respect to U, we obtain the joint (unnormalized) p.d.f. of V and L: 1 π nν/2 −1 exp − tr V LV (2π )np/2 ||n/2 ν (n/2) 2 ⎞ ⎛
× 2−ν |L|(n+ p−2ν−1)/2 ⎝ (i − j )⎠ (V dV )(dL).
(3.3)
1≤i< j≤ν
Let S = X X = V LV . From Lemma 3.5, ⎞ ⎛
1 (dS) = ν |L| p−ν ⎝ (i − j )⎠ (dL)(V dV ), 2 1≤i< j≤ν which yields the p.d.f. of S.
Equation (3.3) is the joint (unnormalized) p.d.f. of nonzero eigenvalues and the corresponding eigenvectors for the Wishart matrix S ∈ S(+) p,ν . Note that, for ν = p, (+) and then the p.d.f. of W (n, ) on S can be expressed as S ∈ S(+) p p p 1 (n− p−1)/2 −1 |S| tr exp − S , 2np/2 ||n/2 p (n/2) 2 1
because |L| = |S| in Proposition 3.7. When p = 1 and = 1, the p.d.f. above is the same as that of the chi-square distribution with n degrees of freedom. Thus the Wishart distribution is a multivariate generalization of the chi-square distribution. The following proposition is an important result on expectation of the Wishart matrix, which is the basis for unbiased estimation of . Proposition 3.8 Let S ∼ Wνp (n, ). Then E[S] = n. Proof Recall that S = X X, where X ∼ Nn× p (0n× p , I n ⊗ ). Thus, this proposition can immediately be verified by Corollary 3.1. From Proposition 3.8, Wνp (n, ) is also called the Wishart distribution with n degrees of freedom and mean n.
3.3 The Wishart Distribution
23
Many interesting results on the Wishart distribution have already been obtained in the literature. For other results on the Wishart distribution, see Gupta and Nagar (1999) and Muirhead (1982).
3.4 The Cholesky Decomposition of the Wishart Matrix Here, we provide distribution theories related to the Cholesky decomposition of the Wishart matrix. This decomposition is also named the Bartlett decomposition in statistics. Let X = (xi j ) ∼ Nn× p (0n× p , I n ⊗ ). The Wishart matrix is defined by S = n×n , and X X. In the case of p > n, partition X as X = (X 1 , X 2 ), where X 1 ∈ R denote by X 1 = T 1 Q uniquely the LQ decomposition of X 1 , where T 1 ∈ L(+) n and Q ∈ On . Here, X =
X1 X2
=
T 1 Q X2
=
T1 X2 Q
Q ≡ T Q,
(3.4)
(+) ( p−n)×n where T = (T , which yields S = T T . 1 , T 2 ) ∈ L p,n and T 2 = X 2 Q ∈ R When n ≥ p, the usual Cholesky decomposition of S is given by S = T T , where (+) T ∈ L(+) p = L p, p . Hence the above decompositions for the cases of n ≥ p and p > n can be integrated into S = T T , where a unique T ∈ L(+) p,ν with ν = n ∧ p. Then we have the following proposition.
Proposition 3.9 The p.d.f. of T = (ti j ) ∈ L(+) p,ν with ν = n ∧ p is given by 1
2ν π nν/2 exp − tr −1 T T tiin−i . np/2 n/2 (2π ) || ν (n/2) 2 i=1 ν
(3.5)
Proof For the n ≥ p case, let X = T Q be the LQ decomposition of X , where and Q ∈ Vn, p . Making the transformation X → (T , Q) in (3.2) and T ∈ L(+) p using Lemma 3.3, we can write the joint (unnormalized) p.d.f. of T and Q as 1 (2π )np/2 ||n/2
p
1 n−i −1 exp − tr T T tii (dT )( Q d Q). 2 i=1
From (3.1), integrating out with respect to Q gives p 1
2 p π np/2 n−i −1 exp − tr T T tii (dT ). (2π )np/2 ||n/2 p (n/2) 2 i=1 This is the p.d.f. of T for the n ≥ p case.
24
3 Matrix-Variate Distributions
When p > n, noting from (3.4) that (dX 2 ) = (dT 2 ), we see (dX ) = (dX 1 )(dX 2 ) =
n
tiin−i (dT 1 )( Q d Q)(dX 2 ) =
n
i=1
tiin−i (dT )( Q d Q).
i=1
Hence the joint (unnormalized) p.d.f. of T ∈ L(+) and Q ∈ On = Vn,n can be n expressed as n
1 n−i −1 exp − tr T T tii (dT )( Q d Q). 2 i=1
1 (2π )np/2 ||n/2
2 Using (3.1) gives Vn,n ( Q d Q) = 2n π n /2 / n (n/2), yielding the p.d.f. of T given in (3.5) with ν = n. For deriving moments of T , we provide the distributions of nonzero elements in each column of T . Define the Cholesky decomposition of as = , where = (ξi, j ) ∈ L(+) p . Denote (1) = , ( p) = ξ p, p and, for i = 1, . . . , p − 1, (i) =
ξi,i 0p−i ξ (i) (i+1)
∈ L(+) p−i+1 ,
where ⎞ ⎞ ⎛ 0 ξi+1,i ξi+1,i+1 ⎟ ⎟ ⎜ ⎜ (+) .. = ⎝ ... ⎠ ∈ R p−i , (i+1) = ⎝ ... ⎠ ∈ L p−i . . ξ p,i ξ p,i+1 · · · ξ p, p ⎛
ξ (i)
−1 2 Let γ (i) = (γi+1,i , . . . , γ p,i ) = ξi,i ξ (i) for i = 1, . . . , p − 1 and let σi2 = ξi,i for i = 1, . . . , p. For i = 1, . . . , p, let (i) = (i) (i) , where (1) = and ( p) = σ p2 . Note that for i = 1, . . . , p − 1
(i)
1 0p−i = γ (i) I p−i
σi2 0p−i 0 p−i (i+1)
1 0 p−i
γ (i) . I p−i
(3.6)
Similarly, for i = 1, . . . , ν, let T (i) be submatrices obtained by removing the first (i − 1) rows and columns of T = (ti, j ) ∈ L(+) p,ν . For i = 1, . . . , ν − 1, partition T (i) into four blocks as t 0 T (i) = i,i p−i ∈ L(+) p−i+1,ν−i+1 , t (i) T (i+1) where t (i) = (ti+1,i , . . . , t p,i ) . Note that for i = 1, . . . , ν − 1 T (i) T (i) =
1 0p−i −1 ti,i t (i) I p−i
2 ti,i 0p−i 0 p−i T (i+1) T (i+1)
−1 1 ti,i t (i) 0 p−i I p−i
(3.7)
3.4 The Cholesky Decomposition of the Wishart Matrix
25
and
T (ν) = (tν,ν , tν+1,ν , . . . , t p,ν ) =
t p, p for n ≥ p (ν = p), (tn,n , tn+1,n , . . . , t p,n ) for p > n (ν = n).
In the p > n case, define additionally T (n) = (tn,n , t (n) ) with t (n) = (tn+1,n , . . . , t p,n ) . Then we have the following proposition.
Proposition 3.10 The columns of T are mutually independent and
2 2 ti,i ∼ σi2 χn−i+1 for i = 1, . . . , ν, t (i) |ti,i ∼ N p−i (ti,i γ (i) , (i+1) ) for i = 1, . . . , n ∧ ( p − 1).
(3.8)
Proof From Proposition 3.9, it is immediately seen that the columns of T are mutually independent. Using (3.6) and (3.7) gives −1 2 2 −1 tr −1 (i) T (i) T (i) = ti,i /σi + (t (i) − ti,i γ (i) ) (i+1) (t (i) − ti,i γ (i) ) + tr (i+1) T (i+1) T (i+1) ,
so that −1 2 /σ 2 + (t −1 tr −1 T T = t1,1 (1) − t1,1 γ (1) ) (2) (t (1) − t1,1 γ (1) ) + tr (2) T (2) T (2) 1 2 −1 + (t (i) − ti,i γ (i) ) −1 (i+1) (t (i) − ti,i γ (i) ) + tr (3) T (3) T (3) 2 σ i i=1 i=1 = ···
=
=
2 t2 i,i
ν t2 i,i
+ σ2 i=1 i
n∧( p−1)
(t (i) − ti,i γ (i) ) −1 (i+1) (t (i) − ti,i γ (i) ).
i=1
Also, for i = 1, . . . , p − 1, || =
| (2) |σ12
= | (3) |
2
σ j2
= · · · = | (i+1) |
j=1
yielding ||n/2 =
n∧( p−1)
σ j2 ,
j=1
| (i+1) |1/2
i=1
i
ν
(σi2 )(n−i+1)/2 .
i=1
It turns out from Proposition 3.1 that 2ν π nν/2 = (2π )np/2 ν (n/2)
n∧( p−1)
i=1
1 (2π )( p−i)/2
ν
i=1
2 , 2(n−i+1)/2 ((n − i + 1)/2)
26
3 Matrix-Variate Distributions
so that the p.d.f. of T , given in Proposition 3.9, can be rewritten as 1 1 −1 exp − (t (i) − ti,i γ (i) ) (i+1) (t (i) − ti,i γ (i) ) (2π )( p−i)/2 | (i+1) |1/2 2 i=1 ν 2
ti,i 2 2 −(n−i+1)/2 n−i ti,i exp − 2 . × (σ ) 2(n−i+1)/2 ((n − i + 1)/2) i 2σi i=1
n∧( p−1)
Thus, for i = 1, . . . , n ∧ ( p − 1), t (i) |ti,i ∼ N p−i (ti,i γ (i) , (i+1) ). Finally, making 2 2 for i = 1, . . . , ν gives yi ∼ σi2 χn−i+1 . the change of variables yi = ti,i The distributional decomposition (3.8) will be used in Sect. 7.6 for estimation of a covariance matrix. Proposition 3.10 immediately provides the following corollary. Corollary 3.2 Let X ∼ Nn× p (0n× p , I n ⊗ I p ). Denote by X X = T T the Cholesky decomposition of X X, where T = (ti, j ) ∈ L(+) p,ν with ν = n ∧ p. Then all the nonzero elements of T are mutually independent and distributed as
2 2 ti,i ∼ χn−i+1 for i = 1, . . . , ν, ti, j ∼ N(0, 1) for 2 ≤ i ≤ p and 1 ≤ j ≤ n ∧ (i − 1).
References J.A. Díaz-García, R. Gutierrez-Jáimez, K.V. Mardia, Wishart and pseudo-Wishart distributions and some applications to shape theory. J. Multivar. Anal. 63, 73–87 (1997) A.K. Gupta, D.K. Nagar, Matrix Variate Distributions (Chapman & Hall/CRC, New York, 1999) A.M. Mathai, Jacobian of Matrix Transformations and Functions of Matrix Argument (World Scientific, Singapore, 1997) R.J. Muirhead, Aspects of Multivariate Statistical Theory (Wiley, New York, 1982) M.S. Srivastava, C.G. Khatri, An Introduction to Multivariate Statistics (North Holland, New York, 1979) H. Uhlig, On singular Wishart and singular multivariate Beta distributions. Ann. Stat. 22, 395–405 (1994) J. Wishart, The generalised product moment distribution in samples from a normal multivariate population. Biometrika 20, 32–52 (1928)
Chapter 4
Multivariate Linear Model and Group Invariance
Multivariate linear model is a multivariate generalization for the dimension of response variable in traditional multiple linear regression model. This chapter provides some fundamental properties in terms of the multivariate linear model and the corresponding canonical form. The group invariance is also explained for shrinkage estimation in the multivariate linear model.
4.1 Multivariate Linear Model The sample size is denoted by N . For i = 1, . . . , N , let yi be a p-dimensional column vector of response variables and let yi = B x i + ε i , where x i ∈ Rm is a column vector of known explanatory variables (m ≤ N ), B ∈ Rm× p is a matrix of unknown parameters and εi ∈ R p is a random vector. Define Y = ( y1 , . . . , y N ) ∈ R N × p , X = (x 1 , . . . , x N ) ∈ R N ×m and E = (ε 1 , . . . , ε N ) ∈ R N × p . Then we obtain Y = X B + E.
(4.1)
Assume that for i = 1, . . . , N the εi ’s are independently distributed as N p (0 p , ), is namely, the error matrix E follows N N × p (0 N × p , I N ⊗ ), where ∈ S(+) p unknown. In addition, X is assumed to be of full column rank. In the literature, model (4.1) is called the multivariate linear model or multivariate linear regression model, and the parameter matrix B is called the regression coefficient matrix or simply the regression matrix. The multivariate linear model (4.1) is closely relevant to various models from simple to complex, such as a simple © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_4
27
28
4 Multivariate Linear Model and Group Invariance
mean-variance model, the multivariate analysis of variance (MANOVA) model and the growth curve model. We now consider estimation of B. There are many procedures for estimation of B, and we here introduce the least squares method that is one of the most well-known procedures. Let N yi − B x i 2 , g(B) = i=1
where · denotes the usual Euclidean norm. Then the least squares method is the minimization of g(B) subject to B. Noting that Y Y =
N
yi yi ,
i=1
X Y =
N
x i yi ,
X X =
i=1
N
x i x i ,
i=1
we observe that g(B) =
N
tr ( yi − B x i )( yi − B x i )
i=1
= tr (Y Y − B X Y − Y X B + B X X B). Recall that X is of full column rank, and thus X X is nonsingular. Completing the square with respect to B gives B) + tr (Y Y − B X X B), g(B) = tr (B − B) X X(B − where
B = (X X)−1 X Y .
Clearly g(B) is minimized at B = B. The resulting B is called the least squares (LS) estimator of B. Since E ∼ N N × p (0 N × p , I N ⊗ ) in (4.1), namely, Y ∼ N N × p (X B, I N ⊗ ), the likelihood without a normalizing constant can be written as 1 B) X X(B − B) + (Y Y − B X X B)} . ||−N /2 exp − tr −1 {(B − 2 Thus the LS estimator B is the maximum likelihood estimator as well. Also, using Proposition 3.5 leads to B ∼ Nm× p (B, (X X)−1 ⊗ ), implying that B is an unbiased estimator of B.
4.1 Multivariate Linear Model
29
Next we find a reasonable estimator for the covariance matrix of random error terms εi ’s in the multivariate linear model (4.1). In traditional multiple linear regression model, a residual sum of squares is often used to estimate an error variance, which can be extended to estimating the error covariance . Let N S= ( yi − B x i )( yi − B x i ) , i=1
which is called the residual sum of squares matrix. It turns out that N ( yi − B x i )( yi − B x i ) = Y Y − B X Y − Y X B+ B X X B i=1
= Y {I N − X(X X)−1 X }Y , so that
S = Y (I N − P X )Y ,
where P X = X(X X)−1 X . Here, P X is the orthogonal projection matrix onto the subspace spanned by columns of X. Now, we write the QR decomposition of X as X = Q 1 L , where Q 1 ∈ V N ,m and L ∈ L(+) m . There exists Q 2 ∈ V N ,n for n = N − m such that ( Q 1 , Q 2 ) ∈ O N , namely, the set of N columns of an N × N matrix ( Q 1 , Q 2 ) forms an orthonormal basis of R N . Then −1 PX = Q 1 L (L Q 1 Q1 L ) L Q1 = Q1 Q1 and I N − P X = I N − Q 1 Q 1 = Q 2 Q 2 . Since Q 2 X B = Q 2 Q 1 L B = 0n× p , using Proposition 3.5 gives Q 2 Y ∼ Nn× p (0n× p , I n ⊗ ). From Definition 3.3, ν S = Y Q2 Q 2 Y ∼ W p (n, )
with ν = n ∧ p. Hence, by Proposition 3.8, U B = 1 S, n = N − m, n is unbiased for . U B is reasonable, but not the maximum likelihood estiThe unbiased estimator U B ), or ( B, S), is a sufficient statistic for (B, ). See mator of . Note that ( B, Muirhead (1982, Theorem 10.1.1) or Anderson (2003, Corollary 8.2.1).
30
4 Multivariate Linear Model and Group Invariance
4.2 A Canonical Form Let Q = ( Q 1 , Q 2 ) ∈ O N , where Q 1 and Q 2 are defined as in the previous section. Then Q1 X L Q X= = 0n×m Q 2 X for n = N − m. Let = (θ 1 , . . . , θ m ) = L B ∈ Rm× p . Define Z U
= Q Y=
Q 1Y
Q 2Y
,
where Z = (z 1 , . . . , z m ) ∈ Rm× p and U = (u1 , . . . , un ) ∈ Rn× p . Recall that Y ∼ N N × p (X B, I N ⊗ ), so that by Proposition 3.5 Z , IN ⊗ . ∼ NN × p U 0n× p Hence from Proposition 3.6, Z and U are independently distributed as Z ∼ Nm× p (, I m ⊗ ),
U ∼ Nn× p (0n× p , I n ⊗ ),
(4.2)
which is a canonical form of the multivariate linear model (4.1). The canonical form (4.2) is equivalent to z i ∼ N p (θ i , ), i = 1, . . . , m, u j ∼ N p (0 p , ),
j = 1, . . . , n,
where all of the z i ’s and u j ’s are mutually independent. The QR decomposition of X is denoted by X = Q 1 L , yielding −1 −1 B = (X X)−1 X Y = (L Q 1 Q 1 L ) L Q 1 Y = (L ) Z,
namely, Z = L B. Here, S = U U =
n
ui ui
i=1
is also called the Wishart matrix, which follows Wνp (n, ) with ν = n ∧ p. Thus B and S are, respectively, made from Z and U, and are mutually independent. From (4.2), (Z, U) is a sufficient statistic of (, ). In this book, and are hereinafter called the mean and the covariance matrices, respectively, and the estimation problem for and will be treated in the canonical form (4.2).
4.3 Group Invariance
31
4.3 Group Invariance If an estimation problem is invariant under a transformation such as translation, scaling and rotation, it seems reasonable to require all decision rules, namely, all possible estimators are invariant under the transformation. In this section, we briefly provide the definition of group invariance, or simply called invariance, and simple examples in the canonical form (4.2) of the multivariate linear model (4.1). Let Pθ be a probability distribution on a sample space X parameterized by θ . A statistical model M is defined by M = {Pθ : θ ∈ P}, namely, it is a family of probability distributions Pθ , where P is a parameter space. Let x be an observed ˆ random variable having Pθ and suppose θ is estimated by using x. Denote by θˆ = θ(x) an estimator of θ based on x, and by D a decision space consisting of all possible estimators θˆ . A distance between θ and its estimator θˆ is measured by a loss function ˆ θ). L(θ, Let G be a transformation group which acts on X. For any g ∈ G, the group action on x ∈ X is written as gx. A statistical model M is said to be invariant under G if, for any g ∈ G and θ ∈ P, there exists a unique gθ ¯ ∈ P such that a distribution of ¯ form a gx is Pgθ ¯ ∈ M. For an invariant statistical model M under G, all of the g’s group of transformations from P into itself and it is called the group induced by G. ˆ θ) is said to When a statistical model M is invariant under G, a loss function L(θ, ˆ be invariant under G if, for any g ∈ G and θ ∈ D, there exists an estimator g˜ θˆ such ˆ gθ that L(g˜ θ, ¯ ) = L(θˆ , θ ) for any θ ∈ P. Note that all of the g’s ˜ also form a group of transformations from D into itself. An estimator θˆ (x) is said to be invariant under G if θˆ (gx) = g˜ θˆ (x) for any g ∈ G and x ∈ X. An estimation problem is said to be invariant under G if the model M and the loss function L are invariant under G. Now, simple examples of invariance are given in terms of the canonical form (4.2) of the multivariate linear model (4.1), which is rewritten as X ∼ Nm× p (, I m ⊗ ),
Y ∼ Nn× p (0n× p , I n ⊗ ),
(4.3)
where ∈ Rm× p and ∈ S(+) p . Consider the problem of estimating the mean matrix θ ) = tr ( − ) −1 ( − ) , where = relative to the quadratic loss L Q (, (x) with x = (X, Y ) and θ = (, ). In the model (4.3), the sample and the parameter spaces are, respectively, X = Rm× p × Rn× p and P = Rm× p × S(+) p . The decision space is denoted by D = m× p {(x) ∈ R : x ∈ X}. We define the transformation group G as G = {(O, U) : O ∈ Om and U ∈ U p } with operation g1 g2 = (O 1 O 2 , U 2 U 1 ) for any g1 = (O 1 , U 1 ), g2 = (O 2 , U 2 ) ∈ G. The group action of g = (O, U) ∈ G on x ∈ X and the induced group actions g¯ on ∈ D are given, respectively, by scale transformations θ ∈ P and g˜ on
32
4 Multivariate Linear Model and Group Invariance
x → gx = (O XU, Y U), θ → gθ ¯ = (OU, U U), = O U. → g˜
(4.4)
From Proposition 3.5, it is easy to see that the model (4.3) is invariant under G. Also, the invariance of the L Q -loss can be verified because gθ − OU)(U U)−1 (O U − OU) ¯ ) = tr (O U L Q (g˜ , − ) −1 ( − ) = L Q (, θ ). = tr ( Thus the estimation problem of in (4.3) relative to the L Q -loss is invariant under G. Let S = Y Y . For n ≥ p, let an estimator of be
= (x) =
X(I p − c1 (X X)−1 S) for m > p, −1 −1 (I m − c2 (X S X ) )X for p ≥ m,
where c1 and c2 are positive constants. When n ≥ p, S belongs to S(+) p with proba bility one. Here is invariant under G. Indeed, for m > p, (gx) = O XU[I p − c1 {(O XU) O XU}−1 U SU] = O X{I p − c1 (X X)−1 S}U = O (x)U = g˜ (x), and for p ≥ m (gx) = [I m − c2 {O XU(U SU)−1 (O XU) }−1 ]O XU = O{I m − c2 (X S−1 X )−1 }XU = O (x)U = g˜ (x). 0 = X − X/ tr X X is not On the other hand, for example, an estimator θ ) = tr ( − )( − ) invariant under G. Further, a quadratic-type loss L F (, is not invariant under G since gθ − OU)(O U − OU) ¯ ) = tr (O U L F (g˜ , − )U U ( − ) = L F (, θ ). = tr ( We may need to discuss invariance in terms of the original multivariate linear model (4.1), but this is omitted. For invariance in estimation of the covariance matrix, see Chap. 7.
4.3 Group Invariance
33
A general theory of invariance and its applications in statistics are discussed by Eaton (1983, 1989). For finding decision-theoretically optimal estimators, it is standard tactics to focus on a restricted class of invariant estimators. See also Lehmann and Casella (1998) for the role of invariance in decision-theoretic estimation.
References T.W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd edn. (Wiley, New York, 2003) M.L. Eaton, Multivariate Statistics: A Vector Space Approach (Wiley, New York, 1983) M.L. Eaton, Group Invariance Application in Statistics. Regional Conference Series in Probability and Statistics, vol. 1 (Institute of Mathematical Statistics, Hayward, 1989) E.L. Lehmann, G. Casella, Theory of Point Estimation, 2nd edn. (Springer, New York, 1998) R.J. Muirhead, Aspects of Multivariate Statistical Theory (Wiley, New York, 1982)
Chapter 5
A Generalized Stein Identity and Matrix Differential Operators
In shrinkage estimation, the Stein (1973, 1981) identity is known as an integration by parts formula for deriving unbiased risk estimates. It is a simple but very powerful mathematical tool and has contributed significantly to the development of shrinkage estimation. This chapter provides a generalized Stein identity in matrix-variate normal distribution model and also some useful results on matrix differential operators for a unified application of the identity to high- and low-dimensional normal models.
5.1 Stein’s Identity in Matrix-Variate Normal Distribution For X = (xi j ) ∈ Rm× p , denote by diXj = ∂/∂ xi j the differential operator with respect to the (i, j)-th element X. The matrix differential operator with respect to X of is defined by ∇ X = diXj , which is an m × p matrix. Let G = (gi j ) be a p × a matrix-valued function such that all the elements of G, gi j ’s, are absolutely coni = 1, . . . , m tinuous functions of X. Define ∇ X G as a usual matrix product: For p and j = 1, . . . , a, the (i, j)-th element of ∇ X G is {∇ X G}i j = k=1 dikX gk j . As for a scalar-valued function f , an m × p matrix ∇ X f (X) is defined by ∇ X f (X) = ∇ X { f (X)I p }. Then a generalized Stein identity is given in the following theorem. Theorem 5.1 Let X = (xi j ) ∼ N m× p (, ⊗ ), where = (θi j ) ∈ Rm× p , ∈ (+) a×m and G 2 ∈ R p×b such that S(+) m and ∈ S p . Let G 1 ∈ R (i) all the elements of G 1 and G 2 are absolutely continuous functions of X, (ii) E[|{G 1 (X − )G 2 }i j |] < ∞ for any i ∈ {1, . . . , a} and j ∈ {1, . . . , b}. Then E[G 1 (X − )G 2 ] = E[G 1 ∇ X G 2 + (G 2 ∇ X G 1 ) ].
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_5
(5.1)
35
36
5 A Generalized Stein Identity and Matrix Differential Operators
Proof Let = (γi j ) ∈ Um and = (λi j ) ∈ U p such that = and = . Let Z = (z i j ) = −1 X( )−1 and = (ξi j ) = −1 ( )−1 . Note here from Proposition 3.5 that Z ∼ Nm× p (, I m ⊗ I p ), namely, the z i j ’s are independently distributed as z i j ∼ N(ξi j , 1). Denote G 1 = G 1 (X) and G 2 = G 2 (X). For i = 1, . . . , a and j = 1, . . . , b, let h i j be the (i, j)-th element of E[G 1 (X − )G 2 ], which is given by h i j = E[{G 1 ( Z )(Z − ) G 2 ( Z )}i j ] =
p m
E[{G 1 ( Z )}ik (z kl − ξkl ){ G 2 ( Z )}l j ].
k=1 l=1
By the integration by parts formula, or by the Stein identity (1.3), hi j =
p m
E dklZ {G 1 ( Z )}ik { G 2 ( Z )}l j
k=1 l=1
=
p m k=1 l=1
E {G 1 ( Z )}ik dklZ { G 2 ( Z )}l j +{ G 2 ( Z )}l j dklZ {G 1 ( Z )}ik ,
where dklZ = ∂/∂z kl . Since xi j = { Z }i j = rule gives dklZ =
m q=1
p
r =1
γiq z qr λ jr , using the chain
p p m m Z X dkl xi j · di j = γik λ jl · diXj = { ∇ X }kl , i=1 j=1
(5.2)
i=1 j=1
so that hi j =
p m
E[{G 1 (X)}ik { ∇ X }kl { G 2 (X)}l j
k=1 l=1
+ { G 2 (X)}l j { ∇ X }kl {G 1 (X)}ik ] = E[{G 1 ∇ X G 2 }i j + {G 2 ∇ X G 1 } ji ] = E[{G 1 ∇ X G 2 }i j + {(G 2 ∇ X G 1 ) }i j ].
Thus the proof is complete.
In the proof of Theorem 5.1, the differentiability of elements of G 1 and G 2 and the interchangeability of integrals are guaranteed, respectively, by conditions (i) and (ii) of Theorem 5.1.
5.1 Stein’s Identity in Matrix-Variate Normal Distribution
37
From Theorem 5.1, if X ∼ Nm× p (, I m ⊗ ) then, under some suitable conditions, (5.3) E[ tr (X − ) −1 G ] = E[ tr ∇ X G ], where all the elements of G (∈ Rm× p ) are absolutely continuous functions of X. The r.h.s. of (5.3) depends only on expectations of the diagonals of ∇ X G , but not on those of the off-diagonals. See also Bilodeau and Kariya (1989) and Konno (1992) for Stein-type identities on matrix-variate normal distribution. The appendix of this chapter will discuss a simple derivation of (5.3) via the Gauss divergence theorem. The Stein identity (5.1), or (5.3), yields a useful identity on the chi-square distribution. Let x = (xi ) ∼ Nn (0n , σ 2 I n ) and let ∇x = (∂/∂ xi ) be the n-dimensional differential operator vector. For a differentiable function g(s) with s = x2 , using the Stein identity leads to E
g(s) g(s) g(s) g(s) = E x = E ∇ = E (n − 2) + 2g x x (s) , (5.4) x σ2 σ 2s s s
where g (s) = dg(s)/ds. Since s ∼ σ 2 χn2 , the identity (5.4) is named the chi-square identity, which was derived by Efron and Morris (1976). The chi-square identity (5.4) can be extended to an identity on the nonsingular Wishart distribution. Let S = (si j ) ∼ W p (n, ) and let D S be the p × p matrix differential operator whose (i, j)-th element is (1/2)(1 + δi j )(∂/∂si j ), where δi j represents the Kronecker delta, namely, δi j = 1 if i = j and δi j = 0 otherwise. Under suitable conditions, we can obtain an identity E[ tr −1 G] = E[(n − p − 1) tr S−1 G + 2 tr D S G],
(5.5)
where all the elements of G (∈ S p ) are absolutely continuous functions of S. In the literature, the identity (5.5) is called the Haff (1977, 1979) identity, and it is a useful tool for estimation of . Clearly, when p = 1, the Haff identity (5.5) is equivalent to the chi-square identity (5.4). See the appendix of this chapter for a brief derivation of the Haff identity (5.5).
5.2 Some Useful Results on Matrix Differential Operators p Let Y = (yab ) ∈ Rn× the n × p matrix differential operator with respect . Denote Y Y = ∂/∂ yab . Here, we provide calculus formulas for to Y by ∇Y = dab with dab + + S = (si j ) = Y Y ∈ S(+) p,ν with ν = n ∧ p and its Moore-Penrose inverse S = (si j ). Y Lemma 5.1 Abbreviate dab to d. Denote dS = (dsi j ) and dS+ = (dsi+j ). Then
dS+ = −S+ [dS]S+ + (I p − SS+ )[dS]S+ S+ + S+ S+ [dS](I p − SS+ ).
38
5 A Generalized Stein Identity and Matrix Differential Operators
Proof Note that S+ = S+ SS+ , SS+ = S+ S and S = SSS+ . Differentiating both sides of S+ = S+ × S × S+ , we have dS+ = [dS+ ]SS+ + S+ [dS]S+ + SS+ dS+ , so that [dS+ ]SS+ = −S+ [dS]S+ + (I p − SS+ )dS+ . Thus dS+ = [dS+ ]{SS+ + (I p − SS+ )} = [dS+ ]SS+ + [dS+ ](I p − SS+ ) = −S+ [dS]S+ + (I p − SS+ )dS+ + {(I p − SS+ )dS+ } .
(5.6)
Differentiating both sides of S = SS+ × S yields dS = [d(SS+ )]S + SS+ dS, which implies that [d(SS+ )]S = (I p − SS+ )dS, which further implies that [d(SS+ )]S+ = (I p − SS+ )[dS]S+ S+ .
(5.7)
Differentiating both sides of S+ = SS+ × S+ , we obtain dS+ = [d(SS+ )]S+ + SS+ dS+ , namely, (I p − SS+ )dS+ = [d(SS+ )]S+ = (I p − SS+ )[dS]S+ S+ ,
(5.8)
where the second equality follows from (5.7). Substituting (5.8) into (5.6) completes the proof. Lemma 5.2 Denote the Kronecker delta by δi j , namely, δi j = 1 for i = j and δi j = 0 for i = j. For a, i ∈ {1, . . . , n} and b, j, k ∈ {1, . . . , p}, we have Y (i) dab yi j = δai δbj , Y s jk = δbj yak + δbk ya j , (ii) dab + + Y + s jk = −{Y S+ }ak sbj − {Y S+ }a j sbk + {Y S+ S+ }ak {I p − SS+ }bj (iii) dab +{Y S+ S+ }a j {I p − SS+ }bk , + Y {Y S+ }ik = {I n − Y S+ Y }ai sbk + {Y S+ S+ Y }ai {I p − SS+ }bk (iv) dab + + −{Y S }ak {Y S }ib . Proof Obviously, (i) holds. Differentiating s jk = nc=1 ycj yck with respect to yab yields
Y s jk = dab
n Y Y (yck dab ycj + ycj dab yck ) c=1
n = (yck δac δbj + ycj δac δbk ) = δbj yak + ya j δbk , c=1
which shows (ii).
5.2 Some Useful Results on Matrix Differential Operators
39
From Lemma 5.1 and (ii), it is observed that Y + Y + s jk = {dab S } jk dab p p
+ Y Y − s +jc [dab = scd ]sdk + {I p − SS+ } jc [dab scd ]{S+ S+ }dk c=1 d=1 Y + {S+ S+ } jc [dab scd ]{I p − SS+ }dk
+ + {Y S+ }ak − {Y S+ }a j sbk = −sbj
+ {I p − SS+ }bj {Y S+ S+ }ak + {Y (I p − SS+ )}a j {S+ S+ }bk + {S+ S+ }bj {Y (I p − SS+ )}ak + {Y S+ S+ }a j {I p − SS+ }bk . Noting that Y (I p − SS+ ) = 0n× p gives (iii). In view of the product rule, Y {Y S+ }ik = dab
p Y + Y + [dab yic ]sck + yic dab sck . c=1
Using (i) and (iii) and subsequently summing over all c, we obtain (iv).
The following lemma is useful in estimation of a covariance matrix in multivariate normal distribution model. The proof of the lemma is similar to that of the p > n case given in Konno (2009). See also Stein (1977), Sheena (1995) and Kubokawa and Srivastava (2008). Lemma 5.3 Denote by S = H L H the eigenvalue decomposition of S = Y Y , and H ∈ V p,ν with ν = n ∧ p. Let = where L = diag (1 , . . . , ν ) ∈ D(≥0) ν diag (φ1 , . . . , φν ) ∈ Dν such that all the diagonal elements φi ’s are differentiable functions of L. Then ∇Y Y S+ HH = H∗ H + ( tr L −1 )(I p − H H ), where ∗ = diag (φ1∗ , . . . , φν∗ ) and for i = 1, . . . , ν φi∗
ν
φi ∂φi φi − φ j = (n − ν − 1) + 2 + . i ∂i − j j=i i
In particular, tr ∇Y Y S+ HH =
ν ν φi − φ j φi ∂φi (|n − p| − 1) + 2 . +2 i ∂i − j i=1 j>i i
40
5 A Generalized Stein Identity and Matrix Differential Operators
Appendix In this appendix, we first give another derivation of the Stein identity (5.3). Denote by f (X) the p.d.f. of Nm× p (, I m ⊗ ). Let I ST (G) =
Rm× p
tr ∇ X {G f (X)}(dX).
It follows that ∇ X f (X) = −(X − ) −1 f (X), so that I ST (G) = E[ tr ∇ X G ] − E[ tr (X − ) −1 G ] provided the expectations exist. Hence the Stein identity (5.3) can be verified if I ST (G) = 0. For r > 0, let Br = {X ∈ Rm× p : vec(X) ≤ r }, where · is the usual Euclidean norm and vec(·) is defined in Definition 2.3. Then Br → Rm× p as r → ∞ and vec(∇ X ) vec(G f (X))(dX). I ST (G) = lim r →∞ B r
The boundary of Br is expressed by ∂Br = {vec(X) ∈ Rmp : vec(X) = r }. Denote by u an outward unit normal vector at a point vec(X) ∈ ∂Br . Let λ∂Br be Lebesgue measure on ∂Br . By the Gauss divergence theorem, I ST (G) = lim
r →∞ ∂B r
u vec(G) f (X)(dλ∂Br ).
For details of the Gauss divergence theorem, see Fleming (1977). Let o(·) be the Landau symbol, namely, for real-valued functions f (x) and g(x) with g(x) = 0, we write f (x) = o(g(x)) when lim x→c | f (x)/g(x)| = 0 for an extended real number c. If sup
vec(G) f (X) = o(r 1−mp ) as r → ∞,
vec(X)∈∂Br
then I ST (G) = 0. In fact, |u vec(G)| f (X)(dλ∂Br ) ≤ ∂Br
vec(G) f (X)(dλ∂Br ) ≤ o(r 1−mp ) (dλ∂Br ) = o(1), ∂Br
∂Br
because ∂Br (dλ∂Br ) is the surface area of the (mp − 1)-sphere of radius r in Rmp , namely, ∂Br (dλ∂Br ) ≈ r mp−1 .
Appendix
41
Next, a simple derivation of the Haff identity (5.5) is provided by using the Gauss divergence theorem. The derivation is essentially the same as Haff (1977, 1979). Let f (S) be the p.d.f. of W p (n, ). For a differentiable matrix-valued function G ∈ S p , let tr D S {G f (S)}(dS) I H F (G) = S(+) p
Since D S |S| = S−1 |S| and D S tr −1 S = −1 , we get D S f (S) = {(n − p − 1) S−1 − −1 } f (S)/2, implying that I H F (G) = E[ tr D S G] +
n− p−1 1 E[ tr S−1 G] − E[ tr −1 G] 2 2
provided the expectations exist. Hence the Haff identity (5.5) follows if I H F (G) = 0. Denote by ∂/∂ S = (∂/∂si j ) the p × p matrix differential operator with respect to S ∈ S p . For A = (ai j ) ∈ S p , define Vec( A) = (a11 , a21 , . . . , a p1 , a22 , a32 , . . . , a p2 , . . . , a p−1, p−1 , a p, p−1 , a pp ) ∈ Rq , where q = p( p + 1)/2. From symmetry of G, it holds that tr D S {G f (S)} =
p p p i 1 + δi j ∂ ∂ {g ji f (S)} = {gi j f (S)} 2 ∂si j ∂si j i=1 j=1 i=1 j=1
= Vec(∂/∂ S) Vec(G f (S)),
so that I H F (G) =
S(+) p
Vec(∂/∂ S) Vec(G f (S))(dS).
q
For r > 0, let ∂Br = {Vec(S) ∈ Rq : Vec(S) = r } and, for 0 < r1 ≤ r2 < ∞, let (+) Cr1 ,r2 = {Vec(S) ∈ Rq : r1 ≤ Vec(S) ≤ r2 }. Then Cr1 ,r2 ∩ S(+) p → S p as r 1 → 0 3 (+) and r2 → ∞. The boundary of Cr1 ,r2 ∩ S p can be expressed as i=1 ∂Bi , where q q ∂B1 , ∂B2 and ∂B3 are certain sets satisfying ∂B1 ⊂ ∂Br1 , ∂B2 ⊂ ∂Br2 and ∂B3 ⊂ (+) ∂S(+) p . Note that, for any point S ∈ ∂S p , |S| = 0, namely, f (S) = 0 when n − p − q 1 > 0. Let u1 = −Vec(S)/Vec(S) for Vec(S) ∈ ∂Br1 and u2 = Vec(S)/Vec(S) q q q for Vec(S) ∈ ∂Br2 . Denote by λ∂Br Lebesgue measure on ∂Br . Using the Gauss divergence theorem gives I H F (G) = lim
r1 →0 ∂B 1
q u 1 Vec(G) f (S)(dλ∂Br1 ) + lim
r2 →∞ ∂B 2
q u 2 Vec(G) f (S)(dλ∂Br2 ).
42
5 A Generalized Stein Identity and Matrix Differential Operators
Using the Landau symbol o(·), we assume that 1−q
sup Vec(S)∈∂B1
Vec(G) f (S) = o(r1
and sup Vec(S)∈∂B2
1−q
Vec(G) f (S) = o(r2
) as r1 → 0
) as r2 → ∞.
Under these assumptions, we can see that ∂B1
|u 1 Vec(G)|
f (S)(dλ
q ∂Br1
)≤
Vec(G) f (S)(dλ∂Brq1 ) 1−q ≤ o(r1 ) (dλ∂Brq1 ) = o(1) ∂B1
q
∂Br1
and also 1−q |u2 Vec(G)| f (S)(dλ∂Brq2 ) ≤ o(r2 ) ∂B2
q
∂Br2
(dλ∂Brq2 ) = o(1)
as r1 → 0
as r2 → ∞,
so that I H F (G) = 0.
References M. Bilodeau, T. Kariya, Minimax estimators in the normal MANOVA model. J. Multivar. Anal. 28, 260–270 (1989) B. Efron, C. Morris, Families of minimax estimators of the mean of a multivariate normal distribution. Ann. Stat. 4, 11–21 (1976) W. Fleming, Functions of Several Variables, 2nd edn. (Springer, New York, 1977) L.R. Haff, Minimax estimators for a multinormal precision matrix. J. Multivar. Anal. 7, 374–385 (1977) L.R. Haff, An identity for the Wishart distribution with applications. J. Multivar. Anal. 9, 531–544 (1979) Y. Konno, Improved estimation of matrix of normal mean and eigenvalues in the multivariate Fdistribution. Doctoral dissertation, Institute of Mathematics, University of Tsukuba, 1992. http:// mcm-www.jwu.ac.jp/~konno/ Y. Konno, Shrinkage estimators for large covariance matrices in multivariate real and complex normal distributions under an invariant quadratic loss. J. Multivar. Anal. 100, 2237–2253 (2009) T. Kubokawa, M.S. Srivastava, Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data. J. Multivar. Anal. 99, 1906–1928 (2008) Y. Sheena, Unbiased estimator of risk for an orthogonally invariant estimator of a covariance matrix. J. Jpn. Stat. Soc. 25, 35–48 (1995) C. Stein, Estimation of the mean of a multivariate normal distribution. Technical Reports No.48 (Department of Statistics, Stanford University, Stanford, 1973)
References
43
C. Stein, Lectures on the theory of estimation of many parameters, in Proceedings of Scientific Seminars of the Steklov Institute Studies in the Statistical Theory of Estimation, Part I, vol. 74, ed. by I.A. Ibragimov, M.S. Nikulin (Leningrad Division, 1977), pp. 4–65 C. Stein, Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1981)
Chapter 6
Estimation of the Mean Matrix
This chapter introduces a unified approach to high- and low-dimensional cases for matricial shrinkage estimation of a normal mean matrix with unknown covariance matrix. A historical background is briefly explained and matricial shrinkage estimators are motivated from an empirical Bayes method. An unbiased risk estimate is unifiedly developed for a class of estimators corresponding to all possible orderings of sample size and dimensions. Specific examples of matricial shrinkage estimators are provided and also some related topics are discussed.
6.1 Introduction Matricial shrinkage estimation of a mean matrix of a matrix-variate normal distribution has been studied since Efron and Morris (1972, 1976) and Stein (1973). Assume now that an m × p observed data matrix X follows N m× p (, I m ⊗ I p ), where p ≥ m and is unknown, and consider estimation of the mean matrix . The is evaluated by a risk function relative to the squared performance of its estimator Frobenius norm loss ) = − 2F = tr ( − )( − ) . L F (, M L = X. It is unbiased and The maximum likelihood (ML) estimator of is minimax. Efron and Morris (1972) considered empirical Bayes estimation for and M L is dominated by the resulting empirical Bayes estimator of the showed that form E M = {I m − c0 (X X )−1 }X, c0 = p − m − 1.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_6
(6.1)
45
46
6 Estimation of the Mean Matrix
M L , which is a matrix multiple of ML, This is equivalent to {I m − c0 (X X )−1 } while the James-Stein (1961) type shrinkage estimator can be defined by mp − 2 M L mp − 2 M L JS = 1− = 1− , X2F tr X X M L . Therefore E M and J S are called, respectively, which is a scalar multiple of matricial and scalar shrinkage estimators. E M = (I m − c0 B L −1 B )X, E M can be written as The Efron-Morris estimator and B ∈ Om . Efron where X X = B L B such that L = diag (l1 , . . . , lm ) ∈ D(≥0) m and Morris (1972) also pointed out an interesting relationship between the mean matrix estimation and estimating a restricted precision matrix. The relationship sug E M and in fact they showed that gests a positive-part rule for P E M = (I m − BC L −1 B )X, C = diag (c1 , . . . , cm ), ci = min(li , c0 ), (6.2) E M under the L F -loss. Efron and Morris (1976) presented another dominates improved estimator of the form E M − (m − 1)(m + 2) X, MEM = X2F
(6.3)
E M under the L F -loss. Meanwhile, Stein (1973) which is uniformly better than considered a multivariate generalization of Baranchik’s (1970) estimator for a mean vector of multivariate normal distribution. Stein’s class of estimators is given by = {I m − B(L)B }X, (L) = diag (φ1 (L), . . . , φm (L)),
(6.4)
where the diagonals of (L) are functions of L. Stein (1973) derived an unbiased to provide alternative estimators. For example, he proposed risk estimate of ST = (I m − B DL −1 B )X,
D = diag (d1 , . . . , dm ), di = m + p − 2i − 1, (6.5) E M relative to the L F -loss. which dominates The purpose of this chapter is to extend these results above to the unknown covariance case: Assume that X and Y are mutually independent random matrices distributed as, respectively, X ∼ Nm× p (, I m ⊗ ),
Y ∼ Nn× p (0n× p , I n ⊗ ),
(6.6)
where ∈ Rm× p and ∈ S(+) p are unknown. The model (6.6) is a canonical form an estimator based on of the multivariate linear regression model (4.1). Denote by
6.1 Introduction
47
X and S = Y Y . The problem we consider in this chapter is estimation of the mean matrix relative to invariant quadratic loss − ) . |) = tr ( − ) −1 ( L(,
(6.7)
The invariance of (6.7) follows under (4.4) and, more generally, under the group → O U + A, → OU + A and → U U for any of transformations is measured by the risk O ∈ Om , U ∈ U p and A ∈ Rm× p . The performance of ) = E[L(, |)], where E is expectation taken with respect to function R(, (6.6). M L = X, which is a minimax estimator with The ML estimator of in (6.6) is M L are minimax relative the constant risk mp. Thus all estimators dominating to the quadratic loss (6.7). When n ≥ p in (6.6) and then the Wishart matrix S M L via matricial is nonsingular with probability one, some studies of improving shrinkage estimation can be found in Bilodeau and Kariya (1989), Honda (1991), Konno (1990, 1991, 1992) and Tsukuma and Kubokawa (2007). Bilodeau and Kariya (1989) and Konno (1992) studied general classes of matricial shrinkage estimators which can be regarded as an extension of (6.4). In particular, Konno’s (1992) class has the form −1 K = X{I p − Q(F) Q } for m > p, (6.8) {I m − R(F)R }X for p ≥ m, where F ∈ D(≥0) m∧ p , Q ∈ U p and R ∈ Om satisfy
Q S Q = I p and Q X X Q = F for m > p, for p ≥ m, X S−1 X = R F R
and (F) ∈ Dm∧ p whose diagonal elements are functions of F. Konno (1992) showed that an unbiased risk estimate of the class (6.8) is only a function of F for both cases of m > p and p ≥ m. Numerical comparison of shrinkage estimators has been carried out by Tsukuma and Kubokawa (2007). The numerical results suggest that when true eigenvalues of −1 are scattered, matricial shrinkage estimators outperform a James-Stein (1961) type scalar one, which may motivate us to study matricial shrinkage estimation. This chapter is based in part on Tsukuma and Kubokawa (2015). The structure of this chapter is as follows. In Sect. 6.2, via an empirical Bayes method, we begin by deriving a unified Efron-Morris estimator for any possible ordering on m, p and n. Section 6.3 yields a unified class of matricial shrinkage estimators from the class (6.8) and studies some properties of the unified class. In Sect. 6.4 we derive an unbiased risk estimate of the unified class, and Sect. 6.5 gives specific examples of matricial shrinkage estimators. Section 6.6 discusses some related topics including a method of positive-part rule and an extension to the GMANOVA model. In the Appendix, we supplement some results on matrix differential operators.
48
6 Estimation of the Mean Matrix
6.2 The Unified Efron-Morris Type Estimators Including Singular Cases 6.2.1 Empirical Bayes Methods First, when n ≥ p, namely, when S = Y Y is nonsingular, we will derive empirical Bayes estimators of in (6.6) individually in the cases of m > p and p ≥ m. In the case of m > p, the prior distribution of is assumed to be Nm× p (0m× p , I m ⊗ A), where A ∈ S(+) p is unknown. Then the posterior distribution of and the marginal distribution of X are, respectively, |X ∼ Nm× p (X(I p − ), I m ⊗ ( −1 + A−1 )−1 ), X ∼ Nm× p (0m× p , I m ⊗ ( + A)), B = X(I p − ). Since where = ( + A)−1 . Thus the posterior mean of is is unknown, it may be estimated from the marginal distributions of S and W = X X, which are given by S ∼ W p (n, ) and W ∼ W p (m, + A), respectively. = cW −1 S for a suitable constant It is reasonable that, as an estimator of , we take B for in yields an Efron-Morris (1972) type empirical Bayes c. Substituting estimator of the form 1E M K = X(I p − ) = X{I p − c(X X)−1 S}. Next, we treat the case of p ≥ m. Assume that the prior distribution of is Nm× p (0m× p , B ⊗ ), where B ∈ S(+) m is unknown. Then the posterior distribution of and the marginal distribution of X are, respectively, |X ∼ Nm× p ((I m − )X, (I m − ) ⊗ ), X ∼ Nm× p (0m× p , −1 ⊗ ), B = (I m − where = (I m + B)−1 . The resulting posterior mean of becomes )X. Since we need to estimate , it will be estimated from the marginal distributions = of X and S. Then from Corollary 3.1, E[X −1 X ] = p−1 , and we think of c(X S−1 X )−1 as an estimator of , where c is a positive constant. Thus the resulting Efron-Morris (1972) type empirical Bayes estimator of for p ≥ m can be expressed as 2E M K = (I m − )X = {I m − c(X S−1 X )−1 }X. The Efron-Morris type estimators given here have been studied by Konno (1991, 1992). For other empirical Bayes approaches, see Tsukuma and Kubokawa (2007).
6.2 The Unified Efron-Morris Type Estimators Including Singular Cases
49
6.2.2 The Unified Efron-Morris Type Estimator E M K and EMK Let us here give a unified form of empirical Bayes estimators 1 2 with properties of the Moore-Penrose inverse. When m > p with n ≥ p, using (i), (ii) and (iv) of Lemma 2.3 yields (X S−1 X )+ = (X )+ SX + = X(X X)−1 S(X X)−1 X , so that (X S−1 X )+ X = X(X X)−1 S. When p ≥ m with n ≥ p, X S−1 X is of full rank and its Moore-Penrose inverse becomes (X S−1 X )+ = (X S−1 X )−1 . Hence for both cases of m > p and p ≥ m with n ≥ p, the Efron-Morris type E M K can be unified into EMK = X − E M K and empirical Bayes estimators 1 2 −1 + c(X S X ) X, where c is a constant. In the case of p > n, the rank of S is deficient and its inverse does not exist. E M K = X − c(X S+ X )+ X. Therefore, we replace S−1 with S+ . This leads to On the other hand, in the case where m = 1, Chételat and Wells (2012) suggested the shrinkage estimator CW = X −
c +
X S X
X SS+ .
C W to the framework of estimation of the An important problem is how to extend mean matrix . In particular, the Efron-Morris type estimator seems to take various variants which depend on possible orderings among m, n and p. One of interesting results provided here is that we can develop a unified form for the Efron-Morris type estimators, given by E M K = X − c(X S+ X )+ X SS+ ,
(6.9)
E M K and EMK , for any set of (m, n, p). Of course, the expression (6.9) includes 1 2 CW as special cases. The matrix X S+ X is nonsingular for n ∧ p ≥ m, while it is singular for m > n ∧ p. In fact, (X S+ X )+ for n ∧ p ≥ m can be rewritten as (X S+ X )+ =
(X S−1 X )−1 for n ≥ p ≥ m, (X S+ X )−1 for p > n ≥ m,
and the corresponding Efron-Morris type estimators are provided. In the case of E M K in (6.9) can be expressed as in the following lemma. m > n ∧ p, Lemma 6.1 In the case of m > n ∧ p, the Efron-Morris type estimators in (6.9) can be expressed as EMK =
for n ≥ m > p or for m > n ≥ p, X − cX(X X)−1 S X − cX(SS+ X X SS+ )+ S for m > p > n or for p ≥ m > n. (6.10)
50
6 Estimation of the Mean Matrix
Proof When n ≥ m > p or when m > n ≥ p, using (i) of Lemma 2.3 gives S+ = S−1 . Further from Lemma 2.3, (X S+ X )+ · X SS+ = (X )+ SX + · X = X(X X)−1 S, which verifies the expression (6.10) when n ≥ m > p or when m > n ≥ p. When m > p > n or when p ≥ m > n, we denote the eigenvalue decomposition . From (i) and (iv) of Lemma of S by S = H L H , where H ∈ V p,n and L ∈ D(≥0) n 2.3, S+ = H L −1 H , so that SS+ = H H = S+ S. Since X H (∈ Rm×n ) is of rank n, it is observed that (X S+ X )+ X SS+ = (X H L −1 H X )+ X H H = (H X )+ L(X H)+ X H H = (H X )+ L H = X H(H X X H)−1 L H = X H(H X X H)−1 H H L H = X H(H X X H)−1 H S
(∵ (∵ (∵ (∵ (∵
(iv) of Lemma 2.3) (iii) of Lemma 2.3) (ii) of Lemma 2.3) H H = In) S = H L H ).
Again from (i), (ii) and (iv) of Lemma 2.3, H(H X X H)−1 H = (H H X X H H )+ = (SS+ X X SS+ )+ . Hence the expression (6.10) is obtained for the case where m > p > n or p ≥ m > n.
6.3 A Unified Class of Matricial Shrinkage Estimators To define the class (6.8), Konno (1992) separately considered two cases, where m > p and where p ≥ m, under n ≥ p. The arguments stated in the previous section suggest that we can construct a well-defined class of matricial shrinkage estimators for all possible orders on m, n and p. Hereafter in this chapter, we denote τ = m ∧ n ∧ p. Define the eigenvalue decomposition of S as S = H L H , where H ∈ V p,n∧ p and √ 1/2 = diag ( 1 , . . . , n∧ p ) and L −1/2 = (L 1/2 )−1 . Denote the L ∈ D(≥0) n∧ p . Let L singular value decomposition of X H L −1/2 by X H L −1/2 = R F 1/2 V ,
6.3 A Unified Class of Matricial Shrinkage Estimators
51
√ √ where R ∈ Vm,τ , V ∈ Vn∧ p,τ and F 1/2 = diag ( f 1 , . . . , f τ ) ∈ D(≥0) . It is clear τ that X S+ X = X H L −1 H X = R F R , which is the eigenvalue decomposition of X S+ X . For any possible triplet (m, n, p), a unified class of matricial shrinkage estimators is defined by SH = S H (X, S) = X − R(F)R X SS+ ,
(6.11)
where (F) = diag (φ1 (F), . . . , φτ (F)) ∈ Dτ and the φi (F)’s are absolutely continuous functions of F. Let us here discuss invariance of the unified class (6.11) under the orthogonal transformations X → O X P, S → P S P, → O P and → P P for any = (X, S), it seems natural to O ∈ Om and P ∈ O p . Then for an estimator X P, P S P) = O (X, S) P. Since the eigenvalue decomposition of require (O P S P is P H L H P, it turns out that, due to (i), (ii) and (iv) of Lemma 2.3, ( P S P)+ = (H P)+ L −1 ( P H)+ = P H(H P P H)−1 L −1 (H P P H)−1 H P = P H L −1 H P = P S+ P. Thus, O X P( P S P)+ (O X P) = O X S+ X O , whose eigenvalue decomposition is O R F R O . This yields, for any O ∈ Om and P ∈ O p , S H (O X P, P S P) = O X P − O R · (F) · R O · O X P · P S P · ( P S P)+ S H (X, S) P, = O
S H . Note that if P ∈ U p and P ∈ which shows invariance of / O p with p > n then + SH ( P H) = H P and does not retain invariance, namely, it is not invariant under the scale transformations (4.4). The Efron-Morris type estimator (6.9) lies in the unified class (6.11). It indeed holds that, according to (i), (ii) and (iv) of Lemma 2.3, (X S+ X )+ = (R F R )+ = R F −1 R , yielding E M K = X − R E M K (F)R X SS+ , E M K (F) = c F −1 .
(6.12)
Next, we give some convenient representations for (6.11). Lemma 6.2 The unified class in (6.11) can be rewritten by S H = X(I p − SS+ ) + R{I τ − (F)}R X SS+ . Proof Since S = H L H and X H L −1/2 = R F 1/2 V where R ∈ Vm,τ , it is seen that
52
6 Estimation of the Mean Matrix
(I m − R R )X SS+ = (I m − R R )X H H = (I m − R R )R F 1/2 V L 1/2 H = 0m× p , which is used to rewrite the class (6.11) as S H = X − X SS+ + X SS+ − R R X SS+ + R R X SS+ − R(F)R X SS+ = X(I p − SS+ ) + R{I τ − (F)}R X SS+ . Hence the proof is complete.
When p > n, because of (i), (ii) and (iv) in Lemma 2.3, SS+ = Y Y · (Y Y )+ = Y Y · Y (Y Y )−2 Y = Y (Y Y )−1 Y is the orthogonal projection matrix onto the subspace spanned by rows of Y . The ML M L = X(I p − SS+ ) + X SS+ . Thus Lemma 6.2 implies estimator is rewritten as S H is shrinking with respect only to the projections of rows of X onto the that subspace spanned by rows of Y . Further, the unified class (6.11) can be rewritten as in the following lemma which is an extension for Konno’s (1992) class (6.8) with m > p. Lemma 6.3 Let Q = H L −1/2 V ∈ R p×τ . Then Q − = V L 1/2 H ∈ Rτ × p is the generalized inverse of Q. Further the unified class in (6.11) can be rewritten as S H = X{I p − Q(F) Q − } = X(I p − SS+ ) + X SS+ Q{I τ − (F)} Q − .
(6.13)
Proof It is seen that Q Q − Q = H L −1/2 V V L 1/2 H H L −1/2 V = H L −1/2 V V V = H L −1/2 V = Q,
and consequently Q − is the generalized inverse of Q. Since R = X H L −1/2 V F −1/2 = X Q F −1/2 , R X SS+ = R R F 1/2 V L 1/2 H = F 1/2 Q − ,
(6.14)
it follows that R(F)R X SS+ = X Q F −1/2 (F)F 1/2 Q − = X Q(F) Q − . Hence, we get the first equality in (6.13). The second equality in (6.13) can be obtained similarly by using Lemma 6.2 with the fact that Q = SS+ Q.
6.3 A Unified Class of Matricial Shrinkage Estimators
53
As for Q in Lemma 6.3, it is easy to check that Q = S+ X R F −1/2 and Q − = R X SS+ . Also, Q S Q = I τ and Q X X Q = F, while F −1/2
−
−
(Q ) Q =
S for m > n ∧ p, SS+ X (X S+ X )−1 X SS+ for m ≤ n ∧ p,
and −
−
(Q ) F Q =
for n ≥ p ≥ m, X X SS+ X (X S+ X )(X S+ X )+ X SS+ otherwise.
S H shrinks not only columns, but also rows Lemmas 6.2 and 6.3 suggest that + ML of X SS in terms of . If all the diagonals of (F) are positive and the matrix square root of (F) is denoted by 1/2 then RF R X SS+ = R1/2 R R1/2 R X SS+ = R1/2 R X Q1/2 Q − . Since Q = SS+ Q, we get the following lemma as well. Lemma 6.4 If all the diagonals of (F) are positive and the matrix square root of (F) is denoted by 1/2 then the unified class (6.11) can be rewritten by S H = X − R1/2 R X SS+ Q1/2 Q − . If all the diagonals of I τ − (F) are positive and the matrix square root of I τ − (F) is denoted by (I τ − )1/2 then the unified class (6.11) can be rewritten by S H = X(I p − SS+ ) + R(I τ − )1/2 R X SS+ Q(I τ − )1/2 Q − . S H takes several different forms. In the following, As seen in the above lemmas, we will employ the different forms for different purposes.
6.4 Unbiased Risk Estimate S H in (6.11) is expanded to Abbreviate (F) to . The quadratic loss (6.7) of S H , |) = tr (X − ) −1 (X − ) − 2 tr (X − ) −1 SS+ X RR L( + tr RR X SS+ −1 SS+ X RR . S H , ) = mp + E 2 − 2E 1 , where Recalling that R(X, ) = mp, we obtain R( E 1 = E[ tr (X − ) −1 SS+ X RR ],
E 2 = E[ tr −1 SS+ X R2 R X SS+ ].
54
6 Estimation of the Mean Matrix
Here Theorem 5.1, or (5.3), is used to evaluate E 1 . If the conditions in Theorem 5.1 are satisfied for G 1 = I m and G 2 = −1 SS+ X RR , then E 1 can be expressed as E 1 = E[ tr ∇ X SS+ X RR ]. Similarly, since Y ∼ Nn× p (0n× p , I n ⊗ ), Theorem 5.1 is used to rewrite E 2 as E 2 = E[ tr −1 Y Y S+ X R2 R X SS+ ] = E[ tr ∇Y Y S+ X R2 R X SS+ ]. A sufficient condition for applying Theorem 5.1 to E 1 and E 2 is E[ tr S · tr F2 ] < ∞.
(6.15)
For more details, see Tsukuma and Kubokawa (2015). S H is given by From the above observation, an unbiased risk estimate of S H ) = mp + tr ∇Y Y S+ X R2 R X SS+ − 2 tr ∇ X SS+ X RR . R( Using Lemmas 6.6 and 6.7 in the Appendix yields Theorem 6.1 Let φi = φi (F) for i = 1, . . . , τ . Assume that (6.15) is satisfied. For S H is any possible ordering on m, n and p, an unbiased risk estimate of τ ∂φi ∂φi a f i φi2 − 2bφi − 4 f i2 φi − 4 fi ∂ f ∂ fi i i=1 τ τ f i2 φi2 − f j2 φ 2j f i φi − f j φ j , −2 −4 fi − f j fi − f j j>i j>i
S H ) = mp + R(
where a = am, p,n = (|n − p| + 2m) ∧ (n + p) − 3 and b = bm, p,n = |n ∧ p − m| + 1. S H ) in Theorem 6.1 depends on F ∈ D(≥0) . Let The unbiased risk estimate R( τ S H and S H be estimators belonging to the unified class (6.11). If R( SH ) ≤ 0 1 0 S H dominates S H relative S H ) for any F ∈ D(≥0) and fixed (m, n, p), then R( τ 1 0 1 to the quadratic loss (6.7). For example, we have M L ) = mp for any F ∈ D(≥0) S H ) ≤ R( and fixed (m, n, p) Corollary 6.1 If R( τ S H M L relative to the quadratic loss is a minimax estimator dominating then (6.7).
6.5 Examples for Specific Estimators
55
6.5 Examples for Specific Estimators 6.5.1 The Unified Efron-Morris Type Estimator E M K could be rewritten as in (6.12). To The unified Efron-Morris type estimator EMK , we put φi = c f i−1 . Then condition (6.15) is expressed apply Theorem 6.1 to −1 2 by c E[ tr S tr F ] < ∞, which is satisfied if b ≥ 3 or, equivalently, |n ∧ p − m| ≥ 2 (see Lemma 6.8 of Tsukuma and Kubokawa 2015). The unbiased risk estimate of E M K is τ 1 E M K ) = mp + {(a + 4)c − 2(b − 2)}c R( , f i=1 i ML E M K dominates implying that if 0 < c ≤ 2(b − 2)/(a + 4) and b ≥ 3 then relative to the quadratic loss (6.7). E M K ) is a quadratic function of c and attains its The unbiased risk estimate R( minimum at |n ∧ p − m| − 1 b−2 = . (6.16) cE M = a+4 (|n − p| + 2m) ∧ (n + p) + 1 Define E M = X − c E M R F −1 R X SS+ = X − c E M (X S+ X )+ X SS+ .
(6.17)
This is an extension of Efron and Morris’ (1972) original estimator in (6.1). The E M has the form unbiased risk estimate of E M ) = mp − (b − 2)c E M R(
τ 1 M L ) − (b − 2)c E M tr F −1 . (6.18) = R( f i i=1
When m, n and p are given, the corresponding specific values of a and b in c E M are determined. Noting that (|n − p| + 2m) ∧ (n + p) = n + p for m > n ∧ p, we can obtain specific values of c E M ,
cE M
⎧ ( p − m − 1)/(n − p + 2m + 1) ⎪ ⎪ ⎨ (n − m − 1)/( p − n + 2m + 1) = ⎪ (m − p − 1)/(n + p + 1) ⎪ ⎩ (m − n − 1)/(n + p + 1)
for n ≥ p ≥ m, for p > n ≥ m, for n ≥ m > p and m > n ≥ p, for m > p > n and p ≥ m > n.
The cases satisfying n ≥ p, namely, n ≥ p ≥ m, n ≥ m > p and m > n ≥ p, are provided by Konno (1992).
56
6 Estimation of the Mean Matrix
6.5.2 A Modified Stein-Type Estimator A modified Stein-type estimator is defined by ST − m ST =
d R R X SS+ , tr X S+ X
ST = X − RC F −1 R X SS+ for C = diag (c1 , . . . , cτ ) with c1 ≥ · · · ≥ where cτ . This corresponds to the form φi =
ci d ci d + τ = + . fi fi tr F j=1 f j
Then, from Theorem 6.1, it follows that m ST ) − R( ML) = R( τ 1 = (aci2 − 2bci + 4ci + 4ci2 ) f i i=1 τ 1 (a − 2τ + 2)d 2 − 2τ bd − 2τ (τ − 1)d + 4d + 2(a + 2)d + ci tr F i=1 + 4d −
τ τ 2 (ci − c j )(ci + c j + 2) tr C F 2 tr F + 4d − 2 ( tr F)2 ( tr F)3 fi − f j i=1 j>i
τ τ 4d ci f i − c j f j , tr F i=1 j>i fi − f j
(6.19)
τ τ because i=1 j>i ( f i + f j ) = (τ − 1) tr F. The condition for obtaining (6.19), namely, for satisfying (6.15), is b ≥ 3. It is noted that tr F 2 ≤ ( tr F)2 , tr C F/ ( tr F)2 ≤ c1 / tr F, τ τ τ τ (ci − c j )(ci + c j + 2) 1 ≥ (ci − c j )(ci + c j + 2), fi − f j f i=1 j>i i=1 i j>i τ τ τ ci f i − c j f j ≥ (τ − i)ci . fi − f j i=1 j>i i=1
≤ Thus,
τ i=1
h c (i)/ f i + h d / tr F, where
6.5 Examples for Specific Estimators
57
h c (i) = (a + 4 − 2τ + 2i)ci2 − 2(b − 2 + 2τ − 2i)ci + 2
τ
c j (c j + 2),
j>i
τ τ h d = (a − 2τ + 6)d 2 − 2 bτ + τ (τ − 1) − 2 − 2c1 − (a + 2) ci + 2 (τ − i)ci d. i=1
For i = 1, . . . , τ , put ci =
i=1
b − 2 + 2τ − 2i , a + 4 − 2τ + 2i
(6.20)
which satisfy c1 ≥ · · · ≥ cτ and cτ = c E M given in (6.16). Since h c (i) is a quadratic function of ci , and (a + 4 − 2τ + 2i)ci2 − 2(b − 2 + 2τ − 2i)ci ≤ (a + 4 − 2τ + 2 − 2(b − 2 + 2τ − 2i)ci+1 for each i, it is observed that 2i)ci+1 h c (i) = (a + 4 − 2τ + 2i)ci2 − 2(b − 2 + 2τ − 2i)ci + 2ci+1 (ci+1 + 2) + 2
τ
c j (c j + 2)
j>i+1 2 ≤ {a + 4 − 2τ + 2(i + 1)}ci+1 − 2{b − 2 + 2τ − 2(i + 1)}ci+1 + 2
τ
c j (c j + 2)
j>i+1
≤ · · · ≤ (a + 2)cτ2−1 − 2bcτ −1 + 2cτ (cτ + 2) ≤ (a + 4)cτ2 − 2(b − 2)cτ = −(b − 2)cτ = −(b − 2)c E M .
It is also seen that 2 vides
τ
i=1 (τ
− i)ci = −(b + τ − 3)τ + (a + 4)
τ
i=1 ci , which pro-
τ ci d. h d = (a − 2τ + 6)d − 4 τ − 1 +
2
i=2
Hence, from (6.18), τ 1 EM ML 2 . ≤ R( ) − R( ) + (a − 2τ + 6)d − 4 τ − 1 + ci d tr F i=2 M L and E M are improved on by ST = X − These observations imply that + −1 ST is an RC F R X SS with constants ci ’s given in (6.20). With these ci ’s, extension of Stein’s (1973) estimator in (6.5) and further improved on by the modified Stein-type estimator m ST = X − RC F −1 R X SS+ −
d R R X SS+ tr X S+ X
τ if d satisfies 0 < d ≤ 4 τ − 1 + i=2 ci /(a − 2τ + 6). This is a generalization of Tsukuma and Kubokawa (2007) for any possible ordering on m, n and p.
58
6 Estimation of the Mean Matrix
6.5.3 Modified Efron-Morris Type Estimator Next, we extend the modified Efron-Morris (1976) estimator in (6.3) to the unknown covariance case. Let m E M = EM −
d R R X SS+ , tr X S+ X
E M is given in (6.17). This corresponds to the form where φi =
cE M d cE M d . + τ = + fi fi tr F j=1 f j
Letting ci = c E M in (6.19) for all i and using the fact that tr F 2 ≤ ( tr F)2 , we get ML) m E M ) − R( R( ≤ −(b − 2)c E M tr F −1 1 + (a − 2τ + 2)d 2 − 2τ bd − 2τ (τ − 1)d + 4d + 2τ (a + 2)c E M d tr F τ EM 2 EM 4c d 4d 4c d (τ − i). + + − tr F tr F tr F i=1 With some algebraic manipulation, (a + b + 2)(τ − 1)(τ + 2) 1 mEM EM 2 . ) − R( ) ≤ (a − 2τ + 6)d − 2d R( a+4 tr F (6.21) m E M improves on E M if Therefore 0 1 with n ≥ p was given by SH + was suggested by Chételat and Tsukuma (2010). When m = 1 with p > n, SH S H . These kind of + outperforms Wells (2012), who showed by simulation that dominance results can be proved unifiedly for any set (m, n, p). SH + dominates Theorem 6.2 Assume that Pr(ψi (F) < 0) > 0 for some i. Then S H relative to the quadratic loss (6.7) regardless of an order relation among m, n and p.
Proof Abbreviate (F) = diag (ψ1 (F), . . . , ψτ (F)) to = diag (ψ1 , . . . , ψτ ) and + (F) = diag (ψ1+ (F), . . . , ψτ+ (F)) to + = diag (ψ1+ , . . . , ψτ+ ), respectively. Put ν = n ∧ p. Let H 0 ∈ R p×( p−ν) such that (H, H 0 ) ∈ O p . We can express SH = X H0 H S H as 0 + R R X H H . Note that
60
6 Estimation of the Mean Matrix −1 S H − ) −1 ( S H − ) = tr (X H 0 H tr ( 0 − ) (X H 0 H 0 − ) + 2 tr R X H H −1 (X H 0 H 0 − ) R
+ tr 2 R X H H −1 H H X R. SH + S H is Thus the difference in risk of and SH S H , ) + , ) − R( R(
= E[ tr ( 2+ − 2 )R X H H −1 H H X R] + 2E[ tr ( + − )R X H H −1 (X H 0 H 0 − ) R].
(6.22)
The first expectation in the r.h.s. of (6.22) is not positive because (ψi+ )2 ≤ ψi2 for all i. Recall that S = H L H is the eigenvalue decomposition, where H ∈ V p,ν and . From Proposition 3.2 and Equation (3.3), the joint L = diag (1 , . . . , ν ) ∈ D(≥0) ν (unnormalized) p.d.f. of (X, L, H) without a normalizing constant can be written as exp where
−
1 1 tr (X − ) −1 (X − ) − tr −1 H L H gn, p (L), 2 2 gn, p (L) = |L|(|n− p|−1)/2
(i − j ).
1≤i< j≤ν
Noting that −1 tr (X − ) −1 (X − ) = tr (X H 0 H 0 − ) (X H 0 H 0 − ) + 2 tr X H H −1 (X H 0 H 0 − )
+ tr X H H −1 H H X , we make the transformation (Z, Z 0 ) = (X H L −1/2 , X H 0 ). Since, by Lemma 3.2, the Jacobian of the transformation is given by J [X → (Z, Z 0 )] = |L|m/2 , the joint (unnormalized) p.d.f. of (Z, Z 0 , L, H) without a normalizing constant is proportional to 1 −1 1/2 −1 H (Z 0 H exp − tr (Z 0 H 0 − ) (Z 0 H 0 − ) − tr Z L 0 − ) 2 1 1 − tr Z L 1/2 H −1 H L 1/2 Z − tr −1 H L H |L|m/2 gn, p (L). 2 2 Then the second expectation in the r.h.s. of (6.22) becomes
6.6 Related Topics
61
K0
Rm×( p−ν) ×D(≥0) ×V p,ν ν
I × f (Z 0 , L, H)|L|m/2 gn, p (L)(dZ 0 )(dL)(H d H),
where K 0 is a normalizing constant, I =
Rm×ν
tr ( + − )R Z L 1/2 H −1 (Z 0 H 0 − ) R
× exp
1 1/2 H −1 H L 1/2 Z (d Z) − tr Z L 1/2 H −1 (Z 0 H 0 − ) − tr Z L 2
and f (Z 0 , L, H) = exp
−
1 −1 (Z H − ) − 1 tr −1 H L H . tr (Z 0 H − ) 0 0 0 2 2
Hence if it is shown that I ≤ 0, the proof of Theorem 6.2 will be complete. We next consider the singular value decomposition Z = R DV , where R ∈ , V ∈ Vν,τ and τ = m ∧ (n ∧ p) = Vm,τ , D = diag (d1 , . . . , dτ ) = F 1/2 ∈ D(≥0) τ m ∧ ν. From Lemma 3.6, we have (dZ) = =
1 | D||m−ν| (di2 − d 2j )(R d R)(d D)(V dV ) τ 2 1≤i< j≤τ 1 gm,ν (F)(R d R)(d F)(V dV ), 22τ
where the second equality is verified by the transformation F = D2 . Recall that (R d R) and (V dV ) are invariant with respect to any orthogonal transformation. For i = 1, . . . , τ , it is observed that {R Z L 1/2 H −1 (Z 0 H 0 − ) R}ii = f i
1/2 1/2 −1 v i L H (Z 0 H 0
− ) r i ,
where v i and r i are the i-th column vectors of V and R, respectively. Letting ai = 1/2 f i v i L 1/2 H −1 (Z 0 H 0 − ) , we obtain I =
τ
Vm,τ ×D(≥0) ×Vν,τ τ
i=1
(ψi+ − ψi )ai r i e−ai r i G i (R d R)(d F)(V dV ), (6.23)
where G i = exp
−
τ j=i
aj r j
1 − tr FV L 1/2 H −1 H L 1/2 V 2
×
1 gm,ν (F). 22τ
For each i ∈ {1, . . . , τ }, we make the transformation r i → −r i . This transformation is equivalent to the orthogonal transformation R → R O i , where O i ∈ Dτ such that
62
6 Estimation of the Mean Matrix
the i-th diagonal is minus one and the other diagonals are ones. Since (R d R) is invariant with respect to the orthogonal transformation, (6.23) is rewritten as I =
τ i=1
Vm,τ ×D(≥0) ×Vν,τ τ
(ψi+ − ψi )(−ai r i e ai r i )G i (R d R)(d F)(V dV ). (6.24)
Adding each side of (6.23) and (6.24) yields 2I =
τ i=1
(≥0) Vm,τ ×Dτ ×Vν,τ
(ψi+ − ψi )ai r i (e−ai r i − e ai r i )G i (R d R)(d F)(V dV ).
Since ψi+ ≥ ψi and ai r i (e−ai r i − e ai r i ) ≤ 0 for any value of ai r i , it always holds that I ≤ 0. Thus the proof of Theorem 6.2 is complete. E M is dominated by For example, the Efron-Morris estimator +E M = X(I p − SS+ ) + R +E M (F)R X SS+ , where the i-th diagonal element of +E M (F) is max(0, 1 − c E M / f i ). This positivepart rule is extending (6.2) to the unknown covariance case. Also, Theorem 6.2 can be ST and m ST given in Sect. 6.5, but the applications are omitted. m E M , applied to
6.6.2 Shrinkage Estimation with a Loss Matrix Next, we look at shrinkage estimation under a loss matrix of the form |) = ( − ) −1 ( − ) , L M (,
(6.25)
which is an m × m symmetric positive semi-definite matrix. The corresponding risk ) = E[L M (, |)]. The loss matrix (6.25) is used matrix is defined by R M (, in Bilodeau and Kariya (1989). For a more general loss matrix, see Honda (1991). Using Theorem 5.1 gives M L , ) = E[(X − ) −1 (X − ) ] = E[∇ X (X − )] = p I m . R M ( is said to be better than M L relative to the loss matrix (6.25) if Thus an estimator has a smaller risk matrix than p I m in the Löwner sense, namely, R M (, ) p I m . S H in (6.11). Denote G = RR X. The risk Here, we focus our attention on SH matrix of is written as S H , ) = R M ( M L , ) − E[(X − ) −1 SS+ G ] R M ( − E[{(X − ) −1 SS+ G } ] + E[G SS+ −1 SS+ G ].
6.6 Related Topics
63
Using the Stein identity (5.1) gives E[(X − ) −1 SS+ G ] = E[∇ X SS+ G ] and E[G SS+ −1 SS+ G ] = E[G S+ Y Y −1 SS+ G ] = E[G S+ Y ∇Y SS+ G ] + E[{G SS+ ∇Y Y S+ G } ]. S H relative to the loss matrix (6.25) becomes Thus the unbiased risk estimate of S H ) = p I m − ∇ X SS+ G − {∇ X SS+ G } M ( R + G S+ Y ∇Y SS+ G + {G SS+ ∇Y Y S+ G } , implying that, according to Lemmas 6.6 and 6.7 in the Appendix, S H ) = p I m − 2( tr )(I m − R R ) + R∗ R , M ( R where ∗ = diag (φ1∗ , . . . , φτ∗ ) and for i = 1, . . . , τ φi∗ = a f i φi2 − 4 f i2 φi
τ
τ
f 2φ2 f i φi f j φ j ∂φi i i −2 +2 ∂ fi f − fj fi − f j j=i i j=i τ
− 2(n ∧ p − τ + 1)φi − 4 f i
f i φi − f j φ j ∂φi −2 ∂ fi fi − f j j=i
with a = n + p − 2(n ∧ p) + 2τ − 3 = (|n − p| + 2m) ∧ (n + p) − 3. The above discussion is summarized as follows. τ S H dominates φi ≥ 0 and φi∗ ≤ 0 for i = 1, . . . , τ , then Proposition 6.1 If i=1 ML relative to the loss matrix (6.25). As a specific example, we consider improvement on the Efron-Morris type estimator (6.9). Putting φi = c/ f i gives φi∗ = {(a + 4)c2 − 2(n ∧ p − τ − 1)} f i−1 for EMK ) M ( each i. Hence if τ = m and 0 < c ≤ 2(n ∧ p − m − 1)/(a + 4), then R p Im.
6.6.3 Application to a GMANOVA Model Here, we treat shrinkage estimation in a generalized MANOVA model (Potthoff and Roy, 1964) of the form Z = ABC + E, (6.26)
64
6 Estimation of the Mean Matrix
where Z ∈ R N ×q is an observation matrix, A ∈ R N ×m 1 and C ∈ R p×q are constant matrices of full rank with N ≥ m 1 and q ≥ p, B ∈ Rm 1 × p is an unknown regression coefficient matrix and E ∈ R N ×q is a random error matrix. Assume that E ∼ N N ×q (0 N ×q , I N ⊗ 0 ) and 0 ∈ Sq(+) is unknown. The generalized MANOVA model is abbreviated by the GMANOVA model and it is also called the growth curve model. The purpose of this section is to present a shrinkage estimator of B improving the ML estimator. To simplify the estimation problem, we first derive a canonical form of (6.26). Let A ∈ O N and C ∈ Oq such that ( A A)1/2 A A = , 0(N −m 1 )×m 1
C C = ((C C )1/2 , 0 p×m 2 )
for m 2 = q − p. Denote = ( A A)1/2 B(C C )1/2 , A Z C =
I p 0 p×m 2 0 p×m 2 X1 U I p , C 0 C = , 0m 2 × p I m2 Z1 Z2 0m 2 × p I m 2
(+) m2× p where X 1 ∈ Rm 1 × p , U ∈ Rm 1 ×m 2 , ∈ S(+) . Further let p , ∈ Sm 2 and ∈ R Z ∈ O N −m 1 such that 1/2 W Z Z2 = 0n×m 2 1/2 with n = N − m 1 − m 2 and W = Z , Y ) , where 2 Z 2 . Denote Z Z 1 = (V W m2× p . Then a canonical form of (6.26) is given as follows: V ∈R
X 1 |U ∼ Nm 1 × p ( + U, I m 1 ⊗ ), U ∼ Nm 1 ×m 2 (0m 1 ×m 2 , I m 1 ⊗ ), Y ∼ Nn× p (0n× p , I n ⊗ ), V |W ∼ Nm 2 × p (, W −1 ⊗ ), W ∼ Wm 2 (N − m 1 , ),
(6.27) (6.28) (6.29) (6.30) (6.31)
where (X 1 , U), Y and (V , W ) are independent. For derivation of the above canonical form, see Srivastava and Khatri (1979, pp.192–193). Here we view the estimation problem of relative to the quadratic loss − ) . |) = tr ( − ) −1 ( L(,
(6.32)
) = E[L(, |)], where the expectation is taken The risk is defined by R(, with respect to (6.27)–(6.31). The ML estimator of is ML = X1 − U V .
6.6 Related Topics
65
M L , we consider a double shrinkage estimator Denote S = Y Y . To improve (Kariya et al. 1996, 1999) having the form DS H = X 1 − G 1 − U(V − G 2 ), where G 1 = G 1 (X 1 , S) ∈ Rm 1 × p , and G 2 = G 2 (V , S|U, W ) ∈ Rm 2 × p satisfies G 2 (V , S|U, W ) = G 2 (V , S| − U, W ).
(6.33)
DS H can be written as The risk of DS H , ) = E[ tr (X 1 − G 1 − − U) −1 (X 1 − G 1 − − U)] R( − 2E[ tr (X 1 − G 1 − − U) −1 (V − G 2 − ) U ] + E[ tr U(V − G 2 − ) −1 (V − G 2 − ) U ].
(6.34)
The second term of the r.h.s. in (6.34) is zero because the distributions (6.27), (6.28) and (6.30) are symmetric and G 2 has the symmetry assumption (6.33), so that DS H , ) = E U [R1 (G 1 )] + E U,W [R2 (G 2 )], R( where R1 (G 1 ) = E[ tr (X 1 − G 1 − − U) −1 (X 1 − G 1 − − U)|U], R2 (G 2 ) = E[ tr U U(V − G 2 − ) −1 (V − G 2 − ) |U, W ]. This suggests the possibility of double shrinkage estimation in both distributions M L , ) = of X 1 and V . The risk of the ML estimator can be expressed as R( U U,W [R2 (0m 2 × p )]. E [R1 (0m 1 × p )] + E For example, we consider the case of m 1 ≥ m 2 . Let τ1 = m 1 ∧ n ∧ p, EM + = c1 R1 F −1 X 1 S+ X 1 = R1 F 1 R1 , G 1 1 R 1 X 1 SS ,
where F 1 ∈ D(≥0) τ1 , R 1 ∈ Vm 1 ×τ1 and c1 is a positive constant. In addition, let τ2 = m 2 ∧ n ∧ p, X 2 = (U U)−1/2 W V , EM + = c2 (U U)−1/2 R2 F −1 X 2 S+ X 2 = R2 F 2 R2 , G 2 2 R 2 X 2 SS ,
where F 2 ∈ D(≥0) τ2 , R 2 ∈ Vm 2 ×τ2 and c2 is a positive constant. Then the Efron-Morris type estimator is defined by E M = X 1 − G 1E M − U(V − G 2E M ).
66
6 Estimation of the Mean Matrix
Using the same arguments as in Sect. 6.5.1 immediately gives R1 (G 1E M ) ≤ R1 (0m 1 × p ) if 0 < c1 ≤
2(|n ∧ p − m 1 | − 1) . (|n − p| + 2m 1 ) ∧ (n + p) + 1
(6.35)
For evaluating R2 (G 2E M ), let ∇V and ∇ X 2 be matrix differential operators with respect to V and X 2 , respectively. Using the same arguments as in (5.2) gives ∇ X 2 = (U U)1/2 W −1 ∇V . The Stein identity (5.1) in terms of (6.30) is used to obtain E[ tr U U(V − ) −1 (G 2E M ) |U, W ] = E[ tr U U W −1 ∇V (G 2E M ) |U, W ] −1 = c2 E[ tr (U U)1/2 W −1 ∇V SS+ X 2 R 2 F 2 R 2 |U, W ] −1 = c2 E[ tr ∇ X 2 SS+ X 2 R 2 F 2 R 2 |U, W ].
Note also that −2 + E[ tr U U G 2E M −1 (G 2E M ) |U, W ] = E[ tr −1 SS+ X 2 R 2 F 2 R 2 X 2 SS |U, W ] −2 + = E[ tr ∇Y Y S+ X 2 R 2 F 2 R 2 X 2 SS |U, W ].
Hence, −1 R2 (G 2E M ) = R2 (0m 2 × p ) − 2c2 E[ tr ∇ X 2 SS+ X 2 R 2 F 2 R 2 |U, W ] −2 + + c22 E[ tr ∇Y Y S+ X 2 R 2 F 2 R 2 X 2 SS |U, W ].
Using the same arguments as in Sects. 6.4 and 6.5.1 gives R2 (G 2E M ) ≤ R2 (0m 2 × p ) if 0 < c2 ≤
2(|n ∧ p − m 2 | − 1) . (|n − p| + 2m 2 ) ∧ (n + p) + 1
(6.36)
E M dominates M L relative to the quadratic loss (6.32) From the abovementioned, if c1 and c2 satisfy, respectively, (6.35) and (6.36). A unified dominance result in the case of m 2 > m 1 can be established in a similar way to the case of m 1 ≥ m 2 , but it is omitted. Other approaches to decision-theoretic estimation in the GMANOVA model have been studied by Tan (1991), Kubokawa et al. (1992) and Kariya et al. (1996, 1999).
6.6 Related Topics
67
6.6.4 Generalization in an Elliptically Contoured Model Consider the multivariate linear model (4.1), but the N × p random error matrix E is assumed to have a p.d.f. of the form ||−N /2 g( tr −1 E E),
(6.37)
where g is a nonnegative and nonincreasing function on the nonnegative real line. In general, a probability distribution having the p.d.f. (6.37) is commonly called the elliptically contoured distribution. This section introduces shrinkage estimation in the elliptically contoured distribution model. Using the orthogonal transformation as in Sect. 4.2, we can rewrite (6.37) as ||−N /2 g( tr (X − ) −1 (X − ) + tr −1 Y Y ),
(6.38)
where N = m + n, X ∈ Rm× p , Y ∈ Rn× p , ∈ Rm× p and ∈ S(+) p . Suppose that and are unknown and that, using X and Y , we want to decision-theoretically estimate relative to the quadratic loss (6.7). Let 1 ∞ g(t)dt. G(x) = 2 x Denote E [u(X, Y )] = g
Rm× p ×Rn× p
E G [u(X, Y )] =
Rm× p ×Rn× p
u(X, Y )||−N /2 g(w)(dX)(dY ), u(X, Y )||−N /2 G(w)(dX)(dY ),
where w = tr (X − ) −1 (X − ) + tr −1 Y Y and u is an integrable function. Let U ∈ Rm× p such that all elements of U are absolutely continuous functions of X. Then, under some suitable conditions, E g [ tr (X − ) −1 U ] = E G [ tr ∇ X U ].
(6.39)
For details of the conditions, see Kubokawa and Srivastava (2001). The identity (6.39) is an extension of the Stein identity (5.3). M L = X, we obtain Applying the Stein identity (6.39) to the risk function of M L , ) = E g [ tr (X − ) −1 (X − ) ] = E G [ tr ∇ X (X − ) ] Rg ( = E G [mp].
68
6 Estimation of the Mean Matrix
S H in (6.11) is expanded to The risk of S H , |) = Rg ( M L , |) + E g [ tr RR X SS+ −1 SS+ X RR R g ( − 2 tr (X − ) −1 SS+ X RR ], S H , |) = E G [ GS H ], where so that, by the Stein identity (6.39), Rg ( GS H = mp + tr ∇Y Y S+ X R2 R X SS+ − 2 tr ∇ X SS+ X RR . S H , we can see that SH GS H is not an unbiased risk estimator for Although SH ML dominates if G ≤ mp. Hence the improving procedures in Sect. 6.5 can be applied to estimation of in (6.38), and the corresponding dominance results hold without depending on the underlying function g.
Appendix This appendix provides some brief proofs of useful results on matrix differential operators that were previously applied to Theorem 6.1. Let X = (xab ) ∈ Rm× p and Y = (yab ) ∈ Rn× p . Denote the matrix differential X X with dab = ∂/∂ xab operators with respect to X and Y , respectively, by ∇ X = dab Y Y and by ∇Y = dab with dab = ∂/∂ yab . Let τ = m ∧ n ∧ p. Here, the eigenvalue decomposition of X S+ X is X S+ X = R F R , and R = (ri j ) ∈ Vm,τ . where F = diag ( f 1 , . . . , f τ ) ∈ D(≥0) τ Lemma 6.5 For i = 1, . . . , τ , k = 1, . . . , m, a = 1, . . . , m and b = 1, . . . , p, we have X (i) dab f i = Aiiab , τ ij rk j Aab X rki = + f i−1 {I m − R R }ka {R X S+ }ib , (ii) dab f − f i j j=i
where Aab = ra j {R X S+ }ib + rai {R X S+ } jb . For i = 1, . . . , τ , k = 1, . . . , m, a = 1, . . . , n and b = 1, . . . , p, we have ij
ii Y (iii) dab f i = Bab ,
Y r = (iv) dab ki
ij τ rk j Bab + f i−1 {R X S+ S+ Y }ia {(I m − R R )X(I p − SS+ )}kb , fi − f j j=i
Appendix
69
where Bab = −{R X S+ Y }ia {R X S+ } jb − {R X S+ Y } ja {R X S+ }ib ij
+ {R X S+ S+ Y }ia {R X(I p − SS+ )} jb + {R X S+ S+ Y } ja {R X(I p − SS+ )}ib . Proof Take R∗ ∈ Vm,m−τ such that R ∗ R = 0(m−τ )×τ . Define R 0 = (R, R ∗ ) ∈ Om . + Denote F 0 = diag ( f 1 , . . . , f τ , 0, . . . , 0) (∈ D(≥0) m ). Then X S X = R 0 F 0 R 0 . Differentiating both sides of R0 R0 = I m gives [d R0 ]R0 + R0 d R0 = 0m×m , m×m . Thus, for j, i ∈ {1, . . . , m}, implying that R 0 d R 0 is skew-symmetric in R {R0 d R0 } ji = 0 if j = i and {R0 d R0 } ji = −{R 0 d R 0 }i j = −{[d R 0 ]R 0 } ji other+ wise. Differentiating both sides of X S X = R0 F 0 R0 gives d(X S+ X ) = [d R0 ]F 0 R 0 + R 0 [d F 0 ]R 0 + R 0 F 0 d R 0 ,
and then + R 0 [d(X S X )]R 0 = [R 0 d R 0 ]F 0 + d F 0 + F 0 [d R 0 ]R 0 = [R 0 d R 0 ]F 0 + d F 0 − F 0 R 0 d R 0 .
Comparing each element in both sides of the above identity, we have d f i = {R [d(X S+ X )]R}ii for i ∈ {1, . . . , τ }, {R d R} ji =
{R [d(X S+ X )]R} ji fi − f j
for j, i ∈ {1, . . . , τ } with j = i,
{R ∗ d R} ji =
+ {R ∗ [d(X S X )]R} ji fi
for j ∈ {1, . . . , m − τ } and i ∈ {1, . . . , τ }.
X Note that dab X = E ab , where E ab ∈ Rm× p such that the (a, b)-th element X X (X S+ X ) = [dab X]S+ X + is one and the other elements are zeros. Since dab + X + + X S [dab X ] = E ab S X + X S E ab , we observe that, for j, i ∈ {1, . . . , τ }, X (X S+ X )]R} ji = {R E ab S+ X R} ji + {R X S+ E ab R} ji {R [dab
= ra j {S+ X R}bi + {R X S+ } jb rai = Aab . ij
(6.40)
X X Thus, for i = 1, . . . , τ , dab f i = {R [dab (X S+ X )]R}ii = Aiiab , which shows (i). On the other hand, it is observed that for k = 1, . . . , m and i = 1, . . . , τ
70
6 Estimation of the Mean Matrix X X X dab rki = {dab R}ki = {(R R + R∗ R ∗ )dab R}ki τ X X rk j {R dab = R} ji + {R∗ R ∗ dab R}ki
j = i =
τ
j = i
+ X rk j Aab {R∗ R ∗ [dab (X S X )]R}ki + . fi − f j fi ij
(6.41)
In a similar way to (6.40), for k = 1, . . . , m and i = 1, . . . , τ , + + + X {R∗ R ∗ [dab (X S X )]R}ki = {R ∗ R ∗ }ka {S X R}bi + {R ∗ R ∗ X S }kb rai . + Here, R ∗ X S = 0(m−τ )× p , so that + + X {R∗ R ∗ [dab (X S X )]R}ki = {I m − R R }ka {R X S }ib .
(6.42)
Substituting (6.42) into (6.41), we obtain (ii). Y Y + (X S+ X )]R} ji = {R X[dab S ]X R} ji for j, i ∈ {1, . . . , τ }, it Since {R [dab is observed from (iii) of Lemma 5.2 that Y (X S+ X )]R} ji {R [dab
= −{R X S+ } jb {R X S+ Y }ia − {R X S+ Y } ja {R X S+ }ib + {R X(I p − SS+ )} jb {R X S+ S+ Y }ia + {R X S+ S+ Y } ja {R X(I p − SS+ )}ib = Bab . ij
Similarly, + + + + Y {R∗ R ∗ [dab (X S X )]R}ki = {R X S S Y }ia {(I m − R R )X(I p − SS )}kb
for k = 1, . . . , m and i = 1, . . . , τ . Hence using the same arguments as in the proofs of (i) and (ii) yields (iii) and (iv). Lemma 6.6 Let c0 = |n ∧ p − m| + 1. Define = diag (φ1 , . . . , φτ ) ∈ Dτ such that the diagonals of are absolutely continuous functions of F. Then ∇ X SS+ X RR = R∗ R + ( tr )(I m − R R ), where ∗ = diag (φ1∗ , . . . , φτ∗ ) and for i = 1, . . . , τ φi∗ = (n ∧ p − τ + 1)φi + 2 f i
τ
∂φi f i φi − f j φ j + . ∂ fi fi − f j j=i
(6.43)
Appendix
71
In particular, tr ∇ X SS+ X RR =
τ i=1
c0 φi + 2 f i
τ f i φi − f j φ j ∂φi . +2 ∂ fi fi − f j j>i
(6.44)
Proof For a, c ∈ {1, . . . , τ }, the (a, c)-th element of ∇ X SS+ X RR is {∇ X SS+ X RR }ac =
p p m τ
X dab [{SS+ }bd xkd rki φi rci ],
b=1 d=1 k=1 i=1
and then
(3) (1) (2) + Dac + Dbc , {∇ X SS+ X RR }ac = Dac
(6.45)
where (1) = Dac
p p m
X {SS+ }bd {RR }kc dab xkd ,
b=1 d=1 k=1 (2) Dac =
(3) Dac =
p τ
X {SS+ X R}bi rci dab φi ,
b=1 i=1 p m τ
X X {SS+ X }bk φi (rci dab rki + rki dab rci ).
b=1 k=1 i=1 X Since dab xkd = δak δbd and SS+ is idempotent with rank n ∧ p, it follows that
(1) = Dac
p {SS+ }bb {RR }ac = (n ∧ p){RR }ac .
(6.46)
b=1 (2) To evaluate Dac , we first use the chain rule and (i) of Lemma 6.5 to obtain
X φi = dab
τ τ ∂φi j j ∂φi X [dab fj] = Aab . ∂fj ∂fj j=1 j=1
Note that S+ SS+ = S+ and p jj {SS+ X R}bi Aab = 2ra j {R X S+ X R} ji = 2ra j {F} ji , b=1
72
6 Estimation of the Mean Matrix
so that (2) Dac
=2
τ τ i=1
τ ∂φi ∂φi ra j rci {F} ji =2 rai rci f i . ∂ f ∂ fi j j=1 i=1
(6.47)
(3) Finally, we consider Dac . Using (ii) of Lemma 6.5 yields m τ ij {SS+ X R}bj Aab X {SS+ X }bk dab rki = fi − f j k=1 j=i
+ f i−1 {(I m − R R )X SS+ }ab {R X S+ }ib . Since p ij {SS+ X R}bj Aab = ra j {F}i j + rai f j , b=1 p {(I m − R R )X SS+ }ab {R X S+ }ib = {(I m − R R )X S+ X R}ai = 0, b=1
it is seen that p m τ X {SS+ X }bk φi rci dab rki b=1 k=1 i=1 τ τ
=
i=1 j=i
τ
τ
rai rci ( f j − f i + f i )φi rai rci f j φi = fi − f j fi − f j i=1 j=i
= −(τ − 1)
τ
rai rci φi +
i=1
τ τ rai rci f i φi . fi − f j i=1 j=i
(6.48)
Similarly, p m τ
X {SS+ X }bk φi rki dab rci =
b=1 k=1 i=1
τ τ ra j rcj f i φi + ( tr ){I m − R R }ac f − f i j i=1 j=i
=−
τ τ rai rci f j φ j + ( tr ){I m − R R }ac . f − f i j i=1 j=i
(6.49) Combining (6.48) and (6.49) gives
Appendix (3) Dac
=
73 τ
rai rci
i=1
τ f i φi − f j φ j − (τ − 1)φi + + ( tr ){I m − R R }ac . f − f i j j=i
(6.50) Substituting (6.46), (6.47) and (6.50) into (6.45) yields (6.43). Note that n ∧ p − τ + 1 + tr (I m − R R ) = n ∧ p + m − 2τ + 1 = |n ∧ p − m| + 1 = c0 and also that τ τ τ τ f i φi − f j φ j f i φi − f j φ j =2 . f − f fi − f j i j i=1 j=i i=1 j>i
Hence taking the trace of (6.43) yields (6.44), which completes the proof.
Lemma 6.7 Let c1 = n − (n ∧ p) + τ − 2, c2 = p − (n ∧ p) + τ − 1 and c0 = c1 + c2 . Let = (F) = diag (φ1 , . . . , φτ ), where the φi ’s are absolutely continuous functions of F. Then RR X SS+ ∇Y Y S+ X RR = R∗1 R ,
+
+
∗2
(6.51)
RR X S Y ∇Y SS X RR = R R ,
(6.52)
where ∗k = diag (φ1∗k , . . . , φτ∗k ) for k = 1, 2 and, for i = 1, . . . , τ , φi∗k
τ
=
ck f i φi2
−2
f i2 φi
τ
f i φi f j φ j ∂φi f i2 φi2 − + . ∂ fi f − fj fi − f j j=i i j=i
In particular, tr ∇Y Y S+ X R2 R X SS+ =
τ i=1
c0 f i φi2 − 2 f i2
∂(φi2 ) ∂ fi
−2
τ f 2φ2 − i i fi − j>i
f j2 φ 2j . fj
(6.53) Proof The proofs of (6.51) and (6.52) can be done by using the same arguments as in the proof of Lemma 6.6. Since tr ∇Y Y S+ X R2 R X SS+ = tr ∇Y {Y S+ X RR · RR X SS+ } = tr RR X SS+ ∇Y Y S+ X RR + tr RR X S+ Y ∇Y SS+ X RR , the identity (6.53) can be verified by combining (6.51) and (6.52).
74
6 Estimation of the Mean Matrix
References A.J. Baranchik, A family of minimax estimators of the mean of a multivariate normal distribution. Ann. Math. Stat. 41, 642–645 (1970) M. Bilodeau, T. Kariya, Minimax estimators in the normal MANOVA model. J. Multivar. Anal. 28, 260–270 (1989) D. Chételat, M.T. Wells, Improved multivariate normal mean estimation with unknown covariance when p is greater than n. Ann. Stat. 40, 3137–3160 (2012) B. Efron, C. Morris, Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika 59, 335–347 (1972) B. Efron, C. Morris, Multivariate empirical Bayes and estimation of covariance matrices. Ann. Stat. 4, 22–32 (1976) M.H.J. Gruber, Improving Efficiency by Shrinkage (Marcel Dekker, New York, 1998) T. Honda, Minimax estimators in the manova model for arbitrary quadratic loss and unknown covariance matrix. J. Multivar. Anal. 36, 113–120 (1991) James, W. and Stein, C. (1961). Estimation with quadratic loss, in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, ed. by J. Neyman (University of California Press, Berkeley), pp. 361–379 T. Kariya, Y. Konno, W.E. Strawderman, Double shrinkage estimators in the GMANOVA model. J. Multivar. Anal. 56, 245–258 (1996) T. Kariya, Y. Konno, W.E. Strawderman, Construction of shrinkage estimators for the regression coefficient matrix in the GMANOVA model. Commun. Stat.—Theory Methods 28, 597–611 (1999) Y. Konno, Families of minimax estimators of matrix of normal means with unknown covariance matrix. J. Japan Stat. Soc. 20, 191–201 (1990) Y. Konno, On estimation of a matrix of normal means with unknown covariance matrix. J. Multivar. Anal. 36, 44–55 (1991) Y. Konno, Improved estimation of matrix of normal mean and eigenvalues in the multivariate Fdistribution. Doctoral dissertation, Institute of Mathematics, University of Tsukuba, 1992. (http:// mcm-www.jwu.ac.jp/~konno/) T. Kubokawa, AKMdE Saleh, K. Morita, Improving on MLE of coefficient matrix in a growth curve model. J. Stat. Plann. Infer. 31, 169–177 (1992) T. Kubokawa, M.S. Srivastava, Robust improvement in estimation of a mean matrix in an elliptically contoured distribution. J. Multivar. Anal. 76, 138–152 (2001) R.F. Potthoff, S.N. Roy, A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51, 313–326 (1964) M.S. Srivastava, C.G. Khatri, An Introduction to Multivariate Statistics (North Holland, New York, 1979) C. Stein, Estimation of the mean of a multivariate normal distribution. Technical Reports No. 48 (Department of Statistics, Stanford University, Stanford, 1973) M. Tan, Improved estimators for the GMANOVA problem with application to Monte Carlo simulation. J. Multivar. Anal. 38, 262–274 (1991) H. Tsukuma, Shrinkage minimax estimation and positive-part rule for a mean matrix in an elliptically contoured distribution. Stat. Probab. Lett. 80, 215–220 (2010) H. Tsukuma, T. Kubokawa, Methods for improvement in estimation of a normal mean matrix. J. Multivar. Anal. 98, 1592–1610 (2007) H. Tsukuma, T. Kubokawa, A unified approach to estimating a normal mean matrix in high and low dimensions. J. Multivar. Anal. 139, 312–328 (2015)
Chapter 7
Estimation of the Covariance Matrix
This chapter addresses decision-theoretic estimation of an error covariance matrix in a multivariate linear model relative to a Stein-type entropy loss. With a unified treatment for high- and low-dimensions, some important improving methods of the best scale and the best triangular invariant estimators are discussed by using the residual sum of squares matrix only. Also this chapter provides interesting dominance results by using the information on both the residual sum of squares matrix and the least squares estimator of the regression coefficient matrix.
7.1 Introduction As seen in (4.2), a canonical form of multivariate linear model (4.1) is given by Y ∼ N n× p (0n× p , I n ⊗ ),
X ∼ Nm× p (, I m ⊗ ),
where Y and X are mutually independent, and ∈ Rm× p and ∈ S(+) p are matrices of unknown parameters. Throughout this chapter, we use the following notation: ν = n ∧ p,
κ = n ∨ p.
Here the Wishart matrix S = Y Y is of rank ν and the above canonical form is replaced by X ∼ Nm× p (, I m ⊗ ). (7.1) S ∼ Wνp (n, ), The problem of estimating the covariance matrix in (7.1) is looked at from a decision-theoretic point of view.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5_7
75
76
7 Estimation of the Covariance Matrix
In the literature, several loss functions have been employed for decision-theoretic estimation of with n ≥ p. One of such loss functions is Stein’s (1956) entropy loss ) = tr −1 − log | −1 | − p. (7.2) L S (, Stein (1956) focused on triangular invariant estimators and succeeded in deriving a minimax estimator improving the sample covariance matrix relative to the loss (7.2). Note that Stein’s (1956) results were summarized in James and Stein (1961) and, in this chapter, the minimax estimator is called James and Stein’s minimax estimator. Although James and Stein’s minimax estimator dominates the sample covariance matrix, the minimax estimator depends on the coordinate system and the dependence causes inadmissibility of the minimax estimator. Typical improved estimators on James and Stein’s minimax estimator are orthogonally invariant estimators, which are not influenced by the coordinate system. The orthogonally invariant estimators have been studied since Stein (1975, 1977). For other studies on orthogonally invariant estimators, see Takemura (1984), Dey and Srinivasan (1985), Sheena and Takemura (1992) and Perron (1992). James and Stein’s minimax estimator and its improved estimators mentioned above are based only on S = Y Y , while truncation rules have been proposed for improving the existing estimators by using the information contained in X. Such a truncation rule was first derived by Stein (1964) in estimation of variance of a normal distribution, and several extensions to multivariate models were studied by Sinha and Ghosh (1987), Perron (1990), Kubokawa et al. (1992) and Kubokawa and Srivastava (2003) in the n ≥ p case. These articles applied conditional arguments to deriving dominance results, but Kubokawa and Tsai (2006) used the Stein identity (5.3) to suggest an alternative truncation rule with shrinkage. When p > n, Konno (2009) studied decision-theoretic covariance estimation relative to a quadratic loss. In the p > n case, the Stein loss (7.2) is not available for singular estimators such as the unbiased estimator S/n since | −1 S| = 0. An extended Stein-type entropy loss applicable to singular estimators was treated by Tsukuma (2016a) and Tsukuma and Kubokawa (2016). This chapter will take a unified approach to both cases of n ≥ p and p > n. We assume that any estimator lies in S(+) p,ν and it is of the same rank as S. More specifically, any estimator is of rank p when n ≥ p and is of rank n when p > n. be an estimator of based on S and X in (7.1), where ∈ S(+) Now, let p,ν . Since −1 −1 is positive definite, has ν nonzero eigenvalues and they are all positive. ∈ Dν such that its diagonal elements consist of ν positive eigenvalues Let Ch( −1 ) The extended Stein loss is defined by of −1 . ) = tr [Ch( −1 )] − log |Ch( −1 )| − ν. L E S (,
(7.3)
If n ≥ p, L E S is the same as the ordinary Stein loss (7.2). The accuracy of estimators ) = E[L E S (, )], where the expectation is measured by the risk function R E S (,
7.1 Introduction
77
E is taken with respect to the model (7.1). With the extended Stein loss (7.3) used, this chapter gives some dominance results unifying both cases of n ≥ p and p > n. First in Sect. 7.2, we deal with the best scale invariant estimator that forms a scalar multiple of S. Section 7.3 considers the class of triangular invariant estimators and gives a unified expression of the James-Stein (1961) type estimators for the n ≥ p and p > n cases. Section 7.4 provides some orthogonally invariant estimators improving on the best scale invariant and the unified James-Stein-type estimators relative to the extended Stein loss (7.3). Section 7.5 gives alternative unified estimators using information on the mean statistic X. In Sect. 7.6, we point out some relevant topics on decision-theoretic covariance estimation.
7.2 Scale Invariant Estimators Recall that ν = n ∧ p and κ = n ∨ p. Consider a simple class of estimators which forms a constant multiple of S. The simple class is denoted by c (S) = cS, c =
(7.4)
c ( P S P ) for c (S) P = where c is a positive constant. The class (7.4) satisfies P any P ∈ U p , namely, c is invariant under the scale transformations S → P S P → P P for any P ∈ U p . The unbiased estimator of , and U B = 1 S, n U B is not the best estimator among the class (7.4) belongs to the class (7.4), but relative to the extended Stein loss (7.3). The best estimator is given in the following proposition. Proposition 7.1 Among the class (7.4), the best estimator relative to the extended Stein loss (7.3) is given by c0 BS = B S = S/ p dominates U B = S/n relative to the with c0 = 1/κ. Hence for p > n, extended Stein loss (7.3). Proof Recall that S = Y Y and Y ∈ Rn× p . The positive eigenvalues of −1 S are identical to those of Y −1 Y , so that the positive eigenvalues of −1 S are identical to those of the full-rank matrix −1 S (∈ R p× p ) for n ≥ p, −1 (+) n×n Y Y (∈ Sn ⊂ R ) for p > n.
78
7 Estimation of the Covariance Matrix
c )] = tr −1 c . Since −1 S has ν positive eigenvalues with Note that tr [Ch( −1 c with probability one, we obtain |Ch(c −1 S)| = cν |Ch( −1 S)|. The risk of respect to the extended Stein loss (7.3) is expressed as c , ) = nc tr −1 − ν log c − E[log |Ch( −1 S)|] − ν R E S ( = npc − ν log c − E[log |Ch( −1 S)|] − ν, which is minimized at c = c0 with c0 =
1 ν = . np κ
Thus the proof is complete. B S is Denote rκ,ν = E[log |Ch( −1 S)|]. The risk of B S , ) = ν log κ − rκ,ν . R E S (
(7.5)
Now, it follows that |Ch( −1 S)| = and consequently
rκ,ν =
| −1 Y Y | for n ≥ p, |Y −1 Y | for p > n,
E[log |Z Z|] for n ≥ p, E[log |Z Z |] for p > n,
where Z ∼ Nn× p (0n× p , I n ⊗ I p ). Since Z Z ∼ W p (n,I p ) for n ≥ p and Z Z ∼ ν E[log si ], where si ∼ Wn ( p, I n ) for p > n, using Corollary 3.2 gives rκ,ν = i=1 2 χκ−i+1 for i = 1, . . . , ν. Denoting the digamma function by (t) =
(t) d log (t) = , dt (t)
we observe E[log si ] = ((κ − i + 1)/2) + log 2, so that rκ,ν
ν κ −i +1 + ν log 2. = 2 i=1
B S has a constant risk. Hence rκ,ν is a constant and
7.3 Triangular Invariant Estimators and the James-Stein Estimator
79
7.3 Triangular Invariant Estimators and the James-Stein Estimator 7.3.1 The James-Stein Estimator As seen in Sect. 3.4, the Cholesky decomposition of S is written as T1 (T S = TT = 1 , T 2 ), T2
(+) (+) ( p−ν)×ν where T = (T . Define a class of esti1 , T 2 ) ∈ L p,ν , T 1 ∈ Lν and T 2 ∈ R mators as T (S) = T Dν T , T = (7.6)
where Dν = diag (d1 , . . . , dν ) and the di ’s are positive constants. The unbiased esti B S are members of the class U B and the best scale invariant estimator mator (7.6). The class (7.6) is invariant under the scale transformation with respect to the lower triangular group L(+) p . Indeed, this can be verified as follows: Denote by T ∗ T ∗ the Cholesky decomposition of L SL for L ∈ L(+) p . It is observed that L SL = (+) LT T L and LT ∈ L p,ν . As discussed in the beginning of Sect. 3.4, the Cholesky decomposition of a symmetric positive semi-definite matrix S is unique. Thus, we obtain T ∗ = LT , so that T T (L SL ) = T ∗ Dν T ∗ = LT D ν T L = L (S)L .
T is named triangular invariant estimator. We will now investigate the risk Here T . The Cholesky decomposition of is expressed as = , where function of (+) ∈ L p . Let U = −1 T . Then T )]] = E[ tr Dν T −1 T ] E[ tr [Ch( −1 = E[ tr Dν (−1 T ) −1 T ] = E[ tr Dν U U].
(7.7)
The distributions of nonzero elements of U are given in Corollary 3.2. In the (+) p > n case, we partition U as U = (u i j ) = (U 1 , U 2 ) , where U 1 ∈ Ln . Since U 2 ∼ N( p−n)×n (0( p−n)×n , I ( p−n) ⊗ I n ), Corollary 3.1 leads to E[U 2 U 2 ] = ( p − n)I n . For i = 1, . . . , n, the i-th diagonal element of E[U 1 U 1 ] is n j=i
E[u 2ji ] = E[u ii2 ] +
n
E[u 2ji ]
j>i
= (n − i + 1) + (n − i) = 2n − 2i + 1,
80
7 Estimation of the Covariance Matrix
so that E[ tr Dn U U] = tr Dn E[U 1 U 1 ] + tr D n E[U 2 U 2 ]
=
n {(2n − 2i + 1)di + ( p − n)di } i=1
n = (n + p − 2i + 1)di .
(7.8)
i=1
When n ≥ p, it follows that E[ tr D p U U] =
p p
E[di u 2ji ] =
i=1 j=i
p (n + p − 2i + 1)di .
(7.9)
i=1
Combining (7.7), (7.8) and (7.9) gives T )]] = E[ tr [Ch( −1
ν
(n + p − 2i + 1)di .
(7.10)
i=1
T has the same positive eigenvalues as Dν T −1 T , implying It is seen that −1 that T )| = | Dν T −1 T |. |Ch( −1 Since T −1 T ∈ S(+) ν , it follows that T )|] = log | Dν | + E[log |T −1 T |] = E[log |Ch( −1
ν
log di + rκ,ν , (7.11)
i=1
where rκ,ν is the same as in (7.5). Using (7.10) and (7.11), we can write the risk of T under the extended Stein loss (7.3) as T , ) = R E S (
ν {(n + p − 2i + 1)di − log di } − rκ,ν − ν. i=1
T has a constant risk. Hence the triangular invariant estimator T , ) are given by Clearly, the di ’s minimizing the risk R E S ( diJ S =
1 n + p − 2i + 1
for i = 1, . . . , ν. Thus the best triangular invariant estimator can be expressed as
7.3 Triangular Invariant Estimators and the James-Stein Estimator
J S = T DνJ S T ,
81
(7.12)
where DνJ S = diag (d1J S , . . . , dνJ S ), which is named the James-Stein (1961) estima B S belongs to the class (7.6), J S dominates B S relative to the extended tor. Since J S has the constant risk Stein loss (7.3). In fact, J S , ) = R E S (
ν
log(n + p − 2i + 1) − rκ,ν ,
(7.13)
i=1
which implies by (7.5) that J S , ) − R E S ( B S , ) = R E S (
ν
log(n + p − 2i + 1) − ν log κ < 0,
i=1
where the inequality follows immediately from concavity of the logarithmic function. The abovementioned can be summarized as follows. J S , namely, the best triangular invariProposition 7.2 The James-Stein estimator B S relative to the ant estimator dominates the best scale invariant estimator extended Stein loss (7.3). J S is minimax when As pointed out by Stein (1956) and James and Stein (1961), n ≥ p. The proof of minimaxity comes from the invariance approach. A general theory of the invariance approach was studied in Kiefer (1957). For proving minimaxity of a specific estimator, the least favorable prior approach are also well known (StrawJ S derman, 2000). See Tsukuma and Kubokawa (2015) for the minimaxity proof of by using the least favorable prior approach.
7.3.2 Improvement Using a Subgroup Invariance In the literature, various estimators have been proposed for improving the James J S in (7.12). Here we introduce an invariant estimator under the Stein estimator commutator subgroup of L(+) p . For two elements A and B of the group L(+) p , the commutator of A and B is defined by A−1 B −1 AB. The commutator subgroup of L(+) p is generated by all the (+) (1) (1) commutators of L p and coincides with L p , where L p consists of all p × p lower triangular matrices with ones on the diagonal. Let S = T 1 T 0 T 1 , where T 0 and T 1 are, respectively, unique elements of Dν and (1) L p,ν . Note that, when n ≥ p, T 1 T 0 T 1 is the LDL decomposition of S. Here we define a class of estimators as I (T 0 , T 1 ) = T 1 (T 0 )T I = 1,
(7.14)
82
7 Estimation of the Covariance Matrix
where (T 0 ) ∈ Dν and each diagonal element of (T 0 ) is an absolutely continuous function of T 0 . The class (7.14) is invariant under the transformations S → AS A → A A for any A ∈ L(1) and p . The risk of the class (7.14) can be expressed as follows. Theorem 7.1 Denote T 0 = diag (t1 , . . . , tν ) and (T 0 ) = diag (φ1 , . . . , φν ). Then I with respect to the extended Stein loss (7.3) is expressed by the risk function of I , ) = E R E S (
ν φi ∂φi φi (n + p − 2i − 1) + 2 − rκ,ν − ν, − log ti ∂ti ti i=1
where rκ,ν is given by (7.5). Proof It is observed that −1 −1 I )| = log |Ch(T log |Ch( −1 1 T 1 T 0 T 0 (T 0 ))| −1 −1 = log |Ch(T 1 T 1 T 0 )| + log |Ch(T 0 (T 0 ))| ν φi = log |Ch( −1 S)| + log , ti i=1
so that I )|] = E E[log |Ch( −1
ν i=1
log
φi ti
+ rκ,ν .
Thus, −1 I , ) = E[ tr −1 T 1 (T 0 )T R E S ( 1 − log |Ch( T 1 (T 0 )T 1 )| − ν] ν φi − rκ,ν − ν. = E tr −1 T 1 (T 0 )T − log (7.15) 1 ti i=1 Denote by −1 = −1 decomposition of −1 , where 0 = 0 the LDL 2 2 diag (σ1 , . . . , σ p ) and are, respectively, unique elements of D p and L(1) p . Mak) yields ing the transformation U = (u i j ) = T 1 (∈ L(1) p,ν
−1 I
E[ tr ] =
E[ tr −1 0 U(T 0 )U ]
=E
ν i=1
φi
p u 2ji j=i
σ j2
ν −1 =E {U 0 U}ii φi i=1
.
7.3 Triangular Invariant Estimators and the James-Stein Estimator
83
2 Using Proposition 3.10 with some manipulation, we can see that ti ∼ σi2 χn−i+1 for 2 −1 (1) i = 1, . . . , ν and u ji |ti ∼ N(0, (ti /σ j ) ) for j > i. Noting that U ∈ L p,ν , namely, u ii = 1, we obtain −1 I
E[ tr ] = E
ν
φi (1/σi2
+ ( p − i)/ti ) ,
i=1
which implies by the chi-square identity (5.4) that I ] = E E[ tr −1
ν φi ∂φi (n + p − 2i − 1) + 2 . ti ∂ti i=1
Hence combining (7.15) and (7.16) completes the proof.
(7.16)
J S belongs to the class (7.14). Using Theorem 7.1 The James-Stein estimator JS JS J S as in (7.13). with φi = di ti , we can obtain the same expression for risk of M JS Next, we provide an improved estimator on . Let (T 0 ) = diag (φ1M , . . . , M φν ) with ν (ti log ti )g(w) (log ti )2 , φiM = diJ S ti − , w= b+w i=1 where b is a suitable constant and g(w) is a differentiable function of w. Proposition 7.3 Suppose ν ≥ 3 and b ≥ 144(ν − 2)2 /{25(n + p − 1)2 }. If g(w) is nondecreasing in w and 0 < g(w) ≤ 12(ν − 2)/{5(n + p − 1)2 }, then M (T 0 , T 1 ) = T 1 M (T 0 )T M = 1 J S relative to the extended Stein loss (7.3). dominates Proof This dominance result is proved along the same arguments as in Dey and Srinivasan (1985). For details, see Tsukuma (2014a). J S under L(+) Proposition 7.3 implies that the best invariant estimator p is domiM (+) nated by the invariant estimator under the commutator subgroup of L p , namely, (+) (1) under L(1) p . Since the lower triangular group L p is solvable, L p also has a commutator subgroup. It is still not known whether there exists an invariant estimator under M the commutator subgroup of L(1) p which dominates .
84
7 Estimation of the Covariance Matrix
7.4 Orthogonally Invariant Estimators 7.4.1 Class of Orthogonally Invariant Estimators T in (7.6) depends on the coordinate system. This The triangular invariant estimator J S in (7.12). For example, fact causes inadmissibility of the James-Stein estimator the inadmissibility can be shown by using the same arguments as in Stein (1956): Let P be a p × p unit anti-diagonal matrix of the form 1 .
P =⎝
0 ..
⎛
1
⎞ ⎠,
0
which is symmetric and orthogonal. Denote Y = ( y1 , . . . , yn ) ∈ Rn× p . Note that, for each i = 1, . . . , n, P yi is a p-dimensional vector obtained from yi by reversing the order of coordinates. Define the Cholesky decomposition of P S P = PY Y P as (+) T ∗T ∗ , where T ∗ ∈ L p,ν and T ∗ T ∗ is not the same as the Cholesky decomposition U = T ∗ DνJ S T ∗ , which is the best triangular of S because of its uniqueness. Let U P becomes an estimator of and has the invariant estimator of P P. Here P JS same risk as . From the convexity of the extended Stein loss (7.3), it is easily J S + P U P)/2. J S is dominated by a combination estimator ( proved that In this section, we will consider a general class of estimators not depending on BS J S and the coordinate system and aim to find better estimators dominating relative to the extended Stein loss (7.3). Write the eigenvalue decomposition of S as S = H L H , and H ∈ V p,ν . The general class of estimators where L = diag ( 1 , . . . , ν ) ∈ D(≥0) ν is defined by O (S) = H diag ( 1 φ1 (L), . . . , ν φν (L))H = H L(L)H , O = where (L) = diag (φ1 (L), . . . , φν (L)) and the φi (L)’s are absolutely continuous O is not only invariant with respect to exchanging coorfunctions of L. The class dinates, but also, more generally, orthogonally invariant in the sense that it satisfies O (O S O ) for any O ∈ O p . O (S)O = O
7.4.2 Unbiased Risk Estimate Here, we will derive an unbiased risk estimate for orthogonally invariant estimators O . Abbreviate φi (L) by φi . Since
7.4 Orthogonally Invariant Estimators
85
H = H H H = SS+ H = Y Y S+ H with Y ∼ Nn× p (0n× p , I n ⊗ ), it follows from Theorem 5.1 and Lemma 5.3 that O )]] = E[ tr −1 H L(L)H ] E[ tr [Ch( −1 = E[ tr ∇Y Y S+ H L(L)H ]
ν ν i φi − j φ j ∂φi =E (|n − p| + 1)φi + 2 i . +2 ∂ i i − j i=1
j>i
(7.17) O )| = |H −1 H L| · |(L)| = |Ch( −1 S)| Note that |Ch( −1 O becomes of
ν i=1
φi . The risk
O , ) = E[ tr −1 O − log |Ch( −1 O )| − ν] R E S (
ν ν i φi − j φ j ∂φi (|n − p| + 1)φi + 2 i =E +2 − log φi ∂ i i − j i=1
j>i
− rκ,ν − ν.
O , Hence, we obtain the unbiased risk estimate for O ) = E S ( R
ν ν i φi − j φ j ∂φi (|n − p| + 1)φi + 2 i +2 − log φi − rκ,ν − ν. ∂ i i − j i=1
(7.18)
j>i
O domComparing (7.5), or (7.13), with (7.18) gives a sufficient condition that BS JS inates , or , relative to the extended Stein loss (7.3). For example, we denote E S ( O ) and (7.13) by the difference between R O ) = R E S ( O ) − R E S ( J S , ) (
ν ν i φi − j φ j ∂φi (|n − p| + 1)φi + 2 i = +2 − log φi ∂ i i − j i=1 j>i +
ν
log diJ S − ν,
i=1
O ) ≤ 0 for every L ∈ D(≥0) O dominates J S. ( implying that if
then ν
(7.19)
86
7 Estimation of the Covariance Matrix
7.4.3 Examples 7.4.3.1
Haff’s Empirical Bayes Estimator
When n ≥ p, Haff (1980) considered the empirical Bayes estimation of . For the sake of simplicity, let a prior distribution of −1 be W p ( p + 1, γ −1 I p ), where γ is an unknown hyperparameter. The resulting posterior distribution of −1 given S is −1 |S ∼ W p (n + p + 1, (S + γ I p )−1 ), so that the posterior mean of is E[|S] = n −1 (S + γ I p ), where this expectation will be explained in Sect. 7.6.4. The hyperparameter γ is estimated from the marginal density of S proportional to γ p( p+1)/2 |S|(n− p−1)/2 |S + γ I p |−(n+ p+1)/2 . The log-likelihood function with ignoring a constant has the form l(γ |S) =
n+ p+1 p( p + 1) log γ − log |I p + γ S−1 |. 2 2
The first order approximation of log |I p + γ S−1 | is γ tr S−1 , so that the loglikelihood function can be approximated as l(γ |S) ≈
n+ p+1 p( p + 1) log γ − γ tr S−1 , 2 2
which attains a maximum at γ =
p( p + 1) 1 . n + p + 1 tr S−1
The estimate γ is an approximated maximum likelihood estimate for γ . Thus we obtain an empirical Bayes estimator of the form p( p + 1) 1 1 1 EB γ I p) = S+ = (S + Ip . n n n + p + 1 tr S−1 Haff (1980) defined a general class of empirical Bayes estimators and gave some dominance results. Here, taking into account both the cases of n ≥ p and p > n, we define a Haff type estimator as H F = B S + a SS+ , κ tr S+ where a is a positive constant. Since tr S+ = tr L −1 and SS+ = H H for S = H L H , the Haff estimator can be expressed as
7.4 Orthogonally Invariant Estimators
87
H F = H L H F (L)H , H F (L) = 1 I ν + a L −1 . κ tr L −1 H F dominates B S relProposition 7.4 If 0 < a ≤ 2(ν − 1)/(|n − p| + 1), then ative to the extended Stein loss (7.3). Proof For i = 1, . . . , ν, let φiH F = κ −1 (1 + a i−1 / tr L −1 ). Note that ν
i
i=1
∂φiH F a = ∂ i κ
−1+
tr L −2 ( tr L −1 )2
≤ 0,
so that, by (7.18), H F ) = E S ( R
ν ν i φiH F − j φ Hj F ∂φiH F HF (|n − p| + 1)φi + 2 i +2 ∂ i i − j i=1 j>i
− log φiH F − rκ,ν − ν
E S ( B S ) + (|n − p| + 1) a − ≤R κ
ν
log(1 + a i−1 / tr L −1 ).
i=1
Since log(1 + x) ≥ 2x/(2 + x) for x ≥ 0, it holds that ν
log(1 +
a i−1 / tr
L
−1
)≥
i=1
≥
ν 2a i−1 / tr L −1 i=1 ν i=1
2 + a i−1 / tr L −1 2a i−1 / tr L −1 2a = , 2+a 2+a
which yields E S ( E S ( H F ) − R B S ) ≤ (|n − p| + 1) a − 2a = a {(|n − p| + 1)a − 2(ν − 1)}. R κ 2+a 2+a
Hence we complete the proof.
7.4.3.2
The Efron-Morris-Dey Shrinkage Estimator
H F B S , the Haff estimator H F is an expansion estimator in the Löwner Since BS. sense. Next, we give an improved estimator shrinking For positive constants b and c, define −1 S H = H L S H (L)H , S H (L) = 1 I ν + b L c , κ tr L c
88
7 Estimation of the Covariance Matrix
where L c = diag ( c1 , . . . , cν ). This estimator is inspired from Efron and Morris (1976) and Dey (1987) for estimating −1 under certain quadratic losses. Indeed, when n ≥ p, it corresponds to Efron and Morris (1976) for c = 1 and to Dey (1987) S H B S in the for c = 2. For b > 0, it holds that S H (L) κ −1 I ν , so that S H as Löwner sense. We can obtain a dominance result on the shrinkage estimator follows. B S relative S H dominates Proposition 7.5 For 0 < b ≤ (ν − 1)/κ and c ≥ 1, to the extended Stein loss (7.3). Proof Let φiS H = κ −1 (1 + b ic / tr L c )−1 and φiS H ∗ = b ic {(1 + b)κ tr L c }−1 for i = 1, . . . , ν. Note that S H (L) = diag (φ1S H , . . . , φνS H ) and, for i = 1, . . . , ν, b ic −1 b ic 1 + κ tr L c tr L c c b i ≤ κ −1 − (1 + b)−1 = κ −1 − φiS H ∗ . κ tr L c
φiS H = κ −1 −
Thus, S H )]] ≤ E[ tr −1 B S ] − E tr −1 H L S H ∗ H , E[ tr [Ch( −1 where S H ∗ = diag (φ1S H ∗ , . . . , φνS H ∗ ). It follows that S H )| = log |Ch( −1 B S )| − log I ν + b L c log |Ch( −1 c tr L b −1 B S c )| − tr B S )| − b, ≥ log |Ch( L = log |Ch( −1 tr L c
so that S H , ) − R E S ( B S , ) ≤ −E tr −1 H L S H ∗ H + b. R E S (
(7.20)
Using the identity (7.17) gives E tr −1 H L S H ∗ H ⎧ ⎫⎤ ⎡ ν ⎨ ν SH∗ SH∗ ⎬ SH∗ φ − φ ∂φ i i j j ⎦. = E⎣ +2 (|n − p| + 1)φiS H ∗ + 2 i i ⎩ ⎭ ∂ i i − j i=1
j>i
(7.21) For b > 0 and c ≥ 1, ν i=1
i
∂φiS H ∗ tr L 2c bc 1− ≥ 0. = ∂ i (1 + b)κ ( tr L c )2
(7.22)
7.4 Orthogonally Invariant Estimators
89
Since, for 1 ≤ i < j ≤ ν and c ≥ 1, ic+1 − c+1 j i − j and
ν
ν
i=1
c j>i ( i
+ cj ) = (ν − 1) tr L c , we obtain
ν ν i φiS H ∗ − j φ Sj H ∗ i=1 j>i
≥ ic + cj ,
i − j
≥
ν ν b (ν − 1)b . ( c + cj ) = (1 + b)κ tr L c i=1 j>i i (1 + b)κ
(7.23) Combining (7.20)–(7.23) provides 2 S H , ) − R E S ( B S , ) ≤ − (n + p − 1)b + b = κb − (ν − 1)b , R E S ( (1 + b)κ (1 + b)κ
which is not positive if 0 < b ≤ (ν − 1)/κ.
7.4.3.3
Stein’s Simple Estimator and Risk Minimization Method
U B can be expressed In the nonsingular case, namely, n ≥ p, the unbiased estimator UB UB UB as = H L D p H with L = diag ( 1 , . . . , p ) and D p = diag (n −1 , . . . , n −1 ). Then the first and last diagonal elements of L DUp B are, respectively, 1 /n and p /n, U B . Let λ1 be the largest eigenwhich are the largest and smallest eigenvalues of value of and let η be the corresponding normalized eigenvector. Now, U B η ≤ E λ1 = η η = E η
max p
ξ ∈R :ξ =1
U B ξ = E[ 1 /n]. ξ
Similarly, it can be shown that E[ p /n] ≤ λ p , where λ p is the smallest eigenvalue of U B . These imply that 1 /n overestimates λ1 and p /n underestimates λ p , so that should probably be modified by shrinking its largest eigenvalue 1 /n and expanding its smallest eigenvalue p /n. In other words, an orthogonally invariant esti O = H L(L)H with (L) = diag (φ1 (L), . . . , φ p (L)) should satisfy mator φ1 (L) ≤ · · · ≤ φ p (L). Moreover from the eigenvalue decomposition S = H L H , L is defined by the diagonal matrix consisting of ordered eigenvalues 1 ≥ · · · ≥ p . O , it seems reasonable that the Thus, for the orthogonally invariant estimators i φi (L)’s are required to have the property 1 φ1 (L) ≥ · · · ≥ p φ p (L). As in Perron (1992), the properties φ1 (L) ≤ · · · ≤ φ p (L) and 1 φ1 (L) ≥ · · · ≥ p φ p (L) shall be called, respectively, the shrinkage and ordering properties. Sheena and Takemura O has the ordering property. O order-preserving when (1992) called Here we will provide well known, two orthogonally invariant estimators with the shrinkage and ordering properties given by Stein. For details, see Stein (1975, 1977) and also Dey and Srinivasan (1985).
90
7 Estimation of the Covariance Matrix
The first estimator is given by ST = H L DνJ S H ,
(7.24)
ST has the shrinkage property, where DνJ S is given in (7.12). Since d1J S ≤ · · · ≤ dνJ S , but lacks the ordering property. As in the following proposition, the simple estimator J S. ST improves ST dominates J S relative to the extended Stein loss (7.3). Proposition 7.6 Proof This proposition can be proved in the same lines as in Dey and Srinivasan (1985, Theorem 3.1). Using (7.19), we can write the difference between the unbiased ST and the constant risk of J S as risk estimate of E S ( ST ) = R ST ) − R E S ( J S , ) (
ν ν diJ S i − d jJ S j (|n − p| + 1)diJ S + 2 − ν. = i − j i=1 j>i It is observed that ν ν diJ S i − d jJ S j
i − j
i=1 j>i
=
ν ν diJ S ( i − j ) + (diJ S − d jJ S ) j
i − j
i=1 j>i
i
ν (ν − i)diJ S , i=1
where the inequality is verified by the facts that 1 > · · · > ν and that d1J S < · · · < dνJ S . Thus, ST ) < (
ν
(|n − p| + 1 + 2ν − 2i)diJ S − ν =
i=1
ν
(n + p − 2i + 1)diJ S − ν = 0,
i=1
which completes the proof.
The other well-known estimator due to Stein (1975, 1977) is obtained from minimizing the unbiased risk estimate (7.18) subject to φi ’s with ignoring differential terms. The unbiased risk estimate (7.18) with ignoring differential terms can be rewritten as O ) = ∗ ( R
ν ν i φi (|n − p| + 1)φi + 2 − log φi + const. − j i=1 j=i i
which is minimized by, for i = 1, . . . , ν,
7.4 Orthogonally Invariant Estimators
φiR M = 1/ωi (L),
91
ωi (L) = |n − p| + 1 + 2 i
ν j=i
1 . i − j
The φiR M ’s are sometimes negative and not satisfying the ordering property 1 φ1R M ≥ · · · ≥ ν φνR M . To modify them, Stein (1977) suggested applying the isotonic regression to φiR M ’s. For a detailed example of the isotonic regression, see Lin and Perlman (1985). No exact dominance result exists for the resulting modified estimator with the ordering property, but it has been much used for numerical comparison in the literature including Lin and Perlman (1985), Haff (1991), Yang and Berger (1994) and Ledoit and Wolf (2004).
7.4.3.4
The Dey-Srinivasan Estimator
Next, we introduce an improved estimator, which is based on Dey and Srinivasan (1985). Let DS (L) = diag (φ1DS , . . . , φνDS ) with φiDS = diJ S −
g(u) log i , b+u
u=
ν (log i )2 , i=1
where b is a suitable constant and g(u) is a differentiable function of u. Define DS = H L DS (L)H . DS has the shrinkage property. Of course, φ1DS ≤ · · · ≤ φνDS , so Proposition 7.7 Suppose ν ≥ 3 and b ≥ 144(ν − 2)2 /{25(n + p − 1)2 }. If g(u) is DS dominondecreasing in u and 0 < g(u) ≤ 12(ν − 2)/{5(n + p − 1)2 }, then JS ST nates and relative to the extended Stein loss (7.3). Proof From a straightforward calculation after substituting φi = φiDS into (7.18), DS can be expressed as the unbiased risk estimate of E S ( E S ( DS ) = R ST ) +
DS , R where DS =
ν i=1
g(u) log i ∂ g(u) log i − 2 i b+u ∂ i b + u
ν g(u) i log i − j log j g(u) log i . −2 − log 1 − J S b + u j>i i − j di (b + u)
− (|n − p| + 1)
DS ≤ 0 then the proposition is verified. If
92
7 Estimation of the Covariance Matrix
Note that ν ν i log i − j log j log i − log j = (ν − i) log i + j ≥ (ν − i) log i − i − j i j j>i j>i
because 1 ≥ · · · ≥ ν . Further noting that i
g(u) g (u)(log i )2 g(u)(log i )2 ∂ g(u) log i = +2 −2 , ∂ i b + u b+u b+u (b + u)2
we obtain DS ≤
ν
g(u) log i g (u)(log i )2 g(u) −4 −2 JS b+u b+u di (b + u) i=1
g(u) log i g(u)(log i )2 . − log 1 − +4 (b + u)2 diJ S (b + u) −
(7.25)
√ It follows that |x|/(b + x 2 ) ≤ {2 b}−1 for b > 0, implying that for each i | log i | | log i | 1 < ≤ √ . b+u b + (log i )2 2 b Combining this inequality and the given conditions on b and g(u) gives 1 1 g(u)| log i | 12(ν − 2) g(u)| log i | < √ × < . ≤ (n + p − 1) b+u 2 diJ S (b + u) 2 b 5(n + p − 1) Since
5 log(1 + x) ≥ x − x 2 6
for |x| ≤ 1/2 (see Dey and Srinivasan 1985, Lemma 2.2), using the given condition on g(u) yields
g(u) log i g(u) log i 5 g(u) log i 2 ≥ − JS log 1 − J S − di (b + u) di (b + u) 6 diJ S (b + u) >−
g(u) log i g(u)(log i )2 . − 2(ν − 2) JS (b + u)2 di (b + u)
Combining (7.25) and (7.26) gives
(7.26)
7.4 Orthogonally Invariant Estimators
DS <
ν i=1
= −2ν
93
g(u) g (u)(log i )2 g(u)(log i )2 −2 −4 + 2ν b+u b+u (b + u)2
g(u) ug (u) ug(u) ug (u) −4 + 2ν ≤ 0, < −4 b+u b+u (b + u)2 b+u
which completes the proof.
7.4.3.5
Sheena and Takemura’s Methods for Improving Non-order-preserving Estimators
Let ϕ = (ϕ1 , . . . , ϕν ) and diag (ϕ) = diag (ϕ1 , . . . , ϕν ), where ϕi = i φi (L) for O = H diag (ϕ)H i = 1, . . . , ν. Here as in Sheena and Takemura (1992), we call O ST order-preserving if has the ordering property ϕ1 ≥ · · · ≥ ϕν . Note that JS DS JS and are not always order-preserving since 1 ≥ · · · ≥ ν , d1 ≤ · · · ≤ dν and φ1DS ≤ · · · ≤ φνDS . Sheena and Takemura (1992) give two methods of improving such a non-order-preserving estimator in the nonsingular case. Now, the methods are unifiedly treated for the nonsingular and singular cases: (i) For i = 1, . . . , ν, let ϕ(i) be the i-th largest element in ϕ = (ϕ1 , . . . , ϕν ). Define O S = H diag (ϕ O S )H with ϕ O S = (ϕ(1) , . . . , ϕ(ν) ). (ii) Denote by (ϕ1I R , . . . , ϕνI R ) the isotonic regression of (ϕ1 , . . . , ϕν ), satisfying min
c1 ≥···≥cν
ν ν (ci − ϕi )2 = (ϕiI R − ϕi )2 . i=1
i=1
I R = H diag (ϕ I R )H with ϕ I R = (ϕ1I R , . . . , ϕνI R ). Define The ϕiI R ’s are given by t
ϕiI R = min max s≤i
t≥i
r =s ϕr , t −s+1
ν ν so that i=1 g(ϕiI R ) ≤ i=1 g(ϕi ) for any convex function g. For computation algorithm and mathematical properties of the isotonic regression, see Robertson et al. (1988, Chap. 1). We first show the following theorem. ∗ ) = H diag (ϕ ∗ )H be an orthogonally invariant estimator Theorem 7.2 Let (ϕ ∗ of , where ϕ = (ϕ1∗ , . . . , ϕν∗ ) and the ϕi∗ ’s are functions of L. Assume that j i=1
ϕi∗ ≥
j i=1
ϕi for 1 ≤ j ≤ ν − 1, and
ν i=1
ϕi∗ =
ν i=1
ϕi .
94
7 Estimation of the Covariance Matrix
ν ν ∗ ) dominates (ϕ) If Pr(ϕ ∗ = ϕ) = 1 and i=1 log ϕi∗ ≥ i=1 log ϕi , then (ϕ relative to the extended Stein loss (7.3). ν Proof Since |Ch( −1 H diag (ϕ ∗ )H )| = |H −1 H| i=1 ϕi∗ , we obtain log |Ch( −1 H diag (ϕ ∗ )H )| = log |H −1 H| +
ν
log ϕi∗
i=1
≥ log |H −1 H| +
ν
log ϕi
i=1
= log |Ch( −1 H diag (ϕ)H )|, so that ∗ ), ) − R E S ((ϕ), ) ≤ E[ tr −1 H{ diag (ϕ ∗ ) − diag (ϕ)}H ]. R E S ((ϕ For i = 1, . . . , ν, let ai = {H −1 H}ii . From (3.3), E[ tr −1 H{ diag (ϕ ∗ ) − diag (ϕ)}H ] ν =E (ϕi∗ − ϕi )ai i=1
=c
ν D(≥0) ν i=1
(ϕi∗
∗
− ϕi )E (ai |L)|L|
(|n− p|−1)/2
!
( 1 − j ) (dL),
1≤i< j≤ν
where c is a constant and
∗
E (ai |L) =
V p,ν
ai exp
ν 1 − ak k (H d H). 2 k=1
Note that ν
(ϕi∗ − ϕi )E ∗ (ai |L)
i=1
= (ϕ1∗ − ϕ1 ){E ∗ (a1 |L) − E ∗ (a2 |L)} + (ϕ1∗ + ϕ2∗ − ϕ1 − ϕ2 ){E ∗ (a2 |L) − E ∗ (a3 |L)} ∗ + · · · + (ϕ1∗ + · · · + ϕν−1 − ϕ1 − · · · − ϕν−1 ){E ∗ (aν−1 |L) − E ∗ (aν |L)}.
Hence, if
i (L) ≡ E ∗ (ai |L) − E ∗ (ai+1 |L) ≤ 0
∗ ), ) ≤ R E S ((ϕ), ). for i = 1, . . . , ν − 1 then R E S ((ϕ Since (H d H) is invariant under any orthogonal transformation, it is invariant under permutation of columns of H. Exchanging the i-th and (i + 1)-th columns of
7.4 Orthogonally Invariant Estimators
95
H gives ν 1 1 1 − ai i+1 − ai+1 i − ak k (H d H), 2 2 2 k=i,i+1
i (L) =
V p,ν
(ai+1 − ai ) exp
so that 2 i (L) =
V p,ν
(ai − ai+1 ) exp
−
ν 1 ak k (H d H) 2 k=1
ν 1 1 1 + (ai+1 − ai ) exp − ai i+1 − ai+1 i − ak k (H d H) 2 2 2 V p,ν k=i,i+1
1 (ai − ai+1 )( i − i+1 ) (ai − ai+1 ) 1 − exp = 2 V p,ν ν 1 ak k (H d H). × exp − 2 k=1
For both when ai − ai+1 ≥ 0 and when ai − ai+1 < 0, we can verify i (L) ≤ 0. Thus the proof is complete. O S and I R satisfy conditions of Theorem It is easy to check that both estimators 7.2. Hence we obtain the following proposition. I R are better than O relative to the extended Stein O S and Proposition 7.8 loss (7.3).
7.4.3.6
The Perron Estimator
Consider the case of n ≥ p. For every O ∈ O p , let T O T O be the Cholesky decompoJS = J S (S), . Using the James-Stein estimator sition of O S O, where T O ∈ L(+) p we define an estimator of as J S (O S O)O |S] = E[O T O D Jp S T E = E[O O O |S],
where E[·|S] stands for conditional expectation with respect to the uniform distri E is an orthogonally invariant bution on O p given S. Eaton (1970) suggested that J S E = relative to the ordinary Stein loss (7.2). Denote estimator improving E E E E E H L (L)H with (L) = diag (φ1 , . . . , φ p ). The computation of (L) was done by Sharma and Krishnamoorthy (1983) for p = 2 and by Takemura (1984) for E and showed that, p = 3. Takemura (1984) also provided a detailed discussion on E for i = 1, . . . , p, φi can be expressed by
96
7 Estimation of the Covariance Matrix
φiE =
p
wi j (L)d jJ S ,
j=1
where W (L) = (wi j (L)) is a doubly stochastic matrix. A closed form of W (L) is hard to clarify. Perron (1992) proposed an approximation method for W (L) and established a dominance result of the resulting estimator. Here, we give a unified Perron (1992) type estimator for the n ≥ p and the p > n cases. For i, j ∈ {1, . . . , ν}, let wiPj R (L) =
tr j (L i ) tr j−1 (L i ) − , tr j−1 (L) tr j (L)
where L i = diag ( 1 , . . . , i−1 , 0, i+1 , . . . , ν ) and
tr j (L) =
⎧ ⎪ 1, ⎨ ⎪ ⎩
1≤i 1 i where a = n + p − 2ν + 2τ − 1. It follows that
τ τ τ τ f i ψi − f j ψ j ψi − ψ j = fj , (τ − i)ψi + fi − f j fi − f j i=1 j>i i=1 j>i implying that
τ τ ψi − ψ j ∂ψi E[ tr ( Q ) Q ] = E αi ψi − 2 f i −2 f j . (7.30) ∂ fi fi − f j i=1 j>i −1
−
−
where αi = a − 2(τ − i) = |n − p| + 2i − 1. Combining (7.29) and (7.30), we obtain (7.28). Thus the proof is complete.
7.5.2 Examples of Improved Estimators 7.5.2.1
The Stein-Type Estimator
A Stein-type estimator similar to (7.24) is described by ST ) = c0 {S + ( Q − ) ST (F) Q − }, ST (F) = diag (ψ1ST , . . . , ψτST ), (
where for i = 1, . . . , τ ,
7.5 Improvement Using Information on Mean Statistic
ψiST =
99
1 ν − 2i + 1 . −1= c0 αi |n − p| + 2i − 1
ST ) under the extended Stein loss (7.3) is expressed as The risk function of (
ST ), ) = R E S ( B S , ) + R E S ((
τ
{c0 αi ψiST − log(1 + ψiST )} − 2c0 E g2 ( ST ) .
i=1
It is observed that ψiST − ψ jST > 0 for j > i, so that g2 ( ST ) > 0. Also, we see that for i = 1, . . . , τ c0 αi ψiST − log(1 + ψiST ) = −{c0 αi − log(c0 αi ) − 1} ≤ 0 ST ) dominates because x − log x − 1 ≥ 0 for x > 0. Thus from Theorem 7.3, (
BS for any order of m, n and p. ST ) dominates the James-Stein Further, if τ = ν, namely, m > n ∧p, then (
ν JS estimator in (7.12). In fact, since i=1 (c0 αi − 1) = 0 and ν
log(c0 αi ) = −ν log κ +
ν
i=1
log αi = −ν log κ +
i=1
ν
log(n + p − 2i + 1),
i=1
it follows that, by (7.5) and (7.13), ST ), ) < R E S ( B S , ) − R E S ((
ν {c0 αi − log(c0 αi ) − 1} i=1
=
ν
J S , ). log(n + p − 2i + 1) − rκ,ν = R E S (
i=1
ST ) dominates J S relative to the extended Stein This shows that if τ = ν then (
loss (7.3).
7.5.2.2
The Haff Type Estimator
As a reasonable estimator, we define the Haff (1980) type estimator as H F ) = c0 {S + ( Q − ) H F Q − }, H F (F) = diag (ψ1H F , . . . , ψτH F ), (
a f i (i = 1, . . . , τ ), a > 0. ψiH F = tr F H F ) dominates Using Theorem 7.3, we can show that the Haff type estimator (
BS if constant a satisfies the inequality 0 < a ≤ 2(ν − 1)/(|n − p| + 1) for ν > 1.
100
7 Estimation of the Covariance Matrix
In fact, it is noted that g2 ( H F ) =
τ τ ψiH F − ψ jH F i=1 j>i
fi − f j
τ τ τ a a fj = (i − 1) f i . tr F i=1 j>i tr F i=1
fj =
H F ) and B S is written as The difference in risk of (
H F ), ) − R E S ( B S , ) R E S ((
= c0 (|n − p| + 1)a − 2c0 E[g1 (
HF
# )] − E
τ
$ log(1 +
ψiH F )
.
i=1
It follows that for any F ∈ D(≥0) τ tr F 2 ≥ 0. g1 ( H F ) = a 1 − ( tr F)2 Since log(1 + x) ≥ 2x/(2 + x) for x ≥ 0 and τ i=1
log(1 + ψiH F ) ≥
τ i=1
ψiH F = a, we observe
τ τ 2ψiH F 2ψiH F 2a = . ≥ HF 2 + a 2 +a 2 + ψ i i=1 i=1
Thus,
−1 2a HF BS R E S (( ), ) − R E S ( , ) ≤ c0 (|n − p| + 1)a − c0 2+a 2(ν − 1) a a− , = c0 (|n − p| + 1) 2+a |n − p| + 1 which shows the dominance result.
7.5.3 Further Improvements with a Truncation Rule First, we provide a useful lemma which will be a key tool to show further dominance results. Lemma 7.1 Let (F) ∈ Dτ such that the diagonal elements are absolutely continuous and nonnegative functions of F. Then we have E[ tr −1 ( Q − ) (I τ + F)(F) Q − ] ≥ E[(κ + m) tr (F)]. Proof See Tsukuma and Kubokawa (2016).
7.5 Improvement Using Information on Mean Statistic
101
ST ) and (
H F ). Let [ ]T R = By using Lemma 7.1, we will improve on (
TR TR diag (ψ1 (F), . . . , ψτ (F)) ∈ Dτ such that the i-th diagonal element is given by % ψiT R (F) = min ψi (F),
& 1 + fi −1 , c0 (κ + m)
where (F) = diag (ψ1 (F), . . . , ψτ (F)). Then we obtain a general dominance result for improvement on the class (7.27). Theorem 7.4 For any possible ordering among m, n and p, the truncated estimator TR ) dominates ( ) relative to the extended Stein loss (7.3) if Pr([ ]T R = ([ ]
) > 0. TR Proof Abbreviate (F) to . The difference in risk of ( ) and ([ ] ) can be expressed as TR ) − R E S (([ ] ), ) R E S (( ),
= E[c0 tr −1 ( Q − ) ( − [ ]T R ) Q − − log |I τ + | + log |I τ + [ ]T R |] ≥ E[c0 (κ + m) tr (I τ + F)−1 ( − [ ]T R ) − log |I τ + | + log |I τ + [ ]T R |], where the inequality follows directly from Lemma 7.1. The last r.h.s. can be written τ
i ] with by E[ i=1
i = c0 (κ + m) ·
ψi − ψiT R − log(1 + ψi ) + log(1 + ψiT R ). 1 + fi
When ψiT R = ψi , we get i = 0. When ψiT R = c0−1 (1 + f i )/(κ + m) − 1 < ψi , it is observed that 1 + ψi 1 + ψi − 1 ≥ 0, − log c0 (κ + m) ·
i = c0 (κ + m) · 1 + fi 1 + fi which completes the proof.
The following proposition is derived immediately from Theorem 7.4. ST T R ST ) relative Proposition 7.10 The truncated estimator ([
] ) dominates (
H F T R H F ) relative to ] ) dominates (
to the extended Stein loss (7.3). Also, ([
the extended Stein loss (7.3).
102
7 Estimation of the Covariance Matrix
7.6 Related Topics 7.6.1 Decomposition of the Estimation Problem When n ≥ p, covariance estimation under the ordinary Stein loss (7.2) is closely related to simultaneous estimation of mean vectors and variances. Here we will J S briefly introduce the relationship and then give a simple improved procedure on via the James-Stein (1961) shrinkage estimators of multivariate normal mean vectors. We use the same notation as in Proposition 3.10 for n ≥ p. Recall that T T and are the Cholesky decompositions of the Wishart matrix S and the covari(+) ance matrix , respectively, where T = (ti, j ) ∈ L(+) p and = (ξi, j ) ∈ L p . Denote T ( p) = t p, p and ( p) = ξ p, p and, for i = p − 1, . . . , 1, define T (i) and (i) inductively as, respectively, T (i) =
ti,i 0p−i t (i) T (i+1)
∈ L(+) p−i+1 ,
(i) =
ξi,i 0p−i ξ (i) (i+1)
∈ L(+) p−i+1
with t (i) = (ti+1,i , . . . , t p,i ) and ξ (i) = (ξi+1,i , . . . , ξ p,i ) . Note that T (1) = T and (1) = . By Proposition 3.10, the columns of T are mutually independent and
2 2 ti,i ∼ σi2 χn−i+1 for i = 1, . . . , p, t (i) |ti,i ∼ N p−i (ti,i γ (i) , (i+1) ) for i = 1, . . . , p − 1,
(7.31)
2 where γ (i) = ξ (i) /ξi,i for i = 1, . . . , p − 1, σi2 = ξi,i for i = 1, . . . , p, and (i) = (i) (i) for i = 1, . . . , p. Recall also that for i = 1, . . . , p − 1
(i)
1 0p−i = γ (i) I p−i
σi2 0p−i 0 p−i (i+1)
1 0 p−i
γ (i) . I p−i
(7.32)
Let the σ i2 ’s and the γ (i) ’s be certain estimators of the σi2 ’s and the γ (i) ’s, respec2 (i) (∈ S(+) ( p) = σ p . For i = p − 1, . . . , 1, we define tively. Set p−i+1 ) inductively as 2 σ i 0p−i 1 γ (i) . (i) = 1 0 p−i (7.33) (i+1) 0 p−i I p−i γ (i) I p−i 0 p−i A = (1) is an estimator of . Conversely, for any estimator , the LDL Then decomposition of can be obtained uniquely from (7.33). Combining (7.32) and (7.33) yields σ i2 /σi2 + σ i2 ( γ (i) − γ (i) ) −1 γ (i) − γ (i) ) + tr −1 tr −1 (i) (i) = (i+1) ( (i+1) (i+1) , which is used again and again to obtain
7.6 Related Topics
103
A = tr −1
p σ2 i 2 σ i=1 i
+
p−1
σ i2 ( γ (i) − γ (i) ) −1 γ (i) − γ (i) ). (i+1) (
i=1
p A derived from Since | −1 A | = i=1 σ i2 /σi2 , the ordinary Stein loss (7.2) of (7.33) can alternatively be written as p 2 σ
p−1 σ i2 − log 2 − 1 + σ i2 ( γ (i) − γ (i) ) −1 γ (i) − γ (i) ). (i+1) ( σ i i=1 i=1 (7.34) This suggests that the covariance estimation problem with the ordinary Stein loss (7.2) is considered as the problem of simultaneously estimating the σi2 ’s and the γ (i) ’s under the decomposed loss (7.34) in the decomposed model (7.31). J S , then estimators of the σi2 ’s and the γ (i) ’s can be written as, respecA = If tively, 2 for i = 1, . . . , p, σ i2J S = diJ S ti,i JS γ (i) = t (i) /ti,i for i = 1, . . . , p − 1. A , ) = L S (
i σi2
J S is expressed by R S ( J S , ) = E[L S ( J S , )] = R1J S + Hence the risk of JS R2 , where # R1J S
=E
p 2J S σ i
σi2
i=1
R2J S
=E
# p−1
$ σ i2J S − log 2 − 1 , σi
JS σ i2J S ( γ (i)
− γ (i) )
JS −1 γ (i) (i+1) (
$ − γ (i) ) ,
i=1
where the expectations are taken with respect to (7.31). Here, when p ≥ 4, we consider improvement on R2J S
=E
# p−1
$ σ i2J S ( p
−
2 i)/ti,i
i=1
p−1 ( p − i)diJ S . = i=1
For i = 1, . . . , p − 1, denote x (i) = t (i) /ti,i and S(i) = T (i) T (i) . Define
SH γ (i) =
⎧ ⎪ ⎨ 1− ⎪ ⎩
x (i)
p−i −2 2 −1 (n − p + 3)ti,i x (i) S(i+1) x (i)
' x (i) for i = 1, . . . , p − 3, for i = p − 2 and p − 1.
SH For i = 1, . . . , p − 3, γ (i) is the James-Stein (1961, Eq. 23) shrinkage estimator in estimation of the multivariate normal mean vector with unknown covariance matrix.
104
7 Estimation of the Covariance Matrix
Note that S(i+1) ∼ W p−i (n − i, (i+1) ) independent of x (i) for each i. In a similar way to James and Stein (1961), we can show that for i = 1, . . . , p − 3 ) ( SH SH γ (i) − γ (i) ) −1 ( γ − γ ) ≤ ( p − i)diJ S , E σ i2J S ( (i) (i) (i+1) SH A obtained from using the J S which implies that σ i2J S ’s and the γ (i) ’s dominates relative to the ordinary Stein loss (7.2). For more details and improvement on R1J S , see Tsukuma (2014a, 2016b). The J S can also be done by using matricial shrinkage estimators of improvement on the mean matrix and this was discussed in Ma et al. (2012).
7.6.2 Decision-Theoretic Studies Under Quadratic Losses Instead of the ordinary Stein loss (7.2) or the extended Stein loss (7.3), some quadratic-type loss functions have often been used for obtaining decision-theoretic results on covariance estimation. A typical quadratic loss is ) = tr −1 ( − ) −1 ( − ) = tr ( −1 − I p )2 . L 1 (, The L 1 -loss is invariant under the general scale transformation → U U and → U U for any U ∈ U p . Selliah (1964) addressed the n ≥ p case of covariance estimation under the L 1 -loss and obtained a minimax estimator based on the Cholesky decomposition of the Wishart matrix. For other approaches, see Haff (1979b, 1980, 1991), Yang and Berger (1994) and Tsukuma (2014b). See also Konno (2009), who discussed the p > n case under the L 1 -loss. A multivariate extension of squared error loss to covariance estimation may be defined by ) = tr ( − )2 , L 2 (, − . The L 2 -loss has orthogonal invarinamely, the squared Frobenius norm of → O O for any ance under the orthogonal transformation → O O and O ∈ O p . In the literature, a much-discussed estimator is a linear shrinkage estimaU B + (1 − α)( tr U B / p)I p , where 0 ≤ α ≤ 1 and U B = S/n is the L S = α tor unbiased estimator of under the normality assumption on error distribution. Leung and Chan (1998) suggested using α = n/(n + 2) from a decision-theoretic point of view. This suggestion of Leung and Chan (1998) was extended to an elliptically contoured distribution model by Leung and Ng (2004). Ledoit and Wolf (2004) took an asymptotic approach to estimating an optimal α from sample under a general error distribution.
7.6 Related Topics
105
= ( For = (σi j ) and σ i j ), the L 2 -loss can be generalized as ) = L 3 (,
wi j ( σ i j − σi j )2 ,
1≤i≤ j≤ p
where wi j ≥ 0 for 1 ≤ i ≤ j ≤ p. When wii = 1 for i = 1, . . . , p and wi j = 2 for 1 ≤ i < j ≤ p, the L 3 -loss coincides with the L 2 -loss. However, the L 3 -loss does not includes the L 1 -loss. For decision-theoretic results under the L 3 -loss, see Perlman (1972) and Haff (1979b). For n ≥ p, as a variant type of the L 1 -loss, we define ) = tr −1 ( − ) −1 ( − ) = tr −1 + tr ( −1 ) −1 − 2 p. L 4 (, The L 4 -loss can also be obtained from the sum of the ordinary Stein loss (7.2) and its ) = tr ( −1 ) −1 − log |( −1 ) −1 | − p. The different entropy-type loss L P (, invariance of the L 4 -loss can easily be verified under a general scale transformation. Improved estimation under the L 4 -loss was studied by Kubokawa and Konno (1990), Gupta and Ofori-Nyarko (1995) and Sun and Sun (2005).
7.6.3 Estimation of the Generalized Variance Some statistical measures are formulated as functions of the covariance matrix. Generalized variance is the determinant of the covariance matrix and interpreted as a scalar measure of uncertainty. Consider the case of n ≥ p in the model (7.1). The generalized variance is defined and we now treat decision-theoretic by ||. Denote an estimator of || by || estimation of || relative to a quadratic-type loss ||) = ||−2 (|| − ||)2 . L G (||, The Cholesky decomposition of S is denoted by S = T T , where T = (ti j ) ∈ Due to Proposition 3.10, it turns out that
L(+) p .
E[|S|] =
p !
E[tii2 ]
i=1
i=1
so that U B
|
|=
p p ! ! 2 = (n − i + 1)σi = || (n − i + 1),
p ! i=1
i=1
' (n − i + 1)
−1
|S| =
(n − p + 2)! |S| n!
106
7 Estimation of the Covariance Matrix
U B | is not the best is the unbiased estimator of ||. However under the L G -loss, | among constant-multiple estimators of the form c|S| with positive constant c. The best constant c that minimizes risk of estimator c|S| is c0 =
(n − p + 2)! , (n + 2)!
which can easily be verified by Proposition 3.10. Here, any constant-multiple estimator c|S| is invariant under an affine transformation. Shorrock and Zidek (1976) discussed improvement on the best affine invariant BC | = c0 |S| by using the information on X. They showed that | BC | is estimator | dominated by
(n + m − p + 2)! (n − p + 2)! SZ |S|, |S + X X| | | = min (n + 2)! (n + m + 2)! S Z | ≤ | BC | is one, and hence relative to the L G -loss. Clearly, the probability of | SZ BC | | is shrinking | | toward the zero. A different approach to proving the above dominance result is given by Sinha (1976). Rukhin and Sinha (1990) provided another dominance result without using the information on X. Some results under an entropy-type loss are obtained by Sinha and Ghosh (1987) and Kubokawa and Srivastava (2003). On the other hand, a dominance result in the case of p > n is still not known.
7.6.4 Estimation of the Precision Matrix Assume now that n ≥ p. Recall that S = Y Y ∼ W p (n, ). For any constant matrix A ∈ S p , an application of the Haff identity (5.5) to tr −1 A yields tr −1 A = E[(n − p − 1) tr S−1 A + 2 tr D S A] = E[(n − p − 1) tr S−1 A]. From the arbitrariness of A, we obtain −1 = E[(n − p − 1)S−1 ], so that U−1B = (n − p − 1)S−1 is the unbiased estimator of −1 . The inverse of the covariance matrix is commonly called the precision matrix. The estimation problem of the precision matrix −1 has been studied since Efron and Morris (1976). They pointed out that certain empirical Bayes estimation for a normal mean matrix is closely related to the problem of estimating −1 under a quadratic-type loss −1 , −1 |S) = tr ( −1 − −1 )2 S. L E M (
7.6 Related Topics
107
Here, we briefly introduce a unified approach to the n ≥ p and the p > n cases based on the Efron-Morris (1976) estimator. Let + −1 B S = a0 S , a0 = |n − p| − 1. −1 Among estimators of the form a S+ with positive constant a, B S is the best estimator −1 B S is equivalent to U−1B . To improve relative to the L E M -loss, and, when n ≥ p, −1 B S , we define the Efron-Morris (1976) type estimator as −1 (ν − 1)(ν + 2) SS+ −1 E M = BS + tr S −1 with ν = n ∧ p. If n ≥ p, then E M is the same as in Efron and Morris (1976). and Denote by S = H L H the eigenvalue decomposition of S, where L ∈ D(≥0) ν H ∈ V p,ν , and then note that (ν − 1)(ν + 2) EM −1 Iν. H , E M = a0 L −1 + E M = H tr L −1 −1 Proposition 7.11 E M dominates B S relative to the L E M -loss. The proof of Proposition 7.11 can be provided by an unbiased risk estimate method and it is omitted. For decision-theoretic estimation of the precision matrix with n ≥ p, other procedures for improving the unbiased estimator can be found in Haff (1977, 1979a, b), Dey (1987) and Dey et al. (1990). An improving method via using information on means was considered by Sinha and Ghosh (1987). Eaton and Olkin (1987) and Krishnamoorthy and Gupta (1989) provided minimax estimators based on the Cholesky decomposition of the Wishart matrix S relative to the Stein-type loss −1 , −1 ) = tr −1 − log | −1 | − p. L P ( See also Zhou et al. (2001) and Tsukuma (2014b) for related works to improved minimax estimation. Orthogonally invariant minimax estimators are obtained by Perron (1997) and Kubokawa (2005) for p = 2 and by Sheena (2003) for p = 3. However, for p ≥ 4, orthogonally invariant minimax estimators are still not provided. The p > n case is treated by Kubokawa and Srivastava (2008), who propose some improved estimators under quadratic-type losses.
108
7 Estimation of the Covariance Matrix
References D.K. Dey, Improved estimation of a multinormal precision matrix. Stat. Probab. Lett. 6, 125–128 (1987) D.K. Dey, M. Ghosh, C. Srinivasan, A new class of improved estimators of a multinormal precision matrix. Stat. Decisions 8, 141–151 (1990) D.K. Dey, C. Srinivasan, Estimation of a covariance matrix under Stein’s loss. Ann. Stat. 13, 1581– 1591 (1985) M.L. Eaton. Some problems in covariance estimation. Technical Reports No. 49, (Department of Statistics, Stanford University, 1970) M.L. Eaton, I. Olkin, Best equivariant estimators of a Cholesky decomposition. Ann. Stat. 15, 1639–1650 (1987) B. Efron, C. Morris, Multivariate empirical Bayes and estimation of covariance matrices. Ann. Stat. 4, 22–32 (1976) A.K. Gupta, S. Ofori-Nyarko, Improved minimax estimators of normal covariance and precision matrices. Statistics 26, 19–25 (1995) L.R. Haff, Minimax estimators for a multinormal precision matrix. J. Multivar. Anal. 7, 374–385 (1977) L.R. Haff, Estimation of the inverse covariance matrix: Random mixtures of the inverse Wishart matrix and the identity. Ann. Stat. 7, 1264–1276 (1979a) L.R. Haff, An identity for the Wishart distribution with applications. J. Multivar. Anal. 9, 531–544 (1979b) L.R. Haff, Empirical Bayes estimation of the multivariate normal covariance matrix. Ann. Stat. 8, 586–597 (1980) L.R. Haff, The variational form of certain Bayes estimators. Ann. Stat. 19, 1163–1190 (1991) W. James, C. Stein, Estimation with quadratic loss, in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, ed. by J. Neyman, (University of California Press, Berkeley, 1961),pp. 361–379 J. Kiefer, Invariance, minimax sequential estimation, and continuous time processes. Ann. Math. Stat. 28, 573–601 (1957) Y. Konno, Shrinkage estimators for large covariance matrices in multivariate real and complex normal distributions under an invariant quadratic loss. J. Multivar. Anal. 100, 2237–2253 (2009) K. Krishnamoorthy, A.K. Gupta, Improved minimax estimation of a normal precision matrix. Can. J. Stat. 17, 91–102 (1989) T. Kubokawa, A revisit to estimating of the precision matrix of the Wishart distribution. J. Stat. Res. 39, 91–114 (2005) T. Kubokawa, Y. Konno, Estimating the covariance matrix and the generalized variance under a symmetric loss. Ann. Inst. Stat. Math. 42, 331–343 (1990) T. Kubokawa, C. Robert, AKMdE Saleh, Empirical Bayes estimation of the variance parameter of a normal distribution with unknown mean under an entropy loss. Sankhy¯a Ser. A 54, 402–410 (1992) T. Kubokawa, M.S. Srivastava, Estimating the covariance matrix: a new approach. J. Multivar. Anal. 86, 28–47 (2003) T. Kubokawa, M.S. Srivastava, Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data. J. Multivar. Anal. 99, 1906–1928 (2008) T. Kubokawa, M.-T. Tsai, Estimation of covariance matrices in fixed and mixed effects linear models. J. Multivar. Anal. 97, 2242–2261 (2006) O. Ledoit, M. Wolf, A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 88, 365–411 (2004) P.L. Leung, W.Y. Chan, Estimation of the scale matrix and its eigenvalues in the Wishart and the multivariate F distributions. Ann. Inst. Stat. Math. 50, 523–530 (1998)
References
109
P.L. Leung, F.Y. Ng, Improved estimation of a covariance matrix in an elliptically contoured matrix distribution. J. Multivar. Anal. 88, 131–137 (2004) S.P. Lin, M.D. Perlman, A monte carlo comparison of four estimators for a covariance matrix, in Multivar. Anal. VI, ed. by P.R. Krishnaiah (North-Holland, Amsterdam, 1985), pp. 411–429 T. Ma, L. Jia, Y. Su, A new estimator of covariance matrix. J. Stat. Plan. Infer. 142, 529–536 (2012) M.D. Perlman, Reduced mean square error estimation for several parameters. Sankhy¯a Ser. B 34, 89–92 (1972) F. Perron, Equivariant estimators of the covariance matrix. Can. J. Stat. 18, 179–182 (1990) F. Perron, Minimax estimators of a covariance matrix. J. Multivar. Anal. 43, 16–28 (1992) F. Perron, On a conjecture of Krishnamoorthy and Gupta. J. Multivar. Anal. 62, 110–120 (1997) T. Robertson, F.T. Wright, R.L. Dykstra, Order Restricted Statistical Inference (Wiley, New York, 1988) A.L. Rukhin, B.K. Sinha, Decision-theoretic estimation of the product of gamma scales and generalized variance. Calcutta Stat. Assoc. Bull. 40, 257–265 (1990) D. Sharma, K. Krishnamoorthy, Orthogonal equivariant minimax estimators of bivariate normal covariance matrix and precision matrix. Calcutta Stat. Assoc. Bull. 32, 23–46 (1983) J.B. Selliah, Estimation and testing problems in a Wishart distribution. Technical reports No.10 (Department of Statistics, Stanford University, 1964) Y. Sheena, On minimaxity of the normal precision matrix estimator of Krishnamoorthy and Gupta. Statistics 37, 387–399 (2003) Y. Sheena, A. Takemura, Inadmissibility of non-order-preserving orthogonally invariant estimators of the covariance matrix in the case of Stein’s loss. J. Multivar. Anal. 41, 117–131 (1992) R.B. Shorrock, J.V. Zidek, An improved estimator of the generalized variance. Ann. Stat. 4, 629–638 (1976) B.K. Sinha, On improved estimators of the generalized variance. J. Multivar. Anal. 6, 617–626 (1976) B.K. Sinha, M. Ghosh, Inadmissibility of the best equivariant estimators of the variance-covariance matrix, the precision matrix, and the generalized variance under entropy loss. Stat. Dec. 5, 201– 227 (1987) C. Stein, Some problems in multivariate analysis, Part I. Technical Reports No.6 (Department of Statistics, Stanford University, 1956) C. Stein, Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean. Ann. Inst. Stat. Math. 16, 155–160 (1964) C. Stein, Estimation of a Covariance Matrix, Rietz Lecture, 39th Annual Meeting IMS (Atlanta, GA, 1975) C. Stein, Lectures on the theory of estimation of many parameters, in Proceedings of Scientific Seminars of the Steklov Institute Studies in the Statistical Theory of Estimation, Part I, vol. 74, eds. by I.A. Ibragimov, M.S. Nikulin (Leningrad Division, 1977), pp. 4–65 W.E. Strawderman, Minimaxity. J. Am. Stat. Assoc. 95, 1364–1368 (2000) D. Sun, X. Sun, Estimation of the multivariate normal precision and covariance matrices in a starshape model. Ann. Inst. Stat. Math. 57, 455–484 (2005) A. Takemura, An orthogonally invariant minimax estimator of the covariance matrix of a multivariate normal population. Tsukuba J. Math. 8, 367–376 (1984) H. Tsukuma, Minimax covariance estimation using commutator subgroup of lower triangular matrices. J. Multivar. Anal. 124, 333–344 (2014a) H. Tsukuma, Improvement on the best invariant estimators of the normal covariance and precision matrices via a lower triangular subgroup. J. Jpn. Stat. Soc. 44, 195–218 (2014b) H. Tsukuma, Estimation of a high-dimensional covariance matrix with the Stein loss. J. Multivar. Anal. 148, 1–17 (2016a) H. Tsukuma, Minimax estimation of a normal covariance matrix with the partial Iwasawa decomposition. J. Multivar. Anal. 145, 190–207 (2016b) H. Tsukuma, T. Kubokawa, Minimaxity in estimation of restricted and non-restricted scale parameter matrices. Ann. Inst. Stat. Math. 67, 261–285 (2015)
110
7 Estimation of the Covariance Matrix
H. Tsukuma, T. Kubokawa, Unified improvements in estimation of a normal covariance matrix in high and low dimensions. J. Multivar. Anal. 143, 233–248 (2016) R. Yang, J.O. Berger, Estimation of a covariance matrix using the reference prior. Ann. Stat. 22, 1195–1211 (1994) X. Zhou, X. Sun, J. Wang, Estimation of the multivariate normal precision matrix under the entropy loss. Ann. Inst. Stat. Math. 53, 760–768 (2001)
Index
A Admissibility, 2 Affine transformation, 3, 106 Anti-diagonal matrix, 84
B Bartlett decomposition, 23
C Chi-square identity, 37, 83 Cholesky decomposition, 11, 15, 16, 23, 79, 104, 107 Commutator, 81 Commutator subgroup, 81
D Decision space, 31 Descending wedge symbol, 12 Digamma function, 78
E Eigenvalue, 12, 89 Eigenvalue decomposition, 11, 15, 39, 50, 60, 68, 84, 89, 107 Eigenvector, 12 Elliptically contoured distribution, 67 Empirical Bayes method, 3, 45, 48, 86 Error covariance, 29, 75 Error matrix, 27, 64, 67 Exterior product, 14
G Gauss divergence theorem, 37, 40 Generalized inverse, 9, 52 Generalized variance, 105 GMANOVA model, 63 canonical form, 64 Group invariance, 31 Growth curve model, 28, 64 H Haff identity, 37, 106 I Inadmissibility, 2 Invariance, 31 of estimation problem, 31 of estimator, 31 of loss function, 31 of multivariate linear model, 31 of statistical model, 31 Isotonic regression, 91, 93 J Jacobian, 14 Jacobian matrix, 14 James-Stein estimator of covariance matrix, 79 of mean vector, 2, 103 K Kronecker delta, 37, 38
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Tsukuma and T. Kubokawa, Shrinkage Estimation for Mean and Covariance Matrices, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-15-1596-5
111
112 Kronecker product, 10
L Landau symbol, 40 LDL decomposition, 11, 81, 82, 102 Least squares estimator, 28 Least squares method, 28 Linear shrinkage estimator, 104 Loss function definition, 1 loss matrix of mean matrix estimation, 62 quadratic loss of covariance estimation, 104 of generalized variance estimation, 105 of mean matrix estimation, 31, 47, 64 of mean vector estimation, 2 of precision matrix estimation, 106 Stein loss, 76, 102, 104 extended, 76 Lower triangular group, 8, 79 Löwner order, 8, 62, 87 LQ decomposition, 11, 14, 23 LU decomposition, 11
M MANOVA model, 28 Matricial shrinkage estimator, 46 Matrix decomposition, 11 Matrix differential operator, 35, 68 Matrix factorization, 11 Matrix square root, 8, 53 Matrix transformation, 14 Matrix-variate normal distribution, 17 Maximum likelihood estimator, 3, 28, 45 Minimaxity, 2, 45, 47, 54, 76, 81, 104, 107 invariance approach, 81 least favorable prior approach, 81 Moore-Penrose inverse, 9, 49 differential, 37 Multivariate gamma function, 16 Multivariate linear model, 27 canonical form, 30, 46, 75 Multivariate normal distribution, 13
O Ordering property, 89, 93, 96 Order-preserving, 89, 93, 96 Orthogonal group, 8
Index Orthogonally invariant, 84, 89, 104, 107 Orthogonal projection matrix, 29, 52
P Positive-part rule, 46, 59 Precision matrix, 46, 106
Q QR decomposition, 11, 29, 30
R Regression coefficient matrix, 27, 64, 75 Residual sum of squares matrix, 29 Risk function, 2
S Scalar shrinkage estimator, 46 Scale invariant, 77, 79, 104 Shrinkage estimator, 3 Shrinkage property, 89, 96 Singular value, 12 Singular value decomposition, 12, 15, 21, 50, 61 Skew-symmetric, 69 Spectral decomposition, 12 Stein identity, 4, 35, 63, 66, 67, 98 Stein’s lemma, 4 Stiefel manifold, 8
T Transformation group, 31 Triangular invariant, 79, 84
U Unbiased risk estimate, 4, 54, 55, 84, 90, 91
V Vec operator, 11
W Wedge symbol, 12 Wishart distribution, 21, 37 pseudo-Wishart distribution, 21 Wishart matrix, 21, 23, 30, 47, 75, 102, 104, 107