126 21 7MB
English Pages 341 [330] Year 2023
ICSA Book Series in Statistics Series Editor: Ding-Geng Chen
Jiahua Chen
Statistical Inference Under Mixture Models
ICSA Book Series in Statistics Series Editor Ding-Geng Chen, College of Health Solutions, Arizona State University, Chapel Hill, NC, USA
The ICSA Book Series in Statistics showcases research from the International Chinese Statistical Association that has an international reach. It publishes books in statistical theory, applications, and statistical education. All books are associated with the ICSA or are authored by invited contributors. Books may be monographs, edited volumes, textbooks and proceedings.
Jiahua Chen
Statistical Inference Under Mixture Models
Jiahua Chen Department of Statistics University of British Columbia Vancouver, BC, Canada
ISSN 2199-0980 ISSN 2199-0999 (electronic) ICSA Book Series in Statistics ISBN 978-981-99-6139-9 ISBN 978-981-99-6141-2 (eBook) https://doi.org/10.1007/978-981-99-6141-2 Mathematics Subject Classification: 62F12, 62F03, 62E20 © Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
I would like to express my gratitude to Lily, Amy, and Andy for their unwavering support throughout the years. Your presence and support have meant a lot to me, and I am grateful to have you by my side. I would also like to dedicate a special mention to my late mother, who unfortunately passed away before witnessing many positive moments in my life.
Preface
After completing my Ph.D. in the design of experiments with Professor C. F. Jeff Wu, I was introduced to finite mixture models by Professor John D. Kalbfleisch. My initial goal was to study the large sample behavior of the maximum likelihood estimator for the mixing distribution in both general and finite mixture models. In 1993, I had the opportunity to attend the IMS research workshop in South Carolina, where I listened to lectures by Professor Bruce G. Lindsay and studied his lecture notes (Lindsay 1995). To gain a better understanding of the fundamentals, I selftaught from three classical books: Titterington et al. (1985), McLachlan and Peel (2004), Everitt (2013), specifically their first editions. The renowned Berkeley conference series provided many intriguing results related to normal mixture models. It was discovered that the likelihood ratio statistic diverges to infinity even in a rather simple situation, and it converges to the supremum of a Gaussian process when the subpopulation parameter is restricted to a compact subspace (Hartigan 1985; Ghosh and Sen 1985). These findings sparked extensive discussions in the statistical community, as evidenced by publications such as Faraway (1993) and Chernoff and Lander (1995). The topic of the asymptotic properties of the likelihood ratio test remains highly relevant even today. We owe much of the popularity of finite mixture models to Pearson (1894), who applied a finite Gaussian mixture distribution of order 2 to a biometric dataset. Their admirable computational effort enabled the estimation of the mixing distribution using method of moments. With advancements in information technology, maximum likelihood estimation, commonly implemented through the Expectation-Maximization (EM) algorithm (Dempster et al. 1977), became the preferred estimation approach. The EM algorithm has a vast literature dedicated to it, even within the realm of finite mixture models (McLachlan and Krishnan 2007). The convergence result established by Wu (1983) is of paramount importance. Recent developments have further solidified the algorithm’s popularity by studying its convergence rate (Xu and Jordan 1996; Balakrishnan et al. 2017). At the beginning of this book project, I was ambitious. However, the more I write, the more I realize how much I still have to learn. As a result, this book mainly
vii
viii
Preface
focuses on topics where I have conducted meaningful research or made significant contributions. The first chapter serves as a foundation for the subsequent discussions. I emphasize that a statistical model represents a distribution family, and in data analysis, point estimation aims to identify the "true" distribution within that family that generated the data. Some of the writings in this chapter are influenced by Lindsay (1995), such as the identifiability of finite binomial mixtures and the concept that any distribution can be represented as a mixture over a distribution family. Chapters 2–5 are dedicated to the consistent estimation of mixing distributions. When the mixing distribution is not parameterized, the parameter space of the mixture model has an infinite dimension. Nevertheless, the classical approach initiated by Wald (1949) remains effective in establishing the consistency of the nonparametric maximum likelihood estimator and its consistency under finite mixture models under certain conditions. For finite Gaussian and finite Gamma mixtures, their likelihood functions are unbounded unless the subpopulation parameter space is restricted to a compact space. We provide detailed accounts of special techniques to establish the consistency of the penalized maximum likelihood estimators and discuss cases where certain subpopulation parameters are "structural." Chapter 6 focuses on the geometric properties of the nonparametric maximum likelihood estimator of the mixing distribution and the associated numerical approach. One astonishing property of the nonparametric maximum likelihood estimator is that it has a finite number of support points, which is no more than the distinct observed values in the dataset. This chapter provides a detailed account of the results presented by Lindsay (1983a) and Lindsay (1983b). To understand finite mixture models, one cannot avoid learning about the EM algorithm. Chapter 7 presents results on the monotonicity of the likelihood values produced by EM iterations. Combined with the global convergence theorem, the algorithm is shown to converge to local maxima under certain general conditions (Wu 1983). The maximum likelihood estimator is well known for its consistency and asymptotic normality. However, in applications, the order of the finite mixture model is often unknown, and one can only specify a finite upper bound. In such cases, the mixture model becomes non-regular, and the maximum likelihood estimator loses these desirable properties. In fact, no approach can estimate the mixing distribution at a rate of .n−1/2 , as is generally achieved under regular models, except in the case of the undesirable super-efficiency phenomenon. This finding by Chen (1995) adds another unusual property to the finite mixture model, in addition to those discovered by Hartigan (1985) and Ghosh and Sen (1985). The initially believed optimal convergence rate of .n−1/4 was found to be too optimistic by Heinrich and Kahn (2018). Chapter 8 of this book provides further details on this topic. Most of the remaining chapters focus on hypothesis testing under finite mixture models. Testing hypotheses under finite mixture models remains a significant research problem in statistics. Before the development of viable likelihood ratiobased tests, the C-alpha test provided an elegant solution for the homogeneity null hypothesis, where the population distribution has a degenerate mixing distribution.
Preface
ix
This approach does not require specific assumptions about the form of the mixing distribution under the alternative hypothesis. When the subpopulation distribution belongs to a special exponential family, this approach aligns with the test designed for over-dispersion. These materials are covered in Chap. 9. Likelihood ratio-based tests for the order of finite mixture models continually present surprising results. The likelihood ratio function, considered as a stochastic process, generally converges to a Gaussian process (Dacunha-Castelle and Gassiat 1999). The likelihood ratio test statistic is then defined as the supremum of this process. If the subpopulation parameter space is not compact, this supremum can be infinite. Characterizing the domain of the limiting Gaussian process can be challenging. Even if this problem is solved, implementing the result test for real data analysis is another challenge. Yet there is one notable practical result of the likelihood ratio test for binomial mixture models (Chernoff and Lander 1995). Besides the overall strength of the authors’ research, they benefited from the naturally compact subpopulation parameter space and the natural upper bound of the likelihood function. Chapter 10 provides a detailed analysis of the likelihood ratio function and the limiting distribution of the likelihood ratio test, primarily for one-dimensional parameter subpopulations. Chapters 11 and 12 build upon Chap. 10 and introduce penalty functions applied to the likelihood function. These penalty functions partially restore the regularity of the likelihood function, pushing the fitted mixing distributions toward those whose support points align with the true mixing distribution. This leads to modified likelihood ratio tests. Chapters 13–15 adopt a different strategy by measuring the increment in likelihood through the EM algorithm after a few iterations from specific initial mixing distributions. By strategically selecting initial mixing distributions and designing the iteration scheme, it becomes easier to characterize the null limiting distributions for the likelihood increment. This approach allows for feasible numerical implementations for real data analysis. The book ends with a chapter on order selection problem under finite mixture model. To test for the order of the finite mixture model, one must control the type I error so that it is below the nominal level in general and close to nominal level when the true mixture is a typical null one, at least asymptotically. The default requirement of having type I error equaling the nominal level asymptotically is a huge obstacle as seen in previous chapters. By order selection, one is free from such a burden. A widely accepted minimum requirement for order selection is consistency: the selected order equals the true order as the sample size .n → ∞. Such a requirement is not too helpful to screen out many order selection approaches. This chapter provides some explanations of several order selection approaches without advocating any. I would like to express my deep gratitude to Professor John D. Kalbfleisch, who introduced me to the fascinating, challenging, and fruitful research area of mixture models. Many of my publications on mixture models were completed during my time as a post-doctoral fellow under his supervision. However, he often does not take credit for his contributions and is sometimes overlooked in the list of authorship. I would also like to extend my heartfelt appreciation to my Ph.D. supervisor, Professor C. F. Jeff Wu. He encouraged me to tackle significant research
x
Preface
problems and supported my exploration beyond his own research areas. He provided and still provides invaluable advice on research and career matters long after our formal supervision ended. A special thank you goes to Professor Hanfeng Chen. His positive feedback on my early work on finite mixture models led to a fruitful collaboration and the development of modified likelihood ratio tests. When the idea of the EM-test was conceived, I was thrilled to have admitted a new Ph.D. student, who is now Professor Pengfei Li. I entrusted him with many of the challenging technical details, and he completed numerous tasks with remarkable speed and accuracy. He quickly became the driving force in our several co-authored papers on mixture models. I would also like to acknowledge Edward Susko, Yuejiao Fu, Abbas Khalili, Qiong Zhang, and other Ph.D. students who researched mixture models under my supervision or co-supervision during their Ph.D. studies. Their contributions and dedication have greatly enriched our understanding of this research area. Last but certainly not least, I am grateful to the Department of Statistics and Actuarial Science at the University of Waterloo and the Department of Statistics at the University of British Columbia. These institutions have provided a stimulating research environment and have brought together wonderful colleagues who have supported and inspired me throughout my career. Vancouver, BC, Canada July 2023
Jiahua Chen
Contents
1
2
3
Introduction to Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Mixture Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Missing Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Identifiability of Some Commonly Used Mixture Models . . . . . . . . . 1.4.1 Poisson Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Finite Binomial Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Normal/Gaussian Mixture in Mean, in Variance, and in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Finite Normal/Gaussian Mixture. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.6 Gamma Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.7 Beta Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Connections Between Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Over-Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 6 8 8 9 10
Non-Parametric MLE and Its Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Non-Parametric Mixture Model, Likelihood Function and the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Consistency of Non-Parametric MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Distance and Compactification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Expand the Mixture Model Space . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Jensen’s Inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Consistency Proof of Kiefer and Wolfowitz (1956). . . . . . . 2.2.5 Consistency Proof of Pfanzagl (1988) . . . . . . . . . . . . . . . . . . . . . 2.3 Enhanced Jensen’s Inequality and Other Technicalities . . . . . . . . . . . . 2.4 Condition C20.2 and Other Technicalities . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Maximum Likelihood Estimation Under Finite Mixture Models . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Generic Consistency of MLE Under Finite Mixture Models . . . . . .
43 43 44
11 13 14 15 16 18
21 22 23 27 28 30 32 35 40 41
xi
xii
Contents
3.3 3.4 4
5
6
7
Redner’s Consistency Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46 50
Estimation Under Finite Normal Mixture Models . . . . . . . . . . . . . . . . . . . . . . 4.1 Finite Normal Mixture with Equal Variance . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Finite Normal Mixture Model with Unequal Variances . . . . . . . . . . . . 4.2.1 Unbounded Likelihood Function and Inconsistent MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Penalized Likelihood Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Selecting a Penalty Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Consistency of the pMLE, Step I . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Consistency of the pMLE, Step II . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Consistency of the pMLE, Step III . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Consistency When G∗ Has Only One Subpopulation . . . . . . . . . . . . . . 4.4 Consistency of the pMLE: General Order . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Consistency of the pMLE Under Multivariate Finite Normal Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Asymptotic Normality of the Mixing Distribution Estimation . . . . .
53 55 58
Consistent Estimation Under Finite Gamma Mixture . . . . . . . . . . . . . . . . . 5.1 Upper Bounds on f (x; r, θ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Consistency for Penalized MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Likelihood Function on G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Likelihood Function on G2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Consistency on G3 and Consistency for All . . . . . . . . . . . . . . . 5.3 Consistency of the Constrained MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Some Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Consistency of the MLE When Either Shape or Scale Parameter Is Structural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 85 87 89 91 94 95 95 97
Geometric Properties of Non-parametric MLE and Numerical Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Geometric Properties of the Non-parametric MLE . . . . . . . . . . . . . . . . . 6.2 Directional Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Numerical Solutions to the Non-parametric MLE . . . . . . . . . . . . . . . . . . 6.4 Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Algorithm Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Illustration Through Poisson Mixture Model . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Experiment with VDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Experiment with VEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Experiment with ISDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59 60 60 65 67 70 74 74 76 77 79 80
98 101 101 107 109 112 113 117 118 120 122
Finite Mixture MLE and EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 EM Algorithm for Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Contents
7.3
xiii
Data Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Poisson Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Exponential Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence of the EM Algorithm under Finite Mixture Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Global Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Convergence of EM Algorithm Under Finite Mixture . . . . Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131 131 132
8
Rate of Convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Best Possible Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Strong Identifiability and Minimum Distance Estimator . . . . . . . . . . . 8.4 Other Results on Best Possible Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Strongly Identifiable Distribution Families . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Impact of the Best Minimax Rate Is Not n−1/4 . . . . . . . . . . . . . . . . . . . . .
141 143 146 151 156 157 160
9
Test of Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Hartigan’s Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Binomial Mixture Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 C(α) Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 The Generic C(α) Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 C(α) Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 C(α) Statistic Under NEF-QVF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Expressions of the C(α) Statistics for NEF-VEF Mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
161 161 162 165 170 170 173 175
10
Likelihood Ratio Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Likelihood Ratio Test for Homogeneity: One Parameter Case. . . . . 10.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 The Proof of Theorem 10.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 An Expansion of θˆ Under the Null Model . . . . . . . . . . . . . . . . 10.3.2 Expansion of R1n : Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Expanding R1n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Homogeneity Under Normal Mixture Model . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 The Result for a Single Mean Parameter . . . . . . . . . . . . . . . . . . 10.5 Two-Mean Parameter Mixtures: Tests for Homogeneity . . . . . . . . . . . 10.5.1 Large Sample Behavior of the MLE’s . . . . . . . . . . . . . . . . . . . . . 10.5.2 Analysis of Rn (e; I ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Analysis of Rn (e; I I ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 Asymptotic Distribution of the LRT . . . . . . . . . . . . . . . . . . . . . . .
179 179 182 186 186 187 189 200 201 213 214 215 219 223
11
Modified Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 11.1 Test of Homogeneity with Binomial Observations . . . . . . . . . . . . . . . . . 226 11.2 Modified Likelihood Ratio Test for Homogeneity with Multinomial Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.4
7.5
133 133 137 139
176
xiv
Contents
11.3 11.4
Test for Homogeneity for General Subpopulation Distribution . . . . 239 Test for Homogeneity in the Presence of a Structural Parameter . . 240
12
Modified Likelihood Ratio Test for Higher Order . . . . . . . . . . . . . . . . . . . . . . 12.1 Test for Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 A Modified Likelihood Ratio Test for m = 2 . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Technical Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Regularity Conditions and Rigorous Proofs . . . . . . . . . . . . . . .
243 243 243 246 257
13
EM -Test for Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Limitations of the Modified Likelihood Ratio Test . . . . . . . . . . . . . . . . . 13.2 EM-Test for Homogeneity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 The Asymptotic Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Precision-Enhancing Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265 265 266 268 275
14
EM -Test for Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 The EM-Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 The Limiting Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Outline of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Conditions Underlying the Limiting Distribution of the EM-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Tuning Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Data Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279 279 279 282 284
15
EM -Test for Univariate Finite Gaussian Mixture Models . . . . . . . . . . . . . . 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 The Construction of the EM-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Asymptotic Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Choice of Various Influential Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Data Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
291 291 292 295 297 298
16
Order Selection of the Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Selection via Classical Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Variations of the Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Shrinkage-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
301 301 302 303 306
17
A Few Key Probability Theory Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Random Variables and Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Uniform Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
311 311 311 312 316
286 287 288
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Chapter 1
Introduction to Mixture Models
1.1 Mixture Model For the record, I would like to declare that a statistical model is a family of probability distributions. This fundamental principle will be fully reflected throughout this book. In the realm of statistics, a distribution describes the stochastic properties of a random variable or a group of random entities. Mathematically, a random variable is a Borel measurable real-valued function defined on a sample space .Ω, equipped with a .σ -algebra .A and a probability measure .Pr(·). When conducting statistical data analysis, we typically do not need to explicitly specify the sample space, .σ -algebra, or provide detailed mathematical explanations of .Pr(·). Instead, we consider random variables as the quantified outcomes of experiments with inherent uncertainty. Given a random variable X, its cumulative distribution function (c.d.f.) is defined as follows: F (x) = Pr(X ≤ x)
.
for .x ∈ R. If X is a random vector, the same definition applies but the inequality X ≤ x is understood component-wise. The c.d.f. of a random variable X often has a specific algebraic expression. Consider the situation where a family has .n > 0 children, and X represents the number of boys. Let .p ∈ (0, 1) be the probability of each child being a boy. Under simplifying assumptions, we suggest the following expression for the c.d.f. .F (x):
.
E (n) pk (1 − p)n−k .F (x) = k 0≤k≤x
for .x ∈ R, where the summation over k is taken over integers. A random variable with this form of c.d.f. is known as a binomially distributed random variable or a © Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_1
1
2
1 Introduction to Mixture Models
binomial random variable. The parameters in this case are the number of children n and the probability p. Each pair of values for n and p gives rise to a specific binomial distribution. The collection of all binomial distributions with varying parameter values for n and p forms the binomial distribution family. If n is fixed (non-random), we have a narrower binomial distribution family. Regardless of whether n is fixed, this distribution family is considered a Binomial Model. Binomial random variables take on integer values. For such discrete random variables, the probability mass function (p.m.f.) is given by f (x) = Pr(X = x) = F (x) − F (x − 1)
.
where .x = 0, ±1, ±2. . . .. For a binomial random variable X, its p.m.f. is given by . BIN
(k; n, p) = Pr(X = k) =
( ) n k p (1 − p)n−k k
for .k = 0, 1, . . . , n, and its value is 0 when k is outside of the specified range. We denote this distribution as .X ∼ Bin(n, p). Let .π be a value between 0 and 1 and consider the following p.m.f.: f (k) = π BIN(k; n, p1 ) + (1 − π )BIN(k; n, p2 )
.
(1.1)
for .k = 0, 1, . . . , n, where .p1 , p2 ∈ [0, 1] are parameters. It is evident that the p.m.f. .f (·) defined above is also a valid p.m.f.. Distributions whose p.m.f.’s have the algebraic structure of (1.1) form a new distribution family or model. Due to its form as a convex combination of two p.m.f.’s from a well-known distribution family, this new distribution is referred to as a binomial mixture distribution. Consequently, we have a binomial mixture model. There is no compelling justification why (1.1) is a binomial mixture while a binomial distribution itself is not a mixture of some other distributions. In general, a mixture model starts with a well-known parametric distribution family, which serves as the component, subpopulation, or kernel distribution family. To clarify the concept, let us assume that each subpopulation distribution in the mixture model is characterized by a probability density function (p.d.f.) or probability mass function (p.m.f.) denoted as .f (x; θ ), where .θ represents an element in a parameter space .Θ. Each specific value of .θ corresponds to a distinct distribution within this family. Unless otherwise specified, we will assume that .Θ is a subset of .Rk for some positive integer k. Throughout this book, the focus will mainly be on the case where .k = 1. In general, we always refer to .f (x; θ ) as a p.d.f. because a p.m.f. is also a density function albeit it is with respect to a counting measure. Typical component/subpopulation distributions used in mixture models include the normal, exponential, gamma, binomial, Poisson, geometric distributions, and many others. It is important to acknowledge the substantial literature on mixture models that involve non-parametric subpopulation distributions (Aragam and Yang 2023). These non-parametric mixture models have been extensively studied and offer valuable
1.1 Mixture Model
3
insights into modeling complex data structures without relying on specific parametric assumptions. This book does not delve into the specific topic of mixture models with non-parametric subpopulation distributions. The focus of this book is primarily on mixture models with parametric component distributions. The discussions and methodologies presented in this book revolve around the parametric framework and the associated statistical techniques for parameter estimation, hypothesis testing, and model selection within that context. Let .f (x; θ ) : θ ∈ Θ represent a parametric model, and let .G(θ ) be a c.d.f. or, equivalently, a probability measure on the parameter space .Θ. The mixture density function can be obtained as follows: { .f (x; G) = f (x; θ )dG(θ ), (1.2) Θ
where the integration is understood in the Lebesgue–Stieltjes sense. In the case where .G(·) is absolutely continuous with a density function .g(θ ), the integration simplifies to: { f (x; θ )g(θ )dθ.
.
Θ
In many cases, we are interested in situations where G is a discrete distribution assigning positive probabilities .π1 , . . . , πm to a finite number of .θ -values, namely Em .θ1 , . . . , θm , such that . π = 1. In this scenario, Eq. (1.2) reduces to the familiar j j =1 convex combination form: f (x; G) =
m E
.
πj f (x; θj ).
(1.3)
j =1
Since a distribution is also referred to as a population in certain contexts, it is appropriate to consider .f (x; θj ) as a subpopulation of the mixture population. This terminology is intuitive and commonly used in various applications. We refer to .θj as the sub-population parameter and .πj as the corresponding mixing proportion or weight. When all .πj > 0, the subpopulation parameters .θ1 , . . . , θm correspond to the support points of G. The order of the mixture model is m if G has at most m support points. It is worth noting that some ambiguity is allowed here to accommodate equal .θj values. The term “mixing distribution” is used to describe G, while the distribution of a subpopulation is also referred to as a component, subpopulation distribution, or kernel distribution in many contexts. These terminologies can be used interchangeably. The density functions .f (x; G) in equation (1.2) constitute a mixture distribution family, thus forming a mixture model. Similarly, the density functions .f (x; G) in Eq. (1.3) correspond to a finite mixture distribution family, giving rise to a finite mixture model. The set of mixing distributions is denoted by .G and serves as the parameter space for the mixture model. Specifically, the collection of mixing
4
1 Introduction to Mixture Models
distributions with at most m support points is represented as .Gm . It is important to note that due to the possibility of equal support points, we have .Gm ⊂ Gm+1 . When all subpopulations are distinct and have positive mixing proportions, the exact order of the finite mixture distribution is m. In cases where necessary, we utilize .Gm − Gm−1 to denote the family of mixtures with this specific order. A mixture model can be defined as a distribution family characterized by the expression: { {f (x; G) =
f (x; θ )dG(θ ) : G ∈ G}.
.
Θ
This definition requires both the specification of the component density functions f (x; θ ) : θ ∈ Θ and the set of mixing distributions .G. The notation .F (x; θ ) is used to represent the c.d.f. corresponding to the component density function .f (x; θ ). Similarly, .F (x; G) denotes the c.d.f. of .f (x; G). It is important to note that the same symbols F and f are utilized for both component and mixture distributions, with the context of the entry after the semicolon helping to identify the specific distribution. The notation .G({θ }) is employed to denote the probability of the distribution G assigned to a specific .θ value, and likewise, .F ({x}) represents the probability of .X = x under the distribution F . The terms “mixture density,” “mixture distribution,” and “mixture model” are used interchangeably to refer to .f (x; G). We have now completed the introduction to the mixture model. The main objective of this book is to explore inference problems related to various aspects of the mixing distribution G. In the simplest scenario, let .X1 , . . . , Xn represent a random sample of size n drawn from a population that follows a mixture distribution. In other words, these random variables are independent and identically distributed (i.i.d.) with a density function .f (x; G) and an unknown mixing distribution G. The specific G that corresponds to the observed values .x1 , . . . , xn is commonly referred to as the “true” G and is denoted as .G∗ . Although we do not know the exact identity of .G∗ , our goal is to learn about .G∗ based on the data generated from .f (x; G∗ ). .
1.2 Missing Data Structure Parametric models commonly used in statistics often have strong scientific and theoretical foundations. One example is the binomial distribution, which is typically motivated by experiments involving independent and identical Bernoulli trials. In these trials, each trial has two possible outcomes, referred to as success and failure. The probability of success remains constant for all trials, and the outcomes of the trials are independent of each other. The total number of successes observed then follows a binomial distribution. For instance, let’s consider the production of lumber in a mill where large quantities of a specific wood product are produced. Among a bundle of n pieces of this wood product, the number of substandard pieces can be
1.2 Missing Data Structure
5
effectively modeled using a binomial distribution. If a single binomial distribution adequately represents this population, we refer to it as homogeneous. However, when wood products from various mills are sent to wholesalers or distribution centers and mixed together before being shipped to retailers, the stock in a retail store comprises bundles from multiple mills. Each bundle in the store still follows a binomial distribution for the number of substandard pieces. However, the probability of substandard quality for each bundle may differ depending on the producing mill, unless all mills have the same substandard rate. It becomes evident that a single binomial distribution is no longer suitable to describe the wood product population at the wholesale level. In this case, the population is considered heterogeneous. Indeed, it can be verified that the number of substandard pieces in a randomly selected bundle follows a binomial mixture distribution. In theory, if we are able to identify the producing mill of each bundle, the heterogeneous population can be explicitly decomposed into m homogeneous subpopulations. However, when the identities of the lumber bundles are unknown, the population, whether it refers to the physical lumber bundles or the number of substandard pieces, becomes a mixture. This example reveals that the mixture is the consequence of missing identity. Mathematically, let .Z be a random vector of length m such that .Pr(Z = ej ) = πj , where .ej is a vector with all entries being 0 except for the j -th entry which is 1. Assuming that when .Z = ej , the distribution of X belongs to a parametric family .f (x; θ ) : θ ∈ Θ with a component parameter value of .θj , the density function of X can be expressed as: f (x; G) =
m E
.
πj f (x; θj )
j =1
where G representsEthe distribution that assigns probability .πj to .θj . Alternatively, we can write .G = m j =1 πj {θj } to denote this distribution. The distribution of X is a mixture because Z is missing. Otherwise, the complete pair .(X, Z) has a joint bivariate density function. The distribution of X is a mixture because the variable .Z is missing. If we had the complete pair .(X, Z), then we would have a joint bivariate density function given by: m | | .
{πj f (x; θj )}zj ,
j =1
where .z represents the realized value of .Z. The notation .zj refers to the j -th element of .z, and we take advantage of the fact that .a 0 = 1 for any .a > 0. It’s important to note that the above function is a density with respect to some .σ -finite measure. Viewing a mixture model as the result of missing data not only reveals the intrinsic
6
1 Introduction to Mixture Models
nature of the model but also proves useful in solving numerical problems related to maximum likelihood estimation. We will explore this further later on.
1.3 Identifiability An identifiable statistical model is one in which any two distributions in the corresponding distribution family are distinct. Identifiability is a crucial requirement for meaningful statistical inference. If we have two density functions, .f (x; θ1 ) and .f (x; θ2 ), that are identical as functions of x despite .θ1 /= θ2 , it becomes impossible to determine from the collected data which population is more suitable for modeling the system. Let us now provide a formal definition of identifiability. Definition 1.1 Consider a distribution family .F (x; ω) : ω ∈ Ω, where the c.d.f.’s are parameterized by .ω with a parameter space .Ω. If, for any .ω1 , ω2 ∈ Ω, the condition F (x; ω1 ) = F (x; ω2 ) for all x
.
implies .ω1 = ω2 , then the model .{F (x; ω) : ω ∈ Ω} is identifiable. It’s worth noting that an absolutely continuous distribution can have two density functions that differ only on a set of measure zero. Therefore, it is most convenient to discuss identifiability in terms of c.d.f.’s. We were somewhat ambiguous about “all x” in the definition above. As we primarily work with random vectors, “all x” means any .x ∈ Rk for some k in this book unless otherwise specified. We use .Ω in the parameter space definition because it does not have to be a subset of Euclidean space, nor exclusively a collection of mixing distributions. The identifiability of a model depends on the characteristics of .Ω. Non-identifiability arises when a particular distribution appears in the model multiple times with different parameter values. From the perspective of .Ω, the non-identifiability problem is resolved if .Ω contains only one parameter value for each distinct distribution in the model. Identifiability is typically not an issue for commonly used parametric models. Two binomial distributions are considered distinct if they have different parameter vectors .(n, p). However, two seemingly different mixture distributions can be identical. Let us illustrate this with a simple example. Suppose there are four lumber-producing mills: .M1 , M2 , M3 , and .M4 , with deficiency rates .θ1 = 0.1, θ2 = 0.7, θ3 = 0.25, and .θ4 = 1. Let each lumber bundle contain two pieces. Retailer .R1 receives stock from a 50–50 split between mills .M1 and .M2 , while retailer .R2 receives stock from an 80–20 split between mills .M3 and .M4 . Let .Xi denote the number of substandard pieces in a randomly selected bundle from retailer .Ri , where .i = 1, 2. The question now is, what are the distributions of .X1 and .X2 ? The probabilities can be computed as follows: For .X1 :
1.3 Identifiability
.
7
Pr(X1 = 0) = 0.5 × {(1 − 0.1)2 + (1 − 0.7)2 } = 0.45; Pr(X1 = 1) = 2 × 0.5 × (0.1 × 0.9 + 0.7 × 0.3) = 0.3; Pr(X1 = 2) = 0.5 × (0.12 + 0.72 ) = 0.25;
and for .X2 : .
Pr(X2 = 0) = 0.2 × (1 − 1)2 + 0.8 × (1 − 0.25)2 } = 0.45; Pr(X2 = 1) = 2{0.2 × 1 × (1 − 1) + 0.8 × 0.25 × 0.75)} = 0.3; Pr(X2 = 2) = 0.2 × 12 + 0.8 × 0.252 ) = 0.25.
The calculations above reveal that .X1 and .X2 have the same distribution. At the same time, both .X1 and .X2 have a finite binomial mixture distribution. Let G1 (θ ) = 0.51(0.1 ≤ θ ) + 0.51(0.7 ≤ θ ) = 0.5{0.1} + 0.5{0.7};
.
G2 (θ ) = 0.21(0.25 ≤ θ ) + 0.81(1 ≤ θ ) = 0.2{0.25} + 0.8{1} where .1(·) denotes the indicator function, which equals 1 when the statement in brackets is true and 0 otherwise. From the definitions of these two mixing distributions, we have .X1 ∼ f (x; G1 ) and .X2 ∼ f (x; G2 ), where the component density is binomial with parameters .n = 2 and .p = θ , or .f (x; θ ) = BIN(x; 2, θ ). This example illustrates that the binomial mixture model is not identifiable in general since .G1 /= G2 . Because .X1 and .X2 are identically distributed, no matter how many independent realizations of .X1 are observed, it is impossible to determine whether we are observing a sample from the binomial mixture with mixing distribution .G1 or with mixing distribution .G2 , unless we know the identities of the mills. In this example, we have considered the mixing distribution G as a parameter. When G is discrete with m support points, it can be presented as a real-valued column vector: G = (π1 , π2 , . . . , πm−1 , πm ; θ1 , θ2 , . . . , θm )T .
.
(1.4)
Because .πm can be derived from the other mixing proportions, it can be omitted from this presentation. However, this vector presentation leads to a trivial nonidentifiability in finite mixture models. Let ˜ = (πm , πm−1 , . . . , π1 , θm ; θm−1 , . . . , θ1 )T . G
.
˜ = f (x; G) for all x. ˜ /= G, but as functions, .f (x; G) In terms of vectors, .G Therefore, the finite mixture model with any subpopulation density .f (x; θ ) is not identifiable from this perspective.
8
1 Introduction to Mixture Models
To fit this example to Definition 1.1, if the parameter space .Ω contains all vectors in the form of (1.3), it allows each distribution to appear .m! times in this model, resulting in non-identifiability caused by permutation of the mixing components/subpopulations. This type of non-identifiability is not intrinsic. If .Θ ⊂ R, we can restore identifiability by requiring .θ1 < θ2 < · · · < θm for every vector contained in .Ω in the context of a finite mixture model. However, this tactic does not completely resolve the problem within the Bayesian methodology. The non-identifiability caused by permutation is eliminated when G is regarded as a distribution rather than a vector. This treatment is most convenient for most discussions in this book, and we will adopt it as our standard. For mathematical rigor, we provide the following definition tailored for mixture models: Definition 1.2 Let .F (x; G) : G ∈ G be a mixture model represented by their c.d.f.’s parameterized by the mixing distribution G in space .G. If for any .G1 , G2 ∈ G, .F (x; G1 ) = F (x; G2 ) for all x implies .G1 = G2 , the mixture model is identifiable. Let .F (x; G) : G ∈ Gm be a finite mixture model of order m represented by their c.d.f.’s parameterized by the mixing distribution G in space .Gm . If for any .G1 , G2 ∈ Gm , .F (x; G1 ) = F (x; G2 ) for all x implies .G1 = G2 , the finite mixture model of order m is identifiable.
1.4 Identifiability of Some Commonly Used Mixture Models Fortunately, the non-identifiable binomial mixture model is an exception rather than the norm. However, it highlights the importance of demonstrating identifiability for many mixture models to ensure the relevance of the entire theory. In the following sections, we will demonstrate the identifiability of commonly employed finite mixture models.
1.4.1 Poisson Mixture Model The component/subpopulation distribution of the Poisson mixture model is specified by the p.m.f.: f (x; θ ) =
.
θx exp(−θ ) for x = 0, 1, . . . x!
where the component parameter space is .Θ = R+ , representing all positive real values. Let X be a random variable with a finite Poisson mixture distribution with a mixing distribution G. Denote the moment generating function of G as .MG (·), which is well-defined when G has a finite number of support points. The momentgenerating function of X is given by:
1.4 Identifiability of Some Commonly Used Mixture Models
MX (t) =
.
{ ∞ E {exp(tx)
∞
f (x; θ )dG(θ )} 0
x=0
{
∞ ∞E
{exp(tx) ×
= 0
{ =
9
x=0 ∞
θx exp(−θ )}dG(θ ) x!
exp{θ (exp(t) − 1)}dG(θ )
0
= MG (exp(t) − 1). Now, let .f (x; G1 ) and .f (x; G2 ) be the p.m.f.’s of two finite Poisson mixtures. If they are identical, we must have: MG1 (exp(t) − 1) = MG2 (exp(t) − 1) for all t ∈ R.
.
When two distributions have the same moment generating function, they must be the same distribution. Hence, the above inequation implies .G1 = G2 . This result proves the identifiability of the finite Poisson mixture model. The same proof is applicable when the space of the mixing distribution is extended to all G whose moment generating function is finite in a neighborhood of .t = 0. Completely removing any conditions on the moment-generating function seems straightforward by employing characteristic functions instead of the momentgenerating function. However, the answer to this remains uncertain, and I leave this topic open for further exploration.
1.4.2 Negative Binomial Distribution The negative binomial distribution is defined by its p.m.f. ( ) m m .f (x; θ ) = θ (1 − θ )x x where m and .θ are parameters, and the range of the distribution is over .x = 0, 1, . . .. We are interested in the mixture with respect to .θ , which ranges from 0 to 1. Suppose we have two mixing distributions, .G1 and .G2 , such that their respective p.m.f.’s are equal for all x: f (x; G1 ) = f (x; G2 ).
.
This equality implies that the integrals of .θ m (1 − θ )x with respect to .G1 and .G2 are equal for all x:
10
1 Introduction to Mixture Models
{
{ .
θ m (1 − θ )x dG1 =
θ m (1 − θ )x dG2 .
We can multiply both sides of this equation by .t x /x! and sum over the range of x to obtain: { { . θ m exp{t (1 − θ )}dG1 = θ m exp{t (1 − θ )}dG2 . It is important to note that since the range of .θ is between 0 and 1, the integration is always well-defined. From this equation, we can deduce that the moment-generating functions of the measures corresponding to .θ m dG1 and .θ m dG2 on the interval .[0, 1] are identical. This, in turn, implies that .G1 and .G2 are identical probability measures. Therefore, the negative binomial mixture is identifiable.
1.4.3 Finite Binomial Mixtures Let’s discuss the identifiability of the binomial mixture model, which is the simplest model that is not identifiable in general, although its finite mixture can be identifiable under certain conditions. Lindsay (1995) provides a detailed discussion of these conditions for identifiability. The following discussion is a simplified overview. We denote the component p.m.f. of a binomial mixture model as: ( ) n k θ (1 − θ )n−k , .f (k; n, θ ) = k where k ranges from 0 to n, and n and .θ are parameters. For a fixed n, the p.m.f. of a binomial mixture model is given by: ( ){ 1 n .f (k; n, G) = θ k (1 − θ )n−k dG(θ ) k 0 where k ranges from 0 to n, and G is a mixing distribution over .θ in the interval .[0, 1]. Suppose we have two mixing distributions .G1 and .G2 such that .f (k; n, G1 ) = f (k; n, G2 ) for all k. This implies the following moment identity: {
1
.
0
{ θ dG1 (θ ) = k
1
θ k dG2 (θ )
(1.5)
0
where k ranges from 0 to n. Note that k does not extend beyond n because n is the highest power of .θ in the integration of .f (k; n, G). It is worth noting that it is possible to find two distinct distributions on the unit interval .[0, 1] that have the same
1.4 Identifiability of Some Commonly Used Mixture Models
11
first n moments. This fact highlights the general non-identifiability of the binomial mixture model. We can easily find two distinct distributions on the unit interval .[0, 1] that have the same first n moments. This fact reminds us again of the general nonidentifiability of the binomial mixture model. However, if we assume that .G1 and .G2 are discrete distributions with the combined set of distinct support points being .{θ1 , θ2 , . . . , θN }. Let .xi = G1 ({θi }) − G2 ({θi }) for .i = 0, 1, . . . , N. We can write the moment identity (1.5) into matrix form ⎡
1 ⎢ θ1 ⎢ ⎢ 2 . ⎢ θ1 ⎢. ⎣ ..
1 θ2 θ22 .. .
... ... ... .. .
1 θN θN2 .. .
θ1n θ2n . . . θNn
⎤
⎡ ⎤ x0 ⎥ ⎥ ⎢ x1 ⎥ ⎥ ⎢ ⎥ ⎥ × ⎢ . ⎥ = 0. ⎥ ⎣ .. ⎦ ⎦ xN
This equation utilizes a Vandermonde matrix, with each column containing geometric progressions. The Vandermonde matrix has full column rank when .N ≤ n+1 and .θi are distinct. In this case, the moment identity implies .xi = 0 for .i = 0, 1, . . . , N . Thus, .G1 and .G2 must be identical if their combined order is no more than .n + 1 when the moment identities are satisfied. From another perspective, a finite binomial mixture model is identifiable when the order of the mixture model is no more than .(n + 1)/2, where n is the number of trials in the component binomial distribution. It is important to note the technical difference between proving the identifiability for the binomial mixture and the negative binomial mixture: in the negative binomial mixture case, the moments of mixing distributions are identical for all positive integer orders, while in the binomial mixture case, the moments of mixing distributions are identical only up to n.
1.4.4 Normal/Gaussian Mixture in Mean, in Variance, and in General The normal (or Gaussian) mixture is undoubtedly the most commonly used model in applications. Since the terms “Normal distribution” and “Gaussian distribution” are often used interchangeably, we will follow the same convention throughout this book. The first application example of a finite normal mixture of order two can be found in Pearson’s paper Pearson (1894), where it was fitted to a biometric dataset provided by Professor Weldon. This paper played a significant role in popularizing mixture models. Due to the central importance of the normal distribution, it requires special notation. We denote the p.d.f. of the normal distribution with mean .θ and standard deviation .σ as:
12
1 Introduction to Mixture Models
{ (x − θ )2 } 1 exp − φ(x; θ, σ ) = √ 2σ 2 2Π σ
.
In the above expression, .Π represents the mathematical constant, not the mixing proportions typically referred to in this book. The c.d.f. of the normal distribution will be denoted as .Φ(x; θ, σ ). For convenience, when .σ is equal to 1, we simplify .φ(x; θ, σ ) to .φ(x; θ ). Further reduction of notation occurs when .θ = 0, resulting in .φ(x). The same conventions apply to the c.d.f.. In a general context, the normal distribution family is a location-scale family, with .θ and .σ as the location and scale parameters, respectively. We begin by considering the simpler case of the normal mixture in its location parameter, also known as the mixture in the mean. In this case, the corresponding scale parameter is held fixed and considered a structural parameter. For convenience, we set .σ = 1. The p.d.f. of a normal mixture model in the mean is given by: { φ(x; G) =
∞
.
−∞
φ(x − θ )dG(θ ).
This expression reveals that the mixture density is a convolution between .Φ and G. Let X and .ξ be their corresponding independent random variables. Then, .φ(x; G) is the p.d.f. of .X + ξ . The characteristic function of .X + ξ factorizes as follows: E{exp(i t (X + ξ )} = E{exp(i tX)} × E{exp(i tξ )}
.
where .i2 = −1. If two mixing distributions, .G1 and .G2 , satisfy .φ(x; G1 ) = φ(x; G2 ) for all .x ∈ R, they must have the same characteristic function. By a standard result in probability theory, we conclude that .G1 = G2 . Therefore, the normal mixture model in the location parameter is identifiable. A technical remark is that the above proof applies to all location mixture models. However, a technicality arises if the characteristic function of X, .E{exp(i tX)}, is zero over an interval of t. In this case, if .G1 and .G2 have equal characteristic function values except for t in this interval, the location mixture model is not identifiable. Distributions with characteristic functions that are zero over an interval are not commonly used in statistics, so there is no need for excessive concern. Let us now consider the normal mixture in its scale parameter, specifically in the variance or equivalently in the standard deviation. We will examine the case where the location parameter .θ = 0, with the location parameter being referred to as a structural parameter in this case. The p.d.f. of a normal mixture model in scale is given by: {
∞
φ(x; 0, G) =
.
σ −1 φ(x/σ )dG(σ ).
0
The characteristic function of this distribution is found to be:
1.4 Identifiability of Some Commonly Used Mixture Models
{ ϕ(t; G) =
.
0
∞
13
1 exp(− σ 2 t 2 )dG(σ ). 2
This can be viewed as a form of Laplace transformation of G. By a standard result in mathematical analysis, the Laplace transformation is unique. Therefore, this subclass of the normal mixture in the variance parameter is identifiable. This result can be easily generalized to situations where the common location parameter .θ takes other values. Next, let us consider the normal mixture model in both the mean and variance. It is important to note that this model is not identifiable. Let .Z1 and .Z2 be two independent standard normal random variables. The distribution of .Z1 + Z2 is a normal mixture, mixed in the location parameter. At the same time, the sum has a normal distribution with mean 0 √and variance .σ 2 = 2. It is evident√that its distribution is the same as that of . 2Z1 . However, the distribution of . 2Z1 is also a normal mixture in the scale parameter, with the mixing distribution being √ degenerate such that .G(θ = 0, σ = 2) = 1. Consequently, these two normal mixture distributions are identical, but their mixing distributions are different. Thus, we can conclude that the normal mixture model in either the location or scale parameter is identifiable, whereas the normal mixture model in both the location and scale parameters is not identifiable.
1.4.5 Finite Normal/Gaussian Mixture The finite normal mixture model, which is commonly used in applications, is of great interest, and its identifiability is crucial for the validity of these applications. Suppose there are two mixing distributions, .G1 and .G2 , defined regarding .θ and .σ , such that φ(x; G1 ) = φ(x; G2 )
.
(1.6)
for all x. Let the distinct support points of .G1 and .G2 be denoted as (θ1 , σ1 ), (θ2 , σ2 ), . . . , (θm , σm ).
.
Without loss of generality, we assume that .σj is in increasing order. Additionally, if σj = σj +1 , we have .θj < θj +1 . For simplicity, we denote the ordering as
.
(θ1 , σ1 ) < (θ2 , σ2 ) < · · · < (θm , σm ).
.
Let .πij represent the probability of .Gi assigned to the support point .(θj , σj ), where i = 1, 2 and .j = 1, 2, . . . , m. Note that many of the .πij values may be zero. If these two mixing distributions result in two identical finite normal mixture models, we must have
.
14
1 Introduction to Mixture Models m E .
(π1j − π2j )φ(x; θj , σj ) = 0.
(1.7)
j =1
Since .(θj , σj ) < (θm , σm ) when .j < m, it can be easily verified that .
φ(x; θj , σj ) =0 φ(x; θm , σm )
lim
x→∞
for all .1 ≤ j < m. Dividing both sides of (1.7) by .φ(x; θm , σm ), we obtain m−1 E .
(π1j − π2j )
j =1
φ(x; θj , σj ) + (π1m − π2m ) = 0. φ(x; θm , σm )
(1.8)
Taking the limit as x tends to infinity, we find that π1m = π2m .
.
This implies that the mth term in the equation is zero. By removing this term from the equation and repeating the same argument, we deduce that π1 m−1 = π2 m−1 .
.
This reasoning can be repeated iteratively, leading to π1j = π2j
.
for .j = 1, 2, . . . , m. Consequently, the two mixing distributions are equal, i.e., G1 = G2 . In summary, the finite normal or Gaussian mixture model is identifiable.
.
1.4.6 Gamma Mixture We will now discuss the identifiability of the Gamma mixture model. The p.d.f. of the Gamma distribution in the shape parameter is given by: f (x; θ ) =
.
x θ−1 exp(−x) Γ (θ )
where both the parameter space and the domain of x are positive real numbers. The Gamma function is defined as:
1.4 Identifiability of Some Commonly Used Mixture Models
{ Γ (θ ) =
∞
.
15
t θ−1 exp(−t)dt
0
with a positive lower bound. Let .G1 and .G2 be two mixing distributions such that f (x; G1 ) = f (x; G2 )
.
(1.9)
for all x. We define measures on the parameter space as: { μj (A) =
.
A
1 dGj Γ (θ )
for all bounded A, denoting these measures as .H1 and .H2 . Using the equality in (1.9), we can derive: { { . exp{θ log(x)}dH1 = exp{θ log(x)}dH2 for all positive real numbers x. Since the range of .log(x) is the entire real line, we can conclude that .H1 and .H2 have the same moment generating function. Therefore, they must be the same measures, implying that .G1 = G2 . Thus, this class of Gamma mixture models is identifiable. Moving on to the two-parameter Gamma distribution, its p.d.f. is given by: f (x; θ ) =
.
λα x α−1 exp(−λx) Γ (α)
where the parameter space is .Θ = R+ × R+ and the domain of x is .R+ . Here, we regard .θ as a two-dimensional vector containing the shape parameter .α and the scale parameter .λ. It is unclear whether the general two-parameter Gamma mixture model is identifiable. However, it is apparent that the finite two-parameter Gamma mixture model is identifiable. One can follow a similar proof to that of the identifiability of the finite normal mixture model.
1.4.7 Beta Mixture Moving on to the Beta mixture, a two-parameter Beta distribution has its p.d.f. given by: f (x; θ ) =
.
x α−1 (1 − x)β−1 B(α, β)
16
1 Introduction to Mixture Models
where the range of x is .(0, 1), and .B(α, β) is the beta function, which serves as the normalization constant. The subpopulation parameter .θ = (α, β). Beta mixture provides a nontrivial example that a mixture model is not identifiable. We note that .x α−1 (1 − x)β−1 is a polynomial of degree .α + β − 2 in x for positive integers .α and .β. It is easy to construct a linear combination of two such polynomials to match another polynomial. For instance, let: f1 (x) = 12x 2 (1 − x)
.
f2 (x) = 12x(1 − x)2 be two density functions of Beta distributions with parameters .θ1 = (3, 2) and θ2 = (2, 3) respectively. The density function of their 50–50 mixture is given by
.
f3 (x) = 0.5f1 (x) + 0.5f2 (x) = 6x(1 − x)
.
which is the density function of a Beta distribution with .θ3 = (2, 2). When either .α or .β is fixed, the resulting one-parameter beta mixture is identifiable.
1.5 Connections Between Mixture Models Mixture models encompass a wide range of distributions and can be seen as a convex combination of distributions from other families. From this perspective, almost all distribution families can be considered as mixture models. The Distribution of Any Integer-Valued Random Variables Is a Mixture One example is the distribution of any integer-valued random variable, denoted by X. Suppose X has a probability mass function (pmf) .h(x). We can define the family of distributions .f (x; n) : n ∈ N, where each distribution assigns probability 1 to the integer value n. In other words, .f (x; n) = δx,n , where .δ is the Kronecker delta function. It is clear that .h(x) can be expressed as a mixture of the distributions in this family: h(x) =
E
.
Pr(X = n)f (x; n).
n∈N
Poisson Mixture and the Negative Binomial Distribution Another example is the connection between the Poisson mixture and the negative binomial distribution. Suppose given .θ , X is a random variable with a Poisson distribution and mean parameter .θ . At the same time, .θ is a Gamma distributed random variable with its p.d.f. given by:
1.5 Connections Between Mixture Models
g(θ ; α, β) =
.
17
β α θ α−1 exp(−βθ ) Γ (α)
with shape parameter .α and scale parameter .1/β. The marginal distribution of X then has a p.m.f. given by { f (x; G) =
.
=
∞
β α θ α−1 θx exp(−x) exp(−βθ )dθ x! Γ (α) 0 Γ (α + x) ( β )α ( 1 )x
Γ (α)Γ (x + 1) β + 1
β +1
for all non-negative integer x. This p.m.f. corresponds to a negative binomial distribution with parameters .α and .p = β/(β + 1). In the context of Bernoulli trials, a negative binomial random variable represents the number of failures before the .αth success, where the probability of success is .β/(β +1). It is important to note that .α does not necessarily an integer; it should be a positive value. Since .β/(β + 1) can take any values between 0 and 1 through an appropriate choice of .β, the Poisson mixture encompasses all possible negative binomial distributions. Conversely, if we include all Poisson mixture distributions and negative binomial distributions in a model, the resulting model would not be identifiable. The Poisson mixture distribution offers a suitable alternative when dealing with counting datasets that exhibit a much larger sample variance compared to the sample mean. In such cases, the traditional Poisson model may no longer be appropriate. The Poisson mixture model provides a well-motivated approach to address this issue. One can certainly handle the phenomenon of over-dispersion by directly use the negative binomial model. The negative binomial distribution allows for greater variability (over-dispersion) than the Poisson distribution without the need for explicitly incorporating a mixture structure. Over-dispersion refers to the situation where the observed variance exceeds the mean in a dataset. This phenomenon often arises in count data, where the assumption of equidispersion (variance equal to the mean) is violated. The negative binomial model is specifically designed to handle over-dispersion by introducing an additional parameter that accounts for the excess variability. In summary, both the Poisson mixture model and the negative binomial model offer viable approaches for addressing over-dispersion in counting data. The choice between these models depends on the specific characteristics of the dataset and the research question at hand. t-Distribution and Normal Mixture The t-distribution can be understood as a normal mixture model. Let Z be a random variable with a standard normal distribution. Additionally, let X be a chi-square distributed random variable with n degrees of freedom, where n is a positive number
18
1 Introduction to Mixture Models
that need not be an integer. It is well-known that when Z and X are independent, the random variable / T = Z/ X/n
.
follows a t-distribution with n degrees of freedom. Consider the density function √ g(x) of .1/ X/n. The density function of T can be expressed as the integral
.
{ f (t; G) =
.
φ(t; 0, θ )g(θ )dθ
where .φ(t; 0, θ ) represents the normal density with mean 0 and variance .θ 2 . By performing this calculation, we observe that all t-distributions can be viewed as a normal mixture model. In this model, the mixing distribution is the inverse of a scaled chi-square distribution. This perspective provides a connection between the tdistribution and normal mixture models, revealing the underlying mixture structure within t-distributions.
1.6 Over-Dispersion In the context of parametric distribution families, the means and variances of the distributions are typically functions of the chosen parameter, assuming they exist. For certain distribution families, there exists a one-to-one mapping between the population mean and the parameter. Consequently, the variance is also a welldefined function of the mean. For instance, under the binomial distribution with probability of success denoted as .θ , the population mean .μ is given by .μ = nθ , where n represents the number of trials. Similarly, in the Poisson and exponential distributions, the population mean .μ is often denoted as .θ , hence .μ = θ . In these families, the relationship between the variance and the mean can be expressed as 2 2 represent the population mean and variance, .Var(X) = σ (μ), where .μ and .σ respectively. In practical applications, we may collect a random sample that is presumed to follow a Poisson distribution. However, upon computing the sample mean .x¯ and sample variance, we might observe that .x¯ is significantly smaller than the sample variance. Logically, if the assumption of a Poisson distribution is valid, we should have .σ 2 (μ) = μ. However, when the sample statistics satisfy .σˆ 2 >> μ, ˆ we may question the validity of the assumed Poisson distribution. In general, when the data are collected from a model with the assumed relationship .Var(X) = σ 2 (μ), but the sample mean and sample variance satisfy 2 2 ¯ we consider the data to be over-dispersed. In such cases, an over.sn >> σ (x), dispersed model is desired to supplement the originally assumed model. Mixture models are often selected as the preferred approach for modeling over-dispersed data.
1.6 Over-Dispersion
19
Let us use the Poisson mixture model as an example. This model retains most of the desired properties of the Poisson model and provides an appealing explanation for over-dispersion. Recall that under the Poisson model, we have .σ 2 (μ) = μ. Now, let .θ be a random variable with distribution G. Suppose that given .θ , X follows a Poisson distribution with mean .θ . In this scenario, the distribution of X is a Poisson mixture. We can observe that μ = E(X) = E{E(X|θ )} = E{θ }.
.
However, the variance of X can be expressed as . VAR
(X) = E{VAR(X|θ )} + VAR{E(X|θ )} = E(θ ) + VAR(θ ) > μ = σ 2 (μ)
unless .VAR(θ ) = 0. Therefore, a Poisson mixture model satisfies .VAR(X) > σ 2 (μ), Consequently, it can be considered an over-dispersed Poisson model. In practical applications, one might believe that a population is composed of several subpopulations, each of which can be appropriately modeled by a homogeneous distribution. The overall distribution can then be characterized as a mixture and displays the property of over-dispersion. It is important to note that a model is referred to as an over-dispersion model only when it exhibits an inflated variance compared to some standard model.
Chapter 2
Non-Parametric MLE and Its Consistency
2.1 Non-Parametric Mixture Model, Likelihood Function and the MLE A mixture model is a collection of distributions based on a specific component/subpopulation distribution family, denoted as .f (x; θ ) : θ ∈ Θ in general. Its corresponding mixture model is denoted as .{f (x; G) : G ∈ G}, where .G is a collection of distributions on .Θ, as shown in Eq. (1.2). When we restrict G to a parametric family of distributions, we have a parametric mixture model. Similarly, by constraining G to have only a finite number of support points, we obtain a finite mixture model, as illustrated in Eq. (1.3). If we impose practically no restrictions on G, the mixture model is referred to as a non-parametric mixture. In this case, the subpopulation distributions can still be parametric, but the overall mixing is nonparametric. It is important to note that we use this terminology differently from Aragam and Yang (2023), where their version would be called a mixture of nonparametric distributions. In this chapter, we focus on the point estimation problem under the assumption of a nonparametric mixture model, denoted as .{f (x; G) : G ∈ G}. Throughout this chapter, we use .x1 , . . . , xn to represent the observed values from a set of n independent and identically distributed random variables, sampled from the nonparametric mixture model .{f (x; G) : G ∈ G}. For the sake of simplicity in presentation, we treat .x1 , . . . , xn as both a set of real values and a set of random variables, depending on the context at hand. Given an i.i.d. sample as specified, the likelihood function of G is Ln (G) =
n | |
.
f (xi ; G).
i=1
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_2
21
22
2 Non-Parametric MLE
I would like to mention in passing that the likelihood function is a function defined on the model space. Up to a constant of proportionality, its value represents the probability of the event that the corresponding set of random variables equals the observed values, under the specific distribution in this model. Therefore, it is a function of the parameter. The data are utilized to assess the likelihood of individual distributions. In cases where the model consists solely of continuous distributions, the joint density function at the observed values serves as an ideal approximation of this probability. The log-likelihood function of G is denoted as ln (G) =
n E
.
log f (xi ; G).
(2.1)
i=1
Both .Ln (G) and .ln (G) are functions defined on .G. The non-parametric maximum ˆ n of G is a distribution on .Θ that satisfies likelihood estimator (MLE) .G ˆ = sup{ln (G) : G ∈ G}. ln (G)
.
(2.2)
ˆ is related to the sample in a measurable manner and It is important to note that .G is potentially one of many possible global maximum points of .ln (·). Additionally, ˆ is allowed to be a limit point of .Gm such that .ln (Gm ) → sup ln (G) : G ∈ G as .G .m → ∞. As the sample size n increases, we gain more information about the mixture ˆ distribution, leading to an expectation of improved precision in the MLE .G. This motivates the discussion of consistency. Consistency is a standard minimum requirement for well-motivated point estimators, and it is typically expected to hold. However, mixture models are known for challenging classical statistical conclusions. Therefore, in this chapter, we will delve into a comprehensive discussion on the consistency of the non-parametric MLE for G.
2.2 Consistency of Non-Parametric MLE Numerous papers in the literature have explored the consistency of the nonparametric MLE under mixture models. Here, we mention three representative and generic ones. Kiefer and Wolfowitz (1956) provides a conceptually straightforward ˆ closely resembling the yet complex and rigorous proof of the consistency of .G, proof of Wald (1949) for the MLE under parametric models. For finite mixture models, Redner (1981) presents a consistent proof for the MLE, assuming that .Θ is compact. This chapter introduces the concept of quotient topology on the space of mixing distributions, which has broader applicability beyond just mixture models. Essentially, Redner (1981) highlights that when
2.2 Consistency of Non-Parametric MLE
23
multiple parameter values correspond to a single distribution under a model, these values should be considered as a single point, accomplished through the introduction of a quotient topological space. This approach overcomes the non-identifiability issue of finite mixture models due to the permutation of component distributions. Interestingly, Redner (1981) does not cite Kiefer and Wolfowitz (1956). The third paper, by Pfanzagl (1988), provides yet another very general consistency proof that encompasses mixture models as a special case. Although Pfanzagl (1988) cites Kiefer and Wolfowitz (1956) but not Redner (1981), its consistency result appears to be more general than that of Kiefer and Wolfowitz (1956). However, it does not imply the consistency of the nonparametric MLE under finite mixture models. Kiefer and Wolfowitz (1956) also offers an intermediate result that is useful in other contexts but not available in Pfanzagl (1988). At the same time, a similar intermediate result seems possible. Therefore, these two papers do not entirely overlap concerning mixture models. Most of the content of this chapter can be found in Chen (2017). In the next section, we provide some preparatory results and introduce highlevel conditions that make the proof of consistency easier to comprehend. These high-level conditions should be established directly on the subpopulation density function .f (x; θ ) and possibly on the mixing distribution G. This task is crucial in determining the true generality of the consistency results. For the sake of clarity, we will postpone the detailed discussion of these conditions to later sections.
2.2.1 Distance and Compactification Whether under parametric or non-parametric models, the consistency of a point estimator is related to distance. A point estimator usually does not equal the true value, but it falls within a small neighborhood of the true value with high probability. The concept of distance is essential to define this “small neighbourhood.” Our default choice is the Euclidean distance when the parameter is a real-valued vector. However, in the space of mixing distributions, there is no commonly accepted default distance. In recent literature, researchers often favor the Wasserstein distance (Zhang and Chen 2022; Ho and Nguyen 2019; Heinrich and Kahn 2018), which we will introduce in later chapters. In this chapter, we utilize the distance metric proposed by Kiefer and Wolfowitz (1956) on the distribution space. Recall that we confine ourselves to the situation where the component parameter space .Θ is .R k for some integer k. For any two mixing distributions .G1 and .G2 on .Θ, we define their distance as follows: { .Dkw (G1 , G2 ) = |G1 (θ ) − G2 (θ )| exp(−|θ |)dθ, (2.3) Θ
24
2 Non-Parametric MLE
where .|θ | is the Euclidean norm, and the integration is a multivariate one when .θ is a vector. This definition was proposed by Kiefer and Wolfowitz (1956), and it possesses some desirable properties that will be demonstrated. In probability theory, there are many modes of convergence, one of which is convergence in distribution. A sequence of distribution functions .Gn is said to converge to distribution function G in distribution if .Gn (θ ) → G(θ ) for every d
θ at which .G(·) is continuous. We write .Gn −→ G when .Gn converges to G in distribution, or we explicitly state it. Convergence in distribution can also be characterized by the distance (2.3). We state this result as a lemma.
.
d
Lemma 2.1 For any sequence .{Gn } and .G∗ in .G, .Gn −→ G∗ if and only if ∗ .Dkw (Gn , G ) → 0 as .n → ∞. Proof Suppose .Gn is a sequence of mixing distributions such that .Gn (θ ) → G∗ (θ ) for all .θ at which .G∗ is continuous. As a monotone function, .G∗ has at most a zero-measure set of points at which it is not continuous { (in terms of Lebesgue measure). Because .|Gn (θ ) − G∗ (θ )| ≤ 1 and trivially . 1 exp(−|θ |)dθ < ∞, the Dominant Convergence Theorem (Chow and Teicher 2003) is applicable: namely, the exchange of limit in n and the integration is legitimate. This implies { .
{
∗
{Gn (θ ) − G (θ )} exp(−|θ |)dθ →
0 exp(−|θ |)dθ = 0.
Note that in the above derivation, the limit of the integrate is 0 at almost all .θ . Thus, d
we have shown that .Gn −→ G∗ implies .Dkw (Gn , G∗ ) → 0. Conversely, suppose there is a continuous point .θ0 at which .Gn (θ0 ) /→ G∗ (θ0 ). Then there must be a subsequence of .Gn (θ0 ), which has a limit not equal to ∗ .G (θ0 ). Without loss of generality, let the sequence be .Gn itself and denote .e = lim{Gn (θ0 ) − G∗ (θ0 )} and assume that .e > 0 (instead of merely .e /= 0). Since the distribution function .Gn (θ ) is an increasing function in every element of .θ , we must have Gn (θ ) − G∗ (θ0 ) ≥ e/2
.
when .0 ≤ θ − θ0 ≤ δ for some small enough .δ and for all large enough n. Denote A = {θ : 0 ≤ θ − θ0 ≤ δ} which has a nonzero measure. We hence have { ∗ .Dkw (Gn , G ) ≥ (e/2) exp{−|θ0 | − |δ|}dθ > 0.
.
A
This contradicts the assumption .Dkw (Gn , G∗ ) → 0. The same proof is applicable when .e < 0. u n
2.2 Consistency of Non-Parametric MLE
25
The mode of convergence in distribution does not necessarily convey a lot of information about certain aspects of the distribution. For example, we may d
have .Gn −→ G∗ while the first moment of .Gn does not exist for any n, even when .G∗ has all finite moments. Additionally, when both .Gn and .G∗ have finite support, convergence in distribution does not imply that there are corresponding convergences between their support points. While these observations are obvious, it is not uncommon for many to overlook these issues when these questions are not explicitly posed. ˆ n in terms of the cumulative Furthermore, we present the non-parametric MLE .G distribution function for each sample point .ω in the sample space. That is, each sample point leads to a sequence of c.d.f.. The consistency is therefore only meaningful when nearly every .ω-determined sequence converges to .G∗ . To explicitly spell this out, we provide a definition. ˆ n be a sequence of estimators of the mixing distribution based Definition 2.1 Let .G on a random sample of size n from a distribution in the mixture model .{f (x; G) : ˆ n is strongly G ∈ G} with the true mixing distribution being .G∗ . We say that .G ˆ n , G∗ ) → 0 almost surely as .n → ∞. consistent when .Dkw (G ˆ n , G∗ ) = 0} = 1. ˆ is strongly consistent when .Pr{limn→∞ Dkw (G Equivalently, .G The distance .Dkw (·, ·) given in (2.3) is totally bounded. Because .|G1 (θ ) − G2 (θ )| ≤ 1 for all .θ , we have { .Dkw (G1 , G2 ) ≤ exp(−|θ |)dθ < ∞. (2.4) Θ
For one-dimensional .θ , this upper bound is 2. When .θ has dimension k, the upper bound is the volume of the k-dimensional unit ball. Because the exact size of the upper bound is not relevant in logical derivations, we will proceed as if the upper bound equals 1 in the sequel unless this quantity makes a difference in the particular context. We do not provide a proof here but declare that under this distance, the space of mixing distributions is totally bounded: for every real number .e > 0, there exists a finite collection of open balls whose union contains the whole space. A bounded space is not necessarily totally bounded. A totally bounded and closed space is compact. We do not provide a proof here but declare that under this distance, the space of mixing distributions is totally bounded: for every real number .e > 0, there exists a finite collection of open balls whose union contains the whole space. However, the mixing distribution space is not necessarily closed because limiting points of a mixing distribution sequence may not be members of .G, the space of all mixing distributions. Example 2.1 Let .Gn (θ ) = 1(n ≤ θ ) for .n = 1, 2, . . .. Note that .Gn is a sequence of univariate distributions defined on .R. Each of them degenerates because each assigns all the probability on a single value n. Let .G(θ ) ≡ 0. Then .Dkw (Gn , G) → 0 as .n → ∞. However, G does not belong to .G, the space of all mixing distributions.
26
2 Non-Parametric MLE
Proof The mixing distribution space .G contains all distributions on the real number space .R. We regard the definition of .Dkw (·, ·) is applicable to all monotone increasing functions in this example. After taking this consensus, we have { Dkw (Gn , G) =
∞
.
−∞
1(n ≤ θ ) exp(−|θ |)dθ = exp(−n) → 0
as .n → ∞. This completes the proof.
u n
It may be helpful to remark that n in this example is merely an index rather than the sample size in statistical analysis. This example shows that .G(θ ) ≡ 0 is a limiting point of .Gn under the distance measure .Dkw . Hence, .G is not a closed space because function .G(θ ) ≡ 0 is not a member. It further implies that .G is not compact. The phenomenon presented in this example is that some probability mass is placed on n, which escapes .Θ as .n → ∞. The limit, when exists, is often similar to a probability measure except for the total mass being less than 1. Whether or not .G is a compact space is not a concern in statistical applicants. However, it is a very important property utilized in establishing the consistency of the non-parametric MLE. To overcome this technical obstacle, we may expand the space of .G in two steps. The first step is to expand component parameter space .Θ to include all its limiting ¯ For example, the parameter space of points. The resulting set is its closure .Θ. Poisson distribution is typically .Θ = (0, ∞), which is an open set. The closure of ¯ = [0, ∞). Note that .Θ¯ is a closed set though it is not compact. Similarly, for .Θ is .Θ normal distribution family with known and specific variance, it can be parameterized by its mean with parameter space being .R = (−∞, ∞) which is a closed set. This expansion step does not make it .[−∞, ∞]. We have mostly regarded a mixing distribution G as a c.d.f. on .Θ. For convenience, we now also regard it as a probability measure so that .G(A) is the probability assigned to subset .A ⊂ Θ. For any .0 ≤ ρ ≤ 1, we define a subdistribution (subprobability measure) .H (·) = ρG(·) so that .H (A) = ρG(A). As a function, .H (·) is practically a c.d.f. except for possibly .H (Θ) = ρ < 1. Let ¯ be the space containing .G and all subdistributions defined on .Θ. ¯ We extend the .G ¯ definition of .Dkw (·, ·) to .G without modifications. ¯ equipped with distance .Dkw (·, ·) is compact. Theorem 2.1 The new space .G Proof This result is somewhat well known in probability theory. Yet it is difficult to give a clean and noninvolved proof. Thus, we refer to the classical textbook Feller (1991). Theorem 1 on page 267 of the second volume states that every distribution sequence on Euclidean space .R d has a convergence subsequence to a limit, and the limit is either a proper distribution or an improper one. This result is also ¯ The mode applicable to distributions on any closed subset of .R d such as our .Θ.
2.2 Consistency of Non-Parametric MLE
27
of the convergence coincides with our .Dkw (·, ·). Applied to the problem in our context, every sequence of mixing distributions has a converging subsequence to ¯ is a closed a mixing distribution or a subdistribution. This property means that .G ¯ u n space. Because .G is also totally bounded, it is therefore also compact.
2.2.2 Expand the Mixture Model Space The expansion of .G in the last subsection is meaningful only if we can also extend the domain of the functional .f (x; G), namely the mixture density, accordingly. We first assume that the component distribution family .{f (x; θ ) : θ ∈ Θ} can ¯ be continuously extended to .{f (x; θ ) : θ ∈ Θ}. C20.0 (Continuity Assumption) It is possible to extend the definition of .f (x; θ ) ¯ we have that .θj → θ as .j → ∞, implies to .Θ¯ such that for any .θ and .θj in .Θ, .f (x; θj ) → f (x; θ ) for all x except for a 0-probability set, which does not depend on sequence .{θj }. Example 2.2 Consider the Poisson distribution family with f (x; θ ) =
.
θx exp(−θ ). x!
For any .θj → 0, we have .f (x; θj ) → 1(x = 0). Hence, we may define .f (x; 0) = 1(x = 0) and obtain the expanded Poisson distribution family whose .Θ¯ = [0, ∞). ¯ Since every .H ∈ G ¯ can be written as .ρG Next, we extend .f (x; G) to all .G ∈ G. for some .G ∈ G, it is natural to define { { .f (x; H ) = f (x; θ )dH (θ ) = ρ f (x; θ )dG(θ ) ¯ We may regard .f (x; H ) as a density function of some submixture for every .H ∈ G. distribution. It appears magical that we can force the compact issue go away through a distance definition and an expansion of .G. However, this ease of manipulation comes with a trade-off in a different context. A crucial technical condition for establishing the consistency of non-parametric MLE under a mixture model is some form of “uniform continuity” of .f (x; G) in G over its neighborhood. A ¯ is much larger than a neighborhood in .G, making the uniformity neighborhood in .G ¯ condition in .G more restrictive. This issue requires detailed explanation and clarification in subsequent subsections.
28
2 Non-Parametric MLE
2.2.3 Jensen’s Inequality Now, let us take a side step to introduce the well-known Jensen’s inequality in probability theory (Chow and Teicher 2003). Lemma 2.2 (Jensen’s Inequality) Let X be a random variable such that .E|X| < ∞ and .ϕ(t) be a convex function in t. Then E{ϕ(X)} ≥ ϕ(E(X)).
.
As an example, consider .ϕ(t) = −t 2 . The Jensen’s inequality implies the well-known moment relationship .E(X2 ) ≥ E2 (X) whenever .E(X) is properly defined. The Jensen’s inequality is essential for proving the consistency of MLEs under parametric models, as it leads to the implied inequality on Kullback–Leibler information. Let .Y = f (X; G)/f (X; G∗ ) where X has mixture distribution with density function .f (x; G∗ ) with respect to some .σ -finite measure .μ(·) and G be any member ¯ Let .E∗ denote the expectation with respect to .f (x; G∗ ) as usual. Clearly, of .G. ∗
E (Y ) =
.
{
∗
∗
{f (x; G)/f (x; G )}f (x; G )dμ(x) =
{ f (x; G)dμ(x) ≤ 1.
The range of the first integration is over the support of .f (x; G∗ ). The second integration is over the intersection of the supports of .f (x; G∗ ) and .f (x; G). In addition, we have included the possibility that G is a subdistribution. Thus, the value of the integration on the right-hand side can be strictly smaller than 1. Applying the Jensen’s inequality to this Y and .ϕ(t) = − log(t), we find .
− E∗ log{f (X; G)/f (X; G∗ )} ≥ − log[E∗ {f (X; G)/f (X; G∗ )}] ≥ 0.
The inequality is strict unless .f (x; G) = f (x; G∗ ). When .E∗ | log f (X; G∗ )| < ∞, we get E∗ log f (X; G) ≤ E∗ log f (X; G∗ ).
.
In this inequality, we allow the left-hand side to take value .−∞. The Kullback–Leibler information/divergence between any two distributions is defined to be { .KL(f (x), g(x)) = log{f (x)/g(x)}f (x)dμ(x) (2.5) where .f (x) and .g(x) are density functions with respect to the .σ -finite measure μ(·). The specific form of the Jensen’s inequality implies .KL(f (x), g(x)) ≥ 0 with
.
2.2 Consistency of Non-Parametric MLE
29
equality holding when .f (x) = g(x) for all x except for a 0-measure set of x in terms of .μ(·). We do not immediately make use of the above inequality but use it to motivate a ¯ define high-level condition. For any .e > 0 and .G ∈ G, ¯ Dkw (G, ¯ :G ¯ ∈ G, ¯ G) < e}. f (x; G, e) = sup{f (x; G)
.
(2.6)
We assume that .f (x; G, e) is measurable in x. Otherwise, it is implied by the continuity assumption C20.0. Given a set of n i.i.d. observations from the mixture model, we define E .ln (G, e) = log f (xi ; G, e). Be aware that .f (·) and .ln (·) stand for different functions when their entries are altered. We believe that the inconvenience of minor confusion is more than compensated by the convenience of simpler notation. Lemma 2.3 Suppose .G ∈ G and .e > 0 are such that E∗ log{f (X; G∗ )/f (X; G, e)} > 0.
.
Then there exists a .ρ > 0 such that ln (G, e) < ln (G∗ ) − nρ
.
almost surely. Proof By the law of large numbers, we have a.s.
n−1 {ln (G∗ ) − ln (G, e)} −→ E∗ log{f (X; G∗ )/f (X; G, e)}.
.
Let ρ = (1/2){E∗ log f (X; G∗ ) − E∗ log f (X; G, e)} > 0,
.
with the positiveness guaranteed by the condition of the lemma. We clearly have a.s.
n−1 ln (G∗ ) − n−1 ln (G, e) −→ 2ρ.
.
Thus, for almost all sample points, when n is sufficiently large, n−1 ln (G∗ ) − n−1 ln (G, e) > ρ.
.
This proves the lemma.
u n
30
2 Non-Parametric MLE
Under some conditions, the log likelihood value at .G∗ is larger than the log likelihood value at any other G. The above lemma reinforces the conclusion by quantifying the difference. Its significance is felt in the proof of many consistency results.
2.2.4 Consistency Proof of Kiefer and Wolfowitz (1956) We now state one more high-level condition for the consistency theorem of Kiefer and Wolfowitz (1956). The first one is practically the Jensen’s inequality but applied to a broader class of functions including .f (X; G, e). For convenient reference, we name it as enhanced Jensen’s inequality. ¯ and .G /= G∗ , there exists C20.1 (Enhanced Jensen’s Inequality) For any .G ∈ G an .e > 0 such that E{log f (X; G∗ )/f (X; G, e)} > 0
.
(2.7)
where .E is computed with .f (x; G∗ ) being the distribution of X. We remark that this condition is built on C20.0, because we have to first have ¯ In addition, this condition implies the definition of .f (x; G) extended to all .G ∈ G. the identifiability of the mixture model: if .G /= G∗ but .f (x; G∗ ) = f (x; G), then C20.1 is violated. We will give examples of the mixture models that satisfy this condition later. At this moment, we give a simple proof of the strong consistency of the non-parametric MLE with the help of this condition. Theorem 2.2 Let .X1 , . . . , Xn be a set of i.i.d. random variables from a mixture ˆ an model .{f (x; G) : G ∈ G} with the true mixing distribution given by .G∗ . Let .G MLE of G as defined by (2.2). Under the conditions C20.0 and C20.1 (continuity and the enhanced Jensen’s inequality), we have, as .n → ∞, ˆ n , G∗ ) −→ 0. Dkw (G a.s.
.
Proof For any .δ > 0, define ¯ δ = {G : Dkw (G, G∗ ) ≥ δ}. G
.
¯ is compact, so is .G ¯ δ for any .δ > 0. Because .G δ ¯ For each .G ∈ G , let .eG be a corresponding .e-value satisfying condition C20.1. For notational simplicity, we will drop G from .eG . Let ¯ B(G, e) = {H : Dkw (G, H ) < e, H ∈ G}
.
2.2 Consistency of Non-Parametric MLE
31
¯ equipped with distance .Dkw (·, ·). We note that be the .e-neighbourhood of G in .G B(G, e) contains G for the least. Thus, we have
.
¯ δ }. ¯ δ ⊂ ∪{B(G, e) : G ∈ G G
.
¯ δ is covered by a set of open balls .B(G, e). By the That is, the compact space .G ¯ δ , say .G1 , . . . , GK , property of compact set, there exists a finite number of .G ∈ G such that ¯ δ ⊂ ∪K B(Gk , ek ) G k=1
.
where .ek = eGk is also a simplified notation. The finite coverage conclusion implies .
¯ δ } ≤ max {ln (Gk , ek )} ≤ ln (G∗ ) sup{ln (G) : G ∈ G 1≤k≤K
(2.8)
almost surely by Lemma 2.3. Hence, by the definition of the nonparametric MLE, ¯ δ or equivalently .Dkw (G ˆ n , G∗ ) < δ almost surely for any choice of ˆ n /∈ G we have .G .δ. Applying this conclusion to .δ = 1, 1/2, 1/3, . . . (countable), we find that except ˆ n , G∗ ) = 0. This proves the theorem. n u for a 0-probability event, .limn→∞ Dkw (G We skipped a minor technicality in the proof for the smooth flow of the logic. The proof indicates that the log-likelihood function attains its maximum within a ∗ .δ-neighborhood of .G almost surely for any .δ > 0. This neighborhood, however, contains both usual mixing distributions as well as some unwanted mixing subdistri¯ a mixing subdistribution? Suppose butions. Is it possible that the MLE defined on .G this subdistribution is .H = ρG. It is apparent that .ln (H ) ≤ ln (G) if .0 < ρ < 1. Thus, the non-parametric MLE must be a proper mixing distribution. Similarly, if ∗ .G has its probability all assigned to .Θ, then any distribution sufficiently close to ∗ ¯ .G must have all its probability all assigned to .Θ as compared to .Θ. The above proof can be tightened to obtain some useful intermediate results. Theorem 2.3 Under the same setting and conditions of Theorem 2.2 for any .δ > 0, there exists an .e > 0 such that ¯ δ } ≥ ne ˆ − sup{ln (G) : G ∈ G ln (G)
.
almost surely as .n → ∞. We do not prove this result here. Interested readers may easily work it out as an exercise. It shows that the likelihood value at the MLE exceeds the likelihood value outside of the neighborhood of the true mixing distribution by a big margin in ¯ δ is .G∗ dependent. Therefore, .e in the logarithm scale. Note that the definition of .G ∗ above theorem is .G dependent. One should also be aware of the i.i.d. background when interpreting this result. This property indicates that even if a statistic does not quite attain the maximum possible likelihood value, it may still estimate the true mixing distribution consistently.
32
2 Non-Parametric MLE
Corollary 2.1 Assume the same setting and conditions of Theorem 2.2. Suppose ˆ be the mixing distribution at which l˜n (G) − ln (G) = o(n) uniformly on .G. Let .G ∗ ˜ ˆ .ln (G) is maximized. Then .G is a consistent estimator of .G . .
In some statistical problems, the likelihood function itself is not the most ideal platform to be directly employed. Instead, penalty or regularization terms are added to .ln (G) before it is employed for the purpose of inference. Corollary 2.1 saves us the trouble to reprove the consistency of these corresponding estimators. Most researchers in the literature prefer to state the consistency result to a more general class of the MLE. It amounts to say that even if a mixing distribution is not the maximum point of the likelihood function, it is still a strongly consistent estimator if its log likelihood is within .o(n) distance of the supremum. ˆ is a Corollary 2.2 Under the same setting and conditions of Theorem 2.2, if .G member of .G such that ¯ = o(n), ˆ − sup{ln (G) : G ∈ G} ln (G)
.
ˆ n converges to .G∗ almost surely as .n → ∞. then .G
2.2.5 Consistency Proof of Pfanzagl (1988) The theory presented in Pfanzagl (1988) is more general than merely establishing the consistency of the non-parametric MLE under mixture models. However, the consistency of the non-parametric MLE under mixture models is its most concrete application example. We include his results in this subsection after they are completely adapted to the context of mixture models. Compared with Kiefer and Wolfowitz (1956), the consistency result of Pfanzagl (1988) is obtained under much less restrictive conditions. At the same time, whether or not his consistency result covers finite mixture models is a question. I tend to believe not, but cannot exclude a possible simple extension. There are three key technical innovations in Pfanzagl (1988). The following lemma is one of them. Lemma 2.4 Let .f (x) and .f ∗ (x) be density functions of any two distributions with respect to some .σ -finite measure .μ. For any .u ∈ (0, 1), we have E∗ log{1 + u[f ∗ (X)/f (X) − 1]} ≥ 0
.
with equality holds if and only if .f ∗ (x) = f (x) for almost all x with respect to .μ, where the expectation is with respect to .f ∗ distribution.
2.2 Consistency of Non-Parametric MLE
33
Proof Let .Y = {1 + u[f ∗ (X)/f (X) − 1]}−1 . We have Y =
.
f (X) f (X) . ≤ uf ∗ (X) uf ∗ (X) + (1 − u)f (X)
Clearly, we have .E∗ |Y | ≤ 1/u < ∞. Thus, Jensen’s inequality as presented in Lemma 2.2 is applicable to .ϕ(y) = − log y. The conclusion is then immediate. n u When .u → 1, this inequality reduces to the standard Jensen’s inequality if the order of the limit and the expectation can be exchanged. The strength of this lemma is that it does not put any restrictions on .f (·) when .0 < u < 1. This is a huge advantage to be seen. In the same fashion as in the last subsection, we use this inequality to motivate a seemingly similar high-level condition. Here, .f (x; G, e) denotes the same function as defined earlier and consider the mixture model with its ¯ domain extended to .G. ¯ and .G /= G∗ , there exists an C20.2 For some given .u ∈ (0, 1) and any .G ∈ G .e > 0 such that E∗ log{1 + u[f (X; G∗ )/f (X; G, e) − 1]} > 0
.
(2.9)
where .G∗ is the true mixing distribution. Theorem 2.4 Let .X1 , . . . , Xn be a set of i.i.d. random variables from a mixture ˆ be model .{f (x; G) : G ∈ G} with the true mixing distribution given by .G∗ . Let .G an MLE of G as defined by (2.2). Under Conditions C20.0 and C20.2, we have, as .n → ∞, a.s. ˆ n , G∗ ) −→ Dkw (G 0.
.
Proof Part of this proof is parallel to that of Theorem 2.2. For any .δ > 0, define ¯ ¯ δ = {G : Dkw (G, G∗ ) ≥ δ, G ∈ G}. G
.
¯ is compact, so is .G ¯ δ . Thus, condition C20.2 implies that there exist a Because .G finite number of .Gk , k = 1, . . . , K, with corresponding .ek such that ¯ δ ⊂ ∪K Bk G k=1
.
(2.10)
where .Bk = {G : Dkw (G, Gk ) < ek }, and E∗ log{1 + u[f (X; G∗ )/f (X; Gk , ek ) − 1]} > 0.
.
(2.11)
34
2 Non-Parametric MLE
By the law of large numbers, (2.11) implies n−1
n E
.
log{1 + u[f (xi ; G∗ )/f (xi ; Gk , ek ) − 1]} > 0
i=1
almost surely for .k = 1, 2, . . . , K. Consequently, we have 0
ln (G)
.
¯ δ almost surely. Since the likelihood function at any G in .G ¯ δ is smaller for all .G ∈ G ∗ than the likelihood value at another mixing distribution .uG +(1−u)G, none of G in ¯ δ can possibly attain the supremum of .ln (G). This implies that the non-parametric .G MLE must reside in the .δ-neighbourhood of .G∗ almost surely. The arbitrariness of ˆ G∗ ) → 0 almost surely as .n → 0. This completes the proof. n .δ implies .Dkw (G, u In this proof, Pfanzagl (1988) tactically takes advantage of the linearity of the mixture model in mixing distributions. That is, uf (x; G∗ ) + (1 − u)f (x; G) = f (x; uG∗ + (1 − u)G)
.
which is the density function of another mixture distribution. We refer to this fact as the second key technical innovation. The third one is somewhat related. In the proof of both Wald (1949) and Kiefer and Wolfowitz (1956), the likelihood value .ln (G∗ ) is used as the anchor point to be compared with. They proved that .ln (G) < ln (G∗ ) for any G outside a neighborhood of .G∗ almost surely and uniformly. Pfanzagl (1988), however, replaces .ln (G∗ ) by .ln (uG∗ + (1 − u)G). This provides a lot more flexibility in establishing the consistency, whether it is in the context of mixture models or for other models.
2.3 Enhanced Jensen’s Inequality and Other Technicalities
35
Using .ln (uG∗ + (1 − u)G) reduces the generality of the consistency result. If we search for the MLE under a finite mixture model with at most m support points, this proof is no longer applicable because .uG∗ + (1 − u)G likely has more than m support points even if both .G∗ and G have only m support points. For this reason, we suspect that the result of Pfanzagl (1988) does not cover the consistency of MLE under parametric models. More specifically, it does not cover the result of Wald (1949). We cannot rule out the possibility to have the result of Pfanzagl (1988) extended to cover finite mixture models with ease. I wonder similarly on whether or not the conclusion can be extended to mixture model contains some structural parameters. This pitfall seems to disappear if one uses the inequality with .u = 0 as the case in the proof of Kiefer and Wolfowitz (1956). However, the proof with .u = 0 practically reduces this proof to that of Kiefer and Wolfowitz (1956).
2.3 Enhanced Jensen’s Inequality and Other Technicalities The proofs of the consistency of the non-parametric MLE presented seem rather simple, against our intuitions. One reason for the unexpected simplicity is that we have smoothed a thorny spot with high-level conditions. In presenting the proof of Kiefer and Wolfowitz (1956), we assume the enhanced Jensen’s inequality C20.1. However, the devil is in details. Often, a research paper may declare that a condition is needed in order to prove a result. More objectively, conditions are used to identify a territory so that the corresponding conclusion is applicable only within this territory. This angle allows the most objective judgment on whether or not a condition is restrictive, or it is worthwhile to prove an existing result under a set of seemingly weaker conditions. Recall that the negative binomial distribution is a special parametric Poisson mixture. In general, the data do not contain sufficient amount of information to decide whether or not a non-parametric Poisson mixture or a negative binomial model is most appropriate. Hence, we do not attempt to verify the conditions of a theorem based on data. A user should first decide whether the model employed is appropriate for the data being analyzed. After which, it is important to verify that this model satisfies the conditions in order to draw relevant inference conclusions. Hence, if a non-parametric Poisson mixture model is used in an application, one should verify the enhanced Jensen’s inequality before citing the consistency result on the non-parametric MLE of the mixing distribution. In this spirit, we go over a number of important issues. Denote .[x]+ = max{0, x} and .[x]− = max{0, −x} for any .x ∈ R. We first state an equivalent result. Lemma 2.5 Suppose the component distribution in the mixture model satisfies condition C20.0. In addition, assume that .lim||θ||→∞ f (x; θ ) = 0 for all but a measure zero set of x.
36
2 Non-Parametric MLE
A necessary and sufficient condition for Condition C20.1, the enhanced Jensen’s ¯ and .G /= G∗ , there exists an .e > 0 such that inequality, is that for any .G ∈ G E∗ [log{f (X; G, e)/f (X; G∗ )}]+ < ∞.
.
(2.12)
Proof Assume C20.1 holds. That is, there is an .e > 0 such that E∗ [log{f (X; G, e)/f (X; G∗ )}] < 0.
.
The above inequality is more than sufficient for (2.12). Conversely, assume (2.12) holds for some .e > 0. Note that [log{f (X; G, δ)/f (X; G∗ )}]+ ≤ [log{f (X; G, e)/f (X; G∗ )}]+
.
for all .δ < e. Hence, .log{f (X; G, δ)/f (X; G∗ )} is uniformly integrable from above. At the same time, it decreases as .δ ↓ 0. Thus, the monotone convergence theorem is applicable, which implies .
lim E∗ [log{f (X; G, δ)/f (X; G∗ )}] = E∗ [lim log{f (X; G, δ)/f (X; G∗ )}]. δ↓0
δ↓0
Because E∗ [log{f (X; G)/f (X; G∗ )}] < 0
.
when .G /= G∗ , the final step is to show that .
lim log{f (x; G, δ)/f (x; G∗ )} = log{f (x; G)/f (x; G∗ )} δ↓0
for all most all x in terms of .f (x; G∗ ). We proceed to show this is true. Under condition C20.0 and the condition that .lim||θ||→∞ f (x; θ ) = 0 in this lemma, .f (x; θ ) must be a bounded continuous function on .θ ∈ Θ¯ for each given ¯ x. If .f (x; G, e) → f (x; G) is not true, then there must exists a sequence .Gk ∈ G, .Dkw (Gk , G) → 0 such that .f (x; Gk ) /→ f (x; G). This is impossible from the fact d
that .Dkw (Gk , G) → 0 implies .Gk −→ G as shown earlier. Let .Ek and .EG denote expectations in .θ with respect to .Gk and G. One property of the convergence in distribution is that Ek {ϕ(θ )} → EG {ϕ(θ )}
.
whenever .ϕ(·) is a bounded and continuous function. Applying this result to .f (x; ·) for each fixed x, we have f (x; Gk ) = Ek {f (x; θ )} → EG {f (x; θ )} = f (x; G).
.
This proves the lemma.
u n
2.3 Enhanced Jensen’s Inequality and Other Technicalities
37
The concept of uniformly integrable from above and the monotone convergence theorem can be found in (Chow and Teicher 2003, 1997, pp 94 and pp 95). In the above proof, we have reversed the monotonicity from increasing to decreasing and integrable from below to from above. The theorem has otherwise been straightforwardly applied. The continuity of .f (x; G) in G was directly assumed in Kiefer and Wolfowitz (1956). The above lemma provides a more concrete condition, which is easier to verify. Still, the seemingly simple condition (2.12) is not so straightforward to verify. In fact, it is possible to find obscure but nevertheless legitimate counter examples even for commonly used mixture models. Kiefer and Wolfowitz (1956) resort to placing restrictions on G to ensure the validity of (2.12). We now give several examples where C20.1 is satisfied in the same spirit. Consider the discrete component distribution for which we have .f (x; G, e) ≤ 1 ¯ and .e > 0. Thus, a sufficient condition for (2.12) is for any .G ∈ G E∗ log f (X; G∗ ) > −∞.
.
(2.13)
Indeed, a condition of this nature is routinely assumed in the consistency proofs. This condition is clearly satisfied by the binomial mixture distribution. The following example shows that the Poisson distribution satisfies this condition if we place a restriction on .G∗ . Example 2.3 If .G∗ (M) = 1 for some M, then the Poisson mixture model satisfies (2.12). Proof Based on the previous discussion, it suffices to show that (2.13) holds. Let θ0 be a support point of .G∗ such that .G∗ (θ0 + e) − G∗ (θ0 − e) ≥ δ > 0 for some sufficiently small positive constants .e and .δ. Hence, we must have
.
f (x; G∗ ) ≥
.
(θ0 − e)x c δ exp(−θ0 − e) ≥ x! x!
for some constant c depending on .θ0 , .δ and .e. Therefore, we have E∗ log f (X; G∗ ) ≥ log(c) − E∗ log(X!) { = log(c) − Eθ log(X!)dG∗ (θ ).
.
(2.14)
We note .Eθ log(X!) is finite and continuous in .θ for .θ ∈ [0, M]. Hence, { .
Eθ log(X!)dG∗ (θ ) < ∞.
Applying this result to (2.14), we get (2.13).
(2.15) u n
38
2 Non-Parametric MLE
In applications, it may not be so controversial to believe the true mixing distribution has its support contained in a finite interval. Yet, placing such an unnatural condition to ensure consistency is at most tolerable. On the other hand, if we work with finite mixture of Poisson distributions, the restriction .G∗ (M) = 1 for some .M > 0 is naturally satisfied. Thus, the MLE of G under the finite Poisson mixture model is consistent. Is (2.12) true without any restrictions on .G∗ ? My intuition is that E∗ log f (X; G∗ ) > −∞
.
under Poisson mixture model is likely a necessary condition for (2.12). Can we find a .G∗ such that this condition is violated? Example 2.4 Let .G∗ be a mixing distribution such that G∗ ({log n}) = c{n(log n)(log log n)2 }−1
.
with some normalizing positive constant c, for .n = 20, 21, . . .. Then, under Poisson mixture model, we have E∗ log f (X; G∗ ) = −∞
.
Proof The specific mixing distribution .G∗ places probability .c{n(log n)(log log n)2 }−1 on .θ = log n. The size of c does not affect the proof. Thus, we pretend .c = 1 in the following proof. With this .G∗ , we have f (x; G∗ ) =
.
∞ E ] [ (log n)x exp(− log n) {n(log n)(log log n)2 }−1 x!
n=20
∞ 1 E (log n)x−1 = . x! (n log log n)2 n=20
Using a technique in mathematical analysis, we have ∞ E .
n=20
(log n)x−1 ≈ (n log log n)2
{
∞
t=20
(log t)x−1 dt. (t log log t)2
The approximation is precise enough so I will not give a precise assessment of the difference. Changing variable with .u = log t, we find {
∞
.
t=20
(log t)x−1 dt ≤ (t log log t)2
{
∞
u=0
ux−1 exp(−u)du = (x − 1)!.
2.3 Enhanced Jensen’s Inequality and Other Technicalities
39
Hence, we find .
log f (x; G∗ ) ≤ log{(x − 1)!/x!} = − log x.
Thus, the conclusion is true if .E∗ log(X) = ∞. Decomposing .E∗ into .EG Eθ , we have E∗ log(X) ≥ EG log{Eθ X} = EG log θ = ∞.
.
This proves the conclusion of this example.
u n
Some readers may find it mystic that .G∗ has support points starting from .log(20). The number 20 is chosen so that .log log(20) > 1 to have the inequality on integration the simplest. Exploring the result of this example, it becomes possible to find an example where C20.1 is not satisfied. We leave it as a problem for readers. It is worthwhile to emphasize that the above example is completely artificial. However, this example does make a clear point: it may not be a simple matter to establish C20.1 even for the commonly used Poisson mixture model. At the same time, the readers have been assured that the consistency result for finite mixture models is largely true. This point is further illustrated by the following example on the exponential mixture model. Example 2.5 Condition C20.1 is satisfied by the exponential mixture model if G∗ (δ) = 0 and .G∗ (M) = 1 for some .0 < δ < M < ∞.
.
Proof This example is based on the parameterization of the component density function as f (x; θ ) = θ exp(−θ x)
.
for .x ≥ 0. The parameter space for the component distribution is .Θ = R + . Under G∗ , the component parameter .θ is confined into the finite interval .[δ, M]. Hence, ∗ ∗ ∗ .f (x; G ) < M which implies that .E log f (X; G ) < ∞. We next show that ∗ ∗ .E log f (X; G ) > −∞. The same restriction implied by .G∗ implies .
f (x; G∗ ) ≥ δ exp(−Mx).
.
Thus, E∗ log f (X; G∗ ) ≥ log δ − M E∗ (X)
.
Using conditional argument, we have ∗
E (X) ≤
.
{
θ −1 dG∗ (θ ) ≤ 1/δ.
Hence, we have .E∗ log f (X; G∗ ) > −∞.
40
2 Non-Parametric MLE
Combining two bounds, we have shown .E∗ log f (X; G∗ ) is finite. Condition C20.1 is therefore implied if E∗ log f (X; G, e) < ∞
.
(2.16)
for any G with a sufficiently small .e. Given any x-value, the component density function .f (x; θ ) is maximized when .θ = 1/x. Therefore, f (X; G, e) ≤ (1/x) exp(−1).
.
We can easily verify that E∗ f (X; G, e) = E∗ log(1/X) − 1 < ∞.
.
Hence, (2.16) is true. Two results: .E∗ log f (X; G∗ ) > −∞ and .E∗ log f (X; G, e) < ∞, lead to C20.1. n u Mathematically, it is rather unsatisfactory to state that the non-parametric MLE is consistent under the exponential mixture model when the true mixing distribution has a support contained in an interval bounded away from both 0 and .∞. In applications, a finite exponential mixture model might be more important in which case C20.1 is implied based on the above example. I suspect that it is possible to show that condition C20.1 is not satisfied by the exponential mixture model in general. Constructing an example likely ends up another very artificial technical discussion. We will not make such an effort.
2.4 Condition C20.2 and Other Technicalities Rather interestingly, Condition C20.2 is universally satisfied as long as the mixture model is identifiable. Lemma 2.6 Suppose the mixture model is identifiable and the component distribution satisfies condition C20.0. Then, if .lim||θ||→∞ f (x; θ ) = 0, Condition C20.2 is satisfied. Proof Note that C20.0 and .lim||θ||→∞ f (x; θ ) = 0 implies .f (x; G) is continuous in G as proved in Lemma 2.5 Next, we note that for any .0 < u < 1, E∗ log{1 + u[
.
f (X; G∗ ) − 1]} ≥ log(1 − u) > −∞. f (X; G, e)
2.4 Condition C20.2 and Other Technicalities
41
When .e ↓ 0, we have .
log{1 + u[
f (X; G∗ ) f (X; G∗ ) − 1]} → log{1 + u[ − 1]}. f (X; G, e) f (X; G)
These two properties make the monotone convergence theorem or Fatou’s lemma applicable (Chow and Teicher 2003). We therefore have lim . E∗ log{1 + u[ e↓0
f (X; G∗ ) − 1]} f (X; G, e)
= E∗ log{1 + u[ ≥ − log E∗ [
f (X; G∗ ) − 1]} f (X; G)
uf (X; G∗ ) + (1 − u)f (X; G) ]=0 f (X; G∗ )
where the last inequality is due to Jensen’s inequality. The equality holds only if f (x; G∗ ) = f (x; G) for all x except for a 0-measurable set. Hence, under the identifiability condition and when .G /= G∗ , there must be an .e > 0 such that
.
E∗ log{1 + u[
.
This completes the proof.
f (X; G∗ ) − 1]} > 0. f (X; G, e) u n
While C20.2 is practical satisfied universally, the consistency result on the nonparametric MLE comes with one fewer dividend. As we pointed out earlier, the result of Pfanzagl (1988) does not imply the consistency of the MLE under finite mixture model. Hence, it leaves some room for the results of both Kiefer and Wolfowitz (1956) and Redner (1981). The proof of Pfanzagl (1988) indicates that almost surely, the likelihood attains its global maximum within an arbitrarily small neighborhood of .G∗ , the true mixing distribution. However, the likelihood value at .G∗ itself needs not be large. This is in sharp technical comparison with the classical results in Wald (1949) or Kiefer and Wolfowitz (1956).
2.4.1 Summary Based on the result of Pfanzagl (1988), the non-parametric MLEs of the mixing distribution under Poisson and exponential mixture models are strongly consistent. Based on the result of Kiefer and Wolfowitz (1956), the MLEs of the mixing distribution under Poisson and exponential finite mixture models are strongly consistent.
Chapter 3
Maximum Likelihood Estimation Under Finite Mixture Models
3.1 Introduction In this chapter, we focus on the consistency of the Maximum Likelihood Estimator (MLE) of the mixing distribution under finite mixture models in the presence of n i.i.d. observations. A finite mixture model of order m has the following generic density function: f (x; G) =
m E
.
πj f (x; θj ).
j =1
as also given in Chap. 1. The corresponding mixing distribution has its c.d.f. given by G(θ ) =
m E
.
j =1
πj 1(θj ≤ θ ) = G =
m E
πj {θj }
j =1
to highlight its assigning probability .πj to parameter value .θj . Be aware that a more commonly used notation for .{θj } is .δθj . Our choice avoids the need of double subscription in many occasions. The class of all mixing distributions with at most m support points as given above will be denoted as .Gm . The class of the finite mixture model of order m is then given by .{f (x; G) : G ∈ Gm }. The MLE of G under finite mixture model .{f (x; G) : G ∈ Gm } is therefore ˆ m ∈ Gm satisfying .G ˆ m ) = sup{ln (G) : G ∈ Gm }. ln (G
.
(3.1)
Some technicality may occur when the supremum is not attained at any mixing distribution G in .Gm . We have the MLE more rigorously described following (2.2). © Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_3
43
44
3 Maximum Likelihood Estimation Under Finite Mixture Models
ˆ m can be smaller than m. We By this definition, the actual number of support of .G assume the true mixing distribution .G∗ from which the samples were generated ˆ m is inconsistent and the subsequent has at most m support points. Otherwise, .G consistency discussion becomes trivial.
3.2 Generic Consistency of MLE Under Finite Mixture Models The consistency of the nonparametric MLE largely depends on the validity of C20.0 and C20.1. Essentially, these two conditions require the component or subpopulation density function .f (x; θ ) is smooth in .θ and has finite KL information. The same requirements are needed for the consistency of MLE under finite mixture models. The simpler structure of .Gm compared to .G makes it much simpler to have C20.1 verified. Theorem 3.1 Let .X1 , . . . , Xn be a set of i.i.d. random variables from finite mixture ˆ model .{f (x; G) : G ∈ Gm } with true mixing distribution given by .G∗ ∈ Gm . Let .G an MLE of G as defined by (3.1). Under the conditions C20.0 and C20.1, we have that, as .n → ∞, a.s. ˆ m , G∗ ) −→ Dkw (G 0.
.
Proof We will use some common notations without redefining them. The intermediate result (2.8) in the proof of Theorem 2.2 remains valid: .
¯ δ } < ln (G∗ ) sup{ln (G) : G ∈ G
almost surely. That is, no mixing distributions, with finite support or not, outside the .δ-neighborhood of .G∗ achieves as large a log likelihood value as at .G∗ . At the same time, the .δ-neighborhood of .G∗ contains at least one mixing distribution with at most m support point which is .G∗ . Hence, the restricted maximum point of ∗ .ln (G) must be within the .δ-neighborhood of .G almost surely. Because this .δ can be chosen arbitrarily small as before, we must have ˆ m , G∗ ) −→ 0. Dkw (G a.s.
.
This completes the proof.
u n
In addition to the consistency result, other miscellaneous results following ˜ m is a mixing distribution Theorem 2.2 also remain valid. For instance, suppose .G with at most m support points and ˜ m ) > sup{ln (G) : G ∈ Gm } + o(n), ln (G
.
3.2 Generic Consistency of MLE Under Finite Mixture Models
45
then, it is also a consistent estimator. If .G∗ has exactly m distinct support points so that .πj∗ > 0 for all .j = 1, 2, . . . , m, a.s. ˆ G∗ ) −→ ˆ can be labeled in such a then .D(G, 0 clearly implies that the support of .G way that a.s. a.s. θˆj −→ θj∗ ; πˆ j −→ πj∗
.
for .j = 1, . . . , m. ˆ m as a When .G∗ has fewer than m distinct support points, the consistency of .G mixing distribution is not affected. However, the loss of full identifiability makes the asymptotic behavior of .θˆj a mess. For example, suppose .m = 2 and .Θ = R. Assume the true distribution .G∗ is actually a degenerated mixing distribution with ∗ ∗ n .θ = 0 and .π = 1. Let the mixing distribution .G1n = (1−1/n){0}+(1/n){(−1) }. 1 1 d
Recall that .{0} means a point mass at .θ = 0. It is easily seen that .G1n −→ G∗ . However, the second support point sequence, .θ2 = (−1)n , does not have a limit. d
The point of this example is, .G1n −→ G∗ does not imply the convergence of its support points. When .m = 1, the mixing distribution is degenerative. In this case, the consistency ˆ becomes the consistency of the ordinary parametric MLE. Hence, Theorem of .G 2.2 is more general than it appears. Another useful implication is on structural parameters. In many applications, the component distribution family has its vector parameter .θ partitioned into two subvectors .θ1 and .θ2 so that the mixing distribution G degenerates in the second subvector .θ2 . Theorem 2.2 also remains applicable to this case. That is, the non-parametric MLE of the mixing distribution in terms of .θ1 as well as the parametric MLE of .θ2 is consistent. In the following example, we demonstrate that finite mixture of exponential distributions satisfies C20.1. Example 3.1 Condition C20.1 is satisfied by the finite mixture of exponential distributions. Proof In Example 2.5, we proved that the exponential mixture satisfies C20.1 if there exists .0 < δ < M < ∞ such that .G∗ (δ) = 0 and .G∗ (M) = 1. Note this condition is placed on the unknown true mixing distribution .G∗ . Under finite mixture model, .G∗ (θ ) has finite number of support points. Naturally, the smallest is larger than 0, and the largest is finite. This condition is therefore satisfied by choosing the smallest and the largest support points as .δ and M. u n Note that the condition .G∗ (δ) = 0 and .G∗ (M) = 1 for some .0 < δ < M < ∞ may appear restrictive for generic mixture models. This condition is automatically satisfied for finite mixture models. Let us point out that C20.1 does not always hold true for finite mixture models. Example 3.2 Condition C20.1 is not satisfied by the finite mixture of normal distributions in both mean and variance.
46
3 Maximum Likelihood Estimation Under Finite Mixture Models
Proof By this example, we mean that C20.1 is not satisfied in general. Thus, we need only an example with a specific parameter setting. Consider the case where the order of the normal mixture model is .m = 2. Let ∗ = {(0, 1)} be a mixing distribution, which places probability 1 at .(μ, σ 2 ) = .G (0, 1) where .μ and .σ 2 are mixing parameters for the mean and variance. That is, the true mixture is in fact the standard normal. We now select a particular .G = {(0, 2)} and characterize the function .f (x; G, e) in the context of finite normal mixture model. This specific parameter value choice 2 .(μ, σ ) = (0, 2) does not play a role. Based on the KW–distance defined by (2.3), for any given value t and .γ 2 , Gt,γ = (1 − e){(0, 2)} + e{(t, γ 2 )}
.
is within .e distance of G. Because of this, we find that at any x, f (x; G, e) ≥ sup f (x; Gx,γ , e) ≥ sup √
e
.
γ
γ
2Π γ
=∞
where .Π is the mathematical constant. Hence, for any .e > 0, E∗ {log f (X; G∗ )/f (x; G, e)} = −∞
.
u n
which violates C20.1.
In fact, it is well-known that the likelihood function under finite normal mixture model is unbounded. It takes value .∞ when one of the component parameters are given by .(x1 , 0) where .x1 is the value of the first observation. Hence, the inconsistency of MLE can be easily and directly demonstrated. At the same time, the likelihood function often has a local maximum, which estimates the mixing distribution well in a common sense. A careless application of MLE under finite mixture model does not necessarily lead to a disaster. Normal mixture models are so special that we will devote a chapter to this topic.
3.3 Redner’s Consistency Result In Chap. 2, we mentioned that the result of Redner (1981) is specifically developed for the consistency of the MLE when the model is non-identifiable. The generic result of Redner (1981) is particularly relevant to finite mixture models. For any .G ∈ Gm , one may attempt to represent it as a real-valued vector rather than a cumulative distribution function defined on the component parameter space .Θ. For instance, a mixing distribution G(θ ) = π1 1(θ1 ≤ θ ) + π2 1(θ2 ≤ θ )
.
3.3 Redner’s Consistency Result
47
may be as effectively represented as a parameter vector G = (π1 , π2 , θ1 , θ2 ).
.
Here we have purposely used the same generic symbol G for the mixing distribution. This particular parametric system is convenient for discussion, for instance, the limiting distribution of some estimator .θˆ1 . The drawback is that it leads to somewhat superficial non-identifiability because ˜ = (π2 , π1 , θ2 , θ1 ) G
.
˜ is a different vector from the above G. However, we have .f (x; G) = f (x; G). Although the finite mixture models are always not identifiable from this angle, it is identifiable when G is viewed as a member of .Gm for many specific mixtures. The topology of .Gm in the context of vector parameter becomes quotient topology. Namely, the vectors defining the same mixing distribution are regarded as a single point under this topology. The distance between two mixing distributions may be taken as the smallest Euclidean distance among all possible vector representations. Once the problem is portrayed from this angle, the consistency results in Wald (1949) are nearly directly applicable. In the context of finite mixture models, we need to interpret the conditions in Wald (1949) in corresponding terms. Let us reaffirm that the component distributions are members of a specific parametric family .{f (x; θ ) : θ ∈ Θ}. Let .δ(·, ·) be a metric defined on .Θ. We use .G∗ for the true mixing distribution with finite support of size at most m: the random sample of size n is from .f (x; G∗ ). We use .E∗ for the expectation under this distribution as usual. Let f (x; θ, e) =
.
sup
δ(θ ' ,θ)≤e
f (x; θ ' )
and .f ∗ (x; θ, e) = max{1, f (x; θ, e)}. C30.1 C30.2
The parameter space .Θ is a metric space with distance .δ(·, ·) so that every closed and bounded subset of .Θ is compact. For each .θ ∈ Θ and for sufficiently small .e, .f (x; θ, e) is measurable in x, and E∗ log{f ∗ (X; θ, e)} < ∞.
.
C30.3 C30.4
For any .θ1 , θ2 ∈ Θ, .Eθ1 | log f (X; θ2 )| < ∞. If .θk → θ then .f (x; θk ) → f (x; θ ) as .k → ∞, except on a set A of x, which does not depend on the sequence .θk such that .Pr(X ∈ A) = 0 in terms of distribution specified by .f (x; θ ).
These conditions are practically the verbatim of the conditions given by Redner (1981). Notably, the list does not include an identifiability condition. Not only
48
3 Maximum Likelihood Estimation Under Finite Mixture Models
the identifiability is not required on the mixture aspect of the model, but also not required on the subpopulation distribution family .{f (x; θ ) : θ ∈ Θ}. Because this book focuses on mixture models whose subpopulation parameter space .Θ is a subset of Euclid space, C30.1 is automatically satisfied by interpreting .δ(·, ·) as the Euclidean distance. C30.2 and C30.3 have their counterparts in Wald (1949). They ensure that when .f (x; θ ) is smooth as specified by C30.4, the enhanced Jensen’s inequality is true for a small enough .e. ˆ maximizes .ln (G) in Theorem 3.2 Under Conditions C30.1–4, and assume that .G ˜ m and that .Gm contain .G∗ . Then .G ˆ is strongly consistent in a compact subset .G ˜ m. quotient topological space .G Proof Whether we work on finite mixture models or non-parametric models, the consistency would be implied by the enhanced Jensen’s inequality C20.1 and the compactness of the parameter space. Hence, to prove this theorem, it suffices to show that C20.1 is implied by C30.1–C30.4. The compactness has already been directly assumed. E ˜ Let .G(θ ) = m j =1 γj {θj } be any mixing distribution in .Gm . Clearly, f (x; G, e) ≤ max f (x; θj , e) ≤ max f ∗ (x; θj , e).
.
j
j
Due to the fact that .f ∗ ≥ 1, this implies log f (x; G, e) ≤ log{max f (x; θj , e)} ≤
.
j
m E
log f ∗ (x; θj , e).
j =1
Hence, by C30.2, we have E∗ [log{f (X; G, e)}] ≤
m E
.
E∗ {log f (X; θj , e)} < ∞.
(3.2)
j =1
In the next step, we show .E∗ | log f (x; G∗ )| < ∞. Let .A = {x : f (x; G∗ ) > 1}. When .x ∈ A, we have 1 ≤ f (x; G∗ ) ≤ max f (x; θj∗ )
.
j
which leads to | log f (x; G∗ )| ≤ max log f (x; θj∗ ) ≤
m E
.
j
When .x /∈ A, we have
j =1
| log f (x; θj∗ )|.
3.3 Redner’s Consistency Result
49
1 ≥ f (x; G∗ ) ≥ min f (x; θj∗ )
.
j
and therefore, | log f (x; G∗ )| ≤ | log{min f (x; θj∗ )}| ≤
m E
.
j
| log f (x; θj∗ )|.
j =1
Hence, for all x, we have | log f (x; G∗ )| ≤
m E
.
| log f (x; θj∗ )|.
j =1
By Condition 40.3, we find E∗ | log f (x; G∗ )| ≤
m E
.
E∗ | log f (x; θj∗ )| < ∞.
(3.3)
j =1
Finally, combining with smoothness condition C30.4, we find that for any .G /= G∗ , there exists an .e such that E∗ [log{f (x; G, e)/f (x; G∗ )}] < 0.
.
We remark that .G /= G∗ utilized above is in the sense of quotient topology. The above inequality is condition C20.1 we promised to verify.
u n
Compared with other approaches in proving consistency, the advantage of this result is the clarity of the conditions. We notice that C30.3 is essentially only condition that goes beyond the conditions of Wald (1949). At the same time, one cannot help to notice that the extra compactness condition stated within the above theorem is restrictive. A likely remedy to this deficiency is to compactify .Gm . If so, we are back to Kiefer and Wolfowitz (1956). Nonetheless, it is easier to verify C30.3 than the more generic conditions in Kiefer and Wolfowitz (1956). Under finite mixture of binomial distributions, the component parameter space .Θ is compact. Because of this, Theorem 3.2 is directly applicable. The concept of quotient topology is handy. We are freed from discussion of identifiability. Of course, without identifiability, we will be burdened by how to interpret the meaning of the consistency. Under finite mixture of Poisson distributions, let .θ be the mean of the component distribution. The component parameter space is .[0, ∞). We may define .f (x; ∞) ≡ 0 so that .Θ = [0, ∞] and alter the distance on .Θ appropriately so that .Gm is ¯ m . Theorem 3.2 would then be applicable. expanded to its compactified version .G It is known that the MLE of the mixing distribution under the finite normal mixture model is not consistent. Therefore, Theorem 3.2 is definitely inapplicable
50
3 Maximum Likelihood Estimation Under Finite Mixture Models
or its conclusion would be wrong. In view of this, we may wonder which part of the conditions of this theorem is violated? We point to the parameter space of the component variance .σ 2 , which is not compact. In fact, there is no way for us to compactify while maintaining the smoothness of .f (x; G) in G. Suppose that it is known that .σ 2 ≥ e > 0 for all mixing distributions in consideration, then the restricted MLE is consistent when .G∗ satisfies this restriction. Needless to say, placing such an artificial lower bound is not a satisfactory solution. There are more natural remedies to obtain consistent estimators under the finite normal mixture model. We will have it discussed in a specific chapter.
3.4 Examples In this section, we go over a number of commonly used finite mixtures and examine whether their MLE is consistent by comparing against consistency conditions. By consistency, we implicitly assume that the dataset is made of n i.i.d. observations from a distribution in the model. The sample size n goes to infinite. Example 3.3 The MLE of the mixing distribution under the finite exponential mixture is consistent. We discuss this example taking .Θ = (0, ∞) and .f (x; θ ) = θ exp(−θ x) for x > 0. The parameter space clearly satisfies condition C30.1. For any .θ ∈ Θ, there exists a sufficient small .e such that .(θ − e, θ + e) ⊂ Θ. In addition,
.
.
sup{θ ' exp(−θ ' x) : θ ' ∈ (θ − e, θ + e)} = (θ + e) exp{−(θ − e)x}.
Hence, .
log{f ∗ (x; θ, e)} ≤ [log(θ + e) − (θ − e)x]+ ≤ [log(θ + e)]+
which has a finite upper bound not depending on x. Therefore, E∗ log{f ∗ (x; θ, e)} < ∞.
.
This verifies condition C30.2. For any .θ ∈ Θ, we have .log f (x; θ ) = log θ − θ X, which has finite moment under any exponential distribution. Hence, condition C30.3 is verified. The density function .f (x; θ ) = θ exp(−θ x) is apparently continuous in .θ . Hence, condition C30.4 is satisfied. Finally, Theorem 3.2 only claims the consistency of the MLE restricted to a compact subset of .Gm . Since .{f (x; θ ) : θ ∈ (0, ∞)} can be continuously extended to .{f (x; θ ) : θ ∈ [0, ∞]}. This restriction is no longer a restriction in the later space. Hence, the MLE is consistent. u n
3.4 Examples
51
Example 3.4 The MLE of the mixing distribution under finite Gamma mixture is not consistent. In this example, we consider the finite Gamma mixture with two component parameters: f (x; r, θ ) =
.
x r−1 exp(−x/θ ) θ r Γ (r)
for .x > 0. In this expression, .θ is the scale parameter, and r is the degree of freedom. Both parameters take positive real values. The density function of the finite gamma mixture of order m is given by f (x; G) =
m E
.
πj f (x; rj , θj ).
j =1
Note that the gamma mixture has a two-dimensional subpopulation parameter vector made of r and .θ . Suppose .x1 , x2 , . . . , xn form a random sample of size n from the two-parameter finite gamma mixture. The log-likelihood of this sample is given by ln (G) =
n E
.
log f (xi ; G).
i=1
Given any observed value .x1 , let .(r1 , θ1 ) be a parameter vector such that .r1 θ1 = x1 for a sufficiently large .r1 . By the Stirling formula of Batir (2008), Γ (r) =
.
√
( ) 2Π r r−1/2 exp(−r) 1 + O(r −1 ) .
Hence, the subpopulation density function at .x1 = r1 θ1 is given by √ x1r1 −1 exp(−r1 ) r1 1 .f (x1 ; r1 , θ1 ) = =√ . Γ (r1 )(x1 /r1 )r1 2Π x1 1 + O(r1−1 ) Thus, by letting .r1 → ∞ while keeping .r1 θ1 = x1 , we find f (x1 ; r1 , θ1 ) → ∞.
.
Let .π1 = 0.5. Let the other subpopulation parameters be .rj = 1, .θj = 1, and πj = 0.5/(m − 1) for .j = 2, . . . , m. For the mixing distribution with this parameter setting, the log-likelihood function
.
52
3 Maximum Likelihood Estimation Under Finite Mixture Models
ln (G) ≥ n log(0.5) + log f (x1 ; r1 , θ1 ) +
n E
.
log f (xi ; 1, 1)
i=2
→∞ since .r1 → ∞ and .θ1 = x1 /r1 . By definition, this G is an MLE. Clearly, it is not relevant to the true mixing distribution from which the dataset was generated. Hence, the MLE is inconsistent. u n The reason for inconsistency is clearly due to .f (x1 ; r1 , θ1 ) → ∞ for a specific combination of .r1 and .θ1 . This makes condition C20.1 not satisfiable.
Chapter 4
Estimation Under Finite Normal Mixture Models
We use normal and Gaussian exchangeably in this book. The techniques for statistical data analysis under the normal model are usually the simplest. That is, there are many standard procedures known to be optimal according to numerous criteria we are aware of. In comparison, the data analysis under normal mixture model is most challenging: not only because the mixture model is not regular but also because the finite normal mixture model losses additional identifiability compared to other mixture models. It is so extreme that the maximum likelihood estimator of the mixing distribution under the finite normal mixture model is not consistent or not well defined. The best possible rate of convergence for estimating mixing distribution is also lower. Because of these issues, it is appropriate to have a chapter designated exclusively for normal mixture models. Let G be a bivariate distribution on .R × R + and denote the density function of a univariate normal distribution with mean .θ and variance .σ 2 to be φ(x; θ, σ 2 ) = √
.
{ (x − θ )2 } exp − . 2σ 2 2Π σ 1
This .Π is the mathematical constant. The most generic normal mixture model with mixing distribution G has density function given by { f (x; G) =
.
φ(x; θ, σ 2 )dG(θ, σ ).
(4.1)
When G degenerates in its second argument .σ , .f (x; G) is reduced to a normal mixture model in location parameter .θ and with a structure parameter .σ : { f (x; G1 , σ ) =
φ(x; θ, σ 2 )dG1 (θ )
(4.2)
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_4
53
.
2
54
4 Estimation Under Finite Normal Mixture Models
where as a c.d.f., .G1 (θ ) = G(θ ; ∞). When G degenerates at its first argument .θ , f (x; G) is reduced to a normal mixture model in scale parameter with a structure parameter .θ :
.
{ f (x; θ, G2 ) =
.
φ(x; θ, σ 2 )dG2 (σ ).
(4.3)
As a c.d.f., .G2 (σ ) = G(∞; σ ). The generic normal mixture model (4.1) is not identifiable as discussed in Chap. 1. Thus, the consistency results on non-parametric MLE of G in Chap. 2 are not applicable. When we restrict the mixing distribution G into special classes, consistent estimation becomes possible. Because the normal distribution has two parameters, there are many meaningful special types of normal mixture models. (a) Consider the model .f (x; G1 ; σ0 ) with .σ0 value given. In this case, the population is a convolution of .φ(x; θ, σ ) with .G1 (θ ). The problem of estimating .G1 (θ ) is a deconvolution problem. Under mild conditions that .G1 is known to have some properties, it is possible to have it consistently estimated. See Fan (1992) and Zhang (1990). (b) Consider the model in (a) except for .σ being an unknown parameter in .f (x; G1 ; σ ). In this case, .G1 is generally not identifiable. Thus, consistent estimation of .G1 is not possible. (c) Consider the model in (b) except for .G1 having at most m support points and m is known. In this case, .σ is a structure parameter. It is a parameter value shared by all subpopulations in this finite normal mixture model. We will devote a section to show that both .G1 and .σ can be consistently estimated in this case. (d) Models (a)–(c) can have the roles of .θ and .σ exchanged to get three new (sub-) models. Among them, model .f (x; θ0 ; G2 ) with .θ0 value given is special. Let .Y = log{|X − θ0 |}. Then, Y is a random variable with a distribution from a location distribution family. Due to symmetry of the distribution of .X − θ0 under this model, the transformation does not loss any information about .G2 . Thus, the problem of estimating .G2 reduces to a deconvolution problem. We are not aware of much discussion on .f (x; θ ; G2 ) when .θ is an unknown and structural parameter. (e) When G has finite support, .f (x; G) as defined in (4.1) is the most generic finite normal mixture model. This is probably the commonly used mixture model. However, this generic probability model has very tricky finite and asymptotic properties in its related statistical inferences. Like model (c), we need a specific section for this model.
4.1 Finite Normal Mixture with Equal Variance
55
4.1 Finite Normal Mixture with Equal Variance Consider the finite normal mixture model (c) where the subpopulation distributions share an equal but unknown variance: f (x; G; σ 2 ) =
m E
.
πj φ(x; θj , σ 2 ).
(4.4)
j =1
The log-likelihood function based on a set of i.i.d. observations is given by ln (G; σ 2 ) =
n E
.
log f (xi ; G; σ 2 ).
i=1
There are many approaches to estimate the mixing distribution G and the structural parameter .σ 2 . Our discussion focuses on the maximum likelihood estimator when the order of the finite mixture, m, is pre-specified. We first give a preliminary result similar to the one in Chen and Chen (2003). The result here is much more general. Lemma 4.1 Assume that we have a set of i.i.d. observations from the finite normal ˆ σ 2 ) be a global maximum point of the mixture model (4.1) with m known. Let .(G, 2 likelihood function .ln (G; σ ). Then, there exist constants .0 < e < Δ < ∞ such that as the sample size .n → ∞, the event sequence .{e ≤ σˆ 2 ≤ Δ} occurs almost surely. Proof It is easily seen that .f (x; G; σ 2 ) ≤ 1/σ for all x and G. Thus, when .σ 2 > Δ, we have ln (G; σ 2 ) ≤ −(1/2)n log Δ.
.
Let .x¯ be sample mean, and .sn2 = n−1
En
i=1 (xi
− x) ¯ 2 . We have
ln (x; ¯ sn2 ) ≥ −n log(sn ) − (n/2),
.
where .x¯ in the first entry represents the mixing distribution assigning probability 1 to the parameter value .θ = x. ¯ The above calculations imply ln (x, ¯ sn2 ) − ln (G; σ 2 ) ≥ {−n log(sn ) − (n/2)} − {−(1/2)n log Δ}
.
= n{log Δ − log(sn ) − (1/2)} uniformly for all .σ 2 > Δ. Let X denote a random variable with the true finite normal mixture distribution. Applying the law of large numbers, .sn2 almost surely converges to .VAR(X) as .n → ∞. Hence, when .log Δ > log{VAR(X)} + (1/2), we have
56
4 Estimation Under Finite Normal Mixture Models
ln (G; σ 2 ) < ln (x, ¯ sn2 )
.
almost surely for all .σ 2 > Δ. This proves that the MLE for .σ 2 is below this finite value .Δ. In other words, the global maximum point .σˆ 2 < Δ almost surely for this choice of .Δ. Next we show that .σˆ 2 is also bounded below from zero almost surely. Due to the algebraic form of the normal density, we always have .
log f (x; G, σ 2 ) ≤ − log(σ )
(4.5)
regardless the actual value of x. At the same time, we have f (x; G; σ 2 ) =
m E
.
πj φ(x; θj , σ 2 ) ≤ max φ(x; θj , σ 2 ). j
j =1
Hence, for any G and .σ 2 , we have another upper bound .
log f (x; G; σ 2 ) ≤ − log σ − (2σ 2 )−1 min (x − θj )2 . 1≤j ≤m
Let M be an arbitrary positive number and denote the truncated .θ value as ⎧ ⎨ −M .θ˜ = θ ⎩ M
θ < −M; |θ | < M; θ > M.
We denote .θ for .(θ1 , . . . , θm ) in G and .θ˜ for .(θ˜1 , . . . , θ˜m ). The space of .θ˜ is clearly compact with finite M. When .|x| ≤ M, we have .
log f (x; G, σ 2 ) ≤ − log σ − (2σ 2 )−1 min (x − θ˜j )2 . 1≤j ≤m
(4.6)
Applying (4.5) to those with .|xi | > M and (4.6) for those with .|xi | ≤ M, we find ln (G, σ ) ≤ −n log σ − (2σ 2 )−1
n E {
.
i=1
} min (xi − θ˜j )2 1(|xi | ≤ M).
1≤j ≤m
Let us focus on h(θ˜ , X) =
.
{
} min (X − θ˜j )2 1(|X| ≤ M).
1≤j ≤m
(4.7)
4.1 Finite Normal Mixture with Equal Variance
57
˜ x) ≤ M 2 and it is equicontinuous in .θ˜ for all x. Because the space of Clearly, .h(θ, θ˜ is compact, by the uniform strong law of large numbers (Rubin 1956),
.
n E
−1
n
.
a.s. ˜ xi ) −→ ˜ X)} h(θ, E∗ {h(θ,
i=1
uniformly in .θ˜ . In addition, the expectation is a smooth function of .θ˜ . Note that .E∗ is the expectation under the true mixture distribution. Being clearly nonzero at each ˜ , we must have .θ .
˜ X)} = δ > 0 inf E∗ {h(θ,
where the infimum is taken over the compact space of .θ˜ . Applying this result to (4.6), we find ln (G, σ ) ≤ −n{log σ + δ/σ 2 }
.
almost surely for any .σ and therefore ln (G, σ ) − ln (x, ¯ sn2 ) ≤ −n{log σ − log(sn ) + δ/σ 2 − 1/2}
.
(4.8)
almost surely. When .σ 2 is small enough, the upper bound goes to negative infinite as .n → ∞. Hence, the maximum value of .ln (G, σ ) must be attained when .σ > e for some .e > 0. This completes the proof. u n We did not give a motivation for introducing .θ˜ for the sake of smooth presentation. Why do we need it? This can be explained as follows. No matter what the true mixture distribution is governing the data, there is always a large enough M such that practically all observed values fall within the range .[−M, M]. Hence, mixing distributions with subpopulation means outside of this range cannot possibly be the local maximum of .ln (G). Introducing .θ˜ fits into this insight and it subsequently allows us to work on a compact space. The compact space is a key condition to make use of Rubin’s uniform strong law of large numbers. We may also take note that M does not have to be large enough at all in the above proof. The proof has taken note that the loss in the size of the likelihood due to mistreating observations inside .[−M, M] is arbitrarily inflated by the sufficiently small .σ . This loss is reflected by the term .δ/σ 2 in inequality (4.8). The result presented here has its root in Chen and Chen (2003). While the proof here shares the basic strategies, the idea of focusing on observations within a compact space .[−M, M] allows a stronger conclusion. Namely, the result is shown without the restriction on the parameter space of .θ . This lemma has a very important implication. That is, under the finite normal mixture model with equal variance, the effective subpopulation parameter space for
58
4 Estimation Under Finite Normal Mixture Models
(θ, σ 2 ) is .R × [e, Δ] from the asymptotic point of view. Restricting the space of .σ this way leads to a compact parameter space. This enables the use of a previously proven result on the consistency of the maximum likelihood estimator. On this subpopulation parameter space, we have
.
.
lim f (x; θ, σ 2 ) = 0.
|θ|→∞
In addition, the density function has a positive upper bound. All the conditions in Theorem 3.1 are therefore easily verified. We therefore arrive at the consistency result. Theorem 4.1 Assume that we have a set of i.i.d. observations from the finite normal ˆ σˆ 2 ) be a global maximum point of the mixture model (4.1) with m known. Let .(G, 2 ˆ likelihood function .ln (G; σ ). Then .(G, σˆ 2 ) are strongly consistent for .(G, σ 2 ). Proof In view of the conclusion in Lemma 4.2, we need to only prove the strong ˆ σˆ 2 ) when the space of consistency for the maximum likelihood estimators of .(G, .σ is reduced to the compact interval .[e, Δ]. This reduction makes it possible for us to use the consistency of the nonparametric MLE given by Kiefer and Wolfowitz (1956). For this purpose, we verify the conditions specified in Lemma 2.5. First of all, the subpopulation normal density function is apparently continuous. In addition, .
lim φ(x; θ, σ ) = 0
|θ|→∞
for any .σ ∈ [e, Δ]. This step reveals the reason behind confining .σ into a compact space. Lastly, we need remark on condition (2.12), which requires the Kullback– to Leibler information being finite. This is clearly true when .f (X; G∗ ) is a finite normal mixture density. In conclusion, the consistency of the MLE is proved. u n
4.2 Finite Normal Mixture Model with Unequal Variances The density function under this model is given by f (x; G) =
m E
.
πj φ(x; θj , σj2 ).
(4.9)
j =1
By unequal-variance assumption, we do not exclude the possibility that the true subpopulation variances are all equal. Under this model, the mixing distribution G is on two-dimensional space .R × R + . The log-likelihood is algebraically the same as before:
4.2 Finite Normal Mixture Model with Unequal Variances
ln (G) =
n E
.
59
log f (xi ; G).
i=1
4.2.1 Unbounded Likelihood Function and Inconsistent MLE Consider the case .m = 2. Let .(θ1 , σ1 ) = (0, 1), .π1 = π2 = 0.5. Let .θ2 = x1 and σ2 = 1/k for .k = 1, 2, . . .. Let .Gk be the corresponding mixing distribution. This setup creates a sequence of mixing distributions .{Gk }∞ k=1 . It is seen that
.
( x2 ) k 0.5k 0.5 f (x1 ; Gk ) = √ + √ exp − 1 ≥ 2 2 Π 2Π 2Π
.
and that for .i ≥ 2, ( k 2 (x − x )2 ) ( x2 ) 0.5 0.5k i 1 + √ exp − i f (xi ; Gk ) = √ exp − 2 2 2Π 2Π ( x2 ) 1 ≥ exp − i . 2Π 2
.
Consequently, we have 1E 2 xi − n log(2Π ). .ln (Gk ) ≥ log(k) − 2 n
i=2
Given the dataset .x1 , . . . , xn , and as .k → ∞, we find .ln (Gk ) → ∞. As .∞ is the supremum of .l(G), the limiting point of .Gk , as .k → ∞, corresponds to the mixing distribution which assigns 50–50% of probability of .(θ, σ ) values on .(0, 1) and .(x1 , 0). Clearly, this MLE is not consistent. There are many misconceptions in the literature. First, some researchers take note that .x1 is the observed value of .X1 and .X1 is a continuous random variable. Thus, given any .θ , we have .Pr(X1 = θ ) = 0. So they wish to conclude that the probability of having a degenerate MLE is zero. This claim is false because .θ2 in this example is chosen after .x1 rather than the other way around. Whatever the value .x1 assumes, we can always examine the likelihood value at .θ2 = x1 in the data analysis. We are allowed to draw a target after having shot an arrow. Second, the non-consistent MLEs under the finite normal mixture model with unequal variances have at least one degenerate subpopulation variance. In applications, however, a user can often find many non-degenerate local maximum of .ln (G) and use the one that gives the largest likelihood value as the MLE of G. Based on our experience accumulated in simulation studies, the resulting estimator has good statistical properties. Because of this, we cannot simply write this practice off.
60
4 Estimation Under Finite Normal Mixture Models
The inconsistency conclusion may not be an obstacle in applications to many. Nonetheless, it is most satisfactory to have a “fool-proof” method whose property is ensured with a solid theory, and it works well in applications. There are two approaches to achieve consistent estimation based on likelihood proposed in the literature. One is to restrict the subpopulation parameter space to avoid singularity. The other is to place a penalty to the likelihood function. Proving the consistency in either approaches involves a lot of technical issues.
4.2.2 Penalized Likelihood Function Let us go back to the example of the finite normal mixture model with unequal variances and in the context that a set of n i.i.d. observations are available. The likelihood contribution of an observation .xi is quantified by .f (xi ; G). Once the set of i.i.d. observations are given, we can always find an G such that .f (x1 ; G) = ∞ or make it arbitrarily large. However, once a specific choice of G makes .f (xi ; G) = ∞ for several .xi ’s, the likelihood contributions of the rest observations remain regular. If so, we may restore the regularity by removing the likelihood contributions of these abnormal observations. This goal can be achieved by introducing a penalized likelihood: l˜n (G) = ln (G) + pn (G).
.
We may then estimate G by one of the global maximum points of .l˜n (G): ˜ = arg max l˜n (G). G
.
˜ the The uniqueness and existence are subject to further discussion. We call .G penalized maximum likelihood estimator, or pMLE for short, rather than maximum ˜ as .θ˜j and penalized likelihood estimator. We denote the subpopulation means in .G so on. To achieve the goal of restoring regularity, we make .pn (G) → −∞ at some rate when .minj σj2 → 0. We next give a technical lemma, which motivates the requirements on the size of the penalty function. Some kind of penalty as function of .σj /σi may also be effective but not discussed here.
4.2.3 Technical Lemmas We have pointed out that .f (xi ; G) = ∞ can occur only if .xi is arbitrarily close to one of the subpopulation means .θj , .j = 1, 2, . . . , m specified by G. Is it possible for us to find a set of .θj such that every .xi is close to one of them? The answer is clearly no. For instance, if .x1 = 1, x2 = 2, . . . , xn = n, we can at most make .xi − θj ≈ 0
4.2 Finite Normal Mixture Model with Unequal Variances
61
for m of these x-values. In general, when the observations spread out instead of forming clusters, the number of “abnormal observations” is limited regardless the choice of G. Based on this example, it is vital to quantify the spreads of the observations from a finite normal mixture model. The spread-out characterization turns out to be true for samples from any distribution with bounded and continuous density function. Lemma 4.2 Let .x1 , . . . , xn be a set of n i.i.d. observations from an absolute continuous distribution F with density function .f (x). E Assume that .f (x) is continuous and .M = supx f (x) < ∞. Let .Fn (x) = n−1 ni=1 1(xi ≤ x) be the empirical distribution function. Then, as .n → ∞ and almost surely, for any given .e > 0, .
sup {Fn (θ + e) − Fn (θ )} ≤ 2Me + 8n−1 log n.
θ∈R
We note that .n{Fn (θ + e) − Fn (θ )} equals the number of observations falling inside the interval .(θ, θ +e]. The length of this interval is .e. The starting point of this interval .θ is arbitrary because the operator “.supθ∈R ” is introduced in this lemma. Because of this, Lemma 4.2 shows that wherever we place an interval of length .e, with probability 1, this interval can at most contain .2nMe + 8 log n observations. If we let .Me = n−1 log n, the total is no more than .n{2Me + 8n−1 log n} = 10 log n. This result will be technically further tightened. Yet the interpretation stays the same. Proof Under the continuity assumption on .F (x), we can always find .η0 , η1 , . . . , ηn such that .F (ηi ) = j/n for .0 < j < n and that .η0 = −∞ and .ηn = ∞. This ensures that for any .θ value, there exists a j such that .ηj −1 ≤ θ ≤ ηj . Therefore, .
sup{Fn (θ + e) − Fn (θ )} ≤ max{Fn (ηj + e) − Fn (ηj −1 )} j
θ
≤ max |{Fn (ηj + e) − Fn (ηj −1 )} j
−{F (ηj + e) − F (ηj −1 )}| + max{F (ηj + e) − F (ηj −1 )}. j
We now give an appropriate bound for each of the two terms. Let us examine the simpler second term first. We have F (ηj + e) − F (ηj −1 ) ≤ {F (ηj + e) − F (ηj )} + {F (ηj ) − F (ηj −1 )}.
.
By the mean value theorem in calculus, we have F (ηj + e) − F (ηj ) ≤ Me.
.
(4.10)
62
4 Estimation Under Finite Normal Mixture Models
Due to the choice of .ηj ’s we have .F (ηj ) − F (ηj −1 ) = n−1 . Therefore, F (ηj + e) − F (ηj −1 ) ≤ Me + n−1
.
regardless of j and for any positive value .e even if it depends on En. Let .Yi = 1(ηj −1 < xi ≤ ηj + e) and write .Δj = n−1 i {Yi − EYi }. The first term in (4.10) equals .maxj Δj . It is a typical quantity in the context of the empirical process. Since .Yi is a bounded random variable with its variance bounded by .δ = Me +n−1 , we may apply Bernstein’s inequality to .Δj (Serfling 1980, pp95): .
{ Pr(Δj ≥ t) ≤ 2 exp −
} nt 2 2δ + (2/3)t
for any .t > 0. Note that this upper bound does not depend on j and valid for any choices of j , .e. Put .t = Me + 8n−1 log n. After some simple simplifications, we have .
Pr(Δj ≥ Me + 8n−1 log n) ≤ 2 exp{−3 log n} = 2n−3 .
(4.11)
By Bonferroni’s inequality, we get .
Pr(max Δj ≥ Me + 8n−1 log n) ≤ j
E
(2n−3 ) = 2n−2 .
j
E −2 Because . ∞ n=1 (2n ) is finite, by Borel-Cantelli Lemma, we have .maxj Δj < −1 Me + 8n log n almost surely. Combining two bounds, we have shown .
sup{Fn (θ + e) − Fn (θ )} ≤ 2Me + n−1 + 8n−1 log n
(4.12)
θ
almost surely. Because .n−1 is a high order term compared to .n−1 log n, it can be easily absorbed into the latter and therefore removed from the upper bound to arrive at a simpler expression. For instance, the earlier inequality could have been proved with .8n−1 log n replaced by .7.9n−1 log n to leave a room to absorb .n−1 . u n The proof and hence the conclusion remain solid when .e is a value that depends on n. Because of this, .e is allowed to take an arbitrarily small value without invalidating the inequality. This is also the reason for not further absorbing .n−1 log n into .Me to get an even simpler expression. The upper bound has to be the sum of two terms. Here are more technical remarks on the proof. The first step of the proof, (4.10), breaks down the “continuous space” of .θ into a discrete space. Its validity is built on the monotonicity of .Fn (·). There is often a zero-probability event on which
4.2 Finite Normal Mixture Model with Unequal Variances
63
a probabilistic conclusion is invalid. The discretization step prevents the zeroprobability events from accumulating into a nonzero probability event. In addition, Borel–Cantelli lemma is almost the only tool for establishing an “almost sure” result. When it is directly employed, we typically assess the probability of exception for each n. If the sum of these probabilities over n is finite, then the almost sure result is established. Hence, a sharp inequality is often vital, and Bernstein’s inequality is very sharp for the mean of bounded random variables. What is the relevance of this lemma to the finite normal mixture model with unequal variances? For any given finite normal mixture model, with equal or unequal variances, its density function is bounded. Thus, this lemma is applicable. That is, a set of i.i.d. observations from any specific finite normal mixture model with unequal variances will not be clustered into a small neighborhood more severely than this lemma has asserted. This lemma is taken and revised from Chen et al. (2008). The original lemma has overly focused on finite normal mixture models. Here we have realized that the conclusion is generally applicable. The upper bound in Lemma 4.2 is established for each .e almost surely. Technically, it leaves a zero-probability event on which the upper bound is invalid for each value of .e. The union of these zero-probability events for .e in an interval, say .(0, 1], does not have to be a zero-probability event. To avoid this problem, we take advantage that the quantity to be bounded is monotone in .e. We show the result holds uniformly in .e. Lemma 4.3 The upper bound given in Lemma 4.2 after a minor alteration: .
sup {Fn (θ + e) − Fn (θ )} ≤ 2Me + 10n−1 log n.
θ∈R
holds uniformly for .e > 0 almost surely. Proof Let .N = n/(2M log n), and we pretend as if N is an integer in this writing. Let .ej = (2j M log n)/n for .j = 1, 2, . . . N. It is seen that .2MeN = 1, which makes the upper bound of the inequality in this lemma larger than 1. Hence, the inequality is trivially true for any larger value of .e. Clearly, for all .e ∈ (ej −1 , ej ], we have {Fn (θ + e) − Fn (θ )} ≤ {Fn (θ + ej ) − Fn (θ )}.
.
Applying inequality (4.11) to .ej in exactly the same fashion, we get .
( ) Pr sup{Fn (θ + ej ) − Fn (θ )} ≥ 2Mej + 9n−1 log n ≤ 2n−19/8 . θ
Noting that .2M(ej − ej −1 ) ≤ n−1 log n, we have 2Mej + 9n−1 log n ≤ 2Me + 10n−1 log n
.
64
4 Estimation Under Finite Normal Mixture Models
for all .e ∈ (ej −1 , ej ]. Hence, .
Pr(supθ supe {[Fn (θ + e) − Fn (θ )] − [2Me + 10n−1 log n]} ≥ 0) ≤ 2n−19/8 N ≤ Cn−11/8
E for some C. Since . (n−11/8 ) < ∞, we have shown Fn (θ + e) − Fn (θ ) ≤ 2Me + 10n−1 log n
.
u n
uniformly in both .θ and .e almost surely.
The above proof has been burdened by tedious mathematical details on choosing various constants such as .10n−1 log n rather than .9n−1 log n. These are necessary for presenting a tight proof, but are far from important for general understanding. We will be less rigorous in the presentation of the next lemma. Let us now consider the same problem for multivariate F of dimension d. Assume that F is absolutely continuous. Let .Fj and .fj be its marginal c.d.f. and p.d.f. for the j th subpopulation, .j = 1, . . . , d. That is, if we denote the corresponding random vector as .X = (X1 , . . . , Xd )T , .Fj and .fj are c.d.f. and p.d.f. of .Xj . Let .M = min1≤j ≤d supx fj (x). Define the cubic of size .e located at d-dimensional vector .θ as A(θ, e) = (θ1 , θ1 + e] × (θ2 , θ2 + e] × · · · × (θd , θd + e].
.
Given a set of i.i.d. observations .x1 , . . . , xn from F , the empirical c.d.f. is defined in the same way to be Fn (x) = n−1
n E
.
1(xi ≤ x)
i=1
where the inequality is component-wise. For any subset A of .R d , we denote .Fn (A) as the proportion of observations falls into A. Lemma 4.4 The upper bound given in Lemma 4.3 remains valid for multivariate observations: .
sup {Fn (A(θ, e))} ≤ 2Me + 10n−1 log n.
θ ∈R d
holds uniformly for .e > 0 almost surely. Proof Denote the n i.i.d. vector observations as .xi , i = 1, . . . , n. In addition, write .xi = (xi1 , xi2 , . . . , xid )T . It is seen that .xi ∈ A(θ , e) implies .θ1 < xi1 ≤ θ1 + e. Clearly, .xi1 , i = 1, . . . , n are i.i.d. observations from population .F1 . Thus, Lemma 4.3 is applicable. Applying Lemma 4.3, we get
4.2 Finite Normal Mixture Model with Unequal Variances
.
65
sup {Fn (A(θ, e))} ≤ 2{sup f1 (x)}e + 10n−1 log n.
θ ∈R d
The upper bound is tightened further by applying Lemma 4.3 to each of d components of .X. This proves the lemma. n u This lemma naturally shows that i.i.d. observations from a population with bounded density function spread out evenly almost surely. The upper bounds given in these lemmas are likely not tight. However, they are good enough for our purposes. They provide the quantitative information on how severe a penalty is needed so that the penalized likelihood leads to consistent estimation of the mixing distribution. These results in this subsection seem not so closely related to statistical analysis of finite mixture models. We will see that this is not the case soon, but not before settling another issue.
4.2.4 Selecting a Penalty Function We recommend the pMLE of the mixing distribution under the finite normal mixture model with unequal variances. The procedure starts with selecting a penalty function .pn (G) and define the penalized log likelihood l˜n (G) = ln (G) + pn (G).
.
(4.13)
We use the maximum point of .l˜n (G) with at most m subpopulations as the estimate of G: ˜ = arg max l˜n (G). G
.
(4.14)
The technical lemmas in the last subsection enable us to determine the properties ˜ consistent. pn (G) must have to make .G For simplicity, we require the penalty function of have the following three properties. Once more, the mixing distribution G has m support points with the j th subpopulation mean and variance given by .(θj , σj2 ). E P40.1. Additivity: .pn (G) = m j =1 p˜ n (σj ). P40.2. Uniform upper bound: .supσ >0 [p˜ n (σ )]+ = o(n); Individual lower bound: .p ˜ n (σ ) = o(n) for any given .σ > 0. P40.3. Sufficiently severe: .p˜ n (σ ) < 4(log n)2 log(σ ) for .σ < n−1 log n when n is large enough.
.
The first property is not restrictive. It is in fact unnecessary if merely for the sake of consistently estimating the mixing distribution. Yet it makes the consistency
66
4 Estimation Under Finite Normal Mixture Models
discussion simple. It also makes the numerical problem straightforward. Thus, we do not try to use more general penalty functions. The uniform upper bound in P40.2 is nearly necessary. Otherwise, the penalty may inflate the value of the likelihood of an arbitrary mixing distribution by a size of order n. Recall that the differences between the sizes of the log-likelihood value at any two specific mixing distributions are of order n. Thus, without this upper bound restriction, the penalty may take over the likelihood. That is, the penalized MLE will be determined solely by the penalty, not by the likelihood. This clearly spells inconsistency. The lower bound in P40.2 is also nearly necessary. Otherwise, the penalty will be able to reduce the log-likelihood at an arbitrary mixing distribution by a size of order n. The true mixing distribution and its neighborhood can therefore be disqualified to become the pMLE purely due to the overly severe penalty they may receive. Property P40.3 requires the size of penalty is large enough to stop .σj ≈ 0 in the penalized MLE of G. The exact utility of this property will be clear in the proof of the consistency of the pMLE. The following is an example of .p˜ n with the required properties: p˜ n (σ ) = −σ −2 − log σ 2 .
.
This penalty function goes to negative infinite when .σ → 0 or .σ → ∞. It is minimized when .σ = 1. The upper bound condition in .P40.2 on .[p˜ n (σ )]+ is certainly satisfied. At any given .σ , .p˜ n (σ ) takes a value does not depend on n. Thus, the lower bound condition in .P40.2 is also satisfied. As for P40.3, we note that when .σ < n−1 log n and n is large, .−σ −2 < n log σ . Hence, this condition is also satisfied. This specific example of the penalty function is particularly meaningful as it represents a prior Gamma distribution placed on .σ −2 . This form of penalty is very convenient when the EM algorithm will be used for numerical computation. In Chen et √ al. (2008), penalty functions in the form of .an p˜ n (σ ) with .an = 1/n and .an = 1/ n were found effective in simulation studies. Most recently, we have seen applications where n is much larger than what (Chen√et al. 2008) had experimented. In view of new experience, we recommend .an = 1/ n instead. In applications, it is best to have the penalty chosen so that the pMLE is scale invariant. For instance, if .sn2 is the sample variance, then { 2} s sn2 .p ˜ n (σ ) = − 2 + log n2 σ σ will be a statistically sensible choice. For mathematical consideration, we will only work on penalty functions without random components. The technical results can be easily extended to allow some level of randomness in the penalty function. The pMLE of G is consistent when the penalty function satisfied P40.1–3. It is hard for an inexperienced to keep a whole picture to understand a lengthy proof. We divide the proof into three big blocks to enhance the comprehension.
4.2 Finite Normal Mixture Model with Unequal Variances
67
Before the formal proof, we restate that there is an i.i.d. sample of size n from a finite normal mixture model with unequal variances. The true mixing distribution is denoted as .G∗ . Let .K ∗ = E∗ {log f (X; G∗ )} where the expectation is with respect to the true mixture distribution. Let .M = supx f (x; G∗ ). The next few sections will give detailed proofs for .m = 2 with .G∗ having two distinct support points and positive mixing probabilities. The proof will then be upgraded so that even when .G∗ degenerates so that it has only a single support point, the pMLE obtained in the space of mixing distributions with at most two support points remains consistent. Proving consistency for the most general case in a single strike can make the presentation even harder for comprehension.
4.2.5 Consistency of the pMLE, Step I We now choose a small enough positive constant .e0 such that (1) (2) (3) (4)
e0 < 1; 5Me0 (log e0 )2 ≤ 1; 2 ∗ .(log e0 ) + log(e0 ) ≥ 4 − 2K ; 2 .(2 + log e0 ) ≥ 16. . .
When .e0 ↓ 0, these four conditions are all satisfied. Hence, it is always possible to find such an .e0 . The exact forms of these requirements are not essential, but they are designed to make the proof presentation simple. The requirement (4) will not be needed until late in the proof. We first consider the situation where the mixing distribution has at most .m = 2 support points and the true mixing distribution .G∗ has two distinct support points with positive mixing proportions. That is, the mixture model is specified to be f (x; G) = π1 φ(x; μ1 , σ12 ) + π2 φ(x; μ2 , σ22 )
.
(4.15)
with .π1 , π2 ≥ 0 and .π1 + π2 = 1. This model contains mixing distributions with a single support point. With the constant .e0 chosen, we divide the parameter space into three nonoverlapping areas: Γ1 = {G : σ1 ≤ σ2 ≤ e0 }. Γ2 = {G : σ1 < τ0 , σ2 > e0 } for some constant .τ0 < e0 to be specified. c .Γ3 = {Γ1 ∪ Γ2 } . . .
˜ almost surely does not fall into .Γ1 . More We first show that the pMLE, .G precisely, we have the following theorem. Theorem 4.2 Given an i.i.d. sample from the finite normal mixture model (4.15) with the true mixing distribution .G∗ . Let .l˜n (G) be the penalized likelihood defined ˜ is the pMLE given by (4.13). Suppose the penalty function satisfies P40.1–3, and .G .m = 2. Then
68
4 Estimation Under Finite Normal Mixture Models
.
sup{l˜n (G) : G ∈ Γ1 } − ln (G∗ ) → −∞
almost surely as .n → ∞. Proof Define .A1 = {i : |xi − θ1 | < |σ1 log σ1 |}, and .A2 = {i : |xi − θ2 | < |σ2 log σ2 |}. For any index set, say A, we define ln (G; A) =
E
.
log f (xi ; G).
i∈A
With this convention, we may partition the entries in .ln (G) to arrive at ln (G) = ln (G; A1 ) + ln (G; Ac1 A2 ) + ln (G; Ac1 Ac2 ).
.
Let us examine the asymptotic order of these three terms. Denote the number of observations in set A as .n(A). Since the mixture density function is bounded by −1 .σ 1 for .G ∈ Γ1 , we get ln (G; A1 ) ≤ −n(A1 ) log(σ1 ).
.
Applying Lemma 4.2 with .e = σ1 log(1/σ1 ), which is positive, we get that almost surely, n(A) ≤ −2nMσ1 log(σ1 ) + 10 log n.
.
Hence, ln (G; A1 ) ≤ 2nMσ1 (log σ1 )2 − 10(log n)(log σ1 ).
.
(4.16)
In the above inequality, the upper bound is dominated by the second term when σ1 → 0 and n is fixed. One may also notice that the upper bound goes to infinite when .σ < 1. By P40.3, we have .p˜ n (σ1 ) < 4(log n)2 log(σ ) when .σ1 < n−1 log n. Based on (4.16) and this property, almost surely,
.
ln (G; A1 ) + p˜ n (σ1 ) ≤ 2nMσ1 (log σ1 )2 − 2{5(log n) − 2(log n)2 } log(σ1 )
.
≤ 2nMσ1 (log σ1 )2 . When .e0 > σ1 > n−1 log n, we have .10(log n)(log σ1 ) + pn (σ1 ) = o(n) with .o(n) does not depend on .σ1 or other parameter values. Hence, we have ln (G; A1 ) + p˜ n (σ1 ) ≤ 2nMσ1 (log σ1 )2 + o(n).
.
This upper bound will be used as a unified upper bound. We may similarly work on .A2 and get a parallel upper bound:
4.2 Finite Normal Mixture Model with Unequal Variances
69
ln (G; Ac1 A2 ) + p˜ n (σ2 ) ≤ 2nMσ2 (log σ2 )2 + o(n).
.
Our choice of sufficiently small .e0 implies .σj (log σj )2 ≤ e0 (log e0 )2 for both .j = 1, 2. Two upper bounds are subsequently replaced by .2nMe0 (log e0 )2 and omiting the redundant .o(n) term for simplicity. For observations falling outside both .A1 and .A2 , their log-likelihood contributions are bounded by .
log{
π1 π2 1 φ(− log σ1 ) + φ(− log σ2 )} ≤ − log(e0 ) − (log e0 )2 σ1 σ2 2
which is negative for the specified small enough .e0 . At the same time, it is easy to show that almost surely as .n → ∞, n(Ac1 Ac2 ) ≥ n − {n(A1 ) + n(A2 )} ≥
.
n . 2
Hence, we get the third bound n 1 ln (G; Ac1 Ac2 ) ≤ − {log(e0 ) − (log e0 )2 }. 2 2
.
Combining the three bounds and recalling the choice of .e0 , we conclude that when .G ∈ Γ1 , l˜n (G) = {ln (G; A1 ) + p˜ n (σ1 )} + {ln (G; Ac1 A2 ) − p˜ n (σ2 )} + ln (G; Ac1 Ac2 )
.
≤ 5Mne0 (log e0 )2 −
n 1 { (log e0 )2 + log(e0 )} 2 2
n (4 − 2K ∗ ) 2 = n(K ∗ − 1). ≤ n−
The last a few inequalities are due to the restrictions placed on the size of .e0 . There is a big room for the size of .e0 within which the inequalities remain valid. Hence, do not struggle to understand the specific size choices. By the strong law of large numbers, .n−1 l˜n (G∗ ) → K ∗ almost surely. The last inequality is simplified to .
sup l˜n (G) − l˜n (G∗ ) ≤ −n → −∞ G∈Γ1
almost surely as .n → ∞. This completes the proof.
n u
When both .σ1 , σ2 < e0 for a very small value of .e0 , the likelihood contribution of each observation in .A1 ∪ A2 is very large. However, Lemma 4.3 shows that the number of observations in .A1 ∪ A2 is not high. The penalty function is made severe
70
4 Estimation Under Finite Normal Mixture Models
enough to fully counter their effect. These considerations are reflected in the bounds on .ln (G; A1 ) + p˜ n (σ1 ) and .ln (G; Ac1 A2 ) + p˜ n (σ2 ). Other than the factor n, the coefficients in these two bounds go to zero as .e0 → 0. The damage on .l˜n (G) suffered from having .σ1 , σ2 < e0 is substantial as reflected in the bound to .ln (G; Ac1 Ac2 ). The coefficient of n in this bound goes to negative infinite as .e0 → 0. Thus, when .e0 is small enough, the penalized log-likelihood goes to negative infinite, compared to baseline likelihood: .l˜n (G∗ ). This is the technical thinking behind the theorem.
4.2.6 Consistency of the pMLE, Step II This section follows closely the last section. We have seen that there is a small enough constant .e0 serving as an almost sure lower bound for the subpopulation variances in the pMLE when .m = 2. More precisely, it is almost surely impossible that both subpopulation variances in the pMLE to fall below this lower bound. We now move to the task of showing that the pMLE of G is almost surely not inside .Γ2 , for an appropriately chosen .τ0 . The choice may depend on .G∗ but not on the sample size n. We now introduce a function g(x; G) = π1 φ(x; θ1 , 2e02 ) + π2 φ(x; θ2 , σ22 ).
.
(4.17)
We will include subdistribution G by allowing .π1 + π2 < 1. This function is a constant with respect to .σ1 . This function is easier to work with compared to the one adopted in Chen et al. (2008). On the restricted mixing distribution space .Γ2 , .σ2 has a non-zero lower bound. Thus, .g(x; G) is well defined over .Γ2 and beyond. That is, we may expand .Γ2 to include sub-distributions allowing .π1 + π2 < 1 and include .σ1 = 0 for the domain of .g(x; G). Let us call this space .Γ¯2 . It is compact in terms of the distance measure { { .Dkw (G1 , G2 ) = |G1 (θ, σ ) − G2 (θ, σ )| exp(−|θ | − σ )dθ dσ. This distance is a slight generalization of .Dkw given in Chap. 2. Without loss of generality, we let .τ0 be small enough such that the true mixing distribution .G∗ /∈ Γ2 . This is possible when .G∗ has two distinct support points and positive mixing proportions. Hence, by Jensen’s inequality E∗ log{g(X; G)/f (X; G∗ )} < 0
.
strictly because .f (x; G∗ ) is not one of .g(X; G)’s. Using a slightly different symbol from .l, let us now define a log-likelihood-like function
4.2 Finite Normal Mixture Model with Unequal Variances
ln (G) =
n E
.
71
log{g(xi ; G)}
i=1
on .Γ¯2 . This function mimics the log-likelihood function and has some similar properties. In particular, by the strong law of large numbers and the Jensen’s inequality just established, we have n−1 {ln (G) − ln (G∗ )} → E∗ log{g(X; G)/f (X; G∗ )} < 0.
.
Formally in the next lemma, we show that this inequality holds uniformly on .Γ¯2 . Lemma 4.5 Give a set of n i.i.d. from .f (x; G∗ ), the function .g(x; G) Eobservations n defined by (4.17) and .ln (G) = i=1 log{g(xi ; G)}, we have ln (G) − ln (G∗ ) ≤ −nδ(e0 )
.
(4.18)
for some .δ(e0 ) > 0 almost surely. Proof Let ¯ ∈ Γ¯2 , D(G, ¯ G) < e}. g(x; G, e) = sup{g(x; G) : G
.
Because .σ2 > τ0 > 0 for all .G ∈ Γ¯2 , we find g(x; G, e) ≤ 1 + τ0−1 .
.
Thus, we have .E∗ {log g(X; G, e)} < ∞. It is apparent that .E∗ {log g(X; G, e)} > −∞. Consequently, for any .e > 0, the expectation .E∗ {g(X; G, e)/f (X; G∗ )} is also well defined. Let .e ↓ 0 and according to monotone convergence theorem, we find .
lim E∗ {g(X; G, e)/f (X; G∗ )} ≤ E∗ {g(X; G)/f (X; G∗ )} < 0. e↓0
Next, note that .Γ¯2 is compact based on distance .Dkw (·, ·), we may find a finite number of G and an .e such that Γ¯2 ⊂ ∪Jj=1 {G : Dkw (G, Gj ) ≤ ej }
.
and for each .j = 1, . . . , J , E∗ {g(X; Gj , ej )/f (X; G∗ )} < 0.
.
This leads to the claim of this lemma:
72
4 Estimation Under Finite Normal Mixture Models
ln (G) − ln (G∗ ) ≤ −nδ(e0 )
.
for some .δ(e0 ) > 0 whose size is affected by the size of .e0 .
u n
We now connect .ln (G) to .ln (G) on the space .Γ¯2 and refine this result to obtain the major result of this subsection. Theorem 4.3 Assume the same conditions as in Theorem 4.2. As .n → ∞, we have almost surely that .supG∈Γ2 ln (G) − ln (G∗ ) → −∞. Proof Recall that we defined .A1 = {i : |xi − θ1 | ≤ σ1 log(1/σ1 )}. In addition, for each mixing distribution .G ∈ Γ2 , .σ1 ≤ e0 < 1. It is easily verified that for .i ∈ A1 , we have f (xi ; G) ≤ (1/σ1 )g(xi ; G).
.
Therefore, the log-likelihood contribution of each observation in .A1 is .
log{f (xi ; G)} ≤ log(1/σ1 ) + log{g(xi ; G)}.
For observations not in .A1 , their observed values satisfy .|x − θ1 | ≥ |σ1 log σ1 |. For these x values, we have .
(x − θ1 )2 1 (x − θ1 )2 1 (x − θ1 )2 2 (log σ + ) ≥ + (log σ1 )2 . ≥ 1 2 2 2 4 4 4σ1 4e0 2σ1
Consequently, we find .
{ (x − θ1 )2 } { 1 } { (x − θ1 )2 } 1 = exp − × exp − (log σ1 )2 − log σ1 exp − 2 2 4 σ1 2σ1 4e0 { (x − θ1 )2 } } { 1 = exp − × exp − (log σ1 + 2)2 + 1 . 2 4 4e0
Let us point out that the factor .1/σ1 on the left hand side is rewritten as exp(− log σ1 ) in the first line. Recall that we required .e0 small enough such that 2 .(log e0 + 2) ≥ 16. This condition ensures .
.
1 − (log σ1 + 2)2 + 1 ≤ −3 4
when .σ1 ≤ e0 . Hence, we get .
{ (x − θ1 )2 { (x − θ1 )2 } } 1 ≤ exp − exp − −3 . 2 2 σ1 2σ1 4e0
4.2 Finite Normal Mixture Model with Unequal Variances
73
Consequently, for these x values not in set .A1 , φ(x; θ1 , σ12 ) = √
.
{ (x − θ1 )2 } exp − 2σ12 2π σ1 1
} { (x − θ1 )2 1 ≤ √ exp − −3 2 4e0 2π √ = 2 exp(−3)φ(x; θ1 , 2e02 ) ≤ φ(x; θ1 , 2e02 ). √ Be aware that we need . 2 exp(−3) < 1, which explains the specific upper bound .−3 given earlier. The value .−3 itself does not have a particular significance. Next, we get f (x; G) = π1 φ(x; θ1 , σ12 ) + π2 φ(x; θ2 , σ22 )
.
≤ π1 φ(x; θ1 , 2e02 ) + π2 φ(x; θ2 , σ22 ) = g(x; G). In summary, when .i /∈ A1 , its log-likelihood contributions .
log f (xi ; G) ≤ log{g(xi ; G)}.
Combining two cases on observations in and not in .A1 , we find ln (G) ≤ n(A1 ) log(1/σ1 ) +
E
.
log{g(xi ; G)}.
This leads to .
sup l˜n (G) ≤ sup {ln (G) + p˜ n (σ2 )} + sup {n(A1 ) log(1/σ1 ) + p˜ n (σ1 )}. G∈Γ2
G∈Γ2
G∈Γ2
Applying the bound developed in the proof of .Γ1 in the last subsection, the property of the penalty function .p˜ n (·) ensures .
sup {n(A1 ) log(1/σ1 ) + p˜ n (σ1 )} < 2Mnτ0 (log τ0 )2 G∈Γ2
almost surely. Hence, the proof of the theorem is reduced to show .
sup{ln (G) + p˜ n (σ2 )} − l˜n (G∗ ) + 2Mnτ0 (log τ0 )2 < 0 Γ2
(4.19)
74
4 Estimation Under Finite Normal Mixture Models
almost surely as .n → ∞. Because we make the penalty function satisfy [pn (σ2 )]+ = op (n), it is sufficient to show that
.
.
sup ln (G) − ln (G∗ ) ≤ −δn
(4.20)
Γ2
for some .δ > 2Mτ0 (log τ0 )2 almost surely. This is implied by Lemma 4.5 when .τ0 is chosen sufficiently small, after the choice of .e0 . u n With the conclusion in this lemma, we have successfully excluded the possibility that the pMLE of G falls in .Γ1 ∪ Γ2 . We finish up the consistency proof in the next subsection.
4.2.7 Consistency of the pMLE, Step III Now we consider the pMLE of G when G is confined to .{Γ1 ∪ Γ2 }c . This is a space on which .σ1 > e0 > 0 and .σ2 > τ0 > 0. On this space, we have .
lim
min(|θ1 |,|θ2 |)→∞
f (x; G) = 0
(4.21)
at any x. Together with the properties of normal distribution, it makes the result of Theorem 2.2 applicable and gives us the following. Theorem 4.4 Given an i.i.d. sample from the finite normal mixture model (4.15) with the true mixing distribution .G∗ having two distinct support points. Let .l˜n (G) be the penalized likelihood defined by (4.13). Suppose the penalty function satisfies ˜ is the pMLE given .m = 2. Then .G ˜ → G∗ almost surely as .n → ∞. P40.1–3, and .G Proof We apply Kiefer–Wolfowitz theorem (Theorem 2.2) under condition C20.0 and C20.1. Condition C20.0 requires the smooth extension of the subpopulation density function to the boundary of the parameter space. This has been demonstrated by (4.21). Condition C20.1 is the generalized Jensen’s inequality. Its validity is implied by verifying .E∗ [f (x; G, e)/f (x; G∗ )]+ < ∞. This is clearly true when the subpopulation density is normal distribution with .σ bounded away from 0. Finally, the condition P40.3, .p˜ n (G) = o(n), ensures that the consistency proof of Kiefer-Wolfowitz (Theorem 2.2) remains valid for the penalized likelihood. This completes the proof. u n
4.3 Consistency When G∗ Has Only One Subpopulation The proof we presented in the last section is not completely satisfactory because we assumed that .G∗ have two distinct support points. In some applications, we are not
4.3 Consistency When G∗ Has Only One Subpopulation
75
sure how many distinct support points .G∗ actually have. We now upgrade the proof to allow .G∗ to have actually only one support point. We retain the partition of the mixing distribution space: .Γ1 , .Γ2 and .Γ3 . We first claim that conclusion of Theorem 4.2 remains valid. In this proof, the only step that .G∗ involved is the size of .n−1 ln (G∗ ) is .K ∗ as .n → ∞. So the pMLE ˜ /∈ Γ1 almost surely. The same as before. .G When .G∗ has only one support point, say .(μ∗ , σ 2∗ ). It is possible to write f (x; G∗ ) = 0 × φ(x; θ1 , 2e02 ) + 1 × φ(x; θ2 , σ22 )
.
which is the form (4.17). Therefore, .G∗ ∈ Γ2 although it is fitted in a bit artificially. This observation invalidates the claim that E∗ log{g(X; G)/f (X; G∗ )} < 0
.
(4.22)
for any .G ∈ Γ2 . And (4.22) is a key step of the previous proof. To fix this problem, let .ε > 0 be some arbitrarily small constant (not random, does not depend on n) to be clarified. Define an .ε-sized ball containing the support point of .G∗ : B(ε) = {(μ, σ 2 ) : (μ − μ∗ )2 + (σ − σ ∗ )2 < ε2 }.
.
We further partition the space .Γ2 defined earlier into Γ21 = {G : σ1 < τ0 , σ2 > e0 , π1 > ε}, Γ22 = {G : σ1 < τ0 , σ2 > e0 , π1 < ε, (μ2 , σ2 ) /∈ B(ε)}, .Γ23 = {G : σ1 < τ0 , σ2 < e0 , π1 > ε, (μ2 , σ2 ) ∈ B(ε)}. . .
When .τ0 is sufficiently small, we have clearly .G∗ /∈ Γ21 . Hence (4.22) holds on for all .G ∈ Γ21 . The proof of Lemma 4.5 is applicable when .Γ2 is replaced by .Γ21 and therefore the conclusion holds on .Γ21 . The subspace .Γ22 has explicitly excluded .G∗ by requiring .(μ2 , σ2 ) /∈ B(ε). The proof of Lemma 4.5 is therefore also applicable when .Γ2 is replaced by .Γ22 and therefore the conclusion holds on .Γ22 . Finally, we find that the mixing distributions in .Γ23 are already within .O(ε) KW-distance from .G∗ . Since .ε can be chosen arbitrarily small, this indicates that the maximum likelihood point in space .Γ23 is a consistent estimator of .G∗ . The maximum likelihood point in space .Γ3 remains consistent for .G∗ . In summary, we have strengthened the conclusion of the last theorem. Theorem 4.5 Given an i.i.d. sample from the finite normal mixture model (4.15) with the true mixing distribution .G∗ having at most two distinct support points. Let .l˜n (G) be the penalized likelihood defined by (4.13). Suppose the penalty function ˜ is the pMLE given .m = 2. Then .G ˜ → G∗ almost surely as satisfies P40.1–3, and .G .n → ∞.
76
4 Estimation Under Finite Normal Mixture Models
The consistency conclusion of the pMLE has been utilized in developing the EM test for finite normal mixture models. Indeed, the consistency when the assumed order m larger than the “true” order of .G∗ is of particular importance. This section helps to patch up the missing point in the published proof.
4.4 Consistency of the pMLE: General Order We have given a proof for the consistency of the penalized MLE under the assumption that the order of the true mixture model is no more than .m = 2. There is no technical difficulty to prove the consistency when the true order is known to be no more than m for any m. We omit much of the details but give an outline as follows. Note that I have weakened the conclusion in the spirit of the remark in the last section. With a general upper bound on the order of the mixture model m, we start with defining Γ1 = {G : max σj ≤ e0 }.
.
1≤j ≤m
which is a subspace of the mixing distribution space .Gm . The same technique used ˜ /∈ Γ1 for some choice for .m = 2 can be used to show that almost surely we have .G of .e0 > 0. After which, we define Γ2 = {G :
.
max
1≤j ≤m−1
σj ≤ e0 , σm > τ0 }
for some .τ0 > 0, The same technique used for .m = 2 can be used to show that ˜ /∈ Γ2 for some choice of .τ0 > 0. In this step, the function .g(x; G) almost surely .G used in the proof for .m = 2, (4.17), needs some adjustment accordingly. It becomes routine to then define Γk = {G :
.
max
1≤j ≤m−k
σj ≤ e0 , σm−i > τi , i = 0, 1, . . . , k − 2}
for some constants .τk−2 chosen consecutively. We should then show that almost ˜ /∈ Γk for some choice of .τk−2 > 0. The logic is repeated for .k = surely .G 2, 3, . . . , m − 1. The above steps lead to the conclusion that the pMLE almost surely belongs to Γm = {G : min σj ≥ τm }
.
1≤j ≤m
for some .τm > 0. On the space .Γm , Kiefer–Wolfowitz Theorem becomes applicable. Hence, we get the consistency result as follows.
4.5 Consistency of the pMLE Under Multivariate Finite Normal Mixture Models
77
Theorem 4.6 Given an i.i.d. sample from the finite normal mixture model (4.21) with the true mixing distribution .G∗ whose number of support points is at most m. Let .l˜n (G) be the penalized likelihood defined by (4.13). Suppose the penalty ˜ is the pMLE given general m support point as function satisfies P40.1–3, and .G ∗ ˜ defined by (4.13). Then .G → G almost surely as .n → ∞. The reason that the consistency holds when the true order “.m∗ ” is smaller than m is the same as when .m = 2.
4.5 Consistency of the pMLE Under Multivariate Finite Normal Mixture Models Does the consistency of the pMLE generates to multivariate finite normal mixture models? This problem has already been discussed with positive answer to the corresponding question. Let .ϕ(x; θ , Σ) be the multivariate normal density with .(d × 1) mean vector .θ and .d × d covariance matrix .Σ. That is, denote 1 ϕ(x; θ , Σ) = (2Π i)−d/2 |Σ|−1/2 exp{− (x − θ)T Σ −1 (x − θ)}. 2
.
A d-dimensional random vector X has a multivariate finite normal mixture distribution with p subpopulations if its density function is given by f (x; G) = π1 ϕ(x; θ 1 , Σ 1 ) + π2 ϕ(x; θ 2 , Σ 2 ) + · · · + πm ϕ(x; θ m , Σ m ).
.
(4.23)
Formally, let .σj (k1 , k2 ) be the (.k1 , k2 )th element of .Σ j and let T λj = (θ T j , σj (k1 , k2 ) : k1 = 1, . . . , d, k2 = 1, . . . , k1 )
.
which is a vector with .d + d × (d + 1)/2 elements. For convenience, let G also stand for its relevant parameters, namely, T T G = (π1 , . . . , πm , λT 1 , . . . , λp ) .
.
or more intuitively, we may even write it as G = (π1 , . . . , πm , θ 1 , . . . , θ m , Σ 1 , . . . , Σ p ).
.
Let .x1 , x2 , . . . , xn be a random sample from (4.23). Then ln (G) =
n E
.
i=1
log f (xi ; G)
78
4 Estimation Under Finite Normal Mixture Models
is the log-likelihood function. For the same reason as in the univariate case, even if the determinant of the variance–covariance matrices .|Σ j | > 0 for all j , .ln (G) is unbounded at .θ 1 = x1 when .|Σ 1 | gets arbitrarily small. We recommend the use of penalized log-likelihood function in the form pln (G) = ln (G) + pn (G)
.
which is based on the same consideration as for the univariate case. The requirements on the penalty function are nearly identical. Ep C1. .pn (G) = j =1 p -n (Σ j ), where .p -n (·) is a function of .d × d matrix. C2. At any fixed G such that .|Σ j | > 0 for all .j = 1, 2, . . . , m, we have .pn (G) = o(n), and .supG max{0, pn (G)} = o(n). In addition, . pn (G) is differentiable with respect to G and as .n → ∞, ' 1/2 ) at any fixed G such that .|Σ | > 0 for all .j = 1, 2, . . . , m. .pn (G) = o(n j C3. For large enough n, .p -n (Σ) ≤ 4(log n)2 log |Σ|, when .|Σ| is smaller than −2d for some .c > 0. .cn Theorem 4.7 Assume that the true density function ∗
∗
f (x; G ) =
.
m E
πj∗ ϕ(x; θ ∗j , Σ ∗j )
j =1
satisfies .πj∗ > 0, .|Σj∗ | > 0, and .(θ ∗j , Σ ∗j ) /= (θ ∗k , Σ ∗k ) for all .j, k = 1, 2, . . . , m∗ and .j /= k and .m∗ ≤ m. ˜ n is a mixing Assume that the penalty function .pn (G) satisfies C1–C3 and .G distribution with at most m subpopulations satisfying ˜ n ) − pln (G∗ ) ≥ c > −∞, pln (G
.
for all n. Then, as .n → ∞, d ∗ ˜ n −→G , G
.
almost surely. This result is proved along the same line as for the univariate case. Interested readers may find the full proof given by Chen and Tan (2009). One technical remark is needed here. The conclusion of Lemma 2 in Chen and Tan (2009) is solid, but the proof was pointed out by Alexandrovich (2014) to be less than tight. He is very kind to provide an alternative rigorous proof. In view of the newer approach, a much simpler proof is possible.
4.5 Consistency of the pMLE Under Multivariate Finite Normal Mixture Models
79
4.5.1 Some Remarks It should be remarked here that the properties of the pMLE under finite normal mixture model were discussed by Ciuperca et al. (2003). The construction of function .g(x; G) in this chapter is largely motivated from their proof. The specific form of our .g(x; G) has evolved substantially, but the spirit remains the same. Two lemmas are, however, novel and crucial for the consistency proofs. Another line of approach for consistency estimation of the mixing distribution under finite normal mixture model is via constrained MLE. Suppose the true mixing distribution has .σj /σi ≥ e > 0 for some known .e. Under this condition, Hathaway (1985) showed that the constrained MLE of G is consistent. In fact, the result remains valid even if .e decreases with the sample size n at some rate. The result on constrained MLE is cited as often as the penalized MLE if not a lot more. In a specific application, which one would we recommend? I find that even when a paper cites the result of Hathaway (1985) in their data analysis, it may in fact totally ignore the requirement of putting up a lower bound .e. At the same time, there are even more papers that apply the straightforward MLE in their data analysis. By straightforward MLE, the users apply EM algorithm to locate local maxima of the likelihood function with multiple initial values. If a local maximum point contains a degenerate subpopulation variance, this maximum is dismissed. The local maximum achieving the highest likelihood value without any degenerate subpopulation variances is regarded as the valid MLE of the mixing distribution. One has to suggest that the straightforward MLE is theoretically baseless. However, Chen and Tan (2009) have conducted extensive simulation studies in various settings to investigate the properties of the straightforward MLE and the pMLE. They found that there are hardly any differences between these two methods in terms of bias and variance of the parameter estimations. The penalized approach enjoys some advantage in terms of faster convergence of the EM iteration used for numerical computation. The EM iteration suffered from occasional breakdowns when computing the straightforward MLE. These breakdowns are not hard to spot by a diligent researcher. In view of these outcomes, one may also conclude easily that the constrained MLE will perform well in general even if loosely implemented. If a strict implement is required, the choice of the lower bound .e can be a delicate issue. The penalized approach has a similar issue, but we feel that it is less delicate. In their simulation, Chen and Tan (2009) employed pn (G) = −an
.
m { } E tr(Sx Σ −1 ) + log |Σ | j j
(4.24)
j =1
with .an = n−1 or .an = n−1/2 , where .Sx is the sample covariance matrix. Again, there are hardly any differences between these two choices of .an in terms of simulated mean square errors in their simulations.
80
4 Estimation Under Finite Normal Mixture Models
In the content of big data, these two choices start to show some meaningful differences. We recommend the pMLE with .an = n−1/2 for theoretical rigor and numerical efficiency.
4.6 Asymptotic Normality of the Mixing Distribution Estimation The consistency of the pMLE has been proved under the assumption that the order of the finite normal mixture model is known up to an upper bound. When the order is known, then mixing proportions, subpopulation means, and covariance matrices are all estimated consistently. The consistency implies that to the first order, the large sample properties of the pMLE under the finite mixture model are the same as the one when the parameter space is restricted into a small neighborhood of the true parameter value. ˆ n and .G∗ have the same number of subpopulations, all elements in .G ˆn Because .G converge to those of .G∗ almost surely. Furthermore, let Sn (G) =
.
n E ∂ log f (xi ; G)
∂G
i=1
be the vector score function at G. Let Sn' (G) =
.
n E ∂Sn (G) i=1
∂G
be the matrix of the second derivative of the log-likelihood function. At .G = G∗ , the normal mixture model is regular and hence, the Fisher information [ ] In (G∗ ) = nI (G∗ ) = −E{Sn' (G∗ )} = E {Sn (G∗ )}T Sn (G∗ )
.
is positive definite. Using classical asymptotic techniques and under condition C2 such that .pn' (G) = op (n1/2 ), we have ˆ n − G∗ = {−Sn' (G∗ )}−1 Sn (G∗ ) + op (n−1/2 ). G
.
ˆ n is an asymptotically normal and efficient estimator. Therefore, .G Theorem 4.8 Under the same conditions as in Theorem 4.7, as .n → ∞, √ ˆ n − G∗ } → N(0, I (G∗ )) n{G
.
in distribution.
4.6 Asymptotic Normality of the Mixing Distribution Estimation
81
Although claiming the asymptotic normality is easy, actual computation of the Fisher information in the current situation is not easy. This above result is therefore of only symbolic significance.
Chapter 5
Consistent Estimation Under Finite Gamma Mixture
We discussed the problem of consistent estimation of the mixing distribution under finite normal mixture models in the previous chapter. The main reason for the specific discussion just for this finite normal mixture is the phenomenon of unbounded likelihood function. This is also the case for all finite mixture of distributions from a location-scale family. See Tanaka and Takemura (2006) and Tanaka (2009). This is also the case for Gamma mixture. Consider the finite mixture of Gamma distribution with component density function given by f (x; r, θ ) =
.
x r−1 exp(−x/θ ) . θ r Γ (r)
As discussed in Example 3.4, when r and .θ are chosen such that .rθ = x, we have f (x; r, θ ) =
.
√ r x r−1 exp(−r) 1 . = √ Γ (r)(x/r)r 2Π x 1 + O(r −1 )
Let .r → ∞ while keeping .rθ = x, we find f (x; r, θ ) → ∞.
.
The above .Π is the mathematical constant. Suppose we have a set of i.i.d. observations from a finite Gamma mixture distribution of order .m = 2 and we wish to find a mixing distribution G at which point the log-likelihood function .ln (G) is maximized. To fix the idea, suppose one observed value is .x1 = 1 and let Gr = (1/2){(r, r −1 )} + (1/2){(1, 1)},
.
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_5
83
84
5 Gamma Mixture
assigning equal probabilities to two subpopulations. The above derivation can be used to show that .l(Gr ) → ∞ as .r → ∞. Yet the limit of this sequence of mixing distribution .Gr clearly is not the true mixing distribution. The unboundedness of the density function occurs when .r → ∞. Therefore, to restore consistency of the MLE, we may simply place an upper bound on r. Let .R0 be an arbitrarily large constant. Let the component parameter space under a Gamma distribution be Ω = {(r, θ ) : 0 < r < R0 , 0 < θ < ∞}.
.
Given any .x > 0, f (x; rk , θk ) =
.
x x rk −1 exp(− ) → 0 θ rk Γ (rk ) θk
for any parameter sequence in .Ω when .rk → 0, when .θk → ∞ or when .θk → 0. The density function is clearly continuous with respect to r and .θ in the interior of .Ω and can be continuously extended to its closure. Let .Gc be the space of distributions with m support points on .Ω. Define the ˆ c satisfying constrained MLE of G to be .G ˆ c ) = sup{ln (G) : G ∈ Gc } ln (G
.
where .ln (G) is the likelihood function based on n i.i.d. observations. Here the subscript is for constraint. We now give a consistency result for the constrained MLE. The result can be derived based on Theorems 2.2 or 3.2. Theorem 5.1 Given an i.i.d. sample of size n from a finite Gamma mixture, the ˆ c is consistent for the true mixing constrained MLE of the mixing distribution .G distribution .G∗ when .G∗ ∈ Gc . In addition, let .Gec be the mixing distributions that are at least an .e-distance from ∗ .G . Then, ln (G∗ ) − sup{ln (G) : G ∈ Gec } ≥ δn
.
for some .δ > 0 almost surely. In applications, it may not be so difficult to find a reasonable upper bound for the degree of freedom r. Our experience also indicates that even though the MLE is in theory not consistent, the unboundedness does not bother us in applications. Any reasonable computer code will find a respectable local maximum point, which gives a good fit to the data. For the sake of rigor, we still prefer a penalized approach to ensure the consistency. In addition, the consistency proof demands rather nonconventional technical manipulations. This also makes such a chapter worthwhile. This is the scheme of this chapter.
5.1 Upper Bounds on f (x; r, θ)
85
5.1 Upper Bounds on f (x; r, θ) To design a penalized likelihood approach, it is essential to know how fast the likelihood increases when the parameter value approaches certain irregular points. In this section, we develop a few inequalities regarding .log f (x; r, θ ). For this purpose, we first give a side inequality. Lemma 5.1 Let .ε ∈ (0, 1). For any .ρ such that .|ρ| > ε, .
exp(ρ) − (1 + ρ) > (1/3)ε2 .
(5.1)
Proof Note that .g(ρ) = exp(ρ) − (1 + ρ) is an increasing function of .ρ when ρ > ε > 0. Hence,
.
g(ρ) = exp(ρ) − (1 + ρ) ≥ (1/2)ρ 2 > (1/3)ε2
.
which leads to (5.1) for .ρ > ε. When .ρ < 0, function .g(ρ) decreases on .ρ ∈ (−∞, 0). For any .ρ < −ε, we have 1 1 g(ρ) − ε2 ≤ g(−ε) − ε2 3 3
.
1 ≤ exp(−ε) − 1 + ε − ε2 3 1 2 1 2 1 3 ≤ ε − ε + ε˜ . 2 3 6 1 = (ε2 + ε˜ 3 ) > 0. 6
(5.2)
Note that we have applied Taylor expansion in the third step, with .ε˜ ∈ (−ε, 0) being an intermediate value. This completes the proof of (5.1). u n The following lemma cites two very accurate bounds of the gamma function given by Li et al. (2007). Lemma 5.2 When .r > 1, we have (r − γ ) log r − (r − 1) < log Γ (r) < (r − 0.5) log r − (r − 1),
.
where the constants .γ = 0.577215 . . . is Euler–Mascheroni constant. Note that the difference between the upper and lower bounds is merely .(γ − 0.5) log r, which is relatively very small compared to .log Γ (r). When .0 < r < 1, it is seen { Γ (r) =
.
0
∞
{
1
x r−1 exp(−x)dx ≥ exp(−1) 0
x r−1 exp(−x)dx ≥ r −1 exp(−1).
86
5 Gamma Mixture
Namely, .log Γ (r) > − log r − 1 for r in this range. A crude but useful fact is that log Γ (r) has a finite lower bound. There is no r value at which .Γ (r) is very close to 0. Now we are ready to develop some inequalities for log-gamma density function. Let r and .θ be two positive numbers and denote the density function of the exponential distribution with mean .θ r by
.
h(x; r, θ ) =
.
x 1 exp(− ). rθ rθ
Exponential distribution family is part of Gamma distribution family with degree of freedom (or shape parameter) being 1. The following lemma gives bounds to the difference between a gamma density function and a specific exponential density function. Lemma 5.3 The gamma density function for .r ≥ 1 satisfies .
log f (x; r, θ ) − log h(x; r, θ ) ≤ γ log r
(5.3)
where .γ is the constant given in Lemma 5.2. √ √ Let .ρ = log(x) − log(rθ ). When .r > 4 and .|ρ| > 2 log r/ r, we have .
log f (x; r, θ ) ≤ log h(x; r, θ ) + γ {log r − log2 r}.
(5.4)
For any positive x, r and .θ , we have .
log h(x; r, θ ) ≤ − log x − 1 ≤ − log x.
(5.5)
Proof We first organize the difference between two log density functions log{f (x; . r, θ )} − log{h(x; r, θ )} x x )− } rθ rθ = r log(r) − log Γ (r) + (r − 1){ρ − exp(ρ)} = r log(r) − log Γ (r) + (r − 1){log(
= r log(r) − (r − 1) − log Γ (r) + (r − 1){1 + ρ − exp(ρ)} where .ρ = log(x) − log(rθ ) as defined in the lemma. It is easily seen that .1+ρ −exp(ρ) ≤ 0 for any value of .ρ. Hence, by Lemma 5.2, when .r > 1, we have .
log{f (x; r, θ )} − log h(x; r, θ ) ≤ r log r − (r − 1) − log Γ (r) ≤ γ log r.
This proves (5.3). √ √ We now prove the second inequality. Denote .εr = 2 log r/ r which takes a value between 0 and 1 when .r > 4. For .ρ in the range of consideration and .r > 4, applying Lemma 5.2, get
5.2 Consistency for Penalized MLE
87
log{f (x; . r, θ )} − log{h(x; r, θ )}.
(5.6)
= r log(r) − (r − 1) − log Γ (r) + (r − 1){1 + ρ − exp(ρ)} 1 ≤ r log(r) − (r − 1) + log Γ (r) − (r − 1)εr2 3 2(r − 1) log2 r ≤ γ log r − 3r ≤ γ (log r − log2 r).
(5.7)
This proves (5.4). The proof of inequality (5.5) is straightforward.
u n
The inequalities in the above lemma are simplified version of the ones given in Chen et al. (2016). In the process, we realized that there is a mistake in their proof. However, the general conclusion is not affected in any substantive way. The current upper bounds have coefficient increased slightly from .0.5 to .γ = 0.577215 . . .. It will not have any effect in subsequent proof. The current choice of .εr is arbitrary to a large degree, but it makes the upper bounds slightly tidier. Under a finite mixture model, one may choose one subpopulation/component parameter vector .(r1 , θ1 ) to boost the likelihood contributions of a few observations, and choose the other subpopulation parameter vector .(r2 , θ2 ) to maintain certain size of the contributions of other observations. This leads to the unbounded likelihood phenomenon. These two inequalities quantify these effects. One states that it is possible to boost the log-likelihood contribution of some observations to a level of .log r. The other inequality states that the effect to the log-likelihood contributions of other observations will be as low as .log r − log2 r. Choosing a large r may not boost the overall log-likelihood value when those super contributors are countered by some penalty. The information learned from these inequalities helps us to choose a penalty function of an appropriate magnitude to counter the inflationary effect. We introduce the penalized MLE and prove its consistency next.
5.2 Consistency for Penalized MLE We require the penalty function to satisfy the following two conditions: P50.1
For any mixing distribution G with m support points and subpopulation parameters .(rj , θj ), pn (G) =
m E
.
pn (rj )
i=1
P50.2
such√that (1) .pn (r) ≤ 0 and (2) .pn (r) ≤ − log2 (n) log r when .r > exp( n) for large n. At each given G, .pn (G) = o(n).
88
5 Gamma Mixture
For notational simplicity, we use .pn (·) for both functions of G and functions of r. This should not cause much confusion. The additive requirement on .pn (G) is not technical but for operational convenience. The first condition also requires √ the penalty to be sufficiently severe when r is large. The requirement .r > exp( n) can be reduced to .r > exp(nb ) for any .0 < b < 1. The setting .b = 0.5 makes the presentation easier to follow without losing much generality. The second condition requires the penalty to be moderate so that it does not take over the likelihood function itself. A penalty function such as pn (r) ≤ −
.
r n
√ satisfied these conditions. Clearly, when .r > exp( n), P50.1 is satisfied by this function. More convenient penalty functions will be suggested later. Given an i.i.d. sample .x1 , . . . , xn from a Gamma mixture model of order m, we define a penalized likelihood function pln (G) = ln (G) + pn (G).
.
ˆ be a mixing distribution of order m such that Let .G ˆ = sup{pln (G) : G ∈ Gm } pln (G)
.
where .Gm is the space of all mixing distributions with at most m support points. We prove the consistency of the pMLE when .m = 2. The general case is more complex but follows the same line of approach. As it is the case in the last chapter, we first investigate the case when .G∗ has two distinct support points with positive mixing proportions. The conclusion will be strengthened subsequently. Following the same idea of the last chapter, we divide the space of the mixing distribution into G1 = {G : min(r1 , r2 ) > R0 }
.
G2 = {G : max(r1 , r2 ) > R0 , min(r1 , r2 ) < R0 } G3 = G − (G1 ∪ G2 ) with the constant .R0 sufficiently large and to be specified when necessary. Note that the above .G1 , G2 , G3 are not .Gm with .m = 1, 2, 3. The notational choice is not logical, but we are short of proper and easy-to-remember ones here. We establish the consistency of the pMLE by studying the properties of the penalized likelihood function in these three regions.
5.2 Consistency for Penalized MLE
89
5.2.1 Likelihood Function on G1 Without loss of generality, let .r1 ≥ r2 ≥ R0 . Let .εrj = and
√ √ 2log rj / rj for .j = 1, 2,
| | Aj = {i : | log xi − log(rj θj )| ≤ εrj }.
.
The sets .Aj are collections of units whose values are close to .rj θj , .j = 1, 2. We use .n(Aj ) for the number of units in .Aj . It can be seen that the component density function of .Y = log X is given by f (exp(y); r, θ ) exp(y) =
.
1 exp{r(y − log θ ) − exp(y − log θ )}. Γ (r)
When .0 < r < 1, the density function of X is unbounded. This does not happen to the density function of Y , which is given by .
sup{f (exp(y); r, θ ) exp(y)} < ∞ y
for any given .r > 0 and .θ > 0. Therefore, .M = supy {f (exp(y); G∗ ) exp(y)} is finite for any true mixing distribution .G∗ . The finiteness of M activates the conclusion of Lemma 4.3, so n(Aj ) ≤ 2nMεrj + 10 log n
.
almost surely for .j = 1, 2. When .rj > R0 and both n and .R0 are large enough, it can be seen that 2Mεrj
R0 , .r2 < R0 . We retain the notation A1 = {i : | log(x) − log(rθ )| ≤ εr1 }.
.
For .i /∈ A1 , by (5.4), we find .
log{f (x; r, θ )} − log{h(x; r, θ )} ≤ 0.5(log r1 − log2 r1 ).
92
5 Gamma Mixture
We further define g(x; G) = π1 h(x; r1 , θ1 ) + π2 f (x; r2 , θ2 ).
.
By (5.3) of Lemma 5.3, for .i ∈ A1 , √ √ f (xi ; G) ≤ π1 { er1 h(xi ; r1 , θ1 } + π2 f (x1 ; r2 , θ2 ) ≤ er1 g(xi ; G).
.
Note that the constant e is needed for the validity of the inequality, but it does not have any practical effect in the subsequent derivations. When .i /∈ A1 and .r1 > R0 , which is sufficiently large, we have .
log r1 − log2 r1 < 0.
Hence, using one of the upper bounds in Lemma 5.3, we find f (xi ; G) ≤ π1 h(xi ; r1 , θ1 ) + π2 f (x1 ; r2 , θ2 ) = g(xi ; G).
.
These two observations show that E 1 n(A1 ) log r1 + log{g(xi ; G)}. 2 n
ln (G) ≤
.
i=1
For convenience, we define ln (G) =
n E
.
log{g(xi ; G)}.
i=1
Lemma 5.4 Let .G∗ be the true mixing distribution of order .m = 2 so that none of its component distributions have one degree of freedom. Then, there exists a .δ > 0 such that almost surely as .n → ∞, E .
ln (G) − ln (G∗ ) ≤ −nδ.
G∈G2
Proof Note that the exponential distribution is a special Gamma distribution with degree of freedom .r = 1. Because .h(x; G) is a special Gamma mixture, we may write ˜ ln (G) = ln (G)
.
with ˜ = π1 {(1, r1 θ1 )} + π2 {(r2 , θ2 )}. G
.
5.2 Consistency for Penalized MLE
93
That is, .ln (G) is a likelihood function confined on the space where the first subpopulation is a Gamma distribution with scale parameter .r1 θ1 and one degree of ˜ When .G∗ does not have a subpopulation ˜ be denoted .G. freedom. Let the space of .G ˜ Hence, there must exist an .e > 0 with one degree of freedom, we have .G∗ /∈ G. such that .
inf Dkw (G, G∗ ) ≥ e.
˜ G∈G
By Lemma 5.2, there exists a .δ such that almost surely sup ln (G) − ln (G∗ ) = sup ln (G) − ln (G∗ ) ≤ −nδ.
.
˜ G∈G
G∈G2
u n
This completes the proof.
Note that the constant .δ is not dependent on how .G1 , .G2 are defined. Its size is influenced by the true mixing distribution .G∗ . Next, we sharpen the upper bound for .n(A1 ) once more. When .r1 > R0 , which is sufficiently large, we rework the upper bound (5.8) to get n(A1 ) ≤ 2nMεr1 + 10 log n ≤
.
δn + 10 log n. 1 + log r1
√ The simplification in the last step is based on .log r/ r exp( n), by P50.1 on the penalty function .pn (G), we have .pn (G) + 5 log n(1 + log r1 ) < 0. Otherwise, .
√ 1 1 − δn + 5 log n(1 + log r1 ) ≤ − δn + 5( n + 1) log n < 0 2 2
when n is sufficiently large. Hence, uniformly for .G ∈ G2 pln (G) < pln (G∗ )
.
almost surely.
94
5 Gamma Mixture
We now summarize the result as follows. Theorem 5.2 Based on a set of n i.i.d. observations from a Gamma mixture model, and for a sufficiently large .R0 in the definition of .G2 , .
sup pln (G) < pln (G∗ )
(5.11)
G∈G2
almost surely as .n → ∞. Proof When .G∗ does not have a component gamma distribution with one degree of freedom, this conclusion is implied by Lemma 5.2. If .G∗ has a component gamma distribution with one degree of freedom, we can alter the proof by defining h(x; r, θ ) =
.
4x 2x exp(− ), 2 rθ (rθ )
which is a Gamma distribution with 2 degrees of freedom and scale parameter .rθ/2. With this choice, the upper bound in Lemma 5.3 is merely increased by a constant 2. The same technique implies that .
sup pln (G) < pln (G∗ ) G∈G2
when the other subpopulation does not have 2 degrees of freedom. If .G∗ has two subpopulations with degrees of freedom 1 and 2, we may alter the proof by choosing .h(x; r, θ ) to be the gamma density function with .1.5 degrees of freedom. In conclusion, whatever the true mixing distribution .G∗ , as long as .f (x; G∗ ) is a finite Gamma mixture of order 2, we always have .
sup pln (G) < pln (G∗ ).
(5.12)
G∈G2
This proves (5.11).
u n
5.2.3 Consistency on G3 and Consistency for All When .R0 is sufficiently large, we must have .G∗ ∈ G3 . The pMLE on .G3 becomes the constrained MLE. The consistency of the constrained MLE has been claimed in Theorem 5.1. We hence claim that the pMLE for a finite Gamma mixture is consistent when the penalty function satisfies P50.1 and P50.2. Our proof is rigorous when the order of the Gamma mixture is 2, but the general case is similar.
5.4 Example
95
Theorem 5.3 Based on a set of n i.i.d. observations from a finite Gamma mixture model with order m known, the pMLE is consistent when .pn (G) satisfies Conditions P50.1 and P50.2. Final remark is needed to close a gap in the proof. When .G∗ has only one support point, the consistency of the pMLE remains valid. The reason is the same as the one in the last chapter proved for the consistency of the pMLE under the finite normal mixture models. The .e-neighbourhood of .G∗ contains all mixing distributions in the form of δG1 + (1 − δ)G2
.
such that .δ is sufficiently small and .G2 is sufficiently close to .G∗ . Once these distributions are excluded from .G2 , all intermediate results remain valid. This makes up the gap in the proof.
5.3 Consistency of the Constrained MLE As mentioned earlier, another way to restore the consistency of the likelihood-based estimator is to place constraints on the component parameter space. For instance, under finite normal mixture models, one may require the fitted subpopulation variances to be larger than a positive constant. There is a theoretical risk that the artificial lower bound on the variance may be too large, that is, larger than the true variances of some subpopulations. To overcome this difficulty, one may gradually relax the constraints as the sample size increases. These approaches have been extensively discussed in Tanaka and Takemura (2005) and Tanaka and Takemura (2006). The constrained MLE has been shown to be consistent for many mixture models with location-scale family distributions as the subpopulations. The constraint approach can also be employed for finite Gamma mixtures. Observe that the penalty functions were activated only infrequently in our proof: √ in √ argument (c) for .r1 > exp( n) and√immediately after (5.10) for .r1 > exp( n). If we require .max{r1 , . . . , rm } < exp( n), then these two steps become unnecessary. Therefore, we have the following consistency result. Theorem 5.4 Based on a set of n i.i.d. observations from a finite Gamma mixture model with order m known, the constrained MLE for G is consistent under the √ constraint .max{r1 , . . . , rm } < exp( n).
5.4 Example Much of the material in this chapter is taken from Chen et al. (2016). Here is the application example from the same paper. For a group of 245 unrelated individuals,
96
5 Gamma Mixture
Table 5.1 MLE versus pMLE .π1 .π2 .π3 .r1 .r2 .r3 .θ1 .θ2 Two-component Gamma mixture fits MLE 0.62 0.38 4.97 8.91 0.04 0.15 pMLE 0.62 0.38 4.95 8.77 0.04 0.15 Three-component Gamma mixture fits MLE 0.62 0.07 0.31 4.90 277.80 9.25 0.04 0.00 10.60 0.04 0.05 pMLE 0.62 0.21 0.17 4.83 21.65 Wild three-component Gamma mixture fits 0.62 0.07 0.31 4.90 1.00e.+100 9.25 0.04 1.00e.−100 0.62 0.07 0.31 4.90 1.00e.+100 9.25 0.04 1.00e.−100
.θ3
.pln .−46.21 .−47.34
0.15 .−41.53 0.15 .−45.27 0.15 47.3 0.15 .−6.39e.+98
Bechtel et al. (1993) studied the distribution of enzymatic activity in the blood for an enzyme involved in the metabolism of carcinogenic substances. There is clear evidence for a bimodal distribution for the data. Hence, it makes sense to try to fit finite Gamma mixture models with .m = 2 and 3. Let the penalty function be pn (G) = −n−1/2
.
m E {rj − log rj }. j =1
Note that its value goes to .−∞ when either .r → 0 or .r → ∞. The best fits for m = 2 and 3 after 50 random starting values are given in Table 5.1. The MLE and pMLE are almost the same even with 50 random initial values for a two-component Gamma mixture. This seems to go against the claim that the MLE does not even exist. However, the MLE and pMLE are quite different when we fit a three-component model: the MLE contains a small (proportion .= 7%) component with a very large (.r = 277.8) degree of freedom. This gives us a hint that the problem of nonconsistency of the usual MLE does show up sometimes. To illustrate the point that the MLE is not consistent, one may purposely select some trouble-making values of the parameter vector. The results of such mischievous practice are given in the last two rows of Table 5.1. The log-likelihood value at this fit is 47.3, much larger than the largest value (.−41.53) obtained through EM iteration. The penalized log-likelihood value (.−6.39e.+98) is much smaller than the smallest value obtained (.−45.27). In theory, a similar example for a two-component mixture is also possible. Such an example is possible when r-value exceeds 1e200 by some intelligent calculation. As it exceeds the capacity of the R language, numerical illustration is not feasible. Yet it shows that the straightforward MLE as we understand (or misinterpreted one) is generally safe though the danger is also real. Figure 5.1 shows a histogram of the data and the density function of the fitted Gamma mixture model. The density of the two-component pMLE is omitted because it is almost identical to that for the two-component MLE. The density corresponding to the three-component MLE has a sharp peak at .AF MU/1X = 1.
.
97
3.5
5.4 Example
2.0 1.5 0.0
0.5
1.0
Density
2.5
3.0
PMLE:3 MLE:2 MLE:3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
AFMU/1X
Fig. 5.1 Observed frequency distribution of the AFMU/IX ratio and the density functions based on the two-component MLE (MLE:2), the three-component MLE (MLE:3), and the pMLE (pMLE:3)
Its density does not match the histogram as closely as the other two densities do. Hence, a two-component Gamma mixture model is a good choice for this dataset, which agrees with Bechtel et al. (1993).
5.4.1 Some Simulation Results Although the MLE (pMLE) is consistent for many mixture models, its convergence rate is always slow. The best possible rates are known no better than .n−1/4 or −1/8 for finite mixture models; see Chen (1995), Nguyen (2013) and Heinrich and .n Kahn (2018). Here is a small simulation experiment to provide some idea about the numerical properties of the pMLE. Let us generate data from the following two Gamma mixture models: Model 1: .0.32f (x; 0.2, 9) + 0.35f (x; 5, 6) + 0.33f (x; 30, 10); Model 2: .0.6f (x; 3, 2) + 0.4f (x; 5, 10).
98
5 Gamma Mixture
The penalized log-likelihood is defined as .pln (G) = ln (G) + pn (G) with the following choices of .pn (G): 0: 1: 2: 3:
no penalty, .pn (G) E= 0; 2 pn (G) = −n−1 m j =1 {rj − log (n) log rj }; E m −1 .pn (G) = −n j =1 {rj − log rj }; E m −1/2 .pn (G) = −n j =1 {rj − log rj }. .
The MLE with no penalty is known to be inconsistent, but the EM algorithm often converges to an estimate with reasonable properties. Hence, we include it in the simulation. The other choices of .pn (G) satisfy conditions P50.1 and P50.2. For both models, we simulated data with the sample sizes .k × 20, k × 80, and .k × 320, where .k = 3 for Model 1 and .k = 2 for Model 2. Given a random sample, we used nine sets of randomly generated parameter values as well as the true values as initial parameter values for the EM algorithm. Thus, given a random sample and the penalty function, there were up to 10 local MLEs or pMLEs. The global MLE is known to have .rˆ = ∞. We ignore this and define the global MLE (pMLE) to be the local MLE (pMLE) with the highest .pln value. We performed 1000 replications for each setting (model, sample size, and penalty). We compared the performance of the estimates based on the root mean squared error (RMSE), which is l | n E | .RMSE = |n−1 (θˆi − θ0 )2 i=1
where .n = 1000 is the number of replications, .θˆi is the estimated value, and .θ0 is the corresponding true parameter value (as for .α, r). The numerical MLEs or pMLEs do not have subpopulation labels. Hence, to compute the RMSE, we must match the fitted subpopulations with the true subpopulations. We did this based on the size of r for Model 1 and the size of .θ for Model 2, because their values differ the most over the respective subpopulations. The results are in Tables 5.2 and 5.3; the last column is the median number of EM iterations. In theory, the RMSEs of the MLE are extremely large. This is not seen here. It is beneficial for the EM algorithm to only find sensible local maxima. Applying penalty 3 seems to produce best results. These methods do not differ much, which is not against any theory. The penalty is to prevent the estimates of r from diverging to infinity. It does not markedly interfere with the parameter estimation.
5.5 Consistency of the MLE When Either Shape or Scale Parameter Is Structural The most commonly used Gamma distribution has a shape parameter r and a scale parameter .θ . Given any positive value .x ∗ and M, we can find a pair of r and .θ values such that .f (x ∗ ; r, θ ) > M. This is why the plain MLE under the finite (2-parameter)
5.5 Consistency of the MLE When Either Shape or Scale Parameter Is Structural
99
Table 5.2 RMSE of MLE and pMLEs: Model 1 .π2 Pen .π1 Sample size .= 60 0.08 0 0.08 1 0.08 0.08 0.08 0.08 2 0.07 0.07 3 Sample size .= 240 0.05 0.05 0 1 0.05 0.05 2 0.05 0.05 0.05 0.04 3 Sample size .= 960 0 0.03 0.03 0.03 0.03 1 0.03 0.03 2 0.03 0.03 3
Iters
.π3
.r1
.r2
.r3
.θ1
.θ2
.θ3
0.06 0.06 0.06 0.06
0.08 0.09 0.08 0.09
6.13 4.19 5.06 2.40
15.55 11.06 11.55 9.08
28.02 15.78 17.51 9.75
2.82 3.38 2.76 3.38
3.50 3.46 3.40 5.26
62.75 48.00 61.00 53.50
0.03 0.03 0.03 0.03
0.03 0.03 0.03 0.03
1.86 1.81 1.85 1.62
5.08 5.00 5.03 4.53
12.65 12.43 12.57 11.61
1.48 1.47 1.47 1.45
1.61 1.61 1.61 1.68
89.00 77.50 89.00 82.00
0.01 0.01 0.01 0.01
0.01 0.01 0.01 0.01
0.79 0.78 0.79 0.77
2.39 2.39 2.39 2.37
7.78 7.72 7.78 7.63
0.74 0.74 0.74 0.73
0.81 0.81 0.81 0.82
118.75 115.50 118.50 117.00
Table 5.3 RMSE of MLE and pMLEs: Model 2 Pen .π1 Sample size .= 40 0.11 0 1 0.10 2 0.11 3 0.09 Sample size .= 160 0 0.04 1 0.04 0.04 2 3 0.04 Sample size .= 640 0.02 0 1 0.02 2 0.02 0.02 3
.π2
.r1
.r2
.θ1
.θ2
Iters
0.11 0.10 0.11 0.09
9.48 3.33 3.76 1.43
3.53 3.03 3.24 2.38
0.84 0.80 0.82 0.73
6.12 6.41 5.99 6.10
31.50 30.50 31.50 34.00
0.04 0.04 0.04 0.04
0.52 0.52 0.52 0.51
1.43 1.41 1.42 1.33
0.39 0.39 0.39 0.39
2.36 2.38 2.36 2.38
32.50 32.00 32.50 32.50
0.02 0.02 0.02 0.02
0.25 0.25 0.25 0.25
0.61 0.61 0.61 0.61
0.19 0.19 0.19 0.19
1.15 1.15 1.15 1.15
33.50 32.50 33.50 33.00
Gamma mixture model is inconsistent. At the same time, the arbitrarily large value of the density function occurs only if we allow r to take very large values. Hence, either placing a penalty on a large value of r, or placing an finite upper bound bruteforce restores the consistency MLE, penalized or constrained. Consider the problem of estimating the mixing distribution under a finite Gamma mixture model when all components have the same valued shape parameter r. Namely, the shape parameter is structural. In this case, a tactical choice of a pair of .(r, θ ) can inflate the value of .f (x; r, θ ) at any x of choice. However, the same r
100
5 Gamma Mixture
value will make .f (y; r, θ ) ≈ 0 for all y outside a small neighbourhood of x. The likelihood value, the sum of log-likelihood contributions of all observations, could remain very small. In other words, the MLE cannot have a large .rˆ value. Based on our experience with the finite normal mixture model, the plain MLE has its .rˆ within a compact region almost surely. Establishing this proposition will lead to the consistency of the plain MLE. While this idea is simple, the proof under the finite Gamma mixture model is much more complex. We refer to He and Chen (2022a,b) to the answers for two special cases. The consistency problem under the finite Gamma mixture distribution has many variations. They are not statistically important and post technical challenges for these interested.
Chapter 6
Geometric Properties of Non-parametric MLE and Numerical Solutions
6.1 Geometric Properties of the Non-parametric MLE The non-parametric MLE is defined to be a global maximum point of the likelihood function over a mixing distribution space .G that contains nearly any distributions on the subpopulation parameter space. Because the dimension of .G is infinite, searching for a global maximum point changes from a challenging numerical problem to a nearly impossible problem. There are many practical solutions at least for some most commonly used mixture models. The geometric properties of the non-parametric MLE are the key to the design of effective numerical solutions. We include some materials in this section based on several key references: Lindsay (1983a), Böhning (1999), and Lesperance and Kalbfleisch (1992). Although building a robust and efficient software package is a serious adventure, employing the principle of some algorithms and experimenting on a toy dataset are simple. This chapter also presents the results of the experimental implementation of the algorithms proposed in the literature. Suppose we have an i.i.d. sample .x1 , . . . , xn from a non-parametric mixture model. Let .uk , k = 1, . . . , K be distinct values in this sample, and .nk be the corresponding number of .xi values equaling .uk . We can mathematically write En .nk = 1(x i = uk ). In theory, the distinct number of observed values K i=1 equals n when the subpopulation distributions are continuous. Due to rounding, it is quite possible that .K < n in real dataset regardless. Under discrete models such as Poisson mixture, there is a good chance that K is much smaller than n. With these conventions, the log-likelihood function has two algebraic expressions: ln (G) =
n E
.
i=1
log f (xi ; G) =
K E
nk log f (uk ; G).
k=1
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_6
101
102
6 Geometric Properties
The numerical problem of computing the non-parametric MLE of G is to search for a G in .G at which .ln (G) is maximized. For each parameter value .θ , define a K-dimensional vector y(θ ) = (y1 (θ ), . . . , yK (θ ))T = {f (u1 ; θ ), . . . , f (uK ; θ )}T .
.
Related to this length K vector, we define a super-curve Γ = {y(θ ) : θ ∈ Θ}
(6.1)
.
The convex hull of .Γ is all of its finite convex combinations: conv(Γ } = {
m E
.
πj y(θj ) : for some m, and πj ≥ 0, j = 1, . . . , m;
E
πj = 1}.
j =1
Be aware that m is finite yet is allowed to be arbitrarily large. Corresponding to these quantities, for any mixing distribution G, let { y(G) =
y(θ )dG(θ ).
.
Θ
and Y = {y(G) : G ∈ G}.
.
(6.2)
Two spaces, .conv(Γ } and .Y are both made of convex combinations of .y(θ ) for θ ∈ Θ. The former is restricted to finite convex combinations, and the latter allows infinite convex combinations. Hence, .conv(Γ } ⊂ Y. Potentially, .Y may contain some vectors, which cannot be written as finite convex combinations of .y(θ ). This remark leads to the question: when do we have .Y = conv(Γ )? The answer is affirmative when .Γ is compact. This result is famous but elementary in the context of convex analysis. Interested readers can find this result in Phelps (2001). We will provide an elementary proof to assist thorough understanding and have it embedded in the proof of the most famous geometric property on non-parametric MLE of Lindsay (1983a) later. We assume that .Γ is compact in this chapter. E Back to the likelihood function, if we define .ϕ(y) = K k=1 nk log yk , then it is apparent that
.
.
sup{ϕ(y) : y ∈ Y} = sup{ln (G) : G ∈ G}.
(6.3)
The above relationship shows that we may solve the optimization problem by maximizing .ϕ(y) with respect to .y instead of maximizing .ln (G) with respect to G. The finite dimensionality of .y conceptually simplifies the numerical problem.
6.1 Geometric Properties of the Non-parametric MLE
103
If .yˆ solves the optimization problem on the left-hand side of (6.3), then there ˆ which solves the optimization problem on the right-hand side and must be a .G, ˆ ˆ = y(G). We hence have a simple result as follows. .y Theorem 6.1 Suppose .f (x; θ ) is continuous in .θ at every observed value of x. Assume further that .Y defined by (6.2) is a closed and bounded set. Then there exists a unique .yˆ ∈ Y such that ϕ(ˆy) = sup{ϕ(y) : y ∈ Y} = sup{ln (G) : G ∈ G}.
.
In addition, .yˆ is a boundary point of .Y. This conclusion is obvious. Being a concave and continuous function, .ϕ(y) must attain its maximum at some point in the compact space .Y. The concavity also implies that the maximum point is unique. The continuity condition on .f (x; θ ) can be reduced to be held true on all but a zero-measure set of x. Why is the maximum point always located at a boundary point of .Y? If .yˆ is not on the boundary, we must be able to find a number .ρ > 1 such that .ρ yˆ is still a member of .Y. Obviously, we have ϕ(ρ yˆ ) > ϕ(ˆy).
.
Hence, .yˆ cannot be the maximum point unless it is on the boundary. Even though the solution to the optimization problem on the left-hand side is unique in space .y, the solution to the optimization problem on the right-hand side ˆ is not necessarily unique. There are potentially several .G ˆ all of them in space .G ˆ Based giving rise of the same .yˆ . It is also of interest to know the structure of .G. on Carátheodory theorem from standard convex theory, Lindsay (1983a) finds that ˆ can always be chosen as a discrete mixture with at most K support points. The .G following result is behind the conclusion of Lindsay (1983a) and taken from Silvey (2013) after some very minor alternations. Theorem 6.2 Let .Γ be any set of vectors of length-K and .conv(Γ ) be its convex hull. Any .y ∈ conv(Γ ) can be written as a convex combination of at most .K + 1 vectors in .Γ . If .y is a boundary point of .conv(Γ ), then it can be written as a convex combination of at most K vectors in .Γ . Proof We first elongate the vector .y to .y' of .conv(Γ ) to length-.(K + 1) by adding 1 as its .K + 1th element. The corresponding .Γ is then denoted as .Γ ' . Next, we introduce the notion of convex cone: c-cone(Γ ) = {y : y =
N E
.
i=1
λi yj , λj > 0, yj ∈ Γ, j = 1, . . . , N for any finite N }.
104
6 Geometric Properties
The convex cone is made of all positive linear combination of vectors in .Γ . Now for any .y' ∈ c-cone(Γ ' ), by definition, there exist N and .y'j ∈ Γ ' and .aj > 0 such that y' =
N E
.
aj y'j .
j =1
We show that these .y'j can always be made linear independent. If .y'j are not independent, there exist .bj , not all zero, such that N E .
bj y'j = 0.
j =1
The above two equalities imply that for any real value t: y' =
N E
.
(aj + tbj )y'j .
j =1
Let t be the value such that .minj {aj + tbj } = 0. We then have obtained a positive combination for .y' with .N − 1 many .y'j . Repeating this procedure if the remaining ' .y are not linear independent until all of them are linear independent. j Because there can be at most .K + 1 independent vectors in any set, every vector in .c-cone(Γ ' ) can be written as positive combination of at most .K + 1 vectors in ' .Γ . Since .conv(Γ ) ⊂ c-cone(Γ ' ), it implies that any .y' ∈ conv(Γ ' ) can be written as a positive combination of at most .K + 1 vectors in .Γ ' . Because the .(K + 1)th element of vectors in .Γ ' equals 1, the positive combination of vectors in .Γ is also a convex combination. This proves the first conclusion. The second conclusion is to reduce the number of vectors in the convex combination from .K + 1 to K. It is seen that if .y is a boundary point of .conv(Γ ), its corresponding .y' is a boundary point of .conv(Γ ' ). For a boundary .y' ∈ conv(Γ ), let y' =
K+1 E
.
aj y'j
j =1
be a positive combination in .conv(Γ ' ) and that .y'j are independent due to the side conclusion given earlier. If .aj = 0 for some j , then the conclusion is true. Suppose ' .aj > 0 for all .j = 1, . . . , K + 1. The independence between .y implies that any j ' vector .z of length .K + 1 can be written as their linear combination for some .bj :
6.1 Geometric Properties of the Non-parametric MLE
z' =
K+1 E
.
105
bj y'j
j =1
This implies for t small enough in absolute value, y' + tz' =
K+1 E
.
(aj + tbi )y'j
j =1
is a positive combination. That is, .y' + tz' ∈ c-cone(Γ ' ) for any .z' when .|t| is small enough. This contradicts the assumption that .y' is a boundary point. Hence, the second conclusion must be true. u n Theorem 6.3 Suppose .f (x; θ ) is continuous in .θ at every observed value of x and .Γ is a closed and bounded. Let K be the number of distinct observed values (denoted as .uj ) and let .yˆ ∈ Y be the solution to ϕ(ˆy) = sup{ϕ(y) : y ∈ Y} = sup{ln (G) : G ∈ G}.
.
ˆ ∈ G with at most K support points such that There exists a .G { yˆj =
.
ˆ f (uj ; θ )d G
for .j = 1, . . . , K, where .yˆj is the j th component of .yˆ . Proof The conclusion is implied by the last theorem if we can in addition show that Y = conv(Γ ). Namely, there is no difference between the finite convex combination and the infinite convex combination. We proceed as follows. For any given mixing distribution G, it is possible to find a sequence of mixing distribution .{Gm }∞ m=1 with finite number of supports such that
.
{
{ .
y(θ )dGm (θ ) →
y(θ )dG(θ ).
Based on Theorem 6.2, we can further write all .Gm as mixing distributions with (m) (m) at most K support points. Let the mixing proportions be .π1 , . . . , πK and (m) (m) supporting vectors be .y(θ1 ), . . . , y(θK ). The superscript m is to link them with .Gm . The space of mixing proportion is clearly compact, the space of .Γ is compact by assumption. Thus, we can always find a subsequence of m such that all of them have limits. Let the subsequence be the sequence itself. (m) (m) Because .conv(Γ ) is compact, we must have all limits of .y(θ1 ), . . . , y(θK ) elements{ of .conv(Γ ). Thus, they are all finite convex combinations of .y(θ ). This implies . y(θ )dG(θ ) is also a finite convex combination of .y(θ ). Consequently, .Y ⊂ conv(Γ ). It is obvious that .Y ⊃ conv(Γ ). Hence, .Y = conv(Γ ). u n
106
6 Geometric Properties
ˆ unique? The answer is negative in general. However, .G ˆ is unique in most Is .G examples we are interested. We refer to Lindsay (1983a) for more details. If .Γ is not compact, the conclusion of this theorem may not be true. One must take additional steps to examine where the conclusion remains valid. One may examine whether .Θ can be expanded so that .Γ becomes compact as a result (Lindsay, 1983a). In order to achieve this goal, one must first make sure the component density .f (x; θ ) can be appropriately defined for .θ on the boundary of .Θ. In many specific examples, it appears that we can find a simpler approach. For most mixture models utilized in applications, we can often easily identify a compact subspace of .Θ, which contains the support points of the nonparametric MLE. On this subspace, the conditions in the above theorem are easily verified. Example 6.1 Suppose we have an i.i.d. sample from normal mixture model in location parameter with equal and known variance. Let .x(1) and .x(n) be the minimum and maximum observed values. Then the support points of the non-parametric MLE are contained in .[x(1) , x(n) ], which is a compact space. Proof Without loss of generality, let us assume the component variance .σ = 1 in the normal mixture model. The log-likelihood function is then given by ln (G) =
n E
.
log
[
{
∞
−∞
i=1
] exp{−(xi − θ )2 /2}dG(θ ) .
Let G be any mixing distribution such that .G(x(n) ) < 1 and denote .p† = 1 − G(x(n) ). Define a new mixing distribution to be G† (θ ) = G(θ )1(θ < x(n) ) + p† 1(x(n) ≤ θ ).
.
We notice that when .θ > x(n) , .
exp{−(xi − θ )2 /2} ≤ exp{−(xi − x(n) )2 /2}
for all .xi . Therefore, the following is also true {
∞
.
−∞
{ exp{−(xi − θ )2 /2}dG(θ ) ≤
∞
−∞
exp{−(xi − θ )2 /2}dG† (θ )
for all .xi . Consequently, we find l(G) < l(G† ).
.
That is, if the support of a candidate mixing distribution G goes beyond .θ = x(n) , it is not possible to be a non-parametric MLE. In the same vein, a non-parametric MLE cannot have any supports contained in .(−∞, x(1) ). This proves that the support of the non-parametric MLE is on .[x(1) , x(n) ]. u n
6.2 Directional Derivative
107
We remark here that the numerical problem is not an asymptotic problem. Once a set of observations is given, the interval .[x(1) , x(n) ] is not random nor dynamic (changing with n). It is solely a compact set. Hence, the non-parametric MLE of the mixing distribution under the location normal mixture model has at most .K ≤ n support points. A numerical algorithm for computing non-parametric MLE should restrict the support of candidate mixing distribution G to this compact interval.
6.2 Directional Derivative Let .G0 and .G1 be two mixing distributions. Their linear combination .(1 − e)G0 + eG1 is another mixing distribution. It is seen that ln ((1 − e)G0 + eG1 ) =
n E
.
log{f (xi ; G0 ) + e[f (xi ; G1 ) − f (xi ; G0 )]}.
i=1
Clearly, we have | ∂ln ((1 − e)G0 + eG1 ) || . | ∂e
= e↓0
n E f (xi ; G1 ) − n. f (Xi ; G0 ) i=1
The above quantity is the directional derivative from .G0 toward .G1 that deserves a special notation: A(G1 ; G0 ) =
.
n E f (xi ; G1 ) i=1
f (xi ; G0 )
− n.
(6.4)
When .G1 is the degenerate mixing distribution that places all mass on a specific parameter value .θ , we will write the directional derivative as .A(θ ; G0 ). When .A(G1 ; G0 ) > 0, by continuity of derivative function in .e, there exists a small enough .e > 0, such that ln ((1 − e)G0 + eG1 ) > ln (G0 ).
.
ˆ being a local If so, .G0 is not a local maximum. Thus, a necessary condition for .G ˆ ≤ 0 for all .G ∈ G. maximum point of .ln (G) is that .A(G; G) Theorem 6.4 Assume the setting of this section. ˆ to be a non-parametric MLE is that (a) A necessary and sufficient condition for .G ˆ ≤ 0. for any .G ∈ G, .A(G; G) ˆ ≤ 0. (b) Condition in (a) is equivalent to, for any .θ ∈ Θ, .A(θ ; G) ˆ are contained in the set of .θ such that (c) Furthermore, the support points of .G ˆ = 0. .A(θ ; G)
108
6 Geometric Properties
Proof (a) The necessity has already been derived before the theorem. We need only prove the sufficiency. ˆ is a mixing distribution such that Suppose .G ˆ ≤0 A(G; G)
.
ˆ ≥ ln (G) for for all .G ∈ G. We need to show that this property leads to .ln (G) all .G ∈ G. Assume the contrary such that there is a mixing distribution G such that ˆ < l(G). For this G, define .ln (G) ˆ + eG). ϕ(e) = ln ((1 − e)G
.
Its first two derivatives are given by ϕ ' (e) =
n E
.
i=1
ˆ f (xi ; G) − f (xi , G) ˆ + e{f (xi ; G) − f (xi , G)} ˆ f (xi ; G)
and ϕ '' (e) = −
n E
.
i=1
ˆ 2 {f (xi ; G) − f (xi , G)} ≤ 0. ˆ + e{f (xi ; G) − f (xi , G)}] ˆ 2 [f (xi ; G)
Note that ϕ ' (0) =
.
n E ˆ −n 0 which also occurs with probability approximately .1/2, the local maximum point of .ln (v) will be different. To reveal more details, we first provide (4) sufficiently precise approximate values of .l(3) n (0) and .ln (0). By some simple calculations, we have l(3) n (0) = 2
.
n E f (3) (Xi , 0) i=1
f (Xi , 0)
and (4) .ln (0)
=6
n E f (4) (Xi , 0) i=1
f (Xi , 0)
− 12
) n ( '' E f (Xi , 0) 2 i=1
f (Xi , 0)
.
Denote Ai =
.
f '' (Xi , 0) , f (Xi , 0)
Bi =
f (3) (Xi , 0) , f (Xi , 0)
Ci =
f (4) (Xi , 0) . f (Xi , 0)
Under regularity conditions on .f (x; θ ) given in this theorem, they all have zero expectation:
8.1 Example
145
EAi = 0,
.
EBi = 0,
ECi = 0.
In addition, they all have finite variances: EA2i < ∞,
.
EBi2 < ∞,
ECi2 < ∞.
These calculations lead to stochastic order assessments as follows: n E .
Ai = Op (n1/2 ),
i=1
n E
Bi = Op (n1/2 ),
i=1
n E
Ci = Op (n1/2 ).
i=1
With this fact and Taylor’s expansion, we find ln (v) = ln (0) + v 2
n E
.
i=1
1 E 1 E 2 Ai + v 3 Bi − v 4 Ai + Op (n1/2 v 4 ), 3 2 n
n
i=1
i=1
as .v → 0. Differentiating with respect to v and setting the derivative to zero, we get an approximating E cubic equation, one of whose roots is zero. Since we work on the occasion of . ni=1 Ai = (1/2)l''n (0) > 0, the nearest local maximum to .v = 0 must be one of the other roots ⎧ ⎫ ⎡( ( n )2 )( n )⎤1/2 ⎪ { n {−1 ⎪ n n ⎨E ⎬ E E E E 2 2 . Bi ± ⎣ Bi +16 Ai Ai ⎦ Ai {1+op (1)}. × 4 ⎪ ⎪ ⎩ i=1 ⎭ i=1 i=1 i=1 i=1 Clearly, the solution that is closer to 0 is givenEby the one with minus sign. Unless E[f '' (X, θ )/f (X, θ )]2 = 0, the average .n−1 ni=1 A2i tends to a positive constant almost surely. Excluding this possibility, we have
.
{ vˆ =
.
E 0 when ni=1 Ai < 0; E En { i=1 A2i }−1/2 { ni=1 Ai }1/2 {1 + op (1)} otherwise
which is a .Op (n−1/4 ) quantity. This proves the theorem.
u n
The slower rate {of convergence is due to the fact that .Gv has only one support point at .v = 0 and . θ dGv = 0 for all .Gv in this model. Because of this, the model has degenerated Fisher information at .v = 0. The distribution family .{Gv , v ∈ R} is otherwise well defined in the sense that different v correspond to different .Gv , and it is smooth around .v = 0. A special remark is the construction of this example. The choice of mixing weights .1/3 and .2/3 ensures that .Gv (θ ) has mean zero as well as is identifiable. Had we chosen .1/2 and .1/2, we would end up with .Gv (θ ) = G−v (θ ).
146
8 Rate of Convergence
8.2 Best Possible Rate of Convergence Even though the MLE of v fails to achieve a root-n rate of convergence at .v = 0, this does not mean root-n is impossible. Due to the phenomenon of super-efficiency, it is possible to create an estimator with arbitrarily fast convergence rate at a specific parameter value. The cost is usually the lower convergence rate in its small neighborhood. However, it will be shown that if one considers the performance of an estimator in a neighborhood of .v = 0, then no estimator can uniformly over that neighborhood achieve a rate better than .n−1/4 . To have this notion clearly explained, we introduce the following simplified definition of local asymptotic normality (LAN) from Ibragimov and Has’ Minskii (2013) and a result of Hájek (1972). Definition 8.1 Let .X1 , X2 , . . . , Xn , . . . be i.i.d. random variables with the density function belonging to .{f (x, θ ), θ ∈ Θ}. The latter is called locally asymptotically normal at the point .θ0 ∈ Θ as .n → ∞, if for some .ϕ(n) = ϕ(n, θ0 ) and any .u ∈ R, the representation Zn,θ0 (u) =
n | |
.
f (Xi , θ0 + ϕ(n)u)/
i=1
n | |
f (Xi , θ0 )
i=1
1 = exp{uZn − u2 + ψn (u, θ0 )} 2
(8.2)
is valid, where Zn → N(0, 1)
.
in distribution as .n → ∞, under .θ = θ0 and moreover, for any .u ∈ R, we have ψn (u, θ ) → 0 when θ = θ0
.
in probability as .n → ∞. Distribution families that satisfy the LAN condition have special properties. We are most concerned of the following one by Hájek (1972). We use .Π for the mathematical constant to avoid potential risk of mixed up with mixing proportion .π . Theorem 8.2 Let .{f (x, θ ), θ ∈ Θ} satisfy the LAN condition at the point .θ0 ∈ Θ with the normalizing value .ϕ(n) and let .ϕ(n) → 0 as .n → ∞. Let .w(t) be a non-negative symmetric function of t, which is continuous at 0, not identically 0, and .w(0) = 0. Further, assume that the sets .{t : w(t) < c} are convex for all .c > 0 and are bounded for all sufficiently small .c > 0.
8.2 Best Possible Rate of Convergence
147
Then for any family of estimators .Tn of .θ and any .e > 0, the inequality lim infn→∞
.
1 Eθ {w[ϕ −1 (n)(Tn − θ )]} ≥ √ 2Π |θ−θ0 | −1/2, we have that for every i, 1 | log(1 + Yi ) − Yi + Yi2 | ≤ |Yi |3 . 2
.
Hence, when .min{Yi } > −1/2, .
log{Zn,0 } =
n E
log(1 + Yi )
i=1
=
n E
E 1E 2 Yi + C |Yi |3 2 n
Yi −
i=1
i=1 −1/2
= u{n
n E
n
i=1
i=1
E 1 (Ai /σ )} − u2 {n−1 (Ai /σ )2 } + op (1), (8.4) 2
where C in the middle is a bounded random variable. At the same time, by the Markov inequality, the finite third moment conditions, and the expansion of .Yi , we get
.
Pr(min{Yi } < −1/2) ≤
n E
Pr(Yi < −1/2) ≤ 8
i=1
n E
E|Yi |3 = O(n−1/2 ).
i=1
Hence, (8.4) holds with probability tending to 1. Furthermore, it is clear that {n−1/2
.
n E (Ai /σ )} → N(0, 1) i=1
in distribution and n−1
.
n E (Ai /σ )2 → 1 i=1
almost surely. Hence, the LAN condition is satisfied with .ϕ(n) = n−1/2 for u n estimating t. Our LAN analysis differs from the standard parametric theory because the first derivative of the log-likelihood in u is identically zero, regardless of the data, at .u = 0. For this reason, higher-order derivatives come into play in the asymptotic √ analysis, and one must switch from parameter u to parameter t to obtain a . n rate. Given this result, the following theorem needs no serious proof. Theorem 8.3 Suppose that in model (8.1), the true mixing distribution .G ∈ Gm−1 , then the optimal rate of convergence .ϕ(n) for estimating G is at most .n−1/4 .
8.3 Strong Identifiability and Minimum Distance Estimator
151
Proof The general model contains the submodel given in Lemma. The best possible rate for estimating t in the submodel is .ϕ(n) = n−1/2 . Since .||Qt − Q0 || = O(t 1/2 ), the best possible rate for estimating .Qt or, in the general G, is .O(n−1/4 ). u n This result firmly establishes that the minimax rate of estimating the mixing distribution cannot be faster than .n−1/4 . Is this rate upper bound sharp? Chen (1995) believes that it is and suggests this rate is attained by a minimum distance estimator under a strong identifiability condition for one-dimensional subpopulation parameter .θ . Unfortunately, the proof of this claim contains an error in its Lemma 2. Heinrich and Kahn (2018) show that the best possible minimax rate under more or less the same conditions is .O(n1/(4k+2) ) where k is the difference between the assumed order of the mixture model and true order of the mixture distribution. Namely, even with the least order over-specification with .k = 1, the sharp best possible rate is .n−1/6 , slower than .n−1/4 . The reason behind non-attainable −1/4 may lie in the notion of neighbourhood. The neighbourhood of .Q under .n 0 model (8.1) in the form of .f (x; Qt ) differs from the neighbourhood of .G0 = Q0 under the finite mixture model .f (x; G) with .G ∈ G2 . In spite of these complications, some techniques and concepts developed in Chen (1995) remain relevant. We continue the discussion alone this line.
8.3 Strong Identifiability and Minimum Distance Estimator A mixture model is identifiable when .F (x; G1 ) = F (x; G2 ) for all x implies G1 = G2 . This is pretty much the necessary and almost the sufficient condition to allow consistent estimation of G under either nonparametric mixture model or finite mixture model. Other conditions required for consistency are more or less merely technical. However, as discussed in the last section, the convergence rate for G can be much lower than the usual .n−1/2 under regular parametric models. What conditions are needed to allow estimating G with a convergence rate .O(n1/(4k+2) ) of Heinrich and Kahn (2018)? The answer remains the same as the one given in Chen (1995).
.
Definition 8.2 A parametric distribution family .{F (x; θ ), θ ∈ Θ} is strongly identifiable up to order m if F is twice differentiable in .θ and for any m different .θ1 , . . . , θm , the equality
.
sup | x
m E
[αj F (x; θj ) + βj F ' (x; θj ) + γj F '' (x; θj )]| = 0.
(8.5)
j =1
implies that .αj = βj = γj = 0 for .j = 1, 2, . . . , m. Clearly the strong identifiability implies finite identifiability because the latter E only requires that .supx | m [α F j =1 j (x; θj )]| = 0 if and only if .αj = 0 for ' '' .j = 1, 2, . . . , m. Including .F (x; θ ) and .F (x; θ ) in the definition imposes a
152
8 Rate of Convergence
stronger identifiability requirement. In fact, the strength of the identifiability can vary with different subpopulation distribution families. Ho and Nguyen (2019) generalize the strong identifiability concept further by allowing various order of derivatives of .F (x; θ ) included in (8.5). Different identifiability strength can lead to different rates of convergences (Ho and Nguyen 2019). The one given in (8.5) will be referred to the second order identifiability. The invention of the strong identifiability is motivated by showing the minimum distance estimator achieves the rate .n−1/4 . Although this rate has to be reduced to 1/(4k+2) ), the concept remains relevant. .O(n Suppose the order of the mixture model is m and the true order of the mixture distribution is .m0 = m − k with .k ≥ 1. Let .Fn (x) be the empirical distribution function constructed from i.i.d. samples .X1 , X2 , . . . , Xn generated from .F (x; G). ˆ n be a distribution function on .Θ such that Let .G .
1 sup |F (x, Gˆn ) − Fn (x)| ≤ inf sup |F (x, G) − Fn (x)| + G x n x
(8.6)
where .infG is taken over all mixing distributions with at most m support points. The ˆ n . Regarding term .n−1 on the right-hand side is used to ensure the existence of .G .supx |F (x, G) − Fn (x)| as the distance between two distributions .F (x, G) and ˆ n is a minimum distance estimator. .Fn (x), .G ˆ n attain the best possible rate of convergence? Let us first define Does .G ⎧ } { ⎪ ⎨ sup |F (x, G1 ) − F (x, G2 )| if G1 = / G2 , x ||G1 − G2 ||2k+1 .ψ(G1 , G2 ) = ⎪ ⎩∞ if G1 = G2 . Suppose for some .δ > 0, we have .
lim lim inf{ψ(G1 , G2 ) : ||G1 − G0 || ≤ δ1 , ||G2 − G0 || ≤ δ2 } ≥ δ
δ1 →0 δ2 →0
(8.7)
where .G1 and .G2 have at most m components. Chen (1995) has erred in the validity of this inequality with exponent .(2k + 1) instead of 4 when defining .ψ(G1 , G2 ). By the well-known Glivenko–Cantelli theorem, .
sup |Fn (x) − F (x; G)| = o(1) x
whenever the true mixture is .F (x; G). Therefore, we have .
sup |F (x, Gˆn ) − F (x; G)| ≤ sup |F (x, Gˆn ) − Fn (x)| + sup |F (x, G) − Fn (x)| x
x
x
≤ 2 sup |F (x, G) − Fn (x)| + x
= o(1).
1 n
8.3 Strong Identifiability and Minimum Distance Estimator
153
ˆ n . Therefore, (8.7) is When G is within a small neighborhood of .G0 , so is .G applicable. We would have ˆ n − G||2k+1 ≤ δ sup |F (x, Gˆn ) − F (x; G)| = Op (n−1/2 ). ||G
.
x
This leads to ˆ n − G|| = Op (n−1/(4k+2) ) ||G
.
the rate given by Heinrich and Kahn (2018). We have assumed the validity of (8.7) above to claim that the minimum distance estimator has convergence rate .n−1/(4k+2) for one-dimensional .θ under the strong identifiability condition. The truthfulness of the inequality in the form of (8.7) is much more complicated, and its establishment is rather involved. We now directly present part of Theorem 6.3 of Heinrich and Kahn (2018) here instead. To do this, we must directly introduce the Wasserstein distance on the space of .Gm . The .c(θ1 , θ2 ) be a bivariate function on .Θ. Denote two mixing distributions .G1 , G2 ∈ Gm by G1 =
m E
.
αi,· {θi }; G2 =
m E
α·,j {θi };
j =1
i=1
with .αi,· , α·,j being mixing proportions. Let Π = {[αij ]i,j =1,2,...,m :
E
.
αi,j = α·,j ,
i
E
αi,j = αi,· }
j
the collection of all bivariate distributions whose marginal distributions are .G1 and G2 . The transportation divergence between .G1 , G2 is defined to be
.
T (G1 , G2 ) = inf
.
⎧ ⎨E ⎩
i.j
⎫ ⎬ αi,j c(θi , θj ) . ⎭
Suppose .c(θi , θj ) is the cost of transporting a unit of mass from location .θi to θj , then .T (G1 , G2 ) is the lowest possible transportation cost for mass distributed according to .G1 to mass distributed according to .G2 . If we let .c(θ1 , θ2 ) = ||θ1 − θ2 )||r for some .r ≥ 1, then .T 1/r (G1 , G2 ) is the r-Wasserstein distance .Wr (G1 , G2 ). When .r = 1 and .Θ ⊂ R, the .Wr (G1 , G2 ) is equivalent to .||G − H || defined in the beginning of this chapter. Both transportation divergence and Wasserstein distance have many useful properties and are widely used recently related to data science. See also (Villani 2003; Dedecker and Merlevede 2017; Chen et al. 2021).
.
154
8 Rate of Convergence
For our purpose, we only mention that Wr1 (G1 , G2 ) ≤ Wr2 (G1 , G2 )
.
when .1 ≤ r2 ≤ r2 . Theorem 8.4 (Part of Theorem 6.3 of Heinrich and Kahn 2018) Let .G0 ∈ Gm0 but .G0 /∈ Gm0 −1 . Assume .F (x; θ ) is differentiable to the .(2m)th order, 2m E .
αt,j F (t) (x; θj ) = 0
t=1
for all x implies .αt,j = 0 for all .t, j , and for some uniform continuity modulus .ω(·) satisfying .ω(h) → 0 as .h → 0, .
sup |F (2m) (x; θ ) − F (2m) (x; θ ' )| ≤ ω(θ − θ ' ). x
Then when .Θ is compact, there are .e, δ > 0 such that for .q = 2(m − m0 ) + 1, inf
{ sup |F (x; G ) − F )x; G )| 1 2 x . : q Wq (G1 , G2 ))
} Wq (G1 , G0 )) ≤ e, Wq (G2 , G0 )) < e, G1 /= G2 ; G1 , G2 ∈ Gm ≥ δ.
The identifiability condition in this theorem is much stronger than the secondorder identifiability initially proposed in Chen (1995). This theorem also requires .F (x; θ ) to be differentiable with respect to .θ to a potentially very high order: 2m. However, the smoothness conditions are not so restrictive because most commonly used subpopulation distribution families are smooth. Nonetheless, the identifiability condition makes it harder to generalize the rate result to mixtures with multidimensional subpopulation parameters. The above theorem shows how fast a minimum distance estimator converges as .n → ∞. The other side of this picture is whether any estimators can attain a faster minimax rate. The rate given in our Theorem 8.3 is known not to be tight. The answer to this special case is as follows. Theorem 8.5 (Simplified and Condensed Theorem 6.1 of Heinrich and Kahn 2018) Under the same conditions as 8.4, there is a mixture distribution subfamily of .Gm {F (x; Gn,u ) : n = 1, 2, . . . , u ∈ R}
.
with .F (x; Gn,u=0 ) = F (x; G0 ) with the following properties:
8.3 Strong Identifiability and Minimum Distance Estimator
155
(a) For all distinct .u, u' ∈ R, we have together W1 (Gn,u , Gn,u' ) ≥ c1 (u, u' )n−1/(4(m−m0 )+2) ≥ c2 (u)W1 (Gn,u , G0 )
.
where .c1 (u, u' ) and .c2 (u) are positive constants possibly depend on u and .u' . (b) For some .Γn > 0, .Zn → N(0, 1), n E .
i=1
log
{ f (x ; G } / u2 i n,u − uZn Γn + Γn → 0 f (xi ; Gn,0 2
in probability, as .n → ∞ and .lim inf Γn > 0. In other words, a sub-mixture model of order m can be created so that it is locally asymptotic normal, and the Wasserstein distance between its members with fixed .u /= u' are at order .n−1/(4(m−m0 )+2) . The property of the locally asymptotic normality in (b) implies no estimator can separate .Gn,u and .G0 to an order of .Op (1) in minimax sense. The property (a) implies the same estimator cannot separate −1/(4(m−m0 )+2) . .G ∈ Gm better than .n While the proof of Theorem 8.5 is involved, we can gain much sight from two sequences of mixing distributions Gn,1 = (1/2){−2n−1/6 } + (1/2){2n−1/6 }
.
and Gn,2 = (4/5){−n−1/6 } + (1/5){4n−1/6 }.
.
They have the same first and second moments, but different third moments. Their third moments are .O(n−1/2 ) quantities. Consequently, the difference between −1/2 ) while .||G −1/6 ). .F (x; Gn,1 ) and .F (x; Gn,2 ) is an .O(n n,1 − Gn,2 | = O(n Based on n i.i.d. observations, the data contain information that distinguishes these two distributions with an .Op (n−1/2 ) difference but not more. Hence, in terms of −1/6 ). Because both .W (G , G ) are in a .W1 (Gn,1 , Gn,2 ), the best rate is .Op (n 1 n,1 n,2 small neighbourhood of .G0 = {0}, the minimax rate at .G0 is .Op (n−1/6 ). We may also note that this example does not extend to .Gn,1 , Gn,2 whose supports are located at .n−1/8 or .n−1/4 distances from 0. The rate conclusion also changes if we allow .Gn,1 , Gn,2 to have three support points. In this case, it becomes possible to match the first three moments of .Gn,1 , Gn,2 whose support points are in .n−1/10 distance from 0. Hence, the optimal rate is reduced when the degree of mixture order over-specification increases. The example presented in the beginning of this paper Gv =
.
1 2 {−v} + {2v} 3 3
156
8 Rate of Convergence
with .ν = O(n−1/4 ) is of similar nature. We aim to match moments of .G0 = {0} instead of mixing distributions in a neighbourhood of .G0 . Because of this, allowing additional support points in .Gv does not lead to moment matching of higher orders. Hence, this example does not catch the minimax rate caused by multi-support points mixing distributions that are infinitesimally close to .G0 .
8.4 Other Results on Best Possible Rates One noticeable development on the topic of best possible rates is Nguyen (2013). They propose to introduce the strong identifiability to allow multidimensional .θ : | k | |E 2 f (x; θ ) | ∂f (x; θ ) ∂ i i | | .ess sup | αi f (x; θi ) + βiT + γiT |=0 T | ∂θ ∂θi ∂θi | x i i=1
necessitates .αi = 0 and .βi = γi = 0. Because the density function values are defined up to a measure zero set, a zero-measure set of x exceptions in the equation is allowed. This is what ess is about. In addition, .βi , γi are vectors of the same length of .θ . Under some mild additional conditions such as compactness of .Θ and Lipschitz continuity of .∂ 2 f (x; θi )/∂θi ∂θiT , they suggest that ψ(G1 , G2 ) =
.
supx |F (x; G1 ) − F (x; G2 )| W22 (G1 , G2 )
has a positive lower bound when both .G1 , G2 are in a small neighborhood of .G0 . This would imply a minimum distance estimator with minimax rate .n−1/4 . This claim may not be solid similar to that of Chen (1995). There are some classical distribution families that fail to be strongly identifiable. The most important and surprising one is the normal/Gaussian family in both mean and variance. While the inference problems related to normal model often have clean and optimal solutions, it is opposite for Gaussian mixtures in spite of its importance. We will show that the univariate Gaussian mixture is not strongly identifiable in the next section. The multivariate Gaussian mixture model has even more complex identifiability issues. Ho and Nguyen (2016a) suggest that it is best to characterize its identifiability based on a system of polynomial equations. Another interesting example is Gamma distribution family in shape and scale/rate parameter: f (x; r, λ) =
.
λr x r−1 exp(−λx) Γ (r)
8.5 Strongly Identifiable Distribution Families
157
over .x ≥ 0, and parameters .r > 0 and .λ > 0. It is seen that .
∂f (x; r, λ) r = {f (x; r, λ) − f (x; r + 1; λ)}. λ ∂λ
Namely, the Gamma density functions are not algebraically linear independent with their first derivatives. Hence, the Gamma mixture model is not strongly identifiable in general. At the same time, the loss of strong identifiability occurs only if some subpopulation parameters .ri − rj ∈ {1, 0} or .λi − λj ∈ {1, 0}. Consequently, the minimax rate can be different for Gamma mixtures of different subpopulation configurations (Ho and Nguyen 2016a).
8.5 Strongly Identifiable Distribution Families We have discussed the best possible rate of convergence for estimating G under finite mixture models when the order is over-specified. The best possible rate depends on many factors. One of them is the degree of identifiability. The strong identifiability condition is first introduced and subsequently extended in many directions. In this section, we give results regarding only strong identifiability of several commonly considered subpopulation distribution families. Theorem 8.6 Suppose that .f (x) is a differentiable density function and .F (x, θ ) = {x f (t − θ )dt. If .limx→±∞ f (x) = limx→±∞ f ' (x) = 0, then .F (x, θ ) satisfies −∞ the strong identifiability condition. Proof We need to show that if m E .
[αj F (x, θj ) + βj F ' (x, θj ) + γj F '' (x, θj )] = 0
(8.8)
j =1
for any x and some m, then .αj , .βj and .γj are all 0. Let us transform it in the language of characteristic functions. If (8.8) is true, we have { exp{ı tx}d{
.
m E
[αj F (x, θj ) + βj F ' (x, θj ) + γj F '' (x, θj )]} = 0.
j =1
Hence,
.
{ m E [αj − βj (ı t) + γj (ı t)2 ] exp{ı tθj } exp{ı tx}dF (x) = 0. j =1
158
8 Rate of Convergence
{ Since . exp{ı tx}dF (x) equals 1 when .t = 0 and is continuous at the point, we have m E .
[αj − βj (ı t) + γj (ı t)2 ] exp{ı tθj } = 0
j =1
for t in a neighbor of 0. Because the function on the left-hand side is an analytic function of .ı t, being 0 in a neighborhood of .t = 0 implies that it equals 0 at all t. Multiplying it by .exp{− 12 t 2 } and taking the inverse Fourier transformation, we obtain m E 1 . [αj − βj H1 (x − θj ) + γj H2 (x − θj )] exp{− (x − θj )2 }] = 0 2 j =1
for all x, where .H1 (x) and .H2 (x) are Hermite polynomials. Observe that when .x → ∞, one of .exp{− 12 (x−θj )2 } tends to 0 with the slowest rate, hence its corresponding polynomial .αj − βj H1 (x − θj ) + γj H2 (x − θj ) must be 0. This implies that all .αj , .βj and .γj equal 0. Hence, the theorem is proved. u n If a random variable X has distribution .F (x), then .Y = |X| has distribution F (x) + 1 − F (−x). Further the distribution of .log Y belongs to a location model. Hence, the above theorem applies, and it proves the corollary.
.
Corollary 8.1 The same conclusion of Theorem 8.6 remains true when 1 .F (x, θ ) = θ
{
x
t f ( )dt θ −∞
for some density function .f (·) where .θ is in .(0, ∞). By Theorem 8.6 and straightforward calculations, the location family and scale families of normal and Cauchy distributions satisfy conditions of Theorem 8.6. Example 8.1 The finite mixture of Poisson distribution is strongly identifiable. The proof of this conclusion is as follows. Since f (x, θ ) =
.
θx exp{−θ }, x!
we have .
.
x(x − 1) 2x f '' (x, θ ) + 1, = − θ f (x, θ ) θ2
x(x − 1)(x − 2) 3x(x − 1) 3x f (3) (x, θ ) − 1, = − + θ f (x, θ ) θ3 θ2
8.5 Strongly Identifiable Distribution Families
159
and .
x(x − 1)(x − 2)(x − 3) 4x(x − 1)(x − 2) 6x(x − 1) 4x f (4) (x, θ ) + 1. = − + − θ f (x, θ ) θ4 θ3 θ2
Clearly, the moment and Lipschitz conditions of Theorems 8.2 are satisfied. Now let us examine the strong identifiability. In this model, we have F (x, θ ) =
x E θi
.
i=0
i!
exp{−θ }.
Thus, if
.
m E [αj F (x, θj ) + βj F ' (x, θj ) + γj F '' (x, θj )] = 0,
(8.9)
j =1
for all x, we have to show that all .αj , .βj and .γj are 0. By calculating the moment generating function of (8.9), we obtain
.
m E [(αj − βj + γj ) + (βj − 2γj ) exp{t} + γj exp{2t}] exp{θj (exp(t) − 1)} = 0 j =1
(8.10) for any t. Suppose .θm is the largest among .θj ’s. Then, .exp{θm (exp(t)−1)+2t} goes to infinity with the fastest speed as t goes to .∞. So .γm must be 0 because of (8.10). When this is the case, .exp{θm (exp(t) − 1) + t} becomes the fastest one, hence .βm must be 0 and so on. Repeating this procedure, we conclude that if (8.9) and hence (8.10) holds, all the .αj , .βj and .γj are 0. This shows that the Poisson mixture model is strongly identifiable. Example 8.2 The finite mixture of normal distribution in only location parameter is strongly identifiable. This conclusion can be easily obtained from the general result for finite mixture of location distribution families. Example 8.3 The finite mixture of normal distribution in both location and scale parameters is not strongly identifiable. Let .φ(x) be the density function of the standard normal distribution. Let φ(x; μ, σ ) =
.
1 (x − μ) φ σ σ
160
8 Rate of Convergence
which is the density function of a normal distribution with mean .μ and variance .σ 2 . It is easy to get .
∂ 2φ (x − μ)2 − σ 2 (x; μ, σ ) = φ(x; μ, σ ); ∂μ2 σ4 ∂ 2φ (x − μ)2 − σ 2 (x; μ, σ ) = φ(x; μ, σ ). ∂σ 2 σ3
The fact that σ
.
∂ 2φ ∂ 2φ (x; μ, σ ) = (x; μ, σ ) ∂μ2 ∂σ 2
violates the strong identifiability condition. The loss of strong identifiability has been blamed for many ill asymptotic properties of the finite normal mixture models.
8.6 Impact of the Best Minimax Rate Is Not n−1/4 The erroneous .n−1/4 rate of convergence has been claimed for many years. When developing methods for hypothesis testing regarding the order of finite mixture models, it is common to examine the local power of the proposed tests. The motivation behind studying “local power” lies in the fact that any proper test should approach a power of 1 as the sample size increases. However, this property alone is not sufficient for comparing the power of different tests. To address this, we select a specific alternative hypothesis that acts as a moving target and allows for a proper limit of the test’s power between 0 and 1. By doing so, we can determine which test is superior when other properties are approximately equal. In the context of studying the asymptotic properties of the test for homogeneity, we often choose a local alternative in the form of Gn = (1 − π ){−n−1/4 π } + π {n−1/4 (1 − π )}
.
which approaches the null mixing distribution .G0 = {0} at a rate of .n−1/4 . This rate choice is guided by the rate result of Chen (1995). The investigation of local power using this approach has been successful. Importantly, the conclusions drawn from these investigations are not dependent on whether the minimax rate is .n−1/6 or not. In all likelihood, this line of investigation focuses solely on .G0 as the target rather than all mixing distributions in its neighborhood. As a result, the minimax rate issue is not relevant to these specific investigations.
Chapter 9
Test of Homogeneity
9.1 Test for Homogeneity A finite mixture model is so flexible that it fits any population well. We need only choose a higher-order m when needed. One does not need specific justifications to use a finite mixture model when the goal is simply a good fit. In other applications, the presence of non-homogeneity is a supporting evidence to certain scientific hypothesis. In this case, we may wish to quantify the strength of the evidence in the data for the lack of homogeneity. Test for homogeneity serves this purpose. Let .{f (x, θ ) : θ ∈ Θ} be a parametric distribution family with .Θ ⊂ R. Let .X1 , . . . , Xn be an i.i.d. sample from the mixture model π1 f (x, θ1 ) + π2 f (x, θ2 ).
.
The homogeneous model is specified by the null hypothesis H0 : π1 π2 (θ2 − θ1 ) = 0.
.
The alternative is simply Ha : π1 π2 (θ2 − θ1 ) /= 0.
.
Given a set of data with i.i.d. structure and the parametric form of the finite mixture model, the classical likelihood ratio test for .H0 against .Ha comes to our mind naturally. Contradicting to the wisdom of many early researchers, the likelihood ratio statistic does not have a standard chi-square limiting distribution for the current test problem. In fact, this statistics may diverge to infinity in some seemingly ordinary situations as first pointed out by Hartigan (1985). We present the original example with more detailed analysis in the next section.
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_9
161
162
9 Test of Homogeneity
9.2 Hartigan’s Example The example is based on a pair of hypotheses H0 : N(0, 1); Ha : (1 − π )N (0, 1) + π N (θ, 1).
.
For better understanding, we first take on the situation where .θ assumes a pre-given value in .R. The log-likelihood function of the general model under .Ha is simply ln (π ; θ ) =
n E
.
log φ(Xi ; 0, 1) +
i=1
n E
log[1 + π {exp(θ Xi − θ 2 /2) − 1}].
i=1
Since .θ is pre-specified, it only plays a secondary role in .ln (·) which leads to the above notational choice. The likelihood ratio statistic is given by Rn (π ; θ ) = 2
n E
.
log[1 + π {exp(θ Xi − θ 2 /2) − 1}]
i=1
The derivative of .Rn (π ; θ ) with respect to .π is given by Rn' (π ; θ ) = 2
n E
.
i=1
exp(θ Xi − θ 2 /2) − 1 . 1 + π {exp(θ Xi − θ 2 /2) − 1}
Thus, we find Rn' (0; θ ) = 2
.
n n E E {exp(θ Xi − θ 2 /2) − 1} = 2 Yi (θ ), i=1
i=1
where .Yi (θ ) = exp(θ Xi − θ 2 /2) − 1. It is simple to verify that .Yi (θ )’s are i.i.d. with 2 2 mean 0 and finite variance, E say, .σ (θ ) = exp{θ } − 1. By the central limit theorem, when .n → ∞, .n−1/2 ni=1 Yi (θ ) → N(0, σ 2 (θ )) in distribution. Hence, .
Pr{Rn' (0; θ ) > 0} = Pr{n−1/2
n E
Yi (θ ) > 0} → 0.5.
i=1
When .Rn' (0; θ ) > 0, the likelihood function increases at .π = 0. Thus, the maximum value of .Rn (π ; θ ) is attained at some .π > 0 or the maximum likelihood estimator .π ˆ > 0. When .Rn' (0; θ ) ≤ 0, Eit is straightforward to verify that .πˆ = 0. Let .y¯n+ (θ ) = max{n−1 ni=1 yi (θ ), 0} as usual. For each pre-specified .θ value, we have n y¯ + (θ ) + op (n−1/2 ). πˆ (θ ) = En n 2 i=1 yi (θ )
.
9.2 Hartigan’s Example
163
E By the law of large numbers, .n−1 ni=1 yi2 (θ ) → σ 2 (θ ) = exp(θ 2 ) − 1 almost surely. This allows a further simplified expansion πˆ (θ ) = σ −2 (θ )y¯n+ (θ ) + op (n−1/2 ).
.
Consequently, we find Rn (πˆ (θ ); θ ) = nσ −2 (θ ){y¯n+ (θ )}2 + op (1) → 0.5χ02 + 0.5χ12 .
.
(9.1)
This example shows that the likelihood ratio statistic for testing homogeneity under finite mixture models does not have to have a standard chi-square limiting distribution. If .θ is not pre-specified but an unknown parameter value in .R, the large sample behavior of the likelihood ratio statistic is more erratic. Assume as above that we still have a set of i.i.d. observations from the standard normal distribution. Based on this set of observations, we test H0 : N(0, 1); Ha : (1 − π )N (0, 1) + π N (θ, 1)
.
with .θ being an unspecified real value. Let us choose a particular set of .θ values given by .θj = j m for .j = 1, . . . , m and some positive integer m. We use this set of .θ values to show that the corresponding likelihood ratio statistics .Rn is stochastically unbounded. The likelihood ratio statistic .Rn is apparently related to the previous one through Rn = sup Rn (πˆ (θ ); θ ).
.
θ
Clearly, we have Rn ≥ max {Rn (πˆ (θj ); θj )}.
.
1≤j ≤m
The finiteness of m and the expansion (9.1) imply Rn ≥ max nσ −2 (θj ){y¯n+ (θj )}2 + op (1).
.
0≤j ≤m
Be careful that values of .op (1)’s are different for different .θj ’s. When m does not increase with√n, the maximum absolute value of these .op (1) terms is still .op (1). Note that . nσ −1 (θj )y¯n+ (θj ), .j = 1, . . . , m are asymptotically jointly multivariate normal with zero mean and covariance matrix specified by σj k = {
.
exp(θj θk ) − 1 }1/2 { }1/2 . exp(θj2 ) − 1 exp(θk2 ) − 1
164
9 Test of Homogeneity
Thus, let .(Z1 , . . . , Zm )T be a multivariate normally distributed random vector with the same mean and covariance matrix. For all x, we have ( ) 2 . lim Pr(Mn > x ) ≥ Pr max Zj > x . 1≤j ≤m
n→∞
When m increases, the correlation coefficients between .Zj ’s shrink to zero. Therefore, let .m0 be a positive integer. .Zj∗ , .j = 1, . . . , m0 be a set of i.i.d. standard normally distributed random variables. As .m → ∞, we have ( .
Pr
)
( ≥ Pr
max Zj > x
1≤j ≤m
) max Zj > x
1≤j ≤m0
( → Pr
max
1≤j ≤m0
Zj∗
) >x
= 1 − Φ m0 (x). For any .0 < x < ∞ and .e > 0, there exists .m0 such that .Φ m0 (x) < e. Therefore, by activating sufficiently large .m0 and m, we have ( .
lim Pr(Rn > x 2 ) ≥ Pr
n→∞
) max Zj > x
1≤j ≤m
≥ 1 − e.
That is, .Rn is stochastically unbounded. It exceeds any pre-specified value with probability approaching 1. Since the alternative model contains two extra parameters compared with the null model, if the model is regular, the likelihood ratio statistic under i.i.d. setting for regular models would have a chi-square limiting distribution with two degrees of freedom. This example demonstrates the non-regularity of the finite mixture model and its effect. How fast does .Rn diverge to infinite? Hartigan (1985) suggested the rate is 2 .(log log n) . A proof is given by Liu and Shao (2004), which is built on Bickel and Chernoff (1993). In fact, Bickel and Chernoff (1993) find that .Rn , after properly rescaled, has a type I extreme value limiting distribution. The apparent cause of the asymptotic unboundedness is that the parameter space of .θ is unbounded. This unboundness issue is not limited to normal mixtures. It can be argued that in applications, one often has prior information on the potential size of .θ and therefore imposing a compact .Θ restriction is acceptable. There are also situations where the parameter space is naturally bounded. With compact .Θ, the likelihood ratio statistic is generally asymptotically stochastically bounded. Because the finite mixture model is not regular, the limiting distribution of the likelihood ratio statistic may still be different from the usual chi-square. The result of Chernoff and Lander (1995) is simple, useful, and illustrative.
9.3 Binomial Mixture Example
165
9.3 Binomial Mixture Example Let us consider the situation where the subpopulation distribution is Binomial with known size parameter m and unknown probability of success parameter .θ . Finite mixture of binomial distributions is often used in statistical genetics (Ott 1999). The binomial p.m.f. is, for .k = 0, 1, . . . , m, ( ) m k θ (1 − θ )m−k .f (k; m, θ ) = k for some positive integer m and probability of success .0 ≤ θ ≤ 1. Let .X1 , . . . , Xn be a set of i.i.d. random variables with common finite mixture of binomial distributions such that .
Pr(X = k) = (1 − π )f (k; m, θ1 ) + πf (k; m, θ2 ).
Clearly, the parameter space of the mixing parameters .θ1 and .θ2 is bounded. This property disables the choice of arbitrarily large number of arbitrarily distant values of .θ as we did in Hartigan’s example. Consequently, the likelihood ratio statistic is stochastically bounded for the test of homogeneity. We now first demonstrate this fact. Let .nk be the number of observation such that .Xi = k for .k = 0, 1, . . . , m. The log-likelihood function has another form: ln (π, θ1 , θ2 ) =
m E
.
nk log{(1 − π )f (k; m, θ1 ) + πf (k; m, θ2 )}.
k=0
Let .θˆk = nk /n, serving as a symbol here and it does not have to be regarded as an estimator. By Jensen’s inequality, we have ln (π, θ1 , θ2 ) ≤ n
m E
.
θˆk log θˆk
k=0
for any choices of .π , .θ1 and .θ2 . Let .θk∗ = Pr(X = k) under the true distribution of X. It is well known that in distribution, we have n
m E
.
k=0
θˆk log θˆk − n
m E
θˆk log θk∗ → χm2
k=0
as .n → ∞. Let us again use .Rn for the likelihood ratio statistic. It is seen that Rn = 2{sup ln (π, θ1 , θ2 ) − sup ln (1, θ, θ )} ≤ n
m E
.
k=0
θˆk log θˆk − n
m E k=0
θˆk log θk∗
166
9 Test of Homogeneity
which has a .χm2 limiting distribution. That is, .Rn = Op (1), or it is stochastically bounded. The conclusion .Rn = Op (1) does not particularly rely on .Xi ’s having a binomial distribution. The conclusion remains true when .Xi ’s have common and finite number of support points. In spite of boundedness of .Rn for binomial mixtures, finding the limiting distribution of .Rn troubled statisticians in early days as well as geneticists for a long time, see Risch and Rao (1989) and Faraway (1993). The first most satisfactory answer is given by Chernoff and Lander (1995). Unlike the results under regular models, the limiting distribution of the likelihood ratio statistic under binomial mixtures is not an asymptotic pivot. The limiting distribution varies according to the size of m, the true null distribution, and so on. There is one particular situation where the limiting distribution of the LRT can be fully specified analytically. Consider the following pair of hypotheses H0 : f (k; 2, 0.5), Ha : (1 − π )f (k; 2, 0.5) + πf (k; 2, θ ).
.
0.8 0.6 0.2
0.4
H0
0.0
Fig. 9.1 Two line segments whose points all stand for the same distribution
1.0
We aim to find the limiting distribution of the LRT statistic in the presence of n i.i.d. observations from .H0 . Under the alternative hypothesis .Ha , the parameter space of .(π, θ ) is the unit square .[0, 1] × [0, 1]. The null hypothesis is made of two lines in this unit square: one is formed by .π = 0 and the other is by .θ = 0.5 (Fig. 9.1). Unlike the hypothesis test problems under regular models, all points on these two lines stand for the same distribution. Each line symbolizes one type of loss of identifiability. The lost of identifiability has been the cause of unusual large sample properties. One way to overcome the non-identifiability difficulty is to make a parameter transformation.
0.0
0.2
0.4
0.6
0.8
1.0
9.3 Binomial Mixture Example
167
Using the parameter setting under the alternative model, we note .
Pr(X = 0) = 0.25(1 − π ) + π(1 − θ )2 ; Pr(X = 1) = 0.5(1 − π ) + 2π θ (1 − θ ); Pr(X = 2) = 0.25(1 − π ) + π θ 2 .
Based on the data generated from the null model, parameters .π, θ are not uniquely defined and therefore cannot be consistently estimated. At the same time, let ξ1 = 0.5π(θ − 0.5); ξ2 = π(θ − 0.5)2 .
.
The null model is uniquely defined by .ξ1 = ξ2 = 0. Under the new parameterization scheme, we have .
Pr(X = 0) = 0.25 − 2ξ1 + ξ2 ; Pr(X = 1) = 0.5 − 2ξ2 ; Pr(X = 2) = 0.25 + 2ξ1 + ξ2 .
Let .pˆ 0 , pˆ 1 , pˆ 2 be the sample proportions of .X = 0, 1, 2. The likelihood would be maximized at pˆ 0 = 0.25 − 2ξ1 + ξ2 ;
.
pˆ 1 = 0.5 − 2ξ2 ; pˆ 2 = 0.25 + 2ξ1 + ξ2 . The unconstrained solution to the above equation in .ξ1 and .ξ2 is given by ξ˜1 = (pˆ 2 − pˆ 0 )/4;
.
ξ˜2 = (pˆ 2 + pˆ 0 − 0.5)/2. At the same time, because π = 4ξ12 /ξ2 , θ = 0.5(1 + ξ2 /ξ1 )
.
and they have range .[−0.25, 0.25] × [0, 1], we must have |ξ2 | ≤ ξ1 , 4ξ12 ≤ ξ2 .
.
The range is shown as the left figure in Fig. 9.2.
9 Test of Homogeneity
H1
0.05
0.05
0.10
0.10
2
2
0.15
0.15
0.20
0.20
0.25
0.25
168
H1
H1
H0
H1
-0.1
0.0
0.1
0.00
0.00
H0 -0.2
-0.1
0.0
0.1
0.2
-0.2
0.2
1
1
Fig. 9.2 Parameter spaces of the full model in new parameterization. The left is original, and the right is its cone approximation
When the shaded region is expanded, it is well approximated by a cone as shown by the plot on the right-hand side of the figure: |ξ2 | ≤ ξ1 , ξ2 ≥ 0.
.
Let .ϕ be the angle from the positive x-axis. Then, the cone contains two disjoint regions: .0 < ϕ < π/4 and .3π/4 < ϕ < π/2. At the same time, under the null model, and by the classical central limit theorem, √ d n(ξ˜1 , ξ˜2 ) −→ (Z1 , Z2 )
.
where .(Z1 , Z2 ) are bivariate normal with covariance matrix .I−1 and .I is the Fisher information with respect to .(ξ1 , ξ2 ): [
] 32 0 .I = . 0 16 According to Chernoff (1954), there is a simple link between the original test problem and another test problem to be stated. Suppose we have a single pair of observations .(Z1 , Z2 ) from a bivariate normal distribution with mean .(ξ1 , ξ2 ) and covariance matrix .I−1 . We wish to test the hypothesis H0 : ξ1 = ξ2 = 0
.
against the alternative Ha : |ξ2 | ≤ ξ1 , ξ2 ≥ 0.
.
9.3 Binomial Mixture Example
169
We will work out the distribution of the LRT statistic momentarily. Before that, we remark that the distribution of this LRT is the limiting distribution of the LRT of the original test. Under the current setting, the log-likelihood function is given by l(ξ1 , ξ2 ) = −16(Z1 − ξ1 )2 − 8(Z2 − ξ2 )2 + c
.
where c is parameter free constant. The outcome of maximizing .l(ξ1 , ξ2 ) can be concretely specified in three situations depending on the location of the observed value of .(Z1 , Z2 ). Case I: .|Z2 | ≤ Z1 , Z2 ≥ 0. In this case, the MLE .ξˆ1 = Z1 and .ξˆ2 = Z2 . Because of this, the likelihood ratio statistic is found to be T = −2{l(ξˆ1 , ξˆ2 ) − l(0, 0)} = 32Z12 + 16Z22 ∼ χ22
.
Case II: .Z2 < 0. In this case the MLE .ξˆ1 = Z1 and .ξˆ2 = 0. Hence, the likelihood ratio statistic is T = −2{l(ξˆ1 , ξˆ2 ) − l(0, 0)} = 32Z12 ∼ χ12
.
Case III: .|Z1 | > Z2 ≥ 0. In this case, the likelihood is maximized when .ξ1 = ξ2 . the MLE .ξˆ1 = ξˆ2 = (2Z1 + Z2 )/3. Hence, the likelihood ratio statistic is T = −2{l(ξˆ1 , ξˆ2 ) − l(0, 0)} = (16/3)(Z2 − Z1 )2
.
It can be verified that the event .{|Z2 | ≤ Z1 , Z2 ≥ 0} in Case I is independent of .{32Z12 + 16Z22 ≤ z} for any z. The independence remains true for the event in Case II. However, the event in the third case is not independent of .(Z2 − Z1 )2 ≤ z. Thus, we conclude that the limiting distribution of the likelihood ratio test statistic T is given by .
Pr(T ≤ t) = 0.5 Pr(χ12 ≤ t) + 2λ Pr(χ22 ≤ t)
| +(0.5 − 2λ) Pr{(Z2 − Z1 )2 ≤ 3t/16 | |Z1 | > Z2 ≥ 0}
√ where .λ = arctan(1/ 2)/(2π ). The third term cannot be further simplified due to the lack of independence. As we remarked earlier, the distribution of T is the limiting distribution of .Rn for the test of homogeneity under the binomial mixture model. Namely, the question about the limiting distribution of .Rn is fully answered under this simple model. The example we just discussed amounts to put .θ1 = θ2 = 0.5 under the null model, and the alternative models are made of those with .θ1 = 0.5 such that .π(θ2 − 0.5) /= 0. The generic test for homogeneity is for H0 : π(1 − π )(θ2 − θ1 ) = 0
.
170
9 Test of Homogeneity
against the general alternative H0 : π(1 − π )(θ2 − θ1 ) /= 0.
.
When .m = 2, the same technique can be used to find the limiting distribution of the LRT is a much simpler 0.5χ12 + 0.5χ22 .
.
Chernoff and Lander (1995) obtain the limiting distributions for any specific m or for random m. The limiting distributions, however, do not have as simple analytical forms.
9.4 C(α) Test The general C(.α) test is designed to test a specific null value of a parameter of interest in the presence of nuisance parameters. More specifically, suppose the statistical model is made of a family of density functions .f (x; ξ, η) with some appropriate parameter space for .(ξ, η). The problem of interest is to test for .H0 : ξ = ξ0 versus the alternative .Ha : ξ /= ξ0 . Note that the parameter value of .η is left unspecified in both hypotheses, and it is of no specific interest. Because of this, .η earns a name of nuisance parameter. Due to its presence, the null and alternative hypotheses are composite as oppose to simple, due to the fact that both contain more than a single parameter value in terms of .(ξ, η). As in common practice of methodological development in statistical significance test, the size of the test is set at some .α value between 0 and 1. Working on composite hypothesis and having size .α appear to be the reasons behind the name C(.α). More close reading of Neyman and Scott (1965) reveals it may also be a tribute to Cramer (1946). While our interest lies in the use of C(.α) to test for homogeneity in the context of mixture models, it is helpful to have a generic introduction.
9.4.1 The Generic C(α) Test To motivate the C(.α) test, let us first examine the situation where the model is free from nuisance parameters. Denote the model without nuisance parameter as a distribution family .f (x; ξ ) with some parameter space for .ξ . In addition, we assume that this family is regular. Namely, the density function is differentiable in .ξ for all x, the derivative and the integration of the density function with respect to parameter can be exchanged, certain functions have an integrable upper bound, and so on. Based on an i.i.d. sample .x1 , . . . , xn , the score function of .ξ is given by
9.4 C(α) Test
171
Sn (ξ ) =
.
n E ∂ log f (xi ; ξ )
∂ξ
i=1
.
When the true distribution is given by .f (x; ξ0 ), we have .E{Sn (ξ0 )} = 0. Define the Fisher information (matrix) to be [{ I(ξ ) = E
.
∂ log f (xi ; ξ ) ∂ξ
}{
∂ log f (xi ; ξ ) ∂ξ
}T ] .
It is well known that d
SnT (ξ0 ){nI(ξ0 )}−1 Sn (ξ0 ) −→ χd2
.
where d is the dimension of .ξ . Clearly, an asymptotic test for .H0 : ξ = ξ0 versus the alternative .Ha : ξ /= ξ0 based on .Sn can be sensibly constructed with rejection region specified by SnT (ξ0 ){I−1 (ξ0 )}Sn (ξ0 ) ≥ nχd2 (1 − α).
.
We call it score test and credit its invention to Rao (2005). When the dimension of .ξ is .d = 1, then the test can be equivalently defined based on the asymptotic normality of .Sn (ξ0 ). In applications, one may replace .nI(ξ0 ) by observed information and evaluate it at a root-n consistent estimator .ξˆ . Let us now go back to the general situation where the model is the distribution family .{f (x; ξ, η)} with two parameters. If .η value in .f (x; ξ, η) is in fact known, the test problem reduces to the one we have just described. Nevertheless, let us go over the whole process again with .η carried around. Define Sn (ξ ; η) =
.
n E ∂ log f (xi ; ξ, η) i=1
∂ξ
.
The semicolon in .Sn indicates that the “score” operation is with respect to only .ξ . Similarly, let us define the .ξ -specific Fisher information to be [{ I11 (ξ, η) = E
.
∂ log f (X; ξ, η) ∂ξ
}{
∂ log f (X; ξ, η) ∂ξ
}T ] .
With the value of .η correctly specified, we have a score test statistic and its limiting distribution d
SnT (ξ0 ; η){nI11 (ξ0 , η)}−1 Sn (ξ0 ; η) −→ χd2 .
.
A score test can therefore be carried out using this statistic.
172
9 Test of Homogeneity
Without a known .η value, the temptation is to have .η replaced by an efficient or root-n consistent estimator. In general, the chi-square limiting distribution of SnT (ξ0 ; η){nI(ξ ˆ ˆ −1 Sn (ξ0 ; η) ˆ 0 ; η)}
.
is no longer the simple chi-square. For a specific choice of .η, ˆ we may work out ˆ and similarly define a new test statistic. The the limiting distribution of .SnT (ξ0 ; η) approach of Neyman (1959) achieved this goal in a graceful way. To explain the scheme of Neyman (1959), let us introduce the other part of the score function Sn (η; ξ ) =
.
n E ∂ log f (xi ; ξ, η) i=1
∂η
.
The above notation highlights that the “score operation” is with respect to only .η. Next, let us define the other part of the Fisher information matrix to be I12 (ξ ; η) =
.
IT 21 (ξ ; η)
[{ =E
∂ log f (X; ξ, η) ∂ξ
}{
∂ log f (X; ξ, η) ∂η
}T ]
and [{ I22 (ξ ; η) = E
.
∂ log f (X; ξ, η) ∂η
}{
∂ log f (X; ξ, η) ∂η
}T ] .
Let us now project .Sn (ξ ; η) into the orthogonal space of .Sn (η; ξ ) to get Wn (ξ, η) = Sn (ξ ; η) − I12 (ξ ; η)I−1 22 (ξ ; η)Sn (η; ξ ).
.
(9.2)
Conceptually, it removes the influence of the nuisance parameter .η on the score function of .ξ . At the true parameter value of .ξ, η, d
n−1/2 Wn (ξ, η) −→ N(0, {I11 − I12 I−1 22 I21 }).
.
Under the null hypothesis, the value of .ξ is specified as .ξ0 , the value of .η is ˆ unspecified. Thus, we naturally try to construct a test statistic based on .Wn (ξ0 , η) where .ηˆ is some root-n consistent estimator of .η. For this purpose, we must know the distribution of .Wn (ξ0 , η). ˆ The following result of Neyman (1959) makes the answer to this question simple.
9.4 C(α) Test
173
Theorem 9.1 Suppose .x1 , . . . , xn is an i.i.d. sample from .f (x; ξ, η). Let .Wn (ξ, η) be defined as in (9.2) together with other accompanying notations. Let .ηˆ be a root-n consistent estimator of .η when .ξ = ξ0 . We have Wn (ξ0 , η) − Wn (ξ0 , η) ˆ = op (n1/2 )
.
as .n → ∞, under any distribution where .ξ = ξ0 . Due to the above theorem and if it is to be scaled by a factor .n−1/2 , the limiting distribution of .Wn (ξ0 , η) ˆ is the same as that of .Wn (ξ0 , η) with .(ξ0 , η) being the true parameter values of the model that generated the data. Thus, −1 −1 WnT (ξ0 , η)[n{I ˆ ˆ 11 − I12 I22 I21 }] Wn (ξ0 , η)
.
may be used as the final C(.α) test statistic. The information matrix in the above definition is evaluated at .ξ0 , η, ˆ and the rejection region is decided based on its chisquare limiting distribution. If we choose .ηˆ as the constrained maximum likelihood estimator given .ξ = ξ0 , we would have .Wn (ξ0 , η) ˆ = Sn (ξ0 ; η) ˆ in (9.2). The projected score function .Sn (ξ0 , η) is one of many possible zero-mean functions satisfying some regularity properties. Neyman (1959) called such class of functions Cramér functions. Each Cramér function can be projected to obtain a corresponding .Wn and therefore a test statistic for .H0 : ξ = ξ0 . Within this class, the test based on score function .Sn (ξ0 , η) is optimal: having the highest asymptotic power against some local alternatives. In general, if the local alternative is of two-sided nature, the optimality based on the notion of “uniformly most powerful” cannot be achieved. The local optimality has to be justified in a more restricted sense. The detailed discussion is apparently tedious and not the focus of this book. We refer interested readers to Section 12 of Neyman (1959). The material in this subsection is otherwise solid and readily applicable.
9.4.2 C(α) Test for Homogeneity As shown in the last subsection, the C(.α) test is designed to test for a special null hypothesis in the presence of some nuisance parameters. The most convincing example of its application is for homogeneity test. Recall that a mixture model is represented by its density function in the form of { f (x; G) =
f (x; θ )dG(θ ).
.
Θ
174
9 Test of Homogeneity
Neyman and Scott (1965) regard the variance of the mixing distribution .G as the parameter of interest, and the mean and other aspects of .G as nuisance parameters. In the simplest situation where .Θ = R, we may rewrite the mixture density function as { √ (9.3) .ϕ(x; θ, σ, G) = f (x; θ + σ ξ )dG(ξ ) Θ
such that the standardized mixing distribution .G(·) has mean√0 and variance 1. The null hypothesis is .H0 : σ = 0. The rational of the choice of . σ instead of .σ in the above definition will be seen momentary. Both .θ and the mixing distribution G are nuisance parameters. The partial derivative of .log ϕ(x; θ, σ, G) with respect to .σ is given by { √ ξf ' (x; θ + σ ξ )dG(ξ ) ∂ϕ(x; θ, σ, G) . . = √Θ { √ ∂σ 2 σ Θ f (x; θ + σ ξ )dG(ξ ) At .σ = 0 or let .σ ↓ 0, we find .
| ∂ϕ(x; θ, σ, G) || f '' (x; θ ) . = | ∂σ 2f (x; θ ) σ ↓0
This is the score √ function for .σ based on a single observation. We may notice that the choice of . σ gives us the non-degenerate score function, which is the reason for this choice in the first place. The partial derivative of .log ϕ(x; θ, σ, G) with respect to .θ is given by { √ f ' (x; θ + σ ξ )dG(ξ ) ∂ϕ(x; θ, σ, G) Θ = { . √ ∂θ Θ f (x; θ + σ ξ )dG(ξ ) which leads to score function for .θ based on a single observation as: .
f ' (x; θ ) ∂ϕ(x; θ, 0, G) = . ∂θ f (x; θ )
Both of them are free from the mixing distribution G. Let us now define yi (θ ) =
.
f ' (xi ; θ ) f '' (xi ; θ ) , zi (θ ) = f (xi ; θ ) 2f (xi ; θ )
(9.4)
with .xi ’s being i.i.d. observationsEfrom the mixture E model. The score functions based on the entire sample are . ni=1 zi (θ ) and . ni=1 yi (θ ) for the mean and variance of G. Based on the principle of deriving test statistic .Wn in the last
9.4 C(α) Test
175
subsection, we first project .zi (θ ) into space of .yi (θ ), and make use of the residual wi (θ ) = zi (θ ) − β(θ )yi (θ ). The regression coefficient
.
β(θ ) =
.
E{Y1 (θ )Z1 (θ )} . E{Y12 (θ )}
We have capitalized Y and Z to highlight their status as random variables. The expectation is with respect to the homogeneous model .f (x; θ ). When .θˆ is the maximum likelihood estimator of .θ under the homogeneous model assumption .f (x, θ ), the C(.α) statistic has a simple form: En En ˆ ˆ i=1 Zi (θ) i=1 Wi (θ ) / .Wn = = / nν(θˆ ) nν(θˆ )
(9.5)
with .ν(θ ) = E{W12 (θ )}. Because the parameter of interest is univariate, .Wn may be directly used as a test statistic. There is not need to create a quadratic form of .Wn . In addition, .Wn has standard normal limiting distribution, and the homogeneity null hypothesis is one-sided. At a given significance level .α, we reject the homogeneity hypothesis .H0 when .Wn > zα . This is then the C(.α) test for homogeneity. In deriving the C(.α) statistic, we assumed that the parameter space .Θ = R. With this parameter space, if .G(·) is a mixing distribution on .Θ, so is .G((θ − θ ∗ )/σ ∗ ) for any .θ ∗ and .σ ∗ ≥ 0. We have made use of this fact in (9.2). If .Θ = R + as in the Poisson mixture model where .θ ≥ 0, .G((θ − θ ∗ )/σ ∗ ) cannot be regarded as a legitimate mixing distribution for some .θ ∗ and .σ ∗ . In the specific example of Poisson mixture, one may re-parameterize model with .ξ = log θ . However, there seems to be no unified approach in general, and the optimality consideration is at stake. Whether or not the mathematical derivation of .Wn can be carried out as we did earlier for other forms of .Θ, the statistic .Wn remains a useful metric on the plausibility of the homogeneity hypothesis. The limiting distribution of .Wn remains the same, and it is useful in detecting the population heterogeneity.
9.4.3 C(α) Statistic Under NEF-QVF Many commonly used distributions in statistics belong to a group of natural exponential families with quadratic variance function (Morris 1982, NEF-QVF). The examples include normal, Poisson, binomial, and exponential. The density function in one-parameter natural exponential family has a unified analytical form f (x; θ ) = h(x) exp{xφ − A(φ)},
.
176
9 Test of Homogeneity
with respect to some .σ -finite measure, where .θ = A' (φ) represents the mean parameter. Let .σ 2 = A'' (φ) be the variance under .f (x; θ ). To be a member of NEF-QVF, there must exist constants .a, b, and c such that σ 2 = A'' (φ) = a{A' (φ)}2 + bA' (φ) + c = aθ 2 + bθ + c.
(9.6)
.
Namely, the variance is a quadratic function of the mean. When the kernel density function .f (x; θ ) is a member of NEF-QVF, the C(.α) statistic has a particularly simple analytical form and simple interpretation. Theorem 9.2 When the kernel .f (x; θ ) is a member of NEF-QVF, then the En
− x) ¯ 2 − nσˆ 2 , √ 2n(a + 1)σˆ 2
i=1 (xi
Wn =
.
E where C(.α) statistic is given by .x¯ = n−1 ni=1 xi and .σˆ 2 = a x¯ 2 + bx¯ + c with coefficients given by (9.6) are the maximum likelihood estimators of .θ and .σ 2 , respectively. The analytical forms of the C(.α) test statistics for the normal, Poisson, binomial, and exponential kernels are included in Table 9.1 for easy reference. Their derivation is given in the next subsection. E Note that the C(.α) statistics contains the factor . ni=1 (xi − x) ¯ 2 which is a scaled up sample variance. The second term in the numerator of these C(.α) statistics is the corresponding “estimated variance” if the data are from the corresponding homogeneous NEF-QVF distribution. Hence, in each case, the test statistic is the difference between the “observed variance” and the “perceived variance” when the null hypothesis is true under the corresponding NEF-QVF subpopulation distribution assumption. The difference is then divided by their estimated null asymptotic variance. Thus, the C(.α) test is the same as the “over-dispersion” test. See Dean and Lawless (1989).
9.4.4 Expressions of the C(α) Statistics for NEF-VEF Mixtures The quadratic variance function under the natural exponential family is characterized by its density function .f (x; θ ) = h(x) exp{xφ − A(φ)} and .A'' (φ) = aA' (φ) + bA' (φ) + c for some constants a, b, and c. The mean and variance are Table 9.1 Analytical form of C(.α) some NEF-QVF mixtures Kernel
.N (θ, 1)
En
C(.α)
.
i=1 (xi
√
−x) ¯ 2 −n
2n
Poisson(.θ)
BIN(.m, p)
En
En
.
i=1 (xi
√
−x) ¯ 2 −nx¯
2nx¯
i=1 (xi
. √
−x) ¯ 2 −nx(m− ¯ x)/m ¯
2n(1−1/m)x(m− ¯ x)/m ¯
Exp(.θ) En .
¯ i=1 (xi −x) √
2 −nx¯ 2
4nx¯ 2
9.4 C(α) Test
177
given by .θ = A' (φ) and .σ 2 = A'' (φ). Taking derivatives with respect to .φ on the quadratic relationship, we find that A''' (φ) = {2aA' (φ) + b}A'' (φ) = (2aθ + b)σ 2 ,
.
A(4) (φ) = 2a{A'' (φ)}2 + {2aA' (φ) + b}A''' (φ) = 2aσ 4 + (2aθ + b)2 σ 2 . Because of the regularity of the exponential family, we have { E
.
d k f (X; θ ∗ )/dφ k f (X; θ ∗ )
} =0
for .k = 1, 2, 3, 4 where .θ ∗ is the true parameter value under the null model. This implies E{(X − θ ∗ )3 } = A''' (φ ∗ ) = (2aθ ∗ + b)σ ∗ 2 ,
.
E{(X − θ ∗ )4 } = 3{A'' (φ ∗ )}2 + A(4) (φ ∗ ) = (2a + 3)σ ∗ 4 + (2aθ ∗ + b)2 σ ∗ 2 , where .φ ∗ is the value of the natural parameter corresponding to .θ ∗ , and similarly for .σ ∗ 2 . The ingredients of the C(.α) statistics are .
Yi (θ ∗ ) =
(Xi − θ ∗ ) f ' (X; θ ∗ ) = , f (Xi ; θ ∗ ) σ ∗2
Zi (θ ∗ ) =
f '' (X; θ ∗ ) (Xi − θ ∗ )2 − (2aθ ∗ + b)(Xi − θ ∗ ) − σ ∗ 2 . = 2f (Xi ; θ ∗ ) 2{σ ∗ }4
We then have E{Yi (θ ∗ )Zi (θ ∗ )}
.
=
E{(Xi − θ ∗ )3 } − (2aθ ∗ + b)E{(Xi − θ ∗ )2 } − σ ∗ 2 E(Xi − θ ∗ ) = 0. 2σ ∗ 6
Therefore, the regression coefficient of .Zi (θ ∗ ) against .Yi (θ ∗ ) is .β(θ ∗ ) = 0. This leads to the projection Wi (θ ∗ ) = Zi (θ ∗ ) − β(θ ∗ )Yi (β ∗ ) = Zi (θ ∗ )
.
and 4{σ ∗ }8 VAR{Wi (θ ∗ )} = VAR{(Xi − θ ∗ )2 } − 2(2aθ ∗ + b)E{(Xi − θ ∗ )3 }
.
+(2aθ ∗ + b)2 E{(Xi − θ ∗ )2 }
178
9 Test of Homogeneity
= (2a + 3){σ ∗ }4 + (2aθ ∗ + b)2 {σ ∗ }2 − {σ ∗ }4 −2(2aθ ∗ + b)2 {σ ∗ }2 + (2aθ ∗ + b)2 {σ ∗ }2 = (2a + 2){σ ∗ }4 . Hence, .ν(θ ∗ ) = VAR{Wi (θ ∗ )} = 0.5(a + 1){σ ∗ }−4 . ¯ we find that Because the maximum likelihood estimator .θˆ = X, n E .
i=1
Wi (θˆ ) =
n E i=1
Zi (θˆ ) =
En
¯ 2 − σˆ 2 − X) 4 2σˆ
i=1 (Xi
with .σˆ 2 = / a X¯ 2 + bX¯ + c due to invariance. The C(.α) test statistic, .Tn = En ˆ ˆ i=1 Wi (θ )/ nν(θ ), is therefore given by the simplified expression in Theorem 9.2. Must of the context in this section is part of Chen et al. (2016). u n
Chapter 10
Likelihood Ratio Test for Homogeneity
10.1 Likelihood Ratio Test for Homogeneity: One Parameter Case While C(.α) test is largely successful, statisticians remain faithful to the likelihood ratio test for homogeneity. The example of Hartigan (1985) in Sect. 9.2 shows that the straightforward likelihood ratio statistic .Rn is stochastically unbounded. Hence, a test rejecting homogeneity when .Rn > C0 for any .C0 not depending on the sample size n has asymptotic 100% type I error as .n → ∞. Thus, if the likelihood ratio test for homogeneity is insisted, some remedy is a must. One such remedy is to set the rejection region into the form of .Rn > Cn for some non-random .Cn depending on n. By allowing .Cn increasing with n, it is possible in principle to have the type I error, .Pr(Rn > Cn ) under the null model to be close to the desired nominal level when n is large enough. Searching for such a .Cn is, however, technically challenging. For the same example as in Hartigan (1985), the asymptotic results given in Bickel and Chernoff (1993) and Liu and Shao (2004) may be regarded as effort in this respect. The unboundedness revealed in Hartigan (1985) is mostly due to the fact that the component parameter space .Θ is unbounded. If .Θ is bounded such as in the binomial mixture, the likelihood ratio statistics can have a non-degenerating limiting distribution as shown in the last chapter, and we do not need additional scaling. In this chapter, we go over technical details in obtaining the limiting distribution of the likelihood ratio statistic when .Θ is one-dimensional and chosen to be bounded, followed by some similar results under normal mixture. We start with the general setting and some notation. Let .f (x; θ ) with .θ ∈ Θ ⊂ R be a p.d.f. or p.m.f. with respect to a .σ -finite measure. We observe a random sample .X1 , . . . , Xn of size n from a population with the mixture p.d.f. f (x; G) = (1 − π )f (x; θ1 ) + πf (x; θ2 ),
.
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_10
(10.1)
179
180
10 Likelihood Ratio Test for Homogeneity
where .θj ∈ Θ and .j = 1, 2 are component parameter values and .π and .1 − π are the mixing proportions. Without loss of generality, .0 ≤ π ≤ 1/2. In this set up, the mixing distribution G has at most two distinct support points .θ1 and .θ2 . We consider the problem of testing H0 : π = 0 or θ1 = θ2 ,
.
versus the full model (10.1). The log likelihood function of the mixing distribution is given by ln (G) =
n E
.
log{(1 − π )f (xi ; θ1 ) + πf (xi ; θ2 )}.
i=1
ˆ is a mixing distribution The maximum likelihood estimator of G, denoted as .G, on .Θ with at most two distinct support points at which .ln (G) attains its maximum ˆ will be denoted as .π, ˆ θˆ1 , and .θˆ2 . With the convention value. The composition of .G declared earlier, we have .πˆ ≤ 1/2 under the full model. The likelihood ratio test statistic is twice of the difference between two maximum log likelihood values. One is obtained under the full model, and the other is obtained under the null model. In our case, when the model is confined in the space of null hypothesis, the log likelihood function is simplified to ln (θ ) =
n E
.
log{f (xi ; θ )}.
i=1
Note that .ln (·) is a generic notation, which can be a function of the mixing distribution G or its degenerate form .θ . Denote the global maximum point of .ln (θ ) as .θˆ . The likelihood ratio test statistic is then ˆ − ln (θˆ )}. Rn = 2{ln (G)
.
This chapter gives one approach of finding the limiting distribution of .Rn . We assume that the finite mixture model (10.1) satisfies various conditions including C30.1–4 in Theorem 3.1 without explicitly spelling out all of them. Many conditions such as the smoothness of .f (x; θ ) in .θ are apparently needed. Yet we include only most relevant conditions in stating conclusions in this chapter, not these apparent ones. Strong identifiability The model (10.1) is strongly identifiable if for any .θ1 /= θ2 in .Θ, 2 E .
{aj f (x, θj ) + bj f ' (x, θj ) + cj f '' (x, θj )} = 0,
j =1
for all x implies that .aj = bj = cj = 0, .j = 1, 2.
10.1 Likelihood Ratio Test for Homogeneity: One Parameter Case
181
Strong identifiability in simple words is to require the functions in x, .f (x; θ ), f ' (x; θ ), and .f '' (x; θ ) are linearly algebraically independent. This condition was first introduced in Chap. 8 in the context for establishing the best possible rate of estimating G. The above form is a simplified version for the specific occasion. Its re-appearance is an indication on its intrinsic necessity. The following quantities have appeared in several places already. Some of them are needed soon and others will be needed later. We have them defined in the same place here: for .i = 1, 2, . . . , n,
.
.
Yi (θ ) =
f (Xi , θ ) − f (Xi , θ ∗ ) ; (θ − θ ∗ )f (Xi , θ ∗ )
Zi (θ ) =
Yi (θ ) − Yi (θ ∗ ) . θ − θ∗
We may omit the subindex i so that .Y (θ ) stands for a generate random variable not attached to specific observation in the sample. At .θ = θ ∗ , functions .Yi (θ ) and .Zi (θ ) take their continuity limits as their values. We denote the projection residual of Z to the space of Y as Wi (θ ) = Zi (θ ) − h(θ )Yi (θ ∗ ),
.
(10.2)
where h(θ ) = E{Y (θ ∗ )Z(θ )}/E{Y 2 (θ ∗ )}.
.
Unlike Y , the entity W is uncorrelated with Z and this is the desired property to be utilized. These notations match the ones defined in the section for C(.α) test with a minor alternation: the current .Y (θ ∗ ) differs from the previous one by a factor of 2. This alternation leads to ease in presentation and does not change the nature of the problem. Under generic model assumptions on the component distribution function .f (x; θ ), both .Y (θ ) and .Z(θ ) are continuous in .θ and .E{Y (θ )} = 0 and ' .E{Z(θ )} = 0. By the mean value theorem in mathematical analysis, .Z(θ ) = Y (θ˜ ) ˜ for some .θ . This gives us a useful connection. Both .Y (θ ) and .Z(θ ) were regarded as score functions in the contents of C(.α) test. We now put down a bit unusual requirement on .Y (θ ) and its derivative .Y ' (θ ). Note that .Y (θ ) is also a function of X given .θ . Uniform Integrable Upper Bound Condition There exist a constant .δ > 0 and a function .g(·) such that .
|Y (θ )|4+δ ≤ g(X); |Y ' (θ )|3 ≤ g(X); |Z(θ )|4+δ ≤ g(X); |Z ' (θ )|3 ≤ g(X)
for all .θ ∈ Θ and .E{g(X)} < ∞.
182
10 Likelihood Ratio Test for Homogeneity
The following is the main conclusion of this chapter. Theorem 10.1 Suppose the mixture model (10.1) satisfies all conditions specified in this section. When .f (x, θ ∗ ) is the true null distribution that generated n i.i.d. observations, the asymptotic distribution of the likelihood ratio test statistic for homogeneity as .n → ∞ is that of {sup W + (θ )}2 ,
.
θ∈Θ
where .W (θ ), .θ ∈ Θ, is a Gaussian process with mean 0, variance 1 and autocorrelation function .ρ(θ, θ ' ) given by COV {W1 (θ ), W1 (θ ' )}
ρ(θ, θ ' ) = /
.
E{W12 (θ )}E{W12 (θ ' )}
,
(10.3)
and that .W1 (θ ) is defined by (10.2). This specific result is first presented in Chen and Chen (2001). More generic results can be found in Dacunha-Castelle and Gassiat (1999). These results were worked out independently. The former is based on straightforward but tedious analysis, and the latter utilizes more advanced theory in probability. Even though the limiting distribution presented here is rather clean mathematically, it is difficult to develop a practical asymptotic test for homogeneity. The first obstacle of such an attempt is to numerically determine the quantiles of the limiting distribution, which is not one of the classical parametric distributions. The second obstacle is that the limiting distribution depends on the value .θ ∗ , the unknown truth behind the homogeneous null distribution and nonexistence when the data are from an alternative distribution. The last one we can think of is: the range of the supremum is also dependent on .θ ∗ . This is an extra layer of difficulty. The general results of Dacunha-Castelle and Gassiat (1999) lead to even more issues. It can be challenging to merely have the range of supremum specified concretely rather than merely conceptually. Nevertheless, their results as well as ours here form a useful platform for further developments. Before giving a proof, we first examine the Gaussian process .W (θ ) with a few examples.
10.2 Examples The autocorrelation function .ρ(θ, θ ' ) of the Gaussian process .W (θ ) is well behaved with commonly used subpopulation densities. The tightness conditions can be easily verified for the following examples, and their verifications are omitted.
10.2 Examples
183
Example 10.1 (Normal Subpopulation Density) Let .f (x; θ ) be normal .N(θ, σ 2 ) with known .σ . For simplicity, we only work on the problem with the true parameter values given by .θ ∗ = 0 and .σ ∗ = 1. Otherwise, one should make an appropriate parameter transformation before the conclusion of this example becomes applicable. We will also ignore the exact form of .Θ but assume it is a compact interval. With this convention, it is simple to find that with normal subpopulation, Y1 (θ ) =
.
exp(θ X1 − θ 2 /2) − 1 θ
which leads to .Y1 (0) = X1 . Consequently, when .θ /= 0, exp(θ X1 − θ 2 /2) − 1 − θ X1 Y1 (θ ) − Y1 (0) = . θ θ2
Z1 (θ ) =
.
Let .θ → 0, we get .Z1 (0) = (1/2)(X12 − 1). Note that .E{exp(θ X − θ 2 /2)} = 1 for all .θ under .N(0, 1). Taking derivative with respect to .θ , we find E{(X − θ ) exp(θ X − θ 2 /2)} = 0.
.
This helps to find that . COV
(X, Y (θ )) = 1.
When both .θ /= 0 and .θ ' /= 0, the above calculations lead to E{Z1 (θ )Z1 (θ ' )} =
.
exp(θ θ ' ) − 1 − θ θ ' . (θ θ ' )2
The correlation function is therefore given by exp(θ θ ' ) − 1 − θ θ ' ρ(θ, θ ' ) = / . {exp(θ 2 ) − 1 − θ 2 }{exp(θ '2 ) − 1 − θ '2 }
.
When .θ /= 0 but .θ ' = 0, the correlation function becomes θ2 ρ(θ, 0) = / . 2{exp(θ 2 ) − 1 − θ 2 }
.
This is the auto-correlation function of .W (θ ) that defines the limiting distribution of the likelihood ratio test for homogeneity under normal subpopulation. u n Example 10.2 (Binomial Subpopulation) Consider the binomial subpopulation density function
184
10 Likelihood Ratio Test for Homogeneity
f (x; θ ) ∝ θ x (1 − θ )k−x , for x = 0, . . . , k.
.
We now express .Y (θ ) as Y (θ ) =
.
} 1 { f (X; θ ) − 1 θ − θ ∗ f (X; θ ∗ )
which leads to 1 .Y (θ ) = θ − θ∗ Y (θ ∗ ) =
[(
1−θ 1 − θ∗
)k {
(1 − θ ∗ )/θ ∗ (1 − θ )/θ
}X
] −1 ;
X k−X − . θ∗ 1 − θ∗
The exact expression of .Z(θ ) is algebraically massive, but the computation of its covariance functions can be accomplished untediously. We start with the knowledge that under binomial distribution .(k, θ ∗ ), its “moment generating function” is given by E{t X } = (1 − θ ∗ + θ ∗ t)k .
.
Making use of this fact eases the computation of the result given below 1 ' . COV (Y (θ ), Y (θ )) = (θ − θ ∗ )(θ ' − θ ∗ )
[{
(1 − θ )(1 − θ ' ) θ θ ' + ∗ 1 − θ∗ θ
}k
] −1 .
Letting .θ ' → θ ∗ , we get . COV
(Y (θ ), Y (θ ∗ )) =
k θ ∗ (1 − θ ∗ )
which does not depend on .θ . This shows . VAR
(Y (θ ∗ )) =
k . θ ∗ (1 − θ ∗ )
Recall .Z(θ ) = {Y (θ ) − Y (θ ∗ )}/(θ − θ ∗ ). The covariance of .Z1 (θ ) and .Z1 (θ ' ) for ' ∗ .θ, θ /= θ is therefore given by 1 . ∗ 2 (θ − θ ) (θ ' − θ ∗ )2 Let
[{
(θ − θ ∗ )(θ ' − θ ∗ ) 1+ θ ∗ (1 − θ ∗ )
}k
] (θ − θ ∗ )(θ ' − θ ∗ ) −1−k . θ ∗ (1 − θ ∗ )
10.2 Examples
185
/ u=
.
k(θ − θ ∗ ) , θ ∗ (1 − θ ∗ )
/ u' =
k(θ ' − θ ∗ ) . θ ∗ (1 − θ ∗ )
We obtain an easier to comprehend expression: ρ(θ, θ ' ) = /{
.
(1 + uu' /k)k − 1 − uu' }{ }. (1 + u2 /k)k − 1 − u2 (1 + u' 2 /k)k − 1 − u' 2
Interestingly, when k is large, exp(uu' ) − 1 − uu' ρ(θ, θ ' ) ≈ / . {exp(u2 ) − 1 − u2 }{exp(u'2 ) − 1 − u'2 }
.
That is, when k is large, the autocorrelation function of .W (θ ) behaves similarly to that of the normal subpopulation. Because when .k → ∞, binomial distribution converges to normal distribution; this result is very natural. u n Example 10.3 (Poisson Subpopulation Density Function) Under this model, the density function f (x, θ ) ∝ θ x exp(−θ ), for x = 0, 1, 2, . . . .
.
Hence, the key quantities for computing the correlation function of .W (θ ) are given by Y (θ ) =
.
Y (θ ∗ ) =
exp{−(θ − θ ∗ )}(θ/θ ∗ )X − 1 for θ − θ∗
θ /= θ ∗ ,
X − θ∗ . θ∗
It is seen that .E(θ/θ ∗ )X = exp(θ − θ ∗ ). From this, we easily find . COV
(Y (θ ), Y (θ ' )) =
exp{(θ − θ ∗ )(θ ' − θ ∗ )/θ ∗2 } − 1 . (θ − θ ∗ )(θ ' − θ ∗ )
Let .θ ' → θ ∗ , we find . COV
(Y (θ ), Y (θ ∗ )) = VAR(Y (θ ∗ ) =
1 . θ∗
Based on these result, when both .θ /= θ ∗ and .θ ' /= θ ∗ , we find . COV
{Z(θ ), Z(θ ' )} =
exp{(θ − θ ∗ )(θ ' − θ ∗ )/θ ∗ } − 1 − (θ − θ ∗ )(θ ' − θ ∗ )/θ ∗ . (θ − θ ∗ )2 (θ ' − θ ∗ )2
186
10 Likelihood Ratio Test for Homogeneity
Put θ − θ∗ , v= √ θ∗
.
θ' − θ∗ v' = √ . θ∗
Then the correlation function becomes ρ(θ, θ ' ) = /
.
exp(vv ' ) − 1 − vv '
.
{exp(v 2 ) − 1 − v 2 }{exp(v ' 2 ) − 1 − v ' 2 }
Interestingly, this form is identical to the one for normal subpopulation. The classical central limit theorem may have played a role here again. u n We can easily verify that all conditions of the theorem are satisfied in these examples, when the parameter space .Θ is confined to a compact subset of .R. Exponential distribution is another popular member of NEF-QVF. This distribution does not satisfy the integrable upper bound condition in general. It is somewhat a surprise that many of the results developed in the literature are not applicable to the exponential mixture model. We will learn more about exponential mixture when a new line of approach is introduced.
10.3 The Proof of Theorem 10.1 10.3.1 An Expansion of θˆ Under the Null Model If the data set is known to be a random sample from a distribution .f (x; θ ∗ ), which belongs to the space of null model yet the value .θ ∗ is hidden from us, we may estimate .θ ∗ by its MLE .θˆ under the null model. Since the null model is regular in the current setting, we claim the following two widely hold expansions related to likelihood approach without derivations: En ∗ i=1 Yi (θ ) ∗ ˆ + op (n−1/2 ), .θ − θ = E n 2 ∗ i=1 Yi (θ ) and R2n
.
E { ni=1 Yi (θ ∗ )}2 ∗ ˆ + op (1). = 2{ln (θ ) − ln (θ )} = En 2 ∗ i=1 Yi (θ )
The reason of using notation .R2n will soon be seen. The first expansion is the basis for the asymptotic normality of .θˆ , and the second expansion is the basis for asymptotic chisquared limiting distribution for the likelihood ratio statistics for regular parametric models.
10.3 The Proof of Theorem 10.1
187
The relevance of these two expansions in the context of homogeneity test is through the following decomposition of the likelihood ratio test statistic: ˆ − ln (θ ∗ )} − 2{ln (θˆ ) − ln (θ ∗ )} Rn = 2{ln (G)
.
= R1n − R2n
(10.4)
An expansion of the second term .R2n has been given. Searching for the limiting distribution of .Rn will be accomplished through an expansion for .R1n . This is the major task, and we will get it done in a few steps.
10.3.2 Expansion of R1n : Preparations ˆ for Based on the consistency result given in Chap. 2, we take the consistency of .G G∗ (θ ) = 1(θ ∗ ≤ θ ) as granted. Denote the MLE of G in its c.d.f. form
.
.
ˆ ) = (1 − πˆ )1(θˆ1 ≤ θ ) + π1( G(θ ˆ θˆ2 ≤ θ )
and recall that .πˆ ≤ 1/2 by our convention. Lemma 10.1 As the sample size .n → ∞, both .θˆ1 −θ ∗ and .πˆ (θˆ2 −θ ∗ ) converge to 0 almost surely when .f (x; θ ∗ ) is the true distribution, namely, when the i.i.d. sample of size n is generated from .f (x; θ ∗ ). Proof Since .Θ is compact as assumed in this chapter, we have .δ infθ ∈Θ exp(−|θ|) > 0. Thus, for the distance defined by (2.3), we have ˆ G∗ ) = .Dkw (G,
{
=
ˆ ) − G∗ (θ )| exp(−|θ|)dθ |G(θ Θ
{
≥δ
ˆ ) − G∗ (θ )|dθ |G(θ
Θ
= δ{(1 − πˆ )|θˆ1 − θ ∗ | + π| ˆ θˆ2 − θ ∗ |}. ˆ G∗ ) → 0 almost surely. Since .πˆ < 1/2, ˆ .Dkw (G, By the strong consistency of .G, ∗ this implies both .|θˆ1 − θ | → 0 and .π| ˆ θˆ2 − θ ∗ | → 0 almost surely. These are the conclusions of this lemma. u n It will be seenEthat the limiting distribution of .Rn is a functional of the E limiting processes . ni=1 Yi (θ ), . ni=1 Zi (θ ) and the like. Some discussions on their stochastic properties are necessary. We first show the following. Lemma 10.2 Suppose the model (10.1) satisfies the strong identifiability condition and the uniform integrable upper bound condition, then the covariance matrix of the vector .(Y (θ ∗ ), Z(θ ))T is positive definite at all .θ ∈ Θ under the homogeneous model .f (x; θ ∗ ).
188
10 Likelihood Ratio Test for Homogeneity
Proof By Cauchy inequality, for each given .θ ∈ Θ, we have E2 {Y (θ ∗ )Z(θ )} ≤ E{Y 2 (θ ∗ )}E{Z 2 (θ )},
.
where the equality holds only if .Y (θ ∗ ) and .Z(θ ) are linearly associated. This possibility is eliminated when the strong identifiability condition holds. Hence, the strict inequality holds under the lemma assumption. The integrable upper bound conditions ensure that these expectations exist. This completes the proof. u n Suppose the sample space of the probability space is a metric space. A collection of probability measures are tight if for any .e > 0, there exists a compact set A whose measure is greater than .1 − e in terms of every one of its probability measures. The precise definition of tightness is given as Definition 17.2 in Chap. 17. Roughly speaking, a sequence of stochastic process defined on an interval induces a sequence of probability measures in the sample space made of functions. We say that this sequence is tight when the collection of these probability measures is tight. Being tight is a necessary condition for the convergence of stochastic processes. In addition, it ensures that the supremum of a stochastic process is a proper random variable. Namely, it does not take infinite value with a non-zero probability. For this reason, the following result is needed here. Lemma the uniform E integrability condition, the processes E10.3 Under E E n−1/2 Yi (θ ), .n−1/2 Yi' (θ ), .n−1/2 Zi (θ ), and .n−1/2 Zi' (θ ) over .Θ are tight.
.
Proof We make use of a result provided in Chap. 17 derived E from Theorem 12.3 of (Billingsley 2008, pp95) for this lemma. Clearly, .n−1/2 Yi (θ ) has continuous sample path as .Yi (θ ) is Econtinuous in .θ . At any fixed .θ value, it is asymptotically normal. Hence, .n−1/2 Yi (θ ) is tight when .θ is held fixed. It is seen that for any .θ1 , θ2 ∈ Θ, under the assumed uniform integrable upper bound condition on page 181, we have } { E{Y (θ2 ) − Y (θ1 )}2 ≤ E g 2/3 (X1 )(θ2 − θ1 )2 = C(θ2 − θ1 )2
.
for some constant C. This implies that { E n−1/2
n E
.
Yi (θ2 ) − n−1/2
i=1
n E
}2 Yi (θ1 )
≤ C(θ2 − θ1 )2 .
i=1
E This result together with theE tightness of .n−1/2 ni=1 Yi (θ ) at a single .θ value n implies the tightness of .n−1/2 i=1 Yi (θ ) as a stochastic process. The claims on the tightness of other two processes are true for the same reason. u n We are now ready to work on .R1n .
10.3 The Proof of Theorem 10.1
189
10.3.3 Expanding R1n We carefully analyze the asymptotic behavior of the likelihood ratio test statistics under the true distribution .f (x; θ ∗ ). For convenience, we rewrite the log likelihood function under two-component mixture model as ln (π, θ1 , θ2 ) =
n E
.
log{(1 − π )f (Xi ; θ1 ) + πf (Xi ; θ2 )}.
i=1
The new notation helps to highlight the detailed structure of the mixing distribution G. As usual, we have used .ln (·) in a very generic and non-rigorous fashion. Let r1n (π, θ1 , θ2 ) = 2{ln (π, θ1 , θ2 ) − ln (0, θ ∗ , θ ∗ )}
.
=2
n E
log(1 + δi ),
i=1
where { f (Xi , θ1 ) } } { f (Xi , θ2 ) δi = (1 − π ) −1 −1 +π ∗ ∗ f (Xi , θ ) f (Xi , θ )
.
= (1 − π )(θ1 − θ ∗ )Yi (θ1 ) + π(θ2 − θ ∗ )Yi (θ2 ). The first term in (10.4) may be written as R1n = sup r1n (π, θ1 , θ2 ).
.
π,θ1 ,θ2
By Lemma 10.1, the supremum occurs only at values at which .θ1 − θ ∗ = op (1) and ∗ .π(θ2 −θ ) = op (1). This result confines our attention of .θ1 to a small neighborhood ∗ of .θ . Yet to work out the supremum, we must expect that the supremum can occur with .θ2 being located in any part of .Θ. For this reason, we investigate the value of .r1n (π, θ1 , θ2 ) at each individual .θ2 value in two cases. Let R1n (e; I ) =
.
sup
r1n (π, θ1 , θ2 )
sup
r1n (π, θ1 , θ2 ).
|θ2 −θ ∗ |>e
and R1n (e; I I ) =
.
|θ2 −θ ∗ |≤e
The size of .e will be chosen very small but the exact size is to be specified later.
190
10 Likelihood Ratio Test for Homogeneity
Case I: .|θ2 − θ ∗ | > e. In this case, let .πˆ (θ2 ) and .θˆ1 (θ2 ) be the MLE’s of .π and .θ1 with fixed .θ2 ∈ Θ such that .|θ2 − θ ∗ | > e. Namely, they maximize .r1n (π, θ1 , θ2 ) given .θ2 in this region. The consistency results of Lemma 10.1 remain true for .πˆ (θ2 ) and .θˆ1 (θ2 ). For simplicity of notation, we write .πˆ = πˆ (θ2 ) and .θˆ1 = θˆ1 (θ2 ) by omitting their dependence on .θ2 . Recalling the consistency combined with restriction .π ≤ 1/2 implies .θˆ1 −θ ∗ = op (1) and .πˆ (θ2 −θ ∗ ) = op (1). With the restriction of .|θ2 −θ ∗ | ≥ e here, we further have .πˆ = op (1). First we establish an upper bound on .Rn (e; I ). By the inequality .2 log(1 + x) ≤ 2x − x 2 + (2/3)x 3 , we have r1n (π, θ1 , θ2 ) = 2
n E
.
log(1 + δi ) ≤ 2
i=1
n E
δi −
i=1
n E
2E 3 δi . 3 n
δi2 +
i=1
(10.5)
i=1
Replacing .θ1 with .θ ∗ in the quantity .Yi (θ1 ) gives δi = (1 − π )(θ1 − θ ∗ )Yi (θ ∗ ) + π(θ2 − θ ∗ )Yi (θ2 ) + ein ,
.
where the remainder is given by ein = (1 − π )(θ1 − θ ∗ ){Yi (θ1 ) − Yi (θ ∗ )}
.
= (1 − π )(θ1 − θ ∗ )2 Zi (θ1 ). Introducing
m1 = (1 − π )(θ1 − θ ∗ ) + π(θ2 − θ ∗ );
.
m2 = π(θ2 − θ ∗ )2
(10.6)
we get a slightly simpler expression (1 − π )(θ1 − θ ∗ )Yi (θ ∗ ) + π(θ2 − θ ∗ )Yi (θ2 ) = m1 Yi (θ ∗ ) + m2 Zi (θ2 ).
.
E Since .n−1/2 ni=1 Zi (θ1 ) = Lemma 10.3, we conclude that en =
n E
.
Op (1) due to its tightness established in
n E { } ein = n1/2 (1 − π )(θ1 − θ ∗ )2 n−1/2 Zi (θ1 ) = n1/2 (θ1 − θ ∗ )2 Op (1).
i=1
i=1
At the same time, (θ1 − θ ∗ )2 = (1 − π )−2 {m1 + π(θ2 − θ ∗ )}2
.
≤ 2(1 − π )−2 {m21 + m22 /(θ2 − θ ∗ )2 } ≤ 2(1 − π )−2 {m21 + m22 /e 2 } = (m21 + m22 )Op (1).
10.3 The Proof of Theorem 10.1
191
Hence, en = n1/2 (m21 + m22 )Op (1) = n(m21 + m22 )op (1).
.
We wish to do the same to the quadratic and cubic terms of .r1n without changing the first order asymptotic. Note that n E .
δi2 =
i=1
n E
{m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2
i=1
+
n E
{m1 Yi (θ ∗ ) + m2 Zi (θ2 )}ein +
i=1
n E
2 ein .
i=1
By uniform strong law of large numbers again, −1
n
.
n E
{m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 → E{m1 Y (θ ∗ ) + m2 Z(θ2 )}2 .
i=1
Due to strong identifiability (page 180), .E{m1 Y (θ ∗ )+m2 Z(θ2 )}2 is positive definite quadratic form of .m1 and .m2 , for all .θ ∗ and .θ2 . Hence, there exists a positive constant .γ such that almost surely, E{m1 Y (θ ∗ ) + m2 Z(θ2 )}2 ≥ γ {m21 + m22 }.
.
Applying this result back, we find the order assessment:
.
n E {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 ≥ nγ {m21 + m22 } i=1
almost surely. One may not realize the significance of the direction in this inequality. It shows that the quadratic form is not merely .Op (n), but is strictly as large in asymptotic order. This result makes it possible to show other terms are of relatively smaller order and therefore negligible. Making use of the fact that we are considering the region where .π < 1/2, combined with .θˆ1 − θ ∗ = o(1) and .πˆ = op (1), we find that in this region, (1 − π )2 (θ1 − θ ∗ )4 = (1 − π )−2 {(1 − π )(θ1 − θ ∗ )}4
.
= (1 − π )−2 {m1 − π(θ2 − θ ∗ )}4 ≤ 16(1 − π )−2 [m41 + {π(θ2 − θ ∗ )}4 ] ≤ 16(1 − π )−2 m41 + 16(1 − π )−3 π 2 m22 = (m21 + m22 )op (1).
192
10 Likelihood Ratio Test for Homogeneity
Hence, we have n E .
2 ein = (1 − π )2 (θ1 − θ ∗ )4
n E
i=1
Zi2 (θ1 )
i=1 ∗ 4
= n(1 − π ) (θ1 − θ ) Op (1) 2
= n{m21 + m22 }op (1). Note that we have applied once more of the uniform strong law of large numbers, n E .
Zi2 (θ1 ) = O(n)
i=1
uniformly for .θ1 in a small neighborhood of .θ ∗ . Simply speaking, we have shown n E .
2 ein =
n [E ] {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 × op (1).
i=1
i=1
By Cauchy inequality, n n n [E ]2 [ E ][ E ] 2 , {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}ein ≤ {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 ein
.
i=1
i=1
i=1
from which we conclude .
n n E ] [E {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}ein = {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 × op (1). i=1
i=1
Combining these, we conclude n E
δi2 =
.
i=1
n [E
] {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 × {1 + op (1)}.
i=1
E In the upperE bound for . Eni=1 log(1 + δi ), we have now worked out some conclusions for . ni=1 δi and . ni=1 δi2 . In the current region of consideration, both .m1 and .m2 are .op (1) terms and |δi |3 ≤ 27{|m1 Yi (θ ∗ )|3 + |m2 Zi (θ2 )|3 + |ein |3 }.
.
Using the uniform law of large numbers (Theorem 17.3), we have
10.3 The Proof of Theorem 10.1
.
n−1 n−1
E E
193
|m1 Yi (θ ∗ )|3 = m31 Op (1) = m21 op (1), |m2 Zi (θ ∗ )|3 = m32 Op (1) = m22 op (1).
For the last term in the above inequality, we have n E .
3 ein = (1 − π )3 (θ1 − θ ∗ )6
i=1
n E
Zi3 (θ1 )
i=1
= n(1 − π )3 (θ1 − θ ∗ )6 op (1) = n{m21 + m22 }op (1). Combining these inequalities, we get n E .
δi3 =
n [E
i=1
] {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 × op (1).
i=1
Combining all these, we get a uniform upper bound in the current region: r1n (π, θ1 , θ2 ) ≤ 2
.
n E {m1 Yi (θ ∗ ) + m2 Zi (θ2 )} i=1
−
n ] [E {m1 Yi (θ ∗ ) + m2 Zi (θ2 )}2 {1 + op (1)}. i=1
To make use of this upper bound, an orthogonal transformation as in (10.2) is needed with the same .W (θ ) and .h(θ ) already introduced. The choice of .W (θ ) is to ensure .E{Y (θ ∗ )W (θ )} = 0. This implies that {
.
sup n θ2
−1/2
n E
} Yi (θ ∗ )Wi (θ2 ) = Op (1).
i=1
Based on this, we break down the quadratic term further:
.
n n n E E E [ ] {tˆYi (θ ∗ ) + m ˆ 2 Wi (θ2 )}2 = (tˆ)2 Yi2 (θ ∗ ) + (m ˆ 2 )2 Wi2 (θ2 ) (1 + op (1)) i=1
i=1
i=1
ˆ 2 h(θ2 ), and .m ˆ 1 and .m ˆ 2 are those defined in (10.6) with .π, θ1 ˆ1 + m where .tˆ = m replaced by .πˆ , θˆ1 . These preparations lead to
194
10 Likelihood Ratio Test for Homogeneity
rn (πˆ , θˆ1 , θ2 ) ≤ 2
.
n E {tˆYi (θ ∗ ) + m ˆ 2 Wi (θ2 )} i=1
− {tˆ2
n E
Yi2 (θ ∗ ) + m ˆ 22
i=1
n E
Wi2 (θ2 )}{1 + op (1)}.
i=1
For fixed .θ2 , consider the quadratic function { n } n n E E E ∗ 2 2 ∗ 2 2 .Q(t, m2 ) = 2 {tYi (θ ) + m2 Wi (θ2 )} − t Yi (θ ) + m2 Wi (θ2 ) . i=1
i=1
i=1
(10.7) Over the region of .m2 ≥ 0, for any .θ2 fixed, .Q(t, m2 ) is maximized at .t = t˜ and m2 = m ˜ 2 , with
.
E Yi (θ ∗ ) , t˜ = E 2 Yi (θ ∗ )
.
E { Wi (θ2 )}+ m ˜2 = E 2 . Wi (θ2 )
(10.8)
Thus, r1n (πˆ , θˆ1 , θ2 ) ≤ Q(t˜, m ˜ 2 ) + op (1) E En [{ ni=1 Wi (θ2 )}+ ]2 { i=1 Yi (θ ∗ )}2 + En + op (1). = En 2 ∗ 2 i=1 Yi (θ ) i=1 Wi (θ2 )
.
Finally, by integrable upper bound condition, n−1
E
.
Yi2 (θ ∗ ) = E{Y12 (θ ∗ )} + op (1)
and n−1
.
E
Wi2 (θ2 ) = E{W12 (θ2 )} + op (1).
We obtain a further simplified expansion: E E [{n−1/2 ni=1 Wi (θ2 )}+ ]2 {n−1/2 ni=1 Yi (θ ∗ )}2 + + op (1). .r1n (π ˆ , θˆ1 , θ2 ) ≤ E{Y 2 (θ ∗ )} E{W 2 (θ2 )} Note that the first term in the above expansion is .R2n + op (1). This leads to R1n (e; I ) − R2n ≤
.
E [{n−1/2 Wi (θ2 )}+ ]2 + op (1). E{W 2 (θ2 )}
(10.9)
10.3 The Proof of Theorem 10.1
195
We have finally established an easy to use upper bound on .Rn (e; I ) − R2n as follows: E [{n−1/2 ni=1 Wi (θ )}+ ]2 + op (1). .R1n (e; I ) − R2n ≤ sup E{W 2 (θ )} |θ−θ ∗ |>e To determine the exact size of .R1n (e; I ), we work out a lower bound of R1n (e; I ). Let .π˜ and .θ˜1 be two specific values so that they make
.
m1 + m2 h(θ2 ) = t˜; m2 = m ˜2
.
with .t˜ and .m ˜ 2 being values defined in (10.8). Such values clearly exist because the system contains equal number of equations and variables. Examining the definition (10.8) clearly shows that ˜ 2 = Op (n−1/2 ) t˜ = Op (n−1/2 ), m
.
uniformly in .θ2 . There is also a simple upper bound for .h(θ2 ) as follows: |h(θ2 )| ≤
/
.
E{Z12 (θ2 )}/E{Y12 (θ ∗ )}.
This fact implies that the solution in .π and .θ1 satisfies .
π˜ = Op (n−1/2 ); θ˜1 = Op (n−1/2 )
uniformly in .θ2 satisfying .|θ2 − θ ∗ | ≥ e. With these issues settled, we consider the following Taylor expansion r1n (π˜ , θ˜1 , θ2 ) = 2
n E
.
i=1
δ˜i −
n E
δ˜i2 (1 + η˜ i )−2 ,
i=1
where .|η˜ i | < |δ˜i | and .δ˜i is given by .
δ˜i = (1 − π˜ )(θ˜1 − θ ∗ )Yi (θ˜1 ) + π(θ2 − θ ∗ )Yi (θ2 ).
We have |δ˜i | ≤ (|θ˜1 − θ ∗ | + |π|M) ˜ max
.
[
sup {|Yi (θ )|}
]
1≤i≤n θ ∈Θ
where M is the largest possible .θ2 value, which is finite as .Θ is compact. By integrable upper bound condition, .|Yi (θ )|4+δ ≤ g(Xi ) and .E{g(X)} is finite. With this moment property plus the i.i.d. nature of .Yi (θ ), we get the following order assessment:
196
10 Likelihood Ratio Test for Homogeneity
.
max [sup {|Yi (θ )|}] = op (n1/4 ).
1≤i≤n θ∈Θ
Hence, combining the order assessments of .θ˜1 , .π˜ and .max |Yi (θ )|, we conclude max (|η˜ i |) = op (1) uniformly. With this, we get
.
r1n (π˜ , θ˜1 , θ2 ) = 2
n E
.
δ˜i −
n {E
i=1
} δ˜i2 {1 + op (1)}.
i=1
Applying the argument leading to (10.9), we know that with this pair of specific values, .π˜ and .θ˜1 , [{n−1/2
En
i=1 Wi (θ2 )} E{W12 (θ2 )}
.r1n (π, ˜ θ˜1 , θ2 ) − R2n ≥
+ ]2
+ op (1).
The supremum value of a function is no less than the value of the function at any specific point. Hence, we must have sup r1n (π, θ1 , θ2 ) − R2n ≥ r1n (π˜ , θ˜1 , θ2 ) − R2n
.
π,θ1
≥
[{n−1/2
En
i=1 Wi (θ2 )} E{W12 (θ2 )}
+ ]2
+ op (1).
Combining with (10.9), we thus arrive at the following lemma. Lemma 10.4 Suppose all conditions specified in this section. Let .R1n (e; I ) − R2n be the likelihood ratio statistic under restriction .|θ2 − θ ∗ | > e. Then when .f (x, θ ∗ ) is the true null distribution, as .n → ∞, R1n (e; I ) − R2n =
.
sup
|θ−θ ∗ |>e
E [{n−1/2 ni=1 Wi (θ )}+ ]2 + op (1). E{W 2 (θ )}
/ E The asymptotic distribution of .n−1/2 Wi (θ )/ E{W 2 (θ )} when .f (x, θ ∗ ) is the true null distribution, is a Gaussian process with mean 0, variance 1 and autocorrelation function .ρ(θ, θ ' ) given by (10.3). Proof We need only to notice that the upper and lower bounds of .R1n (e; I ) − R2n differ by a quantity of size .op (1). This leads to the conclusion. n u Case II: .|θ2 − θ ∗ | ≤ e. When .θ2 is in an arbitrarily small neighborhood of .θ ∗ , some complications appear since .θ1 − θ ∗ and .θ2 − θ ∗ are confounded, so that .n1/2 (θˆ1 − θ ∗ )2 is no longer negligible when compared to .n1/2 (θˆ2 − θ ∗ )2 . However, in this case, .θ1 and .θ2 can be
10.3 The Proof of Theorem 10.1
197
treated equally, so that the usual quadratic approximation to the likelihood becomes possible. Since the MLE of .θ1 is consistent, in addition to .|θ2 − θ ∗ | ≤ e, we can restrict .θ1 in the following analysis to the region of .|θ1 − θ ∗ | ≤ e. In the sequel, .π, ˆ .θˆ1 , and .θˆ2 denote the MLE’s of .π, .θ1 , and .θ2 within the region defined by .0 ≤ π ≤ 1/2, .|θ1 − θ ∗ | ≤ e and .|θ2 − θ ∗ | ≤ e. Again, we start with an inequality like (10.5): r1n (π, θ1 , θ2 ) = 2
n E
.
log(1 + δi ) ≤ 2
i=1
n E
δi −
n E
i=1
2E 3 δi 3 n
δi2 +
i=1
i=1
but with δi = m1 Yi (θ ∗ ) + m2 Zi (θ ∗ ) + ein
.
and having .m1 the same as before, but a slightly different m2 = (1 − π )(θ1 − θ ∗ )2 + π(θ2 − θ ∗ )2 .
.
Note that .|m1 | ≤ e and .m2 ≤ e 2 . The remainder term now becomes ein = (1 − π )(θ1 − θ ∗ )2 {Zi (θ1 ) − Zi (θ ∗ )} + π(θ2 − θ ∗ )2 {Zi (θ2 ) − Zi (θ ∗ )}.
.
By integrable upper bound condition, n−1/2
.
n { E Zi (θ1 ) − Zi (θ ∗ ) } i=1
θ1 − θ ∗
is tight over the space of .θ1 . Hence, its supremum is of order .Op (1). The order assessment is equally applicable when .θ1 is replaced by .θ2 . Let m3 = (1 − π )|(θ1 − θ ∗ )|3 + π|(θ2 − θ ∗ )|3 .
.
We then have en =
n E
.
ein = n1/2 m3 Op (1).
i=1
This leads to r1n (π, θ1 , θ2 ) ≤ 2
n E
.
{m1 Yi (θ ∗ ) + m2 Zi (θ ∗ )} −
i=1
+
2 3
n E {m1 Yi (θ ∗ ) + m2 Zi (θ ∗ )}2 i=1
n E
{m1 Yi (θ ∗ ) + m2 Zi (θ ∗ )}3 + n1/2 m3 Op (1).
i=1
(10.10)
198
10 Likelihood Ratio Test for Homogeneity
Using the same argument as in Case I, E |m1 Yi (θ ∗ ) + m2 Zi (θ2∗ )|3 ≤ max (|m1 |, |m2 |) Op (1) ≤ eOp (1). .E {m1 Yi (θ ∗ ) + m2 Zi (θ2∗ )}2 Inequality (10.10) leads to r1n (π, θ1 , θ2 ) ≤ 2
.
n E {m1 Yi (θ ∗ ) + m2 Zi (θ ∗ )} i=1
n E − {m1 Yi (θ ∗ ) + m2 Zi (θ ∗ )}2 {1 + eOp (1)} i=1
+ n1/2 m3 Op (1).
(10.11)
We also have n1/2 m3 = {op (1) + eOp (1)}n1/2 m ˆ2
.
≤ {op (1) + eOp (1)}(1 + nm ˆ 22 ) ≤ eOp (1 + nm ˆ 22 ). Consequently, in terms of the MLE’s, (10.11) reduces to n E {m ˆ 1 Yi (θ ∗ ) + m ˆ 2 Zi (θ ∗ )}
r1n (πˆ , θˆ1 , θˆ2 ) ≤ 2
.
i=1
−
n E {m ˆ 1 Yi (θ ∗ ) + m ˆ 2 Zi (θ ∗ )}2 {1 + eOp (1)} + eOp (1). i=1
Note that to arrive at the above upper bound, a term in the form of .eOp (nm ˆ 22 ) has been absorbed into the quadratic term. Other big O and small o terms must stay to retain the validity of the asymptotic derivation, however. They will be cleaned up in later steps. By orthogonalization, we have r1n (π, ˆ θˆ1 , θˆ2 ) ≤ 2
.
n E { } tˆYi (θ ∗ ) + m ˆ 2 Wi (θ ∗ ) i=1
{
−
ˆ2
t
n E i=1
Yi2 (θ ∗ ) + m ˆ 22
n E
} Wi2 (θ ∗ )
{1 + eOp (1)} + eOp (1).
i=1
Applying the same technique leading to (10.7), we get
10.3 The Proof of Theorem 10.1
r1n (π, ˆ θˆ1 , θˆ2 ) ≤ {1 + eOp (1)}
.
199
−1
[ E ] E { Yi (θ ∗ )}2 [{ Wi (θ ∗ )}+ ]2 + eOp (1). E 2 ∗ + E 2 ∗ Yi (θ ) Wi (θ )
Furthermore, by the law of large numbers, we have n E . {Yi (θ ∗ )}2 = nE{Y12 (θ ∗ )}(1 + op (1)) i=1
and so on. Therefore, E eOp (1) {n−1/2 Yi (θ ∗ )}2 1 + eOp (1) E{Y12 (θ ∗ )} E [{n−1/2 Wi (θ ∗ )}+ ]2 + + eOp (1) {1 + eOp (1)}E{W12 (θ ∗ )} E [{n−1/2 Wi (θ ∗ )}+ ]2 + eOp (1). = E{W12 (θ ∗ )}
r1n (πˆ , θˆ1 , θˆ2 ) − R2n ≤
.
Some technical remarks are needed. In the above inequality, the first term on the right hand side has a factor .e. As long as the other factor is .Op (1), it is rightfully absorbed in .eOp (1) term. When .{1 + eOp (1)}−1 is replaced by 1, the size change is clearly .eOp (1). Because the other factor in the second term is .Op (1) implied by the E tightness of .n−1/2 Wi (θ ∗ ), the overall size change is also .eOp (1). The difference is therefore also absorbed into .eOp (1). Hence, the excessive many .eOp (1) terms are cleaned up now. With this inequality, we arrive at R1n (e; I I ) − R2n
.
E [{n−1/2 Wi (θ ∗ )}+ ]2 ≤ + eOp (1). E{W12 (θ ∗ )}
Next, let .θ˜2 = θ ∗ , and let .π˜ , .θ˜1 be determined by E Yi (θ ∗ ) .m1 + m2 h(θ ) = E , Yi2 (θ ∗ ) ∗
E { Wi (θ ∗ )}+ m2 = E 2 . Wi (θ ∗ )
The rest of the proof is the same as that in Case I, and we get R1n (e; I I ) − R2n
.
E [{n−1/2 Wi (θ ∗ )}+ ]2 ≥ + op (1). E{W12 (θ ∗ )}
Lemma 10.5 Suppose that Conditions 1–5 hold. Let .Rn (e; I I ) be the likelihood ratio statistic with restriction .|θ2 − θ ∗ | ≤ e for any arbitrarily small .e > 0. When
200
10 Likelihood Ratio Test for Homogeneity
f (x, θ ∗ ) is the true null distribution, as .n → ∞,
.
E [{n−1/2 Wi (θ ∗ )}+ ]2 + op (1) ≤ Rn (e; I I ) − R2n . E{W12 (θ ∗ )} E [{n−1/2 Wi (θ ∗ )}+ ]2 + eOp (1). ≤ E{W12 (θ ∗ )} Our final task is to combine two conclusions to finish the proof of Theorem 10.1 Proof of Theorem 10.1 For any small .e > 0, .Rn = max{Rn (e; I ), Rn (e; I I )}. By Lemmas 10.3 and 10.5, [ ] { } E E [{n−1/2 Wi (θ )}+ ]2 [{n−1/2 Wi (θ ∗ )}+ ]2 + eOp (1) .Rn ≤ max sup , E{W12 (θ )} E{W12 (θ ∗ )} |θ−θ ∗ |>e plus a term in .op (1), and [ Rn ≥ max
.
{ sup
|θ−θ ∗ |>e
] } E E [{n−1/2 Wi (θ )}+ ]2 [{n−1/2 Wi (θ ∗ )}+ ]2 + op (1). , E{W12 (θ ∗ )} E{W12 (θ )}
/ E Since .n−1/2 Wi (θ )/ E{W12 (θ )} converges to the Gaussian process .W (θ ), .θ ∈ Θ, with mean 0, variance 1, and autocorrelation function .ρ(θ, θ ' ), which is given by (10.3), the theorem follows by first letting .n → ∞ and then letting .e → 0. u n
10.4 Homogeneity Under Normal Mixture Model We have already seen that normal mixture models are peculiar in many ways. When it comes to homogeneity test, finite normal mixture retains its reputation again. The limiting distribution of the likelihood ratio test statistic goes beyond a simple Gaussian process. Because of this, we must include a section for homogeneity test for normal with component variance as its structural parameter. Suppose we have a random sample .X1 , . . . , Xn from an order two finite normal mixture model with the following density function f (x; G, σ ) = (1 − π )φ(x; θ1 , σ 2 ) + π φ(x; θ2 , σ 2 ).
.
(10.12)
Let us consider the problem of testing for the null hypothesis of homogeneity H0 : π(1 − π )(θ2 − θ1 ) = 0,
.
versus the full model (10.12). The mixture density function may also be written as
10.4 Homogeneity Under Normal Mixture Model
201
{ f (x; G, σ ) =
.
φ(x; u, σ 2 )dG(u)
with mixing distribution G(u) = (1 − π )1(θ1 ≤ u) + π 1(θ2 ≤ u).
.
To ease the technicality, we obtain a result for a simpler case first.
10.4.1 The Result for a Single Mean Parameter The simpler case is when the value of .θ1 is known. With minor loss of generality, we take .θ1 = 0 and write .θ2 = θ in (10.12). In addition, we place a restriction .|θ | ≤ M for some known .M < ∞. Without this restriction, we are unlikely to obtain a meaningful result about the likelihood ratio test (Hartigan 1985). We do not expect the result for the current case useful in applications but to gain insights on technical issues related to likelihood based homogeneity test. The knowledge gained through this exercise is helpful to handle the full model. Based on a set of i.i.d. observations from the current model, the log-likelihood function of .π , .θ , and .σ is ln (π, θ, σ ) =
n E
.
log{(1 − π )φ(xi ; 0, σ ) + π φ(xi ; θ, σ )}.
i=1
The null hypothesis under consideration in this section is simply H0 : π θ = 0.
.
We aim to demonstrate the rigorous approach of obtaining the limiting distribution of LRT via sandwich method for the simpler setting. This sandwich method requires us to establish an asymptotic upper bound and a lower bound for the test statistic: Rn = 2{sup ln (π, θ, σ ) − sup ln (π, θ, σ )}
.
H1
H0
We denote, under the full model, the MLE of .σ 2 to be .σˆ 2 and the MLE of G to be ˆ n (u) = (1 − πˆ )1(0 ≤ u) + π 1(θˆ ≤ u). G
.
The following two lemmas tell us how .πˆ , θˆ , and .σˆ 2 behave asymptotically.
202
10 Likelihood Ratio Test for Homogeneity
Lemma 10.6 Under the null distribution .N(0, 1), there exist constants .0 < e < Δ < ∞ such that .
lim Pr(e ≤ σˆ 2 ≤ Δ) = 1.
n→∞
Lemma 10.7 Under the null distribution .N(0, 1), as .n → ∞, we have πˆ θˆ → 0 and σˆ 2 → 1,
.
almost surely. Proof According to Theorem 4.1 in Chap. 4, the MLE of G and .σ 2 are strongly consistent under generic normal mixture model with a structural parameter. The conclusion remains valid when .θ1 = 0 is given in a finite normal mixture of order .m = 2. Hence, .σ ˆ 2 → 1 is a direct consequence. At the same time, the consistency conclusion implies the estimated c.d.f. of the mixing distribution ˆ n (u) = (1 − πˆ )1(0 ≤ u) + π 1(θˆ ≤ u) → 1(0 ≤ u) G
.
almost surely at all u. Because the space of .π and .θ is compact, we must have .πˆ θˆ → 0. Otherwise, we will be able to find a subsequence such that .(πˆ , θˆ ) → (π˜ , θ˜ ) with .π˜ θ˜ /= 0. This ˆ n (u) → 1(0 ≤ u) where .1(0 ≤ u) is the true mixing distribution. n contradicts .G u 2 Under .H0 , we have E .π 2θ = 0. The only unknown parameter is .σ , and its MLE is 2 −1 given by .σˆ 0 = n Xi . Let
r1n (π, θ, σ ) = 2{ln (π, θ, σ ) − ln (0, 0, 1)},
.
r2n = 2{ln (0, 0, σˆ 0 ) − ln (0, 0, 1)}. and rn (π, θ, σ ) = r1n (π, θ, σ ) − r2n .
.
(10.13)
The LRT for testing .H0 can also be written as Rn = sup rn (π, θ, σ ).
.
The goal of deriving the limiting distribution of .Rn is attained by finding useful approximations to both .r1n and .r2n . The task for .r2n is rather simple. Let us first focus on an upper bound for .r1n (π, θ, σ ). As we did in the last section, express
10.4 Homogeneity Under Normal Mixture Model
r1n (π, θ, σ ) = 2
203
n E
.
log(1 + δi )
i=1
with .δi = (σ 2 − 1)Ui (σ ) + π θ Yi (θ, σ ), and [ Ui (σ ) = (σ − 1) 2
.
−1
] { X2 1 } 1 i exp − ( − 1) − 1 , σ 2 σ2
(10.14)
and [
] { Xi2 Xi2 } Xi2 } (Xi − θ )2 − exp − + + exp − . 2 2 2σ 2 2σ 2 (10.15) The functions .Ui (σ ) and .Yi (θ, σ ) are continuously differentiable by defining 1 .Yi (θ, σ ) = σθ
Ui (1) =
.
{
{ X2 1 } Xi 1 2 (Xi − 1); Yi (0, σ ) = 3 exp − i ( 2 − 1) . 2 2 σ σ
They all have zero mean. By the same inequality used before, 2 2 log(1 + x) ≤ 2x − x 2 + x 3 , 3
.
we have n E
r1n (π, θ, σ ) = 2
.
i=1
log(1 + δi ) ≤ 2
n E i=1
δi −
n E i=1
2E 3 δi . 3 n
δi2 +
(10.16)
i=1
E E We now set out for neat approximations to . ni=1 δi and . ni=1 δi2 , and to show that En 3 . i=1 δi is negligible for asymptotic consideration. E We start with the task of finding a neat approximation for . ni=1 δi . Re-write .δi as δi = (σ 2 − 1)Ui (1) + π θ Yi (θ, 1) + ein ,
.
(10.17)
the sum of simple leading term linear in .Ui (1) and .θ Yi (θ, 1). The remainder is ein = (σ 2 − 1){Ui (σ ) − Ui (1)} + π θ {Yi (θ, σ ) − Yi (θ, 1)}.
.
E What is the order of . ni=1 ein ? The following proposition is needed or helpful to answer this question.
204
10 Likelihood Ratio Test for Homogeneity
Proposition 10.1 Let .0 < δ < 1 be a nonrandom constant. Then under the null distribution .N (0, 1), the processes Un∗ (σ ) = n−1/2
.
n E { Ui (σ ) − Ui (1) } i=1
σ2 − 1
;
and Yn∗ (θ, σ ) = n−1/2
.
n E { Yi (θ, σ ) − Yi (θ, 1) } i=1
σ2 − 1
are tight for .σ 2 ∈ [1 − δ, 1 + δ] and .|θ | ≤ M. Proof In the light of (Billingsley 2008, page 95) the tightness of .Un∗ (σ ) would be implied by inequality E{Un∗ (σ1 ) − Un∗ (σ2 )}2 ≤ C(σ1 − σ2 )2 ,
.
for some constant C. Thus, all we do subsequently is to verify this inequality. We can see that this inequality holds if the derivative of U ∗ (σ 2 ) =
.
U (σ ) − U (1) σ2 − 1
with respect to .σ 2 is bounded by a square integrable function. Now we follow this line of thinking. Let { 1 X2 X2 } H (σ 2 ) = √ exp − + 2 2σ 2 σ2
.
as a function of .σ 2 for presentational convenience. It is seen that U (σ ) =
.
H (σ 2 ) − H (1) , σ2 − 1
with .U (1) = H ' (1), and subsequently, U ∗ (σ 2 ) =
.
H (σ 2 ) − H (1) − U (1)(σ 2 − 1) U (σ ) − U (1) = . σ2 − 1 (σ 2 − 1)2
The derivative of .U ∗ (σ 2 ) in (10.18) with respect to .σ 2 is given by
(10.18)
10.4 Homogeneity Under Normal Mixture Model
.
205
H (σ 2 ) − H (1) − U (1)(σ 2 − 1) − 12 {H ' (σ 2 ) − U (1)}(σ 2 − 1) . (σ 2 − 1)3
By Taylor expansion, 1 H ' (σ 2 ) − U (1) = H '' (1)(σ 2 − 1) + H ''' (ξ1 )(σ 2 − 1)2 2
.
and notice that .H ' (1) = U (1), we get H (σ 2 ) − H (1) − U (1)(σ 2 − 1) =
.
1 1 '' H (1)(σ 2 − 1)2 + H ''' (ξ2 )(σ 2 − 1)3 2 6
for some .ξ1 , ξ2 taking values between .σ 2 and 1. Substituting these two expansions to the derivative of .U ∗ (σ 2 ), we find that it becomes .
1 ''' 1 H (ξ2 ) − H ''' (ξ1 ). 6 4
By direct calculations and for .σ 2 ∈ [1 − δ, 1 + δ] for some small .δ, we can easily find a constant C, such that } { δ ''' 2 2 6 4 2 2 2 X . .|H (σ )| ≤ C(X + X + X + 1) exp 1+δ It is clear that the above upper bound is integrable with respect to and under the null distribution .N(0, 1) when .0 < δ < 1. This upper bound is clearly also an upper bound for the derivative of .U ∗ (σ 2 ). The existence of such an upper bound establishes the tightness of .Un∗ (σ ), as argued earlier. The tightness proof for .Yn∗ (θ, σ ) follows the same line of thinking. We need only to look for some square integrable .g(x) such that |2 | ∂2 |2 | ∂ 2 | | | | H (θ, σ )| ≤ g(X), | 2 H (θ, σ )| + | ∂θ ∂σ ∂θ
.
with new function H (θ, σ 2 ) =
.
{ (X − θ )2 X2 } 1 exp − . + 2 σ 2 2σ
Again by direct calculations, we have, for some constant C, } { | ∂2 |2 (X − θ )2 | | 2 2 2 +X H (θ, σ )| ≤ C(X + |X| + 1) exp − .| (1 + δ) ∂θ 2 } { δX2 2 2 + 2M|X| . ≤ C(X + |X| + 1) exp (1 + δ)
206
10 Likelihood Ratio Test for Homogeneity
The rightmost of the above inequality is again integrable under the null distribution N (0, 1), as .0 < δ < 1. The similar argument can also be used to show that 2 2 .|∂ Hi (θ, σ )/∂θ ∂σ | is bounded above by an integrable random variable. The proof is completed. u n .
After this proposition is established, we go back to the task of assessing the order E of . ein . The tightness conclusions in this proposition imply that for small .δ > 0 and finite M, .
sup |σ 2 −1|≤δ
|Un∗ (σ )| = Op (1),
and .
sup |θ|≤M,|σ 2 −1|≤δ
|Yn∗ (θ, σ )| = Op (1).
Therefore, n E .
ein = n1/2 {(σ 2 − 1)2 Un∗ (σ ) + π θ (σ 2 − 1)Yn∗ (θ, σ )}
i=1
= {(σ 2 − 1)2 + |π θ (σ 2 − 1)|}Op (n1/2 ). With this, we get what we wanted approximation for the linear term in (10.16): n E .
δi = (σ 2 − 1)
i=1
n E i=1
Ui + π θ
n E
Yi (θ )} + {(σ 2 − 1)2 + |π θ (σ 2 − 1)|}Op (n1/2 ).
i=1
(10.19) We now move to work for a neat approximation for the quadratic term in (10.16). For convenience of notation, put En1 = (σ 2 − 1)2 Op (1), En2 = π θ (σ 2 − 1)Op (1).
.
Clearly, we have n E .
δi2 =
i=1
n E
{(σ 2 − 1)Ui + π θ Yi (θ )}2 + 2
i=1
n n E E 2 {(σ 2 − 1)Ui + π θ Yi (θ )}ein + ein i=1
i=1
(10.20) and |
n E
.
i=1
δi3
n E − {(σ 2 − 1)Ui + π θ Yi (θ )}3 | = n(|En1 |3 + |En2 |3 ). i=1
(10.21)
10.4 Homogeneity Under Normal Mixture Model
207
It is important to note that in (10.21), the remainder terms have a factor of n rather than .n3/2 . This is implied by
.
n n | | E E | | 2 | Ui (σ ) − Ui |3 |(σ − 1){Ui (σ ) − Ui }|3 = {n(σ 2 − 1)6 }n−1 | | σ2 − 1 i=1
i=1
= n(σ − 1) Op (1) = n|En1 |3 . 2
6
Now by (10.19), (10.20), and (10.21), r1n (π, θ, σ ) ≤ 2
n E
.
{(σ 2 − 1)Ui + π θ Yi (θ )} −
i=1
n E {(σ 2 − 1)Ui + π θ Yi (θ )}2 i=1
+(2/3)
n E
{(σ 2 − 1)Ui + π θ Yi (θ )}3
i=1
+n
1/2
3 E (En1 + En2 ) + n (|En1 |j + |En2 |j ).
(10.22)
j =2
In the next few steps, we aim to find the leading terms (10.22) by identifying terms that can be ignored for asymptotic consideration. This task includes: finding tightened expressions for the first two terms in (10.22) and showing the rest terms are asymptotically dominated by the quadratic term in (10.22). This fact will be used to justify that these terms are negligible. We start with introducing a new quantity Zi (θ ) = Yi (θ ) − θ Ui
.
to replace .Yi in the expression. Unlike .Yi , .Zi is orthogonal to .Ui for any .θ : E{Ui Zi (θ )} = 0. Further, let
.
t1 = σ 2 − 1 + π θ 2 , t2 = π θ.
.
Now we get a simpler expression (σ 2 − 1)Ui + π θ Yi (θ ) = t1 Ui + t2 Zi (θ ).
.
Due to orthogonality, we have .
n n E E {(σ 2 − 1)Ui + π θ Yi (θ )}2 = {t1 Ui + t2 Zi (θ )}2 i=1
i=1
= {t12
n E i=1
Ui2 + t22
n E i=1
Zi2 (θ )}{1 + op (1)}
208
10 Likelihood Ratio Test for Homogeneity
since the cross term is of lower order. These give more organized linear and quadratic terms. Due to the law of large numbers, there must be positive constants .δ1 and .δ2 such that with probability 1, n E .
Ui2 ≥ nδ1 ;
n E
i=1
Zi2 (θ ) ≥ nδ2 .
(10.23)
i=1
Being .O(n) merely means the corresponding quantity does not exceed .Cn for some large C. The above inequalities reveal that these two summations are no smaller than an .O(n) quantity. One should pay attention to find why such more accurate order assessments are needed in subsequent derivations. Next we look into the cubic term. Note that .
n n E E {(σ 2 − 1)Ui + π θ Yi (θ )}3 = {t1 Ui + t2 Zi (θ )}3 i=1
i=1
≤ |t1 |3
n E
|Ui |3 + |t2 |3
n E
i=1
≤
|Zi (θ )|3
i=1
(|t1 | + |t2 |){t12
n E
|Ui |
3
+ t22
i=1
( = (|t1 | + |t2 |)Op t12
n E
|Zi (θ )|3 }
i=1 n E
Ui2 + t22
i=1
n E
Zi2 (θ )
)
i=1
= (|t1 | + |t2 |)Op (n). In other words, if .t1 and .t2 are arbitrarily small, then the cubic term is dominated by the quadratic term, which has exact order .n(t12 + t22 ) by (10.23). We now move to evaluate .En1 and .En2 . We first notice that (σ 2 − 1)2 = (σ 2 − 1 − π θ 2 + π θ 2 )2
.
= (t1 + t2 )2 ≤ 2(t12 + t22 ). This result enables us to give an order assessment for .En1 as: En1 = (σ 2 − 1)2 Op (1) = (t12 + t22 )Op (1).
.
This is more than sufficient to arrive at: n n E ( E ) n1/2 En1 = op t12 Ui2 + t22 Zi2 (θ ) .
.
i=1
i=1
10.4 Homogeneity Under Normal Mixture Model
209
The same is applicable to .En2 . Applying the same arguments, we get n n E ( E ) 2 2 n(En1 + En2 ) = op t12 Ui2 + t22 Zi2 (θ )
.
i=1
( n(|En1 |3 + |En2 |3 ) = op t12
n E
i=1
Ui2 + t22
i=1
n E
) Zi2 (θ ) .
i=1
Combining these order assessments, we can write (10.22) as r1n (π, θ, σ ) ≤ 2
.
n E {t1 Ui + t2 Zi (θ )} i=1
− {t12
n E
Ui2 + t22
i=1
n E
Zi2 (θ )}{1 + (|t1 | + |t2 |)Op (1). (10.24)
i=1
+op (1)}.
(10.25)
Note that all terms in (10.22) are accounted as either .(|t1 | + |t2 |)Op (1) or .op (1). Let .tˆ1 = σˆ 2 − 1 + πˆ θˆ 2 and .tˆ2 = πˆ θˆ be the MLE’s. Due to the consistency conclusion, we have .tˆ1 = op (1) and .tˆ2 = op (1). Consequently, replacement of the MLE’s in (10.25) gives a much simpler expression: r1n (πˆ , θˆ , σˆ ) ≤ 2{tˆ1
n E
.
Ui + tˆ2
i=1
n E
Zi (θˆ )}
i=1
n n E { E } − tˆ12 Ui2 + tˆ22 Zi2 (θˆ )}{1 + op (1) . i=1
(10.26)
i=1
To obtain a short form for the upper bound, fix .θ and consider the quadratic function Q(t1 , t2 ) = 2{t1
n E
.
i=1
Ui + t 2
n E i=1
Zi (θ )} − {t12
n E
Ui2 + t22
i=1
n E
Zi2 (θ )}.
i=1
If .θ ≥ 0, then .t2 = π θ ≥ 0 and if .θ < 0, then .t2 < 0. By considering the regions of .t2 ≥ 0 and .t2 < 0 separately, we see that for fixed .θ , .Q(t1 , t2 ) is maximized at .t1 = t˜1 and .t2 = t˜2 with E E Ui [SGN(θ ) Zi (θ )]+ , .t˜1 = E , t˜2 = E 2 Ui2 Zi (θ )
(10.27)
210
10 Likelihood Ratio Test for Homogeneity
where .SGN(θ ) is the sign function, and E E [{SGN(θ ) Zi (θ )}+ ]2 { Ui } 2 ˜ ˜ . .Q(t1 , t2 ) = E + E 2 Ui2 Zi (θ ) Therefore, by (10.26), it follows that E E [{SGN(θ ) Zi (θ )}+ ]2 { Ui } 2 ˆ {1 + op (1)} .r1n (π ˆ , θ , σˆ ) ≤ E 2 {1 + op (1)} + sup E 2 Ui Zi (θ ) |θ|≤M E E { Ui }2 [{SGN(θ ) Zi (θ )}+ ]2 = E 2 + sup + op (1). (10.28) E 2 Ui Zi (θ ) |θ|≤M Recall that (10.13) Rn = rn (πˆ , θˆ , σˆ ) = r1n (πˆ , θˆ , σˆ ) − r2n ,
.
and note that .r2n renders an ordinary quadratic approximation, that is, r2n
.
E Ui } 2 = E 2 + op (1). Ui {
An upper bound for .Rn is thus obtained as follows: Rn ≤ sup
.
|θ|≤M
[{SGN(θ )
En
i=1 Zi (θ )} nE{Z12 (θ )}
+ ]2
+ op (1).
(10.29)
E Here .nE{Z12 (θ )} substitutes for . Zi2 (θ ) since they are equivalent asymptotically and uniformly. We now show that the lower bound for .Rn has the same form. Let .e > 0 be any small number fixed. Let .Rn (e) be the supremum of .rn (π, θ, σ ) under restriction .e ≤ |θ | ≤ M. For fixed .θ ∈ [e, M], let .π ˜ (θ ) and .σ˜ (θ ) assume the values determined by (10.27). Consider the Taylor expansion r1n (π˜ (θ ), θ, σ˜ (θ )) = 2
n E
.
i=1
δ˜i −
n E
δ˜i2 (1 + η˜ i )−2 ,
i=1
where .|η˜ i | < δ˜i and .δ˜i is equal to .δi in (10.17) with .π = π˜ (θ ) and .σ = σ˜ (θ ). Attributing to bounding .θ away from 0, the solution .π˜ (θ ) is feasible, so that .σ˜ 2 (θ )− 1 = Op (n−1/2 ) and .π˜ (θ ) = Op (n−1/2 ), uniformly in .|θ | ∈ [e, M]. Since .δ˜i = (σ˜ 2 − 1)Ui (σ˜ ) + π˜ θ Yi (θ, σ˜ ), we have |δ˜i | ≤ |σ˜ 2 − 1||Ui (σ˜ )| + |πθ ˜ ||Yi (θ, σ˜ )|.
.
10.4 Homogeneity Under Normal Mixture Model
211
For a general constant C, sup |Yi (θ, σ˜ )| ≤ C(X∗ + 1) exp{CX∗ } = op (n1/2 ),
.
|θ|≤M
√ where .X∗ = max{|Xi |} = op ( log n), see Serfling (1980). Similarly .|Ui (σ˜ )| ≤ CX∗ 2 = op (n1/2 ). Thus, uniformly in .θ , max{|η˜ i |} ≤ max{|δ˜i |} = op (1),
.
(10.30)
that is, r1n (π˜ (θ ), θ, σ˜ (θ )) = 2
n E
.
δ˜i −
n {E
i=1
} δ˜i2 {1 + op (1)}.
i=1
Thus, from (10.27) with fixed .θ , .π˜ and .σ˜ are such that E E Ui } 2 [{SGN(θ ) Zi (θ )}+ ]2 + op (1). .r1n (π ˜ (θ ), θ, σ˜ (θ )) = E 2 + Ui nEZ12 (θ ) {
It follows that Rn (e) ≥
.
sup
rn (π˜ (θ ), θ, σ˜ (θ )) =
e≤|θ|≤M
sup e≤|θ|≤M
E [{SGN(θ ) Zi (θ )}+ ]2 + op (1). nEZ12 (θ ) (10.31)
Clearly, this lower bound is a good match of the upper bound given in (10.29). Combining two bounds gives us the main theorem of this section. Theorem 10.2 Let .X1 , · · · , Xn be a random sample from the mixture distribution (1 − π )N (0, σ 2 ) + π N (θ, σ 2 ),
.
where .0 ≤ π ≤ 1, .|θ | ≤ M, and .σ > 0, otherwise unknown. Let .Rn be (twice) the log-likelihood ratio test statistic for testing H0 : π = 0 or θ = 0.
.
Then under the null distribution .N(0, 1), as .n → ∞, Rn → sup {ζ + (θ )}2 ,
.
|θ|≤M
where .ζ (0) = 0 and for .0 < |θ | ≤ M, .ζ (θ ) is a Gaussian process with mean 0 and variance 1, and the autocorrelation given by
212
10 Likelihood Ratio Test for Homogeneity
exp(st) − 1 − (st)2 /2 ρ(s, t) = SGN(st) / , (exp(s 2 ) − 1 − s 4 /2)(exp(t 2 ) − 1 − t 4 /2)
.
(10.32)
for .s, t /= 0. Proof The proof starts with (10.29) and (10.31). Since the process
.
/
n E
1 nEZ12 (θ )
Zi (θ ), |θ | ≤ M
i=1
converges weakly to a Gaussian process .ξ(θ ). Direct calculation of the mean and the covariance of .Zi (θ ) yields that the Gaussian process .ξ(θ ) has mean 0, variance 1, and the autocorrelation function for .s, t /= 0 .
/
est − 1 − (st)2 /2
.
(es − 1 − s 4 /2)(et − 1 − t 4 /2) 2
2
Therefore, the upper bound of .Rn converges in probability to .
sup {ζ + (θ )}2 ,
|θ|≤M
where .ζ (0) = 0 and for .0 < |θ | ≤ M, .ζ (θ ) = SGN(θ )ξ(θ ) follows the Gaussian process with mean 0, variance 1, and the autocorrelation function (10.32). For .e > 0 given, the lower bound of .Rn converges weakly to R(e) = sup {ζ + (θ )}2 .
.
e≤θ≤M
Now letting .e → 0, .R(e) approaches in distribution .
sup {ζ + (θ )}2 .
|θ|≤M
This completes the proof.
u n
We can learn a few things from these derivations. First, we employed the same basic mathematical tools as for other mixtures. Second, the limiting distribution of the LRT for homogeneity looks similar to the ones for regular one parameter mixture. Finally, there is something different: the auto correlation has a strange factor .SGN(st).
10.5 Two-Mean Parameter Mixtures: Tests for Homogeneity
213
10.5 Two-Mean Parameter Mixtures: Tests for Homogeneity In this section, we illustrate the derivation of the limiting distribution of the LRT for homogeneity under the fuller finite normal mixture model (10.12). In this case, both mean parameters .θ1 and .θ2 are unknown. Again, we assume .0 ≤ π ≤ 1/2 so that .θ1 and .θ2 are distinguishable at least for the purpose of theoretical consideration. We retain the assumption that .|θ | ≤ M for some finite M. The homogeneity hypothesis is given by H0 : π(θ1 − θ2 ) = 0.
.
As usual, we assume the test is based on a random sample .X1 , . . . , Xn of size n from (10.12). The log-likelihood function is given by ln (π, θ1 , θ2 , σ ) =
n E
.
log{(1 − π )φ(xi ; θ1 , σ ) + π φ(xi ; θ2 , σ )}
i=1
also the same as before. E Let .θˆ = x¯ and .σˆ 02 = n−1 (xi − x) ¯ 2 be the MLE’s of .θ1 = θ2 = θ and .σ 2 under the null hypothesis. Define rn (π, θ1 , θ2 , σ ) = 2{ln (π, θ1 , θ2 , σ ) − ln (0, θˆ , θˆ , σˆ 0 )}.
.
More explicitly, we may write rn (π, θ1 , θ2 , σ ) = 2
n E
.
i=1
log
[ (1 − π ) σ
exp{−
(xi − θ1 )2 (xi − θ2 )2 ] π exp{− } + } σ 2σ 2 2σ 2
E { ¯ 2} (xi − x) . +n 1 + log n
Let .πˆ , .θˆ1 , .θˆ2 , and .σˆ 2 be the MLEs for the corresponding parameters under the full model. The likelihood ratio statistic for homogeneity is hence Rn = rn (πˆ , θˆ1 , θˆ2 , σˆ ).
.
The goal of this section is to derive the limiting distribution of .Rn , under restriction |θ | ≤ M and the specific null model .N(0, 1).
.
214
10 Likelihood Ratio Test for Homogeneity
10.5.1 Large Sample Behavior of the MLE’s We first point out that the conclusion of Lemma 10.6 remains true. That is, under the null distribution .N(0, 1), there are constants .0 < e < Δ < ∞ such that .
lim Pr(e ≤ σˆ 2 ≤ Δ) = 1.
n→∞
We give another useful but easy to prove result about the asymptotic behaviors of MLEs. Lemma 10.8 Under the null distribution .N(0, 1), as .n → ∞, .
2 θˆ1 → 0, σˆ 2 → 1, πˆ θˆ2 → 0
πˆ θˆ2 + (1 − πˆ )θˆ1 → 0, in probability. Proof Consider .e ≤ σ 2 ≤ Δ for some constants .0 < e < 1 < Δ < ∞. Let the space G = {G : G(u) = (1−π )I (u ≥ θ1 )+π I (u ≥ θ2 ), 0 ≤ π ≤ 1/2, |θi | ≤ M, i = 1, 2}
.
be metrized by the KW distance defined in (2.3) of Chap. 2. In fact, the proof does not depend on this specific distance. Any reasonable distance works. The product space .[e, Δ] × G is apparently compact. These properties ensure that the proof on the consistency of the MLEs given in Chap. 3 is applicable to these of .σ 2 and G. We use this somewhat heavy-handed conclusion to infer that the MLE of the moments of G: { . uk dG(u) = (1 − π )θ1k + π θ2k { are consistent. Under the null distribution .N(0, 1), . uk dG(u) = 0. Thus, we have both (1 − πˆ )θˆ1 + πˆ θˆ2 → 0
.
and (1 − πˆ )θˆ12 + πˆ θˆ22 → 0.
.
These are possible only if .πˆ θˆ22 → 0 and .θˆ1 → 0 since .1 − πˆ ≥ 1/2. The lemma is proved. u n
10.5 Two-Mean Parameter Mixtures: Tests for Homogeneity
215
Remember that the LRT statistics is the supremum of .rn (π, θ1 , θ2 , σ ) over the space of .π, θ1 , θ2 , σ . In the light of Lemma 10.7, without loss of generality, .σ 2 can be restricted to a small neighborhood of .σ 2 = 1, say .[1 − δ, 1 + δ] for a small positive number .δ when considering the supremum of .rn (π, θ1 , θ2 , σ ). For the same consideration, .θ1 can be restricted to a small neighborhood of 0, say .[−δ, δ]. In terms of .π and .θ2 , the same consideration indicates that .π θ22 can be restricted to a small neighborhood of 0. Yet this does not sufficiently narrow down the range of .θ2 . When .π θ22 is small, either .|θ2 | > e with a small .π value, or .|θ2 | ≤ e with a .π value anywhere in [0, 0.5], for some .e > 0. The supremum of .rn (π, θ1 , θ2 , σ ) will be analyzed over these two regions by using a sandwich approach. Let .Rn (e; I ) denote the supremum of the likelihood function within the part .|θ2 | > e, and .Rn (e; I I ) the supremum within .|θ2 | ≤ e. The result will be combined through the relationship Rn = max{Rn (e; I ), Rn (e; I I )} + op (1).
.
Term .op (1) is to look after the cases when the supremum is attained outside of these two regions. The number .e will remain fixed as n approaches infinity. It is easily seen that Lemma 10.7 remains true under either restriction .|θ2 | > e or .|θ2 | ≤ e. Dependence on .e will be suppressed notationally for the MLE’s of the parameters. Thus .πˆ , θˆ1 , θˆ2 , and .σˆ will denote the constrained MLE’s of .π, θ1 , θ2 and .σ with restriction .|θ2 | ≥ e in the analysis of .Rn (e; I ), but stand for the constrained MLE’s with restriction .|θ2 | ≤ e in the analysis of .Rn (e; I I ).
10.5.2 Analysis of Rn (e; I ) We first establish an asymptotic upper bound for .Rn (e; I ). As in the previous section, we write rn (π, θ1 , θ2 , σ ) = 2{ln (π, θ1 , θ2 , σ ) − ln (0, 0, 0, 1)}
.
+2{ln (0, 0, 0, 1) − ln (0, θˆ , θˆ , σˆ 0 )} = r1n (π, θ1 , θ2 , σ ) + r2n . To analyze .r1n (π, θ1 , θ2 , σ ), express .r1n (π, θ1 , θ2 , σ ) = 2
E
log(1 + δi ), where
[
] Xi2 1 (Xi − θ1 )2 exp{ − .δi = (1 − π ) }−1 σ 2 2σ 2 [ ] Xi2 1 (Xi − θ2 )2 +π − exp{ }−1 2 σ 2σ 2 = (1 − π )θ1 Yi (θ1 , σ ) + π θ2 Yi (θ2 , σ ) + (σ 2 − 1)Ui (σ ),
(10.33)
216
10 Likelihood Ratio Test for Homogeneity
where .Yi (θ, σ ) and .Ui (σ ) are defined in (10.15) and (10.14). From this point, we again use inequality r1n (π, θ1 , θ2 , σ ) ≤ 2
n E
.
δi −
i=1
n E
δi2 + (2/3)
i=1
n E
δi3
i=1
E and aim to show that . ni=1 δi3 is negligible, the linear and quadratic terms can be cleaned up, and lead to an easy to handle expansions. Unlike what we did before, let us work on the linear term first. We re-write δi = m1 Yi (0, 1) + (σ 2 − 1 + m2 )Ui (1) + m3 Vi (θ2 ) + ein ,
.
where .ein is the remainder term of replacement, m1 = (1 − π )θ1 + π θ2 , m2 = (1 − π )θ12 + π θ22 , m3 = π θ23 ,
.
and Vi (θ2 ) =
.
Yi (θ2 , 1) − Yi (0, 1) − θ2 Ui (1) . θ22
(10.34)
Define .Vi (0) = −(Xi /2) + (Xi3 /6) so that the function .Vi (θ ) is continuously differentiable. Using simplified notation that .Ui = Ui (1) = (Xi2 − 1)/2 and .Yi (0, 1) = Xi , we have n E .
δi = m 1
i=1
n E
Xi + (σ 2 − 1 + m2 )
i=1
n E
Ui + m 3
i=1
n E
Vi (θ2 ) + en .
(10.35)
i=1
With nearly the same argument as the analysis of the single mean parameter case, we find en =
n E
.
ein = Op
(√ 2 ) √ n|σ −1|{|m1 |+θ12 +π θ22 +|σ 2 −1|}+ n|θ13 | .
i=1
To this point, we get a clean expansion for the linear term (10.35). E The next time is to get a clean expansion for the quadratic term . δi2 . r1n (π, θ1 , θ2 , σ ) ≤ 2
n E
.
i=1
=2
δi −
n E i=1
n E
δi2 + (2/3)
n E
δi3
i=1
{m1 Xi + (σ 2 − 1 + m2 )Ui + m3 Vi (θ2 )}
i=1
(10.36)
10.5 Two-Mean Parameter Mixtures: Tests for Homogeneity
−
217
n E {m1 Xi + (σ 2 − 1 + m2 )Ui + m3 Vi (θ2 )}2 i=1
+(2/3)
n E {m1 Xi + (σ 2 − 1 + m2 )Ui + m3 Vi (θ2 )}3 i=1
} {√ √ +Op n|σ 2 − 1|(|m1 | + θ12 + π θ22 + |σ 2 − 1|) + n|θ13 | . Furthermore, the cubic sum is negligible when compared to the square sum. This can be justified by using the idea leading to (10.26). First, the square sum times .n−1 approaches .E{m1 X1 + (σ 2 − 1 + m2 )U1 + m3 V1 (θ2 )}2 uniformly. The limit is a positive definitive quadratic form in variables .m1 , .σ 2 −1+m2 , and .m3 . Next, noting that .Xi , .Ui , and .Vi (θ2 ) are mutually orthogonal, we see that .
r1n (πˆ , θˆ1 , θˆ2 , σˆ ) ≤ 2{m ˆ1
n E
Xi + (σˆ 2 − 1 + m ˆ 2)
n E
i=1
−{m ˆ 21
n E
Ui + m ˆ3
n E
i=1
Xi2 + (σˆ 2 − 1 + m ˆ 2 )2
i=1
n E
Vi (θˆ2 )}
i=1
Ui2 + m ˆ 23
i=1
n E
Vi2 (θˆ2 )}{1 + op (1)} + eˆn .
i=1
Here the terms with a hat are their (constrained) MLE’s with restriction .|θ2 | > e as remarked in the end of Sect. 10.5.1. In particular, from (10.36), √ √ ˆ 1 | + θˆ12 + πˆ θˆ22 + |σˆ 2 − 1|] + n|θˆ13 |}. eˆn = Op { n|σˆ 2 − 1|[|m
.
√ By Cauchy inequality [e.g., . n|m ˆ 1 | ≤ 1 + nm ˆ 21 ] and the restriction .|θ2 | > e (hence .|θˆ2 | ≥ e), we have √ 2 √ n|. σˆ − 1|[|m ˆ 1 | + θˆ12 + πˆ θˆ22 + |σˆ 2 − 1|] + n|θˆ13 | ˆ 21 + θˆ14 + (πˆ θˆ22 )2 + (σˆ 2 − 1)2 }] + |θˆ1 |(1 + nθˆ14 ) ≤ |σˆ 2 − 1|[4 + n{m = op (1) + nop {m ˆ 21 + θˆ14 + (πˆ θˆ22 )2 + (σˆ 2 − 1)2 } = op (1) + nop {m ˆ 21 + (σˆ 2 − 1 + m ˆ 2 )2 + m ˆ 23 }. Hence the remainder term .eˆn can also be absorbed into the quadratic sum, i.e. .
r1n (πˆ , θˆ1 , θˆ2 , σˆ ) ≤ 2{m ˆ1
n E i=1
Xi + (σˆ 2 − 1 + m ˆ 2)
n E i=1
Ui + m ˆ3
n E i=1
Vi (θˆ2 )}
218
10 Likelihood Ratio Test for Homogeneity
−{m ˆ 21
n E
Xi2 + (σˆ 2 − 1 + m ˆ 2 )2
i=1
n E
Ui2 + m ˆ 23
i=1
n E
Vi2 (θˆ2 )}{1 + op (1)} + op (1).
i=1
Applying the argument leading to (10.28), the right-hand side of the above inequality becomes even greater when .m ˆ 1 , .σˆ 2 − 1 + m ˆ 2 , and .m ˆ 3 are replaced with E E E Xi {SGN(θ2 ) Vi (θ2 )}+ Ui , ˜3 = ˜2 = E 2, m m ˜ 1 = E 2 , σ˜ 2 − 1 + m E 2 Ui Vi (θ2 ) Xi
.
(10.37)
for any .e < |θ2 | ≤ M, so that E E E [{SGN(θ ) Vi (θ )}+ ]2 { Xi }2 { Ui }2 ˆ ˆ +op (1), .r1n (π ˆ , θ1 , θ2 , σˆ ) ≤ E 2 + E 2 + sup E 2 Xi Ui Vi (θ ) e 0) sn2 2 sn2 1 2 + {nIn11 }−1 sn1 1(m ˜ 2 ≤ 0) 2 where .In11 is the .(1, 1)th entry of .I matrix. We remark here that the penalty term is not yet included. Next, we investigate the maximum likelihood value under the null hypothesis. It is straightforward to find ln (θˆ ) − ln (G∗ ) =
.
1 2 {nIn11 }−1 sn1 + op (1). 2
It is seen that if .m ˜ 2 ≤ 0, we have ˆ − ln (θˆ ) = op (1). ln (G)
.
Otherwise, we have ˆ − ln (θˆ )} = 2{ln (G)
.
[
sn1 sn2
]T
{nIn }−1
[
] sn1 2 − {nIn11 }−1 sn1 + op (1). sn2
Clearly, this statistic has chisquare limiting distribution with one degree of freedom. It can be verified that the event .{m ˜ 2 < 0} is asymptotically independent of [ .
sn1 sn2
]T
{nIn }−1
[
] sn1 2 − {nIn11 }−1 sn1 . sn2
Hence, we obtain the conclusion of this theorem.
u n
It is now appropriate to recount the conditions necessary for the validity of Lemma 11.4 and Theorem 11.2.
238
11 Modified Likelihood Ratio Test
C110.1 The parameter space .Θ is an open subset of .R. Being an open subset is not a demanding nor a strictly necessary condition. It would be nice if our result remains valid when .θ is allowed to be a vector. When .θ is a vector, the modified likelihood remains relevant. However, the limiting distribution of .Rn does not enjoy a simple analytical form. Technically, the difficulty arises when we have to specify the natural restrictions on the range of .m ˆ 1 and .m ˆ 2 in multivariate situation. One may certainly provide an abstract specification. However, an abstract specification is not very meaningful when the test is implemented to work on data. C110.2 The functions .px (θ ) are three times differentiable with bounded third derivative. A smoothness restriction on .px (θ ) is not restrictive in general. We would be startled if a model has to be parameterized non-smoothly. Because the range of .px (θ ) is bounded between 0 and 1, it is also natural that its derivatives are bounded. In the binomial mixture example, .px (θ ) is a polynomial in .θ and .Θ is bounded. Hence, this condition is satisfied. Technically, if the smoothness part of this condition is violated, the above proof immediately falls apart in deriving (11.5). Having bounded third derivative makes it simple to justify that the remainder term is .op (1) in (11.5). A condition slightly weaker on the third derivative may also suffice. Yet we do not think that such an effort is worthwhile in statistics. C110.3 The Fisher information matrix defined in (11.6) in the general form: ⎡ E k I (θ ) = ⎣ E
.
2 x=0 px (θ )yx
k x=0 px (θ )yx zx
Ek
x=0 px (θ )yx zx
Ek
⎤ ⎦
(11.7)
2 x=0 px (θ )zx
is positive definite at all .θ where yx =
.
px' (θ ) p'' (θ ) ; zx = x . px (θ ) px (θ )
The .θ value in .I (θ ) in the proof was set at 0 for easy presentation. As the true null value is generally unknown, this condition must be required at every .θ . This condition is a special case of “strong identifiability.” It requires that the first two derivatives are not completely linearly correlated. Without this property, the mixing distribution cannot be estimated with a rate of .n−1/4 , the believed optimal one under mixture models (Chen 1995). The limiting distribution of .Rn , if still exists, likely does not have a clean analytical form in this case. While the minimax rate is found different from .n−1/4 (Heinrich and Kahn 2018), the conclusions here are not affected. C110.4 The mixture model (11.4) with .G ∈ Gm satisfies conditions that ensure consistency of the maximum modified likelihood estimator of G.
11.3 Test for Homogeneity for General Subpopulation Distribution
239
The condition of this nature should be generally prohibited in research papers. ˆ has been extensively discussed in Chap. 3 in our However, the consistency of .G case. Thus, we do not spell out the full legitimate set of conditions here, that is, the conditions placed on the subpopulation distribution family.
11.3 Test for Homogeneity for General Subpopulation Distribution What is the motivation behind first working on the modified likelihood ratio test under the multinomial mixture models? The answer is that under this class of mixture models, the likelihood ratio statistic is stochastically bounded as shown by Lemma 11.2. This property leads to partially restored regularity to the modified likelihood function. In Chap. 10, it is seen that the likelihood ratio statistic has a limiting distribution in the form of a squared and truncated Gaussian process as in Theorem 10.1. In addition to many restrictions on the finite mixture model in terms of its order and its subpopulation distribution, we point out that the space of the subpopulation parameter, .Θ, must be compact to ensure the validity of the tightness condition, and therefore to arrive at the final conclusion. Otherwise, the supremum of the Gaussian process itself may not be well defined. Once the stochastic boundedness is established, the modified likelihood ratio test for homogeneity can be established in exactly the same way as for Theorem 11.2. Theorem 11.3 Suppose .x1 , . . . , xn is a set of i.i.d. sample from the finite mixture model f (x; G) =
m E
.
πj f (x; θj ).
j =1
Assume that .G ∈ Gm with a pre-specified m and that the conditions of Theorem 11.2 are satisfied. Define the modified likelihood ratio statistic as ˆ − ln (θˆ )} Rn = 2{ln (G)
.
ˆ maximizes .pln (G) in the space of .Gm and .θˆ maximizes .pln (G) in the where .G space of .G1 . Then, under the null hypothesis that the true mixing distribution .G∗ is degenerate, we have d
Rn −→ 0.5χ02 + 0.5χ12 .
.
240
11 Modified Likelihood Ratio Test
Let me slightly contradict myself on some statements in Chen et al. (2001). It is true that in most applications we are able to specify a compact .Θ for the mixture model. For instance, the height of a human being is safely confined between 0 and 3 in meters. Once such a condition is in place, the space for mean parameter is compact. We may then fell that that compact condition is satisfied in all realistic applications. However, as it is noticed in various proofs, the chisquare type of limiting distribution is the consequence of .max |θˆj − θ ∗ | → 0. If we have a precise range of .θ ∗ , the convergence will take effect for moderate sample size. A huge sized though compact .Θ may require a huge sample size to achieve a small value of ∗ .max |θˆj − θ |. When this happens, the limiting distribution of .Rn may not be a good description of the finite sample distribution where .Θ is “large” and the sample size is “moderate” even if .Θ is compact. The size of .Θ also has an impact on how large C has to be in order to make the limiting distribution a good approximation of the finite sample distribution. In general, a large C makes the approximation more accurate, while it may hurt the power of detecting a departure from the null model.
11.4 Test for Homogeneity in the Presence of a Structural Parameter The modified likelihood ratio test has been applied to only mixture model whose subpopulation distribution has a single parameter. This reason is mostly technical; otherwise, the test statistic does not have a simple limiting distribution. Chen and Kalbfleisch (1996) made an attempt to apply the modified likelihood test for homogeneity under normal mixture model in the presence of a structure parameter. More specifically, the model under consideration has density function f (x; G, σ ) = π φ(x; θ1 , σ ) + (1 − π )φ(x; θ2 , σ ).
.
The parameter space for .θ is assumed to be a compact .Θ. The parameter space of .σ is usual .R + . As usual, we denote .f (x; G, σ ) as .f (x; θ, σ ) when the mixing distribution G is degenerate. Given a set of i.i.d. observations, the log likelihood function is given by ln (G, σ ) =
n E
.
log f (xi ; G, σ )
i=1
and the modified log likelihood function is defined to be pln (G, σ ) = ln (G, σ ) + C log{π(1 − π )}.
.
11.4 Test for Homogeneity in the Presence of a Structural Parameter
241
ˆ and .σˆ be the mixing distribution and the value of .σ at which .pln (G, σ ) is Let .G maximized, and that .θ˜ and .σ˜ be maximum points of .pln (θ, σ ). Define the modified likelihood ratio test to be ˆ σˆ ) − ln (θ˜ , σ˜ )}. Rn = 2{ln (G,
.
The paper was not able to find a simple analytic form of the limiting distribution of Rn . However, it is found that
.
.
lim Pr(Rn ≤ x) ≤ Pr(χ22 ≤ x)
n→∞
for any x. Consequently, if one uses .χ22 as the reference distribution to compute the approximate p-value, the resulting test is asymptotically conservative. In the context of modified likelihood ratio test, the limiting distribution often under-approximates the stochastic size of the test statistic. Hence, these two factors may work against each other and make the .χ22 is good approximation. This logic is verified in the simulation results of Chen and Kalbfleisch (1996).
Chapter 12
Modified Likelihood Ratio Test for Higher Order
12.1 Test for Higher Order Constructing a test of the hypothesis .G ∈ G2 is a similar task in principle to the test for homogeneity. Perhaps due to its mathematical complexity, however, there is a less extensive literature. Some approaches can be found in the diagnostic method (Roeder 1994; Lindsay and Roeder 1997), the moment methods (Lindsay 1989), and the model selection approach (Chen and Kalbfleisch 1996; Henna 1985). If the component distributions are known, Chen and Cheng (1997) discuss a bootstrap method. Even though there is relatively little literature on the subject, the problem of testing .G ∈ G2 versus G has more support points is as important in applications. For example, if a quantitative trait is determined by a simple gene with two alleles, a mixture of two normal distributions might be appropriate when the mode of inheritance is dominant, whereas a mixture of three or more normals will be appropriate when the mode of inheritance is additive or more complex in nature. We devote the most energy in this chapter to the modified likelihood ratio test. While the test itself is not very complex, the mathematical derivation behind its asymptotic property is rather involved.
12.2 A Modified Likelihood Ratio Test for m = 2 Let .x1 , . . . , xn be an i.i.d. sample of size n from a finite mixture model f (x; G) =
m E
.
πj f (x; θi ).
(12.1)
j =1
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_12
243
244
12 Modified Likelihood Ratio Test for Higher Order
The problem of the interest here is to test the null hypothesis .H0 : G ∈ G2 , or m = 2 against the alternative .m ≥ 3. This had been a daunting task for many years. Searching for an effective and easy to implement procedure before this development had only met limited success. The success remains limited as we only consider the case where the parameter space of .θ is one-dimensional and the space .Θ is a finite interval. With .m = 2 being the null hypothesis, we write the true mixing distribution to be
.
G∗ (θ ) = π ∗ {θ1∗ } + (1 − π ∗ ){θ2∗ },
.
(12.2)
where .θ1∗ and .θ2∗ are distinct interior points of .Θ and .0 < π ∗ < 1. The expectations and probabilities subsequently will be computed with respect to this null mixing distribution. For mixing distribution in the form G(θ ) =
m E
.
πi 1(θi ≤ θ ) ∈ Gm
i=1
with .m ≥ 2, the modified log likelihood function is defined to be pl(m) n (G) = ln (G) + Cm
m E
.
log πj ,
(12.3)
j =1
where .Cm is some positive constant and ln (G) =
n E
.
log f (xi ; G)
i=1
is the ordinary log likelihood function. Due to the introduction of a penalty that is dependent on the order of the finite mixture model, the above definition appears more complicated than warranted. The superscript in .pl(m) n insists on writing every mixing distribution in .Gm in the .Gm has fewer than m form of having m support points. If a mixing distribution inE m support points, we use an expression in the form of .G = i=1 πi {θi } such that Em . j =1 | log πj | is minimized. The constant .Cm determines the level of penalty on the mixing parameters .πj of (m) G. The asymptotic properties of the statistical procedures based on .pln (G) does not depend on the choice of .Cm . In practice, the choice of .Cm is expected to have some effects on the performance of the statistical procedures for small or moderate sample sizes. This is an issue best discussed separately. (m) The modified MLE of G is then obtained by maximizing .pln over the space ˆ =G ˆ (m) , and the modified MLE’s of .Gm . We denote the modified MLE of G by .G .θj and .πj by .θˆj and .π ˆ j . Thus, the modified MLE of G under the null hypothesis is
12.2 A Modified Likelihood Ratio Test for m = 2
245
ˆ (2) that maximizes the modified likelihood (12.3) when .m = 2. Let .πˆ ∗ , .θˆ ∗ , ˆ∗ = G G 1 ∗ ˆ and .θ2 be the modified MLE’s of .π ∗ , .θ1∗ , and .θ2∗ , respectively. That is,
.
ˆ ∗ (θ ) = πˆ ∗ {θˆ1∗ } + (1 − πˆ ∗ ){θˆ2∗ }. G
.
The modified LRT is then based on the statistic ˆ − ln (G ˆ ∗ )}, Rn = 2{ln (G)
.
ˆ is the modified MLE of G in .Gm with .m ≥ m∗ . Although both .G ˆ and .G ˆ∗ where .G are maximum points of the modified likelihood function, the test statistic itself is defined to have the penalty terms removed. This measure leads to cleaner limiting distribution. The modified LRT procedure for testing H0 : G ∈ G2 versus H1 : G ∈ G
.
is as follows. Step 1. Step 2. Step 3.
ˆ ∗ , which maximizes the modified likelihood funcObtain the estimate .G (2) tion .pln (G) over .G2 . Let .m ˆ ∗ = 2/ min{πˆ ∗ , 1 − πˆ ∗ } and .m ≥ m ˆ ∗ be any integer. Obtain the (m) ˆ estimate .G, which maximizes .pln (G) over .Gm . ˆ − ln (G ˆ ∗ )}. Reject the null Compute the testing statistic .Rn = 2{ln (G) hypothesis .H0 if .Rn exceeds some critical value.
ˆ with m support points has some desired In Step 2, we require .m > m ˆ ∗ to ensure .G ˆ 1 + (1 − structure. Most specifically, we want to asymptotically decompose it as .πˆ G ˆ ˆ ˆ πˆ )G2 such that (a) the support points of .G1 and .G2 are consistent estimators of ˆ 1 and .G ˆ 2 have either .θ1∗ or .θ2∗ , respectively, (b) .πˆ is consistent for .π ∗ , and (c) both .G room for at least two support points. The last point is to permit both of them having positive variance. These technicalities are nonsensical statistically but needed to establish our limiting distribution. The critical value of any tests is conceptually determined by the targeted nominal type I error. In lieu of a critical value, one may compute a p-value describing how extreme the observed value of the test statistic is. If this p-value is below the prechosen significance level, the null hypothesis is rejected in favor of the alternative hypothesis. The computation of the p-value is generally based on a null limiting distribution that is used as reference distribution. Thus, one of two most critical issues in designing an effective significance test procedure is to find the null limiting distribution of .Rn under some null distribution of the data. Furthermore, we need to make sure that the null limiting distribution is a good approximation of the finite sample distribution of .Rn and the associated p-value calculation is reasonably simple. In fact, since the null hypothesis space .G2 is composite, different members of .G2 may lead to different null limiting distributions. In general, we use some methods to
246
12 Modified Likelihood Ratio Test for Higher Order
identify a distribution in .G2 as the null distribution and use the limiting distribution of the test statistic under this distribution as the reference distribution. This not so rigorous approach is widely accepted, and it also underlies the development in this chapter. The main conclusion we wish to provide in this chapter is that the modified likelihood ratio test statistic .Rn has the following limiting distribution: ( .
1 α − 2 2Π
)
1 α 2 χ02 + χ12 + χ , 2 2Π 2
with .α-value to be specified later, and .Π = 3.14 · · · the mathematical constant. This unusual notation is adopted because the usual .π is the default notation for mixing proportion in this book. The value of .α, unfortunately, depends on .G∗ . However, a p-value calculation is easy based on an estimated .α. Thus, the modified likelihood ratio test gives one of few early practical methods for the purpose of testing the order of finite mixture models. At the same time, some more recent developments make this method no longer attractive. However, the development of this method remains an interesting topic.
12.2.1 Technical Outlines The modified likelihood ratio test, or any tests, would be useless if it were practically impossible to compute at least its approximate p-value in real data situations. The approximation is generally built on a concrete form of the limiting null distribution of the test statistics, which is .Rn in the current problem. The limiting distribution has been presented in the last section. The technical details behind this limiting distribution are tedious. Many of these details coincide with what have been discussed in the chapters of homogeneity test and the brute-force likelihood ratio test. In this section, we first paint a big picture and aim to get into details subsequently. The goal is to include as much as possible intuitive but not rigorous explanations, and at the same time, do not get certain ideas lost. Consistency Issue The first key observation in developing the modified likelihood ratio test is the consistency of the modified MLE of G. This result enables us approximating the likelihood function by some relatively easy-to-handle quadratic forms. The eventual form of the limiting distribution is derived from such quadratic forms. ˆ be the modified MLE of G over .Gm and put Let .G ˆ = G
m E
.
j =1
πˆ j {θˆj }.
12.2 A Modified Likelihood Ratio Test for m = 2
247
Lemma 12.1 Suppose that Conditions C120.1–5 in the Appendix section of this chapter hold and that the true distribution is a finite mixture .f (x; G∗ ). Then, for any given .m > 0, as .n → ∞, m E .
log πˆ j = Op (1).
j =1
¯ be the ordinary MLE of G over .Gm . That is, at which .ln (G) attains its Proof Let .G global maximum on .Gm . By definition, we have (m) ∗ ˆ pl(m) n (G) ≥ pln (G )
.
and ¯ ≥ ln (G). ˆ ln (G)
.
Hence, (m) ∗ ˆ 0 ≤ pl(m) n (G) − pln (G )
.
ˆ − ln (G∗ )} + Cm { = {ln (G)
m E
log πˆ j −
j =1
¯ − ln (G∗ )} + Cm { ≤ {ln (G)
m E
m E
log πj∗ }
j =1
log πˆ j −
j =1
m E
log πj∗ }.
j =1
Hence, we get .
− Cm
m E
¯ − ln (G∗ )} − Cm log πˆ j ≤ {ln (G)
j =1
m E
log πj∗ .
j =1
Under the conditions of the current lemma, the results of Dacunha-Castelle and Gassiat (1999) are justified. As also pointed out in Chap. 9, among many implications of their results, a seemingly not so important but useful conclusion is that ¯ − ln (G∗ ) = Op (1). ln (G)
.
E ∗ Since .Cm > 0 and . m j =1 log πj does not depend on n, we conclude m E .
log πˆ j = Op (1).
j =1
And the lemma follows.
u n
248
12 Modified Likelihood Ratio Test for Higher Order
E ∗ By . m j =1 log πj in the above proof, we split the true mixing proportion into sum of m positive constants so that its value is minimized. In fact, whatever way it is done, the outcome is a constant that does not depend on n. Therefore, the conclusion will not be affected. A similar result, Lemma 11.1, was given in Chap. 11 in the context of homogeneity test. Under a homogeneity null hypothesis, the mixing distribution has a single support point. Hence, such a result implies that all support points in the modified ˆ converge to that parameter value in probability. MLE .G The implication of Lemma 12.1 is the same though the presentation is more complex. Under the null distribution .f (x; G∗ ) when .G∗ has two distinct support ˆ .θˆj ’s, must converge to those of .G∗ . To be exact, points, all support points of .G, ∗ ∗ ∗ let .θ = (θ1 + θ2 )/2 be the middle point of the support points of .G∗ . Define ˆ ∗ ), which is the probability assigned to the support points .θˆj ≤ θ ∗ by the .π ˆ = G(θ ˆ Then .G ˆ can be expressed as a mixture as follows: mixing distribution estimate .G. ˆ 2 (θ ), ˆ ) = πˆ G ˆ 1 (θ ) + (1 − πˆ )G G(θ
.
ˆ 1 (θ ∗ ) = 1 and .G ˆ 2 (θ ∗ ) = 0. Similarly, we can express the null mixing where .G ∗ distribution .G as G∗ (θ ) = π ∗ G1 (θ ) + (1 − π ∗ )G2 (θ ),
.
where .G1 (θ ∗ ) = 1 and .G2 (θ ∗ ) = 0. With these conventions, the consistency conclusion can be stated unambiguously as follows. Lemma 12.2 Suppose that Conditions C120.1–5 hold and that the true distribution is .f (x; G∗ ). Then (a) (b) (c) (d)
πˆ = π ∗ + op (1). ˆ j − Gj | = op (1), where .| · | is the supremum norm. For .j = 1, 2, we have .|G ˆ All support points of .G{i converge to those of .Gi for .i = 1, 2. ˆ j (θ ) = op (1) for .j = 1, 2 and .r > 0. The absolute moment . |θ − θi∗ |r d G
.
ˆ The results in this lemma Eare consequences of the consistency of .G as a mixing distribution, the fact that . j log πˆ j = Op (1) and that the parameter space .Θ is compact. They pave the way for expanding the modified likelihood ratio test statistic. ˆ are in an infinitesimal Expansion Step Knowing that all support points of .G ˆ at .G∗ becomes neighborhood of the two support points of .G∗ , expanding .ln (G) ˆ − ln (G ˆ ∗ )} meaningful. We study the large sample properties of .Rn = 2{ln (G) through the decomposition Rn = R1n − R0n ,
.
12.2 A Modified Likelihood Ratio Test for m = 2
249
ˆ − ln (G∗ )} and .R0n = 2{ln (G ˆ ∗ ) − ln (G∗ )}. The subsequent where .R1n = 2{ln (G) derivations may look familiar. The same ideas have been used many times. The key task is to obtain quadratic-type expansions of .R1n and .R0n separately. Before we start the expansion, it is helpful to review some basics on the classical Wilks theorem established for regular models. Let us temporarily switch to regular models until we are told to switch back. Denote the true parameter value as .θ ∗ , the regular MLE as .θˆ , and use the same .ln (θ ) for the log likelihood function based on ˆ = 0. Therefore, n i.i.d. observations. For regular models, .l'n (θ) 1 ˆ 2 ln (θ ∗ ) ≈ ln (θˆ ) + l'n (θˆ )(θ ∗ − θˆ ) + l'n (θˆ )(θ ∗ − θ) 2 1 ≈ ln (θˆ ) − {n(θ ∗ − θˆ )2 }I (θ ∗ ) 2
.
where .I (θ ∗ ) is the Fisher information based on a single observation. For regular models, .θ ∗ − θˆ = Op (n−1/2 ). Thus, from the fact that the approximation error is at a higher order of .n(θ ∗ − θˆ )2 = Op (1), we have 2{ln (θˆ ) − ln (θ ∗ )} = {n(θ ∗ − θˆ )2 }I (θ ∗ ) + op (1).
.
√ This leads to the classical chi-square limiting distribution result, because . n(θˆ −θ ∗ ) is asymptotically normal with variance or variance–covariance matrix .I −1 . What do we learn from such a quick review? We note that the limiting distribution of the likelihood ratio under regular model is determined by the leading term in the expansion of .2{ln (θˆ ) − ln (θ ∗ )}. Since the rate of convergence of the MLE under a regular model is .Op (n−1/2 ), we need to keep track of terms only up to .(θ ∗ − θˆ )2 in order to determine the limiting distribution. Now we switch back to non-regular finite mixture model. Under finite mixture models, the convergence rate for G is usually at a much lower .Op (n−1/4 ). Hence, we have to keep many more terms in the expansion of .ln (G) in order to avoid missing out “leading terms.” At the same time, the convergence rate for estimating the moments of G may attain the usual .n−1/2 . This understanding guides us on how many terms we should keep in various expansions. For .i = 1, . . . , n, define Yi' (θ ) =
.
f ' (xi ; θ ) f '' (xi ; θ ) f ''' (xi ; θ ) '' ''' ; Y ; Y . (θ ) = (θ ) = i i f (xi ; G∗ ) f (xi ; G∗ ) f (xi ; G∗ )
(12.4)
When the kernel density function .f (x; θ ) is regular, we have .E{Y ' (X; θ )} = 0 calculated under the true mixture. Thus, n E .
i=1
Yi' (θ ) = Op (n1/2 )
250
12 Modified Likelihood Ratio Test for Higher Order
when some moment conditions are satisfied at every single .θ value. When .Θ is a compact space and so on, this order claim holds uniformly. This is assumed in the current intuitiveE derivations. Put .R1n = 2 log(1 + δi ), with δi =
.
ˆ − f (xi ; G∗ ) f (xi ; G) . f (xi ; G∗ )
(12.5)
ˆ match these of .G∗ . This consideration By Lemma 12.2, the support points of .G leads to a useful decomposition: δi = (πˆ − π ∗ )Δi + πˆ
.
ˆ 1 ) − f (xi ; θ ∗ ) ˆ 2 ) − f (xi ; θ ∗ ) f (xi ; G f (xi ; G 1 2 , + (1 − π ˆ ) f (xi ; G∗ ) f (xi ; G∗ ) (12.6)
where Δi =
.
f (xi ; θ1∗ ) − f (xi ; θ2∗ ) . f (xi ; G∗ )
Clearly, all three terms converge to zero in probability because of Theorem 12.2. For a .θ value such that .θ − θ1∗ = Op (n−1/4 ) as anticipated, we have .
n E f (xi ; θ ) − f (xi ; θ ∗ )
f (xi ; G∗ )
i=1
1
≈ (θ −θ1∗ )
n E i=1
E 1 Yi' (θ1∗ )+ (θ −θ1∗ )2 Yi'' (θ1∗ ) 2 n
(12.7)
i=1
The remainder term we have omitted above is roughly (θ − θ1∗ )3
n E
.
Y ''' (θ1∗ ) = n1/2 (θ − θ1∗ )3 = op (1).
i=1
By stopping the expansion at this term, quantities that do not affect the limiting distribution are left out. Put, for .i, j = 1, 2, { m ˆ ij =
.
ˆ j (θ ). (θ − θj∗ )i d G
ˆ j . They are the most important quantities in the Namely, they are moments of .G ˆ 1 , we get expansion of .Rn . Integrating two sides of (12.7) on .θ with respect to .G
.
n E ˆ 1 ) − f (xi ; θ1 ) f (xi ; G i=1
f (xi ; G∗ )
≈m ˆ 11
n E
Yi' (θ1∗ ) +
i=1
ˆ 2. We also get a similar expression in terms of .θ2∗ and .G
n m ˆ 21 E '' ∗ Yi (θ1 ). 2 i=1
12.2 A Modified Likelihood Ratio Test for m = 2
251
When .θ ’s are in an .n−1/4 neighborhood of .θ1∗ , we may get an impression that .m ˆ 11 = Op (n−1/4 ) and .m ˆ 21 = Op (n−1/2 ). That is, they have different orders. If so, the expansion that keeps two quantities of different orders seem not sensible. However, .m ˆ 11 = Op (n−1/4 ) includes the possibility that .m ˆ 11 = Op (n−1/2 ) because the first moment is a linear combination of potentially both positive and negative terms. Applying this expansion to .δi , we find n E .
i=1
n { E (πˆ − π ∗ )Δi + πˆ m δi ≈ ˆ 11 Yi' (θ1∗ ) + (1 − πˆ )m ˆ 12 Yi' (θ2∗ ) i=1
} 1 1 + πˆ m ˆ 21 Yi'' (θ1∗ ) + (1 − πˆ )m ˆ 22 Yi'' (θ2∗ ) . 2 2
(12.8)
Denote ( )T bi = Δi , Yi' (θ1 ), Yi' (θ2 ), Yi'' (θ1 ), Yi'' (θ2 ) , .
.
1 1 ˆt = (πˆ − π ∗ , πˆ m ˆ 21 , (1 − π) ˆ m ˆ 22 )T . ˆ 11 , (1 − πˆ )m ˆ 12 , πˆ m 2 2
(12.9) (12.10)
ˆ The summand in (12.8) becomes .bT i t in new notation. Define further b=
n E
.
bi ; B =
i=1
n E
bi bT i .
(12.11)
i=1
The writing becomes even simpler: n E .
δi ≈ bT ˆt.
i=1
When .δi is small, .2 log(1 + δi ) ≈ 2δi − δi2 . This leads to { } R1n ≈ 2bT ˆt − ˆtT Bˆt + op (1).
.
(12.12)
Ignoring the high-order term, this leads to { } R1n ≤ sup 2bT t − tT Bt + op (1).
.
(12.13)
t
The range of .t in the above supremum is not .R 5 : Its first entry .−1 ≤ π − π ∗ ≤ 1, and its fourth and fifth entries are nonnegative.
252
12 Modified Likelihood Ratio Test for Higher Order
To facilitate clear presentation, let us introduce a more general .t as follows: )T ( 1 1 t = t(G) = π − π ∗ , π m11 , (1 − π )m12 , π m21 , (1 − π )m22 2 2 { with .π = π(G) = G(θ ∗ ) and its .G1 , .G2 , and .mij = (θ − θj∗ )i dGj (θ ). The ˆ where .G ˆ is the maximum modified likelihood estimator. specific .tˆ = t(G) T T T T Partition the vector .b, .t by .bT = (bT 1 , b2 ), .t = (t1 , t2 ) and .
( B=
.
B11 B12 B21 B22
)
so that .b1 and .t1 are .3 × 1 vectors and .B11 is a .3 × 3 matrix. Some matrix algebra allow us to decompose this quadratic form into T˜ ˜T ˜ ˜ ˜T 2bT t − tT Bt = (2bT 1 t1 − t1 B11 t1 ) + (2b2 t2 − t2 B22 t2 ),
.
(12.14)
where ˜t1 = t1 − B−1 B12 t2 , 11
.
T T −1 b˜ T 2 = b2 − b1 B11 B12 ,
˜ 22 = B22 − B21 B−1 B12 . B 11 If .˜t1 and .t2 can assume any values, then (12.14) attains its upper bound when ˜ −1 ˜ T ˜t1 = B−1 bT ˜ 11 1 , t2 = B22 b2 .
.
If the above .˜t1 and .˜t2 were feasible, the following upper bound for (12.13) would be tight: −1 ˜ T ˜ −1 ˜ R1n ≤ bT 1 B11 b1 + b2 B22 b2 + op (1).
.
This is not the case, however, because the elements in .t2 are intrinsically nonnega˜T tive. There may not be a mixing distribution G, which makes .˜t2 = B˜ −1 22 b2 . Hence, we can tighten the upper bound by requiring .t2 ≥ 0. With this restriction imposed, we get −1 T˜ ˜T R1n ≤ bT 1 B11 b1 + sup {2b2 t2 − t2 B22 t2 } + op (1).
.
t2 ≥0
(12.15)
Up to this point, (12.15) remains an upper bound, rather than an expansion of R1n . Whether or not this upper bound is in fact an expansion depends on whether or
.
12.2 A Modified Likelihood Ratio Test for m = 2
253
not the upper bound (12.13) is attained at some G. The answer is positive and to be elaborated. Quadratic Expansion of .R0n The mixture model under the null hypothesis is “regular”: we know the number of support points is exactly two and the mixing ˆ ∗ ) − ln (G∗ ) proportions are between 0 and 1. A useful expansion of .R0n = ln (G can hence be obtained with much less effort. Let us make further use of (12.14). When .m = 2 or when the null hypothesis ˆ 1 and .G ˆ 2 have a single support point so that .m holds, both .G ˆ 2j = m ˆ 21j . Consequently, we would find .ˆt2 = Op (n−1 ), which makes T˜ T˜ ˜ ˜T 2b˜ T 2 t2 − t2 B22 t2 = op {2b1 t1 − t1 B11 t1 }.
.
That is, the second term is negligible for asymptotic consideration in this case. Omitting the .op (1) term, we get ˜ ˜ ˜T R0n ≤ sup{2bT 1 t1 − t1 B11 t1 }.
.
˜t1
Because this expansion is derived under the regular null model, there is practically no restriction on the range of .˜t1 in the above expansion. In addition, the size of penalty is not an issue under the null hypothesis. Therefore, the upper bound is −1 given by .bT 1 B11 b1 . Ignoring high-order term, we find T −1 ˜ ˜T ˜ R0n = sup{2bT 1 t1 − t1 B11 t1 } = b1 B11 b1 .
.
˜t1
We have hence arrived at the conclusion that T˜ Rn ≤ sup {2b˜ T 2 t2 − t2 B22 t2 } + op (1).
.
t2 ≥0
(12.16)
Is this upper bound tight? Namely, does there exist a mixing distribution G so that the inequality in (12.16) becomes an equality? This is the final task and the confirmation of which leads to the limiting distribution. The Upper Bound Is Tight Lemma 12.3 Suppose that Conditions C120.1–5 hold and that the true distribution is .f (x; G∗ ) with .0 < π ∗ < 1 and .θ1∗ /= θ2∗ . Then as .n → ∞ −1 T˜ ˜T R1n = bT 1 B11 b1 + sup {2b2 t2 − t2 B22 t2 } + op (1).
.
t2 ≥0
Therefore,
254
12 Modified Likelihood Ratio Test for Higher Order T˜ Rn = sup {2b˜ T 2 t2 − t2 B22 t2 } + op (1).
.
t2 ≥0
(12.17)
ˆ is the maximum point of .pln (G). By this definition, The modified MLE of G, .G, for any mixing distribution G on .Θ with m support points, we have ˆ − ln (G ˆ ∗ )} Rn = 2{ln (G)
.
ˆ ˆ∗ ˆ = 2{pl(m) n (G) − ln (G )} − 2pm (G) ˆ∗ ˆ ≥ 2{pl(m) n (G) − ln (G )} − 2pm (G) ˆ ∗ )} − 2{pm (G) ˆ − pm (G)}. ≥ 2{ln (G) − ln (G By requiring m sufficiently large as specified earlier in Step 2 of the construction of ˆ − pm (G) = op (1) as well as modified LRT, it is possible to find a G so that .pm (G) having its .t2 maximize T˜ ˜T ˜ ˜T ˜ (2bT 1 t1 − t1 B11 t1 ) + (2b2 t2 − t2 B22 t2 ),
.
with respect to .˜t1 and .t2 over the range of .t2 ≥ 0. If so, the opposite of (12.16) would hold: T˜ ˆ ∗ )} + op (1) ≥ sup {2b˜ T Rn ≥ 2{ln (G) − ln (G 2 t2 − t2 B22 t2 } + op (1).
.
t2 ≥0
We point out that this is indeed possible here. Analytical Form of the Limiting Distribution Having established a quadratic expansion for .Rn , we are ready to reveal the exact analytical form of the limiting distribution. For this purpose, let us pay attention to the fact that .b˜ 2 is a sum of i.i.d. zero-mean, finite variance random vectors. Thus, .n−1/2 b˜ 2 is asymptotically multivariate normal. For the same reason, the law of large number is applicable to matrix .n−1 B˜ 22 , which converges to the covariance matrix of .n−1/2 b˜ 2 . If the range of −1 ˜ t2 were not restricted, the quadratic leading term in .Rn would be given by .b˜ T .˜ 2 B22 b2 . This is a textbook example for a chisquare limiting distribution with 2 degrees of freedom. ˆ 1 and .G ˆ 2 . Hence, In the current situation, .˜t2 is made of the second moments of .G 2 its range is the first quadrant of the .R plane. This restriction leads to a limiting distribution, which is stochastically smaller than .χ22 . Theorem 12.1 Suppose that Conditions 1–5 hold and that the true distribution is f (x; G∗ ) with .0 < π ∗ < 1 and .θ1∗ /= θ2∗ . Suppose that m satisfies .m ≥ m∗ . Then the asymptotic distribution of the modified LRT statistic .Rn is that of the mixture
.
( .
1 α − 2 2Π
)
1 α 2 χ02 + χ12 + χ , 2 2Π 2
12.2 A Modified Likelihood Ratio Test for m = 2
255
where .α = arccos(ρ) and .ρ is the correlation coefficient between the two components of .b˜ 2 . Proof Without loss of generality, we assume that the covariance matrix of .n−1/2 b˜ 2 has the following standard form: [
] 1ρ .Σ = . ρ 1 Let .(Z1 , Z2 )T be bivariate normal with mean zero and covariance matrix .Σ, and let 2 −1/2 (Z − ρZ ), so that .W and .W are independent .W1 = Z1 and .W2 = (1 − ρ ) 2 1 1 2 .N (0, 1) variates. As .n → ∞, it can be seen that T˜ sup. {2b˜ T 2 t2 − t2 B22 t2 }
t2 ≥0
→
sup
ξ1 ≥0,ξ2 ≥0
{2Z1 ξ1 + 2Z2 ξ2 − ξ12 − 2ρξ1 ξ2 − ξ22 }
= W12 + W22 − = W12 + W22 −
inf
{[W1 − (ξ1 + ρξ2 )]2 + [W2 − (1 − ρ 2 )1/2 ξ2 ]2 }
inf
{(W1 − η1 )2 + (W2 − η2 )2 }.
ξ1 ≥0,ξ2 ≥0 (η1 ,η2 )∈S
The restrictions .ξ1 ≥ 0 and .ξ2 ≥ 0 are transformed into restrictions on .(η1 , η2 ) specified by the cone / S = {(η1 , η2 ) : η2 ≥ 0, η1 ≥ ρη2 / 1 − ρ 2 }.
.
This cone is illustrated in Fig. 12.1 In this figure, we have divided the .R 2 plane into three cones including .S . Its dual cone .S ∗ is formed by vectors with an angle at least .Π /2 to any vectors in .S . The complement of these two is not named in the figure. The size of .S is .α = arccos(ρ), which is at most .Π , half of the plane. Therefore, .S ∗ is nearly always non-empty and the complement of .S ∪ S ∗ has size .Π . The limiting distribution of .Rn is that of W12 + W22 −
.
inf
(η1 ,η2 )∈S
{(W1 − η1 )2 + (W2 − η2 )2 }.
As a standard bivariate normally distributed random vector, the norm of .(W1 , W2 ) and its direction vector are statistically independent. The limiting distribution of .Rn can therefore be discussed, conditional on which cone .(W1 , W2 ) falls in. 1. Given that .(W1 , W2 ) ∈ S , .Rn has a .χ22 distribution. The size of the cone is .α, the correlation coefficient between two components of .b˜ 2 . It occurs with probability .α/(2Π ).
256
12 Modified Likelihood Ratio Test for Higher Order
Fig. 12.1 The cone .S and the dual cone .S ∗
2. Given that .(W1 , W2 ) is in the dual cone (.S ∗ in Fig. 12.1), .Rn = 0. The probability for this portion of limiting distribution complements the other two. 3. Given that .(W1 , W2 ) is in the remaining region, .Rn has a .χ12 distribution. This cone takes half of two dimensional plane. Thus it occurs with probability 1. Combining the above three cases, we get the analytical form of the limiting distribution given this theorem. u n Numerical Consideration The analytical form of the limiting distribution of .Rn is comparatively simple to work with. When applied to a data example, there are few ˆ and .G ˆ ∗ are almost the same as practical issues. The numerical problems to find .G the problem of searching for the usual maximum likelihood estimator. The EMalgorithm is readily applicable. Multiple initial values can be used to lower the chance of merely obtaining a local maximum. ˆ ∗ . Let .θ ∗ and .θ ∗ be its support points. Suppose we have successfully located .G 1 2 We then compute ˆ ∗ ) = (Δi , Yi' (θ1∗ ), Yi' (θ2∗ ), Yi'' (θ1∗ ), Yi'' (θ2∗ ))T bi = bi (G
.
E for each i. After which, we obtain .B = ni=1 bi bT . From there, we obtain .B˜ 22 = ˜ B22 −B21 B−1 11 B12 . Note that .B22 is a two-by-two matrix. From which, we determine the correlation coefficient as ρ = B˜ 22 [1, 2]{B˜ 22 [1, 1] ∗ B˜ 22 [2, 2]}−1/2
.
12.2 A Modified Likelihood Ratio Test for m = 2
257
and hence .α = arccos(ρ). We have used a convention that .B[i, j ] is the (i, j)th entry of matrix .B in the above equation. Let .Robs be the observed value of the modified likelihood ratio statistic. The pvalue for testing .H0 : the data are from a two-component mixture model against the alternative that the mixture model has three or more component is given by p=
.
1 α Pr(χ12 > Robs ) + Pr(χ22 > Robs ). 2 2Π
If the size of the test is set at .5%, one would reject .H0 when .p < 0.05.
12.2.2 Regularity Conditions and Rigorous Proofs Here are the conditions imposed on the mixture model under which the conclusion of Theorem 12.1 holds. C120.1. C120.2. C120.3.
C120.4.
The parameter space .Θ is a bounded and closed interval of .R. The mixing parameters of .G∗ are interior point of .Θ. Conditions C20.1 and C20.2 given in Chap. 2 are satisfied. Smoothness. The support of .f (x; θ ) is independent of .θ and .f (x; θ ) is three times differentiable with respect to .θ in .Θ. Further, .f (x; θ ) and its derivatives with respect to .θ , .f ' (x; θ ), .f '' (x; θ ), and .f ''' (x; θ ), are jointly continuous in x and .θ . Strong identifiability. For any .θ1 /= θ2 in .Θ, 2 E .
{aj f (x; θj ) + bj f ' (x; θj ) + cj f '' (x; θ )} = 0,
j =1
C120.5.
for all x, it implies that .aj = bj = cj = 0, .j = 1, 2. Uniform boundedness. There exists an integrable function g and some ' '' ''' 2 2 2 .δ > 0 such that .|Y (θ )| ≤ g(Xi ), .|Y (θ )| ≤ g(Xi ) and .|Y (θ )| ≤ i i i g(Xi ) for all .θ . Some of the above conditions are natural. They simply specify the range of the applicability of the theorem. Often, they are listed to rule out illy constructed mixture models. Other conditions are admittedly imposed for technical reasons. Their presence implies that the result is not as broadly applicable as we would like. Here are some explanations.
C120.1 is technical. One would love to allow vector parameters and unbounded Θ if at all possible. Being one-dimensional makes the result applicable to Poisson, Binomial mixture, and so on. Conceptually, there can be many mixture models with multidimensional parameters. An example can be the finite mixture of Gamma distributions, which has two- or three-parameter versions. Requiring .Θ bounded
.
258
12 Modified Likelihood Ratio Test for Higher Order
may not appear too restrictive. We can often come up with a natural finite upper bound for .θ in most applications. However, the true implication of the bounded .Θ assumption is the rate of convergence. The limiting distribution may not be so meaningful until the sample size n is exceedingly large if .Θ is practically unbounded. C120.2 is there to ensure the consistency of the maximum modified likelihood estimator of the mixing distribution. The consistency topic has been discussed in Chap. 2. Particularly when .Θ is bounded, these conditions are satisfied by most commonly used models. Similarly, C120.3 places almost no restrictions to the applicability of Theorem 12.1. C120.4 has also been discussed in Chap. 2. Technically, this condition ensures the support points of .G∗ are estimated consistently at a rate of .n−1/4 . More trivially, without this condition, matrix .B defined in (12.11) may not be strictly positive definite. Fortunately, this condition is not so restrictive as to seriously reduce the usefulness of Theorem 12.1. C120.5 is surprisingly restrictive. It is not satisfied by the finite mixture of exponential distributions. For example, suppose .θ1∗ = 1 and .θ2∗ = 2 with .π ∗ = 0.5 so that f (x; G∗ ) = 0.5{exp(−x) + 2 exp(−2x)}
.
on .x ≥ 0. It is seen that .Y ' (θ ) ≥ θ exp{−(θ − 1)x}. Hence, at .θ = 0.5, we have .{Y ' (0.5)}2 ≥ 0.25 exp(x), which does not have finite moment when X has a distribution specified by .f (x; G∗ ). In spite of this example, C120.5 is satisfied by Poisson, binomial mixture models. E The desired implication is the tightness of the stochastic process .n−1/2 Yi' (θ ) over the compact .Θ. For any .θ1 , θ2 ∈ Θ, we have n−1 E
.
{E
Yi' (θ1 ) −
E
Yi' (θ2 )
}2
= E{Y1' (θ1 ) − Y1' (θ2 )}2 = E{Y1'' (η)}2 (θ1 − θ2 )2 ≤ E{g(X1 )}(θ1 − θ2 )2 .
This Lipschitz type continuity leads to tightness following from Theorem 12.3 of (Billingsley 2008, pp. 95). Hence, .
E { } sup n−1/2 Yi' (θ ) = Op (1).
θ∈Θ
In other words, we have .
|E ' | sup | Yi (θ )| = Op (n1/2 ).
θ∈Θ
(12.18)
12.2 A Modified Likelihood Ratio Test for m = 2
259
The same argument also applies to .Yi'' and .Yi''' . It is this result that makes (12.8) a sensible expansion. Additional Details A lot of details and justifications have been omitted so far. Some of them can be easily filled up, while others are not. One purpose of this book is to have all otherwise unexposed important details documented carefully. We start with the inequality R1n ≤ 2
n E
.
i=1
δi −
n E
δi2
i=1
+ (2/3)
n E
δi3
(12.19)
i=1
with the same δi = (πˆ − π ∗ )
n E
.
Δi + πˆ
i=1
ˆ 1 ) − f (xi ; θ ∗ ) ˆ 2 ) − f (xi ; θ ∗ ) f (xi ; G f (xi ; G 1 2 . + (1 − πˆ ) ∗ f (xi ; G∗ ) f (xi ; G ) (12.20)
Note that { } n ˆ 1 ) − f (xi ; θ ∗ ) E f (xi ; G 1 . f (xi ; G∗ ) i=1
} n { { E f (xi ; θ ) − f (xi ; θ1∗ ) ˆ 1 (θ ) dG = f (xi ; G∗ ) i=1
] n { [ E 1 1 ˆ1 (θ − θ1∗ )Yi' (θ1∗ ) + (θ − θ1∗ )2 Yi'' (θ1∗ ) + (θ − θ1∗ )3 Yi''' (η1 ) d G 2 6 i=1 } { n { n n E E E 1 1 ˆ1 ˆ 21 Yi'' (θ1∗ ) + Yi''' (η1 ) d G =m ˆ 11 Yi' (θ1∗ ) + m (θ − θ1∗ )3 2 6 =
i=1
i=1
i=1
where .η1 is between .θ and .θ1∗ . Since the value of .η1 depends on .θ , the integration cannot be factorized. ˆ 2 . We may hence write A similar expression holds for the term with .G n E .
δi =
i=1
n [ E
(πˆ − π ∗ )Δi + πˆ m ˆ 11 Yi' (θ1 ) + (1 − πˆ )m ˆ 21 Yi' (θ2 )
i=1
+ πˆ =
] m ˆ 12 m ˆ 22 Yi '' (θ1 ) + (1 − πˆ ) Yi '' (θ2 ) + ein 2 2
n E ˆ {bT i t + ein } i=1
= bT ˆt + en .
260
12 Modified Likelihood Ratio Test for Higher Order
The expressions of both .ein and .en can be easily figured out from the above derivation: { { 1 1 ∗ 3 ''' ˆ ˆ2 ) Y (η )d G + .ein = (θ − θ1 (θ − θ2∗ )3 Yi''' (η2 )d G 1 1 i 6 6 En and .en = i=1 ein . It is useful to notice that .η1 and .η2 are shared across .i = 1, 2, . . . , n and random. Other quantities, .bi , .b, and .t, were defined in (12.9) and its vicinity. E Based on C120.5, we concluded that . ni=1 Yi''' (η1 ) = Op (n1/2 ). Define { m ˆ 31 =
.
ˆ 1 (θ ). |θ − θ1∗ |3 d G
ˆ 1 . This is needed for order assessment, Unlike .m ˆ 11 , .m ˆ 31 is absolute moments of .G not for expansion itself. With this notation, we get
.
n ˆ 1 ) − f (xi ; θ ∗ ) E f (xi ; G i=1
=m ˆ 11
1
f (xi ; G∗ )
n E i=1
E 1 ˆ 21 Yi' (θ1∗ ) + m Yi'' (θ1∗ ) + m ˆ 31 Op (n1/2 ). 2 n
i=1
Subsequently, we also get en = { m ˆ 31 + m ˆ 32 }Op (n1/2 )
.
ˆ 1 are consistent for .θ ∗ as emphatically Due to the fact that support points of .G 1 established in Lemma 12.2, we have .m ˆ 31 = m ˆ 21 op (1) and similarly .m ˆ 32 = m ˆ 22 op (1). We further conclude en = { m ˆ 21 + m ˆ 22 }op (n1/2 ).
.
By Cauchy inequality, .n1/2 m ˆ 21 ≤ 1 + nm ˆ 221 . Hence, we find en = [1 + n{m ˆ 221 + m ˆ 222 }]op (1) = {1 + n||ˆt||2 }op (1).
.
Up to this point, we have suggested that .nm ˆ 221 = Op (1) in several occasions, but 2 cannot rule out the possibility that .nm ˆ 21 = op (1). Hence, the constant 1 in the above assessment cannot be omitted in a rigorous derivation. In summary, the additional detail we have E provided here is an order assessment of the error term .en in the expansion of .Ln = ni=1 δi . EnIn the2 next step, we work for a similar order assessment for approximating .Qn = i=1 δi . Along the same line of logic and with apparent notation, we have 2 Tˆ 2 ˆ δi2 = {bT i t + ein } = {bi t} + ein .
.
12.2 A Modified Likelihood Ratio Test for m = 2
261
E Much of the technicality for .Qn is to show that . ni=1 ein is negligible compared to the “leading term”: n n E E Tˆ 2 T ˆ ˆ ˆT ˆ . {bi t} = t { bi bT i }t = t Bt. i=1
i=1
Because of the i.i.d. structure in .bi , .n−1 B almost surely converges to the variancecovariance matrix of .b. The strong identifiability condition, C120.4, implies that the limit has full rank. Hence, we have ˆtT Bˆt ≥ nγ ||ˆt||2
.
for some non-random constant .γ > 0 almost surely. Thus, if a quantity is of higher order than .n||ˆt||2 , it is a .op (ˆtT Bˆt) quantity. By Cauchy inequality, [ n ]2 n n E E E Tˆ 2 2 ˆ . {bi t}ein ≤ {bT × ein . t } i i=1
i=1
i=1
E E 2 = o (1 + ˆtT Bˆt), so is . n {bT ˆt}e . We now aim at showing Thus, if . ni=1 ein p in i=1 i the former. We can further simplify the matter by focusing on one of two terms in .ein . It is seen that } }2 {E n {{ n E ˆ1 ≤ . g(Xi ) m ˆ 31 . (θ − θ1 )3 Yi''' (η1 )d G i=1
i=1
The justification is from C120.5, which bounds .Y ''' . By Lemma 12.2, all support ˆ 1 are consistent for .θ ∗ . Hence, we have .m ˆ 31 = m ˆ 21 op (1). By the law of points of .G 1 E large numbers, . ni=1 g(Xi ) = Op (n). Hence, we conclude that n {{ E .
(θ
ˆ1 − θ1 )3 Yi''' (η1 )G
}2 = nm ˆ 21 op (1).
i=1
This leads to n E .
2 ein = nm ˆ 221 op (1) = n||ˆt||2 op (1) = {ˆtT Bˆt}op (1)
i=1
as .m221 is one of summands in .||t||2 . Applying this result to .Qn , we have proved Qn = {ˆtT Bˆt}(1 + op (1)).
.
262
12 Modified Likelihood Ratio Test for Higher Order
The last term in the expansion of .R1n is the cubic .Cn = show that .Cn = ˆtT Bˆt(1 + op (1)). Let us write n E .
δi3 =
i=1
n E
En
3 i=1 δi .
We aim to
3 {bT i t + ein } .
i=1
Note that n E .
|bi ˆt|3 = O(n||ˆt||3 ) = op (ˆtT Bˆt)
i=1
and n E .
|ein |3 ≤ {
i=1
n E
g(Xi )}{m ˆ 331 + m ˆ 332 } = op (ˆtT Bˆt).
i=1
It is reminded here that .m ˆ 231 = m ˆ 221 op (1) = ||ˆt||2 op (1). Substituting these order assessment to (12.19), we get R1n ≤ 2bT ˆt − ˆtT Bˆt(1 + op (1)) + op (1).
.
In this part of the derivations, we assume nothing about the rates of convergence ˆ 1 and .G ˆ 2 . Thus, we do not know whether or not .ˆtT Bˆt × op (1) = op (1). This is on .G found true only as an after-fact. For this reason, the previous derivations may appear odd to have .op (1) appeared twice in the above upper bound. The Tightness of the Upper Bound The upper bound (12.15) has been used to T˜ get (12.16). Now, we further claim .Rn = supt2 {2b˜ T 2 − t2 B22 t2 } + op (1) in (12.17). Why is this true or is it true? ˆ maximizes The complexity of proving (12.17) is rooted in the fact that .G (m) (G) instead of .l (G). The resulting solution is generally a compromising .pl n between achieving higher likelihood and at the lower penalty. Such a compromise may not end up with a clean solution. Fortunately, this issue is solved if we set m large enough. When m is large enough, there is an .G ∈ Gm , which maximizes both .ln (G) and p(G) up to two .op (1) quantities under a restriction to be explained. ˆ = G ˆ ∗ + op (1) while .G ˆ ∈ Gm . For simplicity, The first restriction is that .G ∗ we look for a representation of .G in .Gm , which draws minimum penalty, or equivalently, maximizes .p(G). It is easy to see that this representation has the following form
.
12.2 A Modified Likelihood Ratio Test for m = 2
G∗∗ (θ ) = π ∗
k E 1
.
j =1
k
{θ1∗ } + (1 − π ∗ )
263 m−k E j =1
1 {θ ∗ } m−k 2
for some k between 1 and .m − 1. For given k, its corresponding penalty is given by p(G∗∗ ) = k log{π ∗ /k} + (m − k) log{(1 − π ∗ )/(m − k)}.
.
The penalty is maximized when k is proportional to .π ∗ . Let .m∗ ≥ 2/ min{π ∗ , 1 − π ∗ }, and .k ∗ be the value at which .p(G∗∗ ) is maximized. Then, we have ∗ ∗ ∗ .min{k , m − k } ≥ 2. Among all mixing distributions with .m∗ support points, there is one whose .ˆt T˜ attains the maximum of .supt2 {2b˜ T 2 − t2 B22 t2 }. Therefore, we have shown that T˜ Rn = sup{2b˜ T 2 − t2 B22 t2 } + op (1).
.
t2
Remark in .m∗ In applications, we first compute .πˆ ∗ and then set .m ˆ∗ = (m) ∗ ∗ 2/ min{πˆ , 1 − πˆ } or the nearest integer. We then maximize .pln (G) with .m = m ˆ ∗ . Under the null model, the .π ∗ is consistently estimated. Thus, the limiting distribution will not be affected. Suppose one blindly uses .m∗ = 4 under the alternative model. The resulting .Rn will then be stochastically smaller than one whose limiting distribution is given by Theorem 12.1. In theory, the resulting test has a size smaller than nominal. However, since .m∗ is derived at the worst case scenario, the loss of efficiency due to a lower type I error is minimum. In Chen et al. (2004), they conclude that one only needs to have m ˆ ∗ = [1.5/ min{πˆ ∗ , 1 − πˆ ∗ }]
.
to ensure the upper bound on .Rn is attained asymptotically. We feel that the paper erred in technical details. As discussed already, the choice of .m ˆ ∗ is not critical.
Chapter 13 EM -Test
for Homogeneity
13.1 Limitations of the Modified Likelihood Ratio Test The modified likelihood ratio test introduces a penalty that partially restores regularity to the likelihood function or to the finite mixture model. The partially restored regularity is utilized to construct tests for homogeneity and for other orders of the finite mixture model in general. These tests are advocated as having relatively simpler limiting distributions, easy to implement and therefore more accessible in applications. Often, we brush off the damaging implications due to conditions in various theoretical results. Yet these conditions have real implications. In essence, conditions spell out a territory within which certain results are applicable. Outside this territory, these results are unproven or ultimately invalid. In the context of modified likelihood ratio test, we soon note that the results in the previous chapter do not extend to the finite mixture of exponential distributions. Consider the finite mixture model .f (x; G) with an exponential component density function f (x; θ ) =
.
x 1 exp{− } θ θ
over the domain .x ≥ 0 and parameter space .Θ = (0, M] for some finite M. Consider further the model with order .m = 2 so that the mixing distribution can be written as G(θ ) = (1 − π ){θ1 } + π {θ2 }
.
with mixing proportion .π ∈ [0, 1]. The score function with respect to .π at .π = 0 under this model is
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_13
265
266
13 EM-Test for Homogeneity
S(G) =
.
| ∂ f (x; θ2 ) | log f (x; G)| − 1. = π =0 ∂π f (x; θ1 )
When the data distribution has .π = 0, then the Fisher information based on a single observation is given by ⎧ ⎨ (θ1 − θ2 )2 , 2 .I(G) = E{S (G)} = ⎩ θ2 (2θ1 − θ2 ) ∞,
θ2 < 2θ1 , θ2 ≥ 2θ1 .
Note that the simple limiting distribution of the modified likelihood ratio test as presented in Theorems 11.2 and 11.3 is obtained under conditions including ∗ .f (x; θ )/f (x; G ) has finite third moment at all .θ values in the subpopulation parameter space .Θ, where .G∗ is the true null mixing distribution. This places a serious restriction on the unknown relative size of two component parameters in the context of the finite mixture of exponential distributions. Another condition required by modified likelihood ratio test as in Theorem 11.2 is that .Θ is compact. In applications, the user can often confidently specify a compact .Θ that contains all component parameter values in the true mixing distribution. For instance, the mean height of human population is between 0 and 3 m, which is a compact space. The true value is certain not on the boundary. Hence, it appears that the result obtained under this condition is broadly applicable. While this argument is mathematically valid, we remark that the real damage of this condition lies in the accuracy of the limiting distribution as an approximation to the finite sample distribution. In some applications, one may need a rather large compact .Θ to ensure the inclusion of the true parameter value. When this is the case, the limiting distribution is likely a poor approximation to the finite sample distribution until the sample size is unrealistically large. An asymptotic result derived without requiring compact .Θ may suffer much less from such a pitfall. Fortunately, along the same line of thinking of the modified likelihood ratio test, a much relaxed EM-test is invented in the recent years (Chen et al. 2020, 2012; Niu et al. 2011; Li et al. 2009a; Li and Chen 2010; Chen and Li 2009).
13.2
EM -Test
for Homogeneity
Let .x1 , . . . , xn be an i.i.d. sample of size n from a two-component mixture model and let ln (π, θ1 , θ2 ) =
n E
.
i=1
log{(1 − π )f (Xi ; θ1 ) + πf (Xi ; θ2 )}
(13.1)
13.2 EM-Test for Homogeneity
267
be the ordinary log-likelihood function. We define again the penalized loglikelihood function l˜n (π, θ1 , θ2 ) = ln (π, θ1 , θ2 ) + p(π )
.
where .p(π ) is a penalty function on the mixing proportion .π . For each fixed .π = π0 ∈ (0, 0.5], for instance .0.5, let us first compute a penalized likelihood ratio statistic Mn (π0 ) = 2{l˜n (π0 , θ˜01 , θ˜02 ) − l˜n (0.5, θ˜0 , θ˜0 )}
(13.2)
.
with .θ˜01 and .θ˜02 being the maximizers of .l˜n (π0 , θ1 , θ2 ) and .θ˜0 being the maximizer of .l˜n (0.5, θ, θ ). It can be seen that this statistic has the null limiting distribution given by .(1/2)χ02 +(1/2)χ12 as for many test statistics in previous chapters. Its value is inflated when the true distribution of the data is not homogeneous but a member of the alternative. Hence, one may use .Mn (π0 ) for the purpose of homogeneity test. If the data are from an alternative model with the true mixing proportion .π ∗ different from the mixing proportion .π0 used above, the test based on (13.2) is apparently not efficient as the same test that has used .π0 = π ∗ . One remedy to this imperfectness is to introduce an EM-iteration to improve this simplistic test by updating the .π value. This operation leads to the name of the new test. The EM-algorithm is a tool to search for a better fitted distribution in terms of the likelihood value using an iterative scheme. Taking advantage of this property, the EM-test for homogeneity to be introduced measures how much .Mn (π ) increases after a few iterations from .Mn (π0 ). Furthermore, one may measure the increments from multiple initial values of .π0 . With multiple initial values, we need fewer iterations to have one of iteration outcomes close the true value of .θ when the data are from the alternative model. Using the maximum value of these .Mn (π0 )-values as a test statistic leads to a good test. We now formally define the EM-test statistic through a computational procedure. It starts with choosing a number of mixing proportions, .π1 , . . . , πJ , say, computing ∗ ˜n (0.5, θ, θ ), and letting .j = 1 and .k = 0. .θ˜ = argmaxθ l Step 1. Step 2.
(k)
Let .πj
= πj .
(k) (k) ˜ Compute .(θj(k) 1 , θj 2 ) = arg maxθ1 ,θ2 ln (πj , θ1 , θ2 ) and (k) (k) (k) Mn(k) (πj ) = 2{l˜n (πj , θj 1 , θj 2 ) − l˜n (0.5, θ˜0 , θ˜0 )}.
.
Step 3.
For .i = 1, . . . , n, compute the weights, which are the conditional expectations in the E-step. (k)
(k)
πj f (Xi ; θj 2 )
(k)
wij =
.
(k)
(k)
(k)
(k)
(1 − πj )f (Xi ; θj 1 ) + πj f (Xi ; θj 2 )
.
268
13 EM-Test for Homogeneity
Following the M-step, let (k+1)
πj
.
= arg max{(n − π
(k+1)
θj 1
= arg max{ θ1
θj(k+1) = arg max{ 2 θ2
n E
(k)
wij ) log(1 − π ) +
i=1
n E
(k)
wij log(π ) + p(π )},
i=1
n E
(k)
(1 − wij ) log f (Xi ; θ1 )},
i=1 n E
(k) wij log f (Xi ; θ2 )}.
i=1
Compute Mn(k+1) (πj ) = 2{l˜n (πj
(k+1)
.
Step 4. Step 5.
(k+1)
, θj 1
(k+1)
, θj 2
) − l˜n (0.5, θ˜0 , θ˜0 )}.
Let .k = k + 1 and repeat Step 3 for a fixed number of iterations in k. Let .j = j + 1, .k = 0 and go to Step 1, until .j = J . When .k = K for some pre-specified K, report (K)
. EM n
= max{Mn(K) (πj ), j = 1, . . . , J }.
(K)
We use .EMn as the test statistic. If we keep working on the EM-iteration until convergence, there is a good chance that one or all of .(πj , θj 1 , θj 2 )(k) will converge to the global maximum of the modified likelihood function. If so, the EM-test statistic becomes the modified likelihood ratio test discussed in previous chapters. We will see that by stopping at .k = K, the test has superior asymptotic properties. Under finite mixture models, the EM-iteration is slow at precisely locating the local maximum point of the likelihood function. This is partly caused by the flatness of the likelihood surface. Because the EM-test uses log-likelihood value rather than the location of the maximum point, employing a large number of iterations does not (k) lead to meaningful changes in .EMn for the purpose of hypothesis test. Therefore, by choosing a relatively small K, we gain in simple first-order asymptotic properties without suffering any noticeable power loss.
13.3 The Asymptotic Properties An immediate task of activating EM-test is to work out the distribution or more realistically the limiting distribution of .EM(K) under the null hypothesis. We give n the asymptotic results in this section. Here are the conditions on the mixture model and the penalty function. Some notations have been given in this book repeatedly, and they may not be re-introduced below.
13.3 The Asymptotic Properties
269
C130.1 (Wald’s Integrability Conditions) (i) .E| log f (X; θ ∗ )| < ∞; (ii) for sufficiently small .ρ and for sufficiently large r, the expected values .E log{1 + f (X; θ, ρ)} < ∞ for .θ ∈ Θ and .E log{1 + ϕ(X, r)} < ∞, where ' .f (x; θ, ρ) = sup|θ ' −θ|r f (x; θ ); (iii) .lim|θ|→∞ f (x; θ ) = 0 for all x except on a set with probability zero. C130.2 (Smoothness) The kernel function .f (x; θ ) has common support and is three times continuously differentiable with respect to .θ . The first two derivatives are denoted by .f ' (x; θ ) and .f '' (x; θ ). C130.3 (Identifiability) For any functions .G1 and .G2 with f two mixing distribution f two supporting points such that . f (x; θ )dG1 (θ ) = f (x; θ )dG2 (θ ), for all x, we must have .G1 = G2 . C130.4 (Uniform boundedness) Let Yi (θ ) =
.
f (Xi ; θ ) − f (Xi ; θ ∗ ) f ' (Xi ; θ ∗ ) ∗ ∗ , θ = / θ ; Y = Y (θ ) = i i (θ − θ ∗ )f (Xi ; θ ∗ ) f (Xi ; θ ∗ )
(13.3)
Yi (θ ) − Yi (θ ∗ ) f '' (Xi ; θ ∗ ) ∗ ∗ , θ = / θ . ; Z = Z (θ ) = i i (θ − θ ∗ ) 2f (Xi ; θ ∗ )
(13.4)
Zi (θ ) =
.
For some neighborhood .N(θ ∗ ) of .θ ∗ , there exists a .g(·) with finite expectation such that .|Yi (θ )|3 < g(Xi ), |Zi (θ )|3 < g(Xi ) and |Zi '' (θ )|2 < g(Xi ) for all .θ ∈ N (θ ∗ ). C130.5 (Positive Definiteness) The covariance matrix of .(Yi , Zi ) is positive definite. The upper bounded requirements in C130.4 are placed on .θ close to .θ ∗ at which the expectation is calculated. It is not restrictive because of this. For instance, the finite mixture of exponential distribution satisfies these conditions. The null limiting distribution of .EM(k) n is given as follows. Theorem 13.1 Assume the same conditions as in the above theorem, and that one of the .πj ’s is equal to 0.5. Under the null distribution .f (x; θ ∗ ), and for any fixed finite K, as .n → ∞, (K)
. EM n
→0.5χ02 + 0.5χ12 ,
in distribution. Because .EM(k) has the same limiting distribution as that of the modified n likelihood ratio statistic .Rn in Theorem 11.3, the test rejects the null hypothesis of homogeneity at level .π ∈ (0, 0.5] when (K)
. EM n
> c0
270
13 EM-Test for Homogeneity
where .c0 is the .(1 − α)th quantile of .0.5χ02 + 0.5χ12 . When we choose .α = 0.05 as in general practice, .c0 = 2.7055. We recommend .K = 3, .J = 3 and .{πj } = {0.1, 0.3, 0.5}. These values are good choices supported by common sense but are not supported by some optimality consideration. We do not think other choices can lead to demonstrable benefits. The choice of the penalty function and the tuning value C will be discussed later. (K) It is not a coincident that the limiting distribution of .EMn is the same as the modified likelihood ratio statistic .Rn in Theorem 11.3. The following key intermediate result shows that the fitted subpopulation parameter values are clustered in a small neighborhood of the true value. This leads to precise quadratic expansion of the likelihood function, just like case of modified likelihood ratio test. The rest of this section is devoted to an intuitive proof of various theoretical results. One can find more or complementary details in Li et al. (2009b). Lemma 13.1 Suppose that the subpopulation density function .f (x; θ ) satisfies Conditions C130.1–C130.5 and the penalty function .p(π ) is a continuous function such that .p(π ) → −∞ as .π → 0 and which attains its maximal value at .π = 0.5. Under the null distribution .f (x; θ ∗ ), we have, for .j = 1, . . . , J and any fixed finite k, πj − πj = op (1), θj 1 − θ ∗ = Op (n−1/4 ), θj 2 − θ ∗ = Op (n−1/4 ),
.
(k)
(k)
(k)
mj 1 = (1 − πj )(θj 1 − θ ∗ ) + πj (θj 2 − θ ∗ ) = Op (n−1/2 ).
.
(k)
(k)
(k)
(k)
(k)
Proof Consider the log-likelihood function .ln (π, θ1 , θ2 ) in (13.1) over the region (π, θ1 , θ2 ) ∈ [δ, 0.5] × Θ × Θ
.
for some .δ > 0. Under the null model, the true mixing distribution is a degenerate G∗ = {θ ∗ }. Under the assumptions on mixture model here, and following the same idea in the proof of the consistency of the nonparametric maximum likelihood estimator of G in Chap. 2, any mixing distribution in the form
.
¯ = π¯ {θ¯1 } + (1 − π¯ ){θ¯2 } G
.
satisfying ln (π¯ , θ¯1 , θ¯2 ) − ln (0.5, θ ∗ , θ ∗ ) > c > −∞
.
for a constant c not dependent on n is consistent for .G∗ . With .π¯ ∈ [δ, 0.5], the ¯ necessitates both .θ¯1 − θ ∗ = op (1) and .θ¯2 − θ ∗ = op (1). consistency of .G One may take note that if .π¯ is very close to zero, then the size of .θ¯1 becomes irrelevant. Because of this, condition .π¯ ≤ δ cannot be disposed or we do not have ∗ .θ¯1 − θ = op (1).
13.3 The Asymptotic Properties
271
Let m ¯ 1 = (1 − π)( ¯ θ¯1 − θ ∗ ) + π¯ (θ¯2 − θ ∗ );
.
m ¯ 2 = (1 − π)( ¯ θ¯1 − θ ∗ )2 + π¯ (θ¯2 − θ ∗ )2 . Introduce .Wi = Zi − βYi with .β = E(Y1 Z1 )/E(Y12 ), and .m ¯ =m ¯ 1 − βm ¯ 2. When .θ¯1 and .θ¯2 are in a small neighborhood of .θ ∗ , we have for each i, (1 − π¯. )[f (xi ; θ¯1 )/f (xi ; θ ∗ )] + π¯ [f (xi ; θ¯2 )/f (xi ; θ ∗ )] = 1 + Yi m ¯ 1 + Zi m ¯ 2 + ei = 1 + Y i (m ¯ 1 + βm ¯ 2 ) + (Zi − βYi )m ¯ 2 + ei = 1 + Yi m ¯ + Wi m ¯ 2 + ei for some remainder .ei . This leads to log{(1 . − π¯ )f (xi ; θ¯1 ) + πf ¯ (xi ; θ¯2 )} − log f (xi ; θ ∗ ) = log{1 + Yi m ¯ + Wi m ¯ 2 + ei } = {mY ¯ i +m ¯ 2 Wi } − (1/2){mY ¯ i +m ¯ 2 Wi }2 + e˜i with some new remainder term .e˜i . Summing over i, we get 2{pl. n (π¯ , θ¯1 , θ¯2 ) − pln (0.5, θ ∗ , θ ∗ )} < 2{ln (π¯ , θ¯1 , θ¯2 ) − ln (0.5, θ ∗ , θ ∗ )} 0. The upper bound is the largest possible value of the quadratic function. Again, this trick has been repeatedly used previously. The last .Op (1) claim holds because .Wi ’s are i.i.d. with mean 0 and finite variance and so do .Yi ’s. Together with the condition pln (π, ¯ θ¯1 , θ¯2 ) − pln (0.5, θ ∗ , θ ∗ ) > c > −∞,
.
272
13 EM-Test for Homogeneity
the inequality (13.5) implies that pln (π¯ , θ¯1 , θ¯2 ) − pln (0.5, θ ∗ , θ ∗ ) = Op (1).
.
It further implies that the term in the middle of (13.5) 2m ¯2
n E
.
Wi − m ¯ 22 (
i=1
n E
Wi2 ){1 + op (1)} = Op (1).
(13.6)
i=1
E E Since . ni=1 Wi = Op (n1/2 ) and .n−1 ni=1 Wi2 have positive limit by the law of large numbers, (13.6) holds only if .m ¯ 2 = Op (n−1/2 ). Since .δ < π¯ < 0.5 for some ∗ −1/4 ), .θ¯ − θ ∗ = O (n−1/4 ). .δ ∈ (0, 0.5], we further conclude that .θ¯1 − θ = Op (n 2 p −1/2 Similarly, we have .m ¯ = Op (n ) and therefore .m ¯ 1 = Op (n−1/2 ). u n The order conclusions in this lemma are not statistically meaningful by themselves. The likelihood ratio statistics are often the difference of the log likelihood values at two slightly different mixing distributions and can be approximated by some kind quadratic function. Precise-order assessments in this lemma allow a precision assessment of the remainder term in the quadratic approximation. In Chap. 8, we learned that the minimax optimal rate of convergence for estimating the mixing distribution is .n−1/6 when the order of the true mixture distribution is over-specified by 1. The apparent order of convergence in the above lemma is not minimax order. It is not our concern whether a mixture distribution of order 2 in a small neighborhood of the assumed null distribution is estimated with a rate as high as .n−1/4 . Now we show that, under the null model, the EM-iteration changes the fitted value of .π by at most .op (1). Let .(π¯ , θ¯1 , θ¯2 ) be some estimators of .(π, θ1 , θ2 ) with the same asymptotic properties as before, and let w¯ i =
.
π¯ f (Xi ; θ¯2 ) . (1 − π¯ )f (Xi ; θ¯1 ) + π¯ f (Xi ; θ¯2 )
We further define ( Rn (π ) = n −
n E
.
i=1
) w¯ i log(1 − π ) +
n E
w¯ i log(π )
i=1
and .Hn (π ) = Rn (π ) + p(π ). The EM-test updates .π by .π¯ ∗ = arg maxπ Qn (π ). Lemma 13.2 Suppose that the conditions of Theorem 13.1 hold and .π¯ − π0 = op (1) for some .π0 ∈ (0, 0.5]. Under the null distribution .f (x; θ ∗ ), we have .|π¯ ∗ − π0 | = op (1).
13.3 The Asymptotic Properties
273
Proof For .i = 1, . . . , n, let δ¯i = (1 − π¯ )
.
} { f (X ; θ¯ ) } { f (X ; θ¯ ) i 2 i 1 − 1 + π ¯ − 1 f (Xi ; θ ∗ ) f (Xi ; θ ∗ )
=m ¯ 1 Yi + (1 − π¯ )(θ¯1 − θ ∗ )2 Zi (θ¯1 ) + π¯ (θ¯2 − θ ∗ )2 Zi (θ¯2 ), where .Yi and .Zi (·) are defined in (13.3) and (13.4). Thus, max |δ¯i | < |m ¯ 1 | max |Yi | + m ¯ 2 max { sup |Zi (θ )|}.
.
1 0 is recommended: p(π ) = C log(1 − |1 − 2π |).
.
Since .
log(1 − |1 − 2π |) ≤ log(1 − |1 − 2π |2 ) = log{4π(1 − π )}
(13.7)
276
13 EM-Test for Homogeneity
with the same value of constant C, this penalty is more severe than the penalty function .C log{4π(1 − π )} introduced for the modified likelihood ratio test. The difference is relatively small when .π − 0.5 = 0, and large when .π − 0.5 deviates from 0. In addition, when .π = 0.5, .log(1−|1−2π |) = −|1−2π |. The penalty (13.7) therefore resembles a lasso-type penalty (Tibshirani 1996) that is, it is a continuous function for all .π , but not smooth at .π = 0.5. It has therefore similar properties to the lasso-type penalty for linear regression, the probability of the fitted value of .π being 0.5 is positive. This function retains the simplicity of Step 3 for updating .π values: ⎧ E { n w (k) + C } ⎪ E ⎪ i=1 ij (k) ⎪ ⎪ min , 0.5 , n−1 ni=1 wij < 0.5 ⎨ n+C (k+1) .π = . j En ⎪ (k) } { ⎪ w ⎪ E i=1 ij (k) ⎪ ⎩ max , 0.5 , n−1 ni=1 wij > 0.5 n+C There is also a natural generalization to the hypothesis testing problem with more than two components, which will be discussed in the next chapter. Another precision enhancing measure is motivated by the following observation. It is suggestive that .(1 − pn )χ02 + pn χ12 with .pn = Pr(EM(k) n > 0) may better approximate the finite-sample distribution than the asymptotic limit given above. The exact value .pn is hard to get, but a useful approximation is easy. As reviewed in Chap. 1, Section 1.6, for some parametric distribution family, a distribution in the family is completely determined by its mean. In this case, we can let parameter .θ be the mean of the distribution with density .f (x; θ ) and its variance may be denoted as σ 2 = σ 2 (θ ).
.
for some .g(·). Let .X1 , . . . , Xn be i.i.d. random variables having a mixture density f (x; G) for some mixing distribution G. Let
.
S = E[{X1 − E(X1 )}2 ] − σ 2 (E(X1 )).
.
The first term in S is the variance of .X1 , and the second term is the variance of .X1 if its distribution is .f (x; θ ) with .θ = E(X1 ). One may consistently estimate S by Sn =
.
n E (Xi − X¯ n )2 /n − σ 2 (X¯ n ). i=1
It is known that .S ≥ 0 for any G and .S = 0 when G degenerates. Therefore, if .Sn < (k) 0, the homogeneous model should not be rejected. The value .pn = Pr(EMn > 0) resembles .Pr{Sn > 0}. One may therefore use an approximate value of .Pr{Sn > 0} to approximate .pn .
13.4 Precision-Enhancing Measures
277
One may use Edgeworth expansion as in Hall (2013) to get an approximate value of .Pr{Sn > 0}. We quote this result here without further explanations. Due to notational conflict, we replace the mathematical constant .π by .3.1416 in this theorem. Theorem 13.2 Under the null hypothesis and assuming .E(X16 ) < ∞, we have pn = Pr{Sn > 0} = 0.5 + (2n × 3.1416)−1/2 (a − b/6) + op (n−1/2 ),
.
(13.8)
where } { Sn , a = lim n1/2 E √ n→∞ {VAR(Sn )} { S − E(S ) }3 n n b = lim n1/2 E √ . n→∞ {VAR(Sn )}
.
Furthermore, if .E(X110 ) < ∞, then the remainders term .op (n−1/2 ) in (13.8) is strengthened to .Op (n−3/2 ). For many commonly used distributions, one √ may work out a and b analytically. Li et al. (2009b) show the values of .(a − b/6)/ 2 in (13.8) are given by .−5/6 for 2 .N (θ, σ ), .−(5θ + 1)/(6θ ) for Poisson (.θ ), .−8/3 for Exponential (.θ ) and .−{θ (1 − 0 √ θ )(5m−11)+1}/[6θ (1−θ ) {m(m − 1)}] for Binomial (.m, θ ). When it depends on unknown parameter value .θ , the MLE of .θ under homogeneous model can be used in applications. Simulation shows that the Edgeworth approximations are accurate when the sample size .n = 100 and 200. They further lead to a good approximation precision to the null rejection probability. The penalty function .p(π ) clearly has (K) effects on the probability of .EMn = 0, but this is not reflected in the Edgeworth expansion. Simulations show that the expansion works well for the penalty in (13.7) and a range of C values. The last issue is the choice of C. In applications, a definitive recommended C is preferred. Hypothetically, one may attempt a high-order approximation to the null rejection probability for the EM-test. If the leading term is a function of C, n and the true parameter value of the null distribution, one should be able to choose a C so that the EM-test has close to nominal level null rejection probabilities in a wide variety of mixture models. Realistically, obtaining a high order approximation is not a pleasant task. Chen and Li (2011) instead use computer experiment to build an empirical function of the null rejection probability with C, n, and some characteristics of the subpopulation distribution. They find that a choice of .C = 0.54 provides a satisfactory type I error control in all cases of the EM-test for homogeneity.
Chapter 14 EM -Test
for Higher Order
14.1 Introduction This chapter discusses the EM-test for the null order .m0 ≥ 2. We restrict ourselves to finite mixtures whose subpopulation distribution is in a one-parameter family in this chapter. Much of the materials in this chapter are based on Li and Chen (2010). More specifically, this chapter presents the EM-test that is applicable to .H0 : m = m0 versus HA : m > m0 for any general order .m0 . The EM-test statistic has a null limiting distribution in the form of a mixture of .χ02 , .χ12 , .. . ., .χm2 0 distributions, where 2 .χ is a point mass at 0. The mixing proportions depend on the true null distribution. 0 In applications, we compute them under the fitted null distribution numerically.
14.2 The EM-Test Statistic We still consider the situation where we have a set of i.i.d. observation from some mixture distribution. The problem is to test the null hypothesis that its order is .m = m0 versus .m0 > m0 . We do not go over all notations again but mention a few such as the subpopulation density .f (x; θ ) has its parameter .θ ∈ Θ ∈ R. The mixing distribution is denoted as G(θ ) =
m E
.
πh {θh }
h=1
and the mixture density is .f (x; G). The space of all mixing distributions of order m is .Gm . The true null mixing distribution is given by
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_14
279
280
14 EM-Test for Higher Order
G0 =
m0 E
.
π0h {θ0h }
h=1
where the support points .θ01 < θ02 < · · · < θ0m0 are interior points of .Θ and the mixing proportions .π0h are non-zeroes. All expectations and probabilities are computed with respect to this null distribution. We denote .π 0 = (π01 , . . . , π0m0 )t and .θ 0 = (θ01 , . . . , θ0m0 )t , so that .π and .θ are bold faced. The log-likelihood function based on i.i.d. observations is given by ln (G) =
n E
.
log{f (xi ; G)}.
i=1
ˆ 0 be the maximum likelihood estimator (MLE) of G under the null hypothesis Let .G such that ˆ0 = G
m0 E
.
πˆ 0h {θˆ0h }
h=1
with .θˆ01 ≤ θˆ02 ≤ · · · ≤ θˆ0m0 . Apparently, this convention works when .θ is a scale. The EM-test to be introduced will pretend that the alternative mixture may have an order up to .2m0 . Any finite mixture distribution with .m0 < m < 2m0 can be expressed as a mixture of .m = 2m0 subpopulations, permitting some zero mixing proportions or identical subpopulation distributions. If the data are from a mixture with order .m > 2m0 , our EM-test to be developed is probably less efficient than it could be. We do not know if there are more rooms for further improvement. Even though the EM-test demands much less technical work, it is still rather formidable, which discourage us from perfection. If the true mixing distribution has an order lower than .m0 , we suspect that rejection probability will be asymptotically lower than nominal level. This is unproven, but it is not a serious concern in general. Like the approach presented in the last chapter, we construct the EM-test statistic ˆ 0 be the maximum likelihood estimator of G under the null via an algorithm. Let .G hypothesis and its support points be .θˆ0h , .h = 1, 2, . . . , m0 . Divide the parameter space .Θ into .m0 intervals with the cut-off points being .ηh = (θˆ0h + θˆ0 h+1 )/2, .h = 1, 2, . . . , m0 − 1. Namely, each interval contains one support point. Let .Ih = (ηh−1 , ηh ] denote these intervals. We interpret .η0 and .ηm0 as lower and upper bounds of .Θ. For example, when .Θ = (−∞, +∞), we put .η0 = −∞ and .ηm0 = +∞. For each vector .β = (β1 , . . . , βm0 )t such that .βh ∈ (0, 0.5], create a class of ˆ 0: mixing distributions of order .2m0 with the help of .G Ω2m0 (β) =
m0 {E
.
h=1
} [πh βh {θ1h } + πh (1 − βh ){θ2h }] : θ1h , θ2h ∈ Ih .
14.2 The EM-Test Statistic
281
ˆ 0 into two Mixture distributions in .Ω2m0 (β) split the hth subpopulation weight of .G in proportions of .βh and .1 − βh . They are then assigned to two parameter values .θ1h and .θ2h required to be inside interval .Ih . For each .G ∈ Ω2m0 (β), we let the modified log-likelihood function be pln (G) = ln (G) +
m0 E
.
p(βh ) = ln (G) + p(β)
h=1
with .p(β) as a continuous penalty function .p(β) such that it is maximized at .0.5 and goes to negative infinity as .β goes to 0 or 1. When .β is a vector, .p(β) represents the sum. Without loss of generality, we let .p(0.5) = 0. Here we let penalty depend on .βh but not on the original mixing proportion .πh . Next, define a finite set .B of J values from .(0, 0.5] such as .B = {0.1, 0.3, 0.5} with .J = 3. Denote its set product by .Bm0 , which has .J m0 vectors of .β. For instance, ⎧ ⎫ ⎨ (0.1, 0.1) (0.1, 0.3) (0.1, 0.5) ⎬ 2 .{0.1, 0.3, 0.5} = (0.3, 0.1) (0.3, 0.3) (0.3, 0.5) . ⎩ ⎭ (0.5, 0.1) (0.5, 0.3) (0.5, 0.5) For each .β 0 ∈ Bm0 , compute G(1) (β 0 ) = arg max{pln (G) : G ∈ Ω2m0 (β 0 )}
.
where the maximization is with respect to .π = (π1 , . . . , πm0 )t , .θ 1 = (θ11 , . . . , θ1m0 )t , and .θ 2 = (θ21 , . . . , θ2m0 )t . This step leads to a new estimator (1) (β ) achieving higher penalized log likelihood value in .Ω .G 2m0 (β 0 ). 0 Now we put .β (1) = β 0 . With .G(k) (β 0 ) obtained, for .i = 1, 2, . . . , n and .h = 1, . . . , m0 , following EM-algorithm principle to compute (k) (k)
(k)
(k) wi1h =
πh βh f (xi ; θ1h ) , f (xi ; G(k) (β 0 ))
(k) = wi2h
πh (1 − βh )f (xi ; θ2h ) . f (xi ; G(k) (β 0 ))
.
(k)
(k)
(k)
After these, we assemble .G(k+1) (β 0 ) by letting (k+1)
πh
.
= n−1
n E (k) (k) {wi1h + wi2h }, i=1
(k+1)
θj h
= arg max θ
n n E {E i=1 i=1
} (k) wij h log f (xi ; θ ) , j = 1, 2,
282
14 EM-Test for Higher Order
and βh(k+1) = arg max
n {E
.
β
(k) wi1h log(β) +
i=1
n E
} (k) wi2h log(1 − β) + p(β) .
i=1
Iterate it with a pre-specified number of times, K. Starting from each .β 0 ∈ Bm0 , we create (k)
. EM n
ˆ 0 )}. (β 0 ) = 2{pln (G(k) (β 0 )) − ln (G
Define the EM-test statistic with a pre-specified K by (K)
. EM n
m0 = max{EM(K) n (β 0 ) : β 0 ∈ B }. (K)
As usual, the test rejects .H0 : m = m0 when .EMn exceeds a threshold value so that the type I error is close to the nominal level of choice, asymptotically. This threshold value has to be chosen based on the distribution or the limiting distribution (K) of .EMn given that the data are from a null model/distribution. A key advantage of the EM-test is its relative ease of working out its limiting distribution under a null distribution.
14.3 The Limiting Distribution To quickly get the essence of the asymptotic results, we first forgone the precise conditions and model settings. Briefly, we have a set of i.i.d. observations from a mixture distribution .f (x; G) whose order m is subject to hypothesis test. The subpopulation density function .f (x; θ ) has one dimensional parameter .θ , and its distribution family is regular. The theorem below presents the limiting distribution of the EM-test statistic when the data distribution is .f (x; G0 ), which is a member of null hypothesis .H0 : m = m0 . The limiting distribution depends on .G0 through a matrix .B˜ 22 . Theorem 14.1 Assume some conditions and that .0.5 ∈ B. Let .w = (w1 , . . . , wm0 )t be a 0-mean multivariate normal random vector with some ˜ 22 determined by the null distribution .f (x; G0 ). Then under covariance matrix .B the null distribution and for any fixed finite K, as .n → ∞, (K) . EM n
t
t˜
→ sup(2v w − v B22 v) = v≥0
E 0 in distribution, for some .ah ≥ 0 and . m h=0 ah = 1.
m0 E h=0
ah χh2
(14.1)
14.3 The Limiting Distribution
283
The matrix .B˜ 22 is the same matrix occurred in presenting the limiting distribution of the modified likelihood ratio test in Chap. 12 when .m0 = 2. Suppose .B˜ 22 is diagonal, then the solution of maximization problem (14.1) is a sum of .m0 i.i.d. 2 2 .(0.5)χ + (0.5)χ distributed random variables. The EM -test is to test whether each 0 1 of the putative subpopulation fitted under the null hypothesis is actually itself a mixture with two subpopulations. From this angle, The EM-test statistic for higher order is the “sum” of .m0 EM-test statistics for homogeneity. The coefficients .ah are probability mass function of the binomial distribution. When .B˜ 22 is not diagonal, these .(0.5)χ02 + (0.5)χ12 distributed random variables are dependent. Hence, the limiting distribution is a general chi-square bar, as such a distribution is usually called. We next work on the values of .ah for .m0 = 1, 2, 3, which rely on the specifics of .B˜ 22 . The following quantities have been used in previous chapters. It helps to refresh the memory: f (xi ; θ0h ) − f (xi ; θ0m0 ) f ' (xi ; θ ) f '' (xi ; θ ) , Yi (θ ) = , Zi (θ ) = . f (xi ; G0 ) f (xi ; G0 ) 2f (xi ; G0 ) (14.2) Form two vectors: Δih =
.
( )t b1i = Δi1 , . . . , Δi m0 −1 , Yi (θ01 ), . . . , Yi (θ0m0 ) ,
.
( )t b2i = Zi (θ01 ), . . . , Zi (θ0m0 ) .
(14.3)
Create a matrix .B with its entries given by Bj k = COV(bj i , bki ). = E{bj i − E(bj i )}{bki − E(bki )}t }.
.
Regress .b2i against .b1i and denote the residual as .b˜ 2i = b2i − B21 B−1 11 b1i . Its −1 ˜ covariance matrix is our final product .B22 = B22 − B21 B11 B12 . ˜ 22 . In particular, The limiting distribution in Theorem 14.1 is determined by .B 1. When .m0 = 1, we have .a0 = a1 = 0.5. 2. When .m0 = 2, we have .a0 = (π − arccos ω12 )/(2π ), .a1 = 0.5, and .a0 + a2 = 0.5, where .ω12 is the correlation between .w1 and .w2 . 3. When .m0 = 3, we have .a0 + a2 = a1 + a3 = 0.5 and a0 = (2π − arccos ω12 − arccos ω13 − arccos ω23 )/(4π ),
.
a1 = (3π − arccos ω12:3 − arccos ω13:2 − arccos ω23:1 )/(4π ), where .ωij is the correlation between .wi and .wj and
284
14 EM-Test for Higher Order
(ωij − ωik ωj k ) ωij :k = / 2 )(1 − ω2 ) (1 − ωik jk
.
is the conditional correlation between .wi and .wj given .wk . Be aware that .π = 3.1416, the mathematical constant in the above expression. ˆ 0 under the null model. To implement EM-test for higher orders, we first obtain .G After which, we may use the above formulas to obtain .ah values when .m0 = 2, 3 ˆ 0 as the true mixing distribution. regarding .G For .m0 > 3, we instead generate data from the mixture distribution correspondˆ 0 ) and estimate .B˜ 22 . It is simple to show that ing .f (x; G ah = Pr
m0 (E
.
1(vˆl > 0) = h
)
l=1
where vˆ = arg sup(2vt w − vt B˜ 22 v).
.
v≥0
To implement this test, one may use Monte Carlo simulation to compute .ah values. Such an R-function (R Development Core Team, 2008) is built in mixtureInf (Li et al. 2022) package.
14.3.1 Outline of the Proof The proofs of the asymptotic results in the context of finite mixture model have a large chunk of brute force derivations. To retain the sanity as well as avoid mistakes, it is best to have a grand scheme in the mind. When the true data distribution .f (x; G0 ) is a finite mixture with null order .m0 , every part of .G0 is estimated with convergence rate .n−1/2 , the same as the rate for parameter estimation under regular models. When the order of the finite mixture is over specified, the rate is reduced to .n−1/4 or slower. The EM-test statistic is constructed in such a way that each subpopulation in .f (x; G0 ) is regarded as a finite mixture of order 2. Hence, both .θˆ1h and .θˆ2h stay in the .n−1/4 neighborhood of .θh in the first step of the construction. The EM -algorithm has them updated with the likelihood value always increased. This property ensures they stay in .n−1/4 neighborhood of .θ0h when the true distribution is a member of null hypothesis. The following theorem puts this in writing. With the intuition explained already plus a proof of similar result given in the last section, we do not give its proof. Theorem 14.2 Suppose that .f (x; θ ) and .p(β) satisfy certain usual conditions to be spelled out and suppose .G(k) (β) is obtained via the EM-test procedure given in
14.3 The Limiting Distribution
285
the last section. Under the null distribution .f (x; G0 ), and for each given .β 0 ∈ Bm0 , we have π (k) − π 0 = Op (n−1/2 ), β (k) − β 0 = Op (n−1/6 ),
.
θ 1 − θ 0 = Op (n−1/4 ), θ 2 − θ 0 = Op (n−1/4 ), (k)
(k)
and .m1 = Op (n−1/2 ), where .m1 = (m11 , . . . , m1m0 )t and for .h = 1, . . . , m0 , (k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
m1h = βh (θ1h − θ0h ) + (1 − βh )(θ2h − θ0h ).
.
Once we know the intermediate estimates are close to their target parameters, we have a way to approximate the difference between the log likelihood function values at the true distribution and the estimated distribution. Theorem 14.2 serves this purpose. From which, one establishes the next intermediate result. Theorem 14.3 Assume the same conditions as in Theorem 14.2 albeit loosely, and that .0.5 ∈ B. Under the null distribution .f (x; G0 ), and for any fixed finite K, as .n → ∞, (K)
. EM n
= sup{2vt v≥0
n E
b˜ 2i − nvt B˜ 22 v} + op (1).
i=1
The supremum over .v is taken over the range specified by {v ≥ 0} = {v = (v1 , . . . , vm0 )t : vj ≥ 0, j = 1, . . . , m0 }.
.
E In this theorem, one may notice that . ni=1 b˜ 2i is asymptotically normal. Hence, the limiting distribution is the one given earlier in Theorem 14.2. In the proof of this theorem, the following quantities emerged: (k) (k) (k) (k) 2 2 m(k) 2h = βh (θ1h − θ0h ) + (1 − βh )(θ2h − θ0h ) .
.
They are the second moment of the mixing distribution (k) (k) Gh = βh(k) {θ1h − θ0h } + (1 − βh(k) ){θ2h − θ0h }.
.
The EM-test statistic is closely connected to the maximum value of the log likelihood with respect to .Gh . Maximizing with respect to .Gh becomes maximization with respect to .m(k) 2h when the higher-order remainders are ignored. Because the second moments are nonnegative, the range of maximization is therefore .vj ≥ 0 for all j or simply .v ≥ 0. Clearly, we must ensure that ignoring the higher-order remainders does not change the limiting distribution. The assurance is established in Theorem 14.2.
286
14 EM-Test for Higher Order
One may rightfully question the condition on .β = 0.5 ∈ B in this theorem. This requirement is somewhat artificial. Without this condition, the penalty function .p(β) (K) never equals 0. Consequently, the first-order expansion of .EMn will have an extra term .min p(β) : β ∈ B. By requiring .β = 0.5 ∈ B, this nuisance term disappears.
14.3.2 Conditions Underlying the Limiting Distribution of the EM-Test For completeness, we include the following regularity conditions in this section. C140.0
The penalty .p(β) is a continuous function such that it is maximized at β = 0.5 and goes to negative infinity as .β goes to 0 or 1. The subpopulation density function .f (x; θ ) is such that the mixture distribution .f (x; G) satisfies Wald’s integrability conditions for consistency of the maximum likelihood estimate. See Chap. 2 for general consistency discussion. The subpopulation density function .f (x; θ ) has common support and is four times continuously differentiable with respect to .θ . For any two mixing distribution functions .G1 and .G2 such that
.
C140.1
C140.2 C140.3
f
f .
C140.4
f (x; θ )dG1 (θ ) =
f (x; θ )dG2 (θ ), for all x,
we have .G1 = G2 . Let .N(θ, e) = {θ ' ∈ Θ : |θ ' − θ | ≤ e} for some positive .e. There exist an integrable .g(·) and a small positive .e0 such that (k)
|Δih |3 ≤ g(Xi ), |Yi (θ )|3 ≤ g(Xi ), |Zi (θ )|3 ≤ g(Xi ),
.
(k)
C140.5
for .θ ∈ N(θ0h , e0 ), .h = 1, . . . , m0 , and .k = 0, 1, 2 with .Zi (θ ) being the kth derivative. See (14.2) for their definitions. The variance–covariance matrix .B of bi = (Δi1 , . . . , Δi m0 −1 , Yi (θ01 ), . . . , Yi (θ0m0 ), Zi (θ01 ), . . . , Zi (θ0m0 ))t ,
.
with its elements defined in (14.2), is positive definite. It is helpful to remark here that these conditions are satisfied under finite mixtures whose subpopulation distributions are exponential, normal with known mean, binomial, and Poisson. We are not aware of any one-parameter families that do not satisfy these conditions. The main restriction of the current result is that it does not permit multi-parameter subpopulation distributions. Condition C140.5 is a type of strong identifiability requirement in a neighborhood of the true mixing distribution. It does not fully overshadow the identifiability condition C140.3.
14.4 Tuning Parameters
287
14.4 Tuning Parameters Classical mathematical statistics problems often have exact answers. For instance, with i.i.d. observations, the sample mean is uniformly minimum variance unbiased estimator of the population mean under the normal model. This is not the case when we work with other problems. For instance, under linear regression models, the most popular LASSO approach for variable selection requires the user to choose a level of penalty, which is consequential on the number of variables chosen. The EM-test contains many tuning parameters. The first-order asymptotic results regarding EMtest are not altered with a large variety of choices of the penalty function of the set .B and the number of EM-iterations. At the same time, these choices lead to differential finite sample performances of the EM-test and it is difficult to decide which specific specification is “optimal.” Based on simulation studies, Li et al. (2009b) and Chen and Li (2009) recommend to have .B = {0.1, 0.3, 0.5} and the iteration number .K = 2 or 3. Since it does not make much difference, we simply recommend .K = 3 without any support of optimality consideration. We recommend penalty function .p(β) = C log(1−|1−2β|) for some C for its numerical convenience and its proven effectiveness. (K) Clearly, a larger C leads to a smaller .EMn value. If we use the same limiting distribution for p-value calculation and the same nominal level, a large C leads to smaller type I error. Conceptually, when all others are the same, one can find a perfect C value such that the type I error of the EM-test equals the nominal level on some kind of average. More practically, we can simulate data from various mixture distributions and build an empirical relationship between C and other factors: sample size, subpopulation distribution, the type I error, and so on. The empirical relation can then be employed to recommend a data adaptive C value. For Poisson mixture, Binomial mixture, and Normal mixture in location parameter models, we recommend the following empirical formula: ⎧ 0.5 exp(5 − 10.6ω − 123/n) 12 ⎪ m0 = 2; ⎪ ⎨ 1 + exp(5 − 10.6ω − 123/n) , 12 .C = ⎪ 0.5 exp(3.3 − 5.5ω12 − 5.5ω23 − 165/n) ⎪ ⎩ , m0 = 3. 1 + exp(3.3 − 5.5ω12 − 5.5ω23 − 165/n)
(14.4)
For exponential mixture models, we recommend the following empirical formula:
C=
.
⎧ ⎪ ⎪ ⎨
exp(2.3 − 8.5ω12 ) , 1 + exp(2.3 − 8.5ω12 )
⎪ ⎪ ⎩
exp(2.2 − 5.9ω12 − 5.9ω23 ) , m0 = 3 1 + exp(2.2 − 5.9ω12 − 5.9ω23 )
m0 = 2; (14.5)
288
14 EM-Test for Higher Order
Simulation results show that a single formula for normal, Poisson, and binomial mixtures works very well: the type I errors are close to nominal. Yet the same formula is not suitable for the mixture of exponential distributions. One may remember that the exponential mixture has infinite Fisher information, unlike other mixture models. This might be the reason why it requires different tuning values. Here is a simplified description of the method to obtain the empirical formula for the finite Poisson mixture model. 1. For each .m0 , choose a number of Poisson mixture distributions, a set of C values, and three sample sizes: 100, 200, and 500. 2. For each mixture, C value, and sample size combination, generate 1000 random samples and calculate the simulated type I error .pˆ of the size-5% test based on (3) . EM n . 3. Fit a linear model regarding y = log{p/(1 ˆ − p)} ˆ − log(0.05/0.95)
.
as response variable to covariates .1/n, .log{2C/(1 − 2C)}, and .ω12 for .m0 = 2, or .(ω12 + ω23 )/2 for .m0 = 3. 4. Let the fitted value .yˆ = 0 to get the formula in (14.4). Because the auxiliary linear model fits data very well, the constant C given by (14.4) results in an EM-test with .y = log{p/(1 ˆ − p)} ˆ − log(0.05/0.95) close to 0 for a wide range of models and sample sizes. We have not attempted an empirical formula for other mixtures or for .m0 ≥ 3. When .m0 > 3, we let .C = 0.5 in MixtureInf (Li et al. 2022). Users may conduct some simulation to confirm the appropriateness of this choice in their specific applications.
14.5 Data Examples We now include a data example presented in Li and Chen (2010). The data set consists of occupation times of hospital patients. Because of substantial patient heterogeneity, the occupancy times cannot be properly modeled by a single parametric distribution as as exponential as pointed out in Harrison (2001). It is therefore suggested that an exponential mixture distribution of order .m0 = 2 and .m0 = 3 may provide a good fit to the length-of-stay data in Harrison and Millard (1991a). The EM-test presented in this Chapter is an ideal approach to shed some light on this query by testing the null hypotheses .m0 = 2 and .m0 = 3. The length-of-stay data of Harrison and Millard (1991a) consist of observations for 469 geriatric patients in a psychiatric hospital in North East London in 1991. We first test the null hypothesis .m0 = 2. Under the null model, the MLE of the mixing distribution is given by G = 0.57{627.4} + 0.43{7812.9}.
.
14.5 Data Examples
289
If this fitted model is suitable, then the patients are made of two subpopulations whose average length of stays are 627.4 and 7812.9 days, respectively. To test the hypothesis of .m0 = 2, we compute the “correlation” between these two subpopulation .ω12 = 0.374, which determines the limiting distribution of the EMtest statistic. Applying the empirical formula (14.5), we arrive at the level of penalty (3) .C = 0.29. Let .K = 3, the value of the EM-test statistic is . EM n = 36.03, and the mixing proportion in its limiting distribution being .(a0 , a1 , a2 ) = (0.31, 0.50, 0.19). The p-value is found to be practically 0. Hence, EM-test strongly rejects the null order .m0 = 2. Is .m0 = 3 acceptable for this data set? We can use EM-test for this null hypothesis. The fitted null exponential mixture distribution has mixing distribution given by G = 0.11{50.1} + 0.50{895.0} + 0.39{8343.3}
.
suggesting a subpopulation made of 10% of the patients whose average length of stay is 50.1 days. Based on this null fit, we get .ω12 = 0.392 and .ω23 = 0.436 leading to .C = 0.06 by the empirical formula (14.5). The mixing proportions in the limiting distribution are estimated to be .(a0 , a1 , a2 , a3 ) = (0.20, 0.44, 0.30, 0.06). (3) The test statistic .EMn = 0.53 suggesting exponential mixture distributions of order 4 or higher will not provide a better fit. In summary, the EM-test of Li and Chen (2010) is likely the most effective tool to provide such a clear assessment on the order of the exponential mixture for this data set.
Chapter 15 EM -Test
for Univariate Finite Gaussian Mixture Models
15.1 Introduction This chapter pushes the EM-test one step further (Chen et al. 2012). The EM-test presented in the last chapter does not apply to univariate finite normal mixture models because its subpopulation distribution has two parameters. Curiously, the classical statistical inference conclusions are the neatest when the data are from a normal distribution. Yet we find the finite normal mixture models post most challenging technical problems. One may remember that the finite normal mixture is not strongly identifiable. Yet rather surprisingly, when the technical dust settles, we find that the EM-test has the most elegant limiting distributions when applied to the univariate finite normal mixtures for its order. Recall the finite normal mixture of with order m has a density function φ(x; G) =
m E
.
f πj f (x; θj , σj ) =
φ(x; θ, σ ) dG(θ, σ ),
(15.1)
j =1
E where .πj ’s are the mixing proportions with . m j =1 πj = 1, and the .θj ’s and .σj ’s are the mean and variance of the j th subpopulation. We write the mixing distribution G as m E .G = πj {(θj , σj )}, j =1
which is notationally simpler than the corresponding cumulative distribution function. Suppose we have a random sample .x1 , . . . , xn of size n from a finite univariate normal mixture distribution. This chapter presents the EM-test developed in Chen et al. (2012) for H0 : m = m0 versus HA : m > m0
.
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_15
(15.2) 291
292
15 EM-Test for Univariate Finite Gaussian Mixture Models
with some given positive integer .m0 . To be definitive, we regard the true order of a finite mixture model as the smallest number of components such that all component densities are different and all mixing proportions are nonzero in this chapter. We further restrict our attention to the most practical situation where all the subpopulation mean parameters are different. The testing problem where two normal mixture components have the same mean but different variances is less interesting in applications but more challenging technically. The asymptotic result in this chapter is applicable only to univariate finite normal mixture models with distinct component means.
15.2 The Construction of the EM-Test We have discussed the idea behind the EM-test in previous chapters. Here we put down the EM-test immediately. Given a random sample, the log-likelihood function of the mixing distribution is ln (G) =
n E
.
log φ(xi ; G).
i=1
This likelihood function diverges to positive infinity when some component variance goes to 0. See Chap. 4. The maximum likelihood estimator of G is known to be inconsistent. This technical difficulty can be resolved by putting a penalty En on the −1 small subpopulation variance the same way as before. Let .x ¯ = n i=1 xi and En 2 −1 2 be the sample mean and sample variance. Define .sn = n (x − x) ¯ i=1 i pln (G) = ln (G) +
m E
.
pn (σj2 ; sn2 )
j =1
for some .sn2 -dependent smooth penalty function .pn (σ 2 ; sn2 ) of .σ 2 , so that it goes to ˆ 0 be the maximum point negative infinity when .σ goes to either 0 or infinity. Let .G of .ln (G) under the null hypothesis. As indicated in Chap. 4, Chen et al. (2008) ˆ 0 is a consistent estimator of G under .H0 with some mild conditions on prove that .G .pn (· ; ·). One recommended choice is pn (σ 2 ; sn2 ) = −an {sn2 /σ 2 + log(σ 2 /sn2 ) − 1}.
.
A wide range of .an is acceptable if the consistency is the only concern. We recommend .an = n−1/2 based on our experience. ˆ 0 defined earlier. Without loss of Let .θˆ0h and .σˆ 0h be the constituent entries of .G ˆ ˆ ˆ generality, assume .θ01 ≤ θ02 ≤ · · · ≤ θ0m0 . The EM-test has four key ingredients.
15.2 The Construction of the EM-Test
293
The first ingredient is .m0 intervals defined as .Ij = (ηj −1 , ηj ], .j = 1, . . . , m0 , where .η0 = −∞, .ηm0 = ∞, and .ηj = (θˆ0j + θˆ0 j +1 )/2, .j = 1, ..., m0 − 1. They partition the space of the subpopulation mean. The second ingredient is a special class of mixing distributions of order .2m0 defined as Ω2m0 (β) =
.
m0 } {E {πj βj {θ1j , σ1j } + πj (1 − βj ){θ2j , σ2j }} : θ1j , θ2j ∈ Ij j =1
for some vector .β = (β1 , . . . , βm0 )T such that .βj ∈ (0, 0.5]. When .β is given, the mixing distributions in .Ω2m0 (β) have their subpopulation means confined into specific regions and the mixing proportions fixed. We match each subpopulation fitted under .H0 by two subpopulations as seen in the definition of .Ω2m0 (β). Conceptually, if the true order is .m > m0 , then at least one of the subpopulations fitted under .H0 is forced to match two or more true subpopulations of the data generating mixture. We do not know which of these subpopulations should be split, so we decide to overfit. The approach allows us to catch all of them. At the same time, it is quite possible that other arrangements may lead to a better test in some sense. The third ingredient is a modified penalized likelihood function defined on .Ω2m0 (β): pln (G) = ln (G) +
.
m0 m0 E E 2 2 2 2 {pn (σ1j ; σˆ 0j ) + pn (σ2j ; σˆ 0j )} + p(βj ), j =1
j =1
where .pn (σ 2 ; σˆ 2 ) was defined earlier, and .p(β) will be chosen as a unimodal continuous function that goes to .−∞ when .β goes to 0. We recommend .p(β) = 2 ) depends on .G ˆ 0 , and it log(1 − |1 − 2β|). Note that the penalty term .pn (σh2 ; σˆ 0h prevents .pln (G) from attaining its maximum value at mixing distributions with some .σh2 = 0. We should mention this .pln (G) is not the same .pln (G) given in the last section. By this, we aim to not overload the meaning of a notation and pray that others will not mind. 2 ) as a constant. Specific recommenAt this moment, we regard .an in .pn (σh2 ; σˆ 0h dations will be experimentally tuned so that the related statistical procedures have ˆ 0 is not tuned because it desirable properties. To clarify, the .an value used to get .G ˆ 0 already for the purpose of hypothesis test. leads to a sufficiently accurate .G The fourth ingredient is a finite set of numbers from .(0, 0.5], denoted by .B. We recommend .B = {0.1, 0.3, 0.5}. If .B contains J elements, then .Bm0 contains .J m0 vectors of .β. For each .β 0 ∈ Bm0 , we compute G(1) (β 0 ) = arg max{pln (G) : G ∈ Ω2m0 (β 0 )},
.
294
15 EM-Test for Univariate Finite Gaussian Mixture Models
where the maximization is with respect to .π = (π1 , . . . , πm0 )T , .θ 1 = (θ11 , . . . , θ1m0 )T , .θ 2 = (θ21 , . . . , θ2m0 )T , .σ 1 = (σ11 , . . . , σ1m0 )T , and .σ 2 = (σ21 , . . . , σ2m0 )T . The EM algorithm with multiple initial values should be used to search for .G(1) (β 0 ). Note that .G(1) (β 0 ) is a member of .Ω2m0 (β 0 ). When all four ingredients are ready, the EM-iteration leads to the EM-test as follows. Let .β (1) = β 0 . Suppose we have .G(k) (β 0 ) already calculated with .α (k) , (k) (k) (k) (k) (k) .θ available. The calculation for .k = 1 has been 1 , .θ 2 , .σ 1 , .σ 2 , and .β (1) (1) (1) (1) (1) illustrated with .π , .θ 1 , .θ 2 , .σ 1 , .σ 2 , and .β (1) being constituent entities of (1) (β ). A more appropriate notation might be .G 0 (1) (1) (1) (1) G(α (1) , θ (1) 1 , θ 2 , σ 1 , σ 2 , β ; β 0 ).
.
For simplicity, we use the compact notation .G(1) (β 0 ) and more generally, .G(k) (β 0 ). For each .i = 1, . . . , n and .h = 1, . . . , m0 , let (k) (k)
(k)
wi1h =
.
(k)
(k)
(k)
(k)
(k)
(k)
πh βh f (Xi ; θ1h , σ1h ) π (1 − βh )f (Xi ; θ2h , σ2h ) (k) and wi2h = h . (k) f (Xi ; G (β 0 )) f (Xi ; G(k) (β 0 ))
We then proceed to obtain .G(k+1) (β 0 ) by setting (k+1)
πh
.
= n−1
n E (k) (k) {wi1h + wi2h }, i=1
(k+1)
(θj h
.
(k+1)
, σj h
) = arg max
n {E
θ,σ
} (k) 2 wij h log f (Xi ; θ, σ ) + pn (σ 2 ; σˆ 0h ) , j = 1, 2,
i=1
and (k+1)
βh
.
= arg max
n {E
β
i=1
(k)
wi1h log(β) +
n E
} (k) wi2h log(1 − β) + p(β) .
i=1
The computation is iterated a pre-specified number of times, K. For each .β 0 ∈ Bm0 and k, we define ˆ 0 )}. Mn(k) (β 0 ) = 2{pln (G(k) (β 0 )) − ln (G
.
The re-tooled EM-test statistic, for a pre-specified K, is then defined to be (K)
. EM n
= max{Mn(K) (β 0 ) : β 0 ∈ Bm0 }. (K)
The EM-test rejects the null hypothesis when .EMn
(15.3)
exceeds some critical value.
15.3 Asymptotic Results
295
We now give the motivation behind the complex and lengthy definition. In the usual likelihood ratio test, the likelihood is maximized over the whole parameter space under both the null and full models. The degree of improvement in the loglikelihood value over the best possible distributions in the null model is the metric on the preference of the full model over the null model. The literature in the context of mixture model tells us that such a statistic has complex stochastic behavior even in simple situations (Dacunha-Castelle and Gassiat 1999; Liu and Shao 2004). Because of this, the first two ingredients of the EM-test confine the primary candidate alternative models to a relatively narrow region of mixing distributions denoted as .Ω2m0 (β). The optimal mixing distribution within .Ω2m0 (β 0 ) has simple asymptotic (1) properties when the null hypothesis is true for any fixed .β 0 . This leads to .Mn (β 0 ), which has a simple limiting distribution itself. In principle, we may directly use (1) .Mn (β 0 ) for testing the order of the finite normal mixture model. However, .Mn(1) (β 0 ) appears overly dependent on an arbitrary choice of .β 0 . To mitigate this potential drawback, we employ the fourth ingredient, .Bm0 , which contains a number of .β vectors that fill the space of the full model evenly. (1) Hence, .EMn measures the amount of improvement in log-likelihood over many representative distributions in the alternative hypothesis. Any specific distribution in the alternative hypothesis is likely reasonably approximated by one mixing distribution in .∪{Ω2m0 (β) : β ∈ Bm0 }. Therefore, we do not conduct a full scale search. The EM-iteration further expands the range of distributions in the alternative hypothesis being investigated. It checks the amount of possible further improvement in the log-likelihood after a few more iterations. With the help of the third ingredient, (1) the simple limiting distribution of .EMn is not lost by a finite number of iterations. Thus, we obtain a new test that is highly effective yet convenient to implement.
15.3 Asymptotic Results Working with univariate finite normal mixture, nearly all aspects of the model distribution is completely specified. Other than the null order .m0 , we do not place many additional conditions on the model except for the null distribution that has exactly .m0 distinct subpopulation means. Intuitively, if the null distribution has fewer than .m0 distinct subpopulation means, the test based on our asymptotic results has below nominal level of type I error asymptotically. This does not invalidate the test. (K) The asymptotic distribution of .EMn is, however, obtained with the careful choice of two penalty functions .p(β) and .pn (·; ·): C150.1
p(β) is a continuous function such that it is maximized at .β = 0.5 and goes to negative infinity as .β goes to 0 or 1. Furthermore, .p(0.5) = 0.
.
296
C150.2 C150.3 C150.4 C150.5
15 EM-Test for Univariate Finite Gaussian Mixture Models
For any given .σ22 > 0, .pn (σ12 ; σ22 ) is a smooth function of .σ12 and it is maximized at .σ12 = σ22 . Furthermore, .pn (σ22 ; σ22 ) = 0. For any given .σ12 > 0 and .σ22 > 0, .pn (σ12 ; σ22 ) = o(n). For any given .σ22 > 0, there exists a .c > 0 such that .pn (σ12 ; σ22 ) ≤ 4(log n)2 log(σ1 ), when .σ1 ≤ c/n and n is large. For any given .σ12 > 0 and .σ22 > 0, .pn; (σ12 ; σ22 ) = op (n1/4 ). Here 2 2 2 2 2 ; .pn (σ ; σ ) is the partial derivative of .pn (σ ; σ ) with respect to .σ . 1 2 1 2 1
Examples of functions satisfying the above mathematical conditions were given in an early chapter. Since the user has the freedom to choose the penalty functions, these conditions are not restrictive as long as such functions exist. The utility of .p(β) is to restore some level of identifiability to finite mixture models. The penalty .pn (σ 2 ; σˆ 2 ) prevents fitted mixing distributions with degenerate component ˆ 0 a consistent estimator of the mixing disvariances. Conditions C150.2.−4 make .G tribution G as shown in Chen et al. (2008). Condition C150.5 allows a particularly simple limiting distribution to be presented below. (K)
Theorem 15.1 Let .EMn be defined as in (15.3) based on a random sample of size n from the finite normal mixture model (15.1). Assume the penalty functions in (K) (K) the definition of .EMn satisfy C150.1–5, and the set .B in the definition of .EMn contains the real number .0.5. Under the null hypothesis .H0 (15.2) that the order of the finite normal mixture model .m = m0 and for any fixed finite positive integer K, (K)
. EM n
2 → χ2m 0
in distribution as the sample size .n → ∞. Utilizing such an asymptotic result in a hypothesis test is often taken as automatic. To make it explicit, we reject the corresponding .H0 and recommend (K) the alternative order .M > m0 when the size of .EMn exceeds the .100(1 − α)th 2 percentile of .χ2m when the nominal level is specified to be .α ∈ (0, 1). With the 0 2 well documented, the final decision is easy to make. limiting distribution .χ2m 0 We will not include a proof here. One can start working on the special case 2 .m0 = 1 and confirm that the limiting distribution is .χ . The normal mixture with 2 .m0 = 1 has two free parameters, and the normal mixture with .m0 = 2 has five free parameters. Yet two subpopulation means under .m0 = 2 are not completely free from each other when the null hypothesis holds. The past experience indicates that we gain only .0.5 parameters for each additional mean parameter. The same can be said about the subpopulation variance: we gain another .0.5 for each additional variance parameter. Together with extra mixing proportion parameter, the overall gain is 2 each time the order of the mixture is increased by 1. Hence, the chisquare limiting distribution gains 2 degrees of freedom. When each of .m0 subpopulations is matched by 2 subpopulations under the alternative model, the overall gain in the number of parameters is .2m0 . The normal mixture model is so special that these 2 limiting distribution for the gains are independent, which leads to the neat .χ2m 0 EM-test statistic.
15.4 Choice of Various Influential Factors
297
15.4 Choice of Various Influential Factors The favorite question of many referees is what are the optimal value of various constants in the proposed approach. Most often, our knowledge is limited to the asymptotic property of the proposed approach. A wide range of these constants lead to the same asymptotic conclusions. This is also the case for the proposed EMtests in this past a few chapters. We usually have only suboptimal answers to the optimality question. To apply the EM-test presented in the last section, we must specify the set .B, the number of iterations K, and the penalty functions .p(β) and .pn (σ 2 ; σˆ 2 ). Most of these have been given in previous sections. We put them together here and finalize a few additional issues. Based on our experience, we recommend choosing .B = {0.1, 0.3, 0.5}, .K = 3, and .p(β) = log(1 − |1 − 2β|). For the penalty function .pn (σ 2 ; σˆ 2 ), we recommend pn (σ 2 ; σˆ 2 ) = −an {σˆ 2 /σ 2 + log(σ 2 /σˆ 2 ) − 1}.
.
This function satisfies Conditions C150.2–5 as long as .an = op (n1/4 ). That is, any choice of .an of this order does not change the first-order asymptotic conclusion. Yet choosing a large value of .an is clearly not wise. Searching for an optimal .an is likely not an easy research topic because it is difficult to come up with a workable criterion. As in the previous chapters, instead of regarding choice of .an as a nuisance task for publishing a paper, we use it to improve the precision of the EM-test. In essence, we choose .an to improve the approximation precision of the limiting distribution to (K) the finite sample distribution of .EMn . We have one suggested approach through computer experiment but carry it out slightly differently from the previous one. Bartlett (1953) suggests that if one can inflate a statistic by a factor such as its expectation matches the expectation of the corresponding limiting distribution to a higher order, then the consequential confidence regions have closer targeted coverage probabilities. Following this idea, we may choose the value of .an such that E{EM(K) n } = 2m0
.
when the observations are from a distribution in the null model. Because we do not know the true distribution of the sample, this task cannot be directly accomplished. Instead, we develop through computer experiments an empirical formula for .an based on the sample size, the data, and the null hypothesis, so that .E{EM(K) n } and .2m0 are close. Given a mixing distribution .G0 and a sample size, we simulate the value of (K) (K) .E{ EM n } as a function of .an and find the value .a ˆ n that solves .E{EMn } = 2m0 . We regard .aˆ n or its function as a dependent variable and .(n, G0 ) as explanatory variables. Through exploratory data analysis, we build a regression model between .a ˆ n and some covariates based on .(n, G0 ). Based on simulated data, we obtain an empirical formula in the form of
298
15 EM-Test for Univariate Finite Gaussian Mixture Models
an = g(n, G0 )
.
ˆ 0 and with .g(·) chosen based on our knowledge. In applications, we first compute .G ˆ then choose a tuning parameter according to .an = g(n, G0 ) for the EM-test. The general scheme for .m0 = 2 is as follows. We choose many representative normal mixture distributions order .m0 = 2. Find a general range of suitable .an values. Select a number of sample size n such as 100, 300, and 500. Larger sample sizes are not necessary because the matching between the limiting distribution and the finite sample distribution is good for the .an selected. We choose some combinations of mixing proportions and subpopulation parameter values to form a full factorial design. Generate 1000 repetitions from distributions at each level combination to obtain .aˆ n . When .m0 = 3, experiment space is much larger. The idea remains the same, but a lot more computation is needed. In short, we obtained two empirical formulas as follows. Under a two-component normal mixture model, an observation X from component two is misclassified into component one with probability ( ) ω1|2 = Pr π1 f (X; θ1 , σ1 ) > π2 f (X; θ2 , σ2 ) .
.
Similarly, let .ω2|1 be the opposite misclassification rate and .ω12 the average misclassification rate. We then let .ω12 = (ω1|2 + ω2|1 )/2. When .m0 = 2, an =
.
0.35 exp(−1.859 − 0.577 log{ω12 /(1 − ω12 )} − 60.453/n) . 1 + exp(−1.859 − 0.577 log{ω12 /(1 − ω12 )} − 60.453/n)
(15.4)
When .m0 = 3, an =
.
0.35 exp(−1.602 − 0.240 log{ω12 ω23 /(1 − ω12 )(1 − ω23 )} − 130.394/n) . 1 + exp(−1.602 − 0.240 log{ω12 ω23 /(1 − ω12 )(1 − ω23 )} − 130.394/n)
Both of them are found to be good choices via confirmation simulation studies. While the users may choose different penalty, iteration number K, as well as the tuning value .an , this section has provided a complete set of default values. We have created a function emtest.norm in R-package MixtureInf (Li et al. 2022). We give an example in the next section.
15.5 Data Example Following Roeder (1994); Chen et al. (2012), apply the EM-test to the sodiumlithium countertransport (SLC) activity data in red blood cells. The genetic motivation of this study comes from the observation that it relates to blood pressure and the
15.5 Data Example
299
prevalence of hypertension. Furthermore, SLC activity is easier to study than blood pressure. The data consist of red blood cell SLC activity measured on 190 individuals. If there is a major genetic factor behind the SLC activity, then a finite normal mixture model should fit the data well. There are in addition two competing genetic models, the simple dominance model and the additive model, corresponding to either a twocomponent or three-component normal mixture model. One may therefore test the null hypothesis .m0 = 2 against .m0 > 2. For the purpose of numerical illustration, one may also choose to test .m0 = 3 against .m0 > 3. We can use EM-test to both problems easily with a function in mixtureInf (Li et al. 2022): data(SLC) emtest.norm(c(SLC[[1]), m0=2) $‘MLE under null hypothesis (order = m0)‘ [,1] [,2] alpha.mix 0.664 0.336 mean.mix 0.220 0.348 var.mix 0.003 0.012 $‘Parameter estimates under the order = 2m0‘ [,1] [,2] [,3] [,4] [1,] 0.258 0.258000 0.242 0.242 [2,] 0.253 0.180000 0.345 0.282 [3,] 0.001 0.000792 0.016 0.008 $‘EM-test Statistics‘ [1] 4.236 $‘P-values‘ [1] 0.375 $‘Level of penalty‘ [1] 1.000 0.069 Because this function chooses random initial values, the output can vary from one operation to another. The differences are generally very small. From this output, we find .Rn = 4.236 and the p-value of the test for .m0 = 2 is .0.375. Hence, we do not have evidence to reject this null hypothesis. The level of penalty is .an = 0.069, evaluated based on the fitted null model as given. The parameter estimates obtained by .K = 3 EM-iterations for order .2m0 = 4 is also given. Based on this analysis, we cannot reject the simple dominance model corresponding to .m0 = 2. The same function can be used to test the hypothesis for .m0 = 1. The p-value is about .10−7 . Similarly, to test for .m0 = 3, the p-value of repeated applications can vary markedly. However, none of them will reject the null hypothesis. The reason
300
15 EM-Test for Univariate Finite Gaussian Mixture Models
for unstable performance is due to the difficulty to locate the global maximum of the likelihood function when .m0 = 3. Yet our theory is built on the promise of successfully locating the MLE under the null model. Another reason is that our theory assumes the data are generated from a finite Gaussian/normal mixture model of order .m0 or higher. Because the SLC is best fitted to a finite Gaussian mixture of order 2, the distribution theory developed under the null model does not apply. Nevertheless, it does not invalidate the conclusion of the data analysis.
Chapter 16
Order Selection of the Finite Mixture Models
16.1 Order Selection In the context of finite mixture, we first specify the distribution family for subpopulations. The data set we consider is a random sample from a distribution which is a mixture of at most M distinct subpopulation distributions from this family. The true distribution may contain M or fewer distinct subpopulations in a real world problem. By order selection, we estimate the number of distinct subpopulations actually present in the true distribution. Many of the previous chapters are devoted to the hypothesis test problems regarding the order of finite mixture models. A valid test rejects the null hypothesis with probability close to but no more than a chosen nominal level such as 5% whenever the true distribution is a member of null hypothesis. This rule is often breached slightly, and we do not wish to discuss this issue in this book. An effective test, in our words, should reject the null hypothesis with a probability larger than the nominal level when the true distribution is a member of alternative hypothesis and the rejection probability goes to 1 when the sample size goes to infinite. Such a test is not always easy to come by in strict sense because it is often too difficult to determine the exact null distribution of a test statistic in a numerically practical manner. The straightforward likelihood ratio statistics under finite mixture models, for instance, generally fail to have useful limiting distributions even though the conceptual likelihood ratio test is probably “effective.” At the same time, we can easily construct a valid test by randomly rejecting the null hypothesis with probability equaling the nominal level regardless of data. Yet such a valid test is ineffective. In statistics, we place a much less stringent requirement on order selection. By order selection in the context of finite mixture model, we merely give a point estimation of the order number of distinct subpopulations. We usually do not require such an estimator to have a proper limiting distribution other than it should converge to the true order in probability or almost surely. It is not technically challenging to © Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_16
301
302
16 Order Selection of the Finite Mixture Models
come up with many approaches that are consistent. In addition, there seem exist no widely agreed optimality standards for order selection. We generally resort to simulation experiments to evaluate the performances of an order selection approach. Yet there are so many ways to evaluate the performances; we are often at a loss for recommending new approaches proposed in journal submissions.
16.2 Selection via Classical Information Criteria Without a widely agreed optimality standard, the order selection methods are often developed based on various motivations. Two classical information criteria immediately get our attention: Akaike and Bayes information criteria (AIC and BIC). See Akaike (1973), Schwarz (1978). Both criteria have information theoretic motivations, and their original final mathematical forms are derived for models satisfying some regularity conditions. The informative theoretic motivations do not apply to the problem of selecting the order of finite mixture models for the obvious reason: mixture models are not regular. We may nevertheless use their mathematical forms to select a mixing distribution G from the model space .G that minimizes XIC(G) = −2ln (G) + an ||G||0
.
(16.1)
with .||G||0 being the number of parameters in G and .an a positive constant. I choose X in XIC to make it notationally generic. Within this generic expression, .ln (G) is the log likelihood of G given n random observations, .an = 2 for AIC, and .an = log n for BIC. Strictly speaking, this criterion does not directly select the order but the most suitable mixing distribution G. It favors mixing distribution G with lower .||G||0 value on top of having a high likelihood value. The order of the optimal G is regarded as the order of the eventual selection. In the context of finite mixture models, .G contains all finite mixture distributions, or equivalently its mixing distributions, with (pre-specified) at most M subpopulations. The sub-models are those containing a given number of distinct subpopulations. The information theoretic motivations behind these two criteria when applied to finite mixture models do not lead to the mathematical form given in (16.1) due to non-regularity of the finite mixture model. At the same time, Leroux (1992a) shows that under some general conditions, the AIC is not consistent and BIC selects an order that is asymptotically no lower than the true order. The most cited consistency result for XIC format selection methods is given in Keribin (2000). The paper, among others, shows that the BIC selects the order consistently under some conditions on the mixture model. The key conditions employed in the proof include: 1. The observations are i.i.d. from a finite mixture distribution whose order is below some known upper bound M. 2. The subpopulation parameter space .Θ is compact.
16.3 Variations of the Information Criterion
303
3. The subpopulation parametric family is regular in the sense of Cramer (Shao 2003) but much more involved: the regularity condition is imposed to derivatives of the log likelihood function up to order 5. This paper is widely cited for establishing the consistency of order selection using BIC. Most paid little attention to the restrictive nature of the conditions that lead to this result. The i.i.d. assumption is clearly not a problem. Requiring compact .Θ is not ideal, and the condition is added purely for technical reason to prevent subpopulation parameter .σ 2 from assuming a value arbitrarily close to 0 if it is applied to finite Gaussian mixture models. The seemingly harmless regularity conditions can be difficult to grasp. The paper includes univariate finite normal mixture and Poisson mixture as examples. It is not trivial to verify these regularity conditions even for these two examples.
16.3 Variations of the Information Criterion The famous AIC and BIC both fit into the general frame given by (16.1). Not only they lose the initial justification but they also bump into some other mathematical issues when applied to finite mixtures. We generally regard the likelihood as a metric of how well a candidate fits the data and .||G||0 as a metric of model complexity. Hence, two terms in XIC serve two purposes: one promotes good fit and the other discourages overfitting induced by model complexity. Particularly in the context of finite mixture models, the likelihood can be distorted unexpectedly. It is well known, for instances, that the likelihood function of the finite Gaussian mixture is unbounded. This extends to mixture of location-scale families and Gamma distribution families. Furthermore, the number of independent parameters summarized by .||G||0 do not quantify the model complexity properly. These considerations motive the introduction of other approaches that still fall in this category. To avoid the complex technical issues related to likelihood function, Chen and Kalbfleisch (1996) use minimum distance instead of likelihood as a goodness-offit metric. They further suggest that the penalty function employed in the modified likelihood ratio test pn (G) = cn
m E
.
log πj
j =1
can serve as a model complexity metric. This penalty becomes more severe when the order of the mixture model increases as well as when the mixing proportions are not even. Let .Fn be the empirical distribution based on a set of n i.i.d. observations from a mixture distribution, and .F (x; G) be a candidate mixture distribution. One may then define a proper distance .d(Fn , F (x; G)) on the space of all distributions. The order selection procedure is to find G that minimizes
304
16 Order Selection of the Finite Mixture Models
D(Fn , F (x; G)) = d(Fn , F (x; G)) + cn
m E
.
log πj
j =1
for some .cn > 0. It is apparent that a proper combination of distance .d(·, ·) and level of penalty .cn leads to consistent order selection. The mathematical form of the BIC in (16.1) has its level of penalty .an = log n, which is obtained by approximating the posterior probability of a candidate model under a generic Bayesian framework. More concretely, suppose we wish to select a model from K proposed statistical models, .M1 , . . . , MK . Be aware that each of them is a distribution family and they are not necessarily disjoint. A Bayesian model selection approach necessarily places a positive prior probability to each of them. Given a model, .Mk may be a regular or non-regular parametric distribution family. If it is regular, it is justifiable to put a smooth and positive (conditional) prior density on its parameter space. Almost magically, the leading term of the log posterior probability of model .Mk , when it contains the true distribution, is given by BI C(Mk ) = min {−2ln (w) + dk log n}
.
w∈Mk
(16.2)
up to an additive constant. In this approximation, w represents an individual distribution in .Mk , .dk is the number of parameters of .Mk , and .ln (w) is the log likelihood function on .Mk based on n i.i.d. observations from an unknown distribution. The approximation is up to either an .op (log n) term and a quantity of any order not dependent on .Mk . Remember, the log likelihood function is defined up to an additive constant. The approximation in (16.2) breaks down when .Mk is not regular. The finite mixture model is not regular in general so (16.2) is not well justified though it provides a workable recipe. One may still insist on using the posterior probability of .Mk : k = 1, . . . , K for model selection. If so, one needs a proper prior distribution on space/model .Mk for each k, and assigns proper prior probabilities to .Mk . After these, we must evaluate or approximate the subsequent posterior probabilities of .Mk . The idea in Watanabe (2013) may be interpreted along this line. Choosing a prior density function .φk (w) on .Mk . Let w be a parameter value that identifies a distribution in .Mk . The probability of observing the i.i.d. data is, up to a multiplication constant, Ln (w) =
n | |
.
fk (xi ; w).
i=1
We may regard .Ln (w) as the density function of the data when the true distribution is .fk (x; w), that is a member of .Mk . Let us image an experiment, which first selects .Mk with some (prior) probability, followed by generating w according to some prior density .ϕk (w), and finally draws a random sample .xn of size n from .fk (x; w). Given k was selected in the first step, the conditional density function of .xn is then given by
16.3 Variations of the Information Criterion
f Ln (xn ; Mk ) =
n | |
.
Mk i=1
305
fk (xi ; w)ϕk (w)dμk (w)
where .dμk (w) can be any proper measure on .Mk . Suppose the prior probabilities are equal for all .Mk in consideration. When regarded as a function of .M , then .Ln (xn ; Mk ) is proportional to the posterior probability of .Mk . For regular .Mk , .− log Ln (xn ; Mk ) ≈ BI C(Mk ). The penalty term in usual BIC in the form of .d log n is the result of spreading the prior probability to the d dimensional space. When .Mk is not regular, Watanabe (2009, 2013) suggest that .
− log Ln (xn ; Mk ) ≈ −ln (w0 ) + λk log n + Op (log log n)
for some .λk > 0 and .wk ∈ Mk . If the true population distribution is a member of Mk , .w0 is the true parameter value. Otherwise,
.
w0 = arg min{w : KL(f ∗ (x), f (x; w)), w ∈ Mk :} :
.
which identifies the distribution nearest to the true distribution .f ∗ (x) in terms of Kullback-Leibler divergence. Because this approximation depends on the true distribution when .Mk is not regular, it cannot be directly used for model selection. Yet this approximation motivates another version of information criterion: Widely Applicable Bayesian Information Criterion (WBIC), which (a) approximates .log Ln (xn ; Mk ) and (b) can be evaluated practically using a simulation procedure. Along this line of consideration, Drton and Plummer (2017) propose a singular Bayesian Information Criterion (sBIC) targeting particularly the order of finite mixture models. Let .w0 represent the true distribution and .Mk over a range of k as candidate models. For many singular models, one can have a more accurate approximation in the form of .
− log Ln (xn ; Mk ) ≈ −ln (w0 ) + λk (w0 ) log n + {mk (w0 ) − 1} log log n + Op (1) (16.3)
for some functions .λk (·) and .mk (·) when the true distribution .w0 ∈ Mk , see Watanabe (2013). We cannot directly use this approximation for model/order selection because (a) it depends on the true distribution .w0 and (b) it is not trivial to find values of .λk (w0 ) and .mk (w0 ). The sBIC of Drton and Plummer (2017) gets around these issues as follows. The first manipulation is to make use of ln (w0 ) = sup ln (w) + Op (1).
.
w∈Mk
Hence, we may replace .ln (w0 ) by .supw∈Mk ln (w) in (16.3). The crucial next manipulation is more complex. In the order selection problem under finite mixture
306
16 Order Selection of the Finite Mixture Models
models, we generally have .M1 ⊂ M2 ⊂ · · · ⊂ Mk . For any fixed finite mixture distribution w, there exists a j such that .w ∈ Mj but .w /∈ Mj −1 . For instance, a finite mixture of order exactly 3 is a degenerate finite mixture of order 4, but it is not a finite mixture of order 2. When .w ∈ Mj but .w /∈ Mj −1 λk (w) = λkj , mk (w) = mkj
.
for some constants .λkj and .mkj . In other words, both .λ(·) and .mk (·) are piecewise constant functions on .Mk . If the true distribution .w0 ∈ Mj is known, then these claims allow proper approximations to all terms in (16.3). Yet knowing .w0 ∈ Mj is impossible or the model/order selection problem does not exist in the first place. One may try to use the posterior probabilities of .w0 ∈ Mj to construct a weighted average of .log Ln (xn ; Mk ). Yet if the posterior probabilities or their accurate approximations were available, we would directly use them for order selection rather than develop an sBIC. Drton and Plummer (2017) get around this obstacle by motivating a system of equations of these posterior probabilities based on .λkj , .mkj , and .supw∈Mk ln (w). Denote the solution .L' (Mk ), then sBIC(Mk ) = log{L' (Mk )}.
.
One selects model .Mk if it maximizes the value of sBIC. While sBIC gives another well motivated approach to the order selection problem of finite mixture models, it remains challenging to implement it faithfully. We learn that it is consistent in terms of model selection under some conditions, and it is more liberal than BIC when applied to finite Binomial mixture, finite Gaussian mixture in mean with structural and known subpopulation variance. There is a lack of comprehensive information regarding whether the sBIC exhibits superior or inferior performance. In addition, there are not readily available values of .λkj and .mkj for most finite mixtures.
16.4 Shrinkage-Based Approach The order selection problem is a more popular topic in the context of linear regression. It is however presented as a variable selection problem in that context: we decide which subset of predictors have non-zero regression coefficients. If one regards the number of active predictors in a sub-model as the order of the regression function, then variable selection problem is also an order selection problem. One may directly use the algebraic forms in AIC and BIC to choose a set of active predictors with which the AIC or BIC is minimized. There is a rich literature in this respect. At least conceptually, we must evaluate AIC and BIC values for all subsets of the predictors when we apply these methods. This task quickly becomes numerically infeasible when the number of predictors increases. The nature of evaluating all
16.4 Shrinkage-Based Approach
307
subsets also leads to instability: a small change in data may lead to dramatic change in the outcome of the selected predictors (Breiman 1996), following the conventional wisdom in the literature. A completely new type of variable selection method, without the particular shortcomings of AIC and BIC, called LASSO is introduced by Tibshirani (1996). This approach is soon accompanied by many variations such as SCAD, elastic LASSO, and adaptive LASSO. See Fan and Li (2001), Zou (2006), Zou and Zhang (2009). Among many merits, one common theme is that they carry out the parameter estimation and variable selection simultaneously. In particular, LASSO shrinks the fitted regression coefficient of a predictor via least sum of squares (LSE) toward 0 in absolute value. A predictor with a near zero LSE coefficient may be shrunk to exactly zero when fitted by the LASSO type approaches. The shrinkage is induced by applying a non-smooth penalty on the size of regression coefficients to the objective functions (likelihood or sum of squares). The degree of shrinkage is determined by a scale function in the penalty function in general. By adjusting the degree of shrinkage, we obtain a set of active predictors of suitable size. This achieves the goal of order selection. In the context of finite mixture models, when two subpopulations in a mixture distribution have the same parameter value, the actual order of the mixture distribution is one less than its apparent order. Suppose we have a random sample from a finite mixture distribution whose order is known to be at most M. Denote the potential subpopulation parameters as .θj : j = 1, 2, . . . , M and denote .βj = ||θj − θj −1 ||. When one fits a mixture distribution to the data according to some objective function, likelihood, or sum of squares, with a non-smooth penalty on the size of .βj , some fitted .βj is reduced to zero. The effective order of the fitted finite mixture distribution is then reduced, and the goal of order selection is attained together with the mixing distribution estimated. This general idea has been explored by many researchers. An early example is Chen and Khalili (2009). They consider the problem when the subpopulation distribution family has the form .{f (x; θ, σ ) : θ ∈ Θ, σ ∈ Ω}. A typical example is the univariate normal distribution family with mean .θ and standard deviation .σ . They consider the order selection problem when .σ is a structural parameter: its value is the same for all subpopulations. Because they consider only one-dimensional .θ , one may be without loss of generality, assume .θj −1 ≤ θj and define .ηj = θj +1 −θj . Denote the log likelihood function to be ln (G; σ ) =
n E
.
log f (xi ; G, σ )
i=1
E where .G = M j =1 πj {θj } is the mixing distribution and .σ is the structural parameter. They then pile up a non-smooth penalty function .pn (·) and a regularization function on the log likelihood and create
308
16 Order Selection of the Finite Mixture Models
l˜n (G; σ ) = ln (G; σ ) + CM
M E
.
j =1
log πj +
M−1 E
pn (ηj )
j =1
ˆ and for some .CM > 0. They then estimate .(G, σ ) by the mixing distribution .G structural parameter .σˆ that maximize .l˜n (G; σ ). With proper choices of tuning parameters involved, the order selection method is consistent. Some explanations are needed for the specific choice .l˜n (G; σ ). Recall that a finite mixture model has two types of partial non-identifiability issue. A mixture distribution whose true order is smaller than M can be arbitrarily precisely approximated by a mixture whose (a) G has some of its .πj close to zero but all .θj are meaningfully different from each other, or (b) G has all .πj meaningfully away from 0 but some .θj values are practically equal. Chen and Khalili (2009) use E two penalty functions to work as follows. The term .CM M j =1 log πj discourages the fitting of any G with .πj closing to 0. Hence, mixing distributions in form (a) should lead to low .l˜n (G; σ ) values and drop out of the competition. Consequently, the mixing distributions with high .l˜n (G; σ ) values are in form (b) whose .θj values are clustered. The non-smooth penalty function .pn (ηj ) then promotes some .ηˆ j = 0, and this leads to a mixture distribution of lower order. After the overall scheme is developed, some secondary issues emerge. One must tune the levels of penalty in both penalty functions to sweet spots so that the order selection matches the truth when the sample size .n = ∞. Chen and Khalili (2009) suggest a grid search based on five-fold cross-validation for the tuning parameter in .pn (ηj ) and a conventional choice for .CM . Another is the choice of the potential non-smooth functions.E It turns out the absolute deviation employed in LASSO is not suitable: the value of . M−1 j =1 |ηj | remains unchanged even if some .ηj are reduced to exact 0. Chen and Khalili (2009) show that SCAN or its like is effective for this purpose. Finally, Chen and Khalili (2009) use some modified EM-algorithm for numerical solution. Not surprisingly, they find the estimator is consistent for estimating the order of the mixture distribution. The idea of Chen and Khalili (2009) is to shrink the distance between some subpopulation parameter values to exact 0. Another seemingly similar idea is to shrink some mixing proportions to zero to achieve the goal of order selection. There are indeed many attempts along this line of thinking but not successful at this moment. The difficulty lies in the second type of partial identifiability of the finite mixture models: a mixture distribution of excessive number of subpopulations with a high likelihood value can have all its mixing proportions meaningfully away from 0 while its subpopulation parameters are clustered. The shrinkage tools derived from LASSO or SCAD type penalty functions are not effective in this case.
16.4 Shrinkage-Based Approach
309
When the subpopulation parameter is multidimensional, there is no natural way to have them queued in a linear fashion. One may instead seek to group and sort subpopulation parameters. A penalty is then used to fuse these, which are redundant. Apparently, this line of approach is similar to Chen and Khalili (2009) in spirit, but one must overcome many more serious challenges. See Manole and Kahalili (2021) for the latest development.
Chapter 17
A Few Key Probability Theory Results
17.1 Introduction The materials presented in many of the previous chapters are heavily technical. We establish many of the theoretical conclusions with the help of classical probability theory results. While these conclusions are well known, many of us have difficulties to remember exact details. In particular, a conclusion can be quoted as from a paper or a book often without specifying whether this conclusion is applicable in a particular occasion. It is challenging to check that the required conditions are satisfied. This chapter includes some such conclusions cited in the previous chapters.
17.2 Borel-Cantelli Lemma Borel-Cantelli Lemma is nearly the only technical tool for establishing result with respect to “almost surely” at the lowest level. If an “almost sure” result is not proven using this lemma, then it is “almost surely” proved via a result proved using this lemma. Due to its importance, we include this lemma here. A probability space consists of a sample space .Ω, a probability measure .Pr, and a .σ -finite algebra .A . The entries of .A are subsets of .Ω. The probability measure .Pr assigns a value between 0 and 1 to every member of .A . Let .{An }∞ n=1 be a sequence of events, namely members of .A . E∞ ∞ Lemma 17.1 (Borel-Cantelli Lemma) Let .A = ∩∞ n=1 ∪m=n Am . If . n=1 Pr(An ) < ∞, then .Pr(A) = 0. In order for a sample point, .ω ∈ Ω, be an element of A defined in the lemma, it has to be an element of infinite number of .An , .n = 1, 2, . . .. When the finitesummation condition is satisfied, the lemma claims that practically A is an empty set. That is, almost surely, .An does not occur. The proof of Lemma 4.2 directly © Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2_17
311
312
17 A Few Key Probability Theory Results
employs this lemma. The event .An corresponds to the complement of .
sup {Fn (θ + e) − Fn (θ )} ≤ 2Me + 10n−1 log n.
θ∈R
where M is the largest value of the smooth density function .f (x) and .Fn (·) is the empirical distribution function created by n independent and identically distributed observations from .f (x). The conclusion in Lemma 4.2 itself is rather crude but sufficient for the particular purpose.
17.3 Random Variables and Stochastic Processes A regular though simplistic definition of random variable has been given in the first chapter of this book: a random variable is a measurable real valued function on a probability space denoted as .(Ω, B, Pr). When we study a number of random variables defined on the same space, the analysis becomes multivariate. We are concerned with their joint stochastic behaviors. Let .Z(t) for each given t be a random variable and the index t runs over some index set .T . We therefore end up with a collection of random variables .{Z(t) : t ∈ T }. When .T contains finite number of elements, then the stochastic process reduces to a finite number of random variables. Hence, no new technical issues arise. For generic .T , let us temporarily call .{Z(t) : t ∈ T } a process without the adjective “stochastic.” We reserve the terminology for specific process to the introduced. It is overly ambitious to investigate properties of the most generic entity process but to focus on a subset. The definition below carves out such a subset. Definition 17.1 A process .{Z(t) : t ∈ T } with .T = [0, 1] is separable if there exists a countable set .C = {tj ∈ T , j = 1, 2, . . .} such that for any open interval .(a, b) ⊂ R and any .S ⊂ T , the difference between {ω : a < Z(t; ω) ≤ b all t ∈ S}
.
and {ω : a < Z(t; ω) ≤ b all t ∈ S ∩ C}
.
is a (subset of) zero-probability event. This definition is taken from (Bickel and Doksum 2015, pp 21) or the classical Billingsley (2008). Following the example of the former, only a separable process is called a stochastic process. The extension from .T to any well-shaped subset of Euclidean space, such as .T = [a, b]k for some .a < b and positive integer k, is obvious. Such explicit extensions are somehow missing from books of many experts.
17.3 Random Variables and Stochastic Processes
313
As a stochastic process is required to be separable here, its stochastic properties are (almost surely) decided by a corresponding process on the reduced index space C. Note that in the definition, the “equality” of two events are for indexes confined in any given .S ⊂ T . Definition requiring this property ensures that two processes are “equivalent” in sufficient details. With separability, we may study the properties of the stochastic processes through their simpler equivalents. Another property used to reduce the complexity of the subject is the notion of “tightness.” This notion appears with many varieties. Overall speaking, it is a concept applied to probability measures defined on a sample space .Ω equipped with a metric. That is, there is a distance defined on .Ω, which induces a .σ -algebra and a topology structure. In simple words, once a distance is available, a class of subsets of .Ω are identified as open and a class of subsets are compact. Definition 17.2 A collection of probability measures .Π on a metric sample space Ω is tight if for any .e > 0, there exists a compact subset .A ⊂ Ω such that
.
.
Pr(A) ≥ 1 − e
for all .Pr(·) ∈ Π . Roughly, the sample space .Ω is reduced to a compact A with minuscule loss if we are only concerned with probability measures in .Π with tightness property. From the opposite directoin, when we work with probability measures or distributions in .Π , we may regard the corresponding random variables having values in a compact set A when the tightness condition is satisfied. In most applications, the subjects under investigation are random variables or stochastic processes. Suppose .Xn : n = 1, 2, . . . is a sequence of random variables defined on a common sample space. Given each .Xn , we may define a probability measure on its domain .R by .
Pr n (B) = Pr(Xn ∈ B).
On the metric sample space .R equipped with Borel .σ -algebra, we now have a sequence of probability measures induced by these random variables. When this collection of probability measures on .{R, B} is tight, we say that .{Xn }∞ n=1 is tight. This notion is also generalized to stochastic processes, but one has to have many loosing ends tightened. Let .{Z(t) : t ∈ T } be a stochastic process defined on sample space .Ω equipped with some .σ -algebra. For each .ω ∈ Ω, the value of .Z(t) for each t forms a sample path {Z(t; ω) : t ∈ T }.
.
In the situation where .T is an interval of real values, say .[0, 1], all possible continuous function on this unit interval is commonly denoted as .C [0, 1]. The space of those that have left limit and are right continuous at every point in [0, 1] is commonly denoted as .D[0, 1]. Notice that .C [0, 1] ⊂ D[0, 1]. For any function .h ∈ D[0, 1], one may define an .l∞ norm
314
17 A Few Key Probability Theory Results
|h|∞ = sup{|h(t)| : t ∈ [0, 1]}.
.
This norm induces a distance on .D[0, 1], the collection of open, close, and compact sets. In addition, it is possible to construct a .σ -algebra, say .H , such as the map .h : (Ω, F ) → (D[0, 1], H ) is measurable. This .σ -algebra remark is essential mathematically (Pollard 2012, pp 65), but will be ignored without much harm. For any set .H ∈ H , .Z(t) induces a probability measure through .
Pr Z (H ) = Pr{ω : Z(t; ω) ∈ H }.
We say that .Z(t) is tight when .PrZ (·) is a tight probability measure on .{D[0, 1], H }. With this, the tightness of a sequence of stochastic processes .Zn (t), .n = 1, 2, . . . is defined in exactly same way. In summary, the concept of tightness is defined for probability measures on a metric sample space. This leads to many technical issues such as measurability and compactness, but they can be largely ignored with little harm for non-experts. The concept is applied to random subjects through corresponding induced probability measures on an appropriate metric sample space coupled with some .σ -algebra. The ultimate purpose of introducing tightness is to characterize the convergence of a sequence of stochastic processes. It is well known that .Zn (t) → Z(t) in distribution for every t is not sufficient to connect their distributional behaviors. The classical example (Doob 1953) of this nature is that .sup Zn (t) does not necessarily converge to .sup Z(t) in distribution even when .Zn (t) → Z(t) for any finite many t. The concept of convergence in finite dimension itself is defined as follows. Definition 17.3 A sequence of stochastic process .Zn (t) converges to process .Z(t) on .t ∈ T in finite dimension if for all choices of finite .{t1 , . . . , tk } ⊂ T , we have {Zn (t1 ), . . . , Zn (tk )} → {Z(t1 ), . . . , Z(tk )}.
.
in distribution. Unlike convergence in finite dimension, convergence in distribution for a sequence of random elements, such as stochastic process, takes the required d
property .sup Zn (t) −→ sup Z(t) into direct consideration. The following definition is universal for convergence in distribution as its target is generic “random element.” Definition 17.4 A sequence of random elements .Zn converges to process Z in distribution if for any continuous and bounded function .h(·), E{h(Zn )} → E{h(Z)}.
.
The concept of convergence in distribution of stochastic process has been crucial in technical developments of the likelihood-based hypothesis test problems. The
17.3 Random Variables and Stochastic Processes
315
d
claim of .sup Zn (t) −→ sup Z(t) is now part of the definition of .Zn (t) → Z(t) in distribution. Because of this, the above definition is not operational. The purpose of this Chapter/section is to specify conditions under which a sequence of stochastic processes converges. We aim to provide specific, sufficient, and easy to verify conditions as needed in the previous chapters. Theorem 17.1 Let .Zn (t) : t ∈ [a, b] be a sequence of random process on some sample space .(Ω, F , P) with domain either .C [a, b] or .D[a, b] with corresponding appropriate .σ -algebras. Suppose (1) .Zn (t) converges to .Z(t) in finite dimension; (2) .Zn (t) is tight. Then .Zn (t) → Z(t) in distribution. This theorem combines Theorems 8.1 and 15.1 of Billingsley (2008). As far as we are concerned, Theorem 15.1 of Billingsley (2008) slightly weakens the finite dimension convergence condition on .D[a, b]. The weakened condition could be beneficial in special cases and useful to experts in probability theory. Our judgment is that the above version suffices in most statistical applications. We are finally at the very last step, a core theorem used in the previous Chapters. Theorem 17.2 A sequence of stochastic process .Zn (t) : t ∈ [a, b] with domain being either .C [a, b] or .D[a, b] is tight if it satisfies these two conditions: (1) The sequence .Zn (0) is tight. (2) These exist constants .γ > 0 and .α > 1 and a nondecreasing, continuous function .φ on .[a, b] such that .
Pr{|Zn (t2 ) − Zn (t1 )| ≥ e} ≤ e −γ |φ(t2 ) − φ(t1 )|α .
holds for all .t1 , t2 , and n and all positive .e. This is Theorem 12.3 of Billingsley (2008) after some changes in notation. We boldly claim here that the result is applicable to stochastic processes when .T is the generic .[a, b]k and the domain of .Zn (t) is either corresponding .C (T ) or .D(T ). We feel no additional issues should raise when .T = [a, b]k for any finite integer k. When Chebyshev inequality is applicable, we would have .
Pr{|Zn (t2 ) − Zn (t1 )| ≥ e} ≤ e −2 E{Zn (t2 ) − Zn (t1 )}2 .
Hence, a nearly sufficient condition for tightness is E{Zn (t2 ) − Zn (t1 )}2 ≤ C|t2 − t1 |2
.
for some constant C not dependent on n. The tightness of .Zn (0) can often be taken as granted as it is implied by convergence in finite dimension. This is the result we have repeatedly cited in the previous Chapters.
316
17 A Few Key Probability Theory Results
17.4 Uniform Strong Law of Large Numbers A large class of stochastic processes we encounter in statistics are sum of i.i.d. random variables indexed by some parameters. For each given parameter value, the (strong) law of large number is applicable. In many applications, we are often interested in uniformity of the large sample properties in .θ . The result of Rubin (1956) is recommended here. Let .X1 , X2 , . . . , Xn . . . be a sequence of i.i.d. random variables taking values in an arbitrary space .X . Let .g(x; θ ) be a measurable function in x for each .θ ∈ Θ. Suppose further that .Θ is a compact parameter space. Theorem 17.3 Suppose there exists a function .H (·) such that .E{H (X)} < ∞ and that .|g(x; θ )| ≤ H (x) for all .θ . In addition, there exists .Aj , j = 1, 2, . . . such that Pr(Xi ∈ ∪∞ j =1 Aj ) = 1
.
and .g(x; θ ) is equicontinuous in .θ for .x ∈ Aj for each j . Then, almost surely, and uniformly in .θ ∈ Θ, −1
n
.
n E
g(Xi ; θ ) → E{g(X1 ; θ )}
i=1
and that .E{g(X1 ; θ )} is a continuous function in .θ . Proof We may define .Bk = ∪kj =1 Aj for .k = 1, 2, . . .. Note that .Bk is monotonically increasing. The theorem condition implies that .Pr(X ∈ Bk ) → 1 as .k → ∞ and therefore H (X)1(X ∈ Bkc ) → 0
.
almost surely, where X is a random variable with the same distribution with .X1 . By the dominant convergence theorem, the condition .E{H (X)} < ∞ leads to E{H (X)1(X ∈ Bkc )} → 0
.
as .k → ∞. We now take note of n E | | sup{|n.−1 g(Xi ; θ ) − E{g(X; θ )}|} θ
i=1 n E | | ≤ sup{|n−1 g(Xi ; θ )1(Xi ∈ Bk ) − E{g(X; θ )1(Xi ∈ Bk )}|} θ
i=1
n E | | + sup{|n−1 g(Xi ; θ )1(Xi /∈ Bk ) − E{g(X; θ )1(Xi /∈ Bk )}|}. θ
i=1
17.4 Uniform Strong Law of Large Numbers
317
The second term is bounded by n−1
n E
.
H (Xi )1(Xi /∈ Bk ) + E{H (X)1(X ∈ Bkc )} → 2E{H (X)1(X ∈ Bkc )}
i=1
which is arbitrarily small almost surely when k is made large. Because .H (X) dominants .g(X; θ ), these results show that the proof of the theorem can be carried out as as if .X = Bk for some large enough k. In other words, we need only prove this theorem when .g(x; θ ) is simply equicontinuous over x. Under this condition, for any .e > 0, there exist a finite number of .θ values, .θ1 , θ2 , . . . , θm such that .
sup min |g(x; θ ) − g(x; θj )| < e/2. θ∈Θ
j
This also implies .
sup min |E{g(X; θ )} − E{g(X; θj )}| < e/2. θ∈Θ
j
Next, we easily observe that
.
n n E E | | | sup{|n−1 g(Xi ; θ ) − E{g(X; θ )}|} ≤ max {|n−1 g(Xi ; θj ) θ
i=1
1≤j ≤m
i=1
| − E{g(X; θj )}|} + e.
The first term goes to 0 almost surely by the conventional strong law of large numbers and .e is an arbitrarily small positive number. This conclusion is therefore true. This result is utilized in proving the consistency of the MLE under the finite Gaussian mixture model with structural subpopulation variance in Chap. 4. u n
References
Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, eds. B.N. Petrox and F. Caski, 267–281. Alexandrovich, G. 2014. A note on the article? Inference for multivariate normal mixtures? by J. Chen and X. Tan. Journal of Multivariate Analysis 129: 245–248. Aragam, B., and R. Yang. 2023. Uniform consistency in nonparametric mixture models. The Annals of Statistics 51 (1): 362–390. Balakrishnan, S., M. J. Wainwright, B. Yu, et al. 2017. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics 45 (1): 77–120. Bartlett, M. 1953. Approximate confidence intervals. Biometrika 40 (1/2): 12–19. Batir, N. 2008. Inequalities for the gamma function. Archiv der Mathematik 91 (6): 554–563. Bechtel, Y. C., C. Bonaiti-Pellie, N. Poisson, J. Magnette, and P. R. Bechtel. 1993. A population and family study N-acetyltransferase using caffeine urinary metabolites. Clinical Pharmacology & Therapeutics 54: 134–141. Bickel, P. J., and H. Chernoff. 1993. Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem. In Statistics and Probability: A Raghu Raj Bahadur Gestschriftt, eds. J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, and B. L. S. Prakasa Rao, 83–96. Hoboken: Wiley Eastern. Bickel, P.J., and K. A. Doksum. 2015. Mathematical Statistics: Basic Ideas and Selected Topics, vol. 2. Boca Raton: CRC Press. Billingsley, P. 2008. Probability and Measure. Hoboken: Wiley. Böhning, D. 1999. Computer-Assisted Analysis of Mixtures and Applications: Meta-Analysis, Disease Mapping and Others, vol. 81. Boca Raton: CRC Press. Breiman, L. 1996. Heuristics of instability and stabilization in model selection. The Annals of Statistics 24 (6): 2350–2383. Chen, J. 1995. Optimal rate of convergence for finite mixture models. The Annals of Statistics 23: 221–233. Chen, J. 1998. Penalized likelihood-ratio test for finite mixture models with multinomial observations. Canadian Journal of Statistics 26 (4): 583–599. Chen, J. 2017. Consistency of the MLE under mixture models. Statistical Science 32 (1): 47–63. Chen, J., and P. Cheng. 1995. The limit distribution of the restricted likelihood ratio statistic for finite mixture models. Northeast Mathematical Journal 11: 365–374. Chen, J., and P. Cheng. 1997. On testing the number of components in finite mixture models with known relevant component distributions. The Canadian Journal of Statistics/La Revue Canadienne de Statistique 25 (3): 389–400.
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2
319
320
References
Chen, H., and J. Chen. 2001. The likelihood ratio test for homogeneity in finite mixture models. Canadian Journal of Statistics 29 (2): 201–215. Chen, H., and J. Chen. 2003. Tests for homogeneity in normal mixtures in the presence of a structural parameter. Statistica Sinica 13: 351–365. Chen, J., and J. D. Kalbfleisch. 1996. Penalized minimum-distance estimates in finite mixture models. Canadian Journal of Statistics 24 (2): 167–175. Chen, J., and A. Khalili. 2009. Order selection in finite mixture models with a nonsmooth penalty. Journal of the American Statistical Association 104 (485): 187–196. Chen, J., and P. Li. 2009. Hypothesis test for normal mixture models: The EM approach. The Annals of Statistics 37: 2523–2542. Chen, J., and P. Li. 2011. Tuning the em-test for finite mixture models. Canadian Journal of Statistics 39 (3): 389–404. Chen, J., and X. Tan. 2009. Inference for multivariate normal mixtures. Journal of Multivariate Analysis 100 (7): 1367–1383. Chen, H., J. Chen, and J. D. Kalbfleisch. 2001. A modified likelihood ratio test for homogeneity in finite mixture models. Journal of the Royal Statistical Society: Series B 63 (1): 19–29. Chen, H., J. Chen, and J. D. Kalbfleisch. 2004. Testing for a finite mixture model with two components. Journal of the Royal Statistical Society: Series B 66: 95–115. Chen, J., X. Tan, and R. Zhang. 2008. Inference for normal mixtures in mean and variance. Statistica Sinica 18: 443–465. Chen, J., P. Li, and Y. Fu. 2012. Inference on the order of a normal mixture. Journal of the American Statistical Association 107 (499): 1096–1105. Chen, J., P. Li, and Y. Liu. 2016. Sample-size calculation for tests for homogeneity. The Canadian Journal of Statistics 44 (1): 82–101. Chen, J., S. Li, and X. Tan. 2016. Consistency of the penalized MLE for two-parameter gamma mixture models. Science China Mathematics 59 (12): 2301–2318. Chen, J., P. Li, and G. Liu. 2020. Homogeneity testing under finite location-scale mixtures. Canadian Journal of Statistics 48 (4): 670–684. Chen, Y., Z. Lin, and H.-G. Muller. 2021. Wasserstein regression. Journal of the American Statistical Association 1–14. Chernoff, H. 1954. On the distribution of the likelihood ratio. The Annals of Mathematical Statistics 25 (3): 573–578. Chernoff, H., and E. Lander. 1995. Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. Journal of Statistical Planning and Inference 43 (1): 19– 40. Chow, Y. S., and H. Teicher. 2003. Probability Theory: Independence, Interchangeability, Martingales. Cham: Springer Science & Business Media. Ciuperca, G., A. Ridolfi, and J. Idier. 2003. Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics 30 (1): 45–59. Cramer, H. 1946. Mathematical Method in Statistics. Princeton: Princeton University Press. Dacunha-Castelle, D., and E. Gassiat. 1999. Testing the order of a model using locally conic parametrization: Population mixtures and stationary ARMA processes. The Annals of Statistics 27 (4): 1178–1209. Dean, C., and J. F. Lawless. 1989. Tests for detecting overdispersion in poisson regression models. Journal of the American Statistical Association 84 (406): 467–472. Dedecker, J., and F. Merlevede. 2017. Behavior of the wasserstein distance between the empirical and the marginal distributions of stationary .α-dependent sequences. Bernoulli 23 (3): 2083– 2127. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39: 1–38. Doob, J. L. 1953. Stochastic Processes. Hoboken: Wiley. Drton, M., and M. Plummer. 2017. A bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (2): 323–380.
References
321
Everitt, B. 2013. Finite Mixture Distributions. Monographs on Statistics and Applied Probability. Netherlands: Springer. Fan, J. 1992. Deconvolution with supersmooth distributions. Canadian Journal of Statistics 20 (2): 155–169. Fan, J., and R. Li. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 (456): 1348–1360. Faraway, J. J. 1993. Distribution of the admixture test for the detection of linkage under heterogeneity. Genetic Epidemiology 10 (1): 75–83. Feller, W. (1991). An Introduction to Probability Theory and Its Applications, Volume 2, vol. 81. Hoboken: Wiley. Ghosh, J. K. and P. K. Sen. 1985. On the asymptotic performance of the log likelihood ratio statistic for the mixture model and related results. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, eds. L. LeCam, and R. A. Olshen, vol. 2, 789–806. Wadsworth, Belmont, CA. Hájek, J. 1972. Local asymptotic minimax and admissibility in estimation. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. 175–194. Hall, P. (2013). The Bootstrap and Edgeworth Expansion. Cham: Springer Science & Business Media. Harrison, G. W. 2001. Implications of mixed exponential occupancy distributions and patient flow models for health care planning. Health Care Management Science 4: 37–45. Harrison, G. W., and P. H. Millard. 1991a. Balancing acute and long-stay care: The mathematics of throughput in departments of geriatric medicine. Methods of Information in Medicine 30: 221–228. Harrison, G. W., and P. H. Millard. 1991b. Balancing acute and long-term care: The mathematics of throughput in departments of geriatric medicine. Methods of Information in Medicine 30 (03): 221–228. Hartigan, J. A. 1985. A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, eds. L. LeCam, and R. A. Olshen, vol. 2, 807–810. Wadsworth, Belmont, CA. Hathaway, R. J. 1985. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. The Annals of Statistics 13: 795–800. He, M., and J. Chen. 2022a. Consistency of the MLE under a two-parameter gamma mixture model with a structural shape parameter. Metrika 85: 951–975. He, M., and J. Chen. 2022b. Consistency of the MLE under two-parameter mixture models with a structural scale parameter. Advances in Data Analysis and Classification 16: 125–154. Heinrich, P., and J. Kahn. 2018. Strong identifiability and optimal minimax rates for finite mixture estimation. The Annals of Statistics 46 (6A): 2844–2870. Henna, J. 1985. On estimating of the number of constituents of a finite mixture of continuous distributions. Annals of the Institute of Statistical Mathematics 37: 235–240. Ho, N., and X. Nguyen. 2016a. Convergence rates of parameter estimation for some weakly identifiable finite mixtures. The Annals of Statistics 44: 2726–2755. Ho, N., and X. Nguyen. 2016b. On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics 10: 271–307. Ho, N., and X. Nguyen. 2019. Singularity structures and impacts on parameter estimation in finite mixtures of distributions. SIAM Journal on Mathematics of Data Science 1 (4): 730–758. Ibragimov, I. A. and R. Z. Has’ Minskii. 2013. Statistical Estimation: Asymptotic Theory, vol. 16. Cham: Springer Science & Business Media. Kalbfleisch, J. G. 1985. Probability and Statistical Inference: Probability, vol. 1. Chem: Springer Science & Business Media. Keribin, C. 2000. Consistent estimation of the order of mixture models. Sankhy¯a: The Indian Journal of Statistics. Series A 62: 49–66. Kiefer, J., and J. Wolfowitz. 1956. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics 27: 887–906.
322
References
Lemdani, M., and O. Pons. 1995. Tests for genetic linkage and homogeneity. Biometrics 51: 1033– 1041. Leroux, B. G. 1992a. Consistent estimation of a mixing distribution. The Annals of Statistics 20 (3): 1350–1360. Leroux, B. G. 1992b. Maximum-likelihood estimation for hidden markov models. Stochastic Processes and Their Applications 40 (1): 127–143. Lesperance, M. L., and J. D. Kalbfleisch. 1992. An algorithm for computing the nonparametric mle of a mixing distribution. Journal of the American Statistical Association 87 (417): 120–126. Li, P., and J. Chen. 2010. Testing the order of a finite mixture. Journal of the American Statistical Association 105 (491): 1084–1092. Li, P., J. Chen, and P. Marriott. 2009a. Non-finite Fisher information and homogeneity: an EM approach. Biometrika 96 (2): 411–426. Li, P., J. Chen, and P. Marriott. 2009b. Non-finite Fisher information and homogeneity: an EM approach. Biometrika 96: 411–426. Li, X., C.-P. Chen, et al. 2007. Inequalities for the gamma function. Journal of Inequalities in Pure and Applied Mathematics 8 (1): 1–3. Li, S., J. Chen, and P. Li. 2022. MixtureInf: Inference for finite mixture models. R package version 2.0. https://github.com/jhchen-stat-ubc-ca/Mixturelnf2.0 Lindsay, B., and K. Roeder. 1997. Moment-based oscillation properties of mixture models. The Annals of Statistics 25 (1): 378–386. Lindsay, B. G. 1983a. The geometry of mixture likelihoods: A general theory. The Annals of Statistics 11 (1): 86–94. Lindsay, B. G. 1983b. The geometry of mixture likelihoods, part II: The exponential family. The Annals of Statistics 11 (3): 783–792. Lindsay, B. G. 1989. Moment matrices: applications in mixtures. The Annals of Statistics 17 (2): 722–740. Lindsay, B. G. 1995. Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics. Hayward, CA: Institute of Mathematical Statistics. Liu, X., and Y. Shao. 2004. Asymptotics for the likelihood ratio test in a two-component normal mixture model. Journal of Statistical Planning and Inference 123 (1): 61–81. Liu, X., C. Pasarica, and Y. Shao. 2003. Testing homogeneity in gamma mixture models. Scandinavian Journal of Statistics 30 (1): 227–239. Manole, T., and A. Kahalili. 2021. Estimating the number of components in finite mixture models via the group-sort-fuse procedure. Annuals of Statistics 49 (6): 3043–3069. McLachlan, G., and D. Peel. 2000. Finite Mixture Models. Hoboken: Wiley. McLachlan, G., and D. Peel. 2004. Finite Mixture Models. Hoboken: Wiley. McLachlan, G. J., and T. Krishnan. 2007. The EM Algorithm and Extensions. Hoboken: Wiley. Morris, C. N. 1982. Natural exponential families with quadratic variance functions. The Annals of Statistics 10: 65–80. Neyman, J. 1959. Optimal asymptotic tests of composite hypotheses. In Probability and Statisitics, 213–234. New York: Wiley. Neyman, J., and E. Scott. 1965. On the use of c (alpha) optimal tests of composite hypotheses. Bulletin of the International Statistical Institute 41 (1): 477–497. Nguyen, X. 2013. Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics 41: 370–400. Niu, X., P. Li, and P. Zhang. 2011. Testing homogeneity in a multivariate mixture model. Canadian Journal of Statistics 39 (2): 218–238. Ott, J. 1999. Analysis of Human Genetic Linkage. Baltimore: JHU Press. Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A 185, 71–110.
References
323
Pfanzagl, J. 1988. Consistency of maximum likelihood estimators for certain nonparametric families, in particular: mixtures. Journal of Statistical Planning and Inference 19 (2): 137–158. Phelps, R. R. 2001. Lectures on Choquet’s Theorem. Berlin: Springer. Pollard, D. (2012). Convergence of Stochastic Processes. Cham: Springer Science & Business Media. Rao, C. R. 2005. Score test: Historical review and recent developments. In Advances in Ranking and Selection, Multiple Comparisons, and Reliability: Methodology and Applications, 3–20. Boston: Birkhäuser. Redner, R. 1981. Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. The Annals of Statistics 9 (1): 225–228. Risch, N., and D. Rao. 1989. Linkage detection tests under heterogeneity. Genetic Epidemiology 6 (4): 473–480. Roeder, K. 1994. A graphical technique for determining the number of components in a mixture of normals. Journal of the American Statistical Association 89 (426): 487–495. Rubin, H. 1956. Uniform convergence of random functions with applications to statistics. The Annals of Mathematical Statistics 27: 200–203. Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6 (2): 461–464. Serfling, R. J. 1980. Approximation Theorems of Mathematical Statistics. New York: Wiley. Shao, J. (2003). Mathematical Statistics. Cham: Springer Science & Business Media. Silvey, S. (2013). Optimal Design: An Introduction to the Theory for Parameter Estimation, vol. 1. Cham: Springer Science & Business Media. Tanaka, K. 2009. Strong consistency of the maximum likelihood estimator for finite mixtures of location-scale distributions when penalty is imposed on the ratios of the scale parameters. Scandinavian Journal of Statistics 36 (1): 171–184. Tanaka, K., and A. Takemura. 2005. Strong consistency of MLE for finite uniform mixture when the scale parameters are exponentially small. Annuals of the Institute of Statistical Mathematics 57 (1): 1–19. Tanaka, K., and A. Takemura. 2006. Strong consistency of the maximum likelihood estimator for finite mixtures of location-scale distributions when the scale parameters are exponentially small. Bernoulli 12: 1003–1017. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288. Titterington, D. M., A. F. Smith, and U. E. Makov. 1985. Statistical Analysis of Finite Mixture Distributions. New York: Wiley. van der Vaart, A. W. 2000. Asymptotic Statistics. New York: Cambridge University Press. Villani, C. 2003. Topics in Optimal Transportation, vol. 58. Providence: American Mathematical Society. Wald, A. 1949. Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics 20 (4): 595–601. Watanabe, S. 2009. Algebraic Geometry and Statistical Learning Theory, vol. 25. Cambridge: Cambridge University Press. Watanabe, S. 2013. A widely applicable bayesian information criterion. Journal of Machine Learning Research 14 (27): 867–897. Wu, C. 1978. Some algorithmic aspects of the theory of optimal designs. The Annals of Statistics 6: 1286–1301. Wu, C. J. 1983. On the convergence properties of the em algorithm. The Annals of statistics 11: 95–103. Xu, L., and M. I. Jordan. 1996. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation 8 (1): 129–151. Zangwill, W. I. 1969. Nonlinear Programming: A Unified Approach, vol. 52. Hoboken: PrenticeHall Englewood Cliffs. Zhang, C.-H. 1990. Fourier methods for estimating mixing densities and distributions. The Annals of Statistics 18 (2): 806–831.
324
References
Zhang, Q., and J. Chen. 2022. Distributed learning of finite gaussian mixtures. Journal of Machine Learning Research 23 (99): 1–40. Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association 101 (476): 1418–1429. Zou, H., and H. H. Zhang. 2009. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics 37 (4): 1733.
Index
A Akaike information criterion (AIC), 302, 303, 306, 307
B Bayes information criterion (BIC), 302–307 Bernstein’s inequality, 62, 63 Beta mixture, 15–16 Binomial mixture, viii, ix, 2, 5, 7, 8, 10–11, 37, 165–170, 179, 226, 231, 238, 257, 258, 306 Bonferroni’s inequality, 62 Bootstrap method, 243 Borel-Cantelli Lemma, 62, 63, 148, 311–312
C Caratheodory theorem, 103 Compactification, 23–27 Complete data likelihood, 127 Computer experiment, 277, 297 Consistency, viii, ix, 21–41, 43–50, 54, 58, 60, 65–80, 84, 87–95, 98–100, 141, 151, 187, 190, 202, 209, 214, 227, 234, 235, 238, 239, 246, 248, 258, 270, 286, 292, 302, 303, 317 Constrained MLE, 79, 84, 94, 95, 99, 215, 217 Convergence of EM-algorithm, 79, 133–139 C-alpha test, viii
D Diagnostic method, 243 Directional derivative, 107–111, 116–123
E Empirical formula, 287–289, 297, 298 Enhanced Jensen’s inequality, 30, 35–40, 48 Expectation-Maximization (EM) algorithm, vii, ix, 66, 79, 98, 112, 125–139, 267, 274, 281, 284, 294, 308
F Finite mixture, 3, 22, 43, 55, 83, 125, 141, 161, 180, 231, 243, 265, 279, 292, 301
G Gamma mixture, viii, 14–15, 51, 83–100, 157 Gaussian mixture, vii, viii, 11–14, 156, 291–300, 303, 306, 317 Geometric properties, viii, 101–123 Global convergence theorem, viii, 113, 114, 116, 133–137
H Hartigan’s example, 162–165, 179, 231 Homogeneous, 5, 19, 109, 122, 161, 175, 176, 182, 187, 234, 267, 276, 277
I Identifiability, viii, 6–16, 30, 41, 45, 47–49, 53, 142, 148, 151–157, 159, 160, 166, 180, 181, 187, 188, 191, 225, 232, 238, 257, 261, 269, 286, 296, 308 Infinite Fisher information, 288
© Springer Nature Singapore Pte Ltd. 2023 J. Chen, Statistical Inference Under Mixture Models, ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-99-6141-2
325
326 Intra-simplex Direction Method (ISDM), 111–113, 117, 118, 122–123
J Jensen’s inequality, 28–30, 33, 35–41, 48, 70, 71, 74, 138, 165, 227
K Kiefer’s proof, 30–32, 34, 35 KW-distance, 46, 75, 214
L Likelihood ratio test, vii, ix, x, 161, 169, 179–241, 243–263, 265–266, 268, 270, 276, 283, 295, 301, 303 Linkage analysis, 226, 230 Local alternative, 160, 173 Locally asymptotic normal, 146, 155 Loss of identifiability, 45, 157, 160, 166, 225
M Minimax rate, 147, 151, 154–157, 160, 238 Minimum distance estimator, 142, 148, 151–156 Misclassification rate, 298 Missing identity, 5 Mixing distribution, 3, 22, 43, 53, 83, 101, 128, 141, 174, 180, 226, 244, 265, 279, 291, 302 Mixing probability, 67, 118, 130, 141, 234 Mixture density, 3, 4, 174, 279 Mixture distribution, vii, 2–8, 13, 17, 28, 34, 37, 55, 57, 67, 83, 122, 128, 141, 142, 151, 154, 211, 223, 226, 272, 279–282, 284, 287–289, 291, 298, 302, 303, 306–308 Mixture model, 1, 21, 43, 53, 83, 101, 125, 161, 180, 225, 243, 265, 284, 291, 301, 317 Modified likelihood, ix, x, 225–241, 243–263, 265–266, 268–270, 276, 283, 303
N Natural exponential family, 175, 176 Negative binomial mixture, 9–11, 16–17, 35 Non-parametric MLE, viii, 21–41, 44, 54, 58, 101–123, 141, 270
Index Normal mixture, vii, 11–13, 15, 17–18, 46, 50, 53–81, 83, 95, 100, 106, 107, 113, 115, 130, 139, 141, 160, 164, 179, 200–213, 240, 287, 291, 292, 295, 296, 298–300, 303
O Order of finite mixture, viii, ix, 3, 4, 55, 141, 148, 160, 200, 244, 284, 295, 296, 301, 302, 305, 307 Order selection, ix, 301–309 Over-dispersion, ix, 17–19, 131, 176
P Penalized likelihood, 60, 65, 67, 74, 75, 77, 85, 88, 267, 293 Penalized MLE, 66, 76, 79, 87–95, 99 Pfanzagl’s proof, 32–35, 41 Poisson mixture, 8–9, 16–17, 19, 35, 37–39, 101, 117–123, 131–133, 159, 175, 287, 288, 303 Precision-enhancing, 275–277
Q Quadratic variance function, 175, 176
R Rate of convergence, vii, viii, 53, 141–160, 223, 249, 258, 272, 284 Redner’s consistency, 23, 41, 46–50 Redner’s proof, 22
S EM -test, x, 291–300 Separable stochastic process, 312, 313 Singular Bayesian Information Criterion (sBIC), 305, 306 Sodium-lithium countertransport (SLC) activity data, 298–300 Statistical model, viii, 1, 6, 170, 304 Strong identifiability, 148, 151–160, 180, 181, 187, 188, 191, 238, 257, 261, 286, 291 Structural parameter, viii, 12, 35, 45, 55, 98–100, 200, 202, 223, 240–241, 307, 308
Index Subpopulation, vii, viii, ix, 2–5, 7, 8, 16, 19, 21, 23, 48, 51, 54, 55, 57–60, 64, 65, 70, 74–80, 84, 87, 93–95, 98, 101, 125, 128–130, 138, 141–143, 148, 151, 152, 154, 157, 165, 176, 182, 183, 185, 186, 225–227, 234, 239–240, 266, 270, 277, 279–284, 286, 287, 289, 291–293, 295, 296, 298, 301–303, 306–309, 317 Super-efficiency, viii, 146 T Test for higher order, 243–263, 279–289 Test of homogeneity, 160–241, 243, 246, 265–277 Tightness, 182, 188, 190, 199, 204–206, 239, 258, 262, 313–315 Transportation divergence, 153 Tuning parameter, 287–288, 298, 308
327 Tuning value, 270, 288, 298
U Unbounded likelihood, viii, 46, 59–60, 83, 87, 115, 139, 163, 179 Uniform strong law of large numbers, 57, 192, 316–317
V Vertex Direction Method (VDM), 110, 112, 113, 115, 118–123 Vertex Exchange Method (VEM), 110–113, 115–118, 120–123
W Widely Applicable Bayesian Information Criterion (WBIC), 305