Dirichlet and Related Distributions: Theory, Methods and Applications 9780470688199, 9781119995869, 9781119995784, 9781119998419, 9781119998426, 047068819X

The Dirichlet distribution appears in many areas of application, which include modelling of compositional data, Bayesian

176 16

English Pages 336 [326] Year 2011

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Dirichlet and Related Distributions: Theory, Methods and Applications
 9780470688199, 9781119995869, 9781119995784, 9781119998419, 9781119998426, 047068819X

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Dirichlet and Related Distributions

Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A.C. Cressie, Garrett M. Fitzmaurice, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F.M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti Vic Barnett, Ralph A. Bradley, J. Stuart Hunter, J.B. Kadane, David G. Kendall, Jozef L. Teugels A complete list of the titles in this series can be found on http://www.wiley.com/ WileyCDA/Section/id-300611.html.

Dirichlet and Related Distributions Theory, Methods and Applications

Kai Wang Ng • Guo-Liang Tian Department of Statistics and Actuarial Science The University of Hong Kong, Hong Kong

Man-Lai Tang Department of Mathematics, Hong Kong Baptist University Kowloon Tong, Hong Kong

This edition first published 2011 © 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Ng, Kai Wang, author. Dirichlet and Related Distributions : Theory, Methods and Applications / Kai Wang Ng, Guo-Liang Tian, Man-Lai Tang. p. cm. – (Wiley Series in Probability and Statistics) Includes bibliographical references and index. ISBN 978-0-470-68819-9 (hardback) 1. Distribution (Probability theory) 2. Dirichlet problem. I. Tian, Guo-Liang, author. II. Tang, Man-Lai, author. III. Title. QA276.7.N53 2011 519.2’4–dc22 2010053428 A catalogue record for this book is available from the British Library. Print ISBN: 978-0-470-68819-9 ePDF ISBN: 978-1-119-99586-9 oBook ISBN: 978-1-119-99578-4 ePub ISBN: : 978-1-119-99841-9 Mobi ISBN: 978-1-119-99842-6 Typeset in 10.25/12pt, Times Roman, by Thomson Digital, Noida, India

To May, Jeanne, Jason and his family To Yanli, Margaret and Adam To Daisy and Beatrice

Contents Preface

xiii

Acknowledgments

xv

List of abbreviations

xvii

List of symbols

xix

List of figures

xxiii

List of tables

xxv

1

Introduction 1.1 Motivating examples d 1.2 Stochastic representation and the = operator 1.2.1 Definition of stochastic representation d 1.2.2 More properties on the = operator 1.3 Beta and inverted beta distributions 1.4 Some useful identities and integral formulae 1.4.1 Partial-fraction expansion 1.4.2 Cambanis–Keener–Simons integral formulae 1.4.3 Hermite–Genocchi integral formula 1.5 The Newton–Raphson algorithm 1.6 Likelihood in missing-data problems 1.6.1 Missing-data mechanism 1.6.2 The expectation–maximization (EM) algorithm 1.6.3 The expectation/conditional maximization (ECM) algorithm 1.6.4 The EM gradient algorithm 1.7 Bayesian MDPs and inversion of Bayes’ formula 1.7.1 The data augmentation (DA) algorithm 1.7.2 True nature of Bayesian MDP: inversion of Bayes’ formula 1.7.3 Explicit solution to the DA integral equation 1.7.4 Sampling issues in Bayesian MDPs 1.8 Basic statistical distributions 1.8.1 Discrete distributions 1.8.2 Continuous distributions

1 2 7 7 11 13 16 16 16 17 17 18 18 19 22 22 23 23 25 26 29 30 30 32

viii

2

3

CONTENTS

Dirichlet distribution 2.1 Definition and basic properties 2.1.1 Density function and moments 2.1.2 Stochastic representations and mode 2.2 Marginal and conditional distributions 2.3 Survival function and cumulative distribution function 2.3.1 Survival function 2.3.2 Cumulative distribution function 2.4 Characteristic functions 2.4.1 The characteristic function of u∼ U(Tn ) 2.4.2 The characteristic function of v ∼ U(Vn ) 2.4.3 The characteristic function of a Dirichlet random vector 2.5 Distribution for linear function of a Dirichlet random vector 2.5.1 Density for linear function of v ∼ U(Vn ) 2.5.2 Density for linear function of u∼ U(Tn ) 2.5.3 A unified approach to linear functions of variables and order statistics 2.5.4 Cumulative distribution function for linear function of a Dirichlet random vector 2.6 Characterizations 2.6.1 Mosimann’s characterization 2.6.2 Darroch and Ratcliff’s characterization 2.6.3 Characterization through neutrality 2.6.4 Characterization through complete neutrality 2.6.5 Characterization through global and local parameter independence 2.7 MLEs of the Dirichlet parameters 2.7.1 MLE via the Newton–Raphson algorithm 2.7.2 MLE via the EM gradient algorithm 2.7.3 Analyzing serum-protein data of Pekin ducklings 2.8 Generalized method of moments estimation 2.8.1 Method of moments estimation 2.8.2 Generalized method of moments estimation 2.9 Estimation based on linear models 2.9.1 Preliminaries 2.9.2 Estimation based on individual linear models 2.9.3 Estimation based on the overall linear model 2.10 Application in estimating ROC area 2.10.1 The ROC curve 2.10.2 The ROC area 2.10.3 Computing the posterior density of the ROC area 2.10.4 Analyzing the mammogram data of breast cancer

37 38 38 40 43 45 45 46 51 51 53 55 57 57 59

Grouped Dirichlet distribution 3.1 Three motivating examples 3.2 Density function

97 98 99

61 63 64 64 65 69 70 72 72 72 76 76 77 78 79 80 81 84 87 92 92 92 94 95

3.3 3.4 3.5 3.6

3.7

3.8

3.9

4

CONTENTS

ix

Basic properties Marginal distributions Conditional distributions Extension to multiple partitions 3.6.1 Density function 3.6.2 Some properties 3.6.3 Marginal distributions 3.6.4 Conditional distributions Statistical inferences: likelihood function with GDD form 3.7.1 Large-sample likelihood inference 3.7.2 Small-sample Bayesian inference 3.7.3 Analyzing the cervical cancer data 3.7.4 Analyzing the leprosy survey data Statistical inferences: likelihood function beyond GDD form 3.8.1 Incomplete 2 × 2 contingency tables: the neurological complication data 3.8.2 Incomplete r × c contingency tables 3.8.3 Wheeze study in six cities 3.8.4 Discussion Applications under nonignorable missing data mechanism 3.9.1 Incomplete r × c tables: nonignorable missing mechanism 3.9.2 Analyzing the crime survey data

101 104 108 110 110 111 112 113 115 116 118 118 119 121

Nested Dirichlet distribution 4.1 Density function 4.2 Two motivating examples 4.3 Stochastic representation, mixed moments, and mode 4.4 Marginal distributions 4.5 Conditional distributions 4.6 Connection with exact null distribution for sphericity test 4.7 Large-sample likelihood inference 4.7.1 Likelihood with NDD form 4.7.2 Likelihood beyond NDD form 4.7.3 Comparison with existing likelihood strategies 4.8 Small-sample Bayesian inference 4.8.1 Likelihood with NDD form 4.8.2 Likelihood beyond NDD form 4.8.3 Comparison with the existing Bayesian strategy 4.9 Applications 4.9.1 Sample surveys with nonresponse: simulated data 4.9.2 Dental caries data 4.9.3 Competing-risks model: failure data for radio transmitter receivers 4.9.4 Sample surveys: two data sets for death penalty attitude 4.9.5 Bayesian analysis of the ultrasound rating data

121 123 132 133 134 134 137

141 142 142 144 148 150 152 153 154 155 156 159 159 159 160 162 162 163 166 169 170

x

5

6

CONTENTS

4.10 A brief historical review 4.10.1 The neutrality principle 4.10.2 The short memory property

172 172 174

Inverted Dirichlet distribution 5.1 Definition through the density function 5.1.1 Density function 5.1.2 Several useful integral formulae 5.1.3 The mixed moment and the mode 5.2 Definition through stochastic representation 5.3 Marginal and conditional distributions 5.4 Cumulative distribution function and survival function 5.4.1 Cumulative distribution function 5.4.2 Survival function 5.5 Characteristic function 5.5.1 Univariate case 5.5.2 The confluent hypergeometric function of the second kind 5.5.3 General case 5.6 Distribution for linear function of inverted Dirichlet vector 5.6.1 Introduction 5.6.2 The distribution of the sum of independent gamma variates 5.6.3 The case of two dimensions 5.7 Connection with other multivariate distributions 5.7.1 Connection with the multivariate t distribution 5.7.2 Connection with the multivariate logistic distribution 5.7.3 Connection with the multivariate Pareto distribution 5.7.4 Connection with the multivariate Cook–Johnson distribution 5.8 Applications 5.8.1 Bayesian analysis of variance in a linear model 5.8.2 Confidence regions for variance ratios in a linear model with random effects

175 175 175 176 177 177 178 179 179 182 183 183 183 184 185 185 186 187 188 188 190 191

Dirichlet–multinomial distribution 6.1 Probability mass function 6.1.1 Motivation 6.1.2 Definition via a mixture representation 6.1.3 Beta–binomial distribution 6.2 Moments of the distribution 6.3 Marginal and conditional distributions 6.3.1 Marginal distributions 6.3.2 Conditional distributions 6.3.3 Multiple regression

199 199 199 200 201 203 205 205 206 207

191 192 192 195

CONTENTS

6.4 6.5

6.6

6.7

6.8

7

8

Conditional sampling method The method of moments estimation 6.5.1 Observations and notations 6.5.2 The traditional moments method 6.5.3 Mosimann’s moments method The method of maximum likelihood estimation 6.6.1 The Newton–Raphson algorithm 6.6.2 The Fisher scoring algorithm 6.6.3 The EM gradient algorithm Applications 6.7.1 The forest pollen data 6.7.2 The teratogenesis data Testing the multinomial assumption against the Dirichlet–multinomial alternative 6.8.1 The likelihood ratio statistic and its null distribution 6.8.2 The C(α) test 6.8.3 Two illustrative examples

Truncated Dirichlet distribution 7.1 Density function 7.1.1 Definition 7.1.2 Truncated beta distribution 7.2 Motivating examples 7.2.1 Case A: matrix  is known 7.2.2 Case B: matrix  is unknown 7.2.3 Case C: matrix  is partially known 7.3 Conditional sampling method 7.3.1 Consistent convex polyhedra 7.3.2 Marginal distributions 7.3.3 Conditional distributions 7.3.4 Generation of random vector from a truncated Dirichlet distribution 7.4 Gibbs sampling method 7.5 The constrained maximum likelihood estimates 7.6 Application to misclassification 7.6.1 Screening test with binary misclassifications 7.6.2 Case–control matched-pair data with polytomous misclassifications 7.7 Application to uniform design of experiment with mixtures Other related distributions 8.1 The generalized Dirichlet distribution 8.1.1 Density function 8.1.2 Statistical inferences

xi

207 208 208 209 210 212 212 214 216 218 218 219 221 221 223 225 227 227 227 228 230 231 232 232 233 233 234 234 236 237 239 241 241 242 245 247 247 247 250

xii

CONTENTS

8.2

8.3

8.4

8.5 8.6

8.1.3 Analyzing the crime survey data 8.1.4 Choice of an effective importance density The hyper-Dirichlet distribution 8.2.1 Motivating examples 8.2.2 Density function The scaled Dirichlet distribution 8.3.1 Two motivations 8.3.2 Stochastic representation and density function 8.3.3 Some properties The mixed Dirichlet distribution 8.4.1 Density function 8.4.2 Stochastic representation 8.4.3 The moments 8.4.4 Marginal distributions 8.4.5 Conditional distributions The Liouville distribution The generalized Liouville distribution

250 252 254 254 256 258 258 259 260 263 263 264 265 266 268 269 272

Appendix A: Some useful S-plus Codes

275

References

289

Author index

303

Subject index

307

Preface The Dirichlet distribution is one of the primary multivariate distributions, which is limited to the simplex of a multidimensional space. It appears in many applications, including modeling of compositional data, Bayesian analysis, statistical genetics, nonparametric inference, distribution-free tolerance intervals, multivariate analysis, order statistics, reliability, probability inequalities, probabilistic constrained programming models, limit laws, delivery problems, stochastic processes, and other areas. Development in these areas requires extensions of the Dirichlet distribution in different directions to suit different purposes. In particular, the grouped Dirichlet distribution (GDD) and the nested Dirichlet distribution (NDD) have been introduced as new tools for statistical analysis of incomplete categorical data. Articles regarding these generalizations, their properties, and applications are presently scattered in literature. The book is an exposition (i) to systematically review these results and their underlying relationships, (ii) to delineate methods for generating random vectors following these new distributions, and (iii) to show some of their important applications in practice, such as incomplete categorical data analyses. Chapter 1 introduces the topic with motivating examples and summarizes the concepts and necessary tools for the book. As missing data is an important impetus for generalizing the Dirichlet distribution, we discuss at length the handling of the likelihood function in such problems and the Bayesian approach. It is pointed out that the aim of the Bayesian missing-data problems amounts to the inversion of Bayes’ formula (IBF), the density form of the converse of Bayes’ theorem. Chapter 2 provides a comprehensive review on the Dirichlet distribution, including its basic properties, marginal and conditional distributions, survival function and cumulative distribution function, characteristic function, distribution for linear function of Dirichlet random vector, characterizations, maximum likelihood estimates of the Dirichlet parameters, moments estimation, generalized moments estimation, estimation based on linear models, and application in estimating receiver operating characteristic area. Most recent materials include: the survival function (Section 2.3.1), characteristic functions for two uniform distributions over the hyperplane (Section 2.4.1) and simplex (Section 2.4.2); the distribution for the linear function of a Dirichlet random vector (Section 2.5); estimation via the expectation– maximization gradient algorithm (Sections 2.7.2 and 2.7.3); and application (Section 2.10). Chapters 3 and 4 review two new families of distributions (GDD and NDD), which are two generalizations of the traditional Dirichlet distribution with extra parameters. More importantly, we emphasize their applications in incomplete categorical data and survey data with nonresponse. Chapter 5 collects theoretical

xiv

PREFACE

results on the inverted Dirichlet distribution and its applications previously scattered throughout the literature. Chapters 6 and 7 gather the results about the Dirichletmultinomial distribution and the truncated Dirichlet distribution that are available only in the literature. Chapter 8 reviews the generalized Dirichlet distribution, hyperDirichlet distribution, scaled Dirichlet distribution, mixed Dirichlet distribution, Liouville distribution, and the generalized Liouville distribution. The book is intended as a graduate-level textbook for theoretic statistics, applied statistics, and biostatistics. For example, it can be used as a reference and handbook for researchers and teachers in institutes and universities. The book is also useful, in part at least, to undergraduates in statistics and to practitioners and investigators in survey companies. The knowledge of standard probability and statistics and some topics that are taught in multivariate analysis are necessary for the understanding of this book. Some results in the book are new and have not been published previously. This book includes an accompanying website. Please visit www.wiley.com/ go/dirichlet for more information. Kai Wang Ng and Guo-Liang Tian Department of Statistics and Actuarial Science The University of Hong Kong Pokfulam Road, Hong Kong, P. R. China Man-Lai Tang Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong, P. R. China

Acknowledgments From Kai Wang Ng: I have enjoyed very much working with Guo-Liang and Man-Lai; particularly Guo-Liang with whom this is our second co-authored book. I could not have fulfilled my commitment in finishing the project without the patience and support by my family. The staff in the Department of Statistics and Actuarial Science have been very helpful in providing technical support. From Guo-Liang Tian: I am very grateful for my family’s understanding that I work so far away from them, only seeing them one or two months each year. I hope the fruits of this second co-authored book of mine are worthy of their understanding. I am grateful to Professor Ming T. Tan and Dr Hong-Bin Fang of the University of Maryland Greenebaum Cancer Center, who have invited me for several visits. I enjoy working with my co-authors. I appreciate very much my long collaboration with Kai since 1998, especially on missing-data problems and the related applications of inversion of Bayes’ formula. Working with Kai always give me the opportunity of learning a lot. From Man-Lai Tang: Albeit too technical for my wife and my baby daughter, this book is a very special gift to the three of us. I would also like to thank my parents’ blessings for my free following of ambitions throughout my childhood, and my brother and sister for taking care of my parents when I pursued my PhD degree at UCLA and my career in the USA. It is my honor to co-author this interesting book with Kai and Guo-Liang who opened the door to me on the subject.

List of abbreviations a.s. AUC cdf cf cf. CI CPU DA DOI EM ECM E-step FPF GDD GMM HPD IBF iff i.i.d. I-step LOR MAR MCAR MCMC MDP mgf MLE M-step NDD pdf pmf PIEM

almost surely area under ROC curve cumulative distribution function characteristic function confer confidence/credible interval central process unit data augmentation degree of infiltration expectation–maximization expectation/conditional maximization expectation step false positive fraction grouped Dirichlet distribution generalized method of moments highest posterior density inversion of Bayes’ formula if and only if independently and identically distributed imputation step local odds ratio missing at random missing completely at random Markov chain Monte Carlo missing data problems moment-generating function maximum likelihood estimate maximization step nested Dirichlet distribution probability density function probability mass function partial imputation expectation-maximization

xviii

LIST OF ABBREVIATIONS

P-step ROC SE std TPF

posterior step receiver operating characteristic standard error standard deviation true positive fraction

List of symbols Mathematics ∝  End ¶ = ˆ ≡ ≈ = /

proportional to end of an example end of the proof of an lemma/property/theorem end of a definition/lemma/property/remark/theorem equal by definition always equal to approximately equal to not equal to

(a) [a] a+ , a· a (a)m ◦

real part of the complex number a largest integer not exceeding a sum of all components in the vector a max(a, 0) (a + m)/ (a) if m > 0, see (2.25) component-wise operator, see Footnote 5.3

x¯ x

x−n x−i xp x1 ||x|| diag(x) 0n 1n

mean of all components in the vector x = (x1 , . . . , xn )

transpose of the vector x (x1 , . . . , xn−1 )

(x1 , . . . , xi−1 , xi+1 , . . . , xn )

p -norm of x, (|x1 |p + · · · + |xn |p )1/p 1 -norm of x, x1 + · · · + xn if all xi ≥ 0 x2 diagonal matrix with main diagonal elements x1 , . . . , xn n-dimensional vector of zeros n-dimensional vector of ones

In A

|A| tr (A)

n × n identity matrix transpose of the matrix A determinant of the matrix A trace of the matrix A

xx

LIST OF SYMBOLS

rank (A) A−1 A+

rank of the matrix A inverse matrix of A Moore–Penrose generalized inverse matrix of A, see Footnote 2.11

∇f ∇2f ∇ ij f (x, y)

gradient of the function f Hessian of the function f ∂i+j f (x, y)/∂xi ∂yj

(a) (x)  (x) B(a1 , a2 ) Ix (a, b)

gamma function (a > 0) digamma function, d log (x)/dx trigamma function, d(x)/dx beta function, (a1 )(a2 )/ (a1 + a2 ) incomplete beta function,  x a−1 y (1 − y)b−1 dy/B(a1 , a2 ) 0 multivariate beta function, B(a1 , . . . , an ) indicator function sign function binomial coefficient, n!/{k!(n − k)!} multinomial coefficient, ( x1 , .N . . , xn )

Bn (a) I(x∈X) , IX (x) sgn(x) (n k) (N x) ∅ N Cn [a, b] Rn R Rn+ R+ Vn (c) Vn Vn−1 (a, b) Tn (c) Tn Tn (a, b)

empty set set of natural numbers, {1, 2, . . .} n-dimensional unit cube, (0, 1)n n-dimensional rectangle, [a1 , b1 ] × · · · × [an , bn ] n-dimensional Euclidean space real line, (−∞, +∞) n-dimensional positive orthant positive real line, (0, +∞) n-dimensional open simplex, {x: x ∈ Rn+ , 1 n x < c} Vn (1) convex polyhedron; see (7.2) n-dimensional hyperplane or closed simplex, {x: x ∈ Rn+ , 1 n x = c} Tn (1) convex polyhedron, {x: x ∈ [a, b], 1 n x = 1}

Probability ∼ iid

∼ ind

∼ d

= pj.

= L

→ X⊥ ⊥Y

distributed as independently and identically distributed independently distributed having the same distribution on both sides equality in the sense of projection convergence in distribution X is independent of Y

LIST OF SYMBOLS X, Y, Z x, y, z Pr E(X) Var(X) Cov(X, Y ) Corr(X, Y ) φ(x) (x) o(x) O(x)

random variables/vectors realized values probability measure expectation of random variable X variance of X covariance of X and Y correlation coefficient of X and Y pdf of standard normal distribution cdf of standard normal distribution small-o. As x → ∞, o(x) = o(1) → 0 x big-O. As x → ∞, O(x) = O(1) → constant x

Statistics fX (x) FX (x) ϕX (t) ϕx (t) Mx (s) Ycom , Yaug Yobs Ymis , z θ, L(θ|Yobs ) (θ|Yobs ) (θ|Yobs , z) ∇(θ|Yobs ) Iobs (θ) Iaug (θ) J(θ) θˆ ˆ se(θ) ˆ Cov(θ)

density function of the random variable X distribution function of the random variable X characteristic function of the random variable X characteristic function of the random vector x moment-generating function of the random vector x complete or augmented data observed data missing data or latent vector parameter vector, parameter space observed likelihood function of θ for fixed Yobs statistically identical to the joint pdf f (Yobs |θ) log-likelihood function for observed data log-likelihood function for complete data score vector observed information, −∇ 2 (θ|Yobs ) augmented information, −∇ 2 (θ|Yaug ) Fisher/expected information, E{Iobs (θ)} MLE of θ standard error of θˆ covariance matrix of θˆ

π(θ) f (θ|Yobs , z) f (z|Yobs , θ) f (θ|Yobs ) f (z|Yobs ) S(θ|Yobs ) S(z|Yobs )

prior distribution/density of θ complete-data posterior distribution conditional predictive distribution observed-data posterior distribution marginal predictive distribution conditional support of θ|Yobs conditional support of z|Yobs

{X() }L=1 {X(i) }m i=1 θ˜ J(y → x)

L i.i.d. samples from the distribution of X order statistics of m i.i.d. samples mode of f (θ|Yobs ) Jacobian determinant from y to x, |∂y/∂x|

xxi

List of figures 1.1 2.1 2.2 2.3

3.1 4.1 4.2

4.3 4.4

Plots of the densities of Z ∼ Beta(a1 , a2 ) with various parameter values. Plots of the densities of (x1 , x2 ) ∼ Dirichlet(a1 , a2 ; a3 ) on V2 with various parameter values. ROC curves for the useless and perfect tests for comparison. (a) The posterior density of the AUC given by (2.112) for the ultrasound rating data of breast cancer metastasis, where the density curve is estimated by a kernel density smoother based on 10 000 i.i.d. posterior samples. (b) The histogram of the AUC. Plots of grouped Dirichlet densities GD3,2,2 (x−3 |a, b). Perspective plots of nested Dirichlet densities ND3,2 (x−3 |a, b). Comparisons of three posterior densities in each plot, obtained by using the exact IBF sampling (solid curve, L = 20 000 i.i.d. samples), the DAND algorithm (dotted curve, a total of 40 000 and retaining the last 20 000 samples), and the DAD algorithm (dashed curve, a total of 45 000 and retaining the last 20 000 samples). Comparison of the two cause-specific hazard rates for the radio transmitter receivers data. (a) Posterior density of the AUC given by (2.112) for the ultrasound rating data of breast cancer metastasis, where the density curve is estimated by a kernel density smoother based on 20 000 i.i.d. samples. (b) The histogram of the AUC.

14 39 93

96 100 143

165 168

172

List of tables 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 4.4 4.5 7.1 7.2 7.3

Relative frequencies of serum proteins in white Pekin ducklings. 2 Mammogram data using a five-category scale. 3 Cervical cancer data. 3 Leprosy survey data. 4 Neurological complication data. 5 Victimization results from the national crime survey. 5 Observed cell counts of the failure times for radio transmitter receivers. 6 Ultrasound rating data for detection of breast cancer metastasis. 7 Observed forest pollen data from the Bellas Artes core. 8 Observed teratogenesis data from exposure to hydroxyurea. 9 The joint and marginal distributions. 13 Frequentist and Bayesian estimates of parameters for the leprosy survey data. 120 Observed counts and cell probabilities for incomplete r × c table. 124 Maternal smoking cross-classified by child’s wheezing status. 132 MLEs and CIs of parameters for child’s wheeze data. 133 Parameter structure under nonignorable missing mechanism. 135 Posterior means and standard deviations for the crime survey data under nonignorable missing mechanism. 139 r × 2 contingency table with missing data. 143 MLEs, SEs and Bayesian estimates of parameters for the simulated data. 163 The values of qz (θ 0 ) with θ 0 = 13 /3 and pz = f (z|Yobs ). 164 MLEs and Bayesian estimates of gi (tj ) for the failure data of radio transmitter receivers. 168 Bayesian estimates of AUC for the ultrasound rating data. 172 Bayesian estimates of p and π for various prior parameters. 242 Frequentist estimates of p for four different cases. 244 Bayesian estimates of p for two different cases. 245

xxvi

8.1 8.2 8.3 8.4

LIST OF TABLES

Posterior means and standard deviations for the crime survey data under the ignorable missing mechanism. Results of 88 chess matches among three players. First five results of a sports league comprising nine players. Results from doubles and singles tennis matches.

252 255 255 256

1

Introduction As the multivariate version of the beta distribution, the Dirichlet distribution is one of the key multivariate distributions for random vectors confined to the simplex. In the early studies of compositional data, or measurements of proportions, it is the most natural distribution for modeling. In Bayesian statistics, it is the conjugate prior distribution for counts following multinomial distribution. Applications of the Dirichlet distribution in other areas consist of a long list: statistical genetics, nonparametric inference, distribution-free tolerance intervals, order statistics, reliability theory, probability inequalities, probabilistic constrained programming models, limit laws, delivery problems, stochastic processes, and so on. The Dirichlet family, however, is not rich enough to represent many important applications. Extensions of the Dirichlet distribution in different directions are necessary for different purposes. Previous generalizations and their properties and applications are scattered in the literature. For instance, motivated by the likelihood functions of several incomplete categorical data, the authors of this book and their collaborators have recently developed flexible parametric classes of distributions beyond the Dirichlet distribution. Amongst them, the grouped Dirichlet distribution (GDD) and the nested Dirichlet distribution (NDD) are perhaps the most important. In fact, the GDD and NDD can be used as new tools for statistical analysis of incomplete categorical data. In addition, the inverted Dirichlet, Dirichlet-multinomial, truncated Dirichlet, generalized Dirichlet, hyper-Dirichlet, scaled Dirichlet, mixed Dirichlet, Liouville, and generalized Liouville distributions have a close relation with the Dirichlet distribution. In this book, we shall systematically review these distributions and their underlying properties, including methods for generating these distributions. We also present some of their important applications in practice, such as incomplete categorical data analyses.

Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

2

DIRICHLET AND RELATED DISTRIBUTIONS

1.1 Motivating examples The following real data sets help to motivate the Dirichlet distribution and related distributions which will be studied in this book. Example 1.1 (Serum-protein data of white Pekin ducklings). Mosimann (1962) reported the blood serum proportions (pre-albumin, albumin, and globulin) in 3-week-old white Pekin ducklings; see Table 1.1. In each of the 23 sets of data, the three proportions are based on 7 to 12 white Pekin ducklings, while the individual measurements are not available. Ducklings in each set had the same diet, but the diet was different from set to set. In Chapter 2 we will use the data to illustrate the Newton–Raphson algorithm and the expectation–maximization (EM) gradient algorithm for calculating the maximum likelihood estimates (MLEs) of Dirichlet parameters and the associated standard errors (SEs).  Example 1.2 (Mammogram data for patients with/without breast cancer). Mammography is used for the early detection of breast cancer, typically by detecting the characteristic masses and/or microcalcifications by means of low-dose amplitude X-rays in examining women’s breast. As a diagnostic and a screening tool, the result is a five-point rating scale: normal, benign, probably benign, suspicious, and malignant. Table 1.2 summarizes the mammographer’s results (Zhou et al., 2002: 20–21) of 60 patients presenting for breast cancer screening. The sample consists of 30 patients with pathology-proven cancer and 30 patients with normal mammograms for two consecutive years.

Table 1.1 Relative frequencies of serum proteins in white Pekin ducklings.a i

xi1

xi2

xi3

i

xi1

xi2

xi3

1 2 3 4 5 6 7 8 9 10 11 12

0.178 0.162 0.083 0.087 0.078 0.040 0.049 0.100 0.075 0.084 0.060 0.089

0.346 0.307 0.448 0.474 0.503 0.456 0.363 0.317 0.394 0.445 0.435 0.418

0.476 0.531 0.469 0.439 0.419 0.504 0.588 0.583 0.531 0.471 0.505 0.493

13 14 15 16 17 18 19 20 21 22 23

0.0500 0.0730 0.0640 0.0850 0.0940 0.0140 0.0600 0.0310 0.0250 0.0450 0.0195

0.485 0.378 0.562 0.465 0.388 0.449 0.544 0.569 0.491 0.613 0.526

0.4650 0.5490 0.3740 0.4500 0.5180 0.5370 0.3960 0.4000 0.4840 0.3420 0.4545

Source: Mosimann (1962). Reproduced by permission of Oxford University Press (China) Ltd. a x + x + x = 1, where x = pre-albumin, x = albumin, x = globulins. i1 i2 i3 i1 i2 i3

INTRODUCTION

3

Table 1.2 Mammogram data using a five-category scale. Cancer status

Present Absent

Test results Normal

Benign

Probably benign

Suspicious

Malignant

Total

1 9

0 2

6 11

11 8

12 0

30 30

Source: Zhou et al. (2002). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd.

In Chapter 2 we will discuss how to analyze this dataset using the Bayesian method based on independent Dirichlet posterior distributions. More specifically, we will consider how to compute the Bayesian posterior distribution of the receiver operating characteristic (ROC) area.  Example 1.3 (Cervical cancer data). Williamson and Haber (1994) reported a case–control study for the possible relationship between disease status of cervical cancer and the number of sex partners together with other risk factors. Cases were women of 20–70 years old in Fulton or Dekalb county in Atlanta, Georgia. They were diagnosed and were ascertained to have invasive cervical cancer. Controls were randomly chosen from the same counties and the same age ranges. Table 1.3 gives the cross-classification of disease status (control or case, denoted by X = 0 or X = 1) and number of sex partners (‘few’ (0–3) or ‘many’ (≥4), denoted by Y = 0 or Y = 1). Generally, a sizable proportion (13.5 % in this example) of the responses would be missing because of ‘unknown’ or ‘refused to answer’ in a telephone interview. In this example, it is assumed that the mechanism is missing at random (MAR; see Section 1.6.1). The objective is to examine whether an association exists between the number of sex partners and disease status of cervical cancer. In Chapter 3 we will use this example to motivate the definition of the extended Dirichlet distribution; that is, the so-called GDD with two partitions. 

Table 1.3 Cervical cancer data.a Disease status

X = 0 (Control) X = 1 (Case)

Number of sex partners Y = 0 (0–3)

Y = 1 (≥ 4)

Supplement on X

164 (n1 , θ1 ) 103 (n3 , θ3 )

164 (n2 , θ2 ) 221 (n4 , θ4 )

43 (n12 , θ1 + θ2 ) 59 (n34 , θ3 + θ4 )

Source: Williamson and Haber (1994). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd. a The observed counts and the corresponding cell probabilities are in parentheses.

4

DIRICHLET AND RELATED DISTRIBUTIONS

Table 1.4 Leprosy survey data.a DOI

Clinical condition Improvement

Main sample Little Much

Stationary

Worse

53 (n4 , θ4 ) 13 (n9 , θ9 )

11 (n5 , θ5 ) 1 (n10 , θ10 )

144 3 (n123 , i=1 θi )

120 (˜n4 , θ4 )

16 (˜n5 , θ5 )

92

24 (˜n9 , θ9 )

4 (˜n10 , θ10 )

Marked

Moderate

Slight

11 (n1 , θ1 ) 7 (n6 , θ6 )

27 (n2 , θ2 ) 15 (n7 , θ7 )

42 (n3 , θ3 ) 16 (n8 , θ8 )

Supplemental sample Little Much

(n678 ,

8

θ) i=6 i

Source: Hocking and Oxspring (1974). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd. a DOI = Degree of infiltration. The observed counts and cell probabilities are in parentheses.

Example 1.4 (Leprosy survey data). Hocking and Oxspring (1974) analyzed a data set (see Table 1.1) on the use of drugs in the treatment of leprosy. A random sample of 196 patients was cross-classified by two categories of the degree of infiltration (little or much) and five categories of changes in clinical condition (marked, moderate, slight improvement, stationary, or worse) after a fixed time over which treatments were administered. A supplementary sample of another 400 different patients was classified coarsely with respect to improvement in health. In this example, the MAR is again assumed. The goal is to estimate the cell probabilities. In Chapter 3 we will use the observed likelihood function to illustrate the GDD with multiple partitions. 

Example 1.5 (Neurological complication data). Choi and Stablein (1982) reported a neurological study in which 33 young meningitis patients at the St. Louis Children’s Hospital were given neurological tests at the beginning and the end of a standard treatment on neurological complication. The response of the tests is absent (denoted by 0) or present (denoted by 1) of any neurological complication. The data are reported in Table 1.5. The primary objective of this study is to detect if there is a substantial difference in proportion before and after the treatment. In Chapter 3 we will use this example to illustrate how an EM algorithm based on the GDD applies to an augmented likelihood function with fewer latent variables than the conventional EM algorithm based on the Dirichlet distribution. 

INTRODUCTION

5

Table 1.5 Neurological complication data.a Complication status at the beginning

Complication status at the end of the treatment Y =0

Y =1

Supplement on X

X = 0 (Absent) X = 1 (Present)

6 (n1 , θ1 ) 8 (n3 , θ3 )

3 (n2 , θ2 ) 8 (n4 , θ4 )

2 (n12 , θ1 + θ2 ) 4 (n34 , θ3 + θ4 )

Supplement on Y

2 (n13 , θ1 + θ3 )

0 (n24 , θ2 + θ4 )

Source: Choi and Stablein (1982). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd. a ‘X = 0(1)’ means that the patient’s complication at the beginning of the treatment is absent (present); ‘Y = 0(1)’ means that patient’s complication at the end of the treatment is absent (present). The observed frequencies and probabilities are in parentheses.

Example 1.6 (Crime survey data). The data set in Table 1.6 was obtained via the national crime survey conducted by the US Bureau of the Census (Kadane, 1985). Households were interviewed to see if they had been victimized by crime in the preceding 6-month period. The occupants of the same housing unit were interviewed again 6 months later to determine if they had been victimized in the intervening months, whether these were the same people or not. Denote the probability that a household is crime free (or victimized) in both periods by θ1 (or θ4 ), that it is crime free (or victimized) in period 1 and victimized (or crime free) in period 2 by θ2 (or θ3 ). Naturally, 4 

θj = 1,

and

θj > 0,

j = 1, 2, 3, 4.

j=1

One of the goals is to obtain the estimates θj . In Chapter 3 we will show how a GDD can be applied to obtain explicit estimates of the cell probabilities for this example under the assumption of nonignorable missing mechanism (Little and Rubin, 2002). Table 1.6 Victimization results from the national crime survey.a The first visit

Crime free Victims Nonresponse

The second visit Crime free

Victims

392 (n1 , θ1 ) 76 (n3 , θ3 )

55 (n2 , θ2 ) 38 (n4 , θ4 )

31

(n13 )

7

(n24 )

Nonresponse 33 9

(n12 ) (n34 )

115 (n1234 )

Source: Kadane (1985). Reproduced by permission of Elsevier. a The observed frequencies of households and the corresponding probabilities are in parentheses.



6

DIRICHLET AND RELATED DISTRIBUTIONS

Table 1.7 Observed cell counts of the failure times for radio transmitter receivers.a Index j 1 2 3 4 5 6 7 8 9 10 11 12 13 Total

Time period [tj−1 , tj )

Type I failures (n1j )

Type II failures (n2j )

Total

rj

0–50 50–100 100–150 150–200 200–250 250–300 300–350 350–400 400–450 450–500 500–550 550–600 600–629 Not fail at 630 h

26 29 28 35 17 21 11 11 12 7 6 9 6 —

15 15 22 13 11 8 7 5 3 4 1 2 1 —

41 44 50 48 28 29 18 16 15 11 7 11 7 44

328 284 234 186 158 129 111 95 80 69 62 51 44 —



218

107

369



Source: Medenhall and Hader (1958). Reproduced by permission of Oxford University Press (China), Ltd. a r is the number of right-censored items during the jth time period. j

Example 1.7 (Failure data of radio transmitter receivers). Consider a study of the failure times of 369 radio transmitter receivers (Medenhall and Hader, 1958; Cox, 1959). The failures were classified into two types: those confirmed (Type I) and unconfirmed (Type II) on arrival at the maintenance center. Of the 369 receivers, 44 were censored; that is, they did not fail during the test period of 630 h. These data are listed in Table 1.7.  Example 1.8 (Ultrasound rating data of breast cancer metastasis). Consider the rating data for the study of breast cancer metastasis described in Peng and Hall (1996) and Hellmich et al. (1998). The diagnosis of metastasis was made with ultrasound by a radiologist reading nine images from subjects with metastasis and 14 without metastasis. The rating scale is from 1 to 4, where 1 denotes definitely abnormal, 2 denotes probably abnormal, 3 denotes probably normal, and 4 denotes definitely normal. In addition, the rating scale 23 denotes equivocal (i.e., probably abnormal or probably normal). The ultrasound ratings of the 23 patients are given in Table 1.8. In Chapter 4 we will demonstrate how to apply the NDD to analyze the two data sets described in Examples 1.7 and 1.8.  Example 1.9 (Forest pollen data in the Bellas Artes core). Consider the data that consist of counts of the frequency of occurrence of different kinds of pollen grains at various levels of a core from the Valley of Mexico (Mosimann, 1962). Four arboreal

INTRODUCTION

7

Table 1.8 Ultrasound rating data for detection of breast cancer metastasis.a Metastasis

Yes No

Rating scale 1

2

23

3

4

Total

5 (y1 , θ1 ) 0 (z1 , φ1 )

2 (y2 , θ2 ) 1 (z2 , φ2 )

0 (y23 , θ2 + θ3 ) 2 (z23 , φ2 + φ3 )

2 (y3 , θ3 ) 5 (z3 , φ3 )

0 (y4 , θ4 ) 6 (z4 , φ4 )

9 14

Source: Peng and Hall (1996). Reproduced by permission of SAGE Publications. a The observed counts and the corresponding cell probabilities are in parentheses.

types of pollen (Pinus, Abies, Quercus, and Alnus) were counted until a total of 100 grains were reached. Seventy-three levels are arranged in Table 1.9 in the order of occurrence. That is, there are 73 clusters here and the cluster sizes throughout the experiment are 100.  Example 1.10 (Teratogenesis data from exposure to hydroxyurea). Consider a study on reproductive and developmental effects resulting from exposure to hydroxyurea (Chen et al., 1991). Pregnant female animals were assigned to a control or one of three dosed groups (low dose, medium dose, and high dose). The data for each female animal in each treatment group include the numbers of dead/resorbed fetuses, malformed fetuses, normal fetuses, and implantation sites. The control group has zero response in malformations. Only the data from the three dose groups are shown in Table 1.10. In Chapter 6 we will show how to apply different Dirichlet-multinomial distributions to fit the two data sets described in Examples 1.9 and 1.10.  d

1.2 Stochastic representation and the = operator 1.2.1 Definition of stochastic representation If X is a random variable whose distribution is identical to the distribution of a function of other random variables Y1 , . . . , Yn , we shall write d

X = g(Y1 , . . . , Yn ),

(1.1)

and refer to (1.1) as a stochastic representation of X. Obviously, one should prefer a stochastic representation in which Y1 , . . . , Yn are mutually independent and the distributions of Yi for i = 1, . . . , n are familiar to us. When the distribution of X and its properties are difficulty to derive, or its quantities such as moments are not easily accessible, the stochastic representation is very often a powerful tool; for example, see its important role in the theory of symmetric multivariate distribution (Fang et al., 1990). It has been widely used in Monte Carlo simulation.

8

DIRICHLET AND RELATED DISTRIBUTIONS

Table 1.9 Observed forest pollen data from the Bellas Artes core.a i

xi1

xi2

xi3

xi4

i

xi1

xi2

xi3

xi4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

94 75 81 95 89 84 81 97 86 86 82 72 89 93 87 85 91 95 94 85 91 99 90 91 79 89 95 90 93 90 89 88 99 86 88 91 84

0 2 2 2 3 5 3 0 1 2 2 1 0 4 1 0 0 1 3 1 1 1 2 0 1 0 2 3 1 2 2 1 0 1 0 0 0

5 14 13 3 1 7 10 2 8 11 10 16 9 2 11 12 7 3 3 12 4 0 8 8 19 7 1 5 6 7 9 9 1 10 7 7 14

1 9 4 0 7 4 6 1 5 1 6 11 2 1 1 3 2 1 0 2 4 0 0 1 1 4 2 2 0 1 0 2 0 3 5 2 2

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

84 97 83 81 81 76 87 91 94 88 93 84 87 89 73 87 94 81 88 94 69 90 86 90 74 82 87 68 77 86 79 87 79 74 80 85

1 0 0 1 1 2 3 1 0 1 4 0 1 1 0 3 1 2 0 0 7 0 1 0 5 2 3 3 3 2 1 0 1 0 0 3

12 3 13 15 16 18 7 5 5 11 2 8 12 6 13 8 3 9 9 4 18 8 8 7 16 11 9 26 11 7 11 11 17 19 14 9

3 0 4 3 2 4 3 3 1 0 1 8 0 4 14 2 2 8 3 2 6 2 5 3 5 5 1 3 9 5 9 2 3 7 6 3

Source: Mosimann (1962). Reproduced by permission of Oxford University Press (China), Ltd. a x + x + x + x = 100, where x = Pinus, x = Abies, x = Quercus, and x = Alnus. i1 i2 i3 i4 i1 i2 i3 i4

INTRODUCTION

9

Table 1.10 Observed teratogenesis data from exposure to hydroxyurea.a i

xi1

xi2

Ni

i

xi1

xi2

Ni

i

xi1

xi2

Ni

Low dose 1 1 2 0 3 0 4 5 5 1 6 0 7 2 8 2

0 0 0 0 0 2 0 2

14 11 11 8 12 8 11 10

9 10 11 12 13 14 15 16

1 0 2 1 0 4 0 1

0 0 4 0 0 0 0 5

12 14 13 7 9 9 11 14

17 18 19 20 21 22 23

2 1 0 1 1 0 1

0 3 0 0 2 2 3

14 10 10 9 10 9 10

Medium dose 1 2 2 0 3 12 4 9 5 12 6 5 7 3

0 0 1 0 0 5 1

14 11 16 11 13 13 13

8 9 10 11 12 13 14

3 10 4 3 9 9 8

0 1 2 4 0 1 1

11 12 13 9 10 11 13

15 16 17 18 19 20

4 2 1 8 2 4

1 6 1 3 1 0

8 10 12 11 12 12

High dose 1 7 2 8 3 6 4 10 5 7 6 3 7 4 8 9 9 7 10 4 11 11 12 5 13 10 14 10 15 6 16 8

1 3 1 0 2 1 0 0 1 1 0 4 0 0 0 1

9 11 11 11 11 11 9 10 8 5 12 12 12 11 11 13

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

6 8 11 2 11 2 1 7 10 9 8 7 9 10 12 4

3 0 0 4 0 1 0 2 2 1 0 0 1 1 1 4

12 9 12 10 12 9 11 10 12 11 9 9 11 12 14 12

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

3 11 7 10 15 1 4 5 7 4 2 4 3 11 7

1 0 3 1 1 1 3 2 3 2 1 2 0 1 0

9 12 12 12 16 9 9 9 11 12 10 11 11 12 10

Source: Chen et al. (1991). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd. a x = the number of dead/resorbed fetuses; x = the number of malformed fetuses; x = the number i1 i2 i3 of normal fetuses; and Ni = xi1 + xi2 + xi3 = the number of implantation sites.

10

DIRICHLET AND RELATED DISTRIBUTIONS

The following are some elementary properties for the operator: (1) If X = Y , then X =d Y . However, the inverse may not be true. (2) Let X and Y be two random variables. X|Y =d X if and only if (iff) X and Y are independent. (3) If the cumulative distribution function (cdf) of X is symmetric about the origin (i.e., F (x) = 1 − F (−x) for all x ∈ R), then we have X =d −X. For example, if X ∼ N(0, 1) (i.e., the standard normal distribution), then d

X = −X;

however,

X= / − X.

Another example is U = 1 − U, where U ∼ U(0, 1); that is, the uniform distribution on (0, 1). This fact is often used in Monte Carlo simulation. Before considering more properties for the =d operator, let us introduce two examples to illustrate the advantage of stochastic representation. d

Example 1.11 (Expectation of gamma distribution). A random variable Y has a gamma distribution with shape parameter n and scale parameter β > 0, written as Y ∼ Gamma(n, β), if Y has the probability density function (pdf) βn n−1 −βy Gamma(y|n, β) = y e , y > 0. (n) In particular, when n = 1, Gamma(1, β) reduces to Exponential(β), the exponential distribution with pdf Exponential(y|β) = β e−βy ,

β > 0,

y > 0.

If Y ∼ Gamma(n, β), we would like to calculate the expectation of Y . First, we employ the analytic method. Starting from the density function, we have  ∞ E(Y ) = y · Gamma(y|n, β) dy 0  ∞ βn yn e−βy dy = (n) 0 βn (n + 1) = (n) βn+1 n = . β Here, the integral identity

 0



yn e−βy dy =

(n + 1) βn+1

(1.2)

and the recursive relationship (n + 1) = n(n)

(1.3)

INTRODUCTION

11

for the gamma function are used. The proofs for (1.2) and (1.3) are available in most textbooks of mathematical statistics and are omitted here. Now, let us consider the application of stochastic representation. Let random variables X1 , . . . , Xn be independently and identically distributed (i.i.d.) iid Exponential(β), or X1 , . . . , Xn ∼ Exponential(β), then from the above d

Y = X1 + · · · + Xn . Since



E(X1 ) =

0



β · x e−βx dx =

1 , β

we immediately obtain E(Y ) = n · E(X1 ) = n/β.



Example 1.12 (Expectation of gamma-integral distribution). Let us consider another example (Devroye, 1986: p 191). Let X have the gamma-integral distribution with parameter α > 0, which is defined by the pdf as given by  ∞ α−2 −t t e fX (x) = dt, x > 0. (1.4) (α) x Suppose that it is of interest to calculate E(X). In this case, the analytic method is somewhat tedious, as  ∞   ∞ α−2 −t  t e E(X) = dt dx. x (α) 0 x Note that d

X = UY, where U ∼ U(0, 1) is independent of Y ∼ Gamma(α, 1). As a result, E(X) = E(U) · E(Y ) = α/2.  d

1.2.2 More properties on the = operator The equality in distribution extends easily to two n-dimensional random vectors, x and y, which have the same joint distribution. We denote this by x =d y. The following are some basic properties (Anderson and Fang, 1982, 1987; Zolotarev, 1985; Fang and Chen, 1986; Fang and Zhang, 1990). Property 1.1 If x =d y and hj (·), j = 1, . . . , m, are measurable functions, then (h1 (x), . . . , hm (x)) =d (h1 (y), . . . , hm (y)). ¶ This can be readily verified via the characteristic function (cf). For example, if x =d y, then we have d

Ax + b = Ay + b, (xA1 x, . . . , xAm x) = (yA1 y, . . . , yAm y), d

where A, A1 , . . . , Am are constant matrices and b is a constant vector.

12

DIRICHLET AND RELATED DISTRIBUTIONS

Property 1.2 Let random variables X and Z be independent and let Y and W also be independent. (i) If X =d Y and Z =d W, then X + Z =d Y + W, and hence X + c =d Y + c for any constant c. (ii) If Z =d W and the cf ϕZ (t) of Z is nonzero for almost all t, then X + Z =d Y + W implies X =d Y . ¶ Proof. (i) If X =d Y and Z following equations:

d

=

W, then the cfs of X + Z and Y + W satisfy the

ϕX+Z (t) = ϕX (t)ϕZ (t) = ϕY (t)ϕW (t) = ϕY +W (t), so that X + Z =d Y + W. (ii) If Z =d W and X + Z =d Y + W, we need to prove X =d Y . Since ϕX (t)ϕZ (t) = ϕX+Z (t) = ϕY +W (t) = ϕY (t)ϕW (t), / 0 for almost all t, we have ϕX (t) = ϕY (t) for almost all t. By and ϕW (t) = ϕZ (t) = continuity of the cf, we have X =d Y . End The condition of independence in Property 1.2 is necessary. For instance, let X ∼ N(0, 1), Y = −X, and Z = W = X. Although X =d Y and Z =d W, the distribution of X + Z = 2X is different from that of Y + W = 0. Property 1.3 and Y .

Let random variable Z be independent of random variables X

(i) X =d Y implies ZX =d ZY . (ii) If Pr(X > 0) = Pr(Y > 0) = Pr(Z > 0) = 1 and the cf of log Z satisfies ϕlog Z (t) = / 0 for almost all t, then ZX =d ZY implies X =d Y . (iii) If Pr(Z > 0) = 1 and the cf of log Z satisfies ϕlog Z (t) = / 0 for almost all t, then d d ZX = ZY implies X = Y . ¶ Proof. (i) If Z = 0, the assertion is trivial. Let Pr(Z = / 0) = 1 and the cdf of Z be FZ (·). Without loss of generality, we assume that Pr(Z > 0) = 1. The cdfs of ZX and ZY have the following relationship:

INTRODUCTION

13

Table 1.11 The joint and marginal distributions. Z\X

−1

−2

Marginal

Z\Y

1

2

Marginal

1 −1

1/4 1/4

1/4 1/4

1/2 1/2

1 −1

1/4 1/4

1/4 1/4

1/2 1/2

Marginal

1/2

1/2

1

Marginal

1/2

1/2

1



Pr{ZX ≤ t} =

Pr{ZX ≤ t|Z = z} dFZ (z)  t dFZ (z) (by independence) = Pr X ≤ z   t d dFZ (z) (by X = Y ) = Pr Y ≤ z  = Pr{ZY ≤ t|Z = z} dFZ (z) (by independence) 

= Pr{ZY ≤ t}. This implies Assertion (i). (ii) By Property 1.1, from the assumption that ZX =d ZY , we have d

log Z + log X = log Z + log Y. By Property 1.2(ii), we obtain log X =d log Y ; that is, X =d Y . (iii) The proof of this part can be found in Fang and Zhang (1990: 38–39). End In Assertion (iii) the condition Pr(Z > 0) = 1 is necessary. For instance, let the joint distributions of (Z, X) and (Z, Y ) and their marginals be as reported in Table 1.11. It is easy to see that Z is independent of X and Y , respectively, and ZX d d / Y. = ZY , but X =

1.3 Beta and inverted beta distributions Definition 1.1 A random variable Z is said to follow a beta distribution with parameters a1 , a2 > 0, if it has the pdf Beta(z|a1 , a2 ) =

za1 −1 (1 − z)a2 −1 , B(a1 , a2 )

0 < z < 1,

(1.5)

where B(a1 , a2 ) = (a1 )(a2 )/ (a1 + a2 ) denotes the beta function. We will write Z ∼ Beta(a1 , a2 ). ¶

DIRICHLET AND RELATED DISTRIBUTIONS

4 3 2

Density

0.8

0

0.0

1

0.4

Density

1.2

5

14

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

(ii)

8

10

0

0

2

2

4

6

Density

4

Density

6

12

(i)

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4 (iv)

(iii)

Figure 1.1 Plots of the densities of Z ∼ Beta(a1 , a2 ) with various parameter values (i) a1 = a2 = 0.9; (ii) a1 = a2 = 20; (iii) a1 = 0.5 and a2 = 2; (iv) a1 = 5 and a2 = 0.4.

The beta densities with various parameters are shown in Figure 1.1. Alternatively, the beta distribution can be defined by two independent gamma variables, as shown in the following theorem. Theorem 1.1 Z ∼ Beta(a1 , a2 ) if and only if d

Z=

Y1 , Y1 + Y2

where Yi ∼ Gamma(a1 , β), i = 1, 2, and Y1 ⊥ ⊥ Y2 .

(1.6) ¶

INTRODUCTION

15

If Z ∼ Beta(a1 , a2 ), then the rth moment of Z is given by E(Zr ) =

(a1 + r) (a1 + a2 ) B(a1 + r, a2 ) = . B(a1 , a2 ) (a1 ) (a1 + a2 + r)

(1.7)

In particular, we have a1 , a1 + a 2 a1 (a1 + 1) E(Z2 ) = , and (a1 + a2 )(a1 + a2 + 1) a1 a2 Var(Z) = . (a1 + a2 )2 (a1 + a2 + 1) E(Z) =

Other properties for Z ∼ Beta (a1 , a2 ) are reported as follows: (1) Beta(1, 1) = U(0, 1); (2) 1 − Z =d Beta(a2 , a1 ); and (3) the kth order statistic for a random sample of size n from U(0, 1) follows Beta(k, n − k + 1). It is noteworthy that the beta distribution is the conjugate prior for the binomial likelihood. It includes the Jeffreys’ invariant noninformative prior distribution with a1 = a2 = 1/2. The built-in S-plus function, rbeta(N, a1 , a2 ), can be used to generate N i.i.d. samples from Beta(a1 , a2 ). Finally, we notice that there is a close link between beta and the inverted beta distribution1 (Aitchison, 1986: 60–61). Definition 1.2 A non-negative random variable W is said to follow an inverted beta distribution with parameters a1 , a2 > 0, if its density is IBeta(w|a1 , a2 ) =

1 wa1 −1 , · B(a1 , a2 ) (1 + w)a1 +a2

w > 0.

(1.8) ¶

We will write W ∼ IBeta(a1 , a2 ).

It can be seen that W ∼ IBeta(a1 , a2 ) has the following stochastic representation: d

W=

Z d Y1 = , 1−Z Y2

(1.9)

where Z ∼ Beta(a1 , a2 ), Yi ∼ Gamma(ai , β), i = 1, 2, and Y1 ⊥ ⊥ Y2 . 1

Sometimes, the inverted beta distribution is called a Type II beta distribution (Thomas and George, 2004) or Pearson Type VI distribution (Johnson and Kotz, 1970b: 51).

16

DIRICHLET AND RELATED DISTRIBUTIONS

1.4 Some useful identities and integral formulae 1.4.1 Partial-fraction expansion Several useful formulae can be obtained by combining the surface integral formula with the partial-fraction identity (Hazewinkel, 1990: 311)  N(bk )k (b1 , . . . , bn ) P(x) = , (x − b1 ) · · · (x − bn ) x − bk k=1 n

(1.10)

where bj = / bk for j, k = 1, . . . , n, j = / k, ˆ k (b1 , . . . , bn ) =

n

1 , b − bj j= / k, j=1 k

(1.11)

and P(x) is an r-order polynomial, 0 ≤ r ≤ n − 1. In particular, taking P(x) = −x in (1.10) and setting x = 0, we have n 

k (b1 , . . . , bn ) = 0.

(1.12)

k=1

1.4.2 Cambanis–Keener–Simons integral formulae Let h(·) and g(·) be measurable functions on R+ and R, respectively, and let g(·) have the (n − 1)th absolutely continuous derivatives. It can be shown by induction and through the use of (1.12) that (Cambanis et al., 1983: 225) 

h

 n

Rn+



xj

g(n−1)

 n

j=1

 n  sj xj dx = k (s1 , . . . , sn )

j=1



0

k=1

h(u) g(usk ) du. (1.13)

An alternative version of (1.13) is given by 

h Vn (c)

 n



g(n−1)

xj

j=1

 n

 c n  sj xj dx = k (s1 , . . . , sn ) h(u) g(usk ) du,

j=1

0

k=1

(1.14) where Vn (c) is defined by (2.2). Another important formula is 

h Vn (c)

 n j=1

xj dx =

1 (n − 1)!



c

h(u)un−1 du, 0

which can be obtained by using (5.10) of Fang et al. (1990: 115).

(1.15)

INTRODUCTION

17

1.4.3 Hermite–Genocchi integral formula The classical Hermite–Genocchi formula (Karlin et al., 1986: 71) can be stated as

  n n √  (n−1) g sj xj dx = n g(sk )k (s1 , . . . , sn ), (1.16) Tn

j=1

k=1

where Tn ≡ Tn (1), Tn (c) is defined by (2.1), dx denotes the volume element of Tn , g(·) has the same meaning as in (1.13), and k (s1 , . . . , sn ) is defined by (1.11) and (1.12).

1.5 The Newton–Raphson algorithm Let Yobs denote the observed data and θ the parameter vector of interest. We assume that the log-likelihood function (θ|Yobs ) is a twice continuous differentiable and concave function. The objective is to find the MLE of the parameter vector: θˆ = arg max (θ|Yobs ).

Let ∇ denote the derivative operator. In mathematics, ∇ is called the gradient vector and ∇ 2 the Hessian matrix. When applied to the likelihood as a function of θ, ∇ is called the score vector, the matrix Iobs (θ) = −∇ 2 (θ|Yobs ) is called the observed information matrix, and the matrix of expected values  J(θ) = E{Iobs (θ)} = − ∇ 2 (θ|Yobs ) · f (Yobs |θ) dYobs is called the Fisher/expected information matrix. The iteration scheme of the Newton’s method2 is as follows: −1 (t) θ (t+1) = θ (t) + Iobs (θ )∇ (θ (t) |Yobs ).

(1.17)

The Fisher scoring algorithm (Rao, 1973) is obtained by replacing the observed information matrix with the expected information matrix in the Newton–Raphson algorithm, θ (t+1) = θ (t) + J −1 (θ (t) )∇ (θ (t) |Yobs ).

(1.18)

Since J(θ) can be alternatively represented as the variance–covariance matrix Var{∇ (θ|Yobs )}, the matrix J(θ) is positive semidefinite. So, in practice, the inverse matrix of J(θ) exists more likely than the inverse matrix of Iobs (θ). Another ˆ immediately advantage of the scoring algorithm is that the inverse matrix J −1 (θ) 2

In statistics, Newton’s method is usually called the Newton–Raphson algorithm.

18

DIRICHLET AND RELATED DISTRIBUTIONS

provides the estimated matrix of asymptotic variances and covariances of the MLE θˆ when the iteration converges. It is important to note that neither of these two methods is an ascent algorithm that guarantees at each iteration that (θ (t+1) |Yobs ) > (θ (t) |Yobs ). As in all iterative methods, proper monitoring of the likelihood values achieved by the sequence of estimates is always advisable for the purpose of terminating iteration.

1.6 Likelihood in missing-data problems Missing-data problems (MDPs) refer to situations where there are actually missing values in the planned data set, or where the original data set is complete but which for easier modeling or for new thinking on computation we strategically add to the model certain augmented data that are not available. Incomplete data arise in many settings. For example, returned questionnaires in a survey may include nonresponses to some questions that the respondents did not wish to reply to. In industrial experiments, the results in certain runs may not be available due to uncontrollable events in the process which are not part of the design. A pharmaceutical experiment on the after-effects of a toxic product may skip some doses for a given patient. In drug experiments, a mouse may die before the end of study. Augmented data problems include the latent-class model, mixture model, some constrained parameter models, and so on. The frequentist question is how to handle the likelihood function in such problems.

1.6.1 Missing-data mechanism Let Yobs denote the set of observed values and Ymis the missing values, while Ycom = {Yobs , Ymis } denotes the union of the two sets with a likelihood function for the parameter vector θ. In some problems we can define the associated missing-data indicator M whose value is unity if the corresponding element in Ycom is in Yobs and zero if in Ymis . If the joint distribution of (Ycom , M) is defined by the following factorization of pdf with an additional parameter ψ, f (Ycom , M|θ, ψ) = f (Ycom |θ) · f (M|Ycom , ψ), then the missing-data mechanism (Rubin, 1976) is characterized by the conditional distribution of M given Ycom , which is indexed by a separate unknown parameter vector ψ that is unrelated to θ. If that distribution does not depend on Ycom at all, called the case of missing completely at random (MCAR), f (M|Yobs , Ymis , ψ) = f (M|ψ),

(1.19)

INTRODUCTION

19

the likelihood of θ is the same without data M and we have no information gain about θ. If the distribution of the missing-data mechanism does depend on the observed values and the parameter ψ, but not on the missing values, f (M|Yobs , Ymis , ψ) = f (M|Yobs , ψ),

(1.20)

it is called the case of MAR. If we can integrate out Ymis , the distribution of observed data is  f (Yobs , M|θ, ψ) = f (Yobs , Ymis |θ) · f (M|Yobs , Ymis , ψ) dYmis , and we have the observed likelihood f (Yobs , M|θ, ψ) = f (M|Yobs , ψ) · f (Yobs |θ). In some other problems, we may not be able to handle the likelihood in this way.

1.6.2 The expectation–maximization (EM) algorithm Consider the likelihood L(θ|Yobs , z), where Yobs is available but z is not. Suppose either we cannot integrate out z, or that z consists of strategically augmented latent variables for easier modeling of the likelihood. In maximizing the loglikelihood (θ|Yobs , z) = log L(θ|Yobs , z), there are folklore (or intuitive) versions of the EM algorithm consisting of iterative alternation of the expectation step (E-step) and maximization step (M-step). Let θ (0) be an initial MLE, then one version computes z(0) = E(z|Yobs , θ (0) ) at the E-step and then maximizes (θ|Yobs , z(0) ) at the M-step to obtain a new MLE θ (1) . The iteration is repeated until convergence according to certain criteria. If (θ|Yobs , z) depends on z through K functions of z, say (θ|Yobs , z) = g(θ|Yobs , h1 (z), . . . , hK (z)), then a variant is finding h(0) k (z) = E{hk (z)|Yobs , θ (0) } at the E-step and using them directly as substitutes of hk (z), instead of using hk (z(0) ), for maximization in the M-step. The properties of likelihood-ascent from θ (0) to θ (1) and eventual convergence had not been established in general for such folklore versions of EM until Dempster et al. (1977) proposed a general formulation intending to assure such properties, as follows. ˆ The E-step is to compute the Q Let θ (t) be the current estimate of the MLE θ. function defined by   Q(θ|θ (t) ) = E (θ|Yobs , z) Yobs , θ (t)  = (θ|Yobs , z) · f (z|Yobs , θ (t) ) dz, (1.21) Z

and the M-step is to maximize Q with respect to θ to obtain θ (t+1) = arg max Q(θ|θ (t) ).

The two-step process is repeated until convergence occurs.

(1.22)

20

DIRICHLET AND RELATED DISTRIBUTIONS

Wu (1983) had important contributions in clarifying the convergence properties of this version and provided a list of confirmed properties. For an excellent discussion on EM and more references, see Meng and van Dyk (1997). Nowadays, the practical operation of EM seems to be as follows. For the likelihood function of complete data, first identify those functions of z appearing in the expressions of the MLE if explicitly available, or in the MLE equation ∇ (θ|Yobs , z) = 0, or in the mechanism that does the maximization. Then, in E-step we replace these functions by their expectations conditional on (Yobs , θ (t) ). The M-step returns the new MLE θ (t+1) based on the substitutes of the functions. The first of the following examples demonstrates the procedure when the MLE expressions are explicitly available. The second example demonstrates the fact that the system of MLE equations, ∇ (θ|Yobs , z) = 0, can be presented in different but equivalent ways, so that we have different sets of functions of z to substitute and thus the sequences of estimates are different. So whatever the theory says, it is always a shrewd practice to keep track of the likelihood values achieved by the sequence of estimates in a particular choice of the substitutes. Example 1.13 (Two-parameter multinomial model). To illustrate the EM algorithm, we consider a two-parameter multinomial model in nine cells, Multinomial(N; a1 θ1 , b1 , a2 θ1 , b2 , a3 θ2 , b3 , a4 θ2 , b4 , cθ3 ), where ai and bi are positive and known, 0 ≤ c = 1 − 4i=1 bi = a1 + a2 = a3 + a4 ≤ 1 and the unknown positive parameters satisfy θ1 + θ2 + θ3 = 1. Some frequencies are missing and the available data consist of the subtotals of the first four consecutive and mutually exclusive pairs plus the frequency in the last cell, represented as {Yobs , z} = (z1 , Y1 − z1 , . . . , z4 , Y4 − z4 , Y5 ). By the amalgamation property of the multinomial distribution, the likelihood function for the observable {Y1 , Y2 , Y3 , Y4 , Y5 } is not easy to maximize: y

L(θ|Yobs ) ∝ θ3 5

2 4 (ai θ1 + bi )yi (ai θ2 + bi )yi , i=1

θ ∈ T3 .

(1.23)

i=3

If z is available, however, the likelihood function is very easy to maximize: L(θ|Yobs , z) ∝ θ1z1 +z2 θ2z3 +z4 θ3 5 , y

θ ∈ T3 .

(1.24)

Indeed, the MLEs have explicit expressions in terms of the complete data, z1 + z2 , θˆ 1 = y5 + z +

z3 + z 4 θˆ 2 = , y5 + z +

θˆ 3 =

y5 , y5 + z +

(1.25)

 where z+ = 4i=1 zi . And for imputation of the missing data z, we have the conditional predictive density

f (z|Yobs , θ) =

4 i=1

Binomial(zi |yi , pi ),

(1.26)

INTRODUCTION

21

where pi =

ai θ1 , i = 1, 2; ai θ1 + b i

pi =

ai θ2 , i = 3, 4. ai θ2 + bi

(1.27)

So, the E-step of the EM algorithm computes E(zi |Yobs , θ) = yi pi , with pi defined by (1.27), and the M-step updates (1.25) by replacing {zi }4i=1 with E(zi |Yobs , θ). For illustrative purposes, we consider the data set reported in Gelfand and Smith (1990): (y1 , . . . , y5 ) = (14, 1, 1, 1, 5), b1 = 1/8, b2 = b3 = 0,

a1 = · · · = a4 = 0.25, b4 = 3/8.

(1.28)

Using θ (0) = (1/3, 1/3, 1/3) as the initial values, the EM algorithm converged in eight iterations. The resultant MLE is given by θˆ = (0.585900, 0.0716178, 0.342482).



iid

Example 1.14 (Censored gamma data). Let Y1 , . . . , Ym ∼ Gamma(α, β) and we are given a set of upper-censored observations {y1 , . . . , yr ; c, . . . , c}, where each yi (< c) is not censored and not necessarily associated with Yi by subscript i, while each c represents a missing observation greater than or equal to c. The log-likelihood without any censoring (i.e., {yr+1 , . . . , ym } are also known) would be (α, β|Ycom ) = (α − 1)

m 

log yi − m¯yβ + m{α log β − log (α)},

i=1

 where y¯ = m−1 m i=1 yi . Taking first derivatives with respect to α and β, we have the system of two equations

β=

α , y¯

 (α) 1  = log yi + log β, (α) m i=1 m

the solution of which gives the MLEs without censoring. Note that the digamma function (α) =  (α)/ (α) and the trigamma function  (α) are available in most computing software and the equations can be solved numerically. Now with censoring, there to be substituted in the EM algorithm, are two functionsofm missing data namely m ¯ and i=r+1 log  yi in m i=r+1 yi in y i=1 log yi . Thus, at each iteration with current MLE (α(t) , β(t) ), a missing yi in m i=r+1 yi is to be substituted by E(Y |Y > c), where Y ∼ Gamma(α(t) , β(t) ), and a missing log yi in m i=r+1 log yi is to be substituted by E(log Y |Y > c), where Y ∼ Gamma(α(t) , β(t) ), before numerically solving this system of equations to get the new (α(t+1) , β(t+1) ).

22

DIRICHLET AND RELATED DISTRIBUTIONS

An equivalent system for complete data is obtained by substituting the first equation into the second:  (α) 1  − log α = log yi + log y¯ , (α) m i=1 m

β=

α . y¯

The equation for α can be solved numerically, say by the Newton–Raphson method. Once αˆ is found, βˆ is obtained in the second equation. With censoring, a new function of missing data different from the aforesaid system is log y¯ . Rigorously speaking, this new function should be substituted by

E log

 r  i=1

yi +

m 

  Yi Yi > c − log m,

i=r+1

iid

where Yi ∼ Gamma(α(t) , β(t) ). It is arguable that one may simply substitute y¯ in log y¯ , ignoring this new function. In the same spirit then, one may just substitute for the missing Yi in all functions of missing data in any of the above routes of maximization, effectively going back to the folklore version. In general practice, only by comparison of the trails of likelihood values achieved by different sequences of estimates can we settle the issue as to which of the procedures amounts to minimum work and yet achieves the same level of accuracy in maximization. 

1.6.3 The expectation/conditional maximization (ECM) algorithm The vital property for the EM iteration defined by (1.22) is that an increase in Q(θ|θ (t) ) forces an increase in the log-likelihood (θ|Yobs ) of the observed data. If the M-step does not have an explicit solution, we can solve it numerically, but at the risk of losing this property by an approximated θ (t+1) . Meng and Rubin (1993) proposed an expectation/conditional maximization (ECM) algorithm so that an otherwise complicated M-step is replaced by several computationally simpler conditional maximization steps, called the CM-steps, while the ascending property is retained. Although requiring more iterations than the EM algorithm, ECM is possibly faster in total computer time. The reader is referred to the aforementioned paper for more details.

1.6.4 The EM gradient algorithm Another numerical M-step that preserves the ascending property under a mild condition is the EM gradient algorithm proposed by Lange (1995a). Let ∇Q(θ (t) |θ (t) ) be the gradient vector of Q(θ|θ (t) ) evaluated at the current MLE θ (t) and ∇ (θ (t) |Yobs ) the gradient vector of ∇ (θ|Yobs ) evaluated at θ (t) , and ∇ 2 Q(θ (t) |θ (t) ) the Hessian

INTRODUCTION

23

matrix ∇ 2 Q(θ|θ (t) ) evaluated at θ (t) . The EM gradient algorithm updates the estimate as θ (t+1) = θ (t) − [∇ 2 Q(θ (t) |θ (t) )]−1 ∇Q(θ (t) |θ (t) ) = θ (t) − [∇ 2 Q(θ (t) |θ (t) )]−1 ∇ (θ (t) |Yobs ).

(1.29)

Because (θ|Yobs ) − Q(θ|θ (t) ) has its minimum at θ = θ (t) , the equality ∇Q(θ (t) |θ (t) ) = ∇ (θ (t) |Yobs ) holds whenever θ (t) is an interior point of the parameter domain (Dempster et al., 1977). If ∇ 2 Q(θ (t) |θ (t) ) is negative definite, the θ (t+1) in (1.29) will ascend the likelihood. For the case of the Dirichlet distribution, as an example, the Hessian matrix ∇ 2 Q(θ|θ (t) ) is indeed negative definite. This fact in turn implies strict concavity of Q(θ|θ (t) ) and the uniqueness of the maximum point θ (t+1) .

1.7 Bayesian MDPs and inversion of Bayes’ formula In dealing with Bayesian MDPs, the counterpart to the EM algorithm is the data augmentation (DA) algorithm proposed in the seminal paper of Tanner and Wong (1987). The paper was discussed by prominent experts in the field, including two from Harvard University, and has been viewed by many researchers as a key milestone in the study of Bayesian MDPs. It is elaborated in more detail in the monograph by Tanner (1993, 1996). Other references relevant to this development are Gelfand and Smith (1990), Wei and Tanner (1990a–c), Schervish and Carlin (1992), Wong and Li (1992), and Liu et al. (1994, 1995).

1.7.1 The data augmentation (DA) algorithm In order to delineate more clearly the DA algorithm, and its relationship to inversion of Bayes’ formula in the next subsection, we deviate from the vague notation of a generate function f (·), the meaning of which is to be indicated by the arguments we put in. So let p(θ|Yobs , z) denote the posterior density if both Yobs and z are available, π(θ|Yobs ) denote the observed posterior density since z is not available, f (z|Yobs , θ) denote the conditional density of z given the other two, and h(z|Yobs ) denote the predictive density of z. The DA algorithm of Tanner and Wong (1987) is to find the observed posterior density π(θ|Yobs ) in terms of f (z|Yobs , θ) and p(θ|Yobs , z) by virtue of the ‘successive substitution method’ of functional analysis, to be implemented with Monte Carlo approximation in each substitution step. The reader is referred to Tanner (1996: Chapter 5) for more details.

24

DIRICHLET AND RELATED DISTRIBUTIONS

Under the positivity assumption that all density functions are positive in the product space of the arguments, the following holds:  π(θ|Yobs ) = p(θ|Yobs , z) h(z|Yobs ) dz (1.30)    g(θ )    = p(θ|Yobs , z) f (z|Yobs , φ) π(φ|Yobs ) dφ dz. (1.31)    g(φ) Interchanging the order of integration in the last identity gives an integral equation for the unknown function g(·) with a given kernel function K(·, ·),  g(θ) = K(θ, φ) g(φ) dφ, (1.32) where



K(θ, φ) =

p(θ|Yobs , z)f (z|Yobs , φ) dz.

(1.33)

With an initial function g0 (·), we have a sequence of functions gk (·) generated by the successive substitution  gk+1 (θ) = K(θ, φ)gk (φ) dφ, (1.34) which we wish to converge to g(·) as k → ∞. The paper proved the convergence of the sequence assuming the following set of conditions for convergence, in addition to the positivity assumption p(θ|Yobs , z) > 0 and f (z|Yobs , φ) > 0: (1) K(θ, φ) is uniformly bounded; (2) K(θ, φ) is equi-continuous in θ; (3) for any θ 0 ∈ θ, there exists an open neighborhood U such that K(θ, φ) > 0

∀ θ, φ ∈ U;

g0 (θ) < ∞. (4) the initial function g0 is such that sup π(θ|Y obs ) θ The analytic integration in each substitution cannot be done in practice and we have to use Monte Carlo integration. In line with the proof of convergence, the Monte Carlo integration should follow (1.34) together with (1.33) as follows. •

Monte Carlo substitution according to (1.34) and (1.33): Draw θ 1 , . . . , θ M from gk (θ) – the current estimate of π(θ|Yobs ). For each θ i , draw zij ∼ f (z|Yobs , θ i ), j = 1, . . . , N.

INTRODUCTION

25

Update the estimate of π(θ|Yobs ) by gk+1 (θ) =

M N 1  p(θ|Yobs , zij ). NM i=1 j=1

(1.35)

The authors of the paper recommended an alternative Monte Carlo integration based on (1.30) instead, using conditional sampling in the sequence of gk (θ) followed by f (z|Yobs , θ) for sampling the predictive density h(z|Yobs ) in the current substitution: (a) Imputation step. For j = 1, . . . , m, draw θ (j) from gk (θ) – the current estimate of π(θ|Yobs ). For each θ (j) , draw z(j) from f (z|Yobs , θ (j) ). (b) Posterior step. Update the estimate of π(θ|Yobs ) by 1  p(θ|Yobs , z(j) ). m j=1 m

gk+1 (θ) =

(1.36)

When the two functions gk (·) and gk+1 (·) are judged to be close enough as illustrated in Tanner (1996: Section 5.4), the procedure terminates. Note that the first Monte Carlo method (1.35) becomes the second one (1.36) if N = 1. It would be of interest to study the difference in results between the two methods for m = M × N and N= / 1, if the algorithm is to be used.

1.7.2 True nature of Bayesian MDP: inversion of Bayes’ formula As we have seen from the DA algorithm, the aim of Bayesian MDPs is to find the observed posterior density π(θ|Yobs ) based on f (z|Yobs , θ) and p(θ|Yobs , z). Since Yobs is a given constant throughout, acting like an indexing parameter in the family of joint distributions for (θ, z), we may drop the given Yobs in all density functions in solving the problem, without loss of generality on the conceptual level. So in this abstraction we are going in the opposite direction of Bayes’ theorem – finding π(θ) (the prior) from p(θ|z) (the posterior) with f (z|θ) (the likelihood), as depicted in the following diagram: [ Given the likelihood f (z|θ) ] Bayes’ theorem or Bayes’ formula prior ———————————— > posterior π(θ) < ———————————— p(θ|z) Converse of Bayes’ theorem or inversion of Bayes’ formula

26

DIRICHLET AND RELATED DISTRIBUTIONS

That is, the aim of a Bayesian MDP is equivalent to finding the converse of Bayes’ theorem; or a Bayesian MDP is all about inversion of Bayes’ formula (IBF). In this new context, let us explore the basic results without Yobs , as follows. For any fixed θ, integrate both sides of the basic identity π(θ)f (z|θ) = f (z) p(θ|z)

(1.37)

with respect to z in S(z|θ), the conditional support of z given θ,    f (z|θ) dz . f (z) dz π(θ) = S(z|θ ) S(z|θ ) p(θ|z)

(1.38)

pj.

If θ 0 satisfies S(z|θ 0 ) = S(z) (i.e., the projection of S(z|θ 0 ) on the space of z equals S(z)), the identity (1.38) reduces to  −1 f (z|θ 0 ) π(θ 0 ) = dz , (1.39) S(z) p(θ 0 |z) pj.

which is called the point-wise IBF for θ 0 . If z0 exists with S(θ|z0 ) = S(θ), by symmetry we have the point-wise IBF for z0 ,  −1 p(θ|z0 ) f (z0 ) = dθ , (1.40) S(θ ) f (z0 |θ) and hence we have the function-wise IBF for all θ in S(θ),  −1 p(θ|z0 ) p(θ|z0 ) π(θ) = dθ , f (z0 |θ) S(θ ) f (z0 |θ)

(1.41)

pj.

which must be the same for any other z0 satisfying S(θ|z0 ) = S(θ). It seems that Ng (1995a,b, 1997a,b) was the first to recognize this underlying nature of Bayesian MDPs and obtained the explicit inverse Bayes’ formulae for special positivity conditions. They turn out to be particular forms of the converse of Bayes’ theorem. Ng (1996) presented the main ideas of the inversion algorithm of Bayes’ formula for the haphazard patterns of positivity conditions, which was in print later (Ng, 2010a,b). Then he recognized that the inversion algorithm was squarely the Converse of Bayes’ Theorem, while other IBF results were just corollaries. So he presented this synthesis in the Academic Seminar on 20 October 2010, organized by the Hong Kong Statistical Society in celebration of the first World Statistics Day (www.hkss.org.hk/events/academic.htm). His interview by the Society after the presentation was reported in the Society’s Jan. 2011 Bulletin Vol. 33 No. 2, 17–28 (www.hkss.org.hk/Publication/Bulletins.htm).

1.7.3 Explicit solution to the DA integral equation Transferring the notation for functions back to MDPs, π(θ) is now the density of θ prior to both Yobs and z, and p(θ|Yobs , z) is posterior to both Yobs and z, while π(θ|Yobs ) is posterior to Yobs but prior to z. Under the positivity assumption of the DA

INTRODUCTION pj.

pj.

27

pj.

algorithm, S(θ, z) = S(θ) × S(z), so S(z|θ) = S(z) for all θ and S(θ|z) = S(θ) for all z; hence, we can drop the integration region in the following, as in Section 1.7.2. The DA integral equation (1.31) has an explicit solution in pointwise integral form from (1.39),  −1 f (z|Yobs , θ) π(θ|Yobs ) = , (1.42) dz p(θ|Yobs , z) and the function-wise form with integration at any one point z due to (1.40), 

π(θ|Yobs ) =

p(θ|Yobs , z) dθ f (z|Yobs , θ)

−1

p(θ|Yobs , z) . f (z|Yobs , θ)

(1.43)

By symmetry, we have the predictive density in the point-wise IBF form, 

h(z|Yobs ) =

p(θ|Yobs , z) dθ f (z|Yobs , θ)

−1

and in function-wise IBF form,  −1 f (z|Yobs , θ) f (z|Yobs , θ) h(z|Yobs ) = dz . p(θ|Yobs , z) p(θ|Yobs , z)

(1.44)

(1.45)

It is illustrative to cross-check by substitution that (1.42) and (1.43) satisfy (1.31). The following chain of equalities is obtained by substituting φ into (1.43) and then the resulting π(φ|Yobs ) into the right-hand side of (1.31), canceling f (z|Yobs , φ) in the inside integral, using (1.45) to get the predictive density h(z|Yobs ) in the second last equality, and finally using (1.30),    p(θ|Yobs , z) f (z|Yobs , φ)π(φ|Yobs ) dφ dz    −1  p(θ|Yobs , z) p(θ|Yobs , z) dθ p(φ|Yobs , z) dφ dz f (z|Yobs , θ)  −1  p(θ|Yobs , z) dθ dz = p(θ|Yobs , z) S(θ ) f (z|Yobs , θ)  = p(θ|Yobs , z)h(z|Yobs ) dz 

=

= π(θ|Yobs ). Now let us consider Bayesian analysis for Example 1.13, where an EM algorithm has been established to find the MLE of θ for the two-parameter multinomial model with missing data z. Example 1.15 (Two-parameter multinomial model revisited). Given the prior distribution, θ ∼ Dirichlet(α1 , α2 , α3 ), the posterior of θ given (Yobs , z) is then given

28

DIRICHLET AND RELATED DISTRIBUTIONS

by p(θ|Yobs , z) = Dirichlet3 (θ|z1 + z2 + α1 , z3 + z4 + α2 , y5 + α3 ).

(1.46)

The conditional predictive density for z is known in Example 1.13: f (z|Yobs , θ) =

4  yi  i=1

zi

pi 1 − pi

zi

(1 − pi )yi .

(1.47)

Obviously, the analytic substitution (1.34) with (1.33) cannot be done to find π(θ|Yobs ). If the iterative Monte Carlo version (1.35) or (1.36) of the DA algorithm is to be used to approximate π(θ|Yobs ), the first task is to find an initial function g0 (θ) satisfying condition (4) on page 24 in order to ensure convergence; that is, g0 (θ) sup < ∞. θ π(θ|Yobs )

(1.48)

Since π(θ|Yobs ) is unknown, good experience is vital in selecting g0 (θ). The array of IBF from (1.42) to (1.45) provides useful alternatives to the DA algorithm. As z is discrete in this case and summation is always easier than integration, (1.42) is a direct way to find π(θ|Yobs ). First note that

f (z|Yobs , θ) = p(θ|Yobs , z)

B(α1 + z1 + z2 , α2 + z3 + z4 , α3 + y5 ) y +α −1 θ3 5 3

2

θiαi −1

i=1

2

(ai θ1 + bi )

i=1

yi

4  yi 

zi

i=1 4

y −zi

aizi bi i

,

(ai θ2 + bi )

(1.49)

yi

i=3

where B(·, ·, ·) is the multivariate beta function. By (1.42) we have y +α3 −1

π(θ|Yobs ) = C−1 θ3 5

2 i=1

θiαi −1

2 4 (ai θ1 + bi )yi (ai θ2 + bi )yi , i=1

(1.50)

i=3

where the normalizing constant C is defined as C=

y1  k1 =0

···

y4 

B(α1 + k1 + k2 , α2 + k3 + k4 , α3 + y5 )

k4 =0

4  yi  i=1

ki

y −ki

aiki bi i

.

(1.51) It is to be noted that (1.50) together with (1.51) introduces a new generalized Dirichlet distribution with four groups of parameters, (α, y, a, b). Multiplying (1.49) and (1.50), we obtain the predictive density h(z|Yobs ) = C−1 B(α1 + z1 + z2 , α2 + z3 + z4 , α3 + y5 )

4  yi  i=1

zi

y −zi

aizi bi i

,

(1.52) which is a  complex mixture of four binomial distributions, where a ≥ 0, b i i ≥ 0,  a1 + a2 + 4i=1 bi = 1, and a3 + a4 + 4i=1 bi = 1.

INTRODUCTION

29

Let us consider obtaining π(θ|Yobs ) through the joint density of (θ, Yobs ). If z is not missing but augmented, it simply means backtracking to the model of Yobs . But, in general, we may not be able to integrate (or sum) out the unobservable z. In this example, the density of Yobs given θ is  2 4 ( 5 yi )! g(Yobs |θ) = 5i=1 (cθ3 )y5 (ai θ1 + bi )yi (ai θ2 + bi )yi , (1.53) y ! i=1 i i=1 i=3 where c = 1 −

4 i=1

bi , and the joint density of (θ, Yobs ) is

 3 ( 3i=1 αi ) αi −1 f (θ, Yobs ) = g(Yobs |θ) × 3 θi . i=1 (αi ) i=1

(1.54)

From this joint density, we have π(θ|Yobs ) ∝ (1−θ1 −θ2 )

y5 +α3 −1

2

θiαi −1

i=1

4 2 yi (ai θ1 + bi ) (ai θ2 + bi )yi . (1.55) i=1

i=3

Without the knowledge of C given by (1.51), which has been derived from the IBF procedure, we have to get the normalizing constant by the double integral of the right-hand side of (1.55) over the simplex {(θ1 , θ2 ): θ1 ≥ 0, θ2 ≥ 0, θ1 + θ2 ≤ 1}. This integral is not in the standard integral table and not easy. 

1.7.4 Sampling issues in Bayesian MDPs Obtaining i.i.d. samples from π(θ|Yobs ) is a useful method for constructing the Bayesian posterior credible intervals (CIs) or regions (the counterparts of confidence intervals (CIs) or regions) when other methods fail, especially with the fast computers nowadays. If p(θ|Yobs , z) and f (z|Yobs , θ) can be easily sampled, the popular Gibbs sampling may be applied to the two blocks of random variables, (z|Yobs ) and (θ|Yobs ). We shall refer such Gibbs sampling for DA structure as ‘twoblock Gibbs sampling’ or ‘DA Gibbs sampling.’ It is an iteratively sampling scheme done in the following way: The DA Gibbs sampling: I-step: P-step:

draw z(t+1) ∼ f (z|Yobs , θ (t) ). draw θ (t+1) ∼ p(θ|Yobs , z(t+1) ).

There are two issues on Gibbs sampling and other Markov Chain Monte Carlo (MCMC) methods that common users may not pay attention to. First, the state of the art for checking stationarity of a series of variates generated from iterative sampling is still not as satisfactory as checking convergence in iterative algorithms for approximating a number or a function. As rightly pointed out by Gelfand (2002), ‘in general, convergence can never be assessed, as comparison can be made only between different iterations of one chain or between different observed chains, but

30

DIRICHLET AND RELATED DISTRIBUTIONS

never with the true stationary distribution.’ Second, even if a series is deemed to be stationary, the variates generated in the same series are not independent and so we should not use a segment of the variates as an i.i.d. sample; that is, n independent iterative series are needed to get a sample of size n. The last point is very different from a time-series model. A cautious view is that the MCMC methods are to be used only when there is no better alternative; for example see discussions in Evans and Swartz (1995, 2000) and Hobert and Casella (1996). The function-wise IBF in (1.43) and (1.45) can accommodate those sampling methods targeting a distribution with known density function, apart from the normalizing constant. These methods include the rejection sampling, adaptive rejection sampling (Gilks and Wild, 1992), Metropolis sampling, Metropolis–Hastings sampling, and the sampling/importance-resampling method (Rubin, 1987; 1988). Another important aspect of the function-wise IBF for π(θ|Yobs ) is its role in constructing what is called the highest posterior density (HPD) region. If we have an i.i.d. sample from a one-dimensional posterior distribution, it is very popular to form a CI of 95% by the lower 2.5 percentile and the 97.5 percentile. This is what we call an equal-tail CI. It is fine if the density is symmetric. For a skewed density, however, the equal-tail CI is much longer than an equal-height CI, the endpoints of which have the same height in density function; that is, the same value of density function. This fact is frequently overlooked in sampling applications. If a credible region is required for a vector parameter, the equal-tail method with Bonferroni adjustment will always end up with cubes or rectangular blocks, which have a far larger volume than an HPD region. Either a density or its cdf is needed to aid the construction of an equal-height CI or HPD; see Tanner (1996: Section 5.3). Obviously, a smoothed histogram based on the simulated sample or the empirical cdf is not as good as their analytic counterparts, if available. By symmetry, the above discussion can equally apply to the confidence regions, or intervals, of z. If h(z|Yobs ) in (1.45) is easier to sample than π(θ|Yobs ) in (1.43), we may use conditional sampling: if {z( ) }L =1 is an i.i.d. sample from h(z|Yobs ) and θ ( ) ∼ p(θ|Yobs , z( ) ) for each , then {θ ( ) }L1 are i.i.d. samples from the observed posterior distribution π(θ|Yobs ). When z is discrete and finite, the integral in (1.45) is a finite summation and h(z|Yobs ) is a probability mass function (pmf). So one can easily get the finitely many probabilities by choosing a particularly convenient θ 0 . Since sampling from a finite pmf is always possible, the aforesaid conditional sampling is to be used. The procedure for such discrete data is called exact IBF sampling in Tian et al. (2007).

1.8 Basic statistical distributions 1.8.1 Discrete distributions (a) General finite distribution Notation: X ∼ Finiten (x, p), where x = (x1 , . . . , xn ) and p = (p1 , . . . , pn ) ∈ Tn .

INTRODUCTION

31

Density: Pr(X = x i = 1, . . . , n.  i ) = pi ,  Moments: E(X) = ni=1 xi pi and Var(X) = ni=1 xi2 pi − ( ni=1 xi pi )2 . Note: The uniform discrete distribution is a special case of the general finite distribution with pi = 1/n for all i. (b) Hypergeometric distribution X ∼ Hgeometric(m, n, k), where m, n, k are positive integers.  m  n  m + n  Density: Hgeometric(x|m, n, k) = , where x k−x k x = max(0, k − n), . . . , min(m, k). Moments: E(X) = km/N  and Var(X) = kmn(N  − k)/{N 2 (N  − 1)}, where N = ˆ m + n.

Notation:

(c) Poisson distribution Notation: Density: Moments: Properties:

X ∼ Poisson(λ), where λ > 0. Poisson(x|λ) = λx e−λ /x!, x = 0, 1, 2, . . .. E(X) = λ and Var(X) = λ. ind (i) If {Xi }ni=1 ∼ Poisson(λi ), then

 n n  Xi ∼ Poisson λi i=1

and

i=1

n  (X1 , . . . , Xn ) Xi = N ∼ Multinomialn (N, p), i=1

 where p = (λ1 , . . . , λn ) / ni=1 λi . (ii) The Poisson distribution is related to the gamma density by  λ ∞  Poisson(x|λ) = Gamma(y|k, 1) dy. 

0

x=k

(d) Binomial distribution X ∼ Binomial(n, p), where n is a positive integer and p ∈ (0, 1). n Density: Binomial(x|n, p) = px (1 − p)n−x , x = 0, 1, . . . , n. x Moments: E(X) = np and Var(X) = np(1 − p).

Notation:

ind

Properties: (i) If {Xi }ni=1 ∼ Binomial(ni , p), then

 n n  Xi ∼ Binomial ni , p . i=1

i=1

32

DIRICHLET AND RELATED DISTRIBUTIONS

(ii) The binomial and beta distributions have the relationship k 



Binomial(x|n, p) =

x=0

Note:

0

1−p

Beta(x|n − k, k + 1) dx,

where 0 ≤ k ≤ n. When n = 1, the binomial distribution reduces to the Bernoulli distribution with probability parameter p, denoted by X ∼ Bernoulli(p).

(e) Multinomial distribution x ∼ Multinomial(N; p1 , . . . , pn ) or x ∼ Multinomialn (N, p), where N is a positive integer and p = (p1 , . . . , pn ) ∈ Tn . n N  Density: Multinomialn (x|N, p) = pxi , where x i=1 i x = (x1 , . . . , xn ) ∈ Tn (N). Moments: E(xi ) = Npi , Var(xi ) = Npi (1 − pi ), and Cov(xi , xj ) = −Npi pj . Note: The binomial distribution is a special case of the multinomial distribution with n = 2.

Notation:

1.8.2 Continuous distributions (a) Uniform distribution Notation: Density: Moments: Properties:

X ∼ U(a, b), a < b U(x|a, b) = 1/(b − a), a < x < b. E(X) = (a + b)/2 and Var(X) = (b − a)2 /12. If Y ∼ U(0, 1), then X = a + (b − a)Y ∼ U(a, b).

(b) Beta distribution Notation: X ∼ Beta(a, b), a > 0, b > 0. Density: Beta(x|a, b) = xa−1 (1 − x)b−1 /B(a, b), 0 < x < 1. Moments: E(X) = a/(a + b), E(X2 ) = a(a + 1)/{(a + b)(a + b + 1)}, and Var(X) = ab/{(a + b)2 (a + b + 1)}. Properties: If Y1 ∼ Gamma(a, 1), Y2 ∼ Gamma(b, 1), and Y1 ⊥ ⊥ Y2 , then Y1 /(Y1 + Y2 ) ∼ Beta(a, b). Note: When a = b = 1, Beta(1, 1) = U(0, 1). (c) Exponential distribution Notation: X ∼ Exponential(β), where β > 0 is the rate parameter. Density: Exponential(x|β) = β e−βx , x > 0. Moments: E(X) = 1/β and Var(X) = 1/β2 .

INTRODUCTION

33

log U ∼ Exponential(β). β iid ∼ Exponential(β), then

Properties: (i) If U ∼ U(0, 1), then − (ii) If {Xi }ni=1

n 

Xi ∼ Gamma(n, β).

i=1

(d) Gamma distribution X ∼ Gamma(α, β), where α > 0 is the shape parameter and β > 0 is the rate parameter. βα α−1 −βx x e , x > 0. Density: Gamma(x|α, β) = (α) Moments: E(X) = α/β and Var(X) = α/β2 . Properties: (i) If X ∼ Gamma(α, β) and c > 0, then cX ∼ Gamma(α, β/c). Notation:

ind

(ii) If {Xi }ni=1 ∼ Gamma(αi , β), then n  i=1

Note:

Xi ∼ Gamma

 n



αi , β .

i=1

(iii) (α + 1) = α(α), (1) = 1, and (1/2) = Gamma(1, β) = Exponential(β).



π.

(e) Inverse gamma distribution X ∼ IGamma(α, β), where α > 0 is the shape parameter and β > 0 is the scale parameter. βα −(α+1) −β/x Density: IGamma(x|α, β) = x e , x > 0. (α) Moments: E(X) = β/(α − 1) (if α > 1) and Var(X) = β2 /{(α − 1)2 (α − 2)} (if α > 2). Properties: If X−1 ∼ Gamma(α, β), then X ∼ IGamma(α, β). Note: IGamma(x|α, β) = Gamma(x−1 |α, β)/x2 . Notation:

(f) Chi-square distribution X ∼ χ2 (ν) ≡ Gamma(ν/2, 1/2), where ν > 0 is the degree of freedom. 2−ν/2 ν/2−1 −x/2 e , x > 0. x Density: χ2 (x|ν) = (ν/2) Moments: E(X) = ν and Var(X) = 2ν. Properties: (i) If Y ∼ N(0, 1), then X = Y 2 ∼ χ2 (1).

Notation:

34

DIRICHLET AND RELATED DISTRIBUTIONS ind

(ii) If {Xi }ni=1 ∼ χ2 (νi ), then n 

 n Xi ∼ χ νi . 2

i=1

i=1

(g) t- or Student’s t-distribution X ∼ t(ν), where ν > 0 is the degree of freedom.

( ν+1 ) x2 −(ν+1)/2 2 1+ , −∞ < x < ∞. Density: t(x|ν) = √ ν πν ( 2ν ) ν Moments: E(X) = 0 (if ν > 1) and Var(X) = ν−2 (if ν > 2). Properties: Let Z ∼ N(0, 1), Y ∼ χ2 (ν), and Z ⊥ ⊥ Y , then Notation:

Z √ ∼ t(ν). Y/ν Note:

When ν = 1, t(ν) = t(1) is called the standard Cauchy distribution, whose mean and variance do not exist.

(h) F- or Fisher’s F-distribution X ∼ F (ν1 , ν2 ), where ν1 and ν2 are two degrees of freedom.

(ν1 /ν2 )ν1 /2 (ν1 /2)−1 ν1 x −(ν1 +ν2 )/2 Density: F (x|ν1 , ν2 ) = x 1+ , x > 0. B( ν21 , ν22 ) ν2 ν2 Moments: E(X) = (if ν2 > 2) and ν2 − 2 2 2ν2 (ν1 + ν2 − 2) Var(X) = (if ν2 > 4). ν1 (ν2 − 4)(ν2 − 2)2 ⊥ Y2 , then Properties: Let Yi ∼ χ2 (νi ), i = 1, 2, and Y1 ⊥

Notation:

Y1 /ν1 ∼ F (ν1 , ν2 ). Y2 /ν2 (i) Normal or Gaussian distribution X ∼ N(μ, σ 2 ), where −∞< μ < ∞ and σ 2 > 0. 2 1 (x − μ) Density: N(x|μ, σ 2 ) = √ exp − , −∞ < x < ∞. 2σ 2 2πσ Moments: E(X) = μ and Var(X) = σ 2 . ind Properties: (i) If {Xi }ni=1 ∼ N(μi , σi2 ), then Notation:

n  i=1

a i Xi ∼ N

 n i=1

ai μ i ,

n  i=1

ai2 σi2 .

INTRODUCTION

35

(ii) If X1 |X2 ∼ N(X2 , σ12 ) and X2 ∼ N(μ2 , σ22 ), then X1 ∼ N(μ2 , σ12 + σ22 ). (j) Multivariate normal or Gaussian distribution x ∼ Nn (μ, ) or N(μ, ), where  μ ∈ Rn and  > 0.  1 1 Density: Nn (x|μ, ) = √ exp − (x − μ)−1 (x − μ) , where 2 ( 2π )n ||1/2 x = (x1 , . . . , xn ) ∈ Rn . Moments: E(x) = μ and Var(x) = . Notation:

2

Dirichlet distribution Whenever a multivariate observation is a set of proportions, called compositional data by Aitchison (1986), the Dirichlet family of distributions is usually the first candidate employed for modeling the data. In Bayesian inference for multinomial data, the Dirichlet distribution is the conjugate prior distribution so that the posterior distribution is also a Dirichlet distribution. Applications of the distribution are so many and so diverse that here we can name only a few. For example, Wilks (1962) used it in theoretical analysis to derive the distribution function of a set of order statistics, Theil (1975) used it to model random rational behavior in consumption expenditures, and Spiegelhalter et al. (1994) used it to study the frequencies of congenital heart disease. In biology, the Dirichlet distribution is used to represent proportions of amino acids when modeling sequences with hidden Markov models (Sj¨olander et al., 1996) or with allelic frequencies (Laval et al., 2003). In textmining, it is adopted to model topic probabilities (Blei et al., 2006). In Section 2.1 we introduce the Dirichlet distribution through the density function and derive some basic properties, including mixed moments, stochastic representation, and the mode of the density. In Section 2.2 we discuss the corresponding marginal and conditional distributions. The survival function and the cumulative distribution function are developed in Section 2.3. Characteristic functions are studied in Section 2.4 and distributions for a linear function of a Dirichlet random vector are reviewed in Section 2.5. In Section 2.6 we provide several characterizations for the Dirichlet distribution. The maximum likelihood estimation via the Newton–Raphson algorithm and the EM gradient algorithm of the Dirichlet parameters is investigated in Section 2.7. The generalized method of moments estimation and the estimation based on linear models are discussed in Sections 2.8 and 2.9 respectively. The application of the Dirichlet distribution to the estimation of the ROC area in medical diagnostic tests is given in Section 2.10. Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

38

DIRICHLET AND RELATED DISTRIBUTIONS

2.1 Definition and basic properties Let c be a positive number. The n-dimensional closed simplex in Rn and (n − 1)dimensional open simplex in Rn−1 are defined by  Tn (c) =

(x1 , . . . , xn ): xi > 0, 1 ≤ i ≤ n,

n 



xi = c

(2.1)

i=1

and  Vn−1 (c) =

(x1 , . . . , xn−1 ): xi > 0, 1 ≤ i ≤ n − 1,

n−1 



xi < c

(2.2)

i=1

respectively. Furthermore, we define Tn = Tn (1) and Vn−1 = Vn−1 (1).

2.1.1 Density function and moments (a) Density function The Dirichlet distribution derives its name from the integral  Vn−1

n−1  i=1



xiai −1

1−

n−1 

an −1

xi

dx1 · · · dxn−1

i=1

n (ai ) , n = i=1  i=1 ai

(2.3)

which is commonly attributed to Dirichlet (1839), who studied a class of integrals arising in celestial mechanics. Definition 2.1 A random vector x = (x1 , . . . , xn ) ∈ Tn is said to have a Dirichlet distribution if the density of x−n = (x1 , . . . , xn−1 ) is

n n  a −1  i=1 ai Dirichletn (x−n |a) = n xi i , (a ) i i=1 i=1

x−n ∈ Vn−1 ,

(2.4)

where a = (a1 , . . . , an ) is a positive parameter vector. We will write x ∼ Dirichletn (a) on Tn or x−n ∼ Dirichlet(a1 , . . . , an−1 ; an ) on Vn−1 accordingly. ¶ When all ai → 0, the distribution becomes noninformative. When n = 2, the Dirichlet distribution Dirichlet(a1 ; a2 ) reduces to the beta distribution Beta(a1 , a2 ). When n = 3, the densities of the Dirichlet distribution with different parameters are shown in Figure 2.1.

39

Z -1 0 1 2 3 4 5

Z 0 0 0 10 20 30 4 5

DIRICHLET DISTRIBUTION

1

1 0.8

0.8 0.6

0 Y .4

0.2 0

0.2

0.4

0.6

0.8

1

0.6

0 Y .4

0.2

X

0

0.2

0.4

0.6

0.8

1

X

0

0

(ii)

Z 0 5 0 -5 0 5 1 1 2

Z 5 0 5 -5 0 5 10 1 2 2

(i)

1 0.8

1 0.6

0 Y .4

0.2 0

0.2

0.4

0.6

0.8

1

0.8

0.6

0 Y .4

X

0

0.2 0

(iii)

0.2

0.4

0.6

0.8

1

X

0

(iv)

Figure 2.1 Plots of the densities of (x1 , x2 ) ∼ Dirichlet(a1 , a2 ; a3 ) on V2 with various parameter values: (i) a1 = a2 = a3 = 2; (ii) a1 = 1, a2 = 5, a3 = 10; (iii) a1 = 10, a2 = 3, a3 = 8; (iv) a1 = 2, a2 = 10, a3 = 4.

(b) The moments Let x ∼ Dirichletn (a) on Tn . For any r1 , . . . , rn ≥ 0, the mixed moment of x is given by E

 n



xiri

=

i=1

Let a+ = ˆ

n i=1

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

B(a1 + r1 , . . . , an + rn ) . B(a1 , . . . , an )

(2.5)

ai . From (2.5), we have E(xi ) =

Var(xi ) =

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Cov(xi , xj ) =

ai , a+

i = 1 . . . , n,

ai (a+ − ai ) , 2 a+ (1 + a+ )

i = 1 . . . , n,

−ai aj < 0, i = / j + a+ )

2 a+ (1

and

i, j = 1 . . . , n.

(2.6)

40

DIRICHLET AND RELATED DISTRIBUTIONS

The expectations of all the xi remain the same if all ai are scaled with the same multiplicative constant. From (2.6), it is clear that any two random components in x are negatively correlated.

2.1.2 Stochastic representations and mode (a) Stochastic representation via independent gamma variates Extending Theorem 1.1 to Dirichlet distribution, we obtain the following result that suggests a stochastic representation for Dirichlet random vector via a sequence of independent gamma variates. Theorem 2.1

x = (x1 , . . . , xn ) ∼ Dirichletn (a) on Tn iff d

xi =

yi , y1 + · · · + y n

i = 1, . . . , n,

where yi ∼ Gamma(ai , 1) and {yi }ni=1 are mutually independent.

(2.7) ¶

Proof. The joint density of y = (y1 , . . . , yn ) is proportional to

     n n ai −1 yi exp − yi . i=1

i=1

Consider the following transformation xi = yi /y1 , n  y1 = yi .

i = 1, . . . , n − 1,

and

i=1

The corresponding inverse transformation is yi = xi y1 , i = 1, . . . , n − 1,   n−1  yn = 1 − xi y1 .

and

i=1

The Jacobian is given by J(y → x−n , y1 ) = yn−1 . 1 Therefore, the joint density of (x−n , y1 ) is proportional to n−1  i=1

xiai −1

n n−1 an −1  ai −1 −y 1, 1− xi · y1 i=1 e i=1

which implies that

n x−n ∼ Dirichletn (a) on Vn−1 and x−n is independent of y1 ∼ Gamma End i=1 ai , 1 .

DIRICHLET DISTRIBUTION

41

(b) Stochastic representation via independent beta variates There is a close relationship between Dirichlet and beta distributions, which was reviewed by Wilks (1962), Aitchison (1963), Basu and Tiwari (1982), Devroye (1986: 593–596), and Gupta and Richards (1987). We note that the intractable open simplex Vn−1 can be converted to the unit cube Cn−1 = {z−n = (z1 , . . . , zn−1 ): 0 < zi < 1, i = 1, . . . , n − 1}

via the following transformation (see Fang et al. (1990): 162–163): x1 = z1

xi = zi

and

i−1  (1 − zj ),

i = 2, . . . , n − 1.

(2.8)

j=1

The inverse transformation is then given by   i−1  z1 = x1 and zi = xi 1− xj ,

i = 2, . . . , n − 1,

j=1

and the Jacobian is J(x1 , . . . , xn−1 → z1 , . . . , zn−1 ) =

n−1 

(1 − zi )n−1−i .

i=1

By induction, we have n−1 

xi = 1 −

i=1

n−1 

(1 − zi ).

(2.9)

i=1

We summarize these results in Theorem 2.2 below, which gives two alternative approaches to generate a Dirichlet random vector by a sequence of independent beta random variables. Theorem 2.2

Let x ∼ Dirichletn (a) on Tn , then the following hold.

(i) x has a beta stochastic representation ⎧ d ⎪ x1 = z1 , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ i−1 ⎪  ⎪ d ⎪ ⎨ xi = zi (1 − zj ), i = 2, . . . , n − 1, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n−1 ⎪  ⎪ d ⎪ ⎪ x = (1 − zj ), ⎪ n ⎩ j=1

and (2.10)

j=1



where zj ∼ Beta aj , independent.

 a , j = 1, . . . , n − 1, and {zj }n−1 k j=1 are mutually k=j+1

n

42

DIRICHLET AND RELATED DISTRIBUTIONS

(ii) x has an alternative beta stochastic representation ⎧ n−1  ⎪ ⎪ d ⎪ x1 = ⎪ z j , ⎪ ⎪ ⎪ ⎪ j=1 ⎪ ⎪ ⎪ ⎨ n−1  d ⎪ x = (1 − z ) z j , i = 2, . . . , n − 1, ⎪ i i−1 ⎪ ⎪ ⎪ j=i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ d ⎩ xn = 1 − z n−1 , where z j ∼ Beta independent.



j k=1

(2.11)

and

 ak , aj+1 , j = 1, . . . , n − 1, and {z j }n−1 j=1 are mutually ¶

Proof. Assertion (i) is a direct consequence of (2.8) and (2.9). The proof of Assertion (ii) is similar to that of Assertion (i). End Remark 2.1 Based on (2.7), we have an intuitive interpretation on (2.10) and (2.11). In fact, yi d xi = n j=1

yi = n j=i

yj yj

yi

= n

j=i

n n

yj

j=i

n

yj

j=i−1

j=3

yj

· · · n

j=i−1

n j=2

n

yj

yj j=1 yj

y1 · · · 1 − n

j=2



yi−1

1 − n

yj

yj

j=1



yj

d

= zi (1 − zi−1 ) · · · (1 − z2 )(1 − z1 ). Similarly, we have yi d xi = n j=1

= =

yj

n−2 n−1 j=1 yj j=1 yj j=1 yj · · · n−1 n i i+1 j=1 yj j=1 yj j=1 yj j=1 yj i−1  i n−2 n−1

j=1 yj j=1 yj j=1 yj j=1 yj 1 − i · · · n−1 n i+1 j=1 yj j=1 yj j=1 yj j=1 yj

yi

i

= (1 − z i−1 )z i · · · z n−1 . d



(c) Stochastic representation via independent inverted beta variates By combining (1.9) with Theorem 2.2, we immediately obtain the following results.

DIRICHLET DISTRIBUTION

Theorem 2.3

43

Let x ∼ Dirichletn (a) on Tn . Then:

(i) x has an inverted beta stochastic representation ⎧ w1 d ⎪ x1 = , ⎪ ⎪ ⎪ 1 + w1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ i ⎪  ⎪ 1 d ⎨x = wi , i = 2, . . . , n − 1, i 1 + wj j=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n−1 ⎪  ⎪ 1 ⎪ d ⎪ ⎪ = , x n ⎪ ⎩ 1 + wj j=1

and

(2.12)

  where wj ∼ IBeta aj , nk=j+1 ak , 1 ≤ j ≤ n − 1, and {wj }n−1 j=1 are mutually independent; and (ii) x has an alternative inverted beta stochastic representation ⎧ n−1  w j ⎪ d ⎪ ⎪ x1 = , ⎪ ⎪ 1 + w j ⎪ ⎪ j=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ n−1  w j d (2.13) i = 2, . . . , n − 1, and xi = 1+w1 ⎪ , ⎪ i−1 1 + w ⎪ j j=i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ d ⎪ ⎩ xn = , 1 + w n−1   j n−1 where w j ∼ IBeta k=1 ak , aj+1 , 1 ≤ j ≤ n − 1, and {wj }j=1 are mutually independent. ¶

(d) The mode of the Dirichlet density function Theorem 2.4

The mode of the Dirichlet density (2.4) is given by xˆ i =

ai − 1 , a+ − n

1 ≤ i ≤ n,

if ai ≥ 1, and does not exist otherwise.

(2.14) ¶

2.2 Marginal and conditional distributions Let x ∼ Dirichletn (a) on Tn . The marginal distribution can be readily obtained without performing any integral operations. On the one hand, for any s < n, we can

44

DIRICHLET AND RELATED DISTRIBUTIONS

rewrite (2.7) as yi y1 + · · · + ys + (ys+1 + · · · + yn ) yi = ˆ , i = 1, . . . , s, y1 + · · · + ys + ys+1 d

xi =

where yi ∼ Gamma(ai , 1), i = 1, . . . , s, n 

ys+1 =

yj ∼ Gamma

j=s+1

 n

 aj , 1 ,

j=s+1

are mutually independent. It follows immediately that and y1 , . . . , ys , ys+1

  (x1 , . . . , xs ) ∼ Dirichlet a1 , . . . , as ; nj=s+1 aj .

On the other hand, for any fixed i, the stochastic representation (2.7) can be rewritten as d

xi =

yi +

yi n

j=1,j = / i

yj

.

Therefore, it is clear that xi ∼ Beta(ai , a+ − ai ). Next, the stochastic representation (2.7) can also be employed to derive the conditional distribution of (xs+1 , . . . , xn−1 ) given x1 = x1∗ , . . . , xs = xs∗ . Set xi = xi



1−

s 

 xj ,

i = s + 1, . . . , n − 1.

j=1

By (2.7), we have xi = d

yi , ys+1 + · · · + yn

i = s + 1, . . . , n − 1,

which implies the distribution of (xs+1 , . . . , xn−1 ) is again a Dirichlet distribution with parameters (as+1 , . . . , an−1 ; an ), independent of x1 , . . . , xs . Therefore, given x1 = x1∗ , . . . , xs = xs∗ , we have

d

xi = 1 −

s 

xj∗



xi ,

i = s + 1, . . . , n − 1.

j=1

We summarize these results in the following theorem.

DIRICHLET DISTRIBUTION

Theorem 2.5

45

Let x ∼ Dirichletn (a) on Tn , then we have the following results.

. . . , xs ) has a Dirichlet distribution with (i) For any s 1/n, since define

m! = (m − n)!

n i=1



 Vn ∩[b1 1n ,11n ]

1−

n 

m−n

xi

dx.

(2.16)

i=1

xi ≥ nb > 1, it is clear that Jb(n) (1, m) = 0. For n = 0, we 

Jb(0) (1, m)

=

1,

if b > 0,

0,

otherwise.

46

DIRICHLET AND RELATED DISTRIBUTIONS

By integrating xn from t to 1 − Jb(n) (1, m)

n−1 i=1

xi , we have 



m! = (m − n + 1)!

1−b−

Vn−1 (1−b)∩[b1 1n−1 ,11n−1 ]

n−1 

xi

m−n+1 n−1 

i=1

dxi .

i=1

Let yi = xi /(1 − b), i = 1, . . . , n − 1. Then n−1 



yi ≤ 1

b max 0, 1−b

and

i=1

Since 0 ≤ b ≤ 1/n, we have 0 ≤ n−1 

b 1−b



 1 ≤ yi ≤ min 1, . 1−b

and 1 ≤

yi ≤ 1 and

i=1



1 , 1−b

(2.17)

so that (2.17) becomes

b ≤ yi ≤ 1. 1−b

Thus, a recursive formula for Jb(n) (1, m) can be obtained as follows (Rao and Sobel, 1980): Jb(n) (1, m) (1 − b)m m! = (m − n + 1)!



 b Vn−1 ∩[ 1−b 1n−1 ,11n−1 ]

1−

n−1  i=1

yi

m−n+1 n−1 

dyi

i=1

(n−1) = (1 − b)m Jb/(1−b) (1, m).

(2.18)

Note that the region b/(1 − b) > 1/(n − 1) is the same as b > 1/n, where the J-function vanishes. Repeatedly using the relation in (2.18), we have Jb(n) (1, m) = 1 − nb m ,

(2.19)

where x = ˆ max{x, 0}.

2.3.2 Cumulative distribution function Suppose that x = (x1 , . . . , xn ) ∼ Dirichlet(a1 , . . . , an ; an+1 ) on Vn and a = (a1 , . . . , an+1 ). Let b = (b1 , . . . , bn ) and [00n , b) denote the rectangle [0, b1 ) × · · · × [0, bn ). The cdf of x is given by Fn (b|a) = Pr{x < b} = Pr{x ∈ Vn ∩ [00n , b)}  n (1 − ni=1 xi )an+1 −1  ai −1 = xi dx. Bn+1 (a) Vn ∩[0 0n ,b) i=1

(2.20)

DIRICHLET DISTRIBUTION

47

As the Dirichlet density function equals to zero outside the unit simplex Vn , by the application of the inclusion–exclusion principle, it is easy to obtain the following theorem. Theorem 2.6 (Sz´antai, 1985; Pr´ekopa, 1995). Let b(1) ≤ · · · ≤ b(n) be the ordered values of b1 , . . . , bn . (i) If b(1) + b(2) > 1, then we have Fn (b|a) = 1 − n +

n 

F (bi ),

(2.21)

i=1

where F (·) denotes the marginal cdf of xi ∼ Beta(ai , a1n+1 − ai ), which is an incomplete beta function. (ii) If b(1) + b(2) ≤ 1 and b(1) + b(2) + b(3) > 1, then we have   n  n−1  Fn (b|a) = (n − 2) − F (bi ) + F (bi , bj ), (2.22) 2 i=1 1≤i 0, ⎪ ⎪ (α) ⎪ ⎪ ⎨ (−1)−m ˆ (2.25) (α)m = , if m < 0, and ⎪ ⎪ ⎪ (1 − α) −m ⎪ ⎪ ⎪ ⎩ 1, if m = 0.

48

DIRICHLET AND RELATED DISTRIBUTIONS

For k = 1, . . . , n, let F (b1 , . . . , bk ) = Pr{x1 < b1 , . . . , xk < bk } and F¯ (b1 , . . . , bk ) = Pr{x1 ≥ b1 , . . . , xk ≥ bk }. First, the marginal cdf and survival function of xi can be calculated by 

bi

F (bi ) = 0

xiai −1 (1 − xi )a 1n+1 −ai −1 dxi B(ai , a1n+1 − ai ) 

⎧ 0, ⎪ ⎪ ⎪ ⎨ = Ibi (ai , a1n+1 − ai ), ⎪ ⎪ ⎪ ⎩ 1,

if bi < 0, if 0 ≤ bi ≤ 1,

and

if bi > 1

and F¯ (bi ) = 1 − F (bi ), where Iθ (α, β) denotes the incomplete beta function. Second, the marginal cdf and survival function of (xi , xj ) are given by F (bi , bj ) =

⎧ ⎨ (2.23), ⎩

if bi + bj ≤ 1,

and

1 − F¯ (bi ) − F¯ (bj ), if bi + bj > 1,

and F¯ (bi , bj ) =

⎧ ⎨ 1 − F (bi ) − F (bj ) + F (bi , bj ),

if bi + bj ≤ 1,



if bi + bj > 1

0,

and

respectively. Proceeding this way, in the general case, we can compute F (b1 , . . . , bk ) and F¯ (b1 , . . . , bk ) for k = 3, . . . , n as follows: F (b1 , . . . , bk ) ⎧ ⎪ ⎪ ⎪ ⎪ (2.23), ⎪ ⎪ ⎪ ⎪ ⎪  ⎪ ⎪ ⎪ 1− F¯ (bi1 ) ⎪ ⎪ ⎪ ⎨ 1≤i1 ≤k  = ⎪ + F¯ (bi1 , bi2 ) + · · · ⎪ ⎪ ⎪ ⎪ 1≤i1 0, which can be employed to give ⎧ −2πin (v − bk )n−1 (−1)n ⎪ ⎪ , if bk > 0, 0 ≤ v ≤ bk , ⎨ (n − 1)! Ik = n n−1 ⎪ −2πi (−v + bk ) ⎪ ⎩ , if bk < 0, bk ≤ v ≤ 0, (n − 1)! for some fixed bk . We can rewrite Ik as Ik =

2πin (bk − v)n−1 sgn(bk )I[bk− , bk+ ] (v), (n − 1)!

where sgn(·) denotes the sign function, bk− = min(0, bk ) and bk+ = max(0, bk ). We summarize these results into the following theorem. Theorem 2.10 Let v = (v1 , . . . , vn ) ∼ U(Vn ). If all bk ( = / 0) are different, then  the pdf of V = b v = nk=1 bk vk is

 v n−1 n σnk 1− sgn(bk )I[bk− , bk+ ] (v), fV (v) = bk bk k=1 n 

where weights {σnk : k = 1, . . . , n} are defined by (2.46).

(2.47) ¶

Remark 2.3 (i) When all bk > 0, formula (5.5.4) in David (1981: 103) coincides with (2.47), which indicates that fV (v) is a mixture of beta distributions with scale bk . (ii) If b1 = · · · = bn = 1, it is easy to see V = nk=1 vk ∼ Beta(n, 1) by the definition of v ∼ Dirichletn+1 (1 1) on Vn . ¶

DIRICHLET DISTRIBUTION

59

2.5.2 Density for linear function of u∼ U(Tn ) Parallel to Theorem 2.10, we have the following theorem. Theorem 2.11 Let u = (u1 , . . . , un ) ∼ U(Tn ). If all bk ( = / 0) are different, then  the pdf of U = b u = nk=1 bk uk is

 u n−1 n−1 fU (u) = σnk 1− sgn(bk )I[bk− , bk+ ] (u), bk bk k=1 n 

where the weights {σnk : k = 1, . . . , n} are defined by (2.46).

(2.48) ¶

Example 2.1 (Random weighting method). Since the well-known paper of Efron (1979) appeared there has been considerable work on resampling methods. Among all of these techniques, the bootstrap is the simplest and most attractive one, and the random weighting method is an alternative which is aimed at estimating the error distribution of estimators. Let xk = μ + ε k ,

k = 1, 2, . . . ,

(2.49)

be a measure model, where {εk } are random errors of measurements. It is assumed that {εk } are i.i.d. with a common distribution function F (x) satisfying   x dF (x) = μ and (x − μ)2 dF (x) = σ 2 , and that μ and σ 2 > 0 are unknown. The common estimator for μ is the sample mean x¯ , with sample size n. To construct a CI for μ, we need to know the distribution of the error x¯ − μ. The main idea of the random weighting method is to construct a distribution based on samples {xk }nk=1 , to mimic the distribution of x¯ − μ. Let u = (un1 , . . . , unn ) ∼ U(Tn ) be independent of {xk }nk=1 and define Dn∗

n √  = n (xk − x¯ )unk , k=1



which is the weighted mean of n(xk − x¯ ) with random weight unk . Zheng ∗ (1987, 1992) showed that Dn∗ |(x1 , x2 , . . .), the conditional √ distribution of Dn given (x1 , x2 , . . .), is close to the distribution of the error n(xk − μ) when n is large; that is, with probability one, L

Dn∗ |(x1 , x2 , . . .) → N(0, σ 2 ), L

where the notation → stands for convergence in distribution, provided in model (2.49) the errors {εk } are i.i.d. with E(εk ) = 0 and Var(εk ) = σ 2 < ∞.

60

DIRICHLET AND RELATED DISTRIBUTIONS

Our interest here is to find the exact pdf of Dn∗ |(x1 , x2 , . . .). In fact, the (conditional) pdf of Dn∗ |(x1 , x√ 2 , . . .) is a mixture of beta distributions with scale by virtue of (2.48) with bk = n(xk − x¯ ).  Example 2.2 (Serial correlation problem). Consider the following model of time series (Johnson and Kotz 1970: 233): Xt = ρXt−1 + Zt ,

|ρ| < 1,

t = 1, 2, . . . ,

iid

where Zt ∼ N(0, 1), and, further, Zt is independent of all Xk for k < t. Define the following modified non-circular serial correlation coefficient n−1 

R1,1 =

¯ 1 )(Xk+1 − X ¯ 1) + (Xk − X

k=1 n 

¯ 1 )2 + (Xk − X

k=1

2n−1 

¯ 2 )(Xk+1 − X ¯ 2) (Xk − X

k=n+1 2n 

¯ 2 )2 (Xk − X

k=n+1

where  ¯1 = 1 X Xk n k=1 n

2n  ¯2 = 1 and X Xk . n k=n+1

The exact distribution of R1,1 has been obtained by Pan (1968) for the case when the correlation between Xi and Xj is ρ for |i − j| = 1, and is 0 otherwise. In this case it can be shown that R1,1 is distributed as n−1 

λ k ξk

 n−1

k=1

ξk ,

k=1

iid

where ξ1 , . . . , ξn−1 ∼ Exponential(1), and λ2k−1 = cos[2kπ/(n + 1)],

k = 1, 2, . . . , [n/2],

while λ2 , λ4 , . . . , λ2[(n−1)/2] are roots to the equation   (1 − λ)2 (−0.5)n − 0.5Dn−1 (λ) − (n + 1 − nλ)Dn (λ) = 0 with Dn (λ) = (−0.5λ)n

[n/2+1] 

 k=1

n+1  (1 − λ2 )k−1 . 2k − 1

It is easy to see that R1,1 has the same distribution as by (2.48) with (u1 , . . . , un−1 ) ∼ U(Tn−1 ).

n−1 k=1

λk uk , whose pdf is given 

DIRICHLET DISTRIBUTION

61

2.5.3 A unified approach to linear functions of variables and order statistics In this subsection we aim to introduce a unified approach to linear functions of random variables and order statistics. For this purpose, we first define the family of multivariate 1 -norm symmetric distributions (Fang and Fang, 1988), which is a special case of the family of Liouville distributions.1 Definition 2.2 A random vector zn×1 is said to have an 1 -norm symmetric distribution, denoted by z ∼ Liouvillen (1 1n ; f ), if z =d R · u, where R ≥ 0 is a random variable with pdf f , u ∼ U(Tn ) and R ⊥ ⊥ u. ¶ We adopt the following notation for samples and their order statistics: v = (v1 , . . . , vn ) ∼ U(Vn ),

v(1) ≤ · · · ≤ v(n) ,

u = (u1 , . . . , un ) ∼ U(Tn ),

u(1) ≤ · · · ≤ u(n) ,

z = (z1 , . . . , zn ) ∼ Liouvillen (1 1n ; g), z(1) ≤ · · · ≤ z(n) , iid

ξ = (ξ1 , . . . , ξn ) ∼ Exponential(1), iid

η = (η1 , . . . , ηn ) ∼ U(0, 1),

ξ(1) ≤ · · · ≤ ξ(n) , η(1) ≤ · · · ≤ η(n) ,

and demonstrate that all distributions of the linear functions n 

n 

ck v(k) ,

k=1

ck u(k) ,

k=1

n  k=1 n 

can n be reduced to the distribution of k=1 bk uk as given by (2.48). n

k =1 ck v(k)



ck ξ(k) ,

k=1

dk zk , and

k=1

(a)

n 

ck z(k) ,

n 

n 

ck η(k) ,

k=1

d k ξk

k=1

n k=1

bk vk as given by (2.47) or to that of

n

k =1 bk vk

From Example 5.1 of Fang et al. (1990: 121), we know that v ∼ U(Vn ) belongs to the family of 1 -norm symmetric distributions. Therefore, the conclusion stated in Theorem 5.12 of Fang et al. (1990: 126) can be applied to v. The normalized 2

1

There exist two procedures to define multivariate Liouville distributions. The first one is introduced in Section 3.4 and the second one is presented in Section 8.5. Therefore, we use two different notations to denote the multivariate Liouville distribution. n 2 In fact, we can write v = v · (v/v ). Remark 2.3 shows that v = v ∼ Beta(n, 1). 1 1 1 k=1 k Theorem 5.1 of Fang et al. (1990: 114) shows that v/v1 ∼ U(Tn ) and v1 ⊥ ⊥ (v/v1 ).

62

DIRICHLET AND RELATED DISTRIBUTIONS

spacings of v are defined by ˆ (n − k + 1)(v(k) − v(k−1) ), v∗k =

v(0) = 0,

k = 1, . . . , n,

thus (v∗1 , . . . , v∗n ) =d v, which implies n 

d

ck v(k) =

k=1

(b)

n

k =1 ck u(k)



n 

n

j=k cj

where bk =

bk v k ,

n−k+1

k=1

.

(2.50)

n

k =1 bk uk

Since u ∼ U(Tn ) also belongs to the family of 1 -norm symmetric distributions, similar to (2.50), we have n n n   d j=k cj . ck u(k) = bk uk , where bk = n−k+1 k=1 k=1 (c)

n

k =1 ck z(k)

and

n



k =1 dk zk

n

k =1 bk uk

According to Definition 2.2, we have n 

d

ck z(k) = R

k=1

(d)

n

k =1 ck ξ(k)

n 

n 

ck u(k) ,

k=1

and

n

k =1 dk ξk

d

dk z k = R

k=1



n 

d k uk .

k=1

n

k =1 bk uk

The i.i.d. exponential random variables and their order statistics are related by n n n   d j=k cj . ck ξ(k) = dk ξk , where dk = n−k+1 k=1 k=1 On the other hand, from Theorem 2.1, we have u = (ξ1 /ξ1 , . . . , ξn /ξ1 ). d

Hence, n 

d

dk ξk = ξ1

k=1

n 

bk u k ,

where ξ1 ∼ Gamma(n, 1) independent of (e)

n

k =1 ck η(k)

Note that vk

d

=



where

bk = dk ,

k=1

n k=1

b k uk .

n

k =1 bk vk

η(k) − η(k−1) , η(0) = 0, k = 1, . . . , n. Therefore, n  k=1

d

ck η(k) =

n  k=1

b k vk ,

where bk =

n  j=k

cj .

DIRICHLET DISTRIBUTION

63

2.5.4 Cumulative distribution function for linear function of a Dirichlet random vector Provost and Cheong (2000) discussed the derivation of the distribution of a linear combination of the components of the Dirichlet random vector. Let x = (x1 , . . . , xn ) ∼ Dirichletn (a) on Tn and define X = bx =

n 

bk xk .

k=1

Without loss of generality, we assume that b1 ≥ · · · ≥ bn . Using (2.7), we know that the exact distribution function of X is given by

n  bk yk Pr(X ≤ x) = Pr k=1 ≤ x n k=1 yk

n  bk z k = Pr k=1 ≤ x n zk  n k=1   (bk − x)zk ≤ 0 , = Pr k=1

where yk ∼ Gamma(ak , 1), zk = 2yk ∼ χ2 (2ak ) and {zk }nk=1 are mutually independent. This amounts to computing the probability for a linear combination of independent chi-square random variables to be less than or equal to zero. Provost and Cheong (2000) used the results from Imhof (1961) to express the distribution function of X as follows:   ∞ sin n ak tan−1 {(bk − x)u} k=1 1 1 Pr(X ≤ x) = − du (2.51) 2 π 0 u nk=1 {1 + (bk − x)2 u2 }ak /2 for bn < x < b1 . Clearly, Pr(X ≤ x) = 0 whenever x ≤ bn and Pr(X ≤ x) = 1 whenever x ≥ b1 . Representations of the density function of X for two- and threedimensional vectors were also provided in Provost and Cheong (2000). For example, for n = 2 we have d

X = b1 x1 + b2 x2 =

b 1 y1 + b 2 y2 (b1 − b2 )y1 = + b2 , y1 + y 2 y1 + y 2

so that y1 X − b2 d = ∼ Beta(a1 , a2 ). b1 − b 2 y1 + y 2 Hence, the density of X is fX (x) =

1 (b1 − b2 )B(a1 , a2 )

for b2 < x < b1 .

x − b2 b1 − b 2

a1 −1

1−

x − b2 b1 − b 2

a2 −1 ,

(2.52)

64

DIRICHLET AND RELATED DISTRIBUTIONS

2.6 Characterizations In this section we focus on six characterizations for the Dirichlet distribution.

2.6.1 Mosimann’s characterization Mosimann (1962) gave a characterization of the Dirichlet distribution by using the proportion-sum independence theorem of Lukacs (1955), which characterizes the gamma and beta distributions. Proportion-sum Independence Theorem (Lukacs, 1955). Let Y1 and Y2 be two positive and independent random variables. Then, the random variable Y1 /(Y1 + Y2 ) is independent of the random variable Y1 + Y2 iff Y1 ∼ Gamma(a1 , β) and Y2 ∼ Gamma(a2 , β). ¶ Corollary to Lukacs’ Theorem (Mosimann, 1962). Let {Yi }ni=1 be n positive and mutually independent random variables. Then, each of the n − 1 random variables Yi / nj=1 Yj is independent of nj=1 Yj iff Yi ∼ Gamma(ai , β) for all i = 1, . . . , n. ¶ see that the implication ‘⇐’ is trivial. Proof.3 From the proof of Theorem 2.1, we n We only need to prove ‘⇒’. Let Zi = ˆ j=1 Yj − Yi . It follows that ⊥ Zi , Yi ⊥

(2.53)

as {Yi }ni=1 are mutually independent. For any fixed i, we note that Yi n

j=1 Yj

⊥ ⊥

n 

Yj

j=1

⇐⇒

Yi ⊥ ⊥ (Yi + Zi ). Yi + Z i

By Lukacs’ theorem, we have Yi ∼ Gamma(ai , βi )

and Zi ∼ Gamma(ai∗ , βi ).

(2.54)

From (2.53) and (2.54), we have n 

Yj = Yi + Zi ∼ Gamma(ai + ai∗ , βi ),

∀ i.

j=1

Since

n j=1

Yj does not involve the subscript i, we have β1 = · · · = βn = β. End

The proof of the following theorem (i.e., Theorem 2.12) follows immediately from Theorem 2.1 and the corollary to Lukacs’ theorem. 3

The proof presented here is different from that of Mosimann (1962).

DIRICHLET DISTRIBUTION

65

Theorem 2.12 (Mosimann, 1962). Let {Yi }ni=1 be n positive and mutually independent random variables. Define Xi = Yi / nj=1 Yj for i = 1, . . . , n. Then, each Xi is distributed independently of nj=1 Yj iff (X1 , . . . , Xn ) ∼ Dirichletn (a) on Tn . ¶

2.6.2 Darroch and Ratcliff’s characterization Darroch and Ratcliff (1971) showed the following characterization for the Dirichlet distribution. Theorem 2.13 (Darroch and Ratcliff, 1971). Let {Xi }n−1 i=1 be n − 1 continuous X < 1. If positive random variables satisfying n−1 i i=1

1−

Xi n−1

j=1,j = / i

Xj

is independent of the set {Xj : j = 1, . . . , n − 1, j = / i} for each i = 1, . . . , n − 1, then (X1 , . . . , Xn−1 ) ∼ Dirichletn (a) on Vn−1 . ¶ To prove Theorem 2.13, Darroch and Ratcliff (1971) presented four lemmas. Lemma 2.2 is a straightforward extension of a result given in Feller (1966: 269) and the proof of Lemma 2.3 is trivial. Lemma 2.2 Let f (·) and g(·) be two arbitrary functions defined in (0, 1) satisfying f (xy) → h(y) < ∞ g(x)

as x → 0

at all points. Then, h(y) = pyρ for some constant ρ. Lemma 2.3



If xa (s − x)b = c for all x ∈ (0, s), then a = b = 0 and c = 1. ¶

Lemma 2.4 Let H(x) be a finite, positive, and continuous function on (0, 1) satisfying the functional relationship H(z/[1 − y]) · H(y) = c · H(y/[y + z]) · H(y + z),

(2.55)

where y > 0, z > 0, and y + z < 1. Then, (i) limx→0+ H(x) exists, (ii) H(x) = k for all x ∈ (0, 1), and (iii) c = 1. ¶

66

DIRICHLET AND RELATED DISTRIBUTIONS

Proof. (i) Let b be a real number and b > 1. Substituting z = (b − 1)y and z = b−1 − y in (2.55) gives

 (b − 1)y H H(y) = c H(1/b)H(by) and (2.56) 1−y

 (1/b) − y H H(y) = c H(by)H(1/b) (2.57) 1−y respectively for all y ∈ (0, 1/b). By comparing (2.56) with (2.57), we have



 (b − 1)y (1/b) − y H =H , ∀ y ∈ (0, 1/b). (2.58) 1−y 1−y Putting x = (b − 1)y/(1 − y) in (2.58), we have

 1−x H(x) = H , ∀ x ∈ (0, 1). b Since H(x) is continuous, it can be seen that

 1−x k= ˆ lim H(x) = lim H = H(1/b), x→0+ x→0+ b

∀ b > 1.

(2.59)

Note that 0 < H(x) < +∞ for all x ∈ (0, 1). Hence, k = limx→0+ H(x) exists and 0 < k = H(1/b) < +∞. (ii) Let x = 1/b in (2.59), we have H(x) = k for all x ∈ (0, 1). (iii) Substituting H(x) = k in (2.55) yields k2 = ck2 . Hence c = 1. End Lemma 2.5 If the function f (x, y) has the triangular domain {(x, y): 0 < y < 1 − x, 0 < x < 1} and is expressible as f (x, y) = g1 (x/(1 − y)) g2 (y) = h1 (y/(1 − x)) h2 (x), where g1 , g2 , h1 , h2 are four functions defined on (0, 1), then either f (x, y) is zero throughout the domain or it is nonzero throughout the domain. ¶ Proof. If f (x0 , y0 ) = 0, then either g1 (x0 /(1 − y0 )) = 0 or g2 (y0 ) = 0. Thus, either f (x, y) = 0 for all (x, y) on the line x/(1 − y) = x0 /(1 − y0 ) or for all (x, y) on the line y = y0 . Similarly, f (x1 , y1 ) = 0 implies that either f (x, y) = 0 on the line y/(1 − x) = y1 /(1 − x1 ) or on the line x = x1 . As a result, f is zero at a point implies that f is zero on a line, which implies that f is zero throughout part or all of the triangular domain, which in turn implies that f is zero throughout the whole domain. End Theorem 2.14 (Darroch and Ratcliff, 1971). Let X > 0 and Y > 0 be two continuous random variables which satisfy X + Y < s. Assume that the pdfs of X and Y

DIRICHLET DISTRIBUTION

67

defined on (0, s), and X/(s − Y ) and Y/(s − X) defined on (0, 1) are all continuous. Then, X ⊥ ⊥Y s−Y

and

Y ⊥ ⊥X s−X

iff (X, Y ) have the Dirichlet density f (x, y) = cxa1 −1 ya2 −1 (s − x − y)a3 −1 ¶

for some a1 , a2 and a3 .

Proof. To verify the necessity part (i.e., ‘⇐’), let Z = X/(s − Y ) and W = s − Y . Then, X = ZW and Y = s − W. Since the Jacobian J(X, Y → Z, W) = W, the joint density of (Z, W) is given by f (z, w) = f (x, y) · J(x, y → z, w) = c za1 −1 (1 − z)a3 −1 · wa1 +a3 −1 (s − w)a2 −1 , which is a product of two independent parts. Therefore, Z ⊥ ⊥ W; that is, X/(s − Y) ⊥ ⊥ Y . By symmetry, we also have Y/(s − X) ⊥ ⊥ X. To verify the sufficiency part (i.e., ‘⇒’), we first consider the case when s = 1. Let g1 , g2 , h1 , and h2 denote the pdfs of X/(1 − Y ), Y , Y/(1 − X), and X. Then f (x, y) = g1 (x/(1 − y)) g2 (y)/(1 − y) = h1 (y/(1 − x)) h2 (x)/(1 − x). By Lemma 2.5, f (x, y) cannot be zero for all (x, y) ∈ {(x, y): 0 < y < 1 − x, 0 < x < 1} and must be positive. Therefore, g1 (x/(1 − y)) h1 (y/(1 − x)) 1 − y . = h2 (x) g2 (y) 1−x

(2.60)

Since h1 is continuous, the right-hand side of (2.60) converges to h1 (y)(1 − y)/g2 (y) as x → 0+. By Lemma 2.2, the left-hand side of (2.60) converges as x → 0+ and its limit must be of the form p(1 − y)−a1 +1 . Thus, h1 (y) = p(1 − y)−a1 g2 (y).

(2.61)

g1 (x) = q(1 − x)−a2 h2 (x).

(2.62)

By symmetry, we have

By interchanging the roles of h1 and h2 in (2.60), we have h2 (x) 1 − y g1 (x/(1 − y)) = . h1 (y/(1 − x)) g2 (y) 1 − x

68

DIRICHLET AND RELATED DISTRIBUTIONS

Thus, lim

x→1−y

h2 (1 − y) 1 − y g1 (x/(1 − y)) h2 (x) 1 − y = lim = . h1 (y/(1 − x)) x→1−y g2 (y) 1 − x g2 (y) y

Defining g1∗ (x) = g1 (1 − x) and

h∗1 (x) = h1 (1/(1 + x)),

we have from Lemma 2.2 that g1 (x/(1 − y)) g∗ ((1 − x − y)/(1 − y)) = lim 1 ∗ x→1−y h1 (y/(1 − x)) x→1−y h1 ((1 − x − y)/y) lim

= r(y/(1 − y))a3 −1 . Consequently, h2 (1 − y) = r(y/(1 − y))a3 g2 (y).

(2.63)

Now substituting (2.61), (2.62), and (2.63) into (2.60), we have p(1 − x − y)a2 −a3 −a1 g2 ((1 − x − y)/(1 − y)) g2 (y) = . g2 (y/(1 − x)) g2 (1 − x) q(1 − y)a2 −1 (1 − x)−a3 −a1 +1 Let z = 1 − x − y and H(x) = ˆ

g2 (x) . − x)a1 +a3 −1

xa2 −1 (1

Hence, H is continuous and positive on (0, 1) and satisfies H(z/(1 − y)) H(y) = (p/q) H(y/(y + z)) H(y + z). By Lemma 2.4, H(x) = k and g2 (x) = k xa2 −1 (1 − x)a1 +a3 −1 . The desired joint pdf is f (x, y) = cxa1 −1 ya2 −1 (1 − x − y)a3 −1 , where c is a normalizing constant. For s = / 1, we can similarly apply the above proof to X/s and Y/s. End Proof of Theorem 2.13. Similar to Theorem 2.14, the necessity part of the proof is trivial. For the sufficiency part, given X3 = x3 , X4 = x4 , . . . , Xn−1 = xn−1 , we have (1 −



X1

i= / 1,2 xi ) − X2

⊥ ⊥ X2

and

(1 −



X2

i= / 1,2

xi ) − X 1

⊥ ⊥ X1 .

DIRICHLET DISTRIBUTION

69

Applying Theorem 2.14 to the conditional distribution of (X1 , X2 ) given X3 = x3 , . . . , Xn−1 = xn−1 , we immediately obtain f (x) = c

(12)

a(12) (x ) a(12) (x ) (x−12 )x11 −12 x22 −12

n−1 an(12) (x−12 )  1− xi , i=1

where x = ˆ (x1 , . . . , xn−1 ), x−12 = ˆ (x3 , . . . , xn−1 ), ai(12) (x−12 ), i = 1, 2, n, and c(12) (x−12 ) are functions of x−12 . Similarly, f (x) = c

(13)

(13) a(13) (x−13 ) a3 (x−13 ) (x−13 )x11 x3

n−1 an(13) (x−13 )  1− xi . i=1

Lemma 2.3 can now be applied to obtain a1(12) (x−12 ) = a1(13) (x−13 ),

an(12) (x−12 ) = an(13) (x−13 ),

a(12) (x−12 )

c(12) (x−12 )x22

a(13) (x−13 )

= c(13) (x−13 )x33

and

.

(ij)

It is easy to show that ak (x−ij ) = ak − 1, k = 1, 2, . . . , n, and −1

n−1 . c(12) (x−12 ) = c x3a3 −1 · · · xn−1

a

End

2.6.3 Characterization through neutrality The main results in this subsection are based on Fabius (1973). Let (X1 , . . ., Xn−1 ) ∈ Vn−1 be a random vector and n ≥ 3. Assume that neither Xi nor Xn = 1 − n−1 i=1 Xi vanishes almost surely. Definition 2.3 A random vector (X1 , . . . , Xn−1 ) is called (CM)i -neutral4 for a given i ∈ {1, . . . , n − 1} if for any integers rj ≥ 0, j = / i, there is a constant c such that 

  rj  r E Xj Xi = xi = c (1 − xi ) j =/ i j a.s. ¶ j= / i

Definition 2.4 A random vector (X1 , . . . , Xn−1 ) is called (DR)i -neutral5 for a given i ∈ {1, . . . , n − 1} if for any integer r ≥ 0, there is a constant c such that

 r r E(Xi |Xj = xj , j = / i) = c 1 − xj a.s. ¶ j= / i

4

CM is the acronym of Connor and Mosimann (1969). DR is the acronym of Darroch and Ratcliff (1971). Both (CM)i -neutrality and (DR)i -neutrality are independence properties.

5

70

DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 2.15

(Fabius, 1973). The following statements are equivalent:

(i) (X1 , . . . , Xn−1 ) is (CM)i -neutral for all i. (ii) (X1 , . . . , Xn−1 ) is (DR)i -neutral for all i. (iii) The distribution of (X1 , . . . , Xn−1 ) is a Dirichlet distribution or a limit of a Dirichlet distribution.6 ¶ Proof. The implications (iii) ⇒ (i) and (iii) ⇒ (ii) are well-known properties of Dirichlet distributions (Wilks, 1962). The inverse implications follow from Lemmas 2.7 and 2.8 below. End For the sake of convenience, we define Yi =



Xj

j= / i

and YI =



Xj

j∈ /I

for any i ∈ {1, . . . , n − 1} and I ⊂ {1, . . . , n − 1}. Proofs of the following three lemmas can be found in Fabius (1973). Lemma 2.6 (ii) implies, for any proper subset I of {1, . . . , n − 1}, any i ∈ I, and any integer r ≥ 0, the existence of a constant c such that E(Xir |Xj = xj , j ∈ / I) = c(1 − YI )r



a.s.

Lemma 2.7 (ii) implies, for any proper subset I of {1, . . . , n − 1} and any integers ri ≥ 0, i ∈ I, the existence of a constant c such that E

 i∈I

 

Xiri Xj

 = xj , j ∈ / I = c(1 − YI ) i∈I ri

a.s.

Thus, in particular, (ii) implies (i).



Lemma 2.8



(i) implies (iii).

2.6.4 Characterization through complete neutrality James and Mosimann (1980) provided another characterization of the Dirichlet distribution using the concepts of neutrality and complete neutrality. Let (X1 , . . . , Xn−1 ) ∈ Vn−1 be a random vector and n ≥ 3. Define Sk = ki=1 Xi , k = 1, . . . , n − 1. Suppose that neither the Xi nor 1 − Sn−1 is degenerate at zero. (iii) is a simple way of saying that the distribution of (X1 , . . . , Xn−1 ) is either a Dirichlet distribution or discrete and concentrated in the vertices of the simplex Vn−1 or degenerate.

6

DIRICHLET DISTRIBUTION

71

Definition 2.5 (X1 , . . . , Xk ), k ≤ n − 2, is neutral7 in (X1 , . . . , Xn−1 ) if there are nonnegative random variables Y1 , . . . , Yn−1 with (Y1 , . . . , Yk ) ⊥ ⊥ (Yk+1 , . . . , Yn−1 ) such that

  d (X1 , . . . , Xn−1 ) = Y1 , . . . , Yk , Yk+1 (1 − Tk ), . . . , Yn−1 (1 − Tk ) ,

where Tk =

k i=1



Yi .

Remark 2.4 The neutrality of (X1 , . . . , Xk ) implies Sk states that the random vectors

d

=

Tk . This neutrality

(X1 , . . . , Xk ) ⊥ ⊥ (Xk+1 /(1 − Sk ), . . . , Xn−1 /(1 − Sk )).



James and Mosimann (1980) reformulated Fabius’ characterization (i.e., Theorem 2.15) as follows. Reformulation of Theorem 2.15. The following statements are equivalent: (i) Xi is neutral in (X1 , . . . , Xn−1 ) for all i = 1, . . . , n − 1. (ii) (Xj ; j = / i) is neutral in (X1 , . . . , Xn−1 ) for all i = 1, . . . , n − 1. (iii) The distribution of (X1 , . . . , Xn−1 ) is a Dirichlet distribution or a limit of a Dirichlet distribution. ¶ Definition 2.6 (X1 , . . . , Xn−1 ) is completely neutral8 if there exist mutually independent and nonnegative random variables Z1 , . . . , Zn−1 such that   d (X1 , . . . , Xn−1 ) = Z1 , Z2 (1 − Z1 ), . . . , Zn−1 n−2 (1 − Z ) . ¶ i i=1 James and Mosimann (1980) provided the following alternative characterization of the Dirichlet distribution and its limits, and hence confirmed the conjectures of Doksum (1974: 193) and Mosimann (1975: 223). Here, we state this theorem without the proof. Theorem 2.16 equivalent:

(James and Mosimann, 1980). The following statements are

(i) (X1 , . . . , Xn−1 ) is completely neutral and xn−1 is neutral in x−n . (ii) The distribution of (X1 , . . . , Xn−1 ) is a Dirichlet distribution or a limit of a Dirichlet distribution. ¶ 7 8

The term neutrality was introduced by Connor and Mosimann (1969). The term complete neutrality was introduced by Connor and Mosimann (1969).

72

DIRICHLET AND RELATED DISTRIBUTIONS

2.6.5 Characterization through global and local parameter independence Suppose {Xij : 1 ≤ i ≤ m, 1 ≤ j ≤ n} is a set of positive random variables that sum to unity. Define Xi· =

n j=1

Xij ,

X·j =

m i=1

Xij ,

XI· = {Xi· }m−1 X·J = {X·j }n−1 i=1 , j=1 , Xj|i = Xij / j Xij , Xi|j = Xij / i Xij , XJ|i = {Xj|i }n−1 j=1 ,

XI|j = {Xi|j }m−1 i=1 .

The following lemma gives a well-known property of the Dirichlet distribution. Lemma 2.9 (Dawid and Lauritzen, 1993). Let {Xij : 1 ≤ i ≤ m, 1 ≤ j ≤ n} ∼ Dirichletmn ({aij }) on Tmn . Then, all XI· , XJ|1 , . . . , XJ|m follow a Dirichlet distribution and are mutually independent. ¶ Geiger and Heckerman (1997) proved the following theorem. Theorem 2.17 (Geiger and Heckerman, 1997). Let {Xij : 1 ≤ i ≤ m, 1 ≤ j ≤ n} be positive random variables having a positive pdf. If {XI· , XJ|1 , . . ., XJ|m } are mutually independent and {X·J , XI|1 , . . . , XI|n } are mutually independent, then {Xij } is Dirichlet distributed. ¶

2.7 MLEs of the Dirichlet parameters iid

For i = 1, . . . , m, let xi = (xi1 , . . . , xin ) ∼ Dirichletn (a) on Tn . The objective of this section is to estimate the Dirichlet parameter vector a = (a1 , . . . , an ) by the maximum likelihood method.

2.7.1 MLE via the Newton–Raphson algorithm (a) Formulation of the Newton–Raphson algorithm The log-likelihood function of the parameter vector a for the observed data Yobs = {xi }m i=1 is   n n   log (aj ) + (aj − 1) log Gj , (a|Yobs ) = m log (a+ ) − j=1

j=1

(2.64)

DIRICHLET DISTRIBUTION

where a+ =

n j=1

73

aj , and Gj =

 m

1/m

xij

,

j = 1, . . . , n,

(2.65)

i=1

denote the geometric means of the n variables. Since the Dirichlet distribution belongs to the exponential family, the objective function (2.64) is globally concave (Ronning, 1989) and the Newton–Raphson algorithm converges to the global optimum. It is easy to verify that the gradient and the Hessian matrix are given by ⎛ ⎞ (a+ ) − (a1 ) + log G1 ⎜ ⎟ .. ⎟ and g = ∇(a|Yobs ) = m ⎜ (2.66) . ⎝ ⎠ (a+ ) − (an ) + log Gn H = ∇ 2 (a|Yobs ) = B + 1n 1n b,

(2.67)

where B = ˆ − m diag( (a1 ), . . . , (an )), b = ˆ m (a+ ), and where (x) =

d log (x) dx

and

(x) =

d (x) dx

(2.68)

are the digamma and trigamma functions respectively. It is easy to show that H −1 = B−1 −

B−1 1n 1n B−1 . b−1 + 1n B−1 1n

Note that (2.67) does not depend on the observed data Yobs . Hence, J(a) = Iobs (a); that is, the Newton–Raphson algorithm is identical to the scoring algorithm. From (1.17), we obtain a(t+1) = a(t) − [∇ 2 (a(t) |Yobs )]−1 ∇(a(t) |Yobs ) = a(t) − H −1 g.

(2.69)

A Fortran subroutine written by Narayanan (1991) calculates the MLE aˆ based on (2.69). (b) Calculation of initial values The Newton–Raphson method requires initial parameter estimates. Dishon and Weiss (1980) suggested using estimates from the method of moments. In fact, from (2.6), we have aj = a+ · E(xj ) and 1 + a+ =

E(xj ){1 − E(xj )} . Var(xj )

74

DIRICHLET AND RELATED DISTRIBUTIONS

Using the empirical means and variances 1  xij m i=1

1  (xij − x¯ j )2 , m − 1 i=1

m

x¯ j =

m

and

sjj =

j = 1, . . . , n,

to estimate E(xj ) and Var(xj ) respectively, we have9 ⎛

a

(0)

⎞ x¯ 1 ⎟ (0) ⎜ ⎜ .. ⎟ = a+ ⎝ . ⎠ x¯ n

(0) a+

and

=

 n−1 

x¯ j (1 − x¯ j ) sjj

j=1

1/(n−1)

− 1.

(2.70)

Ronning (1989) observed that the a(0) defined above could result in parameters outside the admissible region (aj > 0, ∀ j ∈ {1, . . . , n}) and thus proposed a(0) =

min

{xij } · 1n .

1≤i≤m, 1≤j≤n

(2.71)

With this initialization, Ronning (1989) proved that, after the first step of the Newton–Raphson algorithm, the parameters remain within the admissible region. However, it was not proved that it is still true in the subsequent steps. Wicker et al. (2008) further showed that the admissibility is not guaranteed for an infinite number of cases under general conditions. (c) Refinement of the initialization The following formulation is due to Wicker et al. (2008). Assume that bj = aj /a+ is known10 and only a+ is the unknown parameter. The log-likelihood (2.64) becomes   n n   (a+ |Yobs ) = m log (a+ ) − log (bj a+ ) + (bj a+ − 1) log Gj . j=1

j=1

At the maximum, we have D = (a+ ) −

n 

bj (bj a+ ) + δ0 = 0,

j=1

where ˆ δ0 =

n 

bj log Gj .

j=1

9 10

Note that the corresponding formula on line 11 of Wicker et al. (2008: 1316) has a typo. m In practice, the bj = E(xj ) are estimated with x¯ j = (1/m) i=1 xij .

(2.72)

DIRICHLET DISTRIBUTION

75

The digamma function (x) has the following two approximations (e.g., see Abramowitz and Stegun (1965)): (x) = log(x) − 1/(2x) + o(1/x) (x) ∼ −1/x

when when

x → +∞ and x → 0.

Furthermore, (x) can be approximated by A(x) =

x log(x) γ − , 1+x x

so that lim

x→0

Note that

1 (x) = , A(x) γ

n j=1

A(1) = −γ = (1)

and

lim

x→+∞

(x) = 1. A(x)

bj = 1 and D thus becomes

  n  γ γ bj a+ log(bj a+ ) a+ log(a+ ) − − bj − + δ0 D= 1 + a+ a+ j=1 1 + b j a+ bj a+    n  bj2 a+ log(bj ) bj2 a+ (n−1)γ bj a+ = + δ0 − + log(a+ ) − . a+ 1 + a+ 1 + b j a+ 1 + b j a+ j=1

It is easy to show that    n  bj2 a+ log(bj ) bj2 a+ bj a+ + log(a+ ) − =0 lim a+ →0 1 + b j a+ 1 + b j a+ 1 + a+ j=1 and lim

 n  bj2 a+ log(bj )

a+ →+∞

j=1

1 + b j a+

  n bj a+ + log(a+ ) = bj log(bj ). − 1 + b j a+ 1 + a+ j=1 

bj2 a+

As a result, the approximation of D is  . (n − 1)γ D= + δ0 − bj log(bj ). a+ j=1 n

The solution is then given by (n − 1)γ , j=1 bj log(bj ) − δ0

a+ = n

where γ = − (1) = 0.577 22 and δ0 is given by (2.72).

(2.73)

76

DIRICHLET AND RELATED DISTRIBUTIONS

2.7.2 MLE via the EM gradient algorithm Let x = (x1 , . . . , xn ) ∼ Dirichletn (a) on Tn denote the observed data. Hence, Theorem 2.1 implies that the underlying random vector y = (y1 , . . ., yn ) constitutes the complete data, where xj = yj /1 1y, and ind

yj ∼ Gamma(aj , 1),

j = 1, . . . , n. iid

 Let Yobs = {xi }m i=1 , xi = (xi1 , . . . , xin ) ∼ Dirichletn (a) on Tn . We aim to estimate the parameter vector a = (a1 , . . . , an ) by the EM algorithm. Note that the complete data are Ycom = {yi }m i=1 and the corresponding complete-data loglikelihood function is m  n    (a|Ycom ) = − log (aj ) + (aj − 1) log(yij ) − yij . i=1 j=1

Thus, the Q function is given by     Q(a|a(t) ) = E (a|Ycom )Yobs , a(t)   n m   = −m log (aj ) + (aj −1) E{log(yij )|xi , a(t) } . j=1

(2.74)

i=1

We cannot optimize Q(a|a(t) ) analytically due to log (aj ) in (2.74). However, it is easy to verify that the Hessian matrix ∇ 20 Q(a|a(t) ) = diag(−m (a1 ), . . . , −m (an )) is negative definite. By combining (1.29) with (2.66), we have the following EM gradient algorithm: ⎛ ⎞ (t) ( (a+ ) − (a1(t) ) + log G1 )/ (a1(t) ) ⎜ ⎟ .. ⎟. (2.75) a(t+1) = a(t) + ⎜ . ⎝ ⎠ (t) ( (a+ ) − (an(t) ) + log Gn )/ (an(t) )

One advantage of the EM gradient algorithm (2.75) over the corresponding EM algorithm is that it is unnecessary to evaluate the conditional expectations of the E-step (i.e., the calculation of the term E{log(yij )|xi , a(t) } in (2.74) can be avoided).

2.7.3 Analyzing serum-protein data of Pekin ducklings We consider the serum-protein data displayed in Table 1.1. Let yij be a random variable whose value is proportional to the particular type of blood serum level for the ith set. Hence, yij xij = yi1 + yi2 + yi3

DIRICHLET DISTRIBUTION

gives the corresponding proportions, 1, . . . , n (n = 3). Suppose that

i = 1, . . . , m (m = 23)

and

77

j=

iid

xi = (xi1 , xi2 , xi3 ) ∼ Dirichlet3 (a1 , a2 , a3 ) on T3 . Our objective here is to find the MLEs of these parameters and the associated SE. We use both the Newton–Raphson algorithm specified by (2.69) and the EM gradient algorithm specified by (2.75) to find the MLEs of a1 , a2 , and a3 . Four initial values are considered. From (2.70), (2.71), and (2.73), we obtain (0) a+ = 40.662, a(0) = (2.9091, 18.3436, 19.4088), a(0) = (0.0140, 0.0140 0, 0.0140 0), (0) a+

= 50.4850, a

(0)

and



= (3.6118, 22.7751, 24.0976) .

The last one is a(0) = (1, 1, 1). Starting from a(0) , the four Newton–Raphson (EM gradient) algorithms converge to the maximum point aˆ = (ˆa1 , aˆ 2 , aˆ 3 ) = (3.2154, 20.383, 21.685), taking respectively 3 (369), 16 (485), 3 (480), and 9 (475) iterations for the log-likelihood to achieve its final value 73.125. The estimated asymptotic covariance matrix of the MLE aˆ is ⎛ ⎞ 0.459 21 2.462 10 2.623 30 ⎜ ⎟ ⎝ 2.462 06 18.7008 19.0044 ⎠ . 2.623 32 19.0044 21.1704 Thus, the corresponding asymptotic SEs are 0.677 65, 4.324 44, 4.601 13, and the asymptotic 90% CIs are [2.1008, 4.3301], [13.270, 27.496], and [14.117, 29.254]. This example confirms the relatively slow convergence of the EM gradient algorithm.

2.8 Generalized method of moments estimation Suppose that we observe a sample of size m from Dirichlet population, denoted by iid

 Yobs = {xi }m i=1 , where xi = (xi1 , . . . , xin ) ∼ Dirichletn (a)

(2.76)

on Tn . In this section, we will introduce two methods (namely, the moments estimation and the generalized moments estimation) to estimate the Dirichlet parameters a = (a1 , . . . , an ). For the sake of convenience, we define the sample matrix as ⎛ ⎞ x1 ⎜ . ⎟ ⎟ Xm×n = ⎜ (2.77) ⎝ .. ⎠ = (x(1) , . . . , x(n) ) = (xij ), xm

78

DIRICHLET AND RELATED DISTRIBUTIONS

for i = 1, . . . , m and j = 1, . . . , n. The corresponding sample mean vector and sample covariance matrix are given by ⎛

x¯ =

1 m

m  i=1

⎞ x¯ 1 ⎜ ⎟ ⎜ . ⎟ 1 .. ⎟ = ⎜ . ⎟, xi = X1m = ⎜ . ⎝ ⎠ ⎝ . ⎠ m 1  x¯ n x 1 m (n) m 1  x 1 m (1) m





(2.78)

and 1  = (xi − x¯ )(xi − x¯ ) m − 1 i=1 m

Sn×n

=

1 XQ1m X = (sjj ), m−1

where ˆ Im − P1m = Im − Q1m =

1m 1m m

is a projection matrix and sjj =

1 x Q1 x(j ) , m − 1 (j) m

j, j = 1, . . . , n.

(2.79)

2.8.1 Method of moments estimation Using the sample mean vector to estimate the population mean vector and the sample covariance matrix to estimate the population covariance matrix, we can obtain the moment estimators of a satisfying the system of equations from (2.6): ⎧ aj ⎪ x¯ j = , j = 1, . . . , n, ⎪ ⎪ a+ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ aj (a+ − aj ) , j = 1, . . . , n, sjj = (2.80) 2 a+ (1 + a+ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −aj aj ⎪ ⎪ sjj = ⎩ , j= / j , j, j = 1, . . . , n. 2 a+ (1 + a+ ) The method of moments estimation only works when the number of moment conditions equals the number of parameters to be estimated. Note that the number of equations in (2.80) is n(n + 3)/2, which is more than n, the number of parameters. Therefore, the system of equations specified by (2.80) is algebraically overidentified and may not be solved. However, the moment estimators given in (2.70) are solved from the first two equations in (2.80).

DIRICHLET DISTRIBUTION

79

2.8.2 Generalized method of moments estimation The generalized method of moments (GMM) estimation developed by Hanson (1982) is a very general statistical method for obtaining estimates of parameters of interest. It is a generalization of the method of moments estimation. (a) Definition of GMM estimator Let x denote the population random vector and θ n×1 ∈  ⊆ Rn be the parameter vector of interest. The observations Yobs are specified by (2.76). The first step of p the GMM is to seek p appropriate functions {gj (x; θ)}j=1 such that ⎛ ⎞ ⎛ ⎞ g1 (x; θ) E{g1 (x; θ)} ⎜ ⎟ ⎜ ⎟ .. .. ⎟=⎜ ⎟ = 0p . (2.81) E{g(x; θ)} = E ⎜ . . ⎝ ⎠ ⎝ ⎠ gp (x; θ) E{gp (x; θ)} The second step is to replace x in gj (x; θ) by xi and then compute the average m ˆ i=1 gj (xi ; θ)/m. When p = n, the GMM estimator θ is defined to be the solution to the system of equations 1  gj (xi ; θ) = 0, m i=1 m

j = 1, . . . , p.

(2.82)

When p > n, the GMM estimator is defined by θˆ = arg min ||h(θ; Yobs )|| subject to θ ∈ , where h(θ; Yobs ) = (h1 (θ; Yobs ), . . . , hp (θ; Yobs )) and 1  gj (xi ; θ), m i=1 m

hj (θ; Yobs ) =

j = 1, . . . , p.

(b) How to find the function vector g(x; θ) Let f (x; θ) denote the density function of the random vector x. Hence,  1 = f (x; θ) dx, ∀ θ ∈  ⊆ Rn . By differentiating both sides of this identity with respect to θ, we have  ∂f (x; θ) 0n = dx ∂θ  ∂ log f (x; θ) · f (x; θ) dx = ∂θ   ∂ log f (x; θ) =E . ∂θ

(2.83)

80

DIRICHLET AND RELATED DISTRIBUTIONS

Therefore, it is appropriate to choose g(x; θ) =

∂ log f (x; θ) ∂θ

(2.84)

and p = n. (c) The GMM estimators of Dirichlet parameters Let x ∼ Dirichletn (a) on Tn . We have 1  aj −1 x , Bn (a) j=1 j n

f (x; a) = and log f (x; a) = log (a+ ) −

n 

log (aj ) +

j=1

n 

(aj − 1) log xj .

j=1

From (2.84), we obtain ∂ log f (x; a) ∂aj = (a+ ) − (aj ) + log xj ,

gj (x; a) =

j = 1, . . . , n,

where (·) is the digamma function defined by (2.68). Based on the observed data Yobs = {xi }m i=1 specified by (2.76), we can from (2.82) obtain the GMM estimator of a, aˆ GMM , which satisfies the system of equations (a+ ) − (aj ) + log Gj = 0, or



(a+ ) − (a1 ) + log G1

j = 1, . . . , n, ⎞

⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ (a+ ) − (a2 ) + log G2 ⎟ ⎜ ⎟ = 0n , ⎜ ⎟ .. ⎜ ⎟ . ⎝ ⎠ (a+ ) − (an ) + log Gn

(2.85)

where {Gj } are defined in (2.65). By comparing (2.85) with (2.66), we know that the GMM estimator and the MLE of a are identical for the Dirichlet distribution.

2.9 Estimation based on linear models Some materials presented in this section are adapted from Zhang (2000).

DIRICHLET DISTRIBUTION

81

2.9.1 Preliminaries (a) Two useful lemmas We first introduce two useful lemmas, which will be used in the following. Lemma 2.10

(Zhang, 2000). Let 0 < ui < 1, i = 1, . . . , m. We have m 

(ui − u) ¯ 2 < mu(1 ¯ − u). ¯



i=1

Proof. Since 0 < ui < 1 for i = 1, . . . , m, we have u2i < ui and m 

(ui − u) ¯ 2=

i=1

m 

u2i − mu¯ 2

i=1


0, we define the event     ξ m      Am =  √  < δ . m

(2.91)

2

Hence, Pr

√

 m{f (¯zm ) − f (μ)} < z ∩ Am √  ¯m + Pr m{f (¯zm ) − f (μ)} < z ∩ A

m{f (¯zm ) − f (μ)} < z

= Pr

√

= ˆ I1 + I2 . ¯ m ) → 0 as m → ∞. Next, we consider the limit We first note that 0 ≤ I2 ≤ Pr(A √ of I1 as m → ∞. From (2.91), we have ||ξ m || < mδ. Utilizing the property that a continuous function is bounded, there exists a positive constant M such that |R1 | =

1  M |ξ m W∗ ξ m | ≤ ||ξ m ||2 . 2 2

√ Therefore, R1 / m converges in probability to zero. (ii) When θθ = 0, the first term in the right-hand side of (2.87) is zero. Similarly, we can prove that mR3 converges in probability to zero. End (b) Some notations Let x = (x1 , . . . , xn ) ∼ Dirichletn (a) on Tn . Making the one-to-one parameter transformation a+ =

n 

aj ,

j=1

bj =

aj , a+

j = 1, . . . , n,

we know that the number of the free new parameters is n as rewrite (2.6) as ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

E(xj ) = bj , Var(xj ) =

j=1

bj = 1. We can

j = 1, . . . , n,

bj (1 − bj ) , j = 1, . . . , n, 1 + a+

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −bj bj ⎪ ⎪ , ⎩ Cov(xj , xj ) = 1 + a+

n

j= / j ,

and

j, j = 1, . . . , n.

(2.92)

84

DIRICHLET AND RELATED DISTRIBUTIONS

In terms of matrix notation, we have ⎧ E(x) = b, ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ Var(x) =

and (2.93)

 1  diag(b) − bb , 1 + a+

where b = (b1 , . . . , bn ) and ⎛

b1 ⎜0 ⎜ diag(b) = ⎜ ⎜ .. ⎝ . 0

0 b2 .. . 0

⎞ 0 0⎟ ⎟ ⎟ .. ⎟ . . ⎠ · · · bn

··· ··· .. .

Although the matrix diag(b) − bb is singular, its Moore–Penrose generalized inverse matrix11  +  1 1  −1  diag(b) − bb = diag(b−1 ) + 11 − b 1 + 1(b−1 ) nc n

is unique,12 where b−1 = ˆ (b1−1 , . . . , bn−1 ) and c = ( n1

n j=1

bj−1 )−1 .

2.9.2 Estimation based on individual linear models (a) Derivation of the estimates The row vectors of the sample matrix X defined in (2.77) are observed samples and the column vectors of X denote the compositional vectors. For example, x(j) = (x1j , . . . , xmj ) denotes the samples for composition j, j = 1, . . . , n. From (2.92), we obtain a total of n linear models: ⎧ E{x(j) } = bj 1m , ⎪ ⎪ ⎪ ⎨ ⎪ bj (1 − bj ) ⎪ ⎪ Im , ⎩ Var{x(j) } = 1 + a+

A matrix B+ : n × m is said to be the Moore–Penrose generalized inverse of B: m × n if it satisfies BB+ B = B, B+ BB+ = B+ , (B+ B) = B+ B, and (BB+ ) = BB+ (Srivastava, 2002: 635). 12 Let B = diag(b) − bb, it can be verified that BB+ = B+ B = I − 1 1 1 is a projection matrix. n n n n We have BB+ B = B and B+ BB+ = B+ . In addition, we found a typo in the formula (2.7) of Zhang (2000: 135). 11

DIRICHLET DISTRIBUTION

for j = 1, . . . , n. Using the sample means estimate bj and the sample variances

1  x 1 m (j) m

85

= x¯ j defined in (2.78) to

1 1  x(j) Q1m x(j) = (xij − x¯ j )2 m−1 m − 1 i=1 m

sjj =

defined in (2.79) to estimate bj (1 − bj )/(1 + a+ ) yields13 bˆ j = x¯ j ,

and aˆ + = aˆ +(j) =

x¯ j (1 − x¯ j ) − 1, sjj

j = 1, . . . , n.

(2.94)

Noting that aˆ + depends on the subscript j, we obtain a total of n different estimates of a+ , denoting them by aˆ +(j) . We observe that some of the aˆ +(j) in (2.94) could be negative. However, if we replace sjj by 1  (xij − x¯ j )2 m i=1 m

sj2 =

(2.95)

and modify aˆ +(j) to ∗ aˆ +(j) =

x¯ j (1 − x¯ j ) − 1, sj2

j = 1, . . . , n,

(2.96)

∗ > 0 for all j. then it follows from Lemma 2.10 that aˆ +(j)

(b) Asymptotic variances of aˆ ∗+( j ) To compare the efficiency of the estimators derived in (2.96), we only need to compare their asymptotic variances. Lemma 2.11(i) indicates that the asymptotic distribution of a function of sample mean z¯ m , say f (¯zm ), is N(f (μ), θθ/m) with asymptotic variance θθ/m. Thus, we only need to compare θθ. Let 

zi =

13

zi1 zi2





= ˆ

xij xij2



,

i = 1, . . . , m.

We should note the difference between the moment estimators in (2.70) and the estimators in (2.94).

86

DIRICHLET AND RELATED DISTRIBUTIONS

Hence, 1  1  zi1 = xij = x¯ j z¯ 1 = m i=1 m i=1 m

m

1  1  2 zi2 = x . m i=1 m i=1 ij m

z¯ 2 =

and

m

We can rewrite (2.96) as ∗ = aˆ +(j)

=

1 m

x¯ j (1 − x¯ j ) −1 m 2 ¯ j2 i=1 xij − x

z¯ 1 (1 − z¯ 1 ) −1 z¯ 2 − z¯ 21

= ˆ f (¯z1 , z¯ 2 ). ∗ can be derived by Lemma 2.11(i). It can be The asymptotic distribution of aˆ +(j) shown that ⎛ ⎞     bj xj xj (2.92) μ2×1 = E(zi ) = E =E = E ⎝ bj (1 + bj a+ ) ⎠ xj2 xj2 1 + a+

and 

 = Var(zi ) =

Var(xj )

Cov(xj , xj2 )

Cov(xj2 , xj )

Var(xj2 )





= ˆ

σ11 σ12

σ12 σ22



(2.97)

,

where σ11 =

bj (1 − bj ) , 1 + a+

σ12 =

2bj (1 − bj )(1 + bj a+ ) , (1 + a+ )(2 + a+ )

σ22 =

and

2 bj (1 + bj a+ )[6(1 + a+ ) − (6 − 4a+ )bj − a+ (6 + 4a+ )bj2 ]

(1 + a+ )2 (2 + a+ )(3 + a+ )

Let f (t) = f (t1 , t2 ) =

t1 (1 − t1 ) − 1. t2 − t12

.

DIRICHLET DISTRIBUTION

87

We have ∂f (t1 , t2 ) t 2 − 2t1 t2 + t2 = 1 ∂t1 (t2 − t12 )2

and

∂f (t1 , t2 ) −t1 (1 − t1 ) = . ∂t2 (t2 − t12 )2

Hence, we have 

θ=

θ1 θ2



   1 + 2bj a+ ∂f (t)  1 + a+ = = ∂t t=μ bj (1 − bj ) −(1 + a+ )

and θθ = θ12 σ11 + 2θ1 θ2 σ12 + θ22 σ22 .

2.9.3 Estimation based on the overall linear model (a) Derivation of the estimates Alternatively, we could directly deal with the sample matrix X defined in (2.77). From (2.93), the overall linear model can be summarized as ⎧ E(Xm×n ) = 1m b, ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ Var(xi )

=

and

diag(b) − bb , 1 + a+

(2.98) ∀ i = 1, . . . , m,

and {xi }m i=1 are mutually independent. This is a multivariate linear model subject to the constraint 1b = 1 and with degenerate covariance matrix. In order to reduce the constrained linear model to an unconstrained linear model, we may delete one column of the X – the last column, say – to obtain X∗ = (x(1) , . . . , x(n−1) ). Let b−n = (b1 , . . . , bn−1 ). It can be shown that E(X∗ ) = 1m b−n , and the row random vectors of X∗ are mutually independent and have the same covariance matrix =

diag(b−n ) − b−n b−n . 1 + a+

(2.99)

88

DIRICHLET AND RELATED DISTRIBUTIONS

Using the results of the multivariate linear model, the least-squares estimators of b−n and a+ are determined by 1 bˆ −n = X∗1m , and m diag(bˆ −n ) − bˆ −n bˆ−n 1 ˆ = = ˆ XQ1 X∗ . 1 + aˆ + m−n+1 ∗ m

(2.100) (2.101)

(b) Several different estimators of a+ based on (2.101) Several different estimators of a+ can be derived from (2.101). First, we consider the criterion of trace. Since n−1 ˆ2 bˆ j − n−1 j=1 j=1 bj ˆ = tr () , 1 + aˆ + the first estimator of a+ is given by n−1 j=1

aˆ + (1) =

bˆ j (1 − bˆ j ) − 1. ˆ tr ()

(2.102)

Second, we utilize n−1 j=1

ˆ 1n−1 = 1n−1 1

bˆ j − (

n−1 j=1

bˆ j )2

1 + aˆ +

=

bˆ n (1 − bˆ n ) 1 + aˆ +

to obtain the second estimator of a+ as aˆ + (2) =

bˆ n (1 − bˆ n ) − 1. ˆ 1n−1 1n−1 1

(2.103)

Third, we consider the criterion of determinant. From14 ˆ = || =

|diag(bˆ −n ) − bˆ −n bˆ−n | (1 + aˆ + )n−1 |diag(bˆ −n )|[1 − bˆ−n {diag(bˆ −n )}−1 bˆ −n ] n

=

14

j=1

(1 + aˆ + )n−1

bˆ j

(1 + aˆ + )n−1

,

Here, we need to use a special result of Theorem A.3.2 of Anderson (1984: 594): if C is nonsingular, then |C + yz| = |C|(1 + zC−1 y).

DIRICHLET DISTRIBUTION

89

the third estimator of a+ is given by  n

aˆ + (3) =

j=1

bˆ j

1/(n−1)

− 1.

ˆ ||

(2.104)

(c) Asymptotic variances of aˆ + (1), aˆ + (2), and aˆ + (3) We only derive the asymptotic variance of aˆ + (1). A similar method can be applied to derive the asymptotic variances of aˆ + (2) and aˆ + (3). From (2.101), we have ˆ = tr ()

n−1  j=1

=

n−1  j=1

=

n−1  j=1

1 x Q1 x(j) m − n + 1 (j) m  1 (xij − x¯ j )2 m − n + 1 i=1 m

m s2 , m−n+1 j

where sj2 is defined by (2.95). Thus, (2.102) becomes n−1

ˆ ˆ j=1 bj (1 − bj ) aˆ + (1) = n−1 m 2 − 1, j=1 m−n+1 sj which may be negative for some m and n. To ensure the positivity, we modify aˆ + (1) as n−1 ∗ aˆ + (1)

=

bˆ j (1 − bˆ j ) − 1. n−1 2 j=1 sj

j=1

∗ It can be verified that aˆ + (1) > 0 by means of Lemma 2.10. Similar to the procedure presented in Section 2.9.2(b), let



⎞ ⎛ ⎞ xi1 zi1 ⎜ . ⎟ ⎜ . ⎟ ⎜ .. ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ zi,n−1 ⎟ ⎜ xi,n−1 ⎟ ⎜ ⎟ ⎜ zi = ⎜ ˆ ⎜ 2 ⎟ ⎟ = ⎟, ⎜ yi1 ⎟ ⎜ xi1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎜ .. ⎟ ⎝ . ⎠ ⎝ . ⎠

yi,n−1

2 xi,n−1

i = 1, . . . , m.

(2.105)

90

DIRICHLET AND RELATED DISTRIBUTIONS

Hence,

z¯ j =

1  1  zij = xij = x¯ j m i=1 m i=1

y¯ j =

1  1  2 yij = x , m i=1 m i=1 ij

m

m

m

and

m

j = 1, . . . , n − 1.

We can rewrite (2.105) as n−1

∗ aˆ + (1)

¯ j (1 − x¯ j ) j=1 x −1 = n−1 yj − x¯ j2 ) j=1 (¯ n−1 ¯ j (1 − z¯ j ) j=1 z −1 = n−1 yj − z¯ 2j ) j=1 (¯ = ˆ f (¯z1 , . . . , z¯ n−1 , y¯ 1 , . . . , y¯ n−1 ).

∗ The asymptotic distribution of aˆ + (1) can be derived by Lemma 2.11(i). It can be shown that

⎞ b1 ⎜ ⎟ x1 .. ⎜ ⎟ . ⎜ ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ bn−1 ⎜ ⎟ ⎜ ⎟ ⎜ xn−1 ⎟ (2.92) ⎜ b1 (1 + b1 a+ ) ⎟ ⎜ ⎟ ⎜ ⎟ = E(zi ) = E ⎜ 2 ⎟ = ⎜ ⎟ 1 + a+ ⎜ ⎟ ⎜ x1 ⎟ ⎜ ⎟ ⎜ ⎟ .. ⎜ ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎝ . ⎠ . ⎜ ⎟ ⎝ 2 bn−1 (1 + bn−1 a+ ) ⎠ xn−1 1 + a+ ⎛

μ2(n−1)×1





and 

ˆ  = Var(zi ) =

11 21

12 22



,

DIRICHLET DISTRIBUTION

91

where 11 = Var(x−n ) = 

defined by (2.99),

12 = (Cov(xj , xj2 )) = 21 ,

and

22 = (Cov(xj2 , xj2 )). The jth diagonal elements of 12 and 22 are Cov(xj , xj2 ) = σ12 and Cov(xj2 , xj2 ) = σ22 , respectively, specified by (2.97). The corresponding off-diagonal elements of 12 and 22 are given by Cov(xj , xj2 ) = −

2bj bj (1 + bj a+ ) , (1 + a+ )(2 + a+ )

Cov(xj2 , xj2 ) = −

2bj bj (1 + bj a+ )(1 + bj a+ )(3 + 2a+ ) , (1 + a+ )2 (2 + a+ )(3 + a+ )

j= / j ,

and j= / j ,

respectively. Let n−1

j=1 tj (1 − tj ) − 1. f (t, r) = f (t1 , . . . , tn−1 , r1 , . . . , rn−1 ) = n−1 2 j=1 (rj − tj )

We have ∂f (t, r) = ∂tj

n−1

j=1 (rj

− tj2 ) + 2tj n−1 j=1 (tj − rj ) n−1 2 2 { j=1 (rj − tj )}

n−1 ∂f (t, r) j=1 tj (1 − tj ) = − n−1 , ∂rj { j=1 (rj − tj2 )}2

j = 1, . . . , n − 1.

Hence, we can calculate ⎛

⎞ ∂f (t, r)  ⎜ ∂t ⎟  ⎟ θ=⎜ , ⎝ ∂f (t, r) ⎠  (t,r)=μ ∂r

and θθ.

and

92

DIRICHLET AND RELATED DISTRIBUTIONS

2.10 Application in estimating ROC area 2.10.1 The ROC curve Diagnostic tests play an important role in medical studies and contribute significantly to health-care costs (Zhou et al., 2002). The ROC curve provides an overall accuracy for a diagnosis test (Pepe, 2003). We use the binary variable D to denote true disease status:  1 for disease, D= 0 for nondisease. Let Y denote the continuous result of a diagnostic test. By convention, larger values of the Y are more indicative of disease. Using a threshold c, we define a binary test T as follows: T =+ T =−

if Y ≥ c and if Y < c.

Furthermore, let FPF(c) = Pr{T = +|D = 0} and TPF(c) = Pr{T = +|D = 1}

(2.106) (2.107)

denote false positive fraction (FPF) and true positive fraction (TPF) at the threshold c respectively. The ROC is defined as ROC(·) = {(FPF(c), TPF(c)), c ∈ (−∞, +∞)} = {(t, ROC(t)), t ∈ (0, 1)}.

(2.108)

Therefore, the ROC curve (2.108) is a plot of the diagnostic test’s TPF (or sensitivity; i.e., the test’s ability to detect the condition of interest) versus its FPF (or 1−specificity; i.e., the test’s inability to recognize normal anatomy and physiology as normal). It can be shown that the ROC curve is a monotone increasing function mapping (0, 1) onto (0, 1). A useless test is one such that the distribution functions for T are the same in the diseased and nondiseased populations. The ROC curve for the useless test is then ROC(t) = t. On the other hand, a perfect test entirely separates diseased and nondiseased subjects. Its ROC curve is along the left and upper borders of the first unit quadrant. Better tests have ROC curves closer to the upper left corner. These are illustrated in Figure 2.2.

2.10.2 The ROC area In the literature, several numerical indices are proposed to summarize ROC curves. The most commonly used summary measure is the area under ROC curve (AUC),

93

1.0

DIRICHLET DISTRIBUTION

perfect

0.6 0.4

useless

0.0

0.2

ROC(t) = TPF(c)

0.8

ROC(t )

0.0

0.2

0.4

0.6

0.8

1.0

t = FPF(c )

Figure 2.2 ROC curves for the useless and perfect tests for comparison.

which is defined as



AUC =

1

ROC(t) dt. 0

The AUC gives a measure of the overall accuracy of the diagnostic test. Theorem 2.18 (Pepe, 2003: 78). Let YD and YD¯ denote independent and randomly chosen test results from the diseased and nondiseased populations respectively. Then AUC = Pr(YD > YD¯ ).

(2.109) ¶

Proof. Let SD and SD¯ denote the survivor functions for Y in the diseased and nondiseased populations: SD (y) = Pr(Y ≥ y|D = 1)

and SD¯ (y) = Pr(Y ≥ y|D = 0).

From (2.106), we have t = FPF(c) (2.106)

= Pr{Y ≥ c|D = 0} = SD¯ (c).

94

DIRICHLET AND RELATED DISTRIBUTIONS

−1 Thus, c = SD ¯ (t). On the other hand, from (2.106), we obtain

ROC(t) = TPF(c) (2.107)

= Pr{Y ≥ c|D = 1} = SD (c) −1 = SD (SD t ∈ (0, 1). ¯ (t)),

(2.110)

Finally, we have 

AUC =

1

ROC(t) dt 0

(2.110)



=

0



= = =

1

−1 SD (SD ¯ (t)) dt

−1 (let y = SD ¯ (t))

−∞

+∞  +∞ −∞  +∞ −∞

SD (y) dSD¯ (y) Pr(YD > y)fD¯ (y) dy

(by YD ⊥ ⊥ YD¯ )

Pr(YD > YD¯ |YD¯ = y)fD¯ (y) dy

= Pr(YD > YD¯ ), End

which completes the proof.

The AUC given by (2.109) has an interesting interpretation. It is equal to the probability that test results from a randomly selected pair of diseased and nondiseased subjects are correctly ordered.

2.10.3 Computing the posterior density of the ROC area When the diagnostic result is discrete (or ordinal), it can be shown that (Result 4.10 in Pepe (2003)) AUC = Pr(YD > YD¯ ) +

1 Pr(YD = YD¯ ). 2

(2.111)

Now, we assume that the possible values of YD and YD¯ are the consecutive integers 1, 2, . . . , n. Let θi = Pr(YD = i), φj = Pr(YD¯ = j),

i = 1, . . . , n, j = 1, . . . , n.

and

DIRICHLET DISTRIBUTION

95

We have θ = (θ1 , . . . , θn ) ∈ Tn and φ = (φ1 , . . . , φn ) ∈ Tn . From (2.111) and the independence between YD and YD¯ , we obtain (Broemeling, 2007: 72, 82) AUC(θ, φ) =

n  i−1 

θi φj +

1 θi φi 2 i=1

θi φj −

1 θi φi 2 i=1

n

i=2 j=1

=

n  i 

n

i=1 j=1

= θ( n − 0.5In )φ, where



1 0 ⎜1 1 ⎜ ˆ ⎜

n = ⎜ .. .. ⎝. . 1 1

(2.112)

⎞ ··· 0 ··· 0⎟ ⎟ ⎟ .. ⎟ . .. . .⎠

···

1

Let yi denote the frequency of category ‘i’ in the M diseased patients and z j denote the frequency n of category ‘j’ in the N nondiseased patients. We have n i=1 yi = M and j=1 zj = N. The likelihood function for the observed data Yobs = {yi } ∪ {zj } is L(θ, φ|Yobs ) ∝

n  i=1

y

θi i

n 

z

φj j .

j=1

When the joint prior is the product of two Dirichlet densities, Dirichletn (θ −n |a) · Dirichletn (φ−n |b), the joint posterior density is given by Dirichletn (θ −n |a + y) · Dirichletn (φ−n |b + z),

(2.113)

where θ −n ∈ Vn−1 , φ−n ∈ Vn−1 , y = (y1 , . . . , yn ), and z = (z1 , . . . , zn ). The stochastic representation in Theorem 2.2(i) can be employed to generate i.i.d. posterior samples of θ and φ according to (2.113). Therefore, the posterior distribution of the AUC is readily determined as (2.112) is a function of θ and φ.

2.10.4 Analyzing the mammogram data of breast cancer We consider the mammogram data of breast cancer in Table 1.2. We adopt the uniform prior distributions for the parameters θ and φ. From (2.113), it can be seen that the posterior distribution of θ is Dirichlet with parameters (2, 1, 7, 12, 13), the posterior distribution of φ is also Dirichlet with parameters (10, 3, 12, 9, 1), and they are mutually independent. With the S-plus code presented in Appendix A.2.1, we generate 10 000 samples from the joint posterior distribution (2.113) and

8

DIRICHLET AND RELATED DISTRIBUTIONS

0

0

2

2

4

4

6

6

8

96

0.4

0.5

0.6

0.7 (a)

0.8

0.9

1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b)

Figure 2.3 (a) The posterior density of the AUC given by (2.112) for the ultrasound rating data of breast cancer metastasis, where the density curve is estimated by a kernel density smoother based on 10 000 i.i.d. posterior samples. (b) The histogram of the AUC.

calculate 10 000 values, via (2.112), for the AUC. The Bayesian estimates of the median, mean, standard deviation, and 95% CI of AUC are given by 0.784 708, 0.781 545, 0.051 471, and [0.671 515, 0.872313] respectively. The corresponding posterior density and histogram of the AUC are shown in Figure 2.3.

3

Grouped Dirichlet distribution In this chapter we introduce a new family of distributions, called the GDD, which includes the Dirichlet distribution as a special member. First, we develop distribution theory for the GDD in its own right. Second, we use this family as a new tool for statistical analysis of incomplete categorical data. In Section 3.1 we present three examples related to incomplete categorical data analyses to motivate our consideration on GDD. Starting with the definition of the density function of a GDD with two partitions (see Section 3.2), we derive its two stochastic representations that provide two simple procedures for simulation in Section 3.3. Other properties, such as mixed moments and mode, are also derived in Section 3.3. Marginal distributions and their relations to the Liouville distribution of the second kind are investigated in Section 3.4. In Section 3.5 we consider conditional distributions of the GDD. The corresponding properties for the general GDD with multiple partitions are discussed in a similar fashion in Section 3.6. Section 3.7 introduces statistical methods for the analysis of incomplete categorical data for large- or small-sample sizes when the likelihood function is with the GDD form. Two data sets from a case–control study and a leprosy survey are used to illustrate how the GDD can be used as a new tool for analyzing incomplete categorical data. In Section 3.8 we provide the EM algorithm and DA Gibbs sampling to deal with likelihood functions beyond the GDD form by first considering a simple 2 × 2 contingency table with incomplete observations and analyzing the neurological complication data set. Next we investigate how to apply the GDD to construct CIs of parameters of interest for a more general incomplete r × c contingency table. Finally, in Section 3.9, the application of the GDD in the analysis of incomplete categorical data under a nonignorable missing data mechanism is presented. The approach based on GDD has at least two advantages over the commonly used approach based on the Dirichlet distribution in both frequentist and conjugate Bayesian inference: (i) in some cases, both the maximum likelihood and Baysian Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

98

DIRICHLET AND RELATED DISTRIBUTIONS

estimates have closed-form expressions in the new approach, which may not be available based on the commonly used approach; and (ii) even if a closed-form solution is not available, the EM and DA algorithms in the new approach converge much faster than those in the commonly used approach.

3.1 Three motivating examples This section will present three motivating examples. In the first example, the likelihood function (3.1) can be expressed exactly in terms of a GDD with two partitions (up to a normalizing constant). In the second example, the likelihood function (3.2) can be represented in terms of a GDD with four partitions. In the third example, the likelihood function (3.3) can be written as a product of two terms, namely a GDD (up to a normalizing constant) and a product of powers of linear combination of the parameters of interest. Efficient methods for analyzing these data sets are developed in subsequent sections. In Example 1.3 and Table 1.3, we denote the observed counts by Yobs = {n1 , . . . , n4 ; n12 , n34 }. The cell probability vector is θ = (θ1 , θ2 , θ3 , θ4 ) ∈ T4 and the odds ratio is defined by ψ=

θ1 θ4 . θ2 θ3

Under the assumption of MAR (Rubin, 1976), the observed-data likelihood function of θ is given by L(θ|Yobs ) ∝

 4



θini

· (θ1 + θ2 )n12 (θ3 + θ4 )n34 .

(3.1)

i=1

Obviously, if we treat θ as a random vector, then θ ∼ GDn,2,s (a, b), up to a normalizing constant, where n = 4, s = 2, a = (n1 + 1, n2 + 1, n3 + 1, n4 + 1), and b = (n12 , n34 ). In Example 1.4 and Table 1.4, let Yobs = {n1 , . . . , n10 ; n123 , n˜ 4 , n˜ 5 , n678 , n˜ 9 , n˜ 10 } represent the observed counts and θ = (θ1 , . . . , θ10 ) ∈ T10 be the cell probabilities. Under the assumption of MAR, the observed-data likelihood is L(θ|Yobs ) ∝

 10

θini +˜ni



· (θ1 + θ2 + θ3 )n123 (θ4 + θ5 )0

i=1

× (θ6 + θ7 + θ8 )n678 (θ9 + θ10 )0 ,

(3.2)

where n˜ i = 0 for i = 1, 2, 3, 6, 7, 8. Similarly, we find that θ ∼ GDn,m,s (a, b), up to a normalizing constant, where n = 10, m = 4, s = (3, 5, 8, 10), a = (n1 + n˜ 1 + 1, . . . , n10 + n˜ 10 + 1), and b = (n123 , 0, n678 , 0).

GROUPED DIRICHLET DISTRIBUTION

99

In Example 1.5 and Table 1.5, let Yobs = {n1 , . . . , n4 ; n12 , n34 ; n13 , n24 } denote the observed frequencies and θ = (θ1 , θ2 , θ3 , θ4 ) ∈ T4 the cell probability vector. The parameter of interest is θ23 = Pr(Y = 1) − Pr(X = 1) = θ2 − θ3 ; that is, the difference between the proportions of patients with a neurological complication before and after the treatment. Under the assumption of MAR, the observed-data likelihood function is L(θ|Yobs ) ∝

 4



θini · (θ1 + θ2 )n12 (θ3 + θ4 )n34

i=1

× (θ1 + θ3 )n13 (θ2 + θ4 )n24 .

(3.3)

Again, we note that the first term in (3.3) follows GD4,2,2 (a, b) with a = (n1 + 1, n2 + 1, n3 + 1, n4 + 1) and b = (n12 , n34 ), up to a normalizing constant, while the second term is simply a product of powers of a linear combination of θ = (θ1 , θ2 , θ3 , θ4 ).

3.2 Density function Motivated by the likelihood functions (3.1) and (3.3), we first define and study the GDD with two partitions (Tian et al., 2003; Ng et al., 2008). Definition 3.1 A random vector x ∈ Tn is said to follow a GDD with two partitions if the density of x−n = ˆ (x1 , . . . , xn−1 ) ∈ Vn−1 is −1

GDn,2,s (x−n |a, b) = cGD ·

 n 

xiai −1

  s b1  n b2 · xi xi ,

i=1

i=1

(3.4)

i=s+1

where a = (a1 , . . . , an ) is a positive parameter vector, b = (b1 , b2 ) is a nonnegative parameter vector, s is a known positive integer less than n, and the normalizing constant is given by cGD = B(a1 , . . . , as ) · B(as+1 , . . . , an ) · B

s i=1

ai + b 1 ,

n i=s+1

ai + b 2 .

We will write x ∼ GDn,2,s (a, b) on Tn or x−n ∼ GDn,2,s (a, b) on Vn−1 to distinguish the two equivalent representations. ¶ In particular, when b1 = b2 = 0, the GDD in (3.4) reduces to the Dirichlet distribution. Figure 3.1 shows some GDD densities for n = 3 with various combinations of a and b.

Z

5 0 5 -5 0 5 10 1 2 2

Z

DIRICHLET AND RELATED DISTRIBUTIONS

0 -2 0 2 4 6 8 1

100

1

1 0.8

0.6

Y

0.4 0.2 0

0.2

0.4

0.6

0.8

0.8

1

0.6

Y

0.4 0.2

X

0

0

0.2

1

X

Z

1

0 0 0 -10 0 10 2 3 4

(b)

5 0 5 0 - 5 0 5 10 1 2 2 3

Z

0.8

0

(a)

0.8

0.4

0.6

1 0.6

0. Y 4

0.2 0

0.2

0.4

0.6

X

0.8

1

0.8

0.6

Y

0

0.4 0.2 0

(c)

0.2

0.4

0.6

0.8

1

X

0

(d)

Figure 3.1 Plots of grouped Dirichlet densities GD3,2,2 (x−3 |a, b): (a) a = (1, 2, 3) and b = (4, 0); (b) a = (6, 7, 8) and b = (1, 10); (c) a = (11, 12, 13) and b = (0, 3); (d) a = (16, 7, 3) and b = (10, 2).

In addition, when s = n − 1, the GDD (3.4) reduces to the following important special case:

  n−1 ai −1 x · x−n 1b1 (1 − x−n 1 )an +b2 −1 i=1 i

 , (3.5) GDn,2,n−1 (x−n |a, b) = n−1 B(a1 , . . . , an−1 ) · B i=1 ai + b1 , an + b2 n−1 where x−n ∈ Vn−1 and x−n 1 = ˆ i=1 xi . For a better understanding of the nature of the GDD, we partition x, y, and a in the same fashion:       x(1) s y(1) a(1) , yn×1 = xn×1 = , an×1 = . x(2) n−s y(2) a(2)

Furthermore, throughout this chapter, we define the parametric functions, α1 , α2 , and α12 , as follows: α1 = ˆ a(1) 1 + b1 ,

α2 = ˆ a(2) 1 + b2 ,

α12 = α1 + α2 .

(3.6)

GROUPED DIRICHLET DISTRIBUTION

101

Hence, the normalizing constant cGD can be rewritten as cGD = Bs (a(1) ) · Bn−s (a(2) ) · B(α1 , α2 ).

(3.7)

3.3 Basic properties The following theorem provides a stochastic representation of the GDD and hence a straightforward procedure for generating i.i.d. samples, which plays a crucial role in Bayesian analysis for incomplete categorical data. Theorem 3.1

A random vector x ∼ GDn,2,s (a, b) on Tn iff     x(1) d R · y(1) x= = x(2) (1 − R) · y(2)

(3.8)

where (i) (ii) (iii) (iv)

y(1) ∼ Dirichlets (a(1) ) on Ts ; y(2) ∼ Dirichletn−s (a(2) ) on Tn−s ; R ∼ Beta(α1 , α2 ); and y(1) , y(2) , and R are mutually independent.



Proof. If x ∼ GDn,2,s (a, b), then the pdf of x−n is given by (3.4). Noting that x(1) = x(1) 1 ·

x(1) , x(1) 1

x(2) , x(2) 1

(3.9)

R = x(1) 1 .

(3.10)

x(2) = x(2) 1 ·

and x1 = 1, we make the following transformation: y(1) =

x(1) , R

y(2) =

x(2) , 1−R

and

The Jacobian can be shown to be J(x−n → y1 , . . . , ys−1 , ys+1 , . . . , yn−1 , R) = Rs−1 (1 − R)n−s−1 . Hence, the joint density of y1 , . . . , ys−1 , ys+1 , . . . , yn−1 and R is s ai −1 ai −1 n Rα1 −1 (1 − R)α2 −1 i=s+1 yi i=1 yi · , · Bs (a(1) ) B(α1 , α2 ) Bn−s (a(2) )

(3.11)

where y(1) ∈ Ts , y(2) ∈ Tn−s , and 0 ≤ R ≤ 1. Note that (3.11) has been factorized into independent Dirichlet and beta distributions for y(1) , y(2) , and R. By combining (3.9) and (3.10), we immediately obtain (3.8). Conversely, suppose that (3.8) holds. Therefore, the joint density of y1 , . . . , ys−1 , ys+1 , . . . , yn−1 and R is given by (3.11). It is easy to show that the density of x−n is given by (3.4). End

102

DIRICHLET AND RELATED DISTRIBUTIONS

The following two theorems are immediate results from Theorem 3.1. A random vector x ∼ GDn,2,n−1 (a, b) on Tn iff

Theorem 3.2



x=

x−n 1 − x−n 1



 d

=

R · y−n 1−R



,

(3.12)

where (i) y−n ∼ Dirichletn−1 (a−n ) on Tn−1 ; (ii) R ∼ Beta(a−n 1 + b1 , an + b2 ); and (iii) y−n and R are mutually independent. If x = (x(1), x(2)) ∼ GDn,2,s (a, b) on Tn , then

Theorem 3.3 (i) (ii) (iii) (iv)



x(1) /x(1) 1 ∼ Dirichlets (a(1) ) on Ts ; x(2) /x(2) 1 ∼ Dirichletn−s (a(2) ) on Tn−s ; x(1) 1 ∼ Beta(α1 , α2 ); and x(1) /x(1) 1 , x(2) /x(2) 1 and x(1) 1 are mutually independent.



We now consider the mixed moments of x by means of the stochastic representation (3.8). The independence among y(1) , y(2) , and R implies

E

 n 



xiri

=E

 s 

i=1



yiri



·E

i=1

n 



yiri

n

s  · E R i=1 ri (1 − R) i=s+1 ri .

i=s+1

Using Theorem 1.3 of Fang et al. (1990: 18), we obtain the following results. Theorem 3.4 Let x ∼ GDn,2,s (a, b) on Tn . For any r1 , . . . , rn ≥ 0, the mixed moments of x are given by

E

 n



xiri

i=1

=

Bs (a(1) + r(1) ) Bn−s (a(2) + r(2) ) · Bs (a(1) ) Bn−s (a(2) ) ×

B(α1 + r(1) 1 , α2 + r(2) 1 ) , B(α1 , α2 )

where r = (r(1), r(2)), r(1) : s × 1, r(2) : (n − s) × 1. In particular, we have E(xi ) =

ai α12



 α2 α1 · I + · I , (1≤i≤s) (s+1≤i≤n) a(1) 1 a(2) 1

GROUPED DIRICHLET DISTRIBUTION

Var(xi ) =

103

⎧   α1 ai ai + 1 α1 ai α1 + 1 ⎪ ⎪ · − · , ⎪ ⎪ ⎪ α12 a(1) 1 α12 + 1 a(1) 1 + 1 α12 a(1) 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ for 1 ≤ i ≤ s, ⎨   ⎪ ⎪ ⎪ ai + 1 α2 ai α2 + 1 α2 ai ⎪ ⎪ · − · , ⎪ ⎪ ⎪ α12 a(2) 1 α12 + 1 a(2) 1 + 1 α12 a(2) 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ for s + 1 ≤ i ≤ n,

  ⎧ 1 α1 α1 a i a j α1 + 1 ⎪ ⎪ · − , ⎪ ⎪ α12 a(1) 1 α12 + 1 a(1) 1 + 1 α12 a(1) 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ for 1 ≤ i < j ≤ s, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ 1 1 α1 α2 ai aj ⎪ ⎪ , ⎪ ⎨ α a(1)  · a(2)  α + 1 − α 12 12 12 1 1 Cov(xi , xj ) = ⎪ ⎪ ⎪ ⎪ ⎪ for 1 ≤ i ≤ s, s + 1 ≤ j ≤ n, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ ⎪ α2 ai aj 1 α2 α2 + 1 ⎪ ⎪ · − , ⎪ ⎪ α12 a(2) 1 α12 + 1 a(2) 1 + 1 α12 a(2) 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ for s + 1 ≤ i < j ≤ n,

where α1 , α2 , and α12 are defined in (3.6), and I(·) denotes the indicator function. ¶ Next, we can find the mode of the GDD directly by calculus of several variables and we have the following result. Theorem 3.5 given by

If ai ≥ 1, then the mode of the grouped Dirichlet density (3.4) is  ai − 1 α1 − s xˆ i = · I(1≤i≤s) a1 + b1 − n a(1) 1 − s  α2 − (n − s) + (2) · I(s+1≤i≤n) , a 1 − (n − s)

where 1 ≤ i ≤ n.

(3.13) ¶

104

DIRICHLET AND RELATED DISTRIBUTIONS

3.4 Marginal distributions In this section we first derive the marginal distributions of subvectors of the random vector following a GDD with two partitions. It turns out that these marginal distributions are the Liouville distribution of the second kind, including the beta–Liouville distribution. Here, we follow the definition presented in Fang et al. (1990: Chapter 6). An alternative definition with different notations is given in Section 8.5 of this book. Let γ = (γ1 , . . . , γm ). A random vector z ∈ Rm + follows a Liouville distribution, denoted by z ∼ Liouvillem (γ; f ), if z has the stochastic representation d

z = R · y, where R ⊥ ⊥ y with y ∼ Dirichletm (γ) on Tm , and R is called the generating variate, with density f , called the generating density. It can be shown that the Liouville distribution Liouvillem (γ; f ) has density 

g(z) =

γi −1 m i=1 zi

Bm (γ)



·

f (z1 ) γ1 −1

z1

.

(3.14)

If f has an unbounded domain, the Liouville distribution is of the first kind. If f is defined in the interval [0, d], the Liouville distribution is of the second kind, being restricted in the open simplex Vm (d). In particular, when R ∼ Beta(α, β), we say that z follows a beta–Liouville distribution and write z ∼ BLiouvillem (γ; α, β). First, we note that the GDD with s = n − 1, namely GDn,2,n−1 (a, b), having the density (3.5) and the stochastic representation (3.12), is equivalent to a beta– Liouville distribution, namely BLiouvillen−1 (a−n ; a−n 1 + b1 , an + b2 ). Second, by comparing the stochastic representation (3.8) with that of the Liouville distribution, we can see that x(1) and x(2) have beta–Liouville distributions, BLiouvilles (a(1) ; α1 , α2 ) and BLiouvillen−s (a(2) ; α2 , α1 ) respectively, where α1 and α2 are defined in (3.6). We therefore have the following results. Theorem 3.6 Let x = (x(1), x(2)) ∼ GDn,2,s (a, b) on Tn and α1 and α2 be defined in (3.6), then the following hold. (i) x(1) has the following stochastic representation: d

x(1) = R · y(1) ∼ BLiouvilles (a(1) ; α1 , α2 )

 d = GDs+1,2,s (a(1), a(2) 1 ), b

GROUPED DIRICHLET DISTRIBUTION

105

inside Vs with density s

g(x ) = (1)

i=1

xiai −1 x(1) 1b1 (1 − x(1) 1 )α2 −1 , Bs (a(1) ) · B(α1 , α2 )

and the stochastic representation can be alternatively written as     x(1) R · y(1) d = . 1−R 1 − x(1) 1

(3.15)

(3.16)

(ii) Similarly, we have d

x(2) = (1 − R) · y(2) ∼ BLiouvillen−s (a(2) ; α2 , α1 )

 d = GDn−s+1,2,n−s (a(2), a(1) 1 ), (b2 , b1 ) inside Vn−s with density n xiai −1 · x(2) 1b2 (1 − x(2) 1 )α1 −1 (2) g(x ) = i=s+1 , Bn−s (a(2) ) · B(α2 , α1 ) and the stochastic representation is given by     x(2) (1 − R) · y(2) d = . R 1 − x(2) 1



Note that the two subvectors in Theorem 3.6 are very special since they correspond to the two components of the partitions of GDn,2,s (a, b). We consider the marginal distributions of those subvectors x(1) and x(2) as well. Theorem 3.7 define

Let x ∼ GDn,2,s (a, b) on Tn , and for 1 ≤ m < s, s + 1 ≤ k < n,

x∗(1) = (x1 , . . . , xm ),

δ1 =

a∗(1) = (a1 , . . . , am ),

τ1 =

x∗(2) = (xs+1 , . . . , xk ), a∗(2)

m i=1

m i=1

xi , ai ,

k

y∗(1) = (y1 , . . . , ym ), τ2 = si=m+1 ai ,

y∗(2) = (ys+1 , . . . , yk ), = (as+1 , . . . , ak ), τ3 = ki=s+1 ai , τ4 = ni=k+1 ai . δ2 =

i=s+1

xi ,

The following are true: (i) x∗(1) ∼ Liouvillem (a∗(1) ; f∗(1) ) with density   ai −1 m 1 i=1 xi (1) · τ1 −1 · f∗(1) (δ1 ), g(x∗ ) = (1) δ1 Bm (a∗ )

x∗(1) ∈ Vm ,

(3.17)

106

DIRICHLET AND RELATED DISTRIBUTIONS

where the generating density f∗(1) is given by  1−δ δτ11 −1 0 1 (δ1 + y)b1 (1 − δ1 − y)α2 −1 yτ2 −1 dy (1) f∗ (δ1 ) = . B(τ1 , τ2 ) B(α1 , α2 ) Furthermore, x∗(1) has the stochastic representation

      d x∗(1) = R · y∗(1) 1 · y∗(1) / y∗(1) 1 ,

(3.18)

(3.19)

  where R ∼ Beta(α1 , α2 ), y∗(1) 1 ∼ Beta(τ1 , τ2 ),

y(1)  ∗  ∼ Dirichletm (a∗(1) ),  (1)  y∗ 

 (1)    ∼ f (1) , R(1) ∗ = R · y∗ ∗ 1

1

    and R, y∗(1) 1 and y∗(1) / y∗(1) 1 are mutually independent.

(ii) x∗(2) ∼ Liouvillek−s a∗(2) ; f∗(2) with density ⎧ ⎫ ai −1 ⎬ ⎨ k 1 i=s+1 xi

 · τ −1 · f∗(2) (δ2 ), x∗(2) ∈ Vk−s , g(x∗(2) ) = (2) ⎭ δ 3 ⎩B a 2 k−s



where the generating density f∗(2) is given by  1−δ δτ23 −1 0 2 (δ2 + y)b2 (1 − δ2 − y)α1 −1 yτ4 −1 dy (2) . f∗ (δ2 ) = B(τ3 , τ4 ) B(α2 , α1 ) Furthermore, x∗(2) has the stochastic representation        d x∗(2) = (1 − R) · y∗(2) 1 · y∗(2) / y∗(2) 1 ,   where R ∼ Beta(α1 , α2 ), y∗(2) 1 ∼ Beta(τ3 , τ4 ),

y(2)  ∗  ∼ Dirichletk−s (a∗(2) ),  (2)  y∗ 

 (2)    ∼ f (2) , R(2) ∗ = (1 − R) y∗ ∗ 1

1

    and R, y∗(2) 1 and y∗(2) / y∗(2) 1 are mutually independent.



Proof. To derive (3.17), we first introduce a special case of the famous integral formula of Joseph Liouville (e.g., see Fang et al. (1990: 21)). Let h(·) be a real function defined on [0, d]. Then 

n n ai −1   d n x x dx n h i i i i=1 i=1 R+ h(y)y i=1 ai −1 dy. (3.20) = Bn (a1 , . . . , an ) 0 From (3.15), the marginal density of x∗(1) is given by m ai −1  



s  s ai −1 i=1 xi h x x dx g x∗(1) = i , i=m+1 i i=m+1 i Bs (a(1) )B(α1 , α2 )

GROUPED DIRICHLET DISTRIBUTION

107

where h(y) = (δ1 + y)b1 (1 − δ1 − y)α2 −1 ,

y ∈ [0, 1 − δ1 ].

Applying (3.20), we immediately obtain (3.17) with f∗(1) given by (3.18). Comparing (3.14) with (3.17), we know that (3.17) is a Liouville distribution with f∗(1) being the generating density. Alternatively, from the viewpoint of stochastic representation, we can arrive at d the same conclusion. In fact, from (3.8) we have x(1) = R · y(1) . Hence, d

x∗(1) = R · y∗(1) = (R · y∗(1) 1 ) · (y∗(1) /y∗(1) 1 ), which implies (3.19). Since y(1) = (y1 , . . . , ys ) ∼ Dirichlets (a(1) ) on Ts , from the properties of the Dirichlet distribution, we know that y∗(1) = (y1 , . . . , ym ) ∼ Dirichletm (a∗(1) ; τ2 ) on Vm . Making the transformations y∗(1) 1 =

m

yj

yi wi = m

and

j=1

j=1

yj

i = 1, . . . , m − 1,

,

. Note that the joint density of y∗(1) 1 and we obtain the Jacobian as y∗(1) m−1 1 (w1 , . . . , wm−1 ) can be factorized as a product of a beta density and a Dirichlet density. We therefore obtain y∗(1) 1 ∼ Beta(τ1 , τ2 )

and

y∗(1) y∗(1) 1

∼ Dirichletm (a∗(1) ).

d

(1) (1) In addition, it is easy to verify that R(1) ∗ = R · y∗ 1 has the density f∗ . We complete the proof of Case (i). Similarly, we can prove Case (ii). End

In particular, setting m = 1 in Theorem 3.7 yields the following marginal density of x1 :  1−x ∗ x1a1 −1 0 1 (x1 + y)b1 (1 − x1 − y)α2 −1 yτ2 −1 dy g(x1 ) = , (3.21) B(a1 , τ2∗ ) B(α1 , α2 ) where τ2∗ = a2 + · · · + as . When b1 is a positive integer, applying the binomial expansion on (x1 + y)b1 yields g(x1 ) =

b1 b1  B(a1 + b1 − , τ ∗ + ) =0



B(a1 , τ2∗ )

2

× Beta(x1 |a1 + b1 − , τ2∗ + + α2 ).

(3.22)

Note that the coefficients in (3.22) constitute a beta–binomial distribution for variable (i.e., a beta mixture of binomial distribution). Hence, (3.22) is a beta–binomial mixture of beta distributions when b1 is a positive integer. The marginal density of xi (2 ≤ i ≤ n) can be obtained in a similar manner.

108

DIRICHLET AND RELATED DISTRIBUTIONS

3.5 Conditional distributions In this section we consider conditional distributions of the GDD. We adopt the following notation: x = (x1 , . . . , xn ) ∈ Tn , x−n = (x1 , . . . , xn−1 ) ∈ Vn−1 , x(1) = (x1 , . . . , xs ),

1 = x(1) 1 ,

x(2) = (xs+1 , . . . , xn ),

(2)

2 = x(2) 1 , x−n = (xs+1 , . . . , xn−1 ),

a(1) = (a1 , . . . , as ),

(1) ν1 = a−s 1 ,

(1) a−s = (a1 , . . . , as−1 ),

a(2) = (as+1 , . . . , an ),

(2) ν2 = a−n 1

(2) a−n = (as+1 , . . . , an−1 ),

(1) 1 , u1 = x−s

(2) u2 = x−n 1 .

(1) x−s = (x1 , . . . , xs−1 ),

Let x follow the GDD (3.4). Our objective is to derive the conditional distributions of x(1) |x(2) and x(2) |x(1) . Direct derivation of the conditional density of x(1) |x(2) via the joint and marginal densities will encounter the difficulty that x does not have a density as x1 = 1. Instead, we first consider the conditional distribution of (1) (2) |x , which is again a Liouville distribution whose generating density depends x−s on x(2) only through the 1 -norm x(2) 1 . Theorem 3.8 Let x ∼ GDn,2,s (a, b) on Tn , we then have the following results.

 (1)|(2) (1) (2) (1) |x ∼ Liouvillem a−s ; f−s with density (i) x−s

(1) (2) |x g x−s



⎧ ⎫ ⎨ s−1 xai −1 ⎬ f (1)|(2) (u |x(2) ) 1 i=1 i

 · −s ν −1 = , 1 (1) ⎭ ⎩B u 1 s−1 a−s

(1) ∈ Vs−1 (1 − 2 ), and the generating density is given by where x−s

ν1 −1 as −1 u1 u1 1 − 1− 2 1− 2 (1)|(2) , f−s (u1 |x(2) ) = (1 − 2 )B(ν1 , as )

which is a beta distribution with scale

 parameter 1 − 2 . (2)|(1) (2) (1) (2) (ii) x−n |x ∼ Liouvillem a−n ; f−n with density

g

(2) (1) x−n |x



⎧ ⎫ ⎨ n−1 xai −1 ⎬ f (2)|(1) (u |x(1) ) 2 i=s+1 i

 · −n ν −1 = , 2 (2) ⎭ ⎩B u2 a n−1−s

(3.23)

−n

(2) ∈ Vn−1−s (1 − 1 ), and the generating density is where x−n

ν2 −1 an −1 u2 u2 1 − 1−

1−

1 1 (2)|(1) f−n (u2 |x(1) ) = , (1 − 1 )B(ν2 , an )

which is a beta distribution with scale parameter 1 − 1 .

(3.24) ¶

GROUPED DIRICHLET DISTRIBUTION

109

 (2) Proof. We only need to prove Case (ii). Let g(x−n ) = g x(1) , x−n denote the

grouped Dirichlet density (3.4) and g(x(1) ) the marginal density (3.15). Hence,

 (2) n ai −1

 g x(1) , x−n (2) (1) i=s+1 xi g x−n |x = = . (3.25) (2) g(x(1) ) Bn−s (a(2) ) · (1 − 1 )a 1 −1 Simplifying (3.25) yields (3.23) and (3.24). Theorem 3.9

If x ∼ GDn,2,s (a, b) on Tn , then (1) 

  (2) x−s (1) x ∼ Dirichlets−1 a−s ; as on Vs−1 and  1 − 2 (2) 

  (1) x−n (2) x ∼ Dirichletn−1−s a−n ; a on Vn−1−s . n 1−  1

End

(3.26) (3.27) ¶

Proof. To prove (3.28), we consider the following transformation: (2) x−n (2) = (ws+1 , . . . , wn−1 ) = . w−n 1 − 1 (2) n−1−s Hence, the Jacobian is (1 − 1 ) andw−n ∈ Vn−1−s . From (3.25), we have

(2) (1) (2) (1) g w−n |x = g x−n |x (1 − 1 )n−1−s

an −1 n−1 n−1 ai −1 w 1 − w i i=s+1 i i=s+1 , (3.28) = Bn−s (a(2) )

which verifies (3.28). Similarly, we can prove (3.26). End n−1 If we set wn = 1 − i=s+1 wi in (3.28), then we have wn = xn /(1 − 1 ) and (2) w = (ws+1 , . . . , wn−1 , wn ) = x(2) /(1 − 1 ) ∈ Tn−s . From (3.28), we immediately have w(2) |x(1) ∼ Dirichletn−s (a(2) ) and the following result. Theorem 3.10 If x ∼ GDn,2,s (a, b) on Tn , then (i) (ii)

x(1) |x(2) 1− 2 x(2) |x(1) 1− 1

∼ Dirichlets (a(1) ) on Ts ; and ∼ Dirichletn−s (a(2) ) on Tn−s .



Theorem 3.10 implies that the conditional distribution of x(1) |x(2) is a Dirichlet distribution with scale parameter 1 − 2 , which is a constant when x(2) is given. That is, d

x(1) |x(2) = (1 − 2 ) · ξ (1) where ξ (1) ∼ Dirichlets (a(1) ) on Ts . Similarly, we have d

x(2) |x(1) = (1 − 1 ) · ξ (2) where ξ (2) ∼ Dirichletn−s (a(2) ) on Tn−s .

110

DIRICHLET AND RELATED DISTRIBUTIONS

3.6 Extension to multiple partitions 3.6.1 Density function Motivated by (3.2), we extend the GDD with two partitions to the general GDD with multiple partitions (Tian et al., 2003). We then develop the corresponding distribution properties. Definition 3.2 A random vector x ∈ Tn is said to follow a GDD with m partitions if the density of x−n ∈ Vn−1 is given by −1 GDn,m,s (x−n |a, b) = cm

 n

  bj sj m  xiai −1 · xk ,

i=1

j=1

(3.29)

k=sj−1 +1

where a = (a1 , . . . , an ) is a positive parameter vector, b = (b1 , . . . , bm ) is a nonnegative parameter vector, 0 = ˆ s0 < 1 ≤ s1 < · · · < sm = ˆ n, s = (s1 , . . . , sm ), and the normalizing constant is ⎧ ⎫ m ⎨ ⎬ cm = Bsj −sj−1 (asj−1 +1 , . . . , asj ) ⎩ ⎭ j=1 ⎛ ⎞ sm s1 × Bm ⎝ ak + b1 , . . . , ak + b m ⎠ . k=sm−1 +1

k=1

We will write x ∼ GDn,m,s (a, b) on Tn or x−n ∼ GDn,m,s (a, b) on Vn−1 accordingly. ¶ In particular, when m = 2, (3.29) reduces to the GDD (3.4). Let tj = sj − sj−1 , j = 1, . . . , m. We partition x, y, and a in the same fashion: ⎛ (1) ⎞ ⎛ (1) ⎞ ⎛ (1) ⎞ x t1 y a ⎜ . ⎟. ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ xn×1 = ⎜ ⎝ .. ⎠.. , yn×1 = ⎝ .. ⎠ , an×1 = ⎝ .. ⎠ . x(m) tm y(m) a(m) Let β = (β1 , . . . , βm ), where βj = a 1 + bj = (j)

sj

ak + b j ,

j = 1, . . . , m.

(3.30)

k=sj−1 +1

Hence, the normalizing constant cm can be rewritten as   m (j) cm = Btj (a ) · Bm (β). j=1

(3.31)

GROUPED DIRICHLET DISTRIBUTION

111

3.6.2 Some properties Similar to Theorems 3.1 and 3.3, we have the following results. Theorem 3.11 x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b) on Tn iff (x(1), . . . , x(m)) = (R1 · y(1), . . . , Rm · y(m)), d

(3.32)

where (i) y(j) ∼ Dirichlettj (a(j) ) on Ttj , j = 1, . . . , m; (ii) R = (R1 , . . . , Rm ) ∼ Dirichletm (β) on Tm ; and (iii) y(1) , . . . , y(m) and R are mutually independent.

¶ d

Theorem 3.12 If x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b) on Tn , then Rj = x(j) 1 and d

y(j) =

x(j) , x(j) 1

j = 1, . . . , m.



Let r1 , . . . , rn ≥ 0 and ⎛

⎞ ⎛ (1) ⎞ r1 r ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ r=⎜ ⎝ .. ⎠ = ⎝ .. ⎠ rn r(m)

have the same partition as x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b). The stochastic representation (3.32) can be used to calculate the mixed moments of x. The independence among y(1) , . . . , y(m) and R implies

E

 n 



xiri

=

i=1

=

⎧ m ⎨ ⎩

j=1



E⎝

sj  i=sj−1 +1

⎞⎫ ⎛ ⎞ m ⎬  r(j) 1 ri ⎠ ⎠ yi · E ⎝ Rj ⎭ j=1

⎧ ⎫ m ⎨ (j) (j) ⎬ Btj (a + r ) ⎩

×

j=1

Btj (a(j) )



Bm β1 + r(1) 1 , . . . , βm + r(m) 1 . Bm (β)

Analogous to Theorem 3.5, we can obtain the mode as follows.

(3.33)

112

DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 3.13 The mode of the GDD with m partitions, as defined by (3.29), is ⎧   b1 ai − 1 ⎪ ⎪ 1 + (1) , ⎪ ⎪ ⎪ N a 1 − t1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪  ⎪ ⎨ ai − 1  b2 1 + (2) , xˆ i = N a 1 − t2 ⎪ ⎪ ⎪ ⎪ ⎪ ······ ⎪ ⎪   ⎪ ⎪ ⎪ ai − 1 bm ⎪ ⎪ 1 + (m) , ⎩ N a 1 − tm

1 ≤ i ≤ s1 ,

s1 + 1 ≤ i ≤ s2 ,

sm−1 + 1 ≤ i ≤ sm = n,

where N = ˆ a1 + b1 − n.



3.6.3 Marginal distributions Let x follow the GDD defined in (3.29). Theorem 3.14 below tells us that the marginal distributions of the tj -subvector x(j) and the sr -subvector (x(1), . . . , x(r)), 1 ≤ r < m, still belong to the family of the GDD. Theorem 3.14 Let x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b) on Tn . Then: (i) for any fixed j (1 ≤ j ≤ m), x(j) follows GDD (3.34) on Vtj with two partitions and the stochastic representation is 

x(j) 1 − x(j) 1



 d

=

Rj · y(j) 1 − Rj



∼ GDtj +1,2,tj



a(j) a1 − a(j) 1

 

,

bj b1 − bj

(3.34)



; and

(ii) for any fixed r (1 ≤ r < m), (x(1), . . . , x(r)) follows GDD (3.35) on Vsr with r + 1 partitions and the stochastic representation is given by ⎛ ⎜ ⎜ ⎜ ⎜ ⎝

1−

⎞ R1 · y(1) ⎟ ⎜ ⎟ .. ⎟ d ⎜ ⎟ . ⎟=⎜ ⎟ ⎟ ⎜ ⎟ (r) ⎠ ⎝ Rr · y ⎠ r (j) 1 − j=1 Rj j=1 x 1

x(1) .. . x(r) r





GROUPED DIRICHLET DISTRIBUTION

⎛⎛ ⎜⎜ ⎜⎜ ⎜ ∼ GDsr +1,r+1,sr⎜ ⎜⎜ ⎝⎝

⎞ ⎛

a(1) .. . (r) a

n

i=sr +1

⎟ ⎟ ⎟, ⎟ ⎠

ai

⎜ ⎜ ⎜ ⎜ ⎝

m

⎞⎞

b1 .. . br

j=r+1

113

⎟⎟ ⎟⎟ ⎟⎟ , ⎟⎟ ⎠⎠

(3.35)

bj

where sr = ˆ (s1 , . . . , sr , sr + 1).

¶ d

Proof. Theorem 3.11 implies that x(j) = Rj · y(j) , where y(j) ∼ Dirichlettj (a(j) )

and Rj ∼ Beta(βj , β1 − βj ).

From (3.30), we have Rj ∼ Beta(a(j) 1 + bj , a1 − a(j) 1 + b1 − bj ). Using Theorem 3.2, we obtain (3.34). In addition, we note that the stochastic representation in (3.35) is an immediate result of (3.32), and

 ∼ Dirichletr+1 β1 , . . . , βr , m β j=r+1 j

 n m = Dirichletr+1 β1 , . . . , βr , i=sr +1 ai + j=r+1 bj .

R1 , . . . , Rr , 1 −

r

j=1 Rj



 in (3.35) can be written as 1 − rj=1 Rj · 1, where ai is the degenerate Dirichlet distribution. By comparing 1∼ (3.32) with (3.35), we obtain the second part of (3.35). End

Note that 1 −

r

j=1 Rj

n Dirichlet1 i=sr +1

The following theorem describes the relationship between the marginal distributions of the GDD and the beta–Liouville distribution. Theorem 3.15 Given that x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b) on Tn , we have d

x(j) = Rj · y(j) ∼ BLiouvilletj (a(j) ; βj , β1 − βj ),

3.6.4 Conditional distributions

j = 1, . . . , m.



  Now we consider conditional pdfs of x[1] x[2] and x[2]  x[1]  . Theorem 3.16 below x[1]  [2] x[2]  [1] shows that the conditional pdfs of 1− [2] x and 1− [1]  x still belong to the family of the GDD.

114

DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 3.16 Let x = (x(1), . . . , x(m)) ∼ GDn,m,s (a, b) on Tn . For 1 ≤ r < m, define x[1] = (x(1), . . . , x(r)), x[2] = (x(r+1), . . . , x(m)), (j)

[1] = x[1] 1 = rj=1 x(j) 1 , [2] = x[2] 1 = m j=r+1 x 1 , a[1] = (a(1), . . . , a(r)),

a[2] = (a(r+1), . . . , a(m)),

b[1] = (b1 , . . . , br ),

b[2] = (br+1 , . . . , bm ),

s[1] = (s1 , . . . , sr ),

s[2] = (sr+1 , . . . , sm ).

We have the following conditional distributions:  x[1]  [2] ∼ GDsr ,r, s[1] (a[1] , b[1] ) on Tsr with stochastic representation (i) 1−

[2] x ⎛

x[1]  [2] x 1 − [2]

R1 · y(1) /(1 − [2] ) .. .

⎜ ⎜ ⎜ ⎜ ⎜ d ⎜ =⎜ ⎜ Rr−1 · y(r−1) /(1 − [2] ) ⎜ ⎜ ⎜ ⎜ (r) ⎝ (1 − [2] − r−1 j=1 Rj ) · y

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟  [2] ⎟ x . ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

(3.36)

1 − [2] 

(ii)

x[2]  [1] x 1− [1] 

∼ GDn−sr ,m−r, s[2] (a[2] , b[2] ) on Tn−sr with stochastic representation ⎛

x[2]  [1] x 1 − [1]

Rr+1 · y(r+1) /(1 − [1] ) .. .

⎜ ⎜ ⎜ ⎜ ⎜ d ⎜ =⎜ ⎜ Rm−1 · y(m−1) /(1 − [1] ) ⎜ ⎜ ⎜ ⎜ (m) ⎝ (1 − [1] − m−1 j=r+1 Rj ) · y

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟  [1] ⎟ x . ⎟ ⎟ ⎟ ⎟ ⎟ ⎠



1 − [1] Proof. Here, we only need to show Case (i). From (3.32), since 1 − [2] is a constant when x[2] is given, we have    x[1]  [2] d R1 · y(1) Rr · y(r)  [2] x = , . . . , x , 1− [2]  1 − [2] 1 − [2] 

(3.37)

GROUPED DIRICHLET DISTRIBUTION

115

where (R1 , . . . , Rm ) ∼ Dirichletm (β) on Tm , y(j) ∼ Dirichlettj (a(j) ) on Ttj for j = 1, . . . , r. Since m

1 − [2] = 1 −

m

d

x(j) 1 = 1 −

j=r+1

Rj ,

j=r+1

we have Rr = 1 −

r−1 j=1

Rj −

m

d

Rj = 1 − [2] −

j=r+1

r−1

Rj .

j=1

Hence, from (3.37) we obtain (3.36). When x[2] is given, from Theorem 1.6 of Fang et al. (1990: 21), we obtain     Rr Rr R1 R1 d ,..., ,..., = 1 − [2] 1 − [2] 1− m 1− m j=r+1 Rj j=r+1 Rj ∼ Dirichletr (β1 , . . . , βr ) on Tr . Combining (3.37) with (3.38) yields  x[1]  [2] x ∼ GDsr ,r, s[1] (a[1] , b[1] ) on Tsr . 1 − [2] 

(3.38)

End

From Theorem 3.16, we can see that the conditional distribution of x[1] |x[2] is a GDD with scale 1 − [2] , which is a constant when x[2] is given. Namely, d

x[1] |x[2] = (1 − [2] ) · ξ [1] , where ξ [1] ∼ GDsr ,r, s[1] (a[1] , b[1] ) on Tsr . Similarly, d

x[2] |x[1] = (1 − [1] ) · ξ [2] , where ξ [2] is distributed as GDn−sr ,m−r, s[2] (a[2] , b[2] ) on Tn−sr .

3.7 Statistical inferences: likelihood function with GDD form In this section we assume that each subject can be classified into one of n categories and θ = (θ1 , . . . , θn ) ∈ Tn denotes the cell probability vector. Let Yobs denote the observed counts that consist of two parts: the complete observations (e.g., {ni }4i=1 in (3.1)) and the partial observations (e.g., n12 and n34 in (3.1)).

116

DIRICHLET AND RELATED DISTRIBUTIONS

Under the MAR mechanism, we further assume that the likelihood function has the GDD form: L(θ|Yobs ) = GDn,m,s (θ −n |a, b) ⎞bj  n  m ⎛ sj  a −1  ⎝ ∝ θi i θk ⎠ . i=1

(3.39)

k=sj−1 +1

j=1

The objective of this section is to introduce statistical methods for the analysis of incomplete categorical data for large- or small-sample sizes when the likelihood function is given by (3.39).

3.7.1 Large-sample likelihood inference From Theorem 3.13, the MLEs of θ are given by ⎧   ⎪ ⎪ ˆ i = ai − 1 1 + s b1 θ , ⎪ ⎪ 1 ⎪ N ⎪ i=1 (ai − 1) ⎪ ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ ⎪ ⎪ b2 ⎨ θˆ = ai − 1 1 + , i s2 N (ai − 1) i=s +1 1 ⎪ ⎪ ⎪ ⎪ .. ⎪ ⎪ . ⎪ ⎪ ⎪   ⎪ ⎪ ⎪ ai − 1 bm ⎪ ⎪ ˆ 1 + n , ⎪ ⎩ θi = N i=sm−1 +1 (ai − 1)

1 ≤ i ≤ s1 ,

s1 + 1 ≤ i ≤ s2 ,

(3.40)

sm−1 + 1 ≤ i ≤ n,

n m where N = ˆ i=1 (ai − 1) + j=1 bj .  Let θ −n = (θ1 , . . . , θn−1 ) . The asymptotic covariance matrix of the MLE θˆ −n −1 ˆ is then given by Iobs (θ −n ), where

Iobs (θ −n ) = −

∂2 log L(θ|Yobs ) ∂θ −n ∂θ−n

is the observed information matrix. From (3.39), we have log L(θ|Yobs ) ∝

n−1



(ai − 1) log θi + (an − 1) log 1 −

i=1

+

m−1 =1



b log ⎝

s k=s −1 +1



n−1



θi

i=1



θk ⎠ + bm log 1 −

sm−1 k=1



θk

.

GROUPED DIRICHLET DISTRIBUTION

Define m subscript index sets: I = {s −1 + 1, s −1 + 2, . . . , s },

= 1, . . . , m − 1,

and

Im = {sm−1 + 1, sm−1 + 2, . . . , sm }\{n}

= {sm−1 + 1, sm−1 + 2, . . . , n − 1}. It can be verified that ⎧ b ai −1 an −1 ⎪ ⎪ − + ⎪ ⎪ ⎪ θi θn ⎪ k∈I θk ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ∂ log L(θ|Yobs ) ⎨ bm = − sm−1 , ⎪ ∂θi 1 − ⎪ k=1 θk ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ a i − 1 an − 1 ⎪ ⎪ − , ⎩ θi θn



i ∈ I , 1 ≤ ≤ m − 1,

i ∈ Im ,

⎧ ai −1 an −1 ⎪ ⎪ + + φ + φm , ⎪ 2 ⎪ ⎪ θn2 ⎨ θi

2

∂ log L(θ|Yobs ) = ⎪ ∂θi2 ⎪ ⎪ ai − 1 an − 1 ⎪ ⎪ + , ⎩ θn2 θi2

⎧ an −1 ⎪ ⎪ ⎪ θ 2 + φ + φ m , ⎪ ⎪ ⎪ ⎪ n ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ an − 1 2 + φm , ∂ log L(θ|Yobs ) ⎨ θn2 = − ⎪ ∂θi ∂θj ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ a −1 ⎪ ⎪ ⎩ n , θn2

i ∈ I , 1 ≤ ≤ m−1,

i ∈ Im ,

and

i, j ∈ I , 1 ≤ ≤ m − 1,

i ∈ I , 1 ≤ ≤ m − 1, j ∈ I , 1 ≤ < ≤ m − 1, i ∈ I , 1 ≤ ≤ m − 1, j ∈ Im ,

where φ =

b

k∈I

θk

2 ,

1 ≤ ≤ m − 1,

and

φm =

1−

bm sm−1 k=1

θk

2 .

117

118

DIRICHLET AND RELATED DISTRIBUTIONS

Hence, the observed information matrix can be written as    B an−1 − 1 an − 1 a1 − 1  Iobs (θ −n ) = diag ,..., + 1n−1 1n−1 + 2 2 2 θn O θ1 θn−1

O O



, (3.41)

where B is a matrix of sm−1 × sm−1 defined by ⎛ ⎞ φ1 11s1 O ··· O ⎜ ⎟ ⎜ O ⎟ O φ2 11s2 −s1 · · · ⎜ ⎟ B=⎜ . ⎟ + φm 11sm−1 . . . . ⎜ .. ⎟ .. .. .. ⎝ ⎠ O O · · · φm−1 11sm−1 −sm−2

3.7.2 Small-sample Bayesian inference For Bayesian analysis, the GDD is the natural conjugate prior distribution. Multiplying (3.39) by the prior distribution θ ∼ GDn,m,s (a∗ , b∗ )

(3.42)

yields the grouped Dirichlet posterior distribution θ|Yobs ∼ GDn,m,s (a + a∗ − 1n , b + b∗ ).

(3.43)

The exact first-order and second-order posterior moments of {θi } can be obtained explicitly from Theorem 3.4. The posterior means are similar to the MLEs in the symmetric case. In addition, the posterior samples of θ in (3.43) can be generated by utilizing (3.8), which only involves the simulation of independent Dirichlet random vectors and a beta random variable.

3.7.3 Analyzing the cervical cancer data In Example 1.3, we described the cervical cancer data set. The likelihood function (3.1) has the density form of a GDD with two partitions. By comparing (3.1) with (3.4), it is clear that the MLE of θ is exactly the mode of the density of GD4,2,2 (a, b) with a = (n1 + 1, . . . , n4 + 1) and b = (n12 , n34 ). From (3.40), the MLEs of θ are given by   ni n1 + n2 + n12 n3 + n4 + n34 θˆ i = · I(1≤i≤2) + · I(3≤i≤4) , N n1 + n 2 n3 + n 4 4 where N = ˆ i=1 ni + n12 + n34 . Hence, the MLE of the odds ratio ψ is ˆ ˆ ˆ ˆ ˆ ψ = θ1 θ4 /(θ2 θ3 ). −1 ˆ The asymptotic covariance matrix of the MLE θˆ −4 is then given by Iobs (θ −4 ), where Iobs (θ −4 ) is defined by (3.41). Hence, the delta method (e.g., Tanner, 1996: 34)

GROUPED DIRICHLET DISTRIBUTION

119

ˆ and a 95% asymptotic CI for ψ can be can be used to approximate the SE of ψ constructed as ˆ − 1.96 · se (ψ), ˆ ψ ˆ + 1.96 · se (ψ)], ˆ [ψ where 

ˆ = se (ψ)

∂ψ ∂θ −4

   1/2 ∂ψ  −1 I (θ −4 |Yobs ) . ∂θ −4 θˆ

For the cervical cancer data in Table 1.3, we have θˆ 1 = 0.2460, θˆ 2 = 0.2460, ˆ = 2.1456. The corresponding SEs are given by θˆ 3 = 0.1615, θˆ 4 = 0.3465, and ψ (0.0163, 0.0163, 0.0143, 0.0181) and 0.3488. Therefore, the 95% asymptotic CI for ψ is [1.4620, 2.8293]. For Bayesian analysis, we adopt the following conjugate prior distribution:

 θ ∼ GD4,2,2 n∗1 , . . . , n∗4 ), (n∗12 , n∗34 . (3.44) The resulting posterior distribution is



  . (3.45) θ|Yobs ∼ GD4,2,2 n1 + n∗1 , . . . , n4 + n∗4 , n12 + n∗12 , n34 + n∗34 The marginal posterior density for each component of θ can be obtained from (3.21). Although we cannot obtain closed-form expressions of the first- and second-order posterior moments for the odds ratio ψ (or an arbitrary function of θ, say h(θ)), an i.i.d. posterior sample of ψ (or h(θ)) can be obtained provided that an i.i.d. sample of θ is available. Fortunately, the i.i.d. posterior samples of θ in (3.45) can be drawn by using the stochastic representation given in (3.8). Noting that both uniform and Dirichlet distributions are special members of the GDD, we usually use the uniform prior distribution in practice. To demonstrate the Bayesian analysis, we use the uniform prior: n∗1 = · · · = n∗4 = 1 and n∗12 = n∗34 = 0. We generate 30 000 i.i.d. posterior samples from (3.45), and the Bayesian estimates of θ and the odds ratio ψ are given by (0.2460, 0.2459, 0.1619, 0.3459) and 2.1708. Therefore, the estimated odds of cervical cancer for patients with many sex partners is about 2.1708 times the estimated odds for patients with few sex partners. The corresponding Bayesian standard deviations are (0.0163, 0.0163, 0.0142, 0.0179) and 0.3557. The 95% Bayesian CI of ψ is then [1.5671, 2.9527]. Since both the asymptotic and Bayesian lower bounds are larger than the value of 1, we have reason to believe that there is an association between the number of sex partners and disease status of cervical cancer.

3.7.4 Analyzing the leprosy survey data In Example 1.4, we described the leprosy survey data set. The likelihood function (3.2) has the density form of a GDD with four partitions. The first objective of this example is to obtain the MLE of θ in (3.2). By comparing (3.2) with (3.29), we

120

DIRICHLET AND RELATED DISTRIBUTIONS

Table 3.1 Frequentist and Bayesian estimates of parameters for the leprosy survey data. Parameters

θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10

Frequentist approach

Bayesian approach

MLE

SE

Mean

std

0.0516 0.1268 0.1973 0.2902 0.0453 0.0401 0.0861 0.0918 0.0620 0.0083

0.0147 0.0209 0.0234 0.0185 0.0085 0.0140 0.0185 0.0188 0.0098 0.0037

0.0541 0.1262 0.1940 0.2873 0.0461 0.0427 0.0855 0.0910 0.0628 0.0098

0.0147 0.0202 0.0226 0.0183 0.0085 0.0137 0.0178 0.0182 0.0098 0.0040

Source: Ng et al. (2008). Reproduced by permission of Elsevier.

know that the MLE of θ is exactly the mode of the density of GD10,4,s ((n1 + n˜ 1 + 1, . . . , n10 + n˜ 10 + 1), (n123 , 0, n678 , 0)) with s = (s1 , . . . , s4 ) = (3, 5, 8, 10). From (3.40), we can obtain the explicit MLEs of θ. For the leprosy survey data, the MLEs of θ and the corresponding SEs are reported in Table 3.1. For Bayesian analysis, a GDD with four partitions is a natural conjugate prior for (3.2). We take θ ∼ GD10,4,s ((n∗1 , . . . , n∗10 ), (n∗123 , n∗45 , n∗678 , n∗9,10 )) as the prior. The posterior distribution is then θ|Yobs ∼ GD10,4,s (a∗ , b∗ ), where a∗ = (n1 + n˜ 1 + n∗1 , . . . , n10 + n˜ 10 + n∗10 ) and b∗ = (n123 + n∗123 , n∗45 , n678 + n∗678 , n∗9,10 ). We adopt the uniform prior: n∗1 = · · · = n∗10 = 1 and n∗123 = n∗45 = n∗678 = n∗9,10 = 0 and generate 30 000 i.i.d. samples of θ by using the stochastic representation in (3.32). The Bayesian means and standard deviations of θ are also reported in Table 3.1.

GROUPED DIRICHLET DISTRIBUTION

121

3.8 Statistical inferences: likelihood function beyond GDD form In this section, under the MAR mechanism, we assume that the likelihood function is a product of two terms, namely a GDD up to a normalizing constant and a product of powers of linear function of θ ∈ Tn : L(θ|Yobs ) ∝ GDn,m,s (θ −n |a, b) ×

 n q  k=1

mk

δik θi

,

(3.46)

i=1

where n×q = (δik ) is a known matrix with δik = 0 or 1 and there exists at least one nonzero entry in each column of . For instance, in (3.3), we have n = 4, q = 2, m1 = n13 , m2 = n24 , and ⎞ ⎛ ⎞ ⎛ 1 0 δ11 δ12 ⎜δ ⎟ ⎜0 1⎟ ⎟ ⎜ 21 δ22 ⎟ ⎜ ⎟.

=⎜ ⎟=⎜ ⎜ ⎝ δ31 δ32 ⎠ ⎝ 1 0 ⎟ ⎠ 0 1 δ41 δ42 In general, we are unable to find the closed-form MLEs of θ. However, we can utilize the EM algorithm and DA Gibbs sampling to deal with the likelihood of θ as given in (3.46). Below, we first consider a simple 2 × 2 contingency table with incomplete observations and analyze the neurological complication data. Then, we investigate how to apply the GDD to construct CIs of parameters of interest for a more general incomplete r × c contingency table.

3.8.1 Incomplete 2 × 2 contingency tables: the neurological complication data (a) MLEs via the EM algorithm In Example 1.5, we described the neurological complication data set in the form of a 2 × 2 contingency table with two supplemental margins. The likelihood function (3.3) has the form of (3.46). Writing n13 = z1 + (n13 − z1 )

and n24 = z2 + (n24 − z2 ),

we introduce a two-dimensional latent vector z = (z1 , z2 ) and obtain an augmented likelihood function

  ni +zi 4 L(θ|Yobs , z) ∝ θ · (θ1 + θ2 )n12 (θ3 + θ4 )n34 , i=1 i where z3 = ˆ n13 − z1 and z4 = ˆ n24 − z2 . This augmented likelihood function has the same functional form of (3.39). From (3.40), the augmented-data MLEs of θ

122

DIRICHLET AND RELATED DISTRIBUTIONS

are given by   ⎧ ni + zi n12 ⎪ ˆ ⎪ θi = 1+ , i = 1, 2, ⎪ ⎪ N n1 + z 1 + n 2 + z 2 ⎨   ⎪ ⎪ ⎪ n + zi n34 ⎪ ⎩ θˆ i = i 1+ , i = 3, 4, N n3 + z 3 + n 4 + z 4

(3.47)

where N = 4i=1 (ni + zi ) + n12 + n34 = 4i=1 ni + n12 + n34 + n13 + n24 . The conditional predictive distribution is then given by     θ1  f (z|Yobs , θ) = Binomial z1 n13 , θ1 + θ 3     θ2  × Binomial z2 n24 , . θ2 + θ 4

(3.48)

Thus, the E-step of the EM is to compute the conditional expectations E(z1 |Yobs , θ) =

n13 θ1 θ1 + θ 3

and E(z2 |Yobs , θ) =

n24 θ2 , θ2 + θ 4

(3.49)

and the M-step updates (3.47) by replacing {z }2 =1 with their conditional expectations. The MLE of θ23 is then given by θˆ 23 = θˆ 2 − θˆ 3 . The 95% asymptotic CI for θ23 can be constructed as [θˆ 23 − 1.96 · se (θˆ 23 ), θˆ 23 + 1.96 · se (θˆ 23 )], where se (θˆ 23 ) = {Var(θˆ 2 ) + Var(θˆ 3 ) − 2Cov(θˆ 2 , θˆ 3 )}1/2 . For the neurological complication data in Table 1.5, using θ (0) = 0.25 · 14 as the initial values, the EM algorithm based on (3.47) and (3.49) converges in six iterations. The resultant MLEs are (θˆ 1 , θˆ 2 , θˆ 3 , θˆ 4 ) = (0.2495, 0.1094, 0.3422, 0.2989), and θˆ 23 = −0.2328. The corresponding SEs are given by (0.0819, 0.0585, 0.0925, 0.0865) and 0.1191. Therefore, the 95% asymptotic CI of θ23 is [−0.4663, 0.0007]. Since the asymptotic CI includes the value of 0, we can conclude that the incidence rates of neurological complication before and after the standard treatment are essentially the same. However, such a CI depends on the large-sample theory, and the sample size in this example is not large enough.

GROUPED DIRICHLET DISTRIBUTION

123

(b) Bayesian estimates via the DA Gibbs sampling For Bayesian analysis, we choose the same prior in (3.44). Hence, the complete-data posterior is a GDD and we have

 θ|(Yobs , z) ∼ GD4,2,2 n1 + n∗1 + z1 , . . . , n4 + n∗4 + z4 ,

  n12 + n∗12 , n34 + n∗34 . (3.50) Again, we adopt the uniform prior assuming no prior information; that is, n∗1 = · · · = n∗4 = 1 and n∗12 = n∗34 = 0. Based on (3.48) and (3.50), we implement the DA Gibbs sampling to obtain posterior samples for θ by running a single chain with length 100 000. The second half of the sequence will be used to calculate the Bayesian estimates of θ and θ23 , which are given by (0.2233, 0.1283, 0.3567, 0.2915) and −0.2284. The corresponding Bayesian standard deviations are (0.0702, 0.0574, 0.0833, 0.0789) and 0.1126. The 95% Bayesian CI of θ23 is [−0.4456, −0.0029], which excludes the value of 0 and contradicts the conclusion derived from the likelihood-based approach. This, on the other hand, lends support to the belief that there is a slight difference for the incidence rates of neurological complication before and after the standard treatment. (c) An alternative DA structure based on the Dirichlet distribution In the conventional DA structure based on the Dirichlet distribution, we need to introduce four latent variables z1 , z2 , w1 , and w3 such that the augmented-likelihood function is L(θ|Yobs , z1 , z2 , w1 , w3 ) ∝

4 

θini +zi +wi ,

θ ∈ T4 ,

i=1

ˆ n12 − w1 and which has the pdf form of a Dirichlet distribution, where w2 = w4 = ˆ n34 − w3 . It is known that the fewer the latent variables, the faster the EM (Liu, 1999; Little and Rubin, 2002). Thus, the resulting EM algorithm (or DA Gibbs sampling) converges slowly because of the introduction of two unnecessary latent variables w1 and w3 . For a general r × c table with two supplemental margins, Tang et al. (2007) theoretically proved that the convergence speed of the EM based on the GDD with only r(c − 1) latent variables is faster than that of the EM based on the Dirichlet distribution with a total of 2rc − r − c latent variables (see Section 3.8.2). Their simulation studies further support this conclusion.

3.8.2 Incomplete r × c contingency tables Constructing CIs for functions of cell probabilities (e.g., rate difference, rate ratio, and odds ratio) becomes a standard procedure for categorical data analysis in clinical trials and medical studies. In this subsection, for incomplete r × c contingency tables, we first introduce an EM algorithm via a DA structure based on a Dirichlet

124

DIRICHLET AND RELATED DISTRIBUTIONS

distribution to construct a CI of the parameters of interest. Next, we present a new EM algorithm via another DA structure based on a GDD to construct two bootstraptype CIs for parameters of interest with and without the normality assumption. The second DA structure requires fewer latent variables than the first DA structure and will consequently lead to a more efficient EM algorithm. (a) CI construction via a DA structure based on the Dirichlet distribution Let X and Y be two correlated categorical variables with joint distribution πij = Pr{X = i, Y = j},

1 ≤ i ≤ r, 1 ≤ j ≤ c,

and let ˆ (π11 , π21 , . . . , πr1 , . . . , π1c , π2c , . . . , πrc ) π = {πij } = denote the cell probability vector. The observed counts and the corresponding cell probabilities for n=

r c

nij = n1rc

i=1 j=1

complete data and mx + my partially incomplete data are summarized in Table 3.2. Under the MAR mechanism, we assume that n = {nij } ∼ Multinomialrc (n, π), mx = (m1x , . . . , mrx ) ∼ Multinomial(mx ; π1+ , . . . , πr+ ), my = (my1 , . . . , myc ) ∼ Multinomial(my ; π+1 , . . . , π+c ), Table 3.2 Observed counts and cell probabilities for incomplete r × c table. Y =1

···

Y =c

Subtotal

Supplement on X

Total

X=1 .. . X=r

n11 (π11 ) .. . nr1 (πr1 )

··· .. . ···

n1c (π1c ) .. . nrc (πrc )

n1+ (π1+ ) .. . nr+ (πr+ )

m1x .. . mrx

n1+ + m1x .. . nr+ + mrx

Subtotal

n+1 (π+1 )

···

n+c (π+c )

n

mx

n + mx

my1

···

myc

my

n+1 + my1

···

n+c + myc

n + my

Supplement on Y Total

Source: Tang et al. (2007). Reproduced by permission of Elsevier.

n + mx + my

GROUPED DIRICHLET DISTRIBUTION

125

and they are mutually independent. Suppose that we are interested in constructing a CI for some function of π, say ϑ = h(π). Let Yobs = {n; mx ; my } denote the observed data. The observed likelihood is ⎛ ⎞ c r  r c    n m mix πijij ⎠ πi+ π+jyj . L(π|Yobs ) ∝ ⎝ i=1 j=1

i=1

j=1

We introduce the latent vectors zi (r) = (zi1 (r), . . . , zic (r)) and zj (c) = (z1j (c), . . . , zrj (c)) such that mix = cj=1 zij (r) for i = 1, . . . , r and myj = ri=1 zij (c) for D = {z(r), z(c)}, where j = 1, . . . , c. Denote these latent (or missing) data by Ymis ⎛ ⎞ ⎛ ⎞ z1 (r) z1 (c) ⎜ z (r) ⎟ ⎜ z (c) ⎟ ⎜ 2 ⎟ ⎜ 2 ⎟ ⎟ and z(c) = ⎜ ⎟ z(r) = ⎜ (3.51) ⎜ .. ⎟ ⎜ .. ⎟ . ⎝ . ⎠ ⎝ . ⎠ zr (r)

zc (c)

D D Hence, the likelihood function for the augmented data Yaug = {Yobs , Ymis } (equivaD lently, the joint density of Yaug ) is D D L(π|Yaug ) = f (Yaug |π) ∝

r  c 

n +zij (r)+zij (c)

πijij

,

(3.52)

i=1 j=1

which is a Dirichlet distribution up to a constant. Hence, the augmented-data MLEs for π are given by πˆ ij =

nij + zij (r) + zij (c) , n + m x + my

1 ≤ i ≤ r, 1 ≤ j ≤ c.

(3.53)

Note that given (Yobs , π), zi (r) follows a multinomial distribution with mix and parameter vector (πi1 , . . . , πic )/πi+ , i = 1, . . . , r. Similar results are available for zj (c). Hence, the E-step of the EM algorithm computes the following conditional expectations: E(zij (r)|Yobs , π) =

mix πij , πi+

1 ≤ i ≤ r, 1 ≤ j ≤ c,

(3.54)

E(zij (c)|Yobs , π) =

myj πij , π+j

1 ≤ i ≤ r, 1 ≤ j ≤ c;

(3.55)

and the M-step updates (3.53) by replacing zij (r) and zij (c) with their conditional expectations. It is noteworthy that, for the 2 × 2 table with two margins, Chen and Fienberg (1974) obtained a formula similar to (3.53) to (3.55) from a different perspective.

126

DIRICHLET AND RELATED DISTRIBUTIONS

Note that the asymptotic covariance matrix V of the MLE of π−rc = (π11 , π21 , . . . , πr1 , . . . , π1c , π2c , . . . , πr−1,c ) can be obtained by the Louis (1982) method or direct computation of the observed information matrix Iobs (π−rc ) = −

∂2 log L(π|Yobs ) ∂π−rc ∂π−rc

−1 evaluated at π−rc = πˆ −rc (i.e., V = Iobs (πˆ −rc )). Hence, the MLE of ϑ is given ˆ and its corresponding 95% CI can be obtained by the delta method by ϑˆ = h(π) (Tanner, 1996); that is,

ˆ ˆ [ϑˆ − 1.96 · se (ϑ), ϑˆ + 1.96 · se (ϑ)].

(3.56)

(b) Bootstrap CIs via a DA structure based on GDD Although (3.53)–(3.55) can be readily implemented, it may converge slowly because of the inclusion of additional but unnecessary latent variables. In this case, the number of latent variables is 2rc − r − c. It is well known that the fewer the latent variables, the faster the EM algorithm (Liu, 1999; Little and Rubin, 2002). Also, the CI in (3.56) depends on the large-sample theory and may not be reliable −1 in small-sample studies. Even worse, Iobs (πˆ −rc ) may not exist in some cases (e.g., the rheumatoid arthritis study in Tang et al. (2007)). In those cases, it is not possible to obtain the asymptotic CIs for ϑ based on (3.56). In this subsection, another DA structure with only r(c − 1) latent variables is introduced, leading to a more efficient EM algorithm. In addition, we also present two bootstrap CIs based on this new EM algorithm for ϑ in small sample designs. GD By introducing only Ymis = z(r), the likelihood for the augmented-data GD GD GD Yaug = {Yobs , Ymis } (equivalently, the joint density of Yaug ) is ⎛ ⎞ r  c c

    n +z (r) m GD GD L π|Yaug = f Yaug |π ∝ ⎝ πijij ij ⎠ π+jyj , (3.57) i=1 j=1

j=1

which is a GDD up to a constant. Hence, from (3.40), the augmented-data MLEs of π are given by   nij + zij (r) myj πˆ ij = 1+ , (3.58) n + m x + my n+j + ri =1 zi j (r) for 1 ≤ i ≤ r and 1 ≤ j ≤ c. Hence, the new E-step only needs to compute E(zij (r)|Yobs , π), while the new M-step simply updates (3.58) by replacing zij (r) with their conditional expectations (3.54). As a computer-based approach, the bootstrap-type CIs invented by Efron (1979) are a useful tool for estimating the SE of ϑˆ and are especially appropriate for smallsample studies (see Efron and Tibshirani (1993)). Based on the πˆ obtained by the

GROUPED DIRICHLET DISTRIBUTION

127

new EM algorithm from (3.58), we can independently generate ˆ n∗ = {n∗ij } ∼ Multinomialrc (n, π), m∗x = (m∗1x , . . . , m∗rx ) ∼ Multinomial(mx ; πˆ 1+ , . . . , πˆ r+ ), m∗y = (m∗y1 , . . . , m∗yc ) ∼ Multinomial(my ; πˆ +1 , . . . , πˆ +c ).

and

∗ Having obtained Yobs = {n∗ ; m∗x ; m∗y }, we can calculate a bootstrap replication ϑˆ ∗ via the new EM algorithm in (3.58). Independently repeating this process G times, ˆ ˆ we obtain G bootstrap replications {ϑˆ g∗ }G g=1 . Consequently, the SE, se (ϑ), of ϑ, can be estimated by the sample standard deviation of the G replications; that is, ⎧ ⎫1/2 G  ⎨ ⎬ 2 &

∗ ˆ = % (ϑ) se /G (G − 1) . (3.59) ϑˆ g∗ − ϑˆ 1∗ + · · · + ϑˆ G ⎩ ⎭ g=1

If {ϑˆ g∗ }G g=1 is approximately normally distributed, a 95% bootstrap CI for ϑ is ˆ ˆ % (ϑ), % (ϑ)]. [ϑˆ − 1.96 · se ϑˆ + 1.96 · se

(3.60)

Alternatively, if {ϑˆ g∗ }G g=1 is nonnormally distributed, a 95% bootstrap CI is [ϑˆ b,L , ϑˆ b,U ],

(3.61)

where ϑˆ b,L and ϑˆ b,U are the 0.025 and 0.975 quantiles of {ϑˆ g∗ }G g=1 . Obviously, the new EM algorithm (3.58) has at least two advantages over the old EM algorithm (3.53)–(3.55): (i) it is more efficient, since only r(c − 1) latent variables are introduced; and (ii) for r × c tables with only one supplemental margin (i.e., mx = 0 or my = 0) it converges in one step, as there will be no missing values under the new DA structure. In other words, the MLEs of the parameters have closedform solutions. Therefore, the computing time for constructing the bootstrap CI is identical to that for calculating G explicit MLEs. (c) The rates of convergence of the two EM algorithms Now, we investigate and compare the convergence rates of the two EM algorithms presented in Sections 3.8.2(a) and 3.8.2(b). Let sequences {θ (t) }∞ t=0 be the output of any EM algorithm with augmented data Yaug = {Yobs , Ymis } and parameter vector θ ∈ Tn . Any EM algorithm implicitly defines a mapping θ → M(θ) = (M1 (θ), . . . , Mn (θ)) from Tn to Tn such that θ (t+1) = M(θ (t) ). If θ (t+1) converges to some fixed point ˆ After expanding M(θ (t) ) in the neighborhood of θ, ˆ we have θˆ ∈ Tn , then θˆ = M(θ). ˆ (t) − θ). ˆ θ (t+1) − θˆ ≈ dM(θ)(θ ˆ the derivative of M(θ) evaluated at θ = θ, ˆ Following Meng (1994), we define dM(θ), (t) as the matrix rate of convergence of the sequence {θ }. The largest eigenvalue

128

DIRICHLET AND RELATED DISTRIBUTIONS

ˆ denoted as ρ{dM(θ)}, ˆ is called the global rate of convergence of {θ (t) }. of dM(θ), ˆ = In − dM(θ) ˆ is called the matrix speed of convergence of {θ (t) }, Furthermore, S(θ) ˆ of S(θ) ˆ is known as the global speed of and the smallest eigenvalue 1 − ρ{dM(θ)} (t) {θ }. Under mild regularity conditions, Dempster et al. (1977) showed that ˆ obs (θ), ˆ ˆ = In − I −1 (θ)I (3.62) dM(θ) aug

where

  ∂2 log f (Yaug |θ)  Iaug (θ) = E − Yobs , θ ∂θ∂θ 

Iobs (θ) = −

and

∂2 log f (Yobs |θ) ∂θ∂θ

(3.63)

(3.64)

are the expected complete-data information matrix and the observed information matrix respectively. To compare two EM algorithms (EM1 and EM2 say) based on the same Yobs but different DA structures, we only need to compare their matrix speeds. Since ˆ is independent of DA structures, it suffices to compare their Iaug (θ). ˆ Meng Iobs (θ) and van Dyk (1997) showed that if EM2 ˆ EM1 ˆ Iaug (θ) ≤ Iaug (θ),

then the global speed of EM2 is greater than or equal to the global speed of EM1. Let c{B} denote some criterion for measuring the size of the positive definite matrix B. We say EM2 dominates EM1 in c-criterion if EM2 ˆ EM1 ˆ c{Iaug (θ)} ≤ c{Iaug (θ)}.

Furthermore, we say EM2 uniformly dominates EM1 in c-criterion if EM2 EM1 (θ)} ≤ c{Iaug (θ)} c{Iaug

for any θ ∈ Tn . Besides the largest eigenvalue, the two commonly used criteria for measuring the size of a positive definite matrix are trace and determinant. Following the notations in Sections 3.8.2(a) and 3.8.2(b), we let D D Yaug = {Yobs , Ymis } denote the augmented data for the DA structure based on a GD GD = {Yobs , Ymis } Dirichlet distribution (i.e., the EM algorithm (3.53)–(3.55)) and Yaug the augmented data for the DA structure based on the GDD (i.e., the EM (3.58)). In (3.62)–(3.64), let n = rc and θ = π. We can similarly define ' ( D ˆ Iaug ˆ dM D (π), (π), S D (π) and

( ' GD ˆ Iaug ˆ . dM GD (π), (π), S GD (π)

We have the following results.

GROUPED DIRICHLET DISTRIBUTION

129

Theorem 3.17 The new EM algorithm in (3.58) uniformly dominates the EM algorithm in (3.53)–(3.55) in terms of both the largest eigenvalue and trace criteria. That is, for any π ∈ Trc , GD D (π)} ≤ ρ{Iaug (π)}; and (i) ρ{Iaug GD D (π)} and the strict inequality holds whenever there exists (ii) tr {Iaug (π)} ≤ tr {Iaug at least one j such that myj > 0. ¶ GD D Proof. We only need to show that Iaug (π) ≤ Iaug (π) for any π ∈ Trc . Since GD D Ymis = z(r) is a subset of Ymis = {z(r), z(c)}, where z(r) and z(c) are defined by (3.51), we have GD GD GD D = {Yobs , Ymis } ⊆ {Yobs , Ymis , z(c)} = Yaug . Yaug

Given (Yobs , π), we have





 D GD GD f Yaug |π = f Yaug |z(c), π · f (z(c)|π) = f Yaug |π · f (z(c)|π). The second equality holds due to the conditional independence between GD Ymis |(Yobs , π) and z(c)|(Yobs , π). Hence, from (3.63), we have D GD Iaug (π) − Iaug (π)



   ∂2 log f Y GD |π   D |π  ∂2 log f Yaug  aug   =E − Yobs , π − E − Yobs , π ∂π∂π ∂π∂π 

   ∂2 log f (z(c)|π)  =E − Y , π ≥ 0,  obs ∂π∂π

which yields the two assertions. For the trace criterion, we present an alternative proof below. Define three index sets V1 = {(i, j) : i = 1, . . . , r, j = 1, . . . , c − 1}, V2 = {(i, j) : i = 1, . . . , r − 1, j = c}, and V = V1 ∪ V2 . From (3.52), we have r c

 D log f Yaug |π = [nij + zij (r) + zij (c)] log πij . i=1 j=1

Since πrc = 1 − −

(i,j)∈V

πij , we immediately have

D |π) ∂2 log f (Yaug

∂πij2

=

nij + zij (r) + zij (c) nrc + zrc (r) + zrc (c) + 2 πrc πij2

130

DIRICHLET AND RELATED DISTRIBUTIONS

for all (i, j) ∈ V. From (3.54), (3.55) and (3.63),     2 D  ∂ log f (Y |π) aug D Yobs , π (π)} = tr E − tr {Iaug  ∂π ∂π −rc

−rc

  nij mix myj = + + πij π+j πij2 πij πi+ (i,j)∈V 

 mrx myc nrc + (rc − 1) 2 + + . πrc πrc πr+ πrc π+c

(3.65)

Similarly, from (3.57), we have GD |π) = [nij + zij (r)] log πij + [nrc + zrc (r)] log πrc log f (Yaug (i,j)∈V

+

c−1



myj log π+j + myc log ⎝1 −

j=1





πij ⎠ .

(i,j)∈V1

Hence, D |π) ∂ log f (Yaug

∂πij

=

⎧ myj myc nij + zij (r) nrc + zrc (r) ⎪ ⎪ − + − , ⎪ ⎪ π π π π ⎪ ij rc +j +c ⎨

∀ (i, j) ∈ V1 ,

⎪ ⎪ ⎪ nij + zij (r) nrc + zrc (r) ⎪ ⎪ − , ⎩ πij πrc

∀ (i, j) ∈ V2 ,

and



D |π) ∂2 log f (Yaug

∂πij2

=

⎧ myj myc nij + zij (r) nrc + zrc (r) ⎪ ⎪ + + 2 + 2 , ⎪ 2 2 ⎪ πrc π+j π+c πij ⎪ ⎨

∀ (i, j) ∈ V1 ,

⎪ ⎪ ⎪ nij + zij (r) nrc + zrc (r) ⎪ ⎪ + , ⎩ 2 πrc πij2

∀ (i, j) ∈ V2 .

GROUPED DIRICHLET DISTRIBUTION

131

Comparing (3.54) and (3.63), we have

 ⎧ ⎫ ⎬ GD ⎨  ∂2 log f Yaug ' ( |π  GD Yobs , π tr Iaug (π) = tr E − ⎩ ⎭ ∂π−rc ∂π−rc  =

    nij mix mrx nrc + + (rc − 1) + 2 πij πi+ πrc πrc πr+ πij2 (i,j)∈V

+r

c−1 myc myj + r(c − 1) 2 . 2 π π +c j=1 +j

(3.66)

By combining (3.65) with (3.66), we have   c r ' ( ' ( 1 myj 1 D GD (π) − tr Iaug (π) = − tr Iaug π πij π+j i=1 j=1 +j     myc 1 1 1 + (rc − 2) − + 2(r − 1) ≥ 0, π+c πrc π+c π+c

and the strict inequality holds whenever there exists at least one j such End that myj > 0. (d) Comparison with existing strategies Many strategies have been developed in the literature to speed up the EM-type algorithms. One strategy is to impute partial missing data (or introduce less latent variables). It is well known that the MLEs of the parameters of a multivariate normal distribution from incomplete data with a monotone pattern have closed-form expressions and that the MLEs from incomplete data with a general missing-data pattern can be obtained using the EM algorithm by first creating a monotone pattern (Anderson, 1957; Liu, 1999; Little and Rubin, 2002). In saturated multinomial models for contingency tables, it is noted (Schafer, 1997) that monotone DA produces faster rates of convergence for both EM algorithms and the corresponding MCMC algorithms. For graphical models, a monotone pattern of incomplete data can be represented as nested hyperedges (Didelez and Pigeot, 1998). For graphical models with general missing pattern, Geng et al. (2000) developed a partial imputation EM (PIEM) algorithm that imputes partial missing data as few as possible so that the graphical model can be decomposed. All aforementioned work mainly considered monotone missing patterns. The main feature of our proposed strategy is that we deal with the grouped missing pattern for r × c contingency tables by taking the advantage that the mode of a GDD has a closed-form expression. Under the context of graphical models, Geng et al. (2000) theoretically proved that the convergence speed of the PIEM, which assumes a monotone missing pattern, is faster than the traditional EM algorithm that requires imputing a great amount of missing data. However, they only considered the criterion of largest eigenvalue.

132

DIRICHLET AND RELATED DISTRIBUTIONS

Under the grouped missing pattern, the new EM algorithm is a special case of the PIEM. Nonetheless, Theorem 3.17 considers both the largest eigenvalue and trace criteria. For the trace criterion, the analytic proof of Theorem 3.17 is quite different from that of Geng et al. (2000). In addition, under the Bayesian framework, the MCMC version of the new EM algorithm can be viewed as the collapsed Gibbs sampler or DA Gibbs sampling (Liu et al., 1994).

3.8.3 Wheeze study in six cities In this subsection, we consider the wheeze study in six cities (Ware et al., 1984; Lipsitz and Fitzmaurice, 1996). The primary objective is to investigate the effects of maternal smoking on respiratory illness in children. Table 3.3 displays the number of children cross-classified according to maternal smoking status (X = 1: none; X = 2: moderate; X = 3: heavy, or missing) and child’s wheezing status (Y = 1: no wheeze; Y = 2: wheeze with cold; Y = 3: wheeze apart from cold, or missing). We also consider the MAR mechanism as in Lipsitz and Fitzmaurice (1996). From Tables 3.2 and 3.3, we have r = c = 3, n = 528, mx = 507, and my = 103. Let πij denote the cell probabilities and LOR(ii : jj ) =

πij πi j πij πi j

the local odds ratio between the (i, i )-row and (j, j )-column. Using π(0) = 19 /9 as the initial values, the old EM algorithm (3.53)–(3.55) converges in 29 iterations. The MLEs of {πij } and {LOR(ii : jj )} are given in the second column of Table 3.4. Using the same initial values, the new EM algorithm (3.58) converges to the same MLEs in seven iterations, which is 4.142 times faster than the old EM. Based on (3.59), we generate G = 30 000 bootstrap samples to estimate the SEs of {πij } and {LOR(ii : jj )}, which are given in the third column of Table 3.4. The distribution densities of {πij } and {LOR(ii : jj )} estimated by a kernel density smoother using the 30 000 bootstrap samples are plotted in Figure 4 of Tang et al. (2007).

Table 3.3 Maternal smoking cross-classified by child’s wheezing status. Maternal smoking

None (X = 1) Moderate (X = 2) Heavy (X = 3) Missing

Child’s wheezing status No wheeze (Y = 1)

Wheeze with cold (Y = 2)

Wheeze apart from cold (Y = 3)

Missing

287 18 91

39 6 22

38 4 23

279 27 201

59

18

26

Source: Lipsitz and Fitzmaurice (1996). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd.

GROUPED DIRICHLET DISTRIBUTION

133

Table 3.4 MLEs and CIs of parameters for child’s wheeze data. Parameters π11 π12 π13 π21 π22 π23 π31 π32 π33 LOR(12 : 12) LOR(12 : 13) LOR(12 : 23) LOR(13 : 12) LOR(13 : 13) LOR(13 : 23) LOR(23 : 12) LOR(23 : 13) LOR(23 : 23)

MLE

SE

0.4747 0.0701 0.0742 0.0327 0.0119 0.0087 0.2058 0.0559 0.0657 2.4738 1.7089 0.6908 1.8403 2.0470 1.1123 0.7439 1.1978 1.6101

0.0178 0.0106 0.0107 0.0066 0.0046 0.0039 0.0147 0.0093 0.0101 1.5353 1.1266 1.0332 0.5383 0.5676 0.4383 0.9164 1.6911 2.1581

95% bootstrap CIa [0.4408, [0.0497, [0.0535, [0.0197, [0.0038, [0.0019, [0.1767, [0.0380, [0.0471, [0.7080, [0.3473, [0.1281, [1.0693, [1.2374, [0.5763, [0.2748, [0.4332, [0.3715,

0.5101] 0.0912] 0.0962] 0.0462] 0.0221] 0.0173] 0.2346] 0.0753] 0.0864] 6.5331] 4.6412] 3.0452] 3.2004] 3.5039] 2.2483] 2.7518] 5.7994] 8.7335]

Source: Tang et al. (2007). Reproduced by permission of Elsevier. a Bootstrap CI based on nonnormality assumption; see (3.61).

Obviously, the curve for π23 , LOR(12 : 23), LOR(23 : 12), LOR(23 : 13), LOR(23 : 23) are nonnormal. Therefore, we only consider the nonnormal-based bootstrap CIs. The 95% bootstrap CIs based on the nonnormality assumptions are listed in the fourth column of Table 3.4. Since both the lower bounds for LOR(13 : 12) and LOR(13 : 13) are larger than the value 1, we have reason to believe that there is a significant association between child’s wheezing status and maternal smoking status (none and heavy).

3.8.4 Discussion An efficient EM algorithm via a DA structure with fewer latent variables is introduced in this section. By augmenting the observed data to complete data having a likelihood with the kernel of a GDD, the new EM algorithm converges much faster than the EM algorithm based on the conventional DA structure, which augments the observed data to complete-data having likelihood with the kernel of a Dirichlet distribution. Theoretical justification supports this conclusion. The algorithm can be applied to compute MLEs of parameters of interest for r × c tables with incomplete observations. In particular, when the r × c table has only one supplemental margin, the new EM algorithm converges in one step; that is, the MLEs can be obtained in closed

134

DIRICHLET AND RELATED DISTRIBUTIONS

form. Based on the assumption of normality or nonnormality for the bootstrap samples, two bootstrap CIs for parameters of interest via the new EM algorithm are presented. The method remains valid for problems with small sample sizes.

3.9 Applications under nonignorable missing data mechanism In Sections 3.7 and 3.8, we assume the mechanism of MAR. Nordheim (1984) and Choi and Stablein (1988) further considered nonrandom or nonignorable missing mechanisms which depend on either the treatment or the outcome. In these cases, more nuisance parameters would be induced, leading to the problem of nonidentifiability. Generally, we can get rid of those nuisance parameters by adopting Bayesian methods, especially, non-iterative Bayesian methods (Tan et al., 2010).

3.9.1 Incomplete r × c tables: nonignorable missing mechanism Example 1.6 considers a crime survey with the resultant data set being reported in Table 1.6. By discarding n1234 (=115) households without response in both interviews, which is equivalent to the MAR or ignorable missing mechanism, Schafer (1997: 45, 271) analyzed this data set by the EM algorithm from a frequentist perspective. In a Bayesian framework, this data set was originally analyzed by Kadane (1985). Under the assumption of a nonignorable missing mechanism, Tian et al. (2003) reanalyzed this data set and obtained explicit Bayesian estimates for the cell probabilities based on a GDD. (a) Posterior density in the form of GDD Let Yobs = {n1 , . . . , n4 ; n12 , n34 ; n13 , n24 ; n1234 } denote the observed frequencies in Table 1.6. Under the nonignorable missing mechanism, we have a total of 15 free parameters, denoted by the cell probability vector π = {πij } = (π11 , π21 , π31 , π41 , . . . , π14 , π24 , π34 , π44 ) ∈ T16 ; see Table 3.5. These {πij } are not identifiable unless there is a prior distribution for π. Note that πij can be decomposed as πij = θj λij ,

i, j = 1, . . . , 4,

where λ·j = λ1j + · · · + λ4j = 1,

j = 1, . . . , 4,

and θj = π1j + · · · + π4j ,

j = 1, . . . , 4.

(3.67)

GROUPED DIRICHLET DISTRIBUTION

135

Table 3.5 Parameter structure under nonignorable missing mechanism.a Category

R12

R12

R12

R12

R12¯

R12¯

R12 ¯

R12 ¯

R1¯ 2¯

Probability

A1 A2 A3 A4

π11 0 0 0

0 π12 0 0

0 0 π13 0

0 0 0 π14

π21 π22 0 0

0 0 π23 π24

π31 0 π33 0

0 π32 0 π34

π41 π42 π43 π44

θ1 θ2 θ3 θ4

Counts

n1

n2

n3

n4

n12

n34

n13

n24

n1234

n\1

Source: Tian et al. (2003). Reproduced by permission of the Institute of Statistical Science, Academia Sinica. a R , R , R , and R denote the responding sets. ¯ 12 12¯ 12 1¯ 2¯

Naturally, λij denotes the corresponding conditional probability, reflecting the prior information of nonignorability. For instance, λ11 (λ41 ) is the conditional probability that a household responds (does not respond) in both interviews given that this household is crime free in both periods. Thus, in Table 3.5, the responding set R12¯ represents that a household responds in the first interview but does not in the second, and the other responding sets have analogous interpretations. Obviously, A1 (A4 ) represents the category that a household is crime free (victimized) in both periods. Hence, we can write λ11 = Pr(R12 |A1 ), λ31 = Pr(R12 ¯ |A1 ),

λ21 = Pr(R12¯ |A1 ), and

λ41 = Pr(R1¯ 2¯ |A1 ).

The likelihood function is given by ⎛ ⎞ 4  nj L(π|Yobs ) ∝ ⎝ π1j ⎠ (π21 + π22 )n12 (π23 + π24 )n34 j=1



× (π31 + π33 )n13 (π32 + π34 )n24 ⎝

4

⎞n1234

π4j ⎠ .

(3.68)

j=1

  α −1 If the prior density is f (π) ∝ 4i=1 4j=1 πijij , the posterior density is proportional to L(π|Yobs ) × f (π). Using the reparametrization ⎧ ξi = π1i , 1 ≤ i ≤ 4, ⎪ ⎪ ⎪ ⎪ ⎪ ξi = π2,i−4 , 5 ≤ i ≤ 8, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ξ9 = π31 , ξ10 = π33 , (3.69) ⎪ ⎪ ⎪ ξ11 = π32 , ⎪ ⎪ ⎪ ⎪ ⎪ ξ12 = π34 , ⎪ ⎪ ⎪ ⎩ ξi = π4,i−12 , 13 ≤ i ≤ 16,

136

DIRICHLET AND RELATED DISTRIBUTIONS

it can be shown that the posterior density becomes 4 

ξini +α1i −1 ·

i=1

8 

α

ξi 2,i−4

−1

α33 −1 α32 −1 α34 −1 · ξ9α31 −1 ξ10 ξ11 ξ12 ·

×⎝

4

⎞0 ⎛

ξj ⎠ ⎝

j=1



× ⎝

α

ξi 4,i−12

−1

i=13

i=5



16 

10

6

⎞n12 ⎛

ξj ⎠



ξj ⎠



j=9

12 j=11

⎞n34

ξj ⎠

j=7

j=5

⎞n13 ⎛

8

⎞n24 ⎛

ξj ⎠



16

⎞n1234

ξj ⎠

,

j=13

which is a GDD with six partitions. Hence, (3.33) can be employed to calculate the expectation of ξi , i = 1, . . . , 16. For instance: ⎧ n1 + α11 ⎪ ⎪ E(ξ1 |Yobs ) = , ⎪ ⎪ n + α·· ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ α21 (α21 + α22 + n12 ) ⎪ ⎪ ⎪ ⎪ E(ξ5 |Yobs ) = (α + α )(n + α ) , ⎪ 21 22 ·· ⎨ ⎪ ⎪ ⎪ α31 (α31 + α33 + n13 ) ⎪ ⎪ E(ξ9 |Yobs ) = , ⎪ ⎪ (α31 + α33 )(n + α·· ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ α (α + n1234 ) ⎪ ⎪ ⎩ E(ξ13 |Yobs ) = 41 4· , α4· (n + α·· )

where n=

4

ni + n12 + n34 + n13 + n24 + n1234 ,

i=1

α4· =

4

α4j ,

and

j=1

α·· =

4 4 i=1 j=1

αij .

(3.70)

GROUPED DIRICHLET DISTRIBUTION

137

Therefore, the Bayes estimator of θ1 is given by θ˜ 1 = E(θ1 |Yobs ) = E(ξ1 |Yobs ) + E(ξ5 |Yobs ) + E(ξ9 |Yobs ) + E(ξ13 |Yobs ).

(3.71)

Similarly, the respective posterior means for θ2 , θ3 , and θ4 can also be obtained. (b) Determination of the prior density How do we determine the values of all αij in the prior density? In practice, what we know about is the joint prior of the original parameters {θj } and {λij } rather than π, specified by f (θ, λ1 , . . . , λ4 ), where θ = (θ1 , . . . , θ4 ) and λj = (λ1j , . . . , λ4j ) for j = 1, . . . , 4, as defined in (3.67). We would like to clarify the relation between f (θ, λ1 , . . . , λ4 ) and f (π). Consider the more general case for (3.67) with i = 1, . . . , k and j = 1, . . . , m. The Jacobian of the transformation (3.67) is m 

θjk−1 .

j=1

Paulino and de Braganc¸a Pereira (1992) showed that π ∼ Dirichlet({αij }) is equivalent to ⎧ θ = (θ1 , . . . , θm ) ∼ Dirichlet(α·1 , . . . , α·m ), ⎪ ⎪ ⎪ ⎨ λj = (λ1j , . . . , λkj ) ∼ Dirichlet(α1j , . . . , αkj ), ⎪ ⎪ ⎪ ⎩ are mutually independent, θ, λ1 , . . . , λm

(3.72)

where θ· = 1 and λ·j = 1 for j = 1, . . . , m. In this way, all αij in prior f (π) can be determined.

3.9.2 Analyzing the crime survey data In this subsection, we analyze the crime survey data presented in Table 1.6. The goal is to obtain Bayes estimates of θi , i = 1, . . . , 4. We assume the nonignorable missing mechanism. Equations (3.70) and (3.71) give the Bayes estimator of θ1 .

138

DIRICHLET AND RELATED DISTRIBUTIONS

Similarly, we have θ˜ 2 = E(θ2 |Yobs ) = E(ξ2 |Yobs ) + E(ξ6 |Yobs ) + E(ξ11 |Yobs ) + E(ξ14 |Yobs ), θ˜ 3 = E(θ3 |Yobs ) = E(ξ3 |Yobs ) + E(ξ7 |Yobs ) + E(ξ10 |Yobs ) + E(ξ15 |Yobs ), ˜θ4 = E(θ4 |Yobs ) = E(ξ4 |Yobs ) + E(ξ8 |Yobs ) + E(ξ12 |Yobs ) + E(ξ16 |Yobs ),

and

where ni + α1i , n + α·· E(ξ5 |Yobs ) = α21 d12 , E(ξ7 |Yobs ) = α23 d34 , E(ξ9 |Yobs ) = α31 d13 , E(ξ11 |Yobs ) = α32 d24 , E(ξ13 |Yobs ) = α41 d1234 , E(ξ15 |Yobs ) = α43 d1234 , E(ξi |Yobs ) =

i = 1, 2, 3, 4, E(ξ6 |Yobs ) = α22 d12 , E(ξ8 |Yobs ) = α24 d34 , E(ξ10 |Yobs ) = α33 d13 , E(ξ12 |Yobs ) = α34 d24 , E(ξ14 |Yobs ) = α42 d1234 , E(ξ16 |Yobs ) = α44 d1234 ,

and d12 =

α21 + α22 + n12 , (α21 + α22 )(n + α·· )

d34 =

α23 + α24 + n34 , (α23 + α24 )(n + α·· )

d13 =

α31 + α33 + n13 , (α31 + α33 )(n + α·· )

d24 =

α32 + α34 + n24 , (α32 + α34 )(n + α·· )

d1234 =

α4· + n1234 . α4· (n + α·· )

The corresponding standard deviation for θi can be obtained by calculating the variance of ξi |Yobs and the covariance of ξi |Yobs and ξj |Yobs with (3.33). We consider two prior distributions. The first is a uniform prior with αij = 1 for all i, j = 1, . . . , 4. From (3.72), this is equivalent to θ = (θ1 , . . . , θ4 ) ∼ Dirichlet(4, 4, 4, 4), λj ∼ Dirichlet(1, 1, 1, 1), for j = 1, . . . , 4, and θ, λ1 , . . . , λ4 are mutually independent. The second prior represents the opinion from experts,

GROUPED DIRICHLET DISTRIBUTION

139

Table 3.6 Posterior means and standard deviations for the crime survey data under nonignorable missing mechanism. Priors

Uniform Experts

θ1

θ2

θ3

θ4

Mean

std

Mean

std

Mean

std

Mean

std

0.5916 0.6570

0.0125 0.0134

0.1396 0.1152

0.0137 0.0142

0.1668 0.1362

0.0180 0.0156

0.1020 0.0916

0.0102 0.0187

Source: Tian et al. (2003). Reproduced by permission of the Institute of Statistical Science, Academia Sinica.

which are θ λ1 λ2 λ3 λ4

∼ Dirichlet(10, 5, 5, 10), ∼ Dirichlet(1, 3, 2, 4), ∼ Dirichlet(1, 0.5, 2, 1.5), ∼ Dirichlet(1.5, 2, 0.5, 1), ∼ Dirichlet(4, 2, 3, 1),

and

and they are independent. Table 3.6 summarizes the posterior means and standard deviations that indicate the posterior means are slightly sensitive to the choice of the prior.

4

Nested Dirichlet distribution This chapter introduces another distribution family, named the NDD which is defined on the n-dimensional closed simplex Tn . The NDD includes the Dirichlet distribution as a special member and can serve as another tool for analyzing incomplete categorical data. We first introduce the density function (Section 4.1) through two motivating examples (Section 4.2). Important distribution properties, such as stochastic representations, mixed moments and mode, are presented in Section 4.3. Marginal and conditional distributions are discussed in Sections 4.4 and 4.5 respectively. Connection with the exact null distribution for the sphericity test is discussed in Section 4.6. New large-sample likelihood approaches (Section 4.7) and small-sample Bayesian approaches (Section 4.8) for analyzing incomplete categorical data are provided and compared with existing likelihood/Bayesian strategies. The new approaches are shown to have at least three advantages over existing approaches based on the Dirichlet distribution in both frequentist and conjugate Bayesian inference for incomplete categorical data. The new methods • • •

possess closed-form expressions for both the MLEs and Bayes estimates when the likelihood function is in NDD form; provide computationally efficient EM and DA algorithms when the likelihood is not in NDD form; and provide exact sampling procedures for some special cases.

In Section 4.9 we demonstrate the applications of the NDD in analyzing incomplete categorical data, fitting a competing-risks model, and evaluating cancer diagnosis tests. One simulated data and four real data involving dental caries, failure times of radio transmitter receivers, attitude toward the death penalty, and ultrasound ratings

Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

142

DIRICHLET AND RELATED DISTRIBUTIONS

for breast cancer metastasis are analyzed. Finally, we give a brief historical review on the development of the distribution family in Section 4.10.

4.1 Density function Motivated by the likelihood functions (4.3) and (4.4), we first define the NDD via a density function (Tian et al., 2003, 2010; Ng et al., 2009). Definition 4.1 A random vector x ∈ Tn is said to follow an NDD if the density of x−n = ˆ (x1 , . . . , xn−1 ) ∈ Vn−1 is −1 NDn,n−1 (x−n |a, b) = cND

 n i=1

xiai −1

 n−1 j  j=1

bj

xk

,

(4.1)

k=1

where a = (a1 , . . . , an ) is a positive parameter vector, b = (b1 , . . . , bn−1 ) is a nonnegative parameter vector, and cND denotes the normalizing constant given by cND =

n−1 

B(dj , aj+1 )

with

j=1

dj = ˆ

j 

(ak + bk ).

(4.2)

k=1

We will write x ∼ NDn,n−1 (a, b) on Tn or x−n ∼ NDn,n−1 (a, b) on Vn−1 equivalently, where the first subscript n in the notation NDn,n−1 (a, b) represents the effective dimension of the parameter vector a and the second subscript n − 1 denotes that of the parameter vector b. ¶ Another motivation of the NDD density (4.1) comes from the factorization of the likelihood with a monotone pattern for incomplete categorical data (Rubin, 1974; Little and Rubin, 2002: Chapter 13). When b = 0n−1 , the distribution in (4.1) reduces to the Dirichlet distribution Dirichletn (a). On the other hand, when b = (0, . . . , 0, bn−1 ) with bn−1 > 0, it reduces to a beta–Liouville distribution (e.g., see formula (6.13) of Fang et al. (1990)). Figure 4.1 shows some plots of NDD densities for n = 3 with various combinations of a and b. We observe that NDD provides more varieties than the Dirichlet distribution in terms of skewness.

4.2 Two motivating examples In this section, two examples are presented to motivate the NDD. In the first example, the likelihood function can be expressed exactly in terms of an NDD (up to a normalizing constant). In the second example, the likelihood function can be written as a product of two terms, namely an NDD (up to a normalizing constant) and a product of powers of linear combination of the parameters of interest. Example 4.1 (Sample surveys with nonresponse). Let N denote the total number of questionnaires being sent out. Among these, suppose m individuals respond and the

0

Z

1

1 0.8

0.6

0. Y 4

0.2 0

0.2

0.4

0.6

0.8

0.8

1

0.6

Y

0.4 0.2

X

0

0

0.2

0.4

0.6

0.8

1

X

0

Z

0 0 0 1 0 20 3 0 4 0 5 6

(b)

0 0 0 50 -10 0 10 2 3 4

(a)

Z

143

5 0 5 -5 0 5 10 1 2 2 3

0 2 -2 0 2 4 6 8 1 1

Z

NESTED DIRICHLET DISTRIBUTION

1

1 0.8

0.8 0.6

Y

0.4 0.2 0

0.2

0.4

0.6

0.8

1

0.6

Y

0.4 0.2 0

X

0.2

0.4

0.6

0.8

1

X

0

0

(c)

(d)

Figure 4.1 Perspective plots of nested Dirichlet densities ND3,2 (x−3 |a, b): (a) a = (1,2,3) and b = (4,5); (b) a = (6,7,8) and b = (9,10); (c) a = (11,12,13) and b = (14,15); (d) a = (16,17,18) and b = (19,20).

rest (i.e., N − m) do not. Of those m respondents, there are nr+i individuals whose answers are classified into category i (denoted by X = i), i = 1, . . . , r. Denote the nonrespondents by R = 0 and the respondents by R = 1. The left-hand side of Table 4.1 represents the survey with nonresponse by an r × 2 contingency table with missing data. Let θi = Pr(X = i, R = 0)

and θr+i = Pr(X = i, R = 1)

Table 4.1 r × 2 contingency table with missing data. Category X=1 .. . X=i .. . X=r Total

R=0

R=1

N −m

nr+1 .. . nr+i .. . n2r m

Total

Category

R=0

R=1

Total

θ1 .. . θi .. . θr

θr+1 .. . θr+i .. . θ2r

N

X=1 .. . X=i .. . X=r Total

θ1 + θr+1 .. . θi + θr+i .. . θr + θ2r 1

144

DIRICHLET AND RELATED DISTRIBUTIONS

denote the cell probabilities, i = 1, . . . , r. The parameter of interest is Pr(X = i) = θi + θr+i (Albert and Gupta, 1985). Let Yobs = {nr+1 , . . . , n2r ; N − m} denote the observed counts and θ = (θ1 , . . . , θ2r ) ∈ T2r . Under the MAR mechanism, the observed-data likelihood function for θ is given by    2r L(θ|Yobs ) ∝ θini · (θ1 + · · · + θr )N−m , (4.3) i=r+1

2r

where m = i=r+1 ni . Obviously, if we treat θ as a random vector, then θ ∼ ND2r,2r−1 (a, b), where a is a 2r × 1 vector with ai = 1 for i = 1, . . . , r, ai = ni + 1 for i = r + 1, . . . , 2r, and b is a (2r − 1) × 1 vector with bj = 0 for j = / r and br = N − m.  Example 4.2 (Dental caries data). To determine the degree of sensitivity to dental caries, dentists often consider three risk levels: low, medium, and high, which are labeled by X = 1, X = 2, and X = 3, respectively. Each subject will be assigned a risk level based on the spittle color obtained using a coloration technique. However, some subjects may not be fully categorized due to the disability of distinguishing adjacent categories. Paulino and de Bragarca Pereira (1995) analyzed the following data set using Bayesian methods. Of 97 subjects, only 51 were fully categorized, with n1 = 14, n2 = 17, and n3 = 20 subjects being classified as low, medium, and high respectively. A total of n12 = 28 subjects were only classified as low or medium risk, and n23 = 18 as medium or high risk. The primary objective of this study is to estimate the cell probabilities. Let Yobs = {n1 , n2 , n3 ; n12 , n23 } be the observed frequencies and θ = (θ1 , θ2 , θ3 ) ∈ T3 the corresponding cell probability vector. Under the assumption of MAR, the observed-data likelihood function is  

ni 3 L(θ|Yobs ) ∝ θ10 (θ1 + θ2 )n12 · (θ2 + θ3 )n23 . (4.4) i=1 θi Again, we observe that the first term in (4.4) follows the ND3,2 (a, b) with a = (n1 + 1, n2 + 1, n3 + 1) and b = (0, n12 ) up to a normalizing constant, while the second term is simply a power of a linear combination of θ = (θ1 , θ2 , θ3 ). 

4.3 Stochastic representation, mixed moments, and mode Theorems 4.1 and 4.2 below provide two alternative stochastic representations of an NDD and hence give two simple procedures for generating i.i.d. samples from NDD, which plays a crucial role in Bayesian analysis for incomplete categorical data. The following result suggests that the NDD can be stochastically represented by a sequence of mutually independent beta random variates.

NESTED DIRICHLET DISTRIBUTION

Theorem 4.1 ⎧ ⎪ d ⎪ ⎪ ⎨ xi = ⎪ ⎪ ⎪ ⎩

A random vector x ∼ NDn,n−1 (a, b) on Tn iff (1 − yi−1 ) ·

n−1 

yj ,

y0 = ˆ 0,

1 ≤ i ≤ n − 1,

and (4.5)

j=i

xn

d

=

145

1 − yn−1 ,

where yj ∼ Beta(dj , aj+1 ), 1 ≤ j ≤ n − 1, {dj } are defined in (4.2), and y1 , . . . , yn−1 are mutually independent. ¶ Proof. If x ∼ NDn,n−1 (a, b) on Tn , then the density of x−n is given by (4.1). We consider the following two transformations: j n−1  i=1 xi yj = j+1 , 1 ≤ j ≤ n − 2, yn−1 = xi , and i=1 xi i=1 uj =

j 

xi ,

1 ≤ j ≤ n − 1.

(4.6)

i=1

From (4.6), we obtain uj = yj yj+1 · · · yn−1 ,

1 ≤ j ≤ n − 1.

(4.7)

From (4.6) and (4.7), the corresponding Jacobian determinants are given by    ∂u−n    = 1 and J(u−n → x−n ) = ˆ  ∂x−n  n−1  j−1 yj . J(u−n → y−n ) = j=1

Hence, we have J(x−n → y−n ) = J(x−n → u−n ) · J(u−n → y−n ) =

n−1 

j−1

yj ,

j=1

and the joint density of y−n is given by −1 f (y−n ) = cND ·

n−1 

d −1

yj j

(1 − yj )aj+1 −1 ,

(4.8)

j=1

where cND and dj are defined in (4.2). It is noteworthy that (4.8) is factorized as a product of independent beta pdfs. Combining (4.6) and (4.7), we immediately obtain (4.5). Conversely, if (4.5) holds, then the joint density of y−n is given by (4.7). It is easy to show that the density of x−n is given by (4.1); that is, x ∼ NDn,n−1 (a, b) on Tn . End

146

DIRICHLET AND RELATED DISTRIBUTIONS

The following theorem suggests another stochastic representation for NDD and is an immediate result of Theorem 4.1. An n-vector x ∼ NDn,n−1 (a, b) on Tn iff

Theorem 4.2

⎧ n−1  ⎪ d ⎪ ⎪ yj , ⎨ x1 + · · · + x i = ⎪ ⎪ ⎪ ⎩

1 ≤ i ≤ n − 1,

and (4.9)

j=i d

= 1 − yn−1 ,

xn

where {yj }n−1 j=1 are defined in Theorem 4.1.



Next, we study the mixed moments of x by using the stochastic representation (4.5). For this purpose, we note that the independence among {yj } implies E

 n



xiri

=

i=1

n   i=1

E(1 − yi−1 )ri ·

n−1 

 r E(yjj ) .

j=i

Using the moments of the beta distribution, we obtain the following results. Theorem 4.3 Let x ∼ NDn,n−1 (a, b) on Tn . For any r1 , . . . , rn ≥ 0, the mixed moments of x are given by     n n  n−1 B(di−1 , ai + ri )  B(dj + ri , aj+1 ) E · xiri = . B(di−1 , ai ) B(dj , aj+1 ) j=i i=1 i=1 In particular, we have ⎧ n−1  ⎪ ai dj ⎪ ⎪ E(x ) = , 1 ≤ i ≤ n, ⎪ i ⎪ di−1 + ai j=i dj + aj+1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ai (ai + 1) ⎪ ⎪ E(xi2 ) = ⎪ ⎪ ⎪ (d + ai )(di−1 + ai + 1) i−1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n−1 ⎪  dj (dj + 1) ⎨ × , and (d + a j j+1 )(dj + aj+1 + 1) ⎪ j=i ⎪ ⎪ ⎪ ⎪ ⎪   j−2 ⎪ ⎪ ai dk aj dj−1 ⎪ ⎪ ⎪ E(x x ) = i j ⎪ ⎪ di−1 + ai k=i dk + ak+1 (dj−1 +aj )(dj−1 + aj + 1) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n−1 ⎪  ⎪ dk (dk + 1) ⎪ ⎪ , i < j, × ⎪ ⎩ (dk + ak+1 )(dk + ak+1 + 1) k=j where {dj } are defined in (4.2).



NESTED DIRICHLET DISTRIBUTION

Similarly, the higher order moments of stochastic representation (4.9).

i j=1

147

xj can be obtained from the

Let x ∼ NDn,n−1 (a, b) on Tn and r ≥ 0, then

Theorem 4.4 E

 i

r

xj

=

n−1  j=i

j=1

B(dj + r, aj+1 ) , B(dj , aj+1 )

1 ≤ i ≤ n − 1.



Finally, Ng et al. (2009) obtained a closed-form expression for the mode of a nested Dirichlet density, implying that explicit MLEs of cell probabilities are available in the frequentist analysis of incomplete categorical data. Theorem 4.5 If ai ≥ 1 for all i ∈ {1, . . . , n} and bj ≥ 0 for all j ∈ {1, . . . , n − 1} with at least one j such that bj > 0, then the mode of the nested Dirichlet density (4.1) is given by ⎧ an − 1 ⎪ ⎪ , xˆ n = ⎪ ⎪ d ⎪ n−1 + an − n ⎪ ⎪ ⎪ ⎪ ⎪ (ai − 1)(1 − xˆ i+1 − xˆ i+2 − · · · − xˆ n ) ⎨ xˆ i = , di−1 + ai − i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 ≤ i ≤ n − 1, and ⎪ ⎪ ⎪ ⎪ ⎩ xˆ 1 = 1 − xˆ 2 − · · · − xˆ n .

where {dj } are defined in (4.2).

(4.10)



Proof. The mode of the nested Dirichlet density (4.1) is readily obtained by the calculus of several variables. If L denotes the log-kernel of the density (4.1), then L=

n−1 



 (ai − 1) log(xi ) + (an − 1) log 1 − n−1 x i i=1

i=1

+

n−1 

bj log(x1 + · · · + xj ).

j=1

Setting the derivative of L with respect to xi equal to zero yields bj ai − 1 an − 1  − + = 0, xi xn x + · · · + xj j=i 1 n−1

1 ≤ i ≤ n − 1.

It is easy to verify that (4.10) is true when n = 3. By induction, we obtain (4.10). End

148

DIRICHLET AND RELATED DISTRIBUTIONS

4.4 Marginal distributions Let x ∼ NDn,n−1 (a, b) and 1 ≤ m < n. We partition x, z, and a into the same pattern:       x(1) m z(1) a(1) , zn×1 = xn×1 = , and an×1 = . x(2) n−m z(2) a(2) Furthermore, we let ⎛

b(n−1)×1 Finally, we define x1 = 1x = Theorem 4.6

⎞ b(1) m−1 ⎜ ⎟ . = ⎝ bm ⎠1 (2) n−m−1 b

n i=1

xi . We have the following theorem.

If x = (x(1), x(2)) ∼ NDn,n−1 (a, b) on Tn , then     x(1) d R · z(1) x= = , x(2) (1 − R) · z(2)

(4.11)

where (i) (ii) (iii) (iv)

z(1) ∼ NDm,m−1 (a(1) , b(1) ) on Tm ; z(1) is independent of (R, z(2) ); R =d n−1 j=m yj with {yj } being defined in Theorem 4.1; and z(2) follows a mixture of NDDs with density 

bm+1

f (z ) = (2)

bn−1   ··· ω(k(2) )

km+1 =0

kn−1 =0

× NDn−m,n−m−1 (z(2) |a(2) , b(2) − k(2) ) , z(2) ∈ Tn−m ,

(4.12) ˆ (km+1 , . . . , kn−1 ), the weights are given by where k(2) = ω(k(2) ) =

B(dm + k(2) 1 , a(2) 1 + b(2) 1 − k(2) 1 ) B(dm , am+1 )  n−1   bj B( j =m+1 (a + b − k ), aj+1 ) × , B(dj , aj+1 ) k j j=m+1

and {dj } are defined in (4.2).

(4.13) ¶

NESTED DIRICHLET DISTRIBUTION

149

Proof. If x ∼ NDn,n−1 (a, b) on Tn , then the density of x−n is given by (4.1). Noting that x(1) = x(1) 1 ·

x(1) , x(1) 1

x(2) = x(2) 1 ·

x(2) , x(2) 1

(4.14)

x(2) . 1−R

(4.15)

and x1 = 1, we make the transformation z(1) =

x(1) , R

R = x(1) 1 ,

and

z(2) =

The Jacobian determinant is then given by J(x−n → z1 , . . . , zm−1 , R, zm+1 , . . . , zn−1 ) = Rm−1 (1 − R)n−m−1 . Hence, the joint density of (z1 , . . . , zm−1 , R, zm+1 , . . . , zn−1 ) can be written as j ai −1 bj ( m ) · m−1 i=1 zi j=1 ( k=1 zk ) · f (R, z(2) ), (4.16) m−1 B(d , a ) j j+1 j=1 where z(1) ∈ Tm , R ∈ [0, 1], z(2) ∈ Tn−m , and f (R, z(2) ) =

   (2) n R dm −1 (1 − R)a 1 −1 ai −1 z n−1 i j=m B(dj , aj+1 ) i=m+1 bj j n−1    × R + (1 − R) zk . j=m+1

(4.17)

k=m+1

Therefore, (4.11) follows from (4.14) and (4.15). In addition, Assertions (i) and (ii) follow by the fact that (4.16) can be factorized into two components. By combining (4.15) with (4.9), we immediately obtain Assertion (iii). To derive the marginal density of z(2) , we assume all bj (j = m + 1, . . . , n − 1) are positive integers. By integrating out the density of R from (4.17) and using the Taylor expansion, we obtain (4.12) and (4.13) easily. End Remark 4.1 From (4.11), we can see that d

x(1) = R · z(1) follows a mixture distribution of NDDs. In addition, d

x(2) = (1 − R) · z(2) follows a double-mixture distribution of NDDs in the sense that z(2) itself is a mixture of NDDs. This is not surprising, because of the asymmetry of xi in (4.1). ¶

150

DIRICHLET AND RELATED DISTRIBUTIONS

From (4.17), setting bm+1 = · · · = bn−1 = 0 immediately yields the following result. Theorem 4.7 If x ∼ NDn,n−1 (a, b) on Tn with b(2) = 0, then the stochastic representation in (4.11) still holds and (i) (ii) (iii) (iv)

z(1) ∼ NDm,m−1 (a(1) , b(1) ) on Tm ; z(2) ∼ Dirichletn−m (a(2) ) on Tn−m ; R ∼ Beta(dm , a(2) 1 ); and z(1) , z(2) and R are mutually independent.



Remark 4.2 According to Fang et al. (1990: 147), Theorem 4.7 implies that x(2) follows a beta–Liouville distribution and we can write x(2) ∼ BLiouvillen−m (a(2) ; a(2) 1 , dm ). ¶ By letting m = n − 1 in Theorem 4.6, we have the following result. Theorem 4.8 Given that x ∼ NDn,n−1 (a, b) on Tn , we have the following stochastic representation 

x=

x−n 1 − x−n 1



 d

=

R · z−n 1−R



,

where (i) z−n ∼ NDn−1,n−2 (a−n , b−(n−1) ) on Tn−1 ; (ii) R ∼ Beta(dn−1 , an ); and (iii) z−n and R are mutually independent.



4.5 Conditional distributions In this section, we consider the conditional distributions of x(1) |x(2) and x(2) |x(1) . For the sake of convenience, we denote x(1) 1 and x(2) 1 by 1 and 2 respectively. Theorem 4.9

If x ∼ NDn,n−1 (a, b) on Tn , then  x(1)  (2) x ∼ NDm,m−1 (a(1) , b(1) ) on Tm . 1 − 2 

(4.18)

Furthermore, we have  x(2)  (1) x ∼ g(·|x(1) ) on Tn−m , 1 − 1 

(4.19)

NESTED DIRICHLET DISTRIBUTION

151

where g(z(2) |x(1) ) ∝

  n

ziai −1

i=m+1

×

n−1 





1 + (1 − 1 )

j=m+1

j 

bj

zk

.

(4.20)

k=m+1



Proof. From (4.11), we have d

d

x(1) = R · z(1) = x(1) 1 · z(1) = (1 − 2 ) · z(1) . Result (ii) in Theorem 4.6 implies that (a) z(1) and 2 are mutually independent and (b) z(1) is independent of x(2) . Therefore, x(1) d (1) =z 1 − 2 and

 x(1)  (2) d (1) (2) d (1) x = z |x = z ∼ NDm,m−1 (a(1) , b(1) ) on Tm . 1 − 2 

Similarly, we have  x(2)  (1) d (2) (1) d (2) d x = z |x = z |1 = z(2) |R. 1 − 1 

From (4.17), we immediately obtain (4.20).

End

Combining Theorems 4.7 and 4.9, we have the following result. Theorem 4.10

and

Given that x ∼ NDn,n−1 (a, b) on Tn and b(2) = 0, we have  x(1)  (2) x ∼ NDm,m−1 (a(1) , b(1) ) on Tm 1 − 2   x(2)  (1) x ∼ Dirichletn−m (a(2) ) on Tn−m . 1 − 1 



Remark 4.3 The first result (i.e., (4.18)) of Theorem 4.9 suggests that the conditional density g(x(1) |x(2) ) depends on x(2) only through the 1 -norm x(2) 1 and is an NDD with scale parameter 1 − 2 , which is a constant when x(2) is fixed. We do not have a similar conclusion for g(x(1) |x(2) ) because of the asymmetry between x(1) and x(2) . However, for the special case that b(2) = 0, Theorem 4.10 shows that the conditional distribution g(x(2) |x(1) ) depends on x(1) only through the 1 -norm x(1) 1

152

DIRICHLET AND RELATED DISTRIBUTIONS

and is a Dirichlet distribution with scale parameter 1 − 1 , which is a constant when x(1) is given. ¶

4.6 Connection with exact null distribution for sphericity test In this section we show that the exact null distribution of the likelihood ratio statistic for testing sphericity in an n-variate normal distribution is identical to the marginal distribution of an NDD with a given set of parameters (Thomas and Thannippara, 2008). iid Let the n-dimensional random vectors x1 , . . . , xm ∼ Nn (μ, ). Consider the sphericity test1 with null hypothesis specified by H 0 :  = σ 2 In ,

σ 2 > 0.

The likelihood ratio statistic for testing H0 is given by (Mauchly, 1940)  m/2 | m1 A| ˆ σˆ 2 In |¯x, A) L(μ, = , 1 ˆ x, A) ˆ |¯ |[ mn tr (A)]In | L(μ, where 1  xj , m j=1 m

ˆ = x¯ = μ

ˆ = 1 A,  m

and

σˆ 2 = A=

1 tr (A), mn m 

(xj − x¯ )(xj − x¯ ).

j=1

Define 

= ˆ

ˆ σˆ 2 In |¯x, A) L(μ, ˆ x, A) ˆ |¯ L(μ,



2/m

=

1/n n i=1 λi  n 1 i=1 λi n

n

,

(4.21)

where λ1 , . . . , λn are the eigenvalues of A. Bilodeau and Brenner (1999) showed that the exact null distribution of is characterized by a product of (n − 1) independent beta random variables. Theorem 4.11 Let be defined by (4.21). We have =d n−1 j=1 yj , where   m − 1 − j (n + 2)j , , 1 ≤ j ≤ n − 1, (4.22) yj ∼ Beta 2 2n and y1 , . . . , yn−1 are mutually independent.



The contour of the density function of x ∼ Nn (μ, ) is an ellipsoid. That is, (x − μ)−1 (x − μ) = c for each c > 0. When  = σ 2 In , the ellipsoid becomes a sphere.

1

NESTED DIRICHLET DISTRIBUTION

153

Comparing the stochastic representation of with the stochastic representation of x1 in (4.5), we immediately obtain the following result. Theorem 4.12 If x = (x1 , . . . , xn ) ∼ NDn,n−1 (a, b) on Tn with a = (a1 , . . . , an ) and b = (b1 , . . . , bn−1 ), where a1 + b1 =

m−2 (n + 2)(j − 1) , aj = for j = 2, . . . , n, and 2  2n  j−1 j + bj = − , for j = 2, . . . , n − 1, 2 n

then the distribution of in (4.21) and the marginal distribution of x1 are identical. ¶ Let λ denote the observed value of and g(x1 ) denote the density function of x1 . The exact p-value of can be computed by 

p = Pr( ≤ λ) =

λ

g(x1 ) dx1 . 0

When n = 2,  d

= x1 ∼ Beta

 m−2 , 1 . 2

When n ≥ 3, Theorem 4.11 implies that the Monte Carlo method can be employed to calculate the exact p-value in three steps: (1) Independent generate beta rendom variables {yj() }n−1 j=1 based on (4.22), for  = 1, . . . , L (L = 30 000, say). () . (2) Compute L values of according to () = y1() y2() · · · yn−1  (3) Estimate the p-value by L1 L=1 I( () ≤λ) .

4.7 Large-sample likelihood inference In this section we investigate the application of the NDD in analyzing incomplete categorical data with large sample sizes. For simplicity, we assume that each subject is classified into one of n categories and θ = (θ1 , . . . , θn ) ∈ Tn is the corresponding cell probability vector. Let Yobs denote the observed frequencies that consist of two parts: the complete observations (e.g., n1 , n2 , n3 and n12 in (4.4)) and the partial observations (e.g., n23 in (4.4)). Under the MAR mechanism, the likelihood function is usually expressed as L(θ|Yobs ) = NDn,n−1 (θ|a, b) · Lst (θ|Yobs ),

(4.23)

154

DIRICHLET AND RELATED DISTRIBUTIONS

where the first term is in the NDD form. For the second term, we consider two cases: (i) Lst (θ|Yobs ) is a constant and (ii) Lst (θ|Yobs ) is not constant, where the superscript ‘st’ represents the second term of (4.23).

4.7.1 Likelihood with NDD form If Lst (θ|Yobs ) is a constant, the likelihood function in (4.23) is proportional to the NDD NDn,n−1 (θ|a, b); that is, L(θ|Yobs ) ∝

 n

 n−1 bk k  θiai −1 · θ .

i=1

k=1

(4.24)

=1

Recall that (4.3) belongs to this form. From Theorem 4.5, we immediately obtain the MLE of θ in closed form. Let θ −n = (θ1 , . . . , θn−1 ). The asymptotic variance–covariance matrix of the −1 ˆ MLE θˆ −n is then given by Iobs (θ −n ), where Iobs (θ −n ) = −

∂2 log L(θ|Yobs ) ∂θ −n ∂θ−n

is the observed information matrix. From (4.24), it is easy to show that ai − 1 an − 1  bk ∂ log L(θ|Yobs ) = − + , k ∂θi θi θn =1 θ k=i n−1

1 ≤ i ≤ n − 1,



ai − 1 an − 1  ∂2 log L(θ|Yobs ) = + + ψk , θn2 ∂θi2 θi2 k=i

1 ≤ i ≤ n − 1,



n−1  an − 1 ∂2 log L(θ|Yobs ) = + ψk , ∂θi ∂θj θn2 k=max(i,j)

i= / j,

n−1

where bk ψk = k , ( =1 θ )2

k = 1, . . . , n − 1.

Hence, the observed information matrix can be expressed as   an−1 − 1 a1 − 1 Iobs (θ −n ) = diag ,..., 2 θ12 θn−1 an − 1 + · 1n−1 1n−1 + An−1 , θn2

(4.25)

NESTED DIRICHLET DISTRIBUTION

155

where ⎛

An−1

1 ⎜0 ⎜ =⎜ ⎜ .. ⎝. 0

⎞⎛ ··· 1 ψ1 ⎜ψ ··· 1⎟ ⎟⎜ 2 ⎟⎜ .. ⎟ ⎜ .. .. . . ⎠⎝. ··· 1 ψn−1

1 1 .. . 0

0 ψ2 .. . ψn−1

⎞ 0 ⎟ 0 ⎟ ⎟. .. ⎟ ⎠ . · · · ψn−1

··· ··· .. .

(4.26)

4.7.2 Likelihood beyond NDD form If Lst (θ|Yobs ) is not constant, it can generally be written as a product of powers of linear functions of θ. Here, we assume Lst (θ|Yobs ) =

q  n   j=1

mj

δij θi

i=1

so that L(θ|Yobs ) ∝

 n  i=1

θiai −1

 n−1 k  k=1

bk 

θ

×

=1

q  n   j=1

δij θi

mj ,

(4.27)

i=1

where n×q = (δij ) is a known matrix with δij = 0 or 1, and that there exists at least one nonzero entry in each column of . For instance, in (4.4), we have n = 3, q = 1, m1 = n23 , δ11 = 0, and δ21 = δ31 = 1. In general, the MLE of θ based on (4.27) may not be available in closed form. Here, we introduce a new EM algorithm based on NDD rather than the Dirichlet distribution to obtain the MLE. For this purpose, we first augment the observed data Yobs with the following latent data: ND Ymis = {z1 , . . . , zj , . . . , zq },

where zj = (z1j , . . . , znj ) is used to split (δ1j θ1 + · · · + δnj θn )mj . ND When δij = 0, we set zij = 0. The likelihood for the new augmented-data Yaug = ND ND {Yobs , Ymis } (equivalently, the joint density of Yaug ) can be readily shown to be ND ND L(θ|Yaug ) = f (Yaug |θ) = NDn,n−1 (θ|a + Z11q , b),

(4.28)

156

DIRICHLET AND RELATED DISTRIBUTIONS

where Zn×q = (z1 , . . . , zq ). Hence, the augmented-data MLEs of θ (cf. (4.10)) are given by ⎧ an + z(n) 1q − 1 ⎪ ⎪ ˆn =  θ ⎪ n−1 , ⎪ n  ⎪ ⎪ =1 (a + z() 1q − 1) + =1 b ⎪ ⎪ ⎪ ⎪ ⎪  ⎪ ⎨ (ai + z(i) 1q − 1)(1 − nj=i+1 θˆ j ) 2 ≤ i ≤ n − 1, θˆ i = i i−1 , (4.29)  ⎪ (a + z 1 − 1) + b  q  ⎪ () =1 =1 ⎪ ⎪ ⎪ ⎪ ⎪ n ⎪  ⎪ ⎪ ⎪ ˆ1 = 1 − θ θˆ j , ⎪ ⎩ j=2

where z(i) denotes the ith row of the matrix Z. In addition, it is easy to show that the conditional predictive distribution is given by ND f (Ymis |Yobs , θ) =

q 

f (zj |Yobs , θ),

(4.30)

j=1

where 

 (δ1j θ1 , . . . , δnj θn )  zj |(Yobs , θ) ∼ Multinomial mj ; , n =1 δj θ

1 ≤ j ≤ q.

Thus, the E-step of the new EM computes the following conditional expectations mj δij θi E(zij |Yobs , θ) = n , =1 δj θ

1 ≤ i ≤ n,

1 ≤ j ≤ q,

(4.31)

while the M-step updates (4.29) by replacing the zij by their conditional expectations. The asymptotic covariance matrix of the MLE θˆ can be obtained by the method of Louis (1982) or the direct computation of the observed information ˆ matrix evaluated at θ = θ.

4.7.3 Comparison with existing likelihood strategies When the likelihood (4.23) takes the NDD form in (4.24), the traditional strategy would introduce latent variables to split n−1 

 k 

k=1

=1

bk

θ

,

so that the augmented likelihood is in the form of a Dirichlet density and hence the EM algorithm can be used to obtain the MLE of θ. In this case, the noniterative method introduced in Section 4.7.1 is more preferable than the EM algorithm as the closed-form MLE solution of θ exists.

NESTED DIRICHLET DISTRIBUTION

157

(a) The traditional EM algorithm based on Dirichlet distribution When the likelihood function is given by (4.27), the traditional strategy introduces (n − 1)(n − 2)/2 latent variables (denoted by {wk }n−1 with the k=2 ), when compared  new strategy, where wk = (w1k , w2k , . . . , wkk ) is used to split ( k=1 θ )bk . Thus, the corresponding missing data are denoted by D ND Ymis = Ymis ∪ {w2 , . . . , wn−1 }, D D so that the likelihood for the augmented data Yaug = {Yobs , Ymis } (equivalently, the D joint pdf of Yaug ) is given by D D ) = f (Yaug |θ) = Dirichletn (θ|s1 , . . . , sn ), L(θ|Yaug

(4.32)

where si = ai + z(i) 1q +

n−1 

wik , 1 ≤ i ≤ n − 1, w11 = ˆ b1 ,

and

(4.33)

k=i

sn = an + z(n) 1q . Thus, the augmented-data MLEs are si − 1 θˆ i = n , =1 (s − 1)

1 ≤ i ≤ n.

(4.34)

On the other hand, the conditional predictive distributions are given by (4.30) and   (θ1 , . . . , θk ) , 2 ≤ k ≤ n − 1. (4.35) wk |(Yobs , θ) ∼ Multinomial bk ; k =1 θ Therefore, the E-step of the traditional EM algorithm computes (4.31) and bk θi E(wik |Yobs , θ) = k

=1 θ

,

2 ≤ k ≤ n − 1,

1 ≤ i ≤ k,

(4.36)

and the M-step updates (4.34) by replacing the {zij } and {wik } with their conditional expectations. (b) Comparison of two EM algorithms Based on the criteria for comparing different EM algorithms introduced in Section 3.8.2(c), we have the following results. Theorem 4.13 The EM algorithm given in (4.29) and (4.31) uniformly dominates the EM algorithm given in (4.34), (4.31), and (4.36) under the trace criterion; that is, ND D tr {Iaug (θ)} ≤ tr {Iaug (θ)}

∀ θ ∈ Tn ,

158

DIRICHLET AND RELATED DISTRIBUTIONS

and the strict inequality holds provided that there exists at least one k such that bk > 0. In addition, the new EM dominates the traditional EM under the criterion of largest eigenvalue. ¶ Proof. From (4.32), it is easy to obtain − −

D ∂2 log f (Yaug |θ)

∂θi2 D ∂2 log f (Yaug |θ)

∂θi ∂θj

=

si − 1 sn − 1 + , θn2 θi2

i = 1, . . . , n − 1,

=

sn − 1 , θn2

i= / j.

and

Thus, we have 

D (θ) Iaug

  D ∂2 log f (Yaug |θ)   =E − Yobs , θ ∂θ∂θ  ∗  ∗ sn−1 −1 s∗ − 1 s1 − 1 = diag ,..., + n 2 · 1n−1 1n−1 , 2 2 θn θ1 θn−1

where, for i = 1, . . . , n − 1, si∗ = E(si |Yobs , θ) = ai + E(z(i) 1q |Yobs , θ) +

n−1  k=i

bk θi k

=1 θ

and

sn∗ = E(sn |Yobs , θ) = an + E(z(n) 1q |Yobs , θ). On the other hand, from (4.28), ND ND L(θ|Yaug ) = f (Yaug |θ) = NDn,n−1 (θ|˜a, b),

where a˜ = (˜a1 , . . . , a˜ n ) with a˜ i = ai + z(i) 1q . Similar to (4.25), we obtain    ND ∂2 log f (Yaug |θ)  ND Yobs , θ Iaug (θ) = E −  ∂θ∂θ  ∗  a˜ ∗ − 1 a˜ ∗ − 1 a˜ − 1 = diag 1 2 , . . . , n−12 + n 2 · 1n−1 1n−1 + An−1 , θn θ1 θn−1 where An−1 is given by (4.26), and a˜ i∗ = E(˜ai |Yobs , θ) = ai + E(z(i) 1q |Yobs , θ), Let hk = bk /

k

i = 1, . . . , n.

for k = 1, . . . , n − 1. Consequently,   n−1 n−1 hn−1 k=1 hk k=2 hk D ND Iaug (θ) − Iaug (θ) = diag , ,..., − An−1 . θ1 θ2 θn−1 =1 θ

NESTED DIRICHLET DISTRIBUTION

159

For the trace criterion, we have D (θ)} tr {Iaug



ND tr {Iaug (θ)}

=

n−1 n−1  



hk

i=1 k=i

1 1 − k θi =1 θ



≥ 0,

for all θ ∈ Tn , and strict inequality holds provided there exists at least one k such that bk > 0. The first part of Theorem 4.13 is thus proved. To prove the second part, we notice that Geng et al. (2000) developed a partial imputation EM algorithm that imputes partial missing data, and they proved that the convergence speed of the partial imputation EM is faster than the traditional EM algorithm. Since the new EM algorithm defined by (4.29) and (4.31) can be viewed as a special partial imputation EM, Theorem 4 of Geng et al. (2000) confirms the second conclusion of Theorem 4.13. End

4.8 Small-sample Bayesian inference When the sample size is small, the asymptotic methods in Section 4.7 are not appropriate and the Bayesian approach is a useful alternative. Furthermore, in situations where some parameters are unidentified (see Section 4.9.1) in frequentist settings, the Bayesian approach may be feasible if an informative prior is assigned. In addition, it is appealing to use a Bayesian approach to specify the whole posterior curve for the parameter of interest.

4.8.1 Likelihood with NDD form When Lst (θ|Yobs ) is a constant, the likelihood function is given by (4.24). For Bayesian analysis, the NDD is the natural conjugate prior distribution. Multiplying (4.24) by the prior distribution θ ∼ NDn,n−1 (a∗ , b∗ )

(4.37)

yields the nested Dirichlet posterior distribution θ|Yobs ∼ NDn,n−1 (a + a∗ − 1n , b + b∗ ).

(4.38)

The exact first-order and second-order posterior moments of {θi } can be obtained explicitly from Theorem 4.3. The posterior means are similar to the MLEs. In addition, the posterior samples of θ in (4.38) can be generated by utilizing (4.5), which only involves the simulation of independent beta random variables.

4.8.2 Likelihood beyond NDD form When the observed likelihood function is given by (4.27), we introduce a new DA Gibbs sampling (Tanner and Wong, 1987) based on NDD, rather than the Dirichlet distribution, to generate dependent posterior samples of θ. We take the prior as

160

DIRICHLET AND RELATED DISTRIBUTIONS

(4.37). From (4.28), the complete-data posterior is an NDD; that is, ND f (θ|Yobs , Ymis ) = NDn,n−1 (θ|a + Z1 1q + a∗ − 1n , b + b∗ ).

(4.39)

Based on (4.30) and (4.39), the new DA Gibbs sampling can be implemented to obtain dependent posterior samples for θ. Furthermore, when q (cf. Eq. (4.27)) is small (e.g., q = 1 or 2), we can adopt the exact IBF sampling of Tian et al. (2007) to obtain i.i.d. samples from the posterior distribution f (θ|Yobs ). In fact, from (4.30), (4.39), and the sampling IBF (Tan et al., 2003), we have ND |Yobs ) ∝ f (Ymis

ND f (Ymis |Yobs , θ 0 ) ND , f (θ 0 |Yobs , Ymis )

(4.40)

ND where θ 0 is an arbitrary point in Tn . Since Ymis is a discrete random vector assumND() L ing finite values on its domain, we can first generate i.i.d. samples {Ymis }=1 of ND() ND from the discrete distribution (4.40) and then generate θ () ∼ f (θ|Yobs , Ymis ). Ymis () L Thus, {θ }=1 are i.i.d. samples from the posterior f (θ|Yobs ).

4.8.3 Comparison with the existing Bayesian strategy When the likelihood (4.23) has the NDD form given by (4.24), the usual Bayesian strategy introduces latent variables so that the augmented posterior is a Dirichlet distribution, and hence the DA Gibbs sampling can be used to obtain dependent posterior samples of θ. The noniterative sampling approach in Section 4.8.1 can obtain i.i.d. posterior samples of θ easier than the iterative DA Gibbs sampling. When the likelihood function is given by (4.27), the traditional Bayesian strategy introduces (n − 1)(n − 2)/2 latent variables (denoted by W = {wk }n−1 k=2 ), so that the likelihood function is given by (4.32). If we consider the conjugate Dirichlet distribution θ ∼ Dirichletn (a∗ ) as the prior, then the augmented posterior remains to be a Dirichlet distribution, ND f (θ|Yobs , Ymis , W) = Dirichletn (θ|s + a∗ ),

s= ˆ (s1 , . . . , sn ),

(4.41)

where {si }ni=1 are defined in (4.33). Therefore, the P-step of the traditional DA ND generates θ from (4.41), and the I-step independently inputs Ymis from (4.30) and inputs W from (4.35). To compare the different DA structures, we consider the criterion of lag-1 autocorrelation, a common measure for studying the mixing rate of a Markov chain. If the chain from a DA Gibbs sampling reaches its equilibrium, Liu (1994) showed that, for any nonconstant scalar-valued function h, Corr{h(θ (t) ), h(θ (t+1) )} =

Var[E{h(θ)|Yaug }|Yobs ] . Var{h(θ)|Yobs }

NESTED DIRICHLET DISTRIBUTION

161

Therefore, the maximum autocorrelation over linear combinations h(θ) = xθ is xVar{E(θ|Yaug )|Yobs }x = ρ(B), xVar(θ|Yobs )x x= / 0

sup Corr(xθ (t) , xθ (t+1) ) = sup

x= / 0

where ρ(B) denotes the spectral radius of B = In − {Var(θ|Yobs )}−1 E{Var(θ|Yaug )|Yobs }, the Bayesian fraction of missing information for θ under f (Yaug |θ). Thus, to reduce the autocorrelation, we need to maximize E{Var(θ|Yaug )|Yobs } over all DA structures using the positive semi-definite ordering (van Dyk and Meng, 2001). To compare two different DA structures, say DA1 and DA2, based on the same observed data Yobs , we only need to compare their E{Var(θ|Yaug )|Yobs }. We say algorithm DA1 converges no slower than algorithm DA2 if DA1 DA2 E{Var(θ|Yaug )|Yobs } ≥ E{Var(θ|Yaug )|Yobs }.

To compare the traditional DA structure defined by (4.41), (4.30), and (4.35) with the proposed DA structure introduced in Section 4.8.2, we must first choose the same prior distribution, which can be done by setting b∗ = 0 in (4.39) or (4.37). Then, the observed posterior distributions corresponding to the two DA structures are identical. Since ND ND ND D Yaug = {Yobs , Ymis } ⊆ {Yobs , Ymis , W} = Yaug ,

we immediately have ND D Var(θ|Yaug ) ≥ Var(θ|Yaug ), ND D )|Yobs } ≥ E{Var(θ|Yaug )|Yobs }. so that E{Var(θ|Yaug

Theorem 4.14 The DA Gibbs sampling defined by (4.39) and (4.30) converges no slower than the traditional DA Gibbs sampling defined by (4.41), (4.30), and (4.35). ¶ Since Yobs does not vary in the sampling process, we can represent the two DA structures simply as DAND Structure:

ND θ|Ymis ,

ND Ymis |θ.

DAD Structure:

ND θ|(Ymis , W),

ND (Ymis , W)|θ.

ND with W In the DAND structure, the two components being iterated are θ and Ymis D being integrated out, while in the DA structure, an extra random vector W is introduced. Using Theorem 5.1 of Liu et al. (1994), we immediately obtain the following result.

162

DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 4.15 Let F ND and F D denote the forward operators of the two DA ND structures and ||F || and ||F D || the corresponding norms. We have (i) ||F ND || ≤ ||F D || and (ii) the spectral radius of the DAND structure is less than or equal to that of the DAD structure. ¶ The former notions of forward operator, norm, and spectral radius can be found in Liu et al. (1994). Theorem 4.15 shows that the maximal correlation between θ and ND ND Ymis is always smaller than that between θ and (Ymis , W). Furthermore, when q is small, the exact IBF sampling in Section 4.8.2 can be used to generate i.i.d. samples from the posterior f (θ|Yobs ), avoiding the convergence problems associated with the iterative DA Gibbs sampling.

4.9 Applications 4.9.1 Sample surveys with nonresponse: simulated data For Example 4.1 in Section 4.2, let φ and 1 − φ denote the probabilities of response and nonresponse respectively. Hence, we have φ = Pr(R = 1) = 1 − φ = Pr(R = 0) =

2r 

θi

and

i=r+1 r 

θi .

i=1

Let N = 1000, r = 5, and φ = 0.65. By independently drawing 100 values of m from Binomial(N, φ) and averaging them, we obtain m = 652, which denotes the number of individuals who respond to the survey. We further assume that θ6 = 0.2, θ7 = 0.12, θ8 = 0.08, θ9 = 0.15, and θ10 = 0.10. Independently drawing 100 values of (n6 , . . . , n10 ) from Multinomial (m; (θ6 , . . . , θ10 )/0.65) and averaging the samples for each component, we obtain n6 = 199, n7 = 120, n8 = 81, n9 = 151, and n10 = 101. Based on the simulated counts Yobs = {(nr+1 , . . . , n2r ); N − m} and the likelihood function (4.3), it is easy to see that the MLE of θ is exactly the mode of the density of NDn,n−1 (a, b) with n = 2r, a = (0, . . . , 0, nr+1 , . . . , n2r ) + 1n and b = (00r−1 , N − m, 0r−1 ). In frequentist settings, we note that θ1 , . . . , θr are nonestimable but 1 − φ is estimable. From Theorem 4.5, we obtain θˆ 6 = 0.199, θˆ 7 = 0.12, θˆ 8 = 0.081, θˆ 9 = 0.151, θˆ 10 = 0.101, and φˆ = 0.652. The corresponding SEs are reported in the fourth column of Table 4.2.

NESTED DIRICHLET DISTRIBUTION

163

Table 4.2 MLEs, SEs and Bayesian estimates of parameters for the simulated data.a Frequentist method

Bayesian method

Parameter

True value

MLE

SE

Mean

std

95% CI

θ1 θ2 θ3 θ4 θ5 1−φ θ6 θ7 θ8 θ9 θ10 φ

— — — — — 0.35 0.20 0.12 0.08 0.15 0.10 0.65

— — — — — 0.348 0.199 0.120 0.081 0.151 0.101 0.652

— — — — — 0.0151 0.0126 0.0102 0.0086 0.0113 0.0095 0.0151

0.1428 0.0727 0.0538 0.0447 0.0390 0.3534 0.1971 0.1192 0.0807 0.1493 0.1001 0.6466

0.0660 0.0568 0.0451 0.0395 0.0348 0.0150 0.0124 0.0100 0.0084 0.0113 0.0093 0.0150

[0.0280, 0.2741] [0.0025, 0.2114] [0.0019, 0.1680] [0.0013, 0.1498] [0.0011, 0.1311] [0.3246, 0.3832] [0.1732, 0.2223] [0.1002, 0.1397] [0.0647, 0.0983] [0.1277, 0.1719] [0.0825, 0.1191] [0.6167, 0.6753]

Source: Ng et al. (2009). Reproduced by permission of the Institute of Statistical Science, Academia Sinica. 5 10 a 1 − φ = Pr(R = 0) = θ and φ = Pr(R = 1) = θ. i=1 i i=6 i

On the other hand, in Bayesian settings, θ1 , . . . , θr are estimable if an informative prior distribution can be assigned to θ. Here, we use the informative prior ∗ with a1∗ = · · · = an∗ = 2 and b1∗ = · · · = bn−1 = 1 in (4.37). We generate 10 000 i.i.d. posterior samples of θ from (4.38). The posterior means, posterior standard deviations, and 95% Bayes CIs for θ and φ are given in Table 4.2.

4.9.2 Dental caries data For Example 4.2 in Section 4.2, we are unable to find the closed-form MLEs. In this case, we consider the EM algorithm introduced in Section 4.7.2 and the exact IBF sampling presented in Section 4.8.2 to handle the likelihood of θ given by (4.4). As (4.4) is a special case of (4.27), with n = 3, q = 1, m1 = n23 , δ11 = 0, and δ21 = δ31 = 1, we only introduce one latent variable z to split (θ2 + θ3 )n23 , so that (4.28) and (4.30) become  

 L(θ|Yobs , z) = ND3,2 θ (n1 + 1, n2 + 1 + z, n3 + 1 + n23 − z), (0, n12 ) and

   f (z|Yobs , θ) = Binomial zn23 ,

 θ2 , θ2 + θ 3

z = 0, 1, . . . , n23 ,

(4.42)

respectively. For the dental caries data, using θ (0) = 13 /3 as the initial value, the new EM algorithm based on (4.29) and (4.31) converges in seven iterations, with 0.016 s CPU time. The resulting MLEs are given by θˆ 1 = 0.2393, θˆ 2 = 0.4880, and θˆ 3 = 0.2727. The corresponding SEs are 0.0547, 0.0674, and 0.0514, obtained by the

164

DIRICHLET AND RELATED DISTRIBUTIONS

Table 4.3 The values of qz (θ 0 ) with θ 0 = 13 /3 and pz = f (z|Yobs ). z

qz (θ 0 )

pz

z

qz (θ 0 )

pz

0 1 2 3 4 5 6 7 8 9

2.713e+032 4.883e+033 4.150e+034 2.213e+035 8.300e+035 2.324e+036 5.036e+036 8.633e+036 1.187e+037 1.319e+037

3.815e-006 6.866e-005 5.836e-004 3.113e-003 1.167e-002 3.268e-002 7.082e-002 1.214e-001 1.669e-001 1.855e-001

10 11 12 13 14 15 16 17 18 —

1.187e+037 8.633e+036 5.036e+036 2.324e+036 8.300e+035 2.213e+035 4.150e+034 4.883e+033 2.713e+032 —

1.669e-001 1.214e-001 7.082e-002 3.268e-002 1.167e-002 3.113e-003 5.836e-004 6.866e-005 3.815e-006 —

Source: Ng et al. (2009). Reproduced by permission of the Institute of Statistical Science, Academia Sinica.

ˆ However, direct computation of the observed information matrix evaluated at θ = θ. using the same initial value, the traditional EM based on Dirichlet augmentation (see (4.34) and (4.36)) converges in 22 iterations, with 0.04 s CPU time, which is about three (or 2.5) times slower than the new EM in terms of the iteration number (or the computing time). The difference would be more substantial if the traditional EM algorithm introduces more than one extra latent variable. For Bayesian analysis, we adopt the uniform prior; that is, setting a∗ = 13 and ∗ b = 0 in (4.37). Hence, the complete-data posterior (4.39) and the sampling IBF (4.40) become  

 f (θ|Yobs , z) = ND3,2 θ (n1 + 1, n2 + 1 + z, n3 + 1 + n23 − z), (0, n12 ) (4.43) and f (z|Yobs ) ∝

f (z|Yobs , θ 0 ) = ˆ qz (θ 0 ) f (θ 0 |Yobs , z)

(4.44)

23 respectively. Let θ 0 = 13 /3. From (4.44), we can calculate {qz (θ 0 )}nz=0 . By defining pz = f (z|Yobs ) for z = 0, 1, . . . , n23 , we have

qz (θ 0 ) pz = n23 , k=0 qk (θ 0 ) which is independent of θ 0 . We report the results in Table 4.3. The exact IBF sampling can be conducted as follows: draw L = 20 000 independent samples {z() }L1 of z from the discrete distribution (4.44) with probabilities pz ; generate θ () ∼ f (θ|Yobs , z() ) at (4.43) for  = 1, . . . , L; then, {θ () }L1 are i.i.d. samples from the observed posterior distribution f (θ|Yobs ). For the dental caries data, the Bayes means for θ1 , θ2 , and θ3 are given by 0.2457, 0.4784,

Posterior density 2 4 6 8

165

0

0

Posterior density 2 4 6 8

NESTED DIRICHLET DISTRIBUTION

0.0

0.2

0.4

θ1

0.6

0.8

1.0

0.2

0.4

θ2

0.6

0.8

1.0

(b)

0

Posterior density 2 4 6 8

(a)

0.0

0.0

0.2

0.4

θ3 (c)

0.6

0.8

1.0

Figure 4.2 Comparisons of three posterior densities in each plot, obtained by using the exact IBF sampling (solid curve, L = 20 000 i.i.d. samples), the DAND algorithm (dotted curve, a total of 40 000 and retaining the last 20 000 samples), and the DAD algorithm (dashed curve, a total of 45 000 and retaining the last 20 000 samples). (a) θ1 ; (b) θ2 ; and (c) θ3 .

and 0.2759, with the corresponding Bayes standard deviations being 0.0532, 0.0654, and 0.0501. The 95% Bayesian CIs are [0.1487, 0.3571], [0.3498, 0.6061], and [0.1832, 0.3785] respectively. The computing time is 36.82 s. Figure 4.2 shows the posterior curves for θ1 , θ2 , and θ3 . To compare the new DAND algorithm defined in (4.42) and (4.43) with the traditional DAD algorithm defined in (4.42),     θ1 f (w|Yobs , θ) = Binomial wn12 , , w = 0, 1, . . . , n12 , θ1 + θ 2 and

 

 f (θ|Yobs , z, w) = Dirichlet3 θ n1 + 1 + w, n2 + 1 + n12 −w + z, n3 + 1 + n23 −z ,

we take the whole posterior curve of {θi } obtained from the exact IBF sampling as a benchmark to assess the convergence of the two Markov chains. We run a single chain of the DAND (DAD ) algorithm to produce 40 000 (45 000) samples and retain the last 20 000 samples. The corresponding computing times are 83.562, and 168.922 s. Figure 4.2 shows that the three curves are almost identical, indicating final convergences for the two DA Gibbs samplers.

166

DIRICHLET AND RELATED DISTRIBUTIONS

4.9.3 Competing-risks model: failure data for radio transmitter receivers (a) Competing-risks model In clinical and epidemiologic studies, we often encounter competing-risks problems (Prentice et al., 1978). The term ‘competing-risks’ has come to encompass the study of any failure process (e.g., survival study) in which there is more than one distinct type of failure or cause of death. For the purpose of illustration, we assume that there are only two possible causes of failure, indexed by i (i = 1, 2). Suppose that the notional times to failure of a unit under those two risks are denoted by the random variables X and Y respectively. The variables X and Y cannot be observed. Available data on each unit typically include the time of failure T = min(X, Y ) ≥ 0, which may be right censored, and the corresponding cause of failure C ∈ {1, 2}, which will be unknown if T is censored. To connect the likelihood function for the competing-risks model with the NDD, we assume that the time to failure T is discrete with m possible values, say t1 , . . . , tm . For continuous failure times, one can classify the failure times into a finite number of intervals with {tj }m j=1 , for instance, being the endpoints of those intervals. The cause-specific hazard rate for the ith cause, gi (tj ), is defined as the rate of failure at time tj from the ith cause given that the unit has been working to time tj ; that is, gi (tj ) = Pr(C = i, T = tj |T ≥ tj ),

i = 1, 2,

1 ≤ j ≤ m.

(4.45)

If we let pij = Pr(C = i, T = tj ), then pij ≥ 0,

m 2  

pij = 1,

i=1 j=1

and (4.45) becomes pij gi (tj ) = 2 m i=1

j  =j

pij

(4.46)

.

One of the objectives is to estimate the cause-specific hazard rate. Let Yobs = {nij : i = 1, 2, 1 ≤ j ≤ m} ∪ {r1 , . . . , rm } denote the observed data, where nij is the number of failures from cause i in the time period indexed by j, and rj is the number of right-censored items during that time period. The likelihood function for the unknown cell probabilities {pij } is given by (Dykstra et al., 1998) L({pij }|Yobs ) ∝

m 2   i=1 j=1

n pijij

×

m   m  j=1

rj

(p1j + p2j )

j  =j+1

.

NESTED DIRICHLET DISTRIBUTION

167

If we reparameterize {pij } by θ = (θ1 , . . . , θ2m ) with θ2j−1 = p1,m−j+1

and θ2j = p2,m−j+1 ,

1 ≤ j ≤ m,

(4.47)

θ ∈ T2m ,

(4.48)

then the above likelihood function can be expressed as L(θ|Yobs ) ∝

2m  i=1

θiai −1

×

j 2m−1   j=1

bj

θk

,

k=1

where a = (a1 , . . . , a2m ) and b = (b1 , . . . , b2m−1 ) with a2j−1 = n1,m−j+1 + 1, b2j−1 = 0,

a2j = n2,m−j+1 + 1, b2j = rm−j ,

1 ≤ j ≤ m, and 1 ≤ j ≤ m − 1.

Therefore, the likelihood function (4.48) is an NDD up to a normalizing constant; that is, θ ∼ ND2m,2m−1 (a, b) on T2m ,

(4.49)

if θ is regarded as a random variable. The MLEs of θ can be obtained analytically2 by using Theorem 4.5. After some straightforward algebras, we find that the MLEs of the cause-specific hazard rates at time tj are given by nij , j  =j (n1j  + n2j  + rj  )

gˆ i (tj ) = m

i = 1, 2,

1 ≤ j ≤ m,

(4.50)

which coincide with the results obtained by Davis and Lawrence (1989). (b) Analyzing failure data of radio transmitter receivers Next, consider the failure data of radio transmitter receivers described in Example 1.7 and Table 1.7. Following Cox (1959), we exclude the 44 items in our analysis as the information contained in them would not be helpful in the estimation of the cause-specific hazard rates for the two types of failure. Based on (4.50), we calculate the estimates gˆ i (tj ) for i = 1, 2 and j = 1, . . . , 13 (see the second and fifth columns of Table 4.4). Figure 4.3 shows the comparison of the two hazard rates. To describe the variability of the estimates gˆ i (tj ), we need to compute their SEs. As a complicated function of {pij }, gi (tj ) defined in (4.46) is related to the parameter vector θ through the relationship (4.47). Hence, the delta method is quite difficult to apply. However, the Bayesian approach is rather straightforward to apply in the current situation. In fact, if we utilize the uniform prior distribution of θ, then the 2

Although Dykstra et al. (1998) obtained the analytical expressions for the MLEs of the parameters of interest based on the approach of Dykstra et al. (1991), they seemed to be unaware that the likelihood function (4.48) is indeed a density function with closed-form mode.

168

DIRICHLET AND RELATED DISTRIBUTIONS

Table 4.4 MLEs and Bayesian estimates of gi (tj ) for the failure data of radio transmitter receivers. Index j

g1 (tj ) Classical

1 2 3 4 5 6 7 8 9 10 11 12 13

g2 (tj )

Bayesian

Classical

Bayesian

MLE

Mean

std

MLE

Mean

std

0.012 05 0.016 22 0.019 19 0.029 78 0.018 06 0.027 81 0.018 42 0.023 50 0.033 61 0.026 71 0.032 96 0.079 64 0.117 64

0.012 68 0.016 90 0.020 17 0.031 29 0.019 65 0.030 20 0.021 13 0.027 59 0.039 98 0.035 38 0.048 53 0.086 67 0.122 05

0.002 46 0.003 07 0.003 63 0.005 20 0.004 58 0.006 36 0.005 96 0.007 83 0.010 83 0.012 18 0.017 65 0.039 46 0.059 21

0.006 95 0.008 39 0.015 07 0.011 06 0.011 68 0.010 59 0.011 72 0.010 68 0.008 40 0.015 26 0.005 49 0.017 69 0.019 60

0.007 43 0.009 05 0.016 08 0.012 12 0.013 06 0.012 38 0.014 05 0.013 84 0.012 33 0.022 07 0.013 92 0.041 18 0.017 94

0.001 84 0.002 23 0.003 35 0.003 24 0.003 83 0.004 02 0.004 86 0.005 54 0.006 18 0.009 82 0.009 62 0.022 55 0.019 21

Type I failure (confirmed)

0.04

0.06

0.08

0.10

Type II failure (unconfirmed)

0.00

0.02

Cause-specific hazard rate

0.12

Source: Tian et al. (2010). Reproduced by permission of Elsevier.

50

100

150

200

250

300

350

400

450

500

550

600 630

Time t j (hours)

Figure 4.3 Comparison of the two cause-specific hazard rates for the radio transmitter receivers data. Source: Tian et al. (2010). Reproduced by permission of Elsevier.

NESTED DIRICHLET DISTRIBUTION

169

posterior distribution of θ is still given by (4.49). Using the stochastic representation in Theorem 4.1, we generate 20 000 posterior samples of θ from (4.49) and calculate 20 000 values of gi (tj ) via (4.46) and (4.47). The corresponding Bayesian means and standard deviations are given in Table 4.4.

4.9.4 Sample surveys: two data sets for death penalty attitude Kadane (1983) analyzed a data set from two sample surveys of jurors’ attitudes toward the death penalty, where respondents are classified into one and only one of the following four groups. C1 : Would not decide guilt versus innocence in a fair and impartial manner. C2 : Fair and impartial on guilt and, when sentencing, would sometimes and sometimes not vote for the death penalty. C3 : Fair and impartial on guilt and, when sentencing, would never vote for the death penalty. C4 : Fair and impartial on guilt versus innocence and when sentencing, would always vote for the death penalty regardless of circumstance. Let ni be the frequency of group i (i = 1, 2, 3, 4). In some cases, some jurors found it difficult to classify their attitudes into one and only one of the aforementioned four categories. Instead, these jurors would consider themselves as a member of a union of some of the categories. For instance, some jurors may consider themselves to be fair and impartial, and would consider the death penalty and would at least sometimes vote for it, if the defendant is found guilty (i.e., C2 ∪ C4 ). Let n24 and n123 be the frequencies for groups C2 ∪ C4 and C1 ∪ C2 ∪ C3 respectively. In the present death penalty study, the frequency data were given by n1 = 68, n3 = 97, and n24 = 674 according to the survey from the Field Research Corporation and by n4 = 15 and n123 = 1484 according to the survey from the Harris Survey Company. The goal is to estimate the cell probabilities. Let Yobs = {Yobs,1 , Yobs,2 } be the combined data and θ = (θ1 , . . . , θ4 ) the cell probabilities, where Yobs,1 = {n1 , n3 ; n24 } and Yobs,2 = {n4 ; n123 } are the counts obtained by the Field Research Corporation and the Harris Survey Company respectively. The likelihood function corresponding to the Field survey is L(θ|Yobs,1 ) ∝ θ1n1 θ3n3 (θ2 + θ4 )n24 . Similarly, the likelihood function based on the Harris survey is L(θ|Yobs,2 ) ∝ θ4n4 (θ1 + θ2 + θ3 )n123 . The combined likelihood for the observed data Yobs is then given by (Dickey et al., 1987)  n L(θ|Yobs ) ∝ ( 4j=1 θj j )θ10 (θ1 + θ2 )0 (θ1 + θ2 + θ3 )n123 · (θ2 + θ4 )n24 , (4.51)

170

DIRICHLET AND RELATED DISTRIBUTIONS

where θ ∈ T4 and n2 = ˆ 0. We observe that the first term in (4.51) follows the ND4,3 (a, b) with a = (n1 + 1, n2 + 1, n3 + 1, n4 + 1) and b = (0, 0, n123 ) up to a normalizing constant, while the second term is simply a power of a linear combination of the components of θ. We use the EM algorithm to calculate the MLEs of θ. By introducing a latent variable z to split (θ2 + θ4 )n24 , the conditional predictive density is    f (z|Yobs , θ) = Binomial zn24 ,

 θ2 , θ2 + θ 4

z = 0, 1, . . . , n24 .

(4.52)

The likelihood function L(θ|Yobs , z) for the complete data is then given by ND4,3 (θ|(n1 + 1, n2 + 1 + z, n3 + 1, n4 + 1 + n24 − z), (0, 0, n123 )). (4.53) Note that the MLEs of θ based on the complete-data {Yobs , z} can be obtained analytically via Theorem 4.5. Using θ (0) = 14 /4 as the initial value, the EM algorithm based on (4.52) and (4.53) converges in 11 iterations. The resultant MLEs are given by θˆ 1 = 0.08105, θˆ 2 = 0.79332, θˆ 3 = 0.11561, and θˆ 4 = 0.01002 with the corresponding SEs being 0.00942, 0.01396, 0.01104, and 0.00257, which are obtained by the direct computation of the observed information matrix evaluated ˆ at θ = θ.

4.9.5 Bayesian analysis of the ultrasound rating data In Section 2.10 we introduced two important notions on the ROC curve and the AUC in disease diagnosis. Let YD and YD¯ denote independent and randomly chosen test results from the diseased and nondiseased population respectively. When the diagnostic result is ordinal, the possible values of YD and YD¯ can be assumed to be consecutive integers 1, 2, . . . , n. Let θi = Pr(YD = i), φj = Pr(YD¯ = j),

i = 1, . . . , n, j = 1, . . . , n,

and

with θ = (θ1 , . . . , θn ) ∈ Tn and φ = (φ1 , . . . , φn ) ∈ Tn . The AUC can be calculated via (2.112). Consider the rating data of breast cancer metastasis described in Example 1.8 and Table 1.8. Let yi denote the frequency of category i in the nine diseased patients and zj denote the frequency of category j in the 14 nondiseased pa  tients. Hence, 4i=1 yi + y23 = 9 and 4j=1 zj + z23 = 14. Under the assumption of MAR, the likelihood function for the observed data Yobs = {y1 , . . . , y4 , y23 } ∪ {z1 , . . . , z4 , z23 } is L(θ, φ|Yobs ) ∝

 4  i=1



 y θi i

(θ2 + θ3 )y23

×

⎧ 4 ⎨  ⎩

j=1

z φj j



(φ2 + φ3 )z23

⎫ ⎬ ⎭

,

NESTED DIRICHLET DISTRIBUTION

171

where θ = (θ1 , . . . , θ4 ) ∈ T4 and φ = (φ1 , . . . , φ4 ) ∈ T4 . When the joint prior is the product of two independent Dirichlet densities 4 

θiai −1 ×

i=1

4 

b −1

φj j

,

j=1

the joint posterior density is given by  4 

y +ai −1

θi i

i=1





(θ2 + θ3 )y23

×

⎧ 4 ⎨  ⎩

z +bj −1

φj j

j=1



(φ2 + φ3 )z23

⎫ ⎬ ⎭

.

(4.54)

Note that the second term in (4.54) can be rewritten as φ2z2 +b2 −1 φ3z3 +b3 −1 φ4z4 +b4 −1 φ1z1 +b1 −1 × φ20 (φ2 + φ3 )z23 (φ2 + φ3 + φ4 )0 . In other words, (φ2 , φ3 , φ4 , φ1 ) ∼ ND4,3 (a∗ , b∗ ) with a∗ = (z2 + b2 , z3 + b3 , z4 + b4 , z1 + b1 ) and b∗ = (0, z23 , 0). We have similar result for the first term in (4.54). The stochastic representation in Theorem 4.1 can be employed to generate i.i.d. posterior samples of θ and φ according to (4.54). Therefore, the posterior distribution of the AUC is readily determined, since (2.112) is a function of θ and φ. For the ultrasound rating data in Table 1.8, we adopt uniform prior distributions (i.e., all ai = bj = 1) for the parameters θ and φ. From (4.54), we can see that the posterior distribution of θ is Dirichlet3 with parameters (6, 3, 3, 1) and is independent of (φ2 , φ3 , φ4 , φ1 ), which is nested Dirichlet with parameters (2, 6, 7, 1) and (0, 2, 0). Using the stochastic representation in Theorem 4.1, we generate 20 000 samples from the two independent posterior distributions and calculate 20 000 values, via (2.112), for the AUC. The posterior density and histogram of the AUC are shown in Figure 4.4, and the corresponding results are reported in Table 4.5. It should be noted that Peng and Hall (1996) and Hellmich et al. (1998) used a regression model for the ordinal responses, normality assumption for a latent variable, and MCMC methods for computing the posterior distribution of the AUC. In addition, their results were based on a five-point ordinal scale; that is, the rating scales are 1, 2, 23, 3, and 4. Therefore, there is a great difference between their results and ours. However, this is not surprising, since we expect that the Bayesian estimate would produce underestimated area as it is based on the linear interpolation of four points on the graph (Broemeling, 2007: 84). 3

Note that y23 = 0.

DIRICHLET AND RELATED DISTRIBUTIONS

0

0

1

1

2

2

3

3

4

4

5

5

172

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

(a)

0.6

0.8

1.0

(b)

Figure 4.4 (a) Posterior density of the AUC given by (2.112) for the ultrasound rating data of breast cancer metastasis, where the density curve is estimated by a kernel density smoother based on 20 000 i.i.d. samples. (b) The histogram of the AUC. Source: Tian et al. (2010). Reproduced by permission of Elsevier.

Table 4.5 Bayesian estimates of AUC for the ultrasound rating data. Investigator Peng and Hall (1996) Hellmich et al. (1998) Our method

Median

Mean

std

95% CI

— — 0.700

0.987 0.903 0.693

— 0.073 0.098

[0.927, 0.999] [0.720, 0.990] [0.487, 0.864]

Source: Tian et al. (2010). Reproduced by permission of Elsevier.

4.10 A brief historical review 4.10.1 The neutrality principle Connor and Mosimann (1969) introduced the concept of so-called neutrality. Let x = (x1 , . . . , xn ) ∈ Tn . The proportion x1 is said to be neutral if x1 is independent of the random vector 

x2 x3 xn , ,..., 1 − x1 1 − x1 1 − x1



.

In addition, they also introduced the notion of complete neutrality; see Definition 2.6. Alternatively, Theorem 2 of Connor and Mosimann (1969) can be used as the definition of complete neutrality. That is, x is completely neutral if and only if

NESTED DIRICHLET DISTRIBUTION

173

z1 , z2 , . . . , zn−1 are mutually independent, where z1 = x1

and zi =

1−

xi i−1 j=1

xj

i = 2, . . . , n − 1.

,

(4.55)

Connor and Mosimann (1969) applied the concept of complete neutrality to obtain a generalization of the Dirichlet distribution. If x is completely neutral and zi ∼ Beta(αi , βi ),

i = 1, . . . , n − 1,

then the inverse transformation is given by (2.8) and the Jacobian is n−1 

J(x1 , . . . , xn−1 → z1 , . . . , zn−1 ) =

(1 − zi )n−1−i .

i=1

Therefore, the density function of x−n = (x1 , . . . , xn−1 ) is given by ⎧ ⎫ n−1 −1 βi−1 −(αi +βi ) ⎬ n−1 n   ⎨ α −1   B(αi , βi ) xnβn−1 −1 x i . xj ⎩ i ⎭ i=1

i=1

(4.56)

j=i

Connor and Mosimann (1969) called (4.56) the generalized Dirichlet density function. In particular, if we let βi−1 = αi + βi for i = 2, . . . , n − 1, then (4.56) reduces to the standard Dirichlet density. Wong (1998) studied the moment structure of this generalized Dirichlet distribution. James (1972, 1975) provided some results based on Connor and Mosimann’s generalization and proposed a new generalization. While dealing with problems in reliability studies and survival analysis, Mathai (2002) came across with a general density of the type (e.g., see Thomas and George (2004))  n   α −1 c xi i · (1 − x1 )β1 (1 − x1 − x2 )β2 · · · (1 − x1 − · · · − xn−1 )βn−1 , i=1

where c is the normalizing constant. In fact, we can rewrite (4.56) as n−1 xnαn +βn −1 xn−1

α

−1

· · · x1α1 −1

× xnβn−1 −(αn +βn ) (xn + xn−1 )βn−2 −(αn−1 +βn−1 ) · · · (xn + · · · + x2 )β1 −(α2 +β2 ) , implying (xn , xn−1 , . . . , x1 ) ∼ NDn,n−1 (a, b) on Tn , where a = (αn + βn , αn−1 , . . . , α1 ) and b = (b1 , . . . , bn−1 ) with bj = βn−j − (αn−j+1 + βn−j+1 ),

j = 1, . . . , n − 1.

174

DIRICHLET AND RELATED DISTRIBUTIONS

4.10.2 The short memory property Let x = (x1 , . . . , xn ) ∈ Tn and define i j=1 xj yi = i+1 , i = 1, . . . , n − 2, j=1 xj

and

yn−1 =

n−1 

xj .

j=1

If y1 , . . . , yn−1 are mutually independent, then x is said to have a short memory property (Thomas and George, 2004). Suppose that x has a short memory property and yi ∼ Beta(αi , βi ),

i = 1, . . . , n − 1,

then the inverse transformation is given by (4.5) and the Jacobian is (cf. the proof of Theorem 4.1) J(x1 , . . . , xn−1 → y1 , . . . , yn−1 ) =

n−1 

yii−1 .

i=1

Therefore, the density function of x−n = (x1 , . . . , xn−1 ) is given by n−1 −1 n−1  αj −(αj−1 +βj−1 ) j   β −1 n−1  α1 −1 i B(αi , βi ) x1 xi+1 xk . i=1

i=1

j=2

(4.57)

k=1

That is, x ∼ NDn,n−1 (a, b) on Tn , where a = (α1 , β1 , β2 , . . . , βn−1 ) and b = (0, b2 , . . . , bn−1 ) with bj = αj − (αj−1 + βj−1 ),

j = 2, . . . , n − 1.

5

Inverted Dirichlet distribution This chapter introduces and studies the inverted Dirichlet distribution. The definition of the inverted Dirichlet distribution through a density function is presented in Section 5.1. Alternatively, the definition through stochastic representation is introduced in Section 5.2. Marginal and conditional distributions and the cumulative distribution function and survival function are derived in Sections 5.3 and 5.4 respectively. In Sections 5.5 and 5.6, we respectively investigate the characteristic function of the inverted Dirichlet distribution and the distribution of the linear combination of inverted Dirichlet variates. Connections between the inverted Dirichlet distribution and other multivariate distributions are discussed in Section 5.7. Some important applications are presented in Section 5.8.

5.1 Definition through the density function 5.1.1 Density function Definition 5.1 A random vector v = (v1 , . . . , vn ) ∈ Rn+ is said to follow an inverted Dirichlet distribution if its density is

IDirichletn+1 (v|a) =

1

· Bn+1 (a) 

n

viai −1 n+1 aj , n j=1 1 + i=1 vi i=1

(5.1)

where a = (a1 , . . . , an , an+1 ) is a parameter vector and aj > 0 for j = 1, . . . , n + 1. We will write v ∼ IDirichlet(a1 , . . . , an ; an+1 ) on Rn+ or v ∼ IDirichletn+1 (a) on Rn+ accordingly. ¶ Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

176

DIRICHLET AND RELATED DISTRIBUTIONS

5.1.2 Several useful integral formulae First of all, we need to verify the following integral identity: n ai −1  i=1 vi  n+1 aj dv = Bn+1 (a).  Rn+ j=1 1 + ni=1 vi

(5.2)

For this purpose, we make the transformation vi xi = , i = 1, . . . , n, 1 + v+ ˆ v1 = v1 + · · · + vn . It is easy to see that x = (x1 , . . . , xn ) ∈ Vn . where v+ = The corresponding Jacobian is    ∂(x1 , . . . , xn )    J(x → v) =  ∂(v , . . . , v )  1

n

= (1 + v+ )−2n × |(1 + v+ )In − v11n | = (1 + v+ )−n−1 . Thus, the left-hand side of (5.2) is equal to  n ai −1 i=1 {xi (1 + v+ )} (1 + v+ )n+1 dx n+1 a j Vn (1 + v+ ) j=1

an+1 −1   n 1 ai −1 xi dx = 1 + v+ Vn i=1 

=

n 

Vn i=1

xiai −1 (1 − 1n x)an+1 −1 dx

n+1

(2.1)

=

j=1 (aj ) = Bn+1 (a), n+1 ( j=1 aj )

which completes the proof of (5.2). In fact, for any nonnegative measurable function f (·), we have the following general integral (Liouville, 1839): n n   c n  b −1 i f zi zi dzi = Bn (b) f (y)y i=1 bi −1 dy, (5.3) Vn (c)

i=1

0

i=1

where b = (b1 , . . . , bn ). By letting c → ∞ in (5.3), we obtain the well-known Liouville integral n n   ∞ n  b −1 i f zi zi dzi = Bn (b) f (y)y i=1 bi −1 dy. (5.4) Rn+

i=1

Let f (y) = (1 + y)−

i=1

n+1 i=1

ai

0

in (5.4); we obtain (5.2) alternatively.

INVERTED DIRICHLET DISTRIBUTION

177

5.1.3 The mixed moment and the mode Using identity (5.2), the mixed moment of v is E

 n



vri i

=

B(a1 + r1 , . . . , an + rn , an+1 − r+ ) Bn+1 (a)

=

(an+1 − r+ )  (ai + ri ) × (an+1 ) (ai ) i=1

i=1

n

if an+1 > r+ = ˆ

n

i=1 ri

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

(5.5)

and does not exist otherwise. In particular, we have E(vi ) =

Var(vi ) =

⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Cov(vi , vj ) =

ai , an+1 − 1 ai (ai + an+1 − 1) , (an+1 − 1)2 (an+1 − 2)

and

ai aj , (an+1 − 1)2 (an+1 − 2)

i= / j.

(5.6)

Theorem 5.1 The mode of the inverted Dirichlet density (5.1) is vˆ i =

ai − 1 , an+1 + n

1 ≤ i ≤ n,

if ai ≥ 1, and does not exist otherwise.

(5.7) ¶

5.2 Definition through stochastic representation Definition 5.2 Gamma(ai , 1) and

Let {yi }n+1 i=1 be independent random variables. If yi ∼ d

vi =

yi yn+1

,

i = 1, . . . , n,

(5.8)

then the distribution of v = (v1 , . . . , vn ) is called the inverted Dirichlet distribution with parameter vector a = (a1 , . . . , an , an+1 ). We will write v ∼ IDirichletn+1 (a) on Rn+ . ¶ It is easy to verify that Definitions 5.1 and 5.2 are equivalent. The stochastic representation (5.8) is very useful. First, by using (5.8), we immediately obtain the mixed moment (5.5). Second, the stochastic representation (5.8) can be used to prove Theorems 5.2 and 5.3 below, which connect the inverted Dirichlet distribution with Dirichlet distribution and inverted beta distribution.

178

DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 5.2 Let v = (v1 , . . . , vn ) ∼ IDirichletn+1 (a) on Rn+ . (i) If we define d

x=

v 1 + v 1 + · · · + vn

or

d

v=

x , 1 − x 1 − · · · − xn

(5.9)

then x = (x1 , . . . , xn ) ∼ Dirichletn+1 (a) on Vn . (ii) v1 ∼ IBeta( nj=1 aj , an+1 ), v ∼ Dirichlet(a1 , . . . , an ) on Tn , v1 and v1 is independent1 of v/v1 .

(5.10) ¶

Theorem 5.3 Let v = (v1 , . . . , vn ) ∼ IDirichletn+1 (a) on Rn+ and W1 = vi /(1 + W2 ), where W2 = v1 + v2 + · · · + vi−1 . Then W1 ∼ IBeta(ai , a1 + · · · + ai−1 + an+1 ), W2 ∼ IBeta(a1 + · · · + ai−1 , an+1 ), and W1 ⊥ ⊥ W2 .



5.3 Marginal and conditional distributions Using the stochastic representation (5.8), we can easily show that the marginal distribution of any subvector of the inverted Dirichlet random vector is still an inverted Dirichlet distribution. Next, the stochastic representation (5.8) can also be employed to derive the conditional distribution of (vs+1 , . . . , vn ) given v1 = v∗1 , . . . , vs = v∗s . Set 

s vi = vi 1+ vj , i = s + 1, . . . , n. (5.11) j=1

By (5.8), we have v i = d

yi yi = ˆ , y1 + · · · + ys + yn+1 yn+1

i = s + 1, . . . , n,

which implies the distribution of (v s+1 ,  . . . , v n ) is again inverted Dirichlet distributed with parameters (as+1 , . . . , an ; sj=1 aj + an+1 ), and is independent of 1

Note that v = (v/v1 ) · v1 .

INVERTED DIRICHLET DISTRIBUTION

179

v1 , . . . , vs . Therefore, given v1 = v∗1 , . . . , vs = v∗s , we have d

vi = 1 +

s

v∗j



v i ,

i = s + 1, . . . , n.

(5.12)

j=1

We summarize these observations in the following theorem. Theorem 5.4 Let v = (v1 , . . . , vn ) ∼ IDirichletn+1 (a) on Rn+ . (i) For any s < n, the subvector (v1 , . . . , vs ) has an inverted Dirichlet distribution with parameters (a1 , . . . , as ; an+1 ). In particular, vi ∼ IBeta(ai , an+1 ). (ii) The conditional distribution of 

s v i = vi 1+ v∗j , i = s + 1, . . . , n, j=1

given v1 = v∗1 , . . . , vs = v∗s , follows an inverted Dirichlet distribution with parameters (as+1 , . . . , an ; sj=1 aj + an+1 ). ¶ It is noteworthy that the conditional distribution of (vs+1 , . . . , vn ) given v1 , . . . , vs is not a multivariate inverted Dirichlet distribution.

5.4 Cumulative distribution function and survival function 5.4.1 Cumulative distribution function Let v ∼ IDirichletn+1 (a) on Rn+ and b = (b1 , . . . , bn ). The cdf of v is defined by FnID (b|a) = Pr{v < b}  bn  b1 ··· IDirichletn+1 (v|a) dv. = 0

(5.13)

0

(a) A recursive formula for the special case that all {ai } are integers First, we consider the special case that all a1 , . . . , an+1 are integers and derive the ID cdf FnID (·|·) in terms of Fn−r (·|·). Lemma 5.1 (Tiao and Guttman, 1965). Let m be a positive integer and (τ, λ, b) be positive numbers. Hence,  b B(λ, m) t m−1 dt I1 = ˆ = − I2 , (5.14) m+λ (1 + τ)λ 0 (1 + t + τ)

180

DIRICHLET AND RELATED DISTRIBUTIONS

where



I2 = ˆ =



t m−1 dt (1 + t + τ)m+λ b m−1 (m) bj (j + λ)

(m + λ)

j=0

j!(1 + b + τ)j+λ

.

(5.15) ¶

Proof. Let u = (1 + τ)/(1 + t + τ) and θ = (1 + τ)/(1 + b + τ). Thus, I1 and I2 respectively become  1 uλ−1 (1 − u)m−1 du and I1 = (1 + τ)−λ θ  θ uλ−1 (1 − u)m−1 du. I2 = (1 + τ)−λ 0

Therefore, we have (1 + τ)λ I1 = 1 − Iθ (λ, m) B(λ, m) (1 + τ)λ I2 = Iθ (λ, m), B(λ, m)

and

(5.16) (5.17)

where Iθ (λ, m) denotes the incomplete beta function. By combining (5.16) with (5.17), we immediately obtain (5.14). On the other hand,  θ (1 + τ)λ I2 = uλ−1 (1 − u)m−1 du 0  1 yλ−1 (1 − yθ)m−1 dy = θλ 0  1 = θλ yλ−1 [(1 − y) + y(1 − θ)]m−1 dy = θλ

0 m−1  j=0

m − 1 (1 − θ)j B(j + λ, m − j), j

which yields (5.15).

End

When a1 is an integer, using Lemma 5.1, we have ID FnID (b|a) = Fn−1 (b2 , . . . , bn |a2 , . . . , an ; an+1 )

an+1 a

s 1 −1 1 (an+1 + s) b1 − 1 + b1 (an+1 )s! 1 + b1 s=0 

bn  b2 ID ×Fn−1 ,..., a , . . . , a ; a +s . 2 n n+1 1 + b1 1 + b1 

(5.18)

INVERTED DIRICHLET DISTRIBUTION

181

In particular, we have F1ID (b1 |a1 ; a2 ) ID F2 (b1 , b2 |a1 , a2 ; a3 )

= Ib1 /(1+b1 ) (a1 , a2 ) and (5.19) = Ib2 /(1+b2 ) (a2 , a3 )

a3 a

s 1 −1 1 (a3 + s) b1 − 1 + b1 (a3 )s! 1 + b1 s=0 × Ib2 /(1+b1 +b2 ) (a2 , a3 + s).

(5.20)

(b) Calculation of the cdf by using the importance sampling method Exact evaluation of the cdf FnID (b|a) for arbitrary a is generally difficult, particularly for large n. Yassaee (1976) developed a generalized Gaussian method to calculate the cdf FnID (b|a). Below, we use the importance sampling method to evaluate the cdf. Let yi = vi /bi , i = 1, . . . , n. Thus, (5.13) becomes n

biai /ai Bn+1 (a)

FnID (b|a) =



1

i=1



1

···

ai yiai −1 dyi n+1  a (1 + ni=1 bi yi ) j=1 j i=1

0

0

n

K biai /ai 1 · Bn+1 (a) K k=1



n

1

i=1

(1 +

n

(k) i=1 bi yi )

n+1 , j=1

(5.21)

aj

where yi(1) , . . . , yi(K) is a random sample of size K from Beta(ai , 1), i = 1, . . . , n. (c) A lower bound of the cdf Using the inequality 1+

n

vi ≤

i=1

n  (1 + vi ),

vi > 0,

i=1

or

1+

n

−11a



vi

i=1

n   (1 + vi )−11 a , i=1

Yassaee (1981) obtained the following lower bound: FnID (b|a) = ≥

1 Bn+1 (a) 1 Bn+1 (a)



bn



···

0



0

b1

0

bn



··· 0

b1

n

vai −1 dvi ni (1 + i=1 vi )1a n ai −1 v dvi ni=1 i 1a i=1 (1 + vi ) i=1

182

DIRICHLET AND RELATED DISTRIBUTIONS

n

=

n

=

i=1

n

=

B(ai , 1a − ai )  Bn+1 (a) i=1 n

i=1

i=1



B(ai , 1 a − ai ) Bn+1 (a) B(ai , 1a − ai ) Bn+1 (a)

 0

bi

viai −1 1 · dvi B(ai , 1a − ai ) (1 + vi )1a

n  bi /(1+bi )  i=1 n 

0

xiai −1 (1 − xi )1 a−ai −1 dxi B(ai , 1a − ai ) 

Ibi /(1+bi ) (ai , 1a − ai ).

(5.22)

i=1

5.4.2 Survival function Let v ∼ IDirichletn+1 (a) on Rn+ and t = (t1 , . . . , tn ). The survival function of v is defined as  ∞  ∞ SnID (t|a) = Pr{v ≥ t} = ··· IDirichletn+1 (v|a) dv. tn

t1

Theorem 5.5 Let all ai (i = 1, . . . , n + 1) be positive integers. Hence,

an+1 a a n −1 1 −1 1 1 t1k1 · · · tnkn n SnID (t|a) = ··· (an+1 ) 1 + i=1 ti k ! · · · kn ! k1 =0 kn =0 1 n (an+1 + i=1 ki ) n . ×  (1 + ni=1 ti ) i=1 ki In particular, when ai = 1 for i = 1, . . . , n, (5.23) can be simplified to

an+1 1 n 1n , an+1 ) = . SnID (t|1 1 + i=1 ti  Proof. For convenience, we denote si=r ai by ar→s . Since  ∞ va11 −1 dv1 ˆ J1 = a +a t1 (1 + v1 + v2→n ) 1 2→n+1 (5.15)

=

a 1 −1 (a1 ) t1k1 (k1 + a2→n+1 ) , (a1 + a2→n+1 ) k =0 k1 !(1 + t1 + v2→n )k1 +a2→n+1 1

we have (an+1 )SnID (t|a)   ∞  ∞ n (a1→n+1 ) viai −1 dvi = ··· · J1 , (a1 ) (ai ) tn t2 i=2  a n 1 −1 k1  ∞  ∞  t1 (a2→n+1 + k1 ) viai −1 dvi = · J2 , ··· k ! tn (a2 ) (ai ) t3 i=3 k =0 1 1

(5.23)

(5.24) ¶

INVERTED DIRICHLET DISTRIBUTION

183

where 

v2a2 −1 dv2 a +a +k t2 (1 + v2 + v3→n + t1 ) 2 3→n+1 1 (a2 ) (5.15) = (a2 + a3→n+1 + k1 ) a 2 −1 t2k2 (k2 + a3→n+1 + k1 ) . × k !(1 + t2 + v3→n + t1 )k2 +a3→n+1 +k1 k =0 2

J2 = ˆ



2

End

Repeatedly applying (5.15), we can arrive at (5.23).

5.5 Characteristic function 5.5.1 Univariate case When n = 1, the inverted Dirichlet distribution defined in (5.1) reduces to an inverted beta distribution with parameters a1 and a2 (cf. Definition 1.2). The cf of W ∼ IBeta(a1 , a2 ) is given by ϕW (t) = E(eitW )  ∞ 1 wa1 −1 = eitw dw B(a1 , a2 ) 0 (1 + w)a1 +a2  ∞ 1 (a1 + a2 ) e−w(−it) wa1 −1 (1 + w)(1−a2 )−a1 −1 dw · = (a2 ) (a1 ) 0 (a1 + a2 ) = (a1 ; 1 − a2 ; −it), (a2 ) where 1 (b; c; z) = ˆ (b)





e−zu

0

ub−1 du (1 + u)b−c+1

(5.25)

is called the confluent hypergeometric function of the second kind. Function has been extensively studied in the applied mathematics literature; see Erd´eyli (1953: Chapter 6) for a detailed review. The relationship between (the confluent hypergeometric function of the first kind) and is described in Remark 2.2.

5.5.2 The confluent hypergeometric function of the second kind Let b = (b1 , . . . , bn ) and z = (z1 , . . . , zn ). The confluent hypergeometric function of the second kind with several arguments is defined by 1 n (b; c; z) = n j=1 (bj )



e Rn+

n

−zu

(1 +

bj −1 j=1 uj n 1b−c+1 j=1 uj )

du

(5.26)

184

DIRICHLET AND RELATED DISTRIBUTIONS

for (bj ) > 0 and (zj ) > 0, j = 1, . . . , n. The domain of the function n can be generalized beyond (zj ) > 0, j = 1, . . . , n. In fact, if (c) < 1

(5.27)

then the integral representation (5.26) still holds for zj on the imaginary axis; that is, for (zj ) ≥ 0, j = 1, . . . , n. Phillips (1988) obtained the following result, which is much simpler than (5.26). Lemma 5.2 (Phillips, 1988). If (bj ) > 0 (j = 1, . . . , n) and (c) < 1, then n (b; c; z) =

1 (1 1b − c + 1)







e−x x1 b−c

0

n  (x + zj )−bj dx.

(5.28)

j=1

¶ Remark 5.1 In particular, consider the case when n = 1. Making transformation x/z1 = u, (5.28) becomes  ∞ 1 e−x xb1 −c (x + z1 )−b1 dx (b1 − c + 1) 0  ∞ z1−c 1 = e−z1 u ub1 −c (1 + u)−b1 du (b1 − c + 1) 0 (5.25)

= z1−c 1 (b1 − c + 1; 2 − c; z1 ) = (b1 ; c; z1 ).

The last equality follows from Equation (6) of Erd´eyli (1953).



5.5.3 General case The cf of v = (v1 , . . . , vn ) ∼ IDirichletn+1 (a) on Rn+ is known to be given by a confluent hypergeometric function of the second kind with many arguments (Phillips, 1988). Similar to the univariate case, we have ϕv (t) = E(eit1 v1 +···+itn vn ) n ai −1  1  i=1 vi = eit v n+1 dv n a Bn+1 (a) Rn+ (1 + i=1 vi ) j=1 j   n+1  j=1 aj n (a1 , . . . , an ; 1 − an+1 ; −it), = (an+1 )

(5.29)

where n is defined in (5.26). Note that condition (5.27) is satisfied in the current case since c = 1 − an+1 and aj > 0 for j = 1, . . . , n + 1 in (5.1). From (5.28) and (5.29), we find that the cf of v ∼ IDirichletn+1 (a) has the following one-dimensional

INVERTED DIRICHLET DISTRIBUTION

185

integral representation: 1 ϕv (t) = (an+1 )





e−x x

n+1 j=1

aj −1

0

n  (x − itj )−aj dx.

(5.30)

j=1

5.6 Distribution for linear function of inverted Dirichlet vector 5.6.1 Introduction Let v = (v1 , . . . , vn ) ∼ IDirichletn+1 (a) on Rn+ and c = (c1 , . . . , cn ). To study  the n distribution of the linear function of v, we define a random variable V = c v = k=1 ck vk . The cf of V is 

ϕV (t) = E(eitc v ) = ϕv (tc) (5.30)

=

1 (an+1 )





e−x x

n+1 j=1

0

aj −1

n  (x − itcj )−aj dx.

(5.31)

j=1

On the one hand, the cdf of V can be expressed as the probability of a linear combination of a Dirichlet random vector being less than or equal to v; that is, Pr(V ≤ v) = Pr

n

c i vi ≤ v

i=1

n

i=1 ci xi  = Pr ≤v 1 − ni=1 xi  n  = Pr (ci + v)xi ≤ v ,

(5.9)

(5.32)

i=1

where x = (x1 , . . . , xn ) ∼ Dirichletn+1 (a) on Vn . On the other hand, using Definition 5.2, the cdf of V can be written as Pr(V ≤ v) = Pr

n

c i vi ≤ i=1 n i=1 ci yi

v

≤v yn+1

Y = Pr ≤v , yn+1

(5.8)

= Pr

(5.33)

186

DIRICHLET AND RELATED DISTRIBUTIONS

n n+1 where Y = ˆ i=1 ci yi , yi ∼ Gamma(ai , 1) and {yi }i=1 are mutually independent. Noting that Y is independent of yn+1 , (5.33) then becomes 

 ∞  Y Pr ≤ vyn+1 = z · Gamma(z|an+1 , 1) dz Pr(V ≤ v) = yn+1 0  ∞ zan+1 −1 e−z dz. Pr(Y ≤ vz) · = (an+1 ) 0

Hence, it suffices to obtain the cdf of Y . For example, from (5.35), it is easy to see that Pr(V ≤ v) can be expressed in terms of an infinite series of F densities.

5.6.2 The distribution of the sum of independent gamma variates Let Yi = ci yi . Hence, Y = Y1 + · · · + Yn . When ci > 0, we have Yi ∼ Gamma(ai , ci−1 ) with density Gamma(yi |ai , ci−1 ) =

1 yai −1 e−yi /ci ciai (ai ) i

and {Yi }ni=1 are mutually independent. Moschopoulos (1985) obtained the pdf and cdf of Y as follows: 

∞  n −1  fY (y) = C δk Gamma y ai + k, c1 , (5.34) i=1

k=0



 w ∞  n δk Gamma y ai + k, c1−1 dy, Pr(Y ≤ w) = C k=0

0

(5.35)

i=1

 where C = ni=1 (c1 /ci )ai , c1 = min ci , and the coefficients {δk } satisfy the recurrence relations: δ0 = 1 and ⎧ ⎫

k+1 n 1 ⎨ c1 i ⎬ δk+1 = aj 1 − δk+1−i , k = 0, 1, 2, . . . ⎩ k+1 cj ⎭ i=1

j=1

Alouini et al. (2001) considered the case of correlated gamma random variables. Provost (1988, 1989) provided a formula similar to (5.34), which is given by fY (y) =



θk Gamma(y|α + k, 1)

k=0

for some α and θk . Efthymoglou and Aalo (1995) obtained the following single integral representation for the density of Y :   1 ∞ cos{ nk=1 ak tan−1 (ck t) − yt} n fY (y) = dt (5.36) 2 2 ak /2 π 0 k=1 (1 + t ck )

INVERTED DIRICHLET DISTRIBUTION

187

for y > 0. Other expressions of the pdf/cdf of Y can be found in Mathai and Saxena (1978), Springer (1979), Mathai (1993), Aalo et al. (2005), and Nadarajah (2008).

5.6.3 The case of two dimensions When n = 2, Al-Ruzaiza and El-Gohary (2008) consider the distribution of v1 + v2 , which follows an F distribution since v1 + v2 is a ratio of two independent gamma variates. Gupta and Nadarajah (2006) derived the exact distribution of the general form c1 v1 + c2 v2 for two cases: c1 > 0, c2 > 0; and c1 < 0, c2 > 0. These densities involve the Gauss hypergeometric function,2 which is defined by (Press, 1982) 2 F1 (α, β; γ; z)

=

∞ (α)k (β)k k=0

(γ)k

·

zk . k!

(5.37)

Series (5.37) converges for |z| < 1 if γ > 0. It is clear that α and β are exchangeable; that is, 2 F1 (α, β; γ; z)

= 2 F1 (β, α; γ; z).

If α > 0, γ > 1 and |z| < 1, it can be shown that  1 α−1 1 t (1 − t)γ−α−1 dt. 2 F1 (α, β; γ; z) = B(α, γ − α) 0 (1 − tz)β

(5.38)

This expression is known as the Euler integral (Exton, 1976). Finally, the Gauss hypergeometric function 2 F1 (α, β; γ; z) is a solution to the following linear differential equation z(1 − z)

d2 w(z) dw(z) + {γ − (α + β + 1)z} − αβw(z) = 0. dz2 dz

Theorem 5.6 (Gupta and Nadarajah, 2006). Let v = (v1 , v2 ) ∼ IDirichlet (a1 , a2 ; a3 ) on R2+ . Hence, the density of V = c1 v1 + c2 v2 for c1 > 0 and c2 > 0 is given by va1 +a2 −1 c2a1 +a3 (a+ ) · 2 F1 (a1 , a+ ; a1 + a2 ; z), c1a1 (a1 + a2 )(a3 ) (c2 + v)a+  1 −c2 )v for 0 < v < ∞, where a+ = 3i=1 ai and z = (c . c1 (c2 +v) fV (v) =

2

(5.39) ¶

The general form of a hypergeometric function is p Fq (α1 , . . . , αp ;

β1 , . . . , βq ; z) =

∞

(α1 )k ···(αp )k k=0 (β1 )k ···(βq )k

·

zk k! .

In (2.24) let n = 1. Thus, the Lauricella function of the first kind, L1 (α, β; γ; z), is equivalent to In addition, the confluent hypergeometric function of the first kind, (α; β; z) defined in (2.37), reduces to 1 F1 (α, β; z).

2 F1 (α, β; γ; z).

188

DIRICHLET AND RELATED DISTRIBUTIONS

Proof. Let a = (a1 , a2 , a3 ). Using (5.1) and letting v1 = (v/c1 )t, we have 

v − c1 v1  IDirichlet3 v1 , a dv1 c2  0  v/c1 a1 −1 1 v1 {(v − c1 v1 )c2−1 }a2 −1 = dv1 c2 B3 (a) 0 {1 + v1 + (v − c1 v1 )c2−1 }a+  (v/c1 )a1 (v/c2 )a2 −1 1 t a1 −1 (1 − t)a2 −1 = dt a c2 B3 (a) 0 {1 + vt/c1 + v(1 − t)/c2 } +  1 a1 −1 (v/c1 )a1 (v/c2 )a2 −1 t (1 − t)a2 −1 = dt. c2 B3 (a)(1 + v/c2 )a+ 0 (1 − tz)a+

fV (v) =

1 c2



v/c1

End

By using (5.38), we immediately obtain (5.39).

Theorem 5.7 (Gupta and Nadarajah, 2006). If v = (v1 , v2 ) ∼ IDirichlet (a1 , a2 ; a3 ) on R2+ , then the density of V = c1 v1 + c2 v2 for c1 < 0 and c2 > 0 is given by fV (v) =

a3 c2a1 +a3 (a+ ) va1 +a2 −1 × (−c1 )a1 (a1 + a3 + 1)(a2 ) (c2 + v)a+

(c2 − c1 )v × 2 F1 a1 , a+ ; a1 + a3 + 1; 1 + , c1 (c2 + v)

(5.40)

for v > 0, and fV (v) =

a3 (−c1 )a2 +a3 (a+ ) a2 c2 (a2 + a3 + 1)(a1 )

× 2 F1 for v < 0, where a+ =

3 i=1

×

(−v)a1 +a2 −1 (−c1 − v)a+

(c2 − c1 )v a2 , a+ ; a2 + a3 + 1; 1 − c2 (c1 + v)



,

(5.41)

ai .



5.7 Connection with other multivariate distributions 5.7.1 Connection with the multivariate t distribution In this subsection we describe the connection between the inverted Dirichlet distribution and the multivariate t distribution discussed in Cornish (1954), Dunnett and Sobel (1954), and Tiao and Guttman (1965).

INVERTED DIRICHLET DISTRIBUTION

189

Definition 5.3 A random vector x ∈ Rn is said to follow a multivariate t distribution if its density is   ( ν+n ) (x − μ)−1 (x − μ) −(ν+n)/2 2 tn (x|μ, , ν) = ν √ 1+ , ( 2 )( νπ )n ||1/2 ν where μ ∈ Rn is the location parameter vector, n×n > 0 is the dispersion matrix, and ν > 0 is the degree of freedom. We write x ∼ tn (μ, , ν). ¶ The multivariate t provides a heavy-tail alternative to the multinormal while accounting for correlation among the components of x. Some basic properties for the multivariate t are as follows: ν  (if ν > 2). ν−2 (2) The following mixture property gives a conditional sampling algorithm for generating the multivariate t distribution. If

ν ν y ∼ Gamma , and x|y ∼ Nn (μ, y−1 ), 2 2 (1) E(x) = μ (if ν > 1) and Var(x) =

then x ∼ tn (μ, , ν). (3) Equivalently, we have the following stochastic representation: Nn (00, ) d x =μ+  , χ2 (ν)/ν which can serve as the definition of the multivariate t variate. (4) The quadratic form

n ν (x − μ)−1 (x − μ) ∼ IBeta , ν 2 2 and hence has the nF (n, ν)/ν distribution. (5) Let ◦ denote the component-wise operator3 of multiplier. Hence,

ν 1 −1 (x − μ) −1 (x − μ) √ √ 1n ; ◦ ∼ IDirichletn+1 2 2 ν ν

(5.42)

(5.43)

(5.44)

on Rn+ . To prove Properties (4) and (5), we let ξ ∼ Nn (00, In ). From (5.42), we have η= ˆ

ξ −1 (x − μ) d √ = . ν χ2 (ν)

(5.45)

Let a = (a1 , . . . , an ) and b = (b1 , . . . , bn ). Hence, a ◦ b = (a1 b1 , . . . , an bn ) and its ith component is denoted by (a ◦ b)i = ai bi , i = 1, . . . , n.

3

190

DIRICHLET AND RELATED DISTRIBUTIONS

Therefore, ξξ d χ2 (n) = 2 χ2 (ν) χ (ν) d Gamma(n/2, 1/2) = Gamma(ν/2, 1/2) d Gamma(n/2, 1) , = Gamma(ν/2, 1)

ηη = d

which implies Property (4). In addition, from (5.45), we obtain d

(η ◦ η)i =

ξi2 d χ2 (1) = 2 , χ2 (ν) χ (ν)

which implies (5.44).

5.7.2 Connection with the multivariate logistic distribution A multivariate logistic distribution has been described by Johnson and Kotz (1972: 291) and Malik and Abraham (1973). Its joint density is given by

−(n+1)

n n n! 1 + e−xi exp − xi , i=1

x ∈ Rn+ .

(5.46)

i=1

We can verify that x has the exponential stochastic representation:  d



x = (x1 , . . . , xn ) = log





yn+1 yn+1  , . . . , log , y1 yn

(5.47)

iid

where y1 , . . . , yn+1 ∼ Exponential(1). Thus, the multivariate logistic random vector x is related to the inverted Dirichlet vector v = (v1 , . . . , vn ) via the following stochastic representation: x = (− log v1 , . . . , − log vn ), d

(5.48)

where v ∼ IDirichletn+1 (1 1). Inversely, Yassaee (1974) derived the density (5.46) by using the stochastic representation (5.48). Furthermore, Yassaee (1974) obtained the moment-generating function of x having the n-dimensional logistic distribution, when it exists, by using the n-dimensional inverted Dirichlet distribution IDirichletn+1 (11). Utilizing (5.5),

INVERTED DIRICHLET DISTRIBUTION

191

we have 

Mx (t) = E(e t x )  n  = E e− i=1 ti log vi 

n −ti =E vi i=1

= (1 + t1n ) ·

n 

(1 − ti ).

(5.49)

i=1

This formula corrects the typos in the formula of Mx (t) given in Malik and Abraham (1973).

5.7.3 Connection with the multivariate Pareto distribution The univariate Pareto density with parameter a > 0 is a/xa+1 , x ≥ 1. A multivariate Pareto density with parameter a > 0 (Johnson and Kotz, 1972: 286) is given by a(a + 1) · · · (a + n − 1)  , ( ni=1 xi − n + 1)a+n

xi ≥ 1, i = 1, . . . , n.

It is easy to show that

y1 yn ,...,1 + x = (x1 , . . . , xn ) = 1 + yn+1 yn+1  d

 ,

iid

where y1 , . . . , yn ∼ Exponential(1), yn+1 ∼ Gamma(a, 1), and they are independent. Hence, the multivariate Pareto random vector x is related to the inverted Dirichlet random vector v = (v1 , . . . , vn ) via the following simple way: d

x = 1n + v,

(5.50)

1n ; a). where v ∼ IDirichletn+1 (1

5.7.4 Connection with the multivariate Cook–Johnson distribution The n-dimensional Cook–Johnson distribution (Cook and Johnson, 1981; Devroye, 1986: 600–601) has the joint density function n −1/a−1 (a + n) i=1 xi · n −1/a , n (a)a ( i=1 xi − n + 1)a+n

0 < xi ≤ 1, i = 1, . . . , n,

where a > 0 is the parameter. Some basic properties for the multivariate Cook– Johnson distribution are as follows:

192

DIRICHLET AND RELATED DISTRIBUTIONS

(1) Its cumulative distribution function is

−a n −1/a xi −n+1 , 0 < xi ≤ 1,

i = 1, . . . , n.

i=1

 (2) As a → ∞, the cdf converges to ni=1 xi (the independent case), and as a ↓ 0, it converges to min (x1 , . . . , xn ) (the totally dependent case). (3) It has following stochastic representation: 

 y1 −a yn −a  d x = (x1 , . . . , xn ) = 1+ ,..., 1 + , yn+1 yn+1 iid

where y1 , . . . , yn ∼ Exponential(1), yn+1 ∼ Gamma(a, 1), and they are independent. Hence, the multivariate Cook–Johnson random vector x is related to the inverted Dirichlet random vector v = (v1 , . . . , vn ) via the following stochastic representation:   d x = (1 + v1 )−a , . . . , (1 + vn )−a , (5.51) where v ∼ IDirichletn+1 (1 1n ; a).

5.8 Applications 5.8.1 Bayesian analysis of variance in a linear model Consider the linear model y = Xβ + ε,

ε ∼ Nn (00, σ 2 In ),

where yn×1 is the vector of outcomes, X is an n × k design matrix with rank k (k < n), βk×1 is the unknown regression coefficient, and σ 2 is the unknown common variance. (a) The joint distribution of the MLE βˆ and the residual sum of squares Given (β, σ 2 ), the MLE βˆ = (XX)−1 Xy is normally distributed with mean vector β and covariance matrix σ 2 (XX)−1 ; that is,   βˆ ∼ Nk β, σ 2 (XX)−1 . The residual sum of squares is defined by ˆ (y − Xβ) ˆ R = (y − Xβ) = (y − PX y)(y − PX y) = yQX y,

INVERTED DIRICHLET DISTRIBUTION

193

where PX = X(XX)−1 X and QX = In − PX are two projection matrices, and rank (QX ) = n − rank (PX ) = n − k. It is well known that R ∼ χ2 (n − k), σ2



or

1 n−k R ∼ Gamma , . 2 2σ 2

ˆ = 0. Therefore, the joint In addition, βˆ is independent of R since Cov(QX y, β) 2 4 density of βˆ and R for given (β, σ ) is

 (βˆ − β)XX(βˆ − β) 2σ 2 2πσ 2   (1/2σ 2 )(n−k)/2 [(n−k)/2]−1 R R × exp − 2σ 2 ( n−k ) 2   R + (βˆ − β)XX(βˆ − β) −(2n−k) [(n−k)/2]−1 ∝σ R exp − . 2σ 2

ˆ R|β, σ 2 ) = f (β,



1

n



exp



(b) The posterior distribution of the variance The likelihood function for β and σ 2 is equal to the joint density of y; that is, Nn (y|Xβ, σ 2 In ). Following Raiffa and Schlaifer (1961: 324–325), Tiao and Guttman (1965) used the following conjugate distribution as the joint prior density for (β, σ 2 ): p(β, σ 2 ) = p(β|σ 2 ) × p(σ 2 )



 a0 b 0 = Nk (β|β0 , σ 2 C0−1 ) × IGamma σ 2  , 2 2

k   (β − β0 )C0 (β − β0 ) 1 = √ exp − 2σ 2 2πσ 2   ( b0 )a0 /2 b0 × 2 a0 (σ 2 )−[(a0 /2)+1] exp − 2 ( 2 ) 2σ   b0 + (β − β0 )C0 (β − β0 ) −(k+a0 +2) ∝σ exp − , 2σ 2

where a0 > 0, b0 > 0, β0 : k × 1, and C0 : k × k are known. The resulting posterior distribution of σ 2 is

n + a0 η , , σ 2 |y ∼ IGamma 2 2 4

It seems to us that there is a typo in the formula (4.2) of Tiao and Guttman (1965).

194

DIRICHLET AND RELATED DISTRIBUTIONS

where η = b0 + R + (βˆ − β0 )A(βˆ − β0 ),

A= ˆ [C0−1 + (XX)−1 ]−1 .

(c) The unconditional sampling distribution of the statistic η To conduct a ‘pre-posterior’ analysis of the variance σ 2 , we need to find the ‘unˆ conditional’ sampling distribution of the statistic η which is a function of R and β. The joint distribution of R and βˆ is given in Tiao and Guttman (1965) as  ˆ = f (β, ˆ R|β, σ 2 ) × p(β, σ 2 ) dβ dσ 2 p(R, β)

R + (βˆ − β0 )A(βˆ − β0 ) −(n+a0 )/2 ∝ R[(n−k)/2]−1 1 + . b0 Let ξ = (ξ1 , . . . , ξk ) be the canonical variables of βˆ − β0 ; that is, ξ = A1/2 (βˆ − β0 ). Hence, the joint density of R and ξ is    ∂(R, β) ˆ   ˆ × p(R, ξ) = p(R, β)   ∂(R, ξ) 

R + ξ12 + · · · + ξk2 −(n+a0 )/2 [(n−k)/2]−1 ∝R 1+ . b0

(5.52)

If we make the transformation v1 =

R ξ2 ξ2 , v2 = 1 , . . . , vk+1 = k , b0 b0 b0

in (5.52), then the joint density of the random vector v = (v1 , . . . , vk+1 ) is

−(n+a0 )/2 k+1 [(n−k)/2]−1 (1/2)−1 (1/2)−1 v2 · · · vk+1 1+ vi ; p(v) ∝ v1 i=1

that is,



v ∼ IDirichlet

1 a0 n−k 1 , ,..., ; . 2 2 2 2

By using Theorem 5.2(ii), we have5 n a  R + (βˆ − β0 )A(βˆ − β0 ) η 0 , −1= = v1 ∼ IBeta . b0 b0 2 2 There is a typo in the formula on line 5 of Tiao and Guttman (1965: 801). We think (n − r)/2 should be replaced by n/2.

5

INVERTED DIRICHLET DISTRIBUTION

195

5.8.2 Confidence regions for variance ratios in a linear model with random effects Sahai and Anderson (1973) utilized an inverted Dirichlet distribution to calculate the exact confidence coefficient associated with the confidence regions for variance ratios in random-effects models with balanced data. Consider the following linear normal random-effects model: r y = μ1 1n + Xi βi + ε, i=1

where y is an n × 1 observation vector, μ is an unknown scalar, Xi is an n × mi known matrix of full rank, {βi }ri=1 are the random effects, βi ∼ Nmi (00, σi2 Imi ),

ε ∼ Nn (00, σ02 In ),

and {βi }ri=1 and ε are mutually independent. (a) A conservative confidence region for variance ratios Let Si2 and νi be the mean square and corresponding degrees of freedom associated with the random effects βi (i = 1, . . . , r), and let S02 and ν0 be the mean square and degrees of freedom associated with the residual variance. The expected mean squares are given by  E(Si2 ) = σi∗2 = σ02 (1 + rj=1 Kij ρj ), where ρj = σj2 /σ02 are variance ratios and Kij are some constants given in standard textbooks. For example, Hartley (1967) showed how to compute them for any design of a random-effect model. For balanced random models, νi Si2 /σi∗2 (i = 1, . . . , r) and ν0 S02 /σ02 are independent central chi-square random variates with νi and ν0 degrees of freedom respectively. From the definition of the central F distribution, it follows that  2 ∗2  S0 σi Pr ≤ F (αi ; ν0 , νi ) = 1 − αi , i = 1, . . . , r, Si2 σ02 where F (αi ; ν0 , νi ) denotes the upper 100αi % percentage point of the F distribution with ν0 and νi degrees of freedom. Using Kimball’s result (e.g., Miller, 1966: 102)  2 ∗2  S0 σi Pr ≤ F (α ; ν , ν ), i = 1, . . . , r ≥ 1 − α, i 0 i Si2 σ02 Broemeling (1969) obtained a conservative confidence region with confidence coefficient 1 − α for the variance ratios ρ = (ρ1 , . . . , ρr ) as   r Si2 S(y, ρ) = ρ : 1 + Kij ρj ≤ 2 F (αi ; ν0 , νi ), i = 1, . . . , r , (5.53) S0 j=1 where {αi } are chosen so that 1 − α =

r

i=1 (1

− αi ).

196

DIRICHLET AND RELATED DISTRIBUTIONS

(b) The exact confidence level associated with the confidence region (5.53) The exact confidence level associated with the confidence region (5.53) was obtained by Sahai and Anderson (1973) in terms of the upper tail of the probability integral (or the survival function; see Section 5.4.2) of an inverted Dirichlet distribution in the form 

 S02 σi∗2 Pr ≤ F (αi ; ν0 , νi ), i = 1, . . . , r Si2 σ02  2  χ (νi ) ≥ t = Pr , i = 1, . . . , r i χ2 (ν0 ) 

 ∞  ∞  ν1 νr ν0 = ··· IDirichletr+1 v , . . . , ; dv 2 2 2 tr t1 

 ν1 νr ν0 ID  = Sr t1 , . . . , tr  , . . . , ; , 2 2 2

(5.54)

where ti = νi /[ν0 F (αi ; ν0 , νi )]. Example 5.1 (A three-stage nested model ). Consider a three-stage nested model (Broemeling, 1969): yijk = μ + βi + γij + εijk ,

1 ≤ i ≤ p, 1 ≤ j ≤ q, 1 ≤ k ≤ s,

iid

iid

iid

(¯yi·· − y¯ ··· )2 /ν1 ,

ν1 = p − 1,

(5.55)

where μ is a scalar, {βi } ∼ N(0, σ12 ), {γij } ∼ N(0, σ22 ), {εijk } ∼ N(0, σ02 ), and they are mutually independent. From the analysis of variance, the three mean squares S12 = qs

p

i=1 q p

S22 = s



(¯yij· − y¯ i·· )2 /ν2 ,

ν2 = p(q − 1),

i=1 j=1

and S02

=

q p s

(yijk − y¯ ij· )2 /ν0 ,

ν0 = pq(s − 1)

i=1 j=1 k=1

have corresponding expectations σ1∗2 = σ02 + qsσ12 + sσ22 = σ02 (1 + qsρ1 + sρ2 ), σ2∗2 = σ02 + sσ22 = σ02 (1 + sρ2 ),

INVERTED DIRICHLET DISTRIBUTION

197

and σ02 . The conservative confidence region (5.53) with confidence coefficient 1 − α becomes S(y, ρ1 , ρ2 ) = {(ρ1 , ρ2 ): 1 + qsρ1 + sρ2 ≤ c1 , 1 + sρ2 ≤ c2 },

(5.56)

where 1 − α = (1 − α1 )(1 − α2 ) and c =

S2 S2 , F (α ; ν , ν ) =  0  S02 S02 F (1 − α ; ν , ν0 )

 = 1, 2.

(5.57)

Simultaneous CIs for ρ1 and ρ2 are obtained by projecting S(y, ρ1 , ρ2 ) onto the coordinate axes of the ρ1 ρ2 -plane and are 0 ≤ ρ1 ≤

c1 − 1 qs

and 0 ≤ ρ2 ≤

min(c1 − 1, c2 − 1) . s

(5.58)

The exact confidence level (5.54) associated with the confidence region (5.56) reduces to S2ID (t1 , t2 |ν1 /2, ν2 /2; ν0 /2). For illustration purposes, let us consider a numerical example from Graybill (1961: 371–372) where p = 12, q = 5, s = 3, S12 = 3.5629, ν1 = 11, S22 = 1.2055, ν2 = 48, S02 = 0.6113, and ν0 = 120. Choose α1 = α2 = 0.01 so that α = 0.9801. Since F (0.99; 11, 120) = 0.270 971 and F (0.99; 48, 120) = 0.548 95,6 from (5.57) we have c1 = 21.5093 and c2 = 3.592 36. The CIs for ρ1 and ρ2 with confidence level 98.01% are 0 ≤ ρ1 ≤ 1.367 29 and 0 ≤ ρ2 ≤ 0.864 12. Note that the estimated variance ratios from the analysis of variance are ρˆ 1 = 0.25 and ρˆ 2 = 0.32.  Example 5.2 (A two-way classification model without interaction ). Consider the following model (Broemeling, 1969): yij = μ + βi + γj + εij , iid

1 ≤ i ≤ p, 1 ≤ j ≤ q, iid

(5.59)

iid

where μ is a scalar, {βi } ∼ N(0, σ12 ), {γj } ∼ N(0, σ22 ), {εij } ∼ N(0, σ02 ), and they are mutually independent. The three mean squares S12 = q

p

(¯yi· − y¯ ·· )2 /ν1 ,

ν1 = p − 1,

i=1 q

S22 = p



(¯y·j − y¯ ·· )2 /ν2 ,

ν2 = q − 1,

j=1

We employ the built-in R function qf(1 − α, df1, df2) to exactly calculate the upper 100α% quantile of the F distribution F (df1, df2) instead of two approximate values (i.e., F (0.99; 11, 120) ≈ 0.24 and F (0.99; 48, 120) ≈ 0.43) used by Broemeling (1969).

6

198

DIRICHLET AND RELATED DISTRIBUTIONS

and S02

=

q p

(yij − y¯ i· − y¯ ·j + y¯ ·· )2 /ν0 ,

ν0 = (p − 1)(q − 1)

i=1 j=1

have the corresponding expectations σ1∗2 = σ02 + qσ12 = σ02 (1 + qρ1 ), σ2∗2 = σ02 + pσ22 = σ02 (1 + pρ2 ), and σ02 . The conservative confidence region (5.53) with confidence coefficient 1 − α becomes S(y, ρ1 , ρ2 ) = {(ρ1 , ρ2 ): 1 + qρ1 ≤ d1 , 1 + pρ2 ≤ d2 },

(5.60)

where 1 − α = (1 − α1 )(1 − α2 ) and d =

S2 S2 , F (α ; ν0 , ν ) = 2 2 S0 S0 F (1 − α ; ν , ν0 )

 = 1, 2.

Simultaneous CIs for ρ1 and ρ2 generated by this region are given by 0 ≤ ρ1 ≤

d1 − 1 q

and 0 ≤ ρ2 ≤

d2 − 1 . p

The exact confidence level (5.54) associated with the confidence region (5.60) reduces to S2ID (t1 , t2 |ν1 /2, ν2 /2; ν0 /2). 

6

Dirichlet–multinomial distribution This chapter introduces the Dirichlet–multinomial distribution, which is a compound distribution of a multinomial distribution and a Dirichlet distribution. In Section 6.1 we first introduce the probability mass density function through a mixture representation. We also present several distributional properties for the beta–binomial distribution, which is a special case of the Dirichlet–multinomial distribution. Moments of the Dirichlet–multinomial distribution are considered in Section 6.2. Marginal distributions, conditional distributions, and multiple regression are discussed in Section 6.3. A conditional sampling method is provided in Section 6.4. In Section 6.5 we investigate the method of moments estimation of the parameters of the Dirichlet–multinomial distribution. The maximum likelihood estimation via the Newton–Raphson algorithm, the Fisher scoring algorithm, and the EM gradient algorithm for the Dirichlet–multinomial parameters is studied in Section 6.6. Some applications are given in Section 6.7. The methods for testing the multinomial assumption against the Dirichlet–multinomial alternative are discussed in Section 6.8.

6.1 Probability mass function 6.1.1 Motivation Multi-category frequencies in many biological investigations sometime show more variation so that the multinomial distribution cannot handle. One possibility of this kind of extra variation is that the multinomial probabilities p1 , . . . , pn do not remain constant across the trials. The parameter vector p = (p1 , . . . , pn ) can be Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

200

DIRICHLET AND RELATED DISTRIBUTIONS

treated as a random vector having a certain distribution in the n-dimensional closed simplex Tn . It is noteworthy that the Dirichlet distribution is defined in the simplex Tn and it is a natural conjugate prior to the multinomial distribution. If the Dirichlet distribution is adopted as the distribution of p, the resulting mixture distribution is the Dirichlet–multinomial distribution or the compound multinomial distribution (Mosimann, 1962).

6.1.2 Definition via a mixture representation Let N be a fixed positive integer and x = (x1 , . . . , xn ) be a random vector of nonnegative integers such that n 

xi = N.

i=1

According to (2.1) and (2.2), the above assumptions imply that x ∈ Tn (N) or equivalently x−n = (x1 , . . . , xn−1 ) ∈ Vn−1 (N). Statistically, if x|p ∼ Multinomialn (N, p)

(6.1)

p = (p1 , . . . , pn ) ∼ Dirichletn (a) on Tn ,

(6.2)

and

then the marginal distribution of x is given by  f (x, p) dp f (x) = Tn  = f (x|p)f (p) dp Tn  n n    x N pai −1 i dp pi × i=1 i = Bn (a) x1 , . . . , xn i=1 Tn   N Bn (x + a) = . Bn (a) x

(6.3)

Naturally, we have the following definition. Definition 6.1 An n-dimensional random vector x = (x1 , . . . , xn ) ∈ Tn (N) is said to have a Dirichlet–multinomial distribution if the pmf of x−n = (x1 , . . . , xn−1 ) is   N Bn (x + a) , x−n ∈ Vn−1 (N), (6.4) DMultinomialn (x−n |N, a) = Bn (a) x where a = (a1 , . . . , an ) is a positive parameter vector. We will write x ∼ DMultinomialn (N, a) on Tn (N) or x−n ∼ DMultinomial(N, a1 , . . . , an−1 ; an ) on Vn−1 (N) accordingly. ¶

DIRICHLET–MULTINOMIAL DISTRIBUTION

From (6.4), we naturally have the following identity:    N Bn (x + a) = Bn (a). x x∈T (N)

201

(6.5)

n

From the mixture representation (6.3), we know that (6.2) and (6.1) can be employed to generate the Dirichlet–multinomial distribution. As a by-product of this mixture representation, the Dirichlet–multinomial distribution is a robust alternative to the multinomial distribution in the sense of (6.18).

6.1.3 Beta–binomial distribution When n = 2, the Dirichlet–multinomial distribution reduces to the beta–binomial distribution (Ishii and Hayakawa, 1960) with pmf   N B(x + a1 , N − x + a2 ) BBinomial(x|N, a1 , a2 ) = , (6.6) B(a1 , a2 ) x where x = 0, 1, . . . , N, and we write X ∼ BBinomial(N, a1 , a2 ). Similar to (6.5), we have the identity N  x=0



N x



B(x + a1 , N − x + a2 ) = B(a1 , a2 ).

(6.7)

When both a1 and a2 are positive integers, (6.6) can be rewritten as     x + a1 − 1 N − x + a2 − 1 N + a 1 + a2 − 1 , x N −x N which is a negative hypergeometric distribution (Johnson and Kotz, 1969: 309). (a) Moments For the beta–binomial distribution, the generalized factorial moment is E =



J (X − j + 1)

 x N 

j=1

x(x − 1) · · · (x − J + 1) Pr(X = x)

B(x + a1 , N − x + a2 ) N! · (let y = x − J) (x − J)!(N − x)! B(a1 , a2 ) x=J   N−J  N − J B(y + J + a1 , N − J − y + a2 ) N! = (N − J)! y=0 B(a1 , a2 ) y

=

202

DIRICHLET AND RELATED DISTRIBUTIONS

N! (N − J)! N! = (N − J)!

(6.7)

=

B(J + a1 , a2 ) B(a1 , a2 ) (J + a1 ) (a1 + a2 ) · · . (a1 ) (J + a1 + a2 ) ·

(6.8)

In particular, letting J = 1 and J = 2 in (6.8), we have Na1 and a1 + a 2 Var(X) = E{X(X − 1)} + E(X) − {E(X)}2 N(N + a1 + a2 )a1 a2 = . (a1 + a2 )2 (1 + a1 + a2 ) E(X) =

(6.9)

(6.10)

(b) Generation Since X is a discrete random variable assuming values over {0, 1, . . . , N}, the built-in S-plus function sample((0, 1, . . . , N), m, prob = (π0 , π1 , . . . , πN ), replace = F)

(6.11)

can be used to produce i.i.d. samples from this distribution; that is, a vector of length m randomly chosen from {0, 1, . . . , N} with corresponding probabilities {π0 , π1 , . . . , πN } without replacement, where πx = BBinomial(x|N, a1 , a2 ) for x = 0, 1, . . . , N. (c) Robustness The beta–binomial distribution is a robust alternative to the binomial distribution in the sense that (cf. the identity (6.12)) Var(X) > Var{E(X|p)}, where X|p ∼ Binomial(N, p) and p ∼ Beta(a1 , a2 ). In fact, we have (6.10)

Var(X) =

> = = =

N(N + a1 + a2 )a1 a2 (a1 + a2 )2 (1 + a1 + a2 ) N 2 a1 a2 (a1 + a2 )2 (1 + a1 + a2 ) N 2 Var(p) Var(Np) Var{E(X|p)}.

DIRICHLET–MULTINOMIAL DISTRIBUTION

203

(d) The convolution property Assume that Xk ∼ BBinomial(Nk , a, b) for k = 1, . . . , K, and they are mutually independent. These assumptions imply that ind

Xk |p ∼ Binomial(Nk , p)

and

p ∼ Beta(a, b).

Owing to the convolution property1 of the binomial distribution, we have K k=1

ind Xk |p ∼ Binomial( K k=1 Nk , p)

and

p ∼ Beta(a, b).

By definition, we obtain K 

Xk ∼ BBinomial

k=1

 K 



Nk , a, b .

k=1

6.2 Moments of the distribution Let X and Y be two random variables. Using the identities E(X) = E{E(X|Y )} and Var(X) = E{Var(X|Y )} + Var{E(X|Y )},

(6.12)

we can find the moments of the Dirichlet–multinomial distribution by means of (6.1) and (6.2). For the multinomial distribution in (6.1), the moments are given by ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

E(xi |pi ) = Npi , Var(xi |pi ) = Npi (1 − pi ),

⎪ Cov(xi , xj |pi , pj ) = −Npi pj , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪  ⎪ ⎪ pi pj ⎪ ⎪ Corr(x , , x |p , p ) = − i j i j ⎩ (1 − pi )(1 − pj ) 1

i = 1, . . . , n, i = 1, . . . , n, i= / j;

i, j = 1, . . . , n,

i= / j;

i, j = 1, . . . , n.

The convolution property indicates that sums of independent random variables having this particular distribution come from the same distribution family.

204

DIRICHLET AND RELATED DISTRIBUTIONS

In matrix form, we have ⎧ ⎪ ⎨ E(x|p) = Np ⎪ ⎩

and (6.13) 

Var(x|p) = N{diag(p) − pp }.

For the Dirichlet distribution in (6.2), the corresponding moments are given by (2.6). In matrix form, we have ⎧ E(p) = b and ⎪ ⎪ ⎪ ⎨ (6.14) ⎪ 1 ⎪  ⎪ {diag(b) − bb }, ⎩ Var(p) = 1 + a+ where a+ = ni=1 ai and b = a/a+ ∈ Tn .2 For the Dirichlet–multinomial distribution, we then obtain E(xi ) = E{E(xi |pi )} = NE(pi ) (6.14)

= Var(xi ) = = = =

(2.6)

=

= Cov(xi , xj ) = =

Nbi , i = 1, . . . , n, (6.15) E{Var(xi |pi )} + Var{E(xi |pi )} E{Npi (1 − pi )} + Var(Npi ) NE(pi ) − N[Var(pi ) + {E(pi )}2 ] + N 2 Var(pi ) NE(pi ){1 − E(pi )} + N(N − 1)Var(pi ) ai (a+ − ai ) ai (a+ − ai ) N + N(N − 1) 2 2 a+ a+ (1 + a+ ) N(N + a+ ) bi (1 − bi ), i = 1, . . . , n, (6.16) 1 + a+ E{E(xi xj |pi , pj )} − E(xi )E(xj ) E{Cov(xi , xj |pi , pj ) + E(xi |pi )E(xj |pj )} − E(xi )E(xj )

= E(−Npi pj + N 2 pi pj ) − E(xi )E(xj ) = N(N − 1)E(pi pj ) − E(xi )E(xj ) = N(N − 1){Cov(pi , pj ) + E(pi )E(pj )} − E(xi )E(xj )

ai aj N 2 ai aj −ai aj (2.6) + 2 = N(N − 1) 2 − 2 a+ (1 + a+ ) a+ a+ N(N + a+ ) bi bj , i = / j; i, j = 1, . . . , n, (6.17) = − 1 + a+ 2

The parameter vector a in the Dirichlet distribution (6.2) can be alternatively represented by the precision parameter a+ and the mean parameter vector b (Minka, 2003).

DIRICHLET–MULTINOMIAL DISTRIBUTION

and



Corr(xi , xj ) = −

bi bj , (1 − bi )(1 − bj )

i= / j;

205

i, j = 1, . . . , n,

indicating that the correlations of variates in the Dirichlet–multinomial distribution are unchanged from those in the multinomial distribution, in the sense that if b = E(p) is replaced by p (Mosimann, 1962: 68). In matrix form, we have E(x) = Nb and N + a+ Var(x) = · N{diag(b) − bb } 1 + a+ N + a+ = M 1 + a+ = ˆ c M > M ,

(6.18)

where M = N{diag(b) − bb }, which equals the conditional covariance matrix of the multinomial distribution when p = b; that is, M = Var(x|p = b) (cf. (6.13) and (6.14)). On the other hand, Var(x) = N(N + a+ )Var(p). Hence, the correlations of variates in the Dirichlet–multinomial distribution are identical to those in the Dirichlet distribution.

6.3 Marginal and conditional distributions Let x = (x1 , . . . , xn ) ∼ DMultinomialn (N, a) on Tn (N). We partition x and a in the same fashion:     x(1) s a(1) and a = x= . x(2) n−s a(2) The 1 -norm of x(1) is defined by x(1) 1 = si=1 xi = N − x(2) 1 .

6.3.1 Marginal distributions We first consider the distribution of the random subvector x(1) for any 1 ≤ s < n. It is not difficult to verify that       N N N − x(1) 1 = · , (6.19) x x(1) , N − x(1) 1 x(2) and Bn (a) = Bs+1 (a(1) , a(2) 1 ) · Bn−s (a(2) ).

(6.20)

206

DIRICHLET AND RELATED DISTRIBUTIONS

Replacing a in (6.20) by x + a, we obtain Bn (x + a) = Bs+1 (x(1) + a(1) , N − x(1) 1 + a(2) 1 ) × Bn−s (x(2) + a(2) ).

(6.21)

By combining (6.19) with (6.20) and (6.21), we have f (x(1) ) =

 s

x(2) ∈Tn−s (N−







i=1

f (x(1) , x(2) ) xi )

Bn (x + a) Bn (a) x(2)   N Bs+1 (x(1) + a(1) , N − x(1) 1 + a(2) 1 ) = (1) (1) Bs+1 (a(1) , a(2) 1 ) x , N − x 1    N − x(1) 1 Bn−s (x(2) + a(2) ) × Bn−s (a(2) ) x(2) x(2)   N Bs+1 (x(1) + a(1) , N − x(1) 1 + a(2) 1 ) (6.5) = ; Bs+1 (a(1) , a(2) 1 ) x(1) , N − x(1) 1

(6.4)

=

N x

that is, x(1) ∼ DMultinomial(N, a(1) ; a(2) 1 ) on Vs (N).

(6.22)

In particular, letting s = 1, we obtain   n  x1 ∼ BBinomial N, a1 , ai .

(6.23)

i=2

By symmetry, we have xi ∼ BBinomial(N, ai , a+ − ai ),

i = 1, . . . , n.

6.3.2 Conditional distributions Next, we consider the conditional distribution of the random subvector x(2) given x(1) . Direct calculation yields f (x(1) , x(2) ) f (x(1) )   N − x(1) 1 Bn−s (x(2) + a(2) ) , = Bn−s (a(2) ) x(2)

f (x(2) |x(1) ) =

DIRICHLET–MULTINOMIAL DISTRIBUTION

207

which implies s

x(2) |x(1) ∼ DMultinomialn−s (N − x(1) 1 , a(2) )

(6.24)

on Tn−s (N − i=1 xi ). We notice from (6.24) that the conditional distribution of x(2) |x(1) depends on x(1) only through its 1 -norm x(1) 1 . We summarize these observations into the following theorem. Theorem 6.1

If x ∼ DMultinomialn (N, a) on Tn (N), then:

(i) For any s < n, the subvector (x1 , . . . , xs ) has a Dirichlet–multinomial distribution with parameters (N, a1 , . . . , as ; ni=s+1 ai ). In particular, xi ∼ BBinomial(N, ai , a+ − ai ).  (ii) The conditional distribution of (xs+1 , . . . , xn ) given (x1 , . . . , x has s) a Dirichlet–multinomial distribution with parameters (N − si=1 xi , as+1 , . . . , an ). ¶

6.3.3 Multiple regression Theorem 6.1 can be extended in the following ways. Let 1 ≤ r1 < · · · < rs ≤ n. The joint distribution of the subset xr1 , . . . , xrs , (and N − si=1 xri ) is also of the form (6.4) with parameters N, ar1 , . . . , ars , a+ − si=1 ari . The conditional distribution of the remaining xj , given xr1 , . . . , xrs is of form (6.4) with parameters N − si=1 xri , {aj , j = / r1 , . . . , rs }. Therefore, the multiple regression of xj (j = / r1 , . . . , rs ) on xr1 , . . . , xrs is    −1 s s   xri aj a+ − ari . E(xj |xr1 , . . . , xrs ) = N − i=1

i=1

6.4 Conditional sampling method Suppose that the joint density of a random vector (x1 , . . . , xn−1 ) has the following factorization: f (x1 , . . . , xn−1 ) = f1 (x1 )

n−1 

fi (xi |x1 , . . . , xi−1 ).

(6.25)

i=2

To generate the random vector, the conditional sampling method states that we only need to first generate x1 from the marginal density f1 (x1 ), and then generate xi sequentially from the conditional density fi (xi |x1 , . . . , xi−1 ) for i = 2, . . . , n − 1. On the right-hand side of (6.19), if we let s = 1, 2, . . . , n − 2 sequentially, the multinomial coefficient can be expressed as a product of n − 1 binomial coefficients:         N N N − x1 N − x 1 − x2 N − n−2 j=1 xj = ··· . x1 , . . . , xn x1 x2 x3 xn−1

208

DIRICHLET AND RELATED DISTRIBUTIONS

Moreover, on the right-hand side of (6.20), if we let s = 1, 2, . . . , n − 2 sequentially, the multivariate beta function can be represented in terms of the following product of beta functions: 

    n n  Bn (a) = B a1 , ai B a2 , ai · · · B(an−1 , an ). i=2

i=3

For (6.21), we have similar results. Therefore, the joint density (6.4) has a factorization of the form (6.25), where x1 follows a beta–binomial distribution with parameters specified by (6.23), and   i−1 n   xi |(x1 , . . . , xi−1 ) ∼ BBinomial N − xj , ai , aj , j=1

(6.26)

j=i+1

for i = 2, . . . , n − 1. In this way, the Dirichlet–multinomial distribution can be generated sequentially by a sequence of beta–binomial variates. If x ∼ DMultinomialn (N, a) on Tn (N), then the conditional sampling method for generating x is as follows: • • •

Draw x1 ∼ BBinomial(N, a1 , nj=2 aj ) using (6.11). n Draw xi ∼ BBinomial(N − i−1 j=i+1 aj ), 2 ≤ i ≤ n − 1. j=1 xj , ai , n−1 Set xn = N − i=1 xi .

6.5 The method of moments estimation 6.5.1 Observations and notations Assume that we have observations of size m from a Dirichlet–multinomial population, denoted by Yobs = {xi }m i=1 , where iid

xi = (xi1 , . . . , xin ) ∼ DMultinomialn (N, a)

(6.27)

on Tn (N). In this section and next, we shall discuss the estimation of the Dirichlet– multinomial parameters a = (a1 , . . . , an ) or equivalently a+ = nj=1 aj and b = (b1 , . . . , bn ) = a/a+ ∈ Tn . The sample matrix is defined by ⎛

Xm×n

⎞ x1 ⎜ . ⎟ ⎟ =⎜ ⎝ .. ⎠ = (x(1) , . . . , x(n) ) = (xij ), xm

(6.28)

DIRICHLET–MULTINOMIAL DISTRIBUTION

209

for i = 1, . . . , m and j = 1, . . . , n so that the corresponding sample mean vector and sample covariance matrix are given by ⎛

x¯ =

1 m ⎛

m 

xi =

i=1

m

⎜ 1  X 1m = ⎜ ⎝ m ⎛

.. .

⎞ ⎟ ⎟ ⎠

1  x 1 m (n) m

⎞ x¯ 1 ⎜ ⎟ ⎜ . ⎟ .. ⎜ ⎟ ⎜ . ⎟ . ⎜ ⎟ ⎜ . ⎟ ⎜ 1 m ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ = ⎜ m i=1 xij ⎟ ⎟ = ⎜ x¯ j ⎟ ⎜ ⎜ ⎟ ⎟ .. ⎜ ⎟ ⎜ .. ⎟ . ⎝ ⎠ ⎝ . ⎠ 1 m x¯ n i=1 xin m 1 m

i=1

xi1



1  x 1 m (1) m

(6.29)

and 1  (xi − x¯ )(xi − x¯ ) m − 1 i=1 m

Sn×n = =

(6.30)

1 XQ1m X = (sjj ), m−1

where ˆ Im − P1m = Im − Q1m =

1m 1m m

is a projection matrix and 1  (xij − x¯ j )(xij − x¯ j ) m − 1 i=1 m

sjj = =

1 x Q1 x(j ) , m − 1 (j) m

(6.31)

j, j  = 1, . . . , n.

6.5.2 The traditional moments method Note that xi ∈ Tn (N), and hence that the sum of the elements in each row of the sample matrix Xm×n defined in (6.28) is N; that is, xi 1n = N. Therefore, we have n ¯ j = N, implying that the number of free sample means is n − 1. Using each j=1 x sample mean, x¯ j (j = 1, . . . , n − 1), to estimate each population mean and each element in the sample covariance matrix to estimate each element in the population covariance matrix from (6.15), (6.16), and (6.17), we can theoretically obtain the

210

DIRICHLET AND RELATED DISTRIBUTIONS

moment estimators of (a+ , b) satisfying the following system of equations: ⎧ x¯ = Nbj , j = 1, . . . , n − 1, ⎪ ⎪ ⎪ j ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ N(N + a+ ) ⎨ sjj = bj (1 − bj ), j = 1, . . . , n − 1, and 1 + a+ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ N(N + a+ ) ⎪ ⎪ bj bj  , j= / j  , j, j  = 1, . . . , n − 1. ⎩ sjj = − 1 + a+

(6.32)

The moments method only works when the number of moment conditions is equal to the number of parameters to be estimated. It is noted that the number of equations in (6.32) is (n − 1)(n + 2)/2, which is larger than n (the number of parameters) if n − 1 ≥ 2. Therefore, the system of equations specified by (6.32) is algebraically over identified and cannot be solved.

6.5.3 Mosimann’s moments method (a) The moment estimators of parameters Mosimann (1962) rewrote the first n − 1 equations in (6.32) as aj =

x¯ j a+ , N

j = 1, . . . , n − 1.

(6.33)

Using (6.18), Mosimann (1962) suggested two estimates of the parameter3 N + a+ 1 + a+

(6.34)

S = c WM ,

(6.35)

c=

by setting

3

Instead of using the precision parameter a+ or the parameter c, Neerchal and Morel (2005) defined ρ= =



1 1+a+ ,

a+ > 0,

N−1 ,

c > 1,

 c−1

and called ρ the over-dispersion parameter, where 0 < ρ < 1.

DIRICHLET–MULTINOMIAL DISTRIBUTION

211

where S is specified in (6.30) and WM = (wjj ) are consistent estimates of Var(x) and M respectively. For example, possible elements of the WM are wjj =

x¯ j (N − x¯ j ) , j = 1, . . . , n, N

wjj =

−¯xj x¯ j , N

j= / j,

and

j, j  = 1, . . . , n.

One way for estimating c is to take the trace of the matrices on both sides of (6.35), yielding the first estimator n tr (S) j=1 sjj . (6.36) cˆ1 = = n tr (WM ) j=1 wjj Alternatively, c could be estimated by  cˆ2 =

˜ |S| ˜ |WM |

1/(n−1)

,

(6.37)

˜ M denote the minor of order n − 1 for S and WM respectively. Setting where S˜ and W cˆ =

N + a+ 1 + a+

will result in the moment estimator of a+ as aˆ + =

N − cˆ . cˆ − 1

Thus, the moment estimators of a are given by aˆ j =

x¯ j x¯ j N − cˆ aˆ + = · , N N cˆ − 1

j = 1, . . . , n.

(6.38)

(b) The univariate case In the univariate case (n − 1 = 1), the estimate of c in (6.37) gives N m ¯ 1 )2 s11 i=1 (xi1 − x m−1 . = cˆ2 = w11 x¯ 1 (N − x¯ 1 )

(6.39)

If s11 in (6.39) is replaced by 1  (xi1 − x¯ 1 )2 , m i=1 m

s12 =

then using (6.39) as an estimate of c and substituting in (6.38), we algebraically obtain formulae identical to those obtained by Skellam (1948) for the beta–binomial distribution using the method of moments.

212

DIRICHLET AND RELATED DISTRIBUTIONS

(c) Other estimates of c It is noted that c defined by (6.34) is a one-to-one mapping of a+ . The second equation in (6.32) shows that a single sample variance sjj can be used to estimate a+ . From   N(N + a+ ) x¯ j x¯ j · 1− sjj = 1 + a+ N N = c · wjj , we have c(j) ˆ =

sjj wjj

(6.40)

or aˆ + (j) =

N x¯ j (N − x¯ j ) − Nsjj , Nsjj − x¯ j (N − x¯ j )

for any fixed j (j = 1, . . . , n − 1). An estimate of c or a+ from only one sample variance is probably less desirable than an estimate using all sample variances and covariances. Next, based on (6.40), the third estimator of c is  n−1 1/(n−1) j=1 sjj cˆ3 = n−1 . (6.41) j=1 wjj In addition, it is still an open question whether the constraint cˆ > 1 is satisfied for various c. ˆ Finally, the relative efficiency of these moment estimators needs further investigation.

6.6 The method of maximum likelihood estimation Let Yobs = {xi }m i=1 denote the independent observed data, where ind

xi = (xi1 , . . . , xin ) ∼ DMultinomialn (Ni , a)

(6.42)

on Tn (Ni ) and these Ni are not necessarily identical. The objective of this section is to estimate the Dirichlet–multinomial parameters a = (a1 , . . . , an ) by the maximum likelihood method.

6.6.1 The Newton–Raphson algorithm (a) Formulation In general, Newton’s method provides the fastest means for finding the MLE. To conduct Newton’s method, one needs the log-likelihood function, the score vector,

DIRICHLET–MULTINOMIAL DISTRIBUTION

213

and the observed information matrix. The log-likelihood function of a for the observed data Yobs is

n  (a|Yobs ) = m log (a+ ) − log (aj ) j=1

+

n m  

log (xij + aj ) −

i=1 j=1

m 

log (Ni + a+ ).

(6.43)

i=1

It can be verified that the first-order and the second-order derivatives are ∂(a|Yobs ) = m{(a+ ) − (aj )} ∂aj m  + {(xij + aj ) − (Ni + a+ )},

1 ≤ j ≤ n,

i=1

∂2 (a|Yobs ) = m{ (a+ ) −  (aj )} ∂aj2 m  { (xij + aj ) −  (Ni + a+ )}, +

1 ≤ j ≤ n,

i=1

 ∂2 (a|Yobs ) = m (a+ ) −  (Ni + a+ ), ∂aj ∂aj i=1 m

j= / j,

where (·) and  (·) denote the digamma (Hille, 1959) and trigamma functions defined by (2.68). The score vector and the observed information matrix are given by

∇(a|Yobs ) =

m(a+ ) − ⎛ ⎜ − ⎜ ⎝

m  i=1

m(a1 ) −



(Ni + a+ ) 1n

m i=1

(xi1 + a1 )

.. .

m(an ) −

m i=1

⎞ ⎟ ⎟ ⎠

and

(6.44)

(xin + an )

−∇ 2 (a|Yobs ) = − ω1 1n 1n

(6.45)

respectively, where = diag(ω1 , . . . , ωn ), ωj = m (aj ) − ω = m (a+ ) −

m 

 (xij + aj ),

i=1 m  i=1

 (Ni + a+ ).

j = 1, . . . , n,

and

(6.46) (6.47)

214

DIRICHLET AND RELATED DISTRIBUTIONS

Based on the Sherman–Morrison formula (Miller, 1987) ( − ω11n 1n )−1 = −1 +

ω −1 1n 1n −1 , 1 − ω1 1n −1 1n

(6.48)

Newton’s method updates the current parameter iterate a(t) by a(t+1) = a(t) + {−∇ 2 (a(t) |Yobs )}−1 ∇(a(t) |Yobs ).

(6.49)

The search can be initialized with the moment estimates presented in Section 6.5. (b) Lange’s remedy It is well known that Newton’s method is not an ascent algorithm in the sense that (a(t+1) |Yobs ) > (a(t) |Yobs ). To generate an ascent algorithm, the observed information matrix Iobs (a(t) ) in (6.49) should be positive definite for all t ≥ 1. This may not always be true. Lange (1995b) suggested a remedy by replacing the observed information matrix Iobs (a(t) ) with an approximating matrix that is positive definite. It is noted that the trigamma function is decreasing (Hille, 1959). The constant ω defined by (6.47) is always positive and ωj defined by (6.46) are also positive provided that xij > 0. Hence > 0 and −1 > 0. The idea of Lange (1995b) is to approximate −∇ 2 (a|Yobs ) by the right-hand side of (6.45) with a decreased value of the constant ω if necessary. Since the inverse of a positive definite matrix is positive definite, it follows from (6.48) that − ω1 1n 1n is positive definite if and only if 1 − ω1 1n −1 1n = 1 − ω

n  1 > 0. ω j=1 j

This result suggests ω should be replaced by   min ω, (1 − ε)/ nj=1 ωj−1 ,

(6.50)

where ε is a small positive constant.

6.6.2 The Fisher scoring algorithm The scoring algorithm replaces the observed information matrix Iobs (a) = −∇ 2 (a|Yobs ) by the expected information matrix J(a) = E{−∇ 2 (a|Yobs )}. The scoring algorithm is better than the Newton method in the sense that it immediately provides asymptotic standard deviations of the parameter estimates. Paul et al. (2005) derived explicit expressions for the elements of the exact Fisher information matrix of the Dirichlet–multinomial distribution. They considered alternative parameters b = (b1 , . . . , bn ) = a/a+ ∈ Tn and φ = 1/(1 + a+ ). The {bi }, similar to the multinomial cell probabilities { pi }, may be interpreted

DIRICHLET–MULTINOMIAL DISTRIBUTION

215

as probabilities and φ (0 < φ < 1) is the over-dispersion parameter. Obviously, a = b(1 − φ)/φ. From (6.42), the density of xi can be expressed as     n xij Ni Bn (xi + a) Ni r=1 {bj (1 − φ) + (r − 1)φ} j=1 . (6.51) = Ni Bn (a) xi xi r=1 {(1 − φ) + (r − 1)φ} Thus, up to a constant, the log-likelihood function of (b, φ) for the observed data Yobs = {xi }m i=1 is = ˆ (b, φ|Yobs ) =

xij m  n−1   i=1

+ −

log{bj (1 − φ) + (r − 1)φ}

j=1 r=1

xin  r=1 Ni 

log{bn (1 − φ) + (r − 1)φ}  log{(1 − φ) + (r − 1)φ} .

r=1

Paul et al. (2005) provided the following results. Theorem 6.2 

E



For j, j  = 1, . . . , n − 1, we have

−∂2  ∂bj2



Ni m  

Pr(Xij ≥ r) {bj (1 − φ) + (r − 1)φ}2 i=1 r=1  Ni  Pr(Xin ≥ r) + , {bn (1 − φ) + (r − 1)φ}2 r=1

= (1 − φ)2

 Ni m   −∂2  Pr(Xin ≥ r) E = (1 − φ)2 ,  ∂bj ∂bj {bn (1 − φ) + (r − 1)φ}2 i=1 r=1   m  Ni φ−1  −∂2  bj Pr(Xij ≥ r) E = ∂bj ∂φ φ i=1 r=1 {bj (1 − φ) + (r − 1)φ}2  Ni  bn Pr(Xin ≥ r) − , and {bn (1 − φ) + (r − 1)φ}2 r=1  2  Ni m  n   bj2 Pr(Xij ≥ r) −∂  −2 E = φ ∂φ2 {bj (1 − φ) + (r − 1)φ}2 i=1 j=1 r=1



Ni  r=1

j= / j,

 1 , {1 + (r − 2)φ}2

where Xij ∼ BBinomial(Ni , bj (1 − φ)/φ, (1 − bj )(1 − φ)/φ).



216

DIRICHLET AND RELATED DISTRIBUTIONS

When n = 2, (6.51) is the beta–binomial distribution with pmf 

Ni xi



xi −1 r=0 {b(1

− φ) + rφ} · Ni −1 r=0

Ni −xi −1 r=0

{(1 − b)(1 − φ) + rφ}

{(1 − φ) + rφ}

,

(6.52)

where b = b1 , 1 − b = b2 , and φ is the dispersion parameter or the intra-cluster correlation coefficient. The elements of the Fisher information matrix for the parameters b and φ reduce to 

−∂2  E ∂b2



Ni m  

Pr(Xi ≥ r) {b(1 − φ) + (r − 1)φ}2 i=1 r=1  Ni  Pr(Xi ≤ Ni − r) + , {(1 − b)(1 − φ) + (r − 1)φ}2 r=1  2  m  Ni φ−1  −∂  b Pr(Xi ≥ r) E = ∂b ∂φ φ i=1 r=1 {b(1 − φ) + (r − 1)φ}2  Ni  (1 − b) Pr(Xi ≤ Ni − r) − , and {(1 − b)(1 − φ) + (r − 1)φ}2 r=1  2  Ni m   −∂  b2 Pr(Xi ≥ r) −2 E = φ 2 ∂φ {b(1 − φ) + (r − 1)φ}2 i=1 r=1 = (1 − φ)

2

(6.54)

Ni 

(1 − b)2 Pr(Xi ≤ Ni − r) {(1 − b)(1 − φ) + (r − 1)φ}2 r=1  Ni  1 − , {1 + (r − 2)φ}2 r=1 +

(6.53)

(6.55)

where Xi ∼ BBinomial(Ni , b(1 − φ)/φ, (1 − b)(1 − φ)/φ).

6.6.3 The EM gradient algorithm We note that (6.1) and (6.2) provide a DA scheme so that an EM algorithm or an EM gradient algorithm can be used to find the MLE of the Dirichlet–multinomial parameter vector a = (a1 , . . . , an ). By augmenting the observed data Yobs = {xi }m i=1 m with the latent variables {pi }m i=1 to form the complete data Ycom = {xi , pi }i=1 , where ind

xi |pi ∼ Multinomialn (Ni , pi ) and iid

pi = (pi1 , . . . , pin ) ∼ Dirichletn (a) on Tn ,

DIRICHLET–MULTINOMIAL DISTRIBUTION

217

we know that the marginal distribution of xi is specified by (6.42). The joint density of the complete data Ycom is m 

m 

f (xi |pi )f (pi ) =

i=1

i=1



Ni xi



n 

n x pijij

j=1

·

a −1

pijj

Bn (a)

j=1

and the conditional predictive distribution of pi is pi |(xi , a) ∼ Dirichletn (xi + a) on Tn . Thus, the resulting log-likelihood function for the complete data is (a|Ycom ) = m log (a+ ) − m

n 

log (aj ) +

n 

j=1

(aj − 1)

j=1

m 

log pij .

i=1

Since the E-step involves the calculation of the conditional expectation E(log pij |xi , a(t) ) which may not have a closed-form expression and the completedata MLEs of a cannot be obtained explicitly, the EM algorithm is not practically appealing. In this case, the EM gradient algorithm is a useful alternative. Now, the Q function is   Q(a|a(t) ) = E (a|Ycom ) Yobs , a(t) = m log (a+ ) − m +

n 

!

(aj − 1)

j=1

n 

log (aj )

j=1 m 

"

E(log pij |xi , a ) . (t)

(6.56)

i=1

Elementary calculus based on the Q function (6.56) shows that the gradient and the Hessian have the following entries:  ∂Q(a|a(t) ) = m{(a+ ) − (aj )} + E(log pij |xi , a(t) ), ∂aj i=1 m

∂2 Q(a|a(t) ) = m{ (a+ ) −  (aj )}, ∂aj2

1 ≤ j ≤ n,

∂2 Q(a|a(t) ) = m (a+ ), ∂aj ∂aj

j= / j.

and

Hence, the Hessian matrix is   ∇ 20 Q(a|a(t) ) = m diag(− (a1 ), . . . , − (an )) + 1n 1n  (a+ )

218

DIRICHLET AND RELATED DISTRIBUTIONS

with an explicit inverse matrix similar to (6.48). By combining (1.28) with (6.44), we have the following EM gradient algorithm: a(t+1) = a(t) − [∇ 20 Q(a(t) |a(t) )]−1 ∇(a(t) |Yobs ),

(6.57)

which avoids calculating the conditional expectation E(log pij |xi , a(t) ) when comparing with the corresponding EM algorithm.

6.7 Applications 6.7.1 The forest pollen data Consider the forest pollen data described in Example 1.9. Let xi1 , xi2 , xi3 , and xi4 denote the respective counts of Pinus, Abies, Quercus, and Alnus for the ith cluster, i = 1, . . . , m (m = 73). Assume that ind

xi = (xi1 , xi2 , xi3 , xi4 ) ∼ DMultinomial4 (Ni , a) on T4 (Ni ), where all Ni = 100 and a = (a1 , a2 , a3 , a4 ). Our objective here is to find the MLEs of the parameters {aj }4j=1 and their SEs. We use both the Newton– Raphson algorithm specified by (6.49) and the EM gradient algorithm specified by (6.57) to find the MLEs of {aj }4j=1 . The initial values of {aj }4j=1 can be computed by Mosimann’s moments method presented in Section 6.5.3. (a) Initial values of a via Mosimann’s moments method From (6.29), (6.30), and (6.35), we obtain x¯ = (86.2739, 1.410 96, 9.068 49, 3.246 58), ⎛ ⎞ 48.507 23 −3.030 822 −31.741 248 −13.735 160 ⎜ −3.030 82 2.078 767 0.638 128 0.313 927⎟ ⎜ ⎟ S=⎜ ⎟, ⎝−31.741 25 0.638 128 25.870 244 5.232 877⎠

and

−13.735 16 0.313 927 5.232 877 8.188 356 1 WM = diag(¯x) − x¯ x¯ N ⎛ ⎞ 11.841 99 −1.217 290 3 −7.823 749 −2.800 949 5 ⎜−1.217 29 1.391 050 9 −0.127 953 −0.045 807 8⎟ ⎜ ⎟ =⎜ ⎟ ⎝−7.823 75 −0.127 952 7 8.246 117 −0.294 415 5⎠ −2.800 95 −0.045 807 8 −0.294 415

3.141 172 8

respectively. From (6.36), we have cˆ1 = 3.438 and aˆ + = 39.6071. Hence, the first moment estimate of a is given by a(0) = (34.170 64, 0.558 84, 3.591 77, 1.285 88).

(6.58)

DIRICHLET–MULTINOMIAL DISTRIBUTION

219

Similarly, from (6.37), we have cˆ2 = 2.196 21 and aˆ + = 81.7612. Hence, the second moment estimate of a is given by a(0) = (70.538 67, 1.153 62, 7.414 51, 2.654 44).

(6.59)

(b) MLEs and SEs via the Newton–Raphson algorithm Using (6.58) as the initial values, the Newton–Raphson algorithm converges in five iterations. The resultant MLE is aˆ = (51.8953, 0.988 744, 5.345 29, 1.966 02).

(6.60)

The estimated asymptotic covariance matrix of the MLE aˆ is ⎛ ⎞ 86.879 9 1.262 003 0 8.520 328 2.861 739 3 ⎜ 1.262 00 0.035 132 1 0.125 389 0.042 114 6 ⎟ ⎜ ⎟ ⎜ ⎟. ⎝ 8.520 33 0.125 388 8 0.958 357 0.284 333 7 ⎠ 2.861 74

0.042 114 6 0.284 334

0.133 051 2

Thus, the corresponding asymptotic SEs are 9.320 944, 0.187 436, 0.978 957, 0.364 762, and the asymptotic 90% CIs are [36.563, 67.226], [0.680 44, 1.297 05], [3.735 05, 6.955 53] and [1.366 04, 2.566 00]. If we use (6.59) as the initial values, then the Newton–Raphson algorithm converges in six iterations to√the same maximum point specified by (6.60). Let π = a/a+ and ρ = 1/(1 + a+ ). From (6.60), we then have πˆ = (0.862 114 7, 0.016 425 6, 0.088 799 0, 0.032 660 7) and ρ = 0.127 832, which are identical to the results obtained by Neerchal and Morel (2005: 41, Table 5). Furthermore, defining φ = 1/(1 + a+ ), it follows from (6.60) that φ = 0.016 341 1, which is the same as the result obtained by Paul et al. (2005: 234, Table 1). (c) MLEs via the EM gradient algorithm Using (6.58) or (6.59) as the initial values, the EM gradient algorithm converges in 50 or 40 iterations and results in the same MLE given by (6.60). This example again confirms the relatively slow convergence of the EM gradient algorithm.

6.7.2 The teratogenesis data Consider the teratogenesis data described in Example 1.10. Let xi1 , xi2 and xi3 denote the numbers of dead/resorbed fetuses, malformed fetuses, and normal fetuses respectively. Furthermore, let Ni = xi1 + xi2 + xi3 be the number of implantation sites for the ith litter, i = 1, . . . , m. Assume that ind

xi = (xi1 , xi2 , xi3 ) ∼ DMultinomial3 (Ni , a)

220

DIRICHLET AND RELATED DISTRIBUTIONS

on T3 (Ni ). We use both the Newton–Raphson algorithm specified by (6.49) and the EM gradient algorithm specified by (6.57) to find the MLEs of {aj }3j=1 and their SEs. (a) Initial values of a via Mosimann’s moments method We note that Mosimann’s moments method presented in Section 6.5.3 requires all Ni to be equal. For the current three cases, we can take N = m i=1 Ni /m to find approximate moment estimates of {aj }3j=1 . First we consider the case of low dose in Table 1.10. From (6.36) and (6.37), we have cˆ1 = 3.340 04 and cˆ2 = 2.083 86. Hence, the two approximate moment estimates of a are given by aI(0) = (0.332 227, 0.293 893, 2.517 255) and aI(0)



= (0.839 766, 0.742 870, 6.362 842) .

(6.61) (6.62)

Second, for the case of medium dose in Table 1.10, we obtain cˆ1 = 4.64 473, cˆ2 = 3.630 59, aII(0) = (0.912 515, 0.232 277, 0.804 672), and aII(0)



= (1.444 765, 0.367 758, 1.274 020) .

(6.63) (6.64)

Finally, for the case of high dose in Table 1.10, we have cˆ1 = 3.3301, cˆ2 = 2.659 12, (0) aIII = (2.057 819, 0.353 490, 0.789 041), (0) aIII

and



= (3.150 087, 0.541 119, 1.207 855) .

(6.65) (6.66)

(b) MLEs and SEs via the Newton–Raphson algorithm Low dose case. Using (6.61) as the initial values, the Newton–Raphson algorithm converges in six iterations. The resultant MLE is aˆ I = (0.816 865, 0.517 579, 5.265 11).

(6.67)

The estimated asymptotic covariance matrix of the MLE aˆ is ⎛ ⎞ 0.1278 063 0.063 215 7 0.719 331 ⎜ ⎟ ⎝ 0.0632 157 0.074 997 5 0.520 469 ⎠ . 0.7193 311 0.520 469 2 6.255 348 Thus, the corresponding asymptotic SEs are 0.3575, 0.273 857, 2.501 069, and the asymptotic 90% CIs are [0.2288, 1.4049], [0.067 125 1, 0.968 033 7], and [1.151 21, 9.379 00]. Medium dose case. Using (6.63) as the initial values, the Newton–Raphson algorithm converges in five iterations. The resultant MLE is aˆ II = (1.543 25, 0.501 156, 1.319 57).

(6.68)

DIRICHLET–MULTINOMIAL DISTRIBUTION

221

The estimated asymptotic covariance matrix of the MLE aˆ is ⎛ ⎞ 0.289 340 9 0.052 249 8 0.178 341 8 ⎜ ⎟ ⎝ 0.052 249 8 0.030 902 4 0.043 413 1 ⎠ . 0.178 341 8 0.043 413 1 0.210 244 5 Thus, the corresponding asymptotic SEs are 0.537 904, 0.175 791, 0.458 524, and the asymptotic 90% CIs are [0.6585, 2.4280], [0.212 006, 0.790 306], and [0.565 365, 2.073 775]. High dose case. Using (6.65) as the initial values, the Newton–Raphson algorithm converges in five iterations. The resultant MLE is aˆ III = (3.879 37, 0.809 339, 1.496 53).

(6.69)

The estimated asymptotic covariance matrix of the MLE aˆ is ⎛ ⎞ 1.175 094 0.174 174 9 0.367 450 8 ⎜ ⎟ ⎝ 0.174 175 0.047 553 8 0.060 384 1 ⎠ 0.367 451 0.060 384 1 0.167 328 6 Thus, the asymptotic SEs are 1.084 018, 0.218 068, 0.409 058, and the asymptotic 90% CIs are [2.096 32, 5.662 42], [0.450 649, 1.168 030], and [0.823 687, 2.169 369]. (c) MLEs via the EM gradient algorithm Using (6.61) or (6.63) or (6.65) as the initial values, the EM gradient algorithm converges in 91 or 29 or 44 iterations and results in an MLE of a given by (6.67) or (6.68) or (6.69).

6.8 Testing the multinomial assumption against the Dirichlet–multinomial alternative For known parameter values, Potthoff and Whittinghill (1966) developed locally most powerful tests for goodness of fit of the multinomial (binomial) distribution against the Dirichlet–multinomial (beta–binomial) alternative. When the multinomial (binomial) parameters are unknown, they derived conservative tests. Paul et al. (1989) proposed the C(α) test for goodness of fit of the multinomial distribution against the Dirichlet–multinomial alternative and showed that the statistic is the generalization of the C(α) test for goodness of fit of the binomial distribution against the beta–binomial alternative (Tarone, 1979).

6.8.1 The likelihood ratio statistic and its null distribution Consider m independent Dirichlet–multinomial samples, where the ith sample xi is specified by (6.42) with alternative parameters b = (b1 , . . . , bn ) = a/a+ ∈ Tn

222

DIRICHLET AND RELATED DISTRIBUTIONS

and θ = 1/a+ = 1/ nj=1 aj . We have a = bθ −1 . From (6.42), the density of xi can be written as     n Ni Bn (xi + a) Ni (θ −1 )  (xij + bj θ −1 ) = Bn (a) xi xi (Ni + θ −1 ) j=1 (bj θ −1 )   n xij −1 Ni r=0 (bj + rθ) j=1 . (6.70) = Ni −1 xi r=0 (1 + rθ)

It can be shown that as θ → 0, the Dirichlet–multinomial distribution converges to the multinomial distribution with parameters Ni and b. This result can be exploited to construct a test of the multinomial versus the Dirichlet–multinomial distribution, which is equivalent to testing H0 : θ = 0

against H1 : θ > 0.

The log-likelihood function of (b, θ) for the m independent samples Yobs = {xi }m i=1 is (b, θ|Yobs ) =

ij −1 m  n x 

i=1

log(bj + rθ) −

N i −1 

j=1 r=0



log(1 + rθ) .

(6.71)

r=0

Under H0 , the likelihood function of b based on Yobs is ⎧  ⎫ m ⎨ n ⎬   Ni x bj ij . ⎩ xi ⎭ i=1

j=1

Thus, under H0 the MLE of bj (j = 1, . . . , n) is given by m xij (6.29) m¯xj b˜ j = i=1 , (6.72) = m N i=1 Ni √ m which is N-consistent, where N = ˆ i=1 Ni . The maximum log-likelihood, apart from a constant, under H0 is 0 =

m  n  i=1 j=1

xij log b˜ j = N

n 

b˜ j log b˜ j .

(6.73)

j=1

Similarly, the maximum log-likelihood under H1 is 1 =

ij −1 m  n x 

i=1

j=1 r=0

ˆ − log(bˆ j + r θ)

N i −1 

ˆ , log(1 + r θ)

(6.74)

r=0

where bˆ 1 , . . . , bˆ n , θˆ are the MLEs of b and θ under H1 . The likelihood ratio statistic to test for θ = 0 is 2(1 − 0 ).

DIRICHLET–MULTINOMIAL DISTRIBUTION

223

Under standard conditions, the asymptotic null distribution of the likelihood ratio statistic would be a chi-square distribution with one degree of freedom. However, owing to the nonnegative constraint of θ, a boundary problem arises and the regular asymptotic likelihood theory is not applicable in this situation. For H0 , it can be seen that the valid distribution of the likelihood ratio test is a 50:50 mixture of zero and chi-square with one degree of freedom provided that bj ∈ (0, 1) for all j between 1 and n (Self and Liang, 1987).

6.8.2 The C(α) test Let S(b) =

∂(b, θ|Yobs ) ∂θ

θ=0

and 2 ∂ (b, θ|Yobs ) σθj = E − ∂θ∂bj



j = 1, . . . , n − 1.

, θ=0

If σθj = 0 for j = 1, . . . , n − 1, then Neyman’s (1959) C(α) criterion for testing H0 : θ = 0 has the form Z= √

S(b) . Var{S(b)}

(6.75)

Note that S(b) =

ij −1 m  n x 

i=1

= =

j=1 r=0

ij −1 m  n x 

N i −1  r r − bj + rθ 1 + rθ r=0

r − bj

i=1 j=1 r=0 m n   xij (xij i=1

j=1

N i −1  r=0

− 1)

2bj



θ=0



r

Ni (Ni − 1) . 2

(6.76)

When θ = 0, we have xij ∼ Binomial(Ni , bj ) so that E{xij (xij − 1)} = Ni (Ni − 1)bj2 ,

(6.77)

E{xij (xij − 1)(xij − 2)} = Ni (Ni − 1)(Ni − and E{xij (xij − 1)(2xij − 1)} = 2E{xij (xij − 1)(xij − 2)} + 3E{xij (xij − 1)} 2)bj3 ,

= Ni (Ni − 1)bj2 {2(Ni − 2)bj + 3}.

(6.78)

224

DIRICHLET AND RELATED DISTRIBUTIONS

Using (6.77), we j = 1, . . . , n − 1, ∂2 (b, θ|Yobs ) ∂θ∂bj

immediately

E{S(b)} = 0.

In

addition,

x in −1 −r r + (bj + rθ)2 (bn + rθ)2 i=1 r=0 r=0

x ij −1 m x in −1  −r r = + bn2 bj2 i=1 r=0 r=0

m  −xij (xij − 1) xin (xin − 1) = + . 2bn2 2bj2 i=1

= θ=0

obtain

for

ij −1 m x 

θ=0

By (6.77), we have σθj = 0 for j = 1, . . . , n − 1. Finally,

∂2 (b, θ|Yobs ) Var{S(b)} = E − ∂θ 2 θ=0 ⎡

xij −1 N m n i −1    r2 r2 ⎣ = E − (bj + rθ)2 (1 + rθ)2 i=1 j=1 r=0 r=0 ⎧ ⎫ ⎬ N ij −1 2 m  n x i −1 ⎨  r 2 = E − r ⎩ ⎭ b2 i=1 j=1 r=0 j r=0

⎤ ⎦ θ=0

n m   E{xij (xij − 1)(2xij − 1)} = 6bj2 i=1 j=1



m  Ni (Ni − 1)(2Ni − 1)

6

i=1 (6.78)

=

n−1 Ni (Ni − 1). 2 i=1 m

By replacing bj in (6.76) with b˜ j defined by (6.72), the Z statistic in (6.75) becomes m n m  N 1  xij (xij − 1) − Ni (Ni − 1) m j=1 x¯ j i=1 i=1 * ∼ N(0, 1). Z= + m  + ,2(n − 1) Ni (Ni − 1)

(6.79)

i=1

This Z statistic is used to test multinomial goodness of fit against the Dirichlet–multinomial alternatives. When n = 2, it is easily verified that Z reduces to Tarone’s (1979) Z statistic for testing binomial goodness of fit against the beta–binomial alternatives.

DIRICHLET–MULTINOMIAL DISTRIBUTION

225

6.8.3 Two illustrative examples Example 6.1 (Pregnant mice data in a toxicological experiment). Paul et al. (1989) considered the treatment group data of pregnant mice in a toxicological experiment. The observed proportions in the treatment group are (Kupper and Haseman, 1978: 75): 0/5, 2/5, 1/7, 0/8, 2/8, 3/8, 0/9, 4/9, 1/10, 6/10. The value of the Z statistic is 2.575 and the corresponding p-value is Pr(Z > 2.575) = 1 − (2.575) = 0.005 012. Therefore, at the 0.01 level of significance, the null hypothesis must be rejected. Alternatively, the value of the likelihood ratio statistic is 3.944 and the p-value corresponding to a 50:50 mixture of zero and chi-square with one degree of freedom is 0.0050. Thus, both tests reject the null hypothesis of a binomial distribution in favor of the beta–binomial distribution.  Example 6.2 (Forest pollen data in the Bellas Artes core). Consider the forest pollen data in Table 1.9. Paul et al. (1989) tested whether the multinomial hypothesis for the data is rejected in favor of a Dirichlet–multinomial distribution. For the data in Table 1.9, a value of Z = 15.06 and a value of 15.166 for the likelihood ratio statistic were obtained. The p-values for Z and for the likelihood ratio test based on a 50:50 mixture of zero and chi-square with one degree of freedom are indistinguishable from zero. Therefore, both tests reject the null hypothesis of a multinomial distribution in favor of the Dirichlet–multinomial distribution. 

7

Truncated Dirichlet distribution This chapter introduces the truncated Dirichlet distribution (Fang et al., 2000), which usually appears in misclassified multinomial models (e.g., screening or diagnostic tests in medicine studies) and experimental designs with restricted mixtures. In Section 7.1 we first introduce the density function of the truncated Dirichlet distribution. We then derive several important properties, such as the random variable generation approach, the moments, and the mode for the truncated beta distribution, which is a special case of the truncated Dirichlet distribution. In Section 7.2, several examples involving misclassification in diagnostic tests are presented to motivate the truncated Dirichlet distribution and three statistical issues. Marginal and conditional distributions and a conditional sampling method are provided in Section 7.3. In Section 7.4 the Gibbs sampler is demonstrated to generate random samples from a truncated Dirichlet distribution. Methods for identifying the mode of a truncated Dirichlet density or finding the constrained MLE are developed in Section 7.5. Applications to misclassification and to uniform design of experiments with mixtures are shown in Sections 7.6 and 7.7 respectively.

7.1 Density function 7.1.1 Definition Let a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) be two real-valued vectors satisfying 0 ≤ ai < bi ≤ 1 for i = 1, . . . , n. Similar to (2.1) and (2.2), we define two convex

Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

228

DIRICHLET AND RELATED DISTRIBUTIONS

polyhedra: Tn (a, b) =



n 



(x1 , . . . , xn ) : ai ≤ xi ≤ bi , 1 ≤ i ≤ n,



xi = 1

(7.1)

i=1

and

 Vn−1 (a, b) =

(x1 , . . . , xn−1 ): ai ≤ xi ≤ bi , 1 ≤ i ≤ n − 1, an ≤ 1 −

n−1 

 xi ≤ bn .

(7.2)

i=1

Clearly, Tn (a, b) ⊆ Tn (0 0n , 1n ) = Tn

and Vn−1 (a, b) ⊆ Vn−1 (0 0n , 1n ) = Vn−1 .

In addition, if x = (x1 , . . . , xn ) ∈ Tn (a, b), we then have x−n = (x1 , . . . , xn−1 ) ∈ Vn−1 (a, b). Definition 7.1 A random vector x = (x1 , . . . , xn ) ∈ Tn (a, b) is said to follow a truncated Dirichlet distribution if the density function of x−n is given by −1 TDirichletn (x−n |γ; a, b) = cTD ·

n 

γ −1

xi i ,

x−n ∈ Vn−1 (a, b),

(7.3)

i=1

where γ = (γ1 , . . . , γn ) is a positive parameter vector and cTD is the normalizing constant. We will write x ∼ TDirichletn (γ; a, b) on Tn (a, b) or x−n ∼ TDirichlet(γ1 , . . . , γn−1 ; γn ; a, b) on Vn−1 (a, b) accordingly. ¶ In particular, when a = 0 and b = 1, the truncated Dirichlet distribution TDirichletn (γ; 0, 1) reduces to the Dirichlet distribution Dirichletn (γ). When γ = 1, the distribution TDirichletn (1 1; a, b) becomes the uniform distribution over the convex polyhedron Tn (a, b).

7.1.2 Truncated beta distribution When n = 2, the truncated Dirichlet distribution reduces to the truncated beta distribution with density function1 −1 TBeta(x|γ1 , γ2 ; a, b) = cTB · xγ1 −1 (1 − x)γ2 −1 ,

(7.4)

In (7.2), letting n = 2 and x1 = x, we have a1 ≤ x ≤ b1 and a2 ≤ 1 − x ≤ b2 . Thus, max (a1 , 1 − b2 ) ≤ x ≤ min (b1 , 1 − a2 ). Namely, V1 (a, b) = [a, b] ⊆ [0, 1], where a = max (a1 , 1 − b2 ) and b = min (b1 , 1 − a2 ).

1

TRUNCATED DIRICHLET DISTRIBUTION

where x ∈ [a, b] ⊆ [0, 1] and the normalizing constant is given by2  b cTB = xγ1 −1 (1 − x)γ2 −1 dx = B(γ1 , γ2 ){Ib (γ1 , γ2 ) − Ia (γ1 , γ2 )}.

229

(7.5)

a

The corresponding cdf is ⎧ 0, ⎪ ⎪ ⎨ Ix (γ1 , γ2 ) − Ia (γ1 , γ2 ) FTB (x) = , ⎪ ⎪ ⎩ Ib (γ1 , γ2 ) − Ia (γ1 , γ2 ) 1,

if x < a, if a ≤ x ≤ b,

(7.6)

if x > b.

We write X ∼ TBeta(γ1 , γ2 ; a, b). (a) Generation of random variable from truncated beta distribution Let U ∼ U[0, 1] and X ∼ TBeta(γ1 , γ2 ; a, b). By the inversion method, it can be readily shown that d

FTB (X) = U

−1 X = FTB (U), d

or

−1 (·) denotes the inverse function of FTB (·). By (7.6), the random variable where FTB X can be generated as (Devroye, 1986: 38–39)

X = FB−1 (FB (a) + U{FB (b) − FB (a)}),

(7.7)

where FB (x) = Ix (γ1 , γ2 ) denotes the cdf of the beta distribution with parameters γ1 and γ2 , and FB−1 (·) is its inverse function. The built-in R (or S-plus) functions pbeta and qbeta can be used to calculate FB (·) and FB−1 (·) respectively. (b) The moments For the sake of convenience, we denote the normalizing constant cTB defined in (7.5) by cTB (γ1 , γ2 ). Let α and β be two positive integers. The mixed moment for X ∼ TBeta(γ1 , γ2 ; a, b) is E{Xα (1 − X)β } =

cTB (γ1 + α, γ2 + β) . cTB (γ1 , γ2 )

(7.8)

When α = 1 and β = 0, we can obtain E(X). Similarly, we can obtain E(X2 ) and Var(X). 2

The incomplete beta function Ix (γ1 , γ2 ) is defined by Ix (γ1 , γ2 ) =

1 B(γ1 ,γ2 )

x 0

t γ1 −1 (1 − t)γ2 −1 dt,

which is also the cdf of the beta distribution with parameters γ1 and γ2 .

230

DIRICHLET AND RELATED DISTRIBUTIONS

(c) The mode If the first-order derivative of the density specified in (7.4) is set equal to zero, we have x˜ =

γ1 − 1 . γ1 + γ 2 − 2

Hence, the mode of the truncated beta density (7.4) is given by ⎧ a, ⎪ ⎪ ⎪ ⎨ xˆ = x˜ , ⎪ ⎪ ⎪ ⎩ b,

if x˜ < a, if a ≤ x˜ ≤ b, if x˜ > b,

= median(a, x˜ , b).

(7.9)

7.2 Motivating examples Statistical inferences based on a truncated Dirichlet distribution are sometimes needed in medical and biometrical studies. In this section, several examples are presented to motivate the truncated Dirichlet distribution. We use the binary variable D to denote the true disease status:  1, for disease; D= 2, for nondisease. Let π = Pr(D = 1) denote the prevalence of the disease of interest. Consider a diagnostic test that only yields a binary result (e.g., positive or negative). Furthermore, we denote positive test results as T = 1 and negative test results as T = 2 (Johnson and Gastwirth, 1991). Two basic measures for diagnostic accuracy of a test are sensitivity and specificity, defined by λ11 = Pr(T = 1|D = 1) and λ22 = Pr(T = 2|D = 2) respectively. The probability that the test result of an individual selected randomly is positive is denoted as p = Pr(T = 1).

TRUNCATED DIRICHLET DISTRIBUTION

Therefore, we have 

p 1−p





=

λ11 1 − λ22 1 − λ11 λ22



π 1−π

231



.

(7.10)

By generalizing the binary variable to an n-value variable, we can generally formulate the problem as follows. Let s = (s1 , . . . , sn ) be the observed frequencies subject to misclassification, p = (p1 , . . . , pn ) and π = (π1 , . . . , πn ) be the corresponding apparent and true classification probabilities respectively, and  = (λij ) be the n × n matrix of conditional probabilities, where λij = Pr(T = i|D = j). Similar to (7.10), we have p = π.

(7.11)

The jth column of  is the conditional probability distribution of the apparent classification given the true state j. Consequently,  is a probability matrix with λ·j =

n 

λij = 1.

i=1

The main objective in this problem is to estimate the true classification probability π.

7.2.1 Case A: matrix  is known The likelihood function for the observed data s is a multinomial distribution with the parameter vector p. If π has a Dirichlet prior distribution Dirichletn (d) with d = (d1 , . . . , dn ), then the posterior density of π given s and  is proportional to n  i=1

⎛ ⎝

n  j=1

⎞si

λij πj ⎠ ·

n 

πidi −1 ,

π ∈ Tn .

(7.12)

i=1

Estimates of the true classification probabilities π can be routinely obtained based on the posterior (7.12); see for example, Viana (1994). It is noted that π ∈ Tn implies each component πi of π can vary from zero to unity provided that ni=1 πi = 1. However, in practice, we may know that πi falls into a certain interval; say, 0 ≤ ai ≤ πi ≤ bi ≤ 1, where ai and bi are known constants. For example, when n = 2, if 0 ≤ π1 ≤ 0.05, then it suggests a prior with the prevalence of the disease being less than or equal to 5%. Therefore, given such information of restrictions, it is reasonable to specify a truncated Dirichlet prior for π; namely, π ∼ TDirichletn (d; a, b) on Tn (a, b). Thus, the posterior distribution of π given s and  is still given by (7.12) with π ∈ Tn (a, b). Estimates of the true classification probabilities π can be obtained now by using the importance sampling method with a truncated Dirichlet distribution as the importance sampling density.

232

DIRICHLET AND RELATED DISTRIBUTIONS

7.2.2 Case B: matrix  is unknown When n = 2, three parameters (i.e., λ11 , λ22 , and π) need to be estimated. Viana et al. (1993) assumed that the joint prior specification for λ11 , λ22 , and π is the product of three independent beta distributions. However, the condition λ11 > 1 − λ22 is sometimes reasonable. In fact, from (7.10), we have p = λ11 π + (1 − λ22 )(1 − π). Given λ11 and λ22 , we note that p is a function of π and write p = f (π|λ11 , λ22 ). According to the definitions of both π and p, it is natural to require that p is an increasing function of π. From ∂p/∂π > 0, we obtain λ11 > 1 − λ22 , or equivalently (1 − λ11 ) + (1 − λ22 ) < 1. Imposing this restriction, Evans et al. (1996) developed a Bayesian method for the analysis of binary data subject to misclassification by specifying the joint prior of (λ∗11 , λ∗22 ) and π as the product of independent Dirichlet distribution and beta distribution, where λ∗11 = ˆ 1 − λ11 and λ∗22 = ˆ 1 − λ22 . Furthermore, if other information is available – for instance, 0.7 ≤ λ11 ≤ 0.95 and 0.8 ≤ λ22 ≤ 0.90 – then we may use a truncated Dirichlet distribution as the prior distribution of (λ∗11 , λ∗22 ).

7.2.3 Case C: matrix  is partially known From (7.11), it is clear that the estimates of p should satisfy the following constraints: min{λi1 , . . . , λin } ≤ pi ≤ max{λi1 , . . . , λin },

1 ≤ i ≤ n.

(7.13)

When some of these inequalities fail, the vector −1 p does not belong to (0, 1)n (the admissible domain of π) and the corresponding estimates of {πi } are assigned to be either 0 or 1. When n = 2 and  is completely known and −1 exists, Lew and Levy (1989) proposed a numerical approximation to the Bayesian estimates of p under the uniform prior restricted by (7.13) which leads to admissible posterior estimates. Define ai = min{λi1 , . . . , λin }

and bi = max{λi1 , . . . , λin },

1 ≤ i ≤ n.

(7.14)

Now we assume that, for each i (1 ≤ i ≤ n), only ai and bi are known and the values of other entries in  are unknown. In this situation, the method proposed by Viana (1994) cannot be applied to estimate the true classification probabilities π. The new idea is as follows. First, one can find the estimates of p including Bayesian estimates and the MLEs. Next, one can derive the corresponding estimates of π based on the estimates of p. The posterior density of p is proportional to the product of the likelihood function and the prior distribution Dirichletn (d) subject

TRUNCATED DIRICHLET DISTRIBUTION

233

to (7.13); that is, n 

psi i +di −1 ,

p ∈ Tn (a, b),

(7.15)

i=1

where ai and bi are defined by (7.14). Therefore, finding the Bayesian estimates of p is equivalent to finding the moments of TDirichletn (s + d; a, b). The MLEs of p maximize the likelihood function n 

psi i ,

(7.16)

i=1

subject to the constraints in (7.13). This is, however, equivalent to extracting the mode of TDirichletn (s + 1; a, b) on Tn (a, b). Cases A, B, and C motivate the following three issues: • • •

How do you generate an i.i.d. sample from a truncated Dirichlet distribution? How do you calculate the moments of a truncated Dirichlet distribution? How do you find the mode of a truncated Dirichlet density?

7.3 Conditional sampling method In this section we derive the marginal and univariate conditional distributions of a truncated Dirichlet distribution. We also introduce a conditional sampling method to generate a random vector following a truncated Dirichlet distribution.

7.3.1 Consistent convex polyhedra A convex polyhedron Tn (a, b) is nonempty if and only if a+ ≤ 1 ≤ b+ , (7.17) n where a+ = i=1 ai and b+ = i=1 bi . From Section 7.2.3, it can be shown that the domain Tn (a, b) with ai and bi specified by (7.14) and (7.13) is nonempty. In fact, from (7.17), we immediately obtain n

a+ =

n  i=1

ai =

n  i=1

min{λi1 , . . . , λin } ≤

n 

λi1 = λ·1 = 1.

i=1

 Similarly, we have b+ = ni=1 bi ≥ 1. Methods for detecting Tn (a, b) with superfluous constraints were given in Piepel (1983) and Crosier (1984, 1986). Let a∗ = (a1∗ , . . . , an∗ ) and b∗ = (b1∗ , . . . , bn∗ ). The convex polyhedron Tn (a∗ , b∗ ) is equivalent to Tn (a, b) if

ai∗ = max(ai , 1 − b+ + bi ) and bi∗ = min (bi , 1 − a+ + ai ),

(7.18) (7.19)

234

DIRICHLET AND RELATED DISTRIBUTIONS

for i = 1, . . . , n. Piepel (1983) showed that there are no superfluous constraints in Tn (a∗ , b∗ ). In this case, the convex polyhedron Tn (a∗ , b∗ ) is said to be consistent.

7.3.2 Marginal distributions Let x ∼ TDirichletn (γ; a, b) on Tn (a, b). It is easy to verify that xn−1 follows a truncated beta distribution, TBeta(γn−1 , γ+ − γn−1 ; ξn−1 , ηn−1 ), ˆ where γ+ =

n i=1

γi . From (7.18) and (7.19), we have ξn−1 = max(an−1 , 1 − b+ + bn−1 ) and ηn−1 = min (bn−1 , 1 − a+ + an−1 ),

(7.20) (7.21)

for i = 1, . . . , n. Therefore, we have the following result. Theorem 7.1

If x ∼ TDirichletn (γ; a, b) on Tn (a, b), then xn−1 ∼ TBeta(γn−1 , γ+ − γn−1 ; ξn−1 , ηn−1 ),

where ξn−1 and ηn−1 are given by (7.20) and (7.21) respectively.



7.3.3 Conditional distributions Let (x1 , . . . , xn−1 ) ∼ TDirichlet(γ1 , . . . , γn−1 ; γn ; a, b) on Vn−1 (a, b). We first consider deriving the conditional distribution of xn−2 |xn−1 . For the situation of no restrictions, Theorem 1.6 of Fang et al. (1990: 21) states that given xn−1 the joint conditional distribution of x1 xn−2 ,..., 1 − xn−1 1 − xn−1 is a Dirichlet distribution with parameters γ1 , . . . , γn−2 and γn . Consequently, for the truncated Dirichlet density (7.3), we have    xn−2  x1 ,..., x n−1 1 − xn−1 1 − xn−1    ∼ TDirichlet γ1 , . . . , γn−2 ; γn ; a(n−1) , b(n−1) ,

where 

a

(n−1)

= 

b(n−1) =

an−2 an a1 ,..., , 1 − xn−1 1 − xn−1 1 − xn−1 bn−2 bn b1 ,..., , 1 − xn−1 1 − xn−1 1 − xn−1



and 

.

(7.22)

TRUNCATED DIRICHLET DISTRIBUTION

235

It remains to show that the domain Vn−2 (a(n−1) , b(n−1) ) or equivalently Tn−1 (a(n−1) , b(n−1) ) is nonempty. According to (7.17), we only need to show that n−2

ai + a n ≤1≤ 1 − xn−1

n−2

i=1

i=1 bi + bn . 1 − xn−1

(7.23)

In fact, since ai ≤ xi ≤ bi for i = 1, . . . , n, we have n−2 

ai + a n ≤

i=1

n−2 

xi + xn = 1 − xn−1 > 0

i=1

and n−2 

bi + b n ≥

i=1

n−2 

xi + xn = 1 − xn−1 > 0.

i=1

Thus, (7.23) follows immediately. Applying Theorem 7.1 to (7.22), we have  xn−2  xn−1 ∼ TBeta(γn−2 , γ+ − γn−1 − γn−2 ; ξn−2 , ηn−2 ), 1 − xn−1 

where 

ξn−2 = max 

ηn−2 = min

b+ − bn−1 − bn−2 an−2 , 1− 1 − xn−1 1 − xn−1



and

 a+ − an−1 − an−2 bn−2 , 1− . 1 − xn−1 1 − xn−1

Similarly, we can obtain the conditional distribution of   xk xk+1 , . . . , xn−1 1 − xk+1 − · · · − xn−1 

for k = n − 3, n − 4, . . . , 1. These results are summarized in the following theorem. Theorem 7.2 If (x1 , . . . , xn−1 ) ∼ TDirichlet(γ1 , . . . , γn−1 ; γn ; a, b) on Vn−1 (a, b), then for k = n − 2, n − 3, . . . , 1, 

1−

xk n−1 j=k+1

xj

    n−1   xk+1 , . . . , xn−1 ∼ TBeta γk , γ+ − γj ; ξk , ηk ,  j=k

236

where

DIRICHLET AND RELATED DISTRIBUTIONS

  b+ − n−1 j=k bj ξk = max , 1− and  1 − j=k+1 xj 1 − n−1 j=k+1 xj    a+ − n−1 bk j=k aj ηk = min , 1 − .   1 − n−1 1 − n−1 j=k+1 xj j=k+1 xj 

ak n−1

(7.24) (7.25) ¶

7.3.4 Generation of random vector from a truncated Dirichlet distribution Consider the following factorization of the joint density of x−n : n−2   f (xk |xk+1 , xk+2 , . . . , xn−1 ) × f (xn−1 ). f (x−n ) =

(7.26)

k=1

By the conditional sampling method, to generate x−n from the joint density f (x−n ) we first generate xn−1 from the marginal density f (xn−1 ) and then sequentially generate xk from the univariate conditional density f (xk |xk+1 , xk+2 , . . . , xn−1 ) for k = n − 2, n − 3, . . . , 1. Based on Theorems 7.1 and 7.2, a stochastic representation of a truncated Dirichlet random vector can be obtained as a by-product. iid

Theorem 7.3 If x ∼ TDirichletn (γ; a, b) on Tn (a, b) and u1 , . . . , un−1 ∼ U[0, 1], then x has the following stochastic representation: ⎧ d −1 ⎪ xn−1 = Fn−1 (un−1 ), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪   n−1 ⎪  ⎪ ⎪ d ⎨ xk = 1− xj Fk−1 (uk ), k = n − 2, . . . , 1, j=k+1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ n−1 ⎪  ⎪ ⎪ d ⎪ x = 1 − xk , ⎪ n ⎩ k=1

where Fk−1 (·) denotes the inverse function of Fk (·), which is the cdf of ⎛ ⎞ n−1  TBeta ⎝γk , γ+ − γj ; ξk , ηk ⎠ j=k

with ξk and ηk given by (7.24) and (7.25) respectively, for k = n − 1, n − 2, . . . , 1, and the summation n−1 ¶ j=k+1 is defined as zero if k + 1 > n − 1.

TRUNCATED DIRICHLET DISTRIBUTION

237

Remark 7.1 Theorem 7.3 shows that a truncated Dirichlet distribution can be generated sequentially by a sequence of truncated beta variates (see Section 7.1.2(a)). On the other hand, the relevant moments of a truncated Dirichlet distribution can be calculated numerically by the Monte Carlo method. ¶

7.4 Gibbs sampling method As an alternative, the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) can be adopted to generate random samples from a truncated Dirichlet distribution. To implement the Gibbs sampling, we need to obtain all full univariate conditional distributions. Suppose that we want to generate the values of a random vector x ∼ TDirichletn (γ; a, b) on Tn (a, b) or x−n ∼ TDirichlet(γ1 , . . . , γn−1 ; γn ; a, b) on Vn−1 (a, b). The Gibbs sampling requires the specification of the full univariate conditional density f (xi |x−i ), i = 1, . . . , n − 1, where x−i = (x1 , . . . , xi−1 , xi+1 , . . . , xn−1 ) denotes the complemental vector of the component xi in x−n . For some fixed i (i = 1, . . . , n − 1), the conditional density f (xi |x−i ) is proportional to the joint density (7.3) and we have γ −1

f (xi |x−i ) ∝ xi i (xi∗ − xi )γn −1

(7.27)

max(ai , xi∗ − bn ) ≤ xi ≤ min(bi , xi∗ − an ),

(7.28)

with support

where xi∗ = xn + xi = 1 −

n−1 

xj = 1 − 1x−i

j=1,j = / i

is a linear function of x−i . Next, we derive the support in (7.28). On the one hand, ai ≤ xi ≤ bi . On the other hand, from (7.2), we have an ≤ xi∗ − xi ≤ bn , resulting in xi∗ − bn ≤ xi ≤ xi∗ − an . Combining the two facts, we immediately obtain (7.28). From (7.27), taking transformation yi =

xi , xi∗

i = 1, . . . , n − 1,

yields yi |x−i ∼ TBeta(γi , γn ; αi , βi ),

(7.29)

238

DIRICHLET AND RELATED DISTRIBUTIONS

where 

αi = max

bn ai , 1− ∗ xi∗ xi





and βi = min

an bi , 1− ∗ xi∗ xi



.

(7.30)

The method introduced in Section 7.1.2(a) can be used to simulate the random variable yi given x−i from the truncated beta distribution (7.29). Once yi is available, the value of xi given x−i is given by xi = xi∗ · yi . The implementation of the Gibbs sampling and the calculation of arbitrary expectations of interest were presented by Gelfand and Smith (1991) and Arnold (1993). Remark 7.2 We need to prove that αi and βi given in (7.30) satisfy 0 ≤ αi ≤ βi ≤ 1, so that the sampling from (7.29) can proceed routinely. In fact, max(ai , xi∗ − bn ) ≥ ai ≥ 0. Noting that xi∗ = xn + xi is a positive real number, we have αi =

max(ai , xi∗ − bn ) ≥ 0. xi∗

Similarly, it can be shown that βi ≤ 1 − an /xi∗ ≤ 1. Finally, we will show that αi ≤ βi is equivalent to ˆ max(ai , xi∗ − bn ) ≤ βi∗ = ˆ min(bi , xi∗ − an ). α∗i = Case 1: ai ≤ xi∗ − bn and bi ≤ xi∗ − an . Hence, xi∗ − bi ≤ xi∗ − xi = xn ≤ bn leads to α∗i = xi∗ − bn ≤ bi = βi∗ . Case 2: ai ≤ xi∗ − bn and bi > xi∗ − an . From bn ≥ an , we have α∗i = xi∗ − bn ≤ xi∗ − an = βi∗ . Case 3: ai > xi∗ − bn and bi ≤ xi∗ − an . Thus, α∗i = ai ≤ bi = βi∗ . Case 4: ai > xi∗ − bn and bi > xi∗ − an . From xi∗ − ai ≥ xi∗ − xi = xn ≥ an , we have α∗i = ai ≤ xi∗ − an = βi∗ . The assertion follows immediately.



TRUNCATED DIRICHLET DISTRIBUTION

239

7.5 The constrained maximum likelihood estimates The objective of this section is to find the mode of a truncated Dirichlet distribution. As indicated in Section 7.2.3, finding the mode of TDirichletn (s1 + 1, . . . , sn + 1; a, b; ) on Tn (a, b) is equivalent to finding the constrained MLE of p that maximizes the likelihood function (7.16) subject to the constraints p ∈ Tn (a, b). It is clear that the log-likelihood function   n−1 n−1   (p) = si log pi + sn log 1 − pi i=1

i=1

is a strictly concave function of p (Lehmann, 1983: 48–49). The unconstrained MLEs of p is given by   sn  s1  pˆ = (pˆ 1 , . . . , pˆ n ) = ,..., , (7.31) s+ s+  where s+ = ni=1 si . Let pˆ c = (pˆ 1c , . . . , pˆ nc ) denote the constrained MLE of p. If ˆ otherwise we have the following theorem, which states pˆ ∈ Tn (a, b), then pˆ c = p; that pˆ c must fall on the boundary. Theorem 7.4 Define S = {i: pˆ i < ai or pˆ i > bi }. If S is not empty, then there exists at least one i ∈ S such that pˆ ic = ai or bi . ¶ Proof. Suppose the conclusion is not true; that is, ai < pˆ ic < bi

for all i ∈ S.

(7.32)

Since the log-likelihood is a strictly concave function of p, we have ˆ + (1 − α) (pˆ c ) (αpˆ + (1 − α)pˆ c ) ≥ α (p)

for any α ∈ (0, 1).

ˆ ≥ (pˆ c ), the above inequality implies Since (p) (˜p) ≥ (pˆ c ),

(7.33)

where p˜ = ˆ αpˆ + (1 − α)pˆ c . Furthermore, for a sufficient small α (> 0), we can show that p˜ ∈ Tn (a, b); that is, ai ≤ p˜ i ≤ bi

1 ≤ i ≤ n,

and

n 

p˜ i = 1.

(7.34)

i=1

The second part of (7.34) is trivial and we only need to prove the first part. Consider two cases: (i) i ∈ S and (ii) i ∈ / S. First, we consider Case (i). From (7.32), we have ai < pˆ ic < bi . Let ti = pˆ i − pˆ ic .

240 • • •

DIRICHLET AND RELATED DISTRIBUTIONS

If ti = 0, then p˜ i = pˆ ic , which implies ai < p˜ i < bi . If ti > 0, setting α < min{(bi − pˆ ic )/ti , 1} yields ai < p˜ i < bi . If ti < 0, taking α < min{(pˆ ic − ai )/(−ti ), 1} gives ai < p˜ i < bi . Next, we consider Case (ii). From the definition of S, we have ai ≤ pˆ i ≤ bi .

• • •

If ai < pˆ ic < bi , similar to Case (i), we have ai < p˜ i < bi . If pˆ ic = ai , then ai ≤ p˜ i ≤ bi . If pˆ ic = bi , then ai ≤ p˜ i ≤ bi .

Thus, the inequalities in (7.34) are valid. Since (p) is strictly concave and p˜ = / pˆ c , we have (˜p) > (pˆ c ) from (7.33). This contradicts that pˆ c is the constrained MLE.

End

Example 7.1 (An illustrative example ). Suppose that S = {i, i }, where pˆ i < ai and pˆ i > bi . According to Theorem 7.4, we have pˆ ic = ai or pˆ i c = bi . For the case of pˆ ic = ai , set pi = ai . The log-likelihood function becomes n 

(p) = si log ai +

sj log pj .

j= / i,j=1

Thus, the MLE of p in this case is given by pˆ ic = ai ,

pˆ jc =

sj (1 − ai ), s+ − s i

j= / i,

j = 1, . . . , n.

(7.35)

If aj ≤ pˆ jc ≤ bj for all j = / i, j = 1, . . . , n, we stop; otherwise, there exists some j ∈ {1, . . . , n} − {i} such that pˆ j c < aj

or

pˆ j c > bj .

Repeat the above procedure (or the following procedure, see (7.36)) by simultaneously setting pi = ai and pj = aj (or pj = bj ). As a result, we obtain the marginal MLE p∗ , say. For the case of pˆ i c = bi , similar to (7.35), we have pˆ i c = bi ,

pˆ jc =

sj (1 − bi ), s+ − si

j= / i ,

j = 1, . . . , n.

(7.36)

We can also obtain another marginal MLE p∗∗ , say. If (p∗ ) ≥ (p∗∗ ), then p∗ = pˆ c is the desired constrained MLE.

Theorem 7.4 can be used to develop the following algorithm to compute the constrained MLE pˆ c .

TRUNCATED DIRICHLET DISTRIBUTION

241

Step 1: Compute the unconstrained MLE pˆ according to (7.31). Step 2: Find the set S = {i: pˆ i < ai or pˆ i > bi }. If S = ∅, then set pˆ c = pˆ and stop; otherwise, we have S = {j1 , . . . , jt }, where 1 ≤ j1 < · · · < jt ≤ n. Step 3: Find a group of marginal MLEs {pˆ (i) , i = j1 , . . . , jt }. Step 4: Select a constrained MLE from this group as pˆ c , which has the maximum log-likelihood function.

7.6 Application to misclassification In this section we will apply the methods introduced in previous sections to misclassification.

7.6.1 Screening test with binary misclassifications Consider a study on a screening test of chest radiographs of 96 black women for the presence of pulmonary hypertension (Lew and Levy, 1989). The sensitivity and specificity were assumed to be λ11 = 0.89 and λ22 = 0.74 respectively. The test outcomes showed that eight women were positive and 88 negative; i.e., s = (s1 , s2 ) = (8, 88). The relationship between the probability p of positive test result and the probability π of prevalence of the disease is given by (7.10). Note that the unconstrained MLE of p is pˆ =

s1 8 = 0.0833, = s1 + s 2 96

which falls outside the interval [1 − λ22 , λ11 ] = [0.26, 0.89]. From (7.10), the unconstrained MLE of π, πˆ =

0.74pˆ − 0.26(1 − p) ˆ = −0.28, 0.89 × 0.74 − 0.26 × 0.11

is outside the range (0, 1) and the MLE of π is assigned to be zero according to the traditional approach. The constrained MLE pˆ c of p should maximize the likelihood function (7.16) with n = 2; that is, p8 (1 − p)88 subject to the constraint 0.26 ≤ p ≤ 0.89. This reduces to finding the mode of the truncated beta distribution TBeta(9, 89; 0.26, 0.89). Using (7.9), the constrained MLE of p is pˆ c = median(0.26, 8/96, 0.89) = 0.26, which results in the estimate of π being πˆ =

0.74pˆ c − 0.26(1 − pˆ c ) = 0. 0.63

This demonstrates why the traditional approach assigns πˆ to be zero.

242

DIRICHLET AND RELATED DISTRIBUTIONS

Table 7.1 Bayesian estimates of p and π for various prior parameters. d1

d2

E(p|s)

std(p|s)

E(π|s)

std(π|s)

1 5 1 5 10 1 10

1 5 5 1 10 10 1

0.2707 0.2717 0.2701 0.2724 0.2732 0.2695 0.2754

0.0104 0.0112 0.0098 0.0119 0.0125 0.0092 0.0144

0.0169 0.0185 0.0160 0.0197 0.0209 0.0150 0.0245

0.0165 0.0178 0.0156 0.0189 0.0198 0.0147 0.0229

Source: Fang et al. (2000). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd.

The exact Bayesian estimator can be obtained by the posterior density (see (7.15)) f (p|s) ∝ p8+d1 −1 (1 − p)88+d2 −1 ,

0.26 ≤ p ≤ 0.89,

where the prior distribution of p is TBeta(d1 , d2 ; 0.26, 0.89). We have p|s ∼ TBeta(8 + d1 , 88 + d2 ; 0.26, 0.89). The first-order and second-order posterior moments E(p|s) and E(p2 |s) can be calculated by using (7.8) for various prior parameters d1 and d2 . Therefore, the posterior mean and standard deviation of π are E(π|s) =

E(p|s) − 0.26 0.63

and √ Var(p|s) std(π|s) = 0.63 respectively. Table 7.1 summarizes some of the results. Using the uniform prior, Lew and Levy (1989) obtained the approximate Bayesian estimate E(π|s) = 0.0176 with standard deviation 0.021 33. In contrast, with the uniform prior the exact posterior estimate of π is 0.0169 with std(π|s) = 0.0165.

7.6.2 Case–control matched-pair data with polytomous misclassifications (a) Formulation of the problem In a case–control matched-pair study of ulcer history as a risk factor for stomach cancer, Greenland (1989) and Viana (1994) considered the estimation problem of

TRUNCATED DIRICHLET DISTRIBUTION

243

the underlying joint probabilities π = (π1 , π2 , π3 , π4 ) of pair counts via the observed data s = (s1 , s2 , s3 , s4 ) of the apparent joint probabilities p = (p1 , p2 , p3 , p4 ) of pair counts. Viana (1994) assumed that the matrix  of classification error probabilities is given by ⎛ ⎞ 0.350 0.014 0.050 0.002 ⎜0.350 0.686 0.050 0.098⎟ ⎜ ⎟ =⎜ ⎟. ⎝0.150 0.006 0.450 0.018⎠ 0.150 0.294 0.450 0.882 Now we assume that only the maximum entry and the minimum entry in each row of  are known and the other entries are unknown. To illustrate the proposed approach, in this subsection we focus on the estimation of p instead of π. According to (7.13), each component of p falls within the range of the corresponding row of  and the domain Tn (a, b) should be T4 (a, b) = {p: 0.002 ≤ p1 ≤ 0.350, 0.050 ≤ p2 ≤ 0.686,

0.006 ≤ p3 ≤ 0.450, 0.150 ≤ p4 ≤ 0.882, 4 i=1

pi = 1},

which is nonempty and consistent with

and

a = (0.002, 0.050, 0.006, 0.150)

(7.37)

b = (0.350, 0.686, 0.450, 0.882).

(7.38)

(b) Frequentist analysis The observed data s = (s1 , s2 , s3 , s4 ) are given in the second column of Table 7.2 for four different cases. From (7.31), the unconstrained MLEs of p are computed and listed in the third column of Table 7.2. We can see that many estimates indicated by ∗ fall outside their proper ranges. For Case 1, the constrained MLE pˆ c is equal to pˆ since all pˆ i for i = 1, . . . , 4 fall into their proper ranges. For Case 2, from Theorem 7.4, we have S = {3, 4}. By setting p3 = b3 = 0.45 and from (7.36), we obtain pˆ 1c = pˆ 2c = pˆ 4c =

11 = 0.1833 and 60

pˆ 3c =

27 = 0.45. 60

It is easy to verify that this pˆ ∗c ∈ T4 (a, b) and the value of the log-likelihood function is (pˆ ∗c ) = −32.0367. Next, by setting p4 = a4 = 0.15, we obtain from

244

DIRICHLET AND RELATED DISTRIBUTIONS

Table 7.2 Frequentist estimates of p for four different cases. s



pˆ c

(ˆpc )

1

1 6 2 21

0.0333 0.2000 0.0667 0.7000

0.0333 0.2000 0.0667 0.7000

−25.9641

2

3 3 21 3

0.1000 0.1000 ∗0.7000 ∗0.1000

0.1833 0.1833 0.4500 0.1833

−32.0367

3

3 21 3 3

0.1000 ∗0.7000 0.1000 ∗0.1000

0.0944 0.6611 0.0944 0.1500

−28.5435

4

21 3 3 3

∗0.7000 0.1000 0.1000 ∗0.1000

0.3500 0.2167 0.2167 0.2167

−35.8095

Case

Source: Fang et al. (2000). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd. Note: ∗ indicates that the estimate falls outside its proper ranges.

(7.35) that pˆ 1c = pˆ 2c =

17 , 180

pˆ 3c =

119 , 180

and

pˆ 4c =

27 . 180

Since pˆ 3c = 0.6611 > b3 = 0.45, by simultaneously setting p4 = a4 = 0.15 and p3 = b3 = 0.45, we obtain pˆ 1c = pˆ 2c = 0.20,

pˆ 3c = 0.45,

and

pˆ 4c = 0.15.

ˆ ∗∗ ˆ c = pˆ ∗c . It is clear that this pˆ ∗∗ c ∈ T4 (a, b) and (p c ) = −32.1167. Hence, we take p For the other cases, we display pˆ c and the corresponding values of the log-likelihood function in Table 7.2. (c) Bayesian analysis We take the uniform truncated Dirichlet as the prior distribution; that is, di = 1 for all i. Therefore, by (7.15), the posterior of p−4 = (p1 , p2 , p3 ) is distributed as TDirichlet(s1 + 1, s2 + 1, s3 + 1; s4 + 1; a, b)

(7.39)

TRUNCATED DIRICHLET DISTRIBUTION

245

Table 7.3 Bayesian estimates of p for two different cases. Case 1

s

E(p|s)

std(p|s)

Case

s

E(p|s)

std(p|s)

1 6 2 21

0.0589 0.2061 0.0880 0.6470

0.0397 0.0682 0.0479 0.0807

3

3 21 3 3

0.1175 0.6041 0.1176 0.1608

0.0541 0.0585 0.0544 0.0372

Source: Fang et al. (2000). Reproduced by permission of John Wiley & Sons (Asia) Pte Ltd.

on V3 (a, b) with a and b given by (7.37) and (7.38) respectively. By Theorem 7.3, we first generate m = 10 000 i.i.d. samples ( ) ( )  (p( ) 1 , p2 , p3 ) ,

= 1, . . . , m,

( ) ( ) from (7.39) using (n − 1) × m = 3×10 000 random numbers u( ) 1 , u2 , u3 . Next, we compute the first-order and the second-order posterior moments E(pi |s) and E(p2i |s) for i = 1, . . . , 4. Table 7.3 lists the posterior means and standard deviations of p for two different cases.

7.7 Application to uniform design of experiment with mixtures Experimental designs with mixtures play a very important role in the fields of chemical industry, rubber, food, material, and pharmaceutical engineering. It requires giving designs on the closed simplex Tn . Cornell (1990) gave a comprehensive and systematic review on this topic from the optimal design point of view. In particular, he discussed optimal designs on mixture experiments under both lower and upper bound restrictions; that is, the region of design is the convex polyhedron Tn (a, b). Wang and Fang (1996) proposed a uniform design of experiments with restricted mixtures by generating a set of points uniformly scattered on the region Tn (a, b) with the acceptance–rejection method. Fang and Yang (2000) employed the conditional sampling method to generate the uniform distribution on the domain Tn (a, b). The key is to find a stochastic representation for the random vector x ∼ U(Tn (a, b)). Since the uniform distribution U(Tn (a, b)) is a special case of the truncated Dirichlet distribution TDirichletn (γ; a, b), the stochastic representation of x is given by Theorem 7.3 with γ = 1n . Example 7.2 (Uniform design of experiments with restricted mixtures ). Fang and Yang (2000) considered a four-factor experiment with restricted mixtures for some chemical product. The region of design is assumed to be T4 (a, b) = {x: 0.60 ≤ x1 ≤ 0.85, 0.20 ≤ x2 ≤ 0.40, 0.05 ≤ x3 ≤ 0.10,

246

DIRICHLET AND RELATED DISTRIBUTIONS

0.0001 ≤ x4 ≤ 0.000 15,

4 i=1

xi = 1}.

To remove superfluous constraints in T4 (a, b), using (7.18) and (7.19), we obtain a consistent domain T4 (a∗ , b∗ ), where a∗ = a and b∗ = (0.7499, 0.3499, 0.10, 0.000 15). Suppose that we want to generate 11 experimental points uniformly scattered on the domain T4 (a∗ , b∗ ). By the method of good lattice points (Fang and Wang, 1994) with the generating vector (1, 3, 9), we first generate 11 points u1 , . . . , u11 in the cube (0, 1)3 . For the th point ( ) ( )  u = (u( ) 1 , u2 , u3 ) ,

= 1, . . . , 11,

we then apply Theorem 7.3 with n = 4 and γ = 14 to obtain 11 experimental points: x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

= (0.6315, = (0.6512, = (0.6659, = (0.7141, = (0.7282, = (0.6165, = (0.6340, = (0.6716, = (0.6885, = (0.7002,

0.3072, 0.2740, 0.2456, 0.2336, 0.2061, 0.3042, 0.2729, 0.2716, 0.2413, 0.2158,

0.0611, 0.0747, 0.0884, 0.0522, 0.0656, 0.0792, 0.0930, 0.0567, 0.0703, 0.0838,

0.000 102 3), 0.000 106 8), 0.000 111 4), 0.000 115 9), 0.000 120 5), 0.000 125 0), 0.000 129 5), 0.000 134 1), 0.000 138 6), 0.000 143 1),

and x11 = (0.6046, 0.2975, 0.0977, 0.000 147 7).



8

Other related distributions In Section 8.1 we discuss in details a generalized Dirichlet distribution, including its propterties, the associated techniques for inference, and an effective importance density. In Section 8.2 we introduce a so-called hyper-Dirichlet distribution motivated by certain statistical analysis of sporting data. In Sections 8.3 and 8.4 we introduce two new distributions: the scaled Dirichlet distribution and the mixed Dirichlet distribution respectively. Their distributional properties are investigated. Finally, we briefly review the Liouville and generalized Liouville distributions. All these distributions include the Dirichlet distribution as a special member.

8.1 The generalized Dirichlet distribution The generalized Dirichlet distribution (Dickey et al., 1987; Tian et al., 2003) usually appears in the statistical inferences for incomplete categorical data, misclassification data, and screening/diagnostic test data.

8.1.1 Density function Motivated by the likelihood function (8.4) and the posterior density (8.8), to be discussed later, we first define the generalized Dirichlet distribution as follows. Definition 8.1 A random vector x ∈ Tn is said to follow a generalized Dirichlet distribution if the joint density function of x−n ∈ Vn−1 is given by −1 · gD(x−n |a, b; ) GDirichletn,q (x−n |a, b; ) = cgD  n  q  n bj  a −1   −1 i xi δij xi , = cgD i=1

j=1

i=1

Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

(8.1)

248

DIRICHLET AND RELATED DISTRIBUTIONS

where cgD is the normalizing constant,1 gD(x−n |a, b; ) denotes the kernel of the joint density, a = (a1 , . . . , an ) is a positive parameter vector, b = (b1 , . . . , bq ) is a nonnegative parameter vector, and  = (δij ) is a known2 n × q matrix with δij = 0 or 1 and there exists at least one nonzero element in each column of . We will write x ∼ GDirichletn,q (a, b; ) on Tn or x−n ∼ GDirichletn,q (a, b; ) on Vn−1 accordingly. ¶ Remark 8.1 In particular, if we let b = 0 in (8.1), the corresponding generalized Dirichlet distribution GDirichletn,q (a, 0; ) reduces to the Dirichlet distribution Dirichletn (a). In addition, the GDD introduced in Chapter 3 and the NDD introduced in Chapter 4 are also two special and important members of the family of generalized Dirichlet distributions. Finally, the hyper-Dirichlet distribution (see (8.18), to be defined later) is also a special case of the generalized Dirichlet distribution (8.1) if b = (b1 , . . . , bq ) is assumed to be a real parameter vector. ¶ (a) The normalizing constant The normalizing constant cgD can be calculated by 

ˆ cgD (a, b; ) = cgD =

Vn−1

gD(x−n |a, b; ) dx−n .

(8.2)

Generally speaking, the normalizing constant cgD does not have an explicit expression except for three special cases; that is, the Dirichlet distribution, GDD, and NDD. Dickey et al. (1987) noted that (8.2) has a close relationship with Carlson’s multiple hypergeometric function (Carlson, 1977; Dickey, 1983). One method proposed by Kadane (1985) is multinomial expansion of the integrand in (8.2), and the other is Laplace’s integral method (Tierney and Kadane, 1986). However, none of them is convenient for users. Based on the stochastic representations of the Dirichlet distribution, GDD, and NDD, Tian et al. (2003) suggested an importance sampling approach with a feasible importance density for approximating the normalizing constant. For the importance sampling approach, one first needs to find an appropriate importance density h(·) defined in Vn−1 . Then, cgD can be approximated by K (k) 1  gD(x−n |a, b; ) , cˆ gD (a, b; ) = (k) K k=1 h(x−n )

1

(8.3)

Here, we use the notation cgD instead of cGD to denote the normalizing constant of a generalized Dirichlet distribution, since cGD was used to represent the normalizing constant of a GDD in Chapter 3. 2 In practice, all elements in the matrix  can be identified based on the partially observed counts; for example, see (8.6).

OTHER RELATED DISTRIBUTIONS

249

(1) (K) where x−n , . . . , x−n is a random sample of size K from h(·). Feasible choices of h(·) include the Dirichlet distribution Dirichletn (a), the GDD with suitable parameter vectors a and b, and the NDD with corresponding parameter vectors a and b. Choices of an effective importance density will be discussed in Section 8.1.4.

(b) The motivation Consider a contingency table of n cells or categories with incomplete observations. Let Yobs = {y1 , . . . , yn ; y1∗ , . . . , yq∗ }, where y1 , . . . , yn denote the fully categorized counts and y1∗ , . . . , yq∗ the partially categorized counts. Furthermore, let θ ∈ Tn be the cell probability vector of interest. The likelihood function L(θ|Yobs ) consists of two parts: • •

n

yi i=1 θi

– the product of the powers of cell probabilities corresponding to those complete for each category; and q nobservations yj∗ ( δ θ ) – the product of powers of linear combinations of cell probij i j=1 i=1 abilities over those sets of categories which cannot be distinguished.

In other words, the likelihood function is given by

L(θ|Yobs ) ∝

 n 

 y θi i

i=1

·

 n q   j=1

yj∗

δij θi

.

(8.4)

i=1

If we treat θ = (θ1 , . . . , θn ) as a random vector, then θ ∼ GDirichletn,q (a, b; ), where a = (y1 + 1, . . . , yn + 1) and b = (y1∗ , . . . , yq∗ ). For example, in Example 1.5 and Table 1.5, under the assumption of MAR, the observed-data likelihood function is proportional to  4 



θini

· (θ1 + θ2 )n12 (θ3 + θ4 )n34 (θ1 + θ3 )n13 (θ2 + θ4 )n24 .

(8.5)

i=1

Similarly, we observe that θ ∼ GDirichletn,q (a, b; ), where n = q = 4, a = (n1 + 1, . . . , n4 + 1), b = (n12 , n34 , n13 , n24 ), and ⎛

δ11 ⎜δ ⎜ 21 =⎜ ⎝ δ31 δ41

δ12 δ22 δ32 δ42

δ13 δ23 δ33 δ43

⎞ ⎛ ⎞ δ14 1 0 1 0 ⎜ ⎟ δ24 ⎟ ⎟ ⎜1 0 0 1⎟ ⎟=⎜ ⎟. δ34 ⎠ ⎝ 0 1 1 0 ⎠ 0 1 0 1 δ44

(8.6)

250

DIRICHLET AND RELATED DISTRIBUTIONS

(c) The mixed moment and other properties If x ∼ GDirichletn,q (a, b; ) on Tn , the mixed moment of x can be expressed as a ratio of the normalizing constants:

E

 n 



xiri

i=1

=

cgD (a + r, b; ) , cgD (a, b; )

(8.7)

where r = (r1 , . . . , rn ). An analytical solution to the mode of the generalized Dirichlet density (8.1) is not generally available. To our knowledge, a stochastic representation of the random vector x is not available yet.

8.1.2 Statistical inferences We note that the likelihood function in (8.4) is a special case of the likelihood function in (4.27). Therefore, the EM algorithm presented in Section 4.7.2 can be used to find the MLE of θ. Furthermore, the DA Gibbs sampling and the exact IBF sampling introduced in Section 4.8.2 can be utilized to compute the posterior moments of θ, where a Dirichlet distribution may be taken as the conjugate prior distribution.

8.1.3 Analyzing the crime survey data In Section 3.9, we analyzed the crime survey data reported in Table 1.6 using a Bayesian approach under the nonignorable missing mechanism. In this subsection we consider the situation of the ignorable missing mechanism or MAR; that is, the elements in each column of the conditional probability matrix (λij ) are equal. Removing these {λij } from the likelihood function in (3.68), we have L(θ|Yobs ) ∝

 4 



θini

· (θ1 + θ2 )n12 (θ3 + θ4 )n34 (θ1 + θ3 )n13 (θ2 + θ4 )n24 ,

(8.8)

i=1

where Yobs = {n1 , . . . , n4 ; n12 , n34 ; n13 , n24 } denotes the observed frequencies and θ = (θ1 , . . . , θ4 ) ∈ T4 is the cell probability vector. (a) MLEs via the EM algorithm Schafer (1997: 45) analyzed the same crime survey dataset by using an EM algorithm based on a Dirichlet DA (i.e., the augmented likelihood function is of a Dirichlet density form). We note that the two likelihood functions in (8.8) and (8.5), or in (3.3), are identical under the MAR mechanism. Therefore, we can employ the EM algorithm introduced in Section 3.8.1(a), which is based on a grouped Dirichlet DA (i.e., the augmented likelihood function is of a grouped Dirichlet density form). The E-step and M-step are specified by (3.49) and (3.47) respectively.

OTHER RELATED DISTRIBUTIONS

251

Using θ (0) = 0.25 · 14 as the initial values, we have the following estimated cell probabilities after five iterations: (θˆ 1 , θˆ 2 , θˆ 3 , θˆ 4 ) = (0.697 123, 0.098 630 4, 0.135 783, 0.068 463 2). Another parameter of interest is the odds ratio ψ=

θ1 θ4 , θ2 θ3

which is often employed to measure the association of two binary variables in a contingency table. Note that ψ being larger than one implies victimization is chronic (Kadane, 1985). The MLE of ψ is given by ˆ ˆ ˆ = θ1 θ4 = 3.563 77. ψ θˆ 2 θˆ 3 In other words, households that were victimized during the first period appear to be more than 3.563 37 times as likely, on the odds scale, to have been victimized in the second period than households that were crime free in the first period. (b) Bayesian estimates via the importance sampling We adopt Dirichlet4 (α) as the prior distribution of θ = (θ1 , . . . , θ4 ). The resulting posterior is a generalized Dirichlet distribution with kernel 4 

θini +αi −1 · (θ1 + θ2 )n12 (θ3 + θ4 )n34 (θ1 + θ3 )n13 (θ2 + θ4 )n24 .

(8.9)

i=1

Consequently, the posterior moments of {θi }4i=1 can be calculated by (8.7), (8.3), and (8.2). The importance density3 can be taken as −1 · h(θ) = cGD

4 

θini +αi −1 · (θ1 + θ2 )n12 (θ3 + θ4 )n34 ,

(8.10)

i=1

which is a GDD with two partitions. In addition, the posterior moments of the odds ratio ψ are also of interest. In fact, the posterior mean E(ψ|Yobs ) and the posterior 3

Alternatively, we may take the GDD ∗−1 · h∗ (θ) = cGD

4 

n +αi −1

θi i

· (θ1 + θ3 )n13 (θ2 + θ4 )n24

i=1

as the importance density. From Table 1.6, since n12 = 33 > 31 = n13 and n34 = 9 > 7 = n24 , the importance density h(θ) specified by (8.10) is more efficient than h∗ (θ) when the importance sampling approach is used. For more details on the choice of an effective importance density, see Section 8.1.4.

252

DIRICHLET AND RELATED DISTRIBUTIONS

Table 8.1 Posterior means and standard deviations for the crime survey data under the ignorable missing mechanism.a Priors

1 2 3 4 5 6

Parameters

Uniform 14 Jeffreys 04 Haldane 0.5 · 14 Information (7.5, 1, 1, 0.5) Uniform∗ 4 · 14 Experts (10, 5, 5, 10)

θ1

θ2

θ3

θ4

ψ

0.6914 (0.0125) 0.6927 (0.0131) 0.6920 (0.0127) 0.6930 (0.0131) 0.6880 (0.0149) 0.6899 (0.0126)

0.0953 (0.0182) 0.0943 (0.0183) 0.0953 (0.0183) 0.0957 (0.0186) 0.0956 (0.0180) 0.0891 (0.0118)

0.1354 (0.0104) 0.1350 (0.0102) 0.1352 (0.0103) 0.1347 (0.0099) 0.1370 (0.0129) 0.1355 (0.0099)

0.0779 (0.0130) 0.0780 (0.0139) 0.0775 (0.0134) 0.0766 (0.0141) 0.0794 (0.0157) 0.0855 (0.0148)

4.4153 (1.2403) 4.3811 (1.2655) 4.4006 (1.2515) 4.3540 (1.2779) 4.4186 (1.2297) 4.9954 (0.7231)

Source: Tian et al. (2003). Reproduced by permission of the Institute of Statistical Science, Academia Sinica. a The posterior standard deviations are in parentheses.

standard deviation std(ψ|Yobs ) =

 E(ψ2 |Yobs ) − {E(ψ|Yobs )}2

can be computed by (8.7) with r = (1, −1, −1, 1) and r = (2, −2, −2, 2) respectively. We consider six Dirichlet4 (α) prior distributions (Kadane, 1985): 1. 2. 3. 4. 5. 6.

A uniform prior with α = 14 . A Haldane prior with α = 04 . A Jeffreys prior with α = 0.5 · 14 . A Kadane information prior with α = (7.5, 1, 1, 0.5). The uniform prior in Table 3.6 with α = 4 · 14 . The experts prior with α = (10, 5, 5, 10).

Table 8.1 displays the corresponding posterior means and standard deviations. It is noteworthy that the posterior means are robust to the choices of the prior.

8.1.4 Choice of an effective importance density We next consider the approximation of the normalizing constant in (8.3). In importance sampling, it is crucial to find an appropriate importance density h(·) which mimics the target function gD(·|a, b; ). The multivariate split normal/Student importance density suggested by Geweke (1989) seems to be infeasible for the present

OTHER RELATED DISTRIBUTIONS

253

situation since θ belongs to the hyperplane Tn . In Section 8.1.1(a), we suggest three feasible choices for h(·). Two issues emerge: (i) What is a natural class of importance densities? (ii) Which member in the class is the most effective? Tian et al. (2003) partially addressed the two issues in the framework of Bayesian inferences. Let the observed posterior f (θ|Yobs ) be a generalized Dirichlet distribution assuming the form in (8.9). Below, we shall discuss how to find an effective importance density like (8.10). Using the generic symbol f for all three functions in the function-wise formula (1.43) and choosing any particular z = z0 , we obtain f (θ|Yobs ) = c−1 · 

=

f (θ|Yobs , z0 ) f (z0 |Yobs , θ) f (θ|Yobs , z0 ) dθ f (z0 |Yobs , θ)

−1

f (θ|Yobs , z0 ) . f (z0 |Yobs , θ)

(8.11)

A natural class of importance densities is the family of complete-data posterior densities   f (θ|Yobs , z0 ): z0 ∈ S(z|Yobs ) , where S(z|Yobs ) denotes the conditional support of z|Yobs . However, the efficiency for approximating the normalizing constant  f (θ|Yobs , z0 ) c= dθ f (z0 |Yobs , θ) by importance sampling depends on how well the importance density f (θ|Yobs , z0 ) mimics the target function f (θ|Yobs , z0 ) = c · f (θ|Yobs ). f (z0 |Yobs , θ) Let θ˜ obs denote the mode of the observed posterior density f (θ|Yobs ). The EM algorithm shows that f (θ|Yobs , z0 ) and f (θ|Yobs ) share the same mode θ˜ obs , whenever z0 = E(z|Yobs , θ˜ obs ).

(8.12)

Thus, there is a substantial amount of overlapping area under the importance density and the target function. In other words, f (θ|Yobs , z0 ), with z0 given by (8.12), is heuristically an effective importance density. Next, we use the crime survey data under the assumption of an ignorable missing mechanism to illustrate the idea. Let the observed data in Section 8.1.3(b) be

254

DIRICHLET AND RELATED DISTRIBUTIONS

Yobs = {n1 , . . . , n4 ; n12 , n34 , n13 , n24 }. Note that the observed posterior density f (θ|Yobs ) is given by (8.9) up to a normalizing constant. We introduce a latent vector z = (z1 , z2 ) so that the complete-data posterior is a grouped Dirichlet density f (θ|Yobs , z) ∝ θ1n1 +z1 +α1 −1 θ2n2 +z2 +α2 −1 θ3n3 +n13 −z1 +α3 −1 × θ4n4 +n24 −z2 +α4 −1 (θ1 + θ2 )n12 (θ3 + θ4 )n34 ,

(8.13)

and the conditional predictive density is given by     θ1  f (z|Yobs , θ) = Binomial z1 n13 , θ1 + θ 3     θ2  × Binomial z2 n24 , . θ2 + θ 4

(8.14)

The EM algorithm based on (8.13) and (8.14) can be used to compute the posterior mode θ˜ obs and z0 via (8.12). Hence, the most efficient importance density is f (θ|Yobs , z0 ). Comparing f (θ|Yobs , z0 ) with (8.10), we notice that both of them belong to the same class of importance densities. Therefore, the importance density in (8.10) is feasible but not the best, while f (θ|Yobs , z0 ) is the best at the expense of running an EM algorithm.

8.2 The hyper-Dirichlet distribution Hankin (2010) discussed a generalization of the Dirichlet distribution, namely the hyper-Dirichlet, for which various types of incomplete observations can be incorporated. He also introduced the hyperdirichlet R package. We first present three motivating examples from certain statistical analysis of sporting data. Then, we give a formal definition and several remarks.

8.2.1 Motivating examples Example 8.1 (Pairwise comparison in the sporting world ). In the sporting world, there exist many repeated pairwise comparisons between two of a larger number of ‘players’ (Jech, 1983). The general problem of comparing n players with skill probability vector θ = (θ1 , . . . , θn ) ∈ Tn potentially has n(n − 1)/2 pairwise comparisons. A match between players i and j constitutes a Bernoulli trial with a total of number nij + nji and the probability θi /(θi + θj ), where nij denotes the number won by player i. Consider three chess players P1 , P2 , and P3 who compete in pairs. Table 8.2 shows the results of 88 chess matches up to 2001. These players form a circular triad (see Knezek et al. (1998)) with repeated comparisons/matches being allowed.

OTHER RELATED DISTRIBUTIONS

255

Table 8.2 Results of 88 chess matches among three players.a Match

P1 (Topalov)

P2 (Anand)

P3 (Karpov)

Total

P1 vs. P2 P2 vs. P3 P3 vs. P1 Total

22 (n12 ) – 8 (n13 ) 30 (n12 + n13 )

13 (n21 ) 23 (n23 ) – 36 (n21 + n23 )

– 12 (n32 ) 10 (n31 ) 22 (n31 + n32 )

35 (n12 + n21 ) 35 (n23 + n32 ) 18 (n13 + n31 ) 88 (N)

Source: Hankin (2010). Reproduced by permission of the American Statistical Association. a N = n + n + n + n + n + n . Entries show number of games won up to 2001 (draws are 12 13 21 23 31 32 discarded). Topalov beats Anand 22–23; Anand beats Karpov 23–12; and Karpov beats Topalov 10–8.

Let Yobs = {n12 , n21 , n23 , n32 , n13 , n31 } denote the observed counts. The observed likelihood function of θ is L(θ|Yobs ) ∝

 i 0 (i = 1, . . . , n) and the other bJ ≥ 0, the corresponding hyper-Dirichlet distribution in (8.18) can be viewed as a special case of the generalized Dirichlet distribution in (8.1). ¶ The normalizing constant of the density (8.18) is given by 

cHD = ˆ cHD (b) =

Vn−1

 n  i=1



xi−1

 J⊆N







⎞bJ

xi ⎠ dx−n .

(8.19)

i∈J

This is given by function B() in the package.4 If the distribution is the Dirichlet, GDD, or NDD, the closed-form expression for the normalizing constant is used; otherwise numerical methods are used. 4

The hyperdirichlet R package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=hyperdirichlet.

258

DIRICHLET AND RELATED DISTRIBUTIONS

Remark 8.3 Since bJ can assume any real value, say a negative integer, the kernel  n 



xi−1

i=1

 J⊆N

⎛ ⎝



⎞bJ

xi ⎠

i∈J

does not necessarily result in a density function (8.18). For example, for the likelihood function (8.15), the corresponding normalizing constant is 1.474 63 × 10−28 ≈ 0. ¶ Although the hyperdirichlet R package is available, further investigation is necessary for the distributional properties of the hyper-Dirichlet distribution (especially for the case that bJ is a negative integer), including marginal distributions, conditional distributions, stochastic representation, mode, and the corresponding statistical inference methods.

8.3 The scaled Dirichlet distribution Aitchison (2003: 305–306) and Cabanlit et al. (2006) provided a generalization of the Dirichlet distribution by introducing a new set of scale parameters. The new distribution, called the scaled Dirichlet, is more flexible than the Dirichlet distribution and can be used to model different real-life situations and phenomena.

8.3.1 Two motivations The first motivation is related the shape of the Dirichlet density. Let x ∼ Dirichletn (a) on Tn , where the parameter vector a determines the shape of the distribution. For illustration purposes, we first consider the case when n = 2. When a1 = a2 = 1, we obtain the uniform distribution on (0, 1). When a1 = a2 < 1, Figure 1.1(i) shows that the beta density function is concave on (0, 1). When a1 = a2 > 1, Figure 1.1(ii) indicates that the beta density is similar to the normal density on (0, 1). When a1 < 1 and a2 > 1, we obtain a decreasing beta density function from zero to one as illustrated by Figure 1.1(iii). When a1 > 1 and a2 < 1, we have an increasing function from zero to one as shown in Figure 1.1(iv). For n > 2, Figure 2.1 shows that the distribution varies in shape for different values and restrictions of ai . By introducing another set of parameters, different useful probability models could be obtained. The second motivation is from the interpretation of the correlation coefficient of any two components of the Dirichlet random vector. From (2.6), it can be seen that the relative magnitude of ai specifies the mean of xi . Moreover, fixing the value E(xi ) = ai /a+ , the variance Var(xi ) =

E(xi ){1 − E(xi )} 1 + a+

OTHER RELATED DISTRIBUTIONS

259

decreases with a+ , where the overall magnitude a+ represents the strength of the prior information when the Dirichlet distribution is adopted as a prior for the parameters in a multinomial distribution. There are no other parameters left to model important aspects of the prior belief. For example, the correlation coefficient Cov(xi , xj ) Var(xi ) · Var(xj )  ai aj =− (a+ − ai )(a+ − aj )  E(xi )E(xj ) =− {1 − E(xi )}{1 − E(xj )}

Corr(xi , xj ) = 

is entirely determined by E(xi ) and E(xj ). Therefore, the Dirichlet distribution can only represent a limited range of prior belief.

8.3.2 Stochastic representation and density function Definition 8.3

If x ∼ Dirichletn (a) on Tn and b i xi d , wi = n j=1 bj xj

i = 1, . . . , n,

(8.20)

then the distribution of w = (w1 , . . . , wn ) is called the scaled Dirichlet distribution with shape parameter vector a and scale parameter vector b = (b1 , . . . , bn ), where5 0 < bi < 1 for i = 1, . . . , n. We will write w ∼ SDirichletn (a, b) on Tn . ¶ In particular, when b1 = · · · = bn = b, we have w = x; that is, the scaled Dirichlet distribution reduces to the Dirichlet distribution. Thus, the family of scaled Dirichlet distributions is more flexible than the family of Dirichlet distributions. To derive the joint density of w, we need to find the Jacobian. The inverse transformation of (8.20) is given by b−1 wi xi = n i −1 j=1 bj wj bi−1 wi  n−1  , −1 −1 1 − j=1 bj wj + bn j=1 wj

=  n−1

i = 1, . . . , n.

In general, we assume that all bi > 0. If there exists at least one i such that bi ≥ 1, we can replace each n bi by bi∗ = bi / j=1 bj for i = 1, . . . , n. Clearly, we have 0 < bi∗ < 1.

5

260

DIRICHLET AND RELATED DISTRIBUTIONS

Let τ =

n j=1

−1  bj−1 wj and d = (b1−1 , . . . , bn−1 ) . Since

b−1 τ − bi−1 wi (bi−1 − bn−1 ) ∂xi = i , ∂wi τ2 −bi−1 wi (bj−1 − bn−1 ) ∂xi = , ∂wj τ2

i = 1, . . . , n − 1,

and

i= / j,

it can be readily verified that the Jacobian is J(x → w−n )  −n   ∂(x1 , . . . , xn−1 )   =  ∂(w1 , . . . , wn−1 )   ⎛ −1  b1 w1  ⎜  .. −2(n−1)  ⎜ =τ . τ · diag(d) − ⎝  −1  b w n−1

= τ −2(n−1) · τ n−2

n 

n−1

   ⎟  ⎟ (d − b−1 1n−1 ) n  ⎠   ⎞

bi−1

i=1

= τ −n

n 

bi−1 .

i=1

Here, we use the famous formula |C + yz| = |C|(1 + zC−1 y) see Footnote 14 of Chapter 2 for more details. Thus, the joint density of w−n = (w1 , . . . , wn−1 ) ∈ Vn−1 is given by SDirichletn (w−n |a, b) =

where a+ =

n i=1

Bn (a)

1 n

ai i=1 bi

n

wiai −1 a+ , n −1 i=1 bi wi

· 

i=1

(8.21)

ai .

8.3.3 Some properties (a) Other stochastic representations The stochastic representation in (8.20) provides a way to generate the random vector w via the Dirichlet random vector. Alternatively, by applying (2.7), we can connect the scaled Dirichlet random vector with independent gamma variates as follows: b i yi zi d d wi = n = n , b y j=1 j j j=1 zj

i = 1, . . . , n,

(8.22)

OTHER RELATED DISTRIBUTIONS ind

261

ind

where {yi }ni=1 ∼ Gamma(ai , 1), {zi }ni=1 ∼ Gamma(ai , bi−1 ), and w is independent of (Aitchison, 2003: 306) n 

bi−1 zi ∼ Gamma(a+ , 1).

i=1

It can be seen that (8.22) gives a more straightforward way for simulating w than (8.20) does. Finally, from (8.22) and (5.8), we obtain d

wi = n−1 j=1 d

wn = n−1 j=1

bi vi bj vj + bn bn bj vj + bn

,

i = 1, . . . , n − 1,

and

,

where v−n = (v1 , . . . , vn−1 ) ∼ IDirichletn (a) on Rn−1 + . (b) The mode of the scaled Dirichlet density The logarithm of the density (8.21) is proportional to  n  n   −1 (ai − 1) log(wi ) − a+ × log bi w i . i=1

i=1

Letting the first-order partial derivative of the log-density with respect to wi be zero, for i = 1, . . . , n − 1, we obtain ai − 1 an − 1 (bi−1 − bn−1 )a+ − − n = 0, −1 wi wn j=1 bj wj or equivalently ai − 1 b−1 a+ an − 1 b−1 a+ − n i −1 = − n n −1 = ˆ c, wi wn j=1 bj wj j=1 bj wj

(8.23)

where c is a constant not depending on the subscript i. In fact, equation (8.23) holds  for all i = 1, . . . , n. Let τ = nj=1 bj−1 wj . From (8.23), we have ai − 1 −

bi−1 wi a+ = c wi , τ

which gives n  i=1



(ai − 1) −

n i=1

 bi−1 wi a+

τ

=c

n  i=1

wi

262

DIRICHLET AND RELATED DISTRIBUTIONS

or c = −n. Hence, the mode of the scaled Dirichlet density (8.21) is the solution to the system of equations ⎧ b−1 a+ ai − 1 ⎪ ⎪ − n i −1 + n = 0, ⎪ ⎨ wi j=1 bj wj ⎪ ⎪ ⎪ ⎩ = 1. w1 + · · · + wn

i = 1, . . . , n − 1, (8.24)

(c) The moments of the ratio and reciprocal of components From (2.5), it is easy to verify that aj , ai − 1 aj (aj + 1) E(xj2 xi−2 ) = , (ai − 1)(ai − 2) a j ak E(xj xk xi−2 ) = , (ai − 1)(ai − 2) E(xj xi−1 ) =

ai > 1,

j= / i,

ai > 2,

j= / i,

and

(8.26)

ai > 2,

j= / i,

k= / i.

(8.27)

(8.25)

Hence, it follows from (8.20) that 

wj E wi



= bj bi−1 E(xj xi−1 )

b j aj , ai > 1, j = / i, bi (ai − 1)   n  1 E = 1+ bj bi−1 E(xj xi−1 ) wi j=1,j = / i (8.25)

=

(8.25)

= 1+

(8.28)

n 

bj aj , b (a − 1) j=1,j = / i i i

ba − bi ai , ai > 1, bi (ai − 1) = ˆ 1 + μi , and    2 n  1 Var = E 1+ bj bi−1 xj xi−1 − (1 + μi )2 wi j=1,j = / i   2 n −1 −1 = E bj bi xj xi − μ2i = 1+

j=1,j = / i

(8.29)

OTHER RELATED DISTRIBUTIONS

=

n 

263

bj2 bi−2 E(xj2 xi−2 ) − μ2i

j=1,j = / i

+2

n 

n 

bj bk bi−2 E(xj xk xi−2 )

j=1,j = / i k=1,k = / i,k>j n 2  bj aj (aj + 1) (8.26) = 2 b (a − 1)(ai − 2) j=1,j = / i i i n n  

+2

− μ2i bj aj bk ak , − 1)(ai − 2)

b2 (a j=1,j = / i k=1,k = / i,k>j i i

(8.30)

where ai > 2.

8.4 The mixed Dirichlet distribution Ongaro et al. (2008) developed a new class of distributions, called Flexible6 Dirichlet, which includes the Dirichlet distribution as a special member and allows various dependence structures. In this section, ei denotes the n × 1 basis vector with the i-th element being one and the others being zero.

8.4.1 Density function Definition 8.4 A random vector x ∈ Tn is said to have a mixed Dirichlet distribution if the joint density function of x−n ∈ Vn−1 is MDirichletn (x−n |a, b, p) =

n 

pi · Dirichletn (x−n |a + b ei )

i=1

(a+ + b) = n i=1 (ai )

 n 



xiai −1

(8.31)

n  pi (ai )xb

i ,

(a + b) i i=1 i=1  where a = (a1 , . . . , an ) is a positive parameter vector with a+ = ni=1 ai , b is a nonnegative parameter, and p = (p1 , . . . , pn ) ∈ Tn is a weight parameter vector. We will write either x ∼ MDirichletn (a, b, p) on Tn or x−n ∼ MDirichletn (a, b, p) on Vn−1 . ¶

In particular, when n = 2, the mixed Dirichlet distribution becomes the mixed beta distribution with density MBeta(x|a1 , a2 , b, p) = p Beta(x|a1 + b, a2 ) + (1 − p)Beta(x|a1 , a2 + b), and we write X ∼ MBeta(a1 , a2 , b, p). 6

Noting that (8.31) is a finite mixture of Dirichlet distributions, we prefer the term mixed Dirichlet.

264

DIRICHLET AND RELATED DISTRIBUTIONS

In addition, when b = 0 or b = 1 and pi = ai /a+ for i = 1, . . . , n in (8.31), the mixed Dirichlet distribution MDirichletn (a, 0, p) reduces to the Dirichlet distribution Dirichletn (a).

8.4.2 Stochastic representation The stochastic representation in Theorem 8.1 below suggests a straightforward way to generate the mixed Dirichlet random vector via a sequence of independent gamma variates and a multinomial random vector. Theorem 8.1 the following.

If x = (x1 , . . . , xn ) ∼ MDirichletn (a, b, p) on Tn , then we have

(i) x has the stochastic representation d

x=

w w = , w1 w1 + · · · + w n

(8.32)

where w = (w1 , . . . , wn ), y = (y1 , . . . , yn ), d

w = y + y · z, yi ∼ Gamma(ai , 1), i = 1, . . . , n, y ∼ Gamma(b, 1), z = (z1 , . . . , zn ) ∼ Multinomialn (1, p), and y, y, and z are mutually independent. (ii) w1 ∼ Gamma(a+ + b, 1) is independent of x = w/w1 .

(8.33) (8.34) (8.35) (8.36) ¶

Proof. (i) Assume that (8.32)–(8.36) are true; it suffices to verify that x ∼ MDirichletn (a, b, p) on Tn . In fact, since zi ∼ Bernoulli(pi ), the cdf of wi is Pr(wi < w) = Pr(yi + yzi < w)  = Pr(yi + yzi < w|zi ) · f (zi ) dzi = (1 − pi ) Pr(yi < w) + pi Pr(yi + y < w). Hence, the density of wi is given by (1 − pi )Gamma(wi |ai , 1) + pi Gamma(wi |ai + b, 1),

(8.37)

which is a mixture of two dependent gamma distributions. Similarly, the cdf of w is  Pr(y + y · z < w) = Pr(y + y · z < w|z) · f (z) dz =

n  i=1

pi Pr(y + y · ei < w).

(8.38)

OTHER RELATED DISTRIBUTIONS

265

Let wi = (wi1 , . . . , win ) = y + y · ei . Hence, wi is a random vector with independent gamma components and its joint density is given by ⎧ n ⎨  ⎩

j=1,j = / i

⎫ ⎬ Gamma(wij |ai , 1) Gamma(wii |ai + b, 1). ⎭

(8.39)

From (8.38) and (8.39), we know that the joint density of x with stochastic representation (8.32)  is exactly given by (8.31). (ii) Since ni=1 zi = 1, we have w1 =

n 

yi + y ∼ Gamma(a+ + b, 1),

i=1

End

which is independent of x = w/w1 . Remark 8.4 We can rewrite the stochastic representation in (8.32) as y+y·z y1 + y y1 y y = · · z, + y1 + y y1 y1 + y d

x=

(8.40)

where y1 ∼ Beta(a+ , b), y1 + y y ∼ Beta(b, a+ ), and y1 + y y ∼ Dirichletn (a). y1



8.4.3 The moments Let x ∼ MDirichletn (a, b, p) on Tn . From (8.32), we have w = w1 ·

w d = w1 · x. w1

Since w1 ⊥ ⊥ x and w1 ∼ Gamma(a+ + b, 1), the mixed moment of x is given by E

 n i=1



xiri

=

   n E( ni=1 wri i )

(a+ + b) ri w = · E i , E{(w1 )r+ }

(a+ + b + r+ ) i=1

(8.41)

266

DIRICHLET AND RELATED DISTRIBUTIONS

where r+ =

n

i=1 ri .

It is easy to verify from (8.33) that

E(wi ) = ai + bpi , Var(wi ) = ai + bpi + b2 pi (1 − pi ), Cov(wi , wj ) = −b2 pi pj ,

i = 1, . . . , n, i = 1, . . . , n, i= / j.

and

(8.42) (8.43) (8.44)

Therefore, from (8.41), we obtain, for i = 1, . . . , n, E(xi ) = (8.42)

=

Var(xi ) = (8.43)

= =

Cov(xi , xj ) = (8.44)

= =

E(wi ) a+ + b ai + bpi = ˆ μi , a+ + b E(w2i ) − μ2i (a+ + b)(a+ + b + 1) ai + ai2 + 2ai bpi + (b + b2 )pi − μ2i (a+ + b)(a+ + b + 1) b2 pi (1 − pi ) μi (1 − μi ) , and a+ + b + 1 (a+ + b)(a+ + b + 1) E(wi wj ) − μ i μj (a+ + b)(a+ + b + 1) ai aj + ai bpj + bpi aj − μ i μj (a+ + b)(a+ + b + 1) −μi μj b 2 pi pj − , i= / j. a+ + b + 1 (a+ + b)(a+ + b + 1)

(8.45)

(8.46)

(8.47)

8.4.4 Marginal distributions Let x = (x1 , . . . , xn ) ∼ MDirichletn (a, b, p) on Tn . We partition x, w, y, z, a, and p in the same fashion: 

x= 

z=

x(1) x(2) z(1) z(2)



s ,w= n−s







,

a =

w(1) w(2) a(1) a(2)





,

y=





, and p =

y(1) y(2) p(1) p(2)



, 

.

We have the following lemma related to the multinomial distribution. Lemma 8.1 following.

If z = (z1 , . . . , zn ) ∼ Multinomialn (N, p), then we have the

OTHER RELATED DISTRIBUTIONS

(i) The marginal distribution of z(1) is a multinomial distribution:      z(1) p(1)   ∼ Multinomials+1 N, . N − si=1 zi 1 − si=1 pi

267

(8.48)

(ii) z(2) 1 ∼ Binomial(N, p(2) 1 ) and z(1) 1 ∼ Binomial(N, p(1) 1 ). (iii) The conditional distribution of z(2) given z(1) follows a multinomial distribution: z(2) |z(1) ∼ Multinomials (N − z(1) 1 , p(2) /p(2) 1 ).

(8.49) ¶

Proof. We can rewrite the joint density of z into 

N z



n 



pzi i

i=1

 s  N−z(1)  s 1  z  N i = p · 1 − p i z(1) , N − z(1) 1 i=1 i i=1   n  zi  N − z(1) 1 pi × , p(2) 1 z(2) i=s+1

which implies (8.48) and (8.49). In addition, since the marginal distribution of a component of a multinomial random vector is binomially distributed, the first conclusion in (ii) is immediately obtained from (8.48). By symmetry, the second result in (ii) is also valid. End We now consider the distribution of the random subvector x(1) for any 1 ≤ s < n. We can rewrite (8.32) as wi w1 + · · · + ws + (ws+1 + · · · + wn ) wi = ˆ , i = 1, . . . , s, w1 + · · · + ws + ws+1 d

xi =

or equivalently  1− where



w(1) ws+1



x(1) s

=

i=1 xi



 d

=

w(1) ws+1

i=s+1

yi

y(1)

n

i=s+1

"



y(1)

n 

=

 d

yi

(w1 + · · · + ws + ws+1 ),



+y· 

n 

+y·



z(1)

i=s+1 zi

1−

z(1) s

i=1 zi

by (8.33) 

,

(8.50)

268

DIRICHLET AND RELATED DISTRIBUTIONS

n

 yi ∼ Gamma( ni=s+1 ai , 1) and      z(1) p(1)   ∼ Multinomials+1 1, 1 − si=1 zi 1 − si=1 pi

i=s+1

by Lemma 8.1(i). Thus, the marginal distribution of x(1) is a mixed Dirichlet distribution:       x(1) a(1) p(1) ∼ MDirichlets+1 , b, (8.51) 1 − x(1) 1 a(2) 1 p(2) 1 on Ts+1 . In particular, when s = 1, we have x1 ∼ MBeta(a1 , a+ − a1 , b, p1 ).

(8.52)

By symmetry, we obtain xi ∼ MBeta(ai , a+ − ai , b, pi ),

i = 1, . . . , n.

(8.53)

8.4.5 Conditional distributions We now consider the conditional distribution of the random subvector x(1) given x(2) . Theorem 8.2 If x ∼ MDirichletn (a, b, p) on Tn , then the conditional distribution of x(1) /(1 − x(2) 1 ) given x(2) is a mixture of a mixed Dirichlet and a Dirichlet:     (2) x(1) p(1) 1 p(1) (1) x ∼ · MDirichlets a , b, 1 − x(2) 1  p(1) 1 + q(x(2) ) p(1) 1 (2) q(x ) · Dirichlets (a(1) ), + (1) (8.54) p 1 + q(x(2) ) where q(x(2) ) =

n 

(a(1) 1 + b)

(ai ) b x . pi (1) (2) b

(a 1 )(1 − x 1 ) i=s+1 (ai + b) i

(8.55) ¶

Proof. Letting v(1) = x(1) /(1 − x(2) 1 ), the conditional distribution of v(1) given x(2) can be expressed as     Fv(1) |x(2) (v(1) ) = Fv(1) |{x(2) ,z(1) 1 =1} (v(1) ) · Pr z(1) 1 = 1x(2)     + Fv(1) |{x(2) ,z(1) 1 =0} (v(1) ) · Pr z(1) 1 = 0x(2) .

OTHER RELATED DISTRIBUTIONS

It can be verified that





x|{z(1) 1 = 1} ∼ MDirichletn a, b,

p(1) /p(1) 1 0n−s

269



(8.56)

on Tn . On the one hand, from (8.56) the distribution of x(1) |{x(2) , z(1) 1 = 1} can be directly computed. Thus we obtain    (1)   (2) x(1)  x , z(1)  = 1 ∼ MDirichlets a(1) , b, p (8.57) 1 1 − x(2) 1  p(1) 1 on Ts . Similarly,

   (2) x(1)  x , z(1)  = 0 ∼ Dirichlets (a(1) ) 1 1 − x(2) 1 

on

Ts .

(8.58)

Now, the weight is given by    f (x(2) | z(1) 1 = 1) · Pr(z(1) 1 = 1)  Pr z(1) 1 = 1x(2) = f (x(2) ) f (x(2) | z(1) 1 = 1) · p(1) 1 = , f (x(2) ) which follows from Lemma 8.1(ii) since z(1) 1 ∼ Bernoulli(p(1) 1 ). On the other hand, from (8.56) we obtain       (1) x(2)  z  = 1 ∼ Dirichletn−s+1 a(2) , a+ + b − a(2)  . 1 1 1 − x(2) 1  Similar to (8.51), we have       x(2) a(2) p(2) ∼ MDirichletn−s+1 , b, 1 − x(2) 1 a(1) 1 p(1) 1 on Tn−s+1 . Some algebraic operations lead to (8.55).

End

8.5 The Liouville distribution The Liouville multiple integral over the positive orthant Rn+ = {(z1 , . . . , zn ): zi > 0, 1 ≤ i ≤ n}

(8.59)

is an extension of the Dirichlet integral. Marshall and Olkin (1979: Chapter 11) described the family of the Liouville distributions. Sivazlian (1981a,b) presented some results on marginal distributions and transformation properties for the class of Liouville distributions. Anderson and Fang (1982, 1987) studied a subclass of Liouville distributions arising from the distribution of quadratic forms. Devroye

270

DIRICHLET AND RELATED DISTRIBUTIONS

(1986: 596–599) provided methods for generating the Liouville distribution. Gupta and Richards (1987, 1992a,b) used the Weyl fractional integral and Deny’s theorem in measure theory on locally compact groups to derive some important results on the multivariate Liouville distributions and also extended some results to the matrix analogues. Fang et al. (1990: Chapter 6) provided an extensive study of the Liouville distributions. Gupta and Richards (2001b) and Kotz et al. (2000: Chapter 50) reviewed the history and the development of the Liouville distributions from various aspects. A random vector z ∈ Rn+ with density

Definition 8.5

−1

cL

  n n ·g zi ziai −1 i=1

(8.60)

i=1

is said to have a Liouville distribution, denoted by z ∼ Liouvillen (g; a), where cL is a normalizing constant, g(·) ≥ 0 is a Lebesgue measurable function, and a = (a1 , . . . , an ) is a positive parameter vector. ¶ If g(·) has an unbounded domain, the Liouville distribution is said to be of the first kind, denoted by z ∼ Liouville(1) n (g; a); otherwise, it is of the second kind, denoted by z ∼ Liouville(2) n (g; a). The normalizing constant is given by 

cL = 0



n

= where a+ =

n

Theorem 8.3 sentation:

i=1



···



0

i=1 (ai )

(a+ )

    n n ai −1 g zi zi dzi



i=1 ∞

i=1

g(z)za+ −1 dz,

(8.61)

0

ai .

If z ∼ Liouvillen (g; a), then z has the following stochastic repred

z = R · x,

(8.62)

where R is a nonnegative random variable distributed as Liouville1 (g; a+ ), x ∼ Dirichletn (a) on Tn , and R ⊥ ⊥ x. ¶ ⊥ x, then the joint Proof. If R ∼ Liouville1 (g; a+ ), x ∼ Dirichletn (a) on Tn , and R ⊥ density of x−n = (x1 , . . . , xn−1 ) and R is proportional to g(R)Ra+ −1 ·

n  i=1

xiai −1 .

OTHER RELATED DISTRIBUTIONS

271

Making the transformation zi = Rxi ,

i = 1, . . . , n − 1,

 n−1   and zn = R 1 − xi ,

(8.63)

i=1

we obtain the following Jacobian: J(z → x−n , R) = Rn−1 . End

Thus, the joint density of z is given by (8.60). Remark 8.5 (8.62) that

Note that zi ≥ 0 and z1 = z1 + · · · + zn ; hence we obtain from d

R = z1

and x =

z . z1

Hence, (8.62) can be intuitively expressed as z = z1 ·

z . z1

(8.64)

The first term of the right-hand side of (8.64) represents the length of the random vector z, while the second term denotes its direction. ¶ Remark 8.6 It is easily seen that the density function fR (·) of R ∼ Liouville1 (g; a+ ) is related to g(·) via " ∞ a+ −1 fR (r) = g(r)r g(r)r a+ −1 dr 0 −1

= cL Bn (a) · g(r)r a+ −1 ,

(8.65) ¶

where cL is given by (8.61).

Example 8.4 (Independent gamma random variables). If z1 , . . . , zn are independent gamma random variables and zi ∼ Gamma(ai , 1), then z = (z1 , . . . , zn ) has a Liouville distribution of the first kind; that is, −z z ∼ Liouville(1) n (e ; a),

where a = (a1 , . . . , an ) and g(z) = e−z has an unbounded domain (0, +∞). The 1 -norm d

R = z1 =

n 

zi ∼ Gamma(a+ , 1).

i=1

Example 8.5 (Dirichlet distribution). In (8.60), if g(z) = (1 − z)b−1 ,

0 < z < 1,



272

DIRICHLET AND RELATED DISTRIBUTIONS

then z = (z1 , . . . , zn ) has a Liouville distribution of the second kind; that is, z ∼ Liouville(2) n (g; a1 , . . . , an ), which is equivalent to z ∼ Dirichlet(a1 , . . . , an ; b) on Vn . It follows from (8.65) that the density of R is proportional to r a+ −1 (1 − r)b−1 ,

0 < r < 1; 

that is, R ∼ Beta(a+ , b). Example 8.6 (Inverted Dirichlet distribution). In (8.60), if g(z) = (1 + z)−(a+ +b) ,

z > 0,

then z = (z1 , . . . , zn ) follows the inverted Dirichlet distribution with density (cf. (5.1))   −(a+ +b) n n 

(a+ + b) n ziai −1 1 + zi . { i=1 (ai )} (b) i=1 i=1 The 1 -norm R ∼ IBeta(a+ , b); see (1.8).



Example 8.7 (Multivariate unit-gamma distribution). In (8.60), if g(z) = (− log z)k−1 ,

0 < z < 1, k > 0,

then z = (z1 , . . . , zn ) follows the multivariate unit-gamma-type distribution (Kotz and Johnson, 1982: 86–87, vol. 5) with density proportional to   k−1  n n n  − log zi · ziai −1 , 0 < yi < 1, 0 < zi < 1. i=1

i=1

i=1

This distribution is an n-dimensional extension of the unit-gamma distribution. The density of the 1 -norm R is proportional to (− log r)k−1 r a+ −1 , 0 < r < 1. We have R = e−Y d

and

Y ∼ Gamma(k, a+ ).



Example 8.8 ( 1 -norm symmetric distribution). In (8.60), if a1 = · · · = an = 1, we obtain the 1 -norm symmetric distribution. For more details, see Fang and Fang (1988). 

8.6 The generalized Liouville distribution The integral related to the generalized Liouville distribution was first presented by Edwards (1922: 160–162) but received little attention from statisticians until Marshall and Olkin (1979). Sivazlian (1981a,b) focused on deriving the Dirichlet

OTHER RELATED DISTRIBUTIONS

273

and beta distributions from the generalized Liouville family. Rayens and Srinivasan (1994) further extended the Liouville family on the open simplex to the generalized Liouville family and studied its application to compositional data analysis. Definition 8.6

A random vector y = (y1 , . . . , yn ) ∈ Vn with density  n  n   yi bi  −1 yiai −1 cGL · g q i i=1 i=1

(8.66)

is said to have a generalized Liouville distribution, denoted by y ∼ GLiouvillen (g; a, b, q) on Vn , where g(·) is a continuous function from R1+ to R1+ , a = (a1 , . . . , an ), b = (b1 , . . . , bn ), and q = (q1 , . . ., qn ) are three positive parameter vectors, and cGL is the normalizing constant. ¶ Obviously, this family includes the Liouville distributions when bi = qi = 1 for all i = 1, . . . , n. However, as pointed out by Gupta and Richards (2001a: 132), ‘the classical Liouville densities on the open simplex Vn cannot be transformed into the densities given in (8.66) by means of a power-scale transformation.’ We are mainly interested in generating random variates from this family. Theorem 8.4 If y = (y1 , . . . , yn ) ∼ GLiouvillen (g; a, b, q) on Vn , then y has the following stochastic representation: d

1/bi

yi = qi zi

,

i = 1, . . . , n,

(8.67)

where z = (z1 , . . . , zn ) ∼ Liouvillen (g; a1 /b1 , . . . , an /bn ).



Proof. Assume that z = (z1 , . . . , zn ) is obtained via the transformation (8.67). The Jacobian of this transformation is then given by   n    ∂(y1 , . . . , yn )   qi 1/bi −1 = J(y → z) =  z .  ∂(z1 , . . . , zn ) bi i i=1 The joint density of z is proportional to f (z) ∝ g ∝g

n n n      qi 1/bi −1 1/b zi (qi zi i )ai −1 · z bi i i=1 i=1 i=1 n n   a /b −1 zi zi i i . i=1

i=1

From y ∈ Vn , we obtain 0 ≤ zi ≤ (1/qi )bi , i = 1, . . . , n, proves the theorem.

n i=1

1/bi

qi z i

≤ 1. This End

Appendix A: Some useful S-plus Codes In this appendix we provide some useful S-plus codes related to multinomial, Dirichlet, grouped Dirichlet, nested Dirichlet, and Dirichlet-multinomial distributions.

A.1 Multinomial distribution A.1.1 x ← rmultinomial(N, p) This S-plus function generates a random vector x = (x1 , . . . , xn ) ∼ Multinomialn (N, p) on Tn (N), where N is a positive integer and p = (p1 , . . . , pn ) ∈ Tn . The joint density of x is defined in Section 1.8.1(e). The conditional sampling method can be used to generate one sample of x as follows: • •

Draw xi ∼ Binomial(N −  Set xn = N − n−1 j=1 xj .

i−1 j=1

xj , pi /

n j=i

pj ), 1 ≤ i ≤ n − 1.

function(N, p) { # x = 2 Dirichlet and Related Distributions: Theory, Methods and Applications, First Edition. Kai Wang Ng, Guo-Liang Tian and Man-Lai Tang. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68819-9

276

DIRICHLET AND RELATED DISTRIBUTIONS # Output: x = c(x_1, ..., x_n) ˜ Multinomial_n(N, p) # ------------------------------------------------------n