361 31 3MB
English Pages 263 Year 2020
Behaviormetrics: Quantitative Approaches to Human Behavior 3
Nobuoki Eshima
Statistical Data Analysis and Entropy
Behaviormetrics: Quantitative Approaches to Human Behavior Volume 3
Series Editor Akinori Okada, Graduate School of Management and Information Sciences, Tama University, Tokyo, Japan
This series covers in their entirety the elements of behaviormetrics, a term that encompasses all quantitative approaches of research to disclose and understand human behavior in the broadest sense. The term includes the concept, theory, model, algorithm, method, and application of quantitative approaches from theoretical or conceptual studies to empirical or practical application studies to comprehend human behavior. The Behaviormetrics series deals with a wide range of topics of data analysis and of developing new models, algorithms, and methods to analyze these data. The characteristics featured in the series have four aspects. The first is the variety of the methods utilized in data analysis and a newly developed method that includes not only standard or general statistical methods or psychometric methods traditionally used in data analysis, but also includes cluster analysis, multidimensional scaling, machine learning, corresponding analysis, biplot, network analysis and graph theory, conjoint measurement, biclustering, visualization, and data and web mining. The second aspect is the variety of types of data including ranking, categorical, preference, functional, angle, contextual, nominal, multi-mode multi-way, contextual, continuous, discrete, high-dimensional, and sparse data. The third comprises the varied procedures by which the data are collected: by survey, experiment, sensor devices, and purchase records, and other means. The fourth aspect of the Behaviormetrics series is the diversity of fields from which the data are derived, including marketing and consumer behavior, sociology, psychology, education, archaeology, medicine, economics, political and policy science, cognitive science, public administration, pharmacy, engineering, urban planning, agriculture and forestry science, and brain science. In essence, the purpose of this series is to describe the new horizons opening up in behaviormetrics—approaches to understanding and disclosing human behaviors both in the analyses of diverse data by a wide range of methods and in the development of new methods to analyze these data. Editor in Chief Akinori Okada (Rikkyo University) Managing Editors Daniel Baier (University of Bayreuth) Giuseppe Bove (Roma Tre University) Takahiro Hoshino (Keio University)
More information about this series at http://www.springer.com/series/16001
Nobuoki Eshima
Statistical Data Analysis and Entropy
123
Nobuoki Eshima Center for Educational Outreach and Admissions Kyoto University Kyoto, Japan
ISSN 2524-4027 ISSN 2524-4035 (electronic) Behaviormetrics: Quantitative Approaches to Human Behavior ISBN 978-981-15-2551-3 ISBN 978-981-15-2552-0 (eBook) https://doi.org/10.1007/978-981-15-2552-0 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
In modern times, various kinds of data are gathered and cumulated in all scientific research fields, practical business affairs, governments and so on. The computer efficiency and data analysis methodologies have been developing, and it has been extending the capacity to process big and complex data. Statistics provides researchers and practitioners with useful methods to handle data for their purposes. The modern statistics dates back to the early 20th century. The Student’s t statistic by W. S. Gosset and the basic idea of experimental designs by R. A. Fisher have been influencing great effects on the development of statistical methodologies. On the other hand, information theory originates from C. E. Shannon’s paper, “A Mathematical Theory in Communication”, in 1948. The theory is indispensable for measuring uncertainty of events from information sources and systems of random variables and for processing the data effectively in viewpoints of entropy. In these days, interdisciplinary research domains have been increasing in order to enhance noble studies to resolve complicated problems. The common tool for statistics and information theory is “probability”, and the common aim is to deal with information and data effectively. In this sense, thus, both theories have similar scopes. The logarithm of probability is negative information; those of odds, and odds ratios in statistics are relative information; that of the log likelihood function is the asymptotic negative entropy. The author is a statistician and takes a standpoint in Statistics; however, there are problems in statistical data analysis that are not able to resolve in conventional views of statistics. In such cases, perspectives in entropy may provide us with good clues and ideas to tackle the problems and to develop statistical methodologies. The aim of this book is to elucidate how most of statistical methods, e.g. correlation analysis, t, F, v2 statistics, can be interpreted in entropy and to introduce entropy-based methods for analyzing data analysis, e.g. entropy-based approaches to the analysis of association in contingency tables and path analysis
v
vi
Preface
with generalized linear models, that may be useful tools for behavior-metric researches. The author would like to expect this book to motivate readers, especially young researchers, to grasp that “entropy” is one of utilities to deal with practical data and study themes. Kyoto, Japan
Nobuoki Eshima
Acknowledgements
I would be grateful to Prof. A. Okada, Rikkyo University Professor Emeritus, for giving me a chance and encouragement to make the present book. I would like to express my gratitude to Prof. M. Kitano, Kyoto University Professor Emeritus, an Executive Vice President of Kyoto University, for providing with a precious time and an excellent environment possible to aid me in the present work. Lastly, I deeply appreciate an anonymous referee for his valuable comments to improve the draft of the present book.
vii
Contents
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
1 1 2 3 4 8 11 16 18 20 27 27
2 Analysis of the Association in Two-Way Contingency Tables . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Odds, Odds Ratio, and Relative Risk . . . . . . . . . . . . . . . . 2.3 The Association in Binary Variables . . . . . . . . . . . . . . . . 2.4 The Maximum Likelihood Estimation of Odds Ratios . . . . 2.5 General Two-Way Contingency Tables . . . . . . . . . . . . . . 2.6 The RCðM Þ Association Models . . . . . . . . . . . . . . . . . . . 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
29 29 30 36 41 44 48 57 58
3 Analysis of the Association in Multiway Contingency Tables 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Loglinear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Maximum Likelihood Estimation of Loglinear Models . . 3.4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . 3.5 Entropy Multiple Correlation Coefficient for GLMs . . . . 3.6 Multinomial Logit Models . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
59 59 60 63 65 68 76
1 Entropy and Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Loss of Information . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Entropy of Joint Event . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Test of Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . 1.8 Maximum Likelihood Estimation of Event Probabilities 1.9 Continuous Variables and Entropy . . . . . . . . . . . . . . . . 1.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . .
ix
x
Contents
3.7 Entropy Coefficient of Determination . . . . . . . . . . . . . 3.8 Asymptotic Distribution of the ML Estimator of ECD 3.9 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Analysis of Continuous Variables . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Correlation Coefficient and Entropy . . . . . . . . . . . . 4.3 The Multiple Correlation Coefficient . . . . . . . . . . . 4.4 Partial Correlation Coefficient . . . . . . . . . . . . . . . . 4.5 Canonical Correlation Analysis . . . . . . . . . . . . . . . 4.6 Test of the Mean Vector and Variance–Covariance Matrix in the Multivariate Normal Distribution . . . . 4.7 Comparison of Mean Vectors of Two Multivariate Normal Populations . . . . . . . . . . . . . . . . . . . . . . . 4.8 One-Way Layout Experiment Model . . . . . . . . . . . 4.9 Classification and Discrimination . . . . . . . . . . . . . . 4.10 Incomplete Data Analysis . . . . . . . . . . . . . . . . . . . 4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
80 85 88 88
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 91 . 91 . 92 . 101 . 104 . 107
. . . . . . . . . . 115 . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
120 122 127 132 138 139
5 Efficiency of Statistical Hypothesis Test Procedures . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Most Powerful Test of Hypotheses . . . . . . . . . . 5.3 The Pitman Efficiency and the Bahadur Efficiency . . 5.4 Likelihood Ratio Test and the Kullback Information . 5.5 Information of Test Statistics . . . . . . . . . . . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
141 141 141 144 148 159 165 165
6 Entropy-Based Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Path Diagrams of Variables . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Path Analysis of Continuous Variables with Linear Models 6.4 Examples of Path Systems with Categorical Variables . . . . 6.5 Path Analysis of Structural Generalized Linear Models . . . . 6.6 Path Analysis of GLM Systems with Canonical Links . . . . 6.7 Summary Effects Based on Entropy . . . . . . . . . . . . . . . . . . 6.8 Application to Examples 6.1 and 6.2 . . . . . . . . . . . . . . . . . 6.8.1 Path Analysis of Dichotomous Variables (Example 6.1 (Continued)) . . . . . . . . . . . . . . . . . . . 6.8.2 Path Analysis of Polytomous Variables (Example 6.2 (Continued)) . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
167 167 168 171 173 175 178 180 183
. . . . 183 . . . . 190
Contents
6.9
General Formulation of Variables . . . . . . 6.10 Discussion . . . . . . . References . . . . . . . . . . . .
xi
of Path Analysis of Recursive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7 Measurement of Explanatory Variable Contribution in GLMs . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Preliminary Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Measuring Explanatory Variable Contributions . . . . . . . . . . 7.5 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Application to Test Analysis . . . . . . . . . . . . . . . . . . . . . . . 7.7 Variable Importance Assessment . . . . . . . . . . . . . . . . . . . . 7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
199 199 200 202 206 211 215 216 222 223
8 Latent Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Factor Analysis Model . . . . . . . . . . . . . . . . . . 8.2.2 Conventional Method for Measuring Factor Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Entropy-Based Method of Measuring Factor Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 A Method of Calculating Factor Contributions by Using Covariance Matrices . . . . . . . . . . . . 8.2.5 Numerical Example . . . . . . . . . . . . . . . . . . . . 8.3 Latent Trait Analysis . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Latent Trait and Item Response . . . . . . . . . . . 8.3.2 Item Characteristic Function . . . . . . . . . . . . . . 8.3.3 Information Functions and ECD . . . . . . . . . . . 8.3.4 Numerical Illustration . . . . . . . . . . . . . . . . . . . 8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
225 225 226 226
. . . .
. . . .
. . . .
. . . .
. . . . . . . . 228 . . . . . . . . 230 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
238 239 242 242 243 245 250 256 256
Chapter 1
Entropy and Basic Statistics
1.1 Introduction Entropy is a physical concept to measure the complexity or uncertainty of systems under studies, and it is used in thermodynamics, statistical mechanics, and information theory, and so on; however, the definitions are different in the research domains. In measuring information of events, “entropy” was introduced to study communication systems by Shannon [6], which is called the Shannon entropy, and it contributed greatly to the basis of modern information theory. In these days, the theory has been applied to wide interdisciplinary research fields. “entropy” can be grasped as a mathematical concept of uncertainty of events in sample spaces and random variables. The basic idea in information theory is to use the logarithmic measure with probability. In statistical data analysis, parameters describing populations under consideration are estimated from random samples and the confidence intervals of the estimators and various statistical tests concerning the parameters are usually made. In the cases, the uncertainty of the estimators, i.e., events under consideration, is measured with probability, for example, the 100(1 − α)% confidence intervals of the estimators, and levels of significances for the tests and so on. Information theory and statistics discuss phenomena and data processing systems with probability. In this respect, it is significant to have another look at statistics from a viewpoint of entropy. In this textbook, data analysis will be considered through information theory, especially “entropy,” in which follows, the term “entropy” implies the Shannon entropy. In this chapter, first, information theory is reviewed for readers unfamiliar with entropy and second, basic statistics are reconsidered in view of entropy. Section 1.2 introduces an information measure for measuring uncertainties of events in the sample space, and in Sect. 1.3 the loss of information is assessed by the information measure. In Sect. 1.4, the Shannon entropy is introduced for measuring the uncertainties of sample spaces and random variables, and the Kullback–Leibler (KL) information [4] is derived thorough a theoretical discussion. Sections 1.5 and 1.6 discuss the joint and the conditional entropy of random variables and briefly discuss the association © Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_1
1
2
1 Entropy and Basic Statistics
between the variables through the mutual information (entropy) and the conditional entropy. In Sect. 1.7, the chi-square test is considered through the KL information. Section 1.9 treats the information of continuous variables, and t and F statistics are expressed through entropy.
1.2 Information Let be a sample space; A ∈ be an event; and let P(A) the probability of event A. The information of event A is defined mathematically according to the probability, not the content itself. Smaller the probability of an event is, greater we feel its value. Based on our intuition, the mathematical definition of information [6] is given by Definition 1.1 For P(A) = 0, the information of A is defined by I (A) = loga
1 , P(A)
(1.1)
where a > 1 In which follows, the base of the logarithm is e and the notation (1.1) is simply denoted by I (A) = log
1 . P(A)
In this case, the unit is called “nat,” i.e., natural unit of information. If P(A) = 1, i.e., event A always occurs, then I (A) = 0 and it implies that event A has no information. The information measure I (A) has the following properties: (i) For events A and B, if P(A) ≥ P(B) > 0, the following inequality holds: I (A) ≤ I (B).
(1.2)
(ii) If events A and B are statistically independent, then, it follows that I (A ∩ B) = I (A) + I (B).
(1.3)
Proof Inequality (1.2) is trivial. In (ii), we have P(A ∩ B) = P(A)P(B). From this, I (A ∩ B) = log
1 1 = log P(A ∩ B) P(A)P(B)
(1.4)
1.2 Information
3
= log
1 1 + log = I (A) + I (B). P(A) P(B)
Example 1.1 In a trial drawing a card from a deck of cards, let events A and B be “ace” and “heart,” respectively. Then, 1 1 1 = × , 52 4 13 1 1 P(A) = , P(B) = . 4 13
P(A ∩ B) =
Since A ∩ B ⊂ A, B, corresponding to (1.2), we have I (A ∩ B) > I (A), I (B). and the events are statistically independent, so we also have Eq. (1.3). Remark 1.1 In information theory, base 2 is usually used for logarithm. Then, the unit of information is referred to as “bit.” One bit is the information of an event with probability 21 .
1.3 Loss of Information When we forget the last two columns of a phone number with 10 digits, e.g., 075753-25##, then there is nothing except that the right number can be restored with 1 , if there is no information about the number. In this case, the loss of probability 100 the information about the right phone number is log 100. In general, the loss of the information is defined as follows: Definition 1.2 Let A and B be events such that A ⊂ B. Then, the loss of “information concerning event A” by using or knowing event B for A is defined by Loss(A|B) = I (A) − I (B).
(1.5)
In the above definition, A ∩ B = A and from (1.5), the loss is also expressed by Loss(A|B) = log
1 P(B) 1 1 − log = log = log . P(A) P(B) P(A) P(A|B)
(1.6)
In the above example of a phone number, the true number, e.g., 097-753-2517, is event A and the memorized number is event B, e.g., 097-753-25##. Then, P(A|B) = 1 . 100
4
1 Entropy and Basic Statistics
Table 1.1 Original data Category
0
1
2
3
Frequency
5
10
15
5
Table 1.2 Aggregated data from Table 1.1
Category
0
1
Frequency
15
20
Example 1.2 Let X be a categorical random variable that takes values in sample space = {0, 1, 2, . . . , I − 1} and let πi = P(X = i). Supposed that for integer k > 0, the sample space is changed into ∗ = {0, 1} by X∗ =
0 (X < k) . 1 (X ≥ k)
Then, the loss of information of X = i is given by
P(X ∗ = j (i)) = Loss X = i|X = j (i) = log P(X = i) ∗
k−1
log log
a=0
πa
(i < k; j (i) = 0)
a=k
πa
(i ≥ k; j (i) = 1)
i Iπ −1
πi
.
In the above example, random variable X is dichotomized, and it is a usual technique for processing and/or analyzing categorical data; however, the information of the original variable is reduced. Example 1.3 In Table 1.1, categories 0 and 1 are reclassified into category 0 shown in Table 1.2 and categories 2 and 3 into category 1. Then, the loss of information is considered. Let A and B be events shown in Tables 1.1 and 1.2, respectively. Then, 1 1 1 we have P(A|B) = 15+1 × 20+1 = 336 , which is the conditional probability restoring Table 1.1 from Table 1.2, and from (1.6) we have Loss(A|B) = log 336.
1.4 Entropy Let = {ω1 , ω2 , . . . , ωn } be a discrete sample space and let P(ωi ) be the probability of event ωi . The uncertainty of the sample space is measured by the mean of information I (ωi ) = − log P(ωi ).
1.4 Entropy
5
Table 1.3 Probability distribution of X
X
0
1
2
Probability
1 4
1 2
1 4
Definition 1.3 The entropy of discrete sample space [6] is defined by H () =
n
P(ωi )I (ωi ) = −
n
i=1
P(ωi ) log P(ωi ).
(1.7)
i=1
Remark 1.2 In the above definition, sample space and probability space (, P) are identified. Remark 1.3 In definition (1.7), if P(ωi ) = 0, then for convenience of the discussion we set P(ωi ) log P(ωi ) = 0 · log 0 = 0. If P(ωi ) = 1, i.e., P(ω j ) = 0, j = i, we have H () = 0. Example 1.4 Let X be a categorical random variable that takes values in sample space = {0, 1, 2} with probabilities illustrated in Table 1.3. Then, the entropy of is given by H () =
4 1 2 1 4 3 1 log + log + log = log 2. 4 1 2 1 4 1 2
In this case, the entropy is also referred to as that in X and is denoted by H (X ). With respect to entropy (1.7), we have the following theorem: Theorem 1.1 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability distributions, i.e. n i=1
pi =
n
qi = 1.
i=1
Then, it follows that: n i=1
pi log pi ≥
n
pi log qi .
i=1
The equality holds true, if and only if p = q.
(1.8)
6
1 Entropy and Basic Statistics
Proof For x > 0, the following inequality holds true: log x ≤ x − 1. By using the above inequality, we have n
pi log qi −
i=1
n
pi log pi =
i=1
n
pi log
i=1
≤
n
qi pi
qi − 1 = 0. pi
pi
i=1
This completes the theorem. In (1.7), the entropy is defined for the sample space. For probability distribution p = ( p1 , p2 , . . . , pn ), the entropy is defined by H ( p) = −
n
pi log pi .
i=1
From (1.8), we have H ( p) = −
n
pi log pi ≤ −
i=1
n
pi log qi .
(1.9)
i=1
In (1.9), setting qi = n1 , i = 1, 2, . . . , n, then we have H ( p) = −
n
pi log pi ≤ log n.
i=1
Hence, from Theorem 1.1, entropy H ( p) is maximized at p = n1 , n1 , . . . , n1 , i.e., uniform distribution. The following entropy is referred to as the Kullback–Leibler (KL) information or divergence [4]: D( p q) =
n i=1
pi log
pi . qi
(1.10)
From Theorem 1.1, (1.10) is nonnegative and takes 0 if and only if p = q. When the entropy of distribution q for distribution p is defined by H p (q) = −
n i=1
pi log qi ,
1.4 Entropy
7
Table 1.4 Probability distribution of Y
Y
0
1
2
Probability
1 8
3 8
1 2
we have D( p q) = H p (q) − H ( p). This entropy is interpreted as the loss of information by substituting distribution q for the true distribution p. With respect to the KL information, in general, D( p q) = D(q p). Example 1.5 Let Y be a categorical random variable that has the probability distribution shown in Table 1.4. Then, random variable X in Example 1.4 and Y are compared. Let the distributions in Tables 1.3 and 1.4 are denoted by p and p, respectively. Then, we have 1 4 1 1 1 log2 + log + log = 4 2 3 4 2 1 1 3 3 1 D(q p) = log + log + log2 = 8 2 8 4 2
D( p q) =
1 4 log ≈ 0.144, 2 3 3 3 log ≈ 0.152. 8 2
In which follows, distributions and the corresponding random variables are identified, the KL information is also denoted by using the variables, e.g., in Example 1.5 KL information D( p q) and D(q p) are expressed as D(X Y ) and D(Y X ), respectively, depending on the occasion. Theorem 1.2 Let q = (q1 , q2 , . . . , qk ) be a probability distribution that is made by appropriately combining categories of distribution p = ( p1 , p2 , . . . , pn ). Then, it follows that H ( p) ≥ H (q). Proof For simplicity of the discussion, let q = ( p1 + p2 , p3 . . . , pn ). Then, 1 1 1 + p2 log − ( p1 + p2 )log p1 p2 p1 + p2 p1 + p2 p1 + p2 = p1 log + p2 log ≥ 0. p1 p2
H ( p) − H (q) = p1 log
(1.11)
8
1 Entropy and Basic Statistics
In this case, theorem holds true. The above proof can be extended into a general case and the theorem follows. In the above theorem, the mean loss of information by combining categories of distribution p is given by H ( p) − H (q). Example 1.6 In a large and closed human population, if the marriage is random, a distribution of genotypes is stationary. In ABO blood types, let the ratios of genes A, B, and O be p, q, and r, respectively, where p+q +r = 1. Then, the probability distribution of genotypes AA, AO, BB, BO, AB, and OO is u = p 2 , 2 pr, q 2 , 2qr, 2 pq, r 2 . We usually observe phenotypes A, B, AB, and O, which correspond with genotypes “AA or AO,” “BB or BO,” AB, and OO, respectively, and the phenotype probabil ity distribution is v = p 2 + 2 pr, q 2 + 2qr, 2qr, r 2 . In this case, the mean loss of information is given as follows: p 2 + 2 pr p 2 + 2 pr + 2 pr log 2 p 2 pr 2 2 q + 2qr q + 2qr + 2qr log + q 2 log q2 2qr p + 2r p + 2r = p 2 log + 2 pr log p 2r q + 2r q + 2r + q 2 log + 2qr log . q 2r
H (u) − H (v) = p 2 log
Theorem 1.3 Let p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) be probability distributions. For 0 ≤ λ ≤ 1, the following inequality holds true: H (λ p + (1 − λ)q) ≥ λH ( p) + (1 − λ)H (q).
(1.11)
Proof Function x log x is a convex function, so we have λpi + (1 − λ)q j log λpi + (1 − λ)q j ≤ λpi log pi + (1 − λ)q j log q j . Summing up the above inequality with respect to i and j, we have inequality (1.11).
1.5 Entropy of Joint Event Let X and Y be categorical random variables having sample spaces X = {C1 , C2 , . . . , C I } and Y = {D1 , D2 , . . . , D J }, respectively. In which follows, the sample spaces are simply denoted as X = {1, 2, . . . , I } and Y = {1, 2, . . . , J },
1.5 Entropy of Joint Event
9
as long as any confusion does not occur in a context. Let πi j = P(X = i, Y = j), i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; πi+ = P(X = i), i = 1, 2, . . . , I ; and let π+ j = P(Y = j), j = 1, 2, . . . , J. Definition 1.4 The joint entropy of X and Y is defined as H (X, Y ) = −
I J
πi j log πi j .
i=1 j=1
With respect to the joint entropy, we have the following theorem: Theorem 1.4 For categorical random variables X and Y, H (X, Y ) ≤ H (X ) + H (Y ).
(1.12)
Proof From Theorem 1.1, H (X, Y ) = −
I J
πi j log πi j ≤ −
i=1 j=1
=−
J I i=1 j=1
I J
πi j log(πi+ π+ j )
i=1 j=1
πi j log πi+ −
J I
πi j log π+ j = H (X ) + H (Y ).
i=1 j=1
Hence, the inequality (1.12) follows. The equality holds if and only if πi j = πi+ π+ j , i = 1, 2, . . . , I ; j = 1, 2, . . . , J , i.e., X and Y are statistically independent. From the above theorem, the following entropy is interpreted as that reduced due to the association between variables X and Y: H (X ) + H (Y ) − H (X, Y ).
(1.13)
Stronger the strength of the association is, greater the entropy is. The image of the entropies H (X ), H (Y ), and H (X, Y ) is illustrated in Fig. 1.1. Entropy H (X ) is expressed by the left ellipse; H (Y ) by the right ellipse; and H (X, Y ) by the union of the two ellipses. Fig. 1.1 Image of Entropy H (X ), H (Y ) and H (X, Y )
10
1 Entropy and Basic Statistics
Table 1.5 Joint distribution of X and Y Y
X 4
計
1
2
3
1
1/8
1/8
0
0
1/4
2
1/16
1/8
1/16
0
1/4
3
0
1/16
1/8
1/16
1/4
4
0
0
1/8
1/8
1/4
計
3/16
5/16
5/16
3/16
1
From Theorem 1.3, a more general proposition can be derived. We have the following corollary: Corollary 1.1 Let X 1 , X 2 , . . . , X K be categorical random variables. Then, the following inequality folds true. H (X 1 , X 2 , . . . , X K ) ≤
K
H (X k ).
k=1
The equality holds if and only if the categorical random variables are statistically independent. Proof From Theorem 1.3, we have H ((X, X 2 , . . . , X K −1 ), X K ) ≤ H (X, X 2 , . . . , X K −1 ) + H (X K ). Hence, inductively the theorem follows.
Example 1.7 The joint probability distribution of categorical variables X and Y is given in Table 1.5. From this table, we have 1 log 4 = 2 log 2 ≈ 1.386, 4 3 16 5 16 3 16 5 16 H (Y ) = 2 × log +2× log = log + log 16 3 16 5 8 3 8 5 5 3 = 4 log 2 − log 3 − log 5 ≈ 1.355, 8 8 1 1 13 H (X, Y ) = 6 × log8 + 4 × log16 = log2 ≈ 2.253. 8 16 4 H (X ) = 4 ×
From this, the reduced entropy through the association between X and Y is calculated as: H (X ) + H (Y ) − H (X, Y ) ≈ 1.386 + 1.355 − 2.253 = 0.488.
1.6 Conditional Entropy
11
1.6 Conditional Entropy Let X and Y be categorical random variables having sample spaces X = {1, 2, . . . , I } and Y = {1, 2, . . . , J }, respectively. The conditional probability of X = i given Y = j is given by P(Y = j|X = i) =
P(X = i, Y = j) . P(X = i)
Definition 1.5 The conditional entropy of Y given X = i is defined by H (Y |X = i) = −
J
P(Y = j|X = i) log P(Y = j|X = i)
j=1
and that of Y given X, H (Y |X ), is defined by taking the expectation of the above entropy with respect to X: H (Y |X ) = −
I J
P(X = i, Y = j) log P(Y = j|X = i).
(1.14)
i=1 j=1
With respect to the conditional entropy, we have the following theorem [6]: Theorem 1.5 Let H (X, Y ) and H (X ) be the joint entropy of (X, Y ) and X, respectively. Then, the following equality holds true: H (Y |X ) = H (X, Y ) − H (X ).
(1.15)
Proof From (1.14), we have H (Y |X ) = −
I J
P(X = i, Y = j) log
i=1 j=1
=−
I J
P(X = i, Y = j) P(X = i)
P(X = i, Y = j) log P(X = i, Y = j)
i=1 j=1
+
I J
P(X = i, Y = j) log P(X = i)
i=1 j=1
= H (X, Y ) −
I
P(X = i) log P(X = i) = H (X, Y ) − H (X ).
i=1
This completes the theorem. With respect to the conditional entropy, we have the following theorem:
12
1 Entropy and Basic Statistics
Theorem 1.6 For entropy H (Y ) and the conditional entropy H (Y |X ), the following inequality holds: H (Y |X ) ≤ H (Y ).
(1.16)
The equality holds if and only if X and Y are statistically independent. Proof Adding (1.12) and (1.15) for the right and left sides, we have H (Y ) − H (Y |X ) = H (Y ) − (H (X, Y ) − H (X )) = H (X ) + H (Y ) − H (Y |X ) ≥ 0. Hence, the inequality follows. From Theorem 1.3, the equality of (1.16) holds true if and only if X and Y are statistically independent. This completes the theorem. From the above theorem, the following entropy is interpreted as that of Y explained by X: I M (X, Y ) = H (Y ) − H (Y |X ).
(1.17)
From (1.17) and (1.15) we have I M (X, Y ) = H (Y ) − (H (X, Y ) − H (X )) = H (Y ) + H (X ) − H (X, Y ). (1.18) It implies I M (X, Y ) is symmetrical with respect to X and Y, i.e., I M (X, Y ) = I M (Y, X ). Thus, I M (X, Y ) is the entropy reduced by the association between the variables and is referred to as the mutual information. Moreover, we have the following theorem: Theorem 1.7 For entropy measures H (X, Y ), H (X ) and H (Y ), the following inequality holds true: H (X, Y ) ≤ H (X ) + H (Y ).
(1.19)
The equality holds true if and only if X and Y are statistically independent. Proof From (1.15) and (1.16), (1.19) follows. The mutual information (1.17) is expressed by the KL information. I M (X, Y ) =
I J i=1 j=1
P(X = i, Y = j)log
P(X = i, Y = j) . P(X = i)P(Y = j)
1.6 Conditional Entropy
13
Fig. 1.2 Image of entropy measures
From the above theorems, entropy measures are illustrated as in Fig. 1.2. Theil [8] used entropy to measure the association between independent variable X and dependent variable Y. With respect to mutual information (1.18), the following definition is given [2, 3]. Definition 1.6 The ratio of I M (X, Y ) to H (Y ), i.e., C(Y |X ) =
H (Y ) − H (Y |X ) I M (X, Y ) = . H (Y ) H (Y )
(1.20)
is defined as the coefficient of uncertainty. The above ratio is that of the explained (reduced) entropy of Y by X, and we have 0 ≤ C(Y |X ) ≤ 1. Example 1.8 Table 1.6 shows the joint probability distribution of categorical variables X and Y. In this table, the following entropy measures of the variables are computed: H (X, Y ) = 8 × H (X ) = 4 ×
1 log 8 = 3 log 2, 8
1 1 log 4 = 2 log 2, H (Y ) = 4 × log 4 = 2 log 2. 4 4
From the above entropy measures, we have I M (X, Y ) = H (Y ) + H (Y ) − H (X, Y ) = log 2. Table 1.6 Joint probability distribution of categorical variables X and Y Y
X 1
2
3
4
計
1
1/8
1/8
0
0
1/4
2
1/8
1/8
0
0
1/4
3
0
0
1/8
1/8
1/4
4
0
0
1/8
1/8
1/4
計
1/4
1/4
1/4
1/4
1
14
1 Entropy and Basic Statistics
From the above entropy, the coefficient of uncertainty is obtained as C(Y |X ) =
log 2 1 = . 2 log 2 2
From this, 50% of the entropy (information) of Y is explained (reduced) by X. In general, Theorem 1.6 can be extended as follows: Theorem 1.8 Let X i , i = 1, 2, . . . ; and Y be categorical random variables. Then, the following inequality holds true: H (Y |X 1 , X 2 , . . . , X i , X i+1 ) ≤ H (Y |X 1 , X 2 , . . . , X i ).
(1.21)
The equality holds if and only if X i+1 and Y are conditionally independent, given (X 1 , X 2 , . . . , X i ). Proof It is sufficient to prove the case with i = 1. As in Theorem 1.5, from (1.12) and (1.15) we have H (X 2 , Y |X 1 ) ≤ H (X 2 |X 1 ) + H (Y |X 1 ), H (Y |X 1 , X 2 ) = H (X 2 , Y |X 1 ) − H (X 2 |X 1 ). By adding both sides of the above inequality and equation, it follows that H (Y |X 1 , X 2 ) ≤ H (Y |X 1 ). Hence, inductively the theorem follows.
From Theorem 1.8, with respect to the coefficient of uncertainty (1.20), the following corollary follows. Corollary 1.2 Let X i , i = 1, 2, . . . ; and Y be categorical random variables. Then, C(Y |X 1 , X 2 , . . . , X i , X i+1 ) ≥ C(Y |X 1 , X 2 , . . . , X i ). Proof From (1.21), 1 − H (Y |X 1 , X 2 , . . . , X i , X i+1 ) H (Y ) 1 − H (Y |X 1 , X 2 , . . . , X i ) ≥ = C(Y |X 1 , X 2 , . . . , X i ). H (Y )
C(Y |X 1 , X 2 , . . . , X i , X i+1 ) =
This completes the corollary.
The above corollary shows that increasing the number of explanatory variables X i induces an increase of the explanatory power to explain the response variable Y.
1.6 Conditional Entropy
15
Table 1.7 Joint probability distribution of X 1 , X 2 , and Y Y X1
X2
0
1
Total
0
0
1
0
3 32 1 32 1 16 1 8 5 16
1 32 3 32 1 4 5 16 11 16
1 8 1 8 5 16 7 16
1 1 Total
The above property is preferable to measure the predictive or explanatory power of explanatory variables X i for response variable Y. Example 1.9 The joint probability distribution of X 1 , X 2 , and Y is given in Table 1.7. Let us compare the entropy measures H (Y ), H (Y |X 1 ), and H (Y |X 1 , X 2 ). First, H (Y ), H (X 1 ), H (X 1 , X 2 ), H (X 1 , Y ), and H (X 1 , X 2 , Y ) are calculated. From Table 1.7, we have 5 11 11 5 log − log ≈ 0.621, 16 16 16 16 1 3 3 1 H (X 1 ) = − log − log ≈ 0.563, 4 4 4 4 1 1 1 5 5 7 7 1 H (X 1 , X 2 ) = − log − log − log − log ≈ 1.418. 4 4 4 4 16 16 16 16 H (Y ) = −
1 1 1 3 3 9 9 1 H (X 1 , Y ) = − log − log − log − log ≈ 1.157, 8 8 8 8 16 16 16 16 3 1 1 1 1 3 H (X 1 , X 2 , Y ) = − log × 2 − log × 2 − log 32 32 32 32 16 16 1 1 1 5 5 1 ≈ 1.695. − log − log − log 8 8 4 4 16 16 From the above measures of entropy, we have the following conditional entropy measures: H (Y |X 1 ) = H (X 1 , Y ) − H (X 1 ) ≈ 1.157 − 0.563 = 0.595, H (Y |X 1 , X 2 ) = H (X 1 , X 2 , Y ) − H (X 1 , X 2 ) ≈ 1.695 − 1.418 = 0.277. From the results, the following inequality in Theorems 1.6 and 1.8 hold true: H (Y |X 1 , X 2 ) < H (Y |X 1 ) < H (Y ). Moreover, we have
16
1 Entropy and Basic Statistics
C(Y |X 1 , X 2 ) =
0.621 − 0.277 0.621 − 0.595 = 0.554, C(Y |X 1 ) = = 0.042. 0.621 0.621
From the above coefficients of uncertainty, the entropy of Y is reduced with about 4% by explanatory variable X 1 ; however, variables X 1 and X 2 reduce the entropy with about 55%. The above coefficients of uncertainty demonstrate Corollary 1.2. The measure of the joint entropy is decomposed into those of the conditional entropy. We have the following theorem. Theorem 1.9 Let X i , i = 1, 2, . . . , n be categorical random variables. Then, H (X 1 , X 2 , . . . , X n ) = H (X 1 ) + H (X 2 |X 1 ) + . . . + H (X n |X 1 , X 2 , · · · , X n−1 ). Proof The random variables (X 1 , X 2 , . . . , X n ) are divided into (X 1 , X 2 , . . . , X n−1 ) and X n . From Theorem 1.4, it follows that H (X 1 , X 2 , . . . , X n ) = H (X 1 , X 2 , . . . , X n−1 ) + H (X n |X 1 , X 2 , . . . , X n−1 ).
Hence, the theorem follows. The above theorem gives a recursive decomposition of the joint entropy.
1.7 Test of Goodness-of-Fit Let X be a categorical random variable with sample space = {1, 2, . . . , I }; pi be the hypothesized probability of X = i, i = 1, 2, . . . , I ; and let n i be the number of observations with X = i, i = 1, 2, . . . , I (Table 1.8). Then, for testing the goodnessof-fit of the model, the following log-likelihood ratio test statistic [9] is used: G 21 = 2
I
n i log
i=1
ni . N pi
(1.22)
Let pi be the estimators of pi , i = 1, 2, . . . , I , i.e.,
pi =
ni . N
Then, statistic (1.22) is expressed as follows: Table 1.8 Probability distribution and the numbers of observations X
1
2
…
I
Total
Probability
p1
p2
…
pI
1
Frequency
n1
n2
…
nI
N
1.7 Test of Goodness-of-Fit
G 21 = 2N
I ni i=1
N
17
log
ni N
pi
= 2N
I
pi log
i=1
pi = 2N · D( p p), pi
(1.23)
where p = ( p1 , p2 , . . . , p I ) and p = p1 , p2 , . . . , p I . From this, the above statistic is 2N times of the KL information. As N → ∞, pi → pi , i = 1, 2, . . . , I (in probability). For x ≈ 1,
1 logx ≈ (x − 1) − (x − 1)2 . 2 For large sample size N, it follows that
pi ≈ 1 (in probability). pi From this, we have 2 I pi 1 pi ≈ −2N pi log = 2N pi · −1 pi 2 pi i=1 i=1 2 2 I I I pi − pi pi (n i − N pi )2 =N pi = . −1 ≈ N pi pi N pi i=1 i=1 i=1
G 21
I
Let us set X2 =
I (n i − N pi )2 . N pi i=1
(1.24)
The above statistics are a chi-square statistic [5]. Hence, the log-likelihood ratio statistic R12 is asymptotically equal to the chi-square statistics (1.24). For given distribution p, the statistic (1.22) is asymptotically distributed according a chi-square distribution with degrees of freedom I − 1, as N → ∞. Let us consider the following statistic similar to (1.22): G 22 = 2
I
N pi log
i=1
N pi . ni
The above statistic is described as in (1.23): G 22 = 2N
I i=1
pi = 2N · D( p p), pi
pi log
For large N, given probability distribution p, it follows that
(1.25)
18
1 Entropy and Basic Statistics
Table 1.9 Probability distribution and the numbers of observations X
1
2
3
4
5
6
Probability
1 6
1 6
1 6
1 6
1 6
1 6
Frequency
12
13
19
14
19
23
100
Table 1.10 Data produced with distribution q =
1
1 1 1 1 1 6 , 12 , 4 , 4 , 6 , 12
Total 1
X
1
2
3
4
5
6
Total
Frequency
12
8
24
23
22
11
100
G 22 = −2N
I i=1
pi ≈N pi pi i=1
pi log
I
2 I pi (n i − N pi )2 −1 = . pi N pi i=1
From the above result, given p = ( p1 , p2 , . . . , p I ), the three statistics are asymptotically equivalent, i.e., G 21 ≈ G 22 ≈ X 2 (in probability) Example 1.10 For uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 , data with sample size 100 were made and the results are shown in Table 1.9. From this table, we have G 21 = 5.55, G 22 = 5.57,
X 2 = 5.60.
The three statistics are asymptotically equal and the degrees of freedom in the chi-square distribution is 5, so the results are not statistically significant. Example Data 1 1 1 1.11 with sample size 100 were produced by a distribution q = 1 1 1 and the data are shown in Table 1.10. We compare the data , , , , , 6 12 4 4 6 12 with uniform distribution p = 16 , 16 , 16 , 16 , 16 , 16 . In this artificial example, the null hypothesis H0 is p = 16 , 16 , 16 , 16 , 16 , 16 . Then, we have G 21 = 15.8, G 22 = 17.1,
X 2 = 15.1.
Since the degrees of freedom is 5, the critical point of significance level 0.05 is 11.1 and the three statistics lead to the same result, i.e., the null hypothesis is rejected.
1.8 Maximum Likelihood Estimation of Event Probabilities Let X be a categorical random variable with sample space = {1, 2, . . . , I }; pi (θ1 , θ2 , . . . , θ K ) be the hypothesized probability of X = i, i = 1, 2, . . . , I , where
1.8 Maximum Likelihood Estimation of Event Probabilities
19
θk , k = 1, 2, . . . , K , are parameters; let n i be the number of observations with X = i, i = 1, 2, . . . , I . Then, the log likelihood function is given by l(θ1 , θ2 , . . . , θ K ) =
I
n i log pi (θ1 , θ2 , . . . , θ K ).
(1.26)
i=1
The maximum likelihood (ML) estimation is carried out by maximizing the above log likelihood with respect to θk , k = 1, 2, . . . , K . Let θk be the ML estimators of parameters θk . Then, the maximum of (1.26) is expressed as
I ni l θ1 , θ2 , . . . , θ K = N log pi θ1 , θ2 , . . . , θ K . N i=1
Concerning the data, the following definition is given: Definition 1.7 The negative data entropy (information) in Table 1.11 is defined by I
ldata =
n i log
i=1
ni . N
(1.27)
Since ldata = N
I ni i=1
N
log
ni , N
data entropy (1.27) is N times of the negative entropy of distribution negative n1 n2 nK . The log likelihood statistic for testing the goodness-of-fit of the , , . . . , N N N model is given by I ni ni G (θ ) = 2 ldata − l(θ ) = 2N log N , N pi (θ ) i=1
2
(1.28)
where θ = θ1 , θ2 , . . . , θ K .
Table 1.11 Data of X X
1
2
…
I
Total
Relative frequency
n1 N
n2 N
…
nI N
1
Frequency
n1
n2
…
nI
N
20
1 Entropy and Basic Statistics
For simplicity of the notations, let us set n i
N
=
n
n2 nK ,..., and pi (θ ) = p1 (θ ), p2 (θ ), . . . , p K (θ ) . N N N 1
,
Then, the log-likelihood ratio test statistic is described as follows:
G 2 (θ ) = 2N × D
n i
pi (θ ) . N
(1.29)
Hence, from (1.29) the ML estimation minimizes the difference between data nNi and model ( pi (θ )) in terms of the KL information, and statistic (1.29) is used for testing the goodness-of-fit of model ( pi (θ)) to the data. Under the null hypothesis H0 : model ( pi (θ )), statistic (1.29) is asymptotically distributed according to a chi-square distribution with degrees of freedom I − 1 − K [1].
1.9 Continuous Variables and Entropy Let X be a continuous random variable that has density function f (x). As in the above discussion, we define the information of X = x by log
1 . f (x)
Since density f (x) is not a probability, the above quantity does not have a meaning of information in a strict sense; however, in an analogous manner the entropy of continuous variables is defined as follows [6]: Definition 1.8 The entropy of continuous variable X with density function f (x) is defined by: H (X ) = −
f (x)log f (x)dx.
(1.30)
Remark 1.4 For continuous distributions, the entropy (1.30) is not necessarily positive. Remark 1.5 For constant a = 0, if we transformed variable X to Y = a X , the density function of Y, f ∗ (y), is y dx = 1 f y . f (y) = f a dy |a| a ∗
From this, we have
1.9 Continuous Variables and Entropy
21
y dx 1 y log f dy f ∗ (y)log f ∗ (y)dx − f |a| a a dy y dx 1 y log f + log dy =− f |a| a a dy y 1 y dx 1 y log f dy − log dy =− f f |a| a |a| a a dy =− f (x)log f (x)dx − f (x)log|a|dx = H (X ) − f (x)log|a|dx.
H (Y ) = −
Hence, the entropy (1.30) is not scale-invariant. For differential dx, f (x)dx can be viewed as the probability of x ≤ X < x + dx. Hence, log
1 f (x)dx
is the information of event x ≤ X < x + dx. Let us compare two density functions f (x) and g(x). The following quantity is the relative information of g(x) for f (x): log
1 g(x)dx 1 f (x)dx
= log
f (x) . g(x)
From this, we have the following definition: Definition 1.9 The KL information between two density functions (distributions) f (x) and g(x), given that f (x) is true density function, is defined by D( f g) =
f (x) log
f (x) dx. g(x)
(1.31)
The above entropy is interpreted as the loss of information of the true distribution f (x) when g(x) is used for f (x). Although the entropy (1.30) is not scale-invariant, the KL information (1.31) is scale-invariant. With respect to the KL information (1.31), the following theorem holds: Theorem 1.10 For density function of f (x) and g(x), D( f g) ≥ 0. The equality holds if and only if f (x) = g(x) almost everywhere. Proof Since function log x is convex, from (1.31), we have g(x) g(x) dx ≥ − log f (x)dx. D( f g) = − f (x) log f (x) f (x) = − log g(x)dx = − log 1 = 0. This completes the theorem.
22
1 Entropy and Basic Statistics 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
2
4
6
8
10
Fig. 1.3 N (4, 1) and N (6, 1)
Example 1.12 Let f (x) and g(x) be normal density functions of N (μ1 , σ12 ) and N(μ2 , σ22 ), respectively. For σ12 = σ22 = σ 2 = 1, μ1 = 1 and μ2 = 4, the graphs of the density functions are illustrated as in Fig. 1.3. Since in general we have 1 σ2 (x − μ1 )2 (x − μ2 )2 f (x) , = log 22 − + g(x) 2 σ1 σ12 σ22 g(x) (x − μ2 )2 (x − μ1 )2 σ12 1 log + . log 2 − = f (x) 2 σ2 σ22 σ12 log
it follows that 1 σ2 σ2 (μ1 − μ2 )2 log 22 + 12 + − 1 , 2 σ1 σ2 σ22 σ2 σ2 (μ1 − μ2 )2 1 log 12 + 22 + − 1 . D(g f ) = 2 σ2 σ1 σ12
D( f g) =
(1.32) (1.33)
For σ12 = σ22 = σ 2 , from (1.32) and (1.33) we have D( f g) = D(g f ) =
(μ1 − μ2 )2 . 2σ 2
(1.34)
The above KL information is increasing in difference |μ1 − μ2 | and decreasing in variance σ 2 . Let us consider a two-sample problem. Let {X 1 , X 2 , . . . , X m } and {Y1 , Y2 , . . . , Yn } be random samples from N(μ1 , σ 2 ) and N(μ2 , σ 2 ), respectively. Comparing the two distributions with respect to the means, we usually used the following t statistic [7]:
1.9 Continuous Variables and Entropy
t=
23
X −Y (m − 1)U12 + (n − 1)U22
(m + n)(m + n − 2) , mn
where 2 2 n Xi − X i=1 Yi − Y 2 ,Y = , U2 = X= = m n m−1 n−1 X −Y (m + n)(m + n − 2) . t= mn (m − 1)U 2 + (n − 1)U 2 m
i−1
n
Xi
i−1
Yi
1
m
, U12
i=1
2
Since the unbiased estimator of variance σ 2 is U2 =
(m − 1)U12 + (n − 1)U22 , m+n−2
the above test statistic is expressed as X −Y t= U
m+n . mn
Hence, the square of statistic t is t2 =
(X − Y )2 m + n m+n = 2D( f g) · , · 2 U mn mn
where
D( f g) =
(X − Y )2 , 2U 2
From this, the KL information is related to t 2 statistic, i.e., F statistic. Example 1.13 Let density functions f (x) and g(x) be normal density functions with variances σ12 and σ22 , respectively, where the means are the same, μ1 = μ2 = μ, e.g., Figure 1.4. From (1.32) and (1.33), we have D( f g) =
σ2 σ2 1 log 22 − 1 + 12 , 2 σ1 σ2
D(g f ) =
In this case, D( f g) = D(g f ).
1 σ2 σ2 log 12 − 1 + 22 . 2 σ2 σ1
24
1 Entropy and Basic Statistics 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
2
4
6
8
10
Fig. 1.4 Graphs of density functions of N (5, 1) and N (5, 4)
Let σ2 x = 22 , σ1
1 1 log x − 1 + . D( f g) = 2 x
The graph of D( f g) is illustrated in Fig. 1.5. This function is increasing in x > 1 and decreasing in 0 < x < 1. Let {X 1 , X 2 , . . . , X m } and {Y1 , Y2 , . . . , Yn } be random samples from N(μ1 , σ12 ) and N(μ2 , σ22 ), respectively. Then, F statistic for testing the equality of the variances is given by F=
U12 , U22
(1.35)
4 3.5 3 2.5 2 1.5 1 0.5 0
0
2
Fig. 1.5 Graph of D( f g) =
4 1 2
log x − 1 +
6 1 x
,x =
σ22 σ12
8
10
1.9 Continuous Variables and Entropy
25
where m X=
i−1
m
Xi
n i−1
,Y =
Yi
n
m , U12
=
Xi − X m−1
i=1
n
2 , U22
=
Yi − Y n−1
i=1
2 .
Under the null hypothesis: H0 : σ12 = σ22 , statistic F is distributed according to F distribution with degrees of freedoms m − 1 and n − 1. Under the null hypothesis μ1 = μ2 = μ, the estimators of KL information D( f g) and D(g f ) are given by U2 1 log 12 − 1 + D( f g) = 2 U2 U2 1 D(g f ) = log 22 − 1 + 2 U1
U22 U12 U12 U22
1 1 = log F − 1 + , 2 F =
1 (− log F − 1 + F) 2
Although the above KL information measures are not symmetrical with respect to statistics F and F1 , the following measure is a symmetric function of F and F1 :
D( f g) + D(g f ) =
1 1 F + −2 . 2 F
Example 1.14 Let f (x) and g(x) be the following exponential distributions: f (x) = λ1 exp(−λ1 x), g(x) = λ2 exp(−λ2 x), λ1 , λ2 > 0. Since λ1 f (x) = log − λ1 x + λ2 x, g(x) λ2 λ2 g(x) = log − λ2 x + λ1 x, log f (x) λ1 log
we have D( f g) = log
λ1 λ2 λ2 λ1 −1+ and D(g f ) = log − 1 + . λ2 λ1 λ1 λ2
Example 1.15 (Property of the Normal Distribution in Entropy) [6] Let f (x) be a density function of continuous random variable X that satisfy the following conditions: 2 2 f (x)dx = 1. (1.36) x f (x)dx = σ and Then, the entropy is maximized with respect to density function f (x). For Lagrange multipliers λ and μ, we set
26
1 Entropy and Basic Statistics
w = f (x) log f (x) + λx 2 f (x) + μf (x). Differentiating function w with respect to f (x), we have d w = log f (x) + 1 + λx 2 + μ = 0. d f (x) From this, it follows that d w = log f (x) + 1 + λx 2 + μ = 0. d f (x) Determining the multipliers according to (1.36), we have 1 x2 f (x) = √ exp − 2 . 2σ 2π σ 2 Hence, the normal distribution has the maximum of entropy given the variance. Example 1.16 Let f (x) be a continuous random variable on finite interval (a, b). The function that maximizes the entropy b H (X ) = −
f (x) log f (x)dx. a
is determined under the condition b f (x)dx = 1. a
As in Example 1.15, for Lagrange multipliers λ, the following function is made: w = f (x) log f (x) + λ f (x). By differentiating the above function with respect to f (x), we have d w = log f (x) + 1 + λ = 0. d f (x) From this, f (x) = exp(−λ − 1) = constant. Hence, we have the following uniform distribution:
1.9 Continuous Variables and Entropy
27
f (x) =
1 . b−a
As discussed in Sect. 1.4, entropy (1.7) for the discrete sample space or variable is maximized for the uniform distribution P(ωi ) = n1 , i = 1, 2, . . . , n. The above result for the continuous distribution is similar to that for the discrete case.
1.10 Discussion Information theory and statistics use “probability” as an important tool for measuring uncertainty of events, random variables, and data processed in communications and statistical data analysis. Based on the common point, the present chapter has given a discussion for taking a look at “statistics” from a view of “entropy.” Reviewing information theory, the present chapter has shown that the likelihood ratio statistics, t and F statistics can be expressed with entropy. In order to make novel and innovative methods for statistical data analysis, it is one of the good approaches to take ways from information theory, and there are possibilities to resolve problems in statistical data analysis by using information theory.
References 1. Agresti, A. (2002). Categorical data analysis. Hoboken: Wiley. 2. Harry, J. (1989). Relative entropy measures of multivariate dependence. Journal of the American Statistical Association, 84, 157–164. 3. Haberman, S. J. (1982). Analysis of dispersion of multinomial responses. Journal of the American Statistical Association, 77, 568–580. 4. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86. 5. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a corrected system of variables is such that it can be reasonable supposed to have arisen from random sampling. Philosophical Magazine, 5, 157–175. https://doi.org/10.1080/ 14786440009463897. 6. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423. 7. Student. (1908). The problem error of mean. Biometrika, 6, 1–25. 8. Theil, H. (1970). On the estimation of relationships involving qualitative variables. American Journal of Sociology, 76, 103–154. 9. Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics, 9, 60–62. https://doi.org/10.1214/aoms/ 1177732360, https://projecteuclid.org/euclid.aoms/1177732360.
Chapter 2
Analysis of the Association in Two-Way Contingency Tables
2.1 Introduction In categorical data analysis, it is a basic approach to discuss the analysis of association in two-way contingency tables. The usual approach for analyzing the association is to compute odds, odds ratios, and relative risks in the contingency tables; however, the numbers of categories of the categorical variables concerned are increasing, the odds ratios to be calculated are also increasing and the interpretations of the results are complicated. In this sense, processing contingency tables is more difficult than that of continuous variables, because the association (correlation) between two continuous variables can be measured with the correlation coefficient between them in the bivariate normal distribution. Then, especially for contingency tables for nominal variables, it is useful to summarize the association by a single measure, where some information in the tables is lost. For testing the association in the contingency table, the RC(1) association model [11], which has a similar structure to the bivariate normal distribution [3, 4, 12, 13], was proposed for analyzing ordinal two-way contingency tables. The model was extended to RC(M) association model with M canonical association components [13, 14], and [6] showed the RC(M) association model is a discretized version of the multivariate normal distribution in canonical correlation analysis. A generalized version of the RC(M) association model was discussed as canonical exponential models by Eshima [7]. It is a good approach to deal the association in two-way contingency tables by a method that is similar to the correlation analysis in the normal distribution. In this chapter, a summary measure of association in contingency tables is discussed from a viewpoint of entropy. In Sect. 2.2, odds, odds ratios, and relative risks are reviewed, and the logarithm of them is interpreted through information theory [8]. Section 2.3 discusses the association between binary variables, and the entropy covariance and the entropy correlation coefficient of the variables are proposed. It is shown that the entropy covariance is expressed by the KL information and that the entropy correlation coefficient is the absolute value of the Pearson correlation coefficient of binary variables. In Sect. 2.4, © Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_2
29
30
2 Analysis of the Association in Two-Way Contingency Tables
properties of the maximum likelihood estimator of the odds ratio are discussed in entropy. It is shown the Pearson chi-square statistic for testing the independence of binary variables is related to the square of the entropy correlation coefficient and the log-likelihood ratio test statistic the KL information. Section 2.5 considers the association in general two-way contingency tables based on odds ratios and the loglikelihood ratio test statistic. In Sect. 2.6, the RC(M) association model, which was designed for analyzing the association in general two-way contingency tables, is considered in view of entropy. Desirable properties of the model are discussed, and the entropy correlation coefficient is expressed by using the intrinsic association parameters and the correlation coefficients of scores assigned to the categorical variables concerned.
2.2 Odds, Odds Ratio, and Relative Risk Let X and Y be binary random variables that take 0 or 1. The responses 0 and 1 are formal expressions of categories for convenience of notations. In real data analyses, responses are “negative” and “positive”, “agreement” and “disagreement”, “yes” or “no”, and so on. Table 2.1 shows the joint probabilities of the variables. The odds of Y = 1 instead of Y = 0 given X = i, i = 1, 2 are calculated by i =
P(X = i, Y = 1) P(Y = 1|X = i) = , i = 0, 1, P(Y = 0|X = i) P(X = i, Y = 0)
(2.1)
for P(X = i, Y = 0) = 0, i = 0, 1. The above odds are basic association measures for change of responses in Y , given X = i, i = 1, 2. The logarithms of the odds are referred to as log odds and are given as follows: logi = (−logP(Y = 0|X = i)) − (−logP(Y = 1|X = i)), i = 0, 1. From this, log odds are interpreted as a decrease of uncertainty of response Y = 1 for Y = 0, given X = i. Table 2.1 Joint probability distribution of X and Y
2.2 Odds, Odds Ratio, and Relative Risk
31
Definition 2.1 In Table 2.1, the odds ratio (OR) is defined by 1 P(Y = 1|X = 1)P(Y = 0|X = 0) . = 0 P(Y = 0|X = 1)P(Y = 1|X = 0)
(2.2)
The above OR is interpreted as the association or effect of variable X on variable Y . From (2.1), we have 1 P(X = 1|Y = 1)P(X = 0|Y = 0) P(X = 1, Y = 1)P(X = 0, Y = 0) = . = 0 P(X = 1, Y = 0)P(X = 0, Y = 1) P(X = 0|Y = 1)P(X = 1|Y = 0) (2.3) Hence, odds ratio is the ratio of the cross products in Table 2.1 and symmetric for X and Y , and the odds ratio is a basic association measure to analyze categorical data. Remark 2.1 With respect to odds ratio expressed in (2.2) and (2.3), if variable X precedes Y , or X is a cause and Y is effect, the first expression of odds ratio (2.2) implies a prospective expression, and the second expression (2.3) is a retrospective expression. Thus, the odds ratio is the same for both samplings, e.g., a prospective sampling in a clinical trial and a retrospective sampling in a case-control study. From (2.2), the log odds ratio is decomposed as follows: log
1 = (logP(Y = 1|X = 1) − logP(Y = 0|X = 1)) 0 − (logP(Y = 1|X = 0) − logP(Y = 0|X = 0)) 1 1 = log − log P(Y = 0|X = 1) P(Y = 1|X = 1) 1 1 − log , − log P(Y = 0|X = 0) P(Y = 1|X = 0)
(2.4)
where P(X = i, Y = j) = 0, i = 0, 1; j = 0, 1. The first term of (2.4) is a decrease of uncertainty (information) of Y = 1 for Y = 0 at X = 1, and the second term that at X = 0. Thus, the log odds ratio is a change of uncertainty of Y = 1 in X . If X is an explanatory variable or factor and Y is a response variable, OR in (2.4) may be interpreted as the effect of X on response Y measured with information (uncertainty). From (2.3), an alternative expression of the log odds ratio is given as follows: log
1 = (logP(X = 1|Y = 1) − logP(X = 0|Y = 1)) 0 − (logP(X = 1|Y = 0) − logP(X = 0|Y = 0)).
(2.5)
32
2 Analysis of the Association in Two-Way Contingency Tables
From this expression, the above log odds ratio can be interpreted as the effect of Y on X , if X is a response variable and Y is explanatory variable. From (2.3), the third expression of the log odds ratio is made as follows: log
1 = (logP(X = 1, Y = 1) + logP(X = 0, Y = 0)) 0 − (logP(X = 1, Y = 0) + logP(X = 0, Y = 1)) 1 1 = log + log P(X = 1, Y = 0) P(X = 0, Y = 1) 1 1 + log . − log P(X = 0, Y = 0) P(X = 1, Y = 1)
(2.6)
If X and Y are response variables, from (2.6), the log odds ratio is the difference between information of discordant and concordant responses in X and Y , and it implies a measure of the association between the variables. Theorem 2.1 For binary random variables X and Y , the necessary and sufficient condition that the variables are statistically independent is 1 = 1. 0
(2.7)
Proof If (2.7) holds, from (2.3), P(X = 1, Y = 1)P(X = 0, Y = 0) = P(X = 1, Y = 0)P(X = 0, Y = 1). There exists constant a > 0, such that P(X = 1, Y = 1) = a P(X = 0, Y = 1), P(X = 1, Y = 0) = a P(X = 0, Y = 0). From (2.8), we have P(X = i) =
ai , i = 0, 1. a+1
and a (P(X = 0, Y = 1) + P(X = 1, Y = 1)) a+1 a = (P(X = 0, Y = 1) + a P(X = 0, Y = 1)) a+1 = a P(X = 0, Y = 1) = P(X = 1, Y = 1).
P(X = 1)P(Y = 1) =
(2.8)
2.2 Odds, Odds Ratio, and Relative Risk Table 2.2 Conditional distribution of Y given factor X
X
33 Y
Total
0
1
0
P(Y = 0|X = 0)
P(Y = 1|X = 0)
1
1
P(Y = 0|X = 1)
P(Y = 1|X = 1)
1
Similarly, we have P(X = i)P(Y = j) = P(X = i, Y = j).
(2.9)
This completes the sufficiency. Reversely, if X and Y are statistically independent, it is trivial that (2.9) holds true. Then, (2.7) follows. This completes the theorem. Remark 2.2 The above discussion has been made under the assumption that variables X and Y are random. As in (2.4), the interpretation is also given in terms of the conditional probabilities. In this case, variable X is an explanatory variable, and Y is a response variable. If variable X is a factor, i.e., non-random, odds are made according to Table 2.2. For example, X implies one of two groups, e.g., “control” and “treatment” groups in a clinical trial. In this case, the interpretation can be given as the effect of factor X on response variable Y in a strict sense. Example 2.1 The conditional distribution of Y given factor X is assumed as in Table 2.3. Then, the odds ratio (2.2) is OR1 =
7 8 1 8
· ·
11 16 5 16
=
77 . 5
Changing the responses of Y as in Table 2.3, the odds ratio is OR2 =
1 8 7 8
· ·
5 16 11 16
=
5 . 77
Therefore, the strength of association is the same for Table 2.3 Conditional distribution of Y given factor X
X
Y
Total
0
1
7 8 5 16
1 8 11 16
1
1 8 11 16
7 8 5 16
1
(a) 0 1
1
(b) 0 1
1
34
2 Analysis of the Association in Two-Way Contingency Tables
OR1 =
1 . OR2
In terms of information, i.e., log odds ratio, we have logOR1 = −logOR2 . Next, the relative risk is considered. In Table 2.2, risks with respect to response Y are expressed by the conditional probabilities P(Y = 1|X = i), i = 0, 1. Definition 2.2 In Table 2.2, the relative risk (RR) is defined by RR =
P(Y = 1|X = 1) . P(Y = 1|X = 0)
(2.10)
The logarithm of relative risk (2.10) is logRR = (−logP(Y = 1|X = 0)) − (−logP(Y = 1|X = 1)).
(2.11)
The above information is a decrease of uncertainty (information) of Y = 1 by factor X . In a clinical trial, let X be a factor assigning a “placebo” group by X = 0 and a treatment group by X = 1, and let Y denote a side effect, expressing “no” by Y = 0 and “yes” Y = 1. Then, the effect of the factor is measured by (2.10) and (2.11). In this case, it may follow that P(Y = 1|X = 1) ≥ P(Y = 1|X = 0), and RR ≥ 1 ⇔ logRR ≥ 0 Then, the effect of the factor on the risk is positive. From (2.2), we have OR =
P(Y = 1|X = 1)(1 − P(Y = 1|X = 0)) (1 − P(Y = 1|X = 0)) = RR × . P(Y = 1|X = 0)(1 − P(Y = 1|X = 1)) (1 − P(Y = 1|X = 1))
If risks P(Y = 1|X = 0) and P(Y = 1|X = 1) are small in Table 2.2, it follows that OR ≈ RR. Example 2.2 The conditional distribution of binary variables of Y given X is illustrated in Table 2.4. In this example, the odds (2.1) and odds ratio (2.3) are calculated as follows:
2.2 Odds, Odds Ratio, and Relative Risk Table 2.4 Conditional probability distributions of Y given X
0 =
35
X
Y
Total
0
1
0
0.95
0.05
1
1
0.75
0.25
1
0.05 0.25 1 1 1 = = , 1 = = , OR = 0.95 19 0.75 3 0
1 3 1 19
=
19 ≈ 6.33. 3
From the odds ratio, it may be thought intuitively that the effect of X on Y is strong. The effect measured by the relative effect is given by RR =
0.25 = 5. 0.05
The result is similar to that of odds ratio. Example 2.3 In estimating the odds ratio and the relative risk, it is critical to take the sampling methods into consideration. For example, flu influenza vaccine efficacy is examined through a clinical trial using vaccination and control groups. In the clinical trial, both of the odds ratio and the relative risk can be estimated as discussed in this section. Let X be factor taking 1 for “vaccination” and 0 for “control” and let Y be a state of infection, 1 for “infected” and 0 for “non-infected.” Then, P(Y = 1|X = 1) and P(Y = 1|X = 0) are the infection probabilities (ratios) of vaccination and control groups, respectively. Then, the reduction ratio of the former probability (risk) for the latter is assessed by efficacy =
P(Y = 1|X = 0) − P(Y = 1|X = 1) P(Y = 1|X = 0)
× 100 = (1 − RR) × 100,
where RR is the following relative risk: RR =
P(Y = 1|X = 1) . P(Y = 1|X = 0)
On the other hand, the influenza vaccine effectiveness is measured through a retrospective method (case-control study) by using flu patients (infected) and controls (non-infected), e.g., non-flu patients. In this sampling, the risks P(Y = 1|X = 0) and P(Y = 1|X = 1) cannot be estimated; however, the odds ratio can be estimated as mentioned in Remark 2.1. In this case, P(X = 1|Y = 0) and P(X = 1|Y = 1) can be estimated, and then, the vaccine effectiveness of the flu vaccine is evaluated by using odds ratios (OR): effectiveness = 1 −
P(X =1|Y =0) 1−P(X =1|Y =0) P(X =1|Y =1) 1−P(X =1|Y =1)
× 100 = (1 − OR) × 100,
36
2 Analysis of the Association in Two-Way Contingency Tables
where OR is OR =
P(X =1|Y =0) 1−P(X =1|Y =0) P(X =1|Y =1) 1−P(X =1|Y =1)
.
2.3 The Association in Binary Variables In this section, binary variables are assumed to be random. Let πi j = P(X = i, Y = j); πi+ = P(X = i), and let π+ j = P(Y = j). The logarithms of the joint probabilities are transformed as follows: logπi j = λ + λiX + λYj + ψi j , i = 0, 1; j = 0, 1.
(2.12)
In order to identify the model parameters, the following appropriate constraints are placed on the parameters: 1
λiX πi+ =
1
i=0
1
λYj π+ j =
j=0
ψi j πi+ =
i=0
1
ψi j π+ j = 0.
(2.13)
j=0
From (2.12) and (2.13), we have the following equations: 1 1
πi+ π+ j logπi j = λ,
i=0 j=0
1
π+ j logπi j = λ + λiX
j=0 1
πi+ logπi j = λ + λYj .
i=0
Therefore, we get ψi j = logπi j +
1 j=0
π+ j logπi j +
1
πi+ logπi j −
i=0
1 1
πi+ π+ j logπi j . (2.14)
i=0 j=0
Let be the following two by two matrix: =
ψ00 ψ01 . ψ01 ψ11
The matrix has information of the association between the variables. For a link to a discussion in continuous variables, (2.12) is re-expressed as follows:
2.3 The Association in Binary Variables
log p(x, y) = λ + λxX + λYy + (1 − x, x)(1 − y, y)T . where
37
(2.15)
p(x, y) = πx y .
From (2.15), we have (1 − x, x)(1 − y, y)t = ϕx y + (ψ01 − ψ00 )x + (ψ10 − ψ00 )y + ψ00 where ϕ = ψ00 + ψ11 − ψ01 − ψ10 From the results, (2.15) can be redescribed as log p(x, y) = α + β(x) + γ (y) + ϕx y.
(2.16)
where α = λ + ψ00 , β(x) = λxX + (ψ01 − ψ00 )x, γ (y) = λYy + (ψ10 − ψ00 )y. (2.17) By using expression (2.16), for baseline categories X = x0 , Y = y0 , we have log
p(x, y) p(x0 , y0 ) = ϕ(x − x0 )(y − y0 ). p(x0 , y) p(x, y0 )
(2.18)
In (2.18), log odds ratio is given by log
p(1, 1) p(0, 0) = ϕ. p(0, 1) p(1, 0)
Definition 2.3 Let μ X and μY be the expectation of X and Y , respectively. Formally substituting baseline categories x0 , and y0 in (2.18) for μ X and μY , respectively, a generalized log odds ratio is defined by ϕ(x − μ X )(y − μY ).
(2.19)
and the exponential of the above quantity is referred to as a generalized odds ratio. Remark 2.3 In binary variables X and Y , it follows that μ X = π1+ and μY = π+1 . Definition 2.4 The entropy covariance between X and Y is defined by the expectation of (2.19) with respect to X and Y , and it is denoted by ECov(X, Y ) : ECov(X, Y ) = ϕCov(X, Y ).
(2.20)
38
2 Analysis of the Association in Two-Way Contingency Tables
The entropy covariance is expressed by the KL information: ECov(X, Y ) =
1 1 i=0
πi j πi+ π+ j πi j log + πi+ π+ j log ≥ 0. (2.21) π π πi j i+ + j j=0 i=0 j=0 1
1
The equality holds true if and only if variables X and Y are statistically independent. From this, the entropy covariance is an entropy measure that explains the difference between distributions (πi j ) and πi+ π+ j . In (2.20), ECov(X, Y ) is also viewed as an inner product between X and Y with respect to association parameter ϕ. From the Cauchy inequality, we have 0 ≤ ECov(X, Y ) ≤ |ϕ| Var(X ) Var(Y ).
(2.22)
Since the (Pearson) correlation coefficient between binary variables X and Y is Corr(X, Y ) = √
π00 π11 − π01 π10 Cov(X, Y ) =√ . √ π0+ π1+ π+0 π+1 Var(X ) Var(Y )
for binary variables, it can be thought that the correlation coefficient measures the degree of response concordance or covariation of the variables, and from (2.20) and (2.22), we have 0≤
ϕCov(X, Y ) ϕ ECov(X, Y ) ≤ = Corr(X, Y ) √ √ √ √ |ϕ| |ϕ| Var(X ) Var(Y ) |ϕ| Var(X ) Var(Y ) = |Corr(X, Y )|.
From this, we make the following definition: Definition 2.5 The entropy correlation coefficient (ECC) between binary variables X and Y is defined by ECorr(X, Y ) = |Corr(X, Y )|.
(2.23)
From (2.21), the entropy covariance ECov(X, Y ) is the KL information, and the √ √ upper bound of it is given as |ϕ| Var(X ) Var(Y ) in (2.22). Hence, ECC (2.23) is interpreted as the ratio of the explained (reduced) entropy by the association between variables X and Y . The above discussion gives the following theorem. Theorem 2.2 For binary random variables X and Y , the following statements are equivalent: (i) (ii) (iii) (iv)
Binary variables X and Y are statistically independent. Corr(X, Y ) = 0. ECorr(X, Y ) = 0. RR = 1.
2.3 The Association in Binary Variables Table 2.5 Joint probability distribution of X and Y
X
39 Y
Total
0
1
0
0.4
0.1
1
0.2
0.3
0.5
Total
0.6
0.4
1
0
0.49
0.01
0.5
1
0.1
0.4
0.5
Total
0.59
0.41
1
(a) 0.5
(b)
Proof From Definition 2.5, Proposition (ii) and (iii) are equivalent. From Theorem 2.1, Proposition (i) is equivalent to OR =
π11 π22 = 1 ⇔ π11 π22 − π12 π21 = 0 ⇔ Corr(X, Y ) = 0. π12 π21
Similarly, we can show that OR =
π11 π22 π12 π21 + π22 = 1 ⇔ RR = · = 1. π12 π21 π11 + π12 π22
This completes the theorem.
Example 2.4 In Table 2.5a, the joint distribution of binary variables X and Y is given. In this table, the odds (2.1) and odds ratio (2.3) are calculated as follows: 0 =
1 3 0.1 0.3 = , 1 = = , 0.4 4 0.2 2
1 = 0
3 2 1 4
= 6.
From the odds ratio, it may be thought intuitively that the association between the two variables are strong in a sense of odds ratio, i.e., the effect of X on Y or that of Y on X . For example, X is supposed to be a smoking state, where “smoking” is denoted by X = 1 and “non-smoking” by X = 0, and Y be the present “respiratory problem,” “yes” by Y = 1 and “no” by Y = 0. Then, from the above odds ratio, it may be concluded that the effect of “smoking” on “respiratory problem” is strong. The conclusion is valid in this context, i.e., the risk of disease. On the other hand, treating variables X and Y parallelly in entropy, ECC is 0.4 · 0.3 − 0.1 · 0.2 ≈ 0.408. ECorr(X, Y ) = Corr(X, Y ) = √ 0.5 · 0.5 · 0.6 · 0.4
40
2 Analysis of the Association in Two-Way Contingency Tables
Table 2.6 Joint probability distribution of X and Y for t ∈ (0, 2)
X
Y
0
Total
0
1
0.4t
0.1t
0.5t
1
0.4(1 − 0.5t)
0.6(1 − 0.5t)
(1 − 0.5t)
Total
0.4 + 0.2t
0.6 − 0.2t
1
In the above result, we may say the association between the variables is moderate in a viewpoint of entropy. Similarly, we have the following coefficient of uncertainty C(Y |X ) (1.19). C(Y |X ) ≈ 0.128. In this sense, the reduced (explained) entropy of Y by X is about 12.8% and the effect of X on Y is week. Finally, the measures mentioned above are demonstrated according to Table 2.5. We have 0 =
1 0.01 0.4 = , 1 = = 4, 0.49 49 0.1
1 4 = 1 = 196; 0 49
ECorr(X, Y ) = Corr(X, Y ) ≈ 0.793; C(Y |X ) ≈ 0.558. In the joint distribution in Table 2.5b, the association between the variables are strong as shown by ECorr(X, Y ) and C(Y |X ). Remark 2.4 As defined by (2.2) and (2.10), OR and RR do not depend on the marginal probability distributions. For example, for any real value t ∈ (0, 2), Table 2.5 is modified as in Table 2.6. Then, OR and RR in Table 2.6 are the same as in Table 2.5a; however,
ECorr(X, Y ) =
t(2 − t) . (t + 2)(3 − t)
(2.24)
√ and it depends on real value t. The above ECC takes the maximum at t = 6 − 2 6 ≈ 1.101: max ECorr(X, Y ) ≈ 0.410.
t∈(0,2)
The graph of ECorr(X, Y ) (2.24) is illustrated in Fig. 2.1. Hence, for analyzing contingency tables, not only association measures OR and RR but also ECC is needed.
2.4 The Maximum Likelihood Estimation of Odds Ratios
41
ECC 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96 1.04 1.12 1.2 1.28 1.36 1.44 1.52 1.6 1.68 1.76 1.84 1.92
0
t
Fig. 2.1 Graph of ECorr(X, Y ) (2.24)
2.4 The Maximum Likelihood Estimation of Odds Ratios The asymptotic distribution of the ML estimator of log odds ratio is considered. Let n i j , i = 0, 1; j = 0, 1 be the numbers of observations in cell (X, Y ) = (i, j); n i+ = n i0 + n i1 ; n + j = n 0 j + n 1 j ; and let n ++ = n 0+ + n 1+ = n +0 + n +1 . The sampling is assumed to be independent Poisson samplings in the cells. Then, the distribution of numbers n i j given the total n ++ is the multinomial distribution, and the ML estimators of the cell probabilities πi j are calculated as πˆ i j =
ni j , i = 0, 1; j = 0, 1. n ++
The ML estimator of the log odds ratio is computed by
OR =
n 00 n 11 , n 01 n 10
and from this, for sufficiently large n i j , we have
n 00 n 11 μ00 μ11 n 00 n 01 − log = log − log n 01 n 10 μ01 μ10 μ00 μ01 n 10 n 11 1 1 − log + log ≈ (n 00 − μ00 ) − (n 01 − μ01 ) μ10 μ11 μ00 μ01 1 1 − (n 10 − μ10 ) + (n 11 − μ11 ), μ10 μ11
logOR − logOR = log
where μi j are the expectations of n i j , i = 0, 1; j = 0, 1. Since the sampling method is an independent Poisson sampling, if data n i j , i = 0, 1; j = 0, 1 are sufficiently large, they are asymptotically and distributed according to N μi j , μi j , independent and we have n i j = μi j + o μi j , i = 0, 1; j = 0, 1, where
42
2 Analysis of the Association in Two-Way Contingency Tables
o μi j lim = 0. μi j →∞ μi j
Therefore, the asymptotic variance of OR is given by
1 1 1 1 1 1 1 1 + + + ≈ + + + . Var logOR ≈ μ00 μ01 μ10 μ11 n 00 n 01 n 10 n 11
(2.25)
According to the above results, the asymptotic distribution of logOR is normal with variance (2.25) and the standard error (SE) of the estimator is given by
SE ≈
1 1 1 1 + + + . n 00 n 01 n 10 n 11
The test of the independence of binary variables X and Y can be performed by using the above SE, i.e., the critical region for significance level α is given as follows:
α 1 1 1 1 logOR > Q + + + , 2 n 00 n 01 n 10 n 11
α 1 1 1 1 + + + , logOR < −Q 2 n 00 n 01 n 10 n 11
where function Q(α) is the 100(1 − α) percentile of the standard normal distribution. Remark 2.5 From (2.25), the following statistic is asymptotically distributed according to the standard normal distribution:
Z=
logOR − logOR 1 n 00
+
1 n 01
+
1 n 10
+
1 n 11
.
Then, a 100(1 − α)% confidence interval for the log odds ratio is
α 1 1 1 1 logOR − Q + + + < logOR < logOR 2 n 00 n 01 n 10 n 11
α 1 1 1 1 +Q + + + . 2 n 00 n 01 n 10 n 11
(2.26)
From the above confidence interval, if the interval does not include 0, then the null hypothesis is rejected with the level of significance α. In order to test the independence of binary variables X and Y , the following Pearson chi-square statistic is usually used:
2.4 The Maximum Likelihood Estimation of Odds Ratios
X2 =
43
n ++ (n 00 n 11 − n 10 n 01 )2 . n 0+ n 1+ n +0 n +1
(2.27)
Since |n 00 n 11 − n 10 n 01 | , ECorr(X, Y ) = √ n 0+ n 1+ n +0 n +1
we have
X 2 = n ++ ECorr(X, Y )2 . Hence, the Pearson chi-square statistic is also related to entropy. On the other hand, the log-likelihood ratio statistic for testing the independence is given by G2 = 2
1 1
n i j log
j=0 i=0
n ++ n i j n i+ n + j
(2.28)
As mentioned in Chap. 1, the above statistic is related to the KL information. Under the null hypothesis that X and Y are statistically independent, the statistics (2.27) and (2.28) are asymptotically equal and the asymptotic distribution is the chisquare distribution with degrees of freedom 1. By using the above testing methods, a test for independence between X and Y can be carried out. Example 2.5 Table 2.7 illustrates the two-way cross-classified data of X : smoking state and Y : chronic bronchitis state from 339 adults over 50 years old. The log OR is estimated as
logOR = log
43 · 121 = 2.471, 162 · 13
A 95% confidence interval of the log odds ratio (2.26) is calculated as follows: 0.241 < logOR < 1.568, Table 2.7 Data of smoking status and chronic bronchitis of subjects above 50 years old
X
(2.29)
Y
Total
Yes (1)
No (0)
Yes (1)
43
162
205
No (0)
13
121
134
Total
56
283
339
X smoking state; Y chronic bronchitis state Source Asano et al. [2], p. 161
44
2 Analysis of the Association in Two-Way Contingency Tables
From this, the association between the two variables is statistically significant at significance level 0.05. From (2.29), a 95% confidence interval of the odds ratio is given by 1.273 < OR < 4.797. By using test statistic G 2 , we have G 2 = 7.469, d f = 1, P = 0.006. Hence, the statistic indicates that the association in Table 2.7 is statistically significant.
2.5 General Two-Way Contingency Tables Let X and Y be categorical random variables that have categories {1, 2, . . . , I } and {1, 2, . . . , J }, respectively, and let πi j = P(X = i, Y = j), πi+ = P(X = i), and let π+ j = P(Y = j). Then, the number of odds ratios is I (I − 1) J (J − 1) I J · . = 2 2 2 2 Table 2.8 illustrates a joint probability distribution of categorical variables. The odds ratio with respect to X = i, i and Y = j, j is the following cross product ratio: Table 2.8 Joint probability distribution of general categorical variables X and Y
2.5 General Two-Way Contingency Tables
45
πi j πi j OR i, i ; j, j = for i = i and j = j . πi j πi j
(2.30)
The discussion on the analysis of association in contingency tables is complicated as the numbers of categories in variables are increasing. As an extension of Theorem 2.1, we have the following theorem: Theorem 2.3 Categorical variables X and Y are statistically independent, if and only if OR i, i ; j, j = 1 for all i = i and j = j .
(2.31)
Proof In (2.30), we formally set i = i and j = j . Then, we have OR(i, i; j, j) = 1. Hence, assumption (2.31) is equivalent to πi j πi j OR i, i ; j, j = = 1 for all i, i , j, and j . πi j πi j From this, it follows that J
1=
j =1 J j =1
I
i =1
I
πi j πi j
i =1 πi j πi j
=
πi j π++ πi j = = 1 for all i and j. πi+ π+ j πi+ π+ j
Hence, we have πi j = πi+ π+ j for all i and j, it means that X and Y are statistically independent. This completes the theorem. model in Table 2.8 be denoted by πi j and let the independent model by Let the πi+ π+ j . Then, from (1.10), the Kullback–Leibler information is J I D πi j || πi+ π+ j = πi j log i=1 j=1
πi j . πi+ π+ j
Let n i j be the numbers of observations in cells (X, Y ) = (i, j), i = 1, 2, . . . , I ; j = 1, 2, . . . , J . Then, the ML estimators of probabilities πi j , πi+ , and π+ j are given as πˆ i j =
ni j n+ j n i+ , πˆ i+ = , πˆ + j = , n ++ n ++ n ++
46
2 Analysis of the Association in Two-Way Contingency Tables
Table 2.9 Cholesterol and high blood pressure data in a medical examination for students in a university X
Y
Total
L
M
H
L
2
6
3
11
M
12
32
11
56
H Total
9
5
3
17
24
43
17
84
Source [1], p. 206
where n i+ =
J
ni j , n+ j =
j=1
I i=1
n i j , n ++ =
I J
ni j .
i=1 j=1
The test of independence is performed by the following statistic: G2 = 2
J I i=1 j=1
n i j log
n i j n ++ = 2n ++ · D πˆ i j || πˆ i+ πˆ + j . n i+ n + j
(2.32)
The above statistic implies 2n ++ times the KL information for discriminating model π . Under the null hypothesis H0 : π model πi j and the independent i+ + j the independent model πi+ π+ j , the above statistic is distributed according to a chi-square distribution with degrees of freedom (I − 1)(J − 1). Example 2.6 Table 2.9 shows a part of data in a medical examination. Variables X and Y are cholesterol and high blood pressure, respectively, which are graded as low (L), medium (M), and high (H). Applying (2.32), we have G 2 = 6.91, d f = 4, P = 0.859. From this, the independence of the two variables is not rejected at the level of significant 0.05, i.e., the association of the variables is not statistically significant. Example 2.7 Table 2.10 illustrates cross-classified data with respect to eye color X and hair color Y in [5]. For baselines X = Blue, Y = Fair, odds ratios are calculated in Table 2.11, and P-values with respect to the related odds ratios are shown in Table 2.12. All the estimated odds ratios related to X = Midium, Dark are statistically significant. The summary test of the independence of the two variables is made with statistic (2.32), and we have G 2 = 1218.3, d f = 9, P = 0.000. Hence, the independence of the variables is strongly rejected. By using odds ratios in Table 2.11, the association between the two variables in Table 2.10 is illustrated in Fig. 2.2.
2.5 General Two-Way Contingency Tables
47
Table 2.10 Eye and hair color data X
Y
Total
Fair
Red
Medium
Dark
Black
Blue
326
38
241
110
3
718
Light
688
116
584
188
4
1580
Medium
343
84
909
412
26
1774
Dark
98
48
403
681
85
1315
Total
1455
286
2137
1391
118
5387
Source [5], p. 143 Table 2.11 Baseline odds ratios for baselines X = blue, Y = fair X
Y Red
Medium
Dark
Black
Light
1.45
1.15
0.81
0.63
Medium
2.10
3.58
3.56
8.24
Dark
4.20
5.56
20.59
94.25
Table 2.12 P-values in estimators of odds ratios in Table 2.11 X
Y Red
Medium
Dark
Black
Light
0.063
0.175
0.125
0.549
Medium
0.000
0.000
0.000
0.001
Dark
0.000
0.000
0.000
0.000
Odds Ratio 100 80 60 40
Dark
20
Medium
0 Red Medium
0-20
Light Black
Dark 20-40
40-60
60-80
80-100
Fig. 2.2 Image of association according to odds ratios in Table 2.11
48
2 Analysis of the Association in Two-Way Contingency Tables
As demonstrated in Example 2.7, since a discussion of the association in contingency tables becomes complex as the number of categories increases, an association measure is needed to summarize the association in two-way contingency tables for polytomous variables. An entropy-based approach for it will be given below.
2.6 The RC(M) Association Models The association model was proposed for analyzing two-way contingency tables of categorical variables [11, 13]. Let X and Y be categorical variables that have categories {1, 2, . . . , I } and {1, 2, . . . , J }, respectively, and let πi j = P(X = i, Y = j); πi+ = P(X = i), and let π+ j = P(Y = j), i = 1, 2, . . . , I ; j = 1, 2, . . . , J . Then, a loglinear formulation of the model is given by logπi j = λ + λiX + λYj + ψi j . In order to identify the model parameters, the following constraints are assumed: I
λiX πi+ =
i=1
J
λYj π+ j =
j=1
I
ψi j πi+ =
i=1
J
ψi j π+ j = 0.
j=1
The association between the two variables ψi j and the by parameters is expressed log odds ratio with respect to (i, j), i , j , i, j , and i , j is calculated as log
πi j πi j = ψi j + ψi j − ψi j − ψi j . πi j πi j
To discuss the association in the two-way contingency table, the RC(M) association model is given as logπi j = λ +
λiX
+
λYj
+
M
φm μmi νm j ,
(2.33)
m=1
where μmi and νm j are scored for X = i and Y = j in the m th components, m = 1, 2, . . . , M, and the following constraints are used for model identification:
2.6 The RC(M) Association Models
49
⎧ I J ⎪ ⎪ μ π = νm j π+ j = 0, m = 1, 2, . . . , M; ⎪ mi i+ ⎪ ⎪ i=1 j=1 ⎪ ⎪ ⎨ I J μ2mi πi+ = νm2 j π+ j = 1, m = 1, 2, . . . , M; ⎪ i=1 j=1 ⎪ ⎪ ⎪ I J ⎪ ⎪ ⎪ μmi μm i πi+ = νm j νm j π+ j = 0, m = m . ⎩ i=1
(2.34)
j=1
Then, the log odds ratio is given by the following bilinear form: πi j πi j = φm (μmi − μmi ) νm j − νm j . πi j πi j m=1 M
log
(2.35)
Preferable properties of the model in entropy are discussed below. For the RC(1) model, Gillula et al. [10] gave a discussion of properties of the model in entropy and Eshima [9] extended the discussion in the RC(M) association model. The discussion is given below. In the RC(M) association model (2.33), let μm = (μm1 , μm2 , . . . , μm I ) and ν m = (νm1 , νm2 , . . . , νm J ) be scores for X and Y in the m th components, m = 1, 2, . . . , M. From (2.35), parameters φm are related to log odds ratios, so the parameters are referred to as the intrinsic association parameters. If it is possible for scores to vary continuously, the intrinsic parameters φm are interpreted as the log odds ratio in unit changes of the related scores in the mth components. Let Corr μm , ν m be the correlation coefficients between μm and ν m , m, m = 1, 2, . . . , M, which are defined by J I Corr μm , ν m = μmi νm j πi j .
(2.36)
j=1 i=1
For simplicity of the notation, let us set ρm = Corr μm , ν m , m = 1, 2, . . . , M. With respect to the RC(M) association model (2.33), we have the following theorem. Theorem 2.4 The RC(M) association model maximizes the entropy, given the correlation coefficients between μm and ν m and the marginal probability distributions of X and Y . Proof For Lagrange multipliers κ, λiX , λYj , and φm , we set G=−
J I
πi j logπi j + κ
j=1 i=1
+
J j=1
λYj
I i=1
J I
πi j +
j=1 i=1
πi j +
M m=1
φm
I J
I
λiX
i=1
J j=1
μmi νm j πi j .
j=1 i=1
Differentiating the above function with respect to πi j , we have
πi j
50
2 Analysis of the Association in Two-Way Contingency Tables
∂G = −logπi j + 1 + κ + λiX + λYj + φm μmi νm j = 0. ∂πi j m=1 M
Setting λ = κ + 1, we have (2.33). This completes the theorem.
With respect to the correlation coefficients and the intrinsic association parameters, we have the following theorem. Theorem 2.5 In the RC(M) association model (2.33), let U be the M×M matrix that ∂ρa . Then, U is positive definite, given the marginal probability has (a, b) elements ∂φ b distributions of X and Y and scores μm and ν m , m = 1, 2, . . . , M. Proof Differentiating the correlation coefficients ρa with respect to φb in model (2.33), we have ∂ρa = μai νa j πi j × ∂φb j=1 i=1 I
J
∂λYj ∂λiX ∂λ + + + μbi νbj . ∂φb ∂φb ∂φb
(2.37)
Since marginal probabilities πi+ and π+ j are given, we have ∂πi+ = πi j × ∂φb j=1 J
∂π+ j = πi j × ∂φb i=1 I
∂λYj ∂λ X ∂λ + i + + μbi νbj ∂φb ∂φb ∂φb
∂λYj ∂λ X ∂λ + i + + μbi νbj ∂φb ∂φb ∂φb
= 0,
(2.38)
= 0.
(2.39)
From (2.38) and (2.39), we obtain J ∂λYj ∂λiX ∂λ ∂λ ∂πi+ = πi j × + + + μbi νbj = 0, ∂φb ∂φb ∂φb ∂φb ∂φb j=1 J ∂λYj ∂λiX ∂λ ∂λiX ∂πi+ = πi j × + + + μbi νbj = 0, ∂φb ∂φb ∂φb ∂φb ∂φb j=1 I ∂λYj ∂λ ∂λYj ∂λiX ∂π+ j = πi j × + + + μbi νbj = 0. ∂φb ∂φb ∂φb ∂φb ∂φb i=1
(2.40)
(2.41)
(2.42)
From (2.41) and (2.42), it follows that I J i=1
∂λ πi j × ∂φa j=1
∂λYj ∂λ X ∂λ + i + + μbi νbj ∂φb ∂φb ∂φb
= 0,
(2.43)
2.6 The RC(M) Association Models I J i=1
∂λ X πi j × i ∂φa j=1
I J
πi j ×
i=1 j=1
∂λYj ∂φa
51
∂λYj ∂λ X ∂λ + i + + μbi νbj ∂φb ∂φb ∂φb ∂λYj ∂λ X ∂λ + i + + μbi νbj ∂φb ∂φb ∂φb
= 0,
(2.44)
= 0.
(2.45)
By using (2.43) to (2.45), formula (2.37) is reformed as follows: ∂λYj ∂λiX ∂λ + + + μbi νbj ∂φb ∂φb ∂φb j=1 i=1 J I ∂λYj ∂λiX ∂λ ∂λ + πi j × + + + μbi νbj . ∂φb ∂φb ∂φb ∂φb i=1 j=1 J I ∂λYj ∂λ X ∂λ ∂λ X + πi j × i + i + + μbi νbj ∂φa ∂φb ∂φb ∂φb i=1 j=1 J I ∂λYj ∂λYj ∂λ X ∂λ πi j × + i + + μbi νbj + ∂φba ∂φb ∂φb ∂φb i=1 j=1 I J ∂λYj ∂λYj ∂λiX ∂λiX ∂λ ∂λ πi j × + + + μai νa j + + + μbi νbj = ∂φa ∂φa ∂φa ∂φb ∂φb ∂φb
∂ρa = μai νa j πi j × ∂φb J
I
i=1 j=1
The above quantity is an inner product between I J dimensional vectors
∂λYj ∂λYj ∂λiX ∂λiX ∂λ ∂λ + + + μai νa j and + + + μbi νbj ∂φa ∂φa ∂φa ∂φb ∂φb ∂φb
with respect to the joint probability distribution πi j . Hence, theorem follows.
From the above theorem, we have the following theorem. Theorem 2.6 In the RC(M) association model (2.33), ρm = Corr μm , ν m are increasing in the intrinsic association parameters φm , m = 1, 2, . . . , M, given the marginal probability distributions of X and Y . ∂ρa is positive definite, so we get Proof In the above theorem, matrix U = ∂φ b ∂ρa ≥ 0. ∂φa This completes the theorem.
With respect to the entropy in RC(M) association model (2.33), we have the following theorem.
52
2 Analysis of the Association in Two-Way Contingency Tables
Theorem 2.7 Let φ = (φ1 , φ2 , . . . , φ M ) be the intrinsic association parameter vector. For real value t > 0, let φ = (tφ10 , tφ20 , . . . , tφ M0 ). Then, the entropy of the RC(M) association model (2.33) is decreasing in t > 0, given the marginal probability distributions of X and Y . Proof The entropy of RC(M) association model (2.33) is calculated as follows: H =−
I J
⎛ πi j logπi j = −⎝λ +
j=1 i=1
I
λiX πi+ +
i=1
J
λYj π+ j + t
M
⎞ φm0 ρm ⎠.
m=1
j=1
Differentiating the entropy with respect to t, we have M M I ∂λiX d(−H ) ∂λ = φm0 + φm0 πi+ dt ∂φm ∂φm m=1 m=1 i=1
+
M J ∂λYj m=1 j=1
∂φm
φm0 π+ j +
M
φm0 ρm + t
m=1
M M n=1 m=1
φn0
∂ρm φm0 . ∂φm
Since I J
πi j = 1,
j=1 i=1
we have J J J I I I d d πi j = πi j = πi j dt j=1 i=1 dt j=1 i=1 j=1 i=1 M M M M ∂λ ∂λYj ∂λiX φm0 + φm0 + φm0 + φm0 μmi νm j ∂φm ∂φm ∂φm m=1 m=1 m=1 m=1
=
M J M ∂λiX ∂λ φm0 + φm0 πi+ ∂φm ∂φm m=1 j=1 m=1
+
J M M ∂λiX φm0 π+ j + φm0 ρm = 0. ∂φm m=1 j=1 m=1
From the above result and Theorem 2.6, it follows that dH ∂ρm = −t φn0 φm0 < 0. dt ∂φm n=1 m=1 M
M
2.6 The RC(M) Association Models
53
Thus, the theorem follows.
Remark 2.6 The RC(1) association model was proposed for the analysis of association between ordered categorical variables X and Y . The RC(1) association model is given by logπi j = λ + λiX + λYj + φμi ν j , i = 1, 2, . . . , I ; j = 1, 2, . . . , J. Then, from Theorem 2.6, the correlation coefficient between scores (μi ) and ν j is increasing in the intrinsic association parameter φ and from Theorem 2.7 the entropy of the model is decreasing in φ. In the RC(M) association model (2.33), as in the previous discussion, we have J I i=1 j=1
πi j πi+ π+ j + πi+ π+ j log = φm μmi νm j πi j πi+ π+ j πi j m=1 i=1 j=1 j=1 i=1 J
I
πi j log
M
=
M
J
I
φm ρm ≥ 0.
(2.46)
m=1
In the RC(M) association model, categories X = i and Y = j are identified to score vectors (μ1i , μ2i , . . . , μ Mi ) and ν1 j , ν2 j , . . . , ν M j , respectively. From this, as an extension of (2.20) in Definition 2.4, the following definition is given. Definition 2.6 The entropy covariance between categorical variables X and Y in the RC(M) association model (2.33) is defined by ECov(X, Y ) =
M
φm ρm .
(2.47)
m=1
The above covariance can be interpreted as that of variables X and Y in entropy (2.46) and is decomposed into M components. Since ECov(X, Y ) =
M
φm
m=1
I J
μmi νm j πi j ,
(2.48)
j=1 i=1
we can formally calculate ECov(X, X ) by ECov(X, X ) =
M
φm
m=1
Similarly, we also have
I J j=1 i=1
μ2mi πi j =
M m=1
φm
I i=1
μ2mi πi+ =
M m=1
φm .
54
2 Analysis of the Association in Two-Way Contingency Tables
ECov(Y, Y ) =
M
φm .
(2.49)
m=1
From (2.48) and (2.49), ECov(X, X ) and ECov(Y, Y ) can be interpreted as the variances of X and Y in entropy, which are referred to as EVar(X ) and EVar(Y ), respectively. As shown in (2.35), since association parameters φm are related to odds ratios, the contributions of the mth pairs of score vectors μm , ν m in the association between X and Y may be measured by φm ; however, from the entropy covariance (2.47), it is more sensible to measure the contributions by φm ρm . The contribution ratios of the mth components μm , ν m can be calculated by φm ρm . M k=1 φk ρk
(2.50)
In (2.46), from the Cauchy–Schwarz inequality, we have M m=1
φm ρm =
M m=1
φm
I J
μmi νm j πi j ≤
M
φm
m=1
j=1 i=1
I
μ2mi πi+
i=1
J j=1
νm2 j π+ j =
M
φm .
m=1
It implies that 0 ≤ ECov(X, Y ) ≤
EVar(X ) EVar(Y ).
(2.51)
From this, we have the following definition: Definition 2.7 The entropy correlation coefficient (ECC) between X and Y in the RC(M) association model (2.33) is defined by M m=1 ECorr(X, Y ) = M
φm ρm
m=1
φm
.
(2.52)
From the above discussion, it follows that 0 ≤ ECorr(X, Y ) ≤ 1. As in Definition 2.4, the interpretation of ECC (2.52) is the ratio of entropy explained by the RC(M) association model. Example 2.8 Becker and Clogg [5] analyzed the data in Table 2.13 with the RC(2) association model. The estimated parameters are given in Table 2.14. From the estimates, we have ρˆ1 = 0.444, ρˆ2 = 0.164.
2.6 The RC(M) Association Models
55
Table 2.13 Data of eye color X and hair color Y X
Y
Total
Fair
Red
Medium
Dark
Black
Blue
326
38
241
110
3
718
Light
688
116
584
188
4
1580
Medium
343
84
909
412
26
1774
Dark
98
48
403
681
85
1315
Total
1455
286
2137
1391
118
5387
Source [5]
and φˆ 1 ρˆ1 = 0.219, φˆ 2 ρˆ2 = 0.018. From the above estimates, the entropy correlation coefficient between X and Y is computed as
ECorr(X, Y ) =
φˆ 1 ρˆ1 + φˆ 2 ρˆ2 = 0.390. φˆ 1 + φˆ 2
(2.53)
From this, the association in the contingency table is moderate. The contribution ratios of the first and second pairs of score vectors are φˆ 2 ρˆ2 φˆ 1 ρˆ1 = 0.076. = 0.924, φˆ 1 ρˆ1 + φˆ 2 ρˆ2 φˆ 1 ρˆ1 + φ2 ρ2
About 92.4% of the association between the categorical variables are explained by the first pair of components (μ1 , ν 1 ). A preferable property of ECC is given in the following theorem. Theorem 2.8 Let φ = (φ1 , φ2 , . . . , φ M ) be the intrinsic association parameter vector in the RC(M) association model (2.33). For real value t > 0, let φ = (tφ10 , tφ20 , . . . , tφ M0 ). Then, the entropy covariance (2.47) is increasing in t > 0, given the marginal probability distributions of X and Y . Proof Since ECov(X, Y ) = t
M
φm0 ρm ,
m=1
differentiating the above entropy covariance with respect to t, we have
0.494
0.110
1
2
Source [5], p. 145
φm
m
0.271
−1.045
−0.894
1.246
μ2
μ1 1.520 0.824
0.166
μ4
−1.356
μ3
Table 2.14 Estimated parameters in the RC(2) association model
0.782
−1.313
ν1 0.518
−0.425
ν2
−1.219
0.019
ν3
0.945
1.213
ν4
0.038
2.567
ν5
56 2 Analysis of the Association in Two-Way Contingency Tables
2.6 The RC(M) Association Models
57
M M M M d ∂ρa dφb 1 φm0 ρm + t φa0 φm ρm ECov(X, Y ) = = dt ∂φb dt t m=1 m=1 b=1 a=1
+t
M M b=1 a=1
φa0 φb0
∂ρa ∂φb
∂ρa 1 φa0 φb0 . ECov(X, Y ) + t t ∂φb b=1 a=1 M
=
M
For t > 0, the first term is positive, and the second term is also positive. This completes the theorem. By using Theorem 2.5, we have the following theorem. Theorem 2.9 Let φ = (φ1 , φ2 , . . . , φ M ) be the intrinsic association parameter vector in the RC(M) association model (2.33). For real value t > 0, let φ = (tφ10 , tφ20 , . . . , tφ M0 ). Then, ECC (2.52) is increasing in t > 0, given the marginal probability distributions of X and Y . Proof For φ = (tφ10 , tφ20 , . . . , tφ M0 ), we have M m=1 ECorr(X, Y ) = M
φm0 ρm
m=1
φm0
.
By differentiating ECorr(X, Y ) with respect to t, we obtain d ECorr(X, Y ) = dt =
M M dρm dφm dρm m=1 m =1 φm0 dφm dt m=1 φm0 dt = M M m=1 φm0 m=1 φm0 M M dρm m=1 m =1 φm0 dφm φm 0 ≥ 0. M m=1 φm0
M
This completes the theorem.
2.7 Discussion Continuous data are more tractable than categorical data, because the variables are quantitative and, variances and covariances among the variables can be calculated for summarizing the correlations of them. The RC(M) association model, which was proposed for analyzing the associations in two-way contingency tables, has a similar correlation structure to the multivariate normal distribution in canonical correlation analysis [6] and gives a useful method for processing the association in contingency
58
2 Analysis of the Association in Two-Way Contingency Tables
tables. First, this chapter has considered the association between binary variables in entropy, and the entropy correlation coefficient has been introduced. Second, the discussion has been extended for the RC(M) association mode. Desirable properties of the model in entropy have been discussed, and the entropy correlation coefficient to summarize the association in the RC(M) association model has been given. It is sensible to treat the association in contingency tables as in continuous data. The present approach has a possibility to be extended for analyzing multiway contingency tables.
References 1. Asano, C., & Eshima, N. (1996). Basic multivariate analysis. Nihon Kikaku Kyokai: Tokyo. (in Japanese). 2. Asano, C., Eshima, N., & Lee, K. (1993). Basic statistic. Tokyo: Morikita Publishing Ltd. (in Japanese). 3. Becker, M. P. (1989a). Models of the analysis of association in multivariate contingency tables. Journal of the American Statistical Association, 84, 1014–1019. 4. Becker, M. P. (1989b). On the bivariate normal distribution and association models for ordinal data. Statistics and Probability Letters, 8, 435–440. 5. Becker, M. P., & Clogg, C. C. (1989). Analysis of sets of two-way contingency tables using association models. Journal of the American Statistical Association, 84, 142–151. 6. Eshima, N. (2004). Canonical exponential models for analysis of association between two sets of variables. Statistics and Probability Letters, 66, 135–144. 7. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation analysis. Journal of the Japan Statistical Society, 27, 109–120. 8. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive power of generalized linear models. Statistics and Probability Letters, 77, 588–593. 9. Eshima, N., Tabata, M., & Tsujitani, M. (2001). Properties of the RC(M) association model and a summary measure of association in the contingency table. Journal of the Japan Statistical Society, 31, 15–26. 10. Gilula, Z., Krieger, A., & Ritov, Y. (1988). Ordinal association in contingency tables: Some interpretive aspects. Journal of the American Statistical Association, 74, 537–552. 11. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74(367), 537–552. 12. Goodman, L. A. (1981a). Association models and bivariate normal for contingency tables with ordered categories. Biometrika, 68, 347–355. 13. Goodman, L. A. (1981b). Association models and canonical correlation in the analysis of crossclassification having ordered categories. Journal of the American Statistical Association, 76, 320–334. 14. Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. Annals of Statistics, 13, 10–69.
Chapter 3
Analysis of the Association in Multiway Contingency Tables
3.1 Introduction Basic methodologies to analyze two-way contingency tables can be extended for multiway contingency tables, and the RC(M) association model [11] and ECC [7] discussed in the previous section can be applied to the analysis of association in multiway contingency tables [6]. In the analysis of the association among the variables, it is needed to treat not only the association between the variables but also the higher-order association among the variables concerned. Moreover, the conditional association among the variables is also analyzed. In Sect. 3.2, the loglinear model is treated, and the basic properties of the model are reviewed. Section 3.3 considers the maximum likelihood estimation of loglinear models for three-way contingency tables. It is shown that the ML estimates of model parameters can be obtained explicitly except model [X Y, Y Z , Z X ] [2]. In Sect. 3.4, generalized linear models (GLMs), which make useful regression analyses of both continuous and categorical response variables [13, 15], are discussed, and properties of the models are reviewed. Section 3.5 introduces the entropy correlation coefficient (ECC) to measure the predictive power of GLMs. The coefficient is an extension of the multiple correlation coefficient for the ordinary linear regression model, and its preferable properties for measuring the predictive power of GLMs are discussed in comparison with the regression correlation coefficient. The conditional entropy correlation coefficient is also treated. In Sect. 3.6, GLMs with multinomial (polytomous) response variables are discussed by using generalized logit models. Section 3.7 considers the coefficient of determination in a unified framework, and the entropy coefficient of determination (ECD) is introduced, which is the predictive or explanatory power measure for GLMs [9, 10]. Several predictive power measures for GLMs are compared on the basis of six preferable criteria, and it is shown that ECD is the most preferable predictive power measure for GLMs. Finally, in Sect. 3.8, the asymptotic property of the ML estimator of ECD is considered.
© Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_3
59
60
3 Analysis of the Association in Multiway Contingency Tables
3.2 Loglinear Model Let X, Y, and Z be categorical random variables having categories {1, 2, . . . , I }, {1, 2, . . . , J }, and {1, 2, . . . , K }, respectively; let πi jk = P(X = i, Y = j, Z = k). Then, the logarithms of the probabilities can be expressed as follows: YZ ZX XYZ logππi jk = λ + λiX + λYj + λkZ + λiXY j + λ jk + λki + λi jk .
(3.1)
In order to identify the parameters, the following constraints are used. ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨
I I
⎪ i=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
λiXj Y =
J j=1
i=1
λiXj Y = I i=1
λiX = J j=1
J j=1
λYj =
λYjkZ =
λiXjkY Z =
J j=1
K k=1
K k=1
λkZ = 0
λYjkZ =
λiXjkY Z =
K k=1
K k=1
λkiZ X =
I i=1
λkiZ X = 0,
(3.2)
λiXjkY Z = 0.
The above model is referred to as the loglinear model and is used for the analysis of association in three-way contingency tables. The model has two-dimensional association terms λiXj Y , λYjkZ , and λkiZ X and three-dimensional association terms λiXjkY Z , so in this textbook, the above loglinear model is denoted as model [XYZ] for simplicity of the notation, where the notation is correspondent with the highest dimensional association terms [2]. Let n i jk be the numbers of observations for X = i, Y = j, and Z = k, i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K . Since model (3.1) is saturated, the ML estimators of the cell probabilities are given by logπˆ i jk = log
n i jk , n +++
where n +++ =
I J K
n i jk .
i=1 j=1 k=1
From this, the estimates of the parameters are obtained by solving the following equations:
λ + λˆ iX + λˆ Yj + λˆ kZ + λˆ iXj Y + λˆ YjkZ + λˆ kiZ X + λˆ iXjkY Z = log i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K .
More parsimonious models than [XYZ] are given as follows:
n i jk , n +++
3.2 Loglinear Model
61
(i) [X Y, Y Z , Z X ] If there are no three-dimensional association terms in loglinear model (3.1), we have logπi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ + λkiZ X .
(3.3)
and then, we denote the model as [X Y, Y Z , Z X ] corresponding to the highest dimensional association terms with respect to the variables, i.e., λiXj Y , λYjkZ , and λkiZ X . These association terms indicate the conditional associations related to the variables, i.e., the conditional log odds ratios with respect to X and Y given Z are log
πi jk πi j k = λiXj Y + λiX Yj − λiX Yj − λiXj Y . πi jk πi j k
Similarly, the conditional odds ratios with respect to Y and Z given X and those with respect to X and Z given Y can be discussed. (ii) [X Y, Y Z ] If λkiZ X = 0 in (3.3), it follows that logπi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ .
(3.4)
The highest dimensional association terms with respect to the variables are λiXj Y and λYjkZ , so we express the model by [X Y, Y Z ]. Similarly, models [Y Z , Z X ] and [Z X, X Y ] can also be defined. In model (3.4), the log odds ratios with respect to X and Z given Y are log
πi jk πi jk = 0. πi jk πi jk
From this, the above model implies that X and Z are conditionally independent, given Y. Hence, we have πi jk =
πi j+ π+ jk . π+ j+
In this case, the marginal distribution of X and Y, πi j+ , is given as follows:
logπi j+ = λ + λiX + λYj + λiXj Y , where
λYj = λYj + log
K k=1
exp λkZ + λYjkZ .
(3.5)
62
3 Analysis of the Association in Multiway Contingency Tables
From (3.5), parameters λiX and λiXj Y can be expressed by the marginal probabilities πi j+ , and similarly, λkZ and λYjkZ are described with π+ jk . (iii) [X Y, Z ] If λYjkZ = 0 in (3.4), the model is logπi jk = λ + λiX + λYj + λkZ + λiXj Y .
(3.6)
The highest dimensional association terms with respect to X and Y are λiXj Y , and for Z term λkZ is the highest dimensional association terms. In this model, (X, Y ) and Z are statistically independent, and it follows that πi jk = πi j+ π++k The marginal distribution of (X, Y ) has the same parameters λiX , λYj , and λiXj Y as in (3.6), i.e., logπi j+ = λ + λiX + λYj + λiXj Y , where λ = λ + log
K
exp λkZ .
k=1
From this, parameters λiX , λYj , and λiXj Y can be explicitly expressed by probabilities πi j+ , and similarly, λkZ by π++k . Parallelly, we can interpret models [Y Z , X ] and [Z X, Y ]. (iv) [X, Y, Z ] This model is the most parsimonious one and is given by logπi jk = λ + λiX + λYj + λkZ . The above model has no association terms among the variables, and it implies the three variables are statistically independent. Hence, we have πi jk = πi++ π+ j+ π++k Remark 3.1 Let n i jk be the numbers (counts) of events for categories X = i, Y = j, and Z = k, i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K , which are independently distributed according to Poisson distributions with means μi jk . Then, the loglinear models are given by replacing πi jk by μi jk in the models discussed in the above, that is
3.2 Loglinear Model
63
logμi jk = λ + λiX + λYj + λkZ + λiXj Y + λYjkZ + λkiZ X + λiXjkY Z . This sampling model is similar to that for the three-way layout experiment model, and the terms λiXj Y , λYjkZ , and λkiZ X are referred to as two-factor interaction terms and λiXjkY Z three-factor interaction terms. Remark 3.2 Let us observe one of IJK events (X, Y, Z ) = (i, j, k) with probabilities πi jk in one trial and repeat it n +++ times independently; let n i jk be the numbers (counts) of events for categories X = i, Y = j, and Z = k, i = 1, 2, . . . , I ; j = 1, 2, . . . , J ; k = 1, 2, . . . , K . Then, the distribution of n i jk is the multinomial distribution with total sampling size n +++ . In the independent Poisson sampling mentioned in Remark 3.1, the conditional distribution of observations n i jk given n +++ is the multinomial distribution with cell probabilities πi jk =
μi jk , μ+++
where μ+++ =
I J K
μi jk .
i=1 j=1 k=1
As discussed above, measuring the associations in loglinear models is carried out with odds ratios. For binary variables, the ordinary approach may be sufficient; however, for polytomous variables, further studies for designing summary measures of the association will be needed.
3.3 Maximum Likelihood Estimation of Loglinear Models We discuss the maximum likelihood estimation of loglinear models in the previous section. Let n i jk be the numbers of observations for (X, Y, Z ) = (i, j, k). For model [X Y, Y Z , Z X ] (3.3), we set
λiXj Y = λ + λiX + λYj + λiXj Y . Then, considering the constraints in (3.2), the number of independent parameters λiXj Y is IJ. Since the loglikelihood function is l=
J K I i=1 j=1 k=1
n i jk λiXj Y + λkZ + λYjkZ + λkiZ X
64
3 Analysis of the Association in Multiway Contingency Tables
=
I J
n i j+ λiXj Y +
i=1 j=1
+
K
n ++k λkZ +
J K
k=1
K I
n + jk λYjkZ
j=1 k=1
n i+k λkiZ X .
i=1 k=1
From J K I
πi jk = 1,
i=1 j=1 k=1
for Lagrange multiplier ζ, we set lLagrange = l − ζ
J K I
πi jk
i=1 j=1 k=1
and differentiating the above function with respect to λiXj Y , we have ∂ l = n i jk − ζ πi jk = 0. X Y Lagrange ∂λi j k=1 k=1 K
K
From this, n i j+ − ζ πi j+ = 0
(3.7)
and we have ζ = n +++ Hence, from (3.7), it follows that πˆ i j+ =
n i j+ , n +++
Let λˆ iXj Y , λˆ kZ , λˆ YjkZ , and λˆ kiZ X be the ML estimators of λiXj Y , λkZ , λYjkZ , and λkiZ X , respectively. Then, we obtain K πˆ i j+ = exp λ + λˆ iX + λˆ Yj + λˆ iXj Y exp λˆ kZ + λˆ YjkZ + λˆ kiZ X
k=1 K = exp λˆ iXj Y exp λˆ kZ + λˆ YjkZ + λˆ kiZ X . k=1
(3.8)
3.3 Maximum Likelihood Estimation of Loglinear Models
65
Similarly, the ML estimators of πi+k and π+ jk that are made by the ML estimator of the model parameters are obtained as πˆ i+k =
n + jk n i+k , πˆ + jk = ; n +++ n +++
and as in (3.8), we have J exp λˆ Yj + λˆ iXj Y + λˆ YjkZ , πˆ i+k = exp λ + λˆ iX + λˆ kZ + λˆ kiZ X
j=1 I πˆ + jk = exp λ + λˆ Yj + λˆ kZ + λˆ YjkZ exp λˆ iX + λˆ iXj Y + λˆ kiZ X
i=1
As shown above, marginal probabilities πi j+ , πi+k , and π+ jk can be estimated in explicit forms of observations n i j+ , n i+k , and n + jk ; however, the model parameters in model [X Y, Y Z , Z X ] cannot be obtained in explicit forms of the observations. Hence, in this model, an iteration method is needed to get the ML estimators of the model parameters. In the ML estimation of model [X Y, Y Z ], the following estimators can be derived as follows: πˆ i j+ =
n i j+ n + jk , πˆ + jk = . n +++ n +++
Thus, by solving the following equations
n i j+ , n +++ n + jk = log n +++
λ + λˆ iX + λˆ Yj + λˆ iXj Y = log
λ + λˆ Yj + λˆ kZ + λˆ YjkZ
we can obtain the ML estimators λ, λˆ iX , λˆ Yj , λˆ kZ , λˆ iXj Y , and λˆ YjkZ in explicit forms. Through a similar discussion, the ML estimators of parameters in models [X Y, Z ] and [X, Y, Z ] can also be obtained in explicit forms as well.
3.4 Generalized Linear Models Generalized linear models (GLMs) are designed by random, systematic, and link components and make useful regression analyses of both continuous and categorical response variables [13, 15]. There are many cases of non-normal response variables in various fields of studies, e.g., biomedical researches, behavioral sciences, economics, etc., and GLMs play an important role in regression analyses for the
66
3 Analysis of the Association in Multiway Contingency Tables
non-normal response variables. GLMs include various kinds of regression models, e.g., ordinary linear regression model, loglinear models, Poisson regression models, and so on. In this section, variables are classified into response (dependent) and explanatory (independent) variables. Let Y be the response variable, and let
X = X 1 , X 2 , . . . , X p be an explanatory variable vector. The GLMs are composed of three components (i)–(iii) explained below. (i) Random Component Let f (y|x) be the conditional density or probability function. Then, the function is assumed to be the following exponential family of distributions:
f (y|x) = exp
yθ − b(θ ) + c(y, ϕ) , a(ϕ)
(3.9)
where θ and ϕ are the parameters. This assumption is referred to as the random component. If Y is the Bernoulli trial, then the conditional probability function is
π f (y|x) = π y (1 − π )1−y = exp ylog + log(1 − π ) . 1−π Corresponding to (3.9), we have
π , a(ϕ) = 1, b(θ ) = −log(1 − π ), c(y, ϕ) = 0. θ = log 1−π
(3.10)
For normal variable Y with mean μ and variance σ 2 , the conditional density function is
yμ − 21 μ2 y2 1 exp + − 2 f (y|x) = √ , σ2 2σ 2π σ 2 where θ = μ, a(ϕ) = σ 2 , b(θ ) =
1 2 y2 μ , c(y, ϕ) = − 2 . 2 2σ
(3.11)
Remark 3.3 For distribution (3.9), the expectation and variance are calculated. For simplicity, response variable is assumed to be continuous. Since f (y|x)dy = 1, we have d dθ
f (y|x)dy =
1 y − b (θ ) f (y|x)dy = 0. a(ϕ)
(3.12)
3.4 Generalized Linear Models
67
Hence, E(Y ) = b (θ ).
(3.13)
Moreover, differentiating (3.12) with respect to θ , we obtain Var(Y ) = b (θ )a(ϕ). In this sense, the dispersion parameter a(ϕ) relates to the variance of the response variable. If response variable Y is discrete, the integral in (3.12) is replaced by an appropriate summation with respect to Y. (ii) Systematic Component
T For regression coefficients β0 and β = β1 , β2 , . . . , β p , the linear predictor is given by η = β0 + β1 x1 + β2 x2 + . . . + β p x p = β0 + β T x,
(3.14)
where
T x = x1 , x2 , . . . , x p . (iii) Link Function For function h(u), mean (3.13) and predictor (3.14) are linked as follows:
b (θ ) = h −1 β0 + β T x . Eventually, according to the link function, θ is regarded as a function of β T x, so for simplicity of the discussion, we describe it as
θ = θ βT x . By using the above three components, regression models can be constructed flexibly. Example 3.1 For Bernoulli random variable Y (3.10), the following link function is assumed: h(u) = log
u . 1−u
Then, from h(π ) = log
π = β0 + β T x, 1−π
68
3 Analysis of the Association in Multiway Contingency Tables
we have the conditional mean of Y given x is
exp β0 + β T x f (1|x) =
. 1 + exp β0 + β T x The above model is a logistic regression model. For a normal distribution with (3.11), the identity function h(u) = u is used. Then, the GLM is an ordinary linear regression model, i.e., μ = β0 + β T x In both cases, we have θ = β0 + β T x and the links are called canonical links.
3.5 Entropy Multiple Correlation Coefficient for GLMs In GLMs composed of (i), (ii), and (iii) in the previous section, for baseline (X, Y ) = (x 0 , y0 ), the log odds ratio is given by
(y − y0 ) θ β T x − θ β T x 0 , log OR(x, x 0 ; y, y0 ) = a(ϕ)
(3.15)
where OR(x, x 0 ; y, y0 ) =
f (y|x) f (y0 |x 0 ) . f (y0 |x) f (y|x 0 )
(3.16)
The above formulation of log odds ratio is similar to that of the association model
(2.35). Log odds ratio (3.15) is viewed as an inner product of y and θ β T x with respect to the dispersion parameter a(ϕ). From (3.16), since log OR(x, x 0 ; y, y0 ) = {(−log f (y0 |x)) − (−log f (y|x))} − {(−log f (y0 |x 0 )) − (−log f (y|x 0 ))},
3.5 Entropy Multiple Correlation Coefficient for GLMs
69
predictor β T x is related to the reduction in uncertainty of response Y through explanatory variable X. For simplicity of the discussion, variables X and Y are assumed to be random. Then, the expectation of (3.15) with respect to the variables is calculated as
Cov Y, θ β T X (E(Y ) − y0 ) E θ β T X − θ β T x 0 + , a(ϕ) a(ϕ) The first term can be viewed as the mean change of uncertainty of response variable Y in explanatory variables X for baseline Y = E(Y ). Let g(x) and f (y) be the marginal density functions of X and Y, respectively. Then, as in the previous section, we have
¨ Cov Y, θ β T X f (y|x) = f (y|x)g(x)log dxdy a(ϕ) f (y) ¨ f (y) dxdy. + f (y)g(x)log f (y|x)
(3.17)
The above quantity is the sum of the two types of the KL information, so it is denoted by KL(X, Y ). If X and/or Y are discrete, the integrals in (3.17) are replaced by appropriate summations with respect to the variables. If X is a factor, i.e., not random, taking levels X = x 1 , x 2 , . . . , x K , then, (3.17) is modified as
K Cov Y, θ β T X f (y|x k ) 1 = dy f (y|x k ) log a(ϕ) K f (y) k=1 K f (y) 1 + dy. f (y) log K f (y|x k) k=1 From the Cauchy inequality, it follows that
√
Var(Y ) Var θ β T X Cov Y, θ β T X ≤ . KL(X, Y ) = a(ϕ) a(ϕ)
(3.18)
Definition 3.1 The entropy (multiple) correlation coefficient (ECC) between X and Y in GLM (3.9) with (i), (ii), and (iii) is defined as follows [7]:
Cov Y, θ β T X ECorr(Y, X) = √
. Var(Y ) Var θ β T X From (3.18) and (3.19), 0 ≤ ECorr(Y, X) ≤ 1.
(3.19)
70
3 Analysis of the Association in Multiway Contingency Tables
Since inequality (3.18) indicates the upper bound of KL(X, Y ) is
√ √ Var(Y ) Var(θ (β T X )) , a(ϕ)
as in the previous section, ECC is interpreted as the proportion of the explained entropy of response variable Y by explanatory variable vector X, and it can be used for a predictive or explanatory power measure for GLMs. This measure is an extension of the multiple correlation coefficient R in ordinary linear regression analysis. Remark 3.4 KL(X, Y ) (3.17) can also be expressed as
¨ Cov Y, θ β T X = ( f (y|x) − f (y))g(x)log f (y|x)dxdy. a(ϕ)
Theorem 3.1 In (3.19), ECC is decreasing as a(ϕ), given Var(Y ) and Var θ β T X . Proof For simplicity of the discussion, let us set E(Y ) = 0, and the proof is given in the case where explanatory variable X is continuous and random. Let f (y|x) be the conditional density or probability function of Y, and let g(x) be the marginal density or probability function of X. Then,
Cov Y, θ β X T
¨ =
yθ β T x f (y|x)g(x)dxdy.
Differentiating the above covariance with respect to a(ϕ), we have ¨ T
d d Cov Y, θ β X = yθ β T x f (y|x)g(x)dxdy da(ϕ) da(ϕ) ¨
d = yθ β T x f (y|x)g(x)dxdy da(ϕ) ¨ T
2 1 =− yθ β x f (y|x)g(x)dxdy ≤ 0. 2 a(ϕ)
Thus, the theorem follows.
For a GLM with the canonical link, ECC is the correlation coefficient between response variable Y and linear predictor θ = β0 + β T x, where
T β = β1 , β2 , . . . , β p .
(3.20)
3.5 Entropy Multiple Correlation Coefficient for GLMs
71
Then, we have p ECorr(Y, X) = √
βi Cov(Y, X i )
. Var(Y ) Var β T X i=1
(3.21)
Especially, for simple regression, ECorr(Y, X ) = |Corr(Y, X )|.
(3.22)
The property (3.21) is preferable for a predictive power measure, because the effects (contributions) of explanatory variables (factors) may be assessed by components related to the explanatory variables βi Cov(Y, X i ). Theorem 3.2 In GLMs with canonical links (3.20), if explanatory variables X 1 , X 2 , . . . , X p are statistically independent, βi Cov(Y, X i ) ≥ 0, i = 1, 2, . . . , p. Proof For simplicity of the discussion, we make a proof of the theorem for continuous variables. Since explanatory variables X 1 , X 2 , . . . , X p are independent, the joint density function is expressed as follows: g(x) =
p
gi (xi ),
i=1
where gi (xi ) are the marginal density or probability functions. Without the loss of generality, it is sufficient to show β1 Cov(Y, X 1 ) ≥ 0
The joint density function of Y and X = X 1 , X 2 , . . . , X p , f (y, x), and the
conditional density function of Y and X 1 given X /1 = X 2 , . . . , X p , f 1 y|x /1 , are expressed by f (y, x) = f (y|x)
f 1 y|x
/1
=
p
gi (xi ),
i=1
f (y, x) p dx1 , i=2 gi (x i )
where f (y|x) is the conditional density or probability function of Y, given X = x. p
βi X i , we have Since f 1 y|x /1 is a GLM with canonical link θ = β0 + i=2
72
3 Analysis of the Association in Multiway Contingency Tables
¨
p
f (y|x)
dxdy f 1 y|x /1 i=1
¨ p /1 f 1 y|x /1 dxdy gi (xi )log + f 1 y|x f (y|x) i=1 ¨ p
= f (y|x)g1 (x1 ) − f y|x /1 gi (xi )log f (y|x)dxdy
0≤
f (y|x)
gi (xi )log
i=1
= β1 Cov(Y, X 1 ).
This completes the theorem.
As a predictive power measure, the regression correlation coefficient (RCC) is also used [17]. The regression function is the conditional expectation of Y given X, i.e., E(Y |X) and RCC are defined by RCorr(Y, X) = √
Cov(Y, E(Y |X)) . √ Var(Y ) Var(E(Y |X))
(3.23)
The above measure also satisfies 0 ≤ RCorr(Y, X) ≤ 1 and in linear regression analysis it is the multiple correlation coefficient R. Since Cov(Y, E(Y |X)) = Var(E(Y |X)), RCC can also be expressed as follows: RCorr(Y, X) =
Var(E(Y |X)) . Var(Y )
The measure does not have such a decomposition property as ECC (3.21). In this respect, ECC is more preferable for GLMs than RCC. Comparing ECC and RCC, ECC measures the predictive power of GLMs in entropy; whereas, RCC measures that in the Euclidian space. As discussed above, regression coefficient β in GLMs is related to the entropy of response variable Y. From this point of view, ECC is more advantageous than RCC. Theorem 3.3 In GLMs, the following inequality holds: RCorr(Y, X) ≥ ECorr(Y, X). Proof Since
Cov Y, θ β T X = Cov E(Y |X), θ β T X , Cov(Y, E(Y |X)) = Var(E(Y |X)),
3.5 Entropy Multiple Correlation Coefficient for GLMs
73
standardizing them, we have
Cov Y, θ β T X Cov E(Y |X), θ β T X Cov(E(Y |X), E(Y |X)) =√ ≤√ √
√ Var(Y ) Var(E(Y |X)) Var(Y ) Var θ β T X Var(Y ) Var θ β T X =√
Var(E(Y |X)) , √ Var(Y ) Var(E(Y |X))
Hence, the inequality of the theorem follows. The equality holds if and only if there exist constants c and d such that
θ β T X = cE(Y |X) + d.
This completes the theorem.
Example 3.2 For binary response variable Y and explanatory variable X, the following logistic regression model is considered: f (y|x) =
exp{y(α + βx)} , y = 0, 1. 1 + exp(α + βx)
Since the link is canonical and the regression is simple, ECC can be calculated by (3.22). On the other hand, in this model, the regression function is E(Y |x) =
exp{(α + βx)} . 1 + exp(α + βx)
From this, RCC is obtained as Corr(Y, E(Y |X )). Example 3.3 Table 3.1 shows the data of killed beetles after exposure to gaseous carbon disulfide [4]. Response variable Y is Y = 0(killed), 1(alive). For the data, a complementary log–log model was applied. In the model, for function h(u), the link is given by h −1 (E(Y |x)) = log(−log(1 − E(Y |x))) = α + βx. Table 3.1 Beetle mortality data Log dose
1.691
1.724
1.755
1.784
1.811
1.837
1.861
1.884
No. beetles
59
60
62
56
63
59
62
60
No. killed
6
13
18
25
52
53
61
60
Source Bliss [4], Agresti [2]
74
3 Analysis of the Association in Multiway Contingency Tables 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.65
1.7
1.75
1.8
1.85
1.9
Fig. 3.1 Estimated conditional probability function in log dose X
Since the response variable is binary, E(Y |x) is equal to the conditional probability function of Y, and it is given by f (y|x) = {1 − exp(−exp(α + βx))} y exp(−(1 − y)exp(α + βx)). Then, θ in (3.9) is θ = log
1 − exp(−exp(α + βx)) f (1|x) = . 1 − f (1|x) exp(−exp(α + βx))
From the data, the estimates of the parameters are obtained as follows: αˆ = −39.52, βˆ = 22.01. The graph of the estimated f (y|x) in log dose X is given in Fig. 3.1. ECC and RCC are calculated according to (3.19) and (3.23), respectively, as follows: ECC = 0.681, RCC = 0.745. According to ECC, 68.1% of entropy in Y is reduced by the explanatory variable. Definition 3.2 Let sub X Y be a subset of the sample space of explanatory variable vector X and response variable Y in a GLM. Then, the correlation coefficient between
entropy correlation θ β T X and Y restricted in sub X Y is referred to as the conditional
. coefficient between X and Y, which is denoted by ECorr Y, X| sub XY T For explanatory T variable vector X = (X 1 , X 2 ) , the correlation coefficient between θ β X and Y given X 2 = x2 is the conditional correlation coefficient between X and Y, i.e.,
3.5 Entropy Multiple Correlation Coefficient for GLMs
75
ECorr(Y, X|X 2 = x2 ) = Corr Y, θ β T X |X 2 = x2 In an experimental case, let X be a factor with quantitative levels {c1 , c2 , . . . , c I } and βx be the systematic component. In this case, the factor levels are randomly assigned to an experimental unit (subject), so it implies P(X = ci ) = 1I , and restricting the factor levels to ci , c j , i.e., sub X Y = (x, y)|x = ci or c j ,
1 it follows that P X = ci | sub X Y = 2 . From this, it follows that
θ (βci ) + θ βc j sub , E θ (β X )| X Y = 2
E(Y |X = ci ) + E Y |X = c j sub , E Y | X Y = 2
1 Cov(Y, θ (β X )) = E(Y |X = ci ) − E Y |X = c j θ (βci ) − θ βc j , 4
2
θ (βci ) − θ βc j sub . Var θ (β X )| X Y = 4 Then, we have
ECorr Y,
X | sub XY
E(Y |X = ci ) − E Y |X = c j , =
2 Var Y | sub XY
(3.24)
The results are similar to those of pairwise comparisons of the effects of factor levels. If response variable Y is binary, i.e.,Y ∈ {0, 1}, we have
ECorr Y, X | sub XY
E(Y |X = ci ) − E Y |X = c j =
. E(Y |X = ci ) + E Y |X = c j 2 − E(Y |X = ci ) − E Y |X = c j (3.25)
Example 3.4 We apply (3.24) to the data in Table 3.1. For X = 1.691, 1.724, the partial data are illustrated in Table 3.2. In this case, we calculate the estimated conditional probabilities (Table 3.3), and from (3.25), we have Table 3.2 Partial table in Table 3.1
Log dose
1.691
1.724
No. beetles
59
60
No. killed
6
13
76
3 Analysis of the Association in Multiway Contingency Tables
Table 3.3 Estimated conditional probability distribution of beetle mortality data in Table 3.1 Log dose
1.691
Killed
0.095
1.724 0.187
Alive
0.905
0.813
Table 3.4 Conditional ECCs for base line log dose X = 1.691 Log dose
1724
1.755
1.784
1.811
1.837
1.861
1.884
Conditional ECC
0.132
0.293
0.477
0.667
0.822
0.893
0.908
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.7
1.72
1.74
1.76
1.78
1.8
1.82
1.84
1.86
1.88
1.9
Fig. 3.2 Estimated conditional ECC in l log dose X
ECorr(Y, X |X = {1.691, 1.724}) = 0.132. It means the effect of level X = 1.724 for base line X = 1.691, i.e., 13.2% of entropy of the response is reduced by the level X = 1.724. The conditional ECCs of X = x for baseline X = 1.691 are calculated in Table 3.4. The conditional ECC is increasing in log dose X. The estimated conditional ECC is illustrated in Fig. 3.2.
3.6 Multinomial Logit Models Let X 1 and X 2 be categorical explanatory variables with categories {1, 2, . . . , I } and {1, 2, . . . , J }, respectively, and let Y be a categorical response variable with levels {1, 2, . . . , K }. Then, the conditional probability function of Y given explanatory variables (X 1 , X 2 ) is assumed to be
3.6 Multinomial Logit Models
77
exp αk + β(1)ki + β(2)k j f (Y = k|(X 1 , X 2 ) = (i, j)) = K
. k=1 exp αk + β(1)ki + β(2)k j
(3.26)
Let X ai =
1 (X a = i) , a = 1, 2, ; Yk = 0 (X a = i)
1 (Y = k) , 0 (Y = k)
Then, dummy variable vectors X 1 = (X 11 , X 12 , . . . , X 1I )T , X 2 = (X 21 , X 22 , . . . , X 2J )T , and Y = (Y1 , Y2 , . . . , Y K )T are identified with categorical explanatory variables X 1 , X 2 , and Y, respectively, where the upper suffix “T ” implies the transpose of the vector and matrix. From this, the systematic component of the above model can be expressed as follows: T T X 1 + B (2) X 2, θ = α + B (1)
where θ = (θ1 , θ2 , . . . , θ K )T , α = (α1 , α2 , . . . , α K )T , ⎛ ⎛ ⎞ β(1)11 β(1)12 . . . β(1)1I β(2)11 β(2)12 ⎜ β(1)21 β(1)22 . . . β(1)2I ⎟ ⎜ β(2)21 β(2)22 ⎜ ⎜ ⎟ B(1) = ⎜ . .. .. .. ⎟, B(2) = ⎜ .. .. . ⎝ . ⎝ . . . . ⎠ . β(1)K 1 β(1)K 2 . . . β(1)K I β(2)K 1 β(2)K 2
⎞ . . . β(2)1J . . . β(2)2J ⎟ ⎟ .. .. ⎟. . . ⎠ . . . β(2)K J
Then, the conditional probability function is described as follows:
exp y T α + y T B (1) x 1 + y T B (2) x 2
, f ( y|x 1 , x 2 ) = T T T y exp y α + y B (1) x 1 + y B (2) x 2 where y implies the summation over all categories y. In this case, a(ϕ) = 1 and the KL information (3.17) become as follows: KL(X, Y ) = trCov(θ , Y ),
where Cov(θ , Y ) is the K × K matrix with (i. j) elements Cov θi , Y j . From this, ECC can be extended as ECorr(Y , (X 1 , X 2 )) = √
trCov(θ , Y ) , √ trCov(θ, θ ) trCov(Y , Y )
78
3 Analysis of the Association in Multiway Contingency Tables
where trCov(θ , θ ) the K × K matrix with (i. j) elements
Cov θi , θ j ; and trCov(Y , Y )the K × K matrix with (i. j) elements Cov Yi , Y j . Since trCov(θ , Y ) =
K
Cov(θk , Yk ) = trB(1) Cov(X 1 , Y ) + trB(2) Cov(X 2 , Y ),
k=1
trCov(θ , θ ) =
K
Var(θk ), trCov(Y , Y ) =
k=1
K
Var(Yk ),
k=1
we have ECorr(Y , (X 1 , X 2 )) =
trB (1) Cov(X 1 , Y ) + trB (2) Cov(X 2 , Y ) , K K k=1 Var(θk ) k=1 Var(Yk )
From this, the contributions of categorical variables X 1 and X 2 may be assessed by using the above decomposition. Remark 3.5 In model (3.26), let πiXjk1 X 2 Y ≡ P(X 1 = i, X 2 = j, Y = k)
exp αk + β(1)ki + β(2)k j = K
P(X 1 = i, X 2 = j). k=1 exp αk + β(1)ki + β(2)k j
(3.27)
Then, the above model is equivalent to loglinear model [X 1 X 2 , X 1 Y, X 2 Y ]. Example 3.5 The data for an investigation of factors influencing the primary food choice of alligators (Table 3.5) are analyzed ([2; pp. 268–271]). In this example, explanatory variables are X 1 : lakes where alligators live, {1. Hancock, 2. Oklawaha, Table 3.5 Alligator food choice data Lake
Size
Primary food choice Fish
Invertebrate
Reptile
Bird
Other
≤2.3 m (S)
23
4
2
2
8
>2.3 m (L)
7
0
1
3
5
Oklawaha
S
5
11
1
0
3
L
13
8
6
1
0
Trafford
S
5
11
2
1
5
L
8
7
6
3
5
S
16
19
1
2
3
L
17
1
0
1
3
Hancock
George
Source Agresti ([2], p. 268)
3.6 Multinomial Logit Models
79
3. Trafford, 4. George}; and X 2 : sizes of alligators, {1. small, 2. large}; and the response variable is Y: primary food choice of alligators, {1. fish, 2. invertebrate, 3. reptile, 4. bird, 5. other}. Model (3.26) is used for the analysis, and the following dummy variables are introduced: X 1i = Yk =
1 (X 1 = i) , i = 1, 2, 3, 4; X 2 j = 0 (X 1 = i)
1 (X 2 = j) , j = 1, 2, ; 0 (X 2 = j)
1 (Y = k) , k = 1, 2, 3, 4, 5. 0 (Y = k)
Then, the categorical variables X 1 , X 2 , and the response variable Y are identified with the correspondent dummy random vectors: X 1 = (X 11 , X 12 , X 13 , X 14 )T , X 2 = (X 21 , X 22 )T , and Y = (Y1 , Y2 , Y3 , Y4 , Y5 )T , respectively. Under the constraints, β(1)ki = 0 for k = 5, i = 4; β(2)k j = 0 for k = 5, j = 2, the following estimates of regression coefficients are obtained: ⎛
ˆ (1) B
−0.826 ⎜ −2.485 ⎜ ⎜ = ⎜ 0.417 ⎜ ⎝ −0.131 0
−0.006 0.931 2.454 −0.659 0
−1.516 −0.394 1.419 −0.429 0
⎞ ⎞ ⎛ 0 −0.332 0 ⎜ 1.127 0 ⎟ 0⎟ ⎟ ⎟ ⎜ ⎟ ˆ ⎟ ⎜ 0 ⎟, B (2) = ⎜ −0.683 0 ⎟. ⎟ ⎟ ⎜ ⎝ −0.962 0 ⎠ 0⎠ 0 0 0
From this, we have
trBˆ (1) Cov(X 1 , Y ) = 0.258, trBˆ (2) Cov(X 2 , Y ) = 0.107, T
T
trBˆ (1) Bˆ (1) Cov(X 1 , X 1 ) = 2.931, trBˆ (2) Bˆ (2) Cov(X 2 , X 2 ) = 0.693,
trCov(Y , Y ) = 0.691. From the above estimates, we calculate the ECC as
ˆ (2) Cov(X 2 , Y ) ˆ (1) Cov(X 1 , Y ) + tr B 0.258 + 0.107 tr B =√ ECorr(Y , (X 1 , X 2 )) = √ 2.931 + 0.693 0.691 trCov(θ , θ ) trCov(Y , Y )
= 0.231. Although the effects of factors are statistically significant, the predictive power of the logit model may be small, i.e., only 23.1% of the uncertainty of the response
80
3 Analysis of the Association in Multiway Contingency Tables
variable is reduced or explained by the explanatory variables. In this sense, the explanatory power of the model is weak. From the above analysis, we assess the effects of lake and size on food, respectively. Since
ˆ (2) Cov(X 2 , Y ) = 0.107. ˆ (1) Cov(X 1 , Y ) = 0.258 and tr B tr B It may be thought that the effect of lake on food is about 2.4 times larger than that of size.
3.7 Entropy Coefficient of Determination Measurement of the predictive or explanatory power is important in GLMs, and it leads to the determination of important factors in GLMs. In ordinary linear regression models, the coefficient of determination is useful to measure the predictive power of the models. As discussed above, since GLMs have an entropy-based property, it is suitable to consider predictive power measures for GLMs from a viewpoint of entropy. Definition 3.3 A general coefficient of determination is defined by R 2D =
D(Y ) − D(Y |X) , D(Y )
(3.28)
where D(Y ) and D(Y |X) are a variation function of Y and a conditional variation function of Y given X, respectively [1, 5, 12]. D(Y |X) implies that D(Y |X) =
D(Y |X = x)g(x)dx,
where g(x) is the marginal density or probability function of X [1, 11]. For ordinary linear regression models, D(Y ) = Var(Y ), D(Y |X) = Var(Y |X), A predictive power measure based on the likelihood function (15, 16) is given as follows:
R L2 = 1 −
L(0) L(β)
n2
,
(3.29)
where L(β) is a likelihood function relating to the regression coefficient vector β, and n is the sample size. For the ordinary linear regression model, the above measure is the coefficient of determination R 2 ; however, it is difficult to interpret the measure
3.7 Entropy Coefficient of Determination
81
(3.29) in general cases. This is a drawback of this measure. For GLMs with categorical response variables, the following entropy-based measure is proposed: R 2E =
H (Y ) − H (Y |X) , H (Y )
(3.30)
where H (Y ) and H (Y |X) are the entropy of Y and the conditional entropy given X [11]. As discussed in Sect. 3.4, GLMs have properties in entropy, e.g., log odds ratios (3.15) and KL information (3.17) are related to linear predictors. We refer the entropy-based measure (3.17), i.e., KL(X, Y ), to as a basic
predictive power measure for GLMs. The measure is increasing in Cov Y, θ β T X and decreasing in a(ϕ). Since
Var(Y |X = x) = a(ϕ)b θ β T x and
Cov Y, θ β T X , KL(X, Y ) = a(ϕ) function a(ϕ)
can be interpreted as error variation of Y in entropy, and Cov Y, θ β T X is the explained variation of Y by X. From this, we define entropy variation function of Y as follows:
D E (Y ) = Cov Y, θ β T X + a(ϕ),
(3.31)
Since
Cov Y, θ β X |X = T
(Y − E(Y |x))θ β T x g(x)dx = 0,
the conditional entropy variation function D E (Y |X)is calculated as
D E (Y |X) = Cov Y, θ β T X |X + a(ϕ) = a(ϕ).
(3.32)
According to (3.28), we have the following definition: Definition 3.4 The entropy coefficient of determination (ECD) [8] is defined by ECD(X, Y ) =
D E (Y ) − D E (Y |X) . D E (Y )
(3.33)
ECD can be described by using KL(X, Y ), i.e.,
Cov Y, θ β T X KL(X, Y ) ECD(X, Y ) = = . T
KL(X, Y ) + 1 Cov Y, θ β X + a(ϕ)
(3.34)
82
3 Analysis of the Association in Multiway Contingency Tables
From the above formulation, ECD is interpreted as the ratio of the explained variation of Y by X for the variation of Y in entropy. For ordinary linear regression 2 models, T ECDT is the coefficient of determination R , and for canonical link, i.e., θ β X = β X, ECD has the following decomposition property: p βi Cov(Y, X i ) ECD(X, Y ) = . i=1 T
Cov Y, θ β X + a(ϕ) Thus, ECC and ECD have the decomposition property with respect to explanatory variables X i . This property is preferable, because the contributions of explanatory variables X 1 can be assessed with components βi Cov(Y, X i ) ,
Cov Y, θ β T X + a(ϕ)
if the explanatory variables are statistically independent. A theoretical discussion on assessing explanatory variable contribution is given in Chap. 7. We have the following theorem:
T Theorem 3.4 Let X = X 1 , X 2 , . . . , X p be an explanatory variable vector; let
T X \i = X 1 , X 2 , . . . , X i−1 X i+1 , . . . , X p be a sub-vector of X; and let Y be a response variable. Then,
KL(X, Y ) ≥ KL X \i , Y , i = 1, 2, . . . , p. Proof Without loss of generality, the theorem is proven for k = 1, i.e., X \1 =
T X 2 , . . . , X p . For simplicity of the discussion, the variables concerned are assumed
to be continuous. Let f (x, y) be the joint density function of X and Y; f 1 x \1 , y be the joint density function of X \1 and Y; f Y (y) be the marginal density function of Y; g(x) be the marginal density function of X; and let g1 x \1 be the marginal density function of X \1 . Since
f (x, y) = f x1 , x \1 , y , g(x) = g x1 , x \1 , we have ˚ f x1 , x \1 , y \1 \1
dx1 dx \1 dy KL(X, Y ) − KL X , Y = f x1 , x , y log f Y (y)g x1 , x \1 ˚ f Y (y)g x1 , x \1 \1
dx1 dx \1 dy. + f Y (y)g x1 , x log f x1 , x \1 , y ˚ f 1 x \1 , y
dx1 dx \1 dy − f x1 , x \1 , y log f Y (y)g1 x \1
3.7 Entropy Coefficient of Determination
83
f Y (y)g1 x \1
dx1 dx \1 dy − f 1 x \1 , y ˚ f x1 , x \1 , y g1 x \1
dx1 dx \1 dy
= f x1 , x \1 , y log f 1 x \1 , y g x1 , x \1 ˚ f 1 x \1 , y g x1 , x \1
dx1 dx \1 dy. + f Y (y)g x1 , x \1 log \1 g1 x f x1 , x \1 , y ˚ f x1 , x \1 , y /g x1 , x \1 \1
= f x1 , x , y log dx1 dx \1 dy f 1 x \1 , y /g1 x \1 ˚ g x1 , x \1 /g1 x \1
dx1 dx \1 dy ≥ 0. + f Y (y)g x1 , x \1 log f x1 , x \1 , y / f 1 x \1 , y ˚
f Y (y)g x1 , x \1 log
This completes the theorem.
The above theorem guarantees that ECD is increasing in complexity of GLMs, where “complexity” implies the number of explanatory variables. This is an advantage of ECD. Example 3.6 Applying ECD to Example 3.5, from (3.34) we have
ˆ (2) Cov(X 2 , Y ) ˆ (1) Cov(X 1 , Y ) + tr B 0.258 + 0.107 tr B = ECD(Y , (X 1 , X 2 )) = ˆ (1) Cov(X 1 , Y ) + tr B ˆ (2) Cov(X 2 , Y ) + 1 0.258 + 0.107 + 1 tr B
= 0.267. From this, 26.5% of the variation of response variable Y in entropy is explained by explanatory variables X 1 and X 2 . Desirable properties for predictive power measures for GLMs are given as follows [9]: (i) A predictive power measure can be interpreted, i.e., interpretability. (ii) The measure is the multiple correlation coefficient or the coefficient of determination in normal linear regression models. (iii) The measure has an entropy-based property. (iv) The measure is applicable to all GLMs (applicability to all GLMs). (v) The measure is increasing in the complexity of the predictor (monotonicity in the complexity of the predictor). (vi) The measure is decomposed into components with respect to explanatory variables (decomposability). First, RCC (3.23) is checked for the above desirable properties. The measure can be interpreted as the correlation coefficient (cosine) of response variable Y and the regression function E(Y |X) and is the multiple correlation coefficient R in the ordinary linear regression analysis; however, the measure has no property in entropy. The measure is available only for single continuous or binary variable cases. For
84
3 Analysis of the Association in Multiway Contingency Tables
polytomous response variable cases, the measure cannot be employed. It has not been proven whether the measure satisfies property (v) or not, and it is trivial for the measure not to have property (vi). Second, R L2 is considered. From (3.29), the measure is expressed as 2
R L2 =
2
L(β) n − L(0) n 2
L(β) n ,
;
2 2 however, L(β) n and L(0) n cannot be interpreted. Let βˆ be the ML estimates of β. Then, the likelihood ratio statistic is
⎛ ⎞ n2 L βˆ 1 ⎠ = ⎝ . L(0) 1 − R2 From this, this measure satisfies property (ii). Since logL(β) is interpreted in a viewpoint of entropy, so the measure has property (iii). This measure is applicable to all GLMs and increasing in the number of explanatory variables, because it is based on the likelihood function; however, the measure does not have property (vi). Concerning R 2E , the measure was proposed for categorical response variables, so the measure has properties except (ii) and (vi). For ECC, as explained in this chapter and the previous one, this measure has (i), (ii), (iii), and (vi); and the measure is the correlation coefficient of the response variable and the predictor of explanatory variables, so this measure is not applicable to continuous response variable vectors and polytomous response variables. Moreover, it has not proven whether the measure has property (v) or not. As discussed in this chapter, ECD has all the properties (i) to (vi). The above discussion is summarized in Table 3.6. Mittlebock and Schemper [14] compared some summary measures of association in logistic regression models. The desirable properties are (1) interpretability, (2) consistency with basic characteristics of logistic regression, (3) the potential range should be [0,1], and (4) R 2 . Property (2) corresponds to (iii), and property (3) is met Table 3.6 Properties of predictive power measures for GLMs RCC
R 2L
R 2E
ECC
ECD
(i) Interpretability
◯
×
◯
◯
◯
(ii) Ror R 2
◯
◯
×
◯
◯
(iii) Entropy
×
◯
◯
◯
◯
(iv) All GLMs
×
◯
◯
×
◯
(v) Monotonicity
◯
◯
◯
(vi) Decomposition
×
×
×
◯
◯
◯: The measure has the property; : may have the property, but has not been proven; ×: does not have the property
3.7 Entropy Coefficient of Determination
85
by ECD and ECC. In the above discussion, ECD is the most suitable measure for the predictive power of GLMs. A similar discussion for binary response variable models was given by Ash and Shwarts [3].
3.8 Asymptotic Distribution of the ML Estimator of ECD The entropy coefficient of determination is a function of KL(X, Y ) as shown in (3.34), so it is sufficient to derive the asymptotic property of the ML estimator of KL(X, Y ). Let f (y|x) be the conditional density or probability function of Y given X = x, and let f (y) be the marginal density or probability function of Y. First, we consider the case where explanatory variables are not random. We have the following theorem [8]: Theorem 3.5 In GLMs with systematic component (3.14), let x k , k = 1, 2, . . . , K be the levels of p dimensional explanatory variable vector X and let n k be the numbers of observations at levels x k , k = 1, 2, . . . , K ; Then, for sufficiently large n k , k = 1, 2, . . . , K , statistic N K L (X, Y ) is asymptotically chi-square distribution with degrees of freedom p under the null hypothesis H0 : β = 0, where N = nk=1 n k .
Proof Let (x k , yki ), i = 1, 2, . . . , n k ; k = 1, 2, . . . , K be random samples. Let fˆ(y|x), fˆ(y), and K L (X, Y ) be the ML estimators of f (y|x), f (y), and K L(X, Y ), respectively. Then, the likelihood functions under H0 : β = 0 and H1 : β = 0 are, respectively, given as follows:
l0 =
nk K
log fˆ(yki ),
(3.35)
log fˆ(yki |x k ).
(3.36)
k=1 i=1
and l=
nk K k=1 i=1
Then, K nk 1 fˆ(yki |x k ) 1 log (l − l0 ) = N N k=1 i=1 fˆ(yki ) K nk f (y|x k ) dy(in probability), → f (y|x k )log N f (y) k=1
as n k → ∞, k = 1, 2, . . . , K . Under the null hypothesis, it follows that
86
3 Analysis of the Association in Multiway Contingency Tables
K nk
fˆ(y|x k ) dy fˆ(y|x k )log N fˆ(y) k=1 K nk f (y|x k ) → dy(in probability). f (y|x k )log N f (y) k=1
For sufficiently large n k , k = 1, 2, . . . , K , under the null hypothesis: H0 : β = 0, we also have K nk k=1
N
K 1 nk fˆ(y|x k ) fˆ(y) dy = dy + o fˆ(y|x k )log fˆ(y)log , N N fˆ(y) fˆ(y|x k ) k=1
where
N ·o
1 N
→ 0 (in probability).
From the above discussion, we have fˆ(y) fˆ(y|x k ) ˆ KL(X, Y ) = dy + fˆ(y)log dy f (y|x k )log N fˆ(y) fˆ(y|x k ) k=1
2 1 = (l − l0 ) + o . N N
K nk
!
From this, under the null hypothesis: H0 : β = 0, it follows that
N KL(X, Y ) = 2(l − l0 ) + N · o
1 . N
Hence, the theorem follows.
As in the above theorem, for random explanatory variables X, the following theorem holds:
Theorem 3.6 For random explanatory variables X, statistic N K L (X, Y ) is asymptotically chi-square distribution with degrees of freedom p under the null hypothesis H0 : β = 0 for sufficiently large sample size N. Proof Let (x i , yi ), i = 1, 2, . . . , N , be random samples, and let fˆ(y), g(x), ˆ and fˆ(y|x) be the ML estimators of functions f (y), g(x), and f (y|x), respectively. Let l0 =
N i=1
log fˆ(yi )g(x ˆ i)
3.8 Asymptotic Distribution of the ML Estimator of ECD
87
and l=
N
log fˆ(yi |x i )g(x ˆ i ).
i=1
Then, by using a similar method in the previous theorem, as N → ∞, we have
¨ N 1 1 fˆ(y|x) fˆ(yi |x i ) 1 ˆ dxdy + o = f (y|x)g(x)log ˆ log (l − l0 ) = ˆ ˆ N N i=1 N f (yi ) f (y) ¨ f (y|x) → f (y|x)g(x)log dxdy(in probability). f (y) Under the null hypothesis: H0 : β = 0, we also have ¨
¨ fˆ(y|x) dxdy = fˆ(y)g(x)log ˆ fˆ(y|x)g(x)log ˆ fˆ(y)
fˆ(y) 1 . dxdy + o ˆ N f (y|x)
From this, it follows that
1 2(l − l0 ) + N · o N
¨
fˆ(y|x) fˆ(y|x)g(x)log ˆ dxdy fˆ(y)
¨ ˆ(y) f 1 + fˆ(y)g(x)log ˆ dxdy + o ˆ N f (y|x)
1 . = N KL(X, Y ) + o N
=N
This completes the theorem as follows.
By using the above theorems, we can test H0 : β = 0 versus H1 : β = 0
Example 3.7 In Example 3.5, the test for H0 : B (1) , B (2) = 0 versus H1 : B (1) , B (2) = 0 is performed by using the above discussion. Since N = 219,we have ˆ (1) Cov(X 1 , Y ) + tr B ˆ (2) Cov(X 2 , Y ) 219KL(Y , (X 1 , X 2 )) = 219 tr B
= 219(0.258 + 0.107) = 79.935.(d f = 16, P = 0.000).
From this, regression coefficient matrix B (1) , B (2) is statistically significant.
88
3 Analysis of the Association in Multiway Contingency Tables
3.9 Discussions In this chapter, first, loglinear models have been discussed. In the models, the logarithms of probabilities in contingency tables are modeled with association parameters, and as pointed out in Sect. 3.2, the associations between the categorical variables in the loglinear models are measured with odds ratios. Since the number of odds ratios is increasing as those of categories of variables increasing, further studies are needed to make summary measures of association for the loglinear models. There may be a possibility to make entropy-based association measures that are extended versions of ECC and ECD. Second, predictive or explanatory power measures for GLMs have been discussed. Based on an entropy-based property of GLMs, ECC and ECD were considered. According to six desirable properties for predictive power measures as listed in Sect. 3.7, ECD is the best to measure the predictive power.
References 1. Agresti, A. (1986). Applying R2 -type measures to ordered categorical data. Technometrics, 28, 133–138. 2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: John Wiley & Sons Inc. 3. Ash, A., & Shwarts, M. (1999). R2 : A useful measure of model performance with predicting a dichotomous outcome. Stat Med, 18, 375–384. 4. Bliss, C. I. (1935). The calculation of the doze-mortality curve. Ann Appl Biol, 22, 134–167. 5. Efron, B. (1978). Regression and ANOVA with zero-one data: measures of residual variation. J Am Stat Assoc, 73, 113–121. 6. Eshima, N., & Tabata, M. (1997). The RC(M) association model and canonical correlation analysis. J Jpn Stat Soc, 27, 109–120. 7. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive power of generalized linear models. Stat Probab Lett, 77, 588–593. 8. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear models. Comput Stat Data Anal, 54, 1381–1389. 9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear models: entropy coefficient of determination, entropy correlation coefficient and regression correlation coefficient. Comput Stat Data Anal, 55, 3049–3058. 10. Goodman, L. A. (1981). Association models and canonical correlation in the analysis of crossclassification having ordered categories. J Am Stat Assoc, 76, 320–334. 11. Haberman, S. J. (1982). Analysis of dispersion of multinomial responses. J Am Stat Assoc, 77, 568–580. 12. Korn, E. L., & Simon, R. (1991). Explained residual variation, explained risk and goodness of fit. Am Stat, 45, 201–206. 13. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and Hall: London. 14. Mittlebock, M., & Schemper, M. (1996). Explained variation for logistic regression. Stat Med, 15, 1987–1997. 15. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. J Roy Stat Soc A, 135, 370–384.
References
89
16. Theil, H. (1970). On the estimation of relationships involving qualitative variables. Am J Sociol, 76, 103–154. 17. Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear model. Stat Med, 19, 1771–1781.
Chapter 4
Analysis of Continuous Variables
4.1 Introduction Statistical methodologies for continuous data analysis have been well developed for over a hundred years, and many research fields apply them for data analysis. For correlation analysis, simple correlation coefficient, multiple correlation coefficient, partial correlation coefficient, and canonical correlation coefficients [12] were proposed and the distributional properties of the estimators were studied. The methods are used for basic statistical analysis of research data. For making confidence regions of mean vectors and testing those in multivariate normal distributions, the Hotelling’s T 2 statistic was proposed as a multivariate extension of the t statistic [11], and it was proven its distribution is related to the F distribution. In discriminant analysis of several populations [7, 8], discriminant planes are made according to the optimal classification method of samples from the populations, which minimize the misclassification probability [10]. In methods of experimental designs [6], the analysis of variance has been developed, and the methods are used for studies of experiments, e.g., clinical trials and experiments with animals. In real data analyses, we often face with missing data and there are cases that have to be analyzed by assuming latent variables, as well. In such cases, the Expectation and Maximization (EM) algorithm [2] is employed for the ML estimation of the parameters concerned. The method is very useful to make the parameter estimation from missing data. The aim of this chapter is to discuss the above methodologies in view of entropy. In Sect. 4.2, the correlation coefficient in the bivariate normal distribution is discussed through an association model approach in Sect. 2.6, and with a discussion similar to the RC(1) association model [9], the entropy correlation coefficient (ECC) [3] is derived as the absolute value of the usual correlation coefficient. The distributional properties are also considered with entropy. Section 4.3 treats regression analysis in the multivariate normal distribution. From the association model and the GLM frameworks, it is shown that ECC and the multiple correlation coefficient are equal and that the entropy coefficient of determination (ECD) [4] is equal to the usual © Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_4
91
92
4 Analysis of Continuous Variables
coefficient of determination. In Sect. 4.4, the discussion in Sect. 4.3 is extended to that of the partial ECC and ECD. Section 4.5 treats canonical correlation analysis. First, canonical correlation coefficients between two random vectors are derived with an ordinary method. Second, it is proven that the entropy with respect to the association between the random vectors is decomposed into components related to the canonical correlation coefficients, i.e., pairs of canonical variables. Third, ECC and ECD are considered to measure the association between the random vectors and to assess the contributions of pairs of canonical variables in the association. In Sect. 4.6, the Hotelling’s T 2 statistic from the multivariate samples from a population is discussed. It is shown that the statistic is the estimator of the KL information between two multivariate normal distributions, multiplied by the sample size. Section 4.7 extends the discussion in Sect. 4.6 to that for comparison between two multivariate normal populations. In Sect. 4.8, the one-way layout experimental design model is treated in a framework of GLMs, and ECD is used for assessing the factor effect. Section 4.9, first, makes a discriminant plane between two multivariate normal populations based on an optimal classification method of samples from the populations, and second, an entropy-based approach for discriminating the two populations is given. It is also shown that the squared Mahalanobis’ distance between the two populations [13] is equal to the KL information between the two multivariate normal distributions. Finally, in Sect. 4.10, the EM algorithm for analyzing incomplete data is explained in view of entropy.
4.2 Correlation Coefficient and Entropy Let (X , Y ) be a random vector of which the joint distribution is a bivariate normal distribution with mean vector μ = (μX , μY )T and variance–covariance matrix =
σ11 σ12 . σ21 σ22
(4.1)
Then, the density function f (x1 , x2 ) is given by σ12 (x − μX )(y − μY ) σ22 (x − μX )2 + σ11 (y − μY )2 − f (x, y) = 1 exp || 2|| 2π || 2 (4.2) 1
The logarithm of the above density function is σ22 (x − μX )2 σ11 (y − μY )2 − 2|| 2|| 2π || σ12 (x − μX )(y − μY ) . + ||
log f (x, y) = log
1
1 2
−
4.2 Correlation Coefficient and Entropy
93
Let us set λ = log
1 2π ||
1 2
, λX (x) = −
σ22 (x − μX )2 Y σ11 (y − μY )2 σ12 . , λ (y) = − ,ϕ = || 2|| 2||
From the above formulation, we have log f (x, y) = λ + λX (x) + λY (y) + ϕ(x − μX )(y − μY ).
(4.3)
The model is similar to the RC(1) association in Sect. 2.6. The log odds model ratio with respect to four points, (x, y), x , y , x, y , and x , y , is calculated as f (x, y)f x , y = ϕ x − x y − y . log f (x , y)f (x, y ) Replacing x and y for the means and taking the expectation of the above log odds ratio with respect to X and Y we have
f (X , Y )f (μX , μY ) E log f (μX , Y )f (X , μY )
= ϕCov(X , Y ).
Definition 4.1 The entropy covariance between X and Y in (4.3) is defined by ECov(X , X ) = ϕCov(X , Y ).
(4.4)
Let fX (x) and fY (y) be the marginal density functions of X and Y, respectively. Since ¨ f (x, y) f (X , Y )f (μX , μY ) = f (x, y)log dxdy E log f (μX , Y )f (X , μY ) fX (x)fY (y) ¨ fX (x)fY (y) dxdy ≥ 0, (4.5) + fX (x)fY (y)log f (x, y) it follows that ϕCov(X , Y ) ≥ 0. From the Cauchy inequality, we have 0 ≤ ϕCov(X , Y ) < |ϕ| Var(X ) Var(X ).
(4.6)
Dividing the second term by the third one in (4.6), we have the following definition:
94
4 Analysis of Continuous Variables
Definition 4.2 In model (4.3), the entropy correlation coefficient (ECC) is defined by ECorr(X , Y ) = √
|Cov(X , Y )| . √ Var(X ) Var(X )
(4.7)
From (4.5) √ √ and (4.6), the upper limit of the KL information (4.5) is |ϕ| Var(X ) Var(X ). Hence, ECC (4.7) can be interpreted as the ratio of the information induced by the correlation between X and Y in entropy. As shown in (4.7), ECorr(X , Y ) = |Corr(X , Y )|. Remark 4.1 The above discussion has been made in a framework of the RC(1) association model in Sect. 2.6. In the association model, ϕ is assumed to be positive for scores assigned to categorical variables X and Y. On the other hand, in the normal model (4.2), if Cov(X , Y ) < 0, then, ϕ < 0. From the above remark, we make the following definition of the entropy variance: Definition 4.3 The entropy variances of X and Y in (4.3) are defined, respectively, by EVar(X ) = |ϕ|Cov(X , X )(= |ϕ|Var(X )) and EVar(Y ) = |ϕ|Cov(Y , Y )(= |ϕ|Var(Y )).
(4.8)
By using (4.4) and (4.8), ECC (4.7) can also be expressed as ECorr(X , Y ) = √
ECov(X , X ) . √ EVar(X ) EVar(X )
Since the entropy covariance (4.4) is the KL information between model f (x, y) and independent model fX (x)fY (y), in which follows, for simplicity of discussion, let us set ¨ f (x, y) dxdy KL(X , Y ) = f (x, y)log fX (x)fY (y) ¨ fX (x)fY (y) dxdy + fX (x)fY (y)log f (x, y) and let ρ = Corr(X , Y ).
4.2 Correlation Coefficient and Entropy
95
Calculating (4.5) in normal distribution model (4.2), since σ
ϕ=
12 ρ 1 σ12 σ12 σ11 σ22 = = ×√ . = 2 2 σ || σ11 σ22 − σ12 σ12 1−ρ σ11 σ22 1 − σ1112σ22
we have KL(X , Y ) = ϕσ12 =
ρ σ12 ρ2 × = . √ 1 − ρ2 σ11 σ22 1 − ρ2
Next, the distribution of the estimator of KL(X , Y ) is considered. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be random samples from the bivariate normal distribution with mean vector μ = (μ1 , μ2 )T and variance–covariance matrix (4.1). Then, the ML estimator of correlation coefficient ρ is given by n i=1 Xi − X Yi − Y r= 2 n 2 . n X Y − X − Y i i i=1 i=1 Then, the following statistic is the non-central F distribution with degrees 1 and n − 2:
F = (n − 2)KL(X , Y ),
(4.9)
where
KL(X , Y ) =
r2 . 1 − r2
(4.10)
Hence, F statistic is proportional to the estimator of the KL information. In order to test the hypotheses H0 :ρ = 0 versus H1 :ρ = 0, we have the following t statistic with degrees of freedom n − 2: t=
√ r n − 2√ . 1 − r2
(4.11)
Then, the squared t statistic is proportional to the KL information (4.10), i.e.,
t 2 = F = (n − 2)KL(X , Y ). Remark 4.2 Let X be an explanatory variable and let Y be a response variable in bivariate normal distribution model (4.2). Then, the conditional distribution of Y given X can be expressed in a GLM formulation and it is given by
96
4 Analysis of Continuous Variables
σ
− μX )(y − μY ) − 21 σσ2211 (x − μX )2 σ22 1 − ρ 2 (y − μY )2 . − 2σ22 1 − ρ 2
1 fY (y|x) = exp 2π σ22 1 − ρ 2
σ11 (x 12
In the GLM framework (3.9), it follows that θ=
σ22 σ12 (x − μX ), a(ϕ) = σ22 1 − ρ 2 , b(θ ) = (x − μX )2 , σ11 2σ11 c(y, ϕ) = −
(y − μY )2 . 2σ22 1 − ρ 2
Then, we have ECC for the above GLM as
2 σ12 Cov Y , σσ1211 X Cov(Y , θ ) σ11 = Corr(Y , θ ) = √ √
= √ σ2 √ Var(Y ) Var(θ ) σ12 σ22 σ1211 Var(Y ) Var σ11 X 2 σ12 =√ √ = |ρ|. σ22 σ11 Hence, ECCs based on the RC(1) association model approach (4.3) and the above GLM approach are the same. In statistic (4.9), for large sample size n, (4.9) is asymptotically distributed according to the non-central chi-square distribution with degrees of freedom 1 and non-centrality λ = (n − 2)
ρ2 . 1 − ρ2
To give an asymptotic distribution of (4.10), the following theorem [14] is used. Theorem 4.1 Let χ 2 be distributed according to a non-central chi-square distribution with non-central parameter λ and the degrees of freedom ν; and let us set c=ν+
λ2 λ and ν = ν + . ν+λ ν + 2λ
Then, the following statistic χ2 =
χ 2 c
(4.12)
4.2 Correlation Coefficient and Entropy
97
is asymptotically distributed according to the chi-square distribution with degrees of freedom ν . From the above theorem, statistic (4.9) is asymptotically distributed according to the non-central chi-square distribution with degrees of freedom 1 and non-centrality λ = (n − 2)
ρ2 . 1 − ρ2
Let ν = 1, c = 1 +
λ2 λ and ν = 1 + . 1+λ 1 + 2λ
(4.13)
Then, for large sample size n, statistic
F (n − 2)KL(X , Y ) = c c
(4.14)
is asymptotically distributed according to the chi-square distribution with degrees of freedom ν . Hence, statistic (4.14) is asymptotically normal with mean ν and variance 2ν . From (4.13), for large n we have ρ (n − 2) 1−ρ 2 2
c =1+
ν = 1 +
ρ 1 + (n − 2) 1−ρ 2 2 2 ρ (n − 2)2 1−ρ 2 2
ρ2
1 + 2(n − 2) 1−ρ 2
≈ 2, ρ2 from r 2 ≈ ρ 2 . ≈ (n − 2) 2 2 1−ρ
Normalizing the statistic (4.12), we have
ZKL
ρ KL(X , Y ) − F − (n − 2) 1−ρ √ − ν 2 n − 2 = √ = = ρ2 ρ2 2ν 2 1−ρ 2 (n − 2) 1−ρ 2 2 2
F c
ρ2 1−ρ 2
.
(4.15)
The above statistic is asymptotically distributed according to the standard normal distribution. By using the following asymptotic property:
ZKL
KL(X , Y ) − √ = n−2 ρ2 2 1−ρ 2
ρ2 1−ρ 2
√ KL(X , Y ) − KL(X , Y ) ≈ n−2 , r2 2 1−r 2
we have the following asymptotic standard error (SE):
98
4 Analysis of Continuous Variables
SE = √
r2 , 1 − r2
2 n−2
ρ From this, the asymptotic confidence intervals of KL(X , Y ) = 1−ρ 2 can be constructed, for example, a 100(1 − α)% confidence interval of it is given by 2
2 α ×√ KL(X , Y ) − z 1 − 2 n−2
r2 1 − r2
2 α < KL(X , Y ) < KL(X , Y ) + z 1 − ×√ 2 n−2
r2 , 1 − r2
(4.16)
where z 1 − α2 is the upper 100 1 − α2 % point of the standard normal distribution. Similarly, by using (4.16), a 100(1 − α)% confidence interval of ρ 2 can be constructed. From the above result, the asymptotic confidence interval is given by
A1 < ρ2 < 1 + A1
A2 , 1 + A2
(4.17)
where 2 α ×√ A1 = KL(X , Y ) − z 1 − 2 n−2
r2 , 1 − r2 2 r2 α ×√ A2 = KL(X , Y ) + z 1 − 2 n − 2 1 − r2
Concerning estimator r, the following result is obtained [1]. Theorem 4.2 In normal distribution (4.2) with variance–covariance matrix (4.1), the asymptotic distribution of statistic Z=
√
n−1
r−ρ 1 − ρ2
(4.18)
is the standard normal distribution. The statistic (4.18) is refined in view of the normality. The Fisher Z transformation [5] is given by ZFisher =
1+r 1 log 2 1−r
(4.19)
4.2 Correlation Coefficient and Entropy
99
Theorem 4.3 The Fisher Z transformation (4.19) is asymptotically normally and variance n − 1. distributed with mean 21 log 1+ρ 1−ρ Proof By the Taylor expansion of ZFisher at ρ, for large sample size n we have ZFisher
1 1 1+ρ 1 1 ≈ log + + (r − ρ) 2 1−ρ 2 1+ρ 1−ρ 1 1+ρ 1 = log + (r − ρ). 2 1−ρ 1 − ρ2
From Theorem 4.2, the above statistic is asymptotically distributed according to and variance n − 1. the normal distribution with mean 21 log 1+ρ 1−ρ In Theorem 4.3, a better approximation of the variance of ZFisher is given by n − 3 [1]. By using the result, an asymptotic 100(1 − α)% confidence interval is computed as exp(B2 ) − 1 exp(B1 ) − 1 0. The above model is similar to the RC(1) association model. In the model, ECC is calculated as follows: Cov(μ(X), ν(Y )) . ECorr(X, Y ) = √ √ Var(μ(X)) Var(Y ) Remark 4.3 The ECC cannot be directly defined in the case where X and Y are ran T T dom vectors, i.e., X = X1 , X2 , . . . , Xp (p ≥ 2) and Y = Y1 , Y2 , . . . , Yq (q ≥ 2). The discussion is given in Sect. 4.5. Like the RC(1) association model, the ECC is positive. In matrix (4.21), let XX·Y = XX − σ XY
1 σ Y X and σYY ·X = σYY − σ Y X −1 XX σ XY . σYY
For a normal distribution with mean vector 0 and variance–covariance matrix (4.21), since f (x, y) =
1 (2π )
p+1 2
1
[] 2
−1 1 T XX σ XY x × exp − x , y , σ Y X σYY y 2 in model (4.24) we set
(4.25)
4.3 The Multiple Correlation Coefficient
103
p+1 1 λ = −log (2π ) 2 [] 2 , 1 1 −1 2 λX (x) = − xT −1 XX·Y x, λY (y) = − σYY ·X y , 2 2 −1 μ(x) = σYY σ Y X −1 XX·Y x, ν(y) = y, φ = 1.
From the above formulation, we have (4.22). The ECC for the normal distribution (4.25) is the multiple correlation coefficient. Third, in the GLM framework (3.9), ECC in the normal distribution is discussed. Let f (y|x) be the conditional density function of Y given X. Assuming the joint distribution of X and Y is normal with mean vector 0 and variance–covariance matrix (4.21). Then, f (y|x) is given as the following GLM expression: f (y|x) =
(2π )
× exp
p+1 2
1 σYY − σ Y X −1 XX σ XY
yβx − 21 (βx)2
σYY − σ Y X −1 XX σ XY
−
1 2 y 2
σYY − σ Y X −1 XX σ XY
,
where β = −1 XX σ XY In (3.9), setting θ = βx, a(ϕ) = σYY − σ Y X −1 XX σ XY , b(θ ) = c(y, ϕ) = −
1 2 y 2
σYY − σ Y X −1 XX σ XY
1 2 θ , and 2
,
(4.26)
the ECC between X and Y is given by Cov(βX, Y ) = ECorr(X, Y ) = Corr(βX, Y ) = √ √ Var(βX) Var(Y )
σ Y X −1 XX σ XY σYY
In this case, ECCs in the RC(1) association model and the GLM frameworks are equal to the multiple correlation coefficient. In this sense, the ECC can be called the entropy multiple correlation coefficient between X and Y. The entropy coefficient of determination (ECD) is calculated. For (4.26), we have
104
4 Analysis of Continuous Variables
ECD(X, Y ) = =
σ Y X −1 σ XY Cov(βX, Y ) XX = −1 Cov(βX, Y ) + a(ϕ) σ Y X XX σ XY + σYY − σ Y X −1 XX σ XY σ Y X −1 XX σ XY = ECorr(X, Y )2 . σYY
In the normal case, the above equation holds true. Below, a more general discussion on ECC and ECD is provided.
4.4 Partial Correlation Coefficient Let X, Y, and Z be random variables that have the joint normal distribution with the following variance–covariance matrix: ⎞ σX2 σXY σXZ = ⎝ σYX σY2 σYZ ⎠, σZX σZY σZ2 ⎛
Then, the conditional distribution of X and Y, given Z, has the following variance– covariance matrix: 2 1 σXZ σX σXY − 2 XY ·Z = σZX σZY 2 σYX σY σZ σYZ
2 σ σ XZ ZX σX − σ 2 σXY − σXZσσ2 ZY Z Z = . σYX − σYZσσ2ZX σY2 − σYZσσ2ZY Z
Z
From the above result, the partial correlation coefficient ρXY ·Z between X and Y given Z is ρXY ·Z =
σXY − σX2
−
σXZ σZX σZ2
σXZ σZY σZ2
σY2
−
σYZ σZY σZ2
=
ρXY − ρXZ ρYZ . 2 2 1 − ρXZ 1 − ρYZ
With respect to the above correlation coefficient, the present discussion can be directly applied. Let fXY (x, y|z) be the conditional density function of X and Y given Z = z; let fX (x|z) and fY (y|z) be the conditional density functions of X and Y given Z = z, respectively, and let fZ (z) be the marginal density function of Z. Then, we have
4.4 Partial Correlation Coefficient
105
˚
fXY (x, y|z) dxdydz fX (x|z)fY (y|z) ¨ 2 ρXY fX (x|z)fY (y|z) ·Z dxdy = + fX (x|z)fY (y|z)fZ (z) log . 2 fXY (x, y|z) 1 − ρXY ·Z (4.27)
KL(X , Y |Z) =
fXY (x, y|z)fZ (z) log
Thus, the partial (conditional) ECC and ECD of X and Y given Z are computed as follows: ECorr(X , Y |Z) = |ρXY ·Z |, ECD(X , Y |Z) =
KL(X , Y |Z) 2 = ρXY ·Z. KL(X , Y |Z) + 1
T T For normal random vectors X = X1 , X2 , . . . , Xp , Y = Y1 , Y2 , . . . , Yq , and Z = (Z1 , Z2 , . . . , Zr )T , the joint distribution is assumed to be the multivariate normal with the following variance–covariance matrix: ⎞ XX XY XZ = ⎝ YX YY YZ ⎠. ZX ZY ZZ ⎛
From the above matrix, the partial variance–covariance matrix of X and Y given Z is calculated as follows: XX XY XZ XX·Z XY·Z = − Σ −1 (X,Y)·Z ≡ ZZ ZX ZY . YX·Z YY·Z YX YY YZ Let the inverse of the above matrix be expressed as −1 (X,Y)·Z
=
XX·Z XY·Z . YX·Z YY·Z
Then, applying (4.27) to this multivariate case, we have KL(X, Y|Z) = −tr XY·Z YX·Z . Hence, the partial ECD of X and Y given Z is calculated as follows: ECD(X, Y|Z) =
−tr XY·Z YX·Z . −tr XY·Z YX·Z + 1
Example 4.2 Let X = (X1 , X2 )T , Y = (Y1 , Y2 )T , and Z = (Z1 , Z2 )T be random vectors that have the joint normal distribution with the variance–covariance matrix:
106
4 Analysis of Continuous Variables
⎛
1 0.8 ⎜ 0.8 1 ⎜ ⎜ ⎜ 0.6 0.5 =⎜ ⎜ 0.5 0.6 ⎜ ⎝ 0.6 0.5 0.7 0.4
0.6 0.5 1 0.5 0.7 0.6
0.5 0.6 0.5 1 0.4 0.5
0.6 0.7 0.5 0.4 0.7 0.6 0.4 0.5 1 0.8 0.8 1
⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠
In this case, we have ⎛
(X,Y)·Z
⎞ 1 0.8 0.6 0.5 ⎜ 0.8 1 0.5 0.6 ⎟ ⎟ =⎜ ⎝ 0.6 0.5 1 0.5 ⎠ 0.5 0.6 0.5 1 ⎛ ⎞ 0.6 0.7 ⎜ 0.5 0.4 ⎟ 1 0.8 −1 0.6 0.5 0.7 0.4 ⎜ ⎟ −⎝ 0.7 0.4 0.6 0.5 0.7 0.6 ⎠ 0.8 1 0.4 0.5 ⎛ ⎞ 0.506 0.5 0.156 0.15 ⎜ 0.5 0.75 0.15 0.4 ⎟ ⎟ =⎜ ⎝ 0.156 0.15 0.506 0.2 ⎠, ⎛
−1 (X,Y)·Z
0.15 0.4
0.2 0.75
7.575 −5.817 ⎜ −5.817 6.345 =⎜ ⎝ −1.378 0.879 1.955 −2.455
−1.378 0.879 2.479 −0.854
⎞ 1.955 −2.455 ⎟ ⎟. −0.854 ⎠ 2.479
From this, we compute
tr XY·Z
YX·Z
0.156 0.15 = tr 0.15 0.4
5.175 −4.825 −5.478 4.522
= −0.771.
Hence, ECD(X, Y|Z) =
0.771 = 0.435. 0.771 + 1
Remark 4.5 The partial variance–covariance matrix (X,Y)·Z can also be calculated by using the inverse of . Let ⎛
−1
⎞ XX XY XZ = ⎝ YX YY YZ ⎠. ZX ZY ZZ
4.4 Partial Correlation Coefficient
107
Then, we have (X,Y)·Z =
XX XY YX YY
−1 .
4.5 Canonical Correlation Analysis T T Let X = X1 , X2 , . . . , Xp and Y = Y1 , Y2 , . . . , Yq be two random vectors for p < q. Without loss of generality, we set E(X) = 0 and E(Y) = 0. Corresponding to the random vectors, the variance–covariance matrix of X and Y is divided as =
XX XY , YX YY
(4.28)
where XX = E XX T , XY = E XY T , YX = E YX T , and YY = E YY T . T T For coefficient vectors a = a1 , a2 , . . . , ap and b = b1 , b2 , . . . , bp , we determine V1 = aT X and W1 = bT Y that maximize their correlation coefficient under the following constraints: Var(V1 ) = aT XX a = 1, Var(W1 ) = bT YY b = 1. For Lagrange multiplier λ and μ, we set h = aT XY b −
λ T μ a XX a − bT YY b. 2 2
Differentiating the above function with respect to a and b, we have ∂h ∂h = XY b − λ XX a = 0, = YX a − μ YY b = 0. ∂a ∂b Applying constraints (4.29) to the above equations, we have aT XY b = λ = μ Hence, we obtain
λ XX − XY − YX μ YY
a = 0. b
(4.29)
108
4 Analysis of Continuous Variables
Since a = 0 and b = 0, It follows that λ XX − XY − YX λ YY = 0. Since λ XX − XY λ XX 0 = − YX λ YY 0 λ YY − 1 YX −1 XY XX λ 1 −1 = |λ XX |λ YY − YX XX XY λ 2 p−q = λ | XX | λ YY − YX −1 XX XY = 0, we have 2 λ YY − YX −1 XY = 0. XX
(4.30)
−1 From this, λ2 is the maximum eigenvalue of YX −1 XX XY YY and the coefficient vector b is the eigenvector corresponding to the eigenvalue. Equation (4.30) has p non-negative eigenvalues such that
1 ≥ λ21 ≥ λ22 ≥ · · · ≥ λ2p ≥ 0.
(4.31)
Since the positive square roots of these are given by 1 ≥ λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0, the maximum correlation coefficient is λ1 . Similarly, we have 2 λ XX − XY −1 YX = 0. YY
(4.32)
From the above equation, we get the roots in (4.31) and q − p zero roots, and the −1 2 coefficient vector a is the eigenvector of XY −1 YY YX XX , corresponding to λ1 . Let us denote the obtained coefficient vectors a and b as a(1) and b(1) , respectively. Then, the pair of variables V1 = aT1 X and W1 = bT1 Y is called the first pair of canonical variables. Next, we determine V2 = aT(2) X and W2 = bT(2) Y that make the maximum correlation under the following constraints:
4.5 Canonical Correlation Analysis
109
aT(2) XX a(1) = 0, aT(2) XY b(1) = 0, bT(2) YX a(1) = 0, bT(2) YY b(1) = 0, aT(2) XX a(2) = 1, bT(2) YY b(2) = 1. By using a similar discussion that derives the first pair of canonical variables, we can derive Eqs. (4.30) and (4.32). Hence, the maximum correlation coefficient satisfying the above constraints is given by λ2 , and the coefficient vectors a(2) −1 and b(2) are, respectively, obtained as the eigenvectors of XY −1 YY YX XX and −1 −1 2 YX XX XY YY , corresponding to λ2 . Similarly, let a(i) and b(i) be the eigenvec−1 −1 −1 tors of XY −1 YY YX XX and YX XX XY YY corresponding to the eigenvalues 2 T λi , i ≥ 3. Then, the pair of variables Vi = a(i) X and Wi = bT(i) Y gives the maximum correlation under the following constraints: aT(i) XX a(j) = 0, aT(i) XY b(j) = 0, bT(i) YX a(j) = 0, bT(i) YY b(j) = 0, i > j; aT(i) XX a(i) = 1, bT(i) YY b(i) = 1. As a result, we can get aT(i) XX a(j) = 0, aT(i) XY b(j) = 0, bT(i) YY b(j) = 0, i = j; and Corr(Vi , Wi ) = Corr aT(i) X, bT(i) Y = aT(i) XY b(i) = λi , i = 1, 2, . . . , p. Inductively, we can decide bT(q+k) such that bT(p+k) YY b(1) , b(2) , . . . , b(p) , . . . , b(p+k−1) = 0, bT(p+k) YY b(p+k) = 1, k = 1, 2, . . . , q − p. Let us set
T V = aT(1) X, aT(2) X, . . . , aT(p) X ,
T
T W (1) = bT(1) Y, bT(2) Y, . . . , bT(p) Y , W (2) = bT(p+1) Y, bT(p+2) Y, . . . , bT(q) Y , A = a(1) , a(2) , . . . , a(p) , B(1) = b(1) , b(2) , . . . , b(p) , B(2) = b(p+1) , b(p+2) , . . . , b(q) . (4.33) Then, the covariance matrix of V, W (1) and W (2) is given by
110
4 Analysis of Continuous Variables
⎞ Cov V T , W T(1) Cov V T , W T(1) Var V T ⎜ ⎟ Var V T , W T(1) , W T(2) = ⎝ Cov V T , W T(1) Var W T(1) Cov W T(1) , W T(1) ⎠ T Var W T(1) Cov V , W T(1) Cov W T(1) , W T(1) ⎞ ⎛ T A XX A AT XY B(1) AT XY B(2) = ⎝ BT(1) YX A BT(1) YY B(1) BT(1) XY B(2) ⎠ BT(2) YX A BT(2) YX B(1) BT(2) YY B(2) ⎛ ⎞ ⎡ ⎤ λ1 0 ⎜ ⎟ ⎢ ⎥ λ2 ⎜ ⎟ ⎢ ⎥ ⎜ ⎟ Ip 0 ⎢ ⎥ .. ⎜ ⎟ ⎣ ⎦ . ⎜ ⎟ 0 ⎜ ⎟ λp ⎜⎡ ⎟ ⎤ ⎜ ⎟ = ⎜ λ1 ⎟ ≡ ∗, ⎜⎢ ⎟ 0 ⎥ ⎜⎢ ⎟ λ2 ⎥ ⎜⎢ ⎟ I 0 ⎥ p ⎜⎣ ⎟ .. ⎦ ⎜ ⎟ . 0 ⎜ ⎟ ⎝ ⎠ λp ⎛
0
0
I q−p (4.34)
where I p and I q−p are the p- and (q − p)-dimensional identity matrix, respectively. Then, the inverse of the above variance–covariance matrix is computed by ⎛⎡
⎤⎡
1 1−λ21
⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎣ 0 ⎜ ⎜ ⎜⎡ ⎜ −λ1 ∗−1 ⎜ 1−λ 2 ⎜⎢ 1 ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎢ ⎜⎣ 0 ⎜ ⎜ ⎝
0 1 1−λ22
..
. 1 1−λ2p
0 −λ2 1−λ22
..
. −λp 1−λ2p
⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣ ⎤⎡ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣
⎤
−λ1 1−λ21
0 −λ2 1−λ22
..
.
0
−λp 1−λ2p
1 1−λ21
0 From this, we have the following theorem:
0 1 1−λ22
..
.
0 1 1−λ22
0
⎞
⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ 0 ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎦ ⎟ ⎟ ⎟ ⎤ ⎟ ⎟ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎥ 0 ⎟ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎦ ⎟ ⎟ ⎠ I q−p
(4.35)
4.5 Canonical Correlation Analysis
111
T Theorem 4.5 Let us assume the joint distribution of X = X1 , X2 , . . . , Xp and T Y = Y1 , Y2 , . . . , Yq is the multivariate normal distribution with variance–covariance matrix (4.28) and let (Vi , Wi ) = aT(i) X, bT(i) Y , i = 1, 2, . . . , p
(4.36)
be the p pairs of canonical variates defined in (4.33). Then, KL(X, Y) =
p i=1
λ2i . 1 − λ2i
T Proof The joint distribution of random vectors V and W T(1) , W T(2) is the multivariate normal distribution with variance–covariance matrix (4.34) and the inverse of (4.34) is given by (4.35). From this, we have ⎛
KL V , W T(1) , W T(2)
T
⎜ ⎜ = −tr⎜ ⎝
=
⎞
λ1 λ2 0
p
λ2i
i=1
1 − λ2i
0 ..
. λp
⎛
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎜ ⎝
⎞
λ1 − 1−λ 2 1
λ2 − 1−λ 2 2
0
0 ..
.
λ
− 1−λp 2
⎟ ⎟ ⎟ ⎟ ⎟ ⎠
p
.
Since T , KL(X, Y) = KL V , W T(1) , W T(2)
the theorem follows.
Remark 4.6 Let the inverse of the variance–covariance matrix (4.28) be denoted as −1 =
XX XY . YX YY
Then, we have KL(X, Y) = −tr XY YX .
(4.37)
From Theorem 4.5, we obtain −tr XY
YX
=
p i=1
λ2i . 1 − λ2i
(4.38)
112
4 Analysis of Continuous Variables
From the p pairs of canonical variates (4.36), it follows that KL(Vi , Wi ) =
λ2i , i = 1, 2, . . . , p. 1 − λ2i
(4.39)
From (4.37), (4.38), and (4.39), we have KL(X, Y) =
p
KL(Vi , Wi ).
i=1
Next, the entropy correlation coefficient introduced in the previous section is extended for measuring the association between random vectors X = T T X1 , X2 , . . . , Xp and Y = Y1 , Y2 , . . . , Yq , where p ≤ q. Let f (x, y) be the joint density function of X and Y and let g(v, w) be the joint density function of canonical T random vectors V and W = W T(1) , W T(2) in (4.33). Then, the joint density function of X and Y is expressed as follows: −1 f (x, y) = g(v, w)|A|−1 B(1) , B(2) ⎛ ⎞ p p q wj2 vi2 + wi2 1 ⎝ ⎠|A|−1 B(1) , B(2) −1 ,
− = vi φi wi − p+q 1 exp 2 2 ∗ (2π ) 2 2 i=1 i=1 2 1 − λi j=p+1
where φi =
λi , i = 1, 2, . . . , p. 1 − λ2i
The above formulation is similar to the RC(p) association model explained in Chap. 2. Assuming the means of X and Y are zero vectors, we have the following log odds ratio: g(v, w)g(0, 0) f (x, y)f (0, 0) = log = vi φi wi . f (x, 0)f (0, y) g(v, 0)g(0, w) i=1 p
log
As discussed in Chap. 2, we have the entropy covariance between X and Y as ECov(X, Y) =
p
φi Cov(vi , wi ) =
i=1
p
φi λ i =
i=1
p i=1
λ2i . 1 − λ2i
and ECov(X, X) =
p i=1
φi Cov(vi , vi ) =
p i=1
φi =
p i=1
λi = ECov(Y, Y). 1 − λ2i
4.5 Canonical Correlation Analysis
113
Then, we have 0 ≤ ECov(X, Y) ≤
ECov(X, X) ECov(Y, Y).
From the above results, the following definition can be made: Definition 4.3 The entropy correlation coefficient between X and Y is defined by p i=1 ECov(X, Y) ECorr(X, Y) = √ = p √ ECov(X, X) ECov(Y, Y) i=1
λ2i 1−λ2i λi 1−λ2i
.
(4.40)
For p = q = 1, (4.40) is the absolute value of the simple correlation coefficient, and p = 1 (4.40) becomes the multiple correlation coefficient. Finally, in a GLM framework, the entropy coefficient of determination is considered for the multivariate normal distribution. Let f (y|x) be the conditional density function of response variable vector Y given explanatory variable vector X. Then, we have 1 1 T −1 f (y|x) = q 1 exp − (y − μY (x)) YY·X (y − μY (x)) , 2 (2π ) 2 | YY·X | 2 where E(X) = μX , E(Y) = μY , μY (x) = μY + YX −1 XX (x − μX ), YY·X = YY − YX −1 XX XY . Let ⎧ −1 ⎪ ⎨ θ = YY·X μY (x), a(ϕ) = 1, b(θ ) = 21 YY·X θ T θ,
⎪ ⎩ c(y, ϕ) = − 1 yT −1 y − log (2π ) 2q | YY·X | 21 . 2
(4.41)
YY·X
Then, we have 1 1 T −1 T −1 T −1 μ y exp y μ μ y − − (x) (x) (x) 1 YY·X Y YY·X Y YY·X 2 Y 2 (2π ) | YY·X | 2 T y θ − b(θ ) (4.42) + c(y, ϕ) . = exp a(ϕ)
f (y|x) =
1
q 2
As shown above, the multivariate linear regression under the multivariate normal distribution is expressed by a GLM with (4.41) and (4.42). Since
114
4 Analysis of Continuous Variables
Cov(Y, θ ) −1 −1 = Cov Y, −1 YY·X μY (X) = tr YY·X YX XX XY a(ϕ) −1 = tr YY − YX −1 YX −1 XX XY XX XY ,
KL(X, Y) =
it follows that −1 YX −1 tr YY − YX −1 KL(X, Y) XX XY XX XY = ECD(X, Y) = . −1 −1 −1 KL(X, Y) + 1 tr YY − YX XX XY YX XX XY + 1 Since KL(X, Y) is invariant for any one-to-one transformations of X and Y, we have KL(X, Y) = KL(V , W ) =
p i=1
λ2i , 1 − λ2i
T where V and W = W T(1) , W T(2) are given in (4.33). Hence, it follows that p λ2i i=1 1−λ2i KL(V , W ) = ECD(X, Y) = . λ2i p KL(V , W ) + 1 2 + 1
(4.43)
i=1 1−λi
The above formula implies that the effect of explanatory variable vector X on λ2 response variable vector Y is decomposed into p canonical components 1−λi 2 , i = i 1, 2, . . . , p. From the above results (4.40) and (4.43), the contribution ratios of canonical variables (Vi , Wi ) to the association between random vectors X and Y can be evaluated by λ2i
λ2i
KL(Vi , Wi ) 1−λ2i 1−λ2i = CR(Vi , Wi ) = = . YX λ2j p KL(X, Y) −tr XY 2
(4.44)
j=1 1−λj
Example 4.3 Let X = (X1 , X2 )T and Y = (Y1 , Y2 , Y3 )T be random vectors that have the joint normal distribution with the following correlation matrix: ⎛
XX XY YX YY
⎜ ⎜ ⎜ =⎜ ⎜ ⎝
The inverse of the above matrix is
1 0.7 0.4 0.6 0.4
0.7 1 0.7 0.6 0.5
0.4 0.7 1 0.8 0.7
0.6 0.4 0.6 0.5 0.8 0.7 1 0.6 0.6 1
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎠
4.5 Canonical Correlation Analysis
115
⎛
XX XY YX YY
−1
=
XX XY YX YY
⎜ ⎜ ⎜ =⎜ ⎜ ⎝
3.237 −2.484 2.568 −2.182 −0.542
−2.484 3.885 −3.167 1.467 0.394
2.569 −2.182 −3.167 1.457 6.277 −3.688 −3.688 4.296 −1.626 0.148
−0.542 0.394 −1.626 0.148 2.069
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎠
From this, we obtain XY YX =
⎛
⎞ 2.569 −3.167 0.4 0.6 0.4 ⎝ −0.498 −0.235 . −2.182 1.457 ⎠ = 0.7 0.6 0.5 0.218 −1.146 −0.542 0.394
From the above result, it follows that ECD(X.Y) =
1.644 −tr XY YX = 0.622. = YX 1.644 +1 −tr XY + 1
By the singular value decomposition, we have
−0.498 −0.235 0.218 −1.146
=
0.147 −0.989 0.989 0.147
3.381 0 0 0.705
0.121 −0.993 . 0.993 0.121
From this, the two sets of canonical variables are given as follows: V1 = 0.147X1 + 0.989X2 , W1 = 0.121Y1 − 0.993Y2 , V2 = −0.989X1 + 0.147X2 , W2 = 0.993Y1 + 0.121Y2 . From (4.44), the contribution ratios of the above two sets of canonical variables are computed as follows: 3.381 = 0.716, 3.381 + 0.705 0.705 CR(V2 , W2 ) = = 0.284. 3.381 + 0.705 CR(V1 , W1 ) =
4.6 Test of the Mean Vector and Variance–Covariance Matrix in the Multivariate Normal Distribution Let X 1 , X 2 , . . . , X n be random samples from the p-variate normal distribution with mean vector μ and variance–covariance matrix Σ. First, we discuss the following statistic:
116
4 Analysis of Continuous Variables
T∗
2
T
= nX S−1 X,
(4.45)
where T 1 1 Xi − X Xi − X . Xi, S = n i=1 n − 1 i=1 n
X=
n
Theorem 4.6 Let f (x|μ, Σ) be the multivariate normal density function with mean vector μ and variance–covariance matrix Σ. Then, it follows that " f (x|μ, Σ)log
f (x|μ, Σ) dx + f (x|0, Σ)
" f (x|0, Σ)log
f (x|0, Σ) dx = μT −1 μ. f (x|μ, Σ) (4.46)
Proof Since log
f (x|μ, Σ) 1 1 1 = − (x − μ)T −1 (x − μ) + xT −1 x = xT −1 μ − μT −1 μ, f (x|0, Σ) 2 2 2
we have "
1 f (x|μ, Σ) dx = μT −1 μ, f (x|0, Σ) 2 " 1 T −1 f (x|0, Σ) dx = μ μ. f (x|0, Σ)log f (x|μ, Σ) 2 f (x|μ, Σ)log
Hence, the theorem follows.
From the above theorem, statistics (4.45) is the ML estimator of the KL information (4.46) multiplied by sample size n − 1. With respect to the statistic (4.45), we have the following theorem [1]. n−p Theorem 4.7 For the statistic (4.45), p(n−1) (T ∗ )2 is distributed according to the non-central F distribution with degrees of freedom p and n − p. The non-centrality parameter is μT Σ −1 μ.
In the above theorem, the non-centrality parameter is the KL information (4.46). The following statistic is called the Hotelling’s T 2 statistic [11]: T T 2 = n X − μ0 S−1 X − μ0 . Testing the mean vector of the multivariate normal distribution, i.e.,
(4.47)
4.6 Test of the Mean Vector and Variance-Covariance Matrix …
117
H0 :μ = μ0 versus H1 :μ = μ1 , from Theorem 4.6, we have D(f (x|μ1 , Σ)||f (x|μ0 , Σ)) + D(f (x|μ0 , Σ)||f (x|μ1 , Σ)) = (μ1 − μ0 )T −1 (μ1 − μ0 ). By substituting μ1 and for X and S in the above formula, respectively, we have T T 2 = n X − μ0 S−1 X − μ0 . From Theorem 4.7, we have the following theorem: n−p Theorem 4.8 For the Hotelling’s T 2 statistic (4.47), p(n−1) T 2 is distributed according to the F distribution with degrees of freedom p and n−p under the null hypothesis H0 :μ = μ0 . Under the alternative hypothesis H1 :μ = μ1 , the statistic is distributed according to the F distribution with degrees of freedom p and n−p. The non-centrality is (μ1 − μ0 )T Σ −1 (μ1 − μ0 ).
Concerning the Hotelling’s T 2 statistic (4.47), the following theorem asymptotically folds true. Theorem 4.9 For sufficiently large sample size n, T 2 is asymptotically distributed according to non-central chi-square distribution with degrees of freedom p. The non-centrality is n(μ1 − μ0 )T Σ −1 (μ1 − μ0 ). Proof Let F=
n−p 2 T . p(n − 1)
For large sample size n, we have n−p 2 1 T ≈ T 2. p(n − 1) p From this, T 2 ≈ pF, and from Theorem 4.8, T 2 is asymptotically distributed according to the non-central chi-square distribution with degrees of freedom p. This completes the theorem. In comparison with the above discussion, let us consider the test for variance matrices, H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 . Then, the following KL information expresses the difference between normal distributions f (x|μ, Σ 0 ) and f (x|μ, Σ):
118
4 Analysis of Continuous Variables
D(f (x|μ, Σ)||f (x|μ, Σ 0 )) + D(f (x|μ, Σ 0 )||f (x|μ, Σ)) " f (x|μ, Σ) = f (x|μ, Σ)log dx f (x|μ, Σ 0 ) " f (x|μ, Σ 0 ) dx + f (x|μ, Σ 0 )log f (x|μ, Σ) 1 −1 = −p + tr Σ −1 . 0 + Σ 0Σ 2
(4.48)
For random samples X 1 , X 2 , . . . , X n , the ML estimator of μ and Σ is given by T 1 1 Xi, Σ = Xi − X Xi − X . n i=1 n i=1 n
X=
n
Then, for given Σ 0 , the ML estimators of D(f (x|μ, Σ)||f (x|μ, Σ 0 )) and D(f (x|μ, Σ 0 )||f (x|μ, Σ)) are calculated as
ˆ ||f x|X, ¯ ¯ 0 , D f (x|μ, )||f (x|μ, 0 ) = D f x|X,
ˆ . ¯ ¯ 0 ||f x||X, D f (x|μ, 0 )||f (x|μ, ) = D f x|X,
From this, the ML estimator of the KL information (4.48) is calculated as
D f (x|μ, )||f (x|μ, 0 ) + D f (x|μ, 0 ||f )(x|μ, )
−1 1 ˆ −1 ˆ = −p + tr 0 + 0 2
The above statistic is an entropy-based test statistic for H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 . With respect to the above statistics, we have the following theorem: Theorem 4.10 Let f (x|μ, Σ) be the multivariate normal density function with mean vector μ and variance–covariance matrix Σ. Then, for null hypothesis H0 :Σ = Σ 0 , the following statistic is asymptotically distributed according to the chi-square distribution with degrees of freedom 21 p(p − 1):
χ 2 = n D f (x|μ, ||f (x|μ, 0 ) + D f (x|μ, 0 ||f (x|μ, )
Proof Let X i , i = 1, 2, . . . , n be random samples from f (x|μ, Σ); let l(μ, Σ) be the following log likelihood function l(μ, Σ) =
n i=1
and let
logf (X i |μ, Σ),
4.6 Test of the Mean Vector and Variance-Covariance Matrix …
119
l μ, Σ = max l(μ, Σ), l μ, Σ 0 = max l(μ, Σ 0 ),
μ,
μ,Σ
where μ and Σ are the maximum likelihood estimators of μ and Σ, respectively, i.e., 2 1 1 Xi − X . Xi, Σ = n i=1 n i=1 n
μ=X=
n
Then, it follows that
n
f x |μ , i 1 1 log l μ, − l μ, 0 = n i=1 n f x|μ, 0 " f (x|μ, ) → f (x|μ, )log dx, f (x|μ, 0 )
f x|μ,
" dx D f (x|μ, )||f (x|μ, 0 ) = f x|μ, log f x|μ, 0 " f (x|μ, ) dx, → f (x|μ, )log f (x|μ, 0 )
" f x|μ, 0
D f (x|μ, 0 )||f (x|μ, ) = f x|μ, 0 log dx f x|μ, " f (x|μ, 0 ) dx → f (x|μ, 0 )log f (x|μ, )
in probability, as n → ∞. Under the null hypothesis, we have
"
f x|μ, Σ f x|μ, Σ 0
dx + o(n), dx = n f x|μ, Σ 0 log f x|μ, Σ log f x|μ, Σ 0 f x|μ, Σ
"
n
where o(n) → 0(n → ∞). n Hence, under the null hypothesis, we obtain
"
f x|μ, dx + o(n) 2 l μ, − l μ, 0 = 2n f x|μ, log f x|μ, 0
120
4 Analysis of Continuous Variables
⎛ "
f x|μ, dx = n⎝ f x|μ, log f x|μ, 0 ⎞ " f x|μ, 0
dx⎠ + o(n) + f x|μ, 0 log f x|μ,
= n D f (x|μ, ||f )(x|μ, 0 ) + D f (x|μ, 0 ||f )(x|μ, )
+ o(n).
This completes the theorem. By using the above result, we can test H0 :Σ = Σ 0 , versus H1 :Σ = Σ 0 .
4.7 Comparison of Mean Vectors of Two Multivariate Normal Populations Let p-variate random vectors X k be independently distributed according to normal distribution with mean vectors and variance–covariance matrices μk and Σ k , and let f (xk |μk , Σ k ) be the density functions, k = 1, 2. Then, X 1 − X 2 is distributed according to f (x|μ1 − μ2 , Σ 1 + Σ 2 ) and we have "
f (x|μ1 − μ2 , Σ 1 + Σ 2 ) dx f (x|μ1 − μ2 , Σ 1 + Σ 2 )log f (x|0, Σ 1 + Σ 2 ) " f (x|0, Σ 1 + Σ 2 ) dx + f (x|0, Σ 1 + Σ 2 )log f (x|μ1 − μ2 , Σ 1 + Σ 2 ) = (μ1 − μ2 )T (Σ 1 + Σ 2 )−1 (μ1 − μ2 ).
First, we consider the case of Σ 1 = Σ 2 = Σ. Let {X 11 , X 12 , . . . , X 1m } and {X 21 , X 22 , . . . , X 2n } be random samples from distributions f (x|μ1 , Σ) and f (x|μ2 , Σ), respectively. Then, the sample mean vectors 1 X 1i and m i=1 m
X1 =
1 X 2i , n i=1 n
X2 =
(4.49)
are distributed according to the normal distribution f x|μ1 , m1 Σ and f x|μ2 , 1n Σ , respectively. From the samples, the unbiased estimators of μ1 , μ2 and Σ are, respectively, given by
μ1 = X 1 , μ2 = X 2 , and S =
(m − 1)S1 + (n − 1)S2 , m+n−2
4.7 Comparison of Mean Vectors of Two Multivariate …
121
where ⎧ ⎪ ⎪ ⎨ S1 =
1 m−1
⎪ ⎪ ⎩ S2 =
1 n−1
m T X 1i − X 1 X 1i − X 1 , i=1
n T X 2i − X 2 X 2i − X 2 . i=1
From this, the Hotelling’s T 2 statistic is given by T2 =
T −1 mn X1 − X2 S X1 − X2 . m+n
(4.50)
From Theorem 4.7, we have the following theorem: Theorem 4.11 For the Hotelling’s T 2 (4.50), m+n−p−1 T 2 is distributed according (m+n−2)p to non-central F distribution with degrees of freedom p and n − p − 1, where the non-centrality is (μ1 − μ2 )T Σ −1 (μ1 − μ2 ). Second, the case of Σ 1 = Σ 2 is discussed. Since sample mean vecare distributed according to normal distributions tors X 1 and X 2 (4.49) f x1 |μ1 , m1 Σ 1 and f x2 |μ2 , 1n Σ 2 , respectively, X 1 − X 2 is distributed according to f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 . From this, it follows that f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 1 1 f x|μ1 − μ2 , Σ 1 + Σ 2 log dx m n f x|0, m1 Σ 1 + 1n Σ 2 " f x|0, m1 Σ 1 + 1n Σ 2 1 1 dx + f x|0, Σ 1 + Σ 2 log m n f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 −1 1 T 1 Σ1 + Σ2 = (μ1 − μ2 ) (μ1 − μ2 ). m n
"
Then, the estimator of the above information is given as the following statistic: T T 2 = X1 − X2
1 1 S1 + S2 m n
−1
X1 − X2 .
(4.51)
With respect to the above statistic, we have the following theorem: Theorem 4.12 Let {X 11 , X 12 , . . . , X 1m } and {X 21 , X 22 , . . . , X 2n } be random samples from distributions f (x|μ1 , Σ 1 ) and f (x|μ2 , Σ 2 ), respectively. For sufficiently large sample size m and n, the statistic T 2 (4.51) is asymptotically distributed according to the non-central chi-square distribution with degrees of freedom p. Proof For large sample size m and n, it follows that
122
4 Analysis of Continuous Variables
T T 2 = X1 − X2 T ≈ X1 − X2
1 1 S1 + S2 m n
−1
1 1 Σ1 + Σ2 m n
X1 − X2
−1
X1 − X2 .
(4.52)
Since X 1 − X 2 is normally distributed according to f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 , statistic (4.52) is distributed according to the non-central chi-square distribution with degrees of freedom p. Hence, the theorem follows. Remark 4.7 In Theorem 4.12, the theorem holds true, regardless of the normality assumption of samples, distributed according to because X 1 − X 2 is asymptotically normal distribution f x|μ1 − μ2 , m1 Σ 1 + 1n Σ 2 .
4.8 One-Way Layout Experiment Model Let Yij , j = 1, 2, . . . , n be random samples observed at level i = 1, 2, . . . , I . Oneway layout experiment model [7] is expressed by Yij = μ + αi + eij , i = 1, 2, . . . , I ; j = 1, 2, . . . , n. where E Yij = μ + αi , i = 1, 2, . . . , I ; j = 1, 2, . . . , n, E eij = 0, Var eij = σ 2 , Cov eij , ekl = 0, i = k, j = j. For model identification, the following constraint is placed on the model. I
αi = 0.
(4.53)
i=1
In order to consider the above model in a GLM framework, the following dummy variables are introduced: # 1 (for level i) Xi = , i = 1, 2, . . . , I . (4.54) 0 (for the other levels) and we set the explanatory dummy variable vector as X = (X1 , X2 , . . . , XI )T . Assuming the random component, i.e., the conditional distribution of Y given X = x, f (y|x), to be the normal distribution with mean θ and variance σ 2 , we have
4.8 One-Way Layout Experiment Model
123
2 √ yθ − θ2 y2 2 f (y|x) = exp − − log 2π σ , σ2 2σ 2 Since the systematic component is given by a linear combination of the dummy variables (4.54): η=
I
αi Xi ,
i=1
it follows that θ =μ+
I
αi Xi .
i=1
The factor levels are randomly assigned to experimental units, e.g., subjects, with probability 1I , so the marginal distribution of response Y is
I yθi − 1 f (y) = exp I i=1 σ2
θi2 2
√ y2 2 − − log 2π σ , 2σ 2
where θi = μ + αi , i = 1, 2, . . . , I . Then, we have " I " f (y) 1 f (y|Xi = 1) dy + f (y)log dy f (y|Xi = 1)log I i=1 f (y) f (y|Xi = 1) I I 1 1 2 Cov(Y , θ ) i=1 Cov(Y , αi Xi ) i=1 αi I I = = = . (4.55) σ2 σ2 σ2 Hence, the entropy coefficient of determination is calculated as ECD(X, Y ) =
1 I
I
2 i=1 αi . 2 2 i=1 αi + σ
1 I I
In the above one-way layout experiment, the variance of response variable Y is computed as follows:
124
4 Analysis of Continuous Variables
" Var(Y ) =
(y − μ)2 f (y)dy "
I
yθi − = (y − μ) exp I i=1 σ2
I " yθi − 1 2 = (y − μ) exp I i=1 σ2 21
θi2 2
√ y2 2 − − log 2π σ dy 2σ 2 √ y2 2 − − log 2π σ dy 2σ 2
1 2 1 2 αi + σ 2 = α + σ 2. I i=1 I i=1 i I
=
θi2 2
I
(4.56)
and the partial variance of Y given X = (X1 , X2 , . . . , XI )T is 1 Var(Y |Xi = 1) I i=1 I
Var(Y |X) =
1 = I i=1 I
"
yθi − (y − μ − αi )2 exp σ2
θi2 2
√ y2 − − log 2π σ 2 dy 2σ 2
1 2 σ = σ 2. I i=1 I
=
(4.57)
From (4.56) and (4.57), the explained variance of Y by X is given by 1 2 α . I i=1 i I
Var(Y ) − Var(Y |X) = Hence, we obtain
Cov(Y , θ ) = Var(Y ) − Var(Y |X), and KL information (4.55) is also expressed by " I " f (y) 1 f (y|Xi = 1) dy + f (y)log dy f (y|Xi = 1)log I i=1 f (y) f (y|Xi = 1) =
Var(Y ) − Var(Y |X) . Var(Y |X)
The above result shows the KL information (4.55) is interpreted as a signal-tonoise ratio, where the signal is Cov(Y , θ ) = Var(Y ) − Var(Y |X) and the noise is Var(Y |X) = σ 2 .
4.8 One-Way Layout Experiment Model
125
From the data Yij , i = 1, 2, . . . , I ; j = 1, 2, . . . , J , we usually have the following decomposition: I I I n n 2 2 2 SS T = Yij − Y ++ = n Yij − Y i+ , Y i+ − Y ++ + i=1 j=1
i=1
i=1 j=1
where Y i+ =
n I n 1 1 Yij , Y ++ = Yij . n j=1 nI i=1 j=1
Let SS A = n
I I n 2 2 Yij − Y i+ , Y i+ − Y ++ , SS E = i=1
(4.58)
i=1 j=1
Then, the expectations of the above sums of variances are calculated as follows: E(SS A ) = n
I I $ 2 % = (I − 1)σ 2 + n E Y i+ − Y ++ αi2 , i=1
i=1
I I n n $ 2 % n−1 2 E(SS E ) = σ = I (n − 1)σ 2 . = E Yij − Y i+ n i=1 j=1 i=1 j=1
In this model, the entropy correlation coefficient between factor X and response Y is calculated as & ' 1 I 2 ' Cov(Y , θ ) i=1 αi I =(1 ECorr(X , Y ) = √ . (4.59) √ I 2 2 Var(Y ) Var(θ ) i=1 αi + σ I and ECC is the square root of ECD. Since the ML estimators of the effects αi2 and error variance σ 2 are, respectively, given by α i = Y i+ − Y ++ and σ 2 =
the ML estimator of ECD is computed as
I n 2 1 Yij − Y i+ , nI i=1 j=1
126
4 Analysis of Continuous Variables
Table 4.2 Variance decomposition Factor
SS
df
SS/df
Factor A
SSA
I −1
SSA I −1
Error
SSE
I (n − 1)
Total
SSA + SSE
nI − 1
SSE I (n−1)
Expectation σ2 +
J I −1
I i=1
αi2
σ2
2 I i=1 Y i+ − Y ++ ECD(X, Y ) = 2 2 I 1 + nI1 Ii=1 nj=1 Yij − Y i+ i=1 Y i+ − Y ++ I 1 I
=
SSA . SSA + SSE
(4.60)
Since the F statistic in Table 4.2 is F=
1 SSA I −1 1 SSE I (n−1)
⇔
SSA I −1 F. = SSE I (n − 1)
(4.61)
ECD can be expressed by the above F statistic.
ECD(X, Y ) =
SSA SSE
+1
SSA SSE
=
(I − 1)F . (I − 1)F + I (n − 1)
(4.62)
If samples are Yij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , I , the constraint on the main effects (4.53) is modified as I
ni αi = 0,
(4.63)
i=1
and the marginal density function as
I yθi − ni exp f (y) = N σ2 i=1
θi2 2
√ y2 − − log 2π σ 2 , 2σ 2
where N=
I i=1
ni .
(4.64)
4.8 One-Way Layout Experiment Model
127
Then, the formulae derived for samples Yij , j = 1, 2, . . . , n; i = 1, 2, . . . , I are modified by using (4.63) and (4.64). Multiway layout experiment models can also be discussed in entropy as explained above. Two-way layout experiment models are considered in Chap. 7.
4.9 Classification and Discrimination T Let X = X1 , X2 , . . . , Xp be p-dimensional random vector, and let fi (x), i = 1, 2 be density functions. In this section, we discuss the optimal classification of observation X = x into one of the two populations with the density functions fi (x), i = 1, 2 [15]. Below, the populations and the densities are identified. In the classification, we have two types of errors. One is the error that the observation X = x from population f1 (x) is assigned to population f2 (x), and the other is the error that the observation X = x from population f2 (x) is assigned to population f1 (x). To discuss the optimal classification, the sample space of X, , is decomposed into two subspaces = 1 + 2 .
(4.65)
such that if x ∈ 1 , then the observation x is classified into population f1 (x) and if x ∈ 2 , into population f2 (x). If there is no prior information on the two populations, the first error probability is " P(2|1) =
f1 (x)dx, 2
and the second one is " P(1|2) =
f2 (x)dx. 1
In order to minimize the total error probability P(2|1) + P(1|2), the optimal classification procedure is made by deciding the optimal decomposition of the sample space (4.65). Theorem 4.13 The optimal classification procedure that minimizes the error probability P(2|1) + P(1|2) is given by the following decomposition of sample space (4.65): ) ) # # f1 (x) f1 (x) 1 = x| > 1 , 2 = x| ≤1 . f2 (x) f2 (x)
(4.66)
128
4 Analysis of Continuous Variables
Proof Let 1 and 2 be any decomposition of sample space . For the decomposition, the error probabilities are denoted by P(2|1) and P(1|2) . Then, P(2|1) =
"
"
2
P(1|2) =
"
f1 (x)dx =
f1 (x)dx + 2 ∩1
"
"
"
f2 (x)dx = 1
f1 (x)dx,
2 ∩2
f2 (x)dx + 1 ∩1
f2 (x)dx.
1 ∩2
From this, we have "
P(2|1) − P(2|1) =
" f1 (x)dx −
f1 (x)dx 2
2
⎛
"
⎜ f1 (x)dx − ⎝
= "
"
=
f1 (x)dx − 1 ∩2
−
2 ∩2
"
f1 (x)dx ≤
2 ∩1
"
⎟ f1 (x)dx⎠
f1 (x)dx +
2 ∩1
2
⎞
"
"
f2 (x)dx 1 ∩2
f1 (x)dx,
2 ∩1
and "
"
P(1|2) − P(1|2) =
f2 (x)dx − 2 ∩1
1 ∩2
"
−
" f2 (x)dx ≤
f1 (x)dx 2 ∩1
f2 (x)dx.
1 ∩2
Hence, it follows that (P(2|1) + P(1|2)) − P(2|1) + P(1|2) ≤ 0 ⇔ P(2|1) + P(1|2) ≤ P(2|1) + P(1|2) . This completes the theorem.
The above theorem is reconsidered from a viewpoint of entropy. Let δ(x) be a classification (decision) function that takes population fi (x) with probability πi (x), i = 1, 2, where
4.9 Classification and Discrimination
129
π1 (x) + π2 (x) = 1, x ∈ . In this sense, the classification function δ(X) is random variable. If there is no prior information on the two populations, then the probability that decision δ(x) is correct is p(x) = π1 (x)
f1 (x) f2 (x) + π2 (x) , f1 (x) + f2 (x) f1 (x) + f2 (x)
and the incorrect decision probability is 1 − p(x). Let # Y (δ(X)) =
1 (if the decision δ(X)is correct) . 0 (if the decision δ(X) is incorrect)
Then, the entropy of classification function δ(X) with respect to the correct and incorrect classifications can be calculated by " H(Y (δ(X))) =
(−p(x)logp(x) − (1 − p(x))log(1 − p(x)))f (x)dx,
where f (x) =
f1 (x) + f2 (x) . 2
With respect to the above entropy, it follows that # min −log
" H (Y (δ(X))) ≥
" =−
f (x)log 1
" −
f (x)log 2
) f1 (x) f2 (x) , −log f (x)dx f1 (x) + f2 (x) f1 (x) + f2 (x)
f1 (x) dx f1 (x) + f2 (x)
f2 (x) dx, f1 (x) + f2 (x)
(4.67)
where 1 and 2 are given in (4.66). According to Theorem 4.13, for (4.66), the optimal classification function δoptimal (X) can be set as # π1 (x) =
1 x ∈ 1 , π2 (x) = 1 − π1 (x). 0 x ∈ 2
(4.68)
From the above discussion, we have the following theorem: Theorem 4.14 For any classification function δ(x) for populations fi (x), i = 1, 2, the optimal classification function δoptimal (X) (4.68) satisfies the following inequality:
130
4 Analysis of Continuous Variables
H(Y (δ(X))) ≥ H Y δoptimal (X) Proof For the optimal classification function, we have H Y δoptimal (X) = −
" f (x)log 1
" −
f (x)log 2
f1 (x) dx f1 (x) + f2 (x)
f2 (x) dx. f1 (x) + f2 (x)
From (4.67), the theorem follows.
For p-variate normal distributions N (μ1 , ) and N (μ2 , ), let f1 (x) and f2 (x) be the density functions, respectively. Then, it follows that 1 f1 (x) > 1 ⇔ logf1 (x) − logf2 (x) = − (x − μ1 ) −1 (x − μ1 )T f2 (x) 2 1 + (x − μ2 ) −1 (x − μ2 )T = (μ1 − μ2 ) −1 xT 2 1 1 − μ1 −1 μT1 + μ2 −1 μT2 2 2 μT1 + μT2 −1 T x − >0 = (μ1 − μ2 ) 2
(4.69)
Hence, sample subspace 1 is made by (4.69), and the p-dimensional hyperplane 1 1 (μ1 − μ2 ) −1 xT − μ1 −1 μT1 + μ2 −1 μT2 = 0, 2 2 discriminates the two normal distributions. From this, the following function is called the linear discriminant function: Y = (μ1 − μ2 ) −1 xT .
(4.70)
The Mahalanobis’ distance between the two mean vectors μ1 and μ2 of p-variate normal distributions N (μ1 , ) and N (μ2 , ) is given by DM (μ1 , μ2 ) =
(μ1 − μ2 )T −1 (μ1 − μ2 )
The square of the above distance is expressed by the following KL information:
4.9 Classification and Discrimination
131
"
f (x|μ2 , Σ) dx f (x|μ2 , Σ)log f (x|μ1 , Σ) " f (x|μ1 , Σ) dx. + f (x|μ1 , Σ)log f (x|μ2 , Σ)
DM (μ1 , μ2 )2 =
(4.71)
The discriminant function (4.70) is considered in view of the above KL information, i.e., square of the Mahalanobis’ distance. We have the following theorem: Theorem 4.15 For any p-dimensional column vectors a, μ1 , μ2 and p × p vari 2 T ance–covariance matrix KL information DM α T μ between nor1 , α μ2 T Σ, the mal distributions N α μ1 , α T Σα and N α T μ2 , α T Σα (4.71) is maximized by α = Σ −1 (μ1 − μ2 ) and 2 max DM α T μ1 , α T μ2 = (μ1 − μ2 )T −1 (μ1 − μ2 ). α
(4.72)
Proof From (4.71), it follows that 2 T T 2 α μ1 − α T μ2 T DM α μ1 , α μ2 = αT Σα
(4.73)
Given αT Σα = 1, the above function is maximized. For Lagrange multiplier λ, we set 2 g = α T μ1 − α T μ2 − λα T Σα and differentiating the above function with respect to α, we have ∂g = 2 α T μ1 − α T μ2 (μ1 − μ2 ) − 2λΣα = 0 ∂α T Since ν = α μ1 − α T μ2 is a scalar, we obtain ν(μ1 − μ2 ) = λα. From this, we have α=
ν −1 (μ1 − μ2 ). λ
Function (4.73) is invariant with respect to scalar λν , so the theorem follows. From the above theorem, the discriminant function (4.70) discriminates N (μ1 , ) and N (μ2 , ) in the sense of the maximum of the KL information (4.72).
132
4 Analysis of Continuous Variables
For the two distributions fi (x), i = 1, 2, if prior probabilities of the distributions are given, the above discussion is modified as follows. Let qi be prior probabilities of fi (x), i = 1, 2, where q1 + q2 = 1. Then, substituting fi (x) for qi fi (x), a discussion similar to the above one can be made. For example, the optimal procedure (4.66) is modified as ) ) # # q1 f1 (x) q1 f1 (x) < 1 , 2 = x ≤1 . 1 = x q2 f2 (x) q2 f2 (x)
4.10 Incomplete Data Analysis The EM algorithm [2] is widely used for the ML estimation from incomplete data. Let X and Y be complete and incomplete data (variables), respectively; let f (x|φ) and g(y|φ) be the density or probability function of X and Y, respectively; let be the sample space of complete data X and (y) = {x ∈ |Y = y} be the conditional sample space of X given Y = y. Then, the log likelihood function of φ based on incomplete data Y = y is l(φ) = logg(y|φ)
(4.74)
and the conditional density or probability function of X given Y = y is k(x|y, φ) =
f (x|φ) . g(y|φ)
(4.75)
Let H φ |φ = E logk X|y, φ |y, φ " = k(x|y, φ)logk x|y, φ dx. (y)
Then, the above function is the negative entropy of distribution k x|y, φ (4.75) for distribution k(x|y, φ). Hence, from Theorems 1.1 and 1.8, we have H φ |φ ≤ H (φ|φ). The inequality holds if and only if k x|y, φ = k(x|y, φ). Let
(4.76)
4.10 Incomplete Data Analysis
133
⎛ ⎜ Q φ |φ = E logf X|φ |y, φ ⎝=
⎞
"
⎟ f (x|φ) logf x|φ dx⎠. g(y|φ)
(y)
From (4.74) and (4.75), we have H φ |φ =
" (y)
" f x|φ f (x|φ) f (x|φ) log dx = log f x|φ dx g(y|φ) g(y|φ) g y|φ (y)
− logg y|φ = Q φ |φ − l φ .
(4.77)
With respect to function H φ |φ , we have the following theorem [2]: Theorem 4.16 Let φ p , p = 1, 2, . . . , be a sequence such that Q φ |φ p = Q φ p+1 |φ p . max φ
Then, l φ p+1 ≥ l φ p , p = 1, 2, . . . . Proof From (4.77), it follows that 0 ≤ Q φ p+1 |φ p − Q φ p |φ p = H φ p+1 |φ p + l φ p+1 − H φ p |φ p + l φ p = H φ p+1 |φ p − H φ p |φ p + l φ p+1 − l φ p . (4.78) From (4.76), we have H φ p+1 |φ p − H φ p |φ p ≤ 0, so it follows that l φ p+1 ≥ l φ p .
from (4.77). This completes the theorem.
The above theorem is applied to obtain the ML estimate of parameter φ, φ , such that l φ = max l(φ)
φ
from incomplete data y. Definition 4.4 The following iterative procedure with E-step and M-step for the ML estimation, (i) and (ii), is defined as the Expectation-Maximization (EM) algorithm:
134
4 Analysis of Continuous Variables
(i) Expectation step (E-step) For estimate φ p at the pth step, compute the conditional expectation of logf (x|φ) given the incomplete data y and parameter φ p : Q φ|φ p = E logf (x|φ)|y, φ p .
(4.79)
(ii) Maximization step (M-step) Obtain φ p+1 such that Q φ p+1 |φ p = max Q φ|φ p .
(4.80)
φ
If there exists φ ∗ = lim φ p , from Theorem 4.16, we have p→∞
l φ ∗ ≥ l φ p , p = 1, 2, . . . .
For the ML estimate φ , it follows that
Q φ |φ = max Q φ|φ .
φ
Hence, the EM algorithm can be applied to obtain the ML estimate φ . If the density or probability function of complete data X is assumed to be the following exponential family of distribution: f (x|φ) = b(x)
exp φt(x)T , a(φ)
where φ be a 1 × r parameter vector and t(x) be a 1 × r sufficient statistic vector. Since " a(φ) = b(x)exp φt(x)T dx, we have ∂ ∂ Q φ|φ p = E logf (x|φ)|y, φ p ∂φ ∂φ ∂ loga(φ) = E t(X)|y, φ p − ∂φ " 1 ∂ = E t(X)|y, φ p − b(x)exp φt(x)T dx a(φ) ∂φ p = E t(X)|y, φ − E(t(X)|φ) = 0.
4.10 Incomplete Data Analysis
135
From the above result, the EM algorithm with (4.79) and (4.80) can be simplified as follows: (i) E-step Compute t p+1 = E t(X)|y, φ p .
(4.81)
(ii) M-step Obtain φ p+1 from the following equation: t p+1 = E(t(X)|φ).
(4.82)
Example 4.4 Table 4.3 shows random samples according to bivariate normal distribution with mean vector μ = (50, 60) and variance–covariance matrix =
225 180 . 180 400
Table 4.3 Artificial complete data from bivariate normal distribution Case
X1
X2
Case
X1
X2
Case
X1
X2
1
69.8
80.8
21
54.8
28.3
41
41.6
50.4
2
69.5
47.1
22
54.7
92.7
42
40.4
28.7
3
67.9
60.5
23
53.5
78.3
43
39.9
64.4
4
67.8
75.8
24
53.0
56.8
44
39.8
66.5
5
65.5
58.5
25
52.9
69.3
45
39.1
44.9
6
64.4
85.8
26
52.2
80.1
46
38.7
57.8
7
63.9
52.3
27
51.6
71.7
47
38.0
49.2
8
63.9
77.8
28
51.4
60.4
48
37.8
49.8
9
62.7
52.0
29
49.9
60.9
49
36.8
70.5
10
60.5
71.2
30
49.8
58.8
50
33.7
48.6
11
60.5
74.0
31
48.8
71.3
51
33.3
45.5
12
60.4
78.7
32
47.4
52.3
52
32.9
53.6
13
60.0
83.3
33
47.1
57.1
53
32.4
54.2
14
57.9
78.9
34
46.4
63.2
54
30.3
45.5
15
57.6
52.9
35
45.6
63.7
55
27.8
53.0
16
57.2
72.8
36
44.8
74.2
56
26.7
58.0
17
57.0
73.7
37
44.8
51.5
57
24.9
39.3
18
55.5
58.2
38
43.8
43.4
58
24.5
30.4
19
55.2
66.6
39
43.6
58.7
59
20.9
23.0
20
55.1
68.4
40
41.7
57.5
60
14.0
28.6
136
4 Analysis of Continuous Variables
From Table 4.3, the mean vector, the variance–covariance matrix, and the correlation coefficient are estimated as 174.3 123.1 , ρ = 0.607. μ = (47.7, 59.7), = 123.18 235.7
Table 4.4 illustrates incomplete (missing) data from Table 4.3. In the missing data, if only the samples 1–40 with both values of the two variables are used for estimating the parameters, we have
μ0 = (55.3, 65.5), 0 =
60.7 24.7 , ρ 0 = 0.242. 24.7 171.0
(4.83)
In data analysis, we have to use all the data in Table 4.4 to estimate the parameters. The EM algorithm (4.81) and (4.82) is applied to analyze the data in Table 4.4. Let μp , p and ρ p be the estimates of the parameters in the pth iteration, where Table 4.4 Incomplete (missing) data from Table 4.3 Case
X1
X2
Case
X1
X2
Case
X1
X2
1
69.8
80.8
21
54.8
28.3
41
41.6
Missing
2
69.5
47.1
22
54.7
92.7
42
40.4
Missing
3
67.9
60.5
23
53.5
78.3
43
39.9
Missing
4
67.8
75.8
24
53.0
56.8
44
39.8
Missing
5
65.5
58.5
25
52.9
69.3
45
39.1
Missing
6
64.4
85.8
26
52.2
80.1
46
38.7
Missing
7
63.9
52.3
27
51.6
71.7
47
38.0
Missing
8
63.9
77.8
28
51.4
60.4
48
37.8
Missing
9
62.7
52.0
29
49.9
60.9
49
36.8
Missing
10
60.5
71.2
30
49.8
58.8
50
33.7
Missing
11
60.5
74.0
31
48.8
71.3
51
33.3
Missing
12
60.4
78.7
32
47.4
52.3
52
32.9
Missing
13
60.0
83.3
33
47.1
57.1
53
32.4
Missing
14
57.9
78.9
34
46.4
63.2
54
30.3
Missing
15
57.6
52.9
35
45.6
63.7
55
27.8
Missing
16
57.2
72.8
36
44.8
74.2
56
26.7
Missing
17
57.0
73.7
37
44.8
51.5
57
24.9
Missing
18
55.5
58.2
38
43.8
43.4
58
24.5
Missing
19
55.2
66.6
39
43.6
58.7
59
20.9
Missing
20
55.1
68.4
40
41.7
57.5
60
14.0
Missing
4.10 Incomplete Data Analysis
137
p p μ = μ1 , μ2 , p =
p
p p σ11 σ12 . p p σ21 σ22
Let x1i and x2i be the observed values of case i for variables X1 and X2 in Table 4.4, respectively. For example, (x11 , x21 ) = (69.8, 80.8) and x1,41 , x2,41 = (41.6, missing). Then, in the E-step (4.81), the missing values in Table 4.4 are estimated as follows: σ21 p p x1i − μ1 , i = 41, 42, . . . , 60. σ11 p
p
p
x2i = μ2 +
In the (p + 1) th iteration, by using the observed data in Table 4.4 and the above estimates of the missing values, we have μp+1 , p+1 and ρ p+1 in the M-step (4.82). Setting the initial values of the parameters by (4.83), we have μ∞ = (47.7, 62.4), ∞ =
174.3 70.8 , ρ ∞ = 0.461. 70.8 135.2
(4.84)
Hence, the convergence values (4.84) are the ML estimates obtained by using the incomplete data in Table 4.4. The EM algorithm can be used for the ML estimation from incomplete data in both continuous and categorical data analyses. A typical example of categorical incomplete data is often faced in an analysis from phenotype data. Although the present chapter has treated continuous data analysis, in order to show an efficacy of the EM algorithm to apply to analysis of phenotype data, the following example is provided. Example 4.5 Let p, q, and r be the probabilities or ratios of blood gene types A, B, and O, respectively, in a large and closed population. Then, a randomly selected individual in the population has one of genotypes AA, AO, BB, BO, AB, and OO with probability p2 , 2pr, q2 , 2qr, 2pq, r 2 , respectively; however, the complete information on the genotype cannot be obtained and we can merely decide phenotype A, B, or O from his or her blood sample. Table 4.5 shows an artificial data produced by p = 0.4, q = 0.3, r = 0.3. Let nAA , nAO , nBB , nBO , nAB , and nOO be the numbers of genotypes AA, AO, BB, BO, AB, and OO in the data in Table 4.5. Then, we have nAA + nAO = 394, nBB + nBO = 277, nAB = 239, nOO = 90. Table 4.5 Artificial blood phenotype data Phenotype
A
B
AB
OO
Total
Frequency
394
277
239
90
1000
138
4 Analysis of Continuous Variables
In this sense, the data in Table 4.5 are incomplete, and X = (nAA , nAO , nBB , nBO , nAB , nOO ) is complete and sufficient statistics of the parameters. Given the total in Table 4.5, the complete data are distributed according to a multinomial distribution, so the E-step and M-step (4.81) and (4.82) are given as follows. Let p(u) , q(u) , and r (u) be the estimates of parameters p, q, and r in the (u + 1) u+1 u+1 u+1 u+1 u+1 th iteration, and let nu+1 AA , nAO , nBB , nBO , nAB , and nOO be the estimated complete data in the (u + 1) th iteration. Then, we have the EM algorithm as follows: (i) E-step p(u)2 2p(u) r (u) , nu+1 , AO = (nAA + nAO ) (u)2 (u) (u) + 2p r p + 2p(u) r (u) q(u)2 2q(u) r (u) = (nBB + nBO ) (u)2 , nu+1 , BO = (nBB + nBO ) (u)2 (u) (u) q + 2q r q + 2q(u) r (u)
nu+1 AA = (nAA + nAO ) nu+1 BB
p(u)2
u+1 nu+1 AB = nAB , nOO = nOO .
(ii) M-step p(u+1) =
u+1 2nu+1 2nu+1 + nu+1 AA + nAO + nAB BO + nAB , q(u+1) = BB , 2n 2n
r (u+1) = 1 − p(u+1) − q(u+1) , where n = nAA + nAO + nBB + nBO + nAB + nOO . For the initial estimates of the parameters p(0) = 0.1, q(0) = 0.5, and r (0) = 0.4, we have p(1) = 0.338, q(1) = 0.311, r (1) = 0.350, and the convergence values are p(10) = 0.393, q(10) = 0.304, r (10) = 0.302. In this example, the algorithm converges quickly.
4.11 Discussion In this chapter, first, correlation analysis has been discussed in association model and GLM frameworks. In correlation analysis, the absolute values of the correlation coefficient and the partial correlation coefficient are ECC and the conditional ECC, respectively, and the multiple correlation coefficient and the coefficient of determination are the same as ECC and ECD, respectively. In canonical correlation analysis,
4.11 Discussion
139
it has been shown that the KL information that expresses the association between two random vectors are decomposed into those between the pairs of canonical variables, and ECC and ECD have been calculated to measure the association between two random vectors. Second, it has been shown that basic statistical methods have been explained in terms of entropy. In testing the mean vectors, the Hotelling’s T 2 statistic has been deduced from KL information and in testing variance–covariance matrices of multivariate normal distributions, an entropy-based test statistic has been proposed. The discussion has been applied in comparison with two multivariate normal distributions. The discussion has a possibility, which is extended to a more general case, i.e., comparison of more than two normal distributions. In the experimental design, one-way layout experimental design model has been treated from a viewpoint of entropy. The discussion can also be extended to multiway layout experimental design models. In classification and discriminant analysis, the optimal classification method is reconsidered through an entropy-based approach, and the squared Mahalanobis’s distance has been explained with the KL information. In missing data analysis, the EM algorithm has been overviewed in entropy. As explained in this chapter, entropy-based discussions for continuous data analysis are useful and it suggests a novel direction to make approaches for other analyses not treated in this chapter.
References 1. Anderson, T. W. (1984). An introduction to multivariate statistical analysis. New York: Wiley. 2. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, B, 39, 1–38. 3. Eshima, N., & Tabata, M. (2007). Entropy correlation coefficient for measuring predictive power of generalized linear models. Statistics and Probability Letters, 77, 588–593. 4. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear models. Computational Statistics & Data Analysis, 54, 1381–1389. 5. Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika, 10, 507–521. 6. Fisher, R. A. (1935). The design of experiments. London: Pliver and Boyd. 7. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. 8. Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of Eugenics, 8, 376–386. 9. Goodman, L. A. (1981). Association models and canonical correlation in the analysis of crossclassification having ordered categories. Journal of the American Statistical Association, 76, 320–334. 10. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society: Series B, 58, 155–176. 11. Hotelling, H. (1931). The generalization of student’s ratio. Annals of Mathematical Statistics, 2, 360–372. 12. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377. 13. Mahalanobis, P. C. (1936). On the general distance in statistics. Proceedings of the National Institute of Sciences of India, 2, 49–55.
140
4 Analysis of Continuous Variables
14. Patnaik, P. B. (1949). The non-central χ2 and F-distributions and their applications. Biometrika, 36, 202–232. 15. Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of two groups. Annals of Statistics, 15, 145–162.
Chapter 5
Efficiency of Statistical Hypothesis Test Procedures
5.1 Introduction The efficiency of hypothesis test procedures has been discussed mainly in the context of the Pittman- and Bahadur-efficiency approaches [1] up to now. In the Pitmanefficiency approach [4, 5], the power functions of the test procedures are compared from a viewpoint of sample sizes, on the other hand, in the Bahadur-efficiency approach, the efficiency is discussed according to the slopes of log power functions, i.e., the limits of the ratios of the log power functions for sample sizes. Although the theoretical aspects in efficiency of test statistics have been derived in both approaches, herein, most of the results based on the two approaches are the same [6]. The aim of this chapter is to reconsider the efficiency of hypothesis testing procedures in the context of entropy. In Sect. 5.2, the likelihood ratio test is reviewed, and it is shown that the procedure is the most powerful test. Section 5.3 reviews the Pitman and Bahadur efficiencies. In Sect. 5.4, the asymptotic distribution of the likelihood ratio test statistic is derived, and the entropy-based efficiency is made. It is shown that the relative entropy-based efficiency is related the Fisher information. By using the median test as an example, it is shown that the results based on the relative entropybased efficiency, the relative Pitman and Bahadur efficiencies are the same under an appropriate condition. Third, “information of parameters” in general test statistics is discussed, and a general entropy-based efficiency is defined. The relative efficiency of the Wilcoxon test, as an example, is discussed from the present context.
5.2 The Most Powerful Test of Hypotheses Let x1 , x2 , . . . , xn be random samples according to probability or density function f (x|θ ), where θ is a parameter. In testing the null hypothesis H0 : θ = θ0 versus the
© Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_5
141
142
5 Efficiency of Statistical Hypothesis Test Procedures
alternative hypothesis H0 : θ = θ1 , for constant λ, the critical region with significance level α is assumed to be n f (xi |θ1 ) >λ . (5.1) W0 = (x1 , x2 , . . . , xn ) ni=1 i=1 f (xi |θ0 ) The above testing procedure is called the likelihood ratio test. The procedure is the most powerful, and it is proven in the following theorem [3]: Theorem 5.1 (Neyman–Pearson Lemma) Let W be any critical region with significance level α and W0 that based on the likelihood ratio with significance level α (5.1), the following inequality holds: P((X1 , X2 , . . . , Xn ) ∈ W0 |θ1 ) ≥ P((X1 , X2 , . . . , Xn ) ∈ W |θ1 ). Proof Let W be any critical region with significance level α. Then, P(W0 |θ1 ) − P(W |θ1 ) = P W0 ∩ W c |θ1 − P W0c ∩ W |θ1 n f (xi |θ1 )dx1 dx2 . . . dxn = W0 ∩W c i=1 n
−
W0c ∩W
n
≥λ W0
−λ
f (xi |θ1 )dx1 dx2 . . . dxn .
i=1
∩W c
f (xi |θ0 )dx1 dx2 . . . dxn
i=1 n
f (xi |θ0 )dx1 dx2 . . . dxn .
W0c ∩W i=1
Since the critical regions W0 and W have significance level α, we have α=
n W0
f (xi |θ0 )dx1 dx2 . . . dxn
i=1
n
=
f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W i=1
+ W0
and we also have
∩W c
n i=1
f (xi |θ0 )dx1 dx2 . . . dxn .
(5.2)
5.2 The Most Powerful Test of Hypotheses
α=
n W
143
f (xi |θ0 )dx1 dx2 . . . dxn
i=1
n
=
f (xi |θ0 )dx1 dx2 . . . dxn
W0 ∩W i=1
+
n
f (xi |θ0 )dx1 dx2 . . . dxn .
W0c ∩W i=1
From the above results, it follows that W0
∩W c
n
f (xi |θ0 )dx1 dx2 . . . dxn =
i=1
W0c ∩W
n
f (xi |θ0 )dx1 dx2 . . . dxn .
i=1
Hence, from (5.2) P(W0 |θ1 ) ≥ P(W |θ1 ). This completes the theorem.
From the above theorem, we have the following corollary: Corollary 5.1 Let X1 , X2 , . . . , Xn be random samples from a distribution with parameter θ and let Tn (X1 , X2 , . . . , Xn ) be any statistic of the samples. Then, in testing the null hypothesis H0 : θ = θ0 and the alternative hypothesis H1 : θ = θ1 , the power of the likelihood ratio test with the samples Xi is greater than or equal to that with the statistic Tn . Proof Let f (x|θ) be the probability or density function of samples Xi and let ϕn (t|θ) be that of Tn . The critical regions with significance level α of the likelihood ratio statistics of samples X1 , X2 , . . . , Xn and statistic Tn (X1 , X2 , . . . , Xn ) are set as
n ⎧ ⎨ W = (x1 , x2 , . . . , xn )
ni=1 f (xi |θ1 ) > λ(f , n) , f |θ ) (x i 0 i=1 ⎩ WT = T (X1 , X2 , . . . , Xn ) = t
ϕn (t|θ1 ) > λ(ϕ, n) , ϕn (t|θ0 ) such that n ϕn (t|θ1 ) f (xi |θ1 ) α = P ni=1 > λ(f , n)|H0 = P > λ(ϕ, n)|H0 . ϕn (t|θ0 ) i=1 f (xi |θ0 ) Then, Theorem 5.1 proves P(W |H1 ) ≥ P(WT |H1 ).
144
5 Efficiency of Statistical Hypothesis Test Procedures
Since from Theorem 5.1, critical region WT gives the most powerful test procedure among those based on statistic T, the theorem follows. In the above corollary, if statistic Tn (X1 , X2 , . . . , Xn ) is a sufficient statistic of θ , then, the power of the statistic is the same that of the likelihood ratio test with samples X1 , X2 , . . . , Xn . In testing the null hypothesis H0 : θ = θ0 versus the alternative hypothesis H1 : θ = θ1 , the efficiencies of test procedures can be compared with the power functions. Optimal properties of the likelihood ratio test were discussed in Bahadur [2].
5.3 The Pitman Efficiency and the Bahadur Efficiency Let θ be a parameter; H0 : θ = θ0 be null hypothesis; H1 : θ = θ0 + √hn (= θn ) be a sequence of alternative hypotheses; and let Tn = Tn (X1 , X2 , . . . , Xn ) be the test statistic, where X1 , X2 , . . . , Xn are random samples. In this section, for simplicity of the discussion, we assume h > 0. Definition 5.1 (Pitman efficiency) Let Wn = {Tn > tn } be critical regions with significant level α, i.e., α = P{Wn |θ0 }. Then, for the sequence of power functions γn (θn ) = P{Wn |θn }, n = 1, 2, . . ., the Pitman efficiency is defined by PE(T ) = lim γn (θn ). n→∞
(5.3)
Definition 5.2 (Relative Pitman efficiency) Let Tn(k) = Tn(k) (X1 , X2 , . . . , Xn ), k = 1, 2 be the two sequences of test statistics and let γn(k) (θn ) be the power functions of θNn , the relative efficiency is defined by Tn(k) , k = 1, 2. Then, for γn(1) (θn ) = γN(2) n n RPE T (2) |T (1) = logn→∞ Nn
(5.4)
Example 5.1 Let X1 , X2 , . . . , Xn be random sample from normal distribution N μ, σ 2 . In testing the null hypothesis H0 : μ = μ0 versus the alternative hypothesis H1 : μ = μ0 + √hn (= μn ) for h > 0. The critical region with significance level α is given by σ 1 Xi > μ0 + zα √ , n i=1 n n
X = where
σ P X > μ0 + zα √ |μ0 n
= α,
5.3 The Pitman Efficiency and the Bahadur Efficiency
145
Let Z=
√ n X − μ0 − σ
√h n
.
Then, Z is distributed according to the standard normal distribution under H1 and we have σ h h = P X > μ0 + zα √ |μ0 + √ γ μ0 + √ n n n h h = P Z > − + zα |μ = 0 = 1 − − + zα , σ σ where function (z) is the distribution function of N (0, 1). The above power function is constant in n, so we have h PE X = 1 − − + zα σ Next, the Bahadur efficiency is reviewed. In testing the null hypothesis H0 : θ = θ0 versus the alternative hypothesis H1 : θ = θ1 ; let Tn(k) = Tn(k) (X), k = 1, 2 be test statistics; and let (k) (k) (k) Tn (X)|θ , L(k) n Tn (X )|θ = −2 log 1 − Fn where Fn(k) (t|θ ) = P Tn(k) < t|θ , k = 1, 2 and X = (X1 , X2 , . . . , Xn ). Definition 5.3 (Bahadur efficiency) If under the alternative hypothesis H1 : θ = θ1 , (k) L(k) P n Tn (X)|θ0 → ck (θ1 , θ0 ) (n → ∞), n then, the Bahadur efficiency (RBE) is defined by BE T (k) = ck (θ1 , θ0 ), k = 1, 2.
(5.5)
Definition 5.4 (Relative Bahadur efficiency) For (5.5), the relative Bahadur efficiency (RBE) is defined by c2 (θ1 , θ0 ) . RBE T (2) |T (1) = c1 (θ1 , θ0 )
(5.6)
Under an appropriate condition, the relative Bahadur efficiency is equal to the relative Pitman efficiency [1]. Example 5.2 Under the same condition of Example 5.1, let
146
5 Efficiency of Statistical Hypothesis Test Procedures
Tn (X) = X . Since under H1 : μ = μ1 (> μ0 ), for large n, 1 X = μ1 + O n− 2 , where 1 1 P O n− 2 /n− 2 → c (n → ∞). Since +∞ √ n n exp − 2 (t − μ0 )2 dt 1 − Fn (Tn (X)|μ0 ) = √ 2σ 2π σ 2 X
+∞ = 1 μ1 +O n− 2
√ n n exp − 2 (t − μ0 )2 dt √ 2σ 2π σ 2
+∞ √ n n ≈ exp − 2 (s − μ0 + μ1 )2 ds √ 2σ 2π σ 2 0 √ n n =√ exp − 2 (μ1 − μ0 )2 2σ 2π σ 2 +∞ n n 2 exp − 2 (μ1 − μ0 )s − s ds, σ 2σ 2 0
for sufficiently large n, we have √
1 − Fn (Tn (X)|μ0 ) < √
n exp − 2 (μ1 − μ0 )2 2σ 2π σ 2 n
+∞ n exp − 2 (μ1 − μ0 )s ds σ 0 √ n σ2 n =√ exp − 2 (μ1 − μ0 )2 2σ n(μ1 − μ0 ) 2π σ 2 n σ =√ exp − 2 (μ1 − μ0 )2 , 2σ 2π n(μ1 − μ0 ) √ n n 1 − Fn (Tn (X)|μ0 ) > √ exp − 2 (μ1 − μ0 )2 2σ 2π σ 2
5.3 The Pitman Efficiency and the Bahadur Efficiency
147
+∞ n exp − 2 (s − μ1 + μ0 )2 ds 2σ 0 √ n n >√ exp − 2 (μ1 − μ0 )2 2σ 2π σ 2 +∞ n exp − 2 (s − μ1 + μ0 )2 ds 2σ μ1 −μ0
=
n 1 exp − 2 (μ1 − μ0 )2 . 2 2σ
From this, we have n −2 σ 2 log √ exp − 2 (μ1 − μ0 ) n 2σ 2π n(μ1 − μ0 ) σ −2 n 2 log √ = − (μ1 − μ0 ) n 2π n(μ1 − μ0 ) 2σ 2 1 → 2 (μ1 − μ0 )2 = c(μ1 , μ0 )(n → ∞), σ −2 n −2 n 1 −log2 − = log exp − 2 (μ1 − μ0 )2 (μ1 − μ0 )2 2 n 2 2σ n 2σ 1 → 2 (μ1 − μ0 )2 , σ so it follows that −2log(1 − Fn (Tn (X)|μ0 )) 1 Ln (Tn (X)|μ0 ) = → 2 (μ1 − μ0 )2 n n σ = c(μ1 , μ0 ) (n → ∞), Let f (x) and g(x) be normal density functions of N μ1 , σ 2 and N μ2 , σ 2 , respectively. Then, in this case, we also have 2D(g||f ) =
1 (μ1 − μ0 )2 = c(μ1 , μ0 ). σ2
In the next section, the efficiency of test procedures is discussed in view of entropy.
148
5 Efficiency of Statistical Hypothesis Test Procedures
5.4 Likelihood Ratio Test and the Kullback Information As reviewed in Sect. 5.2, the likelihood ratio test is the most powerful test. In this section, the test is discussed with respect to information. Let f (x) and g(x) be density or probability functions corresponding with null hypothesis H0 and alternative hypothesis H1 , respectively. Then, for a sufficiently large sample size n, i.e., X1 , X2 , . . . , Xn , we have n n g(Xi ) −nD(f ||g) (under H0 ) i=1 g(Xi ) log n = log ≈ nD(g||f ) (under H1 ) f (Xi ) i=1 f (Xi ) i=1
(5.7)
where D(f ||g) =
f (x) dx, D(g||f ) = f (x)log g(x)
g(x)log
g(x) dx. f (x)
(5.8)
In (5.8), the integrals are replaced by the summations, if the f (x) and g(x) are probability functions. Remark 5.1 In general, with respect to the KL information (5.8), it follows that D(f ||g) = D(g||f ). In the normal distribution, we have D(f ||g) = D(g||f ). The critical region of the likelihood ratio test with significance level α is given by n g(Xi ) i=1 > λ(n), n i=1 f (Xi )
(5.9)
where n g(Xi ) =α P i=1 > λ(n)|H 0 n i=1 f (Xi ) With respect to the likelihood ratio function, we have the following theorem: Theorem 5.2 For parameter θ , let f (x) = f (x|θ0 ) and g(x) = f (x|θ0 + δ) be density or probability functions corresponding to null hypothesis H0 : θ = θ0 and alternative hypothesis H1 : θ = θ0 + δ, respectively, and let X1 , X2 , . . . , Xn be samples to test the hypotheses. For large sample size n and small constant δ, the following log-likelihood ratio statistic
5.4 Likelihood Ratio Test and the Kullback Information
149
n n g(Xi ) i=1 g(Xi ) = log n log f (Xi ) i=1 f (Xi ) i=1
(5.10)
is asymptotically distributed according to normal distribution under H0 , N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 and N nD(g||f ), nδ 2 I (θ ) − nD(g||f )2 under H1 , where I (θ ) is the Fisher information: I (θ ) =
d logf (x|θ) dθ
2 f (x|θ)dx.
(5.11)
Proof Under H0 , statistic n 1 g(Xi ) log n i=1 f (Xi )
(5.12)
is asymptotically distributed according to normal distribution with mean −D(f ||g) g(X ) 1 and variance n Var log f (X ) θ0 . For small δ, we obtain g(X ) d = logf (X |θ0 + δ) − logf (X |θ0 ) ≈ δ logf (X |θ0 ). f (X ) dθ
log Since
E
d d logf (X |θ0 )
θ0 = f (X |θ0 )dx = 0, dθ dθ
we have E
g(X ) log f (X )
2
2
d
2 logf (X |θ0 ) θ0 = δ 2 I (θ0 ).
θ0 ≈ δ E
dθ
Hence, we get
1 g(X )
1 θ0 ≈ δ 2 I (θ0 ) − D(f ||g)2 . Var log
n f (X ) n is asymptotOn the other hand, under the alternative hypothesis H1 , statistic (5.12) ) |θ + δ . ically normally distributed with mean D(g||f ) and variance 1n Var log g(X f (X ) 0 For small δ, we have
150
5 Efficiency of Statistical Hypothesis Test Procedures
E
g(X ) log f (X )
2
2
d
2 logf (x|θ0 ) θ0 + δ
θ0 + δ ≈ δ E
dθ 2
d
2 ≈δ E logf (x|θ0 ) θ0 = δ 2 I (θ0 ).
dθ
From this it follows that
g(X )
1 1 θ Var log + δ ≈ δ 2 I (θ0 ) − D(g||f )2 . 0
n f (X ) n
Hence, the theorem follows. According to the above theorem, for critical region (5.9), the power is n g(Xi ) P i=1 > λ(n)|H 1 → 1 (n → ∞). n i=1 f (Xi ) Below, the efficiency of the likelihood ratio test is discussed.
Lemma 5.1 If density or probability function f (x|θ) is continuous with respect to θ and if there exists an integrable function ϕ1 (x) and ϕ2 (x) such that
f (x|θ + δ) d logf (x|θ) < ϕ1 (x)
dθ
(5.13)
d
f (x|θ) < ϕ2 (x),
dθ
(5.14)
and
then, f (x|θ + δ)
d logf (x|θ )dx → 0 (δ → 0). dθ
Proof From (5.13), we obtain
d d f (x|θ + δ) logf (x|θ)dx → f (x|θ) logf (x|θ)dx dθ dθ d = f (x|θ)dx (δ → 0). dθ
From (5.14), it follows that
5.4 Likelihood Ratio Test and the Kullback Information
d d f (x|θ )dx = dθ dθ
151
f (x|θ )dx = 0.
From the above results, we have d f (x|θ + δ) logf (x|θ )dx → 0 (δ → 0). dθ
Hence, the Lemma follows.
Example 5.3 In exponential density function f (x|λ) = λe−λx (x > 0), for a < λ < b and small δ, we have
−(λ+δ)x
2
f (x|λ + δ) d logf (x|λ) = f (x|λ + δ) d f (x|λ) = (λ + δ)e 1−λ
f (x|λ) dλ
dλ λ <
(λ + δ)e−(λ+δ)x
1 − λ2 < 2 1 − λ2 e−λx (= ϕ1 (x)). λ
Similarly, it follows that
d
f (x|λ) = |1 − λx|e−λx < |1 − ax|e−ax (= ϕ2 (x))
dλ Since functions ϕ1 (x) and ϕ2 (x) are integrable, from Lemma 5.1, we have ∞
d f (x|λ + δ) logf (x|λ)dx = dλ
0
∞ 1 − x dx (λ + δ)exp(−(λ + δ)x) λ 0
=
1 1 − → 0 (δ → 0). λ λ+δ
The relation between the KL information and the Fisher information is given in the following theorem: Theorem 5.3 Let f (x|θ ) and f (x|θ + δ) be density or probability functions with parameters θ and θ + δ, respectively, and let I (θ ) be the Fisher information. Under the same condition as in Lemma 5.1, it follows that D(f (x|θ)||f (x|θ + δ)) I (θ ) → (δ → 0), 2 δ 2
(5.15)
I (θ ) D(f (x|θ + δ)||f (x|θ)) → (δ → 0). δ2 2
(5.16)
152
5 Efficiency of Statistical Hypothesis Test Procedures
Proof log
f (x|θ ) = logf (x|θ) − logf (x|θ + δ) f (x|θ + δ) d δ 2 d2 ≈ −δ logf (x|θ) − logf (x|θ). dθ 2 dθ 2
From this and Lemma 5.1, for small δ, we have f (x|θ) dx D(f (x|θ)||f (x|θ + δ)) = f (x|θ )log f (x|θ + δ) d δ 2 d2 ≈ f (x|θ ) −δ logf (x|θ) − logf (x|θ) dx dθ 2 dθ 2 d = −δ f (x|θ ) logf (x|θ)dx dθ δ2 d2 − f (x|θ) 2 logf (x|θ )dx 2 dθ δ2 d d2 f (x|θ )dx − ≈ −δ f (x|θ ) 2 logf (x|θ )dx dθ 2 dθ 2 2 δ δ2 d =− f (x|θ) 2 logf (x|θ)dx = I (θ ). 2 dθ 2 Similarly, since log
f (x|θ + δ) = logf (x|θ + δ) − logf (x|θ ) f (x|θ) d δ 2 d2 = −δ logf (x|θ + δ) − logf (x|θ + δ). dθ 2 dθ 2
we have D(f (x|θ + δ)||f (x|θ )) =
f (x|θ + δ)log
f (x|θ + δ) dx f (x|θ )
δ 2 d2 d logf + δ) f (x|θ + δ) δ logf (x|θ + δ) − dx (x|θ dθ 2 dθ 2 d ≈ −δ f (x|θ ) logf (x|θ )dx dθ δ2 d2 f (x|θ ) 2 logf (x|θ )dx − 2 dθ 2 d2 δ2 δ f (x|θ ) 2 logf (x|θ )dx = I (θ ). =− 2 dθ 2
≈
5.4 Likelihood Ratio Test and the Kullback Information
153
Hence, the theorem follows. By using the above theorem, we have the following theorem:
Theorem 5.4 In the likelihood ratio test (5.9) for H0 : θ = θ0 versus H1 , θ = θ0 + δ, for small δ and large sample size n, the 100α% critical region (5.9) is given by logλ(n) = −n
δ 2 I (θ0 ) + zα nδ 2 I (θ0 ), 2
(5.17)
percentile of the standard normal distribution, and the where zα is the upper 100α asymptotic power is Q zα − nδ 2 I (θ0 ) , where the function Q(z) is +∞ Q(z) = z
2 1 x dx. √ exp − 2 2π
Proof From Theorems 5.2, under the null hypothesis H0 , the log-likelihood ratio statistic (5.10) is asymptotically distributed according to the normal distribution N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 . From Theorem 5.3, under the null hypothesis, from (5.15) we have D(f ||g) ≈
δ 2 I (θ0 ) , 2
so it follows that −nD(f ||g) ≈ −n
δ 2 I (θ0 ) , 2
δ4 I (θ0 )2 ≈ nδ 2 I (θ0 ). 4 2 From this, the asymptotic distribution is N −n δ I2(θ0 ) , nδ 2 I (θ0 ) . From this, for (5.17) we have nδ 2 I (θ0 ) − nD(f ||g)2 ≈ nδ 2 I (θ0 ) − n
n
i=1 g(Xi ) > λ(n)
H0 = α. P n i=1 f (Xi ) Under the alternative hypothesis H1 , from (5.16) in Theorem 5.3, D(g||f ) ≈
δ2 I (θ0 ) 2
and from Theorem 5.2 the asymptotic mean and variance of the log-likelihood function under H1 are
154
5 Efficiency of Statistical Hypothesis Test Procedures
nD(g||f ) ≈
nδ 2 I (θ0 ) 2
nδ 2 I (θ0 ) − nD(g||f )2 ≈ nδ 2 I (θ0 ) − n
δ4 I (θ0 )2 ≈ nδ 2 I (θ0 ). 4
Hence, the asymptotic distribution of statistic (5.10) is 2 nδ I (θ0 ) N nD(g||f ), nδ 2 I (θ0 ) ≈ N , nδ 2 I (θ0 ) . 2 Then, we have n i=1 g(Xi ) P n > λ(n)|H1 ≈ Q zα − nδ 2 I (θ0 ) i=1 f (Xi )
(5.18)
This completes the theorem. With respect to the Fisher information, we have the following theorem.
Theorem 5.5 For random variable X and a function of the variable Y = η(X ), let f (x|θ ) and f ∗ (y|θ ) be the density or probability functions of X and Y, respectively; and let If (θ ) and If ∗ (θ ) be the Fisher information of f (x|θ) and that of f ∗ (y|θ ), respectively. Under the same condition as in Theorem 5.3, it follows that If (θ ) ≥ If ∗ (θ ). The equality holds true if and only if y = η(x) is a one-to-one function. Proof Since variable Y = η(X ) is a function of X, in this sense, it is a restricted version of X. From this, we have D(f (x|θ)||f (x|θ + δ)) ≥ D f ∗ (y|θ)||f ∗ (y|θ + δ) .
(5.19)
From Theorem 5.3, If (θ ) D(f (x|θ)||f (x|θ + δ)) → (δ → 0), 2 δ 2 If ∗ (θ ) D(f ∗ (y|θ)||f ∗ (y|θ + δ)) → (δ → 0), 2 δ 2 Hence, the theorem follows.
As shown in the above discussion, the test power depends on the KL and the Fisher information. For random variables, the efficiency for testing hypotheses can be defined according to entropy.
5.4 Likelihood Ratio Test and the Kullback Information
155
Definition 5.5 (Entropy-basede efficiency) Let f (x|θ0 ) and f (x|θ0 + δ) be the density or probability functions of random variable X under hypotheses H0 : θ = θ0 versus H1 : θ = θ0 + δ, respectively. Then, the entropy-based efficiency (EE) is defined by EE(X ; θ0 , θ0 + δ) = D(f (x|θ0 + δ)||f (x|θ0 )).
(5.20)
Definition 5.6 (Relative entropy-based efficiency) Let f ∗ (y|θ0 ) and f ∗ (y|θ0 + δ) be the probability or density functions of Y = η(X ) for testing hypotheses H0 : θ = θ0 and H1 : θ = θ0 + δ, respectively. Then, the relative entropy-based efficiency (REE) of the entropy-based efficiency of Y = η(X ) for that of X is defined as follows: REE(Y |X ; θ0 , θ0 + δ) =
D(f ∗ (y|θ0 + δ)||f ∗ (y|θ0 )) . D(f (x|θ0 + δ)||f (x|θ0 ))
(5.21)
From (5.19) in Theorem 5.5, we have 0 ≤ REE(Y |X ) ≤ 1. Let X1 , X2 , . . . , Xn be the original random samples and let Y1 , Y2 , . . . , Yn be the transformed random variables by function Y = η(X ). Then, the REE (5.21) gives the relative efficiency of the likelihood ratio test with samples Y1 , Y2 , . . . , Yn for that with the original samples X1 , X2 , . . . , Xn . Hence, the log-likelihood ratio statistic (5.10) has the maximum of EE for testing hypotheses H0 : θ = θ0 versus H1 : θ = θi . Moreover, from Theorem 5.5, we have If ∗ (θ0 ) D(f ∗ (y|θ0 + δ)||f ∗ (y|θ0 )) = . δ→0 D(f (x|θ0 + δ)||f (x|θ0 )) If (θ0 )
lim REE(Y |X ; θ0 , θ0 + δ) = lim
δ→0
(5.22)
By using the above discussion, the REE is calculated in the following example. Example 5.3 Let X be a random variable that is distributed according to normal distribution with mean μ and variance σ 2 . In testing the null hypothesis H0 : μ = 0 and the alternative hypothesis H1 : μ = δ, let X ∗ be the dichotomized variable such as 0 (X < 0) ∗ X = . 1 (X ≥ 0) Then, the entropy-based relative efficiency of the test with the dichotomized variable X ∗ for that with the normal variable X is calculated. Since the normal density function with mean μ and variance σ 2 is 1 1 f (x) = √ exp − 2 (x − μ)2 , 2σ 2π σ the Fisher information with respect to mean δ is
156
5 Efficiency of Statistical Hypothesis Test Procedures
If (δ) = E
∂ logf (X ) ∂δ
2
=E
X −δ − 2 σ
2 =
1 = If (0). σ2
The distribution of dichotomized variable X ∗ is P X =1 =
∗
+∞ 0
1 2 exp − 2 (x − δ) dx. √ 2σ 2π σ 1
(5.23)
For simplicity of the discussion, we denote the above probability as p(δ). From this, the distribution f ∗ is the Bernoulli distribution with the positive probability (5.23) and the probability function is expressed as follows: P X ∗ = x = p(δ)x (1 − p(δ))1−x . The Fisher information of this distribution with respect to δ is
2 ∂ ∗ X log p(δ) + 1 − X ∗ log(1 − p(δ)) If ∗ (δ) = E ∂δ 2 2 ∂ ∂ log p(δ) + (1 − p(δ)) log(1 − p(δ)) = p(δ) ∂δ ∂δ 2 2 ∂ ∂ 1 1 p(δ) + = (1 − p(δ)) p(δ) ∂δ 1 − p(δ) ∂δ 2 2 ∂ ∂ 1 1 p(δ) + p(δ) = p(δ) ∂δ 1 − p(δ) ∂δ 2 ∂ 1 p(δ) = p(δ)(1 − p(δ)) ∂δ Since ∂ p(δ) = − ∂δ
+∞ 0
1 x−δ 1 1 δ2 2 exp − 2 (x − δ) dx = √ exp − 2 , √ σ2 2σ 2σ 2π σ 2π σ
we have 2 1 δ2 1 exp − 2 √ p(δ)(1 − p(δ)) 2σ 2πσ 2 2 1 1 δ × = exp − 2 → p(δ)(1 − p(δ)) 2π σ 2 σ πσ2 = If ∗ (0) (δ → 0).
If ∗ (δ) =
5.4 Likelihood Ratio Test and the Kullback Information
157
From this, the REE (5.21) is If ∗ (0) = lim REE X ∗ |X ; θ0 , θ1 + δ = δ→0 If (0)
2 πσ 2 1 σ2
=
2 . π
The above REE is the relative efficiency of the likelihood ratio test with dichotomized samples X1∗ , X2∗ , . . . , Xn∗ for that with the original samples X1 , X2 , . . . , Xn . By using the above example, the relative Pitman efficiency is computed on the basis of Theorem 5.3. Example 5.4 In Example 5.3, let H0 : θ = θ0 be null hypothesis; and let H1 : θ = θ0 + √hn (= θn ) be the alternative hypothesis. In Theorem 5.4, setting θ0 = 0 and δ = √hn , from (5.18) in Theorem 5.3, for sufficiently large sample size n, the asymptotic power of the likelihood ratio test statistic based onnormal vari ∗ (1) able X and the dichotomized variable X are γ − h2 If (0) and ≈ Q z (θ ) n α n θNn ≈ Q zα − Nnn h2 If ∗ (0) , respectively. Assuming γN(2) n γn(1) (θn ) = γN(2) θNn , n we have h2 If (0) =
Nn 2 h If ∗ (0). n
Hence, the relative Pitman efficiency (RPE) is If ∗ (0) 2 n = . = RPE X ∗ |X = logn→∞ Nn If (0) π The above RPE is equivalent to REE in Example 5.3. Example 5.5 In Example 5.3, the relative Bahadur efficiency (RBE) is considered. Let H0 : μ = 0 be null hypothesis; and let H1 : μ = δ be the alternative hypothesis. Let Tn(1) (X) =
1 Xi , n i=1
Tn(2) (X) =
1 ∗ X . n i=1 i
n
n
Then, under H1 : μ = δ, it follows that
158
5 Efficiency of Statistical Hypothesis Test Procedures
1 − Fn Tn(1) (X)|μ = 0 =
+∞ √ n n exp − 2 t 2 dt √ 2σ 2π σ 2
X
√
+∞ =
√ 1 δ+O n− 2
n exp − 2 t 2 dt 2σ 2π σ 2 n
and 1−
Fn Tn(2) (X)|μ
+∞ √ 2n 1 2 =0 = dt √ exp −2n t − 2 π
Tn(2) (X)
2n 1 2 dt, √ exp −2n t − 2 π
√
+∞ = 1 p(δ)+O n− 2
where +∞ p(δ) = 0
1 2 exp − 2 (x − δ) dx. √ 2σ 2π σ 2 1
From the above results, as in Example 5.2, we have −2log 1 − Fn Tn(1) (X)|μ = 0 δ2 → c1 (δ, 0) = 2 , σ n −2log 1 − Fn Tn(2) (X)|μ = 0 1 2 → c2 (δ, 0) = 4 p(δ) − (n → ∞). n 2 Hence, RBE is computed as follows: 2 4 4 p(δ) − 21 c2 (δ, 0) ∗ RBE X |X = = → 2 δ c1 (δ, 0) σ2
√ 1 2πσ 2 1 σ2
2 =
2 (δ → 0). π
Hence, the RBE is equal to REE. From Examples 5.3, 5.4, and 5.5, the medium test for testing the mean of the normal distribution, the relative entropy-based efficiency is the same as the relative Pitman and the relative Bahadur efficiencies. Under appropriate assumptions [2], as in the above example, it follows that
5.4 Likelihood Ratio Test and the Kullback Information
159
Ln (Tn (X)|θ0 ) → 2D(g||f ) (n → ∞). n
5.5 Information of Test Statistics In this section, test statistics are considered in a viewpoint of entropy. Let X1 , X2 , . . . , Xn be random samples; let Tn (X1 , X2 , . . . , Xn ) = Tn (X) be a test statistic; and let ϕn (t|θ ) be the probability or density function of statistic Tn (X). Then, the following definition is made. Definition 5.7 (Entropy-based efficiency) For test hypotheses H0 : θ = θ0 and H1 : θ = θ0 + δ, the entropy-based efficiency of test statistic Tn (X) is defined by EE(Tn ; θ0 , θ0 + δ) = D(ϕn (t|θ0 + δ)||ϕn (t|θ0 )).
(5.24)
Definition 5.8 (Relative entropy-based efficiency) Let Tn(1) (X) and Tn(2) (X) be test statistics for testing H0 : θ = θ0 versus H1 : θ = θ0 + δ, and let ϕn(1) (t|θ ) and ϕn(2) (t|θ) be the probability or density functions of the test statistics, respectively. Then, the relative entropy-based efficiency of Tn(2) (X) for Tn(1) (X) is defined by (2) (1) EE Tn(2) ; θ0 , θ0 + δ , REE Tn |Tn ; θ0 , θ0 + δ = EE Tn(1) ; θ0 , θ0 + δ
(5.25)
and the relative entropy-based efficiency of the procedure Tn(2) (X) for Tn(1) (X) is defined as REE T (2) |T (1) ; θ0 , θ0 + δ = lim REE Tn(2) |Tn(1) ; θ0 , θ0 + δ . n→∞
(5.26)
According to the discussions in the previous sections, to test the hypotheses H0 : θ = θ0 and H1 : θ = θ0 + δ at significance level α, the test procedure with statistic Tn (X) is the most powerful to use the following critical region:
ϕn (t|θ0 + δ) > λ(n) , W0 = (x1 , x2 , . . . , xn )
ϕn (t|θ0 ) where P(W0 |θ0 ) = α. With respect to REE (5.25), we have the following theorem:
160
5 Efficiency of Statistical Hypothesis Test Procedures
Theorem 5.6 Let X1 , X2 , . . . , Xn be random samples from distribution f (x|θ ); let Tn(1) = Tn(1) (X1 , X2 , . . . , Xn ) be the lo- likelihood ratio test statistic for testing H0 : θ = θ0 versus H1 : θ = θ0 + δ; Tn(2)= Tn(2) (X1 , X2 , . . . , Xn ) be a test statistic of
which the asymptotic distribution is N η(θ ), σn , where η(θ ) is an increasing and differentiable function of parameter θ . Then, the relative entropy-based efficiency of the test T (2) for the likelihood ratio test T (1) is 2
η (θ0 )2 lim REE T (2) |T (1) ; θ0 , θ0 + δ = ≤ 1, δ→0 I (θ0 )σ 2
(5.27)
where I (θ ) =
d f (x|θ)log logf (x|θ ) dθ
2 dx.
Proof For testing H0 : θ = θ0 versus H1 : θ = θ0 + δ, the asymptotic distribution of the log-likelihood ratio statistic Tn(1) is N −nD(f ||g), nδ 2 I (θ0 ) − nD(f ||g)2 under H0 ,and N nD(g||f ), nδ 2 I (θ0 + δ) − nD(g||f )2 under H1 from Theorem 5.2. Let nor x|μ, σ 2 be the normal density function with mean μ and variance σ 2 . Then, for small δ the entropy-based efficiency of Tn(2) is σ2 σ2 ||nor x|η(θ0 ), EE Tn(2) ; θ0 , θ0 + δ = D nor x|η(θ0 + δ), n n =
n(η(θ0 + δ) − η(θ0 ))2 . 2σ 2
and from Theorem 5.3, for large sample size n, we have nI (θ0 )δ 2 . EE Tn(1) ; θ0 , θ0 + δ = nD(f (x|θ0 )||f (x|(θ0 + δ))) ≈ 2 From this, the relative efficiency of the test with Tn(2) is (2) (1) EE Tn(2) ; θ0 , θ1 + δ REE T |T ; θ0 , θ0 + δ = lim n→∞ EE Tn(1) ; θ0 , θ1 + δ D nor x|η(θ0 + δ), σ 2 nor x|η(θ0 ), σ 2 = lim n→∞ nD f x|θ0 + δ, σ 2 f x| θ0 , σ 2 (η(θ0 +δ)−η(θ0 ))2 η(θ0 + δ) − η(θ0 ) 2 1 2σ 2 = · = I (θ0 )δ 2 δ I (θ0 )σ 2 2
Hence, we have
5.5 Information of Test Statistics
161 (η(θ0 +δ)−η(θ0 ))2 2σ 2 I (θ0 )δ 2 δ→0 2
lim REE T (2) |T (1) ; θ0 , θ0 + δ = lim
δ→0
=
η (θ0 )2 . I (θ0 )σ 2
For Tn(2) , η−1 Tn(2) is asymptotically normally distributed according to statistic 2 nor x|θ0 , nη σ(θ )2 , and for Fisher information I (θ0 ), we have 0
1 σ2 . ≥ nI (θ0 ) nη (θ0 )2 From this, we obtain η (θ0 )2 ≤ 1. I (θ0 )σ 2 This completes the theorem.
From Theorem 5.6, we have the following theorem: Theorem 5.7 Let θˆ be the maximum likelihood estimator of θ . Then, “the test of null hypothesis H0 : θ = θ0 versus alternative hypothesis H1 : θ = θ0 + δ” based on θˆ is asymptotically equivalent to the likelihood ratio test (5.9). 1 Proof For a large sample, the asymptotic distribution of θˆ is N θ0 , nI (θ under H0 0) 1 2 and N θ0 + δ, nI (θ0 +δ) under H1 . In Theorem 5.6, we set η(θ ) = θ and σ = I (θ10 ) , it follows η (θ ) = 1 in (5.27). Let T be the log-likelihood function in (5.9). Then, we have lim REE θˆ |T ; θ0 , θ0 + δ = 1. δ→0
This completes the theorem.
Moreover, in general, with respect to the likelihood ratio test, the following theorem holds. Theorem 5.8 Let X1 , X2 , . . . , Xn be random samples from probability or density function f (x|θ0 ) under H0 : θ = θ0 and from f (x|θ0 + δ) under H1 : θ = θ0 + δ. Let Tn(1) = T (1) (X1 , X2 , . . . , Xn ) be the log-likelihood ratio test statistic; let Tn(2) = T (2) (X1 , X2 , . . . , Xn ) be any test statistic and let ϕn (t|θ0 ) and ϕn (t|θ0 + δ) be the probability or density function of Tn(2) under H0 : θ = θ0 and H1 : θ = θ0 + δ, respectively. Then, D(ϕn (t|θ0 + δ)||ϕn (t|θ0 )) ≤ 1. REE T (2) |T (1) ; θ0 , θ1 + δ = nD(f (x|θ0 + δ)||f (x|θ0 ))
(5.28)
162
5 Efficiency of Statistical Hypothesis Test Procedures
Proof For simplicity of the discussion, the proof is made for the continuous random samples Xi . Then, we have D(ϕn (t|θ0 )||ϕn (t|θ0 + δ)) =
ϕn (t|θ0 )log
=
log
d dt
n
ϕn (t|θ0 ) dt ϕn (t|θ0 + δ)
f (xi |θ0 )dxi
Tn 0, b(θ ), andc(y, ϕ) are specific functions. p In GLM, θ is a function of linear predictor η = βT x = i=1 βi xi . Let fi (x) be marginal density or probability functions of explanatory variables Xi , and let Cov(Y , θ |Xi ) be the conditional covariances of Y and θ , given Xi . From a viewpoint of entropy, the covariance Cov(Y , θ ) can be regarded as the explained variation of Y in entropy by all the explanatory variables Xk and Cov(Y , θ |Xi ) is that excluding the effect of Xi . From this, we make the following definition. Definition 7.1 In a GLM with explanatory variables X1 , X2 , . . . , Xp (7.9), the contribution ratio of Xi for predicting response Y is defined by CR(Xi → Y ) =
Cov(Y , θ ) − Cov(Y , θ |Xi ) , i = 1, 2, . . . , p. Cov(Y , θ )
Remark 7.1 For canonical link θ = CR(Xi → Y ) =
βi Cov(Xi , Y ) −
p i=1
(7.10)
βi Xi , the contribution ratio of Xi is
Xj , Y |Xi
j =i βj Cov Xj , Y − Cov p j=1 βj Cov Xj , Y
, i = 1, 2, . . . , p.
(7.11)
If explanatory variables Xi are statistically independent, it follows that βi Cov(Xi , Y ) , i = 1, 2, . . . , p. CR(Xi → Y ) = p j=1 βj Cov Xj , Y Especially, in the ordinary linear regression model, we have β 2 Var(Xi ) CR(Xi → Y ) = p i 2 , i = 1, 2, . . . , p. j=1 βj Var Xj In this case, the denominator is the explained variance of Y by all the explanatory variables, and the numerator that by Xi .
7.4 Measuring Explanatory Variable Contributions
207
Fig. 7.5 Path diagram of explanatory variables Xi and X \i and response variable Y
Remark 7.2 In the entropy-based path analysis in Chap. 6, the calculation of contribution ratios of explanatory variables is based on the total effect of variable Xi under the assumption that variable Xi is the parent of variables X \i = T X1 , X2 , . . . , Xi−1 , Xi+1 , . . . , Xp as in Fig. 7.5. Then, the total effect of Xi on Y is defined by eT (Xi → Y ) =
Cov(Y , θ ) − Cov(Y , θ |Xi ) , i = 1, 2, . . . , p. Cov(Y , θ ) + a(ϕ)
and we have CR(Xi → Y ) =
eT (Xi → Y ) , i = 1, 2, . . . , p. eT (X → Y )
Hence, the contribution ratios of the explanatory variables Xi in the path diagrams for GLMs such as Fig. 7.1a, b, and c are the same as that in the case described in path diagram Fig. 7.5. Lemma 7.1 Let X = (X1 , X2 )T be a factor or explanatory variable vector; let Y be a response variable; let f (x1 , x2 , y) be the joint probability or density function of X and Y; let f1 (y|x1 ) be the conditional probability or density function of Y given X1 = x1 ; let g(x1 , x2 ) be the joint probability or density function of X; and let f (y) be the marginal probability or density function of Y. Then, ˚ f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy ˚ ≥ f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy,
(7.12)
where for discrete variables the related integrals are replaced by the summations. Proof For simplification, the Lemma is proven in the case of the continuous distribution. Let ˚ h= f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy ˚ − f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy. (7.13)
208
7 Measurement of Explanatory Variable Contribution in GLMs
Under the following constraint: ˚ f (x1 , x2 , y)dx1 dx2 dy = 1, the left side of (7.12) is minimized with respect to f (x1 , x2 , y). For Lagrange multiplier λ, let ˚ L=h−λ f (x1 , x2 , y)dx1 dx2 dy ˚ = f1 (y|x1 )g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy ˚ − f (y)g(x1 , x2 ) log f (x1 , x2 , y)dx1 dx2 dy ˚ −λ f (x1 , x2 , y)dx1 dx2 dy. Differentiating L with respect to f (x1 , x2 , y), we have f1 (y|x1 )g(x1 , x2 ) f (y)g(x1 , x2 ) ∂L = − ∂f (x1 , x2 , y) f (x1 , x2 , y) f (x1 , x2 , y) ˚ ∂ f (x1 , x2 , y)dx1 dx2 dy −λ ∂f (x1 , x2 , y) f1 (y|x1 )g(x1 , x2 ) f (y)g(x1 , x2 ) − − λ = 0. = f (x1 , x2 , y) f (x1 , x2 , y)
(7.14)
From (7.14), it follows that f1 (y|x1 )g(x1 , x2 ) − f (y)g(x1 , x2 ) = λf (x1 , x2 , y),
(7.15)
Since ˚
˚ f1 (y|x1 )g(x1 , x2 )dx1 dx2 dy = =
˚
f (y)g(x1 , x2 )dx1 dx2 dy f (x1 , x2 , y)dx1 dx2 dy = 1,
Integrating the both sides of Eq. (7.15) with respect to x1 , x2 and y, we have λ = 0. From this, we obtain f1 (y|x1 ) = f (y), and it derives h = 0 as the minimum value of h (7.13). This completes the Lemma.
7.4 Measuring Explanatory Variable Contributions
209
Remark 7.3 Let X = (X1 , X2 )T be a factor vector with x1i , x2j , i = levels 1, 2, . . . , I ; j = 1, 2, . . . , J ; let nij be sample sizes at x1i , x2j ; and let ni+ = J I I J nij , n+j = nij , n = nij . Then, Lemma 7.1 holds by setting as follows: j=1
i=1
i=1 j=1
J nij nij ni+ 1 g x1i , x2j = , g1 (x1i ) = , f1 (y|x1i ) = , f y|x1i , x2j n n ni+ j=1 n
f (y) =
I J nij f y|x1i , x2j . n i=1 j=1
T T Remark 7.4 Let X = X1 , X2 , . . . , Xp ; let X (1) = X1 , X2 , . . . , Xq and let X (2) = T Xq+1 , Xq+2 , . . . , Xp . Then, by replacing X1 and X2 in Lemma 7.1 for X (1) and X (2) , respectively, the inequality (7.12) holds true. From Lemma 7.1, we have the following theorem: Theorem 7.1 In the GLM (7.9) with explanatory variable vector X T X1 , X2 , . . . , Xp , Cov(θ, Y ) Cov(θ, Y |Xi ) − ≥ 0, i = 1, 2, . . . , p. a(ϕ) a(ϕ)
=
(7.16)
Proof For simplicity of the discussion, the theorem is proven for continuous variables in the case of p = 2 and i = 1. Since Cov(θ, Y ) = a(ϕ)
˚ (f (x1 , x2 , y) − f (y)g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy
and Cov(θ, Y |Xi ) = a(ϕ)
˚ (f (x1 , x2 , y) − f1 (y|x1 )g(x1 , x2 )) log f (x1 , x2 , y)dx1 dx2 dy,
we have Cov(θ, Y ) Cov(θ, Y |Xi ) − a(ϕ) a(ϕ) ˚ = (f1 (y|x1 )g(x1 , x2 ) − f (y)g(x1 , x2 ))logf (x1 , x2 , y)dx1 dx2 dy. From Lemma 7.1, the theorem follows. From the above theorem, we have the following theorem.
210
7 Measurement of Explanatory Variable Contribution in GLMs
T Theorem 7.2 In the GLM (7.9) with X = X1 , X2 , . . . , Xp , 0 ≤ CR(Xi → Y ) ≤ 1, i = 1, 2, . . . , p.
(7.17)
Proof Since Cov(θ, Y |Xi ) ≥ 0, from (7.16) Theorem 7.1 shows that 0 ≤ Cov(θ, Y ) − Cov(θ, Y |Xi ) ≤ Cov(θ, Y ).
Thus, from (7.10) the theorem follows.
Remark 7.4 Contribution ratio (7.10) can be interpreted as the explained variation of response Y by factor or explanatory variable Xi . In a GLM (7.9) with a canonical link, if explanatory variables Xi are independent, then Cov(θ, Y ) − Cov(θ, Y |Xi ) = βi Cov(Xi , Y ) ≥ 0. and we have CR(Xi → Y ) =
βi Cov(Xi , Y ) βi Cov(Xi , Y ) . = p Cov(θ, Y ) j=1 βj Cov Xj , Y
(7.18)
Remark 7.5 Let n be sample size; let
Cov(θ, Y ) − Cov(θ, Y |Xi ) ; χ 2 (Xi ) = n a ϕˆ
of regression parameters related to Xi , where Cov(θ, Y ), and let τi be the number Cov(θ, Y |Xi ), a ϕˆ are the ML estimators of Cov(θ, Y ), Cov(θ, Y |Xi ), and a(ϕ), respectively. Then, for sufficiently large sample size n, χ 2 (Xi ) is asymptotically non-central chi-square distribution with degrees of freedom τi and the asymptotic non-centrality parameter is
λ=n
Cov(θ, Y ) − Cov(θ, Y |Xi ) . a(ϕ)
(7.19)
In GLMs, after a model selection procedure, there may be cases in which some explanatory variables have no direct paths to the response variables, that is, the related regression coefficients are zeros. One of the examples is depicted in Fig. 7.1c. For such cases, we have the following theorem: Theorem 7.3 In GLM (7.9), let the explanatory variable vector X be decomposed into two sub-vectors X (1) with the direct paths to response variable Y and X (2) without the direct paths to Y. Then, Cov(θ, Y ) . KL(Y , X) = KL Y , X (1) = a(ϕ)
(7.20)
7.4 Measuring Explanatory Variable Contributions
211
Proof Let f (y|x) be the conditional probability or density functions of Y given X = x and let f1 y|x(1) be that given X (1) = x(1) . In the GLM, the linear predictor is a linear combination of X (1) and it means that f (y|x) = f1 y|x(1) .
From this, the theorem follows.
7.5 Numerical Illustrations Example 1 (continued) According to a model selection procedure, the estimated regression coefficients, their standard errors, and the factor contributions (7.10) for explanatory variables are reported in Table 7.5. The estimated linear predictor θ is θ = 1.522X1 + 8.508X3 , where R2 (= ECD) = 0.932. In this case, as illustrated in Fig. 7.6, there does not exist the direct path X2 to Y. According to (7.10), the factor contribution in the model is calculated in Table 7.5. This table shows that BVS (X1 ) has the highest individual contribution and that the contribution of this variable is 1.3 times greater than INC (X3 ). Although FY1 (X2 ) does not have the direct path to PRICE (Y ), when considering the association among the explanatory variables, the contribution is high, i.e. 0.889. These conclusions are consistent with other more specific empirical studies about the Ohlson’s model [6]. Table 7.5 Estimated regression coefficients and contribution ratios Valuable
Regression coefficient
SE
CR(Xk → Y )
SE
Intercept
−1.175
1.692
/
/
BVS (X1 )
1.522
0.149
0.981
0.121
BY1 (X2 )
/
/
0.889
0.115
INC (X3 )
0.858
3.463
0.669
0.100
Fig. 7.6 Path diagram for the final model for explaining PRICE
212
7 Measurement of Explanatory Variable Contribution in GLMs
Example 2 The present discussion is applied to the ordinary two-way layout experimental design model. Before analyzing the data in Table 7.2, a general two-way layout experiment model is formulated in a GLM framework. Let X1 and X2 be factors with levels {1, 2, . . . , I } and {1, 2, . . . , J }, respectively. Then, the linear predictor is a function of (X1 , X2 ) = (i, j), i.e. θ = μ + αi + βj + (αβ)ij .
(7.21)
where I i=1
αi =
J
βj =
j=1
I J (aβ)ij = (aβ)ij = 0. i=1
j=1
Let Xki =
1 (Xk = i) , k = 1, 2. 0 (Xk = i)
In experimental designs, factor levels are randomly assigned to subjects, so the above dummy variables can be regarded as independent random variables such that P(X1i = 1) =
1 1 , i = 1, 2, . . . , I ; P X2j = 1 = , j = 1, 2, . . . , J .. I J
Dummy vectors X 1 = (X11 , X12 , . . . , X1I )T and X 2 = (X21 , X22 , . . . , X2J )T are identified with factors X1 and X2 , respectively. From this, the systematic component of model (7.21) can be written as follows: θ = μ + αTX 1 + β TX 2 + γ TX 1 ⊗ X 2 , where α = (α1 , α2 , . . . , αI )T , β = (β1 , β2 , . . . , βJ )T , T γ = (αβ)11 , (αβ)12 , . . . , (αβ)1J , . . . , (αβ)IJ , and X 1 ⊗ X 2 = (X11 X21 , X11 X22 , . . . , X11 X2J , . . . , X1I X2J )T .
(7.22)
7.5 Numerical Illustrations
213
Let Cov(X 1 , Y ), Cov(X 2 , Y ), and Cov(X 1 ⊗ X 2 , Y ) are covariance matrices, then, the total effect of X 1 and X 2 is Cov(θ, Y ) α T Cov(X 1 , Y ) β T Cov(X 2 , Y ) γ T Cov(X 1 ⊗ X 2 , Y ) = + + 2 2 2 σ σ σ σ2 J 1 J 1 I 2 2 1 I 2 β (αβ) αi j=1 j i=1 j=1 ij J IJ = I i=1 + + . σ2 σ2 σ2
KL(Y , X) =
(7.23)
The above three terms are referred to as the main effect of X1 , that of X2 and the interactive effect, respectively. Then, ECD is calculated as follows: Cov(θ, Y ) Cov(θ, Y ) + σ 2 J 1 I 1 J 1 I 2 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij I = R2 . = 1 I J I J 2 2+ 1 2+ 1 2 α β + σ i=1 i j=1 j i=1 j=1 (αβ)ij I J IJ
ECD((X1 , X2 ), Y ) =
(7.24)
In this case, since Cov(θ, Y |X 1 ) =
J I J 1 1 2 βj + (αβ)2ij , J j=1 IJ i=1 j=1
(7.25)
Cov(θ, Y |X 2 ) =
I I J 1 1 2 αi + (αβ)2ij , I i=1 IJ i=1 j=1
(7.26)
we have eT (X1 → Y ) = =
Cov(θ, Y ) − Cov(θ, Y |X 1 ) Cov(θ, Y ) + σ 2 1 I 1 I I
eT (X2 → Y ) =
i=1
1 I I
i=1
αi2 + αi2 +
1 J
1 J
2 i=1 αi I 1 2 j=1 βj + IJ i=1
J
2 j=1 βj I 1 2 j=1 βj + IJ i=1
J
J
J
I
1 J
J
2 j=1 (αβ)ij
2 j=1 (αβ)ij
+ σ2 + σ2
,
(7.27)
.
(7.28)
From the above results, the contributions of X1 and X2 are given by Cov(θ, Y ) − Cov(θ, Y |X 1 ) Cov(θ, Y ) 1 I 2 i=1 αi I = 1 I , J J 2 1 1 I 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij I 1 J 2 j=1 βj J CR(X2 → Y ) = 1 I . J J 2 1 1 I 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij I
CR(X1 → Y ) =
(7.29)
(7.30)
214
7 Measurement of Explanatory Variable Contribution in GLMs
The above contribution ratios correspond to those of main effects of X1 and X2 , respectively. The total effect of factors X1 and X2 is given by (7.24), i.e. eT ((X1 , X2 ) → Y ) = ECD((X1 , X2 ), Y ). From this, the following effect is referred to as the interaction effect: eT ((X1 , X2 ) → Y ) − eT (X1 → Y ) − eT (X2 → Y ) J 2 1 I i=1 j=1 (αβ)ij IJ = 1 I . J 2 1 J 1 I 2 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij + σ I From the present data, we have
Cov(θ, Y ) = 100.6, Cov(θ, Y |X 1 ) = 30.3, Cov(θ, Y |X 2 ) = 61.4.
From this, it follows that
CR(X1 → Y ) =
Cov(θ, Y ) − Cov(θ, Y |X 1 )
Cov(θ, Y )
=
80.2 − 30.3 = 0.622, 80.2
CR(X2 → Y ) = 0.234. Example 3 (continued) The data in Table 7.4 is analyzed and the explanatory variable contributions are calculated. From the ML estimators of the parameters, we have
Cov(θ, Y ) = 0.539, Cov(θ, Y |X 1 ) = 0.323, Cov(θ, Y |X 2 ) = 0.047,
(7.31)
we have 0.539 = 0.350 (= eT ((X1 , X2 ) → Y )), 0.539 + 1 0.539 − 0.323 = 0.140, eT (X1 → Y ) = 0.539 + 1 0.539 − 0.047 eT (X2 → Y ) = = 0.320 0.539 + 1 0.140 eT (X1 → Y ) CR(X1 → Y ) = = 0.401, = eT ((X1 , X2 ) → Y ) 0.350 0.320 eT (X2 → Y ) = 0.913. CR(X2 → Y ) = = eT ((X1 , X2 ) → Y ) 0.350 ECD =
From ECD, 35% of variation of Y in entropy is explained by the two explanatory variables, and from the above results, the contribution of alcohol use X1 on marijuana
7.5 Numerical Illustrations
215
use Y is 40.1%, whereas that of cigarette use X2 on marijuana use Y is 91.3%. The effect of cigarette use X2 on marijuana use Y is about two times greater than that of alcohol use X1 .
7.6 Application to Test Analysis Let Xi , i = 1, 2, . . . , p be scores of items (subjects) for a test battery, for example, subjects are English, Mathematics, Physics, Chemistry, Social Science, and so on. In usual, test score Y =
p
Xi
(7.32)
i=1
is used for evaluating students’ learning levels. In order to assess contributions of items to the test score, the present method is applied. Let us assume the following linear regression model: Y =
p
Xi + e,
i=1
where e is an error term according to N 0, σ 2 and independent of item scores Xi . In a GLM framework, we set θ=
p
Xi
i=1
from (7.10), and we have Var(θ ) − Var(θ |Xi ) Cov(Y , θ ) − Cov(Y , θ |Xi ) = Cov(Y , θ ) Var(θ ) Var(Y ) − Var(Y |Xi ) as σ 2 → 0 , i = 1, 2, . . . , p. → Var(Y )
CR(Xi → Y ) =
Hence, in (7.32), the contributions of Xi to Y is calculated by CR(Xi → Y ) =
Var(Y ) − Var(Y |Xi ) = Corr(Y , Xi )2 , i = 1, 2, . . . , p. Var(Y )
(7.33)
216
7 Measurement of Explanatory Variable Contribution in GLMs
Table 7.6 Test data for five test items Subject
Japanese X1
Social X3
Mathematics X4
1
64
English X2 65
83
69
Science X5 70
351
Total Y
2
54
56
53
40
32
235
3
80
68
75
74
84
381
4
71
65
40
41
68
285
5
63
61
60
56
80
320
6
47
62
33
57
87
286
7
42
53
50
38
23
206
8
54
17
46
58
58
233
9
57
48
59
26
17
207
10
54
72
58
55
30
269
11
67
82
52
50
44
295
12
71
82
54
67
28
302
13
53
67
74
75
53
322
14
90
96
63
87
100
436
15
71
69
74
76
42
332
16
61
100
92
53
58
364
17
61
69
48
63
71
312
18
87
84
64
65
53
353
19
77
75
78
37
44
311
20
57
27
41
54
30
209
Source [1, 18]
Table 7.6 shows a test data with five items. Under the normality of the data, the above discussion is applied to analyze the test data. From this table, by using (7.33) contribution ratios CR(Xi → Y ), i = 1, 2, 3, 4, 5 are calculated as follows: CR(X1 → Y ) = 0.542, CR(X2 → Y ) = 0.568, CR(X3 → Y ) = 0.348, CR(X4 → Y ) = 0.542, CR(X5 → Y ) = 0.596. Except for Social Science X3 , the contributions of the other subjects are similar, and the correlations between Y and Xi are strong. The contributions are illustrated in Fig. 7.7.
7.7 Variable Importance Assessment In regression analysis, any causal order in explanatory variables is not assumed in usual, so in assessing the explanatory variable contribution in GLMs, as explained above, the calculation of the contribution of the explanatory variables is made as if
7.7 Variable Importance Assessment
217
Japanese 0.6 0.5 0.4 0.3 0.2
Science
English
0.1 0
Mathematics
Social
Fig. 7.7 Radar chart of the contributions of five subjects to the total score
explanatory variable Xi were the parent of the other variables, as shown in Fig. 7.5. If the explanatory variables are causally ordered, e.g. X1 → X2 → . . . → Xp → Y = Xp+1 ,
(7.34)
applying the path analysis in Chap. 6, from (6.46) and (6.47) the contributions of explanatory variables Xi are computed as follows:
\1,2,...,i−1 \1,2,...,i KL X pa(p+1) , Y |X pa(i) − KL X pa(p+1) , Y |X pa(i+1) eT (Xi → Y ) = CR(Xi → Y ) = , eT (X → Y ) KL X pa(p+1) , Y
(7.35)
T where X = X1 , X2 , . . . , Xp . Below, contributions CR(Xi → Y ) are merely described as CR(Xi ) as far as not confusing the response variable Y. Then, it follows that p
CR(Xi ) = 1.
(7.36)
i=1
With respect to measures for variable importance assessment in ordinary linear regression models, the following criteria for R2 -decomposition are listed [12]: (a) Proper decomposition: the model variance is to be decomposed into shares, that is, the sum of all shares has to be the model variance. (b) Non-negativity: all shares have to be non-negative.
218
7 Measurement of Explanatory Variable Contribution in GLMs
(c) Exclusion: the share allocated to a regressor Xi with βi = 0 should be 0. (d) Inclusion: a regressor Xi with βi = 0 should receive a non-zero share. In the GLM framework, the model variance in (a) and R2 are substituted for KL(Y , X) and ECD(Y , X), respectively, and from Theorem 7.3, we have ECD(Y , X) = ECD Y , X (1) , where X (1) is the subset of all the explanatory variables with non-zero regression coefficients βi = 0 in X. Hence, the explanatory power of X and X (1) are the same. From this, in the present context, based on the above criteria, the variable importance is discussed. In which follows, it is assumed all the explanatory variables X1 , X2 , . . . , Xp have non-zero regression coefficients. In GLMs with explanatory variables that have no causal ordering, the present entropy-based path analysis is applied to rel ative importance assessment of explanatory variables. Let U = X1 , X2 , . . . , Xp ; r = r1 , r2 , . . . , rp a permutation of explanatory variable indices {1, 2, . . . , p}; Si (r) be the parent variables that appear before Xi in permutation r; and let Ti (r) = U \Si (r). Definition 7.2 For causal ordering r = r1 , r2 , . . . , rp in explanatory variables
U = X1 , X2 , . . . , Xp , the contribution ratio of Xi is defined by CRr (Xi ) =
KL(Ti (r), Y |Si (r)) − KL(Ti (r)\{Xi }, Y |Si (r)∪{Xi }) . KL(X, Y )
(7.37)
Definition 7.3 The degree of relative importance of Xi is defined as RI(Xi ) =
1 CRr (Xi ), p! r
(7.38)
where the summation implies the sum of CRr (Xi )’s over all the permutations r = r1 , r2 , . . . , rp . From (7.37), we have CRr (Xi ) > 0, i = 1, 2, . . . , p; p
CRr (Xi ) = 1 for any permutation r.
i=1
Hence, from (7.38) it follows that RI(Xi ) > 0, i = 1, 2, . . . , p; p i=1
RI(Xi ) = 1.
(7.39)
(7.40)
7.7 Variable Importance Assessment
219
Remark 7.6 In GLMs (7.9), CRr (Xi ) can be expressed in terms of covariances of θ and the explanatory variables, i.e. CRr (Xi ) = =
Cov(θ,Y |Si (r)) a(ϕ)
Cov(θ,Y |Si (r)∪{Xi }) a(ϕ) Cov(θ,Y ) a(ϕ)
−
Cov(θ, Y |Si (r)) − Cov(θ, Y |Si (r)∪{Xi }) . Cov(θ, Y )
(7.41)
Remark 7.7 In the above definition of the relative importance of explanatory variables (7.38), if Xi is the parent of the other explanatory variables, i.e. Si (r) = φ, then, (7.37) implies KL(U , Y ) − KL(Ti (r)\{Xi }, Y |Xi ) KL(X, Y ) KL X1 , X2 , . . . , Xp , Y − KL(Ti (r)\{Xi }, Y |Xi ) . = KL(X, Y )
CRr (Xi ) =
If the parents of Xi are Ti (r) = {Xi }, i.e. Si (r)∪{Xi } = U , formula (7.37) is calculated as CRr (Xi ) =
KL(Xi , Y |U \{Xi }) KL(Ti (r), Y |Si (r)) = . KL(X, Y ) KL(X, Y )
Remark 7.8 In (7.37), for any permutations r and r such that Ti (r) = Ti r or Si (r) = Si r , it follows that CRr (Xi ) = CRr (Xi ). Example 7.4 For p = 2, i.e. U = {X1 , X2 }, we have two permutations of the explanatory variables r = (1, 2), (2, 1). Then, the relative importance of the explanatory variables is evaluated as follows: RI(Xi ) =
1 CRr (Xi ), i = 1, 2, 2 r
where CR(1,2) (X1 ) =
Cov(θ, Y ) − Cov(θ, Y |X1 ) KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) , = KL((X1 , X2 ), Y ) Cov(θ, Y ) CR(1,2) (X2 ) =
KL(X2 , Y |X1 ) , KL((X1 , X2 ), Y )
CR(2,1) (X1 ) =
KL(X1 , Y |X2 ) , KL((X1 , X2 ), Y )
220
7 Measurement of Explanatory Variable Contribution in GLMs
CR(2,1) (X2 ) =
KL((X1 , X2 ), Y ) − KL(X2 , Y |X2 ) . KL((X1 , X2 ), Y )
Hence, we have 1 CR(1,2) (X1 ) + CR(2,1) (X1 ) 2 KL((X1 , X2 ), Y ) − KL(X2 , Y |X1 ) + KL(X1 , Y |X2 ) , = 2KL((X1 , X2 ), Y )
RI(X1 ) =
RI(X2 ) =
KL((X1 , X2 ), Y ) − KL(X1 , Y |X2 ) + KL(X2 , Y |X1 ) . 2KL((X1 , X2 ), Y )
Similarly, for p = 3, i.e. U = {X1 , X2 , X3 }, we have 1 2CR(1,2,3) (X1 ) + CR(2,1,3) (X1 ) + CR(3,1,2) (X1 ) + 2CR(2,3,1) (X1 ) 3! 1 = (2KL((X1 , X2 , X3 ), Y ) − 2KL((X2 , X3 ), Y |X1 ) 6KL((X1 , X2 , X3 ), Y )
RI(X1 ) =
+ KL((X1 , X3 ), Y |X2 ) − KL(X3 , Y |X1 , X2 ) + KL((X1 , X2 ), Y |X3 ) −KL(X2 , Y |X1 , X3 ) + 2KL(X1 , Y |X2 , X3 )) =
1 (2Cov(θ, Y ) − 2Cov(θ, Y |X1 ) + Cov(θ, Y |X2 ) − Cov(θ, Y |X1 , X2 ) 6Cov(θ, Y )
+Cov(θ, Y |X3 ) − Cov(θ, Y |X1 , X3 ) + 2Cov(θ, Y |X2 , X3 )),
RI(X2 ) =
1 2CR(2,1,3) (X2 ) + CR(1,2,3) (X2 ) + CR(3,2,1) (X2 ) + 2CR(1,3,2) (X2 ) . 3!
RI(X3 ) =
1 2CR(3,1,2) (X3 ) + CR(1,3,2) (X3 ) + CR(2,3,1) (X3 ) + 2CR(1,2,3) (X3 ) 3!
As shown in the above example, the calculation of RI(Xi ) becomes complex as the number of the explanatory variables increases. Example 7.1 (continued) Three explanatory variables are hypothesized prior to the linear regression analysis; however, the regression coefficients of two explanatory variables X1 and X3 are statistically significant. According to the result, the variable importance is assessed in X1 and X3 . Since Cov(θ, Y ) = 394.898, Cov(θ, Y |X1 ) = 7.538, and Cov(θ, Y |X3 ) = 130.821, by using (7.41), we have 394.898 − 7.538 = 0.981, 394.898 7.538 = 0.019, CR(1,3) (X3 ) = 394.898
CR(1,3) (X1 ) =
7.7 Variable Importance Assessment
221
130.821 = 0.331, 394.898 CR(3,1) (X3 ) = 1 − CR(3,1) (X1 ) = 0.669. CR(3,1) (X1 ) =
Hence, it follows that CR(1,3) (X1 ) + CR(3,1) (X1 ) = 0.656, 2 RI(X3 ) = 0.344.
RI(X1 ) =
Example 7.2 (continued) By using the model (7.22), the relative importance of the explanatory variables X1 and X2 is considered. From (7.37), we have Cov(θ, Y ) − Cov(θ, Y |X1 ) Cov(θ, Y ) 1 I 2 i=1 αi I = 1 I . J 2 1 J 1 I 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij I
CR(1,2) (X1 ) =
J 1 J β 2 + 1 I 2 Cov(θ, Y |X1 ) j=1 j i=1 j=1 (αβ)ij J IJ = 1 I CR(1,2) (X2 ) = , J β2 + 1 I J 1 2 2 Cov(θ, Y ) i=1 αi + J j=1 j i=1 j=1 (αβ)ij I IJ
CR(2,1) (X1 ) = CR(2,1) (X2 ) =
J 2 1 I 2 i=1 αi + IJ i=1 j=1 (αβ)ij . J 2 1 I 1 J 1 I 2 2 i=1 αi + J j=1 βj + IJ i=1 j=1 (αβ)ij I 1 I
1 I
I i=1
I
αi2 +
1 J 2 j=1 βj J J 1 1 2 j=1 βj + IJ J
I i=1
J
2 j=1 (αβ)ij
.
Hence, from (7.38), it follows that CR(1,2) (X1 ) + CR(2,1) (X1 ) Cov(θ, Y ) − Cov(θ, Y |X1 ) + Cov(θ, Y |X2 ) = 2 2Cov(θ, Y ) J 1 I 2 + 1 I 2 α (αβ) i=1 i=1 j=1 i ij I 2IJ , = 1 I J 1 J β 2 + 1 I 2 2 i=1 αi + J j=1 j i=1 j=1 (αβ)ij I IJ CR(1,2) (X2 ) + CR(2,1) (X2 ) Cov(θ, Y ) − Cov(θ, Y |X2 ) + Cov(θ, Y |X1 ) = RI(X2 ) = 2 2Cov(θ, Y ) J 1 J β 2 + 1 I 2 j=1 j i=1 j=1 (αβ)ij J 2IJ . = 1 I J 2 + 1 J β 2 + 1 I 2 α i=1 i j=1 j i=1 j=1 (αβ)ij I J IJ RI(X1 ) =
In order to demonstrate the above results, the following estimates are used:
Cov(θ, Y ) = 100.6, Cov(θ, Y |X1 ) = 30.3, Cov(θ, Y |X2 ) = 61.4. From the estimates, we have
222
7 Measurement of Explanatory Variable Contribution in GLMs
CR(1,2) (X1 ) =
Cov(θ, Y ) − Cov(θ, Y |X1 )
Cov(θ, Y )
= 0.699,
CR(2,1) (X1 ) =
Cov(θ, Y |X2 )
Cov(θ, Y )
= 0.610,
CR(1,2) (X2 ) =
Cov(θ, Y |X1 )
Cov(θ, Y )
CR(2,1) (X2 ) =
= 0.301,
Cov(θ, Y ) − Cov(θ, Y |X2 )
Cov(θ, Y )
= 0.390,
0.699 + 0.610 CR(1,2) (X1 ) + CR(2,1) (X1 ) = = 0.655, 2 2 CR(1,2) (X2 ) + CR(2,1) (X2 ) RI(X2 ) = = 0.346. 2 RI(X1 ) =
From the above results, the degree of importance of factor X1 (type of patients) is about twice than that of factor X2 (age groups of nurses). Example 7.3 (continued) From (7.31), we have CR(1,2) (X1 ) =
0.539 − 0.323 0.047 = 0.401, CR(2,1) (X1 ) = = 0.087, 0.539 0.539
CR(1,2) (X2 ) =
0.323 0.539 − 0.047 = 0.599, CR(2,1) (X2 ) = = 0.913, 0.539 0.539
RI(X1 ) =
0.401 + 0.087 0.599 + 0.913 = 0.244, RI(X2 ) = = 0.756. 2 2
From the results, the variable importance of cigarette use X2 is more than thrice greater than that of alcohol use X1 .
7.8 Discussion In regression analysis, important is not only estimation and statistical significance tests of regression coefficients of explanatory variables but also measuring contributions of the explanatory variables and assessing explanatory variable importance. The present chapter has discussed the latter subject in GLMs, and applying an entropybased path analysis in Chap. 6, a method for treating it has been discussed. The method employs ECD for dealing with the subject. It is an advantage that the present method can be applied to all GLMs. As demonstrated in numerical examples, the entropy-based method can be used in practical data analyses with GLMs.
References
223
References 1. Adachi, K., & Trendafilov, N. T. (2018). Some mathematical properties of the matrix decomposition solution in factor analysis. Psychometrika, 83, 407–424. 2. Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley. 3. Azen, R., & Budescu, D. V. (2003). The dominance analysis approach for comparing predictors in multiple regression. Psychological Methods, 8(2), 129–148. 4. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative-importance assessment of explanatory variables in generalized linear models: an entropy-based approach, Statistics and Applications, 14, 107–122. 5. Daniel, W. W. (1999). Biostatistics: A foundation for analysis in the health sciences (7th ed.). New York: Wiley. 6. Darlington, R. B. (1968). Multiple regression in psychological research and practice. Psychological Bulletin, 69, 161–182. 7. Dechow, P. M., Hutton, A. P., & Sloan, R. G. (1999). An empirical assessment of the residual income valuation model. Journal of Accounting and Economics, 26, 1–34. 8. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear models. Computational Statistics and Data Analysis, 54, 1381–1389. 9. Eshima, N., & Tabata, M. (2011). Three predictive power measures for generalized linear models: Entropy coefficient of determination, entropy correlation coefficient and regression correlation coefficient. Computational Statistics & Data Analysis, 55, 3049–3058. 10. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132. 11. Gr˝omping, U. (2006). Relative importance for linear regression in R: The package relaimpo. Journal of Statistical Software, 17, 1–26. 12. Gr˝omping, U. (2007). Estimators of relative importance in linear regression based on variance decomposition. The American Statistician, 61, 139–147. 13. Gr˝omping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63, 308–319. 14. Kruskal, W. (1987). Relative importance by averaging over orderings. The American Statistician, 41, 6–10. 15. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and Hall: London. 16. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear model. Journal of the Royal Statistical Society, A, 135, 370–384. 17. Ohlson, J. A. (1995). Earnings, book values and dividends in security valuation. Contemporary Accounting Research, 11, 661–687. 18. Tanaka, Y., & Tarumi, T. (1995). Handbook for statistical analysis: Multivariate analysis (windows version). Tokyo: Kyoritsu-Shuppan. (in Japanese). 19. Theil, H. (1987). How many bits of information does an independent variable yield in a multiple regression? Statistics and Probability Letters, 6, 107–108. 20. Theil, H., & Chung, C. F. (1988). Information-theoretic measures of fit for univariate and multivariate regressions. The American Statistician, 42, 249–253. 21. Thomas, D. R., & Zumbo, B. D. (1996). Using a measure of variable importance to investigate the standardization of discriminant coefficients. Journal of Educational and Behavioral Statistics, 21, 110–130.
Chapter 8
Latent Structure Analysis
8.1 Introduction Latent structure analysis [11] is a general name that includes factor analysis [19, 22], T latent trait analysis [12, 14], and latent class analysis. Let X = X1 , X2 , . . . , Xp T be a manifest variable vector; let ξ = ξ1 , ξ2 , . . . , ξq be a latent variable (factor) vector that influence the manifest variables; let f (x) and f (x|ξ ) be the joint density or probability functions of manifest variable vector X and the conditional one given latent variables ξ , respectively; let f (xi |ξ ) be the conditional density or probability function of manifest variables Xi given latent variables ξ ; let g(ξ ) be the marginal density or probability function of manifest variable vector ξ . Then, a general latent structure model assumes that f (x|ξ ) =
p
f (xi |ξ ).
(8.1)
f (xi |ξ )g(ξ )d ξ ,
(8.2)
i=1
From this, it follows that f (x) =
p i=1
where the integral is replaced by the summation when the manifest variables are discrete. The manifest variables Xi are observable; however, the latent variables ξj imply latent traits or abilities that cannot be observed and are hypothesized components, for example, in human behavior or responses. The above assumption (8.1) is referred to as that of local independence. The assumption implies that latent variables explain the association of the manifest variables. In general, latent structure analysis explains how latent variables affect manifest variables. Factor analysis treats continuous manifest and latent variables, and latent trait models are modeled with discrete manifest © Springer Nature Singapore Pte Ltd. 2020 N. Eshima, Statistical Data Analysis and Entropy, Behaviormetrics: Quantitative Approaches to Human Behavior 3, https://doi.org/10.1007/978-981-15-2552-0_8
225
226
8 Latent Structure Analysis
variables and continuous latent variables. Latent class analysis handles categorical manifest and latent variables. It is important to estimate the model parameters, and the interpretation of the extracted latent structures is also critical. In this chapter, latent structure models are treated as GLMs, and the entropy-based approach in Chaps. 6 and 7 is applied to factor analysis and latent trait analysis. In Sect. 8.2, factor analysis is treated, and an entropy-based method of assessing factor contribution is considered [6, 7]. In comparison with the conventional method of measuring factor contribution and that based on entropy, advantages of the entropy-based method are discussed in a GLM framework. For oblique factor analysis models, a method for calculating factor contributions is given by using covariance matrices. Section 8.3 deals with the latent trait model that expresses dichotomous responses underlying a common latent ability or trait. In the ML estimation of latent abilities of individuals from their responses, the information of test items and that of tests are discussed according to the Fisher information. Based on the GLM framework, ECD is used for measuring test reliability. Numerical illustrations are also provided to demonstrate the present discussion.
8.2 Factor Analysis 8.2.1 Factor Analysis Model The origin of factor analysis dates back to the works of [19], and the single factor model was extended to the multiple factor model [22]. Let Xi be manifest variables; ξj latent variables (common factors); εi unique factors peculiar to Xi ; and let λij be factor loadings that are weights of factors ξj to explain Xi . Then, the factor analysis model is given as follows: Xi =
m
λij ξj + εi (i = 1, 2, . . . , p),
(8.3)
j=1
where ⎧ ⎪ E(Xi ) = E(εi ) = 0, i = 1, 2, . . . , p; ⎪ ⎪ ⎪ ⎪ E ξj = 0, j = 1, 2, . . . , m; ⎨ Var ξj = 1, j = 1, 2, . . . , m; ⎪ ⎪ ⎪ Var(εi ) = ωi2 > 0, i = 1, 2, . . . , p; ⎪ ⎪ ⎩ Cov(εk , εl ) = 0, k = l. T Let be the variance–covariance matrix of X = X1 , X2 , . . . , Xp ; let be the m × m correlation matrix of common factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T ; let be T the p × p variance–covariance matrix of unique factor ε = ε1 , ε2 , . . . , εp ; and let
8.2 Factor Analysis
227
be the p × m regression coefficient matrix of λij . Then, we have = T + ,
(8.4)
and the model (8.3) is expressed by using the matrices as X = ξ + ε.
(8.5)
Let C be any m × m non-singular matrix with standardized row vectors. Then, transformation ξ ∗ = Cξ makes another model as X = ∗ ξ ∗ + ε,
(8.6)
where ∗ = C −1 . In this model, matrix ∗ = CC T is the correlation matrix of factors ξ ∗ and we have = ∗ ∗ ∗T + .
(8.7)
It implies factor analysis model is not identified with respect to matrices C. If common factors ξi are mutually independent, correlation matrix is the identity matrix and covariance structure (8.4) is simplified as follows: = T + .
(8.9)
and for any orthogonal matrix C, we have = ∗ ∗T + .
(8.10)
where ∗ = C −1 = C T . As discussed above, factor analysis model (8.5) cannot constraints be determined uniquely. Hence, to estimate the model parameters, m(m−1) 2 have to be placed on the model parameters . On the other hand, for transformation of scales of the manifest variables, factor analysis model is scale-invariant. For diagonal matrix D, let us transform manifest variables X as X ∗ = DX.
(8.11)
X ∗ = Dξ + Dε.
(8.12)
Then,
and factor analysis models (8.5) and (8.11) are equivalent. From this, factor analysis can be treated under standardization of the manifest variables. Below, we set Var(Xi ) = 1, i = 1, 2, . . . , p. Methods of parameter estimation in factor analysis have been studied actively by many authors, for example, for least square estimation,
228
8 Latent Structure Analysis
[1, 10], for ML estimation, [8, 9, 15] and so on. The estimates of model parameters can be obtained by using the ordinary statistical software packages [24]. Moreover, high-dimensional factor analysis where the number of manifest variables is greater than that of observations has also been developed [16, 20, 21, 23]. In this chapter, the topics of parameter estimation are not treated, and an entropy-based method for measuring factor contribution is discussed.
8.2.2 Conventional Method for Measuring Factor Contribution The interpretation of the extracted factors is based on factor loading matrix and factor structure matrix . After interpreting the factors, it is important to assess factor contribution. For orthogonal factor analysis models (Fig. 8.1a), the contribution of factor ξj on manifest variables Xi is defined as follows: Fig. 8.1 a Path diagram of an orthogonal factor analysis model. b Path diagram of an orthogonal factor analysis model
(a)
(b)
8.2 Factor Analysis
229
Cj =
p
2 2 Cov Xi , ξj = λij . p
i=1
(8.13)
i=1
Measuring contributions of the extracted factors can also be made from the following decomposition of total variances of the observed variables Xi [2, p. 59]: p i=1
Var(Xi ) =
p
λ2i1 +
i=1
p
λ2i2 + · · · +
i=1
p
λ2im +
i=1
p
ωi2
(8.14)
i=1
From this, the contribution (8.13) is regarded as the quantity in the sum of the variances of the manifest variables. Applying it to the manifest variables observed, contribution (8.13) is not scale-invariant. From this reason, factor contribution is considered for standardized versions of manifest variables Xi ; however, the sum of the variances (8.14) has no physical meaning. The variation of manifest variable matrix vector X = X1 , X2 , . . . , Xp is summarized by the variance–covariance . The generalized variance of manifest variable vector X = X1 , X2 , . . . , Xp is the determinant of the variance–covariance matrix |Σ|, and it expresses p-dimensional variation of random vector X. If the determinant could be decomposed into the sums of quantities related to factors ξj as in (8.14), it would be successful to define the factor contribution by the quantities. It is impossible to make such decompositions based on |Σ|. For standardized manifest variables Xi , we have λij = Corr Xi , ξj .
(8.15)
and p
Var(Xi ) = p.
(8.16)
i=1
The squared correlation coefficients (8.15) are the ratios of explained variances for manifest variables Xi and can be interpreted as the contributions (effects) of factors to the individual manifest variables, but the sum of these has no logical foundation to be viewed as the contribution of factor ξj to manifest variable vector X = X1 , X2 , . . . , Xp . In spite of it, the contribution ratio of ξj is defined by Cj RCj = m
l=1 Cl
Cj = m p l=1
k=1
λ2kl
(8.17)
The above measure is referred to as the factor contribution ratio in the common factor space. Another contribution ratio of ξj is referred to as that in the whole space of manifest variable vector X = (Xi ), and it is defined by
230
8 Latent Structure Analysis
Cj Cj = . p i=1 Var(Xi )
j = p RC
(8.18)
The conventional approach can be said to be intuitive. In order to overcome the above difficulties, an entropy-based path analysis approach [5] was applied to measuring the contributions of factors to manifest variables [7]. In the next subsection, the approach is explained. Remark 8.1 The path diagram in Fig. 8.1a is used to express factor analysis model by using linear equations as (8.3). The present chapter treats the factor analysis model as a GLM, so the common factors ξj are explanatory variables and the manifest variables Xi are response variables. In this framework, the effects of unique factors are dealt with the random component of the GLM. From this, in which follows, Fig. 8.1b is employed; i.e., the random component is expressed with the perforated arrows.
8.2.3 Entropy-Based Method of Measuring Factor Contribution Factor analysis model (8.3) is discussed in a framework of GLMs. A general path diagram of factor analysis model is illustrated in Fig. 8.2a. It is assumed that factors ξj , j = 1, 2, . . . , m and εi , i = 1, 2, . . . , p are normally distributed. Then, the conditional density functions of manifest variables of Xi , i = 1, 2, . . . , p given the factors ξj , j = 1, 2, . . . , m are given by ⎛ 2 ⎞ m x − λ ξ i ij j j=1 1 ⎟ ⎜ exp⎝− fi (xi |ξ ) = ⎠ 2 2ω 2 i 2π ω ⎛
i
⎜ = exp⎝−
xi
m
j=1 λij ξj −
1 2
m
j=1 λij ξj
ωi2
2 −
xi2 2ωi2
⎞
⎟ − log 2π ωi2 ⎠,
i = 1, 2, . . . , p. Let θi =
m j=1
(8.19)
x2 λij ξj and C xi , ωi2 = − 2ωi 2 − log 2π ωi2 . Then, the above density i
function is described in a GLM framework as follows: xi θi − 21 θi2 + C xi , ωi2 , i = 1, 2, . . . , p. fi (xi |ξ ) = exp ωi2
(8.20)
From the general latent structure model (8.1), the conditional normal density function of X given ξ is expressed as
8.2 Factor Analysis
231
(a)
(b)
(d) (c)
Fig. 8.2 a Path diagram of a general factor analysis model. b Path diagram of manifest variables Xi , i = 1, 2, . . . , p, and common factor vector ξ . c Path diagram of manifest variable sub-vectors X (a) , a = 1, 2, . . . , A, common factor vector ξ and error sub-vectors ε (a) related to X (a) , a = 1, 2, . . . , A. d. Path diagram of manifest variable vector X, common factors ξj , and unique factors εi
f (x|ξ ) =
p i=1
= exp
exp
xi θi − 21 θi2 ωi2
p xi θi − 1 θi2 2 i=1
ωi2
2 + C xi , ωi p 2 + C xi , ωi .
From (8.21), we have the following theorem:
i=1
(8.21)
232
8 Latent Structure Analysis
T Theorem 8.1 In factor analysis model (8.21), let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T . Then, KL(X, ξ ) =
p
KL(Xi , ξ ).
(8.22)
i=1
Proof Let fi (xi |ξ ) be the conditional density functions of manifest variables Xi , given factor vector ξ ; fi (xi ) be the marginal density functions of Xi ; f (x) be the marginal density function of and let g(ξ ) be the marginal density function of common X; factor vector ξ = ξj . Then, from (8.20), we have ¨
fi (xi |ξ ) dxi dξ fi (xi |ξ )g(ξ )log fi (xi ) ¨ fi (xi ) dxi dξ . + fi (xi )g(ξ )log fi (xi |ξ ) Cov(Xi , θi ) = , i = 1, 2, . . . , p. ωi2
KL(Xi , ξ ) =
(8.23)
From (8.21), it follows that ¨
f (x|ξ )g(ξ ) dxdξ f (x|ξ )g(ξ )log f (x) ¨ f (x) dxdξ + f (x)g(ξ )log f (x|ξ )g(ξ ) ¨ = (f (x|ξ )g(ξ ) − f (x)g(ξ ))logf (x|ξ )dxi dξ
KL(X, ξ ) =
=
p Cov(Xi , θi ) i=1
ωi2
.
(8.24)
Hence, the theorem follows. KL information (8.23) is viewed as a signal-to-noise ratio, and the total KL information (8.24) is decomposed into KL information components for manifest variables Xi . In this sense, KL(X, ξ ) and KL(Xi , ξ ) measure the effects of factor vector ξ in Fig. 8.2b. Let Ri be the multiple correlation coefficient of Xi and ξ = ξj . Since E(Xi |ξ ) = θi =
m j=1
λij ξj , i = 1, 2, . . . , p,
8.2 Factor Analysis
233
we have Cov(Xi , θi ) = R2i , i = 1, 2, . . . , p. Cov(Xi , θi ) + ωi2 Then, from (8.23) KL(Xi , ξ ) =
R2i , i = 1, 2, . . . , p. 1 − R2i
(8.25)
In a GLM framework, ECDs with respect to manifest variables Xi and factor vector ξ are computed as follows: ECD(Xi , ξ ) =
KL(Xi , ξ ) = R2i , i = 1, 2, . . . , p; KL(Xi , ξ ) + 1 p Cov(Xi ,θi ) p R2i ωi2 Cov(Xi ,θi ) i=1 ωi2 i=1
ECD(X, ξ ) = p
+1
= p
i=1 1−R2i
R2i i=1 1−R2i
+1
.
(8.26)
(8.27)
From Theorem 8.1, in Fig. 8.2c, we have the following corollary: Corollary 8.1 Let manifest variable sub-vectors X (a) , a = 1, 2, . . . , A be any T decomposition of manifest variable vector X = X1 , X2 , . . . , Xp . Then, KL(X, ξ ) =
A
KL X (a) , ξ .
(8.28)
a=1
Proof In factor analysis models, sub-vectors X (a) , a = 1, 2, . . . , A are conditionally independent, given factor vector ξ . From this, the proof is similar to that of Theorem 8.1. This completes the corollary. With respect to Theorem 8.1, a more general theorem is given as follows: T Theorem 8.2 Let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T be manifest and latent variable vectors, respectively. Under the assumption of local independence (8.1), the same decomposition as in (8.22) holds as follows: KL(X, ξ ) =
p
KL(Xi , ξ ).
i=1
Proof KL(X, ξ ) =
¨ p i=1
p fi (xi |ξ )g(ξ )log
k=1 fk (xk |ξ ) dxdξ +
f (x)
¨ f (x)g(ξ )log p
f (x)
k=1 fk (xk |ξ )
dxdξ
234
8 Latent Structure Analysis = =
¨ p
fi (xi |ξ )g(ξ ) − f (x) log
i=1 ¨ p
= =
p
i=1 k=1 p ¨ k=1 p ¨ i=1
=
¨
fk (xk |ξ )
dxdξ +
k=1 fk (xk )
p ¨ p
p
fk (xk |ξ )dxdξ
k=1
fi (xi |ξ )g(ξ )log k=1 p
i=1
=
p
p fk (xk ) f (x)g(ξ )log pk=1 dxdξ k=1 fk (xk |ξ )
p ¨
f (x |ξ ) fi (xi |ξ )g(ξ )log k k dxdξ + fk (xk )
k=1 p ¨
f (x |ξ ) fk (xk |ξ )g(ξ )log k k dxk dξ + fk (xk )
k=1
fi (xi |ξ ) dxi dξ + fk (xk |ξ )g(ξ )log fi (xi )
f (x ) f (x)g(ξ )log k k dxdξ fk (xk |ξ )
f (x ) fk (xk )g(ξ )log k k dxk dξ fk (xk |ξ )
¨
fi (xi )g(ξ )log
fi (xi ) dxi dξ fi (xi |ξ )
KL(Xi , ξ ).
i=1
This completes the theorem. Remark 8.2 Let = λij be a p × m factor loading matrix; let be the m × m correlation matrix of common factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T ; and let be the T p × p variance–covariance matrix of unique factor vector ε = ε1 , ε2 , . . . , εp . Then, the conditional density function of X given ξ , f (x|ξ ), is normal with mean ξ and variance matrix and is given as follows:
1 T˜ ˜ ˜ 2 ξ − 21 ξ T T X T ξ X X f (x|ξ ) = − 2 p 1 exp || || (2π) 2 || 2
1
˜ is the cofactor matrix of . Then, KL information (8.24) is expressed by where ¨ KL(X, ξ ) =
f (x|ξ )g(ξ )log ¨
+
f (x)g(ξ )log
f (x|ξ ) dxdξ f (x)
T ˜ f (x) tr dxdξ = || f (x|ξ )
The above information can be interpreted as a generalized signal-to-noise ratio, T ˜ and noise is ||. From (8.24) and (8.25), we have where signal is tr T Cov(Xi , θi ) R2 ˜ tr i = . = 2 || ω 1 − R2i i i=1 i=1 p
p
Hence, the generalized signal-to-noise ratio is decomposed into those for manifest variables. The contributions of factors in factor analysis model (8.20) are discussed in view of entropy. According to the above discussion, the following definitions are made:
8.2 Factor Analysis
235
Definition 8.1 The contribution of factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T to manifest T variable vector X = X1 , X2 , . . . , Xp is defined by C(ξ → X) = KL(X, ξ ).
(8.29)
Definition 8.2 The contributions of factor vector ξ = (ξ1 , ξ2 , . . . , ξm )T to manifest variables Xi are defined by C(ξ → Xi ) = KL(Xi , ξ ) =
R2i , i = 1, 2, . . . , p. 1 − R2i
(8.30)
T Definition 8.3 The contributions of ξj to X = X1 , X2 , . . . , Xp are defined by C ξj → X = KL(X, ξ ) − KL X, ξ \j |ξj , j = 1, 2, . . . , m,
(8.31)
T where ξ \j = ξ1 , ξ2 , . . . , ξj−1 , ξj+1 , . . . , ξm and KL X, ξ \j |ξj are the following conditional KL information: ¨ f (x|ξ ) \j KL X, ξ |ξj = f (x|ξ )g(ξ )log dxdξ f x|ξj ¨ f x|ξj dxdξ , j = 1, 2, . . . , m. + f x|ξj g(ξ )log f (x|ξ ) According to Theorem 8.1 and (8.25), we have the following decomposition: C(ξ → X) =
p
C(ξ → Xi ) =
i=1
p i=1
R2i 1 − R2i
(8.32)
The contributions C(ξ → X) and C(ξ → Xi ) measure the effects of common factors in Fig. 8.2b. The effects of common factors in Fig. 8.2d are evaluated by contributions C ξj → X . Remark 8.3 The contribution C(ξ → X) is decomposed into C(ξ → Xi ) with respect to manifest variables Xi (8.32); however, in general, it follows that C(ξ → X) =
m C ξj → X j=1
due to the correlations between common factors ξj . In Fig. 8.2a, the following definition is given.
(8.33)
236
8 Latent Structure Analysis
Definition 8.4 The contribution of ξj to Xi , i = 1, 2, . . . , p is defined by Var(θi ) − Var θi |ξj \j C ξj → Xi = KL(Xi , ξ ) − KL Xi , ξ |ξj = , ωi2 i = 1, 2, . . . , p; j = 1, 2, . . . , m.
(8.34)
the above discussion, a more general decomposition of contribution Considering C ξj → X can be made in the factor analysis model. T Theorem 8.3 Let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T be manifest and latent variable vectors, respectively. Under the assumption of local independence (8.1), KL X, ξ \j |ξj = KL Xi , ξ \j |ξj , j = 1, 2, . . . , m. p
(8.35)
i=1
Proof Let f x, ξ \j |ξj be the conditional density function of X and ξ \j given ξj ; f x|ξj be the conditional density function of X given ξj ; and let g ξ \j |ξj be the conditional density function of ξ \j given ξj . Then, in factor analysis model (8.21), we have ˚ f x, ξ \j |ξj dxdξ \j dξj KL X, ξ \j |ξj = f (x, ξ )log f x|ξj g ξ \j |ξj ˚ f x|ξj g ξ \j |ξj dxdξ \j dξj + f x|ξj g(ξ )log f x, ξ \j |ξj ˚ ˚ f x|ξj f (x|ξ ) dxdξ /j dξj + dxdξ \j dξj = f (x, ξ )log f x|ξj g(ξ )log f (x|ξ ) f x|ξj p p Cov Xi , θi |ξj = = KL Xi , ξ \j |ξj , j = 1, 2, . . . , m. 2 ωi i=1 i=1
This completes the theorem.
From Theorems 8.1 and 8.3, we have the following decomposition of the contribution of ξj to X: C ξj → X = KL(X, ξ ) − KL X, ξ \j |ξj = KL(Xi , ξ ) − KL Xi , ξ \j |ξj p
p
i=1
=
i=1
KL(Xi , ξ ) − KL Xi , ξ \j |ξj = C ξj → Xi . p
p
i=1
i=1
(8.36)
8.2 Factor Analysis
237
From Remark 8.2, in general, C(ξ → X) =
p m C ξj → Xi . j=1 i=1
However, in orthogonal factor analysis models, the following theorem holds: T Theorem 8.4 Let X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm )T be manifest and latent variable vectors, respectively. In factor analysis models (8.1), if the factors ξi , i = 1, 2, . . . , m are statistically independent, then p m C(ξ → X) = C ξj → Xi
(8.37)
j=1 i=1
Proof Since factors are independent, it follows that m KL(Xi , ξ ) =
2 k=1 λik 2 ωi
, KL Xi , ξ \j |ξj =
2 k=j λik ωi2
.
From this, we obtain λ2ij C ξj → Xi = KL(Xi , ξ ) − KL Xi , ξ \j |ξj = 2 . ωi From (8.23) and (8.24), we have C(ξ → X) = KL(X, ξ ) =
p m λ2ij i=1 j=1
ωi2
.
Hence, Eq. (8.37) holds. This completes the theorem. In the conventional method of measuring factor contribution, the relative contributions (8.17) and (8.18) are used. In the present context, the relative contributions in Fig. 8.2d are defined, corresponding to (8.17) and (8.18). Definition 8.5 The relative contribution of factor ξj to manifest variable vector X is defined by C ξj → X KL X, ξj RC ξj → X = = . C(ξ → X) KL(X, ξ ) From (8.23) and (8.24), we have
(8.38)
238
8 Latent Structure Analysis
p Cov(Xi ,θi )−Cov(Xi ,θi |ξj ) KL(X, ξ ) − KL X, ξ \j |ξj i=1 ω2 = RC ξj → X = . (8.39) p Cov(Xi i ,θi ) KL(X, ξ ) 2 i=1 ωi
Entropy KL(X, ξ ) implies the entropy variation of manifest variable vector X explained by factor vector ξ , and KL X, ξ \j |ξj is “that by ξ \j ” excluding the effect of factor ξj on X. Based on the ECD approach [4], we give the following definition: Definition 8.6 The relative contribution of factor ξj to manifest variable vector X in the whole entropy space of X is defined by p Cov(Xi ,θi )−Cov(Xi ,θi |ξj ) KL(X, ξ ) − KL X, ξ \j |ξj i=1 ω2 ξj → X = RC = . (8.40) p Cov(Xi ,θii ) KL(X, ξ ) + 1 +1 2 i=1 ωi
8.2.4 A Method of Calculating Factor Contributions by Using Covariance Matrices In factor analysis model (8.3), in order to simplify the discussion, the factor contribution of ξ1 is calculated. Let ≡ Var(X, ξ ) be the variance–covariance matrix of T X = X1 , X2 , . . . , Xp and ξ = (ξ1 , ξ2 , . . . , ξm ), that is, =
, T
and let the matrix be divided according to X, ξ1 and (ξ2 , ξ3 , . . . , ξm ) as follows: ⎞ 11 12 13 = ⎝ 21 22 23 ⎠, 31 32 33 ⎛
where 22 23 = . 11 = ; 12 13 = ; and 32 33 Let the partial variance–covariance matrix of X and (ξ2 , ξ3 , . . . , ξm ) given ξ1 be denoted by (1,3)(1,3)·2 =
11·2 13·2 . 31·2 33·2
(8.41)
8.2 Factor Analysis
239
When the inverse of the above matrix is expressed as ⎞ 11 12 13 = ⎝ 21 22 23 ⎠, 31 32 33 ⎛
−1
then matrix (8.41) is computed as follows: (1,3)(1,3)·2 =
11 13 31 33
−1 ,
and it follows that KL((ξ2 , ξ3 , . . . , ξm ), X|ξ1 ) = −tr 31·2 13 . By using the above result, we can calculate the contribution of factor ξ1 as follows: C(ξ1 → X) = KL((ξ1 , ξ2 , . . . , ξm ), X) − KL((ξ2 , ξ3 , . . . , ξm ), X|ξ1 ) 21 12 13 + tr 31·2 13 . = −tr (8.42) 31
8.2.5 Numerical Example The entropy-based method for measuring factor contribution is scale-invariant, so a numerical illustration is made under the standardization of the manifest variables. Assuming five common factors, Table 8.1 shows artificial nine manifest variables and factor loadings. First, in order to compare the method with the conventional Table 8.1 Artificial factor loadings of five common factors in nine manifest variables Common factor
Manifest variable X1
X2
X3
X4
X5
X6
X7
X8
X9
ξ1
0.1
0.8
0.4
0.3
0.1
0.3
0.8
0.2
0.4
ξ2
0.3
0.2
0.2
0.3
0.8
0.4
0.1
0.1
0.2
ξ3
0.2
0.1
0.2
0.7
0.1
0.2
0.2
0.3
0.8
ξ4
0.7
0.1
0.1
0.1
0.1
0.6
0.1
0.1
0.1
ξ5
0.1
0.1
0.7
0.1
0.1
0.1
0.2
0.7
0.1
ωi2
0.360
0.290
0.260
0.310
0.320
0.340
0.260
0.360
0.140
ωi2 are unique factor variances
240
8 Latent Structure Analysis
Table 8.2 Factor contributions measured with the conventional method (orthogonal case) Factor
ξ1
ξ2
ξ3
ξ4
ξ5
Cj
1.72
0.79
1.07
1.60
1.08
RCj j RC
0.275
0.126
0.171
0.256
0.173
0.246
0.113
0.153
0.229
0.154
Table 8.3 Factor contributions measured with the entropy-based method (orthogonal case) Factor C ξj → X RC ξj → X ξj → X RC
ξ1
ξ2
ξ3
ξ4
ξ5
16.13
2.18
4.27
7.64
4.11
0.474
0.063
0.124
0.223
0.120
0.457
0.062
0.121
0.216
0.116
one, the five factors are assumed orthogonal. By using the conventional method, the factor contributions of the common factors are calculated with Cj (8.13), RCj (8.17), (Table 8.2). On the other hand, the entropy-based approach uses and RC j (8.18) ξj → X (8.40), and the results are C ξj → X (8.36), RC ξj → X (8.38), and RC shown in Table 8.3. According to the conventional method, factors ξ1 and ξ4 have similar contributions to the manifest variables; however, in the entropy-based method, the contribution of ξ1 is more than twice that of ξ4 . Second, an oblique case is considered by using factor loadings shown in Table 8.1. Assuming the correlation matrix of the five common factors as ⎛ ⎜ ⎜ ⎜ =⎜ ⎜ ⎝
1 0.7 0.7 1 0.5 0.7 0.2 0.5 0.1 0.2
0.5 0.2 0.7 0.5 1 0.7 0.7 1 0.5 0.7
0.1 0.2 0.5 0.7 1
⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎠
the factor contributions are computed by (8.39) and (8.42), and the results are illustrated in Table 8.4. According to the correlations between factors, factors ξ3 , ξ4 , and ξ5 have large contributions to the manifest variables; i.e., the contributions are greater than 0.5. Especially, the contributions of ξ4 , and ξ5 are more than 0.7. The entropy-based method is applied to factor analysis of Table 7.6, assuming two common factors. By using the covarimin method, we have the factor loadings in Table 8.4 Factor contributions measured with the entropy-based method (oblique case) Factor C ξj → X RC ξj → X ξj → X RC
ξ1
ξ2
ξ3
ξ4
ξ5
9.708
12.975
21.390
18.417
13.972
0.385
0.515
0.859
0.731
0.553
0.370
0.495
0.816
0.703
0.531
8.2 Factor Analysis
241
Table 8.5 Estimated factor loadings from data in Table 7.6 Japanese X1
English X2
ξ1
0.593
0.768
ξ2
0.244
0
Social X3
Mathematics X4
Science X5
0.677
0.285
0
−0.122
0.524
0.919
Table 8.5, where the manifest variables are standardized. According to the results, factor ξ1 can be interpreted as a latent ability for liberal arts and factor ξ2 as that for sciences. The correlation coefficient between the two common factors is estimated as 0.315, and the factor contributions as in Table 8.6. The table shows are calculated the following decompositions of C ξj → X and C(ξ → X): 5 C ξj → Xi , j = 1, 2; C ξj → X = i=1
C(ξ → X) =
5
C(ξ → Xi ).
i=1
However, the factor analysis model is oblique, and it follows that C(ξ → Xi ) =
2 C ξj → Xi , i = 1, 2, 3, 4, 5. j=1
The effect of ξ2 on the manifest variable vector X is about 1.7 times greater than that of ξ1 . The summary of the contribution (effects) to the manifest vari → X) = 0.904, the explanatory ables is demonstrated in Table 8.7. From RC(ξ power of the latent common factor vector ξ is strong, especially, that of ξ2 , i.e., Table 8.6 Factor contributions C ξj → X , C ξj → Xi , and C(ξ → Xi ) Japanese X1
English X2
Social X3
Mathematics X4
Science X5
Totala
ξ1
0.902
1.438
0.704
0.368
0.539
3.951
ξ2
0.373
0.143
0.014
0.685
ξ
1.009
1.438
5.433
6.648
0.728 0.818 a Totals in the table are equal to C ξ → X , j = 1, 2 and C(ξ → X) j
5.433
9.426
Table 8.7 Factor contributions to manifest variable vector X
C ξj → X ξj → X RC RC ξj → X
ξ1
ξ2
Effect of ξ on X
3.951
6.648
0.379
0.638
C(ξ → X) = 9.426 → X) = 0.904 RC(ξ
0.419
0.705
242
8 Latent Structure Analysis
ξj → X = 0.638. As mentioned in Remark 8.3, since factor analysis model is RC oblique, in Table 8.7, we have 2 C(ξ → X) = C ξj → X j=1
8.3 Latent Trait Analysis 8.3.1 Latent Trait and Item Response In a test battery composed of response items with binary response categories, e.g., “yes” or “no”, “positive” or “negative”, “success” or “failure”, and so on, let us assume that the responses to the test items depend on a latent trait or ability θ, where θ is a hypothesized variable that cannot be observed directly. In item response theory, the relationship between responses to the items and the latent trait is analyzed, and it is applied to test-making. In this section, the term “latent trait” is mainly used for convenience; however, term “latent ability” is also employed. In a test battery with p test items, and let Xi be responses to items i, i = 1, 2, . . . , p, such that Xi =
1 (success response to itemi) , i = 1, 2, . . . , p, 0 (failure)
and let Pi (θ ) = P(Xi = 1|θ ) be (success) response probability functions of an individual with latent trait θ , to items i, i = 1, 2, . . . , p. Then, under the assumption of local independence (Fig. 8.3), it follows that Pi (θ )xi Qi (θ )1−xi , P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ = p
(8.43)
i=1
where Qi (θ ) = 1 − Pi (θ ), i = 1, 2, . . . , p are failure probabilities of an individual with latent trait θ . The functions Pi (θ ), i = 1, 2, . . . , p are called item characteristic functions. Usually, test score Fig. 8.3 Path diagram of the latent trait model
8.3 Latent Trait Analysis
243
V =
p
Xi .
i=1
is used for evaluating the latent traits of individuals. Then, we have E
p
Xi |θ
i=1 p
Var
i=1
=
E(Xi |θ) =
i=1 p
Xi |θ
p
=
p
Pi (θ ),
i=1
Var(Xi |θ) =
i=1
p
Pi (θ )Qi (θ ).
i=1
In general, if weights ci are assigned to responses Xi , we have the following general score: V =
p
ci Xi .
(8.44)
i=1
For the above score, the average of the conditional expectations of ci Xi given latent trait θ over the items is given by 1 ci Pi (θ ). p i=1 p
T (θ ) =
(8.45)
The above function is called a test characteristic function. In the next section, a theoretical discussion for deriving the item characteristic function Pi (θ ) is provided [18, pp. 37–40].
8.3.2 Item Characteristic Function Let Yi be latent traits to answer items i, i = 1, 2, . . . , p and let θ be a common latent trait to answer all the items. Since latent traits are unobservable and hypothesized variables, it is assumed variables Yi and θ are jointly distributed according to a bivariate normal distribution with mean vector (0, 0) and variance–covariance matrix, =
1 ρi ρi 1
(8.46)
Let ηi be the threshold of latent ability Yi to successfully answer to item i, i = 1, 2, . . . , p. The probabilities that an individual with latent trait θ gives correct answers to items i, i = 1, 2, . . . , p are computed as follows:
244
8 Latent Structure Analysis
+∞ Pi (θ ) = P(Yi > ηi |θ) = ηi
+∞ = ηi −ρi θ √ 2
(y − ρθ )2
exp − 21 − ρ 2 dy i 2π 1 − ρi2 1
1 t2 dt = √ exp − 2 2π
+∞ −ai (θ−di )
2 1 t dt = (ai (θ − di )), √ exp − 2 2π
1−ρi
i = 1, 2, . . . , p,
(8.47)
where (x) is the standard normal distribution function and ai =
ρi 1 − ρi2
, di =
ηi . ρi
From this, for θ = di , the success probability to item i is 21 , and in this sense, the parameter di implies the difficulty of item i. Since the cumulative normal distribution function (8.47) is difficult to treat mathematically, the following logistic model is used as an approximation: Pi (θ ) =
exp(Dai (θ − di )) 1 = , i = 1, 2, . . . , p, 1 + exp(−Dai (θ − di )) 1 + exp(Dai (θ − di )) (8.48)
where D is a constant. If we set D = 1.7, the above functions are good approximations of (8.47). For ai = 2, di = 1, the graphs are almost the same as shown in Fig. 8.4a and b. Differentiating (8.48) with respect to θ , we have d Dai Pi (θ ) = Dai Pi (θ )Qi (θ ) ≤ Dai Pi (di )Qi (di ) = , i = 1, 2, . . . , p. dθ 4
(8.49)
From this, the success probabilities to items i are rapidly increasing in neighborhoods of θ = di , i = 1, 2, . . . , p (Fig. 8.5). As increasing the parameters ai , the success probabilities to the items are also increasing. In this sense, parameters ai express discriminating powers to respond to items i, i = 1, 2, . . . , p. As shown in Fig. 8.6, as difficulty parameter d is increasing, the success probabilities for items are decreasing for fixed discrimination parameter a. Since logistic models (8.48) are GLMs, we can handle the model (8.48) more easily than normal distribution model (8.47). In the next section, the information of tests for estimating latent traits is discussed, and ECD is applied for measuring the reliability of the tests.
8.3 Latent Trait Analysis
(a) 1
245
normal distribution function
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
0
0 0.12 0.24 0.36 0.48 0.6 0.72 0.84 0.96 1.08 1.2 1.32 1.44 1.56 1.68 1.8 1.92 2.04 2.16 2.28 2.4 2.52 2.64 2.76 2.88 3
0.1
normal distribution
(b)
logistic function
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
0 0.12 0.24 0.36 0.48 0.6 0.72 0.84 0.96 1.08 1.2 1.32 1.44 1.56 1.68 1.8 1.92 2.04 2.16 2.28 2.4 2.52 2.64 2.76 2.88 3
0.1
logistic function
Fig. 8.4 a The standard normal distribution function (8.47) (a = 2, d = 1). b Logistic model (8.48) (a = 2, d = 1)
8.3.3 Information Functions and ECD In item response models (8.48), the likelihood functions of latent common trait θ given responses Xi are Li (θ |Xi ) = Pi (θ )Xi Qi (θ )1−Xi , i = 1, 2, . . . , p. From this, the log-likelihood functions are given as follows:
246
8 Latent Structure Analysis
Logistic models 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8
a = 0.5
a=1
2
a=2
2.2 2.4 2.6 2.8
3
a=3
Fig. 8.5 Logistic models for a = 0.5, 1, 2, 3; d = 1
Logistic functions
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2 0.4 0.6 0.8 1 d=1
1.2 1.4 1.6 1.8 2 d = 1.5 d=2
2.2 2.4 2.6 2.8 d = 2.5
3
Fig. 8.6 Logistic models for a = 2; d = 1, 1.5, 2, 2.5
li (θ |Xi ) = logL(θ |Xi ) = Xi logPi (θ ) + (1 − Xi )logQi (θ ), i = 1, 2, . . . , p. Then, the Fisher information for estimating latent trait θ is computed as follows: 2 Xi dPi (θ ) 1 − Xi dQi (θ ) 2 d li (θ |Xi ) + =E Ii (θ ) = E dθ Pi (θ ) dθ Qi (θ ) dθ 2 2 dPi (θ ) dQi (θ ) 1 1 + = Pi (θ ) dθ Qi (θ ) dθ
8.3 Latent Trait Analysis
247
dPi (θ ) 2 dPi (θ ) 2 1 1 Pi (θ )2 = + = Pi (θ ) dθ Qi (θ ) dθ Pi (θ )Qi (θ ) = D2 ai2 Pi (θ )Qi (θ )(= Ii (θ )), i = 1, 2, . . . , p,
(8.50)
where Pi (θ ) =
dPi (θ ) , i = 1, 2, . . . , p. dθ
In test theory, functions (8.50) are called item information functions, because the Fisher information is related to the precision of the estimate of latent trait θ . Under the assumption of local independence (8.43), from (8.50), the Fisher information of response X = X1 , X2 , . . . , Xp is calculated as follows: I (θ ) =
p i=1
Ii (θ ) =
p i=1
Pi (θ )2 = D2 ai2 Pi (θ )Qi (θ ). Pi (θ )Qi (θ ) i=1 p
(8.51)
The above function is referred to as the test information function and the characteristic of a test can be discussed by the function. Based on model (8.48), we have exp(xi Dai (θ − di )) P X1 , X2 , . . . , Xp = x1 , x2 , . . . , xp |θ = 1 + exp(Dai (θ − di )) i=1 p exp D i=1 xi ai (θ − di ) = p . i=1 (1 + exp(Dai (θ − di ))) (8.52) p
From (8.52), we get the following sufficient statistic: Vsufficient =
n
ai Xi .
(8.53)
i=1
The ML estimator θˆ of latent trait θ is a function of sufficient statistic Vsufficient , ˆ and from Fisher information (8.51), the asymptotic variance of the ML estimator θ for response X = X1 , X2 , . . . , Xp is computed as follows: 1 1 = 2 p 2 Var θˆ ≈ n D i=1 Ii (θ ) i=1 ai Pi (θ )Qi (θ )
(8.54)
In this sense, the score (8.53) is the best to measure the common latent trait θ . For a large number of items p, statistic (score) dis p V (8.44)p is asymptotically 2 tributed according to normal distribution N ci Pi (θ ), ci Pi (θ )Qi (θ ) . The i=1
i=1
248
8 Latent Structure Analysis
mean is increasing in latent trait θ ; however, the variance is comparatively stable in a rage including item difficulties di , i = 1, 2, . . . , p. From this, assuming the variance to be constant as p
ci2 Pi (θ )Qi (θ ) = σ 2 ,
(8.55)
i=1
the log likelihood function based on statistic V is given by 2 p 1 1 2 l(θ |V ) = − log 2π σ − ci Pi (θ ) . v− 2 2σ 2 i=1 From this, it follows that p p d 1 l(θ |V ) = 2 v − ci Pi (θ ) · D ai ci Pi (θ )Qi (θ ). dθ σ i=1 i=1 Then, the Fisher information is calculated by E
d l(θ |V ) dθ
2
2 p 1 2 ai ci Pi (θ )Qi (θ ) . = 2D σ i=1
(8.56)
Since ⎛
0 P1 (θ )Q1 (θ ) 0 p ⎜ 0 P2 (θ )Q2 (θ ) ⎜ ⎜ ai ci Pi (θ )Qi (θ ) = a1 a2 · · · ap ⎜ .. ⎝ . 0 i=1 0 0 Pp (θ )Qp (θ )
⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝
c1 c2 . .. cp
⎞ ⎟ ⎟ ⎟, ⎟ ⎠
the above quantity can be viewed as an inner product between vectors a1 , a2 , . . . , ap and c1 , c2 , . . . , cp with respect to the diagonal matrix ⎛
P1 (θ )Q1 (θ ) 0 0 ⎜ 0 P (θ )Q 2 2 (θ ) ⎜ ⎜ .. ⎝ . 0 0 0 Pp (θ )Qp (θ )
⎞ ⎟ ⎟ ⎟. ⎠
Hence, under the condition (8.55), Fisher information (8.56) is maximized by ci = τ ai , i = 1, 2, . . . , p,
8.3 Latent Trait Analysis
249
where τ is a constant. From (8.56), we have max
ci ,i=1,2,...,p
E
d l(θ |V ) dθ
2 = D2
p
ai2 Pi (θ )Qi (θ ).
i=1
The above information is the same as that for Vsufficient (8.51). Comparing (8.54) and (8.56), we have the following definition: Definition 8.7 The efficiency of score V (8.44) for sufficient statistic Vsufficient is defined by 2 ai ci Pi (θ )Qi (θ ) p 2 ψ(θ ) = p 2 i=1 ai Pi (θ )Qi (θ ) i=1 ci Pi (θ )Qi (θ ) p
i=1
The conditional probability functions of Xi , i = 1, 2, . . . , p given latent trait θ are given by fi (xi |θ) =
exp(xi Dai (θ − di )) , i = 1, 2, . . . , p. 1 + exp(Dai (θ − di ))
Hence, the above logistic models are GLMs, and we have KL(Xi , θ ) = Dai Cov(Xi , θ ), i = 1, 2, . . . , p. and ECD(Xi , θ ) =
KL(Xi , θ ) Dai Cov(Xi , θ ) = , i = 1, 2, . . . , p. KL(Xi , θ ) + 1 Dai Cov(Xi , θ ) + 1
8.2, the KL information concerning the association between X = From Theorem X1 , X2 , . . . , Xp and θ is calculated as follows: KL(X, θ) = KL(Vsufficient , θ) = Cov(DVsufficient , θ) =
n i=1
Dai Cov(Xi , θ) =
p
KL(Xi , θ).
i=1
Hence, ECD in GLM (8.52) is computed by ECD(X, θ ) = ECD(Vsufficient , θ ) =
p KL(Xi , θ ) KL(X, θ ) = i=1 . KL(X, θ ) + 1 KL(X, θ ) + 1
(8.57)
The above ECD can be called the entropy coefficient of reliability (ECR) of the test. Since θ is distributed according to N(0, 1), from model (8.48), we have KL(Xi , θ ) = Dai Cov(Xi , θ ) = Dai E(Xi θ )
250
8 Latent Structure Analysis
1 θ exp(Dai (θ − di )) 1 · √ exp − θ 2 dθ, 2 −∞ 1 + exp(Dai (θ − di )) 2π i = 1, 2, . . . , p. +∞
= Dai ∫
Since 1 θ exp(Dai (θ − di )) 1 2 lim Cov(Xi , θ ) = lim ∫ · √ exp − θ dθ ai →+∞ ai →+∞ −∞ 1 + exp(Dai (θ − di )) 2 2π +∞ 1 1 1 2 1 2 = ∫ θ · √ exp − θ dθ = √ exp − di , 2 2 2π 2π di +∞
i = 1, 2, . . . , p, it follows that lim KL(Xi , θ ) = +∞
ai →+∞
⇔ lim ECD(Xi , θ ) = 1, i = 1, 2, . . . , p. ai →+∞
8.3.4 Numerical Illustration Table 8.8 shows the discrimination and difficulty parameters for artificial nine items for a simulation study (Test A). The nine item characteristic functions are illustrated in Fig. 8.7. In this case, according to KL(Xi , θ ) and ECD(Xi , θ ) (Table 8.8), the explanatory powers of latent trait θ for responses to the nine items are moderate. According to Table 8.8, the KL information is maximized for d = 0. The entropy coefficient of reliability of this test is calculated by (8.57). Since KL(X, θ ) =
9
KL(Xi , θ ) = 0.234 + 0.479 + · · · + 0.234 = 6.356,
i=1
Table 8.8 An item response model with nine items and the KL information [Test (A)] Item
1
2
3
4
5
6
7
8
9
ai
2
2
2
2
2
2
2
2
2
di
−2
−1.5
−1
−0.5
0
0.5
1.0
1.5
2
KL(Xi , θ )
0.234
0.479
0.794
1.075
1.190
1.075
0.794
0.479
0.234
ECD(Xi , θ)
0.190
0.324
0.443
0.518
0.543
0.518
0.443
0.324
0.190
8.3 Latent Trait Analysis
251
Item Characteristic Functions in Table 8.8
1 0.9
item 1
0.8 0.7
item 6
item 2
0.6
item 7
item 3
0.5 0.4
item 8
item 4
0.3
item 9
item 5
0.2 0
-3.00 -2.81 -2.62 -2.43 -2.24 -2.05 -1.86 -1.67 -1.48 -1.29 -1.10 -0.91 -0.72 -0.53 -0.34 -0.15 0.04 0.23 0.42 0.61 0.80 0.99 1.18 1.37 1.56 1.75 1.94 2.13 2.32 2.51 2.70 2.89
0.1
Fig. 8.7 Item characteristic functions in Table 8.8
we have ECD(X, θ ) =
6.356 = 0.864. 6.356 + 1
In this test, the test reliability can be thought high. Usually, the following test score is used for measuring latent trait θ : V =
9
Xi ,
(8.58)
i=1
In Test A, the above score is a sufficient statistic of the item response model with the nine items in Table 8.8 [Test (A)], and the information of the item response model is the same as that of the above score. Then, the test characteristic function is given by 1 T (θ ) = Pi (θ ). 9 i=1 9
(8.59)
As shown in Fig. 8.8, the relation between latent trait θ and T (θ ) is almost linear. The test information function of Test (A) is given in Fig. 8.9. The precision for estimating the latent trait θ is maximized at θ = 0. Second, for Test (B) shown in Table 8.9, the ECDs are uniformly smaller than those in Table 8.8, and the item characteristic functions are illustrated in Fig. 8.10. In this test, ECD of Test (B) is ECD(X, θ ) = 0.533
252
8 Latent Structure Analysis
test characteristic function
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
0
-3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.5 2.7 2.9
0.1
Fig. 8.8 Test characteristic function (8.59) of Test (A)
Test information function (test (A)) 8.0 7.0 6.0 5.0 4.0 3.0 2.0
0.0
-3 -2.79 -2.58 -2.37 -2.16 -1.95 -1.74 -1.53 -1.32 -1.11 -0.9 -0.69 -0.48 -0.27 -0.06 0.15 0.36 0.57 0.78 0.99 1.2 1.41 1.62 1.83 2.04 2.25 2.46 2.67 2.88
1.0
Fig. 8.9 Test information function of Test (A)
Table 8.9 An item response model with nine items and the KL information [Test (B)] Item
1
2
3
4
5
6
7
8
9
ai
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
di
−2
−1.5
−1
−0.5
0
0.5
1.0
1.5
2
KL(Xi , θ )
0.095
0.116
0.135
0.148
0.152
0.148
0.135
0.116
0.095
ECD(Xi , θ)
0.087
0.104
0.119
0.129
0.132
0.129
0.119
0.104
0.087
and the ECD is less than that of Test (A). Comparing Figs. 8.9 and 8.11, the precision of estimating latent trait in Test (A) is higher than that in Test (B), and it depends on discrimination parameters ai .
8.3 Latent Trait Analysis
253
Item characteristic functions in Table 8.9 1 0.9 0.8 item 1
0.7
item 2
0.6
item 8 item 3
0.5
item 4
0.4
item 5
item 6
item 9
item 7
0.3 0.2
0
-3.0 -2.8 -2.6 -2.4 -2.2 -2.1 -1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.7 1.9 2.1 2.3 2.5 2.7 2.9
0.1
Fig. 8.10 Item characteristic functions of Test (B)
1.4
Test information function (test (B))
1.2 1 0.8 0.6 0.4
0
-3.0 -2.8 -2.5 -2.3 -2.1 -1.9 -1.6 -1.4 -1.2 -0.9 -0.7 -0.5 -0.2 0.0 0.2 0.4 0.7 0.9 1.1 1.4 1.6 1.8 2.1 2.3 2.5 2.7 3.0
0.2
Fig. 8.11 Test information function of Test (B)
Lastly, Test (C) with nine items in Table 8.10 is considered. The item characteristic functions are illustrated in Fig. 8.12. According to KL information in Table 8.10, items 3 and 7 have the largest predictive power on item responses. ECD in this test is computed by
254
8 Latent Structure Analysis
Table 8.10 An item response model with nine items and the KL information [Test (C)] Item
1
2
3
4
5
6
7
8
9
ai
0.5
1
2
1
0.5
1
2
1
0.5
di
−2
−1.5
−1
−0.5
0
0.5
1.0
1.5
2
KL(Xi , θ )
0.095
0.260
0.794
0.442
0.152
0.442
0.794
0.260
0.095
ECD(Xi , θ)
0.087
0.207
0.443
0.306
0.132
0.306
0.443
0.207
0.087
1
Item characteristic functions in Table 8.10
0.9 0.8 0.7 0.6 0.5 0.4
item 4 item 1 item 2
item 8
item 5
item 9
item 3 item 6
item 7
0.3 0.2
0
-3.0 -2.8 -2.6 -2.4 -2.2 -2.1 -1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.7 1.9 2.1 2.3 2.5 2.7 2.9
0.1
Fig. 8.12 Item characteristic functions of Test (C)
ECD(X, θ ) = 0.769. The test information function of Test (C) is illustrated in Fig. 8.13. The configuration is similar to that of KL information in item difficulties di in Test (C) in Table 8.10 (Fig. 8.14). Since this test has nine items with discrimination parameters 0.5, 1, and 2, the sufficient statistic (score) is given by Vsufficient = 0.5X1 + X2 + 2X3 + X4 + 0.5X5 + X6 + 2X7 + X8 + 0.5X9 .
(8.60)
The above score is the best to estimate latent trait based on test score (8.44). Test score (8.58) may be usually used; however, the efficiency of the test score for the sufficient test score (8.60) is less than 0.9, as illustrated in latent trait θ in Fig 8.15. Since the distribution of latent trait θ is assumed to be N(0, 1), about 95% of latent trait θ exist in range (−1.96, 1.96), i.e.,
8.3 Latent Trait Analysis
255
Test information function (test (C))
5.0 4.0 3.0 2.0
0.0
-3.0 -2.8 -2.5 -2.3 -2.1 -1.9 -1.6 -1.4 -1.2 -0.9 -0.7 -0.5 -0.2 0.0 0.2 0.4 0.7 0.9 1.1 1.4 1.6 1.8 2.1 2.3 2.5 2.7 3.0
1.0
Fig. 8.13 Test information function of Test (B)
KL information of items
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Fig. 8.14 KL information KL(Xi , θ) and item difficulties di
0.9
efficiency
0.85
0.8
0.7
-3.0 -2.8 -2.5 -2.3 -2.1 -1.9 -1.6 -1.4 -1.2 -0.9 -0.7 -0.5 -0.2 0.0 0.2 0.4 0.7 0.9 1.1 1.4 1.6 1.8 2.1 2.3 2.5 2.7 3.0
0.75
Fig. 8.15 Efficiency of test score V (8.58) in Test (C)
256
8 Latent Structure Analysis
P(−1.96 < θ < 1.96) ≈ 0.95. Hence, in range −1.96 < θ < 1.96, the precision for estimating θ according to test score (8.58) is about 80% for the sufficient statistic (8.60) as shown in Fig. 8.15.
8.4 Discussion In this chapter, latent structure models are considered in the framework of GLMs. In factor analysis, the contributions of common factors have been measured through an entropy-based path analysis. The contributions of common factors have been defined as the effects of factors on the manifest variable vector. In latent trait analysis, for the two-parameter logistic model for dichotomous manifest variables, the test reliability has been discussed with ECD. It is critical to extend the discussion to those for the graded response model [17], the partial credit model [13], the nominal response model [3], and so on. Latent class analysis has not been treated in this chapter; however, there may be a possibility to make an entropy-based discussion for comparing latent classes. Further studies are needed to extend the present discussion in this chapter.
References 1. Brown, M. N. (1974). Generalized least squares in the analysis of covariance structures. South African Statistical Journal, 8, 1–24. 2. Bartholomew, D. J. (1987). Latent variable models and factor analysis. New York: Oxford University Press. 3. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. 4. Eshima, N., & Tabata, M. (2010). Entropy coefficient of determination for generalized linear models. Computational Statistics & Data Analysis, 54, 1381–1389. 5. Eshima, N., Tabata, M., Borroni, C. G., & Kano, Y. (2015). An entropy-based approach to path analysis of structural generalized linear models: A basic idea. Entropy, 17, 5117–5132. 6. Eshima, N., Borroni, C. G., & Tabata, M. (2016). Relative importance assessment of explanatory variables in generalized linear models: An entropy-based approach. Statistics & Applications, 16, 107–122. 7. Eshima, N., Tabata, M., & Borroni, C. G. (2018). An entropy-based approach for measuring factor contributions in factor analysis models. Entropy, 20, 634. 8. Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika, 32, 443–482. 9. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183–202. 10. Jöreskog, K. G., & Goldberger, A. S. (1972). Factor analysis by generalized least squares. Psychometrika, 37, 243–260. 11. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. New York: HoughtonMifflin. 12. Lord, F. M. (1952). A theory of test scores (Psychometrika Monograph, No. 7). Psychometric Corporation: Richmond.
References
257
13. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. 14. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Illinois: The University of Chicago Press. 15. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 32, 443–482. 16. Robertson, D., & Symons, J. (2007). Maximum likelihood factor analysis with rank-deficient sample covariance matrices. Journal of Multivariate Analysis, 98, 813–828. 17. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph). Educational Testing Services: Princeton. 18. Shiba, S. (1991). Item response theory. Tokyo: Tokyo University. 19. Spearman, S. (1904). “General-intelligence”, objectively determined and measured. American Journal of Psychology, 15, 201–293. 20. Sundberg, R., & Feldmann, U. (2016). Exploratory factor analysis-parameter estimation and scores prediction with high-dimensional data. Journal of Multivariate Analysis, 148, 49–59. 21. Trendafilov, N. T., & Unkel, S. (2011). Exploratory factor analysis of data matrices with more variables than observations. Journal of Computational and Graphical Statistics, 20, 874–891. 22. Thurstone, L. L. (1935). Vector of mind: Multiple factor analysis for the isolation of primary traits (p. 1935). The University of Chicago Press: Chicago, IL, USA. 23. Unkel, S., & Trendafilov, N. T. (2010). A majorization algorithm for simultaneous parameter estimation in robust exploratory factor analysis. Computational Statistics & Data Analysis, 54, 3348–3358. 24. Young, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Quantitative Methods for Psychology, 9, 79–94.