139 81 17MB
English Pages [596]
Behaviormetrics: Quantitative Approaches to Human Behavior 13
Kojiro Shojima
Test Data Engineering Latent Rank Analysis, Biclustering, and Bayesian Network
Behaviormetrics: Quantitative Approaches to Human Behavior Volume 13
Series Editor Akinori Okada, Professor Emeritus, Rikkyo University, Tokyo, Japan
This series covers in their entirety the elements of behaviormetrics, a term that encompasses all quantitative approaches of research to disclose and understand human behavior in the broadest sense. The term includes the concept, theory, model, algorithm, method, and application of quantitative approaches from theoretical or conceptual studies to empirical or practical application studies to comprehend human behavior. The Behaviormetrics series deals with a wide range of topics of data analysis and of developing new models, algorithms, and methods to analyze these data. The characteristics featured in the series have four aspects. The first is the variety of the methods utilized in data analysis and a newly developed method that includes not only standard or general statistical methods or psychometric methods traditionally used in data analysis, but also includes cluster analysis, multidimensional scaling, machine learning, corresponding analysis, biplot, network analysis and graph theory, conjoint measurement, biclustering, visualization, and data and web mining. The second aspect is the variety of types of data including ranking, categorical, preference, functional, angle, contextual, nominal, multi-mode multi-way, contextual, continuous, discrete, high-dimensional, and sparse data. The third comprises the varied procedures by which the data are collected: by survey, experiment, sensor devices, and purchase records, and other means. The fourth aspect of the Behaviormetrics series is the diversity of fields from which the data are derived, including marketing and consumer behavior, sociology, psychology, education, archaeology, medicine, economics, political and policy science, cognitive science, public administration, pharmacy, engineering, urban planning, agriculture and forestry science, and brain science. In essence, the purpose of this series is to describe the new horizons opening up in behaviormetrics — approaches to understanding and disclosing human behaviors both in the analyses of diverse data by a wide range of methods and in the development of new methods to analyze these data. Editor in Chief Akinori Okada (Rikkyo University) Managing Editors Daniel Baier (University of Bayreuth) Giuseppe Bove (Roma Tre University) Takahiro Hoshino (Keio University)
More information about this series at https://link.springer.com/bookseries/16001
Kojiro Shojima
Test Data Engineering Latent Rank Analysis, Biclustering, and Bayesian Network
Kojiro Shojima Research Division National Center for University Entrance Examinations Tokyo, Japan
This work was supported by JSPS KAKENHI Grant Number JP20K03383 ISSN 2524-4027 ISSN 2524-4035 (electronic) Behaviormetrics: Quantitative Approaches to Human Behavior ISBN 978-981-16-9985-6 ISBN 978-981-16-9986-3 (eBook) https://doi.org/10.1007/978-981-16-9986-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
To Sachiko, S¯aya, and Y¯uka
Contents
1
Concept of Test Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Measurement as Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Testing as Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Artimage Morphing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Test Data Engineer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 What Is “Good” Testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Context of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Context of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Context of Presence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 4 7 8 10 10 10 11 12
2
Test Data and Item Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Mathematical Expressions of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Mathematical Expression of Data Matrix . . . . . . . . . . . . . 2.2.2 Mathematical Expression of Item Data Vector . . . . . . . . . 2.2.3 Mathematical Expression of Student Data Vector . . . . . . 2.3 Student Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Total Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Passage Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Standardized Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Percentile Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Stanine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Single-Item Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Correct Response Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Item Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Item Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Item Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 15 17 17 18 19 19 20 21 23 25 27 28 28 30 31 32
vii
viii
Contents
2.5
Interitem Correct Response Rate Analysis . . . . . . . . . . . . . . . . . . . 2.5.1 Joint Correct Response Rate . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Conditional Correct Response Rate . . . . . . . . . . . . . . . . . . 2.5.3 Item Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interitem Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Phi Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Tetrachoric Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Item-Total Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Item-Total Biserial Correlation . . . . . . . . . . . . . . . . . . . . . . Test Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Simple Statistics of Total Score . . . . . . . . . . . . . . . . . . . . . 2.7.2 Distribution of Total Score . . . . . . . . . . . . . . . . . . . . . . . . . Dimensionality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Dimensionality of Correlation Matrix . . . . . . . . . . . . . . . . 2.8.2 Multivariate Standard Normal Distribution . . . . . . . . . . . 2.8.3 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.5 Eigenvector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.6 Eigenspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34 34 36 39 42 43 44 45 50 52 56 56 57 59 59 61 63 65 67 68
3
Classical Test Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Measurement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Parallel Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Tau-Equivalent Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Tau-Congeneric Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 74 75 76 78 81 83
4
Item Response Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Theta (θ ): Ability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Item Response Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Two-Parameter Logistic Model . . . . . . . . . . . . . . . . . . . . . 4.2.2 Three-Parameter Logistic Model . . . . . . . . . . . . . . . . . . . . 4.2.3 Four-Parameter Logistic Model . . . . . . . . . . . . . . . . . . . . . 4.2.4 Required Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Test Response Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Ability Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Assumption of Local Independence . . . . . . . . . . . . . . . . . 4.3.2 Likelihood of Ability Parameter . . . . . . . . . . . . . . . . . . . . . 4.3.3 Maximum Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . 4.3.4 Maximum a Posteriori Estimate . . . . . . . . . . . . . . . . . . . . . 4.3.5 Expected a Posteriori Estimate . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Posterior Standard Deviation . . . . . . . . . . . . . . . . . . . . . . .
85 85 86 87 90 92 93 94 96 96 98 100 103 105 107
2.6
2.7
2.8
Contents
4.4
Information Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Test Information Function . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Item Information Function of 2PLM . . . . . . . . . . . . . . . . . 4.4.3 Item Information Function of 3PLM . . . . . . . . . . . . . . . . . 4.4.4 Item Information Function of the 4PLM . . . . . . . . . . . . . . 4.4.5 Derivation of TIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Item Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Expected Log-Posterior Density . . . . . . . . . . . . . . . . . . . . 4.5.3 Prior Density of Item Parameter . . . . . . . . . . . . . . . . . . . . . 4.5.4 Expected Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.5 Quadrature Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.6 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.7 Convergence Criterion of EM Cycles . . . . . . . . . . . . . . . . 4.5.8 Posterior Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Null and Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Chi-Square Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Goodness-of-Fit Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.5 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108 108 111 112 113 114 116 116 117 119 121 123 125 128 131 139 140 141 144 148 150 152
Latent Class Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Class Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Class Reference Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Expected Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Expected Log-Posterior Density . . . . . . . . . . . . . . . . . . . . 5.3.3 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Convergence Criterion of EM Cycles . . . . . . . . . . . . . . . . 5.4 Main Outputs of LCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Item Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Class Membership Profile . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Latent Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Null Model and Benchmark Model . . . . . . . . . . . . . . . . . . 5.5.3 Chi-Square Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Model-Fit Indices and Information Criteria . . . . . . . . . . . 5.6 Estimating the Number of Latent Classes . . . . . . . . . . . . . . . . . . . . 5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155 157 158 158 159 162 163 164 166 166 167 169 172 173 174 174 177 181 185 188
4.5
4.6
4.7 5
ix
x
6
Contents
Latent Rank Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Latent Rank Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Test Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Test Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Test Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Selection and Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Accountability and Qualification Test . . . . . . . . . . . . . . . . 6.1.6 Rank Reference Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Estimation by Self-organizing Map . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Winner Rank Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Data Learning by Rank Reference Matrix . . . . . . . . . . . . 6.2.3 Prior Probability in Winner Rank Selection . . . . . . . . . . . 6.2.4 Results Under SOM Learning . . . . . . . . . . . . . . . . . . . . . . 6.3 Estimation by Generative Topographic Mapping . . . . . . . . . . . . . . 6.3.1 Update of Rank Membership Matrix . . . . . . . . . . . . . . . . . 6.3.2 Rank Membership Profile Smoothing . . . . . . . . . . . . . . . . 6.3.3 Update of Rank Reference Matrix . . . . . . . . . . . . . . . . . . . 6.3.4 Convergence Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Results Under GTM Learning . . . . . . . . . . . . . . . . . . . . . . 6.4 IRP Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Item Location Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Item Slope Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Item Monotonicity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Can-Do Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Ordinal Alignment Condition . . . . . . . . . . . . . . . . . . . . . . . 6.7 Latent Rank Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Estimation of Rank Membership Profile . . . . . . . . . . . . . . 6.7.2 Rank-Up and Rank-Down Odds . . . . . . . . . . . . . . . . . . . . . 6.7.3 Latent Rank Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Benchmark Model and Null Model . . . . . . . . . . . . . . . . . . 6.8.2 Chi-square Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.3 DF and Effective DF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.4 Model-Fit Indices and Information Criteria . . . . . . . . . . . 6.9 Estimating the Number of Latent Ranks . . . . . . . . . . . . . . . . . . . . . 6.10 Chapter Summary and Supplement . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.1 LRA-SOM Vs LRA-GTM . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.2 Equating Under LRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.3 Differences Between IRT, LCA, and LRA . . . . . . . . . . . .
191 192 192 193 194 194 195 197 198 199 201 204 205 208 209 210 216 218 219 223 223 224 226 227 229 232 233 234 234 235 239 240 241 244 246 249 252 256 256 257 257
Contents
7
Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Biclustering Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Class Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Field Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Bicluster Reference Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Likelihood and Log-Likelihood . . . . . . . . . . . . . . . . . . . . . 7.2.2 Estimation Framework Using EM Algorithm . . . . . . . . . 7.2.3 Expected Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Expected Log-Posterior Density . . . . . . . . . . . . . . . . . . . . 7.2.5 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.6 Convergence Criterion of EM Cycles . . . . . . . . . . . . . . . . 7.3 Ranklustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Parameter Estimation Procedure . . . . . . . . . . . . . . . . . . . . 7.3.2 Class Membership Profile Smoothing . . . . . . . . . . . . . . . . 7.3.3 Other Different Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Main Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Bicluster Reference Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Field Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Field Membership Profile . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Field Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6 Rank Membership Profile . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.7 Latent Rank Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.8 Rank Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Benchmark Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Effective Degrees of Freedom of Analysis Model . . . . . . 7.5.5 Section Summary Thus Far . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.6 Model-Fit Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.7 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Optimal Number of Fields and Ranks . . . . . . . . . . . . . . . . . . . . . . . 7.7 Confirmatory Ranklustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Infinite Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Binary Class Membership Matrix and Latent Class Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 Binary Field Membership Matrix and Latent Field Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.3 Estimation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.4 Gibbs Sampling for Latent Class . . . . . . . . . . . . . . . . . . . .
xi
259 262 262 262 263 264 265 266 268 270 272 273 274 275 276 281 283 283 285 287 292 292 295 299 300 303 304 308 312 314 317 320 326 328 333 336 337 338 339 339
xii
Contents
7.8.5 Gibbs Sampling for Latent Field . . . . . . . . . . . . . . . . . . . . 7.8.6 Convergence and Parameter Estimation . . . . . . . . . . . . . . 7.8.7 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
343 344 347 347
8
Bayesian Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Distance and Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.5 Connected Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.6 Simple Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.7 Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.8 Section Summary and Directed Acyclic Graph . . . . . . . . 8.2 D-Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Serial Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Diverging Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Converging Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Parent Item Response Pattern . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Posterior Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.5 Maximum a Posteriori Estimate . . . . . . . . . . . . . . . . . . . . . 8.4 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Benchmark Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Chi-Square Statistic and Degrees of Freedom . . . . . . . . . 8.4.5 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Population-Based Incremental Learning . . . . . . . . . . . . . . 8.5.3 PBIL Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
349 350 351 352 354 357 358 362 364 366 369 369 371 373 375 377 378 378 385 389 393 395 400 400 402 403 406 407 407 409 413 416 420
9
Local Dependence Latent Rank Analysis . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Local Independence and Dependence . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Local Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Global Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Global Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Local Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
421 421 422 424 425 427
7.9
Contents
xiii
9.1.5 Local Dependence Structure (Latent Rank DAG) . . . . . . Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Rank Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Temporary Estimate of Smoothed Membership . . . . . . . . 9.2.3 Local Dependence Parameter Set . . . . . . . . . . . . . . . . . . . . 9.2.4 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.6 Estimation of Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . 9.2.7 Estimation of Rank Membership Matrix . . . . . . . . . . . . . . 9.2.8 Marginal Item Reference Profile . . . . . . . . . . . . . . . . . . . . 9.2.9 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Benchmark Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Adjacency Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Number of DAGs and Topological Sorting . . . . . . . . . . . . 9.4.3 Population-Based Incremental Learning . . . . . . . . . . . . . . 9.4.4 Structure Learning by PBIL . . . . . . . . . . . . . . . . . . . . . . . . Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
429 430 433 433 435 438 442 443 449 452 458 459 459 461 462 465 465 466 468 474 480 488
10 Local Dependence Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Results of Ranklustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Dichotomized Field Membership Matrix . . . . . . . . . . . . . 10.1.3 Local Dependence Parameter Set . . . . . . . . . . . . . . . . . . . . 10.1.4 PIRP Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.5 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.6 Posterior Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.7 Estimation of Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Field PIRP Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Rank Membership Matrix Estimate . . . . . . . . . . . . . . . . . . 10.2.3 Marginal Field Reference Profile . . . . . . . . . . . . . . . . . . . . 10.2.4 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
491 492 493 495 497 502 504 506 506 510 510 513 516 521 522 524 526
11 Bicluster Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Difference Between BINET and LDB . . . . . . . . . . . . . . . . . . . . . . . 11.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Binary Field Membership Matrix . . . . . . . . . . . . . . . . . . . 11.2.2 Class Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . .
527 528 531 533 534
9.2
9.3
9.4
9.5
xiv
Contents
11.2.3 Parent Student Response Pattern . . . . . . . . . . . . . . . . . . . . 11.2.4 Local Dependence Parameter Set . . . . . . . . . . . . . . . . . . . . 11.2.5 Parameter Selection Array . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.6 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.7 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.8 Estimation of Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Output and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Class Membership Matrix Estimate . . . . . . . . . . . . . . . . . . 11.3.2 Interpretation for Parameter Estimates . . . . . . . . . . . . . . . 11.3.3 Marginal Field Reference Profile . . . . . . . . . . . . . . . . . . . . 11.3.4 Test Reference Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.5 Model Fit Based on Benchmark Model . . . . . . . . . . . . . . 11.3.6 Model Fit Based on Saturated Models . . . . . . . . . . . . . . . 11.4 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
535 538 541 542 543 544 547 548 552 556 560 562 564 566 568
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Abbreviations, Symbols, and Notations
Abbreviations 1PLM 2PLM 3PLM 4PLM AIC AM AMDS BIC BINET BM BNM BRM BVN CAIC CAT CBT CCRR CRR CDF CDM CFA CFI CIRR CMD CP CRM CRP CRR
1-parameter logistic model 2-parameter logistic model 3-parameter logistic model 4-parameter logistic model Akaike information criterion Analysis model Asymmetric multidimensional scaling Bayesian information criterion Bicluster network model Benchmark model Bayesian network model Bicluster reference matrix Bivariate normal Consistent AIC Computer adaptive test Computer-based test Conditional correct response rate Correct response rate Cumulative distribution function Cognitive diagnostic model Confirmatory factor analysis Comparative fit index Conditional incorrect response rate Class membership distribution Conditional probability Class reference matrix Chinese restaurant process Correct response rate
xv
xvi
CRV CTT DAG DIF DF EAP EDF EFA ELL ELP EM algorithm FRP GA GTM ICC IFI IIF IQR IRC IRF IRM IRP IRR IRT ITC JCRR JP KR-20 LCA LCD LDB LD-LCA LD-LRA LDR LFA LFD LHS LRA LRD LRE MAP MIR ML MLE MVN
Abbreviations, Symbols, and Notations
Class reference vector Classical test theory Directed acyclic graph Differential item functioning Degrees of freedom Expected a posteriori Effective degrees of freedom Exploratory factor analysis Expected log-likelihood Expected log-posterior Expectation-maximization algorithm Field reference profile Genetic algorithm Generative topographic mapping Item characteristic curve (=IRF) Incremental fit index Item information function Interquartile range Item-remainder correlation Item response function Infinite relational model Item reference profile Incorrect response rate Item response theory Item-total correlation Joint correct response rate Joint probability Kuder-Richardson formula 20 (reliability coefficient) Latent class analysis Latent class distribution Local dependence biclustering Local dependence LCA Local dependence LRA Local dependence ranklustering Latent field analysis Latent field distribution Left-hand side Latent rank analysis Latent rank distribution Latent rank estimate Maximum a posteriori Monotonic IRP ratio Maximum likelihood Maximum likelihood estimate (or estimation) Multivariate normal
Abbreviations, Symbols, and Notations
NaN NFI NM NP NRS PBIL PDF PIRP POS data PPT PSD PSR PSRP RDO RFI RHS RMD RMP RMSEA RRM RRV RUO SD SE SEM SOAC SOM TCC TDA TDE TF TIF TLI TPS TRF TRP TVN WOAC
Not a number Normed fit index Null model Number of parameters Number-right score Population-based incremental learning Probability density function Parent item response pattern Point of sales data Paper-and-pencil test Posterior standard deviation Passing student rate Parent student response pattern Rank-down odds Relative fit index Right-hand side Rank membership distribution Rank membership profile Root mean square error of approximation Rank reference matrix Rank reference vector Rank-up odds Standard deviation Standard error Structural equation modeling Strongly ordinal alignment condition Self-organizing map Test characteristic curve (=TRF) Test data analysis Test data engineering True or false Test information function Tucker-Lewis index Total passing students Test response function Test reference profile Trivariate normal Weakly ordinal alignment condition
Symbols ∧ ∨
And (logical conjunction) Or (logical disjunction)
xvii
xviii
∼ ≈ ∝ := ⇔ ≷ ∴ ∵ |·| ||·||
Abbreviations, Symbols, and Notations
X ∼ D implies “X is randomly sampled from distribution D” Approximately equal to Proportional to Defined as: e.g., LHS := RHS implies “LHS is defined as RHS” If and only if: A ⇔ B implies “A is true (false) if and only if B is true (false)” For real numbers a, b, c, and d , a ≷ b ⇔ c ≷ d implies “a > b if c > d and a < b if c < d ” Therefore Because Absolute value: for a real number a, |a| is the non-negative value of a, e.g., |−3.14| = 3.14 and |2.72| = 2.72 (unchanged when a is non-negative) Norm: in this book, it denotes a vector’s Euclid distance from the origin; J for a vector a = [a1 · · · aJ ] , ||a|| = aj2 j=1
·
· round(·)
Ceiling function: rounds the argument to the smallest integer but not smaller than the argument, e.g., 3.14 = 4. Floor function: rounds the argument to the largest integer but not larger than the argument, e.g., 3.14 = 3. Rounding (half away from 0) function: rounds the argument to the nearest integer, e.g., round(3.4) = 3, round(3.5) = 4, and round(−3.5) = −4 Repeated multiplication: for a series of numbers {a1 , a2 , · · · , aN }, 6 an = a3 × a4 × a5 × a6 n=3
!
◦
∀
Factorial: n! =
n
i = n × (n − 1) × (n − 2) × · · · × 2 × 1
i=1
Hadamard product: conducts element-wise product of two matrices of the b11 b12 b13 a11 a12 a13 and B = , then A B = same size; if A = a21 a22 a23 b21 b22 b23 a11 b11 a12 b12 a13 b13 a21 b21 a22 b22 a23 b23 Hadamard division: conducts element-wise division of two matrices of b11 b12 a11 a12 and B = , then A B the same size; if A = a21 a22 b21 b22 a /b a /b = 11 11 12 12 a21 /b21 a22 /b22 Hadamard each element of a matrix toa power; for a matrix power: raises b b b a12 a13 a11 a12 a13 a11 and real number b, A◦b = , when A = b b b a21 a22 a23 a21 a22 a23 b = 21 , it is particularly referred to as Hadamard root. For all: ∀j ∈ NJ implies “for all numbers in the set of natural numbers NJ = {1, 2, · · · , J },” which implies “for all items” in this book; similarly, ∀s ∈ NS represents for “for all students.”
Abbreviations, Symbols, and Notations
∈ \ arg max
xix
In or member (element) of: 2 ∈ N implies “2 is a member of natural numbers”; 0.3 ∈ [0, 1] represents “0.3 is in (the range of) 0 to 1” Set difference or relative complement: if A = {a1 , a2 , a3 , a4 } and B = {a1 , a3 }, then A\B = {a1 , a3 } xˆ = arg max f (x) implies xˆ is the value in R that maximizes f x∈R
cl(·)
det(·) diag(·)
E[ · ]
exp(·) ln(·) row(·)
sgn(·) tr(·) N ∅
Clip function or clamp function: cl(x) = x when xmin ≤ x ≤ xmax , cl(x) = xmin when x < xmin , and cl(x) = xmax when x > xmax , where (xmin , xmax ) = (0, 1) in this book Determinant: det(A) denotes the determinant of matrix A Extracts the diagonal matrix and vectorizes them: elements of a square a11 a11 a12 , diag(A) = . for A = a21 a22 a22 Expectation: the expectation or the expected value of a random variable X , denoted E[X ], is intuitively the arithmetic mean of a large number of independently sampled relalizations of X ; the expectation for a discrete random variable is the probability-weighted sum of all possible outcomes Exponential: for a real number a, exp(a) is the same as ea , where e = 2.71828 · · · is Napier’s constant; exp(3) = 2.7183 = 20.09 Natural logarithm: the logarithm with base e = 2.71828 · · · ; lna = loge a Returns the number of rows of the argument (matrix): e.g., row(A) = 3 when A is a 3 × 2 matrix; returns the number of elements when the argument is a vector Signum (or sign) function: returns the argument’s sign, i.e., sgn(x) = 1 if x > 0, sgn(x) = 0 if x = 0, and sgn(x) = −1 if x < 0 Trace: the sum of the argument (a square matrix); when A is a J × J n ajj matrix, tr(A) = j=1
The set of all natural numbers, {1, 2, 3, · · ·}: the set with subscript n ( Nn ) represents the set of natural numbers from 1 to n, {1, 2, · · · , n} The empty set: the set has no elements, and the size of this set is zero
Notations 1n A = ajk aj α˜ j a˜ j bj
A vector of size n, in which all elements are 1:13 = [1 1 1] Adjacency matrix, where ajk = 1 when an edge of Item j → k exists; otherwise ajk = 0 Slope parameter of Item j (IRT) Slope index of Item j (LRA): the lower rank of two adjacent ranks between which the CRR increases the most in the IRP of Item j CRR difference between Ranks αj and αj + 1 (LRA) Location parameter of Item j (IRT)
xx
β˜j b˜ j C cj γj c˜ j dj δj E = ej fN (·; μ, σ ) FN (·; μ, σ ) FN−1 (·; μ, σ ) fN 2 (·,·; ρ) = γjfcd = γsjd = γsrjd = γsrfd
In I j (λ) I (F) j (λ) I (pr) (λ) J λj
Abbreviations, Symbols, and Notations
Location index of Item j (LRA): the rank whose CRR is the closest to 0.5 CRR at Rank βj in IRP of Item j (LRA) Number of latent classes (LCA) Lower asymptote parameter of Item j (IRT) Monotonicity index of Item j (LRA): the ratio of adjacent rank pairs between which CRR decreases to the number of adjacent rank pairs Cumulative decrease of adjacent rank pairs between which CRR decreases (LRA) Upper asymptote parameter of Item j A diagonal matrix where eigenvalues are arranged in the descending order j-th largest eigenvalue or j-th diagonal element in A matrix in which the j-th row vector is ej , which is the eigenvector corresponding to δj PDF of normal distribution with mean μ and SD σ (variance σ 2 ); fN (·; 0, 1) is the PDF of the standard normal distribution CDF of normal distribution with mean μ and SD σ ; FN (·; 0, 1) is the CDF of the standard normal distribution Quantile function or inverse CDF of normal distribution with mean μ and SD σ ; FN−1 (·; 0, 1) is the quantile function of the standard normal distribution PDF of standard bivariate normal distribution with correlation ρ Four-dimensional PSRP array (BINET), where γjfcd is 1 when Item j is the d -th item in Field f , and Class c is locally dependent at the field; otherwise γjfcd is 0 Three-dimensional PIRP array (BNM), where γsjd is 1 when the response pattern of Student s for Item j’s parent item(s) is the d -th pattern; otherwise, γsjd is 0 Four-dimensional PIRP array (LD-LRA), where γsrjd is 1 when the response pattern of Student s in Rank r for Item j’s parent item(s) is the d -th pattern; otherwise, γsrjd is 0 Four-dimensional PIRP array (LDB), where γsrfd is 1 when the NRS of Student s in Rank r for the items classified in Field f ’s parent field(s) is the d -th pattern; otherwise, γsrfd is 0 Identity matrix of dimensions n × n Posterior information matrix of Item j, where λ is an item parameter vector (IRT) Fisher information matrix of Item j, where λ is an item parameter vector (IRT) Prior information matrix, where λ is an item parameter vector (IRT) Number of items (test length) Item parameter vector of Item j (IRT) Item parameter matrix containing λj in the j-th row (IRT)
Abbreviations, Symbols, and Notations
M C = {msc } M F = mjf M G = msg M R = {msr } o = oj p = pj
p(w) = pj(w) P C = pk|j
P G = pjg
P J = pjk P L = ljk
P θ ; λj Q θ ; λj = φjk B = πcf C = πcj DC = πfcj DF = πrfd DI = πrjd = πjd R = πjr R R
xxi
Class membership matrix, where the (s, c)-th element, msc , represents the membership (probability) that Student s belongs to Class c (LCA) Field membership matrix, where mjf represents the membership (probability) that Item j is classified in Field f (biclustering) Group membership matrix, where msg is 1 if Student s belongs to Group g; otherwise, msg is 0 Rank membership matrix, where msr represents the membership (probability) that Student s belongs to Rank r (LRA) Item odds vector, in which oj is the odds of Item j CRR vector, in which pj is the CRR of Item j Item mean vector, in which pj(w) is the mean of Item j
CCRR matrix: an asymmetric matrix P C = P C with all diagonal elements being 1, and the (j, k)-th entry, pk|j , is the CRR of the students passing Item k for Items j Group reference matrix, where pjg is the CRR of students belonging to Group g for Item j
JCRR matrix: a symmetric matrix P J = P J , where the (j, k)-th element, pjk , is the JCRR of Items j and k, and the j-th diagonal element is the CRR of Item j
Item lift matrix: a symmetric matrix P L = P L , where ljk is the lift of Item j → k IRF of item j with item parameter λj (IRT), also denoted as Pj (θ ) Incorrect response function of Item j identical to 1−P θ ; λj (IRT), also denoted as Qj (θ )
A symmetric matrix = with diagonal elements 1, where φjk is the φ coefficient between Items j and k Bicluster reference matrix (biclustering), where πcf is the CRR of Class c students for a Field f item Class reference matrix (LCA), where πcj is the CRR of Class c students for Item j Local dependence parameter set (BINET), where πfcj is the PSR of Class c students for Item j in Field f Local dependence parameter set (LDB), where πrfd is the CRR for a Field f item of Rank r students whose PIRP is the d -th pattern Local dependence parameter set (LD-LRA), where πrjd is the CRR for Item j of Rank r students whose PIRP is the d -th pattern Parameter set (BNM), where πjd is the CRR for Item j of students whose PIRP is the d -th pattern Rank reference matrix (LRA), where πjr is the CRR of Rank r students for Item j Number of latent ranks (LRA) (Tetrachoric) correlation matrix
xxii
r = {rs } r(w) = rs(w) ρ ζ = ρj,ζ S t = {ts } t (w) = ts(w) t TRP t (w) TRP τ = {τs } θ = {θs } U = usj U = usj uj us V (w) V (τ ) V (e) v = vj w = wj Z = zsj zj zs ζ = {ζs } ζ (w) = ζs(w)
Abbreviations, Symbols, and Notations
Student passage rate vector, where rs is the passage rate of Student s for test items Student scoring rate vector, where rs(w) is the scoring rate of Student s Item-total correlation vector, where the j-th element, ρj,ζ , is the item-total correlation of uj with ζ Sample size (number of students) Student NRS vector, where ts is the NRS of Student s Student total score vector, where ts(w) is the total score of Student s TRP, which is a column sum vector of class/rank/bicluster reference matrix and represents the expected NRS on the test Weighted TRP, representing the expected score on the test Student true score vector, where τs is the true score of Student s Student ability value vector, where θs is the ability value of Student s (IRT) Data matrix, where usj is the response of Student s to Item j, coded 1 if it is correct, 0 if incorrect, and “(dot)” if missing (the item is not presented to the student) PIRP matrix, where the (s, j)-th entry, usj , is d when Student s’s PIRP of Item j is the d -th pattern Data vector of Item j or the j-th column vector in U (Vertical) vector of Student s’s dataarranged in the s-th row of U Variance of total scores, or Var t (w) Variance of true scores, or Var[τ ] Error variance, or Var[e] Item variance vector, where vj is the variance of Item j Item weight vector, where wj is the weight of Item j Missing indicator matrix, where zsj is 1 if Item j is presented to Student s; otherwise, zsj is 0 Missing indicator vector for Item j or the j-th column vector of Z (Vertical) vector of Student s’s missing indicators arranged in the s-th row of Z Vector of standardized scores of students’ passage rates, where ζs is the standardized score of Student s’s passage rate Vector of standardized scores of students’ scoring rates, where ζs(w) is the standardized score of Student s’s scoring rate
Chapter 1
Concept of Test Data Engineering
When managing a test, item-writing (or question-making) and scoring (or marking) are the main segments; however, there is a whole package of design, including temporal planning and spatial layout, item-writing, implementation, scoring, analysis, evaluation, and feedback. Among them, the term “test data analysis” is an expression that focuses on the analysis component; however, “test data engineering,” a concept considered in this book, represents a broader perspective. Simply put, tests are tools used in society. They are highly public in nature; thus, it can be said to be public tools. They sometimes affect the future of test takers, most of whom are children and young people. Thus, they should neither be groundless nor irresponsible, and adults must properly summarize the information contained in the test data and provide them back to test takers. In this book, test takers are referred to as “students.” The primary information obtained from a test administration is the raw data; i.e., students’ responses. If an item is “What is the capital of USA?,” the responses are the uncoded raw responses of the students, such as “New York,” “Washington, D.C.,” and “London.” The responses include “missing data” and “nonresponses,” which are also important information about the students. Secondary information is marked (coded) data. Except for essay questions,1 each test datum is trichotomous: correct, incorrect, or missing. The correct and incorrect responses are normally coded as 1 and 0, respectively, while the missing datum is denoted by various symbols such as 99, −1, and “.” (dot). In addition, a nonresponse is usually regarded as an incorrect response and coded as 0 (see also Table 2.2, p. 16). This book mainly focuses on analyzing secondary information (post-coded data).
1
This book does not deal with essay questions. An essay question is often graded from points of view. For example, if an essay item has five viewpoints, and each viewpoint is assigned three points, the scale of the essay item is 0 to 15. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_1
1
2
1 Introduction To Applications
Fig. 1.1 Analogy of measurement
Generally, changing the state of something by retouching it is called “processing”; therefore, changing the primary information (raw data) to secondary information (coded data) is also considered as a process. In addition, counting the number of correct responses for each student and calculating the correct response rate for each item are also processes, and they may be referred to as the “tertiary information briefing the secondary information” and are often used as the summary and feedback returned to students and the teacher who administered the test. In other words, a test administration is considered as a series of processes pertaining to collecting, coding, scoring, and transforming data.
1.1 Measurement as Projection Generally, measurement is the action of assigning a numerical value to an object.2 For example, suppose that Mr. Smith’s height is measured and found to be 173 cm (5 8 ), then this can be formulated as f SCALE (MR. SMITH) = 173 (cm).
(1.1)
This indicates that the height scale can be regarded as a function that outputs “173 (cm)” for the input “Mr. Smith.” The function f SCALE can also be considered as a machine that extracts only his body length (Fig. 1.1, left) but discards all other information, such as gender, age, nationality, ethnicity, birthday, address, and hobbies.
2
For a historical background on educational measurement theory, see (Clauser and Bunch, 2022). This chapter describes the author’s thought.
1.1 Measurement as Projection
3
Fig. 1.2 Measurement as projection
Equation (1.1) can alternatively be represented as f SCALE
MR. SMITH −−−→ 173 (cm). The two formulations are different representations conveying the same meaning, f
where A → B means that A is transformed (or mapped in a technical term) to B by f . For a more literal sense, it can be read as “Painter f studied Scenery A and drew Picture B,” or “when A was illuminated from the direction of f , the shadow (image) of B appeared in the back.” The latter expression is technically referred to as a projection (Fig. 1.1, right). As Fig. 1.2 (left) illustrates, illuminating an object from various angles, examining the shadows (flattened images on the respective walls) of the object, and considering the original shape of the object is referred to as projection pursuit. In particular, when the shape of an object is complex and multidimensional, the object cannot be observed directly; in this case, the original shape can be imagined by inspecting shadow images cast by light from various directions. The measurement is considered a projection. “Mr. Smith” is essentially a highdimensional input composed of various pieces of information,3 Thus, it is necessary to illuminate (measure) Mr. Smith from various angles to understand him (see Fig. 1.2, right). When Mr. Smith is illuminated by the projector (from the direction) at “height meter,” then image “173 cm” is cast on the wall; and if he is illuminated by “weight scale,” then “76.5 kg” is cast on the wall. Thus, we can depict the true shape of Mr. Smith by illuminating him with more projectors. A set of measurements (i.e., projectors) is referred to as an “assessment.” In a medical assessment, the doctor 3
It contains biological (gender, age, height, weight, eyesight, skin color, grip strength, voice pitch), socioeconomic (address, educational history, income, number of siblings, marital status), psychological (agreeableness, conscientiousness, extraversion, neuroticism, openness), and intellectual (language, mathematics, science, history) characteristics.
4
1 Introduction To Applications
obtains the shadows of Mr. Smith’s health status, an extremely high-dimensional object, using multiple projectors such as blood tests, urinalysis, radiography, and echocardiography, and then integrates the shadows and diagnoses the status.
1.2 Testing as Projection Likewise, a test can likewise be viewed as a function or projection. Suppose Mr. Smith took a test and scored 73; this situation can be represented as follows: f TEST (MR. SMITH) = 73 (points).
(1.2)
This test can also be regarded as a projector illuminating Mr. Smith by casting the image “73.” If this test is a math test, the number expresses the mathematical ability. Moreover, some tests may return a grade such as A, B, or C. Note that f T E ST only extracts the object of interest (i.e., mathematical ability) and discards (or neglects) all other information. Let us express Eq. (1.2) in finer detail. Because the test consists of some items (e.g., 100 items), a correct/incorrect (1/0) response pattern for the 100 items is first output by f TEST , as follows: 100 Items
f TEST (MR. SMITH) = 11001011011 · · · 1 . This example shows that Mr. Smith passed 73 items (e.g., Items 1, 2, 5, 7, 8, · · · ), and failed the other 27 items. Then, by inputting the response pattern into function “number-right score,” f N R S , the function finally returns “73” as follows: f NRS (11001011011 · · · 1) = 73 (points). This f NRS is also regarded as a projector because a different image is cast by a different projector. For example, if one uses the projector “correct response rate,” f CRR , the response pattern is transformed into f CRR (11001011011 · · · 1) = 0.73. In addition, suppose he did not respond to 12 of the 100 items. By the projector “nonresponse rate,” f NRR , the response pattern is processed into f NRR (11001011011 · · · 1) = 0.12. Both 0.73 and 0.12 are the shadows (images) cast by the projectors, respectively; more specifically, f TEST is classified in measurement, while f NRS , f CRR , and f NRR
1.2 Testing as Projection
5
Fig. 1.3 Projection by measurement and analysis
are grouped in the analysis. Accordingly, the process of administering a test to obtain student scores includes basically two projections: measurement and analysis. Accordingly, Eq. (1.2) is decomposed as follows: f NRS ( f TEST (MR. SMITH)) = f NRS (11001011011 · · · 1) = 73 (points), f TEST
f NRS
MR. SMITH −−→ (11001011011 · · · 1) −−→ 73 (points). These two equations are identical; i.e., they represent the object (i.e., Mr. Smith) that is first measured as (transformed into, output as, symbolized as, coded to, or induced to) the binary response pattern through the test, and the pattern is then analyzed as (processed into, or calculated as) 73. Usually, a test is performed by some students (Fig. 1.3, left), which is expressed as follows: f T E ST
f TEST (STUDENTS) = DATA, or STUDENTS −−−→ DATA. In these equations, DATA denotes a data matrix (Sect. 2.1, p. 15), in which Student s’s response to Item j is placed in the s-th row and j-th column. The figure also shows that a different aspect (i.e., response pattern) of the students is projected if the same students took a different test, which can be represented as follows: f M AT H
f MATH (STUDENTS) = DATA M , or STUDENTS −−−→ DATA M ,
(1.3)
fF RE NC H
f F R E N C H (STUDENTS) = DATA F , or STUDENTS −−−−→ DATA F . Furthermore, Fig. 1.3 (right) shows that different images (i.e., shadows) are generated even for the same data when different analyses are conducted. In the figure,
6
1 Introduction To Applications
the item response theory (IRT; Chap. 4, p. 85) and Bayesian network model (BNM; Chap. 8, p. 85) are employed. This situation is represented as follows: f IRT
f IRT (DATA) = IRF, or DATA −−→ IRF, f BNM
f BNM (DATA) = DAG, or DATA −−→ DAG. When data are analyzed by IRT, one of the outputs is the item response function (IRF; Sect. 4.2, p. 86).4 In addition, the BNM yields an output known as directed acyclic graph (DAG; Sect. 8.1.8, p. 366). A single test administration is a single-facet assessment, while a multifacet assessment is used to measure each student via multiple tests. In general, teachers should inspect each student from multiple angles by employing various evaluation methods. For instance, for an entrance exam, practical skills, interviews, and activity history should be included and integrated into paper-and-pencil tests (PPTs) to determine the treatment (pass or fail) of each applicant. Although each shadow (i.e., image) is a plane, the original shape of the student, as an essentially high-dimensional being, can be approximated from multiple planes. Design Concept It seems a matter of course that different analyses project different images, this is because each analysis method was developed based on a unique design concept. The design concept of a method is its policy, philosophy, or specification of the information to be extracted from the data, based on the purpose of the analysis. It also consists of some detailed assumptions (or constraints). For example, the typical assumptions used in IRT are unidimensionality (Sect. 2.8, p. 59) and local independence (Sect. 4.3.1, p. 96). A representative assumption for BNM is d-separation (Sect. 8.2, p. 369). Once a statistical analysis is conducted, some outputs are produced; thus, a large quantity of information can be extracted from the data by the method. However, it should be noted that what one method can extract from data is only a small part of the entire information contained in the data, and a large part of it is discarded (or neglected), because data contain a vast amount of information. For example, in IRT, information about the local dependency structure that generally exists among items is neglected and abandoned under the assumption of local independence; the local dependency information is shaped (transformed or processed) into a piece of information as if items are locally independent, and such artificial results are finally output.
4
Various other results can also be output.
1.3 Artimage Morphing
7
Fig. 1.4 Artimage morphing
1.3 Artimage Morphing Measurement and analysis are the processes of retouching the input information and remakeing it into a different product. Specifically, measurement processes an object (phenomenon or event) into an image represented by numbers (i.e., data), and analysis processes the data into images of statistical outputs. In particular, testing, which is a measurement method, transforms the input “students” into the image “numerical array” (i.e., binary true/false data); test data analysis then processes the input “binary data” into images such as “indices,” “figures,” “tables,” “information,” “knowledge,” and “words (including numbers).” Note that all images obtained through such processes are not natural but artificial; thus, they are called artimages Shojima 2007a.5 Figure 1.4 depicts artimage morphing, where an artimage is processed into a differently looking artimage. We obtain information from an object by measuring it and extracting knowledge from the information by analyzing it. However, we should also be aware that we are losing (or discarding) a huge amount of information in the mapping process. If we administer a math test, we obtain information (data) about math ability for the input “students”; however, this is practically synonymous to discarding (or neglecting) all information other than math ability. Similarly, if we analyze the math test data with IRT, we can obtain knowledge such as IRF and ability parameter; however, we lose all the other information. As shown in the bottom part of Fig. 1.4, the more advanced the process, the higher the abstraction level of the information induced from the artimage, which is a merit of the artimage morphing. Every single object (phenomenon, event, or initial input) is normally too complex for direct comprehension. Therefore, we must first replace the input “students” with artimage “data” through testing for increasing levels of abstraction; however, this array of numbers is still extremely complicated and beyond direct interpretation. Therefore, we pass “data” through a statistical model to a further 5
Genzo .
in Japanese, which is a compound word of “phenomenon”
and “artificial”
8
1 Introduction To Applications
increasing level of abstraction by changing it into an understandable artimage (i.e., knowledge). Art The concept of projection is widely applicable. For example, when an artist paints a landscape, or when a poet composes a poem about love, this can be expressed as follows: f PAINTER
f PAINTER (SCENERY) = PICTURE, or SCENERY −−−−→ PICTURE, f POET
f POET (LOVE) = POEM, or LOVE −−→ POEM. The picture or poem is considered to be an artimage processed by the artist or poet, respectively; however, in these cases, we observe that f is filled with the experience and emotion of the painter or poet; therefore, it is not only a process of information abstraction (or impoverishment) but also a process of information enrichment. If the attributes of a student are enumerated successively, the number of attributes can easily exceed a million or billion; thus, each student is a high-dimensional object. However, of these approximately ten subjects can be measured by tests, such as languages, math, sciences, and social studies, and each test contains not more than 100 items. Why do we inquire into only 1000 items from the ten subjects of a student with a billion attributes? The test administrator should be accountable for why and how the 1000 items were selected, and a higher-stakes test6 must be more responsible for this point. Furthermore, even though there are many methods for analyzing data, we usually apply only a few. Why do we select those methods that produce indicators for each student and evaluate the student using the indicators that could determine the future of the student? The test administrator should also be accountable for this point. Though we are given a variety of choices for measuring and analyzing (see Fig. 1.5), we can only trace a few routes in this tree; thus, we should be able to explain why we chose this process.
1.4 Test Data Engineer As seen above, administering a test, obtaining data, and obtaining the results is a process of retouching an object into an artimage after another, which is generally expressed as 6
A high-stakes test is a competitive test such that a high scorer on the test will lead an individual to a high social status (e.g., advanced civil service exam, medical licensing exam, bar exam, and entrance exams for leading universities).
1.4 Test Data Engineer
9
Fig. 1.5 Potentiality of artimage processing
f MEASURE
f MODEL
OBJECT −−−−−→ DATA −−−→ KNOWLEDGE, where f MODEL is a system, machine, or procedure in which a data decoding mechanism (routine) is built. It is generally a statistical model such as IRT and BNM, but it includes a simple method such as counting the correct items for each student and calculating the passing student rate for each item. This process consists of two steps that can be partitioned into smaller steps; for example, the steps of DATA→KNOWLEDGE can be divided as follows: f MODEL
f FEEDBACK
DATA −−−→ OUTPUT −−−−−→ REPORT. In fact, the outputs of IRT or BNM are not friendly to students and teachers. Thus, it is necessary to further process the outputs into something easily comprehensible to them; however, interpreting the outputs depends on the analyst; thus, the report finally fed back to the students and teacher can differ by analyst. Needless to say, testing ends by feeding the report back to the students, teachers, and item writers. A test data analyst is not involved only in the process of DATA → OUTPUT but is required to have a comprehensive perspective from item-writing for test development to report writing for test improvement. A person who is interested, involved, and dependable in all phases in this process can qualify as a test data engineer. Why is the person specifically called a test data “engineer” and not test data researcher, practitioner, expert, specialist, or technician? There are two reasons for this. First, the process from OBJECT to REPORT is analogous to the manufacturing process of a product. Second, engineers are generally pursuers, not of the truth,
10
1 Introduction To Applications
but the good: car engineers are trying to pursue a good car rather than a true car. As mentioned at the beginning of this chapter, tests are (public) tools used in our society. There is no such thing as a true test. What we can do is to pursue a good test that is adaptable to the time, place, situation, and students.
1.5 What Is “Good” Testing? To implement good testing (a package of development, administration, analysis, feedback, and report), we, as test data engineers, must comprehensively consider what and how tests are administered to students (children) in our society and what and how results are fed back to them. To improve testing, the validity, reason, and significance (or raison d’etre) of the testing should be checked in three contexts (Shojima, 2010, 2018).
1.5.1 Context of Measurement The context of measurement refers to the fact that the testing must be checked whether the object measured by the test is exactly the one intended to be measured and whether the object is accurately measured. This concept is similar to test validity (Messick, 1989, 1995) and test reliability (Sect. 3.2). Among the various validities, most important for testing is content validity. As described above, a student’s ability is superhigh-dimensional. The item writers of the test should be accountable for why they focus on a dozen dimensions (select the dozen items). At this point, just because the engineer is familiar with the IRT, they should not compel the item writers to create the items to satisfy the local dependence assumption, which is required for IRT. Item writers are the only people who can develop optimal items with valid content and format (multiple-choice, fill-in-the-blank, or essay) according to the social situation. Item writing is a stage in which subject-teaching experts play a leading role, while test data engineers play a supporting role; however, the latter need to be proficient in various methods to be able to handle (analyze) any type of item.
1.5.2 Context of Explanation The context of explanation means that tests, as public tools, must explain the phenomenon of academic abilities of students in our society.7 The phenomena referred 7
If the test is a psychological questionnaire, the test must explain the psychological phenomenon.
1.5 What Is “Good” Testing?
11
to here include the current status of school students’ academic achievements in our society, their longitudinal change through years, and strong and weak skills compared with students in other countries; in other words, testing is not simply administering a test and evaluating students. A test data engineer has the duty to explain to students, teachers, and policy makers, the present state of students’ academic abilities, the effective means to improve their abilities, and the requisite lifestyle and environment to support such improvements. Testing is a package that inevitably includes feedback, and feedback to each individual makes test data analysis unique compared to other data analyses. Test data engineers should also consider the type of feedback to be created. Desirable Feedback (A) (B) (C) (B)
Map Current location Path to go (recommended route to the goal) Feature of each path
In particular, when explaining (or providing feedback) to students and teachers, the report may refer to the points listed in the above box. Point (A) is a map or world that the test is measuring, which indicates where the start and goals are, and it sketches how paths run between them. This book covers many statistical models that can present some useful maps: latent rank analysis (LRA, Chap. 6), ranklustering (Sect. 7.3), local dependence LRA (Chap. 9), local dependence biclustering (Chap. 10), and bicluster network model (Chap. 11). In addition, as indicated by Point (B), good testing must (accurately) indicate to each student their current location (ability status) in the map. The number-right or weighted scores are the simplest measures for showing where the student is.8 Various other ideas on showing the location will be introduced in the following chapters. Point (C) suggests to each student to select (study) the next path (an item or a bunch of items), which is important in smoothly guiding them to the goal of the map. Point (D) implies that it is essential to understand the characteristics of each path as a component of the map (i.e., test), and this information is preferable for teachers and item writers compared to students.
1.5.3 Context of Presence Finally, the context of existence represents the washback effect (Cheng et al., 2003) in a narrow sense; in simple terms, this means that testing has an impact on society. For example, if a high-stakes test consists of multiple-choice items, cram schools would then teach students techniques for solving multiple-choice items, and the students 8
In this case, the path from 0 to 100 points is regarded as the (unidimensional) map.
12
1 Introduction To Applications
would drill with only such an item type; the way (manner) of testing itself, including the specification, structure, and format of the test, can directly or indirectly orient students’ learning styles and strategies. A higher-stakes test has a stronger washback effect on students’ attitudes toward learning; thus, the method of presenting the test (or simply testing) is important. The presence includes how the test is designed and administered, how data are scored, and how passers are selected. Every minor specification is sent to the students as a message from society (adults). For example, assigning high weights to Russian listening and speaking tests sends explicit or implicit messages that society welcomes (or favors) people who can work with Russians, and if the percentile rank is adopted as the score fed back to each student, then the fact is sent as a message simultaneously as the test administrator underscores the relative position, and not the absolute achievement level, of the student in the examinee group. Some students sit down at their desks day and night to obtain good test scores, while others use tests as an opportunity to make great strides in society. It is a highstakes test that drives them to their desks and influences what they study. Thus, tests must be designed and implemented to have a positive effect on society, with a concrete philosophy and policy about what knowledge the students should acquire. Therefore, good testing, with gradual increase in difficulty levels in terms of items appearing in the tests, can increase the ability of community members.
1.6 Book Overview In the artimage morphing process, this book shows that various kinds of knowledge (artimages) can be processed from the same test data by employing various analysis methods ( f MODEL s). Currently, IRT is the most widely used method for test data analysis. Although IRT is an elaborate statistical theory, it uses only an aspect of data and processes the data into an artimage according to the data decoding rules that compose the theory, while discarding (neglecting) all other aspects. We hope that readers will see and experience various methods and discover the one suitable for enhancing their tests and testing for students and societies. The remainder of the book is structured as follows. Chapter 2, “Test Data and Item Analysis,” introduces the test data and the basics for analyzing test items. Chapter 3 describes the classical test theory which is the first theory that deals with test data as scientific objects. Thereafter, the following chapters will introduce four methods: IRT (Chap. 4), latent class analysis (Chap. 5), LRA (Chap. 6), and biclustering (Chap. 7). They all assume that test items (or fields) are locally independent; however, the models in which items (or fields) are not assumed to be locally independent will be described in the following chapters: BNM (Chap. 8), local dependence LRA (Chap. 9), local dependence biclustering (Chap. 10), and bicluster network model (Chap. 11). However, it is regrettable that the cognitive diagnostic model cannot be covered here. Please refer to Nichols et al. (1995), Leighton & Gierl (2007), Rupp et al. (2010)
1.6 Book Overview
13
for this model. Further, asymmetric multidimensional scaling (e.g., Tobler, 1976– 1977, 1977, Krumhansl, 1978; Weeks & Bentler, 1982; Okada & Imaizumi, 1987, 1997; Chino & Shiraiwa, 1993; Zielman & Heiser, 1993; Chino, 1990; Yadohisa & Niki, 1999; de Rooij & Heiser, 2003; Saito & Yadohisa, 2005; Saburi & Chino, 2008; Okada, 2012; Okada & Tsurumi, 2012, 2013; Shojima, 2012, 2014; Kosugi, 2018) and asymmetric clustering (Hubert, 1973; Takeuchi et al., 2007; Olszewski & Šter, 2014; Okada & Yokoyama, 2015; Bove et al., 2021) are also excluded, which are useful methods for visualizing asymmetric relationships between items.
Chapter 2
Test Data and Item Analysis
This chapter provides a basis for reading this book. The data for each analysis are available at Test Data Engineering Accompanying Website http://shojima.starfree.jp/tde/
2.1 Data Matrix Suppose there are 100 students1 who take a test with 50 items. Then, response data for 100 (students) × 50 (items) are generated. Each individual response is called an item response. The response is either “pass,” “fail,” or “missing,” and is usually coded as “1,” “0,” or “. (dots),”2 respectively. The data matrix, also called the response matrix, is an array of all student item responses in a plane. Table 2.1 shows the most common form of the test data matrix. Each row represents a student’s response data for the 50-item test. For example, the third row contains all 50 item responses from Student 3. In addition, each column shows data of 100 students for the corresponding item. For example, the second column contains the data of 100 students for Item 2. The number of rows in the matrix represents the number of students and is statistically referred to as the sample size. Further, the number of items in a test, which is the number of columns in the matrix, is often called the test length.
1
In this book, “student” is used instead of “examinee,” “test taker,” or “subject.”. Numerical values such as “−1,” “99,” or “−999” are also preferred as an expression for missing data..
2
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_2
15
16
2 Test Data and Item Analysis
Table 2.1 Data matrix Student 1 Student 2 Student 3 .. . Student 100
Item 1 Item 2 1 0 0 0 1 1 .. .. . . 1 1
· · · Item 49 Item 50 ··· 1 . ··· 0 . ··· . 1 .. .. .. . . . ··· . 1
The responses “1,” “0,” and “. (dot)” represent pass, fail, and missing, respectively.
Table 2.2 Differences between Nonresponse and Missing Treatment State Type Nonresponse Item is presented but not responded. 0 (incorrect). Missing Item is NOT presented and not responded. . (dot)
Each individual datum is a trichotomous variable; Student s’s response for Item j, u s j , is expressed as
us j
⎧ ⎪ ⎨1, if correct = 0, if incorrect . ⎪ ⎩ ., if missing
If a data matrix has no missing entries, each individual response becomes a dichotomous (binary) variable (i.e., pass/fail). Missing and nonresponse are different (see Table 2.2). A nonresponse is when an item is presented to a student, but the student does not respond to the item. This is usually treated as an incorrect answer and is coded as 0. Meanwhile, missing is the state when an item is not presented to a student, and thus, the student cannot respond to the item. Suppose that a student has to respond to any one of the two items (Items 49 or 50). In Table 2.2, it can be seen that Student 1 selected Item 49 and passed it. In psychological and sociological questionnaires, we seldom distinguish between nonresponse and missing, but in test data, we should strictly distinguish between the two. Number of Nonresponses It should be noticed that students with a higher number of nonresponses tend to have lower ability as measured by the test. The lower the ability level, the less likely the student is to be able to reach some items, and thus, the item responses become nonresponses. Table 2.3 is the missing indicator matrix (of Table 2.1). It shows which items were presented to each student, where “presented” is coded as 1 and “not presented”
2.1 Data Matrix
17
Table 2.3 Missing indicator matrix Student 1 Student 2 Student 3 .. . Student 100
Item 1 Item 2 1 1 1 1 1 1 .. .. . . 1 1
· · · Item 49 Item 50 ··· 1 0 ··· 1 0 ··· 0 1 .. .. .. . . . ··· 0 1
The values 1 and 0 stand for “presented” and “not presented,” respectively
as 0. Generally, the missing indicator of Item j for Student s is expressed as zs j =
1, if Item j is presented to Student s . 0, if Item j is NOT presented to Student s
Each entry of the missing indicator matrix is either 0 or 1. When there are no missing responses, all elements in the matrix become 1.
2.2 Mathematical Expressions of Data This section describes how to mathematically represent whole data and a part of data.
2.2.1 Mathematical Expression of Data Matrix The data matrix of Table 2.1 is mathematically expressed as follows: ⎡
1 ⎢0 ⎢ ⎢ U = ⎢1 ⎢ .. ⎣.
0 0 1 .. .
··· 1 ··· 0 ··· . . . .. . .
⎤ . .⎥ ⎥ 1⎥ ⎥ (100 students × 50 items). .. ⎥ .⎦
1 1 ··· . 1
In this book, matrices are represented in uppercase boldface. Furthermore, “100×50” stands for the size of the matrix. The first number represents the sample size (i.e., number of rows in the matrix), and the second number is the number of items (i.e., number of columns). Generally, the elements in the data matrix are expressed as
18
2 Test Data and Item Analysis
⎡
u 11 · · · ⎢ .. . . U =⎣ . . u S1 · · ·
⎤ u 1J .. ⎥ = u (S × J ). sj . ⎦ uSJ
This expression implies that the sample size and test length of the data matrix are S and J , respectively. Matrix elements are expressed in lowercase. In other words, the response of Student s to Item j is denoted as u s j . For example, the response of Student 3 to Item 2 in Table 2.1 is represented as u 32 = 1. Similarly, the missing indicator matrix is expressed by ⎤ ⎡ z 11 · · · z 1J ⎥ ⎢ Z = ⎣ ... . . . ... ⎦ = z s j (S × J ). z S1 · · · z S J
2.2.2 Mathematical Expression of Item Data Vector The second column extracted from Table 2.1 is ⎡ ⎤ 0 ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ u2 = ⎢1⎥ (100 × 1). ⎢ .. ⎥ ⎣.⎦ 1 A matrix composed of only one column is called a vector; hence, u2 is called the item data vector (of Item 2). The subscript indicates the column number in the original matrix. The item data vector u2 contains the responses of all students to Item 2. A general expression for the item data vector is given as follows: ⎡ ⎤ u1 j ⎢ .. ⎥ u j = ⎣ . ⎦ = u s j (S × 1). uSj This expression also shows that the vector is the j-th column vector in U and has S entries, and the s-th entry is u s j . Similarly, the j-th column vector of the missing indicator matrix (Z) is represented as ⎡ ⎤ z1 j ⎢ .. ⎥ z j = ⎣ . ⎦ = z s j (S × 1). zSj This vector is called the item missing indicator vector (of Item j).
2.2 Mathematical Expressions of Data
19
2.2.3 Mathematical Expression of Student Data Vector The third row vector extracted from Table 2.1 is ⎡ ⎤ 1 ⎢1⎥ ⎢ ⎥ ⎢ ⎥ u3 = ⎢ ... ⎥ (50 × 1). ⎢ ⎥ ⎣.⎦ 1 This vector contains all the responses by Student 3 and is called the student data vector (of Student 3). The elements of this vector are originally arranged horizontally in U, but in this book, the vector elements are always arranged vertically. If a vector with the elements transposed is required, with prime ( ), it can be expressed as u3 = 1 1 · · · . 1 . This operation swaps the row and column positions of all elements in a vector or matrix. Generally, the vector for Student s is expressed as ⎤ u s1 ⎢ ⎥ us = ⎣ ... ⎦ = u s j (J × 1). ⎡
us J This vector is originally the s-th row vector in U, but it becomes a column vector when expressed individually. Similarly, the s-th row vector of the missing indicator matrix (Z) is expressed as ⎤ z s1 ⎢ ⎥ z s = ⎣ ... ⎦ = z s j (J × 1), ⎡
zs J and it is called the student missing indicator vector (of Student s).
2.3 Student Analysis This section describes some basic methods for transforming student responses. For each student, we obtain J responses about the J items, but the responses are usually transformed (aggregated) into one score to give an indication of the student’s ability level.
20
2 Test Data and Item Analysis
2.3.1 Total Score A typical indicator that reflects a student’s academic achievement is the total score. The simplest one is the number-right score (NRS), which counts the number of items passed. The NRS of Student s is given by ts = z s1 u s1 + · · · + z s1 u s J =
J
zs j u s j .
(2.1)
j=1
Missing responses are omitted from the sum.3 When ts is high, Student s may have high academic ability. Using matrix algebra, Eq. (2.1) is simply written as ⎤ u s1 J ⎢ ⎥ · · · z s J ⎣ ... ⎦ = z s1 u s1 + · · · + z s J u s J = zs j u s j . ⎡
ts = z s us = z s1
(2.2)
j=1
us J
That is, ts is obtained by the inner product of the data vector and the missing indicator vector of Student s.4 The item weight vector is used when the items have a weight to compute the total score. Let w1 , w2 , · · · , w J be the weights of the J items and the item weight vector be ⎡ ⎤ w1 ⎢ .. ⎥ w = ⎣ . ⎦ = {w j } (J × 1), wJ
the total score of Student s is calculated by ⎤ ⎡ ⎤⎞ u s1 z s1 ⎜⎢ . ⎥ ⎢ . ⎥⎟ = w (z j us ) = w1 · · · w J ⎝⎣ .. ⎦ ⎣ .. ⎦⎠ ⎛⎡
ts(w)
zs J us J ⎤ ⎡ z s1 u s1 J ⎢ . ⎥ w j zs j u s j , = w1 · · · w J ⎣ .. ⎦ = j=1 zs J u s J
(2.3)
where is the Hadamard product (p.xviii). 3
This is why it is better to represent missing data as a numerical value instead of “. (dot)” as mentioned in Footnote 2 (p.15) as “Number (0)”דcharacter (.)” cannot be calculated. 4 The inner product is the sum of the products of the corresponding elements in two vectors. For a = [a1 a2 a3 ] and b = [b1 b2 b3 ], a b = a1 b1 + a2 b2 + a3 b3 .
2.3 Student Analysis
21
Item Weight When analyzed by factor analysis and item response theory, a higher weight is given to items with higher discrimination (Sect. 2.6.3, p.50). Compared to such elaborated methods, one may think that the weights determined by the teacher (or test administrator) are arbitrary, unobjective, and unscientific. The author thinks otherwise. The weights given by a subject-teaching expert is based on his or her experience. Teachers tend to give higher weights to items that are more important for students to master, whereas analysts usually do not have such a perspective. If one wants students to be more capable in calculating, give higher weights to calculation items in math tests. Students will surely read your expectations and spend more time on calculation training. The weights can thus reflect one’s educational policy. The following matrix operation can give the NRSs of all students at once: ⎤ ⎡ ⎤⎞ ⎡ ⎤ u 11 · · · u 1J 1 z 11 · · · z 1J ⎥ ⎢ ⎥⎟ ⎢ ⎥ ⎜⎢ t = (Z ⊗ U)1 J = ⎝⎣ ... . . . ... ⎦ ⊗ ⎣ ... . . . ... ⎦⎠ ⎣ ... ⎦ z S1 · · · z S J u S1 · · · u S J 1 ⎤⎡ ⎤ ⎡ ⎤ ⎡ 1 t1 z 11 u 11 · · · z 1J u 1J ⎥ ⎢ ⎥ ⎢ ⎢ .. . . . .. .. ⎦ ⎣ .. ⎦ = ⎣ ... ⎥ =⎣ . ⎦, ⎛⎡
z S1 u S1 · · · z S J u S J
1
tS
where 1 J is a vector with size J , whose elements are all 1. In addition, the vector containing the total scores of all students is obtained by ⎤ ⎡ ⎤ ⎡ (w) ⎤ w1 t1 z 11 u 11 · · · z 1J u 1J ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. . . . . = (Z ⊗ U)w = ⎣ . . . ⎦⎣ . ⎦ = ⎣ . ⎦. z S1 u S1 · · · z S J u S J wJ t S(w) ⎡
t (w)
(2.4)
2.3.2 Passage Rate A perfect NRS equals the number of items. That is, in a 50-item test, 50 is the highest score. However, when there are several testlets in a test, and students can choose some of them, the maximum number of items may vary by student. For example, suppose that a test with 50 items is composed of four testlets, Testlets 3 and 4 (composed of 10 and 8 items, respectively) are optional, and a student is required to choose either of them. The highest score of the student who selected Testlets 1, 2, and 3 is 42, while that of the student who selected Testlets 1, 2, and 4 is 40. Furthermore, in a
22
2 Test Data and Item Analysis
computer-based test (CBT), the number of items can vary depending on the student’s ability. When the number of presented items differs by student, the NRS does not always reflect the students’ academic ability fairly. In such cases, one may calculate the student’s passage rate. The passage rate of Student s, rs , is the NRS (ts ) divided by the number of presented items (Js ). Student s’s number of presented items is computed by Js = 1J z s =
J
zs j .
j=1
Thus, the passage rate of Student s (rs ) is obtained by ts z us rs = = s = Js 1 J zs
J
j=1 z s j u s j
J
j=1 z s j
.
The passage rate falls within the range of [0, 1]. That is, it takes the value of 0 at the minimum and 1 at the maximum, which implies that Student s incorrectly responded to most items on the test when rs is close to 0, while the student correctly responded to most items as rs is close to 1. In addition, the scoring rate of Student s is the rate depicting how many scores the student can obtain proportional to his/her perfect score, and is calculated by rs(w) =
ts(w) Js(w)
=
w (z s us ) = w zs
J j=1
J
w j zs j u s j
j=1
w j zs j
,
where Js(w) = w z s is Student s’s maximum possible score if all presented items are passed. Furthermore, the passage rates and scoring rates of all students can be calculated by ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ J1 t1 /J1 r1 t1 ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ r = t Z1 J = ⎣ . ⎦ ⎣ . ⎦ = ⎣ . ⎦ = ⎣ . ⎦ , JS t S /JS rS ⎡ (w) (w) ⎤ ⎡ (w) ⎤ r1 t1 /J1 ⎥ ⎢ .. ⎥ ⎢ . (w) . = t Zw = ⎣ ⎦ = ⎣ . ⎦, . tS
r (w)
t S(w) /JS(w)
where is the Hadamard division (p.xviii).
r S(w)
2.3 Student Analysis
23
Fig. 2.1 Standardized score in standard normal distribution
2.3.3 Standardized Score It is often said that student abilities follow (form) a normal distribution.5 The normal distribution with a mean of 0 and variance of 16 is called the standard normal distribution (see Fig. 2.1). The standardized score indicates how high or low the student’s ability is placed in the standard normal distribution. The standardized score of Student s, ζs , is rs − r¯ , ζs = √ V ar [r] where r¯ and V ar [r] are the mean and variance of the student passage rates, respectively. That is, S rs 1S r = s=1 , S S S (rs − r¯ )2 (r − r¯ 1 S ) (r − r¯ 1 S ) = s=1 . V ar [r] = S−1 S−1
r¯ =
The total area of the standard normal distribution (and any statistical distribution) is 1.0. If a student’s standardized score is 1.5 (see Fig. 2.1), then the area of the shaded part is calculated as 0.933, which indicates that the student’s ability is placed at 93.3%7 from the bottom (6.7% from the top), implying that the student ranks at 6th or 7th place among 100 students. By calculating the standardized score and referring to the area smaller than the score in the standard normal distribution, the student’s relative position in the group (or population) can be estimated. However, it should 5
Note that the normal distribution of student abilities does not necessarily produce a normal distribution of their test scores. 6 Note that the standard deviation (SD) of the standard normal distribution is also 1 because S D = √ V A R. 7 This percentage (93.3) can be used as the feedback to the student.
24
2 Test Data and Item Analysis
be noted that this prediction is valid only if the histogram of the passage rates of the S students appears to have a normal distribution. The area smaller than the standardized score of Student s (ζs ) in the standard normal distribution is calculated by ps(ζ )
=
ζs
−∞
f N (x; 0, 1)dx,
where f N (·; μ, σ ) is the probability density function (PDF) of the normal distribution with mean μ and SD σ (i.e., variance σ 2 ), and which is defined as 1
1 f N (x; 0, 1) = √ exp − 2 2π σ 2
x −μ σ
2 ,
(2.5)
where exp(a) = ea and e = 2.71828 · · · is Napier’s constant (or Euler’s number). Thus, this definition can be written as 1 1 x−μ 2 2.718− 2 ( σ ) . f N (x; 0, 1) = √ 2π σ 2 As for Fig. 2.1, the shaded area is computed as
1.5 −∞
f N (x; 0, 1)dx = 0.933.
Regarding the scoring rate (of Student s), rs(w) , the standardized score is obtained by r (w) − r¯ (w) ζs(w) = s , V ar [r (w) ] where S r (w) 1S r (w) r¯ = = s=1 s , S S S (w) (r (w) − r¯ (w) )2 (r − r¯ (w) 1 S ) (r (w) − r¯ (w) 1 S ) (w) V ar [r ] = = s=1 s . S−1 S−1 (w)
The vectors respectively containing the standardized passage rates and standardized scoring rates of all students are obtained by
2.3 Student Analysis
25
⎡ ⎤ ζ1 ⎢ ⎥ ζ = (r − r¯ 1 S ) ( V ar [r]1 S ) = ⎣ ... ⎦ ,
ζS
ζ (w) = (r (w) − r¯ (w) 1 S )
V ar [r (w) ]1 S
⎡
⎤ ζ1(w) ⎢ ⎥ = ⎣ ... ⎦ . ζ S(w)
Meanwhile, the NRS (of Student s), ts , should not be standardized when the number of presented items differ by student, because the standardized scores of students with a smaller number of presented items become unfairly small. Normal versus Uniform Distribution Some test data analysts and test administrators blindly believe that it is required for the scores to be normally distributed. In this book, we use a test to identify individual differences in academic performance. Thus, the dispersion of score distribution is more important than its shape. If the dispersion (variance) is small, it means that both high- and low-ability students score similar points; such a test is not diagnostic (discriminating) of academic ability. This belief in data normality is a false belief that test data analysts are inclined to have and is probably fostered by the fact that many statistical tests assume data normality. Recall that the purpose of (academic) testing is not statistical testing but distinguishing individual differences in academic ability. A test is therefore successful if the scores are uniformly distributed on the score scale.
2.3.4 Percentile Rank The percentile rank indicates the group to which each student belongs if all students are divided into 100 groups according to their scores. For example, if a student’s percentile rank is 14, the student’s score falls in (13%, 14%] when sorted in ascending order, where x ∈ (a, b] denotes a < x ≤ b. That is, a student with a score of just 13% from the bottom was not grouped in the 14th percentile rank, but in the 13th percentile rank.
26
2 Test Data and Item Analysis
Example 1
Suppose 100 students took a test, and Student 1 scored 1 point, Student 2 scored 2 points, · · · , and Student 100 scored 100 points. Student (S=100) 001 Score 1 Rank of Score ( A) 1 Percentile (B = 100 × A/S) 1 Percentile Rank ( B) 1
002 2 2 2 2
003 3 3 3 3
··· ··· ··· ··· ···
099 99 99 99 99
100 100 100 100 100
First, the scores were sorted in ascending order and then ranked ( A). If some students obtained the same score, they shared a larger rank. For example, if Students 4, 5, and 6 obtained the same score of 4, their scores were then ranked as 6. Next, the ranks ( A) are divided by the sample size (S), and the results are multiplied by 100 (B), which is called the percentile. Finally, the percentiles (B) are processed by the ceiling function (p.xviii), and the outcomes are then the percentile rank. Example 2
Suppose 101 students took a test, and Student 1 scored 0 points, Student 2 scored 1 point, · · · , and Student 101 scored 100 points. 001 Student (S = 101) Score 0 Ranking of Score ( A) 1 Percentile (B = 100 × A/S) 0.99 Percentile Rank ( B) 1
002 1 2 1.98 2
003 3 3 2.97 3
· · · 099 100 101 · · · 98 99 100 · · · 99 100 101 · · · 98.02 99.01 100 · · · 99 100 100
In this example, the percentile ranks B of both Students 100 and 101 are 100.
2.3 Student Analysis
27
Example 3
Let us now consider another example: Two students took a test. Student 1 scored 10 and Student 2 scored 30, although it is nonsense to calculate the percentile rank when the sample size is two. In general, however, the characteristics of a method are well appeared under extreme conditions. 001 Student (S = 2) Score 10 Ranking of Score ( A) 1 Percentile (B = 100 × A/S) 50 Percentile Rank ( B) 50
002 20 2 100 100
The percentile rank of the student with the lower score was 50, and that of the one with the higher score was 100. Example 4
Let us consider an additional example: The sample size was one, and the student scored 50. 001 Student (S = 1) Score 50 Ranking of Score ( A) 1 Percentile (B = 100 × A/S) 100 Percentile Rank ( B) 100 In this case, the percentile rank of the student becomes 100, regardless of the score. As is evident from these four examples, this method always produces 100 percentile ranker(s). Conversely, this method never brings about a zero-percentile ranker because there is no student whose percentile (B) is ≤ 0.
2.3.5 Stanine Percentile rank is a method to “equipartition” students into 100 groups. Each group contains about 1% of all students. Stanine (Angoff, 1984) divides students into nine groups; however, the method does not divide into ninths (11.1%); it groups students according to the percentage division shown in Table 2.4. For example, a student whose score is > 4% and ≤ 11% is given a stanine score of 2. The number of divisions is nine because this is the largest single-digit odd number. Being a single digit is advantageous for information portability, and being odd produces a middle group. In addition, the percentage division is based on a standard normal distribution (see Fig. 2.2); the percentage thresholds correspond to the nine
28
2 Test Data and Item Analysis
Table 2.4 Stanine Stanine Score Percentage Division 1 lowest 4% 2 next 7% 3 next 12% 4 next 17% 5 medium 20% 6 next 17% 7 next 12% 8 next 7% 9 highest 4%
Percentile (0, 4] (4, 11] (11, 23] (23, 40] (40, 60] (60, 77] (77, 89] (89, 96] (96, 100]
Fig. 2.2 Basis of stanine percentage division
areas of the standard normal distribution when the distribution is partitioned by eight points from −1.75 to 1.75 with an increment of 0.5. Note that using the standard normal distribution as the standard for the percentage division does not necessarily require students’ scores to be normally distributed.
2.4 Single-Item Analysis This section describes some basic methods for analyzing an individual item in test data.
2.4.1 Correct Response Rate The correct response rate (CRR) is one of the most basic and important statistics for item analysis. It is an index of item difficulty and a measure of how many students out of those who tried an item correctly responded to it. The CRR of Item j is calculated as
2.4 Single-Item Analysis
29
pj =
zj u j 1S z j
S =
s=1 z s j u s j
S
s=1 z s j
∈ [0, 1],
(2.6)
where p j ∈ [0, 1] (p.xvii); that is, p j takes a value ≥ 0 and ≤ 1. When p j is close to 1, Item j was easy for students, as most correctly answered the item. Conversely, when p j is close to 0, it means that Item j was difficult for the students. The CRR is identical to the mean of the binary data (correct=1/incorrect=0).8 The CRR multiplied by the weight of the item is the item mean, given as = wj pj. p (w) j The ceiling effect is said to occur when the item mean is close to the item weight (i.e., CRR is close to 1). Conversely, the floor effect indicates that the item mean is very low (CRR is close to 0). Items with a ceiling or floor effect tend to be excluded from psychological and social questionnaires because such items are inappropriate for detecting individual differences. However, in a test, one may include some items with a ceiling effect. This is because they can loosen up nervous students and give a sense of achievement to those with low academic performance. The vectors containing the CRRs and item means of all J items were obtained by ⎤ p1 ⎢ ⎥ p = {(Z ⊗ U) 1 S } Z 1 S = ⎣ ... ⎦ , ⎡
pJ ⎡ (w) ⎤ p1 w1 p 1 ⎢ .. ⎥ ⎢ .. ⎥ = w ⊗ p = ⎣ . ⎦ = ⎣ . ⎦. ⎤
⎡
p(w)
wJ pJ
p (w) J
Item Arrangement In a test, the arrangement of items is important. In general, it is better to place easy items earlier in the test. This can help students relax and warm up. The test administrator should pay attention to the students so that they can perform to their full potential.
8
Originally, u s j was trichotomous (1/0/missing); however, the missing datum was excluded by z s j .
30
2 Test Data and Item Analysis
Fig. 2.3 Item odds
2.4.2 Item Odds Odds are a commonly used metric in marketing research and medical statistics. Test data analysis is similar to point of sales (POS) data analysis in the sense that both deal with binary data. POS data record who bought the products. Each datum is binary and coded as 1 if a customer buys a product; otherwise, it is coded as 0.9 The field of marketing research has developed a set of effective methods for analyzing binary data. The odds are one such classic method. In item analysis, item odds are defined as S z j u j z j u j zs j u s j 1S z j pj = = S s=1 = . oj = 1 − pj 1 S z j z j (1 S − u j ) z j (1 S − u j ) s=1 z s j (1 − u s j ) It is clear from the above equation that odds equal CRR/IRR (incorrect response rate). That is, the odds are an index that compares CRR to IRR. If p j = 0.75, then o j = 0.75/0.25 = 3.0, indicating that the possibility of passing Item j is three times more likely than that of failing the item. The odds are thus easy to understand because the likelihood of passing an item is directly compared with failing the item. The odds have three fundamental states (see Fig. 2.3). If the CRR is smaller than the IRR (i.e., p j < 0.5); that is, if p j < 1 − p j , then o j ∈ [0, 1). When p j = 1 − p j = 0.5, the odds become exactly 1. When p j > 0.5 (i.e., p j > 1 − p j ), the odds are > 1. Note that the odds cannot be calculated when p j = 1.0 because o j = 1/0 (division by 0). Furthermore, the vector containing the odds of all J items is obtained by o = p (1 J − p). 9
If a customer buys the product two pieces, the act was then recorded as 2, such POS data are not dichotomous.
2.4 Single-Item Analysis
31
Fig. 2.4 Item threshold
2.4.3 Item Threshold Item threshold is a measure of difficulty based on a standard normal distribution. The student abilities are first assumed to follow a standard normal distribution, and only the students with an ability higher than the threshold are considered to be able to pass the item (see Table 2.4, left). For example, the CRR of an item that is 0.2 is regarded as a result that only the students with ability ≥ 0.842 could pass the item, because the area above 0.842 in the standard normal distribution amounts to 0.2 (see Table 2.4, right). In addition, if the CRR was 0.5, the item threshold was 0.0. In general, for Item j, the relationship between CRR ( p j ) and item threshold (τ j ) is given by pj =
∞ τj
f N (x; 0, 1)dx.
From this, the item threshold τ j is obtained as follows: pj = 1 −
τj
−∞
f N (x; 0, 1)dx = 1 − FN (τ j ; 0, 1),
FN (τ j ; 0, 1) = 1 − p j , τ j = FN−1 (1 − p j ; 0, 1), where FN (·; 0, 1) and FN−1 (·; 0, 1) are the cumulative distribution function (CDF) and quantile function (or inverse CDF) of the standard normal distribution, respectively. The larger the CRR (i.e., the easier the item is), the smaller is the item threshold. Therefore, p j is said to be an index of easiness because a larger value implies that
32
2 Test Data and Item Analysis
Fig. 2.5 Item entropy
the item is easier, while τ j is the index of difficulty because a greater value indicates that the item is more difficult.
2.4.4 Item Entropy The item entropy is an indicator of the variability or randomness of the responses. The entropy of Item j is defined as e j = − p j log2 p j − (1 − p j ) log2 (1 − p j )
∈ [0, 1].
It can be said that the responses are the most dispersive when the correct and incorrect responses are half and half; in other words, item entropy becomes the maximum when p j = 0.5, as shown in Fig. 2.5. Then, the item entropy is calculated as10 e j = −0.5 log2 0.5 − 0.5 log2 0.5 = 0.5 + 0.5 = 1.0. Conversely, the responses are the least dispersive when p j equals 0.0 and 1.0. The entropy is then computed as11 e j = −0.0 log2 0.0 − 1.0 log2 1.0 = 0 + 0 = 0, e j = −1.0 log2 1.0 − 0.0 log2 0.0 = 0 + 0 = 0. Item entropy is advantageous in that it can be applied to a nominal variable and is used as the dispersion of prescored response choices. For example, suppose that Item j is a multiple-choice item composed of C j (1, · · · , C j ) options and the selection 10 11
log2 0.5 = log2 (1/2) = log2 2−1 = − log2 2 = −1. Suppose that 0 × (−∞) is 0. Thus, 0 log2 0 = 0 × (−∞) = 0.
2.4 Single-Item Analysis
33
Selection Rate of Category 2 (p2)
Fig. 2.6 Item entropy for three options
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0 0.0 0.0
0.2
0.4
0.6
0.8
1.0
Selection Rate of Category 1 (p1)
rates of the options are p j1 , p j2 , · · · , p jC j , the item entropy is then computed by ej = −
Cj
p jc logC j p jc
∈ [0, 1].
c=1
Note that the logarithm base here is set to C j in accord with the number of options, which is necessary for the maximum to equal 1.0. The most random state occurs when the selection rates of all options are 1/C j . The item entropy is then obtained as12 e j = −C j
1 1 = − logC j C −1 logC j j = logC j C j = 1. Cj Cj
Figure 2.6 shows the item entropy when there are three options. The horizontal axis indicates the selection rate for Option 1, and the vertical axis represents Option 2. The selection rate of Option 3 is determined by p1 and p2 (i.e., p3 = 1 − p1 − p2 ). In addition, from p3 ≥ 0, the entropy is computable only if p1 + p2 ≤ 1 is satisfied. As the density plot shows, the entropy is largest when p1 = p2 = 1/3(= p3 ), and the value is then 1.
12
Note that loga a = 1.
34
2 Test Data and Item Analysis
Table 2.5 Contingency Table on Items j and k Item j\ k
Incorrect
Incorrect S jk
correct
S jk
Total
S jk
p jk =
p jk =
p jk =
Correct
S jk S jk S jk S jk
S jk
S jk
p jk =
p jk =
Total S jk S jk S jk S jk
S jk
S jk
S jk S jk S jk p jk = S jk S jk
p jk =
p jk =
S jk S jk S jk S jk
S jk (1.0)
*numbers and percentages (in parentheses)
2.5 Interitem Correct Response Rate Analysis In this section, we discuss three types of CRRs to analyze the relationship between the two items. Among these, item lift is frequently used in marketing research.
2.5.1 Joint Correct Response Rate The joint correct response rate (JCRR) is the rate of students who passed both items. Table 2.5 shows the contingency table of Items j and k. The JCRR of Items j and k is defined as the ratio of the students who passed both items (S jk ) out of the number of students who responded to the two items (S jk ). The numbers S jk and S jk are given by S jk = z j z k =
S
z s j z sk ,
s=1
S jk = (z j u j ) (z k uk ) =
S
z s j z sk u s j u sk ,
(2.7)
s=1
and thus the JCRR of Items j and k, p jk , is obtained as p jk =
S jk S jk
=
(z j u j ) (z k uk ) = z j z k
S
s=1 z s j z sk u s j u sk S s=1 z s j z sk
.
(2.8)
2.5 Interitem Correct Response Rate Analysis
35
Table 2.6 Examples of Frequency Contingency Table Example 1
Example 2
j\ k 0 1 T 0 80 20 100 1 40 60 100 T 120 80 200
j\ k 0 1 T 0 140 0 140 1 0 60 60 T 140 60 200
p jk = 60/200 = 0.3
p jk = 60/200 = 0.3
Example 3 j\ k 0 1 T
0 1 T 40 60 100 40 60 100 80 120 200
p jk = 60/200 = 0.3
When the JCRR is close to 1, it means that both items are easy for students. When the JCRR is close to 0, it indicates that one or both of the items are difficult. In fact, the JCRR does not indicate much about the relationship between the two items. Table 2.6 shows three contingency tables, each of which has a JCRR of 0.3. In Example 1, 40% of the 200 students failed both items, and 30% passed them. We generally consider the two items to be strongly associated13 because 70% of the students simultaneously passed or failed both of them. Example 2 shows a stronger association between the two items, in which 70% of the 200 students failed the two items, and the remaining 30% passed them. Because all students either passed or failed both items, the association between the two items was extremely strong. Although the association of Example 2 is stronger than that of Example 1, the JCRRs of both examples are the same at 0.3. The JCRR is therefore not an indicator of the strength of association between a pair of items. Conversely, the two items are unassociated in Example 3. Of the 100 students failing Item j, 60 passed Item k. Meanwhile, of the other 100 students passing Item j, 60 also passed Item k is also 60. The relationship between these two items has “no association” because the percentage of students passing item k does not change, regardless of whether or not the students pass Item j. Even in this case, the JCRR is 0.3. To obtain the JCRRs of all item pairs at once, the following matrix operation can be used: ⎤ ⎡ p1 sym ⎥ ⎢ p21 p2 ⎥ ⎢ P J = (Z U) (Z U) Z Z = ⎢ . ⎥ = { p jk } (J × J ). . . . . . ⎦ ⎣ . . . pJ1 pJ2 · · · pJ
13
Note that two items are qualitative (ordinal) variables. When the two items are quantitative variables, a “correlation” rather than an “association” is a suitable expression.
36
2 Test Data and Item Analysis
Characteristics of Joint Correct Response Rate Matrix 1. This matrix is symmetric.a 2. The CRRs are placed in the diagonal entries. ⎤ matrix is a square matrix that is equal to its transpose. For example, A = ⎡A symmetric a1 b1 b2 ⎣b1 a2 b3 ⎦ is a symmetric matrix because A = A. b2 b3 a3
a
This is the JCRR matrix, denoted as P J , which has two characteristics. First, the matrix is symmetric. Thus, it satisfies14 p jk = pk j (∀ j, k ∈ {1, 2, · · · , J }). It is natural for the JCRR of Items 1 and 2 to be equal to that of Items 2 and 1. Second, the j-th diagonal element in P J is the CRR of Item j. Let us check this. From Eq. (2.8), the j-th diagonal element is expressed as15 pjj =
Sjj Sjj
(z j u j ) (z j u j ) = = z j z j
S
2 2 s=1 z s j u s j S 2 s=1 z s j
S =
s=1 z s j u s j
S
s=1 z s j
= pj,
and is equal to the CRR of Item j in Eq. (2.6).
2.5.2 Conditional Correct Response Rate The conditional correct response rate (CCRR) represents the ratio of the students who passed Item C (consequent item) to those who passed Item A (antecedent item). Figure 2.7 shows a Venn diagram showing the difference between the JCRR and CCRR, in which Items j and k are the antecedent and consequent items, respectively. In general, test items can be sorted in a certain learning sequence, which roughly corresponds to the order of difficulty. The CCRR stands for the order between two items; thus, it is an important indicator when reviewing the learning sequence. Using the notation in Table 2.5, the CCRR of Item k given Item j, denoted by pk| j , is expressed as pk| j =
S jk S jk
=
(z j u j ) (z k uk ) = (z j u j ) z k
S
s=1 z s j z sk u s j u sk
S
s=1 z s j z sk u s j
.
(2.9)
14
∀ indicates “for all” or “every.” Here, it means “for all j and k from 1 to J , p jk = pk j holds.”.
15
Because 02 = 0 and 12 = 1, z s2j = z s j and z s2j u 2s j = z s j u s j .
2.5 Interitem Correct Response Rate Analysis
37
Fig. 2.7 Joint correct response and conditional correct response rates
This expresses the ratio of the students passing Item k (as well) to those passing Item j. In this equation, the numerator S jk = (z j u j ) (z k uk ) is derived from Eq. (2.7). In addition, the denominator is the number of students who passed Item j and responded to Item k, which is therefore expressed as S jk = (z j u j ) z k =
S
z s j z sk u s j .
s=1
In practice, for simplicity, the following equation is defined as CCRR: pk| j =
S jk S jk
=
S jk /S jk S jk /S jk
≈
p jk pj
,
(2.10)
where ≈ indicates “approximately” or “nearly equal.” The denominator (S jk /S jk ) represents the ratio of the students passing Item j out of the students responding to Items j and k; thus, this figure is nearly equal to the CRR of Item j. Indeed, S jk /S jk = p j if there are no missing data. Strictly speaking, however, they differ in the following manner: S (z j u j ) z k s=1 z s j z sk u s j = = S S jk z j zk s=1 z s j z sk S z j u j s=1 z s j u s j pj = = . (2.6) S 1S z j s=1 z s j S jk
(2.9),
Table 2.7 shows three relative frequency contingency tables corresponding respectively to the three examples shown in Table 2.6, and thus the sum of the four cells is 1.0 in each table. As discussed further in the text, normally pk| j = p j|k ; thus, in Example 1, pk| j = 0.60 and p j|k = 0.75, which means that of the students who passed Item j, the ratio of students who passed Item k was 60%, whereas the reverse ratio was 75%.
38
2 Test Data and Item Analysis
Table 2.7 Examples of Relative Frequency Contingency Table Example 1
Example 2
Example 3
j\ k 0 1 T 0 .40 .10 .50 1 .20 .30 .50 T .60 .40 1.0
j\ k 0 1 T 0 .70 .00 .70 1 .00 .30 .30 T .70 .30 1.0
j\ k 0 1 T 0 .20 .30 .50 1 .20 .30 .50 T .40 .60 1.0
pk| j = .30/.50 = .60 p j|k = .30/.40 = .75
pk| j = .30/.30 = 1.0 p j|k = .30/.30 = 1.0
pk| j = .30/.50 = .60 p j|k = .30/.60 = .50
It is erroneous to conclude from p j|k > pk| j that Item k is the basis or cause of Item j because the inequality direction is simply derived from the fact that the CRR of Item k is lower than that of Item j. It can be seen from the table that the CRRs of Items j and k are 0.5 and 0.4, respectively; that is, Item k is more difficult. It is natural for the ratio of students passing the easier Item j out of those passing the harder Item k ( p j|k ) to become higher than the reverse ratio of students passing the harder Item k of those passing the easier Item j ( pk| j ) because a student who passes the difficult item can usually pass the easier item as well. Therefore, one should not determine which item is more fundamental by simply comparing the two CCRRs of an item pair. As for which is more basic, in this case, Item j is because its CCR is greater (i.e., Item j is easier). In addition, Example 2 is a case in which two items are completely associated. In this case, the two CCRRs obtained were 1.0. Moreover, Example 3 is a case in which the two items are unassociated. Nevertheless, the values of pk| j for Examples 1 and 3 were the same (0.6). These facts suggest that when we want to investigate the relevance of a pair of items, it is insufficient to check only one of the CCRRs. To obtain the CCRRs of all item pairs at once, we can use the following equation: ⎤ ⎛⎡ ⎤ ⎞ sym p1 p1 ⎟ ⎥ ⎜⎢ .. ⎥ ⎢ P C = P J ( p1J ) = ⎣ ... . . . ⎦ ⎝⎣ . ⎦ 1 · · · 1 ⎠ pJ pJ1 · · · pJ ⎤ ⎡ ⎤ ⎡ sym p1 · · · p1 p1 ⎥ ⎢ .. . . .. ⎥ ⎢ = ⎣ ... . . . ⎦⎣ . . . ⎦ pJ · · · pJ pJ1 · · · pJ ⎡ ⎤ 1 p2|1 · · · p J |1 ⎢ p1|2 1 · · · p J |2 ⎥ ⎢ ⎥ =⎢ . .. . . .. ⎥ = { pk| j } (J × J ). . ⎣ . . . ⎦ . ⎡
p1 J p2 J · · · 1
2.5 Interitem Correct Response Rate Analysis
39
Fig. 2.8 Concept of item lift
Characteristics of Conditional Correct Response Rate Matrix 1. In the ( j, k)-th (the j-th row and k-th column) entry, pk| j (not p j|k ) is placed. 2. This matrix is asymmetric ( P C = P C ). 3. All diagonal elements are 1. This matrix is called the CCRR matrix, and its characteristics can be summarized as three points. For Point 2, pk| j = p j|k generally holds. If p j|k ≥ pk| j , then the students passing Item j out of those passing Item k outnumber the students passing Item k out of those passing Item j. Then, Item j is possibly more basic than Item k. In addition, for Point 3, the diagonal elements of the CCRR matrix are 1 because the j-th diagonal element results in p j| j =
pjj pj
=
pj = 1. pj
2.5.3 Item Lift The lift (Brin et al., 1997) is a commonly used index in a POS data analysis. The item lift of Item k to Item j is defined as l jk =
pk| j . pk
That is, l jk is the ratio of pk| j to pk . Figure 2.8 shows a conceptual image of the item lift. What this figure shows is “how many times the probability of passing Item k after passing Item j is as large as the probability of passing Item k.” In other words, it represents “how much easier it becomes to pass Item k after passing Item j.”
40
2 Test Data and Item Analysis
Fig. 2.9 Item lift Table 2.8 Item Lift Values of Three Contingency Tables Example 1
Example 2
Example 3
j\ k 0 1 T 0 .40 .10 .50 1 .20 .30 .50 T .60 .40 1.0
j\ k 0 1 T 0 .70 .00 .70 1 .00 .30 .30 T .70 .30 1.0
j\ k 0 1 T 0 .20 .30 .50 1 .20 .30 .50 T .40 .60 1.0
l jk = (.30/.50)/.40 = 1.5
l jk = (.30/.30)/.30 = 3.333
l jk = (.30/.50)/.60 = 1.0
For example, supposing that l jk = 3.0 (Fig. 2.9, right), the possibility of passing Item k after passing Item j is three times greater than that of simply passing Item k, which can be viewed as the CRR of Item k being “lifted” by Item j. When l jk = 1.0 (Fig. 2.9, center), the possibility of passing Item k after passing Item j is not different from the possibility of passing only Item k, which suggests that what Item j measures is unrelated to what Item k measures. In addition, note that the item lift is rarely < 1 (Fig. 2.9, left). When this happens, it means that the possibility of passing Item k after passing Item j is smaller than the possibility of passing Item k, which additionally implies that the students who pass Item j tend to fail Item k, and vice versa. In this case, there is a possibility that the contents of either or both items are invalid; therefore, their contents should be reviewed. Table 2.8 shows the item lift for each of the three contingency tables. For Example 1, with a moderate association between two items, the item lift is calculated as 1.5. For Example 2, with a strong association, the lift is 3.333. Finally, for Example 3, with no association, the lift is 1.0. Unlike the JCRR, the item lift can quantify the strength of the association between a pair of items.16 In addition, the CCRR is neither useful because we must check two results ( pk| j and p j|k ) to estimate the strength of the association between the items. However, in this respect, as Eq. (2.10) shows, the item lift is convenient because 16
It can be seen from Table 2.6 (p.35) that the JCRRs of Examples 1–3 are equal.
2.5 Interitem Correct Response Rate Analysis
l jk =
41
p jk pk| j p j|k = = = lk j . pk p j pk pj
(2.11)
This symmetry (l jk = lk j ) states that, when l jk = 3.0 as an example, Item k lifts the CRR of Item j by a factor of 3, which implies that the reverse is also true, i.e., that Item j also lifts the CRR of Item k by a factor of 3. This symmetry, however, prevents us from concluding that the item measures the more basic ability. To calculate the item lifts of all item pairs at once, we can use the following equation: ⎡
l11 ⎢ l21 ⎢ P L = P C (1 J p ) = ⎢ . ⎣ ..
sym l22 .. . . . .
⎤ ⎥ ⎥ ⎥. ⎦
lJ1 lJ2 · · · lJ J
This item lift matrix is symmetric ( P L = P L ) because l jk = lk j . Note additionally that the diagonal elements are meaningless; thus, they need not be interpreted. Probabilistic Causality by Patrick Suppes The American philosopher of science P. Suppes (1922–2014) did a great deal of work on introducing probability theory to causality analysis. When Event C is the cause of Event E, the first of his three famous definitions for causality is written as P(E|C) > P(E). That is, the probability of Event E occurring after Event C must be higher than the probability of Event E occurring singly. Considering this in the context of this book, for Item j to be the cause (basic ability) of Item k, the above equation can be expressed asa pk| j > pk
∴
pk| j = l jk > 1. pk
Accordingly, it was found that his first definition requires that the item lift is > 1. a
∴ means “therefore.”.
42
2 Test Data and Item Analysis
Table 2.9 Mutual Information Values of Three Contingency Tables Example 1
Example 2
Example 3
j\ k 0 1 T 0 .40 .10 .50 1 .20 .30 .50 T .60 .40 1.0
j\ k 0 1 T 0 .70 .00 .70 1 .00 .30 .30 T .70 .30 1.0
j\ k 0 1 T 0 .20 .30 .50 1 .20 .30 .50 T .40 .60 1.0
M I jk = .125
M I jk = 0.881
M I jk = 0.00
2.5.4 Mutual Information Mutual information is a measure that represents the degree of interdependence between two items. Using the notation in Table 2.5, the mutual information of Items j and k is defined as follows: M I jk = p jk log2
p jk
+ p jk log2
p jk
(1 − p j )(1 − pk ) (1 − p j ) pk p jk p jk + p jk log2 + p jk log2 . p j (1 − pk ) p j pk
(2.12)
More specifically, this index quantifies the “uncertainty reduction magnitude” for the true or false (TF) information of Item k after knowing the TF information of Item j. The range of this index is [0, 1], and the closer the value is to 1, the lower the uncertainty (the greater the certainty) regarding the TF for Item k by knowing the TF for Item j, which suggests that Items j and k are associated. Note that the mutual information is a symmetric measure (i.e., M I jk = M Ik j ). It can therefore be said that this index is a measure of uncertainty reduction for Item j after knowing Item k. Table 2.9 shows the mutual information for each of the three contingency tables. For instance, the value of Example 1 is calculated by 0.4 0.1 + 0.1 × log2 0.5 × 0.6 0.5 × 0.4 0.2 0.3 + 0.3 × log2 + 0.2 × log2 0.5 × 0.6 0.5 × 0.4 =0.166 − 0.100 − 0.117 + 0.175 =0.125.
M I jk =0.4 × log2
For Example 2, where the two items are strongly associated, the mutual information is calculated as 0.881.17 For Example 3, where they are unassociated, the mutual 17
The mutual information becomes 1 when the top-left and bottom-right cells are both 100.
2.5 Interitem Correct Response Rate Analysis
43
information is 0. Mutual information is similar in purpose to item lift and is easier to understand because its range is [0, 1], whereas that of item lift is [0, ∞). However, there is no numeric standard to judge the interdependency between two items. This standard depends on the field facing. In a clinical trial for a new cold medicine,18 a mutual information value of 0.125 is almost equivalent to no medicinal effect, and such a medicine will not be marketable against existing products. However, in a criminal investigation,19 such a value can serve as a useful reference to narrow down the number of suspects. In a test data analysis, the value of the mutual information can have different meanings in different subjects. For a math test, a value of 0.125 may be considered small, whereas for language and history tests, it may be considered large. Moreover, considering the logarithm of the fourth term in Eq. (2.12), the argument is the item lift defined in Eq. (2.11) (p.41). Because this lift focuses on the bottomright cell of the contingency table, it can be rewritten as l jk =
p jk p j pk
.
Similarly, the arguments in the first, second, and third terms can be expressed as l jk = l jk l jk
p jk
(1 − p j )(1 − pk ) p jk = , (1 − p j ) pk p jk . = p j (1 − pk )
,
These are the item lifts corresponding to the top-left, top-right, and bottom-left cells in the table. Using these, Eq. (2.12) can be rewritten as follows: M I jk = p jk log2 l jk + p jk log2 l jk + p jk log2 l jk + p jk log2 l jk . Accordingly, the mutual information can be seen as the mean of the logarithms of the item lifts for the four cells, and thus it can also be stated as the “mean log lift.”
2.6 Interitem Correlation Analysis This section introduces four indicators of the correlation between items. Of these, tetrachoric and biserial correlations are mathematically complicated. However, they 18
For example, Item j is “placebo (0)/dose (1)” and Item k is “not recovered (0)/recovered (1).”. For example, Item j is “no cigarette butt (0)/a cigarette butt (1) at the scene of the murder” and Item k is “the killer is female (0)/male (1).”.
19
44
2 Test Data and Item Analysis
are extremely important indicators; thus, an understanding of what type of information they extract is recommended.
2.6.1 Phi Coefficient The phi coefficient, denoted φ, is the Pearson’s product moment correlation coefficient between two binary items, which despite being ordinal by nature are treated as continuous variables. The φ coefficient between Items j and k is defined as φ jk =
Cov jk V ar j V ark
∈ [−1, 1],
(2.13)
where V ar j is the variance of Item j, and Cov jk is the covariance between the two items. That is, V ar j = Cov j j = Cov jk =
{z j (u j − p j 1 S )} {z j (u j − p j 1 S )} z j z j − 1
{z j (u j − p j 1 S )} {z k (uk − pk 1 S )} = z j z k − 1
S
S =
2 s=1 z s j (u s j − p j ) , S s=1 z s j − 1
s=1 z s j z sk (u s j − p j )(u sk − pk ) . S s=1 z s j z sk − 1
The φ coefficient is the correlation coefficient; thus, it falls within [−1, 1].20 The closer the φ coefficient is to 1, the more likely students passing Item j also pass Item k. The closer it is to −1, the less likely students passing Item j pass Item k. In addition, if it is close to 0, it indicates the TF of Item j is unrelated to that of Item k. To obtain the covariances for all item pairs, that is, the covariance matrix, we can use the following equation: C = {Z (U − 1 S p )} {Z (U − 1 S p )} (Z Z − 1 J 1J ) ⎤ ⎡ sym V ar1 ⎥ ⎢Cov21 V ar2 ⎥ ⎢ =⎢ . ⎥ (J × J ). . . .. .. ⎦ ⎣ .. Cov j1 Cov J 2 · · · V ar J Note that this matrix is symmetric, and its diagonal elements are the variances of the items. Next, by extracting the diagonal elements from the square matrix and vectorizing them, the variance vector is obtained as follows: However, it is known that the maximum and minimum of the φ coefficient are not 1 and −1, respectively, when the marginal distributions of the two items are not the same (Ferguson, 1941; Guilford, 1965). In addition, the theoretical upper and lower limits of the correlation between two Likert items with three or more points can be obtained using linear programming (Shojima, 2020).
20
2.6 Interitem Correlation Analysis
45
Table 2.10 Example data for tetrachoric correlation
Incorrect 6 (0.070) 20 (0.233) 26 (0.302)
Correct 40 (0.465) 20 (0.233) 60 (0.698)
Total 46 (0.535) 40 (0.465) 86 (1.000)
Item k
1
Item k\j Correct Incorrect Total
0
*numbers and percentages (in parentheses) 0
1 Item j
⎤ V ar1 ⎥ ⎢ v = diag(C) = ⎣ ... ⎦ (J × 1). ⎡
(2.14)
V ar J Then, the matrix containing all φ coefficients is given by21 ⎡
1 ⎢ φ 1 1 ⎢ 21 = C {v ◦ 2 (v ◦ 2 ) } = ⎢ . ⎣ ..
sym 1 .. . . . .
⎤ ⎥ ⎥ ⎥ (J × J ). ⎦
φJ1 φJ2 · · · 1
This matrix is symmetric and its diagonal elements are 1.
2.6.2 Tetrachoric Correlation The φ coefficient has a fault in that its maximal and minimal values do not reach 1 and −1 in most cases, respectively, as stated in Footnote 20 (p.44), although it is a correlation coefficient. The tetrachoric correlation (Divgi, 1979; Olsson, 1979; Harris, 1988) is superior to the φ coefficient as a measure of the relation of an item pair. The concept of a tetrachoric correlation is first explained. Suppose that the contingency table of Items j and k is obtained as shown in Table 2.10 (left), in which the number of students who pass both items is 40 out of 86 ( p jk = 0.07), and the CRRs of the two items are p j = 0.698 and pk = 0.535, respectively. Table 2.10 (right) shows 21
“ ◦ ” denotes the Hadamard power (p.xviii).
46
2 Test Data and Item Analysis
Fig. 2.10 Tetrachoric correlation
Fig. 2.11 Bivariate standard normal distribution
the scatter plot (jitter plot) of the responses for the two items. In the figure, the dots are placed with a small randomness to avoid an overlap. Next, from Sect. 2.4.3 (p.31), the item threshold for each of the two items is computed from the CRR as τ j = −0.518 and τk = 0.088, respectively. In Fig. 2.10 (left), the scatter plot area was divided into four parts by the two item thresholds. Next, the standard bivariate normal (BVN) distribution was assumed in the background of the scatter plot. The SD of each variable of this distribution is 1, and the correlation of the two variables, ρ (rho), takes a value from −1 to 1. Figure 2.11 illustrates the standard BVN distributions when the correlation is 0.6, 0.0, and −0.4. We then consider how large a correlation of the BVN distribution that would be likely to produce these data (Table 2.10). Of the four quadrants separated by the two thresholds, many dots are concentrated in the first (top right) and third quadrants (bottom left); thus, the correlation must be positive. The tetrachoric correlation is estimated to be 0.619, implying that these data are most likely to occur under the BVN distribution with ρ = 0.619.
2.6 Interitem Correlation Analysis
47
The idea of estimating a parameter (herein, correlation) of a model (herein, standard BVN distribution) from which the data are most likely to be generated is called the maximum likelihood estimation. The density plot of the standard BVN distribution with a correlation of 0.619 is illustrated in Fig. 2.10 (right). This is an overview of the tetrachoric correlation.22 The procedure of computing a tetrachoric correlation (Olsson, 1979) is described. First, the PDF of the standard BVN distribution is defined as f N 2 (x j , xk ; ρ) =
x 2j − 2ρx j xk + xk2 ! 1 . exp − 2(1 − ρ 2 ) 2π 1 − ρ 2
(2.15)
where x j and xk are the continuous abilities required for Items j and k, respectively. Under standard bivariate normality, the probability of passing both items is given as23 ∞ ∞ Top Right: p jk (ρ; τ j , τk ) = f N 2 (x j , xk ; ρ)dx j dxk = FN 2 (−τ j , −τk ; ρ). τj
τk
In the equation above, only ρ is an unknown parameter to be estimated because τ j and τk can be obtained from the respective CRRs. This equation is used to evaluating the sectioned volume shown in Fig. 2.12 (top right), although the volume cannot be quantified because ρ is unknown. Similarly, the volumes of the other three sections are defined as follows:
TL: p jk (ρ; τ j , τk ) = BR: p jk (ρ; τ j , τk ) = BL: p jk (ρ; τ j , τk ) =
τj ∞ −∞ τk ∞ τk τj
−∞
τ j τk
−∞ −∞
f N 2 (x j , xk ; ρ)dx j dxk = FN (τ j ; 0, 1) − FN 2 (τ j , τk ; ρ), f N 2 (x j , xk ; ρ)dx j dxk = FN (τk ; 0, 1) − FN 2 (τ j , τk ; ρ), f N 2 (x j , xk ; ρ)dx j dxk = FN 2 (τ j , τk ; ρ).
Next, we build the likelihood, which is the probability expressed, including any unknown parameter(s). Subsequently, the probability becomes a function of such parameters. Because the data on the 86 students are independent from each other, the likelihood is constructed to be the product of the occurrence probabilities of all students as follows24 :
If the number of categories of either or both of the two items is > 2, then it is called polychoric correlation. 23 The volume of BVN (and any statistical distribution) is 1.0 (100%) in total, and the volume of a specified interval of the BVN is obtained by double integration and represents the probability. 24 As for S , S , S , and S , refer to Table 2.5 (p.34). jk jk jk jk 22
48
2 Test Data and Item Analysis
Fig. 2.12 Volume of quadrants sectioned by two thresholds
l(ρ; τ j , τk ) = p jk (ρ; τ j , τk ) S jk · p jk (ρ; τ j , τk )
S jk
· p jk (ρ; τ j , τk )
S jk
· p jk (ρ; τ j , τk ) S jk .
(2.16) Note again that only ρ is an unknown parameter in the above equation. The maximum likelihood estimation method is used to estimate the unknown parameter(s) (herein, only ρ), which maximizes the equation (i.e., likelihood), and the value obtained by this method is called the maximum likelihood estimate (MLE). In fact, it is extremely difficult to maximize Eq. (2.16), which was made of a long chain of multiplication. Therefore, by taking the logarithm of the likelihood, we have25 ll(ρ; τ j , τk ) =S jk ln p jk (ρ; τ j , τk ) + S jk ln p jk (ρ; τ j , τk ) + S jk ln p jk (ρ; τ j , τk ) + S jk ln p jk (ρ; τ j , τk ). This is called the log-likelihood. The logarithm at the bottom of e = 2.7182828 · · · (Napier’s constant) is the natural logarithm, denoted by ln, i.e., ln a = loge a.
25
2.6 Interitem Correlation Analysis
49
For the data shown in Table 2.10, the log-likelihood is constructed as ll(ρ; −0.518, 0.088) =20 ln p jk (ρ; −0.518, 0.088) + 6 ln p jk (ρ; −0.518, 0.088) + 20 ln p jk (ρ; −0.518, 0.088) + 40 ln p jk (ρ; −0.518, 0.088).
(2.17) Again, the only unknown parameter in Eq. (2.17) is ρ. Thus, for example, if ρ = 0.5, the log-likelihood is computed as ll(0.5; −0.518, 0.088) = −106.2. This means that, under ρ = 0.5, the probability of observing the data in Table 2.10 is e−106.2 = 7.55 × 10−47 . The probability (likelihood) falls within [0, 1]; thus, the log-likelihood takes a value within the range of (−∞, 0].26 The log-likelihood takes a nonpositive value and it is easier to judge the value size. In addition, from ll(0.0; −0.518, 0.088) = −112.9, we have the following inequality: ll(0.5; −0.518, 0.088) > ll(0.0; −0.518, 0.088), which reveals that the data are more likely to occur under ρ = 0.5, than under ρ = 0.0. Figure 2.13 shows the plot of function ll(ρ; −0.518, 0.088), in which we can see that the log-likelihood reaches its maximum when ρ = 0.619. This demonstrates that the data are most likely to occur when ρ = 0.619; thus, 0.619 is settled as the MLE for a tetrachoric correlation. It is reasonable to regard a parameter value giving the highest data observability as the estimate. The tetrachoric correlation is the MLE of the correlation coefficient that makes a 2 × 2 table most observable under a BVN.27 As a feature of the tetrachoric correlation, its absolute value is usually larger than that of the φ coefficient (Cohen, 1983) because the φ coefficient calculated from Table 2.10 is 0.401. However, to calculate a tetrachoric correlation, one needs to assume that two continuous ability variables exist behind the two binary items and that they follow a standard BVN distribution, although it is quite natural to suppose that the academic ability is normally distributed as a naive belief. ln 0 = −∞ and ln 1 = 0. The Newton–Raphson method, steepest descent method and bisection method are used to find the value of ρ that maximizes the log-likelihood.
26 27
50
2 Test Data and Item Analysis
Fig. 2.13 Log-likelihood function of tetrachoric correlation
Fig. 2.14 Scatter plot of item score and (standardized) passage rate
2.6.3 Item-Total Correlation The item-total correlation (ITC)28 is a Pearson’s correlation of an item29 with the NRS (or total score). The Pearson’s correlation was originally an index for measuring the strength of a linear relationship between two continuous variables. However, one of the pair is a binary ordinal variable (when it contains no missing data), and the other (NRS) is a discrete variable because it takes only integers, although it is often regarded as continuous. In general, the scatter plot of an item and the NRS is obtained as shown in Fig. 2.14 (left), in which the horizontal axis represents the item, and the vertical axis represents the standardized passage rate (ζ ).
28 29
Also known as the point-biserial correlation. Instead, the passage rate or scoring rate is available.
2.6 Interitem Correlation Analysis
51
This ITC, otherwise known as “item discrimination,” is a very important index in the test data analysis. In general, each item should be positively correlated with a variable representing the entire test measure. Figure 2.14 (left) shows that students who pass this item tend to have a higher passage rate (the ITC of this item is 0.526). Conversely, if the ITC is negative, as shown in Fig. 2.14 (right), this state indicates more capable students tend to fail the item, and such an item is inappropriate for measuring the ability. Qualitative or Quantitative? The NRS is strictly a qualitative variable because the ability difference between 1 and 2 points is not equal to that between 2 and 3, 3 and 4, · · · . However, if the NRS is regarded as an ordinal variable, the mean, SD, and other statistics allowed for a continuous variable become invalid and cannot be calculated. Therefore, for convenience, the NRS is usually treated as an interval variable (quantitative variable). The ITC of Item j is given by ρ j,ζ =
Cov j,ζ Cov j,ζ = , V ar j V arζ V ar j
where Cov j,ζ =
{z j (u j − p j 1 S )} ζ = z j 1 S − 1
S
s=1 z s j (u s j S s=1 z s j
− p j )ζs −1
.
Note that the mean and SD of the standardized passage rates are 0 and 1, respectively. For the mean and variance of Item j, see Sects. 2.4.1 (p.28) and 2.6.1 (p.44), respectively. The vector containing the ITCs for all items can be obtained by ⎤ ρ1,ζ ⎥ ⎢ = ⎣ ... ⎦ = {ρ j,ζ } (J × 1). ⎡
ρ ζ = {Z (U − 1 S p )} ζ (Z 1 S − 1 J ) v ◦ 2
1
ρ J,ζ
52
2 Test Data and Item Analysis
Item-Remainder Correlation The ITC is a correlation of an item with the NRS, which is an extremely important indicator, also known as “item discrimination.” However, it tends to be overestimated, because the item itself is also included in the NRS, which can lead to a problem particularly when considering the item discrimination of a testlet. Suppose a test has four testlets, and the number of items of the testlets is 10, 20, 10, and 10, respectively. Then, the ITC of Testlet 2 would be the largest because the NRS (or total score) contains the most items from Testlet 2. To avoid this problem, the item-remainder correlation is used. This indicator is also a Pearson’s correlation of the testlet score but with the NRS minus the testlet score.
2.6.4 Item-Total Biserial Correlation The ITC is simply a correlation calculated by regarding the item responses as continuous data, even though they are originally binary data. Meanwhile, a biserial correlation is also a correlation between dichotomous-ordinal and continuous variables, but unlike the ITC, it assumes a normal distribution behind the dichotomous variable.30 The item-total biserial correlation is a biserial correlation of an item with the NRS, total score, or passage rate. Let us examine this example. First, suppose that data are obtained, as shown in Table 2.11 (left), where the counterpart variable used to compute the biserial correlation with Item j is the standardized passage rate (ζ ). Table 2.11 (right) shows the scatter plot of the two variables between which ITC is ρ j,ζ = 0.412. Next, from p j = 0.6 (= 12/20), the item threshold is obtained as τ j = FN−1 (1 − 0.75; 0, 1) = −0.253 (see Sect. 2.4.3, p.31), as illustrated in Fig. 2.15 (left). In addition, by assuming a standard BVN distribution on the scatter plot, the biserial correlation is estimated based on the idea of how large a correlation that would most likely generate these data. This idea is the very MLE used in computing the tetrachoric correlation (Sect. 2.6.2, p.45), where the biserial correlation is obtained as 0.565. Ordinarily, the absolute value of the biserial correlation is larger than that of the ITC. The procedure for estimating the biserial correlation was explained by Olsson et al. (1982). First, as shown in Eq. (2.15), the PDF of a standard BVN distribution is defined as
When the number of categories for the ordinal variable is > 2, the correlation is then called a polyserial correlation.
30
2.6 Interitem Correlation Analysis
53
Item j 0 0 0 0 0 0 0 0 1 1
ζ (zeta) −3.21 −1.02 −0.71 −0.71 0.02 0.25 0.48 0.97 −0.43 −0.42
Item j 1 1 1 1 1 1 1 1 1 1
ζ (zeta) −0.32 −0.20 0.25 0.36 0.38 0.42 0.49 0.74 1.23 1.44
Standardized Passage Rate
Table 2.11 Example data for biserial correlation
2 1 0 1 2 0
1 Item Score
Fig. 2.15 Biserial correlation
f N 2 (x j , ζ ; ρ) =
x 2j − 2ρx j ζ + ζ 2 ! 1 , exp − 2(1 − ρ 2 ) 2π 1 − ρ 2
where x j is supposed to be behind Item j, and x j and ζ (the standardized passage rate) are assumed to follow a standard BVN distribution with correlation ρ. Next, the above equation can be transformed into31 Any bivariate PDF can be transformed into f (x, y) = f (x) f (y|x). In addition, when f (x, y) is a BVN, f (x) and f (y|x) are also normal distributions, and are expressed as f N (x; μx , σx2 ) and
31
54
2 Test Data and Item Analysis
Fig. 2.16 Likelihood of a student for estimating biserial correlation
f N 2 (x j , ζ ; ρ) = f N (ζ ; 0, 1) f N (x j ; ρζ, 1 − ρ 2 ). Thus, the likelihood of Student s’s data is given by ls (ρ) =
τj
−∞
!1−u s j f (x j , ζs ; ρ) N2
∞
τj
f N 2 (x j , ζs ; ρ)
!u s j
!1−u s j !u s j 1 − FN (τ j ; ρζs , 1 − ρ 2 ) = f N (ζs ; 0, 1) FN (τ j ; ρζs , 1 − ρ 2 )
1−u s j u 1 − FN (τ j ; ρζs , 1 − ρ 2 ) s j , ∝ FN (τ j ; ρζs , 1 − ρ 2 ) implying that the first factor is selected if u s j = 0, and the second factor is selected if u s j = 1. Note that this equation becomes a function of ρ because u s j , τ j , and ζs are known, but only ρ is unknown. Moreover, f N (ζs ; 0, 1) is not a necessary factor because it does not contain ρ.32 Figure 2.16 shows the likelihood of Student s; however, this figure is illustrated under the condition of ρ = 0.5 (in fact, this correlation is unknown, and its estimation is the goal). In addition, if ζs = −1.0 (i.e., the student’s ability is low), as shown in the left figure, the probability of the student passing Item j can be represented as the area of the right side part exceeding the threshold (lighter-shaded area). Conversely, the probability of failing the item is expressed as the area of the left side (darkershaded area) below the threshold. The right figure is an example of a student having a high ability (i.e., ζs = 1.0). Because ρ = 0.5, the higher the value of ζ is, the larger the probability of passing Item j. ρσ y f N y; μ y + (x − μx ), (1 − ρ 2 )σ y2 . σx . 32
∝ means “proportional to.”.
2.6 Interitem Correlation Analysis
55
Fig. 2.17 Log-likelihood function of biserial correlation
As the likelihood of one student can be specified, the likelihood of all students shown in Table 2.11 is given by33 l(ρ) =
S " ls (ρ), s=1
and the log-likelihood becomes ll(ρ) =
S
ln ls (ρ)
s=1
= (1 − u s j ) ln FN (τ j ; ρζs , 1 − ρ 2 ) + u s j ln{1 − FN (τ j ; ρζs , 1 − ρ 2 ) . Figure 2.17 shows the log-likelihood function for the data in Table 2.11. This loglikelihood function reaches its maximum when ρ = 0.565, which means that the data are then the most observable (with a likelihood of occurring), 0.565 is thus determined to be the MLE for the biserial correlation. Items with a higher biserial correlation can measure the ability targeted by the entire test more directly. Conversely, if an item has a low or negative biserial correlation, such an item should be excluded from the test. At a minimum, the content of the item must be reviewed.
33
#
means “repeated multiplication.” For {a1 , a2 , · · · , a N },
#6
n=3 an
= a3 × a4 × a5 × a6 .
56
2 Test Data and Item Analysis
Low Discrimination Item The ITC and biserial correlation are often referred to as “item discrimination.” If they are low, such an item, in general, is inappropriate to be contained in a test because it does not measure what the entire test attempts to measure. However, an unprecedented and unfamiliar item for students is usually unlikely to be highly discriminating because students are not well trained in such an item. Basically, the ITC or biserial correlation becomes larger simply because low/high ability students fail/pass the item. However, another explanation is that low/high ability students are not/well trained in the item. A fresh item, even if it is truly high-discriminating, cannot be passed even by students with a high capability because they are not sufficiently trained in such a type of item, which makes the surface of the item discrimination (i.e., ITC and biserial correlation) appear to be low. However, if one continues to use the item for a while, instead of hastily concluding it as a low-discriminating item, highly capable students will soon be well prepared for such a type of item and the item discrimination will then increase.
2.7 Test Analysis This section describes statistics regarding the total score. Why the total score and not the NRS? The maximum NRS (i.e., number of presented items) can vary by student when the test contains optional items (or testlets) that students can select. However, the maximum total score is usually set to be unique, regardless of the item choice. Thus, the statistics in this section are mainly concerned with the total score vector t (w) of the students (see Eq. (2.4), p.21).
2.7.1 Simple Statistics of Total Score First, as important indices for the total score, the mean, variance, and SD are as follows: S (w) t 1S t (w) (w) = s=1 s , t¯ = 1S 1S S S (w) (w) ¯ − t¯(w) )2 (t − t (w) 1 S ) (t (w) − t¯(w) 1 S ) s=1 (ts V ar [t (w) ] = = , (2.18) 1S 1S − 1 S−1 S D[t (w) ] = V ar [r (w) ].
2.7 Test Analysis
57
Fig. 2.18 Interquartile range
First, the mean briefs the difficulty of the test for the students. Next, the SD is particularly important. If the SD is small, both high- and low-ability students score similar points. Generally, the aim of a test is to discriminate between individual differences in academic performance. To realize this, it is necessary for high-/lowability students to obtain a high/low score. When this is realized, the SD must be large. The interquartile range (IQR) is an index of dispersion. Suppose that Q 1 , Q 2 , and Q 3 are the scores quartering the total scores sorted in ascending order, i.e., they are the quartiles and are the first quartile (the 25th percentile34 ), median, and third quartile (the 75th percentile), respectively. The IQR is then given by I Q R = Q3 − Q1, indicating that half of all students’ scores fall within this range. In the histogram shown in Fig. 2.18, from Q 1 = 57 and Q 3 = 84, the IQR was calculated as 27 (= 84 − 57). It is necessary for the IQR to be sufficiently large compared with the score range (herein, [0, 100]). If the individual differences among the 50 percentile of all students are large, then this statistic will be great. Note, however, that if a test is made to identify the upper-ability students well, the range of [Q 3 , 100] should be wide.
2.7.2 Distribution of Total Score Although there are many indices for expressing the distribution shape, such as skewness and kurtosis, they are not given too much attention in this book because the author is not particularly concerned with the shape. However, if the author was to venture an opinion, a uniform distribution would be ideal. 34
See Sect. 2.3.4 (p.25).
58
2 Test Data and Item Analysis
Fig. 2.19 Test is similar to a roadroller
Generally speaking, it is difficult to create a test that is highly discriminating from the bottom to top of the ability that we want to measure. However, each individual test is usually given a role as a selection test, achievement test or diagnostic test, and they are generally intended to measure high-, middle-, and low-level students, respectively. It is important to develop a test to disperse the student scores of the targeted ability group. To disperse the scores of a particular ability group, one should adopt many items that are moderately difficult for the ability group in the test. The test used in Fig. 2.18 is found to be a basic examination consisting of easy items for discriminating lowerability students; therefore, the scores of the lower-ability students disperse, whereas those of the upper and average ability students do not (Fig. 2.19). A test can be likened to a road roller. The readers should be clear about the difference between the ability distribution and score distribution. A test, just like a road roller, flattens the ability strata where we want to discriminate in the original ability distribution that is unobserved but often believed to be normally distributed. As stated in Normal versus Uniform Distribution (p.25), this is why it is wrong to conceive that the scores should be normally distributed. The normality of the ability distribution is a different issue from the normality of the score distribution. The score distribution is the result of the ability distribution being rolled by the test. An ideal score distribution, we daresay, is a uniform distribution that flatly spreads across the score scale because the ideal test that could discriminate the entire ability range would flatten the complete ability distribution.
2.7 Test Analysis
59
PPT vs CBT It is difficult to design a paper-and-pencil test (PPT) to be discriminating across the entire ability range because a PPT contains a limited number of items. Meanwhile, a computer-based test (CBT) can present different items depending on each student’s ability level. A CBT with such a mechanism is called a computer adaptive test (CAT; e.g., van der Linden and Glas, 2000; Wainer et al., 2000). In the CAT, each time a student responds to an item, the tentative ability level of the student is estimated, and a next item appropriate for the level is selected from the item bank and presented to the student. To do this, however, it is necessary to have a large stock of items of various difficulty levels.
2.8 Dimensionality Analysis This section describes a method for examining the dimensionality of the test data. To put it shortly, the dimensionality is the number of contents (or components) the test is measuring. For example, suppose that a French test consists of vocabulary and grammar items. Then, the test taken as a whole measures one’s knowledge of French; however, by looking into the details, the test actually measures one’s vocabulary and grammar abilities. In this case, the dimensionality of the test was two. This dimension is often referred to as a factor or component.
2.8.1 Dimensionality of Correlation Matrix If a test can measure only one object, the test is said to be unidimensional or unifactorial, whereas when the test targets two or more objects, it is called bidimensional (bifactorial), tridimensional (trifactorial), or multidimensional (multifactorial). In general, the more unidimensional the test data are, the better the test is because, most of the time, when we give feedback to a student, only one score is returned. A student’s responses to a unidimensional test can be integrated into a single score; however, the responses to a bidimensional test should be summarized into two scores. In an extreme case, when a French and math tests were analyzed together, and if they were found to be unidimensional, then one could integrate the results of the two tests into a single score. There is no need to provide separate feedback on French and math. However, the dimensionality of the aggregated data of two tests for two different targets is usually obtained as ≥ 2. Consequently, the scores for French and math tests are difficult to combine. The unidimensionality of the data can be confirmed by verifying the correlation matrix. Table 2.12 shows three correlation matrices when the data are unidimen-
60
2 Test Data and Item Analysis
Table 2.12 Dimensionality of test appearing in correlation matrix
Unidimensional
Bidimensional
Tridimensional
sional, bidimensonal, and tridimensional. Note that they are not φ coefficient matrices (Sect. 2.6.1, p.44) but rather tetrachoric correlation matrices (Sect. 2.6.2, p.45). Because they are correlation matrices, each is symmetric and all diagonal elements are 1. The left table shows an example of a correlation matrix when the test data are unidimensional, where the correlations of all item pairs are large, indicating that the six items measure a certain single ability. The central table shows a bidimensional correlation matrix, where Items 1–3 compose the first factor because the interitem correlations among them are large, whereas Items 4–6 similarly form the second factor. Items 1–3 and Items 4–6 are different bunches of items because the betweengroup correlations are not large. Thus, the test data are bidimensional. Considering the right table in the same way, the test data were found to be tridimensional. In this way, the dimensionality (i.e., number of factors) of the test data can be determined by checking its correlation matrix. However, this process becomes increasingly difficult as the number of items increases. Dimensionality of φ Coefficient Matrix It is not recommended to conduct a dimensionality analysis on the φ coefficient matrix (, Sect. 2.6.1, p.44) because several studies have shown that the φ coefficient matrix does not properly produce the dimensionality (e.g., McDonald and Ahlawat, 1974; Olsson, 1979). More specifically, when verifying the dimensionality of a φ coefficient matrix, the number of dimensions is determined depending on the number of difficulty level groups in the test items even if the test is unidimensional. For instance, if items are grouped into three difficulty levels, i.e., easy, medium, and difficult, three dimensions (factors) will be yielded. Such factors corresponding to difficulty levels are called “difficulty factors.”
2.8 Dimensionality Analysis
61
2.8.2 Multivariate Standard Normal Distribution The standard BVN distribution was introduced in Sect. 2.6.2 (p.45). In general, a normal distribution with more than one variable is called multivariate normal (MVN) distribution, and an MVN with a mean of 0 and an SD of 1 (thus, a variance of 1) for all variables is called a standard MVN distribution. When the number of variables is J , the PDF of the standard MVN distribution is given by the following: 1 J 1 f N J (x; R) = (2π )− 2 (det R)− 2 exp − x R−1 x , 2
(2.19)
where ⎤ x1 ⎢ ⎥ x = ⎣ ... ⎦ (J × 1) ⎡
xJ is the J -variate vector, and ⎡
1 ⎢ r21 ⎢ R=⎢ . ⎣ ..
r21 1 .. .
⎤ · · · rJ1 · · · r J 2⎥ ⎥ (J × J ) . . .. ⎥ . . ⎦
rJ1 rJ2 · · · 1
is the correlation matrix, which is a symmetric matrix with all diagonal elements being 1. In addition, det(R) and R−1 are the determinant and inverse matrix of the correlation matrix, respectively.35
$
% $ % 1 ρ 1 −ρ , R−1 = (det R)−1 and det R = 1 − ρ 2 . Using them, ρ 1 −ρ 1 the PDF of a general form of the BVN distribution is obtained as shown in Eq. (2.15) (p.47).
35
When J = 2, from R =
62
2 Test Data and Item Analysis
Fig. 2.20 Contour plot of trivariate standard normal distribution
Inverse Matrix and Determinant Square Matrix: Identity Matrix:
A matrix with an equal number of rows and columns A square matrix with all diagonal elements of 1; for example, the second- and third-order identity matrices are ⎤ 100 10 I2 = , and I 3 = ⎣0 1 0⎦ 01 001 $
Inverse Matrix:
Determinant:
%
⎡
For an n-th order square matrix A, if there exists an n-th order square matrix B such that B A = AB = I n , then B is called the inverse of A, as denoted by A−1 . One explanation is that the determinant is a product of the eigenvalues. This is closely related to the volume of the MVN distribution. Note that when the determinant is 0, it means that the volume of the MVN distribution is 0, which is a very serious situation because the volume of any distribution must be 1 (100%).
2.8 Dimensionality Analysis
63
Figure 2.20 shows the contour plots of standard trivariate normal (TVN) distributions.36 The left figure shows a TVN distribution with all correlations of 0, which, in other words, is the TVN when the correlation matrix is the identity matrix. In the right figure, the correlation between Items 1 and 3 is −0.5, which can be seen from the plane of Item 2 = 0 (cf. the darker vertical plane). In addition, the correlation between Items 2 and 3 is 0.5, which is expressed in the plane of Item 1 = 0 (cf. the light-shaded vertical plane). Finally, the correlation between Items 1 and 2 is 0, but we can observe a weak (positive) correlation from the horizontal plane. This is because Items 1 and 2 are uncorrelated globally across the entire space of Item 3, but the correlation is not locally 0 at the plane of Item 3 = 0.37 Thus, the correlation matrix is an important parameter that determines the shape of the (standard) MVN distribution.38 In other words, the correlation matrix is the only parameter in a standard MVN distribution. If it is settled, the shape of the standard MVN distribution is fixed.
2.8.3 Eigenvalue Decomposition In Sect. 2.8.1 (p.59), the dimensionality of a matrix is checked by looking for item groups, within each of which the interitem correlations are large. However, this procedure becomes more difficult as the matrix size increases. Therefore, in practice, the dimensionality is checked by the eigenvalue decomposition (or eigendecomposition) of the matrix. Using this method, a tetrachoric correlation matrix R (J × J ) is decomposed as follows: R = EE .
(2.20)
where ⎡ δ1 ⎢0 ⎢ =⎢. ⎣ ..
0 δ2 .. .
··· ··· .. .
⎤ 0 0⎥ ⎥ .. ⎥ (J × J ) .⎦
0 0 · · · δJ
36
In Fig. 2.11 (p.46), the vertical axis (i.e., z-axis) is used to show the density of the BVN distribution. The trivariate distribution requires the fourth axis to indicate the density, and because a four-dimensional body is not drawable, the size of the density is expressed by the brightness. 37 More specifically, the correlation between Items 1 and 2 at the locus of Item 3 = 0 is 0.33. Similarly, the (global) correlation between Items 1 and 3 across the entire space of Item 2 is −0.5, but the local correlation at Item 2 = 0 is not −0.5. 38 The parameters of the MVN distribution are mean vector and variance–covariance matrix.
64
2 Test Data and Item Analysis
is a diagonal matrix and its diagonal elements (δ1 , · · · , δ J ) are J eigenvalues extracted from the correlation matrix. The important properties of the eigenvalues when the correlation matrix is decomposed are summarized in the following box. Eigenvalues of Correlation Matrix 1. 2. 3. 4.
The J eigenvalues are ordered in the descending order in . All eigenvalues are positive. The product of the eigenvalues is the determinant of R. The sum of the eigenvalues is J .
First, as indicated by Point 1, the J eigenvalues obtained by the decomposition of the correlation matrix are arranged in descending order in . That is, δ1 ≥ δ2 ≥ · · · ≥ δ J > 0. As the reason for the nonstrict inequality, there are cases in which the eigenvalues of the same value can be obtained. In addition, as described in Point 2, all eigenvalues obtained from a correlation matrix are normally positive. Moreover, for Point 3, there is a relation in which the product of all eigenvalues is equal to the determinant. That is, det R =
J "
δj.
j=1
If there is even one zero eigenvalue, the determinant becomes zero, and a division by zero then occurs in the PDF of the MVN (Eq. (2.19), p.61) as follows: (det R)
− 21
=
& (det
R)−1
=
1 = det R
&
1 . 0
Thus, a single zero eigenvalue makes the determinant zero, and such an MVN distribution cannot exist. The MVN volume must be 1. That is, ∞ ∞ 1 J 1 f N J (x; R)dx = 1 ∴ exp − x R−1 x dx = (2π ) 2 (det R) 2 . 2 −∞ −∞ In other words, (2π ) 2 (det R) 2 is the volume of exp(−x R−1 x/2). The volume of any PDF should not be 0. Therefore, each single eigenvalue must not be 0. In addition, the eigenvalues must not be negative because the determinant (i.e., volume) will then be negative. However, one might believe there is no problem when there is an even number of negative eigenvalues; however, a single eigenvalue must not be negative. The reason for this is described in Sect. 2.8.6 (p.68). J
1
2.8 Dimensionality Analysis
65
Eigen-equation and Eigenvalue Decomposition The eigenvalues and eigenvectors can be obtained by the eigenvalue decomposition. This is done by solving simultaneous equations called an eigenequation (or characteristic polynomial). Although the method regarding how to solve the equations is omitted in this book, it can be found in introductory books in the field of linear algebra. The readers can practice using a 2 × 2 symmetric matrix. This will help in grasping the features of an eigenvalue decomposition. Decomposing a matrix of larger than 3 × 3 by hand is extremely complicated and calculation, and a computer is therefore usually used. Standard software provides functions for such computations.
2.8.4 Scree Plot Eigenvalues of Correlation Matrix (p.64) is In this section, Point 4 listed in explained. When a correlation matrix is eigendecomposed, the sum of the eigenvalues is obtained as the matrix size. That is, when the size of the correlation matrix R is J × J , we have39
tr =
J
δ j = J.
j=1
Table 2.13 shows a simple example. The left column is a 2 × 2 example with a correlation of 0.3. When this matrix is eigendecomposed, two eigenvalues are obtained: (δ1 , δ2 ) = (1.3, 0.7). The two eigenvalues are positive; thus, this matrix is said to be positive definite. These eigenvalues also indicate that their sum equal to the matrix size (1.3 + 0.7 = 2). In addition, the determinant is obtained as 0.91 (= 1.3 × 0.7). Similarly, in the right column, when eigendecomposing a 2 × 2 matrix with a correlation of 0.6, the eigenvalues are extracted as (δ1 , δ2 ) = (1.6, 0.4). Note that their sum is the same at 2 (= 1.6 + 0.4), but the magnitudes of the eigenvalues are different between these two examples.
39
The trace of a matrix means the sum of the diagonal elements in the matrix.
66
2 Test Data and Item Analysis
Table 2.13 Eigenspace of bivariate standard normal distribution
1.0 0.6 0.6 1.0
1.0 0.3 0.3 1.0
R 1
Δ
=
1.3
2.0
2.0
1.5
1.5
1.0 0.5 0.0
1
√
1
Item 2
Eigenspace
1 e1
√
2 e2
.707 .707 .707 −.707 =
[e1 e2 ] = √
.806 .592 .806 −.592
2
2
1
1
0
1
2
2 1
0 Item 1
1
2
1 e1
√
2 e2
.707 .707 .707 −.707 =
.894 .447 .894 −.447
0
1
2
2
Number of Eigenvalues
Item 2
1
EΔ 2
0.4
0.5 0.0
2
[e1 e2 ] =
1.6
1.0
Number of Eigenvalues
E
= 2
Eigenvalue
Eigenvalue
Scree Plot
1
0.7
2
2
1
0 Item 1
1
2
2.8 Dimensionality Analysis
67
Clearly, the unidimensionality is higher for the example on the right because the correlation between Items 1 and 2 is larger, which means that the two items measure a similar ability. Meanwhile, the unidimensionality for the example on the left is lower because its correlation is 0.3. In other words, when a correlation matrix with a higher unidimensionality is eigendecomposed, the first eigenvalue is obtained to be larger with the sum of the eigenvalues unchanged. The scree plot in the third row of the table shows a plot of the eigenvalues. The length of a test is occasionally ≥ 50. When examining the unidimensionality of such a test, we need to eigendecompose a correlation matrix of size 50 × 50 to obtain the 50 eigenvalues. The point is to check the contribution rate of the first eigenvalue, which is the ratio of the first eigenvalue to the sum of all eigenvalues (i.e., the number of items J ). That is, Contribution Rate (%) =
δ1 × 100. J
When this rate is large, the test is considered to be unidimensional (Reckase, 1979). However, there is no clear standard for how large it should be. The standard varies depending on the field one is facing.40
2.8.5 Eigenvector Next, the eigenvectors are explained. Eigenvalue decomposition yields eigenvectors in addition to the eigenvalues. Each column of E in Eq. (2.20) (p.63) is an eigenvector. The number of eigenvectors is the same as the number of eigenvalues. Thus, J eigenvectors are obtained after decomposing a correlation matrix with size J . In Table 2.13, two eigenvectors are extracted because the matrix size is 2. Each eigenvector corresponds to an eigenvalue. In the left-hand example, the first column $ e1 =
0.707 0.707
%
is the first eigenvector corresponding to the first eigenvalue (δ1 = 1.3), and the second column $ % 0.707 e2 = −0.707 is the eigenvector of the second eigenvalue δ2 = 0.7.
40
A parallel analysis (Drasgow and Lissak, 1983) is also a useful method to investigate the unidimensionality. Furthermore, Hattie (1985) provides a good review of methods for checking the unidimensionality.
68
2 Test Data and Item Analysis
In addition, the length of each eigenvector is 1. That is,41 ( ) J ' ) ||e j || = e12 j + e22 j + · · · + e2J j = * e2j j = 1. j =1
Indeed, the lengths of the two eigenvectors shown above are ||e1 || =
0.7072 + 0.7072 = 1, ||e2 || = 0.7072 + (−0.707)2 = 1. In addition, as an extremely important property, the eigenvectors are orthogonal to each other. The orthogonality of a pair of vectors is expressed in the inner product as 0. The above two eigenvectors are orthogonal; thus, their inner product is obtained as e1 e2 = e11 e12 + e21 e22 = 0.707 × 0.707 + .707 × (−0.707) = 0. Property of Eigenvector 1. 2. 3. 4.
The number of eigenvectors is equal to the size of the correlation matrix. The j-th eigenvector corresponds to the j-th eigenvalue. The length (Euclidean norm) of an eigenvector is 1. The eigenvectors are orthogonal to each other, in other words, the inner product of any pair of eigenvectors extracted from the same matrix is 0.
2.8.6 Eigenspace The space spanned by the eigenvectors is called an eigenspace. For (see Eq. (2.20), 1 p.63), in which each diagonal element is an eigenvalue, ◦ 2 represents a matrix in which each diagonal element is the square root of the eigenvalue, as indicated in the following:
|| · || denotes the norm of a vector, representing the square root of the sum of the squares of the vector elements. In other words, this is the Euclidean norm representing the Euclidean distance between the two ends of the vector.
41
2.8 Dimensionality Analysis
69
⎡√
◦ 2
1
δ1 √ ⎢ δ2 ⎢ =⎢ .. ⎣ .
⎤
√
⎥ ⎥ ⎥. ⎦ δJ
Because this matrix is symmetric, 1
◦ 2 = ◦ 2
1
holds, and the square of this matrix is 1
◦ 2 ◦ 2 = ◦ 2 ◦ 2 = . 1
1
1
Using this property, the eigenvalue decomposition can be rewritten as 1
R = EE = E◦ 2 ◦ 2 E = E◦ 2 (E◦ 2 ) . 1
1
1
Then, taking a closer look at E◦ 2 , it is formulated as 1
E◦ 2
1
⎡√ δ1 √ ⎢ δ2 ⎢ = e1 e2 · · · e J ] ⎢ .. ⎣ .
⎤
√ δJ
⎥ ⎥ δ1 e1 δ2 e2 · · · δ J e J ]. ⎥= ⎦
For the left example of Table 2.13, E◦ 2 is obtained as 1
E◦ 2 = 1
+
δ1 e1
, $ 0.806 0.592 % δ2 e2 = . 0.806 −0.592
Each column vector in this equation is shown in the bottom row of Table 2.13. It can be seen that the two arrows are orthogonal, which can also be confirmed from the fact that the √ inner product of the two vectors is 0. In addition, the square root of the expresses the length of the arrow. That is, the eigenvalue ( √of the longer √ length √ δ) √ vector is δ1 = 1.3 = 1.140, and that of the shorter one is δ2 = 0.7 = 0.837. The contour plot of the BVN distribution is shown in the background of each eigenspace. The left and right figures show the contour plots for the BVN with correlations of 0.3 and 0.6, respectively. We can see that the first eigenvector e1 is directed toward the principal axis of the ellipse of the BVN distribution, and the second eigenvector e2 toward the minor axis of the ellipse.√In addition, √ it can be seen that the square root of the first and second eigenvalues ( δ1 and δ2 ) correspond to the length of the principal and minor axes, respectively. More specifically, the square root represent the SDs (i.e., the eigenvalues represent the variances) of the
70
2 Test Data and Item Analysis
3.0
3.0
2.5
2.5 Eigenvalue
Eigenvalue
Scree Plot
Table 2.14 Eigenspace of trivariate standard normal distribution ⎤ ⎤ ⎡ ⎡ 1.0 0.3 0.3 1.0 0.6 0.6 ⎥ ⎥ ⎢ ⎢ R ⎣ 0.3 1.0 0.3 ⎦ ⎣ 0.6 1.0 0.6 ⎦ 0.3 0.3 1.0 0.6 0.6 1.0 ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ ⎤ ⎡ 1.6 2.2 1 1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ Δ 0.7 0.4 ⎦ ⎦ ⎣ ⎦=⎣ ⎣ ⎦=⎣ 2 2 0.7 0.4 3 3
2.0 1.5 1.0
1
Eigenspace
1.0 0.5
0.0
0.0
1
2
⎡
EΔ 2
1.5
0.5 3
Number of Eigenvalues
E
2.0
⎤ .577 .000 .816 ⎢ ⎥ [e1 e2 e3 ] = ⎣ .577 −.707 −.408 ⎦ .577 .707 −.408 √ √ √ 2 e2 3 e3 ⎤ ⎡ 1 e1 .730 .000 .683 ⎢ ⎥ ⎣ .730 −.592 −.342 ⎦ .730 .592 −.342
1
2
3
Number of Eigenvalues
⎡
⎤ .577 .000 .816 ⎢ ⎥ [e1 e2 e3 ] = ⎣ .577 −.707 −.408 ⎦ .577 .707 −.408 √ √ √ 2 e2 3 e3 ⎤ ⎡ 1 e1 .856 .000 .516 ⎢ ⎥ ⎣ .856 −.447 −.258 ⎦ .856 .447 −.258
2.8 Dimensionality Analysis
71
Table 2.15 Scree plots of three matrices in Table 2.12
Bidimensional
Tridimensional
6
6
5
5
5
4 3 2 1 0
Eigenvalue
6 Eigenvalue
Eigenvalue
Scree Plot
Correlation Matrix
Unidimensional
4 3 2 1
1
2
3
4
5
6
0
4 3 2 1
1
Number of Eigenvalues
2
3
4
5
6
Number of Eigenvalues
0
1
2
3
4
5
6
Number of Eigenvalues
corresponding ellipse axes. For this reason, all eigenvalues (i.e., all variances) must be positive, as described above. Table 2.14 shows two examples of an eigenvalue decomposition, each with its own TVN distribution, in which all correlations are 0.3 and 0.6 in the left and right examples, respectively. That is, the right example can be considered more unidimensional because the correlations are larger. The first eigenvalue contributions for the left and right examples are δ1 × 100 = J δ2 × 100 = J
1.6 × 100 = 53.3%, 3 2.2 × 100 = 73.3%, 3
respectively. These values indicate that the first eigenvalue contribution of the righthand example is higher. Thus, the first eigenvalue contribution indicates the degree of unidimensionality. In addition, the eigenvectors represent the three axes in the density ellipsoid of the TVN distribution, and the square root of the eigenvalue expresses the SD of each ellipsoid axis. The scree plots of the three examples are shown in Table 2.15. Only the first eigenvalue is outstanding in the unidimensional example (left), whereas the second eigenvalue is moderately large in the bidimensional example (center). In addition, the first to the third eigenvalues are not small in the tridimensional example (right).
72
2 Test Data and Item Analysis
Guttman Criterion The Guttman criterion/rulea (Guttman, 1954; Kaiser, 1960a) is a standard for determining the number of dimensions as the number of eigenvalues ≥ 1. Applying this criterion to the three examples in Table 2.15, the left, central, and right correlation matrices are judged as unidimensional, bidimensional, and tridimensional, respectively. However, when the first eigenvalue is extremely large, even if evaluated as multidimensional based on the Guttman criterion, the matrix is often regarded as unidimensional in practice. The difference between the first and second eigenvalues must then be used as an indicator of the unidimensionality. a
a.k.a. Kaiser–Guttman criterion/rule or Kaiser criterion/rule.
Chapter 3
Classical Test Theory
Classical test theory (CTT; e.g., Linn, 1993; Finch and French, 2018) is arguably the first theory to treat test scores as scientific objects. This theory is fundamental in test data analysis and covers the total scores. Therefore, some of the contents introduced in the previous chapter may be within the scope of the CTT. In addition, test reliability is among the greatest CTT achievements. In short, it is a measure of the accuracy of test scores. Figure 3.1 shows the measurement accuracy of a height scale and a test. Suppose Mr. Smith’s height was measured to be 173 cm (5 8.1 ). The scale is viewed as a measurement tool (or function) that returns height in response to input. This situation can be expressed as f SC AL E (MR. SMITH) = 173 (cm). Suppose also that Mr. Smith took a test and scored 73 points; this situation may then be similarly expressed as f T E ST (MR. SMITH) = 73 (points). Note that there is no difference between the height scales and tests in the sense that they are both tools for assigning a numerical value to the object to be measured. In this regard, do we doubt that Mr. Smith’s height is neither 172 cm nor 174 cm but 173 cm? Usually, we do not. However, what if we consider a test? Can we believe that Mr. Smith’s score is neither 72 nor 74 but exactly 73? We do not usually take the 73 points as is because we know from experience that a test score can easily fluctuate by ±5 points and sometimes even ±10 points. Next, suppose that Mr. Davis’s height is measured at 174 cm (5 8.5 ). In this case, we usually do not doubt that his height is slightly but certainly greater than Mr. Smith’s. This is because we have faith in accurate measurements using a height © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_3
73
74
3 Classical Test Theory
Fig. 3.1 Reliability of height scale and test
scale. Conversely, when Mr. Davis’s test result is 74 points, do we believe that his ability is greater than that of Mr. Smith? We empirically know that there are no ability differences in this 1-point difference. It is clear from the discussion so far that a test is less reliable (accurate) than a height tool as a measurement tool. In general, as shown in Fig. 3.1 (left), a scale can be regarded as a measurement tool that can sort students’ heights in ascending (or descending) order without failure. However, this test cannot. Figure 3.1 (right) shows higher-ability students as having longer pencils, and a test can only roughly sort the students’ abilities. Student with the highest/lowest ability will not always obtain the highest/lowest score. It is often the case that there are two students whose abilities are moderately different, but their scores are contrary to expectation. A test is a less reliable measurement tool than a height scale and a weight scale.1 However, a test is an important public tool that sometimes determines a student’s future. Therefore, a test administrator has the social responsibility to disclose the reliability and accuracy of the test.
3.1 Measurement Model How can the reliability of a test be defined? First, from Lord and Novick (1968), the score of Student s is expressed as
1
Likewise, psychological and social questionnaires are said to be less unreliable.
3.1 Measurement Model
75
ts(w) = τs + es .
(3.1)
This is called the measurement model. The measurement model describes the mechanism by which a score is generated. In the model, τs and es denote the true score and measurement error of Student s, respectively. For example, Mr. Smith would have received 76 if he could demonstrate his usual ability, but because of a stomachache on the day he took the test, an error of −3 was introduced, and his score was then observed as 73 (= 76 − 3). In other words, the observed score included measurement error. To express Eq. (3.1) for all students, we have t (w) = τ + e
(3.2)
or, more precisely, ⎡
⎤ ⎡ ⎤ ⎡ ⎤ t1(w) τ1 e1 ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦ = ⎣ . ⎦ + ⎣ . ⎦. t S(w)
τS
eS
3.2 Reliability In this section, we explain the definition of test reliability. First, it is assumed that the true score τ and error e are uncorrelated, as follows: Cor [τ , e] = 0.
(3.3)
It is natural to assume that the true score and error are uncorrelated. If positively/negatively correlated, this means there is a tendency to add a larger error to a larger/smaller true score. Such an error is no longer an error; it is a kind of ability related to true ability. Note that, when the correlation is 0, the covariance is also 0. That is,2 Cor [τ , e] = √
Cov[τ , e] =0 V ar [τ ]V ar [e]
∴ Cov[τ , e] = 0.
From this assumption, the variance of Eq. (3.2) is obtained by V ar [t (w) ] = V ar [τ + e] = V ar [τ ] + 2Cov[τ , e] + V ar [e] = V ar [τ ] + V ar [e]. 2
∴ means “therefore.”.
76
3 Classical Test Theory
Thus, when the true score and error are uncorrelated, “the variance of the sum (V ar [τ + e])” becomes equal to “the sum of the variances (V ar [τ ] + V ar [e]).” For simplicity, the above equation can be rewritten as V (w) = V (τ ) + V (e) .
(3.4)
Reliability is defined as the ratio of the variance of the true score to the variance of the (observed) test score. That is, ρ (w) =
V (τ ) V (e) = 1 − V (w) V (w)
∈ [0, 1].
For example, if V (w) = 100 and V (τ ) = 80, then the reliability is calculated as ρ (w) = 0.8 (= 80/100), which indicates that 80% of the test score variance can be explained by the true score variance. Reliability is, by definition, obtained within a range of [0, 1]. The closer the reliability is to 1, the more reliable (accurate) the test is. The more public the test, the higher the reliability required. The minimum standard would be approximately 0.8. Subject and Reliability Generally, the reliability of a math and science test tends to be high while that of a history or (native) language test (e.g., English test in the USA, Japanese test in Japan) is not easy to enhance. The higher the unidimensionality (or unifactority) of a test, the higher its reliability tends to be. It is difficult to construct a test of one’s own country’s mother language to be unidimensional.
3.3 Parallel Measurement In fact, both the true score vector τ and error vector e are unknown. The only observation was the students’ total score vector t (w) . Therefore, it is necessary to develop a method to calculate test reliability. Suppose we have two test results. The measurement models of Student s for the two tests are given by (w) = τs1 + es1 , ts1 (w) = τs2 + es2 . ts2
If the measurement models of the two tests satisfy the following assumptions, the two tests are called (strongly) parallel measurements.
3.3 Parallel Measurement
77
Assumptions of Parallel Measurement 1. For all students, the true scores measured in the two tests are equal. That is, τs1 = τs2 = τs (∀s ∈ N S 55 ). 2. The error variances of the two tests are equal (V ar [e1 ] = V ar [e2 ] = V (e) ). The first assumption requires3 that the magnitudes of the true scores in the two tests are equal for each student. For example, although a student scored 62 on Test 1 and 58 on Test 2, the true scores of both tests are the same at 61 with errors of +1 on Test 1 and −4 on Test 2. Based on the assumption, the vectors of the two measurement models can be rewritten as t (w) 1 = τ 1 + e1 = τ + e1 , t (w) 2 = τ 2 + e2 = τ + e2 . The second assumption is that the error variances of the two tests are equal (the errors in the two tests of each student can be different). This makes the variances of the two tests equal, as follows: (τ ) + V (e) = V (w) , V ar [t (w) 1 ] = V ar [τ ] + V ar [e1 ] = V (τ ) + V (e) = V (w) , V ar [t (w) 2 ] = V ar [τ ] + V ar [e2 ] = V
where, from Eq. (3.3), the assumption that the true score and error are uncorrelated is applied. If the two tests satisfy these assumptions, the test reliability can be represented as the correlation between the two tests and can easily be calculated as4 (w) Cor [t (w) 1 , t2 ] =
=
(w) Cov[t (w) 1 , t2 ] (w) V ar [t (w) 1 ]V ar [t 2 ]
(w) Cov[t (w) 1 , t2 ] V (w)
(w) (w) ∵ V ar [t (w) . 1 ] = V ar [t 1 ] = V
(3.5)
In addition, the covariance in this equation can be simplified as (w) (τ ) Cov[t (w) + e1 , t (τ ) + e2 ] 1 , t 2 ] = Cov[t
= Cov[t (τ ) , t (τ ) ] + Cov[t (τ ) , e2 ] + Cov[e2 , t (τ ) ] + Cov[e1 , e2 ] = V ar [t (τ ) ] + 0 + 0 + 0 = V (τ ) , 3 4
∀s ∈ N S means “for all members included in N S = {1, 2, · · · , S}.”. ∵ means “because.”.
78
3 Classical Test Theory
Fig. 3.2 Path diagram of parallel measurement reliability
because any correlation with an error is 05 ; thus, we have Eq. (3.5) as (w) Cor [t (w) 1 , t2 ] =
(w) Cov[t (w) V (τ ) 1 , t2 ] = , V (w) V (w)
(3.6)
and this accords with the definition of test reliability. Figure 3.2 represents the parallel measurement assumption by a path diagram used in the structural equation model (SEM; e.g., Jöreskog and Sörbom, 1979; Bollen, 1989b; Toyoda, 1998; Kaplan, 2000; Muthén and Muthén, 2017) where the observed variables and error variables are represented as rectangles and small circles, respectively. A large ellipse expresses a latent variable (or factor), which implies the true score, because it is not observed. The specific constraints required for this model (i.e., path diagram) are summarized in the following box. Settings of Parallel Measurement in Path Diagram 1. The loadings of Tests 1 and 2 on True Score are fixed at 1. 2. The two error variances are constrained to be equal. By executing the model with these constraints, the squared multiple correlations (or determination coefficients) of the two tests with the latent variable (i.e., true score) were obtained to be the same. This value represents reliability coefficient.
3.4 Tau-Equivalent Measurement The τ -equivalent measurement is a model in which only Assumption 1 of Assumptions of Parallel Measurement (p.76) is maintained, but the number of tests (testlets, or items) is generally supposed to be three or more. Consider a test consisting of J testlets (or items); the measurement model then becomes
5
Note that an error can be an error because it does not correlate with any variable.
3.4 Tau-Equivalent Measurement
79
t (w) 1 = .. . t (w) J =
τ + e1 , J τ + eJ . J
Each testlet measured only one of the J equal parts of the true score. In this case, the variances of the testlets are obtained by V1(w) =
V (τ ) + V1(e) , J2 .. .
V J(w) =
V (τ ) + V J(e) . J2
The error variances are different for each testlet, but the true score variances are equal. Thus, the sum of the testlet variances is obtained by J
V j(w)
j=1
J V (τ ) (e) V (τ ) + V (e) . = + V = j J2 J j=1 J
(3.7)
Note that the error variance is given by the sum of the error variances of the testlets (V (e) = Jj=1 V j(e) ). Solving Eqs. (3.4) and (3.7) with respect to V (e) , we obtain V (w) − V (τ ) =
J
V j(w) −
j=1
1−
ρα(w)
1 J
V (τ ) = V (w) −
V (τ ) , J
J j=1
V (τ ) J 1− = (w) = V J −1
V j(w) ,
J
(w) j=1 V j . V (w)
This index is called Cronbach’s α coefficient (Cronbach, 1951),6 and it is often used as a reliability coefficient for tests and psychological scales. In the formula, the variance of
the total score (V (w) ) is given by Eq. (2.18), and the sum of the testlet variances ( Jj=1 V j(w) ) can be calculated using v from Eq. (2.14) as follows:
When each test item is dichotomous, α is referred to as Kuder-Richardson Formula 20 (KR20; Kuder & Richardson, 1937).
6
80
3 Classical Test Theory
Fig. 3.3 Path diagram of τ -equivalent measurement ( J =5)
J
V j(w) = 1J v (w) = 1J (w◦2 v).
j=1
Figure 3.3 shows the path diagram of the τ -equivalent measurement when the number of testlets is five. Unlike the path diagram of the parallel measurement (Fig. 3.2), the error variances are not equal. Additionally, the loadings on the latent variable are 0.2 (= 1/5). The variance of the true score estimated in this path diagram is V (τ ) .7 Cortina (1993) reported that the α coefficient underestimated the true reliability because the τ -equivalent assumption usually did not hold. However, it is still widely used as a reliability coefficient for tests and psychological scales owing to its ease of calculation. For the confidence interval of α, see Kristof (1963), and, for the estimation of α in item response theory, see Shojima and Toyoda (2002). Internal Consistency This coefficient is also referred to as the index of internal consistency because the higher the correlation coefficients among testlets (items), the closer the value to 1. In addition, α has a tendency to be larger as the number of testlets is larger. This is natural because, in general, a 20-item test is more accurate (reliable) than a 10-item test.
The τ -equivalent measurement is identical to a unifactor model with fixed factor loadings. For reliability coefficient estimation by SEM, see Raykov (1997).
7
3.5 Tau-Congeneric Measurement
81
3.5 Tau-Congeneric Measurement The τ -congeneric measurement is a more relaxed model than the τ -equivalent measurement. When a test consists of J testlets (or items), the model under the τ -congeneric measurement is expressed as t (w) 1 = a1 τ + e 1 , .. . t (w) J = aJ τ + eJ , where a j is the loading of the j-th testlet on the true score (factor). When a j is positive and large, it means that the score of Testlet j reflects the true score. To explain differently, a j can be said to be the discrimination power of Testlet j. Figure 3.4 shows the path diagram of this measurement model. This model is the same as the factor analysis model with one factor. Unlike in the τ -equivalent measurement, the loadings of the testlets on the latent variable are not equal and are estimated from the data while the variance of the latent variable is fixed at 1. The variance of Testlet j in this case is obtained by V j(w) = a 2j V (τ ) + V j(e) ( j = 1, 2, · · · , J ). In addition, the covariance between Testlets j and k is given by (w) (w) (τ ) . C (w) jk = Cov[t j , t k ] = Cov[a j τ + e j , ak τ + ek ] = a j ak Cov[τ , τ ] = a j ak V
Fig. 3.4
Path diagram of τ -congeneric measurement ( J =5)
82
3 Classical Test Theory
(w) (w) Thus, for the test score (t (w) = t (w) ) is obtained by8 1 + · · · + t J ), its variance (V
V (w) =
J
V j(w) +
j=1
J −1 J j=2 k=1
C (w) jk =
J j=1
2 aj
V (τ ) +
J
V j(e) .
j=1
The first term is the variance related to the true score. Let V (τ ) be considered 1, as shown in Fig. 3.4. This supposes that the variance of the latent variable is 1, which is an usual assumption in factor analysis. Then, a reliability called the McDonald’s ω coefficient (McDonald, 1999) is defined as ρω(w)
a j )2 V (τ ) ( Jj=1 a j )2 = J = J .
( j=1 a j )2 V (τ ) + Jj=1 V j(e) ( j=1 a j )2 + Jj=1 V j(e) (
J
j=1
This equation indicates that ω coefficient can be calculated from the factor loadings
and the error variances after executing the one-factor
analysis model. Then, ( Jj=1 a j )2 is the square of the sum of the loadings, and Jj=1 V j(e) is the sum of the error variances. The τ -congeneric measurement, like the τ -equivalent measurement, assumes a unifactorial structure of the data. In other words, each testlet (item) is required to measure the same object. However, the assumption of the τ -congeneric measurement is more relaxed than that of the τ -equivalent measurement, and it is usually more realistic for actual data. Therefore, ω under the τ -congeneric measurement assumption is a more practical choice than α. Reliability and Validity If a questionnaire uses five-point Likert items, the reliability coefficient may exceed 0.9 when the number of items is about 10, but it is not easy for a test to exceed 0.9 even when the number of items is more than 40 because test items are binary. If the reliability of a test exceeds 0.9 with about 10 items, we should rather suspect that the contents of the items are very similar. That is, it is highly likely that the contents of the items overlap, and they are biased in some specific area of the entire field of the subject. Such a test is said to lack content validity. Test reliability is a subconcept included in test validity. Ensuring validity is the first priority; then, reliability must be enhanced. See Messick (1989, 1995) for comprehensive discussions on validity.
8
The variance of the total score is the sum of the variances and covariances of the testlets.
3.6 Chapter Summary
83
3.6 Chapter Summary The greatest achievement of the CTT is the invention of the measurement model, declaring that the observed score of each student is not the true score but the score containing an error. This chapter introduces three measurement models: parallel, τ -equivalent, and τ -congeneric measurements. The reliability coefficients differ according to the measurement model adopted. In addition, α and ω are frequently used reliability coefficients, and they generally tend to be ρω(w) ≥ ρα(w) . This is because the assumption for ω, the τ -congeneric measurement, is more relaxed than that for α, the τ -equivalent measurement. To calculate these reliability coefficients for real data using software, refer to the accompanying website. Moreover, both α and ω coefficients require that the data be unidimensional. When a test targets multiple measurement objects, which usually makes the test data multidimensional, reliability coefficients must be calculated for each factor (i.e., dimension). When the correlations between the factors were small, the reliability coefficient for the entire test was computed to be low, even if the reliability for each factor was high. The analyst is obligated to report reliability to guarantee the quality of his or her test. Generally, if the reliability coefficient exceeds 0.9, the test is considered to be highly accurate. However, we must keep in mind that, even in such a case, 10% is still a measurement error, which indicates that the futures of children should not be determined only by tests. Reliability Study and Limitation of Test It is difficult to create a test with reliability over 0.9. This suggestion for the limitation, in the author’s opinion, is the greatest achievement of CTT. Evidently, tests are inferior to height and weight scales in accuracy. It is risky for a social system to rely too heavily on tests.
Chapter 4
Item Response Theory
Item response theory (IRT; e.g., Hambleton & Swaminathan, 1984; Lord, 1980; Toyoda, 2012; van der Linden, 2016) is the most popular statistical model today as a background theory of test administration. The Program for International Student Assessment, Trends in International Mathematics and Science Study, and Test of English as a Foreign Language, as well as many global-scale tests, are designed based on IRT. In addition, this theory is increasingly becoming an essential subject, not only for students of educational measurement, but also for students of educational psychology in general. A remarkable feature of IRT is that it allows us to evaluate detailed statistical properties of each item such as item difficulty and item discrimination. A test is a set of items; thus, it is possible to edit a test with a difficulty appropriate for the targeted students by collecting items having known statistical properties. In addition, it is also possible to create two parallel tests with an equal difficulty level but composed of different item sets.
4.1 Theta (θ): Ability Scale The first assumption of IRT is that there exists a continuous and unidimensional ability scale,1 denoted θ , which is called the assumption of unidimensionality. In other words, test data are required to be unidimensional (or unifactorial). This is a strong assumption because, strictly speaking, data can never be completely unidimensional. However, we can check how high the unidimensionality of the test data is (see Sect. 2.8, p. 59) Fig. 4.1. 1
Multidimensional IRT (Reckase, 2009) can be applied when the ability scale is more than onedimensional. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_4
85
86
4 Item Response Theory
Fig. 4.1 Unidimensional and continuous ability scale of θ
The larger the value of θ , the higher the ability. By definition, the range of θ is (−∞, ∞), but for practical purposes, it is approximately (−3, 3). When the number of students is S, each of them has an ability parameter that can be written in vector form ⎡ ⎤ θ1 ⎢ .. ⎥ θ = ⎣ . ⎦ = {θs } (S × 1). θS
This is called the ability parameter vector. One of the objectives of IRT is to estimate individual ability parameters in the vector. When θs = 0, the ability level of Student s is average.
4.2 Item Response Function It is natural to think that the correct response rate (CRR) of an item depends on the ability parameter. For instance, the possibility of passing an item by a student with large θ would be high, while it would be low for a student with low θ . In addition, the CRRs of some items may be high regardless of the student’s ability, and some items may be difficult for all students. Furthermore, the CRRs of some items would increase dramatically with θ , while others would not change with θ . In other words, each item has its own unique characteristics. The item response function,2 or IRF, expresses such item characteristics seen in the CRR. The IRF of an item is usually an upward curve because the CRR generally increases as the ability θ increases. In addition, the lower and upper limits of IRF are 0 and 1, respectively, because the CRR takes a value of 0 at the minimum and 1 at the maximum. Although there are many functions with these characteristics, logistic functions are the most widely used in IRT.
2
Also called the item characteristic curve (ICC).
4.2 Item Response Function
87
4.2.1 Two-Parameter Logistic Model The two-parameter logistic model (2PLM) is a classic model that defines the probability of passing Item j by a student with ability θ , denoted Pr [u j = 1|θ ],3 as4 Pr [u j = 1|θ ]:=P(θ ; a j , b j ) =
1 (a j , b j ∈ (−∞, ∞)), 1 + exp{−a j (θ − b j )}
where a j is referred to as the slope parameter5 and b j the location parameter.6 These parameters determine the shape of the IRF and together are called item parameters. The ranges of both parameters are (−∞, ∞) by definition; however, practical ranges of a j and b j are [−0.5, 5] and [−4, 4], respectively. Scaling Factor 1.7 In some works, the IRF is defined as P(θ ; A j , b j ) =
1 , 1 + exp{−1.7A j (θ − b j )}
where A j is multiplied by 1.7, which was important in the early days of IRT, because this curve can closely approximate the normal ogive function. However, nowadays, the logistic function is never compared with the normal ogive function. The simpler model is better because it is easier to interpret. In this book, 1.7 × A j is expressed as a j . Some software may factorize 1.7 from a j ; thus, one must take care when interpreting the results. Figure 4.2 shows the IRFs of the 2PLM for three items with a slope parameter of 1 and location parameters with −1, 0, and 1. As the location parameter increases, the IRF shifts further to the right. When θ = 0, the IRFs of the three items take the values: Item 1: Item 2:
P(0; 1, −1) = 0.731, P(0; 1, 0) = 0.500,
Item 3:
P(0; 1, 1) = 0.269.
This indicates that for a student of average ability, the CRRs for the items are 73.1, 50, and 26.9%. That is, of the three items, Item 3, with the largest location parameter, is the most difficult. Pr [A|B] stands for “the probability of B given A.” LHS := RHS means “LHS is defined as RHS.” 5 Often called the discrimination parameter. 6 Often called the difficulty parameter. 3 4
88
4 Item Response Theory
Fig. 4.2 IRFs of 2PLM with various location parameters
When an IRT analysis is performed, the location parameters are estimated for all items. In general, the more difficult an item is, the larger the location parameter that is estimated, and the further to the right, the IRF of such an item is located. Therefore, as this parameter controls the IRF position, which indicates the difficulty of the item, it is also called the difficulty parameter. In the 2PLM, the CRR of an item is 50% for a student with an ability parameter equal to the item’s location parameter.7 That is,8 2PLM: P(θ = b j ; a j , b j ) =
1 1 = = 0.5. 1 + exp(0) 1+1
Considering the three items in Fig. 4.2, Item 1: Item 2:
P(θ = −1; 1, −1) = 0.5, P(θ = 0; 1, 0) = 0.5,
Item 3:
P(θ = 1; 1, 1) = 0.5.
The slope parameter is explained in Fig. 4.3. When the slope parameter is positive, as in Items 1 and 2, the IRF increases sigmoidally, and the larger the value, the steeper the slope. In the figure, the IRF of Item 1 is steeper than that of Item 2, as a1 > a2 . In addition, the IRF is steepest when θ = b j ; thus, the IRFs of Items 1 and 2 are steepest when θ = −1 because b1 = b2 = −1. Item 1 is steepest when θ = −1, which means that this item is likely to pass students with ability > −1, and to block (fail) students with ability < −1. In other words, the ability of a student who passes Item 1 is probably > −1; conversely, that of a student who fails this item is likely < −1, which means Item 1 discriminates
7 8
This is not true for the 3PLM and 4PLM. exp(0) = e0 = 2.71830 = 1. Note that a number to the 0th power is 1.
4.2 Item Response Function
89
Fig. 4.3 IRFs of 2PLM with various slope parameters
whether the ability of a student is above or below −1. Therefore, we can say that Item 1 is highly discriminating around θ = −1.9 In comparison, Item 2 is less discriminating than Item 1 at locus θ = −1. Item 2 is also steepest at θ = b2 = −1, but passing this item does not necessarily mean that the student’s ability is θ ≥ −1, because the chance of passing this item for even a low-ability student with θ = −2 is 37.8%: Item 2: Pr [u 2 = 1|θ = −2] = P(−2; 0.5, −1) = 0.378. Similarly, the fact that a student fails Item 2 does not mean the student’s θ < −1. This is because there is a 26.9% chance of failing the item even for a high-ability student with θ = 1: Item 2: Pr [u 2 = 0|θ = 1] = 1 − P(1; 0.5, −1) = 0.269. In addition, as shown by Item 3 in Fig. 4.3, when the slope parameter is 0, the IRF is constant regardless of θ . That is, P(θ ; 0, b j ) =
1 1 = . 1 + exp(0) 2
Such an item is inappropriate because there is always a 50–50 chance that it will be passed by students of any ability level. For this reason, an item with this parameter close to 0 is usually regarded as not discriminating. Furthermore, the IRF of Item 4 monotonically decreases because the slope parameter is negative, which indicates that a student with higher ability is more likely to fail the item. This situation is unusual; therefore, the content of such an item should be reexamined. Note that this item is not highly discriminating across the entire θ scale. There is no such globally discriminating item.
9
90
4 Item Response Theory
In general, an item with a slope parameter that is positive and large is preferable and regarded as highly (but locally) discriminating. To edit a test for measuring highability students, a number of highly discriminating items should be selected from the items whose location parameters are large. The 2PLM features are summarized as follows: Characteristics of Two-Parameter Logistic Model 1. As the location parameter (b j ) increases, the IRF is further to the right. 2. As the slope parameter (a j ) is positive and larger, the sigmoid of the IRF rises more sharply. 3. The CRR is 0.5 when θ = b j . 4. The IRF is steepest when θ = b j .
4.2.2 Three-Parameter Logistic Model The three-parameter logistic model (3PLM) is a model where the lower asymptote parameter, c j , is added to the 2PLM. This 3PLM is currently the most commonly used in practical tests. The IRF of the 3PLM is expressed by P(θ ; a j , b j , c j ) = c j +
1 − cj (c j ∈ [0, 1]). 1 + exp{−a j (θ − b j )}
The range of c j is [0, 1]. In addition, when c j = 0, P(θ ; a j , b j , 0) is identical to the 2PLM. The IRFs of the 3PLM are shown in Fig. 4.4. The slope parameter is 2, and the location parameter is 0 for all three items. As seen from the figure, the lower limit of
Fig. 4.4 IRFs of 3PLM with various lower asymptote parameters
4.2 Item Response Function
91
IRF is higher as the lower asymptote parameter increases. If c j < 0 or c j > 1, the range of IRF values falls outside [0, 1]; thus, the parameter range is set as c j ∈ [0, 1]. The 3PLM is efficient when analyzing a test containing multiple-choice items. The CRR of such an item is not 0, even for a student with very low ability. For example, when answering a four-choice item, even a very low-ability student will have a chance of 25% of passing the item by luck. In addition, even when an item is not multiple-choice but is very easy, the CRR of such an item is not 0 even for a student with a very low ability. Many books and papers refer to c j as the guessing parameter, which is not a neutral expression. The reason is that low-ability students do not always pass only by guessing. As mentioned above, the lower asymptote parameter of a very easy item is usually estimated to be larger, but this does not imply that low-ability students pass by guessing. As to Point 3 listed in Characteristics of 2PLM (p. 90), the CRR is 0.5 when θ = b j , but this is not the case for the 3PLM. The CRR of the 3PLM when θ = b j is P(θ = b j ; a j , b j , c j ) = c j +
1 − cj 1 + cj = . 2 2
For example, if c j = 0.3, P(θ = b j ; a j , b j , 0.3) = 0.65. The value of θ that gives the CRR 0.5 is found by solving the following equation: 1 − cj , 1 + exp{−a j (θ − b j )} 1 exp{−a j (θ − b j )} = , 1 − 2c j
0.5 = c j +
a j (θ − b j ) = ln(1 − 2c j ), ln(1 − 2c j ) θ= + bj. aj
(4.1)
However, because it is difficult to find a practical meaning for this value, it is usually not checked. However, Point 4 of Characteristics of 2PLM (p. 90) is also true for the 3PLM. That is, the gradient of the IRF is steepest when θ = b j .
92
4 Item Response Theory
Fig. 4.5 IRFs of 4PLM with various upper asymptote parameters
Guessing or Lower Asymptote? Test data analysts should not refer to c j as the guessing parameter when feeding the analysis results back to item creators. This is because such an expression suggests that they created an item that could be answered correctly by guesswork, which often discourages them from creating new items and reviewing existing items. As mentioned in the main text, the lower asymptote of a very easy item is estimated to be large, whether the item is a multiplechoice or not. For example, the lower asymptote of a very easy four-choice item is estimated to be > 0.25. Although the guessing would be included in c j , it is not all of c j .
4.2.3 Four-Parameter Logistic Model The four-parameter logistic model (4PLM; Barton & Lord, 1981; Liao et al., 2012) is a model where one additional parameter d j , called the upper asymptote parameter, is added to the 3PLM. The IRF of the 4PLM is defined as P(θ ; a j , b j , c j , d j ) = c j +
dj − cj (0 ≤ c j < d j ≤ 1). 1 + exp{−a j (θ − b j )}
(4.2)
Figure 4.5 shows the IRFs of three items for which slope, location, and lower asymptote parameters are the same, but the upper asymptote parameters are different. It is clear from the figure that the fourth parameter controls the upper limit of IRF; the larger the value, the higher the upper limit. This parameter is useful for a difficult item of whose CRR does not reach 1 even for high-ability students. If d j = 1, then P(θ ; a j .b j , c j , 1) is identical to the 3PLM.
4.2 Item Response Function
93
One may believe that the upper asymptote must be larger than the lower asymptote (d j > c j ), but it is mathematically possible for d j < c j , in which case the IRF decreases, which means that a higher-ability student tends to fail the item. Thus, this situation is mathematically possible but pedagogically unusual. If the two parameters have such a relationship, it is better to review the item content. In the 4PLM, with reference to Eq. (4.1), the value of θ giving CRR= 0.5 is found to be 0.5 = c j +
dj − cj 1 + exp{−a j (θ − b j )}
∴
θ=
ln(1 − 2c j ) − ln(2d j − 1) + bj, aj
although it is also difficult to find a practical meaning for this value. In addition, the IRF gradient is steepest when θ = b j , which is common to the 2PLM, 3PLM, and 4PLM. Currently, the most frequently used model is the 3PLM, and the 4PLM is rarely used in practice. Rasch Model Although rarely used nowadays, the one-parameter logistic model (1PLM) is a model with only one parameter, b j . In this model, the CRR of Item j for a student with ability θ is defined by P(θ ; b j ) =
1 . 1 + exp{−(θ − b j )}
This model is a 2PLM model in which a j is constrained to 1. The model was often used before computers were advanced, and it is also called the Rasch model after the name of the developer, Georg Rasch (1901–1980), who was a Danish mathematician, statistician, and psychometrician who contributed to IRT in the early days.
4.2.4 Required Sample Size We need to decide which model to use before conducting an analysis. Generally, the 4PLM best fits data because it is the most flexible. However, when the sample size is smaller, using a model with fewer item parameters is desirable, because a model with more parameters may yield less accurate parameter estimates. Sahin ¸ and Anil (2017) indicate the sample size required depending on the model10 and the number of items, as shown in Table 4.1. For example, for a test with 30 items, if one desires to use the 2PLM, the sample size should be > 250. This may be encouraging information for analysts because some works have insisted that the 10
However, the 4PLM was excluded.
94
4 Item Response Theory
Table 4.1 Sample size required under IRT models and test lengths Test Length 1PLM 2PLM 3PLM 10 150 750 750 20 150 500 750 30 150 250 350 Sahin ¸ and Anil (2017, Table 4)
sample size should exceed 1000 under such a condition. This is too large because the 2PLM has been shown to be mathematically identical to categorical factor analysis model with a single factor (Takane & de Leeuw, 1987). One would be at a loss being unable to perform a factor analysis unless the sample size is > 1000.
4.2.5 Test Response Function The test response function11 (TRF) is the expected number-right score given ability value θ . That is,12 E[t|θ ] := T (θ ; ) =
J
P(θ ; λ j ).
j=1
If the items are weighted, the expectation is given by E[t (w) |θ ] := T (w) (θ ; ) =
J
w j P(θ ; λ j ).
j=1
In this case, this expectation is the expected total score given θ . In the above equations, λ j is the item parameter vector of Item j. The contents of the vector vary by IRT model as follows: ⎧ ⎪ (2PLM) ⎨[a j b j ] λ j = [a j b j c j ] (3PLM) . ⎪ ⎩ [a j b j c j d j ] (4PLM)
11 12
Also called test characteristic curve (TCC). E[A|B] denotes the expectation of A given B. The expectation is the average of a random variable.
4.2 Item Response Function
95
Table 4.2 Item parameters of a 15-item test Item Item 01 Item 02 Item 03 Item 04 Item 05 Item 06 Item 07 Item 08 Item 09 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15
Slope 0.776 0.997 1.008 1.770 1.303 1.515 1.879 1.602 1.410 0.977 1.731 1.992 1.235 1.570 1.154
Location L. Asymp. U. Asymp. −1.441 0.149 0.946 −1.648 0.095 0.891 −1.727 0.248 0.869 −1.336 0.040 0.965 −1.970 0.188 0.920 −1.702 0.157 0.976 −0.878 0.205 0.925 −0.135 0.248 0.888 1.664 0.265 0.944 −1.472 0.070 0.817 0.992 0.037 0.974 0.952 0.105 0.850 −0.960 0.056 0.927 −0.795 0.208 0.994 −0.476 0.298 0.957
In addition, is the item parameter matrix. That is, ⎤ ⎡ a1 λ1 ⎢ .. ⎥ ⎢ .. =⎣ . ⎦=⎣ . ⎡
λJ
⎤ b1 . . . ⎥ .. . . . .⎦ , aJ bJ . . .
where the j-th row vector is the item parameter vector of Item j. The number of columns of this matrix differs by the IRT model used. For example, the size of the matrix is J × 3 when the 3PLM is employed. For example, Table 4.2 shows the result of a 15-item test by the 4PLM. It is sufficient to give the item parameter estimates to three decimal places. The IRFs for these 15 items are shown in Fig. 4.6 (left). The TRF then looks like Fig. 4.6 (right). The TRF is simply the sum of IRFs. Note that the vertical axis is the expected number-right score. For instance, a student with θ = 0 would be expected to pass 10.01 items, while a student with θ = 1 would pass 11.91 items on average. By and large, the location parameters of many items are negative, and this test can thus be said to be suited to measuring low-ability students. The TRF is useful when editing a test so as to check how difficult the test is in order to, for example, let students with lower ability get about 40% to encourage them, or retouch a test to be a little harder so as not to bore students with higher ability. It is a good idea to monitor the TRF as you select items to ensure the targeted level one is trying to measure.
96
4 Item Response Theory
Fig. 4.6 Test response function
4.3 Ability Parameter Estimation One of important parts of testing is providing feedback to each student about his or her ability, which is θs for Student s. When test data are obtained, the item parameter matrix ( = {λ j }) is first estimated, and then the ability parameter vector (θ = {θs }) is estimated. This section explains how to estimate each student’s ability parameter under the condition that the item parameters have already been obtained. This situation is not unusual. If one edits a test by selecting items from an item bank where the item parameters have already been fixed, after the test is administered, estimating the ˆ denotes the item ability parameters of the students is the next task. With “ˆ(hat)”, parameter estimates.
4.3.1 Assumption of Local Independence An important assumption is introduced in estimating the ability parameter, which is called the assumption of local independence (Lord & Novick, 1968). This assumption and the unidimensionality assumption are two important assumptions in IRT. It is assumed that given θ , the true or false (TF) response to each item is independent. This can be rephrased as, “given θ , then only the characteristics of the item can determine the TF.” In statistics, it is exactly the same as the conditional independence, but it is customarily called local independence in this area. The concept of this assumption is explained in detail below (Fig. 4.7). This assumption is analogous to a vision test in which the Landolt rings of various sizes are oriented in different directions. The smaller the ring, the more difficult it is to answer correctly. In general, a student with an eyesight 1.0 can report a slit in a ring with diameter 4.5 cm at a distance of 3 m.
4.3 Ability Parameter Estimation
97
Fig. 4.7 Vision test analogy to local independence assumption
In this case, the local independence means that, “given an eyesight, the trial for each ring is independent.” Once the eyesight of a student is determined (whether unknown or known to the oculist), then only the size of each ring (the feature of the ring) can determine the TF response. The responses to the j-th and k-th rings are independent for each student whose vision is fixed (whether known or unknown). It does not seem difficult for the reader who has had his or her eyesight tested to consider that each trial is independent,13 and almost all statistical models that use latent variables, such as factor analysis and latent class analysis, assume the local independence assumption. However, a superficial user of IRT skips the antecedent “given θ ” in the assumption and only accepts the consequent “the TF response to each item is independent.” If the antecedent is dropped, the consequent alone means “global independence” (Fischer & Molenaar, 1995), which is never going to happen (Table 4.3).14 The local independence assumption requires that, just as the trials of Rings j and k were independent for a given eyesight (i.e., with a constant value), the TF responses to Items j and k be independent given ability θ . It is not at all unusual for students with good/bad eyesight (ability) to answer correctly/incorrectly to Rings (Items) j and k, which is the global dependence between items. Local independence is not an assumption that negates this global dependence. Yen (1993) summarized the factors that threaten local independence,15 one of which is item chaining. This is a type of two (or more) items where the answer of Item j is used to pass Item k. Such two items violate local independence. Suppose that there are two students with the same ability. Then their CRRs to pass Item j are the same. However, there is a possibility that their TFs differ even if their abilities are equal because their TFs are determined stochastically. If the items are locally independent, their CRRs for Item k are also the same. However, if the items are locally dependent, a student passing Item j is likely to pass Item k, whereas there is no possibility that the student (accidentally) failing Item j can pass Item k because the answer of Item j is used to pass Item k.
13 14 15
However,it is an assumption rarely met in the real world. See also Local Independence in IRT (p. 424).
Yen (1984, 1993) also proposed an index of local independence, Q 3 .
98
4 Item Response Theory
Table 4.3 Local independence Meaning Concept Global Dependence The TF of Item j is associated with the TF of Item k. It is natural that two items are associated, because students passing Item j have higher ability, and thus can pass Item k as well. Global The TF of Item j is unassociated with the TF of Item k, which is impossible if Independence the two items measure the same ability. In addition, this global independence should not be mistaken for local independence. Local Dependence (Even) for students with the same ability, the TF of Item j is associated with the TF of Item k. There is the possibility of item chaining (see below). Local Independence For students with the same ability, the TFs of Items j and k are unassociated, which is assumed in the IRT. Only the item characteristics other than the students’ ability affect the TFs.
Differential Item Functioning Items that are advantageous or disadvantageous for a particular subgroup are referred to as having a differential item functioning (DIF). For example, an item with a large difference in the CRR by race is said to have a DIF. A test editor should avoid including such items in the test. In most countries, the average scores of female students are generally higher than those of male students in language tests. In mathematics, conversely, the average score of male students tends to be higher at 18 years of age.16 Of course, there are many female students who are good at math and many male students who are good at foreign languages. One should not have a preconceived notion of what subjects a student is good at based on his/her gender. We do not say that there is a DIF when the CRRs are 10% higher for female students than for male students on all items of a foreign language test. This is simply a sexual difference. For more information on how to detect a DIF, see Camilli and Shepard (1994) and Steinberg and Thissen (2006).
4.3.2 Likelihood of Ability Parameter Due to the local dependence assumption, given θ , the item responses can be regarded as independent. This, strictly speaking, does not hold in the real world,17 , but it is a very useful assumption that many other statistical models assume. Under this 16
At age 12, the average scores of female students are generally higher in all subjects. This is similar to the assumption of data normality in analysis of variance. Strictly speaking, there is no random variable in the real world that is exactly normally distributed (Matloff, 2020). The normal distribution is a mental abstraction.
17
4.3 Ability Parameter Estimation
99
assumption, the mathematics of a statistical model becomes simpler and easier to manipulate. Here, suppose Student s, whose ability is θs , passes Item j and fails Item k. The joint probability of the two responses is factored as18 Pr [u s j = 1∧u sk = 0|θs ] = Pr [u s j = 1|θs ] × Pr [u sk = 0|θs ] = P(θs ; λˆ j ){1 − P(θs ; λˆ k )}, because of the basic principle that the joint probability of mutually independent events is the product of the probabilities of the events.19 In this way, the joint probability can be expressed as the product of the two IRFs. This is the benefit of this assumption. However, the benefit of the assumption is in allowing not only the factorization of the two items, but of all item responses of Student s. Thus, the joint probability of J items by Student s is factored as Pr [us |θs ] := l(us |θs ) =
J
[P(θs ; λˆ j )u s j {1 − P(θs ; λˆ j )}1−u s j ]zs j ,
(4.3)
j=1
where l(us |θs ) denotes the occurrence probability (or likelihood) of data vector us for Student s whose ability is θs . In this equation, the likelihood corresponding to Item j is Pr [u s j |θs ] := l(u s j |θs ) = [P(θs ; λˆ j )u s j {1 − P(θs ; λˆ j )}1−u s j ]zs j .
Summary of Likelihood Function of Item j 1. If the response of Student s to Item j is missing (z s j = 0), then l(u s j |θs ) = 1. This does not affect the overall likelihood. 2. If the response to Item j is observed (z s j = 1) and passed (u s j = 1), P(θs ; λˆ j ) is selected. If failed, {1 − P(θs ; λˆ j )} is selected. The two main points of this equation are listed above. Note that in Eq. (4.3), the only unknown parameter is θs . Then, l(us |θs ) is called the likelihood function of θs . Likelihood is a probability expressed by using an unknown parameter(s).
18 19
.
∧ means “and” (logical conjunction). In addition, ∨ means “or” (logical disjunction). If two dices are rolled, the joint probability of getting an even number and a three is factored as ∧ = Pr × Pr = (1/2) × (1/6). Pr ∨ ∨ ∨ ∨
100
4 Item Response Theory
Table 4.4 Item parameters of a nine-item test and example data Item 1 2 3 4 5 6 7 8 9 Slope 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 Location −2.4 −1.8 −1.2 −0.6 0.0 0.6 1.2 1.8 2.4 Student 1 1 1 0 0 0 0 0 0 0 Student 2 1 1 1 1 0 0 0 0 0 Student 3 1 1 1 1 1 1 1 0 0 Student 4 0 0 0 0 0 0 0 0 0 Student 5 1 1 1 1 1 1 1 1 1
4.3.3 Maximum Likelihood Estimate The likelihood (probability) is a function of an unknown parameter θs . Therefore, it makes sense to find the θs that maximizes the probability. This is because it is natural to assume that the data were observed because the occurrence probability was large. The procedure of obtaining a parameter(s) by maximizing the likelihood is called maximum likelihood estimation (MLE), and the estimate obtained by this method is called maximum likelihood estimate (also abbreviated MLE). From 0 < P j (θs ; λˆ j ) < 1, the likelihood of Eq. (4.3) is a very small number. Thus, we consider the log-likelihood function, which is the (natural) logarithm of the likelihood as follows: ll(us |θs ) =
J
z s j u s j ln P(θs ; λˆ j ) + (1 − u s j ) ln{1 − P(θs ; λˆ j )} .
(4.4)
j=1
The logarithmic transformation preserves the original relative order of values. That is, when θs = θt ,20 21 l(us |θs )≷l(ut |θt ) ⇔ ll(us |θs )≷ll(ut |θt ). In other words, the θ that maximizes the likelihood function is the θ that maximizes the log-likelihood function, and the θ maximizing the log-likelihood is therefore the MLE. Suppose a nine-item test is analyzed with the 2PLM, and the item parameters are then obtained as Table 4.4, where the slope parameters of all the items are 2.0, but the location parameters range from −2.4 to 2.4 in steps of 0.6. The IRFs of these nine items are shown in Fig. 4.8. Because all the slope parameters are the same, the nine IRFs are parallel. Suppose further that five students respond to these nine items (Table 4.4). Student 1 passed only Items 1 and 2 but failed the other seven items. The ability parameter 20 21
⇔ represents “if and only if.” A ⇔ B means “A is true (false) if and only if B is true (false).” For real numbers a, b, c, and d, a ≷ b ⇔ c ≷ d means “a > b if c > d and a < b if c < d.”
4.3 Ability Parameter Estimation
101
Fig. 4.8 IRFs of a nine-item test
Fig. 4.9 Log-likelihood of the ability parameter
for this student is certain to be obtained between −1.8 < θ1 < −1.2. In addition, Student 2 passed Items 1–4 but failed the remaining five items. The ability of this student would be −0.6 < θ2 < 0.0. Finally, Student 3 pass the first seven items and fail the remaining two items. The ability of this student would be 1.2 < θ3 < 1.8. The log-likelihood functions of Students 1–3 are shown in Fig. 4.9. As the ability parameter estimate for Student 1 was expected to be −1.8 < θ1 < −1.2, the θ that maximized the log-likelihood function of Student 1 was −1.545. In other words, the log-likelihood function reaches its maximum when θ1 = −1.545. The value of the log-likelihood function is then ll(u1 |θ1 = −1.545) = −1.248.
(4.5)
The value −1.545 is the MLE of the ability parameter for Student 1, denoted as θˆ1(ML) = −1.545. Similarly, the MLEs of the ability parameters for Students 2 and 3 are estimated as θˆ2(ML) = −0.303 and θˆ3(ML) = 1.545, respectively. The MLE returns a very sensible value for a student’s ability.
102
4 Item Response Theory
Fig. 4.10 Log-likelihood with no local maximum
The MLE of the ability parameter of Student s is mathematically defined as θˆs(ML) = arg max ll(us |θs ). θs ∈(−∞,∞)
This equation means that within the range of θs , −∞ < θs < ∞, the θs that maximizes the log-likelihood function ll(us |θs ) is the MLE, θˆs(ML) . However, the MLE has a flaw. As shown in Table 4.4, Student 4 fails all items, and Student 5 passes all of them. The log-likelihood function for those students who fail/pass all items does not have a (local) maximum. In Fig. 4.10, the log-likelihood function for Student 4 is monotonically decreasing and has no maximum. Thus, the function is said to be maximal when θ = −∞. However, it is inappropriate to report an ability value of −∞ to the student as his or her ability value. Similarly, the log-likelihood of Student 5 who passed all items monotonically increases; the MLE is then obtained as ∞, which is inappropriate as well. Thus, if an estimate does not converge to a finite value, the estimate is said to be divergent.
4.3 Ability Parameter Estimation
103
Optimization Technique There are many methods for finding the θ that gives the log-likelihood function its maximum value, such as the Newton–Raphson method, Fisher’s scoring method, quasi-Newton method, steepest descent method, and bisection method. In general, to maximize/minimize a function to obtain a parameter(s) that gives the function maximum/minimum is called optimization. Until about the year 2000, it was necessary for a psychometrician or analyst to acquire optimization techniques, but nowadays, if a function and a parameter(s) are properly specified, a numerical calculation program (software) can easily execute an accurate optimization using an appropriate method. Therefore, it has become less important to learn optimization techniques, so the explanation of optimization is simplified in this book.
4.3.4 Maximum a Posteriori Estimate As mentioned earlier, the MLE has a flaw. The MLE for students who passed/failed all items is not obtained as a real number. The maximum a posteriori (MAP) estimate overcomes this shortcoming. It is an estimate of θ that maximizes the posterior probability. The posterior probability of a parameter is integrated information about the parameter that combines current information (i.e., the likelihood) with prior information (i.e., the prior probability of the parameter). This is obtained by using Bayes’ theorem (see p. 105). If the parameter is a continuous value, the prior and posterior probabilities are the prior and posterior probability density distributions (in short, prior and posterior densities). The posterior density of θs , pr (θs |us ), is expressed as pr (θs |us ) =
l(us |θs ) pr (θs ) ∝ l(us |θs ) pr (θs ), pr (us )
(4.6)
where pr (θs ) is the prior density of θs , for which the standard normal density f N (θs ; 0, 1) is often designated. In addition, ∞ pr (us ) =
l(us |θs ) pr (θs )dθs −∞
is the marginal probability of us , which is a constant that does not contain an unknown parameter because θ is integrated out (marginalized). Rather than maximizing the posterior probability, it is easier to maximize its logarithm. The log-posterior density of Eq. (4.6) is given as
104
4 Item Response Theory
Fig. 4.11 Log-posterior of five students
ln pr (θs |us ) = ll(us |θs ) + ln f N (θs ; 0, 1) + const. = ll(us |θs ) −
θs2 + const. 2
The constant term (const.) is irrelevant in maximization. By comparing this with Eq. (4.4), we can see that the log-posterior density is the log-likelihood plus −θs2 /2. The log-posterior densities for the five students in Table 4.4 are shown in Fig. 4.11. It can be seen that all the functions, even for Students 4 and 5 who failed/passed all items, have a maximum. The MAP estimate of θs is an estimate that maximizes this log-posterior density. That is, θˆs(MAP) = arg max ln pr (θs |us ). θs ∈(−∞,∞)
The MAP estimate for Student 1 is obtained as θˆ1(MAP) = −1.170, while from Eq. (4.5), the MLE for the student was θˆ1 = −1.545. This is the effect of the standard normal density given as the prior; thus, the MAP estimate is shifted toward 0 in comparison to the MLE. In addition, the MAP estimates for Students 4 and 5, whose MLEs diverge negatively and positively, respectively, can be obtained as θˆ4(MAP) = −2.191, θˆ5(MAP) = 2.191. The MAP estimate does not diverge, which is the benefit of using a prior. If one wants to soften the influence of the prior, a normal prior with a larger variance can be used. In general, as the number of items increases, the MAP estimate becomes closer to the MLE because the likelihood becomes more influential on the posterior density, which means the influence of the prior density decreases.
4.3 Ability Parameter Estimation
105
Bayes’ Theorem The famous Bayes’ theorem on conditional probability, which is the basis of Bayesian statistics, is defined as P(B|A) =
P(A|B)P(B) . P(A)
This theorem is based on the two conditional probabilities for events A and B as follows: P(A ∧ B) P(A) P(A ∧ B) P(A|B) = P(B) P(B|A) =
⇒ P(A ∧ B) = P(B|A)P(A), ⇒ P(A ∧ B) = P(A|B)P(B).
4.3.5 Expected a Posteriori Estimate The expected a posteriori (EAP) estimate is currently the standard ability parameter estimate. The merits of using the EAP estimate over the MLE and MAP can be summarized under the following three points (e.g., Bock & Mislevy, 1982; de Ayala, 2008): Advantages of the EAP Estimate 1. It is easy to calculate. 2. The prior on the EAP is less influential than that on the MAP. 3. Simulation studies show good recovery of the true value. For Point 1, EAP is an expected value of the posterior density of θs , pr (θs |us ) in Eq. (4.6). The posterior densities for the five students in Table 4.4 are shown in Fig. 4.12. Each of these posterior densities was derived by synthesizing the information of the prior density and likelihood, and those for Students 4 and 5, who failed/passed all items, are located on the far left/right, respectively. The location of the peak of the posterior density is the MAP estimate, while the expectation of the posterior density is the EAP estimate; thus, EAP is equal to MAP if the posterior is symmetric. The expectation is the mean of a random variable.22 Thus, the expectation of the posterior density of θs is obtained as follows:
22
For a continuous random variable x whose density is f (x), the expectation is
x f (x)dx.
106
4 Item Response Theory
Fig. 4.12 Posterior densities of five students
Table 4.5 ML, MAP, and EAP estimates of a nine-item test EAP PSD∗1 ML MAP Score Student 1 −1.545 −1.170 −1.180 0.493 Student 2 −0.303 −0.232 -0.233 0.483 Student 3 1.545 1.170 1.180 0.493 Student 4 NaN∗2 (−∞) −2.191 −2.251 0.561 Student 5 NaN (∞) 2.191 2.251 0.561 ∗1
posterior standard deviation
θˆs(EAP) =
not a number
∞ θs pr (θs |us )dθs −∞ ∞
=
∗2
θs
l(us |θs ) pr (θs ) dθs pr (us )
−∞ ∞ θs l(us |θs ) pr (θs )dθs = −∞ ∞ −∞ l(us |θs ) pr (θs )dθs
.
The integration is sufficiently accurate if the quadrature points are placed, for example, from −4 to 4 at intervals of 0.1, although recent programs can properly perform numerical integration calculations. Therefore, it is no problem to believe the results without worrying about the detailed settings. Unlike the MLE and MAP, this calculation does not require optimization. This simplicity has the merit of shortening the computation time, which is highly advantageous because the sample size of test data is sometimes >10,000. The EAP scores for the five students in Table 4.4 are shown in Table 4.5. The table also shows the MLEs and MAPs, and we can see that the MAP of each student is closer to 0 than the student’s EAP. This is because each posterior density has a longer skirt (or thicker tail) on the opposite side from the origin. For example, in Fig. 4.12,
4.3 Ability Parameter Estimation
107
the posterior densities of Students 3 and 5 are slightly positively skewed (the right skirt is longer). Conversely, the posterior densities of Students 1, 2, and 4 are slightly negatively skewed (the left skirt is longer). Therefore, MAP as the peak of density is closer to 0 than EAP. This is the second merit of EAP: The EAP scores are farther from 0 than the MAP scores. This means that the EAP scores are more dispersed, which is preferable because it preserves individual differences better. However, the ability estimates are not returned to the students as they are but are usually fed back after a rescaling (e.g., linear transformation23 ). Thus, the second benefit is not a major benefit, because the ability ordinality sorted by the MAP scores is the same as with the EAP scores. The MAP scores simply shrink toward 0. The third advantage is the most crucial. Many simulation studies have shown that the EAP can best reproduce the true value (e.g., Bock & Mislevy, 1982). Briefly, an ability parameter as the true value, denoted θs(true) , is first randomly generated from the standard normal density, and then J binary responses are generated where the jth binary response is generated from a Bernoulli distribution with success probability P(θs(true) ; λ j ).24 After that, the MLE, MAP, and EAP are estimated from the generated J binary responses to check which estimate is the closest to θs(true) . When this single validation was repeated 1000 or 10,000 times for various true values, it was found that the EAP best reproduced the true values on average. Thus, the EAP has become the standard.
4.3.6 Posterior Standard Deviation Table 4.4 also shows the posterior standard deviation (PSD), which means the SD of the posterior density. In Fig. 4.12, the SDs of the densities are the PSDs of the five students. The PSD of Student s is computed by25 21 ∞ 2 θs − θˆs(EAP) pr (θs |us )dθs PSDs = −∞ ∞
=
−∞
2 l(us |θs ) pr (θs ) 21 dθs θs − θˆs(EAP) pr (us )
∞ ˆ (EAP) 2 l(us |θs ) pr (θs )dθs 21 −∞ θs − θs ∞ = −∞ l(us |θs ) pr (θs )dθs
For score x, a transformation such as ax + b. All item parameters are predetermined. 25 x 1/2 = √ x. Generally, x 1/a = √ a x. 23 24
108
4 Item Response Theory
The PSD can be understood as the estimation accuracy of an EAP score. The smaller the PSD, the more reliable the EAP estimate is. For example, the PSD of Student 1 is 0.493, which is generally considered large because there are only nine items. The EAP of the Student 1 is 0.493; this score is variable between θˆ1(EAP) − PSD1 ≤ θˆ1(EAP) ≤ θˆ1(EAP) + PSD1
∴
−1.673 ≤ θˆ1(EAP) ≤ −0.687.
In addition, as Fig. 4.12 shows, the posterior densities of Students 3 and 5 overlap because their PSDs are large. This tells us that it is difficult to assert that the ability of Student 5 is higher than that of Student 3 with this test with only nine items. In general, the larger the number of items, the smaller the PSD, because the ability parameter can be more accurately estimated as the number of items increases. However, even if a test consists of a large number of items but the items are low discriminating, PSDs does not become small. Point Estimate and Interval Estimate Like θˆ1 = −1.545 or θˆ2(MAP) = −0.232, an estimate reporting a student’s ability as a single point on the ability scale is called a point estimate of the ability parameter. In contrast, an estimate indicating a range, such as −1.673 ≤ θˆ1(EAP) ≤ −0.687, is called an interval estimate. It is important to return an interval estimate as well as a point estimate to the student for showing the accuracy (or limitation) of the test.
4.4 Information Function The information function is a very useful tool when editing a test. The item information function (IIF) tells us where in θ and how strongly an individual item discriminates. There is no high-discriminating item across the entire θ scale with practical range θ ∈ (−3, 3). Easy items can discriminate students with low ability, but they are ineffective for measuring students with high ability. Meanwhile, difficult items are invalid for students with low ability. Thus, along a θ scale, the region being highly discriminated differs by item. In addition, the test information function (TIF) is the corresponding function for a set of items. The TIF is the simple sum of the IIFs and is explained first.
4.4.1 Test Information Function Let the TIF and IIF of Item j be denoted I (θ ; ) and I (θ ; λ j ), respectively; the TIF is then expressed as
4.4 Information Function
109
Fig. 4.13 Test information function of 15-item test
I (θ ; ) =
J
I (θ ; λ j ).
j=1
Before the mathematical definition of the information function is described, its usage is first introduced. To begin with, Fig. 4.13 (left) shows the IIFs of the 15 items in Table 4.2 (p. 95). Given the item parameters, the shape of the IIF is determined. The right figure shows the TIF. Its features are as follows: Summary of Test Information Function 1. It is used in editing a test. 2. “Information” stands for the inverse of the variance of the MLE θˆ . TIF can be used to evaluate where on the θ scale a test best discriminates. As per Point 1 listed in Summary of TIF , when a test is edited to target high-ability students, one may select items so that TIF becomes high at θ ≥ 1. Figure 4.13 (right) shows that TIF is large around θ = −1, which means that this test can be used to discriminate students with slightly lower ability. Also, as per Point 2 listed in Summary of TIF , the “information” indicates the ˆ 26 That is, the relationship between the TIF and inverse of the variance of the MLE θ. MLE can be expressed as Var[θˆ (ML) ] =
1 I (θˆ (ML) ; )
⇔
SE[θˆ (ML) ] =
1 I (θˆ (ML) ; )
.
The above two equations have the same meaning, where the standard error (SE) is the SD of the estimate. When the SE is small, this implies the estimate is obtained with good precision. In other words, the larger the TIF (the smaller the SE), the more 26
For the details of the derivation, see Sect. 4.4.5 (p. 114).
110
4 Item Response Theory
Fig. 4.14 SE function of a 15-item test
accurately the MLE (θˆ (ML) ) is estimated. For example, in Fig. 4.13 (right), when θ = −1, the TIF is 2.89. That is, I (θ = −1) = 2.89
⇒
SE[θ = −1] = √
1 2.89
= 0.59.
This means that, for a student with estimated ability θˆ = −1, the ability is accurately estimated with a precision of −1 ± 0.59. Meanwhile, for a student with θˆ = 2, the ability is obtained with a low precision of 2 ± 1.16 because TIF is small when θˆ = 2, as follows: I (θ = 2) = 0.74
⇒
1 SE[θ = 2] = √ = 1.16. 0.74
Figure 4.14 shows the plot of the SE function, which is the inverse of the square root of TIF, as follows: SE(θ ; ) = √
1 . I (θ ; )
We can see that this test discriminates well the students with ability falling in range −2 ≤ θ ≤ 1. In general, it is recommended to edit a test so that the SE function goes below 0.4 (or, the TIF exceeds 6.25) in the range of θ the test is targeting. It may be necessary for the number of items to be > 30 to achieve this, although this is unrealizable when the IIFs of all 30 items have a low peak.
4.4 Information Function
111
Fig. 4.15 Item information function of 2PLM
4.4.2 Item Information Function of 2PLM This section describes the item information function (IIF). As previously mentioned, the IIF of each item expresses where on the θ scale and how accurately the item is discriminating. The IIF features are summarized as follows. Features of Item Information Function The IIF expresses the item’s estimation accuracy on θ scale. The IIF is a function of θ , and its shape is unimodal. The larger the slope parameter, the higher the mode of the IIF. The larger the location parameter, the further to the right the IIF is located. As the lower asymptote parameter approaches 0, the peak of the IIF becomes higher and moves slightly to the left. 6. As the upper asymptote parameter approaches 1, the peak of the IIF becomes higher and moves slightly to the right.
1. 2. 3. 4. 5.
For Point 1, in the CTT, test reliability is evaluated by the coefficients of α (Sect. 3.4, p. 78) and ω (Sect. 3.5, p. 81). However, these coefficients do not indicate each item’s contribution to the overall reliability. In the IRT, however, the IIF allows us to evaluate each item’s accuracy in estimating θ . Alpha If Item Deleted “Alpha if item deleted” is an α when an item is removed. When the index is reduced significantly, the item can be regarded as contributing highly to the test reliability, but this is an indirect index. A direct index indicates the contribution rate of the item to the entire test reliability. See Raykov (2007, 2008) for notes on editing a test using the “alpha if item deleted” method.
112
4 Item Response Theory
In addition, from Point 2, the IIF is a function of θ ; thus, its value changes depending on the value of θ . Figure 4.15 shows the IIFs of 2PLM. The IIF of Item 1 has a peak at θ = −1. This indicates that Item 1 best discriminates the students with θ = −1. The IIF of 2PLM is defined as I (θ ; λ j ) = a 2j P(θ ; λ j )Q(θ ; λ j ).
(4.7)
The derivation of this expression will be explained in Sect. 4.4.5 (p. 114). In this equation, λ j = [a j b j ] represents the item parameters of 2PLM, and Q(θ ; λ j ) is the incorrect response function as follows27 : Q(θ ; λ j ) = 1 − P(θ ; λ j ) exp{−a j (θ − b j )} = 1 + exp{−a j (θ − b j )} 1 . = 1 + exp{a j (θ − b j )} As per Point 3 listed in Features of IIF (p. 111), and shown in Fig. 4.15 and Eq. (4.7), the peak of the IIF becomes higher as the slope parameter increases. In addition, the IIF is placed further to the right as the location parameter increases as indicated under Point 4. The IIF has a mode at θ = b j . For example, the IIFs of Items 1 and 2 peak at θ equal to −1 and 1, respectively, because their location parameters are b1 = −1 and b2 = 1, although the peak of Item 1 is higher than that of Item 2 because a1 > a2 . The IIF of Item 3 is 0 regardless of θ because the slope parameter is 0. Subject and Slope Parameter An item is said to be more discriminating as the slope parameter is larger. How large should a slope parameter be? It is difficult to answer this question because it depends on the target the test measures. For instance, math and science test items are inclined to have high discrimination, while those of history and language test items tend to be low on average. In addition, the slope parameters of multiple-choice items tend to be small.
4.4.3 Item Information Function of 3PLM The IIF of 3PLM is defined as 27
x −a = 1/x a .
4.4 Information Function
113
Fig. 4.16 Item information function of 3PLM
I (θ ; λ j ) =
a 2j Q(θ ; λ j ){P(θ ; λ j ) − c j }2 (1 − c j )2 P(θ ; λ j )
,
where the incorrect response function is Q(θ ; λ j ) = 1 − P(θ ; λ j ) =
1 − cj . 1 + exp{a j (θ − b j )}
Note that the IIF of 3PLM reduces to that of 2PLM when c j = 0. Figure 4.16 shows the IIFs of the 3PLM, where the slope parameters are 2 and location parameters are 0 for all items. As shown in this figure, the shape of the IIF is not symmetrical but positively skewed (i.e., the right skirt is longer). In addition, as the lower asymptote parameter is closer to 0, the peak of the IIF gets higher and moves slightly to the left (cf. Point 5 listed under Features of IIF ). From Birnbaum (1968) and Kato et al. (2014), the peak is given when θ = b j + a −1 j ln{0.5(1 +
1 + 8c j },
although it is hard to find any substantial meaning for this figure.
4.4.4 Item Information Function of the 4PLM The IIF of the 4PLM (Liao et al., 2012) is given as I (θ ; λ j ) =
a 2j {P(θ; λ j ) − c j }{d j − P(θ ; λ j )}[P(θ ; λ j ){c j + d j − P(θ ; λ j )} − c j d j ] (d j − c j )2 P(θ ; λ j )Q(θ ; λ j )
,
114
4 Item Response Theory
where Q(θ ; λ j ) = 1 − P(θ ; λ j ) =
1 − c j + (1 − d j ) exp{a j (θ − b j )} . 1 + exp{a j (θ − b j )}
Note that when d j = 1, these two equations reduce to the IIF and incorrect response function of the 3PLM, respectively. Figure 4.17 shows the IIFs of the 4PLM, where, contrary to the IIFs of the 3PLM, the larger the value of the upper asymptote parameter, the higher the peak of IIF (cf. Point 6 of Features of IIF ). In addition, the peak moves slightly to the right as the parameter approaches 1. Although the θ that gives the peak has not been specified, it would be difficult to find a substantial meaning for it, as for 3PLM. Note that when d j = 1 − c j , IIF becomes symmetric.
4.4.5 Derivation of TIF This section is a bit mathematically advanced, and it describes the meaning of “information,” particularly the Fisher information that represents the estimation accuracy of the MLE of the ability parameter θ (cf. Point 1 of Features of IIF , p. 111). More specifically, the Fisher information is the negative expectation of the second derivative of the log-likelihood function. Simply put, it is the variance of the MLE. TIF is the application of this Fisher information to the IRT. Let us look again at the log-likelihood function for estimating the MLE of the ability parameter of Student s (θs ) as follows: ll(us |θs ) =
J {u s j ln P j (θs ) + (1 − u s j ) ln Q j (θs )}. j=1
Fig. 4.17 Item information function of the 4PLM
see (4.4)
4.4 Information Function
115
In this equation, suppose that the number of presented items is J and missing data irrelevant to the ability parameter estimation have already been omitted. Assume also that all the item parameters are known, in which case P(θs ; λ j ) and Q(θs ; λ j ) are, respectively, denoted as P j (θs ) and Q j (θs ) for simplicity. What follows is a general explanation that does not depend on a specific model, such as 2PLM, 3PLM, and 4PLM. The second derivative of the function with respect to θs is expressed as d2 ln Q j (θs ) d2 ll(us |θs ) d2 ln P j (θs ) u , = + (1 − u ) s j s j dθs2 dθs2 dθs2 j=1 J
(4.8)
where, under the 4PLM, a 2j {P j (θs ) − c j }{d j − P j (θs )}{P j (θs )2 − c j d j } d2 ln P j (θs ) = , dθs2 (d j − c j )2 P j (θs )2
(4.9)
a 2j {P j (θs ) − c j }{d j − P j (θs )}{Q j (θs )2 − (1 − c j )(1 − d j )} d2 ln Q j (θs ) = . dθs2 (d j − c j )2 Q j (θs )2 (4.10) These reduce to the 3PLM results by assuming d j = 1 and to those of the 2PLM by further assuming c j = 0. TIF is then obtained by taking the negative expectation of Eq. (4.8) with respect to the responses of Student s as follows: d2 ll(u |θ ) s s dθs2 J d2 ln P j (θs ) d2 ln Q j (θs ) E[u s j |θs ] =− + E[1 − u |θ ] s j s dθs2 dθs2 j=1
I (θs ) := − E
=
J d2 ln P j (θs ) d2 ln Q j (θs ) −P j (θs ) , − Q (θ ) j s dθs2 dθs2 j=1
where, the term corresponding to Item j, I (θs ; λ j ) = I j (θs ) = −P j (θs )
d2 ln P j (θs ) d2 ln Q j (θs ) − Q (θ ) , j s dθs2 dθs2
is the IIF of Item j. Substituting Eqs. (4.9) and (4.10) in this equation, the IIF of the 4PLM is obtained. This derivation is for Student s, but it holds for anyone. Thus, removing subscript s, I j (θ ) is the IIF of Item j, and the sum of the IIFs (I (θ ) = j I j (θ )) is the TIF.
116
4 Item Response Theory
TIF Versus PSD Because TIF represents the accuracy of the MLE of θ , they should be reported together. In reality, as mentioned in Sect. 4.3.5, EAP is the current standard for an ability estimate; thus, the PSD that represents the EAP accuracy should be employed with EAP, instead of TIF. However, the PSD cannot be decomposed into the components of individual items, while the TIF is the composition (i.e., sum) of IIFs. TIF is thus useful when editing a test, because it can easily display the entire test accuracy after adding and removing items. Therefore, TIF is monitored when a test is edited, although the EAP and PSD are the standards fed back to each student. Few test administrators find this to be a contradiction. This is because EAP and MLE become more similar as the number of items increases.
4.5 Item Parameter Estimation This section describes how to estimate item parameters () from the item response data (U). Thus, the item parameters are unknown, as well as the ability parameters of S students (θ). When both the item and ability parameters are unknown, the item parameters are estimated first, followed by the ability parameters. The current standard for estimating the item parameters is the maximum likelihood estimation with the EM algorithm (MML-EM; Bock & Aitkin, 1981; Tsutakawa, 1984; Tsutakawa & Lin, 1986).28 This section describes the MML-EM. The expectation-maximization (EM) algorithm (Dempster et al., 1977) explained in this chapter is a bit complicated. It may be easy for some readers to learn this algorithm in latent class analysis (Chap. 5, p.155) or latent rank analysis (Chap. 6, p.191).
4.5.1 EM Algorithm Parameters are said to be structural if the number of parameters is unaffected by sample size, and nuisance otherwise. In IRT, the item parameters () are structural parameters, and the ability parameters (θ) are nuisance parameters. It is known that estimating the structural and nuisance parameters at the same time causes bias in the estimates. Therefore, when the item parameters (i.e., structural parameters) are estimated, the ability parameters (i.e., nuisance parameters) must be neglected. This is done by using the EM algorithm summarized in the following list. 28
As to a method using the Gibbs sampler, see Albert (1992) and Shigemasu and Nakamura (1996).
4.5 Item Parameter Estimation
117
Summary of the EM Algorithm 1. It estimates the MAPs of the item parameters () after the ability parameters (θ ) are integrated out (marginalized). 2. It repeats a cycle of E (expectation) and M (maximization) steps. 3. In each E step, the expected log-posterior density is constructed by marginalizing the ability parameters. 4. In each M step, the MAPs of the item parameters are obtained for each individual item by maximizing the expected log-posterior density.
4.5.2 Expected Log-Posterior Density The EM algorithm repeats a cycle of Steps E and M (cf. Point 2 of Summary of the EM Algorithm ), and this section describes the E step at the t th cycle (Point 3). At this cycle, we have a tentative set of the item parameters (t−1) obtained at the (t − 1)-th M step. For the first E step, this tentative set (0) are the initial values of the item parameters. For example, they are given by a (0) j = 2ρ j,ζ ,
b(0) j = 2τ j ,
c(0) j = 0.05,
d (0) j = 0.95,
where ρ j,ζ and τ j are the ITC (Sect. 2.6.3, p. 50) and item threshold (Sect. 2.4.3, p. 31), respectively, for Item j. The purpose of the E step is to construct the expected log-posterior (ELP) density. First, given data matrix U = {u s j } (S × J ), the logarithm of the posterior density of the item parameters, ln pr (|U), is interpreted as the log-posterior after taking the expectation with respect to the ability parameters (θ) as follows29 : ln pr (|U) = E θ |U,(t−1) [ln pr (|θ , U)] ∞ ln pr (|θ , U) pr (θ|U, (t−1) )dθ . = −∞
29
This in fact is a multiple (S-fold) integration as follows: ∞
∞ f (θ)dθ =
−∞
.
∞ ∞ ···
−∞
−∞ −∞
f (θ)dθ1 dθ2 . . . dθ S .
(4.11)
118
4 Item Response Theory
In this equation, the expectation of the log-posterior of the item parameters is taken over the density of θ , more specifically, which is the posterior density of θ conditional on the (t − 1)-th MAP of item parameters (t−1) as follows: pr (θ|U,
(t−1)
)=
S
pr (θs |us ,
(t−1)
)=
s=1
S
∞
s=1
l(us |θs , (t−1) ) pr (θs )
−∞ l(us |θs ,
(t−1)
. ) pr (θs )dθs (4.12)
As the ability parameters are independent of each other, the posterior density of θ is factored as the product of the posterior densities of θ1 , . . . , θ S . In addition, pr (θs ) is the prior density of θs , which is usually the standard normal density, f N (θs ; 0, 1). Furthermore, l(us |θs , (t−1) ) =
J
{P(θs ; λ(t−1) )u s j Q(θs ; λ(t−1) )1−u s j }zs j j j
j=1
is the likelihood of us given (t−1) . Meanwhile, from Bayes’ theorem, the posterior density of the item parameters in { } of Eq. (4.11) is expressed as pr (|θ, U) =
l(U|θ, ) pr () pr (U|θ, ) pr (|θ) = , pr (U|θ) pr (U|θ)
(4.13)
where pr (U|θ, ) is the likelihood, which may be rephrased as l(U|θ, ). From the local independence assumption (Sect. 4.3.1, p. 96), this likelihood may be expressed as l(U|θ, ) =
S
l(us |θs , ) =
s=1
S J
{P(θs ; λ j )u s j Q(θs ; λ j )1−u s j }zs j .
s=1 j=1
In addition, because the item parameters () and ability parameters (θ) are mutually disjoint, pr (|θ) = pr () holds, and it can be regarded as the prior density of the item parameters. The logarithm of Eq. (4.13) is then given as ln pr (|θ, U) = ln l(U|θ, ) + ln pr () − ln pr (U|θ) = ll(U|θ , ) + ln pr () − ln pr (U|θ ).
(4.14)
The first term is the log-likelihood. That is, ll(U|θ , ) =
J j=1
ll(u j |θ, λ j ) =
J S
z s j u s j ln P(θs ; λ j ) + (1 − u s j ) ln Q(θs ; λ j ) .
j=1 s=1
Then, substituting Eq. (4.14) in Eq. (4.11), we have
4.5 Item Parameter Estimation
∞ ln pr (|U) = −∞ ∞
=
119
ll(U|θ , ) + ln pr () − ln pr (U|θ ) pr (θ|U, (t−1) )dθ
ll(U|θ , ) pr (θ |U, (t−1) )dθ + ln pr () + const.
(4.15)
−∞
This is the ELP at the t-th E step, in which the only unknown parameter set is because θ is marginalized. The that maximizes this ELP at the t-th M step is the t-th MAP of the item parameters. In practice, as indicated by Point 4 listed in Summary of EM Algorithm (p. 117), the ELP can be maximized for each individual item. Because the parameters of an item are disjoint from those of all other items, Eq. (4.15) is rewritten as30 ln pr (|U) =
∞ J
ll(u j |θ, λ j ) pr (θ |U, (t−1) )dθ +
−∞ j=1 ∞ J
=
J
ln pr (λ j ) + const.
j=1
ll(u j |θ, λ j ) pr (θ |U, (t−1) )dθ + ln pr (λ j ) + const.
j=1 −∞
=
J
ln pr (λ j |u j ) + const.
j=1
In other words, if the MAP of λ j that maximizes the ELP of Item j (ln pr (λ j |u j )) is found, and this is done for all such j, the MAP of that maximizes the entire ELP for all items (ln pr (|U)) is obtained.
4.5.3 Prior Density of Item Parameter The prior for the item parameters is included in the ELP. This section introduces some priors for the item parameters. As the prior for the slope parameter, the log-normal distribution (Fig. 4.18, left) is frequently used. The PDF of this distribution is defined as 1 ln a − μ 2 1 j a . exp − pr (a j ; μa , σa ) = √ 2 σ 2π σa a j a The domain of this distribution is (0, ∞); therefore, the domain of the posterior of the slope parameter is constrained to (0, ∞). This is not particularly a problem, as 30
The order of integration and summation is interchangeable.
120
4 Item Response Theory
Fig. 4.18 Prior density of the item parameter
the slope parameter is normally positive. The log-normal density has two parameters, 2 μa and σa , and its mode is given at a j = eμa −σa . A density whose mode is around a j = 1.0 is a good choice. The logarithm of the log-normal density is given as ln pr (a j ; μa , σa ) = −
1 ln a j − μa 2 − ln σa − ln a j + const., 2 σa
which is required in maximizing the log-posterior. The normal density is usually used as the prior for the location parameter. That is, 1 x − μ 2 1 b , exp − f N (b j ; μb , σb ) = 2 σ 2 b 2π σb 1 x − μb 2 ln f N (b j ; μb , σb ) = − − ln σb + const. 2 σb Note that it is not the standard normal density (i.e., f N (b j ; 0, 1)), but a normal density with slightly larger SD such as f N (b j ; 0, 2) that is often used. Moreover, for the priors of the lower and upper asymptote parameters (c j and d j ), the beta distribution is employed. The PDF (Fig. 4.18, right) and its logarithm are given as x β1 −1 (1 − x)β0 −1 , B(β0 , β1 ) ln pr (x; β0 , β1 ) = (β1 − 1) ln x + (β0 − 1) ln(1 − x) + const.
pr (x; β0 , β1 ) =
The domain of the distribution is [0, 1], which is suitable since the range of c j and d j is [0, 1]. B(·, ·) in the above equation is the beta function, which is defined as
4.5 Item Parameter Estimation
121
1 B(β0 , β1 ) =
t β1 −1 (1 − t)β0 −1 dt.
0
When β0 and β1 are positive integers,31 B(β0 , β1 ) =
(β0 − 1)!(β1 − 1)! , (β0 + β1 − 1)! β0 −1
ln B(β0 , β1 ) =
β1 −1
ln i +
i=1
i=1
β0 +β1 −1
ln i −
ln i.
i=1
In particular, when these two parameters are > 1, the mode of the density is (β1 − 1)/(β0 + β1 − 2). Thus, for the prior of c j , it is recommended to set β0 > β1 so that the mode is around c j = 0.2. Conversely, for the prior of d j , β0 < β1 is recommended in order to make the mode around d j = 0.9. Prior Densities of the IRT Models 2PLM: 3PLM: 4PLM:
pr (a j , b j ) = pr (a j ; μa , σa ) f N (b j ; μb , σb ) pr (a j , b j , c j ) = pr (a j ; μa , σa ) f N (b j ; μb , σb ) pr (c j ; βc0 , βc1 ) pr (a j , b j , c j , d j ) = pr (a j ; μa , σa ) f N (b j ; μb , σb ) pr (c j ; βc1 , βc2 ) pr (d j ; βd0 , βd1 )
The priors of the 2PLM, 3PLM, and 4PLM models are listed above. The parameter(s) in a prior density are called the hyperparameter(s). In this book, they are treated as known constants.32 Generally, the hyperparameters are set the same for all items.
4.5.4 Expected Log-Likelihood The first term in Eq. (4.15) is the expectation of the log-likelihood; thus, this part is called the expected log-likelihood (ELL). In other words, the ELP constructed in each E step can be written as Expected log-posterior = Expected log-likelihood + Log-prior of + const. When no prior density is used, or when the prior is noninformative (e.g., uniform density), then in effect, the ELP is equal to the ELL. In this section, the ELL is 31
n! = n × (n − 1) × · · · × 2 × 1 =
n
i.
i=1 32
The hyperparameters can be treated as unknown. Such an analysis is called hierarchical Bayesian estimation (e.g., Mislevy, 1986).
122
4 Item Response Theory
explained in more detail. The ELL is rewritten as ∞
ll(U|θ , ) pr (θ |U, (t−1) )dθ
ell(U|) = −∞ ∞
J
=
−∞ j=1 J ∞
=
ll(u j |θ, λ j ) pr (θ|U, (t−1) )dθ
ll(u j |θ, λ j ) pr (θ|U, (t−1) )dθ
j=1−∞
=
J
ell(u j |λ j ).
j=1
It is clear from the equation that the ELL is simply the sum of the ELLs of the items; thus, it can be maximized item by item. Focusing on the ELL of Item j, it is expressed as ∞ ell(u j |λ j ) =
ll(u j |θ , λ j ) pr (θ |U, (t−1) )dθ
−∞ ∞
=
S
−∞ s=1 S ∞
=
ll(u s j |θs , λ j ) pr (θ |U, (t−1) )dθ
ll(u s j |θs , λ j ) pr (θ |U, (t−1) )dθ ,
(4.16)
s=1 −∞
where ll(u s j |θs , λ j ) = z s j u s j ln P(θs ; λ j ) + (1 − u s j ) ln Q(θs ; λ j )
(4.17)
is the likelihood of response u s j . In addition, as shown in Eq. (4.12), the posterior density of θ in Eq. (4.16) is the product of the posterior ability densities of all students. Furthermore, the integral in the equation in fact denotes a multiple integral (see Footnote 29, p. 117). Focusing on Student s, and without simplifying the multiple integral, Eq. (4.16) may be expressed as
4.5 Item Parameter Estimation
∞
ll(u s j |θs , λ j ) pr (θ |U, (t−1) )dθ
−∞ ∞
=
∞ ···
−∞ ∞
=
123
ll(u s j |θs , λ j )
S
pr (θs |us , (t−1) )dθ1 . . . dθ S
s=1
−∞
pr (θ1 |u1 ,
(t−1)
∞ )dθ1 ×
−∞
pr (θ2 |u2 , (t−1) )dθ2 × · · ·
−∞
∞ ×
ll(u s j |θs , λ j ) pr (θs |us ,
(t−1)
∞ )dθs × · · · ×
−∞
pr (θ S |u S , (t−1) )dθ S
−∞
∞ =1 × 1 × · · · ×
ll(u s j |θs , λ j ) pr (θs |us , (t−1) )dθs × · · · × 1
−∞
∞ =
ll(u s j |θs , λ j ) pr (θs |us , (t−1) )dθs .
−∞
The integral (area) of the posterior θ density of each student is 1; thus, all the factors other than the factor related to Student s cancel. The ELL of Item j, Eq. (4.16), is then simplified as ell(u j |λ j ) =
S ∞
ll(u s j |θs , λ j ) pr (θs |us , (t−1) )dθs .
(4.18)
s=1 −∞
4.5.5 Quadrature Approximation The ELL includes an integral (i.e., area), which is usually approximated by a sum of Q rectangles, where Q is the number of quadrature points. In addition, computer programs normally prioritize accuracy over speed; thus, a very large number of quadrature points is set to precisely approach the integral calculation, which poses a large computational burden, especially when the sample size S is large. For this reason, without relying on the program’s default setting, the integral is usually approximated by reducing the number of quadrature points. Consider Q quadrature points {θ1 , θ2 , . . . , θ Q } on the ability scale. The approximation is sufficiently accurate if the number of quadrature points is ≥ 12. For instance, when Q = 17 and points are placed from −3.2 to 3.2 at increments of 0.4, the quadrature points are
124
4 Item Response Theory
{θ1 , θ2 , . . . , θ17 } = {−3.2, −2.8, . . . , 3.2}. Then, Eq. (4.18) is approximated as ell(u j |λ j ) ≈
Q S
ll(u s j |θq , λ j ) pr (θq |us , (t−1) )
s=1 q=1
=
Q S
(t) ll(u s j |θq , λ j )wsq ,
(4.19)
s=1 q=1
where ⎡
pr (θ1 |us , (t−1) )
(t) ⎤ ⎢ Q pr (θ |u , (t−1) ) ⎥ ws1 q s ⎢ q=1 ⎥ ⎥ ⎢ .. ⎥ ⎢ . ⎢ ⎥ (Q × 1) . =⎣ . ⎦=⎢ . ⎥ ⎢ ⎥ ⎣ pr (θ Q |us , (t−1) ) ⎦ ws(t)Q Q (t−1) ) q=1 pr (θq |us ,
⎡
w(t) s
⎤
is the approximated posterior density of θs at the t-th cycle. This vector is adjusted so that the sum of the elements is equal to 1 (1Q w(t) s = 1). The discretely approximated posterior densities of all students are expressed as ⎤ ⎡ (t) (t) ⎤ w11 . . . w1Q w1(t) ⎥ ⎢ ⎥ ⎢ = ⎣ ... ⎦ = ⎣ ... . . . ... ⎦ (S × Q). (t) wS (t) w (t) S1 . . . w S Q ⎡
W (t)
The s-th row elements are the discretely approximated posterior density of θs at the t-th cycle. From W (t) and Eqs. (4.17), (4.19) is rewritten as ell(u j |λ j ) ≈
Q S
(t)
ll(u s j |θq , λ j )wsq
s=1 q=1
=
Q S
(t) z s j u s j ln P(θq ; λ j ) + (1 − u s j ) ln Q(θq ; λ j ) wsq
(4.20)
s=1 q=1
=
Q S S (t) (t) wsq z s j u s j ln P(θq ; λ j ) + wsq z s j (1 − u s j ) ln Q(θq ; λ j ) q=1
=
s=1
s=1
Q (t) (t) U1 jq ln P(θq ; λ j ) + U0 jq ln Q(θq ; λ j ) , q=1
(4.21)
4.5 Item Parameter Estimation
125
where U1(t)jq = (z j u j ) wq(t) =
S
(t) wsq zs j u s j ,
s=1
U0(t)jq = {z j (1 S − u j )} wq(t) =
S
(t) wsq z s j (1 − u s j ).
s=1
In addition, wq(t) is the q-th column vector in W (t) . All J × Q elements of these U1(t)jq and U0(t)jq are obtained by the following matrix operations: (t) (t) (J × Q), U (t) 1 = {U1 jq } = (Z U) W (t) (t) U (t) (J × Q). 0 = {U0 jq } = {Z (1 S 1 J − U)} W
4.5.6 Maximization Step At the t-th E step, the ELP was constructed, whose main component, the ELL, was approximated with the discretized posterior ability distribution, W (t) . At the t-th M step, the estimate of item parameter set (t) is obtained by maximizing the ELP or ELL, which can be carried out item by item (Point 4 in Summary of EM Algorithm , p. 117). In short, the item parameter estimation procedure iteratively updates the discrete ability distribution and the item parameter estimate as follows: EM Cycle 1
(0)
→W
(1)
EM Cycle 2
!" # !" # → (1) → W (2) → (2) → · · · .
In this process, the set of an E step and an M step is called an EM cycle. The MLE of Item j is obtained by maximizing the ELL of Eq. (4.21) as follows: = arg max {ell(u j |λ j )} λ(ML,t) j λj
Q U1(t)jq ln P(θq ; λ j ) + U0(t)jq ln Q(θq ; λ j ) . ≈ arg max λj
q=1
In addition, the MAP of Item j is obtained by maximizing the ELP as follows:
126
4 Item Response Theory
Fig. 4.19 M step in the first EM cycle of 2PLM (Item 1)
λ(MAP,t) = arg max {ln pr (λ j |u j )} j λj
≈ arg max λj
Q U1(t)jq ln P(θq ; λ j ) + U0(t)jq ln Q(θq ; λ j ) + ln pr (λ j ) . q=1
The difference between the MLE and MAP is whether the priors for the item parameters are included or not in the maximized function. In practice, it is found to be better to employ the priors because the item parameter estimates are more stable when the sample size is small. An example is shown below. A 15-item test whose sample size is 500 (J15S500) is analyzed by the 2PLM. The analysis settings are listed below. Parameter Estimation Setting 1. 2. 3. 4. 5. 6. 7.
Objective function: Expected log-posterior (ELP) Quadrature points: 17 points from −3.2 to 3.2 at 0.4 increments Initial value of slope parameter: a (0) j = 2ρ j,ζ (∀ j ∈ N J ) Initial value of location parameter: b(0) j = 2τ j (∀ j ∈ N J ) Prior of ability: Standard normal distribution f N (θs ; 0, 1) (∀s ∈ N S ) Prior of slope: Log-normal distribution pr (a j ; 0.0, 0.5) (∀ j ∈ N J ) Prior of location: Normal distribution f N (b j ; 0, 2) (∀ j ∈ N J )
Figure 4.19 shows the maximization process at the first M step for Item 1. The contour lines show the magnitude of the ELP of the item constructed in the first E step. The triangle in the map represents the maximum of the contour, which reaches a maximum of −261.848 at (a1 , b1 ) = (0.728, −1.573).33 That is, 33
The constant term of the prior for the item parameters is excluded from this calculation.
4.5 Item Parameter Estimation
127
Fig. 4.20 M step in the first EM cycle of the 3PLM (Item 1)
ln pr (a1 = 0.728, b1 = −1.573|u1 ) = −261.848. Therefore, (a1 , b1 ) = (0.728, −1.573) is determined as (a1(1) , b1(1) ), the MAP of Item 1 at the first M step. The lines in the figure also show the optimization process from a starting point (a1 , b1 ) = (0.5, 0.0)34 toward the peak. The Newton–Raphson method, Fisher’s scoring method, and many other methods can be used for optimization. Recent software can optimize an objective function properly if only the function and a parameter(s) are specified. When analyzed by the 3PLM, the ELP is maximized with respect to a j , b j , and c j , and the values of the triplet that gives the maximum are the MAPs of the item (t) (t) , c j ). For this, the following settings are parameters at the t-th M step, (a (t) j , bj added to Parameter Estimation Setting . Additional Parameter Estimation Setting 9. Initial value of lower asymptote parameter: c(0) j = 0.05 (∀ j ∈ N J ) 10. Prior of lower asymptote: Beta distribution B(c j ; 2, 5) (∀ j ∈ N J ) Figure 4.20 shows the maximization process at the first M step for Item 1. The vertical axis represents the lower asymptote parameter. Although it is a continuous parameter, only four layers are visualized. The ELP has a maximum of −263.923 at 34
A bad starting point is set to clearly show the optimization process.
128
4 Item Response Theory
(a1 , b1 , c1 ) = (0.826, −0.960, 0.240), as indicated by the triangle. That is, ln pr (a1 = 0.826, b1 = −0.960, c1 = 0.240|u1 ) = −263.923. Therefore, these values are the MAPs of the item parameters of Item 1 at the first M step, (a1(1) , b1(1) , c1(1) ).
4.5.7 Convergence Criterion of EM Cycles An EM cycle consists of an E step and an M step, and the EM algorithm approaches the true values of the item parameters by repeating the EM cycles. It is more important to repeat the cycles rather than precisely execute each M step. In each EM cycle, by monitoring the value of the ELP,35 ln pr ((t) |U) =
J
ln pr (λ(t) |u j ),
j=1
the value gradually increases as follows: ln pr ((t) |U) > ln pr ((t−1) |U).
(4.22)
As the EM cycles are repeated, the ELP values of adjacent EM steps gradually become almost the same, which is called convergence. At this time, the EM algorithm is terminated, and the estimates obtained at the latest M step are regarded as the final ˆ (MAP) . For example, the EM algorithm is MAP estimate for the item parameters, terminated when the following criterion is satisfied: ln pr ((t) |U) − ln pr ((t−1) |U) < c| ln pr ((t−1) |U)|,
(4.23)
where c is a threshold, for example, c = 10−4 or c = 10−5 . A criterion with smaller c is harder to satisfy. The left-hand side (LHS) of this equation represents the change in the ELP after updating, and it is positive from Eq. (4.22). Meanwhile, because the ELP is generally negative, the right-hand side (RHS) is enclosed in absolute value signs | · |. Under this criterion, the algorithm is terminated when the updated value of the ELP at cycle t is almost the same as that at cycle t − 1. Additional Parameter Estimation Setting 2 11. Convergence criterion: (4.23) with c = 10−4
35
In estimating the MLEs of the item parameters, the ELL ell(U|(t) ) should be monitored.
4.5 Item Parameter Estimation
129
Table 4.6 MAP estimate of item arameter by MML-EM Model Parameter Item 01 Item 02 Item 03 Item 04 Item 05 Item 06 Item 07 Item 08 Item 09 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15
2PLM Loc. Slope 0.698 −1.683 0.810 −1.552 0.559 −1.838 1.416 −1.178 0.681 −2.241 0.997 −2.162 1.084 −1.039 0.694 −0.557 0.347 1.630 0.492 −1.420 1.122 1.021 1.216 1.032 0.875 −0.719 1.200 −1.231 0.823 −1.202
3PLM Loc. Slope 0.818 −0.834 0.860 −1.119 0.657 −0.699 1.550 −0.949 0.721 −1.558 1.022 −1.876 1.255 −0.655 0.748 −0.155 1.177 2.287 0.546 −0.505 1.476 1.090 1.478 1.085 0.898 −0.502 1.418 −0.788 0.908 −0.812
LA∗ 0.280 0.185 0.305 0.144 0.258 0.183 0.179 0.131 0.293 0.222 0.063 0.046 0.096 0.226 0.153
θ P=0.5 ∗∗ −1.840 −1.658 −2.130 −1.168 −2.566 −2.321 −1.009 −0.560 1.538 −1.580 0.999 1.019 −0.739 −1.212 −1.215
∗ Lower ∗∗ θ
Asymptote (M A P) ˆ (M A P) (M A P) giving P j (θ; aˆ j , bj , cˆ j ) = 0.5
Under the convergence criterion shown above, the 15-item test data (J15S500) are analyzed with the 2PLM and 3PLM. With the 2PLM, analysis settings 1–8 and 11 are applied, and with the 3PLM, all the settings 1–11 are applied. The results are shown in Table 4.6.36 In the analysis by the 2PLM, the EM algorithm converged at the 10th cycle, which means ln pr ((10) |U) − ln pr ((9) |U) < 0.00001| ln pr ((10) |U)|
(4.24)
is satisfied. Meanwhile, the number of EM cycles was 12 in the 3PLM analysis. In general, the convergence becomes slower as the number of parameters in the applied model is larger. Comparing the results of the 2PLM and 3PLM, we can see that the location parameters for the 3PLM tend to be larger; however, this does not mean that the difficulty of an item changes by model because, from Eq. (4.1) (p. 91), the larger the lower asymptote parameter, the smaller the θ that gives CRR = 0.5, and the rightmost column in Table 4.6, θ P=0.5 , shows the θ when the CRR is 0.5. For 2PLM, 36
The results may differ across software programs due to differences in optimization methods and prior densities of the item parameters. In particular,it is necessary to becareful whether the slope parameter includes the scaling factor or not (see Scaling Factor 1.7 , p. 87). In this book, the scaling factor is included in the slope parameter.
130
4 Item Response Theory
Fig. 4.21 Comparison of IRFs between the 2PLM and 3PLM
because the CRR is 0.5 when θ = b j , θ P=0.5 in the 3PLM and b j in the 2PLM are comparable. Through comparison, we can see that the values are similar. In Fig. 4.21, the IRFs of the 2PLM and 3PLM are simultaneously plotted for four items for comparison. The IRT can be considered a logistic regression analysis in which the dependent and independent variables are dichotomous and latent variables, respectively. Therefore, the IRT analysis is basically curve fitting to binary data. The fitted curve does not change significantly just because the applied model is changed. In particular, the fitness of the regression curve around −1 ≤ θ ≤ 1 is accurate because the student abilities are concentrated around this interval. Therefore, we can see that the curves of the 2PLM and 3PLM overlap well around −1 ≤ θ ≤ 1. Although not shown in the figure, the results for the other 11 items are similar. The nature of the difficulty does not change even if the value of the location parameter changes by model. Table 4.6 shows that the slope parameters of the 3PLM are larger for all items, but this does not mean that the discriminating power has improved, because the item discrimination decreases as the lower asymptote parameter increases.37 Figure 4.22 shows the IIFs of Items 4, 8, 9, and 11. It can be seen that the IIF of the 3PLM is not
37
The IIF decreases as the lower asymptote parameter increases (see Sect. 4.4.3, p. 112).
4.5 Item Parameter Estimation
131
Fig. 4.22 Comparison of IIFs between the 2PLM and 3PLM
always larger than that of the 2PLM for all items. For some items, the IIF under the 2PLM is larger, while for other items, that under the 3PLM is larger.
4.5.8 Posterior Standard Deviation This section describes the estimation accuracy of the item parameters. Suppose the location parameter of an item is obtained as 1.234. If its accuracy is 1.234 ± 1.0, then it is difficult to trust the value 1.234. However, if it is 1.234 ± 0.1, then we can trust the value 1.234. The precision of an item parameter estimate is evaluated by the SD of the item parameter estimate. The estimation accuracy of the MAP of an item parameter is evaluated by the PSD. The PSD can be calculated from posterior information matrix. The size of posterior information matrix I j (λ) is 4 × 4 for the 4PLM, where each element is a function of λ = [a b c d] as follows: ( pr ) (λ), I j (λ) = I (F) j (λ) + I
132
4 Item Response Theory ( pr )
where I (F) j (λ) and I j (λ) denote the Fisher information and prior information matrices, respectively. Each of them is a square matrix with size 4 under the 4PLM, and each element is a function of λ. (MAP) in the posterior information matrix and invertFor Item j, substituting λ = λˆ j ing the matrix, the PSDs of the item parameters are the square root of the diagonal elements in the inverted matrix.38 That is, ⎡
⎤ PSD[aˆ (MAP) ] j ⎢ PSD[bˆ (MAP) ] ⎥ (MAP) ⎢ ⎥ j ]=⎢ PSD[λˆ j ⎥ ⎣ PSD[cˆ(MAP) ⎦ ] j (MAP) ˆ PSD[d j ] (MAP) −1 ◦ 1 2 = diagI j (λˆ j ) ◦ 1 ˆ (MAP) ) + I (j pr ) (λˆ (MAP) )}−1 2 . = diag{I (F) j j (λ j
(4.25)
(4.26) (4.27)
When the PSD is small, it indicates that the MAP obtained for the item parameter is accurate. (ML) in the Fisher information matrix and inverting In addition, substituting λ = λˆ j the matrix, the SEs of the item parameters are the square root of the diagonal elements in the inverted matrix.39 That is, ⎡
⎤ SE[aˆ (ML) ] j ⎢ SE[bˆ (ML) ] ⎥ ◦ 1 (ML) ⎢ ⎥ (F) (ML) j SE[λˆ j ] = ⎢ ⎥ = diagI j (λˆ j )−1 2 . ⎣ SE[cˆ(ML) ⎦ ] j SE[dˆ (ML) ] j When the SE is small, it indicates that the MLE obtained for the item parameter is accurate. The content of prior information matrix I ( pr ) (λ), which is a diagonal matrix,40 is given by ⎡ 2 ⎤ d ln pr (a) − 0 0 0 ⎢ ⎥ da 2 ⎢ ⎥ ⎢ ⎥ d2 ln pr (b) ⎢ ⎥ 0 − 0 0 2 ⎢ ⎥ ( pr ) db I (λ) = ⎢ ⎥ . (4.28) 2 d ln pr (c) ⎢ ⎥ ⎢ ⎥ 0 0 − 0 ⎢ ⎥ dc2 2 ⎣ d ln pr (d) ⎦ 0 0 0 − dd 2 38
More precisely, this is the asymptotic PSD that approaches the true PSD as the sample size increases. 39 More precisely, this is the asymptotic SE that approaches the true SE as the sample size increases. 40 A square matrix all of whose off-diagonal elements are 0.
4.5 Item Parameter Estimation
133
In the matrix, the diagonal elements are the second-order derivatives of the logarithm of the prior densities for the item parameters (see Sect. 4.5.3, p. 119) with negative sign, but the hyperparameters are omitted for ease of viewing. These diagonal elements are specified as follows: Elements of Prior Information Matrix
−
d2 ln pr(a|μa , σa ) μa + 1 − σa2 − ln a = da 2 a 2 σa2
−
d2 ln pr(b|μb , σb ) 1 = 2 (a constant regardless of b) db2 σb
βc − 1 βc − 1 d2 ln pr(c|βc1 , βc2 ) = 12 + 2 2 dc2 c 1−c 2 d ln pr(d|βd1 , βd2 ) βd − 1 βd − 1 − = 1 2 + 2 2 dd 2 d 1−d
−
Next, the Fisher information matrix I (F) j (λ) consists of the negative expectation of the second-order derivatives of the ELL.41 Under the 4PLM, the Fisher information of Item j is given by42 ∂ 2 ell(u |λ) j I (F) (λ) = E − j ∂λ∂λ ⎡ 2 ⎤ ∂ ell(u j |λ) sym ⎢ ⎥ ⎢ 2 ∂a 2 ⎥ ⎢ ∂ ell(u j |λ) ∂ 2 ell(u j |λ) ⎥ ⎢ ⎥ 2 ⎢ ∂b∂a ⎥ ∂b = −E ⎢ 2 ⎥. 2 2 ⎢ ∂ ell(u j |λ) ∂ ell(u j |λ) ∂ ell(u j |λ) ⎥ ⎢ ⎥ 2 ⎢ ∂c∂a ⎥ ∂c∂b ∂c ⎣ ∂ 2 ell(u j |λ) ∂ 2 ell(u j |λ) ∂ 2 ell(u j |λ) ∂ 2 ell(u j |λ) ⎦ ∂d∂a ∂d∂b ∂d∂c ∂d 2
41
(4.29)
The second-order derivative matrix is called the Hessian matrix. Thus, the Fisher information matrix is the negative expectation of the Hessian matrix. 42 ∂ f (x)/∂ x denotes the partial derivative of a function f of multiple variables x with respect j to a single variable x j . When differentiated with respect to x j , the other variables are regarded as constants as follows: ∂ (ax13 + bx1 x22 + cx2 x32 ) = 0 + 2bx1 x2 + cx32 . ∂ x2
134
4 Item Response Theory
In the matrix, the (1, 1)-th element, for example, is the (direct) second-order partial derivative of ell(u j |λ) with respect to a (slope parameter). That is, from Eq. (4.20),43 ∂ 2 ln P(θ ; λ) ∂ 2 ell(u j |λ) ∂ 2 ln Q(θq ; λ) q u , = w z + (1 − u ) sq s j s j s j ∂a 2 ∂a 2 ∂a 2 s=1 q=1 S
Q
(4.30) where P(θ ; λ) is the IRF of the 4PLM (see (4.2), p. 92) and Q(θ ; λ) = 1 − P(θ ; λ). In addition, because the item parameters have already been estimated, wsq is , unlike (t) in Eq. (4.20), determined as wsq ˆ pr (θq |us , ) . wsq = Q ˆ q =1 pr (θq |us , ) (MAP)
ˆ ) are substituted for obtaining the PSDs, or In the above equation, the MAPs ( (MAP) ˆ the MLEs ( ) for the SEs. Next, taking the negative expectation of Eq. (4.30) with respect to u s j , we have ∂ 2 ell(u |λ) j ∂a 2 Q S ∂ 2 ln P(θq ; λ) ∂ 2 ln Q(θq ; λ) =− wsq z s j P(θq ; λ) + Q(θq ; λ) 2 ∂a ∂a 2 s=1 q=1 −E
=−
Q q=1
∂ 2 ln P j (θq ; λ) ∂ 2 ln Q(θq ; λ) , Z jq P(θq ; λ) + Q(θ ; λ) q ∂a 2 ∂a 2
(4.31)
where Z jq =
S
wsq z s j .
s=1
After the differentiation, Eq. (4.31) is reformed as follows: Q ∂ 2 ell(u |λ) (θq − b)2 (P(θq ; λ) − c)2 (d − P(θq ; λ))2 j −E . = Z jq ∂a 2 (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
43 Similarly, the (2, 1)-th element is the (cross) second-order partial derivative with respect to a and b, where ell(u j |λ) is first differentiated with respect to a and then with respect to b (location parameter). The order of the differentiation is interchangeable.
4.5 Item Parameter Estimation
135
This is the (1, 1)-th element in the Fisher information matrix. All the elements in the matrix are listed in the following box: Elements of Fisher Information Matrix by 4PLM
−E
Q ∂ 2 ell(u |λ) (θq − b)2 (P(θq ; λ) − c)2 (d − P(θq ; λ))2 j = Z jq ∂a 2 (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a(θq − b)(P(θq ; λ) − c)2 (d − P(θq ; λ))2 j =− Z jq ∂b∂a (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a 2 (P(θq ; λ) − c)2 (d − P(θq ; λ))2 j = Z jq ∂b2 (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (θq − b)(P(θq ; λ) − c)(d − P(θq ; λ))2 j = Z jq ∂c∂a (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a(P(θq ; λ) − c)(d − P(θq ; λ))2 j =− Z jq ∂c∂b (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (d − P(θq ; λ))2 j = Z jq ∂c2 (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (θq − b)(P(θq ; λ) − c)2 j =− Z jq ∂d∂a (d − c)P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a(P(θq ; λ) − c)2 (d − P(θq ; λ)) j =− Z jq ∂d∂b (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (P(θq ; λ) − c)(d − P(θq ; λ)) j = Z jq ∂d∂c (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (P(θq ; λ) − c)2 j = Z jq ∂d 2 (d − c)2 P(θq ; λ)Q(θq ; λ) q=1
The Fisher information under the 3PLM can be constructed by first substituting d = 1 in that of the 4PLM. Next, the fourth row and fourth column corresponding to d in the Fisher information matrix (and prior information matrix) are deleted; the size of the Fisher information matrix (and prior information matrix) then becomes 3 × 3. Finally, because Q(θq ; λ) = 1 − P(θq ; λ) under the 3PLM, the elements in the Fisher information matrix under the 3PLM are simplified as follows:
136
4 Item Response Theory
Elements of Fisher Information Matrix by 3PLM Q ∂ 2 ell(u |λ) (θq − b)2 (P(θq ; λ) − c)2 Q(θq ; λ) j −E = Z jq ∂a 2 (1 − c)P(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a(θq − b)(P(θq ; λ) − c)2 Q(θq ; λ) j =− Z jq ∂b∂a (1 − c)2 P(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) a 2 (P(θq ; λ) − c)2 Q(θq ; λ) j = Z jq ∂b2 (1 − c)2 P(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) (θq − b)(P(θq ; λ) − c)Q(θq ; λ) j = Z jq ∂c∂a (1 − c)2 P(θq ; λ) q=1
Q ∂ 2 ell(u |λ) a(P(θq ; λ) − c)Q(θq ; λ) j =− −E Z jq ∂c∂b (1 − c)2 P(θq ; λ) q=1
−E
Q ∂ 2 ell(u |λ) Q(θq ; λ)) j = Z jq 2 ∂c (1 − c)2 P(θq ; λ) q=1
In addition, the estimation accuracy of the item parameters under the 2PLM is obtained by further substituting c = 1 in the Fisher information matrix of the 3PLM. Then, by deleting the third row and column corresponding to c in the matrix (and prior information matrix), the size of the Fisher information matrix (and prior information matrix) will thus be 2 × 2. Each element in the Fisher information matrix is reduced as follows. Elements of Fisher Information Matrix by 2PLM
−E
Q ∂ 2 ell(u |λ) j = Z jq (θq − b)2 P(θq ; λ)Q(θq ; λ) ∂a 2 q=1
−E
Q ∂ 2 ell(u |λ) j =− Z jq a(θq − b)P(θq ; λ)Q(θq ; λ) ∂b∂a q=1
−E
Q ∂ 2 ell(u |λ) j = Z jq a 2 P(θq ; λ)Q(θq ; λ) ∂b2 q=1
Let us show an example. For Item 1 from Table 4.6 (p. 129), the MAPs of the item parameters were obtained as
4.5 Item Parameter Estimation
137
⎡
(MAP) λˆ 1
⎤ 0.818 = ⎣ −0.834 ⎦ . 0.280
First, refer to Elements of Prior Information Matrix (p. 133) to evaluate the prior information in Eq. (4.28). The priors for a, b, and c are described in Points 6, 7, and 10 of Parameter Estimation Setting (p. 126), respectively. We then have the prior information as follows: ⎡ ⎤ 5.693 (MAP) ⎦. 0.250 I ( pr ) (λˆ 1 )=⎣ 20.446 Note that this matrix is diagonal, and its size is 3 × 3 due to the 3PLM. Next, referring to Elements of Fisher Information Matrix by 3PLM (p. 136), the Fisher informa tion in Eq. (4.29) is calculated as ⎡
ˆ (MAP) ) I (F) 1 (λ 1
⎤ 79.569 −37.031 70.783 40.315 −116.256 ⎦ . = ⎣ −37.031 70.783 −116.256 371.543
Thus, we have the posterior information matrix as follows: ⎡
(MAP) I 1 (λˆ 1 )
=
ˆ (MAP) ) I (F) 1 (λ1
+
(MAP) I ( pr ) (λˆ 1 )
⎤ 85.263 −37.031 70.783 40.565 −116.256 ⎦ . = ⎣ −37.031 70.783 −116.256 391.989
The PSDs of the item parameters are, as shown in Eq. (4.27), the square roots of the diagonal elements of the inverse matrix of the posterior information matrix. To begin with, the inverse matrix of the posterior information is obtained as ⎡
(MAP) −1
I 1 (λˆ 1
)
⎤ 0.033 0.087 0.020 = ⎣ 0.087 0.394 0.101 ⎦ . 0.020 0.101 0.029
The diagonal elements of this matrix are the variances of the MAPs. In other words, each PSD is the square root of the corresponding diagonal element. That is, ⎤ ⎡ ⎤◦ 1 ⎡ ⎤ PSD[aˆ 1(MAP) ] 0.033 2 0.182 = ⎣ PSD[bˆ1(MAP) ] ⎦ = ⎣ 0.394 ⎦ = ⎣ 0.628 ⎦ . 0.029 0.170 PSD[cˆ1(MAP) ] ⎡
(MAP) PSD[λˆ 1 ]
Table 4.7 shows the PSDs under the 2PLM and 3PLM corresponding to the MAPs shown in Table 4.6. There are several features that should be described. First, all
138
4 Item Response Theory
Table 4.7 PSD of item parameter MAP estimate Model Parameter Item 01 Item 02 Item 03 Item 04 Item 05 Item 06 Item 07 Item 08 Item 09 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 ∗ Lower
2PLM Slope Loc. 0.109 0.266 0.117 0.221 0.099 0.338 0.157 0.114 0.115 0.360 0.150 0.273 0.128 0.130 0.100 0.153 0.077 0.428 0.091 0.306 0.131 0.125 0.138 0.118 0.111 0.133 0.141 0.134 0.113 0.180
3PLM Slope Loc. 0.182 0.628 0.157 0.471 0.162 0.798 0.227 0.216 0.148 0.700 0.171 0.423 0.214 0.289 0.148 0.394 0.493 0.423 0.131 0.779 0.263 0.120 0.245 0.115 0.142 0.272 0.248 0.291 0.159 0.383
LA∗ 0.170 0.149 0.173 0.104 0.186 0.158 0.117 0.108 0.044 0.156 0.032 0.028 0.086 0.125 0.125
Asymptote
values are positive because they are the posterior SDs, which must be nonnegative by definition. In addition, the PSD of each parameter under the 2PLM is smaller than the corresponding PSD under the 3PLM. For example, the PSDs of the slope parameter for Item 1 under the 2PLM and 3PLM are 0.109 and 0.182, respectively. This is due to the model flexibility. Generally, a model with a larger number of parameters is more flexible, and such a model better fits the data because this flexibility of the entire model derives from the cumulative mobility range of each single parameter, whereas that flexibility simultaneously militates against the smallness of the PSD for each parameter’s MAP (SE in the case of the MLE). Therefore, increasing the number of parameters in a model makes the model more flexible at the expense of the estimation accuracy of the parameters. The 3PLM should be used when the sample size is large, but the sample size of this analysis (J15S500) was 500, which is, in general, too small for the 3PLM to be applied (see Table 4.1, p. 94).
4.5 Item Parameter Estimation
139
Fig. 4.23 Basic concept of model fit in SEM
4.6 Model Fit Model fit measures how well a statistical model fits the data. In the IRT, the 4PLM is the most flexible; thus, it best fits the data. That is, the order of goodness matches the order of the number of parameters: 4PLM>3PLM>2PLM (>1PLM). However, if there is no large difference between the goodness of fit of two models, one may choose the model with a smaller number of parameters. The model fit introduced in this book is based on the style of structural equation modeling (SEM; e.g., Bollen, 1989b; Jöreskog & Sörbom, 1979; Kaplan, 2000; Muthén & Muthén, 2017; Toyoda, 1998). The basic idea of evaluating model fitness in the SEM is to first assume two extreme models: One is a very good-fitting model (saturated model), and the other is a very poor-fitting model (null model), and to then assess where the present model (analysis model) is to be placed between the two models (Fig. 4.23). The fit of the analysis model is regarded as better the closer it is to the fit of the saturated model, whereas it is considered worse the closer it is to the fit of the null model. The saturated model in SEM is a model with 0 degrees of freedom (DF) in which all elements of the covariance matrix are treated as model parameters and estimated, while the null model is a model in which only the diagonal elements of the covariance matrix (i.e., the variances) are estimated, but all the off-diagonal elements (i.e., the covariances) are assumed to be 0. Because the purpose of SEM is generally to analyze the covariances (or correlations) between variables and to visualize the relationships in a path diagram, the null model assuming that all off-diagonal elements are 0 hardly fits the data.
140
4 Item Response Theory
4.6.1 Analysis Model Summary of Log-Likelihood 1. The likelihood is the probability represented by the model. 2. The range of the likelihood is [0, 1], meaning the data are more likely to occur as it approaches 1. 3. The log-likelihood is the logarithm of a likelihood. 4. The range of the log-likelihood is (−∞, 1], meaning the data are more likely to occur as it approaches 0. The ELL (expected log-likelihood) can be used as a measure of the goodness of fit of the analysis model. Let us review the likelihood. As already explained in this book, the likelihood is the probability represented by parameter(s), and they are built into a model. Thus, as Point 1 in the above box indicates, the likelihood can be regarded as the probability represented by the model. In addition, from Point 2, the range of the likelihood is [0, 1] because it is a probability, and the closer it is to 1, the higher the probability that the data are generated under the model. However, as described in Point 3, the likelihood becomes very close to 0 because it is the product of the likelihood of the individual datum being ∈ [0, 1], so it is hard to evaluate the magnitude of the value. Thus, taking the logarithm of the likelihood, we have the log-likelihood with range stretched as (−∞, 0],44 which makes it easier to evaluate the magnitude of the value. The log-likelihood is 0 when the likelihood is 1; thus, the log-likelihood approaches 0 as the model better fits the data. The log-likelihood in the IRT is the ELL. From Eq. (4.21) (p. 124), the ELL of Item j is given as ˆ = ell A (u j |λ)
Q U1 jq ln P(θq ; λˆ j ) + U0 jq ln Q(θq ; λˆ j ) , q=1
where the subscript “A” refers to the analysis model for emphasis, and U1 jq =
S
wsq z s j u s j ,
s=1
U0 jq =
S
wsq z s j (1 − u s j ),
s=1
ˆ pr (θq |us , ) . wsq = Q ˆ q =1 pr (θq |us , )
44
Note that ln 0 = −∞ and ln 1 = 0.
(4.32)
4.6 Model Fit
141
This equation gives the ELL of Item j, which also implies that we can evaluate the goodness-of-fit per item. In addition, the item parameters of Item j, λˆ j , have already (MAP) (ML) been estimated, which are the MAPs (λˆ j ) or MLEs (λˆ j ).45 Thus, there are no unknown parameters in Eq. (4.32), and it is therefore obtained as a negative constant.
4.6.2 Null and Benchmark Models It is hard to absolutely evaluate the goodness of fit of the analysis model by the value of ell A ; thus, two models were prepared for comparison. One is a null model that poorly fits the data, and the other is the saturated model that best fits the data. The null model is explained first. What kind of model is a null model in test data analysis? It may vary by analyst. Let us consider here a model in which the CRR does not change with ability θ . Then, the IRF of Item j can be defined as PN (θ ; p j ) = p j ,
(4.33)
where p j is the CRR of Item j. In this model, the CRR of Item j is constant at p j , regardless of θ . We can consider a worse-fitting model than this model, for example, PN (θ ) = 0.5. This model assumes for all items that the CRR is constant at 0.5 regardless of θ . The fit of this model will be very bad. In this book, the model shown in Eq. (4.33) is specified as the null model. The ELL of the null model is expressed as follows by referring to Eq. (4.20): ell N (u j | p j ) =
Q S wsq z s j u s j ln PN (θq ; p j ) + wsq z s j (1 − u s j ) ln Q N (θq ; p j ) q=1 s=1
= U1 j ln p j + U0 j ln(1 − p j ),
(4.34)
where U1 j =
Q S q=1 s=1
U0 j =
Q S q=1 s=1
wsq z s j u s j =
S
zs j u s j
s=1
wsq z s j (1 − u s j ) =
Q
S wsq = zs j u s j ,
q=1 S
s=1
z s j (1 − u s j ).
s=1
Note that the sum of the elements in w s = [ws1 . . . ws Q ] is 1 because w s is the discrete posterior distribution of θs . In Eq. (4.34), the expectation is taken with respect to θ ; 45
The ELL (not the ELP) should be used even in the case where the estimates are MAPs. This choice is necessary, for example, when comparing the goodness of fit between the 2PLM, 3PLM, and 4PLM, because the priors used for them are different.
142
4 Item Response Theory
ell N is thus the ELL, but in fact it is the log-likelihood because the IRF (Eq. 4.33) does not depend on θ . Accordingly, U1 j and U0 j are simply the numbers of correct responses and incorrect responses of Item j, respectively. In addition, the fit of the null model is worse than that of the analysis model; thus, the following inequality holds: ˆ < 0. ell N (u j | p j ) < ell A (u j |λ) Next, the saturated model is explained, which is the perfect fit model with DF = 0. Instead, in this book, the benchmark model (Shojima, 2008) is introduced. For the saturated model in test data analysis, refer to Sect. 11.3.6 (p. 564). The benchmark model is not a model perfectly fitting the data, but a model that fits the data sufficiently well. There would be several “sufficiently well-fitting” models, and a candidate may be a model that assumes that there are some subgroups in the whole student group, each of whose CRR differs. Consider here that the grouping is based on the numberright score (NRS). This model, in which the CRR changes by NRS, is specified as the benchmark model. When the number of items is J , grouping by the NRS yields a maximum of J + 1 groups {0, 1, . . . , J }. However, it is not always possible to observe all the NRS patterns; for example, if a test is very easy, no student will get 0 points. Thus, let the number of observed NRS patterns be denoted G. Then, consider a group membership indicator that is a binary variable coded as 1 if Student s belongs to Group g, and 0 if not. That is, m sg
$ 1, if Student s belongs to Group g = . 0, otherwise
In addition, ⎡
m 11 m 12 ⎢ m 21 m 22 ⎢ MG = ⎢ . .. ⎣ .. . m S1 m S2
⎤ . . . m 1G . . . m 2G ⎥ ⎥ . ⎥ = {m sg } (S × G) .. . .. ⎦ . . . m SG
(4.35)
is the group membership matrix in which m sg is placed in the s-th row and g-th column. The s-th row vector in this matrix is the group membership profile of Student s, where only one element is 1 and the others are 0. That is, 1G ms = 1. The sum of each row should be 1, because if it is 0, then there are students who do not belong to any group, and if it is ≥ 2, then there are students who belong to more than one group. In addition, the sum of the g-th column elements in the group membership matrix (1S m g ) gives the number of students belonging to Group g.
4.6 Model Fit
143
Using the group memberships, the CRR of the students belonging to Group g for Item j is expressed as mg (z j u j )
PB (θ ; p jg ) = p jg =
mg z j
G s=1 g=1 m sg z s j u s j S G s=1 g=1 m sg z s j
S =
.
(4.36)
This CRR is constant regardless of θ because θ does not appear in the RHS. Then, the (expected) log-likelihood of the benchmark model becomes ell B (u j | p j ) =
Q G S
m sg wsq z s j u s j ln PB (θq ; p jg )
g=1 q=1 s=1
+(1 − u s j ) ln Q B (θq ; p jg ) =
G {U1g j ln p jg + U0g j ln(1 − p jg )},
(4.37)
g=1
where p j = [ p j1 . . . p j G ] (G × 1), U1g j =
Q S q=1 s=1
U0 j =
Q S q=1 s=1
m sg wsq z s j u s j =
S
m sg z s j u s j ,
s=1
wsq z s j (1 − u s j ) =
S
m sg z s j (1 − u s j ).
s=1
In the above equations, U1g j and U0g j are the numbers of correct responses and incorrect responses by Group g students to Item j, respectively. Relationships Among Benchmark, Analysis, and Null Models Log-likelihood: N of parameters:
ˆ > ell N (u j | p j ) ell B (u j | p j ) > ell A (u j |λ) row( p j ) > row(λˆ j ) > row( p j )
(4.38) (4.39)
The benchmark model is expected to fit better than the analysis model. Therefore, the (expected) log-likelihood of the benchmark, analysis, and null models satisfy the relationships shown in Eq. (4.38) in the above box.
144
4 Item Response Theory
Fig. 4.24 Concept of the chi-square statistic of the analysis model
In addition, there is one more relationship between these three models from the perspective of the number of parameters. The number of parameters of the analysis model is row(λˆ j ), where row(·) denotes the operator to return the number of rows of the argument. If the argument is a vector, it returns the number of elements of the vector; for example, row(λˆ j ) = 3 under the 3PLM. In addition, the numbers of parameters for the benchmark and null models are row( p j ) = G and row( p j ) = 1,46 respectively. Thus, as to the number of parameters, there is a relationship as represented in Eq. (4.39) between the three models. The right inequality (row(λˆ j ) > 1) almost always holds because the 1PLM is rarely used nowadays. In addition, the left inequality is also almost always fulfilled because the number of test items is usually ≥ 10. As implied by Eqs. (4.38) and (4.39), the larger the number of parameters, the better the fit of the model.
4.6.3 Chi-Square Statistics The χ 2 (chi-square) statistic is used for examining the difference in fit between two models. More specifically, the χ 2 can evaluate how inferior the fit of the analysis model is to that of the benchmark model. Figure 4.24 shows the relationship between the likelihoods, log-likelihoods, and χ 2 values for the three models (i.e., benchmark, analysis, and null models). The log-likelihood is the logarithm of the likelihood, and that of the benchmark model (ell B in the figure) is the largest (i.e., closest to 0), while that of the null model (ell N ) is the smallest (negatively largest). Then, the χ 2 value evaluates the difference in fit between the analysis and benchmark models, which is defined as twice the log-likelihood difference between the two models:
46
Note that a scalar is a matrix with one row and one column.
4.6 Model Fit
145
ˆ χ A2 j = 2{ell B (u j | p j ) − ell A (u j |λ)}, d f A j = row( p j ) − row(λˆ j ) = G − row(λˆ j ). The analysis model for Item j is regarded as better as χ A2 j decreases. In addition, d f A j refers to the DF for Item j. The DF is the difference in the number of parameters used in the two models under comparison, and it represents the difference in the flexibility of the two models. Smaller values of d f A indicate that more parameters are used in the analysis model than in the benchmark model, and such an analysis model is not always considered acceptable even if χ A2 j is small. Generally, an analysis model is regarded as valid when d f A j is large (the number of parameters of the analysis model is small), and χ A2 j is small (the log-likelihood of the analysis model is close to that of the benchmark model). To examine the difference in fit between the analysis and null models, the following statistics are used: 2 ˆ χ A/N , j = 2{ell A (u j |λ) − ell N (u j | p j )},
d f A/N , j = row(λˆ j ) − 1. As the former value is larger, the analysis model better fits the data. This is because, when the value is large, the log-likelihood of the analysis model is far from that of 2 the “very poor-fitting” null model. However, χ A/N , j is rarely referred to. In addition, the difference in fit between the benchmark and null models can be examined by the following statistics: χ N2 j = 2{ell B (u j | p j ), −ell N (u j | p j )} d f N j = row( p j ) − 1 = G − 1. These are the χ 2 values and DFs of the analysis and null models for Item j. The χ value for the entire model (i.e., test) is simply the sum of χ A2 j s for all items, as shown in the following box. Similarly, the DF of the entire model (i.e., test) is simply the sum of d f A j s for all items. 2
146
4 Item Response Theory
χ 2 and d f of Analysis and Null Models Analysis Model J J 2 2 ˆ χA = χAj = 2 {ell B (u j | p j ) − ell A (u j |λ)} d fA =
j=1 J
j=1
d fAj
J J = {row( p j ) − row(λˆ j )} = {G − row(λˆ j )}
j=1
j=1
j=1
Null Model J J χ N2 = χ N2 j = 2 {ell B (u j | p j ) − ell N (u j | p j )} d fN =
j=1 J j=1
j=1
d fN j =
J
{row( p j ) − row( p j )} =
j=1
J (G − 1) j=1
These values are not for individual Item j but the whole test.
Table 4.8 shows the (expected) log-likelihoods of the null and benchmark models, as well as the ELLs, χ 2 values, and DFs for the two analysis models (i.e., 2PLM and 3PLM). As shown in the first row (Test) in the table, the ELLs of the four models show the following inequality: ell N < ell 2P L M < ell 3P L M < ell B . This is the order of the (badness of) fit. As expected, the fit of the null model was the worst and that of the benchmark model was the best. This is because the number of parameters of the null model was the smallest and that of the benchmark model was the largest. A model with a larger number of parameters fits the data more flexibly. The difference in log-likelihood between the 2PLM and 3PLM was slight. In the null model, the number of parameters for the entire test is 15, which is given by 1 (the number of parameters per item) × 15 (items). In addition, those of the 2PLM and 3PLM are 30 (= 2 × 15) and 45 (= 3 × 15), respectively. Furthermore, that of the benchmark model is 210, that is, 14 (groups) × 15 (items). The number of items in the test being analyzed is 15; thus, the maximum possible number of NRS patterns (i.e., groups) is 16 {0, 1, . . . , 15}. However, as there were no students with NRS 0 and 1, the number of observed NRS groups was 14. Accordingly, the DFs of the 2PLM and 3PLM were obtained as 180 (= 210 − 30) and 165 (= 210 − 45), respectively. The DF is also regarded as a measure of parsimony. When the DF is large, the analysis model has small number of parameters, and such a parameter-saving model is therefore said to be parsimonious.
4.6 Model Fit
147
Table 4.8 Log-likelihood and χ 2 of the benchmark, null, and analysis models BM∗
Null Model
ell B
ell N
ell2P L M
χ2
df
ell3P L M
χ2
df
−3560.01
−4350.22
−3897.23
674.46
180
−3890.32
660.64
165
Item 01
−240.19
−283.34
−264.21
48.05
12
−263.92
47.46
11
Item 02
−235.44
−278.95
−253.63
36.38
12
−254.33
37.79
11
Item 03
−260.91
−293.60
−281.58
41.34
12
−281.53
41.25
11
Item 04
−192.07
−265.96
−205.86
27.58
12
−204.89
25.63
11
Item 05
−206.54
−247.40
−232.75
52.42
12
−233.07
53.07
11
Item 06
−153.94
−198.82
−174.52
41.16
12
−174.47
41.06
11
Item 07
−228.38
−298.35
−252.54
48.31
12
−251.40
46.03
11
Item 08
−293.23
−338.79
−314.44
42.43
12
−315.72
45.00
11
Item 09
−300.49
−327.84
−325.10
49.22
12
−322.53
44.07
11
Item 10
−288.20
−319.85
−309.74
43.08
12
−310.01
43.63
11
Item 11
−224.09
−299.27
−251.29
54.41
12
−248.56
48.94
11
Item 12
−214.80
−293.60
−240.58
51.57
12
−239.02
48.44
11
Item 13
−262.03
−328.40
−292.50
60.93
12
−294.37
64.69
11
Item 14
−204.95
−273.21
−224.67
39.44
12
−223.70
37.50
11
Item 15
−254.76
−302.85
−273.83
38.13
12
−272.80
36.08
11
Model Statistic Test
∗
2PLM
3PLM
benchmark model
The χ 2 value of the 3PLM was slightly smaller than that of the 2PLM, which means that the ELL of the 3PLM is slightly closer to that of the benchmark model than the 2PLM. In short, the 3PLM better fits the data. However, the ELL difference between the two models is only approximately 14, compared to the DF difference equal to 15. In general, if the χ 2 value decreases (i.e., improves) by 2 upon adding one parameter, the parameter is considered worth adding (see Sect. 4.6.5, p. 150). Although the number of parameters in the 3PLM is 15 more than that of the 2PLM, the χ 2 value has not improved by 30 (= 15 × 2). There is thus a possibility that the 3PLM does not work effectively. A reason for this is that the sample size of the data (J15S500) is 500, which is a little small for the 3PLM to be applied. In general, the larger the sample size, the larger the improvement in the χ 2 value by applying the 3PLM. In addition, referring to the χ 2 values for each item, those of Items 1, 3, 4, etc. are smaller for the 3PLM, and those of Items 2, 5, 8, etc. are smaller for the 2PLM. In general, as the sample size increases, the χ 2 values of the 3PLM become smaller for all items. Note that it is not necessary to apply the same IRT model to all items. The model to be applied may vary by item (Shojima, 2007; Shojima & Toyoda, 2004; Thissen, 1991).
148
4 Item Response Theory
4.6.4 Goodness-of-Fit Indices Using the χ 2 statistics and DFs introduced above, we can consider various goodnessof-fit indices, such as those used in SEM. They are summarized in Standardized Fit Indices . The meaning of “standardized” is that each index has an upper and lower limit. Of these indices, all except the RMSEA fall in [0, 1], and as their values approach 1, the better the fitness indicated of the analysis model to the data. Meanwhile, the range of the RMSEA is [0, ∞), and the fit of the analysis model is judged to be better as its value approaches 0. Although the indices have different formulations, each index basically compares the χ 2 values of the analysis and null models. Standardized Fit Indices Normed Fit Index (NFI; Bentler & Bonett, 1980)∗1 χ A2 j NFI j = 1 − 2 (∈ [0, 1]) χN j Relative Fit Index (RFI; Bollen, 1986)∗1 χ A2 j /d f A j RFI j = 1 − 2 (∈ [0, 1]) χ N j /d f N j Incremental Fit Index (IFI; Bollen, 1989a)∗1 χ A2 j − d f A j IFI j = 1 − 2 (∈ [0, 1]) χN j − d f A j Tucker-Lewis Index (TLI; Bollen, 1989a)∗1 χ A2 j /d f A j − 1 (∈ [0, 1]) TLI j = 1 − 2 χ N j /d f N j − 1 Comparative Fit Index (CFI; Bentler, 1990)∗1 χ A2 j − d f A j CFI j = 1 − 2 (∈ [0, 1]) χN j − d f N j Root Mean Square Error of Approximation (RMSEA; Browne & Cudeck, 1993)∗2 % χ A2 j − d f A j RMSEA j = (∈ [0, ∞)) d f A j (S − 1) ∗1 Larger
values closer to 1.0 indicate a better fit. values closer to 0.0 indicate a better fit.
∗2 Smaller
4.6 Model Fit
149
Table 4.9 Fit indices for the 2PLM and 3PLM 2PLM
Model
3PLM
Index
NFI
RFI
IFI
TLI
CFI
RM∗
NFI
RFI
IFI
TLI
CFI
RM
Test
.573
.538
.647
.613
.643
.074
.582
.506
.650
.577
.642
.078
Item 01
.443
.397
.515
.467
.508
.078
.450
.350
.516
.412
.503
.082
Item 02
.582
.547
.675
.643
.671
.064
.566
.487
.648
.572
.638
.070
Item 03
.368
.315
.450
.393
.440
.070
.369
.254
.444
.318
.423
.074
Item 04
.813
.798
.885
.875
.884
.051
.827
.795
.893
.872
.891
.052
Item 05
.359
.305
.420
.363
.412
.082
.351
.233
.405
.277
.388
.088
Item 06
.541
.503
.625
.588
.620
.070
.543
.459
.618
.537
.608
.074
Item 07
.655
.626
.716
.690
.714
.078
.671
.611
.728
.674
.724
.080
Item 08
.534
.496
.615
.578
.611
.071
.506
.416
.576
.486
.565
.079
Item 09
.100
.025
.128
.033
.107
.079
.194
.048
.243
.063
.207
.078
Item 10
.319
.263
.394
.331
.382
.072
.311
.185
.376
.233
.351
.077
Item 11
.638
.608
.693
.666
.691
.084
.675
.615
.728
.674
.724
.083
Item 12
.673
.646
.728
.704
.726
.081
.693
.637
.745
.694
.741
.083
Item 13
.541
.503
.595
.557
.591
.090
.513
.424
.559
.470
.552
.099
Item 14
.711
.687
.780
.759
.778
.068
.725
.675
.789
.746
.785
.069
.604
.571
.690
.660
.686
.066
.625
.557
.706
.644
.698
.068
Item 15 ∗ RMSEA
(root mean square error of approximation)
The above box shows the fit indices to the data vector of Item j. If the indices are calculated using the χ 2 values and DFs of the entire test, the goodness of fit of the entire model to the data matrix can be obtained. Table 4.9 shows the fit indices for this analysis. In the SEM context, the five indices from the NFI to CFI indicate a good fit when they are ≥ 0.9 or ≥ 0.95, while the RMSEA has good fit when ≤ 0.06. However, these are the standards used in SEM; they are used only as reference in test data analysis because the null and benchmark (or saturated) models assumed in the SEM are different from those assumed here. In addition, although standardized, the fit indices are not absolute but relative; thus, they are improved if setting a worse-fitting benchmark and a worse-fitting null models. The benchmark model assumed here is a very well-fitting model; thus, it is difficult for the NFI–CFI to be ≥0.9 and for the RMSEA to be ≤ 0.06. Moreover, unlike SEM, we have no choice but to accept the present analysis model in reality, even if the model is poor-fitting, as there are few alternative models. If one wants to improve the analysis model, the candidates include the multidimensional IRT model (Reckase, 2009), multigroup IRT model (Zimowski et al., 2003), and latent class IRT model (Rost, 1990; 1991), which are theoretically very attractive. However, they are not used in practice.
150
4 Item Response Theory
4.6.5 Information Criteria In contrast to the standardized indices, the information criterion is an unstandardized index; thus, its value has no upper and lower limits. Thus, this index is used for comparing two or moremodels. Many information criteria have been proposed. The three criteria shown in Information Criteria are frequently used in SEM. The box shows them for an individual item. Those for the entire test can be obtained with the χ 2 value and DF of the entire test (i.e., χ A2 and d f A ). Note that the lower the information criterion, the better the model fit is to the model’s complexity. Information Criteria Akaike Information Criterion (AIC; Akaike, 1987) AIC j = χ A2 j − 2d f A j Consistent AIC (CAIC; Bozdogan, 1987) CAIC j = χ A2 j − d f A j ln(S + 1) Bayesian Information Criterion (BIC; Schwarz, 1978) BIC j = χ A2 j − d f A j ln S A lower value indicates a better fit.
Among these, the AIC is the simplest. As shown previously, χ A2 represents the distance between the analysis and benchmark models. When χ A2 is small, the analysis model is close to the benchmark model, which means the analysis model is wellfitting. Meanwhile, it has also been mentioned previously that the DF is a measure of parsimony. When d f A is large, the number of parameters in the analysis model is small, which means the analysis model is parsimonious. As the top part in Fig. 4.25 shows, in general, a well-fitting (i.e., small χ A2 ) model is likely to use more parameters (i.e., large d f A ). This is because a model with a larger number of parameters can more flexibly fit the data. That is, a better-fitting model is inclined to be less parsimonious (i.e., more wasteful) and thus has a smaller DF. On the other hand, as the second part of the figure shows, a parsimonious model is simple because it uses fewer parameters, but consequently, such a model cannot flexibly fit the data so that its χ 2 is likely to be large. If a model is well-fitting (small χ A2 ) and parsimonious (large d f A ), the AIC of the model becomes small. The AICs of three models are compared in Fig. 4.26. Model 1 is well-fitting so that χ12 is small, but the model is parameter-wasting so that d f 1 is small. Thus, there is a possibility that Model 1 overfits this set of data and would not fit well when similar data are observed. Therefore, the AIC is often referred to as an index for evaluating the fit of future data.
4.6 Model Fit
151
Fig. 4.25 Concept of Akaike information criterion
Fig. 4.26 Model comparison by AIC
Model 2 is parsimonious (i.e., simple) because its number of parameters is small (i.e., d f 2 is large), but due to this parsimony, Model 2 is poor-fitting (i.e., χ22 is large). Model 3 is neither the best-fitting nor the most parsimonious but is balanced so that its AIC is the lowest of the three models. Therefore, Model 3 is the most favored in terms of the AIC. Table 4.10 shows the three information criteria for the 2PLM and 3PLM. The AIC of the 2PLM is smaller than that of the 3PLM, which means that the AIC favors the 2PLM. The CAIC and BIC are also smaller for the 2PLM; thus, the two indices also agree with the AIC. From the definitions of the CAIC and BIC, as shown in
152
4 Item Response Theory
Table 4.10 Information criteria for the 2PLM and 3PLM Model Index Test Item 01 Item 02 Item 03 Item 04 Item 05 Item 06 Item 07 Item 08 Item 09 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15
AIC 314.46 24.05 12.38 17.34 3.58 28.42 17.16 24.31 18.43 25.22 19.08 30.41 27.57 36.93 15.44 14.13
2PLM BIC CAIC −444.53 −444.17 −26.55 −26.53 −38.22 −38.20 −33.26 −33.23 −47.01 −46.99 −22.17 −22.15 −33.44 −33.42 −26.29 −26.26 −32.17 −32.15 −25.38 −25.36 −31.52 −31.49 −20.19 −20.16 −23.03 −23.01 −13.67 −13.64 −35.16 −35.14 −36.47 −36.45
3PLM CAIC AIC 330.64 −365.10 25.46 −20.92 15.79 −30.59 19.25 −27.14 3.63 −42.75 31.07 −15.32 19.06 −27.33 24.03 −22.35 23.00 −23.38 22.07 −24.31 21.63 −24.75 26.94 −19.44 26.44 −19.94 42.69 −3.70 15.50 −30.88 14.08 −32.30
BIC −364.77 −20.90 −30.57 −27.11 −42.73 −15.30 −27.30 −22.33 −23.36 −24.29 −24.73 −19.42 −19.92 −3.67 −30.86 −32.28
Information Criteria , their values are nearly equal when the sample size is large. Thus, it is sufficient to report either one of them (usually the BIC). Comparing the AICs of the 2PLM and 3PLM for each item, the 2PLM is preferred for Items 7, 9, 11, 12, and 15, while the 3PLM is preferred for the other items. However, the CAIC and BIC prefer the 2PLM for all items. In this way, the favored model may differ by information criterion. The coefficients of the DF for the AIC and BIC are different; the former is 2 and the latter ln S. When S ≥ 8, ln S > 2, and the sample size is usually S ≥ 8. Thus, the BIC is more parsimony-oriented than the AIC. In other words, when model selection is performed by the BIC (or CAIC), a simpler model is more likely to be chosen.
4.7 Chapter Summary The largest theoretical leap from the CTT to IRT is the invention of the IRF, that is, a function of ability parameter θ that can represent the CRR characteristics of each individual item. The CTT has not been able to express the characteristics of each item to this extent. While the basic structure of the IRF is simply the logistic function with a few item parameters (the slope, location, lower asymptote, and upper asymptote parameters) built in, the IRF can express a CRR curve of various types of items such as easy, difficult, low-discriminating, and high-discriminating items.
4.7 Chapter Summary
153
Fig. 4.27 Equating by concurrent calibration
Because the features of each item can be captured, when a test is edited as a collection of items, the features of the test can be derived as the sums of the features of the items. The TRF and TIF are the sum of the IRFs and IIFs of all items, respectively. In addition, with the IRT, it is easy to edit parallel tests, that is, two or more tests having almost the same performance. When one wants to conduct a two-wave survey to measure the academic growth of students, the same test cannot be used at the two waves. The test at the second wave should have the same performance but not contain a single item used at the first wave, for otherwise it will become a mere memory-retention test. If the two tests are edited so that their TIFs almost overlap, they can be regarded to be parallel. Moreover, the IRT makes it easy to compare two or more different tests. The procedure to make tests comparable is called the test equating (Gonzalez and Wiberg, 2017; Kolen & Brennan, 2014; von Davier, 2012). By repeating the procedure, all items in equated tests become comparable on the same θ scale, and it can then be said that a huge item bank is constructed. With a larger item bank, it is easier to create parallel tests. Figure 4.27 illustrates the concurrent calibration method (Hanson & Béguin, 2002; Kim & Cohen, 1998, 2002), which is currently the most commonly used equating method due to its simplicity. When new items need to be stored in an existing item bank, a test is created including the new items and some anchoring items. An anchoring item is an item that already stored in the bank; thus, its item parameters are known. The concurrent calibration method then estimates the item parameters of only the new items while keeping the item parameters of the anchor items fixed. For this reason, the item parameter estimates of the new items are equated and obtained as those on the ability scale of the bank, θbank . The larger the number of anchor items, the more reliable equating is conducted. A large item bank is advantageous when administering a computer adaptive test (CAT), in which presented items may vary by student according to his or her ability level. Thus, in a CAT, each student is inevitably presented with a number of items
154
4 Item Response Theory
that are appropriate for his or her ability level. Unless the bank stores a large number of items at each ability level, it will quickly run out of unexposed items.
Chapter 5
Latent Class Analysis
Let us suppose that a test is to be administered to all sophomores in a high school. On surface, it will seem like only one group is scheduled to take the test. In reality, several subgroups with different features were included in the group. For instance, the group is a mixture of males and females of different ages and ethnicities. In addition, some students belong to sports clubs, such as football, baseball, and tennis clubs, whereas others belong to cultural clubs, such as music, drama, science, and calligraphy clubs. In other words, high-school sophomores are not a single group but a mixture of subgroups. A subgroup that is already known to the test administrator is referred to as the manifest class or manifest group. The manifest classes of a student are obtained as additional information when the student takes the test (enrolling in the school). These are qualitative variables that represent student features. For example, gender is a nominal variable, and the Common European Framework of Reference for Languages (CEFR) is an ordinal variable of A, B, and C.1 It is very informative to analyze data by subgroups, which is referred to as multigroup analysis or subgroup analysis. However, some sophomore subgroups may be mixed, which may not be obvious to the test administrator (Fig. 5.1). For example, some sophomores have divorced parents and others do not, and some sophomores are self-supporting students, whereas others are not. Such information is students personal or private information; thus, the test administrator does not usually acquire it, although the teachers in the school may be aware of it. Such socioeconomic subgroups can be treated as a nominal variable; for example, the students with divorced parents are coded 1 and the others 0, but the variable is hidden from the test administrator and analyst. A psychological status, in which introverted students are coded 0 and extraverted students 1, is also an implicit variable. Such an unobserved and implicit status is called a latent class. Latent class analysis (LCA; e.g., Bandeen-Roche et al., 1997; Bartholomew, 1987; Clogg, 1995; Everitt & Hand, 1981; Hagenaars & McCutcheon, 2002; In more detail, the CEFR ranks language learners into six categories: A1 , A2 , B1 , B2 , C1 , and C2 . © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 155 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_5
1
156
5 Latent Class Analysis
Fig. 5.1 Mixture of implicit classes (latent classes)
Fig. 5.2 Classification by latent class analysis
McLachlan & Peel, 2000) is a method that assumes that there are several hidden subgroups (i.e., latent classes) in a given single group and extracts the latent classes. However, it does not disclose what kind of group each extracted class is. Suppose that an LCA analysis reports that the number of latent mixed classes mixed is three, but the analysis does not reveal the features of each class; for example, Class 1 students are females, Class 2 students are introverted, and Class 3 students live by the sea. As shown in Fig. 5.2, the LCA simply classifies students into a smaller number of groups (i.e., classes) according to the homogeneity or similarity of their response patterns. In this respect, the purpose of this method (i.e., classification of students) is
5.1 Class Membership Matrix
157
the same as that of cluster analysis (e.g., Everitt et al., 2011; Wierzcho´n & Kłopotek, 2018) and self-organizing map (e.g., Kohonen, 1995; Oja & Kaski, 1999; Sangole, 2009). To clarify what kind of students belong to each class, an ex-post analysis following the LCA is required. Furthermore, LCA is the basis for a cognitive diagnostic model (CDM; Leignton & Gierl, 2007; Nichols et al., 1995; Rupp, et al., 2010; Tatsuoka, 2009), latent rank analysis (LRA; Chap. 6, p. 191), and biclustering (Chap. 7, p. 259).
5.1 Class Membership Matrix Suppose that the number of mixed latent classes is known, and it is C. The method for determining the optimal number of classes from the data is described in Sect. 5.6 (p. 185). Let m sc be a dichotomous indicator coded as 1 when Student s belongs to Class c ∈ NC ; otherwise, it is coded as 0. That is, m sc =
1, if Student s belongs to Class c 0, otherwise
.
In addition, ⎤ m 11 · · · m 1C ⎥ ⎢ M C = ⎣ ... . . . ... ⎦ (S × C) m S1 · · · m SC ⎡
is referred to as the class membership matrix, where all m sc s are collected. The s-th row vector of this matrix, ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ , ⎡
m sC is the class membership profile of Student s. Note that this is a vertical vector after it is extracted from the class membership matrix. Under the condition that dual membership is not allowed, the class membership profile of each student has only one 1, and the other entries are 0. That is, 1C ms = 1. Class membership matrix M C is unknown in the LCA. If it was known, we could designate which latent class each student belonged to. Thus, estimating M C is one of the major objectives of the LCA.
158
5 Latent Class Analysis
5.2 Class Reference Matrix The LCA supposes that the correct response rate (CRR) of a student for Item j changes depending on the latent class to which the student belongs. Let π jc denote the CRR of a student belonging to Class c for Item j. That is, Pr [u s j = 1|m sc = 1] = π jc .
(5.1)
Then, ⎤ π1C .. ⎥ = {π } (J × C) jc . ⎦
⎡
π11 · · · ⎢ .. . . C = ⎣ . . πJ1 · · ·
πJC
is referred to as the class reference matrix, which includes all π jc s (∀ j ∈ N J ; ∀c ∈ NC ). This matrix is also unknown and is estimated in the process of the LCA analysis. The c-th column vector obtained from the class reference matrix ⎡ ⎤ π1c ⎢ ⎥ (5.2) π c = ⎣ ... ⎦ (J × 1) πJc
is referred to as the class reference vector of Class c. This vector contains all the item CRRs of students belonging to Class c. If Class c is a high-ability group, the elements in π c must be closer to 1 compared with the other classes. In addition, the j-th row vector in class reference matrix C , ⎤ π j1 ⎥ ⎢ π j = ⎣ ... ⎦ (C × 1), ⎡
(5.3)
π jC
is the item reference profile (IRP) of Item j.2 The IRP of Item j lists the CRRs of all latent classes for the item.
5.3 The EM Algorithm The parameters in the LCA are class reference matrix C and class membership matrix M C , which are unknown before the analysis. That is, the LCA can be regarded as a function for estimating C and M C from data matrix U. 2
Note that this is a vertical vector.
5.3 The EM Algorithm
159
The number of parameters in class membership matrix M C is SC, which depends on the sample size. Thus, M C is a nuisance-parameter matrix. In contrast, the number of parameters in class reference matrix C is J C, which does not change with the sample size, and C , thus, is a structural parameter matrix. If a model contains both structural and nuisance parameters, the EM algorithm (Dempster et al., 1977; see Sect. 4.5.1, p. 116) is useful for parameter estimation. Summary of EM Algorithm 1. It repeats a cycle of E (expectation) and M (maximization) steps. 2. In each E step, the expected log-posterior (ELP) density is constructed by marginalizing the nuisance parameters. 3. In each M step, the MAPs of the structural parameters are obtained by maximizing the ELP. As indicated by Point 2 above, the expected log-posterior (ELP) density is constructed in each E step by marginalizing the nuisance parameters (i.e., class membership matrix M C ). In addition, for Point 3, the structural parameters (i.e., class reference matrix C ) are estimated at each M step by maximizing the ELP. Moreover, let the initial value of the ( j, c)-th element in the class reference matrix be π (0) jc =
c (∀ j ∈ N J ). C +1
For example, when the number of classes is five (C = 5), the initial value of the class reference vector of Item j is ⎡
π (0) j
⎤ 1/6 ⎢2/6⎥ ⎢ ⎥ ⎥ =⎢ ⎢3/6⎥ (∀ j ∈ N J ). ⎣4/6⎦ 5/6
5.3.1 Expected Log-Likelihood The purpose of the E step is to construct the ELP of the structural parameters (i.e., class reference matrix C ). ELP is the sum of the expected log-likelihood (ELL) and log-prior density of the structural parameters, as follows: Expected log-posterior = Expected log-likelihood + Log-prior of C . First, the ELL is explained in this section.
(5.4)
160
5 Latent Class Analysis
At the t-th E step, the estimate of the class reference matrix at the (t − 1)-th M step, C(t−1) , has already been obtained. Under this condition, if Student s belongs to Class c, the likelihood of observing the data vector of Student s, us , is expressed as l(us |π (t−1) ) c
=
J
{(π (t−1) )u s j (1 − π (t−1) )1−u s j }zs j jc jc
∈ [0, 1].
(5.5)
j=1
Likelihood is the occurrence probability of data (herein, us ) constructed using param). This equation can be calculated as a real number between [0, 1] eters (herein, π (t−1) c because unknown parameters are not included in it. Thus, given the class reference matrix obtained at the (t − 1)-th M step, C(t−1) , the posterior distribution of the class membership for Student s at the t-th E step is obtained as ⎡
π1l(us |π (t−1) ) 1
⎤
⎢ C ⎥ ⎢ c=1 πc l(us |π (t−1) )⎥ c ⎢ ⎥ (t−1) ⎢ π2 l(us |π 2 ) ⎥ ⎢
⎥ ⎢ C π l(u |π (t−1) ) ⎥ = =⎢ ⎢ ⎥ (C × 1), ⎥ c c s .. ⎥ ⎣ ⎦ ⎢ c=1 .. . ⎢ ⎥ . (t−1) ⎢ ⎥ pr (m sC |us , C ) ⎢ ⎥ (t−1) ⎣ πC l(us |π C ) ⎦
C (t−1) ) c=1 πc l(us |π c ⎡
m(t) s
⎤ pr (m s1 |us , C(t−1) ) ⎢ pr (m s2 |us , (t−1) ) ⎥ C ⎢ ⎥
where πc is the prior probability of belonging to Class c. Unless one has a specific idea about this probability, its value can be set as πc =
1 (∀c ∈ NC ). C
In addition, pr (m sc |us , C(t−1) ) is the posterior probability at the t-th E step that Student s belongs to Class c. All the elements in m(t) s are obtained within the range of [0, 1], and the sum of the elements is 1. Hereafter, it is simply denoted as ⎡
m(t) s
⎤ m (t) s1 ⎢m (t) ⎥ ⎢ s2 ⎥ = ⎢ . ⎥ (C × 1). ⎣ .. ⎦ m (t) sC
By calculating m(t) s for all students, we have the class membership matrix at the t-th (t) E step, M C . An example of an estimate of the class membership profile is presented in Fig. 5.3. The size of the data used in the example (J15S500) was 500 (students) × 15 (items). Figure 5.3 shows the class membership profile of Student 1 at the first E step under the condition that the number of latent classes was five. At this point, the probability
5.3 The EM Algorithm
161
Fig. 5.3 Class membership profile of Student 1 at the first E step
that Student 1 belongs to Class 3 is the highest at approximately 50%, followed by Class 4 at about 30%. However, this class membership is successively updated as the EM cycle is repeated. Furthermore, in the E step, the ELL is constructed by marginalizing the loglikelihood of the structural parameters (i.e., C ) over the posterior distribution of the nuisance parameters (i.e., M C ). First, the log-likelihood of the belonging of Student s to Class c is given as ll(us |π c ) =
J
z s j {u s j ln π jc + (1 − u s j ) ln(1 − π jc )}.
(5.6)
j=1
In this equation, only π c was unknown. Taking the expectation of the log-likelihood with respect to the posterior distribution of the class membership m(t) s , we have ell(us |C ) =
C
pr (m sc |us , C(t−1) )ll(us |π c )
c=1
=
C
m (t) sc
c=1
J
z s j {u s j ln π jc + (1 − u s j ) ln(1 − π jc )},
j=1
where C(t−1) is known; however, C is unknown. Thus, the ELL of U was constructed as follows: ell(U|C ) =
S
ell(us |C )
s=1
=
S C J s=1 c=1 j=1
m (t) sc z s j {u s j ln π jc + (1 − u s j ) ln(1 − π jc )}.
(5.7)
162
5 Latent Class Analysis
5.3.2 Expected Log-Posterior Density The estimate of the class reference matrix that maximizes the ELL of Eq. (5.7) is the MLE of C at the t-th M step, C(ML,t) . In practice, each (item × class) term can be maximized with respect to π jc because each ( j, c)-th term in the ELL is mutually disjointed. Taking the terms of only π jc from Eq. (5.7), they are given as ell(u j |π jc ) =
S
m (t) sc z s j {u s j ln π jc + (1 − u s j ) ln(1 − π jc )}
s=1
= U1(t)jc ln π jc + U0(t)jc ln(1 − π jc ),
(5.8)
where U1(t)jc =
S
m (t) sc z s j u s j ,
s=1
U0(t)jc =
S
m (t) sc z s j (1 − u s j )
s=1
can be regarded as the numbers of correct and incorrect responses by the students belonging to Class c for Item j (at the t-th E step). All U1(t)jc s and U0(t)jc s are obtained by the following matrix operations: (t) (t) (J × C), U (t) 1 = {U1 jc } = (Z U) M C (t) (t) U (t) (J × C). 0 = {U0 jc } = {Z (1 S 1 J − U)} M C
In addition, the density function of the beta distribution is used as the prior density of π jc , as it is the conjugate prior density3 in this case. The beta density is defined as β −1
pr (π jc ; β0 , β1 ) =
π jc1 (1 − π jc )β0 −1 B(β0 , β1 )
,
where β0 and β1 are the hyperparameters and are assumed to be known, for example, (β0 , β1 ) = (1, 1).4 For details of the beta density, refer to Sect. 4.5.3 (p. 119). In addition, the logarithm of the beta density is ln pr (π jc ; β0 , β1 ) = (β1 − 1) ln π jc + (β0 − 1) ln(1 − π jc ) + const. 3
A conjugate density is a prior density such that the posterior density has the same functional form as the prior. 4 The hyperparameters can be treated as unknown in hierarchical Bayesian estimation.
5.3 The EM Algorithm
163
Thus, from Eq. (5.4) (p. 159), the ELP at the t-the E step is given as ln pr (π jc |u j ) = ell(u j |π jc ) + ln pr (π jc ; β0 , β1 ) = (U1(t)jc + β1 − 1) ln π jc + (U0(t)jc + β0 − 1) ln(1 − π jc ) + const. (5.9)
5.3.3 Maximization Step In the t-th E step, class membership matrix M C(t) was calculated, and then the ELP of Eq. (5.9) was constructed. Next, in the t-th M step, C = {π jc } that maximizes the ELP is found. Thus, the estimation process in the LCA alternately updates the class membership matrix and class reference matrix as follows: EM Cycle 1
C(0)
EM Cycle 2
(1) (1) (2) (2) → M C → C → M C → C → · · ·
A pair of steps E and M is referred to as an EM cycle. Figure 5.4 shows the ELP of π11 (i.e., the CRR of Class 1 students for Item 1) in the first M step. This function peaks when π11 = 0.327; thus, this value 0.327 is the (MAP,1) . MAP of π11 at the first M step, π11 In general, π jc , which maximizes Eq. (5.9), is obtained by first taking the derivative of Eq. (5.9) with respect to π jc and then by solving the derivative equal to 0 with respect to π jc .5 By differentiating Eq. (5.9) and setting the result to 0, we obtain
Fig. 5.4 Expected log-posterior of π11 to be maximized at the first M step
5
The derivative of the function represents the slope of the function. When the slope is 0, the value of the function is an extremum (the global maximum in this case).
164
5 Latent Class Analysis
U1(t)jc + β1 − 1 U0(t)jc + β0 − 1 d ln pr (π jc |u j ) = − = 0. dπ jc π jc 1 − π jc Then, by solving the equation with respect to π jc , the MAP of π jc at the t-th M step is obtained as follows: π (MAP,t) = jc
U1(t)jc + β1 − 1 U0(t)jc + U1(t)jc + β0 + β1 − 2
.
(5.10)
If no prior is used, ELP is reduced to ELL. Thus, by differentiating the ELL from Eq. (5.8) with respect to π jc , setting the derivative equal to 0, and solving it with respect to π jc , we obtain the MLE of π jc at the t-th M step as follows: π (ML,t) = jc
U1(t)jc U1(t)jc + U0(t)jc
.
The MLE can also be regarded as an MAP when the hyperparameters are (β0 , β1 ) = (1, 1).
5.3.4 Convergence Criterion of EM Cycles The EM algorithm finds the MAP (or the MLE if not assuming the prior) of the structural parameters (i.e., class reference matrix C ) by repeating the EM cycles. At the end of each EM cycle, the ELP value can be computed as ln pr (C(t) |U) =
C J
ln pr (π (t) jc |u j )
j=1 c=1 J C (t) (t) = {(U1(t)jc + β1 − 1) ln π (t) jc + (U0 jc + β0 − 1) ln(1 − π jc )}. j=1 c=1
Each time the EM cycle is repeated, the ELL value increases as follows: ln pr (C(t) |U) > ln pr (C(t−1) |U).
(5.11)
However, as the cycles progress, the increment gradually decreases and eventually vanishes. Such a state is referred to as convergence, and the EM algorithm is terminated. For the criterion to judge the convergence, ln pr (C(t) |U) − ln pr (C(t−1) |U) < c| ln pr (C(t−1) |U)|
(5.12)
5.3 The EM Algorithm
165
ˆ C) Table 5.1 MAP estimate of class reference matrix ( Latent Class Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Test
1 0.519 0.553 0.796 0.507 0.615 0.684 0.483 0.377 0.311 0.529 0.101 0.036 0.205 0.351 0.388 6.453
2 0.700 0.628 0.320 0.581 0.752 0.750 0.439 0.398 0.398 0.534 0.050 0.167 0.549 0.738 0.608 7.613
3 0.764 0.812 0.937 0.869 0.947 0.948 0.834 0.626 0.266 0.761 0.001 0.159 0.894 0.772 0.825 10.415
4 0.856 0.888 0.706 0.873 0.789 1.000 0.874 0.912 0.165 0.677 0.621 0.296 0.672 0.904 0.838 11.072
5 0.860 0.855 0.849 1.000 0.886 0.907 0.900 0.590 0.673 0.781 0.623 0.673 0.784 1.000 0.823 12.205
CRR∗ 0.746 0.754 0.726 0.776 0.804 0.864 0.716 0.588 0.364 0.662 0.286 0.274 0.634 0.764 0.706 9.664
*correct response rate
is used, where c is a very small constant, such as 10−4 = 0.0001 and 10−5 = 0.00001. The LHS of Eq. (5.12) represents the improvement of the t-th ELP compared with the (t − 1)-th ELP, which is positive from Eq. (5.11). Thus, the above equation implies that the EM algorithm is stopped when the improvement between the ELPs at the t-th and (t − 1)-th cycles is sufficiently small compared with the RHS (i.e., the magnitude of the ELP at the t-th cycle), which is generally negative; therefore, it is enclosed in an absolute value sign. Estimation Setting of LCA 1. 2. 3. 4. 5. 6.
Data: J15S500 (www) Number of latent classes: C = 5 Prior probability for the membership in the E steps: π = 1/5 × 15 Estimator in the M steps: MAP (maximum a posteriori) estimate Prior for π jc in the M Steps: Beta density with (β0 , β1 ) = (1, 1) Constant used in the convergence criterion: c = 10−4
Let us present an example of the analysis. The settings used for the analysis are summarized in the above box. The MAP is selected as the estimator, but it is noted that the MAP is identical to the MLE when (β0 , β1 ) = (1, 1). The EM algorithm converged at 73 iterations, which one might think was large, but the computation time was very short. ˆ C . For example, Table 5.1 shows the MAP estimate of class reference matrix the CRR for Item 1 by the students belonging to Class 1 is 51.9%, and that for Item 2 is 55.3%. The table also shows the CRR of each item by all students in the
166
5 Latent Class Analysis
ˆ C . For rightmost column. The bottom row (test) represents the columnar sum of example, the value for Class 1 is 6.453, which is the sum of [0.519 0.553 · · · 0.388]. The interpretation of the table is explained in detail in the next section.
5.4 Main Outputs of LCA This section introduces the main outputs obtained from the LCA. The results of the test data analysis were classified into two categories: results for the test and items and results for the students. The former will be referred by the test administrator or teacher to improve the test and future tests. In addition, the latter will be used to feed back to the students and teachers to improve their future studies and teaching.
5.4.1 Test Reference Profile Here, we first explain the test reference profile (TRP). The TRP is the column ˆ C . Thus, TRP is expressed as sum vector of the estimated class reference matrix, follows: ˆ C 1 J (C × 1) ˆt TRP = and is shown in the bottom row of Table 5.1. When each item has a weight, using the item weight vector, ⎤ w1 ⎢ ⎥ w = ⎣ ... ⎦ , ⎡
wJ
the TRP is computed by ˆ ˆt (w) TRP = C w (C × 1). The line graph in Fig. 5.5 represents the TRP. The c-th element of the TRP is the sum of the CRRs, that is, the (expected) NRS on the test, by the students in Class c. From Eq. (5.1), tTRP,c =
J j=1
Pr [u s j = 1|m sc = 1].
5.4 Main Outputs of LCA
167
Fig. 5.5 Test reference profile (LCA)
Thus, when the TRP of a latent class is large, the class can be regarded as a high-ability group. We can see from the figure that the students in Class 1 correctly responded to about six items in this 15-item test, and the students in Class 3 had about 10 items. As indicated by the monotonically increasing TRP, the five latent classes are sorted from low- to high-ability groups; however, this arrangement of the classes was not the ˆ C were sorted afterward same as that originally output by the LCA. The columns in so that the TRP increases monotonically and were relabeled as Classes 1 through 5. This is because sorting the classes according to the TRP can help us understand the results. Originally, there was no ordinality in latent classes. An LCA with ordering latent classes is referred to as latent rank analysis, which is introduced in Chap. 6 (p. 191). Moreover, the bars in Fig. 5.5 represent the numbers of students belonging to respective classes. The number of students judged to be belonging to Class 3 was the largest at 125, whereas that of Class 1 was the smallest at 87. The method of evaluating the class to which each student belongs is described in Sect. 5.4.3 (p. 169).
5.4.2 Item Reference Profile Next, the item reference profile (IRP) obtained for each item is explained. As shown by Eq. (5.3) (p. 158), the IRP of Item j is the j-th row vector in the class reference matrix, πˆ j . The IRPs of the 15 items are presented in Table 5.1 and illustrated in Fig. 5.6. The top left panel in the figure shows the IRP of Item 1, which is the plot of the numbers placed in the first row (Item 1) in Table 5.1. As the IRP of Item 1 shows, the CRR of the students in Class 1 was approximately 50%, whereas that of the students in Class 5 was approximately 80%. As the five classes are sorted according to their ability, the IRP of Item 1 increases monotonically. Similarly, the IRPs of the other items tend to increase in general as the class label number increases.
168
5 Latent Class Analysis
Fig. 5.6 Item reference profiles (LCA)
However, almost all of the items do not increase monotonically. For example, the IRP of Item 9 dropped the lowest in Class 4, although this class was the group with the second-highest ability among the five classes. Thus, we can see that Class 4 is a group of students who are weak in Item 9. Therefore, it would be a good idea to review the content of Item 9 for students who are deemed to belong to Class 4. To capture the characteristics of each latent class in detail, it is useful to inspect the plots of the class reference vectors (Eq. (5.2), p. 158), which are the column vectors of the class reference matrix (Table 5.1). Figure 5.7 shows the plots of the class reference vectors for the five classes. For example, the plot labeled 1 denotes
5.4 Main Outputs of LCA
169
Fig. 5.7 Class reference vector (----- correct response rate by all students)
the class reference vector of Class 1. In addition, the dashed plot represents the CRRs of the 15 items for all the students. By examining CRRs of low and high items regarding each class, the outline of what knowledge the class has acquired and what knowledge it has not yet acquired becomes clear. For example, we can see that Class 2 is a group of students who are particularly weak in Item 3 compared with other classes. If the test is a math test, Class 2 can be diagnosed as a group in which students generally have low ability and are not good at, for example, solving simultaneous equations (Item 3). It is important to work with subject-teaching experts to clarify what kind of group each latent class is.
5.4.3 Class Membership Profile ˆ C shown Along with updating the structural parameters (i.e., class reference matrix in Table 5.1) in the repetition of the EM cycles, the nuisance parameters (i.e., class membership matrix M C , Sect. 5.1) have also been updated. Let the EM algorithm be stopped at the T -th cycle; then, the class reference matrix ˆ C = C(T ) ).6 In the previous T -th E step, M C(T ) at the T -th M step is the estimate ( must be computed, which is the final update of the class membership matrix; thus, it ˆ C, m ˆ C = M C(T ) ). The s-th row vector of M ˆ s , is the is determined as the estimate ( M
6
(MAP,T )
It is C
(MAP,T )
if it is the MAP or C
if it is the MLE.
170
5 Latent Class Analysis
ˆ C) Table 5.2 Final estimate of class membership matrix ( M Latent Class Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Student 13 Student 14 Student 15 .. . Student 500 LCD∗2 CMD∗3 ∗1 ∗3
1 0.784 0.035 0.015 0.002 0.213 0.000 0.031 0.020 0.000 0.016 0.290 0.624 0.000 0.001 0.947 .. . 0.960 87 90.372
2 3 4 5 LCE∗1 0.171 0.004 0.041 0.000 1 0.052 0.836 0.078 0.000 3 3 0.105 0.802 0.033 0.045 4 0.023 0.330 0.366 0.280 0.784 0.001 0.000 0.001 2 4 0.001 0.001 0.873 0.124 0.112 0.000 0.071 0.786 5 0.011 0.001 0.969 0.000 4 5 0.002 0.149 0.064 0.785 2 0.740 0.145 0.099 0.000 4 0.150 0.004 0.557 0.000 1 0.068 0.294 0.014 0.000 0.001 0.002 0.804 0.194 4 0.003 0.000 0.000 0.996 5 1 0.050 0.003 0.000 0.000 .. .. .. .. .. . . . . . 1 0.037 0.002 0.000 0.000 97 125 91 100 97.105 105.238 102.800 104.484
latent class estimate, ∗2 latent class distribution, class membership distribution
class membership profile of Student s, namely the posterior probability distribution representing the student’s belonging to the respective latent classes. As mentioned above, 73 EM cycles were required until convergence was achieved. ˆ C = M C(73) is the estimate of the class membership matrix,7 as shown in That is, M Table 5.2, which contains the class membership profiles of all 500 students. From the table, for Student 1, the probability of belonging to Class 1 is the highest at 78.4%, followed by Class 2 at 17.1%; thus, the latent class estimate (LCE in the table) of the student is Class 1. Similarly, the LCE of Student 2 is Class 3, because the probability of belonging to Class 3 was the highest at 83.6%. By plotting the class membership profile, we can easily see the class to which each student belongs. Figure 5.8 shows the class membership profiles of the first 15 ˆ C . The top left panel is that of Student 1, which was obtained at the students of M ˆ 1 = m(73) 73rd E step (m 1 ). Compared with the profile obtained in the first E step (1) (m1 shown in Fig. 5.3, p. 167), we can see that it has changed significantly in the repetition of the EM cycles. In addition, the latent class estimates of Students 4 and 6 are the same (i.e., Class 4), but their probabilities of belonging to the class are different. It was 87.3% for Student 6, whereas it was 36.6% for Student 4, followed by the membership in Class 7
(74)
MC
(73)
(74)
can be obtained from C , but M C
(73)
is almost equal to M C
owing to convergence.
5.4 Main Outputs of LCA
171
Fig. 5.8 Class membership profiles
3 (33.0%). Thus, Class 4 for Student 6 can be regarded as more stable, even though their latent classes are the same. Therefore, the feedback of their results should differ. With the profile analysis of latent classes (Fig. 5.7), the class membership profile of each student can be used as useful information for feeding back diagnostic results to the student and for guiding his or her future study.
172
5 Latent Class Analysis
Fig. 5.9 Latent class distribution and class membership distribution
5.4.4 Latent Class Distribution As explained in the previous section, the latent class estimate of each student is the class with the largest membership, as shown in the rightmost column of Table 5.2. The latent class distribution (LCD) is a histogram of latent class estimates. The LCD of the present analysis is shown in Fig. 5.9.8 The number of members in Class 3 is the highest at 125 among the five classes, whereas that in Class 1 is the smallest at 87. By creating the LCD for each school or class (not latent class), we find that some schools/classes have more students belonging to low-ability latent classes, whereas other schools/classes do not. Accordingly, with reference to the rank reference vectors, the LCD is useful for improving the teaching method by class and for determining whether a supplementary lesson is conducted for each class and even for reorganizing the existing classes according to ability level. The line graph in the figure plots the class membership distribution (CMD), as shown in the bottom row of Table 5.2. CMD is the column sum vector of the class membership matrix; that is, ˆ C 1 S (C × 1). dˆ CMD = M The LCD simply totals the latent class estimates. For example, for Student 4, the frequency of Class 4 is added by 1 because the latent class estimate of the student is Class 4, even though membership in the class is only 36.6% (Fig. 5.8). However, in the CMD, the frequency for Class 4 is added by no more than 0.366, owing to the low certainty of the membership; similarly, the other frequencies are added by the respective memberships. Therefore, the CMD reflects the ambiguity of the latent class estimate for each student. Because the CMD considers such ambiguity, the CMD can be regarded as the LCD of the population behind the 500 students in the analyzed data. It is clear from the 8
The histogram in this figure is the same as that in Fig. 5.5 (p. 167).
5.4 Main Outputs of LCA
173
CMD plot shown in the figure that the value of every class is approximately 100. Thus, when analyzing population data, students are likely to be roughly equipartitioned into five classes. Generally, the between-class differences in frequencies under the CMD are reduced compared with the LCD.
5.5 Model Fit This section describes the model fit of the LCA to the data. The basic approach is the same as that used in item response theory (IRT; see Sect. 4.6, p. 139). First, we prepare a benchmark model that fits to the data better than the model under consideration (analysis model) and a null model that fits the data worse than the analysis model; the goodness of fit of the analysis model is then evaluated by examining where the fit of the analysis model is placed between the fits of the benchmark and null models (Fig. 5.10). The idea of evaluating the fit of the analysis model by sandwiching it between wellfitting and poor-fitting models has been developed in structural equation modeling (SEM; e.g., Bollen, 1989b; Jöreskog & Sörbom, 1979; Kaplan, 2000; Muthén & Muthén, 2017; Toyoda, 1998). In this case, the model with five latent classes was used as the analysis model. As the benchmark model is the best and the null model is the worst in fitting among the benchmark, analysis, and null models, the likelihoods of the three models (l B , l A , and l N ) hold the following inequality: 1 > l B > l A > l N > 0. In addition, the log-likelihood is more precisely the “expected” log-likelihood (ELL) because the EM algorithm is used in this analysis. Let the ELLs of the three models denote ell B , ell A , and ell N , respectively. Then, the following inequality holds between them: 0 > ell B > ell A > ell N > −∞.
Fig. 5.10 Concept of model fit (identical to Fig. 4.24, p. 144)
174
5 Latent Class Analysis
5.5.1 Analysis Model First, the ELL of the analysis model is computed using Eq. (5.8) (p. 162) as follows: ell A (u j |πˆ j ) =
C
ell(u j |πˆ jc )
c=1
=
C S
mˆ sc z s j {u s j ln πˆ jc + (1 − u s j ) ln(1 − πˆ jc )}.
(5.13)
c=1 s=1
This equation is obtained as a real number because it includes no unknown parameters.
5.5.2 Null Model and Benchmark Model Null Model and Benchmark Model Null model:
The model assuming students belong to a single group.
Benchmark model:
The model assuming multiple groups in each of which the number-right scores (NRSs) of the students are equal.
The null and benchmark models are the same as those assumed in the IRT, which are listed in the above box. The null model is a single-group model that attempts to explain the response vector of Item j, u j , only by the CRR of Item j, p j . Accordingly, the likelihood of the null model (for Item j) is given as l N (u j | p j ) =
S
u
{ p j s j (1 − p j )1−u s j }zs j .
s=1
Thus, the log-likelihood is obtained as ll N (u j | p j ) =
S
z s j {u s j ln p j + (1 − u s j ) ln(1 − p j )}.
s=1
Note that the single-group model (i.e., null model) is identical to the LCA model with the number of latent classes set to one. In this model, class membership matrix M C is reduced to
5.5 Model Fit
175
⎡ ⎤ 1 ⎢ .. ⎥ M C = ⎣ . ⎦ = 1 S (S × 1). 1 In this situation, the class membership matrix is a matrix with one column (i.e., a vector), and each student belongs to Class 1 with a probability of 100%. Then, the ELL, which is the log-likelihood of which the nuisance parameters are marginalized, is obtained as ell N (u j | p j ) =
S 1
m sc z s j {u s j ln p j + (1 − u s j ) ln(1 − p j )}
s=1 c=1
=
S
1×z s j {u s j ln p j + (1 − u s j ) ln(1 − p j )}
s=1
=
S
z s j {u s j ln p j + (1 − u s j ) ln(1 − p j )}
s=1
= ll N (u j | p j ). Thus, for the null model, ELL is equivalent to log-likelihood. Therefore, ll N is denoted as ell N . In addition, there are no unknown parameters in the equation, and the ELL is obtained as a real number. Next, the (expected) log-likelihood of the benchmark model is computed. The benchmark model is a multigroup model in which the students are divided into subgroups in each of which students have the same NRS. When the number of items is J , the maximum number of NRS patterns is J + 1 {0, 1, . . . , J }. However, all the patterns are not always observed; for example, there are no 0-point students if a test is very easy. Thus, let G be the number of observed NRS patterns and p jg be the CRR of Group g for Item j. The likelihood of Student s belonging to Group g is obtained as l B (u j | p j ) =
S G
u
{ p jgs j (1 − p jg )1−u s j }zs j m sg .
s=1 g=1
Thus, the log-likelihood is given as ll B (u j | p j ) =
S G
m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )},
s=1 g=1
where p j is the vector collecting the CRRs of all groups for Item j. That is,
(5.14)
176
5 Latent Class Analysis
⎤ p j1 ⎥ ⎢ p j = ⎣ ... ⎦ (G × 1). ⎡
p jG In addition, m sg is the membership representing m sg =
1, if Student s belongs to Group g , 0, otherwise
where the binary of 1 is also considered as a probability of 1.0 (100%) that Student s belongs to Group g and 0 as a probability of 0.0 (0%) that the student belongs to the group. In addition, ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ (G × 1) ⎡
m sG is the group membership profile of Student s. As each student belongs to only one group, the sum of memberships is 1. That is, 1G ms = 1. Note that the group membership profile differs from the class membership profile. The group membership of a student is known before the LCA analysis because it is defined by the student’s NCR, whereas the class membership of the student is unknown before the analysis and is estimated during the analysis. However, the two memberships are the same in the sense that they are both discrete probability distributions. The requirements for a discrete probability distribution are summarized in the following box. Requirements for Discrete Probability Distribution Suppose X is a discrete random variable, and its possible values are x = [X 1 . . . X n ] . Suppose also that the respective occurrence probabilities are p = [ p1 . . . pn ] . For p to be a discrete probability distribution, the following two requirements must be satisfied. 1. All the elements in p is required to fall in [0, 1] ( pi ∈ [0, 1], ∀i ∈ Nn ) because pi is the occurrence probability for X i .
n pi = 1n p = 1). 2. The sum of the elements in p is required to be 1 ( i=1 If (and only if) all possible values are listed, the probability of any of the events occurring is 1 (100%). The class and group membership profiles are discrete probability distributions because they satisfy the above two requirements. Thus, the only difference between
5.5 Model Fit
177
the log-likelihoods of the benchmark and analysis models is that the former is marginalized over the group membership, while the latter is marginalized over class membership. Accordingly, the log-likelihood of the benchmark model, as shown in Eq. (5.14), can also be said to be the expected log-likelihood. That is, ell B (u j | p j ) = ll B (u j | p j ) =
S G
m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )}.
s=1 g=1
Designation of Null Model 1. The specification of the null model is left to the analyst. It is not a strict rule that the single-group model must be specified as the null model. The point is that the null model should be a worse-fitting model compared to the analysis model. It is sufficient for the null model if its ELL can sandwich that of the analysis model from below with that of the benchmark model. 2. The null model is the same as the LCA model with one latent class. The null model used here is a model that considers the whole students as one group and explains the data vector of each item by only the CRR of the item. This single-group model is equivalent to the single-class LCA model. If the single-class model is specified as the analysis model, then the two models become identical, and the ELLs of them are equal, too. In this case, the ELLs of the null and benchmark models cannot sandwich that of the analysis model.
5.5.3 Chi-Square Statistics As illustrated in Fig. 5.10, the χ 2 value of the analysis model (for Item j) was calculated as twice the log-likelihood difference between the benchmark and analysis models. That is, χ A2 j = 2{ell B (u j | p j ) − ell A (u j |πˆ j )}, d f A j = row( p j ) − row(πˆ j ) = G − R,
178
5 Latent Class Analysis
where d f A j denotes the degrees of freedom (DF) of the analysis model (for Item j) and is defined as the difference in the number of parameters between the two models. In addition, row( p j ) is the number of elements in p j and is equal to G, which is the number of observed NRS patterns. Similarly, row(πˆ j ) is the number of elements in πˆ j = [πˆ j1 . . . πˆ jC ] and is, thus, obtained as row(πˆ j ) = C. Refer to Summary of χ A2 j and d f A j for interpreting the χ 2 value and DF of the analysis model. Summary of χ A2 j and d f A j 1. As χ A2 j is smaller, the ELL of the analysis model (for Item j) is closer to that of the benchmark model (see Fig. 5.10), which represents the fit of the analysis model is better. Conversely, the larger the χ A2 j , the worse the fit of the analysis model. 2. If the ELL of the analysis model happens to be larger than that of the benchmark model, χ A2 j is then negative. 3. The smaller the d f A j is, the closer the number of parameters of the analysis model is to that of the benchmark model, which means the analysis model is a flexible but complicated model, like the benchmark model. Conversely, the larger the d f A j is, the simpler the analysis model is, like the null model. In addition, the χ 2 value of the null model (for Item j), which is the fit difference between the null and benchmark models, is computed as χ N2 j = 2{ell B (u j | p j ) − ell N (u j | p j )}, d f N j = row( p j ) − row( p j ) = G − 1. 2 To obtain the χ statistic and DF for the entire test, refer to 2 χ and d f of Analysis Model and Null Model . The χ 2 value and DF of the entire test are simply the sum of the χ 2 values and the sum of the DFs of all items, respectively.
5.5 Model Fit
179
Table 5.3 Log-likelihood and χ 2 of benchmark, null, and analysis models BM∗1 ell B −3560.01 −240.19 −235.44 −260.91 −192.07 −206.54 −153.94 −228.38 −293.23 −300.49 −288.20 −224.09 −214.80 −262.03 −204.95 −254.76
Model Statistic Test Item 1 Item 2 Item 3∗2 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9∗2 Item 10 Item 11∗2 Item 12 Item 13 Item 14∗2 Item 15 ∗1
ell N −4350.22 −283.34 −278.95 −293.60 −265.96 −247.40 −198.82 −298.35 −338.79 −327.84 −319.85 −299.27 −293.60 −328.40 −273.21 −302.85
benchmark model,
∗2
Null Model χ N2 1580.42 86.31 87.02 65.38 147.78 81.73 89.76 139.93 91.13 54.70 63.30 150.36 157.60 132.73 136.52 96.17
d fN 195 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
LCA (C ell A −3663.99 −264.18 −256.36 −237.89 −208.54 −226.45 −164.76 −249.38 −295.97 −294.25 −306.99 −187.20 −232.31 −267.65 −203.47 −268.62
= 5)
χ A2 207.98 47.98 41.85 −46.04 32.93 39.82 21.64 42.00 5.48 −12.48 37.57 −73.77 35.02 11.23 −2.97 27.71
item with χ A2 j < 0
χ 2 and DF of Analysis Model and Null Model Analysis Model J J χ A2 = χ A2 j = 2 {ell B (u j | p j ) − ell A (u j |πˆ j )} d fA =
j=1 J
j=1
d fAj
J J = {row( p j ) − row(πˆ j )} = (G − C)
j=1
j=1
j=1
Null Model J J χ N2 = χ N2 j = 2 {ell B (u j | p j ) − ell N (u j | p j )} d fN =
j=1 J j=1
j=1
d fN j =
J j=1
{row( p j ) − row( p j )} =
J (G − 1) j=1
These values are not of individual Item j but of the whole test.
d fA 135 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
180
5 Latent Class Analysis
Table 5.3 shows the χ 2 values and DFs obtained in the analysis of J15S500 by using the LCA with C = 5. Although the maximum possible number of NRS patterns was 16 because the test length was 15, there were no students scoring 0 or 1 points; thus, the number of observed NRS patterns was 14 {2, 3, . . . , 15}. This implies that the number of parameters in the benchmark model was 14 for each item. Thus, the DFs for the null and analysis models are given as d f N j = 14 − 1 = 13 d f A j = 14 − 5 = 9
(∀ j ∈ N J ), (∀ j ∈ N J ),
In addition, for example, for Item 1, the χ 2 values of the two models are 86.31 and 47.98, respectively, which are obtained through the following calculations: χ N2 1 = 2 × {−240.19 − (−283.34)} = 86.31, 2 χ A1 = 2 × {−240.19 − (−264.18)} = 47.98.
Moreover, the χ 2 values of the analysis model were found to be negative9 for Items 3, 9, 11, and 14. This is because the ELL of the analysis model was greater than that of the benchmark model for the four items. For example, for Item 3, −∞ < −260.91 (ell B3) < −237.89 (ell A3) < 0. The closer the ELL is to 0, the better the model fits the data. That is, the data vectors of the four items are more fitted to the analysis model than to the benchmark model. This also indicates that, for the four items, classifying the students into five latent classes was a better way to explain the data vector than grouping them by NRS patterns. However, the sum of the ELLs of the analysis model (−3663.99) is not greater than that of the benchmark model (−3560.01); thus, the benchmark model fits the entire test data better. As a result, the χ 2 value of the analysis model is obtained to be positive as χ A2 = 2 × {−3560.01 − (−3663.99)} = 207.98.
The negative χ 2 values in Table 5.3 should be expressed as 0 because the χ 2 statistic is nonnegative by definition. To show how much better the analysis model fits the data than the benchmark model, the negative values are shown as they are.
9
5.5 Model Fit
181
Designation of Benchmark Model 1. The specification of the benchmark model is left to the analyst. As the specification of the null model was left to the analyst, the benchmark model may also be designated by the analyst, but the benchmark model is required to better fit the data. In this section, the multigroup model by NRS was specified as the benchmark model, but one can specify the LCA model with, for example, 20 classes as the benchmark model. Or rather, the saturated model perfectly fitting data (see Sect. 11.3.6, p. 564) can also be set as the benchmark model. 2. The LCA is good-fitting. Generally, the LCA is one of very good-fitting models. Thus, if the number of classes of the analysis model (C) and the number of groups of the benchmark model (G) are the same, the goodness of fit of the analysis model is much better than that of the benchmark model. For this reason, the fit of the analysis model often exceeds that of the benchmark model even when the number of classes is much smaller than G. In this case, the fit of the analysis model cannot be sandwiched by those of the null and benchmark models.
5.5.4 Model-Fit Indices and Information Criteria In LCA, as in the IRT, it is possible to calculate the fit indices for the entire test and for each item. There are two types of fit indices: standardized and relative. Standardized Fit Indices summarize standardized indices. Although the indices in the list are defined for an individual item, the indices for the entire test can be calculated using the sum of the χ 2 values and the sum of the DFs for all items.
182
5 Latent Class Analysis
Standardized Fit Indices Normed Fit Index (NFI; Bentler & Bonett, 1980)∗1 2 χAj (∈ [0, 1]) NFI j = 1 − cl 2 χN j Relative Fit Index (RFI; Bollen, 1986)∗1 2 χ A j /d f A j (∈ [0, 1]) RFI j = 1 − cl 2 χ N j /d f N j Incremental Fit Index (IFI; Bollen, 1989a)∗1 2 χAj − d f A j (∈ [0, 1]) IFI j = 1 − cl 2 χN j − d f A j Tucker-Lewis Index (TLI; Bollen, 1989a)∗1 2 χ A j /d f A j − 1 (∈ [0, 1]) TLI j = 1 − cl 2 χ N j /d f N j − 1 Comparative Fit Index (CFI; Bentler, 1990)∗1 2 χAj − d f A j (∈ [0, 1]) CFI j = 1 − cl 2 χN j − d f N j Root Mean Square Error of Approximation (RMSEA; Browne & Cudeck, 1993)∗2 max(0, χ A2 j − d f A j ) (∈ [0, ∞)) RMSEA j = d f A j (S − 1) ∗1 Larger
values closer to 1.0 indicate a better fit. values closer to 0.0 indicate a better fit.
∗2 Smaller
This list is essentially the same as the list shown in IRT (p. 148). The difference is that the clip function (or clamp function), denoted as cl(·), is included in the definition of each index, except for the RMSEA. The clip function is defined as follows: ⎧ ⎪ ⎨xmin , if x < xmin cl(x; xmin , xmax ) = x, if xmin ≤ x ≤ xmax , ⎪ ⎩ xmax , if x > xmax which means that it returns x if argument x falls in [xmin , xmax ], xmin if x < xmin , and xmax if x > xmax . In this book, (xmin , xmax ) = (0, 1); thus, cl(x; xmin , xmax ) is simplified as cl(x). The clip function limits the range of NFI–RFI to [0, 1]. For the
5.5 Model Fit
183
same reason, max(·) is added to the definition of RMSEA in the list compared with the IRT. The value range is [0, 1] for all the standardized indices other than the RMSEA, and the better the model fit, the closer the values are to 1. In SEM, the fit of the model is regarded well when some of the indices are ≥ 0.9. The RMSEA value range is not [0, 1], but [0, ∞), meaning that the closer the value is to 0, the better the fit. A standard for a model said to be a good fit is that the RMSEA is lower than 0.06. These criteria are shared in the field of SEM, but they may not always be applicable to test data analysis. Information Criteria Akaike Information Criterion (AIC; Akaike, 1987) AIC j = χ A2 j − 2d f A j Consistent AIC (CAIC; Bozdogan, 1987) CAIC j = χ A2 j − d f A j ln(S + 1) Bayesian Information Criterion (BIC; Schwarz, 1978) BIC j = χ A2 j − d f A j ln S A lower value indicates a better fit.
In addition, the information criteria were used to compare the fit of two or more analysis models. Three information criteria (for Item j) are listed in 10 Information Criteria The information criteria for the entire test can be obtained based on the sum of the χ 2 values and the sum of the DFs of all items. The value of each information criterion can be both positive and negative. Each information criterion is used for comparing two or more analysis models; thus, it is unnecessary to refer to when the number of analysis models is one. When comparing two or more analysis models, the model with a smaller information criterion value is judged to be a better fit. Because the basic concept is the same for the AIC, CAIC, and BIC, we focus on the AIC.11 The AIC is constructed as AIC = Badness of Fit (χ A2 ) − 2 × Simplicity of Model (d f A ). As χ A2 is larger, the ELL of the analysis model is placed farther from that of the (wellfitting) benchmark model; thus, χ A2 represents the “badness-of-fit” of the analysis model. Meanwhile, the greater the d f A , the larger the difference in the number of 10
The theory of the information criterion has developed remarkably well, and many criteria have been proposed recently. One of the most popular criteria is the WAIC (Watanabe, 2009). 11 See also Figs. 4.25 (p. 151) and 4.26 (p. 151).
184
5 Latent Class Analysis
parameters between the analysis and benchmark models. In other words, when d f A is large, the analysis model is simple with a small number of parameters, unlike the benchmark model. Conversely, the closer the d f A is to 0, the more proximate the number of parameters of the analysis model is to that of the benchmark model, and then, the more complicated the analysis model becomes. In other words, the AIC increases as the fit is worse (i.e., χ A2 is larger) and the model is more complicated (i.e., d f A smaller). Conversely, the AIC decreases as the fit improves and the model is simpler. However, a good-fitting model is generally a complex model with a large number of parameters, which results in a large AIC as follows: Good Fit Model (Small χ A2 ) ⇐⇒ Complex Model (d f A Close to 0) ⇓ Large AIC
The AIC does not judge such a model to be a good fit because the model may only overfit the present data and may not fit similar data in the future. Conversely, a simple model tends have poor-fitting and a large AIC, as follows: Simple Model (Large d f A ) ⇐⇒ Bad Fit Model (Large χ A2 ) ⇓ Large AIC
In this way, AIC values a model being simple but underestimates it being badfitting; thus, AIC does not consider such a model as a good model. AIC highly evaluates a balanced model to be a good model if it is both moderately well-fitting (moderately small χ A2 ) and moderately simple (moderately large d f A ). Table 5.4 shows the fit indices of the LCA model with C = 5 for the entire test and each individual item. Regarding the fit indices for the test, the IFI, TLI, and CFI were > 0.9, indicating that the model was a good fit. For Items 3, 9, 11, and 14, the NFI–CFI was 1.0, and the RMSEA was 0.0, because the ELL of the analysis model was greater than that of the benchmark model. In addition, the information criteria need not be checked here because they are used when there are two or more analysis models.
5.6 Estimating the Number of Latent Classes
185
Table 5.4 Fit indices of latent class analysis (C = 5) Standardized Index
Type Index Test
Information Criterion
NFI
RFI
IFI
TLI
CFI
RM∗
AIC
CAIC
BIC
0.868
0.810
0.950
0.924
0.947
0.033
−62.02
−631.27
−631.00
Item 01
0.444
0.197
0.496
0.232
0.468
0.093
29.98
−7.97
−7.95
Item 02
0.519
0.305
0.579
0.359
0.556
0.086
23.85
−14.10
−14.08
Item 03
1.000
1.000
1.000
1.000
1.000
0.000
−64.04
−101.99
−101.97
Item 04
0.777
0.678
0.828
0.744
0.822
0.073
14.93
−23.02
−23.00
Item 05
0.513
0.296
0.576
0.352
0.552
0.083
21.82
−16.13
−16.11
Item 06
0.759
0.652
0.843
0.762
0.835
0.053
3.64
−34.30
−34.29
Item 07
0.700
0.566
0.748
0.625
0.740
0.086
24.00
−13.95
−13.93
Item 08
0.940
0.913
1.000
1.000
1.000
0.000
−12.52
−50.47
−50.45 −68.42
Item 09
1.000
1.000
1.000
1.000
1.000
0.000
−30.48
−68.43
Item 10
0.406
0.143
0.474
0.179
0.432
0.080
19.57
−18.38
−18.36
Item 11
1.000
1.000
1.000
1.000
1.000
0.000
−91.77
−129.72
−129.70
Item 12
0.778
0.679
0.825
0.740
0.820
0.076
17.02
−20.93
−20.91
Item 13
0.915
0.878
0.982
0.973
0.981
0.022
−6.77
−44.72
−44.70
Item 14
1.000
1.000
1.000
1.000
1.000
0.000
−20.97
−58.92
−58.90
Item 15
0.712
0.584
0.785
0.675
0.775
0.065
9.71
−28.24
−28.23
∗ RMSEA
(root mean square error of approximation)
5.6 Estimating the Number of Latent Classes The results of the LCA model with five classes have been shown as an example of an analysis. However, the results do not indicate that the optimal number of latent classes is five. How should the number be determined? For this, one can use the fit indices, especially the information criterion. After analyzing the data by the LCA models with various numbers of classes and calculating the information criterion for each model, the number of classes of a model with the lowest information criterion is determined as the optimal number of latent classes. Table 5.5 shows the three information criteria and the χ 2 values obtained by varying the number of latent classes from two to ten. The settings for all analyses were the same. The EMCs in the table are the number of EM cycles required until convergence. For example, the EM algorithm converged in the fourth cycle in the analysis with C = 2. The null and benchmark models were the same for all analyses. The null model is a single-group model, and the benchmark model is a multigroup model grouped using the NRS model. From Table 5.3, the log-likelihoods of the null and benchmark models and the χ 2 value and DF of the null model are as follows:
186
5 Latent Class Analysis
Table 5.5 Information criteria with various numbers of latent classes LCs∗1 2 3 4 5 6 7 8 9 10 ∗1
EMCs∗2 4 17 37 73 75 88 78 67 68
ell A −3942.48 −3841.76 −3747.48 −3663.99 −3577.75 −3478.86 −3411.66 −3328.03 −3273.66
number of latent classes,
χ A2 764.95 563.50 374.95 207.98 35.49 −162.29 −296.70 −463.94 −572.70 ∗2
d fA 180 165 150 135 120 105 90 75 60
AIC 404.95 233.50 74.95 −62.02 −204.51 −372.29 −476.70 −613.94 −692.70
CAIC −354.04 −462.24 −557.55 −631.26 −710.50 −815.03 −856.19 −930.19 −945.69
BIC −353.68 −461.91 −557.25 −631.00 −710.26 −814.82 −856.01 −930.04 −945.57
number of EM cycles
ll B = −3560.01, ll N = −4350.22, χ N2 = 1580.42, d f N = 195. Note that the benchmark and null models were not altered by changing the number of latent classes. For instance, in the analyses with Classes 2 and 3, the χ 2 values of the two analysis models were calculated as follows: 2 = 2 × {−3560.01 − (−3942.48)} = 764.95, χ A,C=2 2 χ A,C=3 = 2 × {−3560.01 − (−3841.76)} = 563.50.
We can see that χ A2 value, which is a measure of poor-fit, decreases as the number of classes increases. This is because the model with a larger number of latent classes fits the data more flexibly. In addition, the χ A2 value is negative12 when the number of classes is seven (or greater). This is because the ELL when C = 7 (−3478.86) is greater than that of the benchmark model (−3560.01), which means that the analysis model with C = 7 fits better than the benchmark model with G = 14 (number of NRS patterns). Moreover, the DF decreases as the number of classes increases. DF is an information of the simplicity of the model. In other words, the greater the DF, the simpler the model. Among the models with numbers of classes ranging from two to ten, the simplest model is the one with two classes, and its DF is, thus, the largest. The AIC and BIC of Table 5.5 are plotted in Fig. 5.11.13 The figure shows that the AIC and BIC decrease as the number of classes increases, which means that between the analysis models where the number of classes differs from two to ten, both the AIC and BIC preferred the model with C = 10 the best. This figure suggests that the 12 13
This is an irregular expression because the value range of χ 2 statistic is nonnegative by definition. The CAIC is almost equal to the BIC. Therefore, it is omitted.
5.6 Estimating the Number of Latent Classes
187
Fig. 5.11 AIC and BIC under various numbers of latent classes
Table 5.6 Information criteria with various numbers of latent classes (S = 100) LCs∗1 2 3 4 5 6 7 8 9 10 ∗1
EMCs∗2 6 14 20 41 23 38 48 52 27
ell A −774.62 −729.74 −685.06 −647.14 −612.23 −576.99 −563.31 −538.11 −521.87
number of latent classes,
χ A2 244.16 154.42 65.04 −10.78 −80.61 −151.09 −178.45 −228.86 −261.34 ∗2
d fA 150 135 120 105 90 75 60 45 30
AIC −55.84 −115.58 −174.96 −220.78 −260.61 −301.09 −298.45 −318.86 −321.34
CAIC −448.11 −468.63 −488.78 −495.37 −495.98 −497.23 −455.35 −436.54 −399.79
BIC −446.62 −467.28 −487.58 −494.33 −495.08 −496.48 −454.76 −436.09 −399.49
number of EM cycles
optimal number of classes is > 10; in fact, the number is 66 by the AIC at −2346.04 and 15 by the BIC at −1169.97. It can be seen that the AIC tends to favor a more complicated model than BIC because the BIC has a larger coefficient on DF in its formula, and thus, the BIC selects a simpler model compared with the AIC. Furthermore, as the sample size increases, a model with a larger number of classes is likely to be selected. Therefore, the model with the number of classes being unrealistically large is often selected when exploring the optimal number of classes based on the AIC or BIC. In practice, when the sample size is small, the optimal number of classes can be determined using the information criteria. Table 5.6 shows the AICs and BICs when analyzing the first 100 students in the data (J15S500). In this case, the ELLs of the benchmark and null models and the χ N2 and DF of the null model can be obtained as follows: ll B = −652.54, ll N = −871.74, χ N2 = 438.42, d f N = 15 × (12 − 1) = 165,
188
5 Latent Class Analysis
Fig. 5.12 Information criteria under various numbers of latent classes (S = 100)
where the DF of the null model is 165 because the number of observed NRS patterns is 12 under the condition that the sample size is 100. The AIC and BIC values of Table 5.6 are plotted in Fig. 5.12, which shows that the smallest AIC and BIC are obtained when the number of classes is ten and seven, respectively. Therefore, the BIC (and CAIC) can work when the sample size is small. However, the sample size of a test often exceeds 1000 or 10,000; thus, one may sometimes fail to select the number of classes based on the information criterion. In such cases, the number of classes should be determined by carefully considering the purpose of the analysis without consulting the fit indices. For example, if one wants to organize school classes in a grade and there are four teachers in the grade, the number of latent classes should be four.
5.7 Chapter Summary LCA is a statistical method used to classify students into groups (latent classes) that are unknown before analyzing the data, and students with similar response patterns are gathered in each latent class. The LCA has structural and nuisance parameters; the former is the class reference matrix (C ), and the latter is the class membership matrix (M C ), both of which can be estimated using the EM algorithm. The calculation is straightforward and fast. LCA is fundamental and elementary for any test data analyst, as well as the IRT. Many advanced models are based on LCA, such as latent rank analysis (Shojima, 2008, 2009, 2011; Chap. 6), where the latent classes are ordered, the latent class IRT model (Rost, 1990, 1991) that combines the IRT and LCA, and the model that combines the Bayesian network model and the latent class (rank) analysis (Shojima, 2011). Performing LCA is one of the basic skills of a psychometrician, educational evaluator, or educational engineer. However, it is difficult to determine the number of classes because the LCA is generally very flexible and fits the data well. The greater the number of classes, the
5.7 Chapter Summary
189
more the model fits the data, and if one tries to determine the number of classes based on an information criterion, the number is often very large. Each information criterion suggests an optimal number of classes by referring to only the information inside the test data. However, analysts are aware of the world outside of test data. What we know about it is that, for example, the number of classes being five is sufficient for grouping students in reality. In addition, if the students are classified into ten classes, ten different teaching materials, ten teaching methods, or ten teachers cannot usually be prepared for the respective classes. The test data engineer is the only person who could compromise the information from both inside and outside test data and integrate them properly. It is definitely an objective decision if the number of classes is selected by an information criterion, but a statistical decision is not always realistic. Unless realistic, we must use knowledge about the world outside of the data to make our decisions. LCA (computer software) does not use the knowledge outside of the data in outputting the results, but we can inspect the statistical outputs reflecting the information inside the data by consulting the knowledge outside the data. This is our ability to integrate information in and out of data.14 The evidence for making a decision is not necessarily a statistical indicator. If that was the case, the fields of literature, law, and politics would no longer be viable. When determining the number of classes, one is allowed to refer to one’s experience and knowledge and to consult any person or any literature having the expertise on this issue.
14
Artificial intelligence would do it before long.
Chapter 6
Latent Rank Analysis
Latent rank analysis1 (LRA; Shojima, 2008, 2009, 2011) is a model similar to latent class analysis (LCA) that supposes several latent classes. The classes in LCA are not ordered and unrelated in general, although they might be ordered. Meanwhile, in LRA, classes are assumed to be ordered. In other words, LRA is a LCA that assumes the ordinality of classes.2 Moreover, compared with item response theory (IRT), LRA can be expressed as an IRT model whose continuous ability scale θ is altered to an ordinal ability scale. Level of Ability Scale Continuous: Item response theory (θ ) Nominal: Latent class analysis Ordinal: Latent rank analysis The box above shows a classification of the models by the level of ability scale. The θ scale in IRT is a continuous scale, so that the student ability estimates thus obtained are continuous, such as 1.2345 and −0.9876. Students whose θ is high are considered to have high ability. Meanwhile, the LCA evaluates each student’s ability by specifying the latent class a student belongs to (LCE in Table 5.2, p.170). Because the latent class is a nominal variable, it follows that the ability scale of LCA is a nominal scale. It is thus possible to say that LCA is a nonparametric IRT where the ability scale is not continuous but nominal, although latent classes can be reordered by the test reference profile (TRP;
1
Or, latent rank theory. In this book, it is referred to as latent rank analysis in parallel with LCA. It is also called neural test theory. 2 Some similar methods have also been proposed by Lazarsfeld and Henry (1968) and Croon (2002). © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_6
191
192
6 Latent Rank Analysis
see Figure 5.5, p.167). “Nonparametric” means that a specific mathematical function is not employed to define the model. The IRT is thus a parametric model because it is formulated using a logistic function. LRA is a model that inherently orders latent classes and evaluates the ability of each student on an ordinal scale. Thus, the LRA can be said to be a nonparametric IRT whose ability scale is ordinal.
6.1 Latent Rank Scale The LRA assumes ordinal latent classes or latent ranks. Why do we consider latent ranks? The reasons are shown in the following box. Why Latent Rank? 1. Test reliability 2. Selection tests and diagnostic tests 3. Accountability and qualification tests Regarding Point 1, tests are not reliable measurement tools for evaluation by continuous scoring. As described in classical test theory (CTT; Chap. 3, p.73), tests are less reliable than height and weight scales. In the following sections, we will deepen the discussion of test reliability in terms of accuracy, discrimination, and resolution.
6.1.1 Test Accuracy Let Mr. Smith’s height be 173 cm (5 8.1 ; see Fig. 6.1, left). A height scale is a measurement tool that returns the length of an object (i.e., input) as a real number, but discards all other information regarding the input. This process can be expressed as f SC AL E (MR. SMITH) = 173 (cm). In addition, assume that Mr. Smith took a test and scored 73 points (Fig. 6.1, right). This situation is expressed as f T E ST (MR. SMITH) = 73 (points). Height scales and tests are similar in the sense that they are tools that give a numerical value to a feature of the input but discard (or neglect) all others except the feature.
6.1 Latent Rank Scale
193
Do we question the value measured by a height scale? We usually would not doubt that the height of Mr. Smith was 172 cm or 174 cm, but we take it to be exactly 173 cm. What about test results? Can we believe that Mr. Smith’s score is neither 72 nor 74 but exactly 73? Generally, we cannot easily accept from experience that Mr. Smith’s ability is exactly 73. This is because we know that a test score is influenced by the contents of the items being unevenly distributed and by Mr. Smith’s physical condition on that day. Thus, in terms of accuracy, tests are clearly inferior measurement tools to height scales.
6.1.2 Test Discrimination Next, discrimination is the capability of distinguishing a minute difference between two similar objects. The smaller the difference distinguished, the better the discrimination. From Fig. 6.2 (left), Mr. Smith and Mr. Davis are roughly the same height (the difference is 1 cm). A height scale can detect that small 1 cm difference, and we do not usually doubt the report by the scale that the difference, although small, definitely exists. Now, let us consider test scores. Suppose Mr. Davis scores 74 on the test on which Mr. Smith scored 73 (Fig. 6.2, right). Can we believe that the ability of Mr. Davis is higher? Usually, we would not consider there to be a meaningful difference in ability between two students whose score difference is 1, because we have often witnessed the test scores of two students whose abilities are clearly different to sometimes be reversed.
Fig. 6.1 Accuracy of height scale and test
Fig. 6.2 Discrimination of height scale and test
194
6 Latent Rank Analysis
Fig. 6.3 Resolution of height scale and test
6.1.3 Test Resolution Finally, let us consider the resolution of a test. Here, resolution refers to the degree to which a small extent or unit objects can be differentiated. The smaller the unit differentiation, the higher the resolution. Scales are measurement tools that can sort students from the shortest to the tallest, whether there are a thousand or a million students, without making any mistakes (Fig. 6.3, left). However, the resolution of a test is not high (Fig. 6.3, right). It is impossible to sort the abilities of students from the lowest to the highest without making mistakes. If the scale of a test is [0, 100] and we try to classify students into 101 groups, the test does not have sufficient resolution to classify the abilities of students into such fine units. A test has a rough resolution, dividing students into not more than 20 levels. Even if we can accept the assumption that the abilities of students vary on a continuous scale, tests are not precise enough to capture that continuum.
6.1.4 Selection and Diagnostic Tests Next, Point 2 in Why Latent Rank? (p.192) is explained. Tests play two main roles: selection and diagnosis. A selection test is administered to pick a certain number of high-ability students, such as during university entrance exams. A competitive test
6.1 Latent Rank Scale
195
with a fixed quota of successful applicants is generally a selection test. Meanwhile, a diagnostic test is conducted to examine each student’s knowledge and ability (e.g., TOEIC,3 TOEFL, calculation tests, and vocabulary tests). A selection test usually has a fixed quota. To precisely adjust the number of successful applicants according to the enrollment limit, it is better that the ability of each applicant be expressed as a continuous score. Thus, θ in IRT or the total score in CTT is more useful in a selection test when the score is considered as a tool. For example, when the goal of a university entrance exam is to pass 300 highability students out of 1,000 applicants, a five-rank evaluation is not useful because there would be about 200 applicants in each rank, and it is difficult to make finer adjustments to determine the intended number of successful candidates. Even if the use of a five-rank scale is valid and reasonable for the exam when considering reliability and resolution, such a scale is inconvenient as a public tool used in our society. Meanwhile, in monitoring (diagnosing) each student’s daily achievement, it is not so important to evaluate his or her ability with a fine continuous score. Moreover, a test score contains measurement errors. For example, a difference of some ±5 points on a 100-point test reflects no practically meaningful difference in ability, and it is thus pointless to be concerned over such a small fluctuation in the score. To instruct a student to be careful regarding small changes in score may train him or her to make fewer careless mistakes, but it may not lead the student to big ideas. It is better to let students concentrate on improving their abilities without pressure in everyday school life. Thus, a graded evaluation is better than a continuous score evaluation for a diagnosis test because a grade is more stable than a continuous score.
6.1.5 Accountability and Qualification Test Finally, let us consider Point 3 of Why Latent Rank? (p.192). It is very important for a test administrator to be able to explain what kind of test it is (i.e., test accountability). What does the administrator need to be able to explain about the test? These factors are roughly summarized as follows: Test administrator must be accountable for 1. 2. 3. 4. 5.
The ability the test measures The purpose of the test The main target of the test The degree of test reliability …
There are more points besides those listed above. The test administrator needs to be able to answer basically anything about the test. Points 1–3 are about test validity, 3
It becomes a selection test when students must score 800 or higher to be eligible to study abroad.
196
6 Latent Rank Analysis
Fig. 6.4 Accountability and qualification test
and Point 4 is about test reliability. For more details on validity, Messick (1989, 1995) are good references. Furthermore, test reliability is an idea developed in CTT (Chap. 3, p.73). In fact, it is very difficult to explain Points 1 and 2 in detail. If the test is an academic achievement test, expertise in the subject is required. For example, if it is a foreign language test, the test administrator needs to know whether the test is measuring grammatical skills, reading comprehension, or writing ability, and more specifically, if the test concerns grammar, it is important to understand whether the grammar is used in conversation or when reading essays. Furthermore, the test administrator should be accountable for the level of difficulty of the grammar test (e.g., the level equivalent to junior high school students of native speakers). In addition, the test administrator should understand the property of each item format. If multiple-choice items are mainly used, what the test measures is simply the presence or absence of grammatical knowledge, but if fill-in-the-blank items are used instead in a conversational context, what the test measures is not superficial knowledge but knowledge at a deeper level, such as the application of knowledge. Thus, although doing so is very difficult in reality, it is necessary for the test administrator to be able to correctly account for the test in detail. When a continuous score is fed back to two students scoring 73 and 74, respectively, on a 100-point test, it is very difficult to explain in words the difference in ability between 73 and 74 (Fig. 6.4, left). What is the point in giving different feedback although we cannot express the difference in words? In contrast, if the ability scale is ordinal, it is easier to explain the ability level required at each grade. Figure 6.4 (right) shows an example of brief descriptions corresponding to each ability level for a foreign language test. Evidently, a graded evaluation is useful to verbalize (and visualize) the path from the start to the goal of the test. Although this verbalization is not very easy and requires deep expertise in the subject, it is much more realizable than describing every one-point difference between adjacent continuous scores.
6.1 Latent Rank Scale
197
Fig. 6.5 Rank reference matrix on the latent rank scale
If the path from the start (i.e., zero marks) to the goal (i.e., full marks) of a test is well described, the test is said to be a qualification test. In such a test, a list of the abilities to be acquired is clarified for each grade. LRA is a method of grading students, and it makes the ability list at each grade accountable. In other words, LRA can be said to be a tool for turning a test into a qualified one.
6.1.6 Rank Reference Matrix LRA supposes an ordinal ability scale called the latent rank scale (Fig. 6.5). Assume that the number of latent ranks is R (in the figure, R = 6) and that Rank 1 is the lowest ability group, whereas Rank R is the highest. When the number of items is J (J = 10 in the figure), the rank reference matrix with size J × R is given as follows: ⎤ ⎡ π11 · · · π1R ⎥ ⎢ R = ⎣ ... . . . ... ⎦ = {π jr } (J × R), πJ1 · · · πJ R
where the ( j, r )-th element π jr (∈ [0, 1]) is the correct response rate (CRR) of Item j for the students belonging to Rank r . This matrix is unknown but is estimated in LRA.
198
6 Latent Rank Analysis
The r -th column vector in the rank reference matrix ( R ), ⎤ π1r ⎢ ⎥ π r = ⎣ ... ⎦ (J × 1), ⎡
π Jr
is referred to as the rank reference vector (of Rank r ). In addition, the j-th row vector of the matrix, ⎤ π j1 ⎥ ⎢ π j = ⎣ ... ⎦ (J × 1), ⎡
πjR
is referred to as the item reference profile (IRP) of Item j. Because the average ability increases with larger rank labels, the elements in π j have a tendency to show the following relationship: π j1 ≤ π j2 ≤ · · · ≤ π j R .
6.2 Estimation by Self-organizing Map The LRA estimates R from U. Thus, it can be regarded as a function that outputs R in response to input U. In this section, a method of estimating the rank reference matrix ( R ) using a self-organizing map (SOM; e.g., Kohonen, 1995; Oja and Kaski, 1999; Sangole, 2009) is described under the condition that the number of latent ranks, R, is given.4 The SOM is a statistical learning model that simulates the learning mechanism of neurons in a simplified way, and it is regarded as a neural network model without a supervisory signal (i.e., dependent variable). The procedure is shown in the following box.5
4
How to determine the number of latent ranks will be described in Sect. 6.7 (p.234). The practical programming statements depend on the programming language. The statements written in a natural language are denoted “pseudocode.”
5
6.2 Estimation by Self-organizing Map
199
LRA-SOM Estimation Procedure 1. Set (1,0) R . 2. Repeat [2-1] to [2-3] for t = 1 through T . 2.1. Randomly sort the row vectors of U and obtain U (t) . 2.2. Repeat [2-2-1] to [2-2-2] for i = 1 through S. 2.2.1 Select the winner rank for ui(t) . 2.2.2 Update (t,i−1) and obtain (t,i) R R . (t,S) (t+1,0) . 2.3. Set R as R First, in Command [1], the initial value of the rank reference matrix, (1,0) R , is specified. If there is no prior information on the matrix, the initial value of the j-th row vector of (1,0) R , the IRP of Item j, is ⎡
π (1,0) j
⎤ 1/(R + 1) ⎢ 2/(R + 1) ⎥ ⎢ ⎥ =⎢ ⎥ (∀ j ∈ N J ). .. ⎣ ⎦ . R/(R + 1)
Next, [2] commands that [2-1] through [2-3] be repeated for t from 1 through T , where T is the number of iterations (e.g., T = 100), and [2-1] orders that the data matrix (U) be reordered. The SOM is a statistical learning model, and the rank reference matrix learns the input of individual student data vectors sequentially. If the input order is always the same, the learning result will be affected by the order (order effect). Therefore, in [2-1], the input order is changed randomly each time, where the U (t) denotes a U reordered at the t-th iteration. In addition, [2-2] is a command to repeat [2-2-1] and [2-2-2] for s from 1 through S. In the i-th iteration, the i-th row vector of U (t) , ui(t) , is input and processed.
6.2.1 Winner Rank Selection Next, [2-2-1] at the i-th iteration is the process of selecting the rank reference vector closest to the input data (ui(t) ). The rank whose rank reference vector is the closest to the input data is called the winner rank. Some commonly applied distances or closeness are defined as follows:
200
6 Latent Rank Analysis
Winner Rank Selection Methods Least Squares (LS) Distance d L S (t, i, r ) =
(z i(t) ) (ui(t)
−
π r(t,i−1) )◦2
=
J
z i(t)j (u i(t)j − π (t,i−1) )2 jr
j=1
Log-Likelihood (LL) Distance d L L (t, i, r ) = −ll(ui(t) |π r(t,i−1) ) J
z i(t)j u i(t)j ln π (t,i−1) + (1 − u i(t)j ) ln(1 − π (t,i−1) ) =− jr jr j=1
Log-Posterior (LP) Distance d L P (t, i, r ) = −ll(ui(t) |π r(t,i−1) ) − ln πr The least squares distance (d L S ) is the distance used in a classical SOM, which is defined as the sum of the squares of the element differences between the input data vector (ui(t) ) and the latest update of Rank r ’s rank reference vector (π r(t,i−1) ).6 The more similar the input and the rank reference vectors, the smaller the d L S . Thus, the winner rank is determined as wi(L S,t) = arg min d L S (t, i, r ). r ∈N R
This equation means that the latent rank whose rank reference vector has the smallest d L S is selected as the winner rank. The log-likelihood distance (d L L ; Shojima, 2008) is a stochastic closeness. The likelihood takes a value within [0, 1] because it is a probability, and the better the fit, the closer to 1 it is. Thus, the log-likelihood, ll(ui(t) |π r(t,i−1) ), is nonpositive, that is, the range of the log-likelihood is (−∞, 0], and the better the fit, the closer it is to 0. In other words, the negative sign of the log-likelihood, −ll(ui(t) |π r(t,i−1) ), makes it nonnegative, and the closer it is to 0, the better the fit. Accordingly, the winner rank as determined by the log-likelihood distance is wi(L L ,t) = arg min d L L (t, i, r ). r ∈N R
In addition, d L P is the log-posterior distance (Shojima, 2008). By this distance, the winner rank is selected as follows: wi(L P,t) = arg min d L P (t, i, r ), r ∈N R
where πr is the prior probability. A commonly used LRA software package, Exametrika (Shojima, 2008–), applies this log-posterior distance.
6
The missing data are excluded from the sum.
6.2 Estimation by Self-organizing Map
201
20
Fig. 6.6 Winner rank selection
Distance
15 LS
10
LL LP
5 0
1
2
3 4 Latent Rank
5
6
Let us now consider the following example. The number of items and sample size of the data to be analyzed are 15 and 500 (J15S500), respectively. Assume that the number of latent ranks is six. Consider the first data vector to be input at the first repetition (t = 1). At this time, the rank reference matrix ( R ) is in an unlearned state and has the initial value, (1,0) R . As an exercise, let the first input data vector be the response vector of Student 1 (u1 ).7 The three distances between the input data (u1 ) and each latent rank are calculated and plotted in Fig. 6.6. The three distances indicate that the rank reference vector of Rank 4 is the closest to the input data, and the rank is thus the winner rank. For the logposterior distance, the discrete uniform distribution is used as the prior distribution. That is, πr =
1 1 = (r ∈ N R ). R 6
Consequently, in this case, the plot of d L P becomes parallel to the d L L plot by a positive amount of − ln(1/6) = 1.792, and the winner ranks selected by the loglikelihood and log-posterior distances are thus the same. Note that the winner ranks selected by the three distances are usually not the same; it is purely coincidental in this example.
6.2.2 Data Learning by Rank Reference Matrix This section describes [2-2-2] in LRA-SOM Estimation Procedure (p.199). In this process, the rank reference matrix, which is a kind of neural network, learns the input data. In other words, the winner rank and the ranks around the winner adapt to the input data.
7
In practice, the first input data are the first row vector of U (1) .
202
6 Latent Rank Analysis
Fig. 6.7 Learning around the winner rank
In Sect. 6.2.1, the winner rank was selected for the input data (Fig. 6.7, left). Learning occurs around the winner rank (Fig. 6.7, right). Learning here means that the elements of each rank reference vector approach the input data. First, the values of the rank reference vector of the winner rank approach the input data. Then, the rank reference vectors of the ranks around the winner approach the input data; however, they get less close to the input data than those of the winner rank. The nearer the rank is to the winner, the closer its rank reference vector approximates to the input data. The extent of learning is called the learning rate. The learning rate of the winner is the largest, and the nearer the rank is to the winner, the larger its learning rate. Suppose here that the winner rank is w for the input data (ui(t) ); then, through learning, the present rank reference matrix, (t,i−1) , is updated to (t,i) as follows: R R (t,i) = (t,i−1) + (1 J h(t) ) (z i(t) 1R ) (ui(t) 1R − (t,i−1) ). R R R
(6.1)
This equation can be rephrased in words as New R =Old R + (Learning Rate) (Skip Missing Data) (Difference between Input Data and Old R ). The size of the third factor in the second term of Eq. (6.1), (ui(t) 1R − (t,i−1) ), is R J × R, where the first half of the matrix is ⎡ (t) (t) ⎤ u i1 · · · u i1 ⎢ .. . . ⎢ .. ⎥ (t) ui 1 R = ⎣ . ⎦ [1 · · · 1] = ⎣ . . (t) (t) ui J ui J · · · ⎡
(t) ⎤ u i1 .. ⎥ = [u(t) · · · u(t) ] (J × R). i i . ⎦
u i(t)J
6.2 Estimation by Self-organizing Map
203
This is a matrix where R replications of the input vector (ui(t) ) are arranged horizontally. Thus, the r -th column vector in (ui(t) 1R − (t,i−1) ), R ui(t) − π r(t,i−1) , represents the difference between the input vector and Rank r ’s reference vector. Next, (z i(t) 1R ) in Eq. (6.1), ⎡ (t) (t) ⎤ z i1 · · · z i1 ⎢ .. . . ⎢ .. ⎥ (t) z i 1 R = ⎣ . ⎦ [1 · · · 1] = ⎣ . . z i(t)J z i(t)J · · · ⎡
(t) ⎤ z i1 .. ⎥ = [z (t) · · · z (t) ] (J × R), i i . ⎦
z i(t)J
is a matrix composed of R copies of missing indicator vector z i(t) corresponding to input vector ui(t) . Thus, if Item j in ui(t) is missing, all the elements in the j-th row of z i(t) 1R then become 0. In this case, the j-th row elements in the rank reference matrix do not learn the input data and are thus not updated. Moreover, the learning rate is high at the early stages in the SOM learning, and it gradually decreases as the learning stage progresses. The initial and final stages of the learning are when iteration t equals 1 and T (e.g., T = 100), respectively. In other words, the updated size of the rank reference matrix decreases as t increases. In Eq. (6.1), h(t) represents the learning rate and controls the updated size as follows: (r − w)2
αt R exp − (R × 1), h(t) = h r(t)w h r(t)w = S 2R 2 σt2 (T − t)α1 + (t − 1)αT αt = , T −1 (T − t)σ1 + (t − 1)σT . σt = T −1
(6.2)
In the above equation, the numerator of (r − w)2 /R 2 in the exponential term means that the nearer the rank to the winner, the greater the update size, while the denominator means that the larger the number of ranks (R), the wider the update spreads. Moreover, the numerator of the coefficient R/S of the exponential term, R, means that the larger the number of ranks, the larger the update size, and the denominator (S) means that the larger the sample size, the smaller the update size per student. In addition, αt represents the update size at Period t, which is α1 at the beginning of learning and decreases to αT at the end of learning. Moreover, σt represents the range through which the learning at Period t extends, which is σ1 at the beginning, shrinking to σT in the end. For example, if the number of ranks is six and the winner is Rank 4, and we let (T, α1 , αT , σ1 , σT ) = (100, 1.0, 0.01, 1.0, 0.12),8 the learning rates of the six ranks at the first period (t = 1) are 8
This is the default setting of Exametrika, the LRA software.
204
6 Latent Rank Analysis
h(1) = .01059 .01135 .01183 .01200 .01183 .01135 . The learning rate for the winner (i.e., Rank 4) is the greatest, and the farther the rank is placed from the winner, the lower the learning rate. Then, the learning rates at the 10th period are h(10) = .00942 .01023 .01074 .01092 .01074 .01023 . This exemplifies that the learning rate decreases as the period t progresses. Moreover, the first factor of the second term in Eq. (6.2), (1 J h(t) ), is
1 J h(t)
⎡ ⎤ ⎡ (t) 1 h 1w · · · ⎢ .. ⎥ (t) ⎢ .. . . (t) = ⎣ . ⎦ [h 1w · · · h Rw ] = ⎣ . . 1 · · · h (t) 1w
⎤ ⎡ (t) ⎤ h (t) h Rw .. ⎥ = ⎢ .. ⎥ (J × R), . ⎦ ⎣ . ⎦ h (t) Rw
h(t)
meaning that the same learning rate is applied to each item.
6.2.3 Prior Probability in Winner Rank Selection As we have seen, the estimation by the SOM algorithm is basically the repetition of selecting the winner rank (Sect. 6.2.1) and updating the rank reference matrix (Sect. 6.2.2). However, during learning with the SOM algorithm, one perhaps notices that the ranks at both ends (i.e., Ranks 1 and R) are likely to be selected as the winner (Shojima, 2008). This is a property that has also been observed in the original SOM (e.g., Amari, 1980; Ritter and Schulten, 1986; Kohonen, 1995). The LRA is useful in organizing school classes according to ability level. However, if the numbers of students in the lowest and highest ranks are large, it will be hard for the teachers who instruct these ranks. When organizing classes, the number of students in classes should be approximately equal. A latent rank scale where the number of students belonging to each rank is equal is called an equiprobability rank scale. To ensure that the number of students classified into each rank is nearly equal, it is effective to assume a prior probability when selecting the winner rank; in other words, the log-posterior distance introduced in Winner Rank Selection Methods (p.200) is used in this case. Let the initial value of the prior probability for Rank r , πr(1,0) ,9 be
If a π has two subscripts such as π jr , it is an element in the rank reference matrix. If a π has one subscript like πr , it is the prior probability (for Rank r ).
9
6.2 Estimation by Self-organizing Map
205
πr(1,0) =
1 . R
In addition, the prior probability of Rank r is πr(t,i−1) before the i-th data vector at the t-th period, ui(t) , is input. Then, if the winner is Rank w after ui(t) is input, the following operation is performed on the prior probability of Rank r : ⎧
⎨max 0, π (t,i−1) − R − 1 κt (r = w) r
R , πr(t,i) = ⎩min 1, π (t,i−1) + κt (r = w) r R (T − t)κ1 + (t − 1)κT κt = . R(T − 1) In other words, the prior probability of the rank chosen as the winner is reduced by (R − 1)κt /R under the constraint that the prior not be < 0. The smaller the prior probability, the less likely the rank is to be selected as the winner for the next input (t) . Meanwhile, κt /R is added to the prior probabilities of the remaining data, ui+1 R − 1 ranks that were not selected as the winner. As the prior probability increases, the possibility of being selected as the winner increases when the next data vector (t) ) is input. The sum of the prior probabilities is 1 both before and after the (ui+1 selection. Moreover, κt denotes the magnitude of changing the prior probability at the t-th period. It is κ1 at the first period and then decreases to κT at the final period. For example, (κ1 , κT ) is set as (0.01, 0.0001).10
6.2.4 Results Under SOM Learning This section shows the estimation results of the rank reference matrix ( R ) by SOM learning. The conditions of learning are as follows: Estimation Setting of LRA-SOM 1. 2. 3. 4. 5. 6. 7.
10
Data: J15S500 (www) Number of latent ranks: R = 6 Number of learning repetitions in [2]: T = 100 Winner rank selection in [2-2-1]: Log-posterior distance Prior update size in [2-2-1]: (κ1 , κT ) = (0.01, 0.0001) Learning size in [2-2-2]: (α1 , αT ) = (1.0, 0.01) Learning range in [2-2-2]: (σ1 , σT ) = (1.0, 0.12)
This is the setting in Exametrika, the software for the LRA.
206
6 Latent Rank Analysis
ˆ R ) by SOM learning Table 6.1 Estimate of rank reference msatrix ( Latent Rank Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Test Prior
1 0.595 0.538 0.608 0.502 0.685 0.694 0.443 0.365 0.332 0.535 0.074 0.069 0.375 0.519 0.481 6.814 0.157
2 0.639 0.640 0.624 0.634 0.749 0.792 0.546 0.468 0.323 0.583 0.078 0.099 0.509 0.604 0.586 7.874 0.205
3 0.713 0.764 0.684 0.775 0.794 0.896 0.697 0.601 0.308 0.664 0.126 0.153 0.667 0.725 0.709 9.277 0.183
4 0.784 0.855 0.764 0.879 0.832 0.956 0.835 0.683 0.318 0.723 0.268 0.239 0.742 0.851 0.802 10.532 0.164
5 0.846 0.888 0.818 0.943 0.873 0.963 0.898 0.717 0.387 0.734 0.479 0.406 0.770 0.936 0.846 11.504 0.158
6 0.895 0.889 0.851 0.973 0.909 0.944 0.917 0.735 0.467 0.749 0.625 0.594 0.792 0.971 0.860 12.170 0.133
CRR∗ 0.746 0.754 0.726 0.776 0.804 0.864 0.716 0.588 0.364 0.662 0.286 0.274 0.634 0.764 0.706 9.664
*correct response rate (item mean)
ˆ R . For example, the Table 6.1 shows the estimate of the rank reference matrix, CRRs of the students belonging to Rank 1 are estimated as 59.5% for Item 1, 53.8% for Item 2, · · · , and 48.1% for Item 15.11 In this way, if a certain column of the rank reference matrix (i.e., the rank reference vector) is focused, we can examine which items the students belonging to that rank are good at and which items they are not good at, and grasp the features of the students classified in Rank 1, Rank 2, · · · , and Rank 6. Although subject expertise is required for this characterization of ranks, if done as in Fig. 6.4 (right), the accountability of the test increases, and the visibility of the path from the start to the goal of the test is clarified. In addition, the j-th row of the rank reference matrix is the item reference profile (IRP) of Item j. Figure 6.8 shows the plots of the IRPs for the 15 items. The IRP of each item shows an upward tendency as the rank increases.12 However, the IRP of Item 6 is not monotonically increasing, as the CRR for Rank 6 is 0.944, whereas that for Rank 5 is 0.963. 11
The order of inputting data vector is randomly sorted in [2-1] of LRA-SOM Estimation Procedure (p.199). As a result, the estimates vary slightly each time even
when one analyzes the same data are analyzed with the same settings. If one fixes the random number seed, the same estimates can be obtained every time when analyzing the same data. Exametrika fixes the random number seed to return the same results every time. 12 Note that the IRPs obtained using LRA are much smoother than those obtained using LCA (Figure 5.6, p.168).
0.8 0.6 0.4 0.2 4
5
6
0.4 0.2 0.0
1
2
0.8 0.6 0.4 0.2 2
3
4
5
6
Correct Response Rate
Correct Response Rate
Item 4 1.0
1
0.8 0.6 0.4 0.2 3
4
5
6
Correct Response Rate
Correct Response Rate
Item 7
2
0.8 0.6 0.4 0.2 3
4
5
6
Correct Response Rate
Correct Response Rate
Item 10
2
0.8 0.6 0.4 0.2 0.0
1
2
0.8 0.6 0.4 0.2 3
4
5
6
Correct Response Rate
Correct Response Rate
Item 13
2
0.2 0.0
1
2
4
5
6
0.8 0.6 0.4 0.2 1
2
3
4
5
6
Item 11 0.8 0.6 0.4 0.2 1
2
3
4
5
6
Item 14 0.8 0.6 0.4 0.2 1
2
Latent Rank
3
4
6
5
6
5
6
5
6
5
6
Item 6 0.8 0.6 0.4 0.2 0.0
1
2
3
4
Item 9 1.0 0.8 0.6 0.4 0.2 0.0
1
2
3
4
Item 12 1.0 0.8 0.6 0.4 0.2 0.0
1
2
3
4
Latent Rank
1.0
0.0
5
Latent Rank
1.0
0.0
4
Latent Rank
Item 8
0.0
3
1.0
Latent Rank
1.0
1
3
1.0
Latent Rank
0.0
0.4
Latent Rank
1.0
1
0.6
Latent Rank
Item 5
Latent Rank
0.0
6
0.8
Latent Rank
1.0
1
5
1.0
Latent Rank
0.0
4
Item 3 1.0
Latent Rank
Latent Rank
0.0
3
Correct Response Rate
3
0.6
Correct Response Rate
2
0.8
Correct Response Rate
1
Item 2 1.0
5
6
Correct Response Rate
0.0
207 Correct Response Rate
Item 1 1.0
Correct Response Rate
Correct Response Rate
6.2 Estimation by Self-organizing Map
Item 15 1.0 0.8 0.6 0.4 0.2 0.0
1
Latent Rank
Fig. 6.8 Item reference profiles (LRA-SOM)
LRA-SOM Procedure (Monotonic Increasing IRP) 1. Set (1,0) R 2. Repeat [2-1] to (2-4) for t = 1 through T . 2.1. Randomly sort the row vectors of U and obtain U (t) . 2.2. Repeat [2-2-1] to [2-2-2] for i = 1 through S. 2.2.1. Select the winner rank for ui(t) . 2.2.2. Update (t,i−1) and obtain (t,i) R R . (t+1,0) as . 2.3. Set (t,S) R R 2.4. Repeat (2-4-1) for j = 1 through J . . 2.4.1. Sort π (t+1,0) j
2
3
4
Latent Rank
208
6 Latent Rank Analysis
The IRPs of all items are not always monotonically increasing. When it is desirable for allIRPs to be monotonic, then the learning algorithm ismodified slightly, as shown in LRA-SOM Procedure (Monotonic Increasing IRP) . In (2-4-1), the com mand “sort π (t+1,0) ” is a simple operation of sorting the IRP elements of Item j j in ascending order. One may think this idea unrefined, but the learning algorithm loses simplicity and becomes very complicated if a subroutine updating the rank reference matrix while ensuring that the IRPs monotonically increase is included in the algorithm. In Table 6.1, the second line from the bottom, labeled “Test,” represents the test reference profile, which will be described later. In addition, the bottom line, “Prior,” in the table is the final update of the prior probabilities used in the winner rank selection. As mentioned above, a simple application of the SOM mechanism has the property that the numbers of students in the ranks at both ends (i.e., Ranks 1 and 6 in this case) are likely to be larger than those of the mid ranks. Thus, by manipulating the prior probabilities as explained in Sect. 6.2.3 (p.204), a roughly equal number of students is distributed into each rank (i.e., equiprobability rank scale). As the line “Prior” in the table shows, the prior probability for Rank 6 is obtained as the smallest, thereby reducing the number of students belonging to the rank at one end side. In addition, the prior probability of Rank 1 is smaller than that of Rank 2 for the same reason.
6.3 Estimation by Generative Topographic Mapping The LRA-SOM described in the previous section is a type of sequential learning because the rank reference matrix learns (is updated) each time the data vector of one student is input. This method can learn each data vector minutely and sensitively, but the larger the sample size, the longer the computation time, and the sample size of test data is often large. When the sample size is > 5000, one may feel the waiting time to be long, although the time also depends on computer performance and the number of repetitions T . LRA-GTM Procedure (Monotonic Increasing IRP) 1. Set (0) R . 2. Repeat the following steps until convergence. 2.1. 2.2. 2.3. 2.4.
(t−1) Obtain M (t) . R from U and R (t) (t) Obtain S by smoothing M R . (t) Obtain (t) R from S . Repeat (2-4-1) for j = 1 through J . 2.4.1. Sort π (t) j .
6.3 Estimation by Generative Topographic Mapping
209
In contrast to sequential learning, batch learning is a method where the data of all students U are learned at once. As a batch-type SOM, generative topographic mapping (GTM; Bishop et al., 1998) has been proposed, which is, in fact, an ordinary EM algorithm. That is, the GTM itself does not have a mechanism to order the latent ranks; if it is simply applied in this case, such batch-type learning is thus identical to LCA (Chap. 5, p.155). In this section, LRA-GTM, a method of ordering latent classes using the EM algorithm (GTM), is described. The following box shows the LRA-GTM procedure. The procedure of this method begins with the specification of the initial value of (0) (0) the rank reference matrix (0) R at [1]. The j-th row vector of R , π j , which is the initial value of the IRP of Item j, is set as ⎡
π (0) j
⎤ 1/(R + 1) ⎢ 2/(R + 1) ⎥ ⎢ ⎥ =⎢ ⎥ (∀ j ∈ N R ), .. ⎣ ⎦ . R/(R + 1)
and then the learning of the LRA-GTM proceeds as follows: EM Cycle 1
(0) R
EM Cycle 2
(1) (2) (2) (1) (2) M → M (1) → S → → → S → R R R R → ··· .
(6.3)
6.3.1 Update of Rank Membership Matrix Before the convergence criterion in [2] is described, [2-1] is explained in this section. In this command at the t-th EM cycle, the rank membership matrix (M R ) is updated using the data matrix (U) and the tentative estimate for the rank reference matrix ). The rank membership matrix (M R ), whose obtained at the previous cycle ((t−1) R size is S × R, is given by ⎤ m 11 · · · m 1R ⎥ ⎢ M R = ⎣ ... . . . ... ⎦ = {m sr } (S × R), m S1 · · · m S R ⎡
where the s-th row and r -th column element, m sr , denotes the probability that Student s belongs to Rank r . The s-th row vector of the rank membership matrix, ms , is called the rank membership profile of Student s. That is,
210
6 Latent Rank Analysis
⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ = {m sr } (R × 1). ⎡
ms R The sum of the rank membership profile is 1 (1R ms = 1) because each student belongs to one of the ranks. Accordingly, the row sum vector of the rank membership matrix is M R 1 R = 1S . In [2-1], the rank membership matrix at the t-th cycle, M (t) R , is computed given The likelihood (i.e., probability) of observing data vector us under the condition that Student s belongs to Rank r is expressed as . (t−1) R
l(us |π r(t−1) ) =
n
(π (t−1) )u s j (1 − π (t−1) )1−u s j jr jr
z s j
.
j=1
Then, using the prior probability of belonging to Rank r , πr , the posterior probability that Student s belongs to Rank r is expressed as l(us |π r(t−1) )πr m (t) . sr = R (t−1) )πr r =1 l(us |π r Consequently, given (t−1) , the rank membership profile of Student s is updated as R ⎡
⎤ l(us |π (t−1) )π1 1 ⎢ R ⎥ (t−1) m s1 ⎢ r =1 l(us |π r )πr ⎥ ⎢ ⎥ ⎥ ⎢ .. ⎥ (R × 1). = ⎣ ... ⎦ = ⎢ . ⎥ ⎢ ⎢ ⎥ (t−1) ⎣ ⎦ |π )π m (t) l(u s R sR R R (t−1) )πr r =1 l(us |π r ⎡
m(t) s
(t) ⎤
Note that the indicator for the EM cycle has also been updated from t − 1 to t in this timing. By computing this for all S students, the rank membership matrix at the t-th cycle, M (t) R , is obtained.
6.3.2 Rank Membership Profile Smoothing Next, [2-2] in LRA-GTM Procedure (p.208) is described here. This command requires us to obtain S(t) by smoothing M (t) R . If this command line is skipped, LRA-
6.3 Estimation by Generative Topographic Mapping
211
Fig. 6.9 Smoothing by simple moving average
GTM becomes identical to LCA. In other words, the difference between LRA-GTM and LCA is the presence of line [2-2]. The smoothed rank membership profile at the t-th cycle for Student s is denoted as s(t) s , which is given by smoothing the student’s rank membership profile at the (t) (t) (t) t-th cycle, m(t) and M (t) s . The profiles s s and ms are the s-th row vectors in S R , respectively. Smoothing is a technique often used in time series analysis, spatial statistics, and image processing. The most basic smoothing method is the simple moving average (Fig. 6.9). The top of the figure represents the (unsmoothed) rank membership profile and the bottom the smoothed rank membership profile. The unevenness of the unsmoothed profile is softened by smoothing. In the figure, Rank 2 membership, for example, in the smoothed rank membership profile is calculated by s2 =
m1 + m2 + m3 . 3
(6.4)
That is, the smoothed membership of Rank 2 is the average of the unsmoothed memberships of Ranks 1, 2, and 3. Similarly, the smoothed memberships of Ranks 3, 4, and 5 are calculated as m1 + m2 + m3 , 3 m2 + m3 + m4 s4 = , 3 m3 + m4 + m5 s5 = . 3
s3 =
212
6 Latent Rank Analysis
In these equations, the weight, ⎡
⎤ 1/3 f = ⎣1/3⎦ , 1/3 is called the kernel. The sum of the elements in the kernel must be 1. Thus, at the ranks in both ends, the following kernel f =
1/2 1/2
is used to obtain the smoothed memberships as follows: m1 + m2 , 2 m5 + m6 . s6 = 2 s1 =
By using a kernel with a larger size, a wider range of rank memberships can be averaged. For example, a Size 5 filter is given by ⎡
⎤ 0.2 ⎢0.2⎥ ⎢ ⎥ ⎥ f =⎢ ⎢0.2⎥ . ⎣0.2⎦ 0.2 The larger the size of the kernel, the less the unevenness of the smoothed rank membership profile that is obtained. In this way, the simple moving average smooths the original rank membership profile using a kernel with equal elements, while the following nonuniform kernel, ⎡ ⎤ 0.2 f = ⎣0.6⎦ 0.2 also serves to smooth the rank membership profile. Figure 6.10 illustrates the smoothing when this kernel is used. Such a moving average using a nonuniform kernel is called weighted moving average. Note that the sum of this kernel is also 1. Using the kernel, the smoothed rank membership of, for example, Rank 2 is calculated by s2 = 0.2m 1 + 0.6m 2 + 0.2m 3 .
6.3 Estimation by Generative Topographic Mapping
213
Fig. 6.10 Smoothing by weighted moving average
Because 60% of s2 is m 2 , this membership is less smooth than than that of a simple moving average (the component of m 2 in Eq. (6.4) is 33%). That is, as the center element of the kernel is larger, the original shape of the rank membership profile is better preserved. We can see that the shape of Fig. 6.10 (bottom) is more similar to the original rank membership profile than that of Fig. 6.9 (bottom). Note that, with the sum of the kernel being 1, the elements for the ranks at both ends are modified as follows: 1 0.6 0.75 = Latent Rank 1: f = , 0.25 0.6 + 0.2 0.2 1 0.2 0.25 = . Latent Rank 6: f = 0.75 0.2 + 0.6 0.6 In general, suppose first a kernel is symmetric and defined as ⎤ fL ⎢ .. ⎥ ⎢.⎥ ⎢ ⎥ ∗ ⎥ f =⎢ ⎢ f0 ⎥ L × 1 , ⎢ .. ⎥ ⎣.⎦ fL ⎡
where f 0 is the center element. The length of this kernel is an odd number (L ∗ = 2L + 1). Based on this kernel, the following matrix
214
6 Latent Rank Analysis
⎡
f0 ⎢ .. ⎢. ⎢ ⎢ ⎢ fL ⎢ ⎢ ∗ F =⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
··· .. . .. . .. .
fL .. . . . . .. f0 . .. . . . . . f .. L
⎤
fL .. . . . .
. f0 . . .. . . . . fL · · ·
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ (R × R) ⎥ ⎥ fL ⎥ ⎥ .. ⎥ .⎦ f0
is created. However, the sums of the first L − 1 columns and last L − 1 columns are not 1. Thus, the following operation is conducted to make them 1: F = F ∗ {1 L ∗ (1L ∗ F ∗ )} (R × R). The sums of all columns are 1 in this matrix, which is called the filter. In Exametrika (Shojima, 2008–), the Size 3 kernel, ⎡ ⎤ (1 − f 0 )/2 ⎦, f0 f =⎣ (1 − f 0 )/2 is used, where f 0 varies depending on the number of ranks as follows: ⎧ ⎪ ⎨1.05 − 0.05R (1 ≤ R ≤ 5) f 0 = 1.00 − 0.04R (5 ≤ R ≤ 10) . ⎪ ⎩ 0.80 − 0.02R (10 ≤ R ≤ 20)
(6.5)
For example, f = [.1 .8 .1] (R = 5), f = [.2 .6 .2] (R = 10), f = [.3 .4 .3] (R = 20). In other words, the larger the number of ranks (R), the flatter the kernel, and then, the more closely the kernel approximates to that of the simple moving average. The filter F created above is used to smooth the rank membership matrix (M (t) R ) to obtain the smoothed rank membership matrix (at the t-th cycle) as follows: ⎤ ⎡ (t) ⎤ ⎡ (t) ⎤ s1 m1 F m(t) 1 ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ (t) (t) = M R F = ⎣ . ⎦ F = ⎣ . ⎦ = ⎣ . ⎦ = {ssr } (S × R). ⎡
S(t)
m(t) S
m(t) S F
s(t) S
6.3 Estimation by Generative Topographic Mapping
215
Fig. 6.11 Pre- and post-smoothed rank membership profile of student 1
The s-th row vector of the smoothed rank membership matrix is the smoothed rank membership profile of Student s, s(t) s . Let us show an example analyzing data with 15 items and 500 students (J15S500). Suppose that the number of latent ranks is six (R = 6). From Eq. (6.5), the kernel used in smoothing is then ⎡
⎤ 0.12 f = ⎣0.76⎦ . 0.12 Accordingly, the following filter is obtained: ⎤ ⎡ 0.864 0.120 ⎥ ⎢0.136 0.760 0.120 ⎥ ⎢ ⎥ ⎢ 0.120 0.760 0.120 ⎥. ⎢ F=⎢ ⎥ 0.120 0.760 0.120 ⎥ ⎢ ⎣ 0.120 0.760 0.136⎦ 0.120 0.864 The first and last columns are rescaled so that the sum of each becomes 1. Note that the sum of the smoothed rank membership profile is not exactly equal to 1 for each student (the sum of the unsmoothed one is 1). To make the sum equal 1, the filter needs to be slightly improved, but the difference is not significant. Figure 6.11 shows the rank membership profiles before and after smoothing for Student 1 at the first cycle (t = 1). We can see that the elements of each adjacent pair get closer to each other through smoothing, although the difference before and after smoothing is not significant, because the filter whose kernel has the central element 0.76 is a weak smoother. However, this slight difference results in a significant difference as the cycle t progresses.
216
6 Latent Rank Analysis
(t) The process of smoothing the rank membership matrix M (t) is R and obtaining S essential in LRA. If this process is skipped, each latent rank will not be associated with the ranks of both sides, and then, the LRA-GTM will be indistinguishable from LCA. In other words, through this process, neighboring latent classes are associated with each other and become latent ranks. The following is a summary of the smoothing.
Summary of Smoothing Rank Membership Matrix • By smoothing, neighboring classes are related (ordered) to each other and become ranks. • The flatter the kernel, the stronger the smoothing, and the less clear the difference between adjacent ranks. • The less flat the kernel (i.e., the larger the central element of the kernel f 0 ), the weaker the relation of adjacent ranks, and the more similar the ranks become to independent classes.
6.3.3 Update of Rank Reference Matrix In this section, [2-3] in LRA-GTM Procedure (p.208) is described, where the rank reference matrix is updated using the smoothed rank membership matrix (S(t) ). As previously described, the LRA-GTM procedure is identical to that of LCA without the process in [2-2]. In the LCA, the class membership matrix is used to update the class reference matrix (see Eq.(5.10), p.164). Referring to the equation, the rank reference matrix can likewise be updated as follows: A P,t) π (M jr
=
S1(t)jr + β1 − 1 S0(t)jr + S1(t)jr + β0 + β1 − 2
,
where S1(t)jr
=
S
(t) ssr zs j u s j ,
s=1
S0(t)jr =
S
(t) ssr z s j (1 − u s j ).
s=1
All S1(t)jr s and S0(t)jr s are obtained by the following matrix operations: (t) (t) (J × R), S(t) 1 = {S1 jr } = (Z U) S (t) (t) S(t) 0 = {S0 jr } = {Z (1 S 1 J − U)} S . (J × R)
(6.6)
6.3 Estimation by Generative Topographic Mapping
217
Fig. 6.12 Nonsmoothed and smoothed item reference profiles of item 1
Accordingly, the difference between the MAP of LCA (Eq. (5.10)) and that of LRAGTM (Eq. (6.6)) is simple, and the latter is obtained by simply replacing
U0(t)jc , U1(t)jc in Eq. (5.10) −→ S0(t)jr , S1(t)jr in Eq. (6.6). By executing Eq. (6.6) with respect to all items and ranks, the rank reference matrix can be updated. In addition, β0 and β1 in Eq. (6.6) are the hyperparameters of the beta distribution specified as the prior density for π jr . The density function of the beta distribution is defined as β −1
pr (π jr ; β0 , β1 ) =
π jr1 (1 − π jr )β0 −1 B(β0 , β1 )
.
If (β0 , β1 ) = (1, 1), MAP is identical to the MLE: L,t) = π (M jr
S1(t)jr S0(t)jr + S1(t)jr
.
(6.7)
Figure 6.12 is the IRP of Item 1 at Cycle t = 1 (π (1) 1 ), calculated from the smoothed rank membership matrix, S(1) . The figure also shows the IRP of Item 1 calculated from the (unsmoothed) rank membership matrix, M (1) R . By smoothing, the neighboring elements in the IRP become more similar (closer) to each other. The difference is slight at Cycle 1, but it becomes significant as t progresses. This section may be summarized as follows:
218
6 Latent Rank Analysis
Point of Updating Rank Reference Matrix • In LCA, the structural parameter (i.e., class reference matrix) is calculated from the class membership matrix, while in LRA-GTM, the structural parameter (i.e., rank reference matrix) is obtained from the smoothed rank membership matrix. • In the smoothed rank membership profile (s(t) s , ∀s ∈ N S ), the elements of adjacent ranks are slightly closer to each other (see Sect. 6.3.2). • By calculating the rank reference matrix from the smoothed rank membership matrix, the values of the rank reference vectors (column vectors of the rank reference matrix) of adjacent ranks get closer to each other.
6.3.4 Convergence Criterion In this section, Command [2] in LRA-GTM Procedure (p.208), the convergence criterion, is described. In each process from [2-1] to [2-3] at Cycle t, [2-1] (and [2-2]) corresponds to an E step in the EM algorithm, and [2-3] to an M step. Then, as the EM cycle t progresses, the structural parameter (i.e., rank reference matrix) at adjacent cycles gradually remains unchanged. In other words,13 (t−1) (t) . R ≈ R
When the estimate of the rank reference matrix at the previous cycle sufficiently approximates that at the present cycle, the matrix is considered to have converged on a local optimum, and the EM algorithm is then terminated. In addition, a criterion for judging convergence is required. Referring to Eq. (5.12) (p.164), the following criterion is a candidate: (t−1) |U) < c| ln pr ((t−1) |U)|, ln pr ((t) R |U) − ln pr ( R R
(6.8)
where c denotes a sufficiently small constant, such as c = 10−4 or c = 10−5 . In addition, ln pr ((t) R |U) is the ELL. That is, ln pr ((t) R |U) =
J R (t) (t) {(S1(t)jr + β1 − 1) ln π (t) jr + (S0 jr + β0 − 1) ln(1 − π jr )}. j=1 r =1
The ELL increases as t progresses; thus, ELLs of adjacent cycles show the following inequality:
13
≈ means “is approximately equal to.”
6.3 Estimation by Generative Topographic Mapping
219
(t−1) ln pr ((t) |U). R |U) > ln pr ( R
(6.9)
Thus, the LHS of Eq. (6.8) is positive and represents the extent of the ELL update. Meanwhile, the ELL is usually negative; thus, the RHS is enclosed in an absolute value sign. That is, Eq. (6.8) is a convergence criterion to stop the EM algorithm when the difference between the ELLs at Cycles t − 1 and t is smaller than 100c% of the absolute ELL value at Cycle t − 1.
6.3.5 Results Under GTM Learning This section shows the estimation of the rank reference matrix by the LRA-GTM procedure. The following box shows the settings of the analysis. Although the MAP was used as described in Point 6, note that it is identical to the MLE when (β0 , β1 ) = (1, 1) (see Point 7). Estimation Setting of LRA-GTM 1. 2. 3. 4. 5. 6. 7.
Data: J15S500 (www) Number of latent ranks: R = 6 Constant of convergence criterion in [2]: c = 10−4 Prior probability of membership in [2-1]: π = 1/6 × 16 Kernel in [2-2]: f = [0.12 0.76 0.12] Estimator in [2-3]: MAP estimator Prior of π jr in [2-3]: Beta distribution with (β0 , β1 ) = (1, 1)
Following the analysis settings shown in the above box, the estimate of the rank reference matrix was obtained as Table 6.2. The EM algorithm was terminated at the 17th cycle, meaning that for the convergence criterion of Eq. (6.8), the ELLs at the 16th and 17th cycles satisfied A P,17) A P,16) A P,16) |U) − ln pr ((M |U) < c| ln pr ((M |U)|. ln pr ((M R R R
The ELLs at 16th and 17th cycles were indeed obtained as A P,17) |U) = −3900.77, ln pr ((M R A P,16) |U) = −3901.11. ln pr ((M R
Thus, we can see that the convergence criterion was surely satisfied: −3900.77 − (−3901.11) < 0.0001 × | − 3901.11|
∴
0.34 < 0.39.
In other words, Table 6.2 is the 17th update of the rank reference matrix. That is,
220
6 Latent Rank Analysis
ˆ R under GTM learning Table 6.2 Estimate of rank reference matrix Latent Rank Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15 Test
1 0.585 0.525 0.613 0.441 0.647 0.647 0.409 0.338 0.352 0.500 0.096 0.065 0.291 0.484 0.398 6.389
2 0.632 0.629 0.610 0.607 0.745 0.775 0.518 0.429 0.320 0.579 0.079 0.098 0.484 0.595 0.574 7.675
3 0.708 0.755 0.708 0.794 0.821 0.911 0.720 0.602 0.298 0.686 0.136 0.156 0.715 0.729 0.756 9.496
4 0.787 0.845 0.773 0.882 0.837 0.967 0.840 0.713 0.282 0.729 0.286 0.239 0.773 0.849 0.827 10.631
5 0.853 0.883 0.801 0.939 0.862 0.963 0.890 0.735 0.377 0.717 0.472 0.421 0.750 0.933 0.835 11.432
6 0.898 0.875 0.839 0.976 0.905 0.915 0.900 0.698 0.542 0.753 0.617 0.636 0.778 0.977 0.834 12.144
CRR* 0.746 0.754 0.726 0.776 0.804 0.864 0.716 0.588 0.364 0.662 0.286 0.274 0.634 0.764 0.706 9.664
*correct response rate
A P,17) ˆ R = (M . R
In the table, πˆ jr represents the CRR of a student belonging to Rank r for Item j. Each column of the rank reference matrix is an estimate of the rank reference vector. For example, the CRRs of a student belonging to Rank 1 are 58.5% for Item 1, 52.5% for Item 2, · · · , and 39.8% for Item 15. In addition, the j-th row vector of the matrix is the IRP of Item j, and the plots with markers in Fig. 6.13 show the IRPs of the 15 items. In this latent rank scale, six ranks are ordered, and Ranks 1 and 6 are the lowest- and highest-ability groups, respectively. Thus, the IRP of each item tends to increase. However, not all items increase monotonically through the latent rank scale. For example, in the IRP for Item 6, the CRR at Rank 5 is 0.963, but that of Rank 6 is 0.915. In addition, for Item 9, the IRP decreases from Rank 1 to 4. As Command (2-4) in LRA-GTM Procedure (Monotonic Increasing IRP) (p.208) shows, if the IRPs are required to increase monotonically, the following command is executed at the end of Cycle t: Sort π (t) j . This command requires the IRP elements of each item to be sorted in ascending order; thus, it ensures that the IRPs are monotonically increasing.
6.3 Estimation by Generative Topographic Mapping
221
Fig. 6.13 IRPs by LRA-GTM (—•—) and LCA (——–)
Techniques for IRPs Tending to be Monotonic 1. The filter with a flatter kernel is used. 2. The number of ranks is reduced. Meanwhile, the two methods listed above facilitate a monotonic increase of each IRP, although they do not ensure the monotonicity of the IRPs. The first method is to use a flatter kernel. The flatter the kernel, the closer (more similar) the IRP values of adjacent ranks become, and the smoother and more monotonic the IRP.
222
6 Latent Rank Analysis
Fig. 6.14 Influence of the number of latent ranks on IRP monotonicity
Method 2 is to reduce the number of ranks (R). When R is reduced, it is easier to obtain monotonically increasing IRPs. Conversely, when R is larger, the IRPs become more flexible and more closely fit minute data oscillations (i.e., measurement error). This tendency generally holds when a complex statistical model is used. For example, when R is three (Fig. 6.14, top), about one-third of students would belong to the highest rank, while when the same data are analyzed with six ranks (Fig. 6.14, bottom), the number of students in the highest rank (i.e., Rank 6) would be reduced by half. If Rank 6 students are good at all items except for the single item with the marker (—•—), the IRPs of most items would then retain their monotonicity, but the IRP of the marked item would drop from Rank 5 to 6. Although this explanation exemplifies the highest and second highest ranks, this kind of phenomenon can also happen at the middle and lower ranks. The point is that when the number of ranks is large, students with a specific response pattern are more likely to be grouped in each rank. This is not always a bad phenomenon. If students with a specific response pattern are gathered in each rank, it is easy to diagnose the characteristics of each rank and provide post-test instructions for them. Conversely, if R is small, the specificity of the responses by the students belonging to each rank is reduced and the generality is increased. In other words, students with a specific response pattern cannot be grouped into a single rank when R is small. Therefore, the IRP value of each rank is then obtained as an average of various types of students, and thus, the IRP is likely to be monotonically increasing. However, it
6.3 Estimation by Generative Topographic Mapping
223
is sometimes difficult to grasp the characteristics of each rank because students with a variety of response patterns are grouped into each rank when R is small. The line-only plots in Fig. 6.13 (p.221) are the IRPs obtained by the LCA analysis with C = 6 used for comparison. The difference between the LRA-GTM and LCA lies in whether the membership matrix is smoothed or not. In LRA-GTM, the neighboring classes are associated by smoothing the rank membership matrix, and the classes then become the ranks. The figure shows that the IRPs under LCA are not smooth because the classes are not related with each other; neighboring classes are independent from each other.14
6.4 IRP Indices There were two procedures used to estimate the rank reference matrix in LRA: the SOM and GTM. The rank reference matrix estimates using LRA-SOM and LRA-GTM were shown in Tables 6.1 (p.206) and 6.2 (p.220), respectively. However, the two procedures are the same when interpreting the results after estimating the rank reference matrix. In this section, the IRP indices (Kumagai, 2007) that help us understand the IRP shape are introduced.
6.4.1 Item Location Index The item location index shows the difficulty of an item. This index has two indicators: β˜ represents the rank whose CRR is closest to 0.5, and b˜ is the CRR at that rank. Figure 6.15 shows the IRPs of two items, where the number of ranks is 10. The IRP of Item 1 (left) is said to be easy because the CRR of students belonging to Rank 6 is about 80%, while Item 2 (right) is more difficult than Item 1 because the CRR of students belonging to Rank 6 is approximately 30%. In the IRP for Item 1, the CRR is the closest to 0.5 at Rank 3 at 0.466. Thus, the ˜ = (3, 0.466). In addition, for Item 2, the ˜ b) item location index for this item is (β, ˜ = (8, 0.493). It can ˜ b) CRR is 0.496 at Rank 8, which is the closest to 0.5; thus, (β, ˜ thus be said that the items with larger β are generally more difficult items. The threshold of 0.5 for the item location index would depend on the purpose of a test. If it is to measure the retention of basic knowledge, β˜ could then be the rank whose CRR first exceeds 0.8. Alternatively, if a test is a psychological questionnaire
14
Latent class and rank are originally different types of variables; thus, Class 1 and Rank 1, Class 2 and Rank 2, · · · , Class 6 and Rank 6 are not comparable. Therefore, they should not be simultaneously plotted with the same vertical axis.
224
6 Latent Rank Analysis
Fig. 6.15 Item location index
used to investigate suicidal ideation,15 the threshold would be set to 0.2 or smaller because “yes” (1) is rarely observed.
6.4.2 Item Slope Index Item slope index is an index related to discriminative capability. An item is said to be the most discriminating at the two adjacent ranks between which the IRP rise is the greatest. Figure 6.16 shows examples of high- and low-discrimination IRPs. In the top of the figure, the IRP increases significantly from Ranks 5 to 6, where the CRRs are 0.1 and 0.9 for the students belonging to Ranks 5 and 6, respectively. Thus, this item is highly discriminating as to whether each student’s rank is ≥ 6 or ≤ 5. Tests are analogous to vision tests (see Sect. 4.3, p.96) or the high jump. Before jumping, the jumping power of each athlete is unknown. However, it is evident after jumping. If an athlete cleared a bar at 2 m, this reveals that his or her jumping ability exceeded 2.0 m, although it is not yet known whether the (maximum) ability is 2.1 or 2.2 m. Conversely, when the athlete failed the 2-m-high bar, it was obvious that his or her jumping ability is < 2 m, although his or her true ability is still veiled. If a student can correctly answer the item shown at the top of the figure, we can then strongly expect the student to have an ability of Rank 6 or higher, because this item can only be passed by students ranking 6 or higher. However, it is still unknown whether the student’s rank is 6, 7, · · · , or 10. In other words, this item passes students 15
If the questionnaire consists of yes/no questions, it can be analyzed. Even if it includes 4-point Likert items, the items can be analyzed by coding the first two choices as 0 and the other two choices as 1.
6.4 IRP Indices
225
Fig. 6.16 High-discrimination and low-discrimination items
with Rank 6 or higher, but fails those ranking lower. Thus, it can be said that this item has the power of discriminating whether each student’s rank is ≤ 5 or ≥ 6. Just as an athlete’s jumping power can be unveiled by having the athlete try bars of various heights, the ability of a student can be measured by having him or her respond to highly discriminative items at various difficulty levels. Figure 6.16 (bottom) shows an item with low discrimination, where the IRP rises slightly from Ranks 5 to 6. Is it certain that a student passing this item has an ability of Rank 6 or higher? As a student with Rank 5 or lower has a 50% chance of passing, it is disputable whether the student’s rank is 6 or higher. Likewise, even if a student fails the item, it cannot be said with confidence that his or her rank is ≤ 5. Therefore, a test must have many highly discriminative items. Moreover, the information about where (at which rank) each item is highly discriminating is important, because there are no items that are highly discriminating across the entire latent scale. Each item is only “locally” discriminative somewhere on the latent rank scale. The item slope index indicates where the largest rise in an IRP occurs. For Item 1 in Fig. 6.17 (left), the slope of the item is steep between Ranks 1 and 5 but gradually
226
6 Latent Rank Analysis
Fig. 6.17 Item slope index
decreases starting with Ranks 6. Thus, Item 1 discriminates well in the low to mid ranks. In particular, the rise between Ranks 2 and 3 is the greatest, with a difference of 0.148. The item slope index includes two quantities: α, ˜ the smaller of the adjacent two ranks where the largest IRP rise is observed; and a, ˜ the difference between the CRRs of Ranks α˜ and α˜ + 1. Accordingly, the slope index of Item 1 is given as (α, ˜ a) ˜ = (2, 0.148). Likewise, from Fig. 6.17 (right), it can be seen that Item 2 is discriminative for mid to high ranks, and the rise is the largest between Ranks 8 and 9 with a difference of 0.166. The slope index for Item 2 is thus obtained as (α, ˜ a) ˜ = (8, 0.166).
6.4.3 Item Monotonicity Index The item monotonicity index is explained next. This index shows whether the IRP of an item is monotonically increasing or not, and if not, the index also quantifies the extent to which the IRP is not monotonically increasing. Figure 6.18 shows an IRP that is not monotonically increasing. The IRP drops between Ranks 2 and 3, Ranks 6 and 7, and Ranks 8 and 9. This index specifies γ˜ as the ratio of the number of pairs where the IRP drops to the number of adjacent pairs, and c˜ as the cumulative drops. In the figure, the IRP drops three times in nine adjacent rank pairs (Ranks 1 and 2, 2 and 3, · · · , 9 and 10); thus, the ratio is 0.333 (= 3/9). In addition, the cumulative drop is then −0.45 (= −0.2 − 0.15 − 0.1). Accordingly, the monotonicity index for the item is given as (γ˜ , c) ˜ = (0.333, −0.45). Note that (γ˜ , c) ˜ = (0, 0) when the IRP is monotonically increasing, and (γ˜ , c) ˜ = (1, −1) when it is monotonically decreasing.
6.4 IRP Indices
227
Fig. 6.18 Item monotonicity index
6.4.4 Section Summary IRT is a parametric model (see Parametric or Nonparametric , p.229); thus, the shape of the item response function is abstracted in the item parameters, while the LRA is a nonparametric model; therefore, there are no parameters that represent the IRP shape. When the number of items is large, it is very hard to check the IRPs of all items. IRP indices are not parameters but summarize the shape of the IRP; thus, they are useful to overview the shape of a number of items. Table 6.3 shows the IRP indices. The left half of the table is for the IRPs obtained by the LRA-SOM (Table 6.1), and the right half for the LRA-GTM (Table 6.2). The number of ranks in these two analyses is six; thus, the possible value sets for β˜ and α˜ are N6 = {1, 2, 3, 4, 5, 6} and N5 , respectively. By examining the IRP indices, we can easily grasp which items are easy and difficult, which items are highly discriminating, and which items are monotonically increasing. IRP Indices and CAT In a computer adaptive test (CAT, see PPT vs CBT , p.59), each time a student responds to an item, a tentative ability of the student is estimated, and the best item for that student is selected from the item bank and presented. For this reason, items that are very easy or difficult for the student and thus are not informative for estimating his or her ability are not presented, which can efficiently shorten the test time. The IRP indices are useful when running the CAT with LRA. Each time an item is responded to, a tentative rank is estimated, and an item with β˜ (and α) ˜ equal to the rank is selected from the bank as the next presented item. For more sophisticated methods, see Kimura and Nagaoka (2012) and Akiyama (2014).
228
6 Latent Rank Analysis
Table 6.3 IRP indices about Tables 6.1 and 6.2 (LRA) Model LRA-SOM LRA-GTM Index Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15
α˜ 2 2 3 2 1 2 2 2 5 2 4 5 2 3 2
a˜ .074 .125 .080 .142 .064 .105 .151 .132 .080 .081 .211 .188 .158 .127 .122
β˜ 1 1 1 1 1 1 2 2 6 1 5 6 2 1 1
b˜ γ˜ b˜ γ˜ c˜ α˜ a˜ β˜ c˜ .595 3 .079 1 .585 2 .126 1 .525 .2 −.008 .538 2 .099 2 .610 .2 −.004 .608 2 .186 1 .441 .502 1 .099 1 .647 .685 .694 .2 −.019 2 .136 1 .647 .4 −.052 2 .203 2 .518 .546 2 .173 2 .429 .2 −.037 .468 .467 .4 −.024 5 .165 6 .542 .6 -.070 .535 2 .107 1 .500 .2 −.012 .479 4 .187 5 .472 .2 −.017 5 .215 5 .421 .594 2 .231 2 .484 .2 −.023 .509 2 .134 1 .484 .519 2 .182 2 .574 .2 −.001 .481
Blank cells are 0.
For Item 12, β˜ = 6, which means this item is difficult because the CRR is approximately 50%16 even for the students in the highest rank. In addition, from α˜ = 5, the largest rise in the IRP of the item occurs between Ranks 5 and 6, which indicates that the item is the most discriminative between Ranks 5 and 6. Moreover, the number of items that were not monotonically increasing was two for the LRA-SOM, while it was nine for the LRA-GTM. However, this does not mean that the IRPs estimated by the LRA-GTM are less likely to monotonically increase. This difference depends on the estimation settings employed in the two analyses. The tips to obtain monotonically increasing IRPs will be discussed again in Sect. 6.6.1 (p.233).
16
59.4% in LRA-SOM and 42.1% in LRA-GTM.
6.4 IRP Indices
229
Parametric or Nonparametric The IRP in LRA corresponds to the item response function (IRF) in IRT (e.g., the two-, three-, or four-parameter logistic model). The basic shape of the IRF (sigmoid curve) is determined by a logistic function, and the detailed form is adjusted by the slope parameter a, location parameter b, and lower asymptote parameter c. This type of model (given a functional form with parameters) is called a parametric model. Meanwhile, in LRA (and LCA), the IRP does not have a functional form. Therefore, it does not have parameters for adjusting the functional form neither. Such a model is called a nonparametric model. The rank reference matrix ( R ) in LRA (and class reference matrix in LCA) is the parameter to be estimated, but it is not the parameter that determines the shape of the function. Strictly speaking, however, R is the parameter that determines the shape of the piecewise linear function of the IRP. As can be seen above, the terms parametric and nonparametric include such ambiguities. In general, if a linear or nonlinear function is explicitly used (or declared to be used), then it is a parametric model; otherwise, it is a nonparametric model. In this book, IRT is treated as a parametric model, while the LRA and LCA are nonparametric models.
6.5 Can-Do Chart One of LRT goals is to create a can-do chart (or can-do table). A can-do chart is a diagram or table composed of can-do statements describing what students of each rank can and cannot do. If the can-do chart of a test can be created, the accountability and visibility of the test increase. When the accountability of a test is sufficiently high, it can be called a qualification test (Sect. 6.1.5, p.195). Table 6.4 shows an example of a can-do chart (Matsumiya & Shojima, 2009a, 2009b).17 This example is of a math test for elementary school students with 30 items and five ranks. The IRPs of the items are placed at the upper center of the chart, where the items are sorted by the CRR in the descending order, and the larger the IRP value, the darker the cell is shaded. The left side of the IRPs shows information about each item, such as its content and format, and the right side of IRPs contains the IRP indices. By checking IRP values for each rank by referring to the item contents, it becomes clear what items the students belonging to each rank passed and failed, although it is not a very easy task. Along with this, we can gradually understand what kind of academic ability the students of each latent rank have. For example, if the CRRs of 17
For other practical examples, see Sugino et al. (2012), Sugino and Shojima (2012), and Fujita et al. (2021).
230
6 Latent Rank Analysis
Table 6.4 Example of can-do chart (http://shojima.starfree.jp/ntt/Can-DoChart.pdf)
Rank 1 (i.e., the rank reference vector of Rank 1) are examined, CRRs are low for all items. The students belonging to this rank barely passed Items 3, 7, 9, and 13. However, there were not so many such students. The greater part of the students in the rank failed almost all items. Referring to such considerations, can-do statements can be written for Rank 1 as printed with small type in the bottom of the chart, which are picked out and shown in the following box. The title of the rank can also be named accordingly.
6.5 Can-Do Chart
231
Can-Do Statement for Latent Rank 1 Rank Title: Needs more effort Volume and measurement: Understanding is insufficient, but the student is able to calculate using addition with carrying over and addition of fractions with common denominators. Numerical relationships: Understanding is insufficient, but the student is able to determine the size of groups from a bar graph. For any other ranks, the can-do statements can be described, and the title can also be termed based on the statements. A can-do chart cannot be completed only by the experts of psychometrics and educational measurement; it requires the knowledge and experience of subject-teaching experts. A can-do chart is inevitable and essential to make a test accountable and visible. It is not too much to say that LRA is a mere tool to create a can-do chart. As described in Sect. 1.5.2 (p.10), a good analysis shows (A) a world (road map) of the test, which pictures what the test is measuring (where the students are running) and gives a view of the start and goal of the learning (running).18 Making a Can-Do Chart through LCA and IRT IRPs can also be obtained using LCA (Table 5.1, p.165), and a can-do chart can be created via LCA as well. However, each latent class is not always ordered. Thus, the chart under LCA does not usually become a stepwise (or graded) structure, but rather a patchwork of can-do statements for each class. However, it happens sometimes that the classes are ordered or they can be sorted afterward, and the statements can then be written to give the chart a stepwise structure. However, it might be better to use LRA to order the latent classes and make them latent ranks. To create a can-do chart under IRT, by taking some discrete points (e.g., θ = −2, −1, 0, 1, 2) on the continuous ability scale θ and calculating the CRRs for each item at each ability point, one can write the can-do statements at all the ability points by referring to the CRRs. In this case, however, the number of discrete points and where the points are placed are arbitrary. In the LRA, the number of ranks can be determined by the degree of fit, as described in Sect. 6.9 (p.252).
18
Note that (B) the location of each student is indicated by the latent rank estimate, (C) the next path to take by the rank-up odds (Sect. 6.7.2, p.235), and (D) the features of each item by the IRP and IRP indices.
232
6 Latent Rank Analysis
Fig. 6.19 Test reference profile
6.6 Test Reference Profile This section describes the test reference profile (TRP). In Tables 6.1 and 6.2, which are the rank reference matrix estimates under LRA-SOM and LRA-GTM, respectively, the bottom rows show the TRPs. The TRP is the column sum vector of the rank ˆ R ). That is, reference matrix (
ˆ R 1 J (R × 1). ˆt T R P = In Table 6.1, for example, the TRP value is 6.814 at Rank 1, which is calculated as the sum of the rank reference vector of Rank 1, πˆ 1 = [.595 .538 · · · .481] . Figure 6.19 shows the TRPs, and we see that there is no significant difference between them under LRA-SOM (left) and LRA-GTM (right). Each TRP value represents the expected number-right score (NRS) at the rank. For example, in the left figure, a student belonging to Rank 1 is expected to pass 6.8 items on this 15-item test. If the items have their weights, using the item weight vector, ⎤ w1 ⎢ ⎥ w = ⎣ ... ⎦ , ⎡
wJ
the TRP is obtained as follows:
ˆ ˆt (w) T R P = R w (R × 1). The TRP values in this case represent the expected scores. In each figure, the bar graph represents the latent rank distribution and will be explained in another section.
6.6 Test Reference Profile
233
6.6.1 Ordinal Alignment Condition Latent ranks, unlike latent classes, are ordered. The question is, what makes them ordered? In other words, what is the superiority of Rank 2 to 1, Rank 3 to 2, · · · , Rank R to R − 1? This is shown by the TRP. The monotonicity of the TRP is especially important. If the TRP is monotonically increasing, it means that the expected NRS is higher as the latent rank increases. We can then say that the students belonging to the higher ranks are superior in the sense that they can get a NRS greater than the students belonging to the lower ranks can. If the TRP is not monotonically increasing, we then need to explore other evidence showing the advantage of the students at higher ranks to those at lower ranks. If such evidence is found, we are allowed to use that evidence to show the ordinality of the ranks. Here, the expected NRS (TRP) is used as the basis for the ordinality of the ranks. Two Levels of Ordinality of Latent Rank Scale Strongly Ordinal Alignment Condition The IRPs of all items monotonically increase, and the TRP then monotonically increases as well. Weakly Ordinal Alignment Condition Although TRP monotonically increases, not all IRPs monotonically increase. There are two cases when the TRP is monotonically increasing. The first is when the IRPs of all items are monotonically increasing; the TRP, the sum of the IRPs, then becomes monotonically increasing as well. In this case, it is said that the latent rank scale satisfies strongly ordinal alignment condition (SOAC). Meanwhile, there is a case where the IRPs of all items are not monotonically increasing but the TRP is monotonically increasing. In this case, the latent rank scale is said to fulfill the weakly ordinal alignment condition (WOAC). For the latent scale to be said to be ordinal, it must at least satisfy WOAC. If the weak condition is not met, as mentioned above, it is necessary to find other evidence that the latent scale is ordinal. Alternatively, it is necessary to reanalyze data under different settings and satisfy the weak condition. The following techniques are useful to make it easier to satisfy the weak condition. For a TRP Satisfying the Weakly Ordinal Alignment Condition LRA in General: LRA-SOM: LRA-GTM:
The number of ranks is set to be smaller. The final value of the learning range parameter, σT , is set to be larger. A filter with a flatter kernel is used.
Figure 6.19 shows that TRPs in both LRA-SOM and LRA-GTM are monotonically increasing. However, as the monotonicity indices (γ˜ ) in Table 6.3 (p.228) show, the IRPs of Items 6 and 9 are not monotonically increasing under LRA-SOM, and the
234
6 Latent Rank Analysis
IRPs of nine items are not monotonically increasing under LRA-GTM. Therefore, it is said that the latent scales obtained in these two analyses pass WOAC.
6.7 Latent Rank Estimation This section describes the method of estimating the latent ranks of students. Estimating the latent rank of each student is essential to understand the current ability status of the student and to feed the results back to him or her. Note that the rank reference matrix ( R ) must be obtained before estimating the latent ranks.
6.7.1 Estimation of Rank Membership Profile The rank estimate of a student is based on the rank membership profile, which is ms = {m sr } (R × 1) for Student s, and it is the s-th row vector of the rank membership matrix, ⎤ m 11 · · · m 1R ⎥ ⎢ M R = ⎣ ... . . . ... ⎦ = {m sr } (S × R), m S1 · · · m S R ⎡
where m sr represents the probability that Student s belongs to Rank r . Because Student s is a member of a particular rank, the sum of the elements in ms is 1 (ms 1 R = 1). The rank estimate of Student s should be the rank with the largest membership. It is thus necessary to estimate the rank membership profile of the student in advance. ˆ R ), the membership of Student Given the estimate for the rank reference matrix ( s to Rank r can be estimated as l(us |πˆ r )πr mˆ sr = R , ˆ r )πr r =1 l(us |π where l(us |πˆ r ) =
J
u
{πˆ jrs j (1 − πˆ jr )1−u s j }zs j
j=1
is the probability (i.e., likelihood) that Student s belongs to Rank r . In addition, πr denotes the prior membership to Rank r . The bottom line (Prior) in Table 6.1 can be used as the prior when one is planning to construct an equiprobability rank scale where the number of students is roughly equal at every rank. Otherwise, simply,
6.7 Latent Rank Estimation
235
πr =
1 R
is fine. Thus, the rank membership profile of Student s, ms , is obtained as l(us |πˆ 1 )π1 ⎤ ⎢ R l(us |πˆ r )πr ⎥ ⎥ ⎢ r =1 ⎥ ⎢ .. ˆs =⎢ m ⎥, . ⎥ ⎢ ⎣ l(us |πˆ R )π R ⎦ R ˆ r )πr r =1 l(us |π ⎡
and then the latent rank estimate of the student, Rs , can be obtained by19 Rˆ s = arg max mˆ sr (R × 1). r ∈N R
ˆ R s) and latent Table 6.5 shows the estimates of the rank membership matrices ( M rank estimates. The upper part is the results under LRA-SOM and the lower part under LRA-GTM. From the LRA-SOM results, Student 1 was estimated to have a 27.9% probability of belonging to Rank 1, 37.9% to Rank 2, · · · , 0.1% to Rank 6. The membership in Rank 2 was the highest, and the latent rank estimate for this student is thus 2 ( Rˆ 1 = 2). Comparing the rank estimates of LRA-SOM and LRA-GTM, for example, the rank estimate of Student 7 is different: Rank 5 under LRA-SOM but Rank 6 under LRA-GTM. In fact, it is not necessary to estimate the rank membership matrix (M R ) in the framework of LRA-GTM because it was already obtained. Referring to Command [2-1] in LRA-GTM Procedure (p.208), M R had been repeatedly updated in the process of estimating the rank reference matrix ( R ), of which estimate (Table 6.2, p.220) was obtained at the M step of the 17th EM cycle. In other words, R = (17) R . obtained at the 17th E step can be used as the estimate of the rank Therefore, M (17) R ˆ R = M (17) membership matrix. That is, M R .
6.7.2 Rank-Up and Rank-Down Odds Figure 6.20 shows the plots of the rank membership profiles for the first 15 students. For example, comparing the rank membership profiles of Students 5 and 15, both were estimated to belong to Rank 1. When viewed closely, however, the membership of Student 5 in Rank 1 is higher than that of Student 15. The former student has the highest probability (51.0%) of belonging to Rank 1, and the second highest (40.4%) This equation means “select the r from N R = {1, 2, . . . , R} that maximizes mˆ sr and regard that r as the estimate of Rs .” 19
236
6 Latent Rank Analysis
Table 6.5 Rank membership and latent rank estimates Rank Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Student 13 Student 14 Student 15 .. . LRD RMD
1 0.279 0.056 0.047 0.004 0.510 0.001 0.076 0.026 0.000 0.086 0.147 0.195 0.000 0.001 0.648 .. . 82 79.93
2 0.379 0.230 0.179 0.028 0.404 0.005 0.137 0.085 0.003 0.319 0.282 0.377 0.003 0.007 0.311 .. . 85 90.96
Rank Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 .. . LRD∗4 RMD∗5
1 0.270 0.028 0.023 0.002 0.558 0.000 0.090 0.019 0.000 0.042 .. . 96 83.75
2 0.357 0.158 0.139 0.016 0.397 0.003 0.105 0.056 0.001 0.252 .. . 60 78.69
∗1 latent
∗4 latent
LRA-SOM 3 4 0.254 0.078 0.431 0.228 0.338 0.272 0.119 0.238 0.078 0.007 0.040 0.194 0.167 0.204 0.250 0.409 0.028 0.129 0.413 0.140 0.323 0.186 0.293 0.114 0.029 0.184 0.030 0.080 0.038 0.003 .. .. . . 84 85 83.26 82.82 LRA-GTM 3 4 0.276 0.085 0.474 0.280 0.379 0.285 0.096 0.217 0.038 0.003 0.048 0.248 0.092 0.113 0.243 0.469 0.028 0.113 0.457 0.186 .. .. . . 91 77 81.85 84.92
5 0.010 0.046 0.120 0.328 0.001 0.420 0.251 0.195 0.345 0.033 0.052 0.019 0.419 0.246 0.000 .. . 83 85.38
6 0.001 0.008 0.043 0.281 0.000 0.339 0.165 0.035 0.494 0.009 0.010 0.002 0.365 0.636 0.000 .. . 81 77.66
LRE∗1 2 3 3 5 1 5 5 4 6 3 3 2 5 6 1 .. .
RUO∗2 0.669 0.529 0.804 0.856 0.793 0.808 0.656 0.478
5 0.010 0.054 0.121 0.362 0.001 0.429 0.272 0.191 0.291 0.050 .. . 73 84.24
6 0.001 0.007 0.054 0.307 0.001 0.271 0.327 0.023 0.565 0.012 .. . 103 86.54
LRE∗1 2 3 3 5 1 5 6 4 6 3 .. .
RUO∗2 0.773 0.590 0.752 0.846 0.712 0.633
rank estimate, ∗2 rank-up odds, ∗3 rank-down odds rank distribution, ∗5 rank membership distribution
0.339 0.575 0.777 0.871 0.479 .. .
0.407 0.407 .. .
RDO∗3 0.736 0.534 0.529 0.726 0.463 0.810 0.610 0.699 0.772 0.874 0.516 0.438 0.387 .. .
RDO∗3 0.757 0.332 0.367 0.599 0.579 0.833 0.518 0.515 0.553 .. .
6.7 Latent Rank Estimation
237
Fig. 6.20 Rank membership profiles (LRA-SOM)
of belonging to Rank 2, which means that Student 5 belongs to Rank 1 and is likely to move up to Rank 2. However, the latter Student 15 has a 64.8% probability of belonging to Rank 1 and 31.1% to Rank 2. In other words, Student 15 is less likely to move up to Rank 2 than Student 5. Rank-up odds are useful in this situation. When the rank estimate of Student s is Rˆ s , the rank-up odds are defined as the ratio of the membership in Rank Rˆ s + 1 (mˆ s, Rˆ s +1 20 ) to the membership in Rank Rˆ s (mˆ s Rˆ s ). That is, 20
When there are more than two subscripts, a comma is sometimes used between them for readability.
238
6 Latent Rank Analysis
RUOs =
mˆ s, Rˆ s +1 mˆ s, Rˆ s
.
This indicator measures the possibility of moving up from the current rank to the next higher rank, such that the larger the odds, the higher the possibility of the student moving up to the next rank. Calculating the odds for Students 5 and 15, they are obtained as mˆ 5,2 0.404 = 0.793, = mˆ 5,1 0.510 mˆ 15,2 0.648 = 0.479, = = mˆ 15,1 0.311
RUO5 = RUO15
These results show numerically that Student 5 is more likely to move up to Rank 2 than Student 15. Meanwhile, the rank-down odds are defined by RDOs =
mˆ s, Rˆ s −1 mˆ s, Rˆ s
,
which indicates the possibility of dropping down from the current rank to the lower rank by one level. It is seen from Fig. 6.20 that Students 9 and 14 are classified as Rank 6, but the membership of Student 9 is lower than that of Student 14, which implies that Student 9 is more likely to drop to Rank 5. This situation is numerically represented by the rank-down odds as follows: mˆ 9,5 0.345 = 0.699, = mˆ 9,6 0.494 mˆ 14,5 0.246 = 0.387. = = mˆ 14,6 0.636
RDO9 = RDO14
Bimodal Rank Membership Profile Although the rank membership profiles are unimodal for most students, bimodal ones may be obtained for some students, where the two memberships of a student are high in both low and high ranks. This kind of rank membership profile may appear for a student who can pass some difficult items, but fail some easy items. For this reason, the LRA indicates two possibilities, that the student has both high and low ability. Such a student might have disregarded basic learning contents and engaged in advanced learning contents.
6.7 Latent Rank Estimation
239
Fig. 6.21 Latent rank distribution and rank membership distribution (LRA)
6.7.3 Latent Rank Distribution The latent rank distribution (LRD) shown in Table 6.5 (p.236) represents the frequencies of the latent rank estimates for 500 students and is illustrated in Fig. 6.21. Under the LRA-SOM, for example, the number of students classified into Rank 1 was 82. The frequencies of the six ranks are obtained to be roughly the same because the priors shown in Table 6.1 (p.206) were used to construct an equiprobability rank scale. Meanwhile, under the LRA-GTM, the number of students belonging to each rank differs significantly because no operation was performed to adjust the frequencies of the ranks to be equal. The LRA-SOM takes a longer computation time because it is a sequential learning method, which implies that it analyzes the data carefully. Thus, it is easier to find the prior memberships to form an equiprobability scale. Meanwhile, the LRA-GTM is a batch-type learning method, and its computation time is thus very fast, but it is not easy to incorporate a subroutine to fine-tune the learning process. The line graphs in the figures are the plots of rank membership distributions (RMDs in Table 6.5). The RMD is the column sum vector of the rank membership ˆ R . That is, matrix estimate, M ˆ R 1 S (R × 1). dˆ R M D = M For example, the RMD value at Rank 1 under the LRA-SOM is 79.93, which is the sum of the memberships of all Rank 1 students, {.279, .056, .046, · · · }. The frequencies in the LRD are simply the counts of the rank estimates. Taking Student 1 as an example, the frequency of Rank 2 in the LRD is increased by +1 because the student’s membership to Rank 1 is the highest at 0.379 (37.9%), and thus, the latent rank estimate of the student is Rank 2. Meanwhile, in the RMD, the frequency of Rank 2 is increased by no more than 0.379 because the RMD considers an increment of +1 (100%) to be too great. Similarly, the RMD values at other ranks are added by memberships of the student to the respective ranks. That is, the RMD
240
6 Latent Rank Analysis
Fig. 6.22 Concept of model fit (identical to figure 4.24, p.144)
value at each rank is the accumulation of the possibilities of the students belonging to the rank. Therefore, the LRD can be regarded as the frequency distribution of the present data (i.e., sample), while the RMD is that of the sample population. Note that the sums of the elements of either the LRD or the RMD equals the sample size S (500 in this case).
6.8 Model Fit In this section, the model fit under LRA is described. The procedures are common to LRA-SOM and LRA-GTM. It is required to calculate the fitness to examine whether the present LRA model fits the data. As already shown in previous chapters, model-fit evaluation is performed based on the framework used in structural equation modeling (SEM). That is, when evaluating the fit of the present model (i.e., analysis model), the benchmark model fitting the data better and the null model fitting the data worse are first prepared, and then the fit of the analysis model is evaluated by examining where it is placed between the fits of the benchmark and null models. If the fit of the analysis model is close to that of the benchmark model, the fit of the analysis model can be considered good. Conversely, if it is close to that of the null model, the analysis model is regarded as fitting badly. Figure 6.22 illustrates the concept of model-fit evaluation. The likelihood (vertical axis) directly indicates the fitness of a model, because it measures the occurrence probability of the data under the model and its range is thus [0, 1]. The likelihood of a better-fit model is closer to 1, because the data are more likely to occur under a well-fit model. Accordingly, the likelihoods of the benchmark, analysis, and null models generally hold the following inequality21 : 0 < l N < l A < l B < 1. The second inequality is violated (i.e., l A < l N ) if the analysis model is very poor-fitting, and the third inequality is not met (i.e., l A > l B ) if the analysis model fits very well.
21
6.8 Model Fit
241
However, due to a very narrow range of likelihood, the likelihoods of the three models are compared after taking the logarithm, because the logarithmic transformation expands the interval of [0, 1] to (−∞, 0].22 Thus, the log-likelihoods of the three models generally have the following relationship: −∞ < ell N < ell A < ell B < 0. The expected log-likelihood (ELL) of the analysis model can be evaluated from ˆ R )23 ˆ R ) and rank membership matrix ( M the estimates of the rank reference matrix ( 24 as follows : ˆ R) = ell A (U|
S R J
mˆ sr z s j {u s j ln πˆ jr + (1 − u s j ) ln(1 − πˆ jr )},
s=1 r =1 j=1
where the ELL for Item j is ell A (u j |πˆ j ) =
S R
mˆ sr z s j {u s j ln πˆ jr + (1 − u s j ) ln(1 − πˆ jr )}.
s=1 r =1
6.8.1 Benchmark Model and Null Model The models that sandwich the ELL of the analysis model from the right (positive) and left (negative) sides are the benchmark and null models, respectively. The choice of the two models is left in the hands of the analyst, but following previous chapters, the models shown in the box below are used. Null Model and Benchmark Model Null Model: The model assuming students belong to a single group. Benchmark Model: The model assuming multiple groups in each of which the number-right scores (NRSs) of the students are equal. The null model must be worse fitting than the analysis model. Let us suppose here that the null model is a single-group model, that is, all students are assumed to belong to one group. Such a model can also be regarded as the LRA model with the number of ranks equal to one (R = 1). Then, the rank reference matrix of the LRA model with R = 1 is a matrix with one column (i.e., vector), and it is, in effect, the CRR vector as follows: 22
But they have the same cardinality, as per Georg Cantor. ˆ Note that this is not the final update of the smoothed rank membership matrix, S. 24 This is not the expected log-posterior (ELP) but the ELL; thus, the prior density of the rank reference matrix is not included. 23
242
6 Latent Rank Analysis
⎤ p1 ⎢ ⎥ p = ⎣ ... ⎦ (J × 1). ⎡
pJ In addition, due to R = 1, the rank membership of Student s to Rank 1 is a probability of 1.0 (m s1 = 1, ∀s ∈ N S ). Thus, the ELL of the null model is given as ell N (U| p) =
J
ell N (u j | p j ) =
J S
z s j {u s j ln p j + (1 − u s j ) ln(1 − p j )},
j=1 s=1
j=1
where ell N (u j | p j ) is the ELL of the null model for Item j. Moreover, the benchmark model is a multigroup model in which the number of groups is the number of observed NRS patterns.25 When the number of items is J , the maximum number of NRS patterns is J + 1, {0, 1, · · · , J }. However, all the NRS patterns are not always observed. Thus, suppose the number of observed NRS patterns is G. Then, the group membership profile of Student s is given as ⎡
⎤ m s1 ⎢ m s2 ⎥ ⎢ ⎥ ms = ⎢ . ⎥ = {m sg } (G × 1), ⎣ .. ⎦ m sG where m sg =
1, if Student s’s NRS is g . 0, otherwise
Because Student s belongs to a particular NRS group, one element in the group membership profile ms = {m sg } is 1, and all the other elements are 0.26 The group membership matrix is a matrix collecting the group membership profiles of all S students. That is, ⎤ m 11 · · · m 1G ⎥ ⎢ M G = ⎣ ... . . . ... ⎦ = {m sg } (S × G), m S1 · · · m SG ⎡
25 In Exametrika (software), a LRA-GTM model with R being the number of observed NRS patterns, and under the constraint that the IRPs of all items are monotonically increasing, is specified as the benchmark model. 26 Meanwhile, m in the rank membership profile m represents the membership of Student s in sr s Rank r . What group and rank membership profiles have in common is that the sum of the elements is 1.
6.8 Model Fit
243
where the s-th row vector is the group membership profile of Student s. Thus, each row vector of this matrix has a 1 in it. In addition, the CRRs by NRS group for Item j are ⎤ p j1 ⎢ p j2 ⎥ ⎥ ⎢ p j = ⎢ . ⎥ = { p jg } (G × 1), ⎣ .. ⎦ ⎡
p jG where p jg denotes the CRR of Group g for Item j, and it is calculated by S p jg =
s=1
S
m sg z s j u s j
s=1
m sg z s j
=
mg (z j u j ) mg z j
.
Using these, the ELL of the benchmark model for Item j can be computed by ell B (u j | p j ) =
S G
m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )}.
s=1 g=1
Furthermore, the ELL of the benchmark model for the entire test is ell B (U| P G ) =
J
ell B (u j | p j ),
j=1
where ⎤ p11 · · · p1G ⎥ ⎢ P G = ⎣ ... . . . ... ⎦ = { p jg } (J × G) pJ1 · · · pJ G ⎡
is the group reference matrix that collects CRRs of all groups for all items.
244
6 Latent Rank Analysis
Common Benchmark and Null Models The benchmark and null models specified in LRA are the same as those specified in IRT and LCA. When they share the same bench and null models, the model fit between different models, for example, the fit for the 3PLM of IRT, the LCA model with five classes, and the LRA model with seven ranks can be compared. In general, LCA is the most flexible, followed by the LRA, and IRT model is the least flexible. Therefore, the fit of a LCA model tends to be the best when comparing the three models. Conversely, the IRT model can be said to be the most robust, and the LCA can be considered to have a tendency to overfit only the present data.
6.8.2 Chi-square Statistics As shown in Fig. 6.22 (p.240), the χ 2 value of the analysis model is defined as twice the difference between the ELLs of the benchmark and analysis models. The goodness of fit of the analysis model is evaluated in comparison to the benchmark model. The χ 2 values of theanalysis model (and null model) are summarized in χ 2 and DF of AM and NM . Using the ELL and DF for each item, the χ 2 value of an item can be obtained. χ 2 and DF of Analysis and Null Models Analysis Model J J χ A2 j = 2 {ell B (u j | p j ) − ell A (u j |πˆ j )} χ A2 = j=1
d fA =
J
j=1
d fAj
J J = {row( p j ) − row(πˆ j )} = (G − R) (LRA-SOM)
j=1
ed f A =
j=1
J
ed f A j =
j=1
j=1 J j=1
J {row( p j ) − tr F} = (G − tr F) (LRA-GTM) j=1
Null Model J J χ N2 j = 2 {ell B (u j | p j ) − ell N (u j | p j )} χ N2 = d fN =
j=1 J j=1
j=1
d fN j
J J = {row( p j ) − row( p j )} = (G − 1) j=1
j=1
6.8 Model Fit
245
Table 6.6 shows the χ 2 values and DFs under LRA-GTM.27 For instance, the χ 2 values of the analysis and null models for the entire test were calculated as χ A2 = 2 × (ell B − ell A ) = 2 × {−3560.01 − (−3857.98)} = 595.95, χ N2 = 2 × (ell B − ell N ) = 2 × {−3560.01 − (−4350.22)} = 1580.42. Moreover, the IRT, LCA, and LRA share the same benchmark and null models. For this reason, the fit of the three models can be compared. From Table 4.8 (p.147) for the χ 2 values for the 2PLM and 3PLM of IRT and Table 5.3 (p.179) for that of LCA with C = 5, the χ 2 values are sorted in the ascending order as follows: 2 χ LC A5 = 207.98,
χ L2 R A6 = 595.95, 2 χ3P L M = 660.64, 2 χ2P L M = 674.46.
The χ 2 value of LCA is exceptionally small (i.e., good). In general, the larger the number of classes (C)/ranks (R), the better the fit (i.e., the smaller the χ 2 value) of the LCA/LRA model; thus, this result is a particular case with C = 5 and R = 6. The fit of the 2PLM or 3PLM would thus be better than an LCA/LRA with small C/R. In this manner, when the benchmark and null models are common, the fit between different analysis models can be compared. Goodness of Fit or Badness of Fit? This book follows the SEM style in evaluating the model fit, and the χ 2 value represents the distance from the benchmark model. That is, the larger the χ 2 value, the farther the analysis model from the well-fitting benchmark model, which means the fit of the analysis model is worse. In other words, (the magnitude of) the χ 2 value represents the badness of fit. Alternatively, the χ 2 value can be measured by setting the null model as reference. Then, the χ 2 value of the analysis model is defined as χ A2 = 2(ell A − ell N ). In this case, the larger the χ 2 value, the farther the analysis model from the bad-fitting null model. The magnitude of the χ 2 value in this case thus represents the goodness of fit of the analysis model. Some readers may be unsure whether largeness or smallness of the χ 2 value represents better fitting of the analysis model. One should note which one, the benchmark or the null model, is used as a reference (i.e., standard). In the SEM style (in this book), the saturated (benchmark) model is set as the reference.
27
The χ 2 values and DFs under LRA-SOM can also be calculated similarly but are omitted here.
246
6 Latent Rank Analysis
Table 6.6 Log-likelihood and χ 2 of benchmark, null, and analysis models Model
BM∗1
LRA-GTM (R = 6)
Null Model
ell B
ell N
χ N2
d fN
ell A
χ A2
ed f A
−3560.01
−4350.22
1580.42
195
−3857.98
595.95
138.49
Item 01 Item 02 Item 03 Item 04 Item 05
−240.19 −235.44 −260.91 −192.07 −206.54
−283.34 −278.95 −293.60 −265.96 −247.40
86.31 87.02 65.38 147.78 81.73
13 13 13 13 13
−264.50 −253.14 −282.79 −207.08 −234.90
48.61 35.41 43.76 30.02 56.73
9.23 9.23 9.23 9.23 9.23
Item 06 Item 07 Item 08 Item 09 Item 10
−153.94 −228.38 −293.23 −300.49 −288.20
−198.82 −298.35 −338.79 −327.84 −319.85
89.76 139.93 91.13 54.70 63.30
13 13 13 13 13
−168.22 −250.86 −312.62 −317.60 −309.65
28.56 44.97 38.79 34.22 42.91
9.23 9.23 9.23 9.23 9.23
Item 11 Item 12 Item 13 Item 14 Item 15
−224.09 −214.80 −262.03 −204.95 −254.76
−299.27 −293.60 −328.40 −273.21 −302.85
150.36 157.60 132.73 136.52 96.17
13 13 13 13 13
−242.82 −236.52 −287.78 −221.70 −267.79
37.47 43.45 51.50 33.50 26.06
9.23 9.23 9.23 9.23 9.23
Statistic Test
∗1
benchmark model
6.8.3 DF and Effective DF The DF represents the complexity of the model. When DF is small, the number of parameters used in the analysis model is large, which means that the analysis model is then not very different, considering the model complexity (or flexibility), than the benchmark model. Conversely, the larger the DF, the smaller the number of parameters in the analysis model, which means that the analysis model is simpler. The DF of the analysis model is given by the difference in the number of parameters between the benchmark and analysis models. That is, d f A j = (N of parameters of BM) − (N of parameters of AM) = row( p j ) − row(πˆ j ) =G−R
(∀ j ∈ N J ),
(6.10)
where row( p j ) = G denotes the number of parameters of the benchmark model for Item j.28 In addition, row(πˆ j ) = R is the number of parameters of the analysis model for Item j. However, under the LRA, is the number of parameters for an item no less than R? The LRA is different from the LCA in that adjacent latent ranks are ordered. In other words, unlike LCA, the IRP values of each adjacent rank pair are mutually 28
The operator row(·) returns the number of rows about the input matrix. When a vector is input, the operator returns the number of vector elements.
6.8 Model Fit
247
dependent and similar to each other to some extent. That is, the R values in the IRP of each item are not estimated independently but are estimated while being restricted by each other. Therefore, it can be said that the number of parameters per item is in practical terms less than R. In this case, effective degrees of freedom (EDF; Hastie et al., 2001) are used instead of DF. First, under the EDF concept, the number of parameters in the analysis model is defined as 29 N of parameters of AM = tr F, where F (Sect. 6.3.2, p.210) is the filter, which was used to smooth the rank membership profiles. The EDF of Item j is then given as ed f A j = row( p j ) − tr F = G − tr F. For example, when the number of ranks is six, because the filter is ⎤ .864 .120 ⎥ ⎢.136 .760 .120 ⎥ ⎢ ⎥ ⎢ .120 .760 .120 ⎥, F=⎢ ⎥ ⎢ .120 .760 .120 ⎥ ⎢ ⎣ .120 .760 .136⎦ .120 .864 ⎡
the number of parameters of the analysis model is given as tr F = 0.864 + 0.760 + · · · + 0.760 + 0.864 = 4.768. When R = 6, the six IRP values are not independently estimated. The number of parameters is thus regarded to be 4.768. Consequently, the EDF is evaluated as ed f A j = G − 4.768. Note that the EDF can be computed only under LRA-GTM because the filter for generating smoothness (i.e., dependency) has been explicitly defined. In learning under LRA-SOM, however, the mechanism for generating smoothness cannot be easily quantified. Therefore, Eq. (6.10) is used as the DF under LRA-SOM. In addition, the null model DF is calculated in the same manner by comparing its number of parameters with that of the benchmark model. The parameter of the null model per item is only p j ; thus, the number of parameters is 1 (i.e., row( p j ) = 1). In other words, the null model is a model that attempts to explain the data vector of Item j (u j ) only by one parameter ( p j ).
29
If A is a square matrix, tr A returns the sum of the diagonal elements of A.
248
6 Latent Rank Analysis
Table 6.6 (p.246) shows the EDFs of the analysis model under LRA-GTM.30 First, from the DF of the null model per item being 13, the number of parameters in the benchmark model is found to be 14 as follows: d f N j = G − 1 = 13
∴
G = 14.
This is because there were no students who scored 0 and 1 points, although this test has 15 items and the maximum number of NRS patterns is 16. In addition, under R = 6, from tr F = 4.768 as described above, the EDF of the analysis model per item becomes ed f A j = G − tr F = 14 − 4.768 = 9.232 (∀ j ∈ N J ). Consequently, the EDF for the entire test is given as ed f A =
J
ed f A j = J × ed f A j = 15 × 9.232 = 138.491.
j=1
Number of Parameters of IRT Model The number of parameters is three in the 3PLM of IRT (λ j = [a j b j c j ] ), while it is three in the LCA with C = 3 (π j = [π j1 π j2 π j3 ] ), and it is also three in the LRA-SOM with R = 3. However, can these numbers be considered equivalent? In the LCA with C = 3, every minute detail about the shape of IRP is involved in the three parameters π j . In IRT, however, the basic shape of IRF (the IRP counterpart) is determined by a logistic function, and the three item parameters (λ j ) only adjust the slope, location, and lower asymptote of the logistic function. Therefore, in the case of 3PLM, strictly speaking, there are four things that determine the IRF shape: the slope, location, lower asymptote parameters, and logistic function. Therefore, it is difficult to strictly compare DFs of different models because it is necessary to precisely evaluate how many parameters the behavior or basic shape of the logistic function is converted (corresponding) to. This is similar to the difficulty in evaluating the number of parameters the smoothness of LRA-SOM is converted to. The goodness-of-fit comparisons between different models described in this section are useful and convenient, but note that they are not rigorous comparisons.
30
Under LRA-SOM, the DF for each item is 8 (= 14 − 6).
6.8 Model Fit
249
6.8.4 Model-Fit Indices and Information Criteria Using the χ 2 values and DFs of the analysis and null models evaluated by specifying the benchmark model as the reference, various model-fit indices can be calculated. The indices listed in Standardized Fit Indices are also used in SEM. The value range of each index, except for the RMSEA, is [0,1], where it represents a better fit when its value is closer to 1. In SEM, a model is regarded to be good-fitting when the value is ≥ 0.95 (or 0.90). The RMSEA has a value range of [0, ∞), and the closer the value is to 0, the better the fit. If it is ≤ 0.06, a model is considered to be good. These indices are grouped as standardized indices because they have a (half-)bounded value range. Standardized Fit Indices Normed Fit Index (NFI; Bentler and Bonett, 1980)∗1 ! 2" χ (∈ [0, 1]) N F I = 1 − cl 2A χN Relative Fit Index (RFI; Bollen, 1986)∗1 " ! 2 χ /d f A ) (∈ [0, 1]) R F I = 1 − cl A2 χ N /d f N Incremental Fit Index (IFI; Bollen, 1989a)∗1 " ! 2 χA − d f A (∈ [0, 1]) I F I = 1 − cl 2 χN − d f A Tucker-Lewis Index (TLI; Bollen, 1989a)∗1 ! 2 " χ A /d f A − 1 T L I = 1 − cl 2 (∈ [0, 1]) χ N /d f N − 1 Comparative Fit Index (CFI; Bentler, 1990)∗1 " ! 2 χA − d f A (∈ [0, 1]) C F I = 1 − cl 2 χN − d f N Root Mean Square Error of Approximation (RMSEA; Browne and Cudeck, 1993)∗2 # max(0, χ A2 − d f A ) (∈ [0, ∞)) RMSE A = d f A (S − 1) Apply the EDF instead of DF when LRA-GTM is used. ∗1 Larger values close to 1.0 indicate a better fit. ∗2 Smaller values close to 0.0 indicate a better fit.
250
6 Latent Rank Analysis
Table 6.7 Model with low AIC Is efficient
Meanwhile, an index that does not have a bounded range and is used to relatively compare the goodness of fit of two or more models is called an information criterion. Various information criteria have been proposed, and the indices listed in Information Criteria are the three most frequently used criteria in SEM. A single value of an information criterion is useless for evaluating the fit of a model. When there are two or more models (that share the same benchmark and null models), the values evaluating the models can be compared. The model with the lowest value of the information criterion is judged to be the best-fitting model. Information Criteria Akaike Information Criterion (AIC; Akaike, 1987) AI C = χ A2 − 2d f A Consistent AIC (CAIC; Bozdogan, 1987) C AI C = χ A2 − d f A ln(S + 1) Bayesian Information Criterion (BIC; Schwarz, 1978) B I C = χ A2 − d f A ln S Apply EDF instead of DF when LRA-GTM is used. A lower value indicates a better fit.
The goodness of fit that each information criterion is evaluating is the efficiency of the model. An efficient model in this context refers to a model that is simple (i.e., has a small number of parameters) but fits well (i.e., has a small χ 2 value). The AIC, CAIC, and BIC share the same concept: They all evaluate the efficiency of the model. Table 6.7 represents the concept of the AIC, meaning that the better the fit and the larger the DF, the smaller the AIC. The χ 2 value of a model becomes
6.8 Model Fit
251
Table 6.8 Fit indices of LRA-GTM (R = 6) Type Index Test Item 01 Item 02 Item 03 Item 04 Item 05 Item 06 Item 07 Item 08 Item 09 Item 10 Item 11 Item 12 Item 13 Item 14 Item 15
NFI 0.623 0.437 0.593 0.331 0.797 0.306 0.682 0.679 0.574 0.374 0.322 0.751 0.724 0.612 0.755 0.729
∗ RMSEA
RFI 0.469 0.207 0.427 0.058 0.714 0.023 0.552 0.548 0.401 0.119 0.046 0.649 0.612 0.454 0.654 0.618
Standardized Index IFI TLI CFI 0.683 0.535 0.670 0.489 0.244 0.463 0.664 0.502 0.646 0.385 0.072 0.341 0.850 0.783 0.846 0.345 0.027 0.309 0.760 0.646 0.748 0.727 0.604 0.718 0.639 0.467 0.622 0.451 0.156 0.401 0.377 0.057 0.330 0.800 0.711 0.794 0.769 0.667 0.763 0.658 0.503 0.647 0.809 0.723 0.804 0.806 0.715 0.798
RM∗ 0.081 0.092 0.075 0.087 0.067 0.102 0.065 0.088 0.080 0.074 0.085 0.078 0.086 0.096 0.073 0.060
Information Criterion AIC CAIC −264.99 318.97 30.15 −8.79 16.94 −21.99 25.29 −13.64 11.56 −27.38 38.26 −0.67 10.09 −28.84 26.50 −12.43 20.33 −18.61 15.75 −23.18 24.44 −14.49 19.01 −19.92 24.99 −13.95 33.04 −5.89 15.03 −23.90 7.59 −31.34
BIC −264.71 −8.77 −21.97 −13.62 −27.36 −0.65 −28.82 −12.41 −18.59 −23.16 −14.47 −19.91 −13.93 −5.88 −23.88 −31.32
(root mean square error of approximation)
small when the model fit is good. In addition, the DF of a model is large when the model is simple (i.e., parsimonious) because the number of parameters of the model is then small. The AIC of a model is thus lowered as the model is more efficient (both the simplicity and goodness-of-fit model are acceptable). Table 6.8 shows the standardized fit indices and information criteria under LRAGTM31 The values of the indices are nearly equal to those under IRT (Table 4.9, p.149). It is seen that the fit of Items 4, 6, 11, 14, and 15 were relatively good among the 15 items. Because the benchmark and null models are common through these chapters, the indices between different models can be compared. Comparing the AICs of 2PLM and 3PLM of IRT, LCA with C = 5, and the present model (i.e., LRA-GTM with R = 6), we have AI C LC A5 = −62.02, AI C2P L M = 314.46, AI C L R A6 = 318.97, AI C3P L M = 330.64,
31
The results under LRA-SOM are omitted.
252
6 Latent Rank Analysis
in ascending order. Just like the order of the χ 2 values, the AIC of the LCA with C = 5 is the smallest. Therefore, in terms of the AIC, the LCA model with C = 5 is said to be the most efficient among the four models. In addition, the χ 2 value of the LRA-GTM with R = 6 was smaller (i.e., better) than that of 2PLM, but because the LRA has a larger number of parameters, the 2PLM is judged to be more efficient than the LRA-GTM with R = 6.
6.9 Estimating the Number of Latent Ranks This section explains how to determine the number of latent ranks (R). Referring to the fit indices introduced in the previous section, specifically information criteria, the optimal R can be deduced. When analyzing data under various Rs and evaluating an information criterion in each case, the analysis with an R that produces the lowest value of the information criterion can be selected as the optimal R. This is a bottom-up approach for determining the number of ranks. Meanwhile, an optimal R may be determined from a practical standpoint. For instance, the R is (or has to be) three if there are only three teachers in a grade to reorganize the school classes according to the ability grade; or, the R may be found by considering information, experience, and knowledge outside the data (see Importance of Knowledge about Outside Data , p.255): for example, from the per spective of developmental psychology, the optimal R is theoretically found to be five because the number of cognitive stages in this domain for the students of this age is five. This is a top-down approach of determining the number of ranks. As a blended approach of the top-down and bottom-up approaches, a range of R is first selected (in a bottom-up way) in which the information criterion under each of R is not bad, and an R in the range is then picked up as the optimal R (in a top-down way) based on the expertise of the analysis members. Note that it is not a good idea to let the whole decision-making process depend on statistical indices. When relying excessively on the indices, the analyst and administrator cease to think and make decisions, in which case their ownership and responsibility for the test will be gradually attenuated. Table 6.9 shows the χ 2 values and information criteria when analyzing data 32 settings for the anal(J15S500) by LRA-GTM when changing R from 2 to 10. The yses basically conformed to Estimation Setting of LRA-GTM (p.219) by changing Condition 2 (i.e., the number of ranks, R). Note that the prior membership in each rank (Condition 4) and the kernel (Condition 5) also change with R. Note also that the null and benchmark models are invariant throughout the analyses under various Rs. The null model is a single-group model, and the benchmark model is a G-group model, where G is the number of observed NRS patterns. If these two 32
Although these analyses can be performed under LRA-SOM, it is time consuming because the data are analyzed with various Rs. It is recommended that LRA-GTM be used because the LRAGTM is much faster than LRA-SOM.
6.9 Estimating the Number of Latent Ranks
253
Table 6.9 Information criteria with various numbers of latent ranks R 2 3 4 5 6 7 8 9 10 ∗1 ∗2
ell A −3954.64 −3888.14 −3872.66 −3867.15 −3857.98 −3849.18 −3841.02 −3832.83 −3824.49
χ A2 789.27 656.26 625.30 614.29 595.95 578.34 562.04 545.65 528.97
ed f A 180.77 168.08 156.93 147.33 138.49 130.88 124.51 119.39 115.50
AIC 427.73 320.10 311.44 319.62 318.97 316.58 313.01 306.88 297.97
CAIC −334.50 −388.62 −350.28 −301.63 −264.99 −235.31 −212.02 −196.52 −189.05
BIC −334.14 −388.28 −349.97 −301.33 −264.71 −235.05 −211.77 −196.28 −188.82
WOAC∗1 1 1 1 1 1 1 1 1 1
MIR∗2 0.000 0.067 0.200 0.333 0.600 0.600 0.600 0.667 0.733
coded 1 if the TRP satisfies the WOAC monotonic IRPs ratio
models are different, the fit indices under various Rs are not comparable. Therefore, the log-likelihoods of the null and benchmark models were the same as in Table 6.6 (p.246). That is, ell B = −3560.01, ell N = −4350.22, χ N2 = 1580.42, d f N = 195. In addition, the χ 2 values under R being, for example, two and three were obtained as 2 = 2 × {−3560.01 − (−3954.64)} = 789.27, χ A,R=2 2 χ A,R=3 = 2 × {−3560.01 − (−3888.14)} = 656.26.
It can be seen that χ A2 decreases as R increases. This is because the larger the R, the more flexible the model becomes and the better it fits the data. However, a model with a larger R is more likely to overfit the data; thus, an efficient model should be selected by referring to the information criteria. Figure 6.23 plots the AICs and BICs shown in Table 6.9, where the CAIC plot is omitted because it is almost the same as BIC. From the figure, AIC is almost unchanged from Rank 3 through 10, but strictly speaking, AIC is the smallest when R is 10. This indicates that the goodness of fit improves as R increases, but the model contrariwise becomes more complex, which cancels the improvement of the fit, and then there is little change in terms of efficiency. Meanwhile, BIC implies that the model with R = 3 is the most efficient. Moreover, the WOAC shown in Table 6.9 indicates the monotonicity of the TRP under each R. The TRP tends to be less monotonic as R increases, but the TRPs
254
6 Latent Rank Analysis
Fig. 6.23 AIC and BIC under various numbers of latent ranks
are monotonically increasing for all cases of Rs. In other words, all analyses were successful; the latent scale under each R is said to be ordinal. Meanwhile, the monotonic IRPs ratio (MIR) in the table indicates the ratio of the number of IRPs that were monotonically increasing to the number of items. That is,33 J MI R =
j=1
sgn(γ˜ j ) J
.
It may be difficult to determine the optimal R using information criteria. Therefore, it would be better to finalize R based on the expertise and experience of the teacher(s) or subject-teaching expert(s).34 Without referring to the information criteria, an optimal R may be determined by the interpretability of the results, which is a very important aspect of the analysis (see Importance of Interpretability , p.256). For example, the MIR was 0.6 (= 9/15) when R = 6, which means that 9 of the 15 IRPs were not monotonically increasing. This fact can also be confirmed by Table 6.3 (p.228). It is harder to interpret the results when there are more nonmonotonic IRPs, because, in general, the students belonging to higher ranks should have higher ability. It is thus natural that CRRs of the higher ranks are generally greater. Regarding the results under R = 6, however, the IRPs of more than half of the items are not monotonic, which is difficult to interpret even if the TRP is monotonically increasing. Therefore, the optimal R would be four or five. 33
The signum function (or sign function) extracts the sign of the argument. That is, ⎧ ⎪ ⎨−1, if x < 0 sgn(x) = 0, if x = 0 ⎪ ⎩1, if x > 0
If the IRP is not monotonically increasing, the monotonicity index is then positive (γ˜ > 0), thus sgn(γ˜ ) = 1. 34 Or, the author hopes that a reader will find a better way to determine the optimal R.
6.9 Estimating the Number of Latent Ranks
255 Test Length
Fig. 6.24 Missing world outside data Sample Size
Item Universe
Population
DATA
Missing
Furthermore, in general, the larger the sample size, the greater the optimal R tends to be. For example, if the number of items is the same, the optimal R will be larger for data with 5000 students than data with 500 students. This is because the larger the sample size, the greater the number of observed response patterns, and the LRA, like LCA, basically groups together the students with similar response patterns into the same rank (see Figure 5.2, p.156). Importance of Knowledge about Outside Data No matter how many pieces of data are collected, the information contained is only a fraction of the world; even if the sample size is 10,000, it must be < 0.1% of the population, and it is ultimately unknown whether those 10,000 people are unbiased representatives of the population. In addition, even if the test (questionnaire) has 500 items, there must be 5,000 or 10,000 questions that should be investigated to clarify the phenomenon exhaustively. In other words, the analyst should be aware that the data he or she are facing are very small compared to the vast world of missing data outside the data given (Fig. 6.24). It is our knowledge and experience as experts that fill the vast world of missing data outside the data. Some people believe that it is more objective to interpret data using only the information extracted and is more subjective, arbitrary, and unscientific to use experience and information outside the data. However, in the humanities and social sciences, useful knowledge can never be obtained without synthesizing the information and experience inside and outside the data.
256
6 Latent Rank Analysis
Importance of Interpretability In the natural sciences, to explore the truth is of utmost importance, but in the social sciences and humanities, “how the world can be interpreted” is more important. This idea, social constructionism (e.g., Berger & Luckmann, 1966; Spector & Kitsuse, 1977; Searle, 1996; Weinberg, 2014; Gergen, 2015), is a mainstream of modern thought. In other words, it is more essential how we perceive, interpret, and use the test results. Thus, rather than pursuing the true R, the perspective of “how the teacher recognizes those ranks and how he or she can use them to improve students’ future academic performance” should receive priority. Therefore, the teacher (or analyst) should select the optimal R that he or she is confident of.
6.10 Chapter Summary and Supplement This chapter introduced LRA, which is similar to LCA. The latent classes in LCA are nominal groups, while the latent ranks are ordinal groups. The ordinality in this chapter is based on NRS, and the students belonging to higher ranks can score greater expected NRSs (i.e., greater TRP values) than the students belonging to lower ranks. That is, a monotonous increase in TRP is evidence that the latent ranks are ordered. However, if there is another basis to show the ordinality of the ranks, it is unnecessary for the TRP to be monotonically increasing. Before concluding this chapter, some additional information is provided in this section.
6.10.1 LRA-SOM Vs LRA-GTM There are two learning methods in LRA: SOM and GTM. LRA-GTM is batch learning; thus, the analysis time is very short. Accordingly, LRA-GTM is recommended when the sample size is very large and when the optimal number of latent ranks is to be determined. Meanwhile, LRA-SOM is sequential learning. It is thus easier to construct an equiprobability rank scale. In addition, the IRPs are likely to be smoother than the LRA-GTM. However, note that this is due to the effect of tuning the parameters used in the analysis. Specifically, as the final value of the learning range parameter (σT ) in Estimation Setting of LRA-SOM (p.205) approaches 0, the learning range then becomes narrower, and the IRPs are less likely to be smooth. Therefore, depending on the parameters being set, smoother IRPs can be obtained under LRA-GTM.
6.10 Chapter Summary and Supplement
257
Fig. 6.25 Equating by concurrent calibration under LRA
6.10.2 Equating Under LRA In the framework of the LRA, new items can be stored in the item bank through test equating. If the number of items in the item bank is small, it is difficult to administer a reliable CAT. For this purpose, it is necessary to store a large number of items that are highly discriminative at various difficulty levels in the bank. Figure 6.25 shows the image of an equating method, concurrent calibration (Kim & Cohen, 1998, 2002; Hanson & Béguin, 2002). In this method, when new items of a test are planned to be stored in the item bank, some items that are already stored there (i.e., anchor items) are first mixed in the test. The IRPs of the anchor items must have already been estimated (as they were stored in the item bank). Then, the IRPs of only the new items are estimated, while keeping the IRPs of the anchor items fixed, where the number of ranks should be the same as the number of ranks employed in the bank. Consequently, the IRPs of the new items will be obtained as values already equated on the latent rank scale of the item bank. The larger the number of anchor items, the more reliable the equating.
6.10.3 Differences Between IRT, LCA, and LRA Finally, the differences between the IRT, LCA, and LRA are briefly summarized in Table 6.10. First, these three models differ in the level of the latent scale employed: It is continuous in IRT, nominal in LCA, and ordinal in LRA.
258
6 Latent Rank Analysis
Table 6.10 Differences between IRT, LCA, and LRA
Model IRT LCA LRA
Scale Continuous Nominal Ordinal
Parametricity Parametric Nonparametric Nonparametric
Unidimensionality Necessary Unnecessary Necessary
Local Dependence Necessary Necessary Necessary
In addition, the IRT models are parametric, while the LCA and LRA models are nonparametric. This is because the IRF in IRT is formulated by a logistic function, while IRP under LCA and LRA does not have a specific functional form. Moreover, the IRT and LRA assume the data to be unidimensional. The unidimensionality assumption, also referred to as unifactority, requires that all items in a test measure the same content: The items in a math test measure a math ability, and the items in a Spanish test measure a Spanish proficiency. The unidimensionality of the test can be examined by eigen-decomposing the tetrachoric correlation matrix (see Sect. 2.8, p.59). Meanwhile, this assumption is not necessary in LCA, but the more unidimensional the test, the easier it is to interpret the results. The assumption of local independence (see Sect. 4.3.1, p.96), which was required in IRT, is also necessary in LCA and LRA. This assumption is essential to construct the likelihood/log-likelihood and express it in a simple product/sum form (see Sect. 5.3.1 for LCA and Sects. 6.2.1 and 6.3.1 for LRA).
Chapter 7
Biclustering
This chapter describes the biclustering technique (e.g., Govaert & Nadif, 2013; Kasim et al., 2016; Long et al., 2019). In short, biclustering is a method of data matrix clustering applied in two directions (i.e., vertical and horizontal). Figure 7.1 shows a visual representation of biclustering. The figure on the upper left shows an array plot of the raw data. An array plot is a diagram coloring the matrix cells, in which the larger the cell value, the darker the cell color. In this array plot of the binary raw data, the correct responses are shaded in black, and the black-and-white pattern appears to be random. Next, the figure on the lower left shows an array plot of an LCA, where C = 3 and where the data vectors of the students (i.e., row vectors) are sorted and arranged into the respective classes. As a result, the data can be divided into three groups with similar black-and-white patterns in the horizontal direction, which means that students with similar response patterns are grouped together. Moreover, in the upper-right figure, items are divided into four groups, and there are items with similar true/false (TF) patterns in each group. Although groups of items have been referenced by various names (e.g., factor, attribute, skill, and ability), the term field is employed herein. Accordingly, the representation in the upper-right figure can be considered a latent field analysis (LFA); i.e., the figure shows an array plot of an LFA with four fields in which the items are sorted according to the latent field. An LFA and a factor analysis (FA) are similar. However, an FA forms some item groups with high within-item (tetrachoric) correlations, whereas items in a group created through an LFA show similar TF patterns. In fact, when TF patterns of two items are similar, the correlation between them is likely to be high; thus, the FA and LFA results are not entirely different. The lower-right figure shows an array plot of the biclustering technique. As the figure indicates, this method applies the LCA and LFA simultaneously. Whereas the students are grouped into latent classes, the items are classified into latent fields. This © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_7
259
260
7 Biclustering
Student Clustering
No Student Clustering
No Item Clustering
Fig. 7.1 Visual representation of biclustering technique
Item Clustering
7 Biclustering
261
figure is an example of biclustering using three classes and four fields. Each block specified by a class and field is called a bicluster. The figure indicates that the data matrix is partitioned into 12 (= C × F = 3 × 4) biclusters. Co-clustering A method for simultaneous clustering of two-dimensional (or higher) data array (e.g., student × item) in a two-way direction is called biclustering or two-mode clustering. As an extension of biclustering, for a threedimensional (or higher) data array (e.g., student × item × time), a simultaneous clustering in a three-way direction is called triclustering or three-mode clustering (Fig. 7.2). The LCA and LFA are thus types of one-mode clustering (or simply, clustering), whereas two-mode (or higher) clustering methods are collectively called co-clustering.
Fig. 7.2 Triclustering
262
7 Biclustering
7.1 Biclustering Parameters By analyzing the data matrix using biclustering, the following three matrices are estimated. This section describes these three matrices. Biclustering Parameters 1. Class membership matrix, M C 2. Field membership matrix, M F 3. Bicluster reference matrix, B
7.1.1 Class Membership Matrix First, the class membership matrix is described. Supposing that the number of classes is C, which should be significantly smaller than sample size S, and that m sc denotes the membership of Student s belonging to Class c, ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ (C × 1) ⎡
m sC is the class membership profile of Student s. Because Student s always belongs to a particular class, the sum of the class membership profile is 1; i.e., 1C ms = 1. The class membership matrix collects class membership profiles of all S students and is given as ⎤ m 11 · · · m 1C ⎥ ⎢ M C = ⎣ ... . . . ... ⎦ = {m sc } (S × C). m S1 · · · m SC ⎡
The s-th row vector in this matrix is the class membership profile of Student s.
7.1.2 Field Membership Matrix Next, the field membership matrix is described. Suppose that J items in a test are classified into F fields, and that F should be significantly smaller than J . In addition,
7.1 Biclustering Parameters
263
let m j f denote the probability that Item j is classified in Field f (∈ N F ). Thus, ⎤ m j1 ⎥ ⎢ m j = ⎣ ... ⎦ (F × 1) ⎡
m jF is called the field membership profile of Item j. Because Item j must be grouped into a certain field, the sum of the field membership profile becomes 1: 1F m j = 1. In addition, a matrix that collects the field membership profiles of all J items, ⎤ m 11 · · · m 1F ⎥ ⎢ M F = ⎣ ... . . . ... ⎦ = {m j f } (J × F), m J1 · · · m J F ⎡
is called the field membership matrix. The j-th row vector in this matrix is the field membership profile of Item j.
7.1.3 Bicluster Reference Matrix Under the condition in which the numbers of classes and fields are C and F, respectively, let the probability that Student s belonging to Class c (m sc = 1) correctly responds to Item j classified in Field f (m j f = 1) be represented by Pr (u s j = 1|m sc = 1, m j f = 1) = π f c (∈ [0, 1]). Then, a matrix with size F × C that collects all π f c s, i.e., ⎤ π11 · · · π1C ⎥ ⎢ B = ⎣ ... . . . ... ⎦ = {π f c } (F × C) π F1 · · · π FC ⎡
is called the bicluster reference matrix. The f -th row vector in B , ⎤ πf1 ⎥ ⎢ π f = ⎣ ... ⎦ (C × 1), ⎡
πfC
264
7 Biclustering
represents the CRRs of students belonging to Classes 1 through C for the items classified in Field f . This vector is called the field reference profile (FRP) of Field f . In addition, the c-th column vector of the matrix ⎤ π1c ⎥ ⎢ π c = ⎣ ... ⎦ (F × 1) ⎡
π Fc
represents the CRRs of students belonging to Class c for the respective fields. This vector is called the class reference vector of Class c.
7.2 Parameter Estimation We can consider biclustering simply as a problem of estimating the class membership matrix, field membership matrix, and bicluster reference matrix from data matrix U. The class membership matrix (M C ) indicates which class each student belongs to, field membership matrix (M F ) indicates which field each item is classified in, and bicluster reference matrix ( B ) shows the CRR of each class for each field. In other words, biclustering can be regarded as a function used to return M C , M F , and B in response to input U, as follows: f Biclustering (U) = {M C , M F , B }. This can also be thought of as a machine processing raw material U into M C , M F , and B (Fig. 7.3).
Fig. 7.3 Biclustering machine
Input Output
Biclustering
7.2 Parameter Estimation
265
7.2.1 Likelihood and Log-Likelihood The basis of a statistical model analysis is the construction of the likelihood. The likelihood is the occurrence probability of the data with the given model parameters. In biclustering, likelihood is the probability of observing data matrix U, given bicluster reference matrix B , class membership matrix M C , and field membership matrix M F , and is expressed as l(U| B ) =
S J F C
u
π f scj (1 − π f c )1−u s j
zs j m sc m j f
.
(7.1)
s=1 j=1 f =1 c=1
In this equation, the likelihood corresponding to each datum can be appropriately selected by the class and field memberships. Suppose that Student 1 belongs to Class 2 (m 12 = 1), and Item 3 is classified in Field 4 (m 34 = 1). In this case, the response of Student 1 (in Class 2) to Item 3 (in Field 4) should depend on the CRR of Class 2 for Field 4; i.e., π42 , from which, the likelihood corresponding to u 13 is obtained as follows: C F u 13
z m m π f c (1 − π f c )1−u 13 13 1c 2 f f =1 c=1
z m m u 13 =1 × · · · × π42 (1 − π42 )1−u 13 13 12 34 × · · · × 1
z u 13 = π42 (1 − π42 )1−u 13 13 , which shows that only the term relating to π42 can be extracted because (m 12 , m 34 ) = (1, 1). After the likelihood is constructed, the parameters that maximize the likelihood simply need to be found. However, it is extremely difficult to maximize the likelihood because the likelihood of the data is the product of the likelihood of each datum; therefore, the logarithm of the likelihood, i.e., the log-likelihood, is taken. The logarithm of Eq. (7.1) is ll(U| B ) =
S J F C
z s j m sc m j f u s j ln π f c + (1 − u s j ) ln(1 − π f c ) .
s=1 j=1 f =1 c=1
(7.2) The log-likelihood of the data is the sum of the log-likelihood of each datum and is thus significantly easier to handle. Moreover, the logarithmic transformation is a monotonic transformation, and the values of the parameters that maximize the likelihood are identical to those that maximize the log-likelihood.
266
7 Biclustering
Why are M C and M F not written in terms of likelihood? Why is the notation for likelihood l(U| B ) instead of l(U| B , M C , M F ), despite the class membership matrix and field membership matrix also being parameters? Similarly, why is the notation for log-likelihood ll(U| B ) instead of ll(U| B , M C , M F )? It is because the parameter types are different. Both membership matrices are distributions by nature. Recall that the s-th row vector of class membership matrix M C (i.e., class membership profile of Student s) represents the probability distribution of the class that Student s belongs to. The membership of Student s to a class might be 1, or the memberships to Classes 1 and 2 might be 0.7 and 0.3, respectively. In any case, the sum of each student’s class membership profile is 1 (1C ms = 1), and the class membership profile of a student is thus a distribution of the probabilities for all respective classes that the student belongs to. Similarly, the field membership profile of each item is a probability distribution. Furthermore, the likelihood has a structure of Cc=1 (·)m sc ; i.e., the log C likelihood has a structure of c=1 m sc (·). This means that, for Student s (∀s ∈ N S ), the (log-)likelihood considers all possibilities regarding which class Student s belongs to. In other words, the (log-)likelihood is said to be marginalized over the class membership profile of Student s. When marginalized, the unknown parameter (which class Student s belongs to) is then effectively nullified. Therefore, the class membership profiles of all students (class membership matrix) are marginalized, and thus, in the (log)likelihood, they are not written on the right side of “|.” Likewise, the field membership matrix is not written because it is marginalized.
7.2.2 Estimation Framework Using EM Algorithm There are three unknown parameters in biclustering: bicluster reference matrix B , class membership matrix M C , and field membership matrix M F . Although it is not easy to estimate them simultaneously, fortunately, M C and M F can be obtained when estimating B . The framework of the estimation is the same as that described in the other chapters, i.e., the EM algorithm (Dempster et al., 1977). The EM algorithm is an extremely important tool in a test data analysis.1 The framework for estimating a bicluster reference matrix using the EM algorithm is shown in Biclustering Estimation Procedure . 1
Because this book introduces many statistical models, the author has tried to reduce the number of mathematical tools. The EM algorithm described in the IRT chapter is the most detailed explanation, but it might be somewhat difficult to read; thus, the explanations in the respective chapters are slightly varied.
7.2 Parameter Estimation
267
Biclustering Estimation Procedure (EM Algorithm) (0) [1] Set (0) B and M F . [2] Repeat the following steps until convergence. M (t−1) }. [2-1] Obtain M C(t) from {U, (t−1) B F (t) (t−1) [2-2] Obtain M F from {U, B , M C(t) }. (t) (t) [2-3] Obtain (t) B from {M C , M F }.
In Command [1], the initial value of the bicluster reference matrix ((0) B ) is first determined. In each latent class, the students with similar response patterns are grouped; thus, the latent classes are created based on the ability level. Likewise, items with similar difficulty levels are grouped into the same latent field. Thus, the initial value of the ( f, c)-th element of the bicluster reference matrix may be specified as πfc =
F − f +c . F +C
For example, when the numbers of fields (F) and classes (C) are 5 and 6, respectively, the initial value of the bicluster reference matrix is given as ⎡
(0) B
0.624 ⎢0.063 ⎢ =⎢ ⎢0.201 ⎣0.050 0.023
0.864 0.333 0.543 0.245 0.054
0.872 0.426 0.228 0.078 0.028
0.898 0.919 0.475 0.233 0.043
0.952 0.990 0.706 0.648 0.160
⎤ 1.000 1.000⎥ ⎥ 1.000⎥ ⎥ (5 × 6). 0.983⎦ 0.983
The larger the class label, the higher the CRR for each field. In other words, these initial settings direct the gathering of students with a higher ability in a class with a larger label.2 In addition, these initial values induce more difficult items to form a larger labeled field. From this matrix, for example, the CRRs of a student belonging to Class 1 are 62.4% for the items classified in Field 1 and 2.3% for those classified in Field 5. Moreover, Command [1] also requires specifying the initial value of the field (0) (1) membership matrix, which is a component of {U, (0) B , M F } for estimating M C in Command [2-1].3 If the CRR of Item j ( p j ) is in the f -th group when the CRRs of all items are partitioned into F difficulty levels (i.e., p j is in the f -th F-quantile in descending order), then the initial values of Item j’s memberships to the respective F fields are set as
2
This is not necessarily true at the end of the analysis. (0) One may first determine the initial value of M C in Command [1] and then estimate M (1) F from (0) (0) {U, B , M C } in Command [2-1].
3
268
7 Biclustering
m (0) jf
=
1, if f = f . 0, otherwise
Because the initial values for the bicluster reference matrix ((0) B ) and field mem) have been specified, the EM algorithm can be started and bership matrix (M (0) F progressed as follows: EM Cycle 1
Initial value
EM Cycle 2
(0) (1) (1) (1) (2) (2) (2) (0) M M , M → → M → → → M → B F C F B C F B → ··· .
7.2.3 Expected Log-Likelihood The expected log-likelihood (ELL) at the t-th EM cycle is the log-likelihood that is marginalized with respect to class membership matrix M C(t) and field membership matrix M (t) F . That is, bicluster reference matrix B is only the unknown parameter in the ELL. In addition, as described in Why M C and M F are not written in likelihood? , the log-likelihood in Eq. (7.2) has already been marginalized with respect to M C(t) and M (t) F . That is, Eq. (7.2) can be expressed as ell(U| B ) =
S J F C
(t) z s j m (t) sc m j f u s j ln π f c + (1 − u s j ) ln(1 − π f c ) .
s=1 j=1 f =1 c=1
(7.3) To obtain this ELL, the class and field membership matrices at the t-th E step (M C(t) and M (t) As shown in Command [2-1] of F ) are necessary. Biclustering Estimation Procedure (p. 267), the class membership matrix is first updated using {U, (t−1) , M (t−1) }. From Eq. (7.1), the probability (i.e., likelihood) B F that the data of Student s (us ) will be observed under the condition that the student belongs to Class c is expressed as l(us |π (t−1) )= c
J F (t−1) u s j 1−u s j zs j m (t−1) jf πfc 1 − π (t−1) . fc j=1 f =1
This equation is obtained as a real number because there are no unknown parameters in it, and the number falls within [0, 1] because this equation describes a likelihood (i.e., probability). Thus, the membership of Student s in Class c at the t-th E step is
7.2 Parameter Estimation
269
)πc l(us |π (t−1) c m (t) . sc = C (t−1) )πc c =1 l(us |π c Notably, at this time, the superscript of the class membership is updated from (t − 1) to (t). In addition, πc in the equation denotes the prior probability that the student belongs to Class c. If one does not have a particular assumption, a discrete uniform distribution is valid as follows: πc =
1 . C
By computing m (t) sc s of all students for all classes, the class membership matrix at the t-th E step is updated as follows: M C(t) = {m (t) sc } (S × C). The s-th row vector in this matrix is the class membership profile of Student s at the t-th E step. , Next, in Command [2-2], the field membership matrix is updated using {U , (t−1) B (t) M C }. From Eq. (7.1), the probability (i.e., likelihood) that the data for Item j, u j (i.e., the j-th column vector of data matrix U), will be observed under the condition in which the item is classified in Field f is expressed as l(u j |π (t−1) )= f
S C (t−1) u s j 1−u s j zs j m (t) sc πfc 1 − π (t−1) . fc s=1 c=1
Note that the class memberships that have just been updated are used here. In addition, this equation does not include unknown parameters; thus, this equation is obtained as a real number within the range of [0, 1]. Accordingly, the probability that Item j is classified in Field f (at the t-th E step) is calculated as )π f l(us |π (t−1) f m (t) = ,
F jf (t−1) )π f f =1 l(us |π f where π f is the prior probability that the item is grouped in Field f . When having no particular supposition, the following discrete uniform distribution is recommended: πf =
1 . F
The field membership matrix at the t-th E step is thus updated as (t) M (t) F = {m j f }. (J × F)
270
7 Biclustering
In addition, the j-th row vector in this matrix is the field membership profile of Item j at the t-th E step. Consequently, the class and field membership matrices at the t-th E step (M C(t) and M (t) F ) are obtained. The ELL of Eq. (7.3) is constructed using these matrices.
7.2.4 Expected Log-Posterior Density The posterior density of B , pr ( B |U), is proportional to the product of likelihood l(U| B ) in Eq. (7.1) and prior density of B , pr ( B ). That is, pr ( B |U) =
l(U| B ) pr ( B ) l(U| B ) pr ( B ) ∝ l(U| B ) pr ( B ), = pr (U) l(U| B ) pr ( B )d B
where the rightmost transformation by “∝ (proportional to)” is derived from the fact that pr (U) = l(U| B ) pr ( B )d B becomes a constant irrelevant to B . Accordingly, the log-posterior can be expressed as the sum of the log-likelihood and log-prior density, as follows: ln pr ( B |U) = ln l(U| B ) + ln pr ( B ) + const. = ll(U| B ) + ln pr ( B ) + const. = ell(U| B ) + ln pr ( B ) + const. Note that the log-likelihood, ln l(U| B ), is the ELL of Eq. (7.3). In addition, as the prior density for each π f c (∀c ∈ NC , ∀ f ∈ N F ), a beta density as the natural conjugate prior (see Sect. 4.5.3, p. 119) is a valid choice. The beta density is β −1
pr (π f c ; β0 , β1 ) =
π f c1 (1 − π f c )β0 −1 B(β0 , β1 )
.
Thus, the prior density for bicluster reference matrix B is given as pr ( B ; β0 , β1 ) =
β −1 F C π f c1 (1 − π f c )β1 −1 f =1 c=1
The logarithm of the prior is thus
B(β0 , β1 )
,
7.2 Parameter Estimation
ln pr ( B ; β0 , β1 ) =
271
F C
(β1 − 1) ln π f c + (β0 − 1) ln(1 − π f c ) + ln B(β0 , β1 ) f =1 c=1
=
F C (β1 − 1) ln π f c + (β0 − 1) ln(1 − π f c )} + const., f =1 c=1
where (β0 , β1 ) are the hyperparameters determining the shape of the beta density, and B(·, ·) is the beta function that becomes an irrelevant constant after the beta density is logarithmically transformed. Finally, the expected log-posterior (ELP) of bicluster reference matrix B constructed at the t-th E step can be obtained as follows: ln pr ( B |U) =ell(U| B ) + ln pr ( B ; β0 , β1 ) + const. =
S J F C s=1 j=1 f =1 c=1
+
(t) (t) z s j m sc m j f u s j ln π f c + (1 − u s j ) ln(1 − π f c )
F C (β1 − 1) ln π f c + (β0 − 1) ln(1 − π f c )} + const. f =1 c=1
=
F S J C f =1 c=1
+
(t) (t) z s j m sc m j f u s j + β1 − 1 ln π f c
s=1 j=1
S J
(t) (t) z s j m sc m j f (1 − u s j ) + β0 − 1 ln(1 − π f c ) + const.
s=1 j=1
=
F C
(t) (t)
U1 f c + β1 − 1 ln π f c + U0 f c + β0 − 1 ln(1 − π f c ) + const.,
(7.4)
f =1 c=1
where U1(t)f c =
S J
(t) m (t) sc m j f z s j u s j ,
(7.5)
(t) m (t) sc m j f z s j (1 − u s j )
(7.6)
s=1 j=1
U0(t)f c =
S J s=1 j=1
can be considered the total numbers of correct and incorrect responses of the students belonging to Class c for the items classified in Field f , respectively. All values of U1 f c s and U0 f c s can be computed through the following matrix operations: (t) (t) (t) (F × C), U (t) 1 = {U1 f c } = M F (Z U) M C (t) (t) (t) U (t) (F × C). 0 = {U0 f c } = M F {Z (1 S 1 J − U)} M C
272
7 Biclustering
7.2.5 Maximization Step In the t-th M step, the parameters that maximize the ELP shown in Eq. (7.4) built at the t-th E step are estimated. Note that all π f c s are mutually disjointed (i.e., independent) in the equation. Thus, the term relating to each π f c , ln pr (π f c |U) = U1(t)f c + β1 − 1 ln π f c + U0(t)f c + β0 − 1 ln(1 − π f c ),
(7.7)
can be optimized individually. To find the π f c that maximizes the above equation, the equation is first differentiated by π f c and set to 0, and the derivative is then solved with respect to π f c . The function is differentiated for measuring its slope, and the point when the slope is 0 is the point that gives the maximum value of the function.4 Differentiating the above equation with respect to π f c and setting it to 0, U1(t)f c + β1 − 1 U0(t)f c + β0 − 1 d ln pr (π f c |U) = − = 0. dπ f c πfc 1 − πfc Solving this with respect to π f c , the MAP at the t-th M step is obtained as follows: A P,t) π (M fc
=
U1(t)f c + β1 − 1 U0(t)f c + U1(t)f c + β0 + β1 − 2
.
Note that the π f c that maximizes the ELL (not ELP) is the MLE. From Eq. (7.7), the ELL related to π f c is ell(U|π f c ) = U1(t)f c ln π f c + U0(t)f c ln(1 − π f c ), and thus, L ,t) π (M = fc
U1(t)f c U1(t)f c + U0(t)f c
is the MLE of π f c at the t-th M step. In addition, note that the MAP is equal to the MLE when the hyperparameters of the beta prior are (β0 , β1 ) = (1, 1).
4
There is a possibility that the point will give the minimum value of the function. However, the function is convex in this case; thus, the maximum value is obtained.
7.2 Parameter Estimation
273
7.2.6 Convergence Criterion of EM Cycles The t-th EM cycle ends by obtaining the bicluster reference matrix estimate at the t-th M step ((t) ), and the E step of the (t + 1)-th EM cycle starts. As the EM cycles progress, the values of the bicluster reference matrix approach certain values and gradually become unchanged as follows: (t−1) . (t) B ≈ B
It is therefore necessary to terminate the EM cycles at an appropriate period and regard the values at the point as the final estimate. A criterion is necessary to judge whether the next EM cycle should be started or stopped. In general, the EM algorithm is terminated when the ELLs of any pair of adjacent cycles are approximately equal. The ELL at the t-th EM cycle is obtained by ell(U|(t) B )=
F C (t) (t) (t)
U1 f c ln π (t) f c + U0 f c ln(1 − π f c ) . f =1 c=1
This equation is obtained as a real number because it includes no unknown parameters, and this value gradually increases (closer to 0) as the EM cycles progress in the following manner: (t−1) ) ell(U|(t) B ) > ell(U| B
∴
(t−1) ell(U|(t) ) > 0. B ) − ell(U| B
Then, as shown in Fig. 7.8, the EM algorithm is terminated when the difference between the ELLs at the t-th and (t − 1)-th cycles becomes smaller than the RHS of the following equation: (t−1) ) < c|ell(U|(t−1) )|. ell(U|(t) B ) − ell(U| B B
(7.8)
The LHS of this equation is positive, and the RHS is enclosed in an absolute value sign because the ELL is negative. With the convergence criterion, c denotes a sufficiently small constant, such as c = 10−4 = 0.0001 or c = 10−5 = 0.00001 (Fig. 7.4). Suppose this convergence criterion is satisfied when the T -th cycle is reached; ) the T -th EM cycle is thus the final cycle, and the latest update (T B is then specified as the estimate for the bicluster reference matrix. That is, ) ˆ B = (T B .
In addition, the latest updates for the class and field membership matrices are employed as the following respective estimates: ˆ C = M C(T ) , M ) ˆ F = M (T M F .
274
7 Biclustering
Fig. 7.4 Convergence criterion
Bad Convergence Criterion To simplify Eq. (7.8), consider the following convergence criterion: (t−1) ) < c. ell(U|(t) B ) − ell(U| B
This equation seems like a proper and straightforward criterion. However, this criterion is not useful because it depends on the size of the data matrix. It can be seen from Eq. (7.3) that the ELL becomes smaller (i.e., negatively larger) as the sample size (S) and number of items (J ) increase because π f c ∈ [0, 1]
∴
ln π f c , ln(1 − π f c ) ≤ 0.
This criterion is thus more likely to be satisfied when the sample size and the number of items are smaller. It is then highly possible that the EM algorithm ends without sufficient convergence.
7.3 Ranklustering Biclustering is a two-mode clustering. For a two-dimensional data array, students are clustered into latent classes, and items are clustered into latent fields, where both the classes and fields are nominal clusters. However, in practical tests, teachers sometimes may want to classify students into ordinal latent ranks (see LRA, Chap. 6, p. 191) instead of nominal latent classes. Latent ranks are ordered latent classes in certain respects; in LRA, the ordinality of ranks is examined by the monotonicity of the expected number-right score (NRS) or the weakly ordinal alignment condition of the test reference profile (TRP) (see Sect. 6.6.1, p. 233). For a data matrix of (student × item), ranklustering (Shojima, 2020b) is a biclustering that clusters students into latent ranks, while classifying items into latent fields.
7.3 Ranklustering
275
Fig. 7.5 Ranklustering example
Figure 7.5 shows an example of ranklustering, where the items are classified into nominal clusters (i.e., latent fields). The “Qs” on the data matrix are test items (questions), representing similar items that are grouped together. Meanwhile, students are classified into ordinal clusters (i.e., latent ranks). The length of each pencil symbolizes the student’s ability. The darker cells in the data matrix indicate that the CRR of the students in the corresponding rank for the items in the corresponding field is higher. Thus, the figure shows that the CRRs for the respective four fields are larger as the latent rank increases.
7.3.1 Parameter Estimation Procedure Ranklustering Estimation Procedure (Monotonic Increasing FRP) [0] Use not the class membership (M C ) but the rank membership (M R ). (0) [1] Set (0) B and M F . [2] Repeat the following steps until convergence. (t−1) [2-1] Obtain M (t) , M (t−1) }. R from {U, B F
[2-1.5] Obtain S(t) by smoothing M (t) R . (t−1) [2-2] Obtain M (t) , S(t) }. F from {U, B (t) (t) [2-3] Obtain (t) B from {S , M F }.
(2-4) Repeat (2-4-1) for f = 1 through F. (2-4-1) Sort π (t) f .
276
7 Biclustering
The above box shows the ranklustering estimation procedure. Although it is almost the same as that of biclustering, there are two major differences. The first is a notational issue, as shown in Command [0]. Ranklustering classifies students into latent ranks; thus, each student’s membership to each rank should be referred to as a rank membership instead of a class membership. When the number of latent ranks is R, the rank membership matrix is defined as follows: ⎤ m 11 · · · m 1R ⎥ ⎢ M R = ⎣ ... . . . ... ⎦ = {m sr } (S × R), m S1 · · · m S R ⎡
where m sr denotes the membership of Student s to Rank r . The s-th row vector in this matrix, ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ (R × 1), ⎡
ms R is the rank membership profile (RMP) of Student s. According to this, the size of bicluster reference matrix B becomes not F × C, but rather F × R. As the second difference, Command [2-1.5] has been added, in which the smoothed rank membership matrix (S(t) ) is created by smoothing M (t) R . Owing to this procedure, a relation between each pair of adjacent ranks occurs and the ranks are ordered.5 By inserting this procedure, the sequence of updating parameters is given as follows6 : Initial Value
EM Cycle 1
EM Cycle 2
(0) (0) (1) (1) (1) (2) (2) (2) B , M F → M R → S(1) → M F → B → M R → S(2) → M F → B → · · · .
7.3.2 Class Membership Profile Smoothing This section describes Command [2-1.5]. The simplest smoothing approach is using the simple moving average, an example of which is shown in Fig. 7.6. The original RMP of a student is shown in the left part of the figure, and the smoothed RMP, which is made using the simple moving average, is shown in the lower part of the figure. 5
Note that this procedure does not always successfully order all ranks. (t) In this sequence, M F is affected by S(t) . To avoid the effect, the order in the t-th EM cycle can (t) (t) be changed to M F → M (t) → (1) R → S B . In this case, however, the optimal number of latent fields tends to be large.
6
7.3 Ranklustering
277
Fig. 7.6 Smoothing rank membership profile
At the center of the figure, there is a matrix labeled “filter.” Smoothing of the (original) RMP is accomplished by multiplying the RMP by the filter. In addition, the kernel is placed in each column. In the figure, the kernel is ⎡
⎤ 1/3 1 f = ⎣1/3⎦ = 13 . 3 1/3 Note that the sum of the kernel is 1. Because of this filter, as the figure shows, the smoothed rank membership to Rank 3 is calculated as follows: s3 = 0.333 × (m 2 + m 3 + m 4 ) = 0.333 × (0.160 + 0.080 + 0.360) = 0.200. This is the average membership of the neighboring Ranks 2, 3, and 4. Likewise, the other memberships are obtained as s2 = 0.333 × (m 1 + m 2 + m 3 ) = 0.333 × (0.060 + 0.160 + 0.080) = 0.100, s4 = 0.333 × (m 3 + m 4 + m 5 ) = 0.333 × (0.080 + 0.360 + 0.260) = 0.233, s5 = 0.333 × (m 4 + m 5 + m 6 ) = 0.333 × (0.360 + 0.260 + 0.080) = 0.233. However, for the ranks at both ends (i.e., Ranks 1 and 6), the memberships are computed as s1 = 0.500 × (m 1 + m 2 ) = 0.500 × (0.060 + 0.160) = 0.110, s6 = 0.500 × (m 5 + m 6 ) = 0.500 × (0.260 + 0.080) = 0.170,
278
7 Biclustering
because the kernel is adjusted such that its sum is 1. Let the presmoothed RMP, smoothed RMP, and filter be denoted as ⎤ ⎡ ⎤ ⎡ 0.110 0.060 ⎢0.100⎥ ⎢0.160⎥ ⎥ ⎢ ⎥ ⎢ ⎢0.200⎥ ⎢0.080⎥ ⎥ ⎢ ⎥ ⎢ m=⎢ ⎥ , s = ⎢0.233⎥ , ⎥ ⎢ ⎢0.360⎥ ⎣0.233⎦ ⎣0.260⎦ 0.170 0.080
⎡ ⎤ 0.500 0.333 ⎢0.500 0.333 0.333 ⎥ ⎢ ⎥ ⎢ ⎥ 0.333 0.333 0.333 ⎢ ⎥, F=⎢ ⎥ 0.333 0.333 0.333 ⎢ ⎥ ⎣ 0.333 0.333 0.500⎦ 0.333 0.500
respectively. Then, the smoothed RMP is obtained using the following matrix operation: s = F m. If using a kernel with a larger size, each element in the smoothed RMP is then an average of a wider range of ranks. For example, a Size 5 kernel is ⎡ ⎤ 0.2 ⎢0.2⎥ ⎢ ⎥ 1 ⎥ f =⎢ ⎢0.2⎥ = 5 15 . ⎣0.2⎦ 0.2 The kernel elements for the simple moving average are equal. The filter using this kernel then becomes ⎤ ⎡ 0.333 0.250 0.200 ⎥ ⎢0.333 0.250 0.200 0.200 ⎥ ⎢ ⎥ ⎢0.333 0.250 0.200 0.200 0.250 ⎥. F=⎢ ⎢ 0.250 0.200 0.200 0.250 0.333⎥ ⎥ ⎢ ⎣ 0.200 0.200 0.250 0.333⎦ 0.200 0.250 0.333 Note that the elements of the leftmost and rightmost two columns are adjusted such that their sums are 1. By smoothing the RMP with this filter, the smoothed RMP is obtained as ⎤ ⎡ 0.076 ⎢0.148⎥ ⎥ ⎢ ⎢0.213⎥ ⎥ s=⎢ ⎢0.220⎥ . ⎢ ⎥ ⎣0.180⎦ 0.164
7.3 Ranklustering
279
Fig. 7.7 Plots of presmoothed and smoothed rank membership profile
Note that the sum of the elements in the presmoothed RMP is 1; however, that in the smoothed RMP is not.7 Figure 7.7 shows the plots for a presmoothed RMP, an RMP smoothed using a filter with a 13 /3 kernel, and an RMP smoothed using a filter with a 15 /5 kernel. The smoothing reduces the unevenness of the original RMP. The figure also illustrates that an RMP smoothed using a 13 /3 kernel preserves more of the original shape than an RMP smoothed using a 15 /5 kernel. This concludes the description of smoothing using a simple moving average. Another basic smoothing method is the weighted moving average. This method does not use a flat kernel, but instead employs a nonflat one such as ⎡
⎤ 0.1 f = ⎣0.8⎦ . 0.1 Note that the sum of this kernel elements is also 1. The filter using this kernel becomes ⎤ ⎡ 0.889 0.100 ⎥ ⎢0.111 0.800 0.100 ⎥ ⎢ ⎥ ⎢ 0.100 0.800 0.100 ⎥. ⎢ F=⎢ ⎥ 0.100 0.800 0.100 ⎥ ⎢ ⎣ 0.100 0.800 0.111⎦ 0.100 0.889 The first and last columns are adjusted such that the sum of each column is 1. Figure 7.8 shows the result of smoothing the rank membership profile m using this filter.
7
It is possible to make a filter such that the sum of the elements in the smoothed RMP is 1.
280
7 Biclustering
Fig. 7.8 Smoothed rank membership profile based on weighted moving average
To generalize the description thus far, the kernel is first specified as follows: ⎤ fL ⎢ .. ⎥ ⎢.⎥ ⎢ ⎥ ∗ ⎥ f =⎢ ⎢ f0 ⎥ L × 1 , ⎢.⎥ ⎣ .. ⎦ fL ⎡
where f 0 is the center of the kernel, and L symmetric weights are placed in front and behind the center. The length of the kernel thus becomes L ∗ = 2L + 1. Using this kernel, we obtain ⎡
f0 ⎢ .. ⎢. ⎢ ⎢ ⎢ fL ⎢ ⎢ ∗ F =⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
··· .. . .. . .. .
fL .. . . . . .. f0 . .. . . . .
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ (R × R). ⎥ ⎥ fL ⎥ ⎥ .. ⎥ .⎦
fL .. . . . . .. .. f L . f0 . .. . . . . f L · · · f0
However, the sums of the first L − 1 and last L − 1 columns of this matrix are not 1. To make them 1, the filter F is obtained by rescaling F ∗ as follows: F = F ∗ {1 L ∗ (1L ∗ F ∗ )} (R × R). Using this filter F, the RMP of Student s at the t-th EM cycle is obtained using
7.3 Ranklustering
281
s(t) = F m(t) s . To smooth the RMPs of all students, the following matrix operation is used: ⎤ ⎡ (t) ⎤ ⎡ (t) ⎤ m1 F s1 m(t) 1 ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ (t) = MR F = ⎣ . ⎦ F = ⎣ . ⎦ = ⎣ . ⎦. ⎡
S(s)
m(t) S
m(t) S F
s(t) S
The smoothing of the RMPs is extremely important. For this reason, latent ranks that were independent become associated with the respective neighboring ranks. Or, it can instead be stated that this process allows latent classes to become latent ranks. Latent classes that are disjointed remain as nominal clusters, but the latent classes that are ordered by the smoothing change into ordinal clusters (i.e., latent ranks).8
7.3.3 Other Different Points The previous section describes the major change in the estimation procedure of ranklustering from that of biclustering caused by inserting Command [2-1.5]. This section summarizes other miscellaneous changes produced through such command insertion. [2-2] Update of Field Membership Matrix During this process, the field membership matrix is updated. First, the probability that the data vector of Item j (u j ) is observed under the condition that the item is classified in Field f is )π f l(us |π (t−1) f = . m (t)
F jf (t−1) )π f f =1 l(us |π f This equation is apparently the same as that used in biclustering. However, the smoothed rank membership matrix (S(t) ) is used in the likelihood as follows: l(u j |π (t−1) )= f
S R (t−1) u s j 1−u s j zs j ssr(t) πfr 1 − π (t−1) . fr s=1 r =1
[2-3] Update of Bicluster Reference Matrix This command is step M in the ranklustering, and the bicluster reference matrix is updated during this process. Referring to Eq. (7.7) (p. 272), the ranklustering ELL 8
The ordering of classes into ranks is not always successful. The way to check the ordinality is described in Sect. 7.4.3 (p. 287).
282
7 Biclustering
can be represented as ln pr (π f r |U) = S1(t)f r + β1 − 1 ln π f r + S0(t)f r + β0 − 1 ln(1 − π f r ),
(7.9)
where S1(t)f r =
S J
(t) (t) ssr m j f zs j u s j ,
s=1 j=1
S0(t)f r =
S J
(t) (t) ssr m j f z s j (1 − u s j ).
s=1 j=1
To obtain all S1 f r s and S0 f r s, with the smoothed rank membership matrix (S(t) ), the following matrix operation is carried out: (t) (t) (t) S(t) (F × R) 1 = {S1 f r } = M F (Z U) S (t) (t) (t) S(t) (F × R). 0 = {S0 f r } = M F {Z (1 S 1 J − U)} S
Next, by differentiating Eq. (7.9) with respect to π f r and setting the derivative equal to 0, the MAP of π f r at the t-th M step is obtained as A P,t) = π (M fr
S1(t)f r + β1 − 1 S0(t)f r + S1(t)f r + β0 + β1 − 2
.
In addition, the MLE of π f r is given as L ,t) π (M = fr
S1(t)f r S1(t)f r + S0(t)f r
.
[2] Convergence Criterion The criterion for judging the convergence of EM algorithm is identical to that used in biclustering shown in Eq. (7.8) (p. 273). That is, (t−1) ) < c|ell(U|(t−1) )|. ell(U|(t) B ) − ell(U| B B
Note, however, that the smoothed RMPs are used in the ELLs in the above equation as follows: ell(U|(t) B )=
S J F R
(t)
(t) (t) z s j ssr m j f u s j ln π (t) f r + (1 − u s j ) ln(1 − π f r )
s=1 j=1 f =1 r =1
=
R F f =1 r =1
(t) (t)
S1(t)f r ln π (t) f r + S0 f r ln(1 − π f r ) .
7.4 Main Outputs
283
7.4 Main Outputs This section describes the results of a biclustering analysis. The settings for the analysis are shown in Estimation Setting of Biclustering . The sample size and number of items of the analyzed data (J35S515) were 515 and 35, respectively. The analysis model used was ranklustering. This means that 515 students were not classified into nominal clusters (i.e., latent classes) but instead into ordinal clusters (i.e., latent ranks). Estimation Setting of Biclustering 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Data: J35S515 (www) Model: Ranklustering (student grading × item clustering) Number of latent fields: F = 5 Number of latent ranks: R = 6 Kernel of filter: f = [0.12 0.76 0.12] Prior of field membership in E step: Discrete uniform 15 /5 Prior of rank membership in E step: Discrete uniform 16 /6 Estimator in M step: MAP estimator Prior of π f r in M step: Beta density with (β0 , β1 ) = (1, 1) Constant of convergence criterion: c = 10−4
The numbers of fields and ranks were set to 5 and 6 (F, R) = (5, 6), respectively.9 In addition, a discrete uniform density was employed as the prior for the ranks and for the fields in each E step. Moreover, in each M step, the beta density with0hyperparameters (β0 , β1 ) = (1, 1) was used as the prior density for each element of the bicluster reference matrix.
7.4.1 Bicluster Reference Matrix The EM algorithm was terminated at the sixth EM cycle, which means that the sixth updated value of the bicluster reference matrix was employed as the final estimate. That is, ˆ B = (6) B . ˆ B . Rank 1 is the Table 7.1 shows the estimate of the bicluster reference matrix, lowest ability group, and Rank 6 is the highest. In addition, although the five latent fields are nominal clusters, they are sorted from easy to hard. More specifically, the 9
A way to find the optimal number of ranks and fields from the data are described in Sect. 7.6 (p. 328).
284
7 Biclustering
ˆ B) Table 7.1 Estimate of bicluster reference matrix ( Field 1 Field 2 Field 3 Field 4 Field 5 TRP∗2 LRD∗3 ∗1 ∗3
Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 LFD∗1 3 0.650 0.792 0.881 0.914 0.938 0.970 7 0.094 0.278 0.617 0.904 0.984 0.998 4 0.225 0.321 0.440 0.616 0.751 0.899 0.104 0.186 0.281 0.361 0.603 0.841 9 0.033 0.055 0.093 0.130 0.262 0.598 12 7.93 12.38 16.34 21.27 28.24 4.83 143 88 77 90 76 41
latent field distribution, latent rank distribution
∗2
test reference profile,
fields are sorted such that the row sums of the bicluster reference matrix are arranged in descending order. For example, this table shows that the CRR for an item classified in Field 1 by the students belonging to Rank 1 is 65.0%, whereas an item classified in the field by Rank 2 students is 79.2%, Rank 3 students is 88.1%, · · · , and Rank 6 students is 97.0%. This indicates that the CRR becomes larger as the rank increases. The latent rank distribution (LRD) and TRP in the table will be described in other sections. The latent field distribution (LFD) shown in the rightmost column in the table is a frequency distribution that counts how many items are classified in each latent field. For example, three items are classified in Field 1 and seven items in Field 2. The field including the largest number of items (12) is Field 5. Figure 7.9 shows the array plots of the data matrices before (left) and after (right) the analysis. The size of the data matrix is (515 students × 35 items). The black cells represent correct responses and the white cells incorrect responses. The figure on the right is an array plot sorted according to the estimated ranks and fields. Ranks 1–6 are sorted from the top, and Fields 1–5 from the left. The black and white blocks become clearer before and after the sorting. In addition, as the rank is higher, the proportion of black cells in a bicluster increases. This is because the higher the rank, the greater the ability of the students in the rank. Moreover, as the field label is larger, the proportion of black cells decreases. This is because the fields are rearranged from easy to hard and are labeled as Fields 1–5, although the five fields are originally nominal clusters. This implies that a bicluster placed lower and more to the left has a larger proportion of black cells because such a bicluster is a submatrix of the responses by students belonging to a higher rank to an easier field. Conversely, a bicluster placed higher and more to the right has a larger proportion of white cells because it is a submatrix of lower rank students for a harder field.
7.4 Main Outputs
Original Data
∗
285
Ranklustering (F, R) = (5, 6)∗
The leftmost column is Field 1, and the topmost row is Rank 1.
Fig. 7.9 Array plot of data sorted using ranklustering
7.4.2 Field Reference Profile The field reference profile (FRP) of Field f is the f -th row vector of the bicluster ˆ B . Figure 7.10 shows the FRP plots of the five fields. reference matrix, It can be seen from the figure that the items in Field 1 form an easy item cluster because the CRRs are high even for the students belonging to Ranks 1 and 2. Conversely, Field 5 items form an item cluster with more difficulty because the CRRs are moderate even for the students in Ranks 5 and 6. The FRP of Field 2 has a large increase between Ranks 2 and 3, indicating that Field 2 items are useful for discriminating whether the ability of a student is Rank 2 or lower or Rank 3 or higher. The
286
7 Biclustering
Fig. 7.10 Field reference profiles Table 7.2 FRP indices (Ranklustering) Index Field 1 Field 2 Field 3 Field 4 Field 5
α˜ 1 2 3 4 5
a˜ 0.142 0.340 0.176 0.242 0.336
β˜ 1 3 3 5 6
b˜ 0.650 0.617 0.440 0.603 0.598
γ˜ 0.000 0.000 0.000 0.000 0.000
c˜ 0.000 0.000 0.000 0.000 0.000
items in Fields 3 and 4 are of moderate difficulty, and the CRRs increase linearly across the latent rank scale. It is difficult to check the shape of each FRP when the number of fields is large. The FRP indices summarize the FRP shape: the field slope index, field location index, and field monotonicity index, which correspond to the IRP indices in the LRA (see Sect. 6.4, p. 223). Table 7.2 shows the FRP indices. First, the field slope index indicates that a˜ is the largest increase in the FRP among adjacent rank pairs, and α˜ is the smaller rank of the pair. For example, for the FRP of Field 2, the CRRs of Ranks 2 and 3 were 0.278 and 0.617, respectively, between which a difference of 0.340 (= 0.617 − 0.278) is the largest increase among the five adjacent rank pairs. Therefore, the slope index for the field is obtained as (α˜ 2 , a˜ 2 ) = (2, 0.340). ˜ and the Next, the field location index expresses the FRP value closest to 0.5 as b, ˜ rank having the value is denoted as β. For example, in Field 3, the FRP value (i.e., CRR) is closest at Rank 3, and the value is then 0.440. The location index of the field is thus (β˜3 , b˜3 ) = (3, 0.440).
7.4 Main Outputs
287
Fig. 7.11 Field monotonicity index (identical to Fig. 6.18, p. 227)
Finally, the field monotonicity index quantifies the monotonicity of the FRP, where γ˜ represents the ratio of the number of adjacent rank pairs between which the CRR drops to the number of pairs, and c˜ are cumulative drops. In Fig. 7.11, among the nine adjacent rank pairs (1, 2), (2, 3), · · · , (9, 10), the CRR decreases in three pairs, (2, 3), (6, 7), and (7, 8); thus, γ˜ = 3/9 = 0.333. In addition, because the decrements between the respective rank pairs are −0.2, −0.15, and −0.1, c˜ = −0.2 − 0.15 − 0.1 = −0.45. Note that (γ˜ , c) ˜ = (0, 0) when the FRP is monotonically increasing, and conversely, (γ˜ , c) ˜ = (1, −1) when it is monotonically decreasing. Figure 7.10 shows that all FRPs increase monotonically. Thus, in Table 7.1, the monotonicity index is obtained as (γ˜ , c) ˜ = (0, 0) for all items.
7.4.3 Test Reference Profile The values of the test reference profile (TRP) are the expected NRSs of the respective ranks. As shown above, the LFD was obtained as ⎡ ⎤ 3 ⎢7⎥ ⎢ ⎥ ⎥ dˆ LFD = ⎢ ⎢ 4 ⎥. ⎣9⎦ 12
288
7 Biclustering
Fig. 7.12 Test reference profile (Ranklustering)
30
Expected Score
25
143
20 15
90
88 77
76
10
41
5 0
1
2
3
4
5
6
Latent Rank
The TRP can then be calculated as ⎤ ⎡ ⎤ ⎡ ⎤ 3 4.834 0.650 0.094 · · · 0.033 ⎢ ⎥ ⎢ ⎥ ⎢ .. . . .. ⎥ ⎢ 7 ⎥ = ⎢ 7.932 ⎥ . = ⎣ ... ⎢ ⎢ ⎥ ⎦ . . . . . .. ⎥ ⎦ ⎣ .. ⎦ ⎣ 0.970 0.998 · · · 0.598 28.239 12 ⎡
ˆ B dˆ L F D ˆt TRP =
That is, the TRP value of each rank is the sum of the rank reference vector (i.e., the column vector corresponding to the rank in the bicluster reference matrix), in which each element is weighted by the number of items corresponding to the field (i.e., LFD). From the TRP, for example, a student belonging to Rank 1 is expected to correctly respond to approximately 5 (4.834) items of this 35-item test. Similarly, a student belonging to Rank 2 is expected to pass approximately eight (7.932) items. The line graph in Fig. 7.12 illustrates the TRP. In addition, the bar graph shows the LRD, which will be described in another section.10 The figure shows that the TRP is monotonically increasing, and the monotonicity of TRP is important in ranklustering because students are classified into ordinal (i.e., latent ranks) rather than nominal (i.e., latent classes) clusters. For the latent scale created by ranklustering (i.e., student grading × item clustering) which is regarded as ordinal, one needs to provide some evidence showing the ordinality of the clusters.11 The TRP can be useful as a proof and provides two levels for showing the ordinality of the latent scale.
10
It is recommended to plot both the TRP and LRD in the same figure because both indicate the characteristics of each rank. 11 This discussion is not required in biclustering (i.e., student clustering × item clustering).
7.4 Main Outputs
289
Two Levels of Ordinality of Latent Rank Scale Strongly ordinal alignment condition The FRPs of all fields monotonically increase, and the TRP then monotonically increases as well. Weakly ordinal alignment condition Although TRP monotonically increases, not all FRPs monotonically increase. The strong condition is satisfied when the FRPs of all fields are monotonically increasing. The TRP then consequently increases monotonically. Figure 7.10 (p. 286) shows that all FRPs monotonically increase, which can also be found from Table 7.2 (p. 286), which indicates the monotonicity index of each field is (γ˜ , c) ˜ = (0, 0). Therefore, the latent scale in this analysis can be regarded to satisfy the strong condition. Meanwhile, the weak condition is satisfied when the TRP is monotonically increasing even though not all FRPs are monotonically increasing. If some (or all) FRPs do not monotonically increase, when the TRP does, this weak condition is met, and the latent scale can then be said to be ordinal in terms of the expected NRS. It is necessary for at least the weak condition to be cleared. If the TRP is not monotonic, the following two methods are recommended. For TRP Satisfying Weakly Ordinal Alignment Condition 1. Use a filter with a flatter and larger-sized kernel. 2. Reduce the number of latent ranks. For Point 1, the rank membership profile is more smoothed as a stronger filter is used, and a strong filter can be made of a flat filter that is large in size and has a small center value. Then, the smoother the rank membership profiles of the students, the more monotonic the FRP of each field becomes, which leads to the TRP monotonicity. Next, for Point 2, reducing the number of ranks makes it more likely for the TRP to monotonically increase. For example, consider a situation in which the number of ranks is reduced from six to two (Fig. 7.13). In the upper left segment of the figure, some FRPs are not monotonically increasing. As a result, the TRP (top right) does not monotonically increase and instead decreases from Rank 4 to Rank 5 because the shape of the TRP is a weighted average of all FRPs. In the figure, the TRP is drawn assuming that the number of items classified in each field is equal. Next, when the number of ranks is reduced to two (bottom left of the figure), it is supposed that the two ranks are created by combining the six ranks as follows12 : Original Scale : {1 2 3 4 5 6} Integrated Scale : { 1 2 } 12 There are many other possibilities such as {1 2 | 3 4 5 6} and {1 2 3 4 5 | 6}. Ranks whose students’ response patterns are similar are grouped.
290
7 Biclustering
Fig. 7.13 Smaller N of latent ranks and TRP monotonicity
Under this situation, the students are divided into two (i.e., high- and low-ability level) groups; thus, each FRP, which is a line graph with six points, becomes a linear plot connecting two points. The new Rank 1 value of each new FRP then becomes almost equal to the average of the first three points of the original FRP. Similarly, the new Rank 2 value of each new FRP is obtained as that of the last three points.13 Because Rank 2 students have a higher ability, the new FRPs tend to increase. Consequently, it is highly possible for the TRP, which is the sum of the weighted FRPs, to be monotonic (bottom right of the figure) when the number of ranks is reduced. Meanwhile, how about when the number of fields is reduced? Is it effective to make the TRP monotonically increase? For instance, consider reducing the number of fields from six to two (Fig. 7.14). In the figure on the upper left, four of the six FRPs are not monotonically increasing. As a result, the TRP is also not monotonically increasing (top right). The number of fields is then reduced to two. The FRPs having similar patterns are merged into two FRPs, and as shown in the bottom left of the figure, the top three FRPs and bottom three FRPs are combined. In this case, each new FRP becomes an average plot of the corresponding three original FRPs.14 In the figure, one of the two FRPs (i.e., the upper FRP) becomes monotonic owing to the averaging; thus, it can be said that the reduction of the number of fields makes some of the new FRPs monotonic. However, the shape of the TRP remains almost the same before and after the integration of the FRPs because it is eventually shaped as an averaged plot of the two FRPs. Therefore, reducing the number of fields has little effect on making the TRP monotonic. 13 14
Note that the average is weighted by the numbers of students belonging to the respective ranks. Note that the average is weighted by the numbers of items classified in the respective fields.
7.4 Main Outputs
291
Fig. 7.14 Smaller N of latent fields and TRP monotonicity
Note that neither Point 1 nor Point 2 is infallible in making the TRP monotonically increase. (2 The last resort is to reorder the values of each FRP, as shown in Command 4) in Ranklustering Estimation Procedure (Monotonic Increasing FRP) (p. 275). That is, (2-4-1) Sort π f is the process of sorting the FRP values of Field f in an ascending order. This is simple but ensures that each FRP is monotonically increasing, and the TRP then becomes monotonic as well. As described above, the TRP monotonicity is important as a basis for the ordinality of the latent rank scale. Note that the basis does not necessarily have to be TRP; in this case, evidence from a different perspective is required to show the ordinality. External Standard for Ordinality of Latent Rank Scale When the TRP is not monotonically increasing, an external standard may be provided to show the ordinality of the latent rank scale. Suppose a situation in which the TRP is not monotonically increasing when data on an English test are analyzed using ranklustering with R = 10. Even then, when making students report their Test of English for International Communication (TOEIC) scores and averaging the scores by the respective ten ranks, if the ten averages are ordered in accord with the ranks, it can be said that the latent rank scale is ordinal in terms of the TOEIC score, although it is necessary to examine whether this reason is sufficiently convincing.
292
7 Biclustering
7.4.4 Field Membership Profile One of the main outputs of ranklustering (and biclustering) is the field membership profile of each item. The field membership profile of each item represents the memberships of the item to the respective fields. The j-th row vector of the field membership matrix (M F ) is the field membership profile of Item j. ˆ F ) has been already given in The estimate of the field membership matrix ( M ˆ B ; Table 7.1, the process of finding the estimate of the bicluster reference matrix ( p. 284). Because the EM algorithm in this analysis was terminated at the sixth cycle, the last two cycles were EM Cycle 5
EM Cycle 6
(5) (5) (6) (6) (6) (5) (6) M · · · → M (5) → S → M → → → S → M → R F B R F B . Then, the latest update of the bicluster reference matrix was specified as the estimate ˆ B = (6) of parameter B , as described thus far. Likewise, the final update of the field membership matrix is determined as the parameter estimate. That is, ˆ F = M (6) M F . ˆ F ). As can be Table 7.3 shows the estimate of the field membership matrix ( M seen from the table, every item had an extremely high probability of being classified into a certain field, and 26 out of all 35 items had a value of 1.0000 when rounded to the fourth decimal place. In addition, for five of the remaining nine items, the largest membership was above 0.99, the exceptions being Items 9, 28, 29, and 34. The LFE in the table indicates the latent field estimate of each item, and is the field with the highest membership of the item selected. For example, the LFE of Item 1 is Field 1 because the item has the highest membership to Field 1. Similarly, that of Item 2 is Field 4 as the membership of the item to the field is the greatest. In addition, the LFD in the bottom line of the table represents the latent field distribution, which is the distribution of the number of items in the respective fields.15 For example, there were three items classified in Field 1 (i.e., Items 1, 31, 32). The field with the largest number of items was Field 5 at 12 items.
7.4.5 Field Analysis The fields of the respective items become clear owing to the field membership matrix estimate. Next, based on this, a field analysis is conducted, which is a process of analyzing and synthesizing the item contents of each field to clarify the field characteristics. 15
It is also listed in the bottom line of Table 7.1 (p. 284).
7.4 Main Outputs
293
ˆ F) Table 7.3 Estimate of field membership profile ( M Field 1 Field 2 Field 3 Field 4 Field 5LFE∗1 Item 01 1.0000 1 Item 02 0.0045 0.9995 4 Item 03 1.0000 3 Item 04 1.0000 4 Item 05 1.0000 4 0.0003 0.9997 5 Item 06 Item 07 1.0000 3 Item 08 1.0000 4 4 Item 09 0.0023 0.9977 Item 10 1.0000 4 1.0000 Item 11 2 4 Item 12 1.0000 Item 13 0.9974 0.0026 4 Item 14 1.0000 5 Item 15 1.0000 5 0.0001 0.9999 5 Item 16 4 Item 17 1.0000 Item 18 1.0000 5 Item 19 1.0000 5 Item 20 1.0000 5 Item 21 1.0000 2 2 Item 22 1.0000 Item 23 1.0000 2 Item 24 1.0000 2 Item 25 1.0000 2 Item 26 1.0000 2 Item 27 0.9995 0.0005 3 Item 28 1.0000 5 Item 29 0.2729 0.7271 5 Item 30 1.0000 5 Item 31 1.0000 1 Item 32 1.0000 1 1.0000 3 Item 33 0.0937 0.9063 5 Item 34 1.0000 5 Item 35 LFD∗2 3 7 4 8 12 Blank cells indicate “0.0000” after rounding to four decimal places. ∗1 latent field estimate, ∗2 latent field distribution
294
7 Biclustering
Table 7.4 Field analysis Item Item 1 Item 31 Item 32 Item 21 Item 23 Item 22 Item 24 Item 25 Item 11 Item 26 Item 7 Item 3 Item 33 Item 27 Item 2 Item 9 Item 10 Item 8 Item 12 Item 4 Item 17 Item 5 Item 13 Item 34 Item 29 Item 28 Item 6 Item 16 Item 35 Item 14 Item 15 Item 30 Item 20 Item 19 Item 18
Item Content
CRR
LFE
.850 .812 .808 .616 .600 .586 .567 .491 .476 .452 .573 .458 .437 .414 .392 .390 .353 .350 .340 .303 .276 .250 .237 .229 .227 .221 .216 .216 .155 .126 .087 .085 .054 .052 .049
1 1 1 2 2 2 2 2 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 LFD
Field Membership Profile Field 1 Field 2 Field 3 Field 4 1 1 1 1 1 1 1 1 1 1 1 1 1 .9995 .0005 .0045 .9955 .0241 .9759 1 1 1 1 1 1 .9974 .0937 .2729 .0171
3
7
4
7
Field 5
Field Analysis Descriptions of Field 1
Descriptions of Field 2
Descriptions of Field 3
Descriptions of Field 4
.0026 .9063 .7271 .9829 1 1 1 1 1 1 1 1 1 12
Descriptions of Field 5
Table 7.4 shows an example of a field analysis. The 35 items were classified into five fields, and the items in the table are sorted by field. For example, Items 1, 31, and 32 are grouped into Field 1 because these three items have the greatest membership to Field 1. By examining the contents of the three items and synthesizing them, the characteristics of Field 1 can be described. Note that such a task would be difficult for psychometricians and data analysts. Knowledge and experience of teachers or researchers regarding the test subject are necessary.
7.4 Main Outputs
295
The items in each field have similar response patterns. They are therefore clustered in the same field, which means that the items in the same field do not always have similar subject contents. However, the fact that the items have similar TF patterns certainly implies that the items are highly associated with each other. It is impossible for the difficulty levels and contents of such items to be completely different. Thus, although not an easy task, a subject-teaching expert can surely describe the characteristics of the respective fields. If the characteristics cannot be stated well, the data may then be reanalyzed with a different number of fields. N of Latent Fields and N of Factors Determining the number of fields is similar to determining the number of factors in a factor analysis (FA). In an FA, a scree plot (Sect. 2.8.4, p. 65) must be first checked to roughly determine the number of factors. The data are analyzed using the FAs whose number of factors are approximately that determined by the scree plot. For example, if one has considered that the optimal number of factors is four from the scree plot, then three FAs with three to five factors are carried out. Next, by examining and comparing the three factor pattern matrices, the number of factors in which the factor pattern matrix becomes the most accountable and interpretable is employed as the optimal number. This kind of decision regarding the number of factors is often made in practice. Likewise, in a field analysis, if the items are not grouped well, it is better to reanalyze the data with a different number of fields. The number of fields that can produce more understandable and interpretable item clustering should then be employed. However, when the fit of the model with that number of fields is extremely poor, the number is invalid. It is recommended that the range of the numbers of fields, for each of which fit is moderately good, first be determined, and the optimal number of fields then be chosen from the range in terms of interpretability and accountability.
7.4.6 Rank Membership Profile This section describes an important output, i.e., the rank membership profile. The rank membership profile of Student s (ms ) is the s-th row vector of the rank memˆ R ), similar to the bership matrix (M R ). The rank membership matrix estimate ( M ˆ F ), has also already been obtained during the field membership matrix estimate ( M ˆ B ). Because the EM process of acquiring the bicluster reference matrix estimate ( algorithm was completed in six cycles, the estimation process was as follows:
296
7 Biclustering
ˆ R) Table 7.5 Estimate of rank membership matrix ( M Rank Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Student 13 Student 14 Student 15 .. . Student 515 LRD RMD ∗1 Latent
∗4 Latent
1 0.731 0.000 0.613 0.000 0.816 0.758 0.000 0.000 0.828 0.000 0.000 0.000 0.000 0.709 0.949 .. . 0.000 143 133.89
2 0.268 0.000 0.382 0.030 0.240 0.690 0.000 0.000 0.171 0.000 0.085 0.001 0.000 0.289 0.051 .. . 0.000 88 96.10
3 0.002 0.008 0.004 0.751 0.002 0.053 0.000 0.003 0.001 0.000 0.835 0.181 0.002 0.003 0.000 .. . 0.001 77 83.69
4 0.000 0.320 0.000 0.213 0.000 0.000 0.000 0.243 0.000 0.000 0.079 0.813 0.069 0.000 0.000 .. . 0.106 90 84.37
5 0.000 0.672 0.000 0.000 0.000 0.000 0.212 0.753 0.000 0.000 0.000 0.005 0.926 0.000 0.000 .. . 0.888 76 73.95
6 0.000 0.000 0.000 0.000 0.000 0.000 0.787 0.001 0.000 1.000 0.000 0.000 0.003 0.000 0.000 .. . 0.005 41 42.99
LRE∗1 1 5 1 3 1 2 6 5 1 6 3 4 5 1 1 .. . 5
RUO∗2 0.366 0.000 0.623 0.282 0.271 0.077 0.002 0.207 0.095 0.006 0.003 0.408 0.053 .. . 0.006
RDO∗3 0.477 0.039 0.372 0.270 0.323 0.000 0.102 0.22 0.074 .. . 0.120
rank estimate, ∗2 Rank-up odds, ∗3 Rank-down odds rank distribution, ∗5 Rank membership distribution
EM Cycle 5
EM Cycle 6
(5) (5) (6) (6) (6) (5) (6) M · · · → M (5) → S → M → → → S → M → R F B R F B . During this process, the latest update for the rank membership matrix is specified as its estimate. That is,16 ˆ R = M (6) M R . ˆ R . For instance, the Table 7.5 shows the rank membership matrix estimate, M membership of Student 1 to Rank 1 is estimated at 79.4% and the membership to Rank 2 is estimated at 20.5%. In addition, the latent rank estimate of the student is obtained as Rank 1. This is because the membership of the student to the rank is the largest. Thus, the latent rank with the largest membership is employed as the student’s latent rank estimate.
(7)
(6)
(6)
(6)
Although M R can be obtained anew from {U, B , M F }, it is almost equal to M R because of convergence.
16
7.4 Main Outputs
297
0.8
0.8
0.8
0.6 0.4 0.2
Membership
1.0
Membership
Membership
1.0 0.6 0.4 0.2
1
2
3
4
5
1
6
2
3
4
5
1
6
0.8
0.8
0.8
0.4 0.2
Membership
1.0
0.6
0.6 0.4 0.2
2
3
4
5
1
6
2
3
4
5
1
6
0.8
Membership
0.8
Membership
0.8
0.6 0.4 0.2
3
4
5
6
1
2
Student 10
3
4
5
1
6
Membership
0.8
Membership
0.8 0.6 0.4 0.2 0.0
3
4
5
6
2
3
4
5
1
0.8
Membership
0.8
Membership
0.8
0.6 0.4 0.2
4
Latent Rank
5
6
6
0.6 0.4 0.2 0.0
0.0 3
4
Student 15 1.0
2
3
Latent Rank
1.0
1
2
Student 14
Student 13
0.0
5
0.2
6
1.0
0.2
6
0.4
Latent Rank
0.4
5
0.6
0.0 1
Latent Rank
0.6
4
Student 12
0.8
2
3
Latent Rank 1.0
1
2
Student 11
0.0
6
0.2
1.0
0.2
5
0.4
1.0
0.4
6
0.6
Latent Rank
Latent Rank
0.6
4
0.0
0.0 2
3
Student 9 1.0
1
2
Latent Rank
1.0
0.0
5
0.2
Student 8
Student 7
0.2
6
0.4
1.0
0.4
5
0.6
Latent Rank
Latent Rank
0.6
4
0.0
0.0
0.0
3
Student 6
1.0
1
2
Latent Rank
1.0 Membership
Membership
0.2
Student 5
Student 4
Membership
0.4
Latent Rank
Latent Rank
Membership
0.6
0.0
0.0
0.0
Membership
Student 3
Student 2
Student 1 1.0
1
2
3
4
Latent Rank
5
6
1
2
3
4
Latent Rank
Fig. 7.15 Rank membership profiles (Ranklustering)
Figure 7.15 shows the plots of the rank membership profiles for the first 15 students. For example, Students 1 and 3 have the highest membership at Rank 1. The latent rank estimates for the two students are thus Rank 1. Note, however, that the membership of Student 1 is 73.1%, whereas that of Student 3 is 61.3%, which is approximately 12% greater for Student 1. In addition, the memberships of Students 1 and 3 to Rank 2 are 26.8% and 38.2%, respectively. Thus, the probability of belonging to Rank 2 is higher for Student 2. Accordingly, although the latent rank estimates of both students are Rank 1, the possibility of moving up to Rank 2 is higher for
298
7 Biclustering
Student 3. This difference is represented by the rank-up odds (RUO in Table 7.5). Let us denote the latent rank estimate of Student s as Rˆ s ; the odds are then defined as RUOs =
m s, Rˆ s +1 m s, Rˆ s
.
Note, however, that if the student’s rank estimate is the highest ( Rˆ s = R), these odds cannot be calculated. The rank-up odds of Students 1 and 3 are given by m 12 0.268 = 0.366, = m 11 0.731 m 32 0.382 RUO 3 = = = 0.623. m 31 0.613 RUO 1 =
In addition, the rank-up odds ratio of the two students is obtained by RUOR 3/1 =
RUO 3 0.623 = = 1.701, RUO 1 0.366
which indicates that Student 3 is approximately 2 (≈ 1.701) times more likely to move up to Rank 2. Likewise, the rank estimates of Students 7 and 10 are Rank 6 because their memberships to the rank are the highest. However, from their rank membership profiles, the membership of Student 7 to Rank 6 is 78.7%, whereas that of Student 10 is 100%.17 It can thus be stated that Student 7 is more likely to drop to Rank 5. Therefore, the feedback comments for the two students should not be the same, although their rank estimates are equal. The rank-down odds (RDO), shown in Table 7.5, measures this possibility as follows: RDO s =
m s, Rˆ s −1 m s, Rˆ s
,
although these odds are not computed when the student’s rank estimate is the lowest ( Rˆ s = 1). Using this, the RDOs of Students 7 and 10 are calculated as m 7,5 0.212 = 0.270, = m 7,6 0.787 0.000 m 10,5 = 0.000, = = m 10,6 1.000
RDO 7 = RDO 10
17
It was exactly 99.99982%.
7.4 Main Outputs
299
Fig. 7.16 LRD and RMD (Ranklustering)
140
143
120
Frequency
100 80
90
88 77
60
76
40
41 20 0
1
2
3
4
5
6
Latent Rank
respectively.18 These two odds imply that Student 7 is more likely to be transferable to the lower rank by one grade. To directly compare two rank-down odds, the rankdown odds ratio is used. That is, RDOR s/s =
RDO s . RDO s
Normally, the odds ratio is easy to interpret when the smaller odds are specified as the denominator to make the odds ratio > 1. In this case, however, the smaller odds for Student 10 are almost 0; thus, this odds ratio cannot be calculated.
7.4.7 Latent Rank Distribution The latent rank distribution (LRD) is a frequency distribution that counts the number of students belonging to each rank and is listed in the second line from the bottom of Table 7.5. Because this distribution is an extremely important output, it has already been shown in various places, as indicated in the rightmost column of Table 7.1 (p. 284). This is also shown as the bar graph in Fig. 7.12 (p. 288) with the line graph for TRP. By plotting them simultaneously, it was found for instance that the TRP value at Rank 1 is approximately 5, which means that the students belonging to Rank 1 can pass only 5 out of 35 items, and there are 143 such students in the rank. The bar graph in Fig. 7.16 shows the LRD. The number of students was the largest at 143 in Rank 1 and the smallest at 41 in Rank 6. The sum of the frequencies is 515 because the sample size was 515. In addition, the line graph in the figure is the rank membership distribution (RMD). This distribution is obtained as the column sum ˆ R ) as follows: vector of the rank membership matrix ( M 18
When there are more than two subscripts, a comma is sometimes used between them for readability.
300
7 Biclustering
⎤ ⎤⎡ ⎤ ⎡ 133.89 0.731 0.000 · · · 0.000 1 ⎢0.268 0.000 · · · 0.000⎥ ⎢1⎥ ⎢ 96.10 ⎥ ⎥ ⎥⎢ ⎥ ⎢ ˆ R 1 S = ⎢ =M ⎢ .. .. ⎥ (R × 1). .. . . . ⎥ ⎢.⎥ = ⎢ ⎣ . . .. ⎦ ⎣ .. ⎦ ⎣ . ⎦ . 42.99 0.000 0.000 · · · 0.005 1 ⎡
dˆ RMD
In LRD, +1 is added to the frequency of a rank when a student is estimated to belong to the rank, whereas in RMD, the probability (i.e., membership) that the student belongs to the rank is to the frequency. For example, the latent rank estimate of Student 4 was Rank 3; thus, +1 is added to the frequency of Rank 3 in LRD, and no value is added to the frequency for all other ranks, whereas +0.757 is added to the frequency of Rank 3 in the RMD because the membership of the student to this rank is 75.7%, and the frequency of each of the other ranks is likewise added based on the rank membership. In LRD, the addition of +1 to Rank 3 is too high because the possibility of Student 4 belonging to other ranks is neglected, whereas the RMD takes the ambiguity of the possibilities into account. The RMD can be considered the LRD of a sample population. Why is this? Consider first that there are 100 students from the population with the same ability as Student 4; however, the response patterns of 100 students do not then become the same as that of Student 4 because each response is a stochastic variable. The 100 response patterns will be similar to that of Student 4 yet different. By analyzing such responses of the 100 students, it is highly likely that the ranks of 3 out of 100 students will be estimated as Rank 2, 76 as Rank 3, and 21 as Rank 4. In this case, the LRD would be +3 for Rank 2, +76 for Rank 3, and +21 for Rank 4. Therefore, it can be stated that the RMD is an LRD, not for the sample but for the population. Note that the sum of the rank membership profile of each student is 1; thus, the sum of the RMD equals the sample size (515 in this analysis).
7.4.8 Rank Analysis It is important to explore the kind of profile each rank has. More specifically, the goal of the analysis is to clarify what each ranked student can and cannot do and how well. Without such information, it becomes impossible to make a future learning plan for the students in each rank. Figure 7.17 shows the plots of the rank reference vectors, which are the column vectors of the bicluster reference matrix (Table 7.1, p. 284). The top line plot is the rank reference vector for Rank 6, and the bottom is that of Rank 1. The plots are parallel and do not intersect, which shows the ordinality of the latent rank scale. Comparing the line plots of Ranks 2 and 3, a large gap in the CRR is observed in Field 2, which indicates that the items in Field 2 contribute significantly to the discrimination regarding whether students belong to Rank 2 or 3. In other words, it is inevitable for Rank 2 students training on Field 2 items to step up to Rank 3. This
7.4 Main Outputs
301
Fig. 7.17 Rank reference vector (Ranklustering) Table 7.6 Can-do chart (Ranklustering) Field
Field Reference Profile (FRP)
Content
FRP Index
Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6
1
Numbers and Calculation
.650
.792
.881
.914
.938
.970
1
.145
1
.652
0
0
2
Volume and Measurement
.094
.278
.617
.904
.984
.998
2
.351
3
.642
0
0
3
Numerical Relationships
.225
.321
.440
.616
.751
.899
3
.163
3
.457
0
0
4
Volume and Measurement
.104
.186
.281
.361
.603
.841
4
.232
4
.401
0
0
5
Diagrams
.033
.055
.093
.130
.262
.598
5
.344
6
.632
0
0
Title 1
Title 2
Title 3
Title 4
Title 5
Title 6
Can-Do Statements Statements for Rank 1
Rank Title
Statements for Rank 2 Statements for Rank 3 Statements for Rank 4 Statements for Rank 5 Statements for Rank 6
can also be concluded from the fact that the FRP of Field 2 shows a large increase from Ranks 2 to 3 (see Fig. 7.10, p. 286). The rank analysis should be conducted after the field analysis (Sect. 7.4.5, p. 292) is achieved because if the characteristics of the respective fields are not fully captured, an explanation such as “students in Rank 1 can pass Field 1 but cannot Field 2” then lacks specifics. If the field analysis is conducted well, a can-do chart, similar to the one presented in Fig. 7.6, can be easily created.19 A can-do chart may be constructed by placing ˆ B ) at the chart center. In addition, the contents of the bicluster reference matrix ( the fields are placed to the left of the bicluster reference matrix, where the results of 19
A can-do chart was also created in the LRA. Refer to Sect. 6.5 (p. 229).
302
7 Biclustering
the field analysis are described. Moreover, the FRP indices summarizing the shape of each FRP are listed to the right of the bicluster reference matrix. After the field contents, the bicluster reference matrix and the FRP indices are arranged, the next step is to examine the rank reference vector of each rank in the bicluster reference matrix. For example, the rank reference vector of Rank 1 is ⎡
⎤ 0.650 ⎢ ⎥ πˆ 1 = ⎣ ... ⎦ , 0.033 which indicates that Rank 1 students master Field 1 at an intermediate level, but not the other fields. If this is a math test, and the content of Field 1 is “calculation,” it is then found that those classified in Rank 1 are students who are only capable of conducting a calculation, based upon which the can-do statements for Rank 1 can be described and the title for the rank can be named. Furthermore, based on the can-do statements, post-test instructions for Rank 1 students can be created. It is a good idea to provide different remedial lessons based on the respective ranks. As described in (A) in Sect. 1.5.2 (p. 10), it is important to show the students the map of a test, i.e., what the test is measuring and its start and goal. A can-do chart visualizes (or verbalizes) such information. The starting point of a test is described in the can-do statements for the lowest rank (i.e., Rank 1). Students who are not yet ready for the test (having less than a Rank 1 ability level) would be better off taking an easier test and discovering the position of each of their positions in the outer world of the present map. In addition, the map destination is the highest rank of the test.20 Furthermore, the intermediate ranks correspond to the road from the start to the goal on the map because they show the process from the lowest to highest rank. Note that not all students go through the same road from the start to the goal. This can-do chart is created from a single test data analysis. It is known that some students make a qualitative leap and others move to the next map without reaching the goal of the present map.
In fact, Field 5 was found to be difficult even for the highest-ranking students (πˆ 56 = 0.632). This test is thus useful for evaluating students whose ability level is slightly higher.
20
7.4 Main Outputs
303
Single-Shot versus Longitudinal Data This book introduces various methods for analyzing single-shot data, which are data taken only once at a certain time, whereas longitudinal data, or multiple-shot data, are data taken at multiple times over a certain period. Longitudinal data are more informative and can be used to analyze changes in the ability of individual students. The can-do chart shown in Fig. 7.6 was created from single-shot data and can be improved by integrating the findings from the longitudinal data. Although a longitudinal data analysis is not yet popular in the field of educational measurements, the study by O’Connell and McCoach (2008) was an innovative approach in this field. In psychology, various studies have already been published (e.g., Bollen & Curran, 2005; Duncan et al., 2006; Little, 2013; Mirman, 2014; Singer & Willett, 2003; Verbeke & Molenberghs, 2009), and many of the techniques described can be applied to a test data analysis. A longitudinal data analysis is generally an extension of a single-shot data analysis, the latter being the basis of the former. However, current methods for analyzing longitudinal data are an extension of an extremely simple singleshot data analysis method, whereas the methods introduced in this book are complex and advanced; thus, they cannot be easily extended and applied to longitudinal data, which would merit further studies.
7.5 Model Fit This section describes the model-fit evaluation for biclustering (and ranklustering). Because this analysis was carried out through ranklustering, the model fit for the ranklustering is described, although the only difference between them is whether the ˆ R) ˆ C ) or rank membership matrix estimate ( M class membership matrix estimate ( M is used. First, the present model employed for analyzing the data is referred to as the with (F, R) = (5, 6) specified in analysis model. In this case, ranklustering Estimation Setting of Biclustering (p. 283) is applied. The benchmark and null models should then be specified. The former and latter are models that fit better and worse than the analysis model, respectively. Figure 7.18 shows an image of the model-fit evaluation. The fit of the analysis model is not evaluated in an absolute sense (individually), but is achieved by comparing it with those of the better-fitting benchmark and worse-fitting null models. The figure compares the two analysis models to the benchmark and null models, and indicates that Analysis Model 2 has a better fit than Analysis Model 1 because the fit of Analysis Model 2 is closer to that of the benchmark model.
304
7 Biclustering
Fig. 7.18 Model-Fit Evaluation
7.5.1 Benchmark Model The benchmark model is first introduced here and is a model that fits the data (by far) better than the analysis model; in SEM, the model that plays this role is referred to as the saturated model, the degree of freedom (DF) of which is 0. The DF in the SEM is the same as the number of constraints; thus, the saturated model is a model with no constraints and can be called a structureless model. This is because the structure is a combination of constraints. In other words, the saturated model is not originally called a model. In this book, the framework for evaluating the fitness follows the SEM approach; however, instead of using the saturated model with 0 DF, a benchmark model that fits the data extremely well is employed and is neither unstructured nor unconstrained. The saturated model used in the test data analysis is described in Saturated Model in Test Data (p. 308). As in the other chapters, the benchmark model in this chapter is also the multigroup model with the number of groups being the number of NRS patterns.21 If the multigroup model is rephrased according to the context of this chapter, it is a bipartitioning model. The biclustering/ranklustering is a method of simultaneous clustering of students into unknown classes/ranks and of items into unknown fields, whereas bipartitioning is a model that simultaneously partitions students into known groups, as well as items. The details of this are as follows: Bipartitioning as the Benchmark Model for Biclustering and Ranklustering 1. Partitions S students into G groups, where G is the number of observed number-right score (NRS) patterns. 2. Partitions J individual items into J groups (e.g., no item is grouped). As described in Point 1, students are partitioned into groups to have the same NRS, where the number of groups is the number of observed NRS patterns, denoted as G. When the number of items is J , the maximum number of NRS patterns is J + 1 (= 0, 1, · · · , J ), but not all patterns occur. 21
By employing common benchmark and null models, the model fit between different analysis models becomes comparable (see Common Benchmark and Null Models , p. 244). In Sect. 6.8 (p. 240), 2PLM and 3PLM of IRT, LCA (C = 5), and LRA (R = 6) were compared.
7.5 Model Fit
305
Fig. 7.19 Bipartitioning using benchmark, null, and analysis models
In addition, Point 2 means that the J items are partitioned into J groups. In other words, J items are partitioned but not grouped. In IRT, LCA, and LRA, individual items are not all grouped together with the other item(s), and thus it is unnecessary to mention this point; however, treating each item individually is equivalent to partitioning J items into J groups. Figure 7.19 (left) shows an image of the benchmark model (bipartitioning). Students are divided into G (the number of NRS patterns observed) groups, and items are partitioned (not grouped) into J groups (one item in one group). Because the number of cells created by this bipartition is J G and one cell has one CRR, the number of parameters for the benchmark model is J G. Let the group membership of Student s be defined by m sg =
1, if the number-right score of the student is g . 0, otherwise
Student s surely belongs to a specific group, and the sum of the memberships is G
m sg = 1.
g=1
In addition, the matrix collecting the m sg s for all students and groups, ⎤ m 11 · · · m 1G ⎥ ⎢ M G = ⎣ ... . . . ... ⎦ = {m sg } (S × G), m S1 · · · m SG ⎡
(7.10)
306
7 Biclustering
is called the group membership matrix. The CRR of the students belonging to Group g for Item j, p jg , is thus calculated by
S p jg =
s=1 m sg z s j u s j
S s=1 m sg z s j
=
mg (z j ⊗ u j ) mg z j
,
where m g is the g-th column vector of the group membership matrix (M G ). Moreover, the matrix of all p jg s, ⎤ p11 · · · p1G ⎥ ⎢ P G = ⎣ ... . . . ... ⎦ = { p jg } (J × G), pJ1 · · · pJ G ⎡
is the group reference matrix. The number of parameters for the benchmark model is the number of cells in Fig. 7.19 (left), as described above, and is simply the number of elements in the group reference matrix, J G. From the group reference matrix ( P G ) and group membership matrix (M G ), the likelihood of the benchmark model can be calculated. The likelihood indicates the basis for evaluating the model fit to the data and is identical to the occurrence probability of the data under the condition that the model and its parameters are given. In this case, the likelihood is the occurrence probability of data U under the bipartitioning model whose parameter is group reference matrix P G (and group membership matrix M G ). That is, l B (U| P G ) =
G S J
u
p jgs j (1 − p jg )1−u s j
m sg zs j
.
s=1 j=1 g=1
This equation has no unknown parameters; thus, it is obtained as a real number and falls within [0, 1] because the likelihood is a probability. The closer the likelihood is to 1, the more valid the values of the parameters are. The log-likelihood is then represented by ell B (U| P G ) =
S J G
m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )}.
(7.11)
s=1 j=1 g=1
The likelihood is the product of all likelihoods of individual data, and the loglikelihood is thus the sum of all of their log-likelihoods. In addition, from ln 0 = −∞ and ln 1 = 0, the range of the log-likelihood is specified as (−∞, 0].
7.5 Model Fit
307
The LHS of this equation is denoted as ell (expected log-likelihood) in stead of ll (log-likelihood) because, extracting the log-likelihood regarding Student s, we have ell B (us | P G ) =
G
m sg
g=1
=
G
J
z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )}
j=1
m sg X sg
g=1
= E[X s ]. That is, X sg is averaged (or is the expectation) with respect to g.22 Thus, Eq. (7.11) is the sum of the ELLs of each student, as follows: ˆ B) = ell B (U|
S
ˆ B ). ell B (u g |
(7.12)
s=1
A summary of the benchmark model is listed in the following box. Summary of the Benchmark Model Student Partitioning Partition S students into G groups. (G is the number of observed NRS patterns) Item Partitioning Partition J items into J groups (each group has one item). Expected Log-Likelihood S J G ell B (U| P G ) = m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )} s=1 j=1 g=1
Number of Parameters J G, or the number of elements in the group reference matrix ( P G )
3 When the probability distribution of {X 1 , X 2 , X 3 } is { p1 , p2 , p3 } (s.t. i=1 pi = 1), the
3 expectation is then calculated by X¯ = i=1 pi X i = E[X ], where E[·] represents the expectation of the argument. The group membership profile is a discrete probability distribution (see Requirements for Discrete Probability Distribution , p. 176). This equation can thus be regarded as the expectation. 22
308
7 Biclustering
Saturated Model in Test Data The benchmark model in this section fits the data better than the analysis model, but is not a saturated model perfectly fitting the data. That is, the likelihood/log-likelihood of the saturated model is 1/0. For the test data, the perfect fit model has a CRR for each (student × item). Let the CRR of Student s for Item j be denoted as ps j ; then ps j = u s j because there is only one response for calculating ps j . In this case, the log-likelihood becomes ll(U| P = { ps j = u s j }) =
S J
z s j u s j ln u s j + (1 − u s j ) ln(1 − u s j ) = 0,
s=1 j=1
where 0 ln 0 = 0 × (−∞) = 0, and the number of parameters of the saturated model is S J , which is equal to the number of data. The term “saturated” indicates that as many parameters as the number of data are used in the model. This model is therefore so structureless that it can no longer be called a model. In addition, the fit of this model is too good be benchmarked.
7.5.2 Null Model The null model is a single-group model that has only one CRR for each item and is identical to the null models used in other chapters. Meanwhile, rephrasing it according to the framework of this chapter, the model is expressed as a bipartitioning model in the following box. Bipartitioning as Null Model for Biclustering and Ranklustering 1. Partitions S students into one group. 2. Partitions J items into J groups (each group has one item). The state of this bipartitioning is illustrated in Fig. 7.19 (center) (p. 305). The null model partitions all students into a single group, which implies, in other words, that no students are classified. It also partitions J items into J groups, which means that there is one item in each group. This item partitioning is the same as that of the benchmark model. With the bipartitioning of the null model, the data matrix is divided into J (= 1 × J ) cells. Figure 7.19 thus illustrates that the difference is with respect to the way of partitioning the students.
7.5 Model Fit
309
Accordingly, the design of the benchmark model when the number of groups is set to one becomes identical to that of the null model. When there is one group, the group reference matrix of the benchmark model, P G (J × G), reduces to p as follows:
Group Reference Matrix
Benchmark Model Null Model P G (J × G) =⇒ p (J × 1)
where ⎡
⎤ p1 ⎢ ⎥ p = ⎣ ... ⎦ = { p j } (J × 1) pJ is the CRR vector. That is, the group reference matrix of the null model is the item CRR vector. Moreover, when there is one group, the group membership matrix reduces to 1 S as follows:
Group Membership Matrix
Benchmark Model Null Model . M G (S × G) =⇒ m G = 1 S (S × 1)
That is, the group membership matrix of the null model becomes a vector in which all S elements are 1 because each student belongs to Group 1 with a 100% membership when there is one group. Consequently, the likelihood of the null model becomes l N (U| p) =
S J 1
u
p j s j (1 − p j )1−u s j
zs j m sc
,
s=1 j=1 c=1
and the log-likelihood is obtained as ell N (U| p) =
S J 1
m sc z s j u s j ln p j + (1 − u s j ) ln(1 − p j )
s=1 j=1 c=1
=
S J
1×z s j u s j ln p j + (1 − u s j ) ln(1 − p j )
s=1 j=1
=
S J
z s j u s j ln p j + (1 − u s j ) ln(1 − p j ) .
(7.13)
s=1 j=1
This log-likelihood can also be said to be ELL. This ELL is computed to be a real number falling within the range of (−∞, 0] because it has no unknown parameters.
310
7 Biclustering
Furthermore, twice the difference between the two models is the χ 2 value. The χ value of the null model (with respect to the benchmark model) is obtained by 2
χ N2 = 2 ell B (U| P G ) − ell N (U| p) .
(7.14)
Although both ell B and ell N are negative, ell N has a larger absolute value because the null model is the worse fitting; thus, χ N2 is obtained as positive. In addition, the DF is the difference in the number of parameters between the benchmark and null models, which is d f N =(N of Parameters for BM) − (N of Parameters for NM) =J (G − 1). The summary for the null model is listed in the following box. Summary of Null Model Student Partitioning Partitions S students into one group (the students are not classified) Item Partitioning Partitions J items into J groups (each group has one item) Expected Log-Likelihood S J
z s j u s j ln p j + (1 − u s j ) ln(1 − p j ) ell N (U| p) = s=1 j=1
Number of Parameters J , or the number of elements in the item CRR vector ( p) Chi-Square
χ N2 = 2 ell B (U| P G ) − ell N (U| p) Degrees of Freedom d f N = J (G − 1)
(7.15)
7.5 Model Fit
311
Worse Fit Model than the Null Model We can consider a null model that fits even worse than the null model introduced in this section. This model considers S students as one group and J items as another group (Fig. 7.20). In this model, the only parameter is
S p=
J
s=1
S s=1
j=1 z s j u s j
=
J
j=1 z s j
1S (Z ⊗ U)1 J . 1S Z1 J
This model attempts to explain the behaviors of all S J data based on only one parameter; thus, the fit of this model is extremely poor. The likelihood and log-likelihood of this model are l(U| p) =
S J
p u s j (1 − p)1−u s j
z sj
,
s=1 j=1
ll(U| P = { ps j = u s j }) =
S J
z s j u s j ln p + (1 − u s j ) ln(1 − p) ,
s=1 j=1
respectively. It is up to the analyst to determine what model is employed as the null model (the specifications of the benchmark model also rest with the analyst). If this model is used as the null model, the model-fit indices that incorporate the ELL of the null model will improve because the ELL of the analysis model becomes relatively closer to that of the benchmark model.
Fig. 7.20 Single student group × single item group model
312
7 Biclustering
7.5.3 Analysis Model This section describes the fitness of the analysis model, which is the ranklustering of grading S students into R ranks and clustering J items into F fields (Fig. 7.19, right, p. 305). The likelihood, in this case, is the occurrence probability of the data (U) ˆ B ) (and the field under the parameter estimate of the bicluster reference matrix ( ˆ R ). That is, ˆ F and M and rank membership matrices, M ˆ B) = l A (U|
S J F R us j
z mˆ mˆ πˆ f r (1 − πˆ f r )1−u s j s j j f sr ,
(7.16)
s=1 j=1 f =1 r =1
The log-likelihood is thus ˆ B) = ell A (U|
S J F R
z s j mˆ j f mˆ sr u s j ln πˆ f r + (1 − u s j ) ln(1 − πˆ f r ) .
s=1 j=1 f =1 r =1
(7.17) This log-likelihood is ELL because the expectation is taken. The expected value is the mean of a random variable. Here, the expectation is taken with respect to two distributions: the rank and field membership profiles of the students and items, respectively. The former is a probability distribution of the membership of each student to its respective ranks. The term regarding Student s in the above equation is ˆ B) = ell A (us |
R
⎡ mˆ sr ⎣
r =1
=
R
F J
z s j mˆ j f u s j ln πˆ f r
⎤
+ (1 − u s j ) ln(1 − πˆ f r ) ⎦
j=1 f =1
mˆ sr X sr
r =1
= E[X s ], which means that the term is the expected value of X sr with respect to r . Accordingly, Eq. (7.17) can be considered the sum of the ELLs of all students, as follows: ˆ B) = ell A (U|
S s=1
ˆ B) ell(us |
(7.18)
7.5 Model Fit
313
Fig. 7.21 χ 2 Values of the analysisl and null models
In addition, the latter distribution is the field membership profile of each item, which is a probability distribution of the membership of each item to its respective fields. Extracting the terms regarding Item j, Eq. (7.17) can be rewritten as follows: ˆ B) = ell A (u j |
F
mˆ j f
F
z s j mˆ sr u s j ln πˆ f r + (1 − u s j ) ln(1 − πˆ f r )
s=1 r =1
f =1
=
S R
mˆ j f X j f
f =1
= E[X j ], which indicates that the ELL of Item j is the expectation of X j f with respect to f ; thus, Eq. (7.17) is also the sum of the ELLs of all items as follows: ˆ B) = ell A (U|
S s=1
ˆ B) = ell A (us |
J
ˆ B ). ell A (u j |
j=1
Therefore, the log-likelihood of the analysis model is an ELL. Note that the fit of a ˆ B ).23 single item can be evaluated using the ELL of the item, i.e., ell A (u j | 2 The next step is to evaluate the χ value of the analysis model. As described in the previous section, the value of the null model is twice the difference between the ELLs of the benchmark and null models (Fig. 7.21). Likewise, the χ 2 value of the analysis model is twice the difference between the ELLs of the benchmark and analysis models. That is,
ˆ B) . χ A2 = 2 ell B (U| P G ) − ell A (U|
23
However, it is unnecessary to conduct the evaluation per item under a biclustering or ranklustering analysis.
314
7 Biclustering
Although both ELLs of the benchmark and analysis models are negative, the ELL of the former is closer to 0, and we therefore have ˆ B ) < ell B (U| P G ) < 0 ell A (U|
∴
ˆ B ) > 0. ell B (U| P G ) − ell A (U|
Thus, χ A2 > 0. This value measures the distance between the benchmark and analysis models. The smaller the value of χ A2 , the closer the ELL of the analysis model is to that of the benchmark model, which means the fit of the analysis model is good. Conversely, when χ A2 is large, the ELL of the analysis model is far from that of the benchmark model, which implies that the analysis model is a poor fit for the data.
7.5.4 Effective Degrees of Freedom of Analysis Model The DF of the analysis model is calculated using d f A = (N of parameters for BM) − (N of parameters for AM). In biclustering, although the second term in the RHS is C F (number of classes × number of fields) because the data matrix is partitioned into C F cells, it is not R F (number of ranks × number of fields) in the ranklustering, even if the data matrix is partitioned into R F cells. In ranklustering, the rank membership profiles are smoothed in Command [2-1.5] of Ranklustering Estimation Procedure (p. 275) such that neighboring ranks are mutually related, which leads to a smoothness of the FRPs. Thus, the R values in the FRP of each field are not estimated independently, but are obtained as a result of being mutually restricted. Therefore, it is an overestimation to consider that the number of parameters in each field is R. The effective degrees of freedom (EDF; Hastie, Tibshirani, & Friedman, 2001) are used in such cases.24 When RMPs are smoothed,25 the framework of the EDF quantifies the number of parameters for the analysis model not as R but as N of parameters for AM = (N of fields) × (N of parameters per field) = Ftr F.
24
(7.19)
See also Sect. 6.8.3 (p. 246). In biclustering (student clustering × item clustering), the EDF is not necessary because the class membership profiles are not smoothed.
25
7.5 Model Fit
315
Accordingly, the EDF of the analysis model is specified as d f A =(N of Parameters for BM) − (N of Parameters for AM) =J G − Ftr F,
(7.20)
where F is the filter used for smoothing. For example, when there are six ranks, suppose the following kernel ⎡ ⎤ 1/3 f = ⎣1/3⎦ 1/3 is used, which is a kernel applied in the simple moving average. Then, the filter becomes ⎤ ⎡ 0.500 0.333 ⎥ ⎢0.500 0.333 0.333 ⎥ ⎢ ⎥ ⎢ 0.333 0.333 0.333 ⎥. ⎢ F=⎢ ⎥ 0.333 0.333 0.333 ⎥ ⎢ ⎣ 0.333 0.333 0.500⎦ 0.333 0.500 Thus, the trace of the filter is obtained as tr F = 0.500 + 0.333 + · · · + 0.333 + 0.500 = 2.333. Although there are six ranks, because the six FRP values are not obtained independently, the number of parameters is estimated as 2.333. For this reason, the number of parameters for the analysis model is given as Ftr F = 2.333F (< 6F). The number of parameters generally represents the flexibility of a model. A model with a larger number of parameters better fits the data. The benchmark model thus fits better than the analysis and null models because it has many different parameters. When using a smoother (i.e., flatter) filter, the stronger the relationships that occur between adjacent ranks, the less flexible the model is, and the smaller the number of parameters that are estimated.
316
7 Biclustering
In addition, the smaller the number of parameters that are applied, from Eq. (7.20), the larger the DF. The DF generally refers to the degree of constraints in a model. In other words, a model with a larger DF has a smaller number of parameters. In ranklustering, such a model has smaller numbers of ranks and fields, or is a model using a flat (i.e., strongly smooth) filter. That is, the larger the DF, the less flexible and more constrained the model is. The least flexible and most constrained model among the benchmark, analysis, and null models is the latter, which assumes S students as one group without considering A summary of the the differences in their abilities. analysis model is listed in Summary of Analysis Model . Summary of the Analysis Model (Biclustering and Ranklustering) Student Partitioning (B) Clusters S students into C classes (R) Grades S students into R ranks Item Partitioning (the same in both models) (B, R) Clusters J items into F fields Expected Log-Likelihood F C S J ˆ B) = (B) ell A (U| z s j mˆ j f mˆ sc u s j ln πˆ f c + (1 − u s j ) ln(1 − πˆ f c )
ˆ B) = (R) ell A (U|
s=1 j=1 f =1 c=1 S J F R
z s j mˆ j f mˆ sr u s j ln πˆ f r + (1 −
s=1 j=1 f =1 r =1 u s j ) ln(1 − πˆ f r ) Number of Parameters (B) C F, or the number of elements in bicluster reference matrix B (R) Ftr F, or (the number of fields) × (the trace of filter) Chi-Square (the same in both models)
ˆ B) (B, R) χ A2 = 2 ell B (U| P G ) − ell A (U| (Effective) Degrees of Freedom (B) d f A = J G − C F (R) d f A = J G − Ftr F
7.5 Model Fit
317
N of Parameters When Not Smoothed When the number of ranks is R, a filter which does not smooth at all is considered the identity matrix with size R, denoted as I R . That is, F = I R. The identity matrix is a square matrix where all diagonal elements are 1 and off-diagonal elements are 0 (see Inverse Matrix and Determinant , p.62). Using this filter, the smoothed rank membership matrix at the t-th cycle, S(t) , is not smoothed as follows: (t) S(t) = M (t) R IR = MR .
When the rank membership matrix is not smoothed, the FRPs are also not smoothed, and the R values in each FRP thus remain unrelated and independent.26 In this case, the number of parameters in each FRP should be R, and Eq. (7.19) estimates the number of parameters as N of parameters of AM = (N of fields) × (N of parameters per field) = F × tr I R = F × R. That is, using tr F as the number of parameters in each FRP has sufficient generality to be applied to unsmoothed models (i.e., biclustering).
7.5.5 Section Summary Thus Far Table 7.7 shows the ELLs, χ 2 values, and DFs of the benchmark, null, and analysis models. First, the relationship between the ELLs of the three models was obtained as ell N (−9862.11) < ell A (−7273.60) < ell B (−5891.31) < 0.
26
The ranklustering using the identity matrix as a filter is identical to biclustering.
(7.21)
318
7 Biclustering
The fit of the benchmark model in which ELL is closest to 0 is the best, whereas that of the null model in which ELL is the most distant from 0 is the worst.27 From these ELLs, the χ 2 values of the null and analysis models were obtained as χ N2 = 2 × {−5891.31 − (−9862.11)} = 7941.60, χ A2 = 2 × {−5891.31 − (−7273.60)} = 2763.50. From these values, it was found that the ratio of the distance between the analysis and benchmark models (2763.50) to the distance between the null and benchmark models (7941.60) is 0.348 (= 2763.50/7941.60),28 which means that the analysis model is located closer to the benchmark model than half the distance between the null and benchmark models. In addition, the number of parameters of the benchmark model was J G = 35 × 34 = 1190. The number of observed NRS patterns G was 34. This is because, although this test has 35 items and a maximum of 36 NRS patterns was 36 {0, 1, · · · , 35}, there are no students whose NRS was 32 or 33. Moreover, the number of parameters of the null model was J = 35, and the DF of the model was d f N = J G − J = 1190 − 35 = 1155. Furthermore, the number of parameters of the analysis model was 23.84. As shown in Estimation Setting of Biclustering (p. 283), the kernel of the filter for the analysis model was ⎡ ⎤ 0.12 f = ⎣0.78⎦ . 0.12 The filter is then given as ⎤ 0.864 0.120 ⎥ ⎢0.136 0.760 0.120 ⎥ ⎢ ⎥ ⎢ 0.120 0.760 0.120 ⎥, F=⎢ ⎥ ⎢ 0.120 0.760 0.120 ⎥ ⎢ ⎣ 0.120 0.760 0.136⎦ 0.120 0.864 ⎡
27
Note that the ELL of an analysis model happens to be closer to 0 if the analysis model fits extremely well, as shown in Sect. 5.6 (p.185). 28 Subtracting this rate from 1 gives the value of the normed fit index (NFI; Bentler & Benett, 1980).
7.5 Model Fit
319
Table 7.7 Expected log-likelihood, χ 2 and DF ELL∗1 NP∗2 χ2 DF Benchmark Model −5891.31 1190 Null Model −9862.11 35 7941.60 1155 Analysis Model −7273.60 23.84 2763.50 1166.16 ∗1 expected
log-likelihood, ∗2 number of parameters
and the trace of this filter is obtained as29 tr F = 0.864 + 0.760 + · · · + 0.760 + 0.864 = 4.767. This indicates that there are six ranks; however, this number is overestimated because the six FRP values in each field are not independently obtained. Therefore, the number of parameters per field is evaluated as 4.767. Accordingly, the number of parameters for the analysis model was evaluated as Ftr F = 5 × 4.767 = 23.836, and the EDF of the analysis model was then given as ed f A = J G − Ftr F = 35 × 34 − 23.84 = 1166.16. In general, the numbers of parameters of the benchmark, null, and analysis models show the following relationship: NP of NM < NP of AM < NP of BM. The benchmark model is the most flexible, having the largest number of parameters, and its ELL is the closest to 0. Meanwhile, the null model is the least flexible (i.e., the most parsimonious), having the smallest number of parameters, and its ELL is the farthest from 0. However, as shown in Table 7.7, although the numbers of parameters of the three models show the following inequality, NP of AM (23.84) < NP of NM (35) < NP of BM (1190),
In Sect. 7.5.4, the number of parameters was 2.333 under the flatter kernel f = [1/3 1/3 1/3] . The flatter the filter that is used, the smaller the number of parameters per field that are evaluated.
29
320
7 Biclustering
the ELL of the null model is the worst, as indicated in Eq. (7.21). This is an unusual case because the ELL of the analysis model is better than that of the null model, despite the number of parameters of the former being smaller. This is a matter of parametrization (see Parametrization , below). A well parametrized model achieves a good fit even when the number of parameters is small. Parametrization Parametrization indicates the types of parameters specified and their placement in a model. The fit of a model is not simply improved by increasing the number of parameters. It can be enhanced only when valid parameters are specified and each of them is placed in the right place within the model. Figure 7.19 (p. 305) shows that the null model places one parameter per item, which means that it attempts to explain the behavior of S data of an item by only one parameter. This is not an efficient parametrization. In comparison, bilustering (ranklustering) assigns one parameter for each bicluster,56 and similar data are collected in each bicluster. Thus, the parameters in biclustering are effectively placed (i.e., parametrized). Consequently, the ELL of the analysis model is better than that of the null model, despite the number of parameters being smaller for the former.
7.5.6 Model-Fit Indices Fit indices are calculated from the χ 2 values and DFs, and are classified into two categories: standardized fit indices and information criteria. The standardized indices commonly used in SEM are listed in Standardized Fit Indices . Each has a bounded interval.31 However, the range of the RMSEA is a left-bounded interval.32
56 In ranklustering, from Eq. (7.19) (p. 314), the number of parameters is reduced according to the filter smoothness. 31 [a, b], [a, b), (a, b), (a, b) are bounded intervals. 32 [a, ∞) and (a, ∞) are left-bounded intervals (or right-unbounded intervals). Similarly, (∞, a] and (∞, a) are right-bounded intervals (or left-unbounded intervals).
7.5 Model Fit
321
Standardized Fit Indices Normed Fit Index (NFI; Bentler & Bonett, 1980)∗1 2 χ (∈ [0, 1]) N F I = 1 − cl 2A χN Relative Fit Index (RFI; Bollen, 1986)∗1 2 χ /d f A R F I = 1 − cl 2A (∈ [0, 1]) χ N /d f N Incremental Fit Index (IFI; Bollen, 1989a)∗1 2 χ − d fA (∈ [0, 1]) I F I = 1 − cl 2A χN − d f A Tucker-Lewis Index (TLI; Bollen, 1989a)∗1 2 χ /d f A − 1 (∈ [0, 1]) T L I = 1 − cl 2A χ N /d f N − 1 Comparative Fit Index (CFI; Bentler, 1990)∗1 2 χ − d fA (∈ [0, 1]) C F I = 1 − cl 2A χN − d f N Root Mean Square Error of Approximation (RMSEA; Browne & Cudeck, 1993)∗2 ! max(0, χ A2 − d f A ) (∈ [0, ∞)) RMSE A = d f A (S − 1) Apply EDF instead of DF when ranklustering is used. ∗1 Larger values close to 1.0 indicate a better fit. ∗2 Smaller values closet to 0.0 indicate a better fit.
Each of the indices has a different concept, and those of the NFI, RFI, and CFI are described in this section. From Standardized Fit Indices , the NFI can be simply expressed as33 NFI = 1 −
χ2 − χ2 χ A2 = N 2 A. 2 χN χN
(7.22)
The structure of this equation is illustrated in Fig. 7.22 (top right), where χ N2 and χ A2 , from their definitions, can be rephrased as34 33 34
The clip function is used to keep the argument within [0, 1], even if χ A2 is abnormal (negative). In fact, the difference is multiplied by 2, but is omitted here to simplify the explanation.
322
7 Biclustering
Fig. 7.22 Concept of NFI, RFI, and CFI
χ N2 = (Fit of BM) − (Fit of NM), χ A2 = (Fit of BM) − (Fit of AM). Thus, the numerator of Eq. (7.22) is given as χ N2 − χ A2 = {(Fit of BM) − (Fit of NM)} − {(Fit of BM) − (Fit of AM)} = (Fit of AM) − (Fit of NM). Accordingly, the NFI can be represented as NFI =
Fit AM ↔ NM (Fit of AM) − (Fit of NM) = . (Fit of BM) − (Fit of NM) Fit BM ↔ NM
In other words, the NFI is an index for measuring how large the difference in fit between the analysis and null models (Fit AM↔NM) is compared to that between the benchmark and null models (Fit BM↔NM). The closer the fit of the analysis model is to that of the benchmark model, the closer the NFI is to 1. Next, it can be seen from Fig. 7.22 (bottom right) that the CFI is a ratio of the efficiencies of the differences between the analysis and benchmark models. Rephrasing the definition of the CFI, this is written as (χ N2 − d f N ) − (χ A2 − d f A ) χ A2 − d f A = χ N2 − d f N χ N2 − d f N {(Fit of AM) − (Fit of NM)} − {(NP of AM) − (NP of NM)} = {(Fit of BM) − (Fit of NM)} − {(NP of BM) − (NP of NM)} (Fit AM ↔ NM) − (NP AM ↔ NM) , (7.23) = (Fit BM ↔ NM) − (NP BM ↔ NM)
CFI = 1 −
7.5 Model Fit
323
where DFs are the difference in the number of parameters between the two models, and are represented as d f N = (NP of BM) − (NP of NM), d f A = (NP of BM) − (NP of AM). In the numerator of Eq. (7.23), (Fit AM↔NM) is the difference in the ELL between the analysis and null models. The larger the value, the better the fit of the analysis model compared to the null model. Meanwhile, (NP AM↔NM) in the numerator is the difference in the number of parameters between these two models. The larger this value is, the more parameters the analysis model uses than the null model. In other words, (NP AM↔NM) represents the degree of wastefulness of the parameters compared to the null model. Accordingly, the entire numerator of Eq. (7.23) represents the difference of the model fitness and parameter wastefulness between the analysis and null models. In other words, the numerator can be said to be the efficiency of the analysis model based on the null model. That is, (Fit AM ↔ NM) − (NP AM ↔ NM)=Eff NM → AM. Similarly, the denominator of Eq. (7.23) can be represented as (Fit BM ↔ NM) − (NP BM ↔ NM)=Eff NM → BM. The CFI can thus be rephrased as CFI =
Eff NM → AM . Eff NM → BM
Consequently, the CFI compares the efficiency of the analysis model (based on the null model) with that of the benchmark model (based on the null model). Finally, the definition of the RFI (Fig. 7.22, bottom left) is RF I = 1 −
χ A2 /d f A . χ N2 /d f N
In the RHS numerator, χ A2 and d f A can be rephrased as (Fit AM↔BM) and (NP AM↔BM), respectively. The numerator can be represented as χ A2 Fit AM ↔ BM = Ineff AM/BM. = d fA NP AM ↔ BM That is, this fraction evaluates the efficiency of the analysis model as the ratio of the fitness to the number of parameters compared to the benchmark model. However, this efficiency represents an “inefficiency” because the larger the fraction is, the worse the fit and the smaller the number of parameters. In addition, the denominator is
324
7 Biclustering
rephrasable as χ N2 Fit NM ↔ BM = = Ineff NM/BM, d fN NP NM ↔ BM which evaluates the inefficiency of the null model compared to the benchmark model. In general, the analysis model is more efficiently parametrized than the null model, that is, 0 < Ineff AM/BM < Ineff NM/BM. We thus have the following ratio: 0
k (k + 1, · · · , C ∗ ) are renamed as k, · · · , C ∗ − 1, and the label of the new class is C ∗ . The second fraction represents the likelihood ratio, where B(·, ·) is the beta function, and β0 and β1 are the hyperparameters of the beta density employed as the prior for the bicluster reference matrix. However, the posterior density of the bicluster reference matrix (beta density) has already been marginalized, leaving only the beta function. For this reason, this method is called the collapsed Gibbs sampling (Liu, 1994). Referring to Eqs. (7.5) and (7.6) (p. 271), the arguments of the beta function are U˜ 1 f c =
S J
m˜ sc m˜ j f z s j u s j ,
s=1 j=1
U˜ 0 f c =
S J
m˜ sc m˜ j f z s j (1 − u s j ),
s=1 j=1
where U˜ 1 f c and U˜ 0 f c are the numbers of correct and incorrect responses of Class c students for Field f items. Note that∗ the data of Student s ∗ are included in these (−s ∗ ) ) ∗ ˜ numbers. Meanwhile, U0 f c and U˜ 1(−s f c do not include the data of Student s . That is, ∗
) U˜ 1(−s fc =
S J
m˜ sc m˜ j f z s j u s j ,
s (=s ∗ ) j=1 ∗
) U˜ 0(−s fc =
S J
m˜ sc m˜ j f z s j (1 − u s j ).
s (=s ∗ ) j=1
The second fraction for (existing) Class c thus represents the ratio of the likelihood for the class when it includes Student s ∗ to that when it does not. Accordingly, if Student s ∗ is highly likely to belong to Class c, the likelihood ratio (i.e., second fraction) becomes > 1. Conversely, if the student is less likely to belong to the class, the ratio is < 1. Consequently, the possibility that Student s ∗ will be seated at Table c is simply rephrased as follows: P of sitting at Table c = (Liveliness of Table c) × (Belongingness to Table c). ∗
∗
∗
In addition, for the new table (i.e., Table C ∗ + 1), U˜ 0(sf c) (= U˜ 0(sf C) ∗ +1 ) and U˜ 1(sf c) ∗ (= U˜ 1(sf C) ∗ +1 ) are the numbers of correct and incorrect responses for Field f items of only Student s ∗ . That is,
7.8 Infinite Relational Model
343 ∗
U˜ 1(sf C) ∗ +1 =
J
m˜ j f z s ∗ j u s ∗ j ,
j=1 ∗
U˜ 0(sf C) ∗ +1 =
J
m˜ j f z s ∗ j (1 − u s ∗ j ).
j=1
Thus, the second fraction for the new table is the ratio of the likelihood of the data if Student s ∗ belongs to the new class to the likelihood if the student does not belong to the new class (i.e., the prior probability). Therefore, the possibility that the student will be seated at the new table is also expressible as (Liveliness × Belongingness). Recall that the binary class memberships of Students s ∗ are updated after the class of the student is updated. In this way, when the classes of all S students are updated in random order, the update of the latent class vector in Cycle t is completed.
7.8.5 Gibbs Sampling for Latent Field Next, the latent field vector is updated from l (t−1) to l (t) F F . The fields of the items are updated in random order. When updating the field of item j ∗ , if the original field of the item (e.g., Field h) has only that item, then Field h vanishes, the labels of fields h + 1, h + 2, · · · are renamed as Field h, h + 1, · · · , and the h-th column of the ˜ F is removed. binary field membership matrix M Suppose that the current number of fields is F ∗ . The field of Item j ∗ is chosen among the existing fields and Field F ∗ + 1 (i.e., new field) by rolling the following dice: p j ∗1 p j ∗2 p j∗ F∗ p j ∗ F ∗ +1 , l F(t)j ∗ ∼ D F ∗ +1 , F ∗ +1 , · · · , F ∗ +1 , F ∗ +1 f =1 p j ∗ f f =1 p j ∗ f f =1 p j ∗ f f =1 p j ∗ f where (t)
(t,− j ∗ )
p j ∗ f = P(l F j ∗ = f |l F , U, γ F , β0 , β1 ) ⎧ C (− j ∗ ) ˜ ˜ Jf ⎪ ⎪ c=1 B(U0 f c + β0 , U1 f c + β1 ) ⎪ , existing field f ∈ N F ∗ ⎪ ∗) ⎨ J − 1 + γ F C (− j (− j ∗ ) B(U˜ 0 f c + β0 , U˜ 1 f c + β1 ) c=1 ∝ . ∗ ∗ C ⎪ ˜ (j ) ˜ (j ) ⎪ γF c=1 B(U0 f c + β0 , U1 f c + β1 ) ⎪ ∗ ⎪ , new field f = F + 1 ⎩ C J − 1 + γF c=1 B(β0 , β1 )
The structure of this equation is the same as that used when updating the latent (t,− j ∗ ) classes of students. The fields of items other than Item j ∗ , l F , are fixed. In ∗ (− j ) is the number of items classified in Field f except for Item j ∗ , as addition, J f follows:
344
7 Biclustering
(− j ∗ ) Jf
=
J f − 1, if Item j ∗ ’s pre-updated field is f . otherwise Jf ,
Moreover, γ F is the attractiveness of Table F ∗ + 1 (i.e., new field), and the greater this value is, the more likely Item j ∗ is to be classified in the new field. Furthermore, respectively, ∗
(− j ) U˜ 1 f c =
S J
m˜ sc m˜ j f z s j u s j ,
s=1 j (= j ∗ ) ∗
(− j ) U˜ 0 f c =
S J
m˜ sc m˜ j f z s j (1 − u s j ),
s=1 j (= j ∗ ) ∗
∗
(j ) (j ) U˜ 1 f c = U˜ 1,F ∗ +1,c =
S
m˜ sc z s j ∗ u s j ∗ ,
s=1 ∗
∗
(j ) (j ) U˜ 0 f c = U˜ 0,F ∗ +1,c =
S
m˜ sc z s j ∗ (1 − u s j ∗ ).
s=1
After the field of Item j ∗ is updated, it is necessary to update the binary field memberships of the item as well. In this way, if the fields of all J items are updated in random order, the update of the latent field vector in Cycle t is completed.
7.8.6 Convergence and Parameter Estimation Before analyzing data using IRM, the only prior setup is to determine γC , γ F , β0 , and β1 . Of these, γC and γ F are the CRP hyperparameters. The greater γC and γ F are, the larger the final numbers of classes and fields, respectively. In addition, β0 and β1 are the hyperparameters of the prior density for the bicluster reference matrix. By iteratively updating the classes of the students and fields of the items, the value at the final cycle T is considered to be an estimate as follows: lˆC = l C(T ) , ) lˆ F = l (T F .
The class of Student s is the s-th element of lˆC , i.e., lˆCs , and the field of Item j is the j-th element of lˆ F , i.e., lˆF j . In addition, the optimal numbers of classes and fields are determined as
7.8 Infinite Relational Model
345
Table 7.14 Bicluster reference matrix (IRM) Field 1 2 3 4 5 6 7 8 9 10 11 12 TRP∗2 LCD∗3 ∗1 latent
C1 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 0.0 2
C2 .605 .023 .173 .003 .003 .240 .070 .010 .005 .106 .010 .007 4.0 103
C3 .711 .031 .308 .175 .227 .280 .120 .038 .038 .231 .008 .066 6.8 64
C4 .944 .458 .069 .187 .168 .384 .021 .042 .014 .167 .014 .042 8.1 35
C5 .893 .725 .523 .046 .137 .373 .241 .102 .080 .102 .034 .017 10.0 43
C6 .886 .758 .990 .745 .215 .372 .111 .030 .020 .190 .030 .030 12.8 49
C7 .850 .720 .222 .047 .963 .458 .197 .056 .639 .194 .097 .007 13.3 35
C8 .861 .842 .985 .970 .139 .389 .239 .779 .015 .162 .088 .022 15.5 33
C9 .909 .909 .962 .948 .935 .535 .225 .346 .038 .173 .019 .069 18.0 25
C10 .968 .989 .984 .958 .305 .637 .889 .469 .141 .219 .391 .024 20.9 31
C11 .897 .942 .971 .890 .916 .545 .388 .260 .962 .192 .067 .015 20.2 51
C12 .933 .989 .983 .989 .978 .748 .839 .967 .783 .317 .583 .195 26.8 29
C13 LFD∗1 3 1.00 1.00 3 1.00 2 3 1.00 3 1.00 1.00 5 1.00 4 2 1.00 2 1.00 1.00 2 1.00 2 4 1.00 35.0 15
field distribution, ∗2 test reference profile, ∗2 latent class distribution,
Cˆ = max lˆC , Fˆ = max lˆ F , respectively. Regarding the convergence of this repetitive procedure, the iteration is terminated when, for example, the fields of the items are invariant for several cycles,48 or when the difference of the log-likelihoods between the adjacent cycles reduces to a certain standard (see Sect. 7.2.6, p. 273). ˆ ˆ B = {πˆ c f } (Cˆ × F) Finally, the MAP estimate of the bicluster reference matrix can be obtained as = πˆ c(MAP) f
U˜ 1 f c + β1 − 1 , ˜ U0 f c + U˜ 1 f c + β0 + β1 − 2
as shown in Table 7.14. Figure 7.28 (left) shows the array plot obtained under the IRM, where the hyperparameters employed in the analysis were (γC , γ F , β0 , β1 ) = (0, 0, 1, 1). Here, both γC and γ F are constrained to 0 to avoid creating a new class and a new field, while reducing them from the initial numbers of classes and fields. For this reason, there were no fields having a single item because, when updating such a field (having only one item), the field is first removed and no new field is created because γ F = 0. Thus, the fields consisting of one item gradually disappear during the repeated cycles. ˆ C) ˆ = (12, 13), The optimal numbers of classes and fields were estimated to be ( F, where the students who scored 0 and 35 (i.e., full marks) were separately classified as Classes 1 and 13, respectively, without mixing them into the other classes; in addition, 48
The classes of students cannot be expected to remain unchanged when S is large.
346
7 Biclustering
Fig. 7.28 Array plot and fit indices of IRM
the numbers of students in the former and latter classes are S1 = 2 and S13 = 15, respectively. Moreover, the numbers of students belonging to the mid-level classes were adjusted to be no less than 20. More specifically, after the IRM convergence, the classes with fewer than 20 students were destroyed, and students were assigned to the other classes through confirmatory biclustering. As the model-fit indices in Fig. 7.28 (right) show, the fitness of this IRM model was satisfactory, and the log-likelihood was closer to zero than that of the benchmark model (ell A > ell B ), and χ A2 thus becomes negative.49 Because the IRM is an extremely good-fitting model, it would be necessary to set a better-fitting benchmark model such as the saturated model (see Saturated Model in Test Data , p.308). The 49
Because the χ 2 values must be nonnegative, χ A2 in the table should originally be written as 0.
7.8 Infinite Relational Model
347
saturated model fits the data perfectly, and the log-likelihood of the model is 0. The χ 2 value of the analysis model is then χ A2 = 2 × {0 − (−5719.01)} = 11438.01. Each index can be re-evaluated based on this χ A2 value.
7.8.7 Section Summary In this section, we described the IRM. This model is extremely useful for minutely adjusting the classification results to maintain the size of each class at more than 20 students, eliminate fields with only one item, and retain students with 0 and full markers in individual classes. However, the computation time is extremely long when the sample size is large. In addition, because random procedures are included in the estimation process, the results are slightly different each time, even with the same settings. Because the rank memberships (and field memberships) are binary, this model cannot provide detailed results, such as “the membership of Student A to Rank 4 is 60%, and that to Rank 5 is 30%.” However, given the latent field estimates and the number of classes obtained using the IRM, by applying the confirmatory biclustering/ranklustering, it is possible to obtain the usual class/rank membership profile for each student. In the case of this post-hoc confirmatory ranklustering, the number of ranks does not need to be Cˆ obtained through the IRM, but can instead be determined by the fit indices and interpretability.
7.9 Chapter Summary In this chapter, for a two-dimensional data array (student × item), biclustering, which gathers those students having a similar response pattern in the row direction and those items having a similar TF pattern in the column direction, was described. The student and item clusters are called the latent classes and latent fields, respectively. Both the classes and fields are nominal clusters. The blocks separated by the classes and fields are called biclusters. Classes created by student clustering can be ordered, and the biclustering in such a case is called ranklustering, where the ordered latent classes are the latent ranks. In biclustering (student clustering × item clustering), the optimal C and F can be determined by referring to the information criteria, whereas in ranklustering (student grading × item clustering), the optimal R and F are selected based on the information criteria among the (R, F) pairs that satisfy WOAC. Note that the optimal C/R and F chosen by the information criteria will generally be larger than those intended by
348
7 Biclustering
Fig. 7.29 Confirmatory biclustering machine (Revision of Fig. 7.3)
the analyst. In such a case, the numbers may be settled based on a comprehensive consideration through professional experience and knowledge. In addition, after conducting an exploratory analysis, it is recommended to conduct a confirmatory analysis by fixing the field membership matrix obtained during the exploratory analysis. The exploratory analysis does not always obtain an item classification that satisfies the analyst, but it will provide a hint for the desirable classification. Based on this hint, the analyst can modify the item classification and examine whether the modified classification is acceptable by a confirmatory analysis. Note that it is not a good choice to conduct a confirmatory analysis alone. If the item classification is derived from only the (top-down) knowledge of the analyst, the bottom-up information buried in the data is less likely to be extracted, which often leads to a poor fit of the model. An exploratory (bottom-up) analysis followed by a confirmatory (top-down) analysis modified by the knowledge of the analyst (or expert) would constitute a well-balanced procedure. Figure 7.29 shows confirmatory biclustering. In the confirmatory analysis, the expert knowledge is reflected in the field membership matrix. In the exploratory analysis, information on outside data is missing, and it is the knowledge of the expert that compensates for such missing information. Only when the information within the data (i.e., the data matrix) and the information outside the data (i.e., knowledge and experience) are properly integrated, does it become possible to achieve excellent analysis results. However, if the outside information is off point, this situation is identical to adding noise into the analysis results, thereby decreasing the validity of the results.
Chapter 8
Bayesian Network Model
This chapter describes the Bayesian network model (BNM; e.g., Pearl, 1988; Jensen & Nielsen, 2007; Darwiche, 2009; Koski & Noble, 2009; Ueno, 2013; Scutari & Denis, 2014), which has been widely used in many fields, including engineering, marketing, behavioral sciences, and cognitive psychology. Almond et al. (2007, 2009, 2015) applied the model to educational measurements, and Culbertson (2016) and Reichenberg (2018) provided quality reviews of its application. The BNM is a method for analyzing the network between variables (items), as shown in Fig. 8.1. In a test data analysis, this model is particularly suitable for analyzing the relationship between items, which referred to as an interitem structure or interitem network. In math and science tests, subject experts may be able to identify the structure from contents of the items without analyzing the test data. However, the structure may not be obvious even for language or social studies because the item contents of these subjects are usually intricately intertwined. Moreover, even if the structure is obvious, it is worth analyzing the data using BNM as new facts are discovered. In addition, if the knowledge of the structure was mainly developed while teaching high-ability students, the structure may not be applicable to low-ability students. In the figure, the circles represent test items, and the numbers in the circles indicate the item labels. Thus, the test had nine items in total. Furthermore, the arrows indicate the relationship between the items. Note that this book uses the term “relationship,” and although it may be tempting to use “causality” instead, the latter should not be used without further consideration. A strict set of criteria must be met to apply the term “causal relationship.” For example, for Events X and Y , Hume lists three requirements for their relationship to be qualified as a causality (Reiss, 2013). Three Requirements for Causal Relationship Developed by David Hume 1. Universal association between X and Y 2. Time precedence of Y by X 3. Spatiotemporal connection between X and Y
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_8
349
350
8 Bayesian Network Model
Fig. 8.1 Bayesian network model
1
2
5
3
6
4
7
8
9
Clearly, meeting Point 3 is easy. For example, in Fig. 8.1, the relation of Item 1 → 2 is assumed. There is no doubt that these items are temporally close to each other because the items in a test are generally responded to simultaneously and are usually placed in the same booklet if the test is a paper-and-pencil test (PPT), or on the same webpage or linked webpages if it is a computer-based test (CBT). Thus, the items are spatiotemporally close. However, it is difficult to verify Point 1. Whether this point is met depends on how the term “universal” is defined: Point 1 is clearly unsatisfied if the definition is “always used in the same test.” Point 2 is also extremely difficult to fulfill. Because it is impossible to respond to two items simultaneously, the items are temporally sorted; however, not all students respond to the items in the same order. For a PPT, even if most students answer Item 2 after Item 1, some students may respond to Item 1 after Item 2. Furthermore, if a CBT is designed to allow students to return to the previous items, some students may return to Item 1 after responding to Item 2. Thus, it cannot be stated that all responses for Item 1 temporally precede those for Item 2. Consequently, a causal relationship between Items 1 and 2 cannot be qualified.
8.1 Graph The essence of BNM is representing a data structure in a graph, as shown in Fig. 8.1. The arrows express interitem relations; more specifically, the conditional correct response rates (CCRRs), as described in Sect. 2.5.2 (p.36). A BNM is a graphical representation of a CCRR structure, and the purpose of a BNM analysis is to find a graph that expresses a true CCRR structure. A graph that includes a true CCRR structure is called an independent map (I-map), and the minimum I-map is an I-
8.1 Graph
351
Fig. 8.2 Terminology of Bayesian network model
1
Source
Parents of 6
3
2
Sink
5
6
4
7
Source
8 Children of 4
Vertex, Node, or Item (Directed) Edge, Link, or Arc
9
Sink
map with a minimum number of edges. In other words, the goal of a BNM analysis is to find a minimum I-map; however, this is very difficult to achieve when the number of items is large.
8.1.1 Terminology The terminology used in this chapter is summarized in Fig. 8.2. First, each variable observed is expressed as a circle and is referred to as a vertex or node. In this book, the observed variables are test items; thus, they are referred to as “items” for clarity. In addition, each line connecting the items is called an edge, link, or arc, and a line with a direction is a directed edge/link/arc. Usually, all lines are directed in a BNM, and therefore, in this book, they are simply referred to as “edges.” For instance, in the figure, an edge from Item 1 to 2 can be seen, which is also reffered to as the out-edge of Item 1 and in-edge of Item 2. Next, a set of items whose edge links to Item j is called the parent item(s) of Item j and is denoted by pa( j). For example, in the figure, the parents of Items 6, 7, and 8 are pa(6) = {2, 3}, pa(7) = {3, 4, 6}, pa(8) = {4}, respectively. In addition, an item that does not have a single parent is called a source; thus, Items 1 and 4 are sources in the figure and are denoted as follows1 :
1
Here, ∅ indicates an empty set, which is a set with zero elements.
352
8 Bayesian Network Model
pa(1) = {} = ∅, pa(4) = {} = ∅. The number of parent items is called the indegree. For example, the indegree of Item 7 is 3, and that of Item 1 is 0. Moreover, the set of parents, parents’ parents, parents’ parents’ parents, etc., of an item are called the ancestors of the item. For example, the ancestors of Item 6 are anc(6) = {1, 2, 3}. Furthermore, the item set linked from an item is referred to as the child item(s) of the item. In the figure, the children of Items 4, 5, and 6 are ch(4) = {7, 8}, ch(5) = {} = ∅, ch(6) = {7}, respectively. Items that have no children, such as Items 5 and 9, are called sinks. The number of children is the outdegree. For example, the outdegree of Item 4 is 2, and that of Item 5 is 0. In addition, the items that can be reached from a given item are called descendants of the item. For example, the descendants of Item 3 are des(3) = {6, 7, 9}.
8.1.2 Adjacency Matrix If Item j is directly connected to Item k by an edge, Item j is adjacent to Item k. The adjacencies among items can be considered as the structure of a graph. If two graphs have the same adjacencies, the graphs are identical. Figure 8.3 shows the same graph as Fig. 8.2, with the items in different positions. Although both graphs appear quite different, they are the same because they have the same adjacencies. In general, items should be placed such that the edges intersect as little as possible for easier viewing. Moreover, it is better to arrange the items such that the directions of the edges are approximately the same (e.g., top to bottom or left to right), thereby making the graph legible. An adjacency matrix summarizes the adjacencies among the items in a graph and is denoted as A. When the number of items in a test is J , the size of the adjacency matrix A is J × J . The ( j, k)-th element of this matrix is
8.1 Graph
353
Fig. 8.3 Example of a graph with a Muddled layout
1 3 4
2
5
7
6
8
9
Table 8.1 Adjacency matrix for Fig. 8.2
→ 123 456 78 Item 1 11 11 Item 2 1 1 Item 3 Item 4 11 Item 5 1 Item 6 Item 7 Item 8 Item 9 Indegree 0 1 1 0 1 2 3 1 ∗ Blank
cells represent 0.
a jk =
9 Outdegree 2 2 2 2 0 1 1 1 1 1 0 2 11
1, if an edge exists from Item j to k . 0, otherwise
Table 8.1 shows the adjacency matrix derived from Fig. 8.2. Only one adjacency matrix can be derived from a given graph. Likewise, only one graph can be created from a given adjacency matrix. Thus, the adjacency matrix of a graph is a mathematical representation of that graph. In addition, the sum of the j-th column of an adjacency matrix is the number of parent items (i.e., indegree) of Item j. Because there are no parents for Items 1 and 4, as shown in the bottom row of the table, the columnar sums of these items are 0. The indegree vector is calculated as follows:
354
8 Bayesian Network Model
d in = A 1 J .
(8.1)
Moreover, the sum of the j-th row of the adjacency matrix is the number of child items (i.e., outdegree) of Item j. Because Items 5 and 9 have no children, the fifth and ninth cells in the rightmost column of the table are 0. The outdegree vector is computed as follows: d out = A1 J . Furthermore, the number of edges in a graph is obtained through the following: N of edges = 1J A1 J . This calculates the sum of the adjacency matrix elements, which is 11 for the graph in Fig. 8.2. A graph with a large or small number of edges is said to be dense or sparse, respectively.
8.1.3 Reachability For a sequence of adjacent vertices starting with Item j and ending with Item k, Item k is reachable from Item j, or Item j is connected to Item k. For example, in Fig. 8.2, Item 7 is reachable from Item 1. In addition, an adjacency is a connection when there is one edge. Moreover, a series of adjacent vertices (i.e., items) is called a path. There are three paths from Items 1 to 72 : Path A: 1 → 3 → 7 Path B: 1 → 2 → 6 → 7
(8.2)
Path C: 1 → 3 → 6 → 7 Conversely, Item 1 is not reachable from Item 7 because a path can be passed in one direction only (from left to right). The length of a path is expressed as the number of edges. Therefore, the lengths of Paths A, B, and C are 2, 3, and 3, respectively. The reachability between items can be examined using the power of the adjacency matrix. Table 8.2 shows the first (upper left) to the fourth (lower right) powers of the adjacency matrix. The first power of the adjacency matrix is the adjacency matrix. In the second power of the matrix ( A2 ), the (1, 5)-th element is 1, which indicates that only one path with length 2 exists from Item 1 to 5, as follows:
2
A path in which no vertices appear more than once is called a simple path, whereas a path in which one or more vertices that appear more than once are referred to as a walk. Therefore, the three paths shown in Eq. (8.2) are simple paths.
8.1 Graph
355
Table 8.2 Matrix powers of adjacency matrix A1 = A
A2 = A A
→ 123 456 789 1 11 2 11 1 1 3 4 11 5 1 6 7 1 8 1 9
→ 123 456 789 1 12 1 2 1 1 1 3 4 2 5 1 6 7 8 9
A3 = A A A
A4 = A A A A
→ 123 456 789 → 123 456 789 1 2 1 1 2 2 1 2 1 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Blank cells represent 0. Each shaded cell denotes when a nonzero value first appears at that location.
1 → 2 → 5 . In addition, the (1, 6)-th element in the matrix is 2, which means that there are two paths with length 2 from Item 1 to 6, i.e., 1 → 2 → 6 , 1 → 3 → 6 . Similarly, from the third power of the adjacency matrix, there are five paths with length 3: two paths from Item 1 to 7, one path from Item 1 to 9, one path from Item 2 to 9, and one path from Item 3 to 9 as follows:
356
8 Bayesian Network Model
1 → 2 → 6 → 7 , 1 → 3 → 6 → 7 , 1 → 3 → 7 → 9 , 2 → 6 → 7 → 9 , 3 → 6 → 7 → 9 . In addition, the paths with length 4 are as follows: 1 → 2 → 6 → 7 → 9 , 1 → 3 → 6 → 7 → 9 . These two paths are the longest in this graph. There are no paths with length 5, which can be identified from the fifth power of the adjacency matrix as follows: A5 = A A A A A = O, where O is a zero matrix, wherein all entries are 0. Therefore, the sixth or higher power of the matrix becomes O. For example, the sixth and seventh powers of A are A6 = A A5 = AO = O, A7 = A2 A5 = A2 O = O. Accordingly, the longest path in the graph is 1 less than the power providing a zero matrix for the first time. In this graph, A5 = O, and therefore, the longest path is 4. Next, consider the powers of the adjacency matrix to be added as follows: A(n) =
n
A p = A + A2 + · · · + An .
p=1
Because there are nine items in the graph, the maximum path length is 8. The sum of the matrix powers is obtained as follows3 : ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ (8) 2 8 A = A+ A + ··· + A = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
3
111 1 1
⎤ 12 3 3 1 1 1 1⎥ ⎥ 1 2 2⎥ ⎥ 1 1 1 2⎥ ⎥ ⎥, 1 ⎥ 1 1 1⎥ ⎥ 1 1⎥ ⎥ 1 1⎦ 1
In fact, from A6 = A7 = A8 = O, this is the sum of A, · · · , A5 .
(8.3)
8.1 Graph
357
The blank cells indicate 0. This matrix is called the reachability matrix (with a path length of ≤ 8),4 and accounts for the reachability between each pair of items. For example, the (1, 7)-th element is 3, meaning that there are three paths from Items 1 to 7, and Item 7 is thus reachable from Item 1. Meanwhile, because the (2, 3)-th element is 0, there is no single path starting with Item 2 and ending with Item 3, which indicates that Item 3 is unreachable from Item 2. When the number of items in a graph is large, it becomes difficult to examine whether one item is reachable from another merely by inspecting the graph. Therefore, the reachability matrix is useful for checking whether paths exist between any two items of interest.
8.1.4 Distance and Diameter The shortest path between two items is called the distance between the items. For example, as shown in Eq. (8.2) (p.354), there are three paths from Item 1 to 7, the shortest of which is Path A with a length of 2. Therefore, the distance between Items 1 and 7 is 2. The distance between items can be easily determined by checking the power matrices of the adjacency matrix. First, let the ( j, k)-th element in A p be denoted as a p, jk . For example, a1, jk , a2, jk , and a3, jk are the ( j, k)-th elements of A, A2 , and A3 , respectively. Then, by observing a1, jk , a2, jk , a3, jk , · · · , when a p, jk = 0 for the first time, p is the distance between Items j → k. For example, as Table 8.2 shows, a1,12 = 1, a2,12 = 0, a3,12 = 0, a4,12 = 0. The first nonzero element is when p = 1 (a1,12 ),5 Therefore, the distance between Items 1 → 2 is 1. In addition, from a1,39 = 0, a2,39 = 1, a3,39 = 1, a4,39 = 0, the distance between Items 3 → 9 is 2.6 The longest distance in a graph is called the diameter. In this graph, the largest distance is Path F between Items 1→ 9, although there are three paths between the items as follows:
Adjacency matrix A is a reachability matrix with path length ≤ 1. In Table 8.2, the a1,12 cell is shaded, which indicates it is the first nonzero element. 6 Likewise, the a 2,39 cell is shaded in Table 8.2. 4 5
358
8 Bayesian Network Model
Fig. 8.4 Example of a disconnected graph
1
Component 2
5
3
6
4
7
8
Isolated Item
9
Path D: 1 → 3 → 7 → 9 , Path E: 1 → 2 → 6 → 7 → 9 , Path F: 1 → 3 → 6 → 7 → 9 . The length of the shortest path is Path D; therefore, the distance between the items is three. This interitem distance is the longest among all item pairs. Accordingly, the diameter of the graph is 3. The diameter can also be determined by examining the power matrices of the adjacency matrix. Among the matrices with one or more shaded cells (each of which represents the first nonzero element for that position), the largest p represents the diameter. In Table 8.2, shaded cells are observed when p = 1 ( A), 2 ( A2 ), and 3 ( A3 ). Among them, the largest p is 3; therefore, the diameter is 3.
8.1.5 Connected Graph A connected graph is a graph in which all items are connected by edges. Figure 8.2 shows a connected graph. In addition, a graph is strongly connected if each item is reachable from all other items. The graph in Fig. 8.2 is connected but not strongly so because Item 4 is not reachable from Item 1. If a graph is strongly connected, all off-diagonal elements of the reachability matrix will be positive integers. Figure 8.4 shows a disconnected graph in which not all of the items are connected. The connected subgraphs that comprise the disconnected graph are called components. This disconnected graph has three components: A component with a single vertex is called an “isolated vertex” (or in this book, an isolated item). An isolated item has zero parent items and zero child items.
8.1 Graph
359
Fig. 8.5 Another example of a disconnected graph
1
4
7 3 2
5 8 9
6
It is sometimes difficult to determine whether a graph is connected or disconnected, particularly if the graph is not well-organized or has numerous items. Figure 8.5 shows another disconnected graph. It is not easy to immediately discern whether the graph is connected or disconnected. In fact, the graphs in Figs. 8.5 and 8.4 are identical because the adjacencies of the items are the same. Thus, it would be useful to have an adjacency matrix-based method to judge whether a graph is connected or disconnected. If a graph is connected, each item must be reachable from other items when the directions of the edges are disregarded. If the directions of the edges are not considered, the graph is an undirected graph. Accordingly, the reachability matrix for the undirected graph of the original directed graph should be examined. In the first row of Table 8.3, two directed graphs are presented: The one on the left is connected, and the one on the right is disconnected. The connectivity for each graph was verified through calculation. The second row shows the undirected graphs of both directed graphs by replacing all directed edges with undirected edges. In addition, the adjacency matrices of the undirected graphs ( AU s) are obtained, as shown in the third row of the table. From the adjacency matrix of the directed graph ( A), that of the undirected graph ( AU ) can be obtained through the following: AU = A + A . Using this operation, if the ( j, k)-th element in A is 1 (i.e., if Items j → k are adjacent), then 1 is input in the (k, j)-th element of AU (Items k → j are also adjacent). When the ( j, k)-th and (k, j)-th elements are both 1, the relation between them is Items j → k and Items k → j, which implies that the items are intercommutable (i.e., Items j ↔ k). A bidirected edge is thus an undirected edge.
360
8 Bayesian Network Model
Table 8.3 How to check whether a graph is connected or disconnected Disconnected Graph
Connected Graph 1
1 3
Directed Graph
2
5
4
6
7
3 5
2
8
7
8 9
9
1
Undirected Graph
4 7
Adjacency Matrix AU
1 ⎢ ⎢1 ⎢ ⎢ ⎢1 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
3 5
2
8
8 9
480 394 592 ⎢ ⎢ 394 624 796 ⎢ ⎢ ⎢ 592 796 1135 ⎢ ⎢ ⎢ 321 347 611 ⎢ ⎢ 211 157 237 ⎢ ⎢ ⎢ 729 665 1043 ⎢ ⎢ ⎢ 722 813 1150 ⎢ ⎢ 236 262 380 ⎣
⎤
1 1 1
1 1 1
1 1
1
1
1 1 1
1 1 1
1
347 157 665 611 237 1043 443 131 623 131 99 314 623 314 1190 586 271 1162 205 76 382
321 347 611 443 131 623
722 236 321
6
⎤
⎡
1 1 1 ⎢ ⎢1 1 ⎢ ⎢ ⎢1 1 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
⎥ ⎥ ⎥ ⎥ ⎥ 1 ⎥ ⎥ 1 1 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 1 ⎥ ⎥ 1 1⎥ ⎥ 1 1⎥ ⎦ 1 1 1
321 211 729
4
7
9
⎡ (8) Reachability Matrix AU
3 6
⎡
6
1
2 5
4
1 1 1 1 1 1 1
⎤ ⎡
⎥ 813 262 347 ⎥ ⎥ ⎥ 1150 380 611 ⎥ ⎥ 586 205 443 ⎥ ⎥ ⎥ 271 76 131 ⎥ ⎥ ⎥ 1162 382 623 ⎥ ⎥ ⎥ 1516 604 586 ⎥ ⎥ 604 282 205 ⎥ ⎦ 586 205 443
170 85 85 ⎢ ⎢ 85 170 170 ⎢ ⎢ ⎢ 85 170 170 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 170 85 85 ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 1 1 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 1 1⎥ ⎥ 1 1⎥ ⎦ 1 1 1
170 85 85 170 111 170 85 85 170
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 85 85 170 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 170 170 85 ⎥ ⎥ 170 170 85 ⎥ ⎦ 85 85 170
8.1 Graph
361
Next, the reachability matrix of the undirected adjacency matrix ( AU ) is calculated. Because there are nine items, the maximum possible path length is eight if the graph is connected,7 and the reachability matrix of AU is defined as follows: AU(n) =
n
p
AU = AU + AU2 + · · · + AUn ,
p=1
where AU(n) is the reachability matrix with path length ≤ n. The bottom row of Table 8.3 shows the reachability matrix with path length ≤ 8, AU(8) . In the reachability matrix for the left (connected) graph, all elements are positive integers, which indicates that there are one or more paths between any two items. For example, the (1, 2)-th entry is 394, which means that there are 394 walks with length ≤ 8 that start at Item 1 and end at Item 2. Note that the (2, 1)-th element is 394 because there are the same number of walks from Item 2 back to Item 1. Because the same holds for all item pairs, the reachability matrix for an undirected graph is symmetric. In addition, the first diagonal element is 480, which implies that there are 480 walks, each with a length of ≤ 8, starting from and returning to Item 1. For the left-hand graph, even if p = 5, all elements in the reachability matrix become nonzero. Items 5 and 8 are the farthest apart, and the shortest paths from Item 5 to 8 are 5 → 2 → 6 → 7 → 4 → 8 , 5 → 2 → 6 → 7 → 9 → 8 . Meanwhile, in the reachability matrix for the (disconnected) graph on the right, many of the elements are 0. For example, the (1,4)-th element is 0. This means that there is no single walk with a length of ≤ 8 between the items. Because there are nine items, the items must be reachable within eight steps if the graph is connected. Therefore, the graph on the right is disconnected. It does not matter whether a graph is connected or disconnected in a BNM. In this book, however, the target of the BNM analysis is the test data, and the contents of the items within a test are usually interrelated. Therefore, the data-derived network structure is less likely to be disconnected. Even if a network is disconnected, the data should be separately analyzed component-wise if it is not necessary to ensure that there is no relationship between the components. Therefore, test data-derived networks are generally connected in a BNM analysis.
7 A path with length > 8 becomes a walk because at least one item must be passed more than once. If an item is unreachable from another item by a path, it is also unreachable by a walk; thus, it is sufficient to consider the reachability under the condition in which the path length is the longest.
362
8 Bayesian Network Model
Fig. 8.6 Multigraph
1 Multi-Edge
2
5
3
6
4
7
8
Self-Loop
9
8.1.6 Simple Graph A simple graph is a graph that has no multiple edges or self-loops. As shown in Fig. 8.6, if two or more edges are incident to the same two items, the edges are multiple edges (also called “parallel edges”). For instance, there are two edges between Items 1 and 3. This multi-edge from Item 1 to 3 is a unidirected multi-edge. Similarly, the multi-edges between Items 2 and 5 and between Items 4 and 8 are bidirected multi-edges. As shown in the figure, a self-loop is an edge that connects an item to the item itself. In the figure, Items 6 and 8 have a self-loop. A graph that contains multiple edges or self-loops is called a multigraph. By contrast, a simple graph is a network that can be analyzed using a BNM. If it is difficult to tell whether a graph is simple when the number of items is large, an adjacency matrix can be used in the evaluation. Table 8.4 shows the adjacency matrix of the graph in Fig. 8.6. If Item j has a self-loop, the j-th diagonal entry of the adjacency matrix becomes 1.8 The sixth and eighth diagonal elements are 1 because Items 6 and 8 have a self-loop. In addition, the trace of the adjacency matrix is calculated as tr A = 2. When the trace of the adjacency matrix is not 0, it indicates that at least one item has more than one self-loop. In addition, if at least one off-diagonal element in the adjacency matrix is ≥ 2, the graph has one or more multi-edges. For example, Table 8.4 shows that the (1, 3)-th and (4, 8)-th elements are 2, and that the edges from Items 1 to 3 and Items 4 to 8 are unidirected multi-edges.9 However, this method cannot detect the bidirected multi-edge consisting of two unidirected edges and connecting Items 2 and 5. 8 9
When the item has two self-loops, the diagonal element will be 2. If the (3, 1)-th and (8, 4)-th elements are 1, they are bidirected multi-edges.
8.1 Graph
363
Table 8.4 Adjacency matrix of multigraph (Fig. 8.6)
1 2 3 4 5 6 7 8 9 Outdegree → 3 Item 1 12 2 11 Item 2 2 1 1 Item 3 Item 4 3 12 1 Item 5 1 2 1 1 Item 6 1 Item 7 1 3 1 11 Item 8 0 Item 9 Indegree 0 2 2 1 1 3 3 3 2 17 ∗ Blank
cells represent 0.
The self-loops, unidirected multi-edges, and bidirected multi-edges in the graph can be detected simultaneously by examining the adjacency matrix for the undirected graph of the original directed graph, AU , which is the sum of the adjacency matrix and its transpose, which is given as follows: ⎡
112 ⎢ 1 11 ⎢ ⎢ 1 11 ⎢ ⎢ 1 12 ⎢ 1 AU = A + A = ⎢ ⎢ 1 ⎢ 11 ⎢ ⎢ 1 ⎢ ⎣ 1 1 ⎡
1 ⎢1 ⎢ ⎢2 ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
12 1 1
21 11 1 13 1 2 11 21 11 11 3 2 11
⎡
⎤ 1 ⎥ ⎥ ⎢1 1 1 ⎥ ⎥ ⎢ ⎥ ⎥ ⎢2 1 ⎥ ⎥ ⎢ ⎥ ⎢ 1 1 ⎥ ⎥ ⎥ ⎢ ⎥ ⎥+⎢ 1 1 ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ 11 1 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ 11 11 1⎥ ⎢ ⎥ ⎣ ⎦ 2 1 ⎦ 1 111 1 ⎤ ⎤
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ 1⎥ ⎥ 1⎦ 1
364
8 Bayesian Network Model
In this undirected adjacency matrix, AU , if a diagonal element is 2, the corresponding item has a self-loop,10 and, if an off-diagonal element is ≥ 2, there are multi-edges between the two corresponding items. In summary, if any of the elements in the undirected adjacency matrix, AU , is ≥ 2, the graph contains at least one self-loop or one multi-edge. Therefore, the graph is a multigraph. However, if all elements of this matrix are a 0 or 1, the graph is simple. Because the maximum element of this matrix is max( AU ) = 3, the graph is a multigraph.
8.1.7 Acyclic Graph A path that starts from an item and returns to that item is called a “cycle,” “circuit,” or “closed path,” and a graph containing a cycle(s) is referred to as a cyclic graph. Figure 8.7 shows a cyclic graph with the following three cycles: Cycle A: 4 → 7 → 8 → 4 , Cycle B: 2 → 6 → 7 → 9 → 5 → 2 , Cycle C: 2 → 6 → 7 → 8 → 9 → 5 → 2 . Because Cycle A contains three edges, it is a cycle with a length of 3. In addition, this cycle is described using Item 4 as the starting point, even when starting from any of the items in the cycle, with the path revisiting that item. Analogously, Cycles B and C are closed paths with lengths of 5 and 6, respectively. A graph that does not contain a cycle is called an acyclic graph. Figure 8.2 (p.351) shows an example of a (connected) acyclic graph, whereas Fig. 8.4 (p.358) shows a disconnected acyclic graph. The multigraph shown in Fig. 8.6 (p.362) is a cyclic graph because a self-loop and a bidirected multi-edge are both treated as a cycle with a length of 1 and 2, respectively. A network that can be analyzed using a BNM must be acyclic. When a graph is large, it is not easy to detect whether the graph is cyclic or acyclic. Thus, a method is required for such detection. Table 8.5 shows the adjacency and reachability matrices for the graph in Fig. 8.7. It is not obvious from the adjacency matrix (table on the left) whether the graph is cyclic or acyclic. The table on the right shows the reachability matrix. When the j-th diagonal element of this matrix is not 0, at least one path starting from and returning to Item j exists. For example, the second diagonal element is 3, which shows that there are 10
When an item has two self-loops, the diagonal element is 4. In general, if an item has n self-loops, the diagonal element is 2n.
8.1 Graph
365
1
2
5
4
3
6
7
8
9 Fig. 8.7 Cyclic graph Table 8.5 Adjacency and reachability matrices of a cyclic graph (Fig. 8.7)
8 p=1
A
8
A(8) =
p=1
→ 123 456 789 1 11 1 2 3 1 1 1 4 5 1 6 1 11 7 8 1 1 1 9 Blank cells represent 0.
8 p=1
Ap
→ 123 4 56 78 9 1 8 1 6 10 8 11 7 13 2 3 2 43 53 5 3 7 5 9 6 10 8 13 4 4 3 53 55 8 5 3 2 33 32 4 6 4 3 53 55 8 7 5 5 8 4 6 5 10 8 5 4 73 74 8 9 3 1 32 22 3
three paths (cycles) each with a length of ≤ 8, starting from and returning to Item 2. The cycles are as follows: Cycle D: 2 → 6 → 7 → 9 → 5 → 2 , Cycle E: 2 → 6 → 7 → 8 → 9 → 5 → 2 , Cycle F: 2 → 6 → 7 → 8 → 4 → 7 → 9 → 5 → 2 . In Cycle F, Item 7 is visited twice. A cycle such as Cycle F, which passes an item more than once, is called a closed walk. The trace of the reachability matrix is calculated as tr A(8) = 23,
366
8 Bayesian Network Model
indicating that the graph has 23 paths (i.e., closed walks) with a length of ≤ 8, starting from and ending with a given item. In a graph containing a cycle, when the number of items is J , at least one diagonal element is nonzero in the reachability matrix of path length ≤ J − 1. Conversely, for an acyclic graph, as shown in Fig. 8.2 (p.351), all diagonal elements in the reachability matrix are 0, as shown in Eq. (8.3) (p.356). Consequently, for a graph to be acyclic, the trace of the reachability matrix must be 0.
8.1.8 Section Summary and Directed Acyclic Graph Directed Acyclic Graph 1. 2. 3. 4.
The graph must be simple. The graph must be directed. The graph must be acyclic. Either a connected or disconnected graph is acceptable.
that can be analyzed using BNM are summarized in The characteristics of graphs Directed Acyclic Graph . First, BNM graphs must be simple. That is, they must not contain any self-loops or multi-edges. Second, BNM graphs must be directed. In other words, every edge must be directed. A directed edge is a graphical representation of CCRR between incident items. In addition, BNM graphs must be acyclic. An acyclic graph is one that contains no cycles; therefore, no item can be revisited on any path in a graph. Finally, BNM graphs can be connected or disconnected. However, when a graph is disconnected, each component should be analyzed individually because there is not much need to analyze the disconnected components together.11 A graph satisfying Points 1, 2, and 3 is called a directed acyclic graph (DAG). Although “simple DAG” is a more accurate expression, “simple” is omitted for brevity. In other words, a BNM-analyzed network must be a (simple) DAG. mathematically expressed In addition, the conditions for a DAG can be as Mathematical Conditions for a Graph to be DAG , which is useful for checking whether a graph is a DAG. Although Point 4 is irrelevant to a graph being a DAG, it is listed, and the connectivity represents basic information about the graph.
11
If the components are considered to be related but are found to be unrelated, it is important to show their disconnectedness.
8.1 Graph
367
Mathematical Conditions for a Graph to be DAG 1. 2. 3. 4.
If simple, the maximal element in AU is ≤ 1. If directed, the maximal element in AU is ≤ 1. If acyclic, the trace of A(J −1) , tr A(J −1) is 0. If connected, the maximal element in AU(J −1) is greater than 0.
For a simple graph, all elements in the undirected adjacency matrix AU (i.e., the sum of adjacency matrix A and its transpose A ) must be a 0 or 1. In addition, for a directed graph (in which all edges are directed), the mathematical condition is identical to that of a simple graph because if there is an undirected edge between Items j and k, then the ( j, k)-th and (k, j)-th elements in the adjacency matrix are both 1 because Items j → k and k → j are both available. In other words, an undirected (simple) edge is a bidirected multi-edge as follows12 : USE k j —–
≡
j
BME ← −− −− →
k .
In other words, if a graph contains no directed multi-edges, the graph has no simple undirected edges. Moreover, as Point 3 shows, for an acyclic graph, when the number of items is J , all diagonal elements in the reachability matrix with path length ≤ J − 1, A(J −1) , must be 0. When the j-th diagonal element is not 0, there is at least one path (i.e., cycle) starting from and returning to Item j. Because there must not be a cycle starting from and returning to any given item, all diagonal elements must be zero, which requires the trace of the reachability matrix to be 0. Finally, although Point 4 indicates that a graph can be connected, it is not necessary for a graph to be a DAG. A graph is connected if each item is reachable from all other items under the condition that the (directed) graph is undirected. Thus, if all elements in the reachability matrix of the undirected adjacency matrix ( AU ), AU(J −1) , are > 0, each item can be visited from any of the other items. That is,13 J J j=1 k=1
sgn(aU(Jjk−1) )
=
1, if the graph is connected , 0, otherwise
where aU(Jjk−1) is the ( j, k)-th element in AU(J −1) . When the number of items in a graph is large, it is much easier to mathematically verify the connectivity of the graph than 12 13
.
≡ means “equivalent to.”. The signum (sign) function extracts the sign of the argument. That is, ⎧ ⎪ ⎨−1, if x < 0 sgn(x) = 0, if x = 0 . ⎪ ⎩1, if x > 0
368
8 Bayesian Network Model
to determine it through a visual inspection. That is, if an element in AU(J −1) is 0, the product of all elements is 0. Directed Tree A DAG, as shown in Fig. 8.8, is called a “directed tree,” the characteristics of which are as follows: 1. It contains a root. 2. For each item except for the root, one parent exists.
1
2
Root
3
4
Leaf
5
6
Leaf
Leaf
7
8 Leaf
9
Leaf
Fig. 8.8 Directed tree
From Point 1, the root is an item that has no parents (i.e., the indegree is zero); that is, the root is a source. In addition, from Point 2, the vertices (i.e., items) other than the root have only one parent item. Therefore, there is only one directed path from the root to each item in a directed tree. When a tree contains more than one root, it becomes a disconnected graph, where each component is a tree.14 Such a tree is called a forest. The number of child items (i.e., outdegrees) for each item may not be the same. For example, in the figure, the outdegrees of Items 1, 2, and 3 are 3, 2, and 0, respectively. An item with no children is called a leaf. In the figure, Items 3, 5, 6, 8, and 9 are leaves. A tree in which the outdegrees of all items other than the leaves is 2 (3) is called a binary (ternary) tree.
14
A connected tree with multiple roots is impossible to draw. For example, if the same tree as in Fig. 8.8 is placed to the right of the tree, Item 8 in the left tree and Item 5 in the right tree are considered the same item. Both trees are then connected, and the graph will have two roots. However, the number of parents of the connecting item will become two (i.e., Items 2 and 4), which will violate the defining of a graph as a tree.
8.2 D-Separation
369
8.2 D-Separation This section describes the core concept of the BNM theory, i.e., directed separation (d-separation). The BNM theory is based on d-separation and DAG, the latter of which was described in the previous section. D-separation can simplify a complex conditional probability network among the items and mapp it to a graph.
8.2.1 Chain Rule The chain rule re-expresses the joint probability as a product of the conditional probabilities. From Eq. (2.9) (p.36), the joint correct response rate (JCRR) of Items j and k ( p jk ) is expressed as follows: p jk = pk| j × p j , where p j is the correct response rate (CRR) of Item j, and pk| j is the conditional correct response rate (CCRR) of Item k for the students passing Item j. This equation describes the relationship between Items j and k when both items are passed. Letting j represents an event in which Item j is passed, the above equation can then be rewritten as follows: Both items are correct: p jk = pk| j × p j .
(8.4)
This decomposition of JCRR is also applicable when either item is passed and when neither item is passed. Let j represents an event in which Item j fails: Item j is correct and Item k is incorrect: p jk = pk| j × p jm , Item j is incorrect and Item k is correct: p jk = pk| j × p j m, Neither item is correct: p jk = pk| j × p j . Moreover, let the four equations above be simultaneously expressed as P( j, k) = P(k| j) × P( j).
(8.5)
Although Eq. (8.4) represents only the JCRR, Eq. (8.5) includes the above three situations (i.e., both Items j and k are correct, neither are correct, or only one is correct) and is referred to as the joint probability (JP).
370
8 Bayesian Network Model
Referring to Eq. (8.5), the JP of Items j, k, and l is factored as follows: P( j, k, l) = P(l| j, k) × P( j, k) = P(l| j, k) × P(k| j) × P( j),
from P(l|k, j) =
P( j, k, l) P( j, k)
from (8.5).
Further, when the number of items is J , the JP of J items is P(1, · · · , J ) =P(J |1, · · · , J − 1) × P(J − 1|1, · · · , J − 2) × · · · × P(3|1, 2) × P(2|1) × P(1) =
J
P( j|1, · · · , j − 1).
(8.6)
j=1
Note that when j = 1, P(1| ) = P(1). Therefore, the JP of J items can be factored into the product of J conditional probabilities (CPs), and the items labeled < j (i.e., Items 1, · · · , j − 1) are clearly specified as the antecedent items in the CP for Item j, P( j|1, · · · , j − 1). Consider that the number of test items is five and that the CP structure (i.e., the network) of the test was analyzed. First, the JP of the test is factored as P(1, 2, 3, 4, 5) = P(5|1, 2, 3, 4) × P(4|1, 2, 3) × P(3|1, 2) × P(2|1) × P(1). (8.7) Note that the JP is factored but not simplified; in other words, it is simply re-expressed. By using d-separation, this factorization can be simplified. Next, we assume that the CP network consisting of five items is shown in Fig. 8.9. This graph (i.e., Network 1) is a DAG. When agraph is a DAG, the children of each item can be labeled higher than the item (see Topological Sorting ). In Network 1, we assume that the five items are labeled as such. Item 1 is a source unaffected by any other item, which is consistent with the expression P(1) in Eq. (8.7). In addition, Item 2 is affected by Item 1, which matches the expression P(2|1) in Eq. (8.7). Fig. 8.9 Network 1
1
2
4
3
5
8.2 D-Separation
371
Although P(1) and P(2|1) match the graphical representation, they cannot be further simplified, and the next section shows that the three d-separation rules can simplify the CP structure and match the structure with the graph representation of Network 1. Topological Sorting When a graph is a DAG, the children of each item can be labeled higher than the item. Conversely, the parents of each item can be labeled as lower than the item. This numbering scheme is called topological sorting. The item labels in Fig. 8.1 are not the original labels but the topologically sorted labels. Note that the item labeling result in the figure is an example of topological sorting. It is possible to obtain several results from the topological sorting. Figure 8.10 shows another example of labeling through topological sorting for the same graph, where the children of each item are labeled higher than the item itself. 1
2
3
4
6
5
8
7
9 Fig. 8.10 Another example of topological sorting
It is strongly recommended that the item labels be topologically sorted because the labels become consistent with the factored JP. As CP P( j|1, · · · , j − 1) in Eq. (8.6) shows the antecedent items of Item j are Items 1 through j − 1, each of which is labeled as < j.
8.2.2 Serial Connection This section focuses on the CP structure for Item 3, which is expressed as P(3|1, 2) in Eq. (8.7). Meanwhile, in Network 1 (Fig. 8.9), Item 3 is directly affected by only
372
8 Bayesian Network Model
Fig. 8.11 Serial connection
Multiplication
Division
Fraction
Item 2 and is indirectly affected by Items 1 via Item 2. The relationships among Items 1, 2, and 3 were extracted, as shown in Fig. 8.11. Such a linear relationship among a series of items is called a serial connection, which is one of the three d-separation conditions. This connection is similar to the relationship among multiplication, division, and fractions. A student who is able/unable to calculate a division problem (Item k) is usually capable/incapable of calculating a multiplication problem (Item j). Likewise, a student who is capable/incapable of calculating fractions (Item l) is likely to be proficient/poor at division. Suppose that a reader is a teacher and considers a student’s possibility of passing not know whether the student is capable or incapable of Item l (fractions). If you do :::::::::: division (Item k), then the knowledge regarding whether the student is capable or incapable of multiplication (Item j) is informative for estimating the CP (or CCRR) know whether the student is proficient in of Item l (fractions). However, if you do ::::::: division (Item k), then the knowledge of whether the student is capable of multiplication (Item j) is no longer informative when estimating the CCRR of Item l (fractions) because the information regarding the capability of the student for Item j (multiplication) is included in the information of Item k (division), and the teacher usually has information about the student’s response (information) on Item k (division). In general, when information about Item k is obtained, it is said that Item k is instantiated. This situation is analogous to that of blood relations. If the physical states of a child’s parents are known, the states of the parents’ ancestors are uninformative in understanding the state of the child. As a scientific fact, if the DNA of the parents is known, the DNA of the grandparents is not necessary to know the DNA of the child. Similarly, when the information about Item k is known, Item j is no longer informative regarding Item l. In other words, Items k and l are independent when given
8.2 D-Separation
373
information (i.e., response) on Item k. This situation is mathematically expressed as15 P( j, l|k) = P( j|k) × P(l|k). For Network 1 (Fig. 8.9), this situation is presented as P(1, 3|2) = P(1|2) × P(3|2).
(8.8)
Thus, using this serial connection, P(3|2, 1) in Eq. (8.7) can be simplified as P(3|1, 2) =
P(1, 3|2)P(2) P(1, 2, 3) = P(1, 2) P(1|2)P(2) P(1|2)P(3|2)P(2) = P(1|2)P(2) = P(3|2).
from
P(1, 2, 3) = P(1, 3|2) P(2)
from (8.8)
That is, Item 1 was excluded from the antecedent item set of Item 3. Thus, the network structure in Eq. (8.7) is simplified and rewritten as P(1, 2, 3, 4, 5) = P(5|1, 2, 3, 4) × P(4|1, 2, 3) × P(3|2) × P(2|1) × P(1). (8.9)
8.2.3 Diverging Connection This section considers the CP structure for Item 4 in Network 1 (Fig. 8.9), which is expressed as P(4|1, 2, 3) in Eq. (8.9). First, just as Items 3 and 1 were conditionally independent given Item 2, Items 4 and 1 are also conditionally independent. Thus, owing to the serial connection, P(4|1, 2, 3) can be expressed as P(4|1, 2, 3) = P(4|2, 3).
(8.10)
Next, in Network 1, Item 4 is a successor to Item 2, but not a successor to Item 3. Figure 8.12 illustrates the relationship among Items 2, 3, and 4. In the figure on the left, Item j has two children (Items k and l), whereas Item j has four children in the figure on the right. As shown in the figures, a structure in which an item has multiple children is called a diverging connection. This connection specifies the conditional independence between the child items. For the left figure, given Item j, Items k and l are conditionally independent. For P(B, C) = P(B) × P(C) implies that Events B and C are independent. In addition, P(B, C|A) = P(B|A) × P(C|A) means that Events B and C are conditionally independent, given Event A. 15
374
8 Bayesian Network Model
Fig. 8.12 Diverging connection
example, suppose that the contents of the three items are as follows: Item j : Are dolphins mammals or fish? Item k : Are dolphins viviparous or oviparous? Item l : Do dolphins have lungs or gills? It is important to assume again that you are a teacher and consider whether a stunot know whether the student possesses knowldent can pass Item l. If you do :::::::::: edge about Item j (mammal/fish), then the student’s true/false (TF) information about Item k (viviparous/oviparous16 ) is useful for estimating the student’s CP (or already know whether the student CCRR) for Item l (lungs/gills). However, if you ::::::::::: has knowledge of Item j (mammal/fish), the student’s TF information about Item k (viviparous/oviparous) is no longer an important clue in estimating the CCRR of the student for Item l (lungs/gills). In addition, you can know the student’s TF information about Item j because you are his/her teacher (i.e., Item j is instantiated). This situation can also be related to blood relationships. If we know the parental DNA of an individual, the DNA of the individual’s siblings is unnecessary for knowledge regarding the person’s DNA. That is, because information is known about Item j, Item k is no longer informative about Item l. The reverse is also true: When information is known about Item j, Item l has no useful information about Item k. Therefore, Items k and l are independent when given Item j, which can be represented as follows: P(k, l| j) = P(k| j) × P(l| j). In addition, in the figure on the right, the four child items are conditionally independent given Item j, which is mathematically expressed as follows:
16
Viviparous animals are born from their mother’s womb, whereas oviparous animals hatch from eggs.
8.2 D-Separation
375
P(k1 , k2 , k3 , l| j) = P(k1 | j) × P(k2 | j) × P(k3 | j) × P(k4 | j) × P(l| j). In Network 1 (Fig. 8.9), the relationships among Items 2, 3, and 4 correspond to this diverging connection. Because Item 2 is given, the information about Item 3 is not important for estimating the CP (or CCRR) of Item 4, and vice versa. In other words, given Item 2, Items 3 and 4 are conditionally independent, as follows: P(3, 4|2) = P(3|2) × P(4|2).
(8.11)
Using this equation, P(4|2, 3) in Eq. (8.10) can be further simplified as P(4|2, 3) =
P(3, 4|2)P(2) P(2, 3, 4) = P(2, 3) P(3|2)P(2) P(3|2)P(4|2) = P(3|2) = P(4|2).
from
P(2, 3, 4) = P(3, 4|2) P(2)
from (8.11)
That is, Item 3 was excluded from the antecedent item set of Item 4, and Eq. (8.9) can thus be written as P(1, 2, 3, 4, 5) = P(5|1, 2, 3, 4) × P(4|2) × P(3|1) × P(2|1) × P(1).
(8.12)
8.2.4 Converging Connection This section considers the CP for Item 5, which is expressed as P(5|1, 2, 3, 4) in Eq. (8.12). First, from the serial connection rule, when the parent items of an item are instantiated, the item is independent of the predecessors of the parent(s). Therefore, the CP for Item 5 can be represented as P(5|1, 2, 3, 4) = P(5|3, 4). Figure 8.13 illustrates the relationship among Items 3, 4, and 5. In the figure on the left, Item l has two parent items, whereas Item l has four parents in the figure on the right. Such structures are called converging connections and constitute the third d-separation rule. Notably, this rule is contrary to the previous two: When a child item is instantiated, the parent items are conditionally dependent on each other. Conversely, if the child item is not instantiated, the parent items are conditionally independent of each other. However, the response to the child item is usually observed in the test data, and the parent items become conditionally dependent on each other.
376
8 Bayesian Network Model
lost
heart
forget Fig. 8.13 Converging connection
Items j ( = lost, late, or dead), k ( = heart, mind, or spirit), and l ( = forget) in the left figure are kanji17 learned by Japanese elementary school students. Here, (forget) is composed of (lost) and (heart).18 Again, suppose that you are a teacher and estimate the CP (or CCRR) that a student can write the kanji “ (heart)” (Item k). When you do not know whether :::::::::: a student can write “ (forget)” (Item l), then the information about whether the student can write “ (lost)” (Item j) is uninformative as to whether the student can (heart)” (Item k). This is because both characters (Items j and k) are quite write “ different in terms of meaning and shape. Therefore, the TF for “ (lost)” provides little information about the TF for “ (heart).” (forget)” (Item l), the do know that the student can write “ However, if you ::::::: TF for “ (lost)” (Item j) is informative for estimating the CCRR of “ (heart)” (heart).” Therefore, the CCRR (forget)” contains “ (Item l). This is because “ of “ (heart)” can be estimated by combining the TF information regarding “ (lost)” and “ (forget).” In addition, Item l is usually observed in the test data. Therefore, unlike the previous two d-separation rules, P(l| j, k) cannot be factored into the product of P(l| j) and P(l|k). That is, P(l| j, k) = P(l| j) × P(l|k). Similarly, in the figure on the right, where the number of parent items is four, P(l| j, k1 , k2 , k3 ) cannot be decomposed. According to the serial connection rule, the upstream and downstream items are is instantiated. In addition, according to the divergindependent when the middle item :::::::::::: 17
Chinese characters were introduced to ancient Japan and are used in the modern Japanese language. They are considerably different from their counterparts used in modern Mandarin in China but quite similar to those in Mandarin used in Taiwan. 18 Many Chinese characters are formed by combining two or more characters. For example, (man)+ (tree)= (rest).
8.2 D-Separation
377
ing connection rule, the siblings sharing a common parent item are independent when Conversely, in the converging connection the common parent item is instantiated. ::::::::: not instantiated, the parent items become independent. rule, when the child item is :::::::::::::: Accordingly, P(5|4, 3) in Network 1 cannot be further simplified. Therefore, the factorization of JP for Network 1 is finalized as P(1, 2, 3, 4, 5) = P(5|1, 2, 3, 4) × P(4|1, 2, 3) × P(3|1, 2) × P(2|1) × P(1) = P(5|3, 4) × P(4|2) × P(3|1) × P(2|1) × P(1).
(8.13)
8.2.5 Section Summary Summary of D-Separation 1. D-separation converts the structure of the graph (i.e., the adjacencies) of the test items into the CP network. 2. Serial connection: Each item is conditionally independent of the ancestors of the item’s parents. 3. Diverging connection: Each item is conditionally independent of the siblings, which have the same parent(s). 4. Converging connection: An item is conditionally dependent on all the parent(s). A summary of d-separation is shown in Summary of D-Separation . As described in Point 1, BNM is a method of graphically representing the CP structure among items. A graphical representation of a structure has the advantage of improving the structural interpretability. Figure 8.9 is much more understandable than Eq. (8.13), even if the equation is simplified. In addition, not everyone who practically applies the analysis results is trained in advanced mathematics. Point 2 (a serial connection) means that each item is influenced only by its parent item(s) and not by its nonparent ancestors. Moreover, Point 3 (a diverging connection) indicates that regardless of the number of siblings an item has, the item itself is unaffected by the sibling items. Finally, Point 4 (a converging connection) means that when an item has multiple parents, it is affected by all the parents of the item. Points 2, 3, and 4 in Summary of D-Separation can be summarized in a sentence indicating that an item is affected only by its parent item(s). In other words, the item is unaffected by items other than its parent item(s). Accordingly, Eq. (8.13) can be rewritten as P(1, 2, 3, 4, 5) = P(5|3, 4) × P(4|2) × P(3|1) × P(2|1) × P(1) = P(5| pa(5)) × P(4| pa(4)) × P(3| pa(3)) × P(2| pa(2)) × P(1).
378
8 Bayesian Network Model
Consequently, when the number of items is J , the interitem CP structure in Eq. (8.6) (p.370) is represented as P(1, · · · , J ) = P(J | pa(J )) × P(J − 1| pa(J − 1)) × · · · × P(2| pa(2)) × P(1) =
J
P( j| pa( j)),
j=1
where P( j| pa( j)) = P( j) if pa( j) = ∅. Table 8.6 exemplifies the JP factorization for the three DAGs. Although the factorization results differ depending on the graph structure, it can be confirmed whether each individual item is conditioned only by its parent item(s). In addition, the source items were not conditioned by any other items. For instance, in the disconnected graph shown in Fig. 8.4, because Items 1, 4, and 5 are source items, they have no antecedent items. Furthermore, the directed tree in Fig. 8.8 clearly shows that all the items except the root item (i.e., Item 1) have only one antecedent item.
8.3 Parameter Learning In a BNM analysis, the network structure (i.e., DAG) is first determined, and the parameters of the network are then estimated. The former is called structure learning or structure exploration, and the latter is referred to as parameter learning. For example, several relationships are conceivable amongthree items. Suppose that the item labels are already topologically sorted (see Topological Sorting , p.371). The number of feasible DAGs is 8 (= 23 ), depending on whether an edge is assumed for Item 1 → 2, Items 1 → 3, and Items 2 → 3, the DAGs for three of which are shown in Fig. 8.14. The objective of structure learning is to determine the best network (i.e., Network 2-1 in the figure) among the eight DAGs, and the objective of parameter learning is to estimate the parameters, i.e., the CPs, that are set in the best network. In this section, we discuss the latter.
8.3.1 Parameter Set The parameters to be set depend on the DAG structure for analysis. For Network 2-1 shown in Fig. 8.14, to specify the network parameters, the JP of the network is first factored as P(1, 2, 3) = P(3|1)P(2|1)P(1).
8.3 Parameter Learning
379
Table 8.6 DAG and factorization of joint response rate Factorization
DAG 1
2
5
3
4
7
6
P(1, · · · , 9) = P(9|7, 8)P(8|4)P(7|3, 4, 6) P(6|2, 3)P(5|2)P(4) P(3|1)P(2|1)P(1) 8
9
Figure 8.1 (p.350) 1 Component
2
5
3
6
4
7
P(1, · · · , 9) = P(9|7, 8)P(8|4)P(7|4) P(6|2, 3)P(5)P(4) P(3|1)P(2|1)P(1)
8
Isolated Item
9
Figure 8.4 (p.358) P(1, · · · , 9) = P(9|7)P(8|4)P(7|4) P(6|2)P(5|2)P(4|1) P(3|1)P(2|1)P(1)
Figure 8.8 (p.368)
Item 1 is the source, as well as the parent of Items 2 and 3. As Network 2-1 shows, Item 1 has two parameters corresponding to P(1), i.e., the CRR and the incorrect response rate (IRR), which are estimated as follows: IRR of Item 1: π1 = 0.35, CRR of Item 1: π1 = 0.65.
380
8 Bayesian Network Model
Fig. 8.14 Structure and parameter learning
In fact, the CRR is the only parameter because the IRR is obtained by π1 = 1 − π1 . Next, Item 2 has four parameters corresponding to CP P(2|1). They are as follows19 : The CCRR of Item 2, given Item 1 is incorrect: π2|1 = 0.75, The CIRR of Item 2, given Item 1 is incorrect: π2|1 = 0.25, The CCRR of Item 2, given Item 1 is correct: π2|1 = 0.43, The CIRR of Item 2, given Item 1 is correct: π2|1 = 0.57. The practical parameters are only the CCRRs because both CIRRs are the remainder obtained by subtracting the corresponding CCRR from 1 (π2|1 = 1 − π2|1 and π2|1 = 1 − π2|1 ). Likewise, the parameters to be estimated for Item 3 are π3|1 and π3|1 . Consequently, Network 2-1 contains five parameters to be estimated, which are sorted in a matrix as follows: ⎤ ⎡ π1 = ⎣π2|1 π2|1 ⎦ . π3|1 π3|1 This parameter set enumerates all network parameters. Subscript “” indicates a “graph.” The content of the parameter set was uniquely derived from the DAG. The 19
CCRR and CIRR indicate conditional CRR and IRR, respectively.
8.3 Parameter Learning
381
Table 8.7 Parameters of network 2-2
Factor Parameter P(1) π1 CRR of Item 1 P(2) π2 CRR of Item 2 CCRR of Item 3, given Item 2 is incorrect π3|2 P(3|2) π3|2 CCRR of Item 3, given Item 2 is correct
Table 8.8 Parameters of network 2-3
Factor Parameter P(1) π1 CRR of Item 1 P(2) π2 CRR of Item 2 π3|12 CCRR of Item 3 given (Item 1, Item2) is (0, 0) P(3|1, 2)
π3|12
CCRR of Item 3 given (Item 1, Item2) is (0, 1)
π3|12
CCRR of Item 3 given (Item 1, Item2) is (1, 0)
π3|12
CCRR of Item 3 given (Item 1, Item2) is (1, 1)
j-th row in the parameter set lists the parameters for Item j, and the number of parameters is different for each item. For Network 2-2, because the JP is factored as P(1, 2, 3) = P(3|2)P(2)P(1), the parameters to be estimated are as listed in Table 8.7. The source items (i.e., Items 1 and 2) have one parameter. In addition, because Item 3 has one parent, the number of CCRRs to be estimated for this item is two. The parameter set is then ⎡
⎤ π1 ⎦. = ⎣ π2 π3|2 π3|2 Next, for Network 2-3, the JP is factored as P(1, 2, 3) = P(3|1, 2)P(2)P(1). Therefore, the network has six parameters, as listed in Table 8.8. A graph with more edges contains more parameters. Item 3, which has two parent items, has four parameters because a CCRR can be set for each parent item response pattern (PIRP). The parameter set is then
382
8 Bayesian Network Model
Table 8.9 Adjacency matrix of graph in Fig. 8.2 (identical to Table 8.1, p.353)
123 456 78 → Item 1 11 11 Item 2 1 1 Item 3 Item 4 11 Item 5 1 Item 6 Item 7 Item 8 Item 9 Indegree 0 1 1 0 1 2 3 1 ∗ Blank
9 Outdegree 2 2 2 2 0 1 1 1 1 1 0 11 2
cells represent 0.
⎡
⎤ π1 ⎦. = ⎣ π2 π3|12 π3|12 π3|12 π3|12 Because Networks 2-1, 2-2, and 2-3 are small graphs, it is easy to enumerate the parameters in the graph. However, it is not an easy task to list all parameters when a graph is large and complex. Finally, we enumerate the parameters for the graph in Fig. 8.2 (p.351). When a graph is large, the parameters should be specified using the adjacency matrix of the graph ( A). Table 8.9 shows the adjacency matrix for the graph in Fig. 8.2. The j-th column vector in the adjacency matrix indicates the parents of Item j. For example, the sixth column shows that Items 2 and 3 are the parents of Item 6. The PIRPs are thus (Item 2, Item 3) = {(0, 0), (0, 1), (1, 0), (1, 1)}. Corresponding to these four patterns, parameters {π6|23 , π6|23 , π6|23 , π6|23 }
(8.14)
are set for the CP of Item 6, i.e., P(6|2, 3). In addition, from the seventh column in adjacency matrix A, the parents of Item 7 are Items 3, 4, and 6. The PIRPs are then (Item 3, Item 4, Item 6) ={(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)}. Accordingly, eight parameters are specified as the CP for Item 7, i.e., P(7|3, 4, 6), corresponding to the above eight patterns, as follows:
8.3 Parameter Learning
383
{π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 }.
(8.15)
When data are binary, the number of parameters for an item with n parents is 2n . That is, when the number of parents for an item is 0, 1, 2, 3, · · · , the number of parameters for the item is then 20 = 1, 21 = 2, 22 = 4, 23 = 8, · · · . In addition, as shown in the bottom row of the table, the columnar sums of the adjacency matrix are the indegrees of the respective items. From Eq. (8.1) (p.354), the indegree vector is obtained by
d in
⎡ ⎤ 0 ⎢1⎥ ⎢ ⎥ = A 1 J = ⎢ . ⎥ = {din, j } (9 × 1). ⎣ .. ⎦ 2
From this vector, the total number of parameters for the DAG shown in Fig. 8.2 is 9
2din, j = 20 + 21 + · · · + 22 = 26.
j=1
Because the subscripts of the parameters are extremely complex, the parameters are renamed. Because the number of parameters for each item is equal to the number of PIRPs for the item, the number of parameters for Item j is first denoted as D j = 2din, j , and the parameters corresponding to the patterns are then represented as {π j1 , · · · , π j D j }. For example, the parameters in Eq. (8.14) are renamed as {π6|23 , π6|23 , π6|23 , π6|23 }
⇒
{π61 , π62 , π63 , π64 },
and the parameters in Eq. (8.15) are renamed as {π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 , π7|346 } ⇓ {π71 , π72 , π73 , π74 , π75 , π76 , π77 , π78 }. Consequently, the DAG in Fig. 8.2 has 26 parameters, as listed in Table 8.10, and the parameter set for the DAG is represented as follows:
384
8 Bayesian Network Model
Table 8.10 Parameters of graph in Fig. 8.2 (p.351)
Factor Parameter P(1) π11 (π1 ) π (π ) P(2|1) 21 2|1 π22 (π2|1 )
CRR of Item 1 CCRR of Item 2, given that Item 1 is incorrect (0)
π31 (π3|1 )
CCRR of Item 3, given that Item 1 is incorrect (0)
π32 (π3|1 )
CCRR of Item 3, given that Item 1 is correct (1)
P(3|1)
P(4) π41 (π4 ) π (π ) P(5|2) 51 5|2 π52 (π5|2 )
CCRR of Item 2, given that Item 1 is correct (1)
CRR of Item 4 CCRR of Item 5, given that Item 2 is incorrect (0) CCRR of Item 5, given that Item 2 is correct (1)
π61 (π6|23 ) CCRR of Item 6, given that Items (2, 3) are (0, 0) P(6|2, 3) π62 (π6|23 ) CCRR of Item 6, given that Items (2, 3) are (0, 1) π63 (π6|23 ) CCRR of Item 6, given that Items (2, 3) are (1, 0) π64 (π6|23 ) CCRR of Item 6, given that Items (2, 3) are (1, 1) π71 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (0, 0, 0) π72 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (0, 0, 1) π73 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (0, 1, 0) P(7|3, 4, 6) π74 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (0, 1, 1) π75 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (1, 0, 0) π76 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (1, 0, 1) π77 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (1, 1, 0) π78 (π7|346 ) CCRR of Item 7, given that Items (3, 4, 6) are (1, 1, 1) P(8|4)
π81 (π8|4 )
CCRR of Item 8, given that Item 4 is incorrect (0)
π82 (π8|4 )
CCRR of Item 8, given that Item 4 is correct (1)
π91 (π9|78 ) CCRR of Item 9, given that Items (7, 8) are (0, 0) P(9|7, 8) π92 (π9|78 ) CCRR of Item 9, given that Items (7, 8) are (0, 1) π93 (π9|78 ) CCRR of Item 9, given that Items (7, 8) are (1, 0) π94 (π9|78 ) CCRR of Item 9, given that Items (7, 8) are (1, 1)
⎤
⎡
π11 ⎢π21 ⎢ ⎢π31 ⎢ ⎢π41 ⎢ = ⎢ ⎢π51 ⎢π61 ⎢ ⎢π71 ⎢ ⎣π81 π91
π22 π32 π52 π62 π63 π64 π72 π73 π74 π75 π76 π77 π82 π92 π93 π94
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ π78 ⎥ ⎥ ⎦
8.3 Parameter Learning
385
The number of columns in the parameter set is equal to the maximum number of PIRPs. In this set, the number of columns is eight (= 23 ) because the number of parents for Item 7, which is three, is at maximum. Finally, the relationship between the numberof parent items (i.e., indegree) and PIRP is summarized in Indegree and PIRP . Indegree and PIRP As the number of parent items (i.e., indegree) of an item increases by 1, the number of parameters set for the item is doubled because when the indegree is 0, 1, 2, 3, · · · , the number of PIRPs is 1, 2, 4, 8, · · · . Thus, for an item with four parents, the number of parameters is 16 (= 24 ). However, not all 16 patterns are always observed, e.g., when the sample size is small. Specifically, PIRPs in which easy parent items fail and difficult parent items pass are unlikely to occur. In other words, even if the number of parent items increases by 1, the number of observed PIRPs does not usually double. This situation can occur even when the indegree is small. The number of PIRPs is 2 (= 21 ) when the indegree is 1. However, if the parent item is too easy and the sample size is small, all the students will pass the parent item, and the number of observed PIRPs will then become only 1. Therefore, only CCRR must be estimated, given that the parent item is correct (π j2 , or π j|k if the parent is Item k), whereas the CCRR is no longer necessary (or, rather, it cannot be estimated without an assumption such as setting a prior density, as described later), given that the parent item is incorrect (π j1 or π j|k ). In addition, π j|k is essentially the same as π j (CRR of Item j) in this case. As the indegree increases by 1, the number of observed PIRPs does not always double but steadily increases. For example, as the indegree increases from 3 to 4, the number of observed PIRPs may not double from 8 to 16 but increases from 8 to, for example, 12. That is, as the indegree increases, the number of observed PIRPs also increases, which increases the number of parameters. Therefore, it is practical to limit the indegree of each item to 3 to make the number of PIRPs ≤ 8.
8.3.2 Parent Item Response Pattern This section details PIRPs, which is an important BNM concept. Let the PIRP indicator of Student s for Item j be defined as γs jd =
1, if the response pattern of Student s for pa( j) is the d-th pattern. , 0, otherwise (8.16)
386
8 Bayesian Network Model
Fig. 8.15 Parent item response pattern
where pa( j) is the parent item set of Item j. In other words, the indicator denotes the PIRP that Student s took for the item set pa( j). If the pattern is d-th, γs jd = 1. Figure 8.15 exemplifies the PIRP of two students. The left figure shows Network 1 (Fig. 8.9, p.370). According to the DAG, the JP of the test can be factored as follows: P(1, 2, 3, 4, 5) = P(5|3, 4) × P(4|2) × P(3|2) × P(2|1) × P(1). In addition, the parameter set is specified as follows: ⎡
π11 ⎢π21 ⎢ = ⎢ ⎢π31 ⎣π41 π51
⎤ ⎥ π22 ⎥ ⎥. π32 ⎥ ⎦ π42 π52 π53 π54
(8.17)
Then, suppose that the item response patterns of two students are obtained as u1 = [0 1 1 0 1] , u2 = [1 0 1 1 1] . The top right table in the figure shows the PIRP of Student 1 (where blank cells represent 0), whereas the bottom right table shows that of Student 2.
8.3 Parameter Learning
387
Because Item 1 is a source, it has only one (= 20 ) PIRP (i.e., no parent items). Accordingly, for Students 1 and 2, the PIRP of Item 1 is the first (γ111 = 1 and γ211 = 1). Next, Item 2 has only one parent (i.e., Item 1); therefore, there are two (= 21 ) PIRPs: 0 and 1. For Student 1, the response to the parent item (Item 1) is 0 (incorrect), this PIRP is the first pattern; thus, γ121 = 1. Meanwhile, for Student 2, the response to the parent item was 1 (correct). Therefore, γ222 = 1 because the PIRP is the second pattern. Similarly, Item 5 has two parents (Items 3 and 4); thus, there are four (= 22 = 4) PIRPs: 00, 01, 10, and 11. The response pattern of Student 1 for Items 3 and 4 is 10; therefore, γ153 = 1 because this PIRP is the third pattern, whereas the response pattern of Student 2 is 11. Thus, γ254 = 1 because this PIRP is the fourth pattern. From the tables in the figure, the PIRPs of both students are sorted in a matrix as follows: ⎤ ⎤ ⎡ ⎡ 1 1 1 ⎥ ⎥ ⎢1 ⎢ 1 ⎥ ⎥ ⎢ ⎢ ⎥ and 2 = ⎢ 1 1 ⎥ . 1 (8.18) 1 = ⎢ ⎥ ⎥ ⎢ ⎢ ⎦ ⎦ ⎣ 1 ⎣1 1 1 The PIRPs of Student s can be expressed as s = {γs jd } (J × Dmax ). The ( j, d)-th element of this matrix is γs jd , as shown in Eq. (8.16), and the size of the matrix is J × Dmax , where Dmax = max{D1 , · · · , D J } = max{2din,1 , · · · , 2din,J }. That is, the maximum number of PIRPs is the number of columns in the matrix. For Network 1 (Fig. 8.15), the number of matrix columns reaches four because the indegree of Item 5 is the largest (= 2), and there are four (= 22 ) PIRPs for the item. PIRP specifies the data used to estimate the parameters. For example, the datum of Student 1 for Item 2 (u 12 = 1) is used when π21 (the CCRR, given parent Item 1 or π2|1 ) is estimated, but it is not used when π22 (the CCRR, given Item is correct, ::::: or = π2|1 ). Similarly, the datum of the student for Item 4 is used to 1 is incorrect, ::::::: estimate the CCRR (π42 ), given that the parent (i.e., Item 2) is correct, because the student passed Item 2. For the datum of the student for Item 5 (u 15 = 1), because the responses to the parents of Item 5 (i.e., Items 3 and 4) are (1, 0), the datum is used to estimate π53 (= π5|34 ). Accordingly, the parameters set in Network 1, to which the data of Student 1 can contribute, are
388
8 Bayesian Network Model
Fig. 8.16 Three-dimensional array of parent item response pattern
⎡
1 ⎢1 ⎢ 1 = ⎢ ⎢ 1 ⎣ 1 ⎡
⎤
⎡
⎤ π11 ⎥ ⎢π21 π22 ⎥ ⎥ ⎢ ⎥ ⎥ ⎢π31 π32 ⎥ ⎥ ⎢ ⎥ ⎦ ⎣π41 π42 ⎦ 1 π51 π52 π53 π54 ⎤ 1
π11 ⎢ π21 ⎢ π32 =⎢ ⎢ ⎣ π42
⎥ ⎥ ⎥. ⎥ ⎦
(8.19)
π53 π54
Figure 8.16 shows a three-dimensional array20 that collects the PIRPs of all S students. The unused cells in the array are shaded. As the table for Student 3 shows, if a student fails all items, all PIRPs are the first pattern, while for a student passing all items (e.g., Student S), all PIRPs become the last pattern. This three-dimensional array is called a PIRP array and is denoted as = {γs jd } (S × J × Dmax ). In this array, the γs jd of Eq. (8.16) is placed in the s-th layer, j-th row, and d-th column. In addition, the size of the array is S × J × Dmax .
20
A matrix is a two-dimensional array.
8.3 Parameter Learning
389
Fig. 8.17 Road to likelihood Data
1 2 3
PIRP
Likelihood
4 Parameter
5 Graph (DAG) Adjacency Matrix
8.3.3 Likelihood This section describes the derivation of the likelihood from the DAG. In a BNM, the likelihood is the probability constructed using the model (DAG) and the parameters set in the DAG. In this section, the network to be analyzed (DAG) is assumed to be predetermined. Exploring the optimal network is a type of structure learning, as described in Sect. 8.5. Once the DAG is fixed, the adjacency matrix A is uniquely derived (Fig. 8.17), and vice versa. In other words, they are different representations of the same thing. In addition, once the DAG (or adjacency matrix A) is fixed, the parameter set to be estimated ( ) is determined (Sect. 8.3.1, p.378). Moreover, the PIRP array is obtained from the DAG and the item response matrix U (Sect. 8.3.2, p.385). In this figure, U is explained later. The likelihood can then be constructed using the item response matrix U, parameter set , and PIRP array . Table 8.11 shows data obtained from the five-item test to be analyzed using Network 1. The file name for these data is J5S10. Although the data are unrealistic as test data because the sample size is extremely small, it is helpful to understand how to construct the likelihood from a small data sample size because the BNM likelihood is quite complex. Moreover, this table shows the PIRPs of the students. However, the PIRPs for each student in this table are represented not by a matrix (e.g., Eq. (8.18), p.387), but by a vector. For example, the PIRPs of Student 2 are represented as [1 2 1 1 4]. Using a vector of numbers from 1 to Dmax (= 4),
n Dmax
⎡ ⎤ 1 ⎢2⎥ ⎢ = n4 = ⎣ ⎥ , 3⎦ 4
the vector representations of the PIRP matrices for Students 1 and 2, u1 and u2 , are given as follows:
390
8 Bayesian Network Model
Table 8.11 Five-item data (J5S10)
Student 01 02 03 04 05 06 07 08 09 10
1 0 1 0 1 0 0 1 1 1 1
⎡
u1
1 ⎢ ⎢ = 2 n4 = ⎢ ⎢1 ⎣1
5 3 4 1 4 3 3 3 3 3 4
⎡ ⎤ ⎡ ⎤ 1 1 ⎥ ⎢1 ⎥ ⎥ ⎢2⎥ ⎢ ⎥ ⎥ ⎢ ⎥ = ⎢2⎥ ⎥ ⎣3⎦ ⎢ ⎥ ⎦ ⎣2⎦ 4 1 3 ⎤ ⎡ ⎤ ⎡ ⎤ 1 ⎥ 1 ⎢2⎥ 1 ⎥ ⎢2⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 ⎥ ⎥ ⎣ 3⎦ = ⎢1⎥ ⎦ ⎣1⎦ 4 1 4
1 ⎢1 ⎢ = 1 n4 = ⎢ ⎢ 1 ⎣ 1 ⎡
u2
PIRP Matrix (U ) 1 2 3 4 1 1 2 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 1 1 2 1 1 1 2 2 2
Item Response (U) 2 3 4 5 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1
1
⎤
In other words, if the j-th element of the PIRP vector of Student s (us ) is d, the ( j, d)-th entry has a 1 in s . A matrix collecting the PIRP vectors of all students is referred to as the PIRP matrix and is denoted as U = {u s j } (S × J ), u s j = π j n Dmax , where π j is the (vertically arranged) j-th row vector of the parameter set .21 Although the PIRP matrix (U ) and the three-dimensional PIRP array () have the same information, the former is more summarized. All pieces are prepared such that the likelihood can be constructed. The likelihood is the product of the likelihood of each student. Let the likelihood of Student s be denoted as l(us | ), which also means the probability of observing us , given 21
Note again that the vectors in this book are vertical (column) vectors. The horizontal (row) vectors are represented with the prime (’).
8.3 Parameter Learning
391
parameter set (Eq. (8.17), p.386). The parameter set is currently unknown, and its estimation is the goal of parameter learning. The likelihood of all students is then expressed as follows: l(U| ) =
S
l(us | ).
(8.20)
s=1
This factorization reflects the natural assumption that the student data are mutually independent. For example, let us consider the likelihood of Student 1. The parameters used to construct the student’s likelihood are given in Eq. (8.19) and are reprinted as follows: ⎡
1 ⎢1 ⎢ G 1 = ⎢ ⎢ 1 ⎣ 1
⎡ π11 ⎥ ⎢π21 ⎥ ⎢ ⎥ ⎢π31 ⎥ ⎢ ⎦ ⎣π41 1 π51 1
⎤
⎤ π22 π32 π42 π52 π53 π54
⎡
π11 ⎥ ⎢ π21 ⎥ ⎢ ⎥=⎢ π32 ⎥ ⎢ ⎦ ⎣ π42
⎤ ⎥ ⎥ ⎥. ⎥ ⎦ π53 π54
In other words, although the data of Student 1 can be used to estimate these parameters, they do not contribute to estimating any of the other parameters. These parameters were selected by the PIRPs of the student. Note that only one parameter was selected for each row (i.e., item). In addition, from the item response vector of Student 1, ⎡ ⎤ ⎡ ⎤ 0 u 11 ⎢u 12 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ u1 = ⎢ ⎢u 13 ⎥ = ⎢1⎥ , ⎣u 14 ⎦ ⎣0⎦ 1 u 15 the likelihood for the student is obtained as l(u1 | ) = (1 − π11 ) × π21 × π32 × (1 − π42 ) × π53 .
(8.21)
For passed/failed items, the CCRR (π )/CIRR (1 − π ) is incorporated into the likelihood. This rule is common among all models introduced in this book. Note that although the conditional part of the likelihood is written as for convenience, the parameters used in this likelihood are in fact the five parameters appearing on the RHS of this equation. In addition, this equation remains unchanged if expressed as follows: u 11 (1 − π11 )1−u 11 l(u1 | ) =π11 u 12 × π21 (1 − π21 )1−u 12 u 13 × π32 (1 − π32 )1−u 13 u 14 × π42 (1 − π42 )1−u 14 u 15 × π53 (1 − π53 )1−u 15 .
392
8 Bayesian Network Model
If the response to an item is correct (1) or incorrect (0), the CCRR (π ) or CIRR (1 − π ), respectively, is selected as the parameter for the item. Furthermore, in the parameter set, only one parameter was selected in each row, which can be expressed using the PIRP array for Student 1 ( 1 ) as follows: γ u γ 1−u 11 l(u1 | ) = π11111 11 1 − π11111 γ γ u γ γ 1−u 12 × π21121 · π22122 12 1 − π21121 · π22122 γ γ u γ γ 1−u 13 × π31131 · π32132 13 1 − π31131 · π32132 γ γ u γ γ 1−u 14 × π41141 · π42142 14 1 − π41141 · π42142 γ γ γ γ u γ γ γ γ 1−u 15 × π51151 π52152 · π53153 π54154 15 1 − π51151 π52152 · π53153 π54154 , where π jd is selected when γs jd = 1; otherwise, it is excluded when γs jd = 0. This equation can be summarized as follows: l(u1 | ) =
Dj J j=1
γ
π jd1 jd
u s j 1−u s j Dj γ 1− . π jd1 jd
d=1
d=1
In this representation, the appropriate parameter is first selected using the PIRPs for each item, and the CCRR or CIRR is then selected using the item response. However, this order is reversible. If the likelihood is constructed by first selecting the CCRR or CIRR using the item response and then selecting the parameter using the PIRPs, the likelihood is rewritten as l(u1 | ) =
J
u
π jd1 j (1 − π jd )1−u 1 j
γ1 j1
γ u × · · · × π jd1 j (1 − π jd )1−u 1 j 1 j D j
j=1 Dj J u1 j γ = π jd (1 − π jd )1−u 1 j 1 jd . j=1 d=1
This expression is preferred because of its simplicity. Accordingly, the likelihood for Student s is represented as l(us | ) =
Dj n
u π jds j (1
− π jd )
1−u s j γs jd
z s j
j=1 d=1
=
Dj n j=1 d=1
u
π jds j (1 − π jd )1−u s j
γs jd zs j
.
8.3 Parameter Learning
393
This expression also considers the missing data. Consequently, from Eq. (8.20), the likelihood for all students was obtained as follows: l(U| ) =
Dj S n
u
π jds j (1 − π jd )1−u s j
γs jd zs j
.
(8.22)
s=1 j=1 d=1
This equation indicates that the likelihood is the occurrence probability of the item response data (U) for a given parameter set ( ) or that of U under the DAG (i.e., Network 1). This is because, as Fig. 8.17 (p.389) shows, once the DAG to be analyzed is fixed, the parameter set ( ) to be estimated is uniquely derived.22 It is reasonable to find a parameter set value that increases the probability of occurrence (i.e., likelihood) because it is natural to consider that data occur because of high probability. The method of estimating by maximizing the likelihood (i.e., occurrence probability) is the maximum likelihood estimation (MLE) method, and the parameter set estimated using the MLE method is called the maximum likelihood estimate (also abbreviated as MLE).
8.3.4 Posterior Density In practice, a Bayesian estimation is preferred over the MLE. In a Bayesian estimation, the posterior density, which consists of the likelihood and prior density, is constructed, and the parameter set ( ) is estimated by maximizing the posterior density. The two main advantages of a Bayesian estimation are as follows. Advantages of Bayesian Estimation in BNM 1. It can give a more stable estimate of the parameter set. 2. It can estimate the parameters corresponding to the unobserved PIRPs. As Point 1 indicates, in a Bayesian estimation, the estimate for each element of the parameter set ( ) is less likely to be extreme, particularly when the sample size is small, which is an advantage. When an item is difficult but the sample size is small and almost all the students can pass the item, the CRR (and CCRR) for the item will be high. However, using prior knowledge that the item is difficult in a Bayesian estimation, the estimated parameter (CRR or CCRR) will not be excessively large. Moreover, several simulation studies have reported that the Bayesian method is better at recovering the true value of the parameter set (e.g., Suzuki, 1999), which means that the item response data (U) are first generated with a predetermined parameter set ( ); next, the reproduction of the parameter set is attempted under the condition that the generated item response data are given and the true value of the parameter set is unknown, and the Bayesian estimate is closer to the true value of the parameter set. 22
For the same reason, the likelihood is the probability of observing U under adjacency matrix A.
394
8 Bayesian Network Model
Regarding Point 2, the MLE cannot obtain a parameter whose PIRP was not observed. For example, when Item 1 is the parent of Item 2, the parameters of Item 2 are the CCRRs of Item 2 when Item 1 fails and passes: π21 and π22 (or π2|1 and π2|1 ). If all students pass Item 1, the MLE will not estimate the former parameter because no such cases exist. As the principle here, the MLE cannot estimate the parameters corresponding that are not observed. to the PIRPs Using Bayes’ Theorem (see p.105), the posterior density of the parameter set can be calculated as pr ( |U) =
l(U| ) pr ( ) ∝ l(U| ) pr ( ), pr (U)
where pr ( ) expresses the prior density for parameter set . In addition, pr (·) denotes a probability density function (PDF) for a continuous stochastic argument or a probability mass function for a discrete stochastic argument. In the above equation, pr (U) becomes constant because this term contains no unknown parameters. Therefore, it is negligible when maximizing the posterior density. The prior density for each parameter π jd should be the same, and the beta density (see Sect. 4.5.3, p.119) should be employed as the prior density, as the natural conjugate for π jd . The PDF of the beta density is β −1
pr (π jd ; β0 , β1 ) =
π jd1 (1 − π jd )β0 −1 B(β0 , β1 )
,
where
1
B(β0 , β1 ) =
t β1 −1 (1 − t)β0 −1 dt
0
is the beta function. In addition, β0 and β1 are the hyperparameters that determine the shape of the beta density, for which β0 = β1 = 1/2 is recommended in several studies (e.g., Haldane, 1931; Ordentlich & Cover, 1996; Suzuki, 1999). Because the priors of all parameters are independent, the prior for the parameter set becomes the simple product of each prior. That is, pr ( ; β0 , β1 ) =
Dj β −1 J π jd1 (1 − π jd )β0 −1 j=1 d=1
∝
Dj J j=1 d=1
B(β0 , β1 ) β −1
π jd1 (1 − π jd )β0 −1 .
8.3 Parameter Learning
395
Because the beta function in the denominator also becomes constant, it can be ignored. Consequently, the posterior density for the parameter set is obtained as pr ( |U) ∝l(U| ) pr ( ) ∝
Dj S n
u π jd1 j (1
− π jd )
1−u 1 j γs jd z s j
s=1 j=1 d=1
×
Dj J
β −1
π jd1 (1 − π jd )β0 −1 .
j=1 d=1
(8.23)
8.3.5 Maximum a Posteriori Estimate A Bayesian estimation gives the parameter set ( ), which maximizes the posterior density pr ( |U). However, maximizing it is extremely difficult because it is a product of numerous terms, whereas maximizing its logarithm is much easier because it is the sum of the terms.23 The logarithm in Eq. (8.23) is given by ln pr ( |U) =
Dj S n
γs jd z s j u 1 j ln π jd + (1 − u 1 j ) ln(1 − π jd )
s=1 j=1 d=1 Dj J + (β1 − 1) ln π jd + (β0 − 1) ln(1 − π jd ) j=1 d=1
=
Dj n
(U1 jd + β1 − 1) ln π jd + (U0 jd + β0 − 1) ln(1 − π jd )
j=1 d=1
=
Dj n
ln pr (π jd |u j ),
(8.24)
j=1 d=1
where u j is the j-th column vector in item response matrix U; that is, the item response vector for Item j.24 In addition, U1 jd =
S
γs jd z s j u s j ,
s=1
U0 jd =
S
γs jd z s j (1 − u s j ),
s=1
Because a logarithmic transformation is monotonic, the x that gives the maximum f (x) also gives the maximum ln f (x). 24 Note that not all data in u are necessary for estimating π ; Only data whose PIRP is d-th are j jd used. 23
396
8 Bayesian Network Model
where U1 jd and U0 jd are the numbers of students passing and failing Item j, respectively, out of all students whose PIRP for Item j is d-th. For the data (J5S10) in Table 8.11 (p.390), U1 jd s and U0 jd s are calculated as follows: ⎡
6 ⎢1 ⎢ U 1 = {U1 jd } = ⎢ ⎢5 ⎣1 0
⎡
⎤ 3 4 2 02
⎥ ⎥ ⎥ ⎥ ⎦ 2
and
4 ⎢3 ⎢ U 0 = {U0 jd } = ⎢ ⎢1 ⎣5 1
⎤ 3 0 2 04
⎥ ⎥ ⎥, ⎥ ⎦
(8.25)
1
where the cells with no corresponding parameters are shaded. For example, the (1, 1)-th element in U 1 was U111 = 6. This indicates that there are six students passing Item 1 out of all students whose PIRP for Item 1 was the first pattern (i.e., all 10 students because the parent set of Item 1 is empty). In addition, U121 = 1 in the (2, 1)-th element means that there is one student who passed Item 2 out of the four (= U021 + U121 ) students whose PIRP for Item 2 (response pattern to Item 1) was 0 (i.e., the first pattern). Moreover, the (5, 2)-th cells in both matrices were 0 (U052 = U152 = 0), which implies that the second pattern for the parents of Item 5 (Items 3, 4)=(0, 1) was not observed. The parameter for where the PIRP was not observed cannot be estimated using the MLE (see MLE in BNM ). For the data in Table 8.11, the parameter corresponding to the second PIRP of Item 5, π52 (= π5|34 ), cannot be estimated. Equation (8.24) is the objective function to be maximized,25 and we let the estimate for the parameter set ( ) that maximizes the objective function be denoted ˆ . The estimate that maximizes the posterior density is called the maximum a by posteriori (MAP) estimate. In addition, as Eq. (8.24) shows that the parameters are mutually disjointed in the log-posterior density, which can be optimized by optimizing each individual parameter. Thus, maximizing the log-posterior density of each parameter ln pr (π jd |u j ) is identical to maximizing the entire log-posterior density, ln pr (G |U). When the hyperparameters are set to (β0 , β1 ) = (1, 1), for example, the logposterior density functions for π11 , π21 , and π54 are ln pr (π 11 |u1 ) = 6 × ln π11 + 4 × ln(1 − π11 ), ln pr (π 21 |u2 ) = 1 × ln π21 + 3 × ln(1 − π21 ), ln pr (π 54 |u5 ) = 2 × ln π54 + 1 × ln(1 − π54 ), respectively. In addition, these functions are plotted in Fig. 8.18.
25
Maximizing (or minimizing) an objective function to obtain the parameter(s) is called optimization.
8.3 Parameter Learning
397
Maximum Likelihood Estimation in BNM Without the prior density for the parameter set, the logarithm of the likelihood shown in Eq. (8.22) is given as ll(U| ) =
Dj S n
γs jd z s j u 1 j ln π jd + (1 − u 1 j ) ln(1 − π jd )
s=1 j=1 d=1 Dj n = U1 jd ln π jd + U0 jd ln(1 − π jd ) j=1 d=1
=
Dj n
ll(u j |π jd )
(8.26)
j=1 d=1
A method for finding the parameters that maximize the log-likelihood is maximum likelihood estimation method, and the parameters obtained using an MLE are referred to as the maximum likelihood estimates. In an MLE, each parameter can be optimized individually. The only difference between Eqs. (8.24) and (8.26) (i.e., the log-posterior and log-likelihood) is the inclusion of constants β0 − 1 and β1 − 1 regarding the hyperparameters in the prior density. A Bayesian estimation is thus identical to the MLE when (β0 , β1 ) = (1, 1). In the log-likelihood of Eq. (8.26), when the d-th PIRP for Item j is not observed, U0 jd and U1 jd are 0, which leads to the log-likelihood of π jd being a constant equal to 0. That is, ll(u j |π jd ) = 0. Then, the parameters that maximize the log-likelihood cannot be found because there is no parameter and thus no concavity in the log-likelihood. However, using a Bayesian estimation, even if U0 jd = U1 jd = 0, the logposterior density does not become constant, as follows: ln pr (π jd |u j ) = (β1 − 1) ln π jd + (β0 − 1) ln(1 − π jd ). However, it should be questioned whether the estimate is meaningful. Nevertheless, as described in Advantages of Bayesian Estimation in BNM (p.393), Bayesian estimates are more stable, particularly when the sample size is small. This is because, as the number of parent items increases, the number of PIRPs increases, and with this, thenumber of students in each PIRP decreases (see also Indegree and PIRP , p.385).
398
8 Bayesian Network Model
Fig. 8.18 Log-likelihood function
By differentiating the function with respect to the parameter, setting the derivative equal to 0, and solving the equation with respect to the parameter, the parameter that maximizes the objective function can be found. To differentiate the function is to obtain the slope of the function because the function reaches the maximum when the slope is 0.26 By differentiating Eq. (8.24) with respect to π jd and setting it to 0, d ln pr (π jd |u j ) U1 jd + β1 − 1 U0 jd + β0 − 1 = − = 0. dπ jd π jd 1 − π jd is obtained. Then, solving this equation with respect to π jd , we have A P) = πˆ (M jd
U1 jd + β1 − 1 . U0 jd + U1 jd + β0 + β1 − 2
This value is the MAP estimate for π jd , which maximizes the log-posterior density. Unless the prior density is assumed, the estimate is obtained as follows: L) = πˆ (M jd
U1 jd . U0 jd + U1 jd
This is the MLE of the parameter that maximizes the log-likelihood, which implies that the MAP is reduced to MLE when (β0 , β1 ) = (1, 1). In addition, they approximate each other as U0 jd and U1 jd both increase. This equation also suggests the vulnerability of the MLE because it is not computable owing to division by zero when the PIRP is not observed (U0 jd = U1 jd = 0). The MAP of the parameter set is obtained immediately by the following matrix operation27 : 26
Because this objective function is concave upward and unimodal, the objective function is at its global maximum when the slope is 0. 27 “” is the Hadamard division (p.xvi).
8.3 Parameter Learning
399
A P) ˆ (M = U 1 + (β1 − 1)1 J 1Dmax U 0 + U 1 + (β0 + β1 − 2)1 J 1Dmax . Consequently, from Eq. (8.25), the MAP of the parameter set with (β0 , β1 ) = (1, 1) (i.e., MLE) is obtained as ⎡
⎤ πˆ 11 ⎢ πˆ 21 πˆ 22 ⎥ ⎢ ⎥ ⎥ ˆ = ⎢ πˆ 31 πˆ 32 ⎢ ⎥ ⎣ πˆ 41 πˆ 42 ⎦ πˆ 51 πˆ 52 πˆ 53 πˆ 54 ⎡ ⎤ ⎛⎡ 4 6 ⎢1 3 ⎥ ⎜⎢ 3 3 ⎢ ⎥ ⎜⎢ ⎥ ⎜⎢ 1 0 =⎢ ⎢5 4 ⎥ ⎜⎢ ⎣1 2 ⎦ ⎝⎣ 5 2 104 0022 ⎡ ⎤ 1/2 ⎢ 1/4 1/2 ⎥ ⎢ ⎥ ⎥, 5/6 1 =⎢ ⎢ ⎥ ⎣ 1/6 1/2 ⎦ 0 — 1/3 2/3
⎤
⎤⎞
⎡
6 ⎥ ⎢1 ⎥ ⎢ ⎥ + ⎢5 ⎥ ⎢ ⎦ ⎣1 0 1
3 4 2 02
⎥⎟ ⎥⎟ ⎥⎟ ⎥⎟ ⎦⎠ 2
where “—” in the (5, 2)-th cell indicates that it was impossible to estimate π52 because of division by zero. Figure 8.19 shows the parameter estimates written for the DAG. The table next to Item 1 shows the CRR of this source item (π11 = 0.600), which also means that
Fig. 8.19 Parameter learning result for DAG
400
8 Bayesian Network Model
the incorrect response rate is 0.400. The table for Item 2 shows the CCRR for each pattern of the parent item (i.e., Item 1); in addition, the CCRR of Item 2 is π21 = 0.250 when Item 1 is incorrect and π22 = 0.500 when Item 1 is correct. In the test data, the CCRR of a child item (i.e., successor) becomes large when the parent item (i.e., predecessor) of the child is passed. The greater the difference between the CCRRs of a child item when the parent item is passed and failed, the more likely the parent item measures the basic ability of the child item. The table next to Item 5 shows the CCRRs of the item. From the table, when the response pattern to parent Items 3 and 4 is (0, 0), the CCRR of Item 5 is 0. In addition, the CCRR was 0.667 when the PIRP was (1, 1). This suggests that to pass Item 5, students must be able to pass either Items 3 or 4, or both items. In addition, because no students failed Item 3 and passed Item 4 (U052 = U152 = 0), the CCRR of the pattern was not estimated. If the hyperparameters are set as β0 = β1 = β, then the parameter is estimated as π52 =
1 β −1 = . 2β − 2 2
That is, if both hyperparameters are set to the same value, the CCRR for the PIRP that was not observed is 1/2; however, this value has little practical meaning.
8.4 Model Fit It is necessary to evaluate the fitness of the model after parameter learning. By doing so, one can examine whether the DAG (Network 1) fits the data. The fitness of the BNM model can be determined in a similar manner as discussed in previous chapters. The model under the current analysis (i.e., analysis model) was compared to a well-fitting model (i.e., benchmark model) and a poor-fitting model (i.e., null model). The analyst can specify both the benchmark and null models, and the same models as in the other chapters are also employed in this chapter; that is, the benchmark model is a multigroup model where the number of groups is the number of observed NRS patterns, and the null model is a single-group model.
8.4.1 Analysis Model Likelihood l(U| ) is the basis for evaluating the fitness of the analysis model, and it is the probability that the data occur under the model (DAG) and its parameters. Therefore, likelihood falls in the range [0, 1], and when the likelihood is closer to 1, the data (U) are more likely to occur given the DAG and its parameter set ( ). ˆ ) can be obtained as a real ˆ , l(U| Because the parameter set is estimated as number. However, it is difficult to calculate because it is a product of numerous terms;
8.4 Model Fit
401
thus, the log-likelihood is computed. From Eq. (8.26) (p.397), the log-likelihood of the analysis model, ll A , is calculated as ˆ ) = ll A (U|
Dj n U1 jd ln πˆ jd + U0 jd ln(1 − πˆ jd ) = −26.41.
(8.27)
j=1 d=1
As the structure of this equation suggests, the log-likelihood is computable for each item; however, unlike IRT, LCA, and LRA, in a BNM, it is not necessary to calculate the fitness of each individual item because that of the interitem structure is essential. The number of parameters set in the analysis model is 11, as shown in Eq. (8.17) (p.386), which is also the sum of the number of PIRPs for each item, as follows: N of elements in =
J
Dj =
j=1
J
2din, j = 11.
j=1
However, because the PIRP of Item 5, (Items 3, 4)= (0, 1), was not observed in the data (U052 = U152 = 0), π52 could not be estimated. Thus, the number of parameters (M A P) can be obtained using was ten in this analysis. Although the MAP for π52 , πˆ 52 a Bayesian estimation, it is not included in the number of parameters. Furthermore, in the log-likelihood given by Eq. (8.27), the term regarding π52 does not contribute to the entire log-likelihood because the term vanishes because U052 = U152 = 0. Conversely, if U0 jd + U1 jd > 0, π jd can be included in the number of parameters. The number of students whose PIRP of Item j is d-th is obtained as follows: U0 jd + U1 jd =
S
{γs jd z s j u s j + γs jd z s j (1 − u s j )}
s=1
=
S
γs jd z s j
s=1
= U jd , Therefore, the number of parameters is given by NP =
Dj J j=1 d=1
sgn(U jd ) =
Dj J j=1 d=1
sgn(U0 jd + U1 jd ).
402
8 Bayesian Network Model
Student 01 02 03 04 05 06 07 08 09 10
0.6
1 0.4
0.9
2
3 0.4
0.3
4
5
1 0 1 0 1 0 0 1 1 1 1
2 1 0 0 1 0 0 1 0 0 1
U 3 1 1 0 1 1 1 1 1 1 1
4 0 1 0 1 0 0 0 0 0 1
5 1 1 0 0 1 0 0 0 0 1
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
UΓ 3 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1 1 1
Fig. 8.20 DAG and PIRP of the null model
8.4.2 Null Model This section describes the null model, which is a single-group model that fits the data worse than the analysis model.28 Figure 8.20 (left) illustrates the DAG for the null model. As shown, the DAG of the null model does not contain an edge. A graph with no edges is called an empty graph or an edgeless graph. The DAG for the null model is thus an empty graph with J vertices29 ; in other words, all items are mutually independent. The adjacency matrix for such an empty graph becomes a zero matrix (i.e., A = O). The JP of the null model with J items is factored as P(1, 2, 3, 4, 5) = P(5) × P(4) × P(3) × P(2) × P(1). In addition, the number of PIRPs for each item is 1 (= 20 ) because the indegree of each item is 0. Therefore, the PIRPs of all students for all items become the first pattern, as shown in Fig. 8.20 (right). Thus, each item has only one parameter, that is, the CRR (not the CCRR), and the MLE of the parameter set is obtained as follows: ⎤ ⎡ ⎤ ⎡ ⎤ p1 0.6 πˆ 11 ⎢πˆ 21 ⎥ ⎢0.4⎥ ⎢ p2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ =⎢ ⎢πˆ 31 ⎥ = ⎢0.9⎥ = ⎢ p3 ⎥ = p. ⎣πˆ 41 ⎦ ⎣0.3⎦ ⎣ p4 ⎦ 0.4 πˆ 51 p5 ⎡
L) ˆ (M
The parameter set for the null model is the item CRR vector ( p). In the DAG, the number shown in the upper-right corner of each item is the CRR estimate. The DAG 28 29
The fit of the null model can be better if that of the analysis model is extremely poor. In graph theory, a graph with no vertices, and thus no edges, is referred to as a null graph.
8.4 Model Fit
403
clearly shows that the number of parameters for the null model is equal to the number of items, J . Consequently, the likelihood of the null model, l N , is represented as l N (U| p) =
S J
u
p j s j (1 − p j )1−u s j ,
s=1 j=1
and the log-likelihood of the null model, ll N , is computed as ll N (U| p) =
S J
u s j ln p j + (1 − u s j ) ln(1 − p j ) = −28.88.
(8.28)
s=1 j=1
8.4.3 Benchmark Model The benchmark model is a multigroup model in which the number of groups is equal to the number of observed NRS patterns, and it fits the data better than the analysis model.30 The number of NRS patterns is J + 1 (0, 1, 2, · · · , J ) when the number of items is J ; however, not all the patterns are always observed. For example, there are no zero markers if a test is extremely easy. For the analysis data (Table 8.11, p.390), the number of items was five. Therefore, the maximum number of NRS patterns was six. Because all the patterns were observed, the students could be classified into six groups. In Fig. 8.21 (right), U ∗ shows the data sorted in ascending order according to the NRS pattern. For example, Student 3 with NRS = 0 belongs to Group 1 alone, and Students 5, 8, and 9 with NRS = 2 are classified in Group 3. As shown in Fig. 8.21 (left), for the benchmark model, the DAG for each NRS group is an empty graph. That is, the JP for Group g is factored as Pg (1, 2, 3, 4, 5) = Pg (5) × Pg (4) × Pg (3) × Pg (2) × Pg (1). This equation implies that the parameters for Group g are the CRR vector of the group (i.e., pg ). Figure 8.21 (left) also describes the estimates of the parameters. For example, in Group 3 (students whose NRS = 2), the CRR of Item 1 was 2/3 because two out of three students belonging to the group passed the item. Consequently, the parameter estimate for the entire benchmark model can be obtained as follows: ⎤ p11 · · · p1G ⎥ ⎢ P G = ⎣ ... . . . ... ⎦ (J × G). pJ1 · · · pJ G ⎡
30
The fit of the benchmark model can be poor if that of the analysis model is extremely good.
404
8 Bayesian Network Model
Fig. 8.21 DAG and PIRP of the benchmark model
This matrix is identical to the group reference matrix in the LRA (Chap. 6, p.191) and biclustering (Chap. 7, p.259). The g-th column vector in this matrix was pg . Moreover, the number of parameters of the benchmark model was J G, which is the same as the number of elements in the matrix. To calculate the log-likelihood of the benchmark model, the group membership matrix (M G ) was initially specified. When the number of NRS groups is G, the size of the matrix is S × G, and the (s, g)-th element is m sg =
1, if student s belongs to group g . 0, otherwise
8.4 Model Fit
405
The matrices for the sorted data (U ∗ ) and the original (i.e., unsorted) data (U), M ∗G and M G are obtained as follows: ⎡
1 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎢0 M ∗G = ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎣0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 1 1 0
⎤ 0 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥, 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥ 0⎦ 1
⎡ 0 ⎢0 ⎢ ⎢1 ⎢ ⎢0 ⎢ ⎢0 MG = ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎣0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 1 1 0
1 0 0 0 0 0 1 0 0 0
0 1 0 1 0 0 0 0 0 0
⎤ 0 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥. 0⎥ ⎥ 0⎥ ⎥ 0⎥ ⎥ 0⎦ 1
Based on group membership matrix M G , the likelihood for the benchmark model, l B , is constructed as l B (U| P G ) =
S G J
u
p jgs j (1 − p jg )1−u s j
m sg
.
s=1 g=1 j=1
Owing to the group memberships, the data on each student contribute only to the likelihood of the group to which the student belongs. Accordingly, the log-likelihood of the benchmark model, ll B , is computed as ll B (U| P G ) =
S G J
m sg u s j ln p jg + (1 − u s j ) ln(1 − p jg ) = −8.94.
s=1 g=1 j=1
The likelihood takes a value in the interval [0, 1], and the closer the likelihood is to 1, the better the model fits. Thus, the log-likelihood takes a value between −∞ (= ln 0) and 0 (= ln 1), and the closer the log-likelihood is to 0, the better fitting the model is. Comparing the log-likelihood of the benchmark model with those of the analysis and null models, we obtain the following relationship: ll N < ll A < ll B < 0, which shows that the fitness of the benchmark model is the best among the three models, followed by that of the analysis model, with the null model being the worst.
406
8 Bayesian Network Model
Fig. 8.22 χ 2 values of analysis and null models (identical to Fig. 7.21, p.313)
8.4.4 Chi-Square Statistic and Degrees of Freedom The fitness of a model is evaluated based on its distance from the benchmark model (Fig. 8.22). More specifically, twice the difference from the log-likelihood of the model to that of the benchmark model is the χ 2 value of the model. That is, the χ 2 values of the analysis and null models are as follows: ˆ )} χ A2 =2 × {ll B (U| P G ) − ll A (U| =2 × {−8.94 − (−26.41)} =34.95 χ N2 =2 × {ll B (U| P G ) − ll N (U| p} =2 × {−8.94 − (−28.88)} =39.89 The χ 2 value of the model represents the distance from the model to the benchmark model. Thus, as the χ 2 value decreases, the model moves closer to the benchmark model, which means that the model fits the data. The χ 2 value of the analysis model is smaller than that of the null model, and this holds here as well (χ A2 < χ N2 ). In addition, the DF of the model is the difference between the number of parameters in the model and the benchmark model. Thus, the DFs of the analysis and null models are obtained as d fA = JG −
Dj J
sgn(U jd ) = 20,
j=1 d=1
d f N = J (G − 1) = 25. The DF of a model represents its parsimony. The more parameters a model contains, the more flexible the model fitting the data. That is, the good/poor fit of the benchmark/null model is attributed to the number of J G/J parameters. Usually, the number of parameters in the analysis model is smaller and larger than that of the benchmark and null models, respectively, which also holds here (0 < d f A < d f N ).
8.4 Model Fit
407
Table 8.12 Model fit indices (BNM) χ 2 and d f ll B −8.94 ll N −28.88 ll A −26.41 χ N2 39.89 χ A2 34.95 d fN 25 d fA 20
Standardized Index InformationCriterion NFI 0.124 AIC −5.05 −13.01 RFI 0.000 CAIC −11.10 IFI 0.248 BIC TLI 0.000 CFI 0.000 RMSEA 0.288
8.4.5 Section Summary Using the χ 2 values and DFs of the analysis and null models, various model-fit indices were calculated. The indices are roughly grouped into standardized indices and information criteria. To calculate them, refer to Standardized Fit Indices (p.321) and Information Criteria (p.326). Table 8.12 shows the fit indices for Network 1. For the standardized indices, the NFI–CFI take a value within the range of [0, 1], and when the indices are approximately 1, the model is a good fit. The range of RMSEA is [0, ∞), and a value of approximately 0 indicates that the fit of the model is poor. From the table, it cannot be said that Network 1 (Fig. 8.19, p.399) fits the data (J5S10). The information criteria are used to compare the fit of multiple models. Each information criterion represents the efficiency of the model, and the efficiency is good when χ 2 is small (i.e., the model is a good fit) even if the DF is large (i.e., the number of parameters is small). As the information criteria decrease, the model is more efficient. Note that information criteria are not needed when there is only one analysis model.
8.5 Structure Learning Sections 8.3 and 8.4 described the parameter learning and model-fit evaluation, respectively. Note that these were conducted under the condition that the network (DAG) had already been fixed. Usually, however, the (true) network induced from data must be unknown when the data are acquired. The structure learning described in this section is a stage for exploring the optimal DAG for the data. The basis of the search is data fitness. That is, the optimal DAG is selected from numerous model candidates by referring to the fit indices. BIC is often used in the selection process.
408
8 Bayesian Network Model
Structure learning is not an easy problem to solve because there are numerous candidate DAGs.31 To estimate the number, first, when the number of items is J , the number of item pairs is J C2
=
J (J − 1) J! = . 2!(J − 2)! 2
In addition, there are three relationships between each item pair. For example, the relationships between Items j and k are as follows32 : Status in Graph Status in A k a jk = ak j = 0 j j → k
a jk = 1 ∧ ak j = 0
j ← k
a jk = 0 ∧ ak j = 1
From this, the maximum number of candidate graphs is N of simple graphs = 3 J (J −1)/2 . This number gives 3, 27, 729, 59049, 14348907, . . . when J = 2, 3, 4, 5, 6, · · · . Note that because this number includes cyclic graphs, it is the number of simple graphs when the number of items is J . DAG, the labels of the items can be topologically sorted (see When a graph is a Topological Sorting , p.371). If this can be achieved, the children of each item are labeled higher than the item. Then, in the adjacency matrix, the cells where 1 can be entered are limited to the strictly upper triangular cells of the matrix, as follows33 : ⎡
⎤
0 ⎢0 ⎢ A=⎢. ⎣ ..
0 .. . . . . 0 0 ···
⎥ ⎥ ⎥. ⎦ 0
Accordingly, if the item labels can be topologically sorted (i.e., the graph is a DAG), the number of graphs to be examined can be drastically reduces to N of DAGs = 2 J (J −1)/2 .
31
Refer also to Section 9.4.2 (p.468). j ↔ k is a multi-edge, meaning j → k and j ← k . Therefore, this pattern is excluded. 33 If the main diagonal elements of an upper/lower triangular matrix are 0, the matrix is called strictly upper/lower triangular. 32
8.5 Structure Learning
409
This number is 2, 8, 64, 1024, 32768, · · · when J = 2, 3, 4, 5, 6, . . .. For the test data, it is easy to topologically sort the items. The simplest method is to relabel items in descending order of CRR (the easiest as Item 1 and the hardest as Item J ), because it is unnatural for the parent(s) of an item to be harder than the item. However, because a typical test has 20–50 items, the number of DAGs to be explored is still extremely large. Thus, it is impossible to explore all DAGs and identify the best one. Therefore, one can only find a DAG that is not the best fit, but that fits satisfactorily.
8.5.1 Genetic Algorithm Several methods, such as the K2 algorithm (Cooper and Herskovits, 1992), the SGS algorithm (Spirites et al., 1993), the PC algorithm (Spirites et al., 2010), the MMHC algorithm (Tsamardinos et al., 2006), and the RAI algorithm (Yehezkel and Lerner, 2009), are effective for structural learning. In addition, methods based on a genetic algorithm (GA; Holland, 1975; Goldberg, 1989; Michalewicz, 1996; De Jong, 2006) have been proposed by Larranãga et al. (1996a, 1996b), Wong et al. (1999), and Fukuda et al. (2014). The GA simplifies the evolution mechanism and is effective for solving combinatorial optimization problems. This problem is not as simple as finding the optimal (continuous) value that maximizes a continuous function (as shown in Fig. 8.18, p.398) and is a difficult problem involving finding the optimal combination of qualitative (discrete) variables; that is, the aim is to determine the best combination of discrete status (i.e., no edge/edge) regarding the relation between each item pair of all J (J − 1)/2 pairs. Figure 8.23 shows the general framework of GA. The GA starts at Generation 1. Suppose there are I individuals who have a different adjacency matrix in Generation (t) (t) t, { A(t) 1 , A2 , · · · , A I }, as shown in (1). The number of individuals (I ) is denoted as the population. The adjacency matrix of each individual is a different binary pattern and is regarded as individual genes (i.e., chromosomes). The gene loci are the strictly upper triangular cells in the adjacency matrix. Therefore, the number of loci is equal to J (J − 1)/2 when the number of items is J . It is therefore reasonable to consider the adjacency matrix as an individual gene because it uniquely corresponds (1) (1) to a DAG. The individuals in the first generation { A(1) 1 , A2 , · · · , A I } may be generated randomly. Selection in (2) is the next stage in which the fitness of each individual gene (i.e., adjacency matrix) is evaluated. Although the fitness depends on the purpose of each analysis, information criteria such as the BIC and AIC are frequently used for selecting fitter individuals (i.e., survivors). This process selects survivors to generate individuals during the next generation (t + 1). Assuming that r S ∈ (0, 1) is the survival rate, the number of survivors (I S ) is34 34
Note that if I S is not an integer, it is rounded off.
410
8 Bayesian Network Model
Fig. 8.23 Framework of genetic algorithm
I S = r S × I. There are several methods of selecting I S survivors, including roulette wheel selection, rank selection, and tournament selection. The simplest method is to select I S survivors from I individuals (i.e., the population) to increase the BIC order. The (I − I S ) individuals who are not selected are discarded. In addition, crossover in (3) is the process of creating I individuals in Generation t + 1 from I S survivors in Generation t. When an individual is created, two survivors are randomly selected, and the genes of both survivors are randomly mixed. At this point, one can program a scenario such that better-fitting survivors are given more chances to produce children, which accelerates the convergence and increases the possibility of getting stuck in the local optima. Here, convergence means that all individuals in a generation have the same gene. Uniform crossover randomly mixes ) two genes.35 In the figure, the gene for Individual 1 in Generation t + 1 (i.e., A(t+1) 1 is created using a uniform crossover from Individuals 3 and 1 in Generation t. In a crossover, elitism selection is often used to preserve some top elites with the best fitness in the new generation without changing their genes to prevent the loss of the best-fitting individuals during a crossover. Note that the number of preserved elites (I E ) must be predetermined. This method also prevents the fitness of the best individual in each generation from retrogressing. Another operation often used in a crossover is a mutation. This method changes each genetic locus with a certain probability (r M ) for each individual. Let the ( j, k)th (s.t. j < k) element of the gene for Individual i in Generation t + 1 (i.e., Ai(t+1) ) be defined by 35
Note that single-point crossover and two-point crossover are also frequently used.
8.5 Structure Learning
411
(t+1) ai,(t+1,new) = (1 − B M )ai,(t+1) jk jk + B M (1 − ai, jk ) a (t+1) , if B M = 0 = i, jk (t+1) , 1 − ai, jk , if B M = 1
where B M is a Bernoulli random variable, equaling 1 with probability r M and equaling 0 with probability (1 − r M ). That is,36 B M ∼ Ber n(r M ). Although the value of genetic locus ai,(t+1) jk remains unchanged with probability 1 − r M , it changes to 1/0 if the original value is 0/1 with probability r M . If the genes of all individuals in a generation are extremely similar, those of individuals in the next generation will not vary. This is a drawback, particularly for the early generation trying various gene patterns. This operation can produce a new gene pattern. Termination Conditions of Genetic Algorithm 1. When the fitness of the best individual in a generation meets a predetermined criterion 2. When the average fitness in a generation meets a certain criterion 3. When the difference in fitness of the best individuals between successive generations falls within a certain criterion 4. When the fitness of the best individual does not change for a certain number of generations When generations are repeated, well-fitting individuals become predominant in a given generation. Meanwhile, the GA should be terminated when a predetermined condition is satisfied. Some typical examples of the condition are listed in Termination Conditions of GA . In addition, Estimates of A (GA) also list some ˆ typical methods for determining the final estimate for the adjacency matrix (A). Estimates of A (GA) 1. The adjacency matrix of the optimal individual at the final generation. 2. The rounded average matrix of the adjacency matrices at the final generation. 3. The rounded average matrix of the adjacency matrices for the survivors at the final generation.
36 X ∼ D means “X is randomly sampled from distribution D.” Flipping a coin once is a random sampling from Ber n(1/2). Rolling a dice and getting a number > 4 is a random sampling from Ber n(1/3).
412
8 Bayesian Network Model
Suppose that GA terminates at Generation T . Point 1 indicates that the gene of ) (T ) (T ) (T ) the best individual in Generation T { A(T 1 , A2 , · · · , A I }, A Best , is specified as the estimate of the adjacency matrix, as follows: ) Aˆ = A(T Best .
In addition, Point 2 indicates that when the average matrix of the adjacency matrices in Generation T is I ) 1 (T ) A¯ = a¯ (T = A(T ) , jk I i=1 i
the rounded matrix represents the estimated adjacency matrix as follows37 : ) Aˆ = {aˆ jk } = round a¯ (T . jk Finally, Point 3 is different from Point 2 in calculating the average adjacency matrix. The average adjacency matrix at Point 3 is calculated from the adjacency matrices for the survivors in the final generation. Notably, there are many methods other than those listed in the box, and the GA analysis yields different results even if the same data are analyzed under the same settings because they contain random operations. Although various DAGs can be tested using the GA, not all gene patterns can be examined; thus, the final DAG chosen may not be the best solution among all possible DAGs. However, in practice, it is sufficient to find only one good-fitting individual because the DAG that best fits the data is not necessarily the best one in the real world. Because data naturally do not contain any external information (i.e., knowledge and experience) (see Importance of Knowledge about Outside Data , p.255), the DAG that best fits the data is likely to be a DAG that overfits the internal information and may not adequately explain the external world (i.e., reality or phenomena). Accordingly, the DAG obtained using the GA is often slightly unusual to subject experts; thus, the final structure is usually required to be reviewed and slightly modified. must at least specify the settings listed in In summary, to conduct a GA, one Settings for Executing Simple GA . However, there are many variations, and this is an example.
37
Note that round(·) is the rounding function.
8.5 Structure Learning
413
Settings for Executing Simple GA 1. Gene: Adjacency matrix A is specified in this analysis. The loci are the strictly upper triangular cells in A under the condition that the item labels are topologically sorted. Each locus takes a value of 0 (no edge) or 1 (edge). 2. Population: The number of individuals (I ) in a generation 3. Survival rate: The rate of survivors (r S ) in the population 4. Fitness evaluation: The index (such as BIC and AIC) used to evaluate the fitness of each individual 5. Selection: Rank, roulette wheel, tournament, elitism selections, etc. 6. Crossover: Uniform, single-point, two-point crossovers, etc. 7. Mutation rate: The rate at which each genetic locus changes (r M ) 8. Termination condition: See Termination Conditions of GA (p. 411) 9. Estimate of A: See Estimates of A (GA)
8.5.2 Population-Based Incremental Learning This section introduces the method proposed by Fukuda et al. (2014), which is an application of population-based incremental learning (PBIL; Baluja, 1994), into BNM structure learning. As shown in Fig. 8.24, the PBIL learning progresses through generations, which is the same as in a typical GA. However, instead of waiting for highly adapted individuals to be randomly generated, the average gene of each generation evolves through the generations. The upper left corner in Fig. 8.24 shows the generational gene in Period t. This generational gene is the average gene of all the individuals in Period t. Because the ( j, k)-th elements of all individuals are binary, the ( j, k)-th element in the generational gene (i.e., the average adjacency matrix) is the probability of Item j → k. In the figure, (1) shows that I individuals are generated from this t-th generational ¯ (t) ).38 Specifically, the ( j, k)-th (s.t. j < k) element of Individual i A(t) is gene ( A i determined as ai,(t)jk ∼ Ber n a¯ (t) jk , (t)
(t) ¯ where a¯ (t) jk is the ( j, k)-th element of the t-th generational gene ( A ). That is, ai, jk is 1 with a probability of a¯ (t) ¯ (t) jk and 0 with a probability of (1 − a jk ). Accordingly, as (t) a¯ jk approximates 1, the corresponding element of each individual is more likely to be 1.
38
For the first generational gene, all the loci may be set to 0.5.
414
8 Bayesian Network Model
Fig. 8.24 Framework of PBIL (BNM)
Next, the survivors are selected in (2), where roulette wheel, rank, and tournament selections can be employed, as in a typical GA. Individuals who were not selected are discarded. Elitism selection can be used jointly during this process. (t) Then, in (3), suppose that the I S best-fitting survivors are renamed { A(t) S1 , A S2 , (t) (t) ¯ S ) is calculated · · · , AS I S }. Then, the average gene of the I S survivors in Period t ( A as IS IS 1 1 (t) (t) (t) ¯A(t) (J × J ). A = a¯ S, jk = a S = I S i=1 Si I S i=1 Si, jk In this matrix, if the ( j, k)-th element is ≥ 0.5, more than half of the survivors have ¯ (t) an edge from Item j to k. Suppose that A(t) R (J × J ) is the rounded matrix of A S . (t) That is, the ( j, k)-th element of A R is (t) ¯ S, jk = round a (t) R, jk = round a
IS 1 (t) a Si, jk . I S i=1
In other words, the ( j, k)-th element of A(t) R is 1 if the corresponding element of (t) ¯A(t) S is ≥ 0.5 and is 0 otherwise. Using this A R , the (t + 1)-th generational gene is updated as follows:
8.5 Structure Learning
415
¯ (t+1) = A ¯ (t) + α A(t) − A ¯ (t) = {a¯ (t+1) } (J × J ), A R jk where α ∈ (0, 1) is a constant called the learning rate. In this equation, the ( j, k)-th element is represented as (t) = a¯ (t) ¯ (t) a¯ (t+1) jk jk + α a R, jk − a jk a¯ (t) ¯ (t) , if a (t) jk − α a R, jk = 0 jk (t) = . (t) (t) a¯ jk + α 1 − a¯ jk , if a R, jk = 1 ¯ (t+1) < a¯ (t) That is, when a (t) R, jk = 0, then a jk jk , and the ( j, k)-th element approaches 0. Moreover, the higher the learning rate (α), the closer the element is to 0. Conversely, if a (t) R, jk = 1, the higher the α, the closer the element is to 1 because (t) > a ¯ . a¯ (t+1) jk jk ¯ (t+1) is mutated with a probability of r M , as folFurthermore, each element in A lows: a¯ (t+1,new) = (1 − B M )a¯ (t+1) + BM jk jk ⎧ (t+1) ⎨a¯ jk , = a¯ (t+1) ⎩ jk + U 2
a¯ (t+1) +U jk
if B M = 0 , if B M = 1
2 ,
where B M ∼ Ber n(r M ), U ∼ U [0, 1]. In the above equation, U [0, 1] is the uniform density within the range [0, 1], and U is a random sample from the density. This formula means that, although each element remains unchanged with probability 1 − r M , it is replaced by the average of a¯ (t+1) jk and U with probability r M . Note that this formula is not the only method for gene mutations. There are various methods of mutating genes. Estimates of A (PBIL) 1. 2. 3. 4.
The adjacency matrix of the best individual in the final period The rounded average matrix of the individuals in the final period The rounded average matrix of the survivors in the final period The rounded generational gene in the final period
From the (t + 1)-th generational gene, I individuals are randomly regenerated, as shown in (4) in Fig. 8.24. By repeating these processes, the generational gene
416
8 Bayesian Network Model
is updated and evolved to produce better-adapted individuals. For the termination conditions of the PBIL, those in the GA (Termination Conditions of GA , p.411) can be used. Methods forestimating the adjacency matrix estimate are listed in Estimates of A (PBIL) . Points 1–3 are the same as those shown in Estimates of A (GA) . Point 4 states that when the final period is T , the estimate is ¯ (T ) , as follows: specified as the rounded matrix of A ) Aˆ = {aˆ jk } = round a¯ (T . jk
(8.29)
The previous descriptions provide a brief overview of the PBIL. Notably, this method involves randomly generating and mutating genes; therefore, the analyses (even with the same settings) usually do not yield the same results. Although numer- ous settings are required to execute PBIL, the set listed in Settings for Executing PBIL is basic, and many options and specifications can be added. Settings for Executing PBIL (BNM) 1. Gene: Adjacency matrix A is specified in this analysis. The genetic loci are the strictly upper triangular cells in A under the condition that the item labels are topologically sorted. Each locus takes a value of 0 (no edge) or 1 (edge). 2. Population: The number of individuals (I ) in a generation 3. Survival rate: The rate of survivors (r S ) in the population 4. Fitness evaluation: The index (such as BIC and AIC) used to evaluate the fitness of each individual 5. Selection: Rank, roulette wheel, tournament, elitism selections, etc. 6. Learning rate: The updated size of the generational gene (α) 7. Mutation rate: The rate at which each locus changes (r M ) 8. Termination condition: see Termination Conditions of GA (p.411) 9. Estimate of A: see Estimates of A (PBIL)
8.5.3 PBIL Structure Learning In this section, structure learning is conducted using PBIL for the five-item data (J5S10) shown in Table 8.11 (p.390). The analysis settings are as follows:
8.5 Structure Learning
417
Parameters for Executing PBIL (BNM) 1. 2. 3. 4. 5. 6. 7. 8.
Gene: A (the maximum indegree per item is 2) Population: I = 20 Survival rate: r S = 0.5; thus, I S = r S I = 10 Fitness evaluation: BIC Selection: Simple rank selection (and elitism selection with I E = 1) Learning rate: α = 0.05 Mutation rate: r M = 0.002 Termination condition: When the fitness of the best individual does not change for ten generations ) Eq. (8.29) 9. Estimate: Aˆ = {aˆ jk } = round a¯ (T jk
First, the CRRs of the data (J5S10) in Table 8.11 are ⎤ ⎡ ⎤ 0.6 p1 ⎢ p2 ⎥ ⎢0.4⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ p=⎢ ⎢ p3 ⎥ = ⎢0.9⎥ . ⎣ p4 ⎦ ⎣0.3⎦ 0.4 p5 ⎡
Because Items 3, 1, 2, 5, and 4 are topologically sorted in descending order of the CRR (from the easiest to the hardest), it is reasonable to assume that an item with a low CRR (i.e., a harder item) is less likely to be a prerequisite (i.e., parent) item for other items with a higher CRR (i.e., easier items). The genetic loci can thus be set as follows: ⎡ ⎤ → 31254 ⎢ Item 3 ⎥ ⎢ ⎥ ⎢ Item 1 ⎥ ⎢ ⎥. A=⎢ ⎥ Item 2 ⎢ ⎥ ⎣ Item 5 ⎦ Item 4 When the items are arranged in the original order, the loci are ⎤ → 12345 ⎥ ⎢ Item 1 ⎥ ⎢ ⎥ ⎢ Item 2 ⎥. A∗ = ⎢ ⎥ ⎢ Item 3 ⎥ ⎢ ⎦ ⎣ Item 4 Item 5 ⎡
In addition, as described in Point 1, the maximum indegree (i.e., number of parents) for each item was set to 2, because the number of items was small. However, even
418
8 Bayesian Network Model
when the number of items is large, the maximum indegree per item should be ≤ 3. This is because, when the maximum number of parent items is 4, the maximum number of PIRPs is 16 (= 24 ), and the maximum number of parameters for such an item is 16. A DAG that is too complex for us to understand and explain is worthless in practice, even if it is a true model. In social sciences such as psychology, pedagogy, sociology, and behavioral sciences, a simpler, explainable, and interpretable model is often preferred when the true model is extremely complex. However, a simple model with a poor fit should be avoided. For Point 5, simple rank selection was used to select the top I S (= 10) survivors in descending order of fitness (i.e., ascending order of BIC). At this time, an elitism selection with I E = 1 was jointly employed; that is, the individual with the lowest BIC was adopted as an individual in the next generation without changing its gene. Finally, as described in Point 8, the PBIL is terminated when the BIC of the best individual is invariant for 10 periods. PBIL was terminated during the 15th period. Because the number of items and sample size are both small, the BIC of the best individual did not change at −18.6011 from the fifth period. The generational gene in the 15th period was obtained as follows: ⎤ ⎡ → 3 1 2 5 4 ⎢ Item 3 0.000 0.712 0.654 0.693 0.257 ⎥ ⎥ ⎢ ⎢ Item 1 (15) 0.232 0.232 0.768 ⎥ ⎥. ⎢ ¯A =⎢ 0.257 0.232 ⎥ ⎥ ⎢ Item 2 ⎣ Item 5 0.260 ⎦ Item 4 In this analysis, as described in Point 9, the rounded matrix of the generational gene at the final period is employed as the adjacency matrix estimate; that is, the estimate is given as ⎡
⎤ → 31254 ⎢ Item 3 1 1 1 0 ⎥ ⎢ ⎥ ⎢ Item 1 0 0 1⎥ ⎢ ⎥. ˆ A=⎢ 0 0⎥ ⎢ Item 2 ⎥ ⎣ Item 5 0⎦ Item 4 ˆ Except for Item 3, which is a Figure 8.25 shows the DAG of this estimate (A). source, the number of parents of each item was one; thus, this DAG is a tree (see Directed Tree , p.368). The figure also shows the MLEs of the parameters.39 For all the items except Item 3 (i.e., the source), the CCRR was 0 when the PIRP was incorrect (0). 39
Although the MLEs were employed because they were easier for readers to verify, the MAPs are recommended in practice.
8.5 Structure Learning
419
Fig. 8.25 DAG and parameter estimates obtained using PBIL Table 8.13 Model fit indices (PBIL)
χ 2 and d f ll B −8.94 ll N −28.88 ll A −25.54 χ N2 39.89 χ A2 33.20 d fN 25 d fA 21
Standardized Index NFI 0.168 RFI 0.009 IFI 0.354 TLI 0.025 CFI 0.181 RMSEA 0.254
Information Criterion AIC −8.80 CAIC −17.15 BIC −15.15
The DAG should be modified if it is strange to teachers of the subject because a model that is accountable and fits well is preferred to the model that best fits the data. As previously mentioned, the model that best fits the data is not always the model that is most applicable to the real world. Table 8.13 shows the model-fit indices for the DAG. The indices improved compared with those of Network 1 (Table 8.12, p.407). Note that both tables share the same log-likelihoods for the benchmark and null models (i.e., ll B and ll N ). The benchmark model is a multigroup model in which the DAG of each group is empty, whereas the null model is a single-group model in which the DAG is empty. Accordingly, the χ 2 values for the null model were also the same in both tables. For the log-likelihood of the analysis model (ll A ), the model in this table approximates 0. In other words, the DAG fits the data better than Network 1. In addition, the DF of this DAG (i.e., d f A ) was higher than that of Network 1. The higher the DF, the more parsimonious the DAG is because DF indicates that such a DAG requires fewer parameters. Because the log-likelihood (ll A ) and DF (d f A ) both improved, all indices improved. Namely, NFI–CFI approached 1, RMSEA approached 0, and
420
8 Bayesian Network Model
all three information criteria became smaller (i.e., negatively larger). Therefore, the PBIL-learned structure fits the data well.
8.6 Chapter Summary This chapter introduced BNM. The network among the test items is represented by a DAG, where a directed edge implies that the successor item has a CCRR for each pattern of the predecessor item(s). In addition, the JP of the items is uniquely factored as the product of the CCRRs according to the DAG. The kernel of a BNM analysis is structure learning, which identifies a structure that explains well the information contained in the data. If the structure is obvious without analyzing the data acquired through the experience or expertise of the analyst, structure learning is not necessary. However, structure learning often presents a structure that is unexpected for analysts. As mentioned earlier, when the number of items is J , there can be 2 J (J −1)/2 DAGs, even after the items are topologically sorted. This number is tremendously large, i.e., 2190 ≈ 1.57 × 1057 (J = 20) and 2435 ≈ 8.87 × 10130 (J = 30). Thus, there are almost an infinite number of wellfitting DAGs that analysts cannot conceive. Although structure learning cannot test all possible DAGs, it can still scan a wide range of DAGs beyond the reach of human imagination. In IRT, LCA, and LRA, the test items are assumed to be unrelated because of the assumption of local independence.40 In addition, fields are supposed to be unassociated in biclustering. Although there are only a few applications of BNM in a test data analysis, it is crucial to explore and identify a structure that abstracts such data.
40
Ueno (2002) proposed a model that combined IRT and BNM. In addition, Shojima (2011) proposed a model that incorporated LRA (LCA) and BNM.
Chapter 9
Local Dependence Latent Rank Analysis
This chapter describes a local dependence latent rank analysis (LD-LRA; Shojima 2011), which combines a latent rank analysis (LRA; Chap. 6, p. 191) and the Bayesian network model (BNM; Chap. 8, p. 349). The LRA is a method used to classify students into latent ranks (ordinal latent classes), whereas BNM is a model used to find a good structure among the test items. An LD-LRA is thus a model that can classify students into ranks and find a good structure at each rank. This chapter also refers to a local dependence latent class analysis (LD-LCA) because the ranks reduce to classes if not ordered based on their ability level. Figure 9.1 illustrates an LD-LRA. The steps at the bottom of the figure represent the latent rank scale with five ranks: Rank 1 is the lowest ability group, whereas Rank 5 is the most advanced. In addition, the numbers in the circles indicate the test items (there are six items in total). In each rank, the items are connected by directed edge(s). This is the conditional correct response rate (CCRR) structure at each rank and is called a directed acyclic graph (DAG, see Sect. 8.1.8, p. 366). For example, the DAG at Rank 1 has directed edges of Item 1 → 2 and Item 1 → 3. The DAG at each rank is not necessarily congruent with those at the other ranks.
9.1 Local Independence and Dependence This section describes the difference between local independence and dependence. Furthermore, the difference between global independence and dependence is described.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_9
421
422
9 Local Dependence Latent Rank Analysis
Fig. 9.1 Local dependence LRA
9.1.1 Local Independence In a usual LRA, items are locally independent; thus, the usual LRA can be seen as a local independence LRA (LI-LRA), as illustrated in Fig. 9.2. The figure shows that the DAG at each rank is an empty graph (i.e., edgeless graph). In other words, the items at each rank are independent of each other. Here, the term “local” in local independence represents each rank, and the number of loci is equal to the number of ranks (R).1 Thus, the local independence is expressed with the DAG at each rank being an empty graph. Accordingly, the JP at Rank r is factored as Pr (1, . . . , J ) = Pr (J ) × Pr (J − 1) × · · · × Pr (1) =
J
Pr ( j).
j=1
In Fig. 9.2, it is obtained as Pr (1, . . . , 6) = Pr (6) × Pr (5) × · · · × Pr (1). No items have a parent; thus, the PIRP of each item is the first pattern, i.e., “no parent.” When the number of parents is k, the number of PIRPs is 2k , and it is therefore 1 (= 20 ) because k = 0 in this case. If the number of parents is 0, 1, 2, and 3, the PIRPs are 1
Similarly, local independence in LCA means that each item is independent at each latent class, and each class is thus a locus.
9.1 Local Independence and Dependence
423
Fig. 9.2 Local independence LRA (normal LRA)
N of PIs N of PIRPs 0 1 1 2 2 4 3 8
PIRP − 0, 1 00, 01, 10, 11 000, 001, 010, 011, 100, 101, 110, 111
As learned from the BNM, the number of parameters for an item is the number of PIRPs; thus, when all items are independent (indegree of 0), the number of parameters per item is 1. That is, the number of parameters for Item j at Rank r is only π jr . Accordingly, Fig. 9.2 has 30 (6 items × 5 ranks) parameters, i.e., ⎡ π11 · · · ⎢ .. . . R = ⎣ . . π61 · · ·
⎤ π15 .. ⎥ (6 × 5). . ⎦ π65
This is the rank reference matrix (Sect. 6.1.6, p. 197). In general, when the numbers of items and ranks are J and R, respectively, the rank reference matrix is expressed as ⎤ ⎡ π11 · · · π1R ⎥ ⎢ R = ⎣ ... . . . ... ⎦ (J × R). πJ1 · · · πJ R
424
9 Local Dependence Latent Rank Analysis
Local Independence in Item Response Theory In item response theory (IRT), a “locus” is any continuous point on the continuous ability scale (i.e., θ ; Fig. 9.3). Students with an average ability have θ = 0, and students with higher ability have a larger θ . Although the range of θ is (−∞, ∞) by definition, the practical range is approximately (−3, 3). Despite the practical range being limited to (−3, 3), the number of loci is infinite because every single continuous point is a locus.
Fig. 9.3 Local independence in IRT
9.1.2 Global Independence People who criticize the assumption of the local independence in IRT, LCA, and LRA as being extremely severe and unrealistic might be confusing global and local independence. Figure 9.4 (left) illustrates global independence, which means that
9.1 Local Independence and Dependence
425
Fig. 9.4 Global independence and dependence
all items are independent (i.e., DAG is empty) for all students when regarded as a single group. This global independence is identical with the local independence when there is a single locus. Such assumption is unnatural because the items in a test are normally related to each other. In other words, the contents of the items in a test often partially overlap. Even if items with unrelated topics are grouped together in a test, there will still be correlations among the items because the students with higher ability can pass more items than the students with lower ability. This model can hardly be expected to fit the data and in fact has already been described many times in the previous chapters as the null model (e.g., Sect. 8.4.2, p. 402). That is, the global independence model is consistently employed as the null model in this book.
9.1.3 Global Dependence Items in a test are usually associated with each other. In other words, they are globally dependent (Fig. 9.4, right), and can be locally independent even if they are globally dependent. Simpson’s paradox (Fig. 9.5) is a good example of data being globally dependent but locally independent. The upper table in the figure is a cross between Items j and k for 100 students. These items are said to be related because there is a total of 68 students who passed and failed the items. The item lift (Sect. 2.5.3, p. 39) and mutual information (Sect. 2.5.4, p. 42) between these items are indeed l jk = 1.36 and M I jk = 0.096; however, they take a value of 1 and 0, respectively, when the two items are independent. The cross tables at the bottom of the figure are for male and female students. For the 50 boys, from among the 40 students who failed Item j, 8 (20%) of the students passed Item k, whereas from among the 10 students passing Item j, 2 (20%) students passed Item k. Thus, the two items are unassociated for the boys, and the item lift
426
9 Local Dependence Latent Rank Analysis
Item
0 1
Item
0 1
All Students 0 1 0.34 (34) 0.16 (16) 0.16 (16) 0.34 (34)
Male Students 0 1 0.64 (32) 0.16 (08) 0.16 (08) 0.04 (02)
Female Students 0 1 0 0.04 (02) 0.16 (08) 1 0.16 (08) 0.64 (32) 0
Item
Fig. 9.5 Simpson’s paradox (Example 1)
and the mutual information are therefore 1 and 0, respectively. Likewise, for the 50 girls, the two items are found to be independent; thus, the two indices are 1 and 0, respectively. As these tables imply, when the relationship seen in a whole group is (remarkably) different from those seen in subgroups, the state is called a Simpson’s paradox.2 The three tables in the figure indicate that Items j and k are globally dependent but locally independent. In other words, they are associated when considering a whole group (all students) but locally unassociated at the loci of “sex = male” and “sex = female.” In a normal LRA, each rank is a locus, and items are assumed to be independent at each locus (Fig. 9.6). However, this does not mean that items are globally independent. In the figure, Items j and k are globally dependent, as shown by the two indices, but are locally independent at each of the five ranks. For instance, the two items were unassociated at Rank 1 because 4 of the 16 students passing Item j passed Item k (25%), whereas 1 of the 4 students passing Item j passed Item k (25%). Accordingly, the item lift and mutual information are thus 1 and 0, respectively. Likewise, these two items are unassociated through Ranks 2–5. There is no doubt the items are associated. Thus, there is no analyst who considers such items to be globally independent. The test items are globally dependent on each other. Therefore, their analysis is worthwhile. The items can then be locally independent even if they are globally dependent. That is, the global dependence and 2
This paradox is also applicable to the reverse situation, when the cross table for all students is unassociated but that for each gender is associated. In addition, it can refer to the correlation as well. For example, two items are Simpson’s paradoxical when they are correlated/uncorrelated in a whole group, but uncorrelated/correlated in subgroups.
9.1 Local Independence and Dependence
427
Fig. 9.6 Simpson’s paradox (Example 2)
local independence are compatible and assumed in IRT, LCA, and LRA. In other words, Simpson’s paradox is not a paradox for these methods. In fact, it does not matter whether the items are globally dependent or independent because the items in a test are usually globally dependent; however, this does not always hold between all item pairs. It is often observed that some item pairs are uncorrelated (i.e., globally independent). Note, however, that the null model is a model that unrealistically assumes all item pairs in a test are globally independent. Therefore, this model fits the data poorly.
9.1.4 Local Dependence Occasionally, the global and local states differ. The conditions in which two states are different are referred to as Simpson’s paradox. Table 9.1 shows four (2 × 2) combinations of the global and local states. The LD-LRA described in this chapter is a model that assumes local dependence among items. However, it does not matter whether the items are globally dependent or independent because, although not all the item pairs are globally dependent on each other, the test items usually have such dependence. In general, the larger the number of loci, the higher the local independency.3 The number of students in a locus produces the local dependency (i.e., interitem 3
Accordingly, the assumption of local independence is least restrictive in IRT because the number of loci is infinite.
428
9 Local Dependence Latent Rank Analysis
Table 9.1 (GI or GD)×(LI or LD) Local Independence (Normal LRA)
Global Independence Dependence
Dependence (LD-LRA)
Independence
Dependence
Status Items j and k are unassociated both in a whole student group and in each rank. Items j and k are unassociated in a whole student group but associated in each rank (Simpson’s Paradox). Items j and k are associated in a whole student group but assumed to be unassociated in each rank (Simpson’s Paradox). Items j and k are associated in a whole student group and assumed to be associated in each rank.
association at a locus). The larger the number of students in a locus, the wider the range of ability of the students, and the larger the number of students passing and failing both Items j and k, which produces the association between items. By contrast, as the number of loci increases and the number of students in each locus decreases, the range of ability in each locus reduces, the interitem association becomes weaker, and the state among items approximates to a local independency. In LRA, each latent rank is the locus; thus, the larger the number of ranks, the more likely the assumption of local independence will hold. This is because, as the number of ranks increases, the number of students belonging to each rank decreases, and the range of ability in each rank becomes narrower. Therefore, the LD-LRA is more effective when the number of ranks is small. In addition, local independency is contingent on the subject of the test. It is empirically known that local independence is more likely to be held in language, history, and social study tests, whereas it is more difficult to maintain in math and science tests. This is because the contents of the items are more hierarchical in math and science tests, that is, the correct answer of Item j is used to pass Item k. For example, the correct answer of Item j, “find x where y = x 2 − 2x + 1 is minimum,” is required for passing Item k, “find the minimum value of y.” In such items, the numbers of students passing both Items j and k, failing both Items j and k, and passing Item j but failing Item k increase, whereas there are few students failing Item j but passing Item k. That is, the numbers of joint correct and incorrect responses increase owing to such a hierarchical relationship, and the two items are more strongly associated (local independence between these items is less plausible) as the two numbers increase, even if the number of students in the locus is small.
9.1 Local Independence and Dependence
429
A group of items with such a hierarchical relationship is called a testlet, and this item format is frequently used in math and science tests.4 Although it is also possible to create such items in language, history, and social study tests, the degree to which items are associated is usually less strong than that in math and science tests because students can often solve subsequent items by guessing from the context even if they fail the antecedent items. Suitable Conditions for LD summarizes the situations in which it is better to assume local dependence. Suitable Conditions for Local Dependence 1. When the number of loci (ranks) is small 2. When a test contains a testlet(s)
9.1.5 Local Dependence Structure (Latent Rank DAG) The local dependence structure is represented by the DAG at each latent rank. This implies that the graph at each rank must be simple, directed,and acyclic, which can be easily accomplished by topologically sorting the items ( Topological Sorting , p. 371). As described in the chapter on BNM, a DAG is a graphical representation of a CCRR structure among the test items. For example, in the DAG at Rank 1 in Fig. 9.1, Items 1, 2, and 3 assume the following structure: 2 ← 1 → 3 , whereas there are no edges between Items 4, 5, and 6; thus, they are isolated items and are locally independent. A graph is said to be disconnected (Sect. 8.1.5, p. 358) when it has an isolated item or a disconnected component. The DAG at Rank 1 is thus a disconnected graph. The chapter focusing on BNM specifies that a DAG will typically be connected. This is because a BNM analysis explores the global dependency structure among items under the condition that the number of student groups (loci) is one. The contents of the items in a test normally partially overlap, and globally independent items or components (groups of items) are less likely to be employed in a test. However, the DAG at each rank will mostly be disconnected because the ranks are ordered by the ability level, and thus the ability range at each rank is insufficiently wide to generate large correlations (or associations) among the items. Items 4, 5, and 6 are indeed difficult for the students belonging to Rank 1; it is therefore highly 4
Regarding the number of correct responses in a testlet as a testlet datum (0, 1, 2, or 3 if there are three items in the testlet), the graded response model (Samejima 1969) and partial credit model (Masters 1982; Muraki 1992) can analyze such testlet items in IRT, whereas the graded LRA model (Shojima 2007c) is available for an LRA.
430
9 Local Dependence Latent Rank Analysis
possible that they will fail all three items. The interitem dependency structure does not emerge among such items. For example, if Items 4 and 5 have the following relation: 4 → 5 , the CCRR of Item 5 should vary depending on whether the parent (i.e., Item 4) is correct. However, when all students belonging to Rank 1 fail Items 4 and 5, there will only be one pattern for Item 4 (i.e., incorrect), and therefore two CCRRs cannot be specified for Item 5. Meanwhile, Items 1 and 2 are simple for the students at Rank 5 to the extent at which an interitem dependency structure does not appear. Accordingly, a dependency structure normally appears between limited items at each rank. In other words, it is normal for the DAG represented in each rank to vary because the set of items focused in each rank differs. In general, the local dependency between easier items is inclined to emerge at a lower rank, whereas that between more difficult items is likely to emerge at a higher rank. Conversely, when the DAG is congruent at all ranks, the data can then be analyzed without dividing the students into ranks and by regarding them as a single group. In this case, the model is a typical (single-group) BNM. That is, the main objective of the LD-LRA is to explore a distinctive DAG structure at each rank. The above discussions are summarized in Characteristics of DAGs in LD-LRA . Characteristics of DAGs in LD-LRA 1. The DAG at each rank is disconnected in most cases. 2. The DAG at each rank is usually incongruent with those of the other ranks.
9.2 Parameter Learning This section describes a procedure employed to estimate the parameter set in each DAG and the latent rank to which each student belongs, under the condition that the number of ranks (R) and the DAG at each rank are given. Before analyzing the data, both the number of ranks and the DAG of each rank are unknown. Thus, they should be given before estimating the parameters. In addition, because it is difficult to construct the DAG at each rank without knowing what kind of group belongs to the rank, one should first determine the number of ranks, and then explore the DAG at each rank. This is.
9.2 Parameter Learning
431
Estimation Order of LD-LRA 1. Estimate the number of latent ranks (LRA) 2. Specify the DAG at each latent rank (structure learning) 3. Estimate the parameter set (parameter learning) As specified in Point 1, the number of ranks (R) must first be determined. By executing normal LRAs under various Rs, an optimal R can be selected by referring to their BICs or AICs (Sect. 6.9, p. 252). Alternatively, an optimal R can be chosen by consulting the experience and expertise among Rs with a more satisfactory fitting.5 Then, as indicated in Point 2, specifying the DAGs of the ranks is called structure learning, and estimating the parameters derived from the DAGs is referred to as parameter learning. Structure learning is described in Sect. 9.4 (p. 465). This section describes parameter learning (Point 3) under the condition in which the number of ranks (Point 1) and the DAG of each rank (Point 2) are known. In practice, structure and parameter learnings (Points 2 and 3) are conducted at almost the same time. Once the structure (i.e., the DAGs of all ranks) is specified, the parameters are derived and estimated under the structure, and the fitness of the structure is then calculated to decide whether to adopt that structure. If an acceptable structure is not obtained, the number of ranks can be determined again by returning to Point 1. This section analyzes a 12-item math test for high school students (J12S5000), and the contents of this test are shown in Table 9.2. The 12 items are classified into three testlets, where Testlet 1 is the easiest and Testlet 3 is the most difficult. In addition, as shown in the table, the structure in each testlet is hierarchical such that the previous item(s) must be correctly answered to pass the subsequent item(s). Thus, within each testlet, the CRR of the item with a larger label is lower. The sample size of the data is 5000. Figure 9.7 presents that the DAG structure on the latent rank scale. From the figure, it can be observed that the number of ranks is five. Although an edge can be incident with two items of distinct testlets, there are no such edges in this structure.6 In practice, such edges should be present across different testlets, and it is recommended to assume them if the structure becomes more interpretable. Moreover, the DAGs at Ranks 2 and 3 are set to be congruent. Furthermore, within each testlet, edges are set between items with a smaller label at a lower rank, and edges tend to be set between items with a larger label at a higher rank. For instance, Items 1 → 2 are set at Ranks 1, 2, and 3, but Items 2 → 3 are set at Ranks 2, 3, 4, and 5. This is because Items 1, 2, and 3 are set from the easiest to the 5
The latter method is more preferable and is often adopted in practice. Although statistical indices are objective and reproducible, they only summarize the information inside the data. The latter approach knowledge and experience) into account. See takes the information outside the data (i.e., also Importance of Knowledge about Outside Data knowledge and experience) into account. See
(p. 255) and Importance of Interpretability (p. 256). 6
Note that no edge can bridge two items at different ranks.
432
9 Local Dependence Latent Rank Analysis
Table 9.2 Item contents of 12-item math test (J12S5000)
Testlet Item Content 1 Find the value of a that minimizes the y-coordinate of intersection point P 1 made by y = f (x) = x 2 − 2(a + 2)x + a 2 − a + 1 and the y-axis. 2 Find the minimum of the y-coordinate of intersection point P. 3 Find the x-coordinate of the intersection point made by f (x) and the x-axis when the y-coordinate of intersection point P is at a minimum. 4 Find the value of a at which f (x) becomes symmetric with respect to the 2 y-axis ( f 1 ). 5 Find the value of a at which f (x) is tangential to the x-axis ( f 2 ). 6 Find the difference in size between the x-coordinates of f 1 and f 2 . 7 Find the difference in size between the y-coordinates of f 1 and f 2 . 3 8 Find the probability that the number of intersection points made by y = g(x) = x 2 − (b − 2)/a and the x-axis, provided that a and b are integers determined by a hexahedral dice. 9 Find the probability that the number of intersection points made by g(x) and the x-axis is 1. 10 Find the probability that the number of intersection points made by g(x) and the x-axis is 2. 11 Find the expected number of intersection points made by g(x) and the x-axis. 12 Find the probability that there is at least one intersection point made by f (x) and the x-axis and that all the x-coordinates of the intersection points are integers.
Fig. 9.7 Latent rank DAG
9.2 Parameter Learning
433
most difficult in this order (i.e., Item 1 is the easiest). Items 2 and 3 are difficult for the students at lower ranks; the structure of Item 2 → 3 does not emerge. Likewise, the structure of Item 1 → 2 does not appear at higher ranks because these items are easy for the students with high ability.
9.2.1 Rank Membership Matrix The rank membership matrix is a S × R matrix and is given as ⎤ m 11 · · · m 1R ⎥ ⎢ M R = ⎣ ... . . . ... ⎦ = {m sr } (S × R), m S1 · · · m SR ⎡
where the (s, r )-th element, m sr , represents the membership of Student s belonging to Rank r . Each element of this matrix is also a parameter to be estimated. In this case, the size of the matrix is (5000 × 5). The number of parameters in the matrix is S R, which varies depending on the sample size. These parameters are thus nuisance parameters. The s-th row vector in the matrix M R , ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ , ⎡
ms R is the rank membership profile of Student s. Each student necessarily belongs to one of the ranks. The sum of this vector is thus 1 as follows: ms 1 R = 1. Because the sum is 1, and each element is nonnegative, the rank membership profile of each student can be viewed as a discrete probability distribution. This also implies that the row sum vector of rank membership matrix M R becomes M R 1 R = 1S .
9.2.2 Temporary Estimate of Smoothed Membership For Point 1 in Estimation Order of LD-LRA , five ranks were employed in this analysis, which implies that the normal (local independence) LRA with R = 5 was complete. As shown in Eq. (6.3) (p. 209), the estimation process of the normal LRA
434
9 Local Dependence Latent Rank Analysis
applying the EM algorithm (Dempster et al., 1977) is EM Cycle 1
(0) R
EM Cycle 2
(1) (2) (2) (1) (2) M → M (1) → S → → → S → R R R R → ··· .
Suppose that the EM algorithm was terminated at Cycle T , and the T -th rank mem) bership matrix (M (T R ) was then estimated. The smoothed rank membership matrix (T ) at Cycle T (S ) was also obtained using filter F as follows7 : S(t) = M (t) R F. The kernel of the filter is ⎤ fL ⎢ .. ⎥ ⎢.⎥ ⎢ ⎥ ∗ ⎥ f =⎢ ⎢ f0 ⎥ L × 1 , ⎢ .. ⎥ ⎣.⎦ fL ⎡
(9.1)
which is a symmetric kernel with respect to the central element ( f 0 ). In addition, the kernel length (L ∗ = 2L + 1) is an odd number. Using this kernel, F ∗ is first created as follows: ⎡ ⎤ f0 · · · f L ⎢ .. . . .. . . ⎥ ⎢. ⎥ . . . ⎢ ⎥ ⎢ .. ⎥ . ⎢ f L . f0 . . f L ⎥ ⎢ ⎥ ⎢ ⎥ F ∗ = ⎢ . . . ... . . . ... . . . ⎥ (R × R). ⎢ ⎥ ⎢ ⎥ .. .. ⎢ f L . f0 . f L ⎥ ⎢ ⎥ ⎢ .. . . .. ⎥ ⎣ . .⎦ . f L · · · f0 However, the sums of the elements are not 1 in the first and last L − 1 columns. To make them 1, F ∗ is adjusted as follows: F = F ∗ {1 L ∗ (1L ∗ F ∗ )} (R × R). The kernel employed in this analysis is
7
See Sect. 6.3.2 (p. 210) for details.
9.2 Parameter Learning
435
⎡
⎤ 0.1 ⎢0.2⎥ ⎢ ⎥ ⎥ f =⎢ ⎢0.4⎥ . ⎣0.2⎦ 0.1 The filter is then given as ⎡ 0.571 ⎢0.286 ⎢ F=⎢ ⎢0.143 ⎣
0.222 0.444 0.222 0.111
0.100 0.200 0.400 0.200 0.100
⎤ ⎥ 0.111 ⎥ 0.222 0.143⎥ ⎥. 0.444 0.286⎦ 0.222 0.571
This smoothing filter is stronger (i.e., flatter) than the one used in the normal LRA with R = 5.8 For the LD-LRA, it is empirically known to be better to use a flatter filter. The reason for this is discussed in Sect. 9.2.9 (p. 458).
9.2.3 Local Dependence Parameter Set The CCRR structure (i.e., the DAG at each rank) is also an object to be identified during the parameter learning stage. First, referring to Sect. 8.2.5 (p. 377), the JP at Rank r is factored into the chain of CCRRs as follows: Pr (1, . . . , J ) = Pr (J | par (J )) × Pr (J − 1| par (J − 1)) × · · · × Pr (2| par (2)) × Pr (1) =
J
Pr ( j| par ( j)),
j=1
where par ( j) is the parent item set of Item j at Rank r . In Fig. 9.7, the JPs at the five ranks are factored as shown in Table 9.3. Next, the parameter set for each rank is derived from the DAG of the rank. Let the parameter set at Rank r be denoted as D I r ,9 and suppose from Sect. 8.3.1 (p. 378) that πr jd is the CCRR corresponding to the d-th PIRP for Item j at Rank r , the parameter sets for the five ranks can be specified as follows:
8 9
From Eq. (6.5) (p. 214), the kernel used in the normal LRA with R = 5 is f = [0.1 0.8 0.1]. The subscript DI means “(local) dependence among items.”.
436
9 Local Dependence Latent Rank Analysis
Table 9.3 Joint probability factorization induced from Fig. 9.7
Rank 1
Factorization
Testlet
P1 (1, · · · , 12) = P1 (12)P1 (11)P1 (10)P1 (9)P1 (8)
3
×P1 (7)P1 (6)P1 (5|4)P1 (4) 2
2
×P1 (3)P1 (2|1)P1 (1)
1
P2 (1, · · · , 12) = P2 (12)P2 (11|8)P2 (10|9, 8)P2 (9|8)P2 (8)
3
×P2 (7)P2 (6)P2 (5|4)P2 (4) 3
4
5
⎡
π1,1,1 ⎢ π1,2,1 ⎢ ⎢ π1,3,1 ⎢ ⎢π ⎢ 1,4,1 ⎢π ⎢ 1,5,1 ⎢ ⎢ π1,6,1 ⎢ ⎢ π1,7,1 ⎢ ⎢ π1,8,1 ⎢ ⎢ π1,9,1 ⎢ ⎢π1,10,1 ⎢ ⎣π1,11,1 π1,12,1
2
×P2 (3|2)P2 (2|1)P2 (1)
1
P3 (1, · · · , 12) = P3 (12)P3 (11|8)P3 (10|9, 8)P3 (9|8)P3 (8)
3
×P3 (7)P3 (6)P3 (5|4)P3 (4)
2
×P3 (3|2)P3 (2|1)P3 (1)
1
P4 (1, · · · , 12) = P4 (12)P4 (11|9, 8)P4 (10|8)P4 (9)P4 (8)
3
×P4 (7|5, 4)P4 (6|5, 4)P4 (5)P4 (4)
2
×P4 (3|2)P4 (2)P4 (1)
1
P5 (1, · · · , 12) = P5 (12|10)P5 (11|10, 9)P5 (10)P5 (9)P5 (8)
3
×P5 (7|5, 4)P5 (6|5, 4)P5 (5)P5 (4)
2
×P5 (3|2)P5 (2)P5 (1)
1
D I 1 ⎤⎡ π2,1,1 π1,1,1 π1,1,1 π1,1,1 ⎥⎢ π2,2,1 π1,2,2 ⎥⎢ ⎥⎢ π2,3,1 ⎥⎢ ⎥⎢ π ⎥⎢ 2,4,1 ⎥⎢ π π1,5,2 ⎥⎢ 2,5,1 ⎥⎢ ⎥⎢ π2,6,1 ⎥⎢ ⎥⎢ π2,7,1 ⎥⎢ ⎥⎢ π2,8,1 ⎥⎢ ⎥⎢ π2,9,1 ⎥⎢ ⎥⎢π2,10,1 ⎥⎢ ⎦⎣π2,11,1 π1,12,2 π1,12,1 π1,12,1 π2,12,1
D I 2 π2,2,2 π2,3,2 π2,5,2
π2,9,2 π2,10,2 π2,10,3 π2,11,2
⎤⎡
π3,1,1 ⎥⎢ π3,2,1 ⎥⎢ ⎥⎢ π3,3,1 ⎥⎢ ⎥⎢ π ⎥⎢ 3,4,1 ⎥⎢ π ⎥⎢ 3,5,1 ⎥⎢ ⎥⎢ π3,6,1 ⎥⎢ ⎥⎢ π3,7,1 ⎥⎢ ⎥⎢ π3,8,1 ⎥⎢ ⎥⎢ π3,9,1 ⎥⎢ ⎢ π2,10,4 ⎥ ⎥⎢π3,10,1 ⎦⎣π3,11,1 π3,12,1
D I 3 π3,2,2 π3,3,2 π3,5,2
π3,9,2 π3,10,2 π3,10,3 π3,11,2
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ π3,10,4 ⎥ ⎥ ⎦
9.2 Parameter Learning
⎡
π4,1,1 ⎢ π4,2,1 ⎢ ⎢ π4,3,1 ⎢ ⎢ π4,4,1 ⎢ ⎢ π4,5,1 ⎢ ⎢ π4,6,1 ⎢ ⎢ π4,7,1 ⎢ ⎢ π4,8,1 ⎢ ⎢ π4,9,1 ⎢ ⎢π4,10,1 ⎢ ⎣π4,11,1 π4,12,1
437
D I 4 π4,3,2 π4,6,2 π4,6,3 π4,7,2 π4,7,3 π4,10,2 π4,11,2 π4,11,3
⎤⎡
π5,1,1 ⎥ ⎢ π5,2,1 ⎥⎢ ⎥ ⎢ π5,3,1 ⎥⎢ ⎥ ⎢ π5,4,1 ⎥⎢ ⎥ ⎢ π5,5,1 ⎥⎢ ⎢ π4,6,4 ⎥ ⎥ ⎢ π5,6,1 ⎥ π4,7,4 ⎥ ⎢ ⎢ π5,7,1 ⎥ ⎢ π5,8,1 ⎥⎢ ⎥ ⎢ π5,9,1 ⎥⎢ ⎥ ⎢π5,10,1 ⎥⎢ π4,11,4 ⎦ ⎣π5,11,1 π1,12,1
D I 5 π5,3,2 π5,6,2 π5,6,3 π5,7,2 π5,7,3
π5,11,2 π5,11,3 π1,12,2
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ π5,6,4 ⎥ ⎥ π5,7,4 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ π5,11,4 ⎦
(9.2)
The j-th row in D I r lists the parameters of Item j for Rank r . When the indegree (i.e., number of parents) of Item j at Rank r is din,r j , the number of parameters is Dr j = 2din,r j . That is, when the indegree is 0, 1, 2, · · · , the number of parameters is 20 = 1, 21 = 2, 22 = 4, · · · . For example, Item 1 has no parents at Rank 1; the parameter is thus only the CRR (π1,1,1 ) for the item at the rank. In addition, Item 2 has a parent (i.e., Item 1) at Rank 1; the parameters are thus the CCRRs of Item 2 when Item 1 is passed or failed (π1,2,1 or π1,2,2 ). Moreover, Item 10 has two parents (i.e., Items 8 and 9) in Rank 2. Thus, the four parameters (π2,10,1 , π2,10,2 , π2,10,3 , and π2,10,4 ) are set corresponding to the four PIRPs (00, 01, 10, and 11). Furthermore, a set layering the parameter sets for all ranks is the local dependence parameter set, which is given as D I = { D I 1 , D I 2 , D I 3 , D I 4 , D I 5 } = {πr, j,d } (R × J × Dmax ). This set is a three-dimensional array and has blank cells. In addition, Dmax is Dmax = max{2din,11 , . . . , 2din,r j , . . . , 2din,RJ } = max{D11 , . . . , Dr j , . . . , DRJ }. In this case, Dmax = 4. The elements in the local dependence parameter set are the structural parameters because the number of elements in this set neither increases nor decreases regardless of the sample size.
438
9 Local Dependence Latent Rank Analysis
9.2.4 Likelihood The likelihood is a probability that data U are observed given the local dependence parameter set ( D I ), which is denoted as l(U| D I ). Because the likelihood is the product of the likelihoods of the students based on the natural assumption that the responses of each student are independent from those of the other students, it is expressed as l(U| D I ) =
S
l(us | D I ),
s=1
where l(us | D I ) represents the likelihood of Student s, which can be represented with the rank membership profile of Student s, ms , as follows: l(us | D I ) =
R
l(us | D I r )m sr .
(9.3)
r =1
For this reason, if Student s belongs to Rank 4 (m s4 = 1), for example, only the likelihood for Rank 4 is selected as follows: l(us | D I ) = l(us | D I 1 )0 × l(us | D I 2 )0 × l(us | D I 3 )0 × l(us | D I 4 )1 × l(us | D I 5 )0 = l(us | D I 4 ).
For a more detailed description on how to construct the likelihood of Student s for Rank r , l(us | D I r ), refer to Sect. 8.3.3 (p. 389). Considering a specific example of Student 4, from the data (J12S5000), the student responses are u4 = [100 1100 01001]. In addition, let the PIRPs of Student s for Rank r be denoted as sr = {γsr jd } (J × Dmax ), where γsr jd
1, if the PIRP to par ( j)o f Studentsis thed-th pattern = , 0, otherwise
then, the PIRPs of Student 4 at Rank 1 is derived as
9.2 Parameter Learning
439
⎡
41
1
⎢ 1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢ 1 ⎢ ⎢1 =⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎣1 1
00
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ (12 × 4). ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
At Rank 1, Item 1 is the parent of Item 2, and Student 4 passed Item 1; thus, the PIRP for Item 2 is thus the second pattern.10 Therefore, 1 is entered into the second column (γ4122 = 1). Similarly, the PIRP of the student for Item 5 is the second pattern (γ4152 = 1) because this student passed the parent of Item 5 (i.e., Item 4). Other items are sources with no parents, and their PIRPs are first11 ; thus a value of 1 is entered in the first column for the items. Likewise, the PIRPs of Student 4 for Ranks 2–5 are specified as follows: ⎡
1
⎢ ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢ ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢ ⎢ ⎣1 1
42 ⎤ 00 ⎥ 1 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 1 ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 1 ⎥ ⎦
43 ⎤ ⎡ 44 ⎤ 00 1 ⎢ 1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ 1⎥ ⎢ ⎥, ⎢ ⎥, ⎢1 ⎥ ⎢ 1⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎣1 ⎦ ⎣ 1 ⎦ 1 1 ⎡
1
⎡ 45 ⎤ 1 ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ 1⎥ ⎢ ⎥. ⎢ 1⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎣ 1 ⎦ 1
Note that because the DAGs of Ranks 2 and 3 are congruent, the PIRPs for the ranks are equal. That is, 42 = 43 .
10
When the indegree is 1, the failing and passing of the parent item have the first and second patterns, respectively. 11 When the indegree of an item is 0, the PIRP of the item is the first pattern because there is only one pattern, “no parent.”.
440
9 Local Dependence Latent Rank Analysis
In addition, the (6, 4)-th element in 44 is 1 (γ4464 = 1). This is because at Rank 4, Item 6 has two parents (i.e., Items 4 and 5), Student 4 passed both items (11), and this PIRP is the fourth pattern.12 Similarly, γ4,4,11,2 = 1 because the response pattern for the two parents of Item 11 (i.e., Items 8 and 9 at Rank 4) is the second pattern (01). Moreover, γ4,5,11,3 = 1 is because the response pattern for the two parents of Item 11 (i.e., Items 9 and 10 at Rank 5) is the third pattern (10). Furthermore, as can be seen from 41 to 45 , sr has a single 1 in each row. These PIRPs indicate which datum of each student will be used to estimate each parameter. For example, if Student 4 belongs to Rank 2 (although unknown at this time), out of the Rank 2 parameter set ( D I 2 ), the parameters that data u4 of Student 4 can contribute to are indicated by 42 , as follows: ⎡
42 D I 2
1
00
⎤
⎡
π2,1,1 ⎥ ⎢ π2,2,1 ⎥ ⎢ ⎥ ⎢ π2,3,1 ⎥ ⎢ ⎥ ⎢ π2,4,1 ⎥ ⎢ ⎥ ⎢ π2,5,1 ⎥ ⎢ ⎥ ⎢ π2,6,1 ⎥⎢ ⎥ ⎢ π2,7,1 ⎥ ⎢ ⎥ ⎢ π2,8,1 ⎥ ⎢ ⎥ ⎢ π2,9,1 ⎥ ⎢ ⎥ ⎢ π2,10,1 ⎥ ⎢ ⎦ ⎣ π2,11,1 π2,12,1
⎢ 1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢ 1 ⎢ ⎢1 =⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢1 ⎢ ⎢ 1 ⎢ ⎣1 1 ⎡ π2,1,1 ⎢ π2,2,2 ⎢ ⎢ π2,3,1 ⎢ ⎢ π2,4,1 ⎢ ⎢ π2,5,2 ⎢ ⎢ π2,6,1 =⎢ ⎢ π2,7,1 ⎢ ⎢ π2,8,1 ⎢ ⎢ π2,9,1 ⎢ ⎢ π2,10,2 π2,10,3 ⎢ ⎣ π2,11,1 π2,12,1
⎤ π2,2,2 π2,3,2 π2,5,2
π2,9,2 π2,10,2 π2,10,3 π2,11,2
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ π2,10,4 ⎥ ⎥ ⎦
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ π2,10,4 ⎥ ⎥ ⎦
Because there is only one 1 in each row of sr , only one parameter is picked from each row of the parameter set of Rank r ( D I r ). That is, the data of Student s (us ) are used to estimate J parameters out of the Rank r parameter set.
12
When the indegree is 2, there are four PIRPs (00, 01, 10, and 11), of which 11 is the fourth pattern.
9.2 Parameter Learning
441
Fig. 9.8 Four-dimensional PIRP array
In general, an array that collects all the PIRPs of Student s for all ranks can be represented as s = {γsr jd } (R × J × Dmax ). This array is three-dimensional. In this case, the size of this array is (5 × 12 × 4). Accordingly, an array that layers the PIRP arrays of all students, i.e., = {γsr jd } (S × R × J × Dmax ), becomes four-dimensional (Fig. 9.8). Although a four-dimensional array is not visually expressed, it is easy to create a multidimensional array in a calculation program. Using this PIRP array, the likelihood that the data of Student s (us ) are observed given the parameter set of Rank r ( D I r ), l(us | D I r ), is represented as l(us | D I r ) =
Dr j J j=1
=
γsr jd πr jd
d=1
Dr j J
u s j 1−
Dr j
γsr jd πr jd
1−u s j zs j
d=1 u
πr jds j (1 − πr jd )1−u s j
zs j γsr jd
.
(9.4)
j=1 d=1
For this reason, from u4 = [100 1100 01001], for example, the likelihood of Student 4 at Rank 2 can be properly constructed as
442
9 Local Dependence Latent Rank Analysis
l(u4 | D I 2 ) = π2,1,1 × (1 − π2,2,2 ) × (1 − π2,3,1 ) × π2,4,1 × π2,5,2 × (1 − π2,6,1 ) × (1 − π2,7,1 ) × (1 − π2,8,1 ) × π2,9,1 × (1 − π2,10,2 ) × (1 − π2,11,1 ) × π2,12,1 , where the terms for the chosen π correspond to the items that were passed, whereas those with (1 − π ) correspond to the failed items. The likelihood for Student s “at Rank r ” was constructed as Eq. (9.4). In addition, as in Eq. (9.3) (p. 438), by incorporating the rank membership profile of Student s (ms ), the likelihood of the student can be rewritten as l(us | D I ) =
Dr j R J
u
πr jds j (1 − πr jd )1−u s j
zs j γsr jd m sr
.
r =1 j=1 d=1
Consequently, the likelihood of all students is obtained as l(U| D I ) =
S
l(us | D I )
s=1
=
Dr j S R J us j z γ m πr jd (1 − πr jd )1−u s j s j sr jd sr .
(9.5)
s=1 r =1 j=1 d=1
9.2.5 Posterior Distribution A prior density should be assumed for each of the local dependence parameter set ( D I = {πr jd }), because the estimates of the parameters become more stable in most cases. The same prior should be assigned for each element in D I , and the beta density (see Sect. 4.5.3, p. 119) is a valid choice because it is the natural conjugate for this type of likelihood. The probability density function of the beta density is defined as β −1
pr (πr jd ; β0 , β1 ) =
πr jd1 (1 − πr jd )β0 −1 B(β0 , β1 )
,
where β0 and β1 are the hyperparameters of the beta density. Accordingly, the prior density for all parameters is expressed as pr ( D I ; β0 , β1 ) =
Dr j R J r =1 j=1 d=1
pr (πr jd ; β0 , β1 ).
9.2 Parameter Learning
443
Using this prior density, the posterior density of the local independence parameter set ( D I ) is represented as l(U| D I ) pr ( D I ; β0 , β1 ) pr (U) ∝l(U| D I ) pr ( D I ; β0 , β1 )
pr ( D I |U) =
Dr j S R J us j z γ m ∝ πr jd (1 − πr jd )1−u s j s j sr jd sr s=1 r =1 j=1 d=1
×
Dr j R J
β −1
πr jd1 (1 − πr jd )β0 −1 .
(9.6)
r =1 j=1 d=1
9.2.6 Estimation of Parameter Set Taking the logarithm of Eq. (9.6), the (expected) log-posterior density is obtained as ln pr ( D I |U) =
Dr j S R J
z s j γsr jd m sr u s j ln πr jd + (1 − u s j ) ln(1 − πr jd )
s=1 r =1 j=1 d=1
+
Dr j R J
{(β1 − 1) ln πr jd + (β0 − 1) ln(1 − πr jd )
r =1 j=1 d=1
=
Dr j S R J r =1 j=1 d=1
+
S
z s j γsr jd m sr u s j + β1 − 1 ln πr jd
s=1
z s j γsr jd m sr (1 − u s j ) + β0 − 1 ln(1 − πr jd ) .
s=1 (T ) The tentative estimate for the smoothed rank membership matrix (S(T ) = {ssr }, (T ) S × R) has already been obtained in Sect. 9.2.2 (p. 433). By substituting ssr for m sr in the above equation, we obtain
444
9 Local Dependence Latent Rank Analysis
ln pr ( D I |U) Dr j S R J (T ) = z s j γsr jd ssr u s j + β1 − 1 ln πr jd r =1 j=1 d=1
+
S
s=1
(T ) z s j γsr jd ssr (1 − u s j ) + β0 − 1 ln(1 − πr jd )
s=1
=
Dr j R J
(T ) (T ) S1r jd + β1 − 1 ln πr jd + S0r jd + β0 − 1 ln(1 − πr jd ) , (9.7)
r =1 j=1 d=1
where (T ) S1r jd
=
S
(T ) γsr jd ssr zs j u s j ,
s=1 (T ) S0r jd =
S
(T ) γsr jd ssr z s j (1 − u s j ).
s=1 (T ) (T ) In other words, S1r jd and S0r jd are the numbers of students passing and failing Item j, respectively, out of the Rank r students with the d-th PIRP for Item j. The value of D I that maximizes Eq. (9.7) is the MAP of D I .13 Differentiating Eq. (9.7) with respect to πr jd , all terms unrelated to πr jd vanish as follows:
d ln pr (πr jd |u j ) d ln pr ( D I |U) = . dπr jd dπr jd This is because all parameters in D I are mutually disjointed, this fact also implies that each parameter can be independently optimized. First, by setting the derivative to 0, (T ) (T ) S1r S0r d ln pr (πr jd |u j ) jd + β1 − 1 jd + β0 − 1 = − = 0, dπr jd πr jd 1 − πr jd
and then solving this equation with respect to πr jd , the MAP is found as πˆ r(MAP) = jd
(T ) S1r jd + β1 − 1 (T ) (T ) S0r jd + S1r jd + β0 + β1 − 2
.
Note that if the prior density is not assumed or if β0 = β1 = 1, the MLE of πr jd is obtained as 13
An MLE without the prior density.
9.2 Parameter Learning
445 L) πˆ r(M = jd
(T ) S1r jd (T ) (T ) S0r jd + S1r jd
.
Estimation Process of LD-LRA EM Cycle 1
LI-LRA LD-LRA
(0) R
EM Cycle T
(1) (T ) (T ) (1) (T ) M → M (1) → S → → · · · → → S → R R R R −−−→ ˆ DI → M ˆ R
To summarize, the difference in the parameter estimation process between the normal (local independence) and local dependence LRAs is shown in Estimation Process of LD-LRA . In the normal LRA, the estimation process ends ) at Cycle T by calculating the rank reference matrix (T R from the smoothed rank (T ) membership matrix (S ). In other words, the estimates of the rank reference matrix and the rank membership matrix are ) ˆ R = (T R , ) ˆ R = M (T M R .
ˆ D I is calculated and treated as Meanwhile, in the LD-LRA, after obtaining S(T ) , 14 the final estimate for the parameter set. Estimation Setting of LD-LRA 1. 2. 3. 4. 5. 6. 7.
Data: J12S5000 (www) Number of latent ranks: R = 5 Constant of convergence criterion: c = 10−4 Prior probability of rank membership: π = 1/5 × 15 Filter kernel: f = [0.1 0.2 0.4 0.2 0.1] Estimator: MAP estimator Prior of πr jd : Beta distribution with (β0 , β1 ) = (2, 2)
The settings for the present data analysis are shown in Estimation Setting of LD-LRA , where Point 3 indicates the convergence criterion of the EM algorithm under the normal LRA with R = 5. As a result of the analysis, the parameter set estimate was obtained as follows:
(T )
In practice, to determine whether the T -th cycle is final, R must be computed to evaluate the convergence of the EM algorithm.
14
446
9 Local Dependence Latent Rank Analysis
ˆ DI1 ⎤⎡ 0.549 0.456 0.000 0.000 ⎢ 0.030 0.444 ⎥⎢ 0.035 ⎢ ⎥⎢ ⎢ 0.083 ⎥⎢ 0.020 ⎢ ⎥⎢ ⎢ 0.421 ⎥⎢ 0.495 ⎢ ⎥⎢ ⎢ 0.101 0.240 ⎥⎢ 0.148 ⎢ ⎥⎢ ⎢ 0.025 ⎥⎢ 0.066 ⎢ ⎥⎢ ⎢ 0.016 ⎥⎢ 0.045 ⎢ ⎥⎢ ⎢ 0.286 ⎥⎢ 0.407 ⎢ ⎥⎢ ⎢ 0.326 ⎥⎢ 0.264 ⎢ ⎥⎢ ⎢ 0.181 ⎥⎢ 0.081 ⎢ ⎥⎢ ⎣ 0.106 ⎦⎣ 0.041 0.055 0.086 ⎡
⎡
0.836 ⎢ 0.720 ⎢ ⎢ 0.058 ⎢ ⎢ 0.740 ⎢ ⎢ 0.635 ⎢ ⎢ 0.008 ⎢ ⎢ 0.010 ⎢ ⎢ 0.760 ⎢ ⎢ 0.805 ⎢ ⎢ 0.150 ⎢ ⎣ 0.064 0.227
ˆ DI4
ˆ DI2
⎤⎡
0.683 ⎥⎢ 0.040 0.568 ⎥⎢ ⎥⎢ 0.032 0.459 ⎥⎢ ⎥⎢ 0.612 ⎥⎢ ⎥⎢ 0.227 0.351 ⎥⎢ ⎥⎢ 0.205 ⎥⎢ ⎥⎢ 0.156 ⎥⎢ ⎥⎢ 0.581 ⎥⎢ ⎥⎢ 0.330 0.734 ⎥⎢ ⎢ 0.133 0.159 0.745 ⎥ ⎥⎢ 0.092 ⎦⎣ 0.056 0.445 0.152
⎤⎡
0.931 ⎥ ⎢ 0.869 ⎥⎢ ⎥ ⎢ 0.099 0.713 ⎥⎢ ⎥ ⎢ 0.846 ⎥⎢ ⎥ ⎢ 0.811 ⎥⎢ ⎢ 0.105 0.023 0.684 ⎥ ⎥ ⎢ 0.015 ⎥ 0.031 0.039 0.542 ⎥ ⎢ ⎢ 0.016 ⎥ ⎢ 0.880 ⎥⎢ ⎥ ⎢ 0.912 ⎥⎢ ⎥ ⎢ 0.825 0.844 ⎥⎢ 0.124 0.105 0.825 ⎦ ⎣ 0.082 0.153
ˆ DI3
⎤
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0.556 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0.845 ⎥ 0.160 0.211 0.843 ⎥ ⎥ ⎦ 0.636 0.728 0.617
ˆ DI5
⎤
⎥ ⎥ ⎥ 0.789 ⎥ ⎥ ⎥ ⎥ ⎥ 0.125 0.040 0.788 ⎥ ⎥. 0.034 0.064 0.650 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 0.190 0.216 0.915 ⎦ 0.341
Figure 9.9 also shows these parameter estimates displayed on the latent rank DAGs. For example, the CRR of Item 1 at Rank 1 is 0.456, and increases to 0.549, 0.683, 0.836, and 0.931, corresponding to Ranks 2, 3, 4, and 5. The CRR of an item generally increases as the rank increases. In addition, for Item 1 → 2 at Rank1, the CCRRs of Item 2 are 0.030 and 0.444 when the parent (i.e., Item 1) is incorrect and correct, respectively. The small numbers next to the CCRRs are the PIRPs, where 0 and 1 indicate that the parent (i.e., Item 1) as failed and passed, respectively. Naturally, given that the parent is correct, the CCRR is larger. This edge of Item 1 → 2 is also set at Ranks 2 and 3, and the CCRR increases with the rank if the parent is passed. Moreover, Rank 2 has a structure of Item 8 → 10 ← 9.15 There are two parents for Item 10. When an item has two parents, there are four PIRPs (i.e., 00, 01, 10, and 11), and thus four CCRRs are estimated. For example, the CCRR of Item 10 is 0.159 when the parents (i.e., Items 8 and 9) are (1, 0), whereas it increases to 0.745 15
This type of structure is called a converging connection (Sect. 8.2.4, p. 375).
9.2 Parameter Learning
447
Fig. 9.9 Parameter estimates for Latent rank DAG
when the parents are (1, 1), which suggests that the abilities to pass Items 8 and 9 are the bases for the ability to pass Item 10. This fact also implies that the relationship between these two parent items is uncompensatory. If the CCRR of Item 10 by students passing Item 8 but failing Item 9 (i.e., the PIRPs of the students are 01) are large, the ability required for Item 9 can compensate (cover) for the ability for Item 8. In reality, because the CCRR is 0.133, the ability required for Item 9 does not compensate for the ability to solve Item 8. This is also suggested by the contents of Items 8, 9, and 10 (Table 9.2, p. 432). It
448
9 Local Dependence Latent Rank Analysis
is indeed difficult to correctly answer Item 10 unless both Items 8 and 9 are passed. As for a testlet in a math test, the parent items do not compensate for one another in most cases. This is because the answers for the parent items of an item are often used to solve the item. In other words, the parent items constitute the stairs to the final item (step) in the testlet. By examining these latent rank DAGs in detail, one can capture the characteristics of each rank and may have an idea of how to post-instruct the students belonging to the respective ranks after testing. Local Dependence Latent Class Analysis When the student clusters are not ordered, each cluster is nominal (LCA), not ordinal (LRA). In addition, if no items at each class are independent, the LCA is the local dependence LCA (LD-LCA), where each latent class is the locus. By contrast, the normal LCA (Chap. 5, p. 155) can be regarded as the local independence LCA. The parameter set for the LD-LCA can be estimated using EM Cycle 1
LI-LCA LD-LCA
C(0)
EM CycleT
. → M C(1) → C(1) → · · · → M C(T ) → C(T ) −−−→ ˆ DI → M ˆC
In this flowchart, C and M C are the class reference and class membership matrices, respectively. That is, if only the step for smoothing the membership matrix is skipped from each cycle during the LD-LRA estimation process, the results for the LD-LCA can be obtained. To simplify, it is empirically known that the parameters can be obtained through the following procedure: EM Cycle 1
LD-LCA
(0) DI
EM CycleT
(T ) (T ) M → M C(1) → (1) → · · · → → DI C DI .
The values at the final cycle are then treated as the estimate of the parameters, as follows: ) ˆ D I = (T DI ,
ˆ C = M C(T ) . M
9.2 Parameter Learning
449
9.2.7 Estimation of Rank Membership Matrix As the next step, the rank membership matrix (M R ) is obtained from the estiˆ D I ). The estimation procedure in mate of the local dependence parameter set ( Estimation Process of LD-LRA is shown again below. EM Cycle 1
LI-LRA (0) R LD-LRA
EM Cycle T
(1) (T ) (T ) (1) (T ) . M → M (1) → S → → · · · → → S → R R R R −−−→ ˆ DI → M ˆ R
In a normal LRA (LI-LRA), the estimate for the rank membership matrix is ) ˆ R = M (T M R ;
however, this cannot be regarded as the rank membership matrix estimate in the LD) (T −1) and obtained along the LI-LRA estimation LRA, because M (T R is made from R ˆ R = {mˆ sr } (S × R), should process. The rank membership matrix for the LD-LRA, M ˆ D I . From Eq. (9.4) (p. 441), the membership for which Student s be created from belongs to Rank r can be calculated as ˆ D I r )πr l(us | , mˆ sr = R ˆ r =1 l(us | D I r )πr where πr is the prior probability that the student belongs to Rank r . ˆ R . For example, Table 9.4 shows the estimate of the rank membership matrix, M the membership for which Student 1 belongs to Rank 3 is 0.389 (38.9%). Because this is the highest value among the five memberships, Rank 3 is selected as the latent rank estimate (LRE in the table) for the student. In addition, the rank-up odds in the table are the ratio of the membership of Student s’s LRE ( Rˆ s ), mˆ s Rˆ s , to that of the next rank ( Rˆ s + 1), mˆ s, Rˆ s +1 . That is, RUO s =
mˆ s, Rˆ s +1 mˆ s, Rˆ s
,
which measures the possibility of moving from the current rank to a rank that is one level higher; in addition, the larger the value, the higher the possibility of the student moving up to the next rank. Conversely, the rank-down odds are calculated as RDO s =
mˆ s, Rˆ s −1 mˆ s, Rˆ s
,
which quantifies the possibility of moving from the current rank to a rank that is one level lower.
450
9 Local Dependence Latent Rank Analysis
Table 9.4 Rank membership estimate and latent rank estimate (LD-LRA) Rank Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 .. . Student 5000 LRD∗4 RMD∗5 ∗1 ∗4
1 0.000 0.428 0.000 0.118 0.000 0.536 0.427 0.688 0.034 .. . 0.536 1829 1121.84
2 0.050 0.422 0.000 0.223 0.001 0.325 0.384 0.238 0.362 .. . 0.383 593 1087.86
3 0.389 0.121 0.006 0.380 0.016 0.111 0.160 0.070 0.401 .. . 0.081 759 873.80
4 0.239 0.029 0.214 0.272 0.673 0.027 0.028 0.004 0.192 .. . 0.001 569 835.53
5 0.322 0.000 0.780 0.007 0.310 0.000 0.001 0.000 0.010 .. . 0.000 1250 1080.98
LRE∗1 3 1 5 3 4 1 1 1 3 .. . 1
RUO∗2 0.613 0.986 0.717 0.461 0.607 0.898 0.346 0.479 .. . 0.715
RDO∗3 0.128 0.274 0.587 0.025
0.902 .. .
latent rank estimate, ∗2 rank-up odds, ∗3 rank-down odds latent rank distribution, ∗5 rank membership distribution
The rank-up and rank-down odds for Student 1 are obtained as mˆ 14 0.239 = 0.613, = mˆ 13 0.389 mˆ 12 0.050 RDO 1 = = 0.128. = mˆ 13 0.389 RUO 1 =
Accordingly, although it is observed that the ability of Student 1 is valid for Rank 3, it is easier to level up to Rank 4 rather than level down to Rank 2. Figure 9.10 illustrates the rank membership profiles of the first nine students. Note that the rank membership profile of Student s is the s-th row vector in the rank membership matrix. It can be observed from the figure that both Students 1 and 9 belong to Rank 3; however, Student 9 also has a high membership to Rank 2 and is likely to drop to Rank 2. The rank-down odds of Student 9 are indeed 0.902. The latent rank distribution (LRD) and rank membership distribution (RMD) shown in the bottom of Table 9.4 are plotted in Fig. 9.11. The LRD is the frequency distribution of the LREs of all students. The sum of the LRD is thus the sample size S (in this case, 5000). Rank 1 has the most students at 1829 (36.5% of 5000), whereas Rank 4 has the fewest students at 569 (11.4%). The RMD is the column sum vector
9.2 Parameter Learning
451
0.8
0.8
0.8
0.6 0.4 0.2
Membership
1.0
Membership
Membership
1.0 0.6 0.4 0.2
1
2
3
4
1
5
2
0.4 0.2
3
4
1
5
Student 4
Student 5
0.8
0.8
0.8
0.4 0.2 0.0
Membership
1.0
Membership
1.0
0.6
0.6 0.4 0.2 0.0
2
3
4
5
0.4
2
3
4
0.2
5
1
Student 7
Student 8 0.8
0.8
Membership
0.8
Membership
1.0
0.0
0.6 0.4 0.2 0.0
2
3
4
5
3
4
5
Student 9
1.0
1
2
Latent Rank
1.0
0.2
5
0.6
Latent Rank
0.4
4
0.0 1
Latent Rank
0.6
3
Student 6
1.0
1
2
Latent Rank
Latent Rank
Latent Rank
Membership
0.6
0.0
0.0
0.0
Membership
Student 3
Student 2
Student 1 1.0
0.6 0.4 0.2 0.0
1
2
Latent Rank
3
4
5
1
Latent Rank
2
3
4
5
Latent Rank
Fig. 9.10 Rank membership profiles (LD-LRA) Fig. 9.11 LRD and RMD (LD-LRA)
2000 1829
Frequency
1500 1250
1000 759
500
0
593
1
2
569
3
4
5
Latent Rank
of the rank membership matrix. For example, from Table 9.4, the first element in the RMD is 1121.84, which is the sum of the memberships of 5000 students in Rank 1. That is, 1121.84 = 0.000 + 0.428 + 0.000 + 0.118 + · · · + 0.536. In general, the r -th element in the RMD is given by
(9.8)
452
9 Local Dependence Latent Rank Analysis
RMDr = mˆ 1r + mˆ 2r + · · · + mˆ Sr , which is the sum of the r -th column in the rank membership matrix. Therefore, all elements in the RMD are computed by ˆ R 1 S (R × 1). dˆ RMD = M The sum of this vector is also S (in this case, 5000). As shown in Eq. (9.8), the membership of Student 2 in Rank 1 is 0.428, which is the largest value in the rank membership profile of the student. The LRE of the student is thus Rank 1. However, the LRD adds this largest membership as +1 to Rank 1, even though the student’s likelihood of belonging to the rank is only 0.428 (42.8%), whereas the RMD adds it as +0.428. Thus, the ambiguity of belonging to the rank is considered. Accordingly, the LRD can be regarded as the frequency distribution of the 5000 students, whereas the RMD can be regarded as the frequency distribution of the population of the 5000 samples. As shown in Fig. 9.11, the RMD is usually flatter than the LRD.
9.2.8 Marginal Item Reference Profile Let π jr be denoted as the CRR of the students belonging to Rank r for Item j, which can be written as π jr = Pr ( j|r ), while πr jd in Sect. (9.2.6) can be likewise formulated as πr jd = Pr ( j|r, d), where d implies the d-th PIRP. The following relationship can thus be derived from these two equations. Pr ( j|r ) =
Dr j
Pr ( j|r, d)Pr (d),
d=1
where Dr j is the number of PIRPs of Item j at Rank r (Dr j = 2din,r j ). In addition, Pr (d) is the of students with the d-th PIRP out of Rank r students who respond ratio S mˆ sr z s j ).16 That is, to Item j ( s=1
16
The responses of those for whom the item is presented but do not respond to the item are included as “nonresponses (false answers)” (see Table 2.2, p. 16).
9.2 Parameter Learning
453
Table 9.5 Marginal rank reference matrix (LD-LRA)
Latent Rank Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 TRP2 1 correct
1 0.456 0.210 0.083 0.421 0.156 0.025 0.016 0.286 0.326 0.181 0.106 0.055 2.321
2 0.549 0.296 0.140 0.495 0.239 0.066 0.045 0.407 0.441 0.298 0.193 0.086 3.255
3 0.683 0.474 0.316 0.612 0.432 0.205 0.156 0.581 0.624 0.498 0.387 0.152 5.121
2 test
response rate (item mean),
S Pr (d) =
s=1
S
4 0.836 0.720 0.554 0.740 0.635 0.385 0.304 0.760 0.805 0.650 0.565 0.227 7.179
reference profile
mˆ sr z s j γsr jd
s=1
CRR1 0.678 0.514 0.350 0.613 0.445 0.251 0.199 0.565 0.606 0.475 0.400 0.162
5 0.931 0.869 0.741 0.846 0.811 0.631 0.517 0.880 0.912 0.825 0.808 0.317 9.090
mˆ sr z s j
.
Accordingly, from πˆ r jd , πˆ jr is obtained as πˆ jr =
Dr j
S πˆ r jd
s=1
S
mˆ sr z s j γsr jd
s=1
d=1
mˆ sr z s j
.
Thus, πˆ jr can be regarded as a marginalized πˆ r jd with respect to the PIRPs. The matrix containing all πˆ jr s is ⎡
ˆ MR
πˆ 11 · · · ⎢ .. . . =⎣ . . πˆ J 1 · · ·
⎤ πˆ 1R .. ⎥ = {πˆ } (J × R), jr . ⎦ πˆ JR
which is called the marginal rank reference matrix. ˆ MR ), where Table 9.5 shows the estimate of the marginal rank reference matrix ( the j-th row vector is the marginal IRP of Item j. The three panels on the left in Fig. 9.12 show these marginal IRPs separately by testlet and show that every marginal IRP monotonically increases, indicating that the CRR of each item increases with rank. The marginal IRP of each item is makes it easy to check the CRRs of the item.
9 Local Dependence Latent Rank Analysis
Testlet 1
Testlet 1
1.0 0.8
1
0.6 0.4 0.2 0.0
1 2 3
1 2 3
2 3
1 2 3
1 2 3
3.0
Expected Score
Correct Response Rate
454
2.5 2.0 1.5 1.0 0.5 0.0
1
2
3
4
1
5
2
Testlet 2 4
0.8
4
4 5
5 6 7
5 6 7
6 7
1
2
3
0.6 0.4
4
0.2
4 5
4 5 6 7
6 7
3 2 1 0
4
1
5
2
Testlet 3
0.6
0.2 0.0
9 8 10 11 12
9 8 10 11 12
1
2
9 8 10 11 12
4
5
Testlet 3
1.0 0.8
3
Latent Rank
9 8 10 11 12
9 8 10 11
12
5
Expected Score
Correct Response Rate
Latent Rank
0.4
5
Testlet 2
1.0
0.0
4
Latent Rank
Expected Score
Correct Response Rate
Latent Rank
3
4 3 2 1 0
3
4
5
1
Latent Rank
2
3
4
5
Latent Rank
Fig. 9.12 Marginal IRP (Left) and testlet reference profile (Right)
In addition, within each testlet, the marginal IRP with a larger item label is placed lower.17 For example, in Testlet 1, Item 2 is plotted lower than Item 1, and Item 3 is lower than Item 2. This implies that the subsequent items in the testlet are more difficult. This type of interitem relationship is common in a testlet (see IRP Ordinality and Graded Response , p. 457). Table 9.6 shows the IRP indices (item slope, location, and monotonicity indices; Kumagai 2007) that summarize the IRP shape and can also be used to summarize the shape of the marginal IRP (see Sect. 6.4, p. 223). As for the item slope index, a˜ 17
9.
Items 8 and 9 are the only exceptions, and the marginal IRP of Item 8 is lower than that of Item
9.2 Parameter Learning
455
Table 9.6 IRP indices (LD-LRA) Index Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12
α˜ 3 3 3 3 3 4 4 3 2 2 4 4
a˜ 0.152 0.246 0.238 0.128 0.203 0.246 0.214 0.179 0.183 0.201 0.243 0.090
β˜ 1 3 4 2 3 4 5 3 2 3 4 5
b˜ 0.456 0.474 0.554 0.495 0.432 0.385 0.517 0.581 0.441 0.498 0.565 0.317
γ˜ 0 0 0 0 0 0 0 0 0 0 0 0
c˜ 0 0 0 0 0 0 0 0 0 0 0 0
is the largest rise between an adjacent pair in the marginal IRP, and α˜ is the smaller rank of the pair. As for the marginal IRP of Item 2, the largest rise is 0.246, which is observed between Ranks 3 and 4 (0.474 and 0.720, respectively). The item slope index is thus (α˜ 2 , a˜ 2 ) = (3, 0.246). In addition, in the item location index, b˜ represents the IRP value of a rank that is closest to 0.5 and β˜ indicates that rank.18 For example, for Item 2, the CRR is 0.474 at Rank 3, which is the closest to 0.5. Accordingly, (β˜3 , b˜3 ) = (3, 0.474). Finally, in the item monotonicity index, γ˜ indicates the ratio of the adjacent ranks, where the IRP drops to all adjacent ranks, and c˜ represents the cumulative drops. Clearly, this index takes (γ˜ , c) ˜ = (0, 0) if the IRP monotonically increases and (γ˜ , c) ˜ = (1, −1) if it monotonically decreases. The table shows that the monotonicity index of each item obtained is (γ˜ , c) ˜ = (0, 0) because all items monotonically increase. Furthermore, each panel on the right-hand side in Fig. 9.12 plots the sum of the marginal IRPs for each testlet. This plot can be referred to as the testlet reference profile. For example, the testlet reference profile of Testlet 1 is the sum of the marginal IRPs of the three items in Testlet 1. In addition, the vertical axis represents the expected NRS of the testlet. The profile of a testlet can be used to identify the characteristics of the testlet. If it were horizontal, the entire testlet would not be discriminative, which is an extremely grave situation. The five column vectors of the marginal rank reference matrix (Table 9.5) are plotted in Fig. 9.13. The r -th column vector in the matrix is called the marginal rank reference vector of Rank r . The bottom and top plots in the figure are those of Ranks 1 and 5, respectively. The dashed line plot is for the CRRs of all students. 18
The criterion is not necessarily “closest to 0.5.” It can be “closest to 0.6” or “the first exceeding 0.8.”.
456
9 Local Dependence Latent Rank Analysis 1.0
5
3
2
Item 5
Item 4
Item 3
4
3
3 2
4
1
0.0 Item 2
4
5
1
2 3 1
1
3
5
2 2 1
2 1
2 1
4 3 2 1 Item 12
3
4
2 1
3
5
3
1
5
Item 11
3 2
Item 1
5
4
Item 10
2 1
0.4
0.2
4
4
2 1
4
4 3
0.6
5
5
Item 9
5
5
Item 8
4
5
Item 7
3
5
Item 6
Correct Response Rate
0.8
4
Fig. 9.13 Marginal rank reference vector (LD-LRA)
As the rank increases, the plot is placed higher. In addition, the plots steeply rise from Item 3 to 4 and from Item 7 to 8. This is because these item pairs are the boundaries between two different testlets: The former pair is between Testlets 1 and 2, and the latter is between Testlets 2 and 3. In addition, the first item in a testlet is generally easy, and the last item is hard. This type of structure is often viewed in a math or science test that uses testlets. The marginal rank reference vector for each rank gives the rough characteristics of the rank.19 In addition, the large gap between plots represents the significant difference between the ranks. For example, Items 8, 9, and 10 are spread significantly between the plots of Ranks 1 and 2, which implies that the abilities of Rank 1 students required for passing the items are greatly inferior to those of the Rank 2 students.
19
The detailed feature of each rank is clear from Fig. 9.9 (p. 447).
9.2 Parameter Learning
457
IRP Ordinality and Graded Response The ordinality seen in marginal IRPs in a testlet, as shown in Fig. 9.12 (p. 454), is a feature of a testlet with strong local dependency. This ordinality is often observed in a testlet where the answer to a previous item is used to pass the subsequent item. In such a testlet, let K and ttestlet be the numbers of items and correct responses in the testlet, respectively. Their relations are mostly as follows: ::::: ttestlet 0 1 2 .. . K
Correctly Responded Item(s) No item First item in the testlet First two items in the testlet .. . All K items in the testlet
Therefore, as described in Footnote 4 (p. 429), the number of correct responses in a testlet being k (∈ {0, 1, 2, · · · , K }) can be regarded as a graded response of k, and it is effective to analyze such graded response data using a testlet model. Meanwhile, in a testlet with a weak local dependency, passing k items in the testlet does not necessarily mean passing the first k items. Note that this IRP ordinality in a testlet does not imply that all students will pass the items in the testlet in a stepwise manner because the IRP value at each rank is the mean of the binary data of the students belonging to the rank. That is, the ordinality in the full student group does not necessarily mean that all individual students will meet that ordinality. In the present data with a sample size of 5000 (J12S5000), there are 56 students who failed Item 1 but passed Item 2. As described above, although the students who pass k students who pass the first k items, exceptions are always items are mostly ::::: present. When the number of items passed in a testlet is regarded as a graded response, the responses of such students passing only Item 2 or only Item 1 are equally coded as 1. Whether the two different response patterns, i.e., (Item 1, 2) is (0, 1) or (1, 0), are regarded as the same graded response may depend on the purpose of the analysis. In a psychological questionnaire and social survey, Likert scales are often employed with the following choices: “strongly disagree,” “disagree,” “neither agree nor disagree,” “agree,” and “strongly agree.” This is a 5-point Likert item, and 3- to 7-point Likert items are common. For a K -point Likert item, the datum is generally coded as a graded response: 0, 1, 2, · · · , K − 1 (or 1, 2, 3, · · · , K ).
458
9 Local Dependence Latent Rank Analysis
Fig. 9.14 Test Reference Profile (LD-LRA)
12
Expected Score
10
1829
8 1250
6 4
759 593
569
2 0
1
2
3
4
5
Latent Rank
9.2.9 Test Reference Profile The test reference profile (TRP) is the column sum vector of the marginal rank reference matrix MR . That is, ˆ MR 1 J . ˆt TRP = The TRP represents the expected NRS of the test at each rank and is shown in the bottom line of Table 9.5. The TRP (and LRD) is plotted in Fig. 9.14, which shows that the TRP monotonically increases. Thus, the latent rank scale satisfies the weakly ordinal alignment condition (WOAC). With the LD-LRA, as well as the LRA and ranklustering, it is important to satisfy the WOAC because this fact indicates that the student clusters obtained by the analysis are ordinal classes (i.e., ranks). If the WOAC remains unsatisfied, the following can be attempted: For TRP Satisfying Weakly Ordinal Alignment Condition 1. Reduce the number of latent ranks 2. Use a filter with a flatter and larger sized kernel As specified in Point 1, the TRP is likely to monotonically increase as the number of ranks, R, is smaller. This mechanism was described in detail in Figure 7.13 (p. 290). In summary, when R is small, the marginal IRP of each item is inclined to increase monotonically, and as a result, the TRP also tends to increase monotonically. For Point 2, if a flatter kernel is employed, the TRP is more likely to monotonically increase. A flatter kernel is one with a smaller center value ( f 0 in Eq. (9.1), p. 434) or a longer length (L ∗ = 2L + 1). The flatter the kernel is, the stronger the smoothing filter and the smoother the rank membership matrix. The marginal IRP of each item
9.2 Parameter Learning
459
also tends to be smoother and monotonically increase; the TRP is thus likely to monotonically increase. Two Levels of Ordinality of Latent Rank Scale Strongly Ordinal Alignment Condition The IRPs of all items monotonically increase, and the TRP then monotonically increases as well. Weakly Ordinal Alignment Condition Although TRP monotonically increases, not all IRPs monotonically increase. If the marginal IRPs of all items increase monotonically, the TRP being the sum of the marginal IRPs also increases monotonically. This situation, called a strongly ordinal alignment condition (SOAC), is satisfied. In this analysis, as shown in Fig. 9.12, the SOAC holds. Although the basis for the ordinality of the latent rank scale does not necessarily resort to the TRP, alternative evidence is required to prove the ordinality if the TRP does not monotonically increase. If one can find an alternative standard, the student clusters can be viewed as ranks.
9.3 Model Fit The fitness of the model can be examined after the parameters are estimated. For the analysis model shown in Fig. 9.9 (p. 447), the fitness is evaluated by comparing it to those of the null and benchmark models.
9.3.1 Benchmark Model The benchmark model better fits the data. Although the benchmark model used is the choice of the analyst, in this book, an identical benchmark model is shared among the chapters, but expressed differently. As shown in Fig. 9.15 (left), the benchmark model can be expressed as the local independence model with G loci, where G is the number of NRS patterns. When the number of items is J , the maximum possible number of NRS patterns is J + 1 {0, 1, 2, . . . , J }. This test being analyzed has 12 items, and the maximum number of patterns is thus 13. Although not all of the NRS patterns can always be observed, all 13 patterns were observed in these test data; thus, G = 13 here. A matrix collecting the CRRs of all G groups for all J items is called the group reference matrix. That is,
460
9 Local Dependence Latent Rank Analysis
Fig. 9.15 DAGs of benchmark and null models
⎤ p11 · · · p1G ⎥ ⎢ P G = ⎣ ... . . . ... ⎦ = { p jg } (J × G), pJ1 · · · pJ G ⎡
where p jg is the CRR of Group g for Item j. In addition, a matrix collecting the memberships of all S students to all G groups is referred to as the group membership matrix, which is ⎤ m 11 · · · m 1G ⎥ ⎢ M G = ⎣ ... . . . ... ⎦ = {m sg } (S × G), m S1 · · · m SG ⎡
where m sg =
1, if Studentsbelongs to Groupg 0, otherwise
.
Each student belongs to any of the G groups. The sum of each row in M G is thus 1, as follows: M G 1G = 1 S .
9.3 Model Fit
461
Using the group reference and membership matrices ( P G and M G ), the likelihood of the benchmark model can be calculated as l B (U| P G ) =
S G J
u
p jgs j (1 − p jg )1−u sg
m sg
.
s=1 g=1 j=1
The log-likelihood of the benchmark model is thus obtained as ll B (U| P G ) =
S G J
m sg u s j ln p jg + (1 − u sg ) ln(1 − p jg )
s=1 g=1 j=1
= −21318.47. In addition, the number of parameters (NP) of the benchmark model is the number of elements in the group reference matrix, J G.
9.3.2 Null Model The null model fits the data worse than the analysis model. Although the designation of the null model also depends on the analyst, a single model, i.e., the (single-group) global independence model (Fig. 9.15, right), is consistently used in this book. This global independence model is identical with the local independence model when there is only a single group; however, in this case, it is not referred to as “locally independent” but rather “globally independent.” Each item must be globally related with other items (a dependency structure must be present among the items), and the null model fits the data poorly. The parameter set of the null model is the vector of the CRRs of the items, which can be regarded as a group reference matrix with a single column. Accordingly, the NP of the null model is J . The CRR vector is ⎡ ⎤ p1 ⎢ .. ⎥ p = ⎣ . ⎦ = { p j } (J × 1), pJ where p j is the CRR of Item j. Therefore, the likelihood of data U under the null model is defined as l N (U| p) =
S J s=1 j=1
and the log-likelihood is computed as
u
p j s j (1 − p j )1−u sg ,
462
9 Local Dependence Latent Rank Analysis
ll N (U| p) =
S J
u s j ln p j + (1 − u sg ) ln(1 − p j )
s=1 j=1
= −37736.23. The χ 2 value of the null model, which is twice the difference between the loglikelihoods of the benchmark and null models, is calculated as χ N2 = 2 × {ll B (U| P G ) − ll N (U| p)} = 2 × {−21318.47 − (−37736.23)} = 32835.53. Furthermore, the DF of the null model, which is the difference of the NPs of these two models, is obtained as d f N = N PB − N PN = J G − J = 12 × (13 − 1) = 144.
9.3.3 Analysis Model The analysis model is the model currently applied to the data, and its fitness to the data needs to be evaluated. This model is normally worse-fitting than the benchmark model and better-fitting than the null model. The fitness of the analysis model is considered to be good/bad if it is close to that of the benchmark/null model. Substituting the estimates of the local independence parameter set and rank memˆ R ) in Eq. (9.5), the likelihood of the analysis model can ˆ D I and M bership matrix ( be defined as ˆ DI ) = l A (U|
Dr j S R J us j z γ mˆ πˆ r jd (1 − πˆ r jd )1−u s j s j sr jd sr . s=1 r =1 j=1 d=1
Then, the log-likelihood of the analysis model is computed as ˆ DI ) = ll A (U|
Dr j S R J
z s j γsr jd mˆ sr u s j ln πˆ r jd + (1 − u s j ) ln(1 − πˆ r jd )
s=1 r =1 j=1 d=1
= −26657.78. The closer the log-likelihood is to 0, the more approximate the likelihood (i.e., the occurrence probability of the data) is to 1, which implies that the fitness of the model is better. Accordingly, the likelihoods and log-likelihoods of the benchmark, analysis,
9.3 Model Fit
463
and null models normally have the following inequalities20 : 0 < l N < l A < l B < 1, ll N < ll A < ll B < 0. These inequalities hold in this analysis as well. In addition, the χ 2 value of the analysis model, which is twice the difference between the log-likelihoods of the benchmark and analysis models, is computed as ˆ D I )} χ A2 = 2 × {ll B (U| P G ) − ll A (U| = 2 × {−21318.47 − (−26657.78)} = 10678.64. The NP of the analysis model is the number of elements in the local dependence parameter set, D I in Eq. (9.2) (p. 437). These parameters are derived from the local dependence DAGs (Fig. 9.7, p. 432). In fact, the parameters corresponding to the unobserved PIRPs are not (cannot be) estimated. For example, if Item j has three parents at Rank r , there are 23 = 8 PIRPs, and the eight parameters (πr j1 , . . . , πr j8 ) corresponding to the eightPIRPs are set. However, if all eight PIRPs are not observed (see Indegree and PIRP , p. 385), for example, if pattern 010 is not observed, then it is no longer necessary to estimate the corresponding parameter (πr j3 ).21 To determine whether each PIRP was observed, the PIRP array ( = {γsr jd }) can be referenced. For example, Sr(Tjd) =
S
(T ) z s j ssr γsr jd
s=1
=
S
(T ) z s j ssr γsr jd (1
s=1
− us j ) +
S
(T ) z s j ssr γsr jd u s j
s=1
(T ) (T ) = S0r jd + S1r jd , (T ) determines whether the d-th PIRP of Item j at Rank r was observed, where ssr is (T ) 22 the (s, r )-th element in the smoothed rank membership matrix (S ) at the final cycle in the EM algorithm. In addition, Sr(Tjd) is the number of Rank r students whose (T ) (T ) PIRP for Item j is the d-th pattern, which is the sum of S0r jd and S1r jd from Eq. (9.7) (p. 444). If Sr(Tjd) > 0, πr jd can be estimated, and the number of (estimable)
20
If the fitness of the analysis model is extremely good/poor, the log-likelihood of the analysis/null model can be closer to 0 than that of the benchmark/analysis model. 21 The eight PIRPs are sorted as 000, 001, 010, 011, 100, 101, 110, and 111 in the binary numbering system, and 010 is thus the third pattern. 22 Note that it is not M ˆ R because S(T ) was used to obtain ˆ DI .
464
9 Local Dependence Latent Rank Analysis
parameters for the analysis model is then given as N PA =
Dj R J
sgn(Sr(Tjd) ).
r =1 j=1 d=1
In this analysis, all PIRPs were observed, and thus N PA = 100. The NP is an index of the model flexibility. The larger the NP, the more flexible the model is and the better the fit of the data. For this reason, the benchmark/null model with a large/small NP fits the data well/poorly. The NP of the analysis model is normally smaller/larger than that of the benchmark/null model, as follows: N PN < N PA < N PB . These inequalities hold in this analysis as well.23 A model with a large NP is said to be wasteful, whereas a model with a small NP is said to be parsimonious; thus, the benchmark model is wasteful, and the null model is parsimonious. Therefore, the analysis model generally aims to be placed in a balanced (or economic) position between the two extreme models. The DF of the analysis model is defined as the difference between the NPs of the benchmark and analysis models. That is, it is calculated as d f A = N PB − N PA = JG −
Dj R J
sgn(Uˆ r jd )
r =1 j=1 d=1
= 12 × 13 − 100 = 56. The DF is an index of the model parsimony. As the DF reaches 0, the NP of the analysis model (N PA ) is approximate to the NP of the benchmark model (N PB ), indicating that the analysis model is as wasteful as the benchmark model. By contrast, if the DF of the analysis model is large, the NP of the analysis model (N PA ) is small, and the analysis model is as parsimonious as the null model. The analysis model is normally less parsimonious than the null model (i.e., N PA > N PN ), and the DFs of the two models have the following relation: d fA > d fN . This relation is satisfied in this analysis as well.
23
If a flexible model using many parameters is specified as the analysis/null model, the NP of the analysis/null model can be larger than that of the benchmark/analysis model.
9.3 Model Fit
465
Table 9.7 Model fit indices (LD-LRA) χ 2 and d f ll B −21318.47 ll N −37736.23 ll A −26657.78 χ N2 32835.53 χ A2 10678.64 d fN 144 d fA 56
Standardized Index NFI 0.675 RFI 0.164 IFI 0.676 TLI 0.164 CFI 0.675 RMSEA 0.195
Information Criterion AIC 10566.64 CAIC 10201.66 BIC 10201.67
9.3.4 Section Summary Using the χ 2 values and DFs of the analysis and null models, various model-fit indices can be calculated. The indices are mainly classified as standardized indices and information criteria. Refer to Standardized Fit Indices (p. 321) for the former and Information Criteria (p. 326) for the latter. Table 9.7 shows the results of the fit indices of the entire test. Although it is possible to calculate the indices for each item, there is not much demand for doing so. The information criteria are used for comparing multiple analysis models.
9.4 Structure Learning The descriptions presented thus far are the parameter learning and model-fit evaluation under the condition that the number of ranks (R) and the local dependence DAGs are given. However, they are unknown when the data are taken; thus, the parameter learning and model-fit evaluation are executable only after they are fixed. Once again, the analysis procedure of the LD-LRA is shown below: Estimation Order of LD-LRA (Revisited) 1. Estimate the number of latent ranks (LRA) 2. Specify the DAG at each latent rank (structure learning) 3. Estimate the parameter set (parameter learning) This section describes the method for determining the local dependence DAGs (Point 2) under the condition that the number of ranks (R) has been already determined (Point 1), which is called the structure learning stage. In fact, structure learning (Point 2) and parameter learning (Point 3, Sect. 9.2, p. 430) are almost integrated and simultaneous in a calculation program. After determining a particular structure, the parameter learning and fitness evaluation under the structure are exe-
466
9 Local Dependence Latent Rank Analysis
cuted instantaneously. In other words, an optimal structure can be found by evaluating the model-fit under various structures. Meanwhile, it is almost impossible to find an optimal R and structure (Points 1 and 2) simultaneously because the analyst generally has an idea about the number of ranks but does not necessarily have a detailed hypothesis about the structure. Therefore, as a standard approach, the number of ranks is first determined and then structure learning is carried out. If not satisfied with the structure obtained, the structure learning should be conducted again after changing the number of ranks.
9.4.1 Adjacency Array Directed Acyclic Graph 1. 2. 3. 4.
The graph must be simple. The graph must be directed. The graph must be acyclic. The graph being either connected or disconnected is acceptable.
The local dependence structure at each the struc rank must be a DAG. Therefore, ture must have the conditions listed in Directed Acyclic Graph . First, as shown in Point 1, the structure (i.e., graph) must be simple, and should have no multiedges or self-loops. A graph that contains a multiedge or a self-loop is referred to as a multigraph (Fig. 8.6, p. 362). In addition, Point 2 indicates that all the edges in a graph are directed. Moreover, Point 3 requires the graph to be acyclic, which means that the graph does not contain a path (cycle) that starts with a particular item and ends with the same item. A graph with one or more cycles is referred to as a cyclic graph (Fig. 8.7, p. 365). Finally, according to Point 4, a connected graph is a graph where each item is connected with one or more items. In other words, each item is reachable from all other items when considering all directed edges as undirected. If not, such a graph is called a disconnected graph (Fig. 8.4, p. 358). Note, however, that the connectivity is not a requirement for a graph to be a DAG. A graph is a DAG if the graph fulfills Points 1, 2, and 3. This section introduces the adjacency array, which is a three-dimensional array necessary for structure learning: A = {ar jk } (R × J × J ). The r -th layer in this three-dimensional array,
9.4 Structure Learning
467
⎤ ar 11 · · · ar 1J ⎥ ⎢ Ar = ⎣ ... . . . ... ⎦ (J × J ), ar J 1 · · · ar J J ⎡
is the adjacency matrix at Rank r , where ar jk indicates that ar jk =
1, if an edge from Item j to kat Rankr exists . 0, otherwise
Mathematical Conditions for a Graph Being DAG (LD-LRA) 1. 2. 3. 4.
If simple, the maximal element in AUr is 1. If directed, the maximal element in AUr is 1. If acyclic, the trace of Ar(J −1) , tr Ar(J −1) , is 0. (J −1) If connected, the minimal element in AUr is greater than 0.
For the graph at Rank r being a DAG, the adjacency matrix of the rank ( Ar ) must fulfill the conditions listed in Mathematical Conditions for Graph Being DAG (see also Sect. 8.1.8, p. 366), where AUr = Ar + Ar is the adjacency matrix when all directed edges are replaced by undirected edges (or bidirected multi-edges). If the maximum element of this matrix is 1, then this graph is simple (Condition 1) because the maximum element becomes ≥ 2 when the graph has a multi-edge or a self-loop. In addition, at this time, the graph is directed. In other words, the mathematical conditions for a graph to be both simple and directed are the same. A graph being directed implies that all edges in the graph are unidirected. The bidirected edge of Item j ↔ k is certainly a directed edge but is identical to the undir ected edge of Item j—k in the sense that either item is adjacent to the other item. That is, BSE
j ←→ k
≡
j
BME ← −− −− →
k
≡
USE k j —–
In general, as shown above, a bidirected (simple) edge, bidirected multiedge, and undirected (simple) edge are three different representations of the same adjacency relationship between two items. The congruence of the right two adjacencies is particularly important here. That is, to check whether or not the graph has multiedges is the same as checking whether or not the graph has undirected edges. Therefore, the diagnostic criteria for the simplicity and directionality of a graph become identical. In addition, Ar(J −1) in Condition 3 is the reachability matrix (see Sect. 8.1.3, p. 354),
468
9 Local Dependence Latent Rank Analysis
Ar(J −1) =
J −1
Arp = Ar + Ar2 + · · · + ArJ −1 ,
p=1
which is the sum of the powers of the adjacency matrix. By referring to this reachability matrix summing the powers up to J − 1, one can examine whether an item is reachable from another item. For example, the ( j, k)-th element in the matrix can be n (ar(Jjk−1) = n), which indicates that there exist n paths (walks) with length ≤ J − 1 from Item j to k. Moreover, tr Ar(J −1) =
J
ar(Jj j−1) = ar(J11−1) + · · · + ar(JJ −1) J
j=1
is the sum of the diagonal elements of Ar(J −1) , where ar(Jj j−1) represents the number of walks with length ≤ J − 1, which start from Item j and return to the same item; thus, ar(Jj j−1) > 0 implies that there is at least one cycle. Accordingly, when tr Ar(J −1) = 0, it is proven that there are no cycles in the graph. (J −1) in Condition 4 is the reachability matrix when all directed Furthermore, AUr edges are replaced by the undirected edges. That is, (J −1) AUr
=
J −1
J −1 2 AUr = AUr + AUr + · · · + AUr . p
p=1
If a graph is connected, each item must be reachable from all other items in the underlying undirected graph of the directed graph. Thus, if all elements in this matrix are > 0, it then shows that all items are reachable from each other in the underlying undirected graph. Note again that the connectivity of a graph is not required for the graph to be a DAG. If a graph is simple, directed, and acyclic, the graph is a DAG even if the graph is disconnected.
9.4.2 Number of DAGs and Topological Sorting “Specifying a structure” can be rephrased as “specifying the relationships between all item pairs at all ranks.” For example, in the DAG at Rank r , there can be three relationships between Items j and k: j
k ,
j → k , j ← k .
9.4 Structure Learning
469
Thus, because the number of item pairs is J (J − 1)/2 when there are J items, the number of graphs in total is (J ) = 3 J (J −1)/2 . NSG
Note that this is the number of simple graphs because it does not exclude the graphs containing a cycle(s). Also note that this number increases explosively as the number of items (J ) increases (Table 9.8) and exceeds ten billion (1010 ) when J = 7. Out of the number of simple graphs, how many have no cycles (i.e., DAGs)? From (J ) ) satisfies the following recurrence Robinson (1973), the number of DAGs (NDAG 24 relation : (J ) NDAG
=
J j=1
(−1)
j−1
J (J − j) 2 j (J − j) NDAG j
(0) (1) (NDAG = NDAG = 1).
This number is much smaller than the number of simple graphs (NSG ), and is approximately one billion (109 ) when J = 7; thus, NDAG /NSG < 0.1. However, because J = 12 in the data (S12S5000), the number is astronomically large and exceeds 1026 . In addition, note that NDAG is the number of DAGs at each rank. When the number of ranks is R, (J ) R . N of DAGs for all ranks = NDAG Therefore, it is difficult to apply structure learning and pick up only one structure. The 1, 2, and 3 listed adjacency matrix in each rank must fulfill Conditions in Mathematical Conditions for Graph Being DAG for a graph to be a DAG. However, there is a simpler way, which is sorting the items topologically (see Topological Sorting , p. 371). After the items are topologically sorted and rela beled, if they comply with the rule that the parents of Item j are selected from Items 1, · · · , j − 1, the graph drawn is then always a DAG.25 Figure 9.16 is a complete graph with six items after sorting the items topologically. A (directed) complete graph is a simple graph in which every pair of items is connected by a (directed) edge. As shown in the figure, the six items are topologically sorted and arranged from left to right, and all edges thus point toward the right direction, which implies that the graph has no cycle that starts from an item and returns to the same item. In practice, a DAG is barely complete, and is instead a graph with some edges removed from a complete graph, which is still a DAG. Therefore, a graph cannot be cyclic when the items are topologically sorted. n n! = m!(n−m)! is a binomial coefficient, which is also frequently denoted by n Cm m and represents the number of combinations of m members that can be selected from a set of n members. 25 This rule is rephrased as the children of Item j being selected from Items j + 1, · · · , J . 24
Here,
J 1 2 3 4 5 6 7 8 9 10 11 12
N SG 1 3 27 729 59049 14348907 10460353203 22876792454961 150094635296999121 2954312706550833698643 174449211009120179071170507 30903154382632612361920641803529
Table 9.8 Number of graphs (at each latent rank)
N D AG 1 3 25 543 29281 3781503 1138779265 783702329343 1213442454842881 4175098976430598143 31603459396418917607425 521939651343829405020504063 NT S 1 2 8 64 1024 32768 2097152 268435456 68719476736 35184372088832 36028797018963968 73786976294838206464
470 9 Local Dependence Latent Rank Analysis
9.4 Structure Learning
471
Fig. 9.16 Topological sorting (six-item example)
It must be awkward for ordinary data to topologically sort the variables because this imposes a strong constraint on the variables, thus requiring a strong rationale. Meanwhile, it is relatively easier to apply this sorting to test items in descending order of their CRRs. This is because, in general, an item with a low CRR (difficult item) is normally inadequate for the parent of an item with a higher CRR (easier item). Even if such a structure is true, and such cases must occur in reality, the results are not necessarily translatable into knowledge that can be used in the field of education, where teachers normally teach the easier items earlier. In humanities and social sciences, extracting pragmatic knowledge from data is often more emphasized than pursuing the truth. From the rightmost column in Table 9.5 (p. 453), when the 12 items of the data under analysis (J12S5000) are topologically sorted, the order of the items is obtained as Table 9.9, where Item 1 is the easiest, followed by Items 4, 9, 8, · · · . Under this
Table 9.9 Topological sorting of 12-item math test (J12S5000)
Original Item 1 Item 4 Item 9 Item 8 Item 2 Item 10 Item 5 Item 11 Item 3 Item 6 Item 7 Item 12
CRR∗ 0.678 0.613 0.606 0.565 0.514 0.475 0.445 0.400 0.350 0.251 0.199 0.162
TS 1 2 3 4 5 6 7 8 9 10 11 12
*correct response rate
472
9 Local Dependence Latent Rank Analysis
order, only those upper triangular elements other than the diagonal elements can have an edge in the adjacency matrix at each rank. That is, the corresponding elements in the adjacency matrix at Rank r are the shaded cells shown below. ⎤ → 1 4 9 8 2 10 5 11 3 6 7 12 ⎥ ⎢ Item 1 ⎥ ⎢ ⎥ ⎢ Item 4 ⎥ ⎢ ⎥ ⎢ Item 9 ⎥ ⎢ ⎥ ⎢ Item 8 ⎥ ⎢ ⎥ ⎢ Item 2 ⎥ ⎢ ⎥. ⎢ Ar = ⎢ Item 10 ⎥ ⎥ ⎢ Item 5 ⎥ ⎢ ⎥ ⎢ Item 11 ⎥ ⎢ ⎥ ⎢ Item 3 ⎥ ⎢ ⎥ ⎢ Item 6 ⎥ ⎢ ⎦ ⎣ Item 7 Item 12 ⎡
(9.9)
If complying with the rule in which only the shaded cells are allowed to set up a directed edge, the resulting graph is a DAG regardless of the number of edges and their placements. By restoring the item order of this matrix to the original, Eq. (9.9) is represented as ⎤ → 1 2 3 4 5 6 7 8 9 10 11 12 ⎥ ⎢ Item 1 ⎥ ⎢ ⎥ ⎢ Item 2 ⎥ ⎢ ⎥ ⎢ Item 3 ⎥ ⎢ ⎥ ⎢ Item 4 ⎥ ⎢ ⎥ ⎢ Item 5 ⎥ ⎢ ∗ ⎥ ⎢ Ar = ⎢ Item 6 ⎥ ⎥ ⎢ Item 7 ⎥ ⎢ ⎥ ⎢ Item 8 ⎥ ⎢ ⎥ ⎢ Item 9 ⎥ ⎢ ⎥ ⎢ Item 10 ⎥ ⎢ ⎦ ⎣ Item 11 Item 12 ⎡
Clearly, if a graph is constructed when considering a constraint in which only the shaded cells can have an edge, the graph is still a DAG.
9.4 Structure Learning
473
Merit of Topological Sorting 1. The graph is always created as a DAG. 2. The number of structures to be explored in the structure learning can be significantly curtailed. Topological sorting is dispensable; however, as described in Point 1 of the above box, it is extremely useful for the graph created under this constraint to become a DAG. In addition, as shown in Point 2, doing so can significantly reduce the number of possible structures to be explored during structure learning. Without topologically sorting the items, for example, the relationships between Items 1 and 4 are the following three patterns: 1
4 ,
1 → 4 , 1 ← 4 ; however, with such sorting, the relations are restricted to the following two patterns: 1
4 ,
1 → 4 . This applies to the relationships between all item pairs; thus, the number of structures can largely be reduced. Accordingly, when the number of items is J , with topological sorting, the number of possible DAGs to be explored in the structure learning reduces to (J ) = 2 J (J −1)/2 . NTS
As shown in Table 9.8 (p. 470); however, NTS exceeds 70 quintillion (1018 ) when J = 12, which is still inconceivably large. To further reduce the number of structures, the analyst may translate the knowledge regarding the phenomenon into constraint(s) and additionally impose them on the adjacency matrix. For example, as shown in Fig. 9.7 (p. 432), unless considering an edge between two items of different testlets, the constraints on the adjacency matrix Ar∗ become
474
9 Local Dependence Latent Rank Analysis
⎤ → 1 2 3 4 5 6 7 8 9 10 11 12 ⎥ ⎢ Item 1 ⎥ ⎢ ⎥ ⎢ Item 2 ⎥ ⎢ ⎥ ⎢ Item 3 ⎥ ⎢ ⎥ ⎢ Item 4 ⎥ ⎢ ⎥ ⎢ Item 5 ⎥ ⎢ ∗ ⎥. ⎢ Ar = ⎢ Item 6 ⎥ ⎥ ⎢ Item 7 ⎥ ⎢ ⎥ ⎢ Item 8 ⎥ ⎢ ⎥ ⎢ Item 9 ⎥ ⎢ ⎥ ⎢ Item 10 ⎥ ⎢ ⎦ ⎣ Item 11 Item 12 ⎡
(9.10)
⎤ → 1 2 3 4 5 6 7 8 9 10 11 12 ⎥ ⎢ Item 1 ⎥ ⎢ ⎥ ⎢ Item 4 ⎥ ⎢ ⎥ ⎢ Item 9 ⎥ ⎢ ⎥ ⎢ Item 8 ⎥ ⎢ ⎥ ⎢ Item 2 ⎥ ⎢ ⎥. Ar = ⎢ Item 10 ⎥ ⎢ ⎥ ⎢ Item 5 ⎥ ⎢ ⎥ ⎢ Item 11 ⎥ ⎢ ⎥ ⎢ Item 3 ⎥ ⎢ ⎥ ⎢ Item 6 ⎥ ⎢ ⎦ ⎣ Item 7 Item 12 ⎡
Thus, with this matrix holding 18 shaded cells, the number of possible DAGs is 218 = 262, 144. Consequently, because there are five ranks in this analysis, the total number of possible DAGs is finally given as26 {218 }5 = 290 = 1.24 × 1027 . This number exceeds one octillion (1027 ).
9.4.3 Population-Based Incremental Learning The number of graph candidates is still extremely large; however, the analyst must present one, or sometimes a few, of such candidates as a result of the analysis. The 26
For real numbers a, b, and c, {a b }c = a bc .
9.4 Structure Learning
475
Generational Gene
Generational Gene
(3) Update (4) Generate
Bad Fit
Good Fit
(1) Generate
(2) Fig. 9.17 Framework of PBIL (LD-LRA)
chapter on BNM used the population-based incremental learning (PBIL; Fukuda et al. 2014; see Sect. 8.5.2), which is a type of genetic algorithm (GA), in structure learning. This section describes a method used to apply PBIL to the structural learning in an LD-LRA. Figure 9.17 shows a PBIL framework used in an LD-LRA. Basically, similar to a normal GA, PBIL selects individuals that are highly adapted to the data in each generation (or periods) t = 1, 2, 3, . . .. Each individual has its own unique set of genes (chromosomes), which is the adjacency array ( A). The r -th layer in this three-dimensional array, Ar , holds the information regarding the DAG at Rank r . The element in each locus of a gene can take a value of 0 (no edge) or 1 (edge). The number of individuals in a period (i.e., population) is denoted as I . As (t) (t) shown in (1), during Period t, the genes for I individuals ( A(t) 1 , A2 , · · · , A I ) are first generated. They are different binary patterns and randomly generated from the (t) generational gene in Period t (A¯ ), which is the three-dimensional array, as shown
476
9 Local Dependence Latent Rank Analysis
in the follows27 : (t) A¯ = {a¯ r(t)jk ∈ [0, 1]} (R × J × J ),
where a¯ r(t)jk represents the possibility that there is an edge of Item j → k at Rank r , and the (r, j, k)-th element in Ai(t) is generated from (t) ai,r ¯ r(t)jk ), jk ∼ Bern(a
where Bern( p) is a Bernoulli distribution that generates 1 with probability p and (t) 0 with probability 1 − p. Thus, ai,r jk is a binary random variable, and is 1 with (t) probability a¯ r jk and 0 with probability 1 − a¯ r(t)jk . Accordingly, the closer a¯ r(t)jk is to 1, (t) the more likely ai,r jk is to be sampled as 1. The gene of Individual i can be fixed by sampling the elements for all loci. When the topological sorting is applied, only the strictly upper triangular elements (t) (t) are considered the loci of the r -th layer in A¯ (A¯ r , J × J ), and all lower triangular elements (including the main diagonal elements) are 0. This means that an edge of Item j → k is not set if the CRR of Item j is smaller than that of Item k (i.e., p j < pk ). In Fig. 9.17 (2), imitating natural selection, the better-fitting individuals are selected, where various methods are available, such as roulette, rank, and tournament selections. The figure depicts the simple rank selection process. This method selects the I S best-fitting individuals from the population, where I S is the number of survivors, and r S = I S /I is the survival rate. To sort the individuals based on their fitness, model-fit indices can be used. Among them, information criteria such as the AIC or BIC are frequently used. The information criteria highly evaluate a well-balanced model between fitness (i.e., χ 2 value) and parsimony (i.e., DF). As the information criterion decreases, the model better fits the data with a smaller number of parameters. In addition, the elitism selection can be used together at the selection stage. This method preserves some individuals with a high fitness during Period t without changing their genes and employs them as the individuals in Period t + 1. The figure shows the case in which there is one elite (I E = 1). In Fig. 9.17 (3), the generational gene is updated. First, the genes of the survivors (t) (t) are averaged. Supposing that these survivors are denoted as A(t) S1 , A S2 , · · · , A S I S in the decreasing order of their fitness (e.g., increasing order of their BICs), then the averaged gene of the survivors is given by (t) A¯ S
27
IS 1 (t) = A(t) = a¯ S,r jk = I S i=1 Si
IS i=1
(t) a Si,r jk
IS
(R × J × J ).
(1) For the generational gene during Period 1, A¯ , the elements of all loci may be 0.5.
9.4 Structure Learning
477
(t) If a¯ S,r jk ≥ 0.5, it means that more than half of the survivors in Period t have the edge of Item j → k at Rank r . This fact also suggests that this edge was effective in saving those survivors. (t) Then, rounding off all elements of the averaged gene A¯ S , a three-dimensional array denoted as A(t) R = {a R,r jk } (R × J × J ) is created, where the (r, j, k)-th element is
a (t) R,r jk
(t) = round a¯ S,r jk = round
IS 1 (t) . a I S i=1 Si,r jk
(9.11)
That is, A(t) R is a binary array in which the elements with corresponding elements in (t) ¯ AS of ≥ 0.5 are all 1 and are 0 otherwise. Using this A(t) R , the generational gene in Period t + 1 is updated as
(t+1) (t) ¯ (t) = {a¯ (t+1) } (R × J × J ), A¯ = A¯ + α A(t) R −A r jk where α ∈ (0, 1) is the learning rate. In this array, the (r, j, k)-th element is
= a¯ r(t)jk + α a (t) ¯ r(t)jk a¯ r(t+1) jk R,r jk − a a¯ (t) − α a¯ r(t)jk , if a (t) R,r jk = 0
= r(t)jk . (t) a¯ r jk + α 1 − a¯ r jk , if a (t) R,r jk = 1 approaches 1 when a (t) ¯ r(t+1) < a¯ r(t)jk , whereas it That is, a¯ r(t+1) jk R,r jk = 0 because a jk ¯ r(t+1) > a¯ r(t)jk . In addition, the larger the approximates 0 when a (t) R,r jk = 1 because a jk learning rate (α) is, the closer it gets to the targeted value (i.e., 0 or 1). Moreover, the elements are mutated with a probability of r M per locus, as follows: a¯ r(t+1,new) = (1 − B M )a¯ r(t+1) jk jk + B M ⎧ (t+1) ⎨a¯ r jk , = a¯ r(t+1) ⎩ jk + U 2
a¯ r(t+1) jk + U
if B M = 0 , if B M = 1
where B M ∼ Bern(r M ), U ∼ U [0, 1].
2 ,
478
9 Local Dependence Latent Rank Analysis
In addition, U [0, 1] is the uniform density with an interval of [0, 1], and U is a random sample from the density. This equation implies that a¯ r(t+1) jk is invariant with a (t+1) probability of 1 − r M , but is replaced by an average of a¯ r jk and U with a probability of r M . Note that this procedure is not the only way to mutate the gene. Various other ways also exist. Termination Conditions of PBIL 1. When the fitness of the best individual in a period meets a predetermined criterion 2. When the average fitness in a period meets a certain criterion 3. When the difference in fitness of the best individuals between successive periods falls within a certain criterion 4. When the fitness of the best individual does not change for a certain number of periods Next, as shown in Fig. 9.17 (4), I individuals are randomly generated again from the (t + 1)-th generational gene. By repeating this cycle, the generational gene evolves to produce better-fitting individuals. For the rules of terminating the cycles, refer to Termination Conditions of PBIL as examples. Determination of Aˆ 1. 2. 3. 4.
The adjacency array of the optimal individual in the final period The rounded average array of the individuals in the final period The rounded average array of the survivors in the final period The rounded generational gene in the final period
Moreover, Determination of Aˆ lists some examples of how to finally determine an estimate for the adjacency array after the PBIL cycles. Point 1 considers the optimal individual in Period T as the estimate. That is, ) Aˆ = A(T S1 .
In Point 2, the individuals in Period T are first averaged as follows: I 1 (T ) A¯ = a¯ r(Tjk) = A(T ) . I i=1 i
9.4 Structure Learning
479
Then, the rounded version of this array is specified as the estimate. That is,
Aˆ = {aˆ r jk } = round a¯ r(Tjk) . Meanwhile, Point 3 considers the rounded version of the averaged array over the survivors as the estimate: ) Aˆ = A(T R . (T ) Finally, Point 4 specifies the rounded A¯ as the estimate:
Aˆ = {aˆ r jk } = round a¯ r(Tjk) .
(9.12)
Thus far, we have seen an overview of the PBIL. To summarize, one needs to specify the settings as listed in Settings for Executing PBIL to apply PBIL; how ever, the list is basic, and various options can be added. Also note that, in general, a GA analysis including PBIL produces slightly different results even if identical data are analyzed under the same settings because random operations are included in generating individuals and mutating genes. Settings for Executing PBIL 1. Gene: Adjacency array A and the loci at Rank r are the strictly upper triangular cells at the r -th layer in A provided that the items are topologically sorted. Each locus takes a value of 0 (no edge) or 1 (edge). 2. Population: The number of individuals (I ) in a generation 3. Survival rate: The rate of survivors (r S ) in the population 4. Fitness evaluation: The index (such as BIC and AIC) used to evaluate the fitness of each individual 5. Selection: Rank, roulette-wheel, tournament, elitism selections, etc. 6. Learning rate: The update size of the generational gene (α) 7. Mutation rate: The rate at which each locus changes (r M ) 8. Termination condition: See Termination Conditions of PBIL ˆ ˆ 9. Estimate of A: See Determination of A
480
9 Local Dependence Latent Rank Analysis
9.4.4 Structure Learning by PBIL Parameters for Executing PBIL (LD-LRA) 1. Gene: Adjacency array A (a) Items are topologically sorted (Table 9.9) (b) Maximum N of parents per item is 2. Population: I = 50 Survival ratio: r S = 0.5; thus, I S = r S I = 25. Fitness evaluation: BIC Selection: Simple rank selection (and elitism selection with I E = 1) learning rate: α = 0.05 Mutation probability: r M = 0.002 Termination condition: When the fitness of the optimal individual does not change for 15 periods.
Eq. (9.12) 9. Estimate: Aˆ = {aˆ r jk } = round a¯ r(Tjk)
2. 3. 4. 5. 6. 7. 8.
Structure learning with PBIL was conducted under the settings shown in the above box. Point 1 implies that, at every rank, no item can be a parent of an item easier than the item. In addition, edges across different testlets were assumed to be allowed. In other words, only the strictly upper triangular elements were the loci in the adjacency matrix for each rank, as shown in Eq. (9.9). Meanwhile, the upper limit for the number of indegrees per item was set to 2. However, the limit was not 2 from the beginning, but was set to J/2 during the first period, and then gradually (linearly) reduced to 2 as t progresses. It is not recommended to limit the maximum number of indegrees to 2 from the first period. As a result of the topological sorting, for example, Item 12 can be a parent for all 11 remaining items. If the maximum number is small when generating the individuals at the first period, the locus means of the survivors regarding the item (Eq. (9.11), p. 477) are inclined to be small and become zero when rounded off. Therefore, it is better to set a larger number of indegrees in the early periods and gradually reduce the number. The PBIL was terminated at period T = 64. The generational gene was obtained as follows:
(64) A¯ 1
Rank 1 1 ⎢ Item 1 ⎢ ⎢ Item 4 ⎢ ⎢ Item 9 ⎢ ⎢ Item 8 ⎢ ⎢ Item 2 ⎢ =⎢ ⎢ Item 10 ⎢ Item 5 ⎢ ⎢ Item 11 ⎢ ⎢ Item 3 ⎢ ⎢ Item 6 ⎢ ⎣ Item 7 Item 12
⎡
4 0.976
9 0.027 0.980
8 0.021 0.031 0.981
2 0.967 0.021 0.022 0.019
10 0.019 0.148 0.978 0.961 0.045
5 0.021 0.023 0.021 0.019 0.977 0.021
11 0.026 0.916 0.023 0.021 0.026 0.145 0.021
3 0.029 0.890 0.023 0.019 0.023 0.019 0.028 0.025
6 0.056 0.021 0.023 0.021 0.028 0.029 0.019 0.965 0.019
7 0.026 0.023 0.057 0.024 0.028 0.029 0.028 0.021 0.951 0.023
12 0.028 0.974 0.022 0.019 0.024 0.019 0.022 0.019 0.066 0.033 0.030
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
⎤
9.4 Structure Learning 481
(64) A¯ 2
Rank 2 1 ⎢ Item 1 ⎢ ⎢ Item 4 ⎢ ⎢ Item 9 ⎢ ⎢ Item 8 ⎢ ⎢ Item 2 ⎢ =⎢ ⎢ Item 10 ⎢ Item 5 ⎢ ⎢ Item 11 ⎢ ⎢ Item 3 ⎢ ⎢ Item 6 ⎢ ⎣ Item 7 Item 12
⎡
4 0.979
9 0.025 0.025
8 0.021 0.025 0.981
2 0.949 0.024 0.021 0.019
10 0.019 0.028 0.970 0.961 0.019
5 0.042 0.019 0.079 0.026 0.032 0.019
11 0.019 0.019 0.034 0.021 0.954 0.981 0.019
3 0.023 0.019 0.030 0.974 0.977 0.019 0.021 0.032
6 .019 0.041 0.026 0.075 0.965 0.034 0.019 0.028 0.019
7 0.021 0.019 0.023 0.964 0.098 0.019 0.022 0.913 0.021 0.021
12 .028 0.015 0.019 0.019 0.024 0.022 0.058 0.023 0.019 0.203 0.023
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
⎤
482 9 Local Dependence Latent Rank Analysis
(64) A¯ 3
Rank 3 1 ⎢ Item 1 ⎢ ⎢ Item 4 ⎢ ⎢ Item 9 ⎢ ⎢ Item 8 ⎢ ⎢ Item 2 ⎢ =⎢ ⎢ Item 10 ⎢ Item 5 ⎢ ⎢ Item 11 ⎢ ⎢ Item 3 ⎢ ⎢ Item 6 ⎢ ⎣ Item 7 Item 12
⎡
4 .966
9 0.019 0.923
8 0.085 0.080 0.969
2 0.971 0.026 0.019 0.021
10 0.021 0.023 0.977 0.973 0.021
5 0.021 0.055 0.019 0.019 0.019 0.019
11 0.021 0.028 0.021 0.019 0.027 0.979 0.019
3 0.019 0.021 0.058 0.019 0.019 0.021 0.979 0.019
6 0.019 0.973 0.019 0.021 0.019 0.019 0.981 0.019 0.021
7 0.021 0.019 0.021 0.019 0.021 0.023 0.021 0.035 0.019 0.039
12 0.044 0.019 0.029 0.019 0.019 0.027 0.019 0.979 0.974 0.030 0.050
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
⎤
9.4 Structure Learning 483
(64) A¯ 4
Rank 4 1 ⎢ Item 1 ⎢ ⎢ Item 4 ⎢ ⎢ Item 9 ⎢ ⎢ Item 8 ⎢ ⎢ Item 2 ⎢ =⎢ ⎢ Item 10 ⎢ Item 5 ⎢ ⎢ Item 11 ⎢ ⎢ Item 3 ⎢ ⎢ Item 6 ⎢ ⎣ Item 7 Item 12
⎡
4 .981
9 .977 .024
8 .019 .021 .976
2 .952 .023 .227 .019
10 .019 .019 .978 .028 .026
5 .024 .026 .047 .019 .021 .019
11 .019 .019 .755 .024 .019 .979 .023
3 .023 .023 .027 .019 .031 .023 .150 .021
6 .021 .019 .248 .019 .019 .118 .979 .965 .025
7 .023 .021 .019 .025 .028 .121 .021 .022 .969 .979
12 .042 .019 .024 .022 .021 .024 .019 .019 .028 .023 .068 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
⎤
484 9 Local Dependence Latent Rank Analysis
(64) A¯ 5
Rank 5 1 ⎢ Item 1 ⎢ ⎢ Item 4 ⎢ ⎢ Item 9 ⎢ ⎢ Item 8 ⎢ ⎢ Item 2 ⎢ =⎢ ⎢ Item 10 ⎢ Item 5 ⎢ ⎢ Item 11 ⎢ ⎢ Item 3 ⎢ ⎢ Item 6 ⎢ ⎣ Item 7 Item 12
⎡
4 .981
9 0.023 0.853
8 0.025 0.041 0.954
2 0.977 0.021 0.019 0.022
10 0.019 0.019 0.027 0.971 0.028
5 0.022 0.023 0.036 0.049 0.020 0.057
11 0.080 0.032 0.034 0.021 0.019 0.977 0.016
3 0.019 0.027 0.042 0.017 0.019 0.019 0.048 0.019
6 0.028 0.842 0.030 0.021 0.023 0.019 0.974 0.019 0.090
7 0.019 0.855 0.021 0.019 0.023 0.339 0.019 0.024 0.019 0.979
12 0.023 0.023 0.024 0.106 0.966 0.019 .021 0.026 0.024 0.971 0.049
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
⎤
9.4 Structure Learning 485
486
9 Local Dependence Latent Rank Analysis
The elements ≥ 0.5 are shaded; thus, the matrix replacing these elements as 1 and the other elements as 0 is the adjacency array estimate in this structure learning. Figure 9.18 shows the structure estimated (and the parameter set estimated). Edges were allowed across testlets; thus, Item 1 → 4 and Item 11 → 6 were such edges at Rank 1. In addition, some invalid edges did not vanish and survived, such as Item 4→11 at Rank 1. With this edge, the CCRRs of Item 11 were 0.103 and 0.109 given that Item 4 was correct and incorrect, respectively. In other words, the CCRR of Item 11 does not change with respect to the parent item response pattern (PIRP); thus, this edge is said to be unnecessary. Why did such useless edges remain alive? They did so owing to the large sample size (S = 5000) and the property of the BIC as the fitness index employed. The BIC is defined by B I C = χ A2 − d f A ln S = 2(ll B − ll A ) − (N PB − N PA ) ln S =2
S ˆ D I )} − (N PB − N PA ) ln S, {ll B (us | P G ) − ll A (us | s=1
where the last equal sign is derived from Eq. (9.5) (p. 442). From the first term, if the sample size changes from S to S + 1, χ A2 increases by ˆ D I )}, 2{ll B (u S+1 | P G ) − ll A (u S+1 | which shows that χ A2 increases almost linearly with respect to the sample size. Meanwhile, the second term remains almost unchanged because ln S ≈ ln(S + 1) when the sample size is large (here , ln 5000 = 8.5172 and ln 5001 = 8.5174). In other words, as the sample size increases, the second term (d f A ln S) becomes relatively smaller, and the BIC becomes almost identical to χ A2 . In this example, χ A2 is approximately 10,000, whereas d f A ln S is approximately 200. Thus, when the sample size is large, rather than increasing d f A ln S by reducing the number of parameters, reducing χ A2 is more effective at decreasing BIC. Accordingly, individuals with a greater number of edges are inclined to have a smaller BIC and survive. It must be difficult to interpret the results shown in Fig. 9.18. It might have been better not to assume any edges across testlets. The model selected by the structure learning is not always a model interpretable by the analyst, accountable for the phenomenon, or explainable to influencers in the society (e.g., pedagogists and educational policy makers). An expert(s) should modify the model by removing and adding edges without damaging the goodness of fit as much as possible and fix the final structure. Table 9.10 shows the model-fit indices. The BIC decreases by approximately 1400, compared with that in Table 9.7. This is a product of the structure (machine) learning. Meanwhile, RFI, TLI, and RMSEA worsen because the PBIL in this analysis
9.4 Structure Learning
487
Fig. 9.18 Latent rank DAG structure obtained using PBIL
searched for the best model in terms of the BIC. If another index were adopted as a fitness measure, a different structure would have been selected.
488
9 Local Dependence Latent Rank Analysis
Table 9.10 Model fit indices (PBIL) χ2 ll B ll N ll A χ N2 χ A2 d fN d fA
and d f −21318.47 −37736.23 −25846.39 32835.53 9055.85 144 23
Information Criterion Standardized Index 9009.85 NFI 0.724 AIC 8859.95 RFI 0.000 CAIC IFI 0.725 BIC 8859.95 TLI 0.000 CFI 0.724 RMSEA 0.280
9.5 Chapter Summary This chapter introduced the LD-LRA. Whereas the normal LRA assumes local independence among items, the LD-LRA assumes local dependence among them, where “local” refers to each latent rank. In addition, “dependence” refers to the relationships (or structure) of the CCRRs among the items, which is represented by a DAG introduced in the chapter on BNM. Therefore, the LD-LRA is regarded as a statistical model that combines the LRA and BNM. In other words, BNM is the LD-LRA when the number of ranks is one. In the LD-LRA, the structure can be identified through structure learning. In this chapter, the PBIL, a type of GA, was employed, and it was difficult to obtain a satisfactory model regardless of the method used because the number of model candidates was quite large, and therefore only a small fraction of the total number of candidates could be examined. In addition, when selecting survivors, a certain fitness index must be used. A fitness index can only evaluate one single aspect of the model from a single perspective. Although the BIC is frequently used in model ranking, it is not a good measure when the sample size is large. There is no single index that is valid under all situations. Therefore, there is very little possibility that the model selected by the structure learning will be satisfactory as is. It is generally necessary to modify the model by referring to knowledge and experience. If a good-fitting and well interpretable model is acquired, the teacher can have very useful insights regarding the field of education because the LD-LRA efficiently represents the internal world of the test. As described in Sect. 1.5.2 (p. 10), a good test data analysis (A) visualizes (and verbalize) the entire map of the test. That is, it is important to clarify what is being measured by the test, where it begins, and what are the goals. In the LD-LRA, the local dependence DAGs such as Fig. 9.9 (p. 447) represent the map, and Ranks 1 and R are the start and goal of the map, respectively. The DAG at each rank can be a guide to instruct the students at that rank. In addition, a good analysis (B) shows where each student is positioned on the
9.5 Chapter Summary
489
map, which is indicated by the rank membership profile of the student (cf. Fig. 9.10, p. 451); (C) points to the path (item) that each student should take (learn) for moving up to the next rank, as shown in Fig. 9.13 (p. 456); and (D) clarifies the feature of each path (item), which is clear from the marginal IRP (Fig. 9.12, p. 454) and IRP indices (Table 9.6, p. 455).
Chapter 10
Local Dependence Biclustering
This chapter introduces local dependence biclustering (LDB; Shojima, 2021), which incorporates biclustering (Chap. 7, p. 259) and a Bayesian network model (BNM; Chap. 8, p. 349). Thus, it is desirable to first read those chapters. Biclustering is a method of simultaneously clustering students into latent classes and items into latent fields. Both the classes and fields are nominal clusters. Meanwhile, ranklustering (Sect. 7.3, p. 274) is used to perform item clustering when students are clustered into ordinal classes (i.e., latent ranks). In addition, BNM is a method for identifying the interitem-dependency structure. Figure 10.1 illustrates LDB. Strictly speaking, it is an image of the local dependence ranklustering (LDR), because the horizontal axis is the latent rank scale. In the figure, the number of ranks is five. In addition, the nodes in the DAG at each rank represent the latent fields. While LD-LRA (Chap. 9, p. 421) is a method for analyzing the dependency structure among items at each rank (i.e., locus), LDR (LDB) is a model that considers the fields at each rank (class). The structure at each rank must be a DAG (Sect. 8.1.8, p. 366). That is, all edges in the graph at each rank are directed, and the graph does not contain a path (i.e., cycle) that starts from a field and returns to that field. The (usual) biclustering is illustrated in Fig. 10.2, although it is strictly ranklustering because the horizontal axis is the latent rank scale. In ranklustering/biclustering, the fields are locally independent at each rank/class; in other words, this model can be referred to as local independence ranklustering/biclustering.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_10
491
492
10 Local Dependence Biclustering
Fig. 10.1 Local dependence biclustering (ranklustering)
Fig. 10.2 Local independence biclustering (ranklustering)
10.1 Parameter Learning This section describes the parameter learning in LDR, which includes that of LDB, under the condition that the numbers of latent fields and ranks (F and R, respectively) are already determined. Additionally, it is assumed that the structure to be analyzed has been given. In other words, an LDR analysis consists of three steps, as shown in Estimation Order of LDR , and this section focuses on the third step.
10.1 Parameter Learning
493
Estimation Order of LDR 1. Estimate the numbers of ranks and fields (ranklustering) 2. Estimate the latent rank DAG structure (structure learning) 3. Estimate the parameter set (parameter learning) In the first step, the usual ranklustering is conducted to determine the number of fields and ranks, as described in Sect. 7.6 (p. 328). An optimal pair (F, R) can be selected from various combinations of (F, R) by referring to a model-fit index, such as the BIC. Alternatively, it can be decided by consulting the knowledge and experience of the pairs that fit the data satisfactorily. It is further assumed that the second step (structure learning) has been completed, and the interfield structure at each rank has been given. Step 2 is described in Sect. 10.3 (p. 524).
10.1.1 Results of Ranklustering First, the results of the ranklustering, Step 1 of Estimation Order of LDR , are shown. The settings for the analysis are as in Estimation Setting of Ranklustering . The data to be analyzed are J35S515, for which the number of items and sample size are 35 and 515, respectively. These data are identical to those analyzed in Sect. 7.4 (p. 283). The number of fields and ranks were assumed to be (F, R) = (10, 5), and the EM algorithm was terminated at the sixth cycle. Estimation Setting of Ranklustering 1. 2. 3. 4. 5. 6. 7. 8. 9.
Data: J35S515 (www) Number of latent fields: F = 10 Number of latent ranks: R = 5 Kernel of filter: f = [0.1 0.8 0.1] Prior of field membership in E step: Discrete uniform 110 /10 Prior of rank membership in E step: Discrete uniform 15 /5 Estimator in M step: MAP estimator Prior of π f r in M step: Beta density with (β0 , β1 ) = (1, 1) Constant of convergence criterion: c = 10−4
An analysis of the ranklustering proceeds as follows: Initial Value
EM Cycle 1
EM Cycle 2
(0) (0) (1) (1) (1) (2) (2) (2) B , M F → M R → S(1) → M F → B → M R → S(2) → M F → B → · · · , (t) where M (t) F and M R are the field and rank membership matrices at the t-th EM cycle, respectively, and S(t) is the smoothed rank membership matrix. In addition, (t) B is the bicluster reference matrix (Section 7.1.3, p. 263) with size F × R:
494
10 Local Dependence Biclustering
Fig. 10.3 Array plot of sorted data by ranklustering
⎤ π11 · · · π1F ⎥ ⎢ B = ⎣ ... . . . ... ⎦ = {π f r } (F × R), π F1 · · · π F R ⎡
where π f r is the CRR of a student belonging to Rank r for an item classified in Field (6) (6) f . Because this process was completed at the sixth EM cycle, M (6) F , M R , S , and (6) B are available. Figure 10.3 (left) shows the array plot for the original data. The black cells indicate that the corresponding item responses were correct. The figure on the right-hand side shows the array plot sorted by ranklustering. The top and bottom clusters in the figure
10.1 Parameter Learning
495
are the lowest and highest ability groups (i.e., Ranks 1 and 5), respectively. Thus, the lower the row, the more cells are black. The rank membership matrix M R is an S × R matrix: ⎤ m 11 · · · m 1R ⎥ ⎢ M R = ⎣ ... . . . ... ⎦ = {m sr } (S × R), m S1 · · · m S R ⎡
where m sr ∈ [0, 1] represents the membership of Student s to Rank r . Now, M (6) R and its smoothed matrix S(6) are available as the final update. Each student is classified in the rank that has the largest membership, as displayed in Fig. 10.3 (right). In an LDR/LDB analysis, it is recommended that the number of ranks/classes be smaller because the ability range at each rank/class thus becomes wider. The larger the number of ranks, the smaller the number of students in each rank; the abilities of the students at each rank tend to be more homogeneous, and the local independence among fields becomes stronger (see Sect. 9.1.4, p. 427). Conversely, the smaller the number of ranks, the larger the number of students in each rank, and the broader the ability range at each rank, which makes it easier to examine the local dependency structure.
10.1.2 Dichotomized Field Membership Matrix The field membership matrix is given as ⎤ m 11 · · · m 1F ⎥ ⎢ M F = ⎣ ... . . . ... ⎦ = {m j f } (J × F), m J1 · · · m J F ⎡
where m j f ∈ [0, 1] represents the membership of Item j to Field f . Now, M (6) F is in the available state. ˜ ˜ j f } (J × From this M (6) F , the dichotomized field membership matrix M F = {m F) is created. The ( j, f )-th element of this matrix is m˜ j f =
) (T ) 1, if m (T j f = max m j , 0, otherwise
) (T ) where m(T j is the j-th row vector of M F and is thus the field membership profile of Item j. This equation indicates that m˜ j f = 1 when the membership of Item j to Field f is the largest; otherwise, m˜ j f = 0 ( f = f ). This means that the field ) membership matrix is finalized, or the proposal of M (T F is accepted as the item
496
10 Local Dependence Biclustering
Table 10.1 Item Classification Results by Ranklustering (10, 5) Field 1 2 3 4 5 6 7 8 9 10
N of Items 3 2 2 1 3 3 4 2 8 7
Item 1, 31, 32 21, 22 23, 24 7 11, 25, 26 2, 3, 27 8, 9, 10, 33 4, 12 5, 6, 13, 16, 17, 28, 29, 34 14, 15, 18, 19, 20, 30, 35
classification result. The field to which each item is classified must be determined to examine the local dependency structure among the fields. Table 10.1 shows the item classification results. For example, because the memberships of Items 1, 31, and 32 to Field 1 are the largest, their dichotomized memberships to the field are all 1, as follows: m˜ 1,1 = m˜ 31,1 = m˜ 32,1 = 1, m˜ 1, f = m˜ 31, f = m˜ 32, f = 0
( f = 1).
Thus, they were classified as Field 1 in the table. In Fig. 10.3 (right), Fields 1 and 10 are placed in the leftmost and rightmost columns. Field 9 has the most items, with eight items classified in it, whereas Field 4 has the least items, including only Item 7. The 10 fields are labeled in decreasing order of the CRR mean of the items in each field. That is, let the mean regarding Field f be denoted as
J j=1
Mean of CRRs of Field f items: p¯ f = J
m˜ j f p j
j=1
m˜ j f
.
Then the following inequality holds: p¯ 1 > p¯ 2 > · · · > p¯ 10 . In other words, Fields 1 and 10 are the easiest and most difficult item groups, respectively.
10.1 Parameter Learning
497
In an LDR (and LDB) analysis, it is recommended that the number of fields be larger. This is because as the number of fields increases, the number of items in each field is likely to decrease. A detailed explanation of this is provided in the next section.
10.1.3 Local Dependence Parameter Set Suppose that the structure learning (Step 2 of Estimation Order of LDR , p. 493) is completed and the latent rank DAG structure is determined as shown in Fig. 10.4. Thus, parameter learning is conducted under this structure. The graph at each rank must be a DAG, and it is clear from the figure that the graphs are DAGs. In addition, for each edge, the label of the out-field (i.e., predecessor) is smaller than that of the in-field (i.e., successor), which is equivalent to topological sorting of the fields. In this case, the graph is always a DAG, regardless of the number of edges and their positions. Generally, when F fields are topologically sorted, the joint probability (JP) of F fields at Rank r is decomposed as Pr (1, . . . , F) = Pr (F| par (F)) × Pr (F − 1| par (F − 1)) × · · · × Pr (1) =
F
Pr ( f | par ( f )),
f =1
Fig. 10.4 Latent rank DAG structure
498
10 Local Dependence Biclustering
Table 10.2 Joint Probability Factorization Induced by Figure 10.4 Factorization Rank 1 P1 (1, · · · , 10)= P1 (10) × P1 (9) × P1 (8) × P1 (7) × P1 (6) ×P1 (5) × P1 (4) × P1 (3) × P1 (2) × P1 (1) 2 P2 (1, · · · , 10)= P2 (10) × P2 (9) × P2 (8|5, 6) × P2 (7|1, 4) × P2 (6) ×P2 (5|4) × P2 (4) × P2 (3) × P2 (2|1) × P2 (1) 3 P3 (1, · · · , 10)= P3 (10) × P3 (9) × P3 (8|4, 6) × P3 (7|4) × P3 (6|4) ×P3 (5|3) × P3 (4|2) × P3 (3) × P3 (2) × P3 (1) 4 P4 (1, · · · , 10)= P4 (10) × P4 (9) × P4 (8|1, 6) × P4 (7) × P4 (6) ×P4 (5|3, 4) × P4 (4) × P4 (3) × P4 (2) × P4 (1) 5 P5 (1, · · · , 10)= P5 (10|7, 9) × P5 (9|7) × P5 (8|6) × P5 (7) × P5 (6) ×P5 (5) × P5 (4) × P5 (3) × P5 (2) × P5 (1)
where par ( f ) is the parent field set of Field f at Rank r . Note that Pr ( f | par ( f )) = Pr ( f ) if par ( f ) = ∅. In this analysis, referring to Fig. 10.4, this factorization is given as shown in Table 10.2. The graph for Rank 1 is an empty (edgeless) graph,1 which implies that the 10 fields are locally independent at the rank. The next problem is how to specify the parameters for each Pr ( f | par ( f )). For P( j| pa( j)) in the BNM and Pr ( j| par ( j)) in LD-LRA, the parameters were set corresponding to the parent item response patterns (PIRPs). Recall that in the BNM, when Item j has three parents, the eight (= 23 ) PIRPs are as follows: 000, 001, 010, 011, 100, 101, 110, 111. Because the CCRRs of Item j are calculable under the respective PIRPs, the parameters for P( j| pa( j)) are designated as follows: π j1 , π j2 , π j3 , π j4 , π j5 , π j6 , π j7 , π j8 . Similarly, in LD-LRA, the parameters are created according to the PIRPs. Those for Pr ( j| par ( j)) are thus specified as πr j1 , πr j2 , πr j3 , πr j4 , πr j5 , πr j6 , πr j7 , πr j8 . For this reason, as recommended in Indegree and PIRP (p. 385), the indegree per item should not be more than three, because the number of PIRPs (parameters) is 2n if the indegree is n.
1
An empty graph is also a DAG.
10.1 Parameter Learning
499
Let the parameters be defined here as well. Suppose that Field f has two parents (Fields k and l) at Rank r , and Fields k and l have five and four items, respectively. There are then the following 512 (= 29 ) PIRPs: Field k Field l
Field k Field l
Field k Field l
Field k Field l
Plan (A) 00000 0000, 00000 0001, 00000 0010, · · · , 11111 1111 . The 512 parameters corresponding to them are then designated as follows: πr f 1 , πr f 2 , πr f 3 , . . . , πr f 512 All 512 parameters may not be necessary because not all 512 PIRPs may be observed; however, it is a highly unrealistic condition for a field to have hundreds of parameters. There are too many parameters to estimate properly. Thus, a method for reducing the number of parameters is required. We consider that the number-right score (NRS) for each field is used. Then, because the numbers of NRS patterns for Fields k and l are six (0–5) and five (0–4), respectively, although all the patterns are not necessarily observed, the maximum number of PIRPs becomes 30 (= 6 × 5) as follows: Plan (B) 00, 01, 02, 03, 04, 05, 10, 11, 12, . . . , 54. Therefore, the number of parameters is reduced to 30, and they are πr f 1 , πr f 2 , πr f 3 , · · · , πr f 30 . However, the number of parameters is still unrealistically large in practice for a single field. Alternatively, let the NRS patterns for all items in the parent fields be considered. Because the number of items classified in Fields k and l is nine, all the NRSs are then the following 10 patterns: Plan (C) 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. With this condition, the number of parameters can be reduced to the following 10: πr f 1 , πr f 2 , πr f 3 , · · · , πr f 10 . Compared to Plan (A), which has no constraint but 512 patterns (parameters), Plan (C) has only 10 parameters as a result of a stronger constraint. From the viewpoint of an analyst, this is still many parameters for a field. For this reason, it was previously stated that a larger number of fields is recommended, as this leads to a smaller number of items in each field. If a field has two parents and
500
10 Local Dependence Biclustering
each parent has two items, the number of PIRPs (parameters) will then be five (0–4). Therefore, this section adopts Plan (C) to specify the PIRPs. How to Select a Plan for Creating PIRPs In the BNM and LD-LRA, PIRPs were defined using Plan (A); however, they could be based on Plan (B) or (C). Which plan to choose depends on the assumptions (constraints) that the analyst has regarding the data. The set of assumptions reflects the analyst’s thought and policy, and the analyst as a human being cannot be free from them. Note, however, that the model fit should be checked. It is wrong-headed and thoughtless to hold to a set of assumptions, even if the model fit is poor. Let the parameter set for Rank r be denoted by D Fr = {πr f d },2 where πr f d is the d-th parameter for Field f at Rank r . Then, the parameter set for Rank 5 is given as D F5 = ⎡ ⎤ π5,1,1 ⎢ π5,2,1 ⎥ ⎢ ⎥ ⎢ π5,3,1 ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 5,4,1 ⎥ ⎢ ⎥ ⎢ π5,5,1 ⎥ ⎢ ⎥. ⎢ π5,6,1 ⎥ ⎢ ⎥ ⎢ π5,7,1 ⎥ ⎢ ⎥ ⎢π ⎥ π π π ⎢ 5,8,1 5,8,2 5,8,3 5,8,4 ⎥ ⎣π ⎦ π π π π 5,9,1 5,9,2 5,9,3 5,9,4 5,9,5 π5,10,1 π5,10,2 π5,10,3 π5,10,4 π5,10,5 π5,10,6 π5,10,7 π5,10,8 π5,10,9 π5,10,10 π5,10,11 π5,10,12 π5,10,13
In this matrix for Rank 5, Field 8 (the eighth row) has four parameters. This is because the field has a parent (i.e., Field 6) that has three items; thus, the number of NRS patterns is four (0–3). Similarly, Field 10 has two parents (i.e., Fields 7 and 9) with 12 items in total; the number of NRS patterns is 13; this matrix thus has 13 parameters in the 10th row. However, all the parameters will not always be estimated in practice because not all PIRPs (i.e., NRS patterns) are necessarily observed. For example, in Field 8, if there are no students belonging to Rank 5 who pass three items in the items classified in the field’s parents (i.e., Fields 7 and 9), then the fourth parameter (π5,10,4 ) will not be estimated.3 Similarly, the parameter sets for Ranks 1 through 4 can be specified as follows:
2
The subscript D F means “(local) dependence among fields.” Bayesian estimation can produce an estimate for the parameter even under this condition, but it is not necessarily meaningful.
3
10.1 Parameter Learning
501
D F1 = ⎡ ⎤ π1,1,1 ⎢ π1,2,1 ⎥ ⎢ ⎥ ⎢ π1,3,1 ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 1,4,1 ⎥ ⎢ ⎥ ⎢ π1,5,1 ⎥ ⎢ ⎥, ⎢ π1,6,1 ⎥ ⎢ ⎥ ⎢ π1,7,1 ⎥ ⎢ ⎥ ⎢ π1,8,1 ⎥ ⎢ ⎥ ⎣π ⎦ 1,9,1 π1,10,1 π5,10,2 π5,10,3 π5,10,4 π5,10,5 π5,10,6 π5,10,7 π5,10,8 π5,10,9 π5,10,10 π5,10,11 π5,10,12 π5,10,13
D F2 = ⎡ ⎤ π2,1,1 ⎢ π2,2,1 π2,2,2 π2,2,3 π2,2,4 ⎥ ⎢ ⎥ ⎢ π2,3,1 ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 2,4,1 ⎥ ⎢ ⎥ ⎢ π2,5,1 π2,5,2 ⎥ ⎢ ⎥, ⎢ π2,6,1 ⎥ ⎢ ⎥ ⎢ π2,7,1 π2,7,2 π2,7,3 π2,7,4 π2,7,5 ⎥ ⎢ ⎥ ⎢ π2,8,1 π2,8,2 π2,8,3 π2,8,4 π ⎥ 2,8,5 π2,8,6 ⎢ ⎥ ⎣π ⎦ 2,9,1 π2,10,1 π5,10,2 π5,10,3 π5,10,4 π5,10,5 π5,10,6 π5,10,7 π5,10,8 π5,10,9 π5,10,10 π5,10,11 π5,10,12 π5,10,13
D F3 = ⎡ ⎤ π3,1,1 ⎢ π3,2,1 ⎥ ⎢ ⎥ ⎢ π3,3,1 ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 3,4,1 π3,4,2 π3,4,4 ⎥ ⎢ ⎥ ⎢ π3,5,1 π3,5,2 π3,5,3 ⎥ ⎢ ⎥, ⎢ π3,6,1 π3,6,2 ⎥ ⎢ ⎥ ⎢ π3,7,1 π3,7,2 ⎥ ⎢ ⎥ ⎢ π3,8,1 π3,8,2 π3,8,3 π3,8,4 π ⎥ 3,8,5 ⎢ ⎥ ⎣π ⎦ 3,9,1 π3,10,1 π5,10,2 π5,10,3 π5,10,4 π5,10,5 π5,10,6 π5,10,7 π5,10,8 π5,10,9 π5,10,10 π5,10,11 π5,10,12 π5,10,13
D F4 = ⎡ ⎤ π4,1,1 ⎢ π4,2,1 ⎥ ⎢ ⎥ ⎢ π4,3,1 ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 4,4,1 ⎥ ⎢ ⎥ ⎢ π4,5,1 π4,5,2 π4,5,3 π4,5,4 ⎥ ⎢ ⎥. ⎢ π4,6,1 ⎥ ⎢ ⎥ ⎢ π4,7,1 ⎥ ⎢ ⎥ ⎢ π4,8,1 π4,8,2 π4,8,3 π4,8,4 π ⎥ 4,8,5 π4,8,6 π4,8,7 ⎢ ⎥ ⎣π ⎦ 4,9,1 π4,10,1 π5,10,2 π5,10,3 π5,10,4 π5,10,5 π5,10,6 π5,10,7 π5,10,8 π5,10,9 π5,10,10 π5,10,11 π5,10,12 π5,10,13
502
10 Local Dependence Biclustering
Each field has only one parameter at Rank 1, where all the fields are locally independent. Consequently, a set of all parameters induced by the graphs, D F , is given as D F = { D F1 , D F2 , D F3 , D F4 , D F5 } = {πr f d } (R × F × Dmax ). This set is three-dimensional and not dense. In addition, let Jr f denotes the number of items in the parent(s) of Field f , and Dmax is then Dmax = max{J11 + 1, . . . , Jr f + 1, . . . , JRF + 1}. In this analysis, because the parents of Field 8 at Rank 5 have the most items (i.e., 12 items), the maximum number of NRS patterns per field, Dmax , is thus 13.
10.1.4 PIRP Array This analysis assumed that the CCRRs differed depending on the PIRP based on Plan (C). In this section, we examine how to create PIRPs under Plan (C) through a concrete example. Item responses of Student 7 in the data (J35S515) are4 u7 = [11111 11110 10100 11000 11111 11110 11101] . Counting the NRS for each field by referring to Table 10.1 (p. 496), they are given as u F7 = [32213 33171] .
(10.1)
For instance, three items (i.e., Items 1, 31, and 32) are classified in Field 1, and Student 7 passed all of them; thus, the NRS for Field 1 in u F7 is 3. In addition, Field 10 has seven items (i.e., Items 14, 15, 18, 19, 20, 30, and 35), but the student passed only one item (i.e., Item 14) of the seven; the 10th element in u F7 is thus 1. ˜ F ), From the data matrix (U) and the dichotomized field membership matrix ( M the field NRS data for all students can be obtained as follows: ˜ F (S × F). UF = UM
(10.2)
The (s, f )-th element of this matrix represents the NRS of Student s for Field f . The field NRS vector for Student 7, shown in Eq. (10.1), is contained in the seventh row of the matrix. Table 10.3 shows the PIRP arrays of Student 7 for Ranks 2 and 3. In the table for Rank 2, the second column from the left (PF) shows the parent field of each field. 4
The data are sectioned by every five items for ease of reading.
10.1 Parameter Learning
503
Table 10.3 Student 7’s PIRP array for latent ranks 2 and 3
Rank 2 Field 1 2 3 4 5 6 7 8 9 10
*1
PF − 1 − − 4 − 1, 4 5, 6 − −
*2
3 2 2 1 3 3 3 1 7 1
NRS − 3 − − 1 − 4 6 − −
Parent Item Response Pattern (
)
1 1 1 1 1 1 1 1 1 1
*1 parent field(s), *2 number-right score
Rank 3 Field 1 2 3 4 5 6 7 8 9 10
PF*1 − − − 2 3 4 4 4, 6 − −
3 2 2 1 3 3 3 1 7 1
NRS*2 − − − 2 2 1 1 4 − −
Parent Item Response Pattern (
)
1 1 1 1 1
1 1 1
1 1 1
*1 parent field(s), *2 number-right score
For example, the parent of Field 2 (at Rank 2) is Field 1, and those of Field 7 are Fields 1 and 4. In addition, the third column is u F7 . The fourth column indicates the NRS in the parent(s) of each field, and from the eighth column onwards are the PIRPs of the respective fields. For example, Field 2 has a parent (i.e., Field 1); Student 7 passed three items in Field 1; the NRS pattern (i.e., PIRP) is the fourth pattern, and a 1 is entered in the fourth cell. Similarly, Field 8 has two parents (i.e., Fields 5 and 6); Student 7 passed six items in the fields; the PIRP is the seventh, and the seventh cell is 1. For the fields that do not have a parent (i.e., source fields), the number of items in the parent fields is zero; thus, the NRS is also zero; the PIRP is then the first pattern.
504
10 Local Dependence Biclustering
As for the lower table (for Rank 3), the rightmost (10 × 13) matrix is the PIRP array of Student 7 for Rank 3, denoted as 73 . Similarly, the PIRPs in the upper table (for Rank 2) can be expressed as 72 . Furthermore, 71 , 74 , and 75 for Ranks 1, 4, and 5, respectively, can also be created in the same manner. Accordingly, an array collecting all the PIRPs for Student 7 is created as 7 = { 71 , 72 , 73 , 74 , 75 } (R × F × Dmax ). This is a three-dimensional array. Finally, a four-dimensional array collecting the PIRP arrays of all S students can be expressed as = { 1 , · · · , s , · · · , S } = {γsr f d } (S × R × F × Dmax ), where the (s, r, f, d)-th element represents γsr f d =
1, if Student s s NRS for par ( f )item(s) is d − 1 . 0, otherwise
10.1.5 Likelihood Let the likelihood be constructed using the data matrix (U), PIRP array (), and parameter set ( D F ). First, the likelihood of all students is factored as follows: l(U| D F ) =
S
l(us | D F ),
s=1
where l(us | D F ) is the likelihood of Student s. This decomposition is based on the natural assumption that the responses of each student are independent of those of the other students. In addition, with the rank memberships of Student s and the field memberships of Item j, the likelihood of Student s can be expressed as l(us | D F ) =
R r =1
l(us | D Fr )m sr =
R J F
l(u s j |π r f )m sr m j f ,
(10.3)
r =1 j=1 f =1
where l(us | D Fr ) represents the likelihood of observing us under the condition that Student s belongs to Rank r . Moreover, l(u s j |π r f ) is the likelihood of u s j if Student s belongs to Rank r and Item j is classified into Field f . Furthermore, π r f denotes the parameters specified for Field f at Rank r . That is, it is the f -th row vector in the parameter set for Rank r ( D Fr ). Datum u s j can contribute to estimating only one of the parameters in π r f , and the PIRP selects the parameter from π r f . Note that the PIRPs of Student s at Rank
10.1 Parameter Learning
505
r are collected in sr (F × Dmax ). For example, let us consider the likelihood of Student 7’s datum for Item 21 (u 7,21 ) at Rank 2. First, from Table 10.3, the PIRPs of the student at the rank are as follows: ⎡ ⎤ 1000000000000 ⎢ ⎥ 1 ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢ ⎥ (10 × 13), 72 = ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎣1 ⎦ 1 where an element in the second row is focused on because Item 21 is classified in Field 2 (see Table 10.1, p. 496). Note that the f -th row is checked if Item j is grouped into Field f . In addition, from Fig. 10.4 (p. 497), Field 2’s parent is Field 1 at Rank 2. Student 7 passed three items in Field 1 from Eq. (10.1) (p. 502); the PIRP is thus the fourth pattern; the element in the second row and fourth column in 72 (γ7224 , the shaded cell) is thus 1. This implies that the fourth parameter in the second row vector of D F2 is selected. Accordingly, in the likelihood of u 7,21 at Rank 2, only the term regarding π2,2,4 can remain due to γ7,2,2,4 as follows: l(u 7,21 |π 2,2 ) =
4
γ7,2,2,d π2,2,d
u 7,21 1−u 7,21 4 γ7,2,2,d 1− π2,2,d
d=1 u
d=1
7,21 = π2,2,4 (1 − π2,2,4 )1−u 7,21 .
In other words, u 7,21 is only used to estimate π2,2,4 . Additionally, from u 7,21 = 1, the above equation is further reduced to 1 (1 − π2,2,4 )0 = π2,2,4 . l(u 7,21 |π 2,2 ) = π2,2,4
In general, the likelihood of the response of Student s belonging to Rank r for Item j classified in Field f can be represented as l(u s j |π r f ) =
J r f +1
γ πr fsrdf d
d=1
=
d=1
Jr f +1
1−
γ πr fsrdf d
d=1
Jr f +1
u s j
u
πr fs dj (1 − πr f d )1−u s j
γsr f d
.
1−u s j
506
10 Local Dependence Biclustering
Consequently, the likelihood for all the data is expressed as l(U| D F ) =
r f +1 F J S R J
u
πr fs dj (1 − πr f d )1−u s j
γsr f d m sr m j f
.
(10.4)
s=1 r =1 j=1 f =1 d=1
10.1.6 Posterior Density A prior density can be assumed for each element of the local dependence parameter set ( D F = {πr f d }), which often contributes to making the parameter estimates more stable (preventing them from being extremely large or small). Beta density (see Sect. 4.5.3, p. 119) is the conjugate prior and is imposed on all the elements of D F . The density function of the beta distribution is as follows: β −1
pr (πr f d ; β0 , β1 ) =
πr f1d (1 − πr f d )β0 −1 B(β0 , β1 )
,
where β0 and β1 are the hyperparameters of this density; thus, the prior density for all parameters is represented as pr ( D F ; β0 , β1 ) =
r f +1 R F J
pr (πr f d ; β0 , β1 ).
r =1 f =1 d=1
Then, with this prior density pr ( D F ; β0 , β1 ), the posterior density of the local dependence parameter set ( D F ) can be constructed as l(U| D F ) pr ( D F ; β0 , β1 ) pr (U) ∝l(U| D F ) pr ( D F ; β0 , β1 )
pr ( D F |U) =
∝
r f +1 S R J F J us j z γ m m πr f d (1 − πr f d )1−u s j s j sr f d sr j f
s=1 r =1 j=1 f =1 d=1
×
r f +1 R F J
β −1
πr f1d (1 − πr f d )β0 −1 .
(10.5)
r =1 f =1 d=1
10.1.7 Estimation of Parameter Set Taking the logarithm of Eq. (10.5), the log-posterior density is obtained as follows:
10.1 Parameter Learning
507 J +1
ln pr ( D F |U) =
f F r R J S
z s j γsr f d m sr m j f u s j ln πr f d + (1 − u s j ) ln(1 − πr f d )
s=1 r =1 j=1 f =1 d=1 f +1 F Jr R {(β1 − 1) ln πr f d + (β0 − 1) ln(1 − πr f d )
+
r =1 f =1 d=1
=
f +1 F Jr J R S
r =1 f =1 d=1
+
J S
z s j γsr f d m sr m j f u s j + β1 − 1 ln πr f d
s=1 j=1
z s j γsr f d m sr m j f u s j (1 − u s j ) + β0 − 1 ln(1 − πr f d ) .
s=1 j=1
In addition, from Sect. 10.1.1 (p. 493), by substituting the current estimate of the (T ) }, S × R) for the field membership smoothed rank membership matrix (S(T ) = {ssr ˜ matrix ( M F = {m˜ j f }, J × F) in the above equation, we obtain
=
ln pr ( D F |U) r f +1 S R F J J r =1 f =1 d=1
+
S J
(T ) z s j γsr f d ssr m˜ j f u s j + β1 − 1 ln πr f d
s=1 j=1 (T ) z s j γsr f d ssr m˜ j f (1
− u s j ) + β0 − 1 ln(1 − πr f d )
s=1 j=1
=
r f +1 R F J
(T ) (T ) S1r f d + β1 − 1 ln πr f d + S0r f d + β0 − 1 ln(1 − πr f d ) ,
r =1 f =1 d=1
(10.6) where (T ) S1r fd =
S J
(T ) γsr f d ssr m˜ j f z s j u s j ,
(10.7)
(T ) γsr f d ssr m˜ j f z s j (1 − u s j ).
(10.8)
s=1 j=1 (T ) S0r fd =
S J s=1 j=1
(T ) (T ) In these equations, S1r f d and S0r f d are the sums of the correct and incorrect responses, respectively, of the Rank r students whose NRS for par ( f ) items is d − 1 (i.e., PIRP is d-th). The D F that maximizes Eq. (10.6) is the MAP of D F . Differentiating Eq. (10.6) with respect to πr f d , all the terms unrelated to πr f d are eliminated; thus,
508
10 Local Dependence Biclustering
d ln pr (πr f d |U) d ln pr ( D F |U) = dπr f d dπr f d remains. This implies that all parameters are mutually disjoint, and each parameter can be optimized individually. Setting this first derivative to 0 yields (T ) (T ) S1r S0r d ln pr (πr f d |U) f d + β1 − 1 f d + β0 − 1 = − = 0, dπr f d πr f d 1 − πr f d
and solving the equation with respect to πr f d , the MAP of πr f d that maximizes ln pr (πr f d |U) can then be obtained as πˆ r(MAP) fd
=
(T ) S1r f d + β1 − 1 (T ) (T ) S0r f d + S1r f d + β0 + β1 − 2
.
Note that this is identical to the MLE unless the prior density is assumed, or when β0 = β1 = 1, even if we assume a prior. That is, the MLE is given as L) πˆ r(M fd =
(T ) S1r fd (T ) (T ) S0r f d + S1r f d
.
(10.9)
With (β0 , β1 ) = (1, 1), the parameter set is estimated as follows: ⎤ 0.654 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ⎥ ⎢ 0.076 ⎥ ⎢ ⎥ ⎢ 0.184 ⎥ ⎢ ⎥ ⎢ 0.382 ⎥ ⎢ ⎥ ⎢ 0.050 ⎥ ˆ D F1 = ⎢ ⎥ ⎢ 0.098 ⎥ ⎢ ⎥ ⎢ 0.218 ⎥ ⎢ ⎥ ⎢ 0.061 ⎥ ⎢ ⎦ ⎣ 0.056 ⎡
0.024
⎤ 0.822 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ⎥ ⎢ 0.146 0.318 0.383 0.597 ⎥ ⎢ ⎥ ⎢ 0.332 ⎥ ⎢ ⎥ ⎢ 0.493 ⎢ ⎥ ⎢ 0.160 0.255 ⎥ ⎢ ⎥ ˆ D F2 = ⎢ ⎥ 0.254 ⎢ ⎥ ⎢ 0.123 0.293 0.217 0.306 0.376 ⎥ ⎢ ⎥ ⎢ 0.065 0.089 0.236 0.443 0.196 0.285 0.624 ⎥ ⎢ ⎥ ⎣ 0.110 ⎦ ⎡
0.036
10.1 Parameter Learning
509
⎤ 0.892 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ⎥ ⎢ 0.874 ⎥ ⎢ ⎥ ⎢ 0.803 ⎥ ⎢ ⎥ ⎢ 0.473 0.492 0.650 ⎥ ⎢ ⎥ ⎢ 0.273 0.319 0.714 ⎥ ⎢ ˆ D F3 = ⎢ ⎥ 0.403 0.486 ⎥ ⎢ ⎥ ⎢ 0.316 0.408 ⎥ ⎢ ⎥ ⎢ 0.103 0.166 0.177 0.439 0.590 ⎥ ⎢ ⎦ ⎣ 0.180 ⎡
0.043
⎤ 0.920 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ⎥ ⎢ 0.971 ⎥ ⎢ ⎥ ⎢ 0.970 ⎥ ⎢ ⎥ ⎢ 0.701 ⎥ ⎢ ⎥ ⎢ 0.287 0.477 0.911 0.952 ⎥ ⎢ ˆ D F4 = ⎢ ⎥ 0.726 ⎥ ⎢ ⎥ ⎢ 0.482 ⎥ ⎢ ⎥ ⎢ 0.003 0.000 0.370 0.370 0.401 0.532 0.779 ⎥ ⎢ ⎦ ⎣ 0.362 ⎡
0.086
⎤ 0.963 0.000 0.000 0.000 0.000 0.000 0.000 0.000 .000 .000 .000 .000 .000 ⎥ ⎢ 0.996 ⎥ ⎢ ⎥ ⎢ 0.995 ⎥ ⎢ ⎥ ⎢ 0.865 ⎥ ⎢ ⎥ ⎢ 0.994 ⎥ ⎢ ˆ D F5 = ⎢ ⎥ 0.918 ⎥ ⎢ ⎥ ⎢ 0.733 ⎥ ⎢ ⎥ ⎢ 0.511 0.444 0.594 0.917 ⎥ ⎢ ⎦ ⎣ 0.406 0.519 0.650 0.679 0.851 0.087 0.028 0.065 0.043 0.110 0.117 0.118 0.163 0.217 0.275 0.262 0.257 0.950 ⎡
All the parameters specified in Sect. 10.1.3 (p. 497) were estimated, which impliess that all PIRPs based on Plan (C) were observed. The MAP estimator can produce an estimate even for a parameter whose PIRP is not observed. However, when the MAP is employed, even if the hyperparameters are set to (β0 , β1 ) = (1, 1), as in this case, the MAP is the same as the MLE, and the MLE, however, can provide an estimate only for the parameters for which PIRP is observed. In other words, the ( f, d)-th parameter at Rank r , πr, f,d , could not be estimated if all the students’ PIRPs were not d-th (γs,r, f,d = 0 ∀s ∈ N S ). In this situation, Eqs. (10.7) and (10.8) (p. 507) would yield the following values:
510
10 Local Dependence Biclustering (T ) (T ) S0r f d = S1r f d = 0.
The denominator of the MLE in Eq. (10.9) (p. 508) was thus 0, and the MLE could not be estimated. If (β0 , β1 ) = (1, 1), the MAP is estimated as πˆ r, f,d =
β1 − 1 . β0 − β1 − 2
Because this value is not based on any datum but calculated from only the hyperparameters, it would be difficult to find a practical meaning for this value. Note that this takes a value of 0.5 when β0 = β1 = 1. If the number of parameters is small, the estimates can be easily written in the DAGs; however, this is difficult in this case because of the large number of parameters. Thus, it is recommended to add line plots for the parameters, as shown in Fig. 10.5. Each field at each rank has one or more black bars. Fields with only one bar are the source fields. In Rank 1, all 10 fields are the sources. For instance, in the black bar of Field 1 at Rank 1, a red dot is placed around the midpoint of the bar. The black bar represents the CCR range of [0, 1], and the red dot represents the CCR, which was estimated to be 0.654, as shown in the first line of the ˆ D F1 ). Thus, the red dot is placed at 65.4% of the bar from Rank 1 parameter set ( the bottom. In addition, Field 3 is a source at all ranks; thus, the field at each rank has only one bar. The Field 3 bars show that as the rank increases, because the CRR of the field increases, the red dot rises. Meanwhile, the fields with more than one bar have at least one parent and thus have two or more CCRRs (parameters) corresponding to the respective PIRPs, which will be explained in detail in Sect. 10.2.1.
10.2 Outputs This section describes the outputs provided by LDR (LDB). The field PIRP profile introduced in the following section is a unique result for LDR (LDB). Many of the other outputs are similar to those of ranklustering (biclustering).
10.2.1 Field PIRP Profile The field PIRP profile is a plot of the parameter estimates for each field, which is important for interpreting the features of the field. Figure 10.6 shows the field PIRP profiles, where those for the source fields are omitted because their plots are single dots. Thus, the figure does not include those of Rank 1 because all the fields at that rank are sources. The upper left panel in the figure shows the field PIRP profiles at Rank 2 for Fields 2, 5, 7, and 8, which have one or more parents. The source fields at Rank 2
10.2 Outputs
Fig. 10.5 Parameter estimates for latent rank DAGs
511
512
10 Local Dependence Biclustering Rank 3
Rank 2 1.0
0.8
0.4 0.2
8
2
0.6
2 7 5 5 2 7 8
2
8
7
7 8 7
Correct Response Rate
Correct Response Rate
1.0
8 8
0.8
5 4
0.6 0.4
4 6 7 5
0.2
8
8
4 6 7 5 8
8 8
8
0.0
0.0 0
1
2
3
4
5
6
7
8
9
10
11
0
12
1
2
3
4
Rank 4
Correct Response Rate
5
8 8
0.6
8
5 8
8
8
5 0.2
0.8
9 8
0.6 0.4
8 9
8
8
0
1
0.0 2
3
8
9
10
11
4
5
6
7
12
8
9
10
PIRP (Number-Right Score) in Parent Field(s)
11
12
10 9
9
9 8
0.2
10 0.0
7
Rank 5
5
0.8
0.4
6
1.0
Correct Response Rate
1.0
5
PIRP (Number-Right Score) in Parent Field(s)
PIRP (Number-Right Score) in Parent Field(s)
0
10 10 10 1
2
3
10 10 10
4
5
6
10
10
7
8
10 10 10
9
10
11
12
PIRP (Number-Right Score) in Parent Field(s)
Fig. 10.6 Field PIRP profile
(i.e., Fields 1, 3, 4, 6, 9, and 10) are not plotted. In this panel, the plot with marker 2 represents the field PIRP profile for Field 2, which plots the second-row elements ˆ D F2 ),5 of the Rank 2 parameter set ( πˆ D F22 = [0.146 0.318 0.383 0.597] . Field 2 has a parent (i.e., Field 1), and the parent has three items. In this analysis, the PIRP was defined as the NRS (i.e., Plan (C)); thus, the horizontal axis of the figure represents the NRS. There are four PIRPs: 0, 1, 2, and 3. This profile indicates that in Rank 2, the CCRRs of a student passing 0, 1, 2, or 3 items in Field 1 for an item in Field 2 (i.e., Item 21 or 22) are 0.146, 0.318, 0.383, and 0.597, respectively. In other words, the CCRR for a Field 2 item increases as the NRS for the Field 1 items increases. This fact encourages the Rank 2 students to review the Field 1 items before learning the Field 2 items. Usually, each profile rises with the horizontal axis because it is natural for the CCRR to grow with respect to the NRS for the items in the parent field(s). As shown in Fig. 10.6, the profiles are generally rising rightward. Unless the profile of a field
5
ˆ D F2 . Note that this vector is vertical (i.e., transposed) after it is extracted from
10.2 Outputs
513
is upward, it indicates that the parent(s) of the field were designated improperly, suggesting that the parent(s) were relieved or replaced by the other field(s). Therefore, it is recommended that the field PIRP profiles be referred to when determining and revising the latent DAG structure. For example, if the structure were to be modified further, either parent of Field 8 at Rank 2 (i.e., Field 5 or 6) might be dismissed because the field PIRP profile for Field 8 at Rank 2 is slightly uneven.
10.2.2 Rank Membership Matrix Estimate Estimation Process of LDR
LIR
, ↓
LDR
The above box shows the parameter estimation process of the LDR (and local independence ranklustering, LIR).6 Note that this process has progressed to the point ˆ D F ), as shown in the lower right of obtaining the local dependence parameter set ( part of the chart. ) In the LIR, the rank reference matrix at the final T -th cycle (M (T R ) was adopted as the rank reference matrix estimate; however, this must be recalculated because ) (T −1) . In other words, it is necessary to re-estimate the M (T R was obtained from B ˆ R placed at the bottom right of the chart) based on rank membership matrix (i.e., M ˆ DF . Note, however, that it is not necessary to re-estimate the field membership matrix because it was already determined as the dichotomized field membership matrix ) ˜ F ), which was created from M (T (M F in Sect. 10.1.2 (p. 495). This implies that we accepted the item classification results by the LIR because this classification must be determined in advance of building an interfield-dependency structure. The field ˆ R , but this might cause a ˆ D F and M membership matrix can be re-estimated from change in the item classification. ˜ F , because the likelihood of observing Student s’s data under ˆ D F and M With ˆ D Fr ) from the condition that the student belongs to Rank r can be expressed as l(us | Eq. (10.3) (p. 504), the membership of the student to the rank can be obtained as
6
If the smoothing process is skipped, this is the estimation process for the LDB.
514
10 Local Dependence Biclustering
Table 10.4 Rank Membership Estimate and Latent Rank Estimate (LDR) Rank Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 .. . Student 515 LRD∗4 RMD∗5 ∗1 ∗4
1 0.916 0.000 0.850 0.000 0.936 0.022 0.000 0.000 0.911 .. . 0.000 163 148.27
2 0.083 0.000 0.150 0.013 0.063 0.931 0.000 0.000 0.089 .. . 0.000 91 103.00
3 0.000 0.098 0.000 0.931 0.000 0.047 0.000 0.018 0.000 .. . 0.010 102 105.61
4 0.000 0.860 0.000 0.055 0.000 0.000 0.010 0.781 0.000 .. . 0.577 91 86.10
5 0.000 0.042 0.000 0.000 0.000 0.000 0.990 0.202 0.000 .. . 0.413 68 72.02
LRE∗1 1 4 1 3 1 2 5 4 1 .. . 4
RUO∗2 0.091 0.049 0.177 0.060 0.068 0.051 0.258 0.098 .. . 0.716
RDO∗3 0.113 0.014 0.024 0.010 0.023 .. . 0.018
latent rank estimate, ∗2 rank-up odds, ∗3 rank-down odds latent rank distribution, ∗5 rank membership distribution
mˆ sr
ˆ D Fr )πr πr Jj=1 Ff=1 l(u s j |πˆ r f )m˜ j f l(us | = R = R , J F ˆ ˆ r f )m˜ j f r =1 πr j=1 f =1 l(u s j |π r =1 l(us | D Fr )πr
where πr is the prior probability of the student belonging to Rank r , which is usually set to πr = 1/R. ˆ R ). Each row in the Table 10.4 shows the rank membership matrix estimate ( M table represents the rank membership profile of the corresponding student. For example, Student 1 has a 91.6% membership to Rank 1 (mˆ 11 = 0.916), 8.3% to Rank 2 (mˆ 12 = 0.083), and almost 0% to Ranks 3, 4, and 5. Student 1’s latent rank estimate (LRE) is thus Rank 1 because the student is most likely to belong to Rank 1. Similarly, Student 2 has the largest membership to Rank 4 (mˆ 24 = 0.860); thus, the student’s LRE is Rank 2. In addition, Fig. 10.7 plots the rank membership profiles of the first nine students. Both Students 2 and 8 belong to Rank 4, but the membership of Student 8 is smaller (mˆ 84 < mˆ 44 ), and the membership of Student 3 to Rank 5 is moderately large (20.2%), which suggests that Student 8 belongs to Rank 4 but is likely to move up to Rank 5. The rank-down odds (RDO) represent this possibility of dropping to the next lower rank and are obtained by
10.2 Outputs
515
0.8
0.8
0.8
0.6 0.4 0.2
Membership
1.0
Membership
Membership
1.0 0.6 0.4 0.2
1
2
3
4
1
5
0.4 0.2
2
3
4
1
5
0.8
0.8
0.8
0.4 0.2 0.0
Membership
1.0
Membership
1.0
0.6
0.6 0.4 0.2
3
4
5
1
0.4 0.2
2
3
4
1
5
Student 7
Student 8 0.8
0.8
0.0
Membership
0.8
Membership
1.0
0.2
0.6 0.4 0.2 0.0
2
3
4
5
3
4
5
Student 9
1.0
1
2
Latent Rank
1.0
0.4
5
0.6
Latent Rank
Latent Rank
0.6
4
0.0
0.0 2
3
Student 6
Student 5
Student 4 1.0
1
2
Latent Rank
Latent Rank
Latent Rank
Membership
0.6
0.0
0.0
0.0
Membership
Student 3
Student 2
Student 1 1.0
0.6 0.4 0.2 0.0
1
Latent Rank
2
3
4
5
Latent Rank
1
2
3
4
5
Latent Rank
Fig. 10.7 Rank membership profiles (LDR)
R D Os =
mˆ s, Rˆ s −1 mˆ s, Rˆ s
,
where Rˆ s is the LRE of Student s. The RDO of Student 2 is listed as 0.113 in Table 10.4, which is calculated by R D O4 =
mˆ 23 0.098 = 0.113. = mˆ 24 0.860
Conversely, the possibility of transitioning to the next higher rank is represented by the rank-up odds (RUO) and is given by RU O s =
mˆ s, Rˆ s +1 mˆ s, Rˆ s
.
Moreover, in Fig. 10.8, the bar graph shows the latent rank distribution (LRD), which is the frequency distribution for the LREs of the 515 students and is also listed in the second row from the bottom in Table 10.4. It can be seen from the figure that the ranks with the most and least students are Ranks 1 and 5 with 163 and 53 students, respectively.
516
10 Local Dependence Biclustering
Fig. 10.8 LRD and RMD (LDR) Frequency
150
163
100
102 91
91 68
50
0
1
2
3
4
5
Latent Rank
The line plot in the figure represents the rank membership distribution (RMD), which is also displayed in the bottom row of Table 10.4. The RMD is the column sum vector of the rank membership matrix. That is, ˆ R 1 S (R × 1). dˆ R M D = M The sum of the rank membership profile of each student is 1. For Student 2, the frequency of Rank 4 is incremented by +1 because the student belongs to the rank; however, this may be overadditive because the membership of the student to the rank is 0.860 (86.0%). Meanwhile, the RMD adds +0.860 to the frequency of Rank 4, +0.098 to that of Rank 3, and the respective membership to each rank. Thus, the LRD is the rank distribution for the sample, whereas the RMD is the rank distribution for the population. In this analysis, there were no significant differences between the LRD and RMD.
10.2.3 Marginal Field Reference Profile Suppose that π f r is the CRR of a student belonging to Rank r for an item classified in Field f ; it can be expressed as π f r = Pr ( f |r ). Meanwhile, πr f d estimated in Sect. (10.1.7) can be represented as πr f d = Pr ( f |r, d),
10.2 Outputs
517
where d indicates that this parameter corresponds to the d-th PIRP. 7 Therefore, the following relationship can be derived from the above two equations: Jr f +1
Pr ( f |r ) =
Pr ( f |r, d)Pr (d),
d=1
where Jr f represents the number of items in the parent(s) of Field f at Rank r , and Pr (d) is the ratio of the students whose PIRP is d-th among the students belonging to Rank r . Because the students who missed all the items in Field f (to whom all the items were not presented) are not counted in the number, this ratio is obtained as J ˆ sr sgn ˜ j f z s j γsr f d s=1 m j=1 m J
S ˆ sr sgn ˜ j f zs j s=1 m j=1 m
S Pr (d) =
.
Accordingly, πˆ f r is given from πˆ r f d as follows: πˆ f r =
J ˆ sr sgn ˜ j f z s j γsr f d s=1 m j=1 m J
S ˆ sr sgn ˜ j f zs j s=1 m j=1 m
S
Jr f +1
πˆ r f d
d=1
.
Note that this operation marginalizes πˆ r f d over d and obtains πˆ f r . Computing this for all the ranks and fields, we have ⎡
ˆ MB
πˆ 11 · · · ⎢ .. . . =⎣ . . πˆ F1 · · ·
⎤ πˆ 1R .. ⎥ = {πˆ } (F × R). fr . ⎦
πˆ FR
This matrix is obtained by marginalizing the local dependence parameter set with respect to the PIRPs; thus, it is called the marginal bicluster reference matrix. Table 10.5 shows the marginal bicluster reference matrix. In this table, the f -th row vector represents the marginal FRP of Field f . For example, it is found from the marginal FRP of Field 1 that the average CRR of a Field 1 item is 0.654 for the Rank 1 students and 0.822 for the Rank 2 students. In addition, the latent field distribution (LFD) in the table represents the number of items classified in the respective fields. The test reference profile (TRP) in the table is described later in the text. Figure 10.9 shows the marginal FRPs of all 10 fields. Each plot shows the change in CRR on the latent rank scale. For instance, Field 2 is found to be an item group in which CRRs sharply increase from Rank 1 to 2. In addition, Field 3 items are difficult for Rank 2 students but easy for Rank 3 students. Table 10.6 shows the indices summarizing the shape of the marginal FRP of each field. The field slope index, denoted by (α, ˜ a), ˜ expresses the discrimination of the 7
In this analysis, based on Plan (C), d represents that the NRS for the items in the parent(s) of Field f at Rank r is d − 1.
518
10 Local Dependence Biclustering
Table 10.5 Marginal Bicluster Reference Matrix (LDB)
Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7 Field 8 Field 9 Field 10 TRP∗2 LRD∗3 ∗1 ∗3
Rank 1 0.654 0.076 0.184 0.382 0.050 0.098 0.218 0.061 0.056 0.024 4.92 163
Rank 2 0.822 0.507 0.332 0.493 0.207 0.254 0.312 0.172 0.110 0.036 8.74 91
latent field distribution, latent rank distribution
∗2
Rank 3 0.892 0.874 0.803 0.627 0.618 0.455 0.374 0.272 0.180 0.043 13.66 102
Rank 4 0.920 0.971 0.970 0.701 0.926 0.726 0.482 0.570 0.362 0.086 18.87 91
Rank 5 0.963 0.996 0.995 0.865 0.994 0.918 0.733 0.863 0.715 0.377 26.49 68
LFD∗1 3 2 2 1 3 3 4 2 8 7
test reference profile,
field: α˜ indicates the smaller rank of the adjacent pair between which the marginal FRP exhibits the steepest rise, and a˜ is the magnitude of the rise. For example, from Table 10.5, the marginal FRP of Field 3 shows the sharpest rise from Rank 2 to 3 (πˆ 32 = 0.332, πˆ 33 = 0.803); thus, the slope index for the field is (α˜ 3 , a˜ 3 ) = (2, 0.471). ˜ represents the field location index, and β˜ is the rank when the ˜ b) In addition, (β, marginal FRP is closest to 0.5,8 and b˜ is the value of the marginal FRP. For example, the marginal FRP of Field 4 is approximately 0.5 at Rank 2 (πˆ 42 = 0.493); thus, the location index of the field is obtained as (β˜4 , b˜4 ) = (2, 0.493). Moreover, (γ˜ , c) ˜ is the field monotonicity index. When the marginal FRP is not monotonically increasing, γ˜ represents the ratio of the adjacent rank pairs between which the marginal FRP drops to all R − 1 adjacent rank pairs, and c˜ indicates the cumulative drops. Note that this index is obtained as (γ˜ , c) ˜ = (0, 0) when the marginal FRP monotonically increases; conversely, it is (γ˜ , c) ˜ = (1, −1) when the FRP is monotonically decreasing. As shown in Fig. 10.9, the marginal FRPs of all fields monotonically increase. Therefore, the monotonicity indices for all the fields are (γ˜ , c) ˜ = (0, 0), as shown in Table 10.6. The marginal FRP of a field summarizes the features of the group of items classiˆ M B ) is fied in the field. Each row vector of the marginal bicluster reference matrix ( the marginal FRP of the corresponding field. The field PIRP profile provides the most detailed information about each field; however, it may not be useful for grasping the
8
The criterion is not necessarily “closest to 0.5.” It can be “closest to 0.6” or “first exceeding 0.8.”
Field 1 1.0 0.8 0.6 0.4 0.2 0.0 3
4
5
0.6 0.4 0.2 0.0 1
2
Field 4 1.0 0.8 0.6 0.4 0.2 0.0 1
2
3
4
5
0.8 0.6 0.4 0.2 0.0 4
5
Correct Response Rate
Latent Rank
Correct Response Rate
Correct Response Rate
Field 7
3
0.8 0.6 0.4 0.2 0.0 1
2
Field 5 0.8 0.6 0.4 0.2 0.0 1
2
3
4
5
Field 8 0.8 0.6 0.4 0.2 0.0 2
3 Latent Rank
4
5
4
5
4
5
Field 6 1.0 0.8 0.6 0.4 0.2 0.0 1
2
3 Latent Rank
1.0
1
3 Latent Rank
Latent Rank
1.0
2
5
1.0
Latent Rank
1
4
Field 3 1.0
Latent Rank Correct Response Rate
Correct Response Rate
Latent Rank
3
Correct Response Rate
2
0.8
4
5
Correct Response Rate
1
Field 2 1.0
Correct Response Rate
519 Correct Response Rate
Correct Response Rate
10.2 Outputs
Field 9 1.0 0.8 0.6 0.4 0.2 0.0 1
2
3 Latent Rank
Field 10 1.0 0.8 0.6 0.4 0.2 0.0 1
2
3
4
5
Latent Rank
Fig. 10.9 Marginal field reference profile (LDR)
general tendency of the field. In contrast, the marginal FRP outlines the characteristics of each field. In addition, a can-do chart as in Fig. 7.6 (p. 301) can be created ˆ M B ). from the marginal bicluster reference matrix ( ˆ M B represents the marginal rank refFurthermore, the r -th column vector of erence vector (RRV) of Rank r , and Fig. 10.10 shows those of all the R ranks, which are useful for understanding the differences between the ranks by focusing on large gaps between the plots. For example, there exists a spread at Field 2 between the plots of Ranks 1 and 2, which indicates that the Rank 1 students are much inferior to the Rank 2 students in terms of the ability required to pass Field 2 items. This fact also suggests that for Rank 1 students, relearning or reviewing the range covered by Field 2 items is necessary to move up to Rank 2. Similarly, the fields that largely separate the plots of Ranks 2 and 3 are Fields 3 and 5, which implies that the Rank 2 students are much less capable than the Rank
520
10 Local Dependence Biclustering
Table 10.6 FRP indices (LDR)
Index Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7 Field 8 Field 9 Field 10
1.0
Correct Response Rate
0.8
5 4 3 2
5 4 3
α˜ 1 1 2 4 2 3 4 3 4 4
5 4
5 4 3
b˜ 0.654 0.507 0.332 0.493 0.618 0.455 0.402 0.576 0.362 0.377
γ˜ 0 0 0 0 0 0 0 0 0 0
5
5
4 3
1
β˜ 1 2 2 2 3 3 4 4 4 5
a˜ 0.168 0.431 0.471 0.164 0.411 0.271 0.252 0.298 0.353 0.291
4
3
5
4 3
2 2
0.4
1 2
2 0.2
4 3 2 1
1 1
1
5
5
2
0.6
c˜ 0 0 0 0 0 0 0 0 0 0
1
4 3 2 1
3 2 1
8
9
0.0 1
2
3
4
5
6
Field Fig. 10.10 Rank reference vector (LDR)
7
5
4 3 2 1 10
10.2 Outputs
521
3 students considering the abilities required in Fields 3 and 5. In other words, Rank 2 students can move up to Rank 3 if they train on the items of Fields 3 and 5.
10.2.4 Test Reference Profile The test reference profile (TRP) is the expected NRS on the test for students belonging to the respective ranks. From Table 10.5, the LFD is obtained as dˆ LFD = [3 2 2 1 3 3 4 2 8 7] , and the TRP is then calculated as ⎡ ˆ MB dˆ LFD = ⎢ ˆt TRP = ⎣
0.654 0.076 · · · .. .. . . . . . 0.963 0.996 · · ·
⎡ ⎤ ⎡ 4.915 ⎤ ⎤ 3 0.024 ⎢ ⎥ ⎢ 8.744 ⎥ ⎥ .. ⎥ ⎢2⎥ = ⎢ . ⎥ ⎢ 13.657 ⎥ . ⎦⎢ ⎥. ⎣ .. ⎦ ⎢ ⎣ 18.867 ⎦ 0.377 7 26.488
This equation shows that the r -th element in the TRP is the weighted sum of the r -th column elements in the marginal bicluster reference matrix, and the weights of the column elements are given by the LFD. For example, the TRP shows that the students who belong to Rank 1 are expected to pass approximately five (4.915) items on this 35-item test. Similarly, it shows that Rank 2 students can pass approximately ten (8.764) items. The line plot in Figure 10.11 shows the TRP. In addition, the bar graph represents the LRD, which has already been shown in Fig. 10.8; however, plotting them in the same figure helps to interpret the TRP, because both show the characteristics
Fig. 10.11 Test reference profile (LDR) Expected Score
25
163 20 15
102 91
91
10
68 5 0
1
2
3
Latent Rank
4
5
522
10 Local Dependence Biclustering
of the respective ranks. It can be observed from the figure that the TRP increases monotonically. The monotonicity of the TRP is required in the LDR.9 In the LDR, students are classified into ordinal clusters (i.e., latent ranks) rather than nominal clusters (i.e., latent classes). To show that the latent scale is ordinal, it is necessary to provide evidence for ordinality. If the TRP increases monotonically, it can be said that the latent clusters are ordered in terms of the TRP (i.e., the expected NRS). For ordinality, two levels can be considered: strongly and weakly ordinal alignment conditions (SOAC and WOAC, respectively). All the marginal FRPs are monotonically increasing; the TRP as the weighted sum of the marginal FRPs is monotonically increasing; the SOAC is then fulfilled. In this analysis, all the marginal FRPs monotonically increase. Thus, the latent rank scale satisfies the SOAC. Meanwhile, the WOAC is fulfilled when the TRP is monotonic, even if all marginal FRPs are not monotonic. The latent scale can be regarded as ordinal in terms of the TRP (i.e., the expected NRS). Unless the weak condition is satisfied, further evidence is necessary to show that the latent scale is ordinal.
10.2.5 Model Fit This section describes the model-fit evaluation of the LDR. As in the other chapters, let the model applied to the data herein be the analysis model. It is compared with a better-fitting benchmark model and a worse-fitting null model. The benchmark and null models can be chosen by the analyst, and those used in this book are consistent throughout the chapters. That is, the benchmark model is an interitem local independence model with the loci being the NRS patterns, and the null model is an interitem global independence model where the number of loci is one. For the data under analysis (J35S515), the log-likelihoods of the benchmark and null models and the χ 2 value and DF of the null model are shown in Table 7.7 (p. 319). ˆ D F ), From Eq. (10.4) (p. 506), and using the local dependence parameter set ( ˜ F ), ˆ R ), and dichotomized field membership matrix ( M rank membership matrix ( M the likelihood for the analysis model (Fig. 10.4, p. 497) is given as ˆ DF ) = l A (U|
r f +1 F J S R J
u
πˆ r fs dj (1 − πˆ r f d )1−u s j
γsr f d mˆ sr m˜ j f
,
s=1 r =1 j=1 f =1 d=1
and the log-likelihood is thus obtained as ˆ DF ) = ll A (U|
f +1 F Jr S R J
γsr f d mˆ sr m˜ j f u s j ln πˆ r f d + (1 − u s j ) ln(1 − πˆ r f d ) .
s=1 r =1 j=1 f =1 d=1
9
Not the LRD but the local dependence ranklustering. Note also that the monotonicity of the TRP is not necessary in local dependence biclustering (LDB).
10.2 Outputs
523
In addition, the χ 2 value for the analysis model is calculated as χ A2 = 2(ll A − ll B ), where ll B represents the log-likelihood of the benchmark model. The number of parameters in the analysis model (N PA ) is the number of estimated ˆ D F ). For an element πr f d to be parameters in the local dependence parameter set ( estimable, the d-th PIRP of Field f at Rank r must be observed, which is recorded for each student in γsr f d of the PIRP array (). Using these γsr f d s, we consider the following Sr(Tf d) =
S J
(T ) z s j ssr m˜ j f γsr f d ,
s=1 j=1 (T ) where ssr is the (s, r )-th element of S(T ) , the smoothed rank membership matrix at the final T -th EM cycle.10 Then, Sr(Tf d) can be regarded as the number of students belonging to Rank r whose PIRP for Field f is the d-th pattern, and if it is positive (Sr(Tf d) > 0), πr f d can be estimated. Therefore, the sum of the positive numbers,
N PA =
r f +1 R F J
sgn(Sr(Tf d) ),
r =1 f =1 d=1
is the number of parameters. In this analysis, N PA is calculated as 100. In addition, the DF of the analysis model (d f A ) is obtained as d f A = N P B − N P A, where N P B is the number of parameters for the benchmark model. Various model-fit indices can be computed from the χ 2 values and DFs of the analysis and null models. They can be calculated with reference to Standardized Fit Indices (p. 321) and Information Criteria (p. 326), and they are shown in Table 10.7. Compared to Table 7.7 (p. 319), which shows the model-fit indices for local independence ranklustering with (F, R) = (5, 6) (denoted as LIR (5, 6)), it can be seen that the χ 2 value of the analysis model is greatly reduced, which indicates that the current analysis model (also denoted LDR (10, 5)) better fits the data than LIR (5, 6). However, these two models differ in terms of the number of fields (five/ten), number of ranks (six/five), and local independence/dependence; thus, the improvement in the χ 2 value is not attributable only to whether local dependence is assumed. In ˆ R ). This is because S(T ) was used to obtain Note that it is not the rank membership matrix ( M ˆ D F .
10
524
10 Local Dependence Biclustering
Table 10.7 Model-fit indices (LDR) χ 2 and d f ll B −5891.31 ll N −9862.11 ll A −6804.90 χ N2 7941.60 χ A2 1827.17 d fN 1155 d fA 1088
NFI RFI IFI TLI CFI RMSEA
Information Criterion Standardized Index −348.83 0.770 AIC −4968.60 0.756 CAIC 0.892 BIC −4966.48 0.884 0.891 0.036
addition, the DF of LDR (10, 5) is smaller than that of LIR (5, 6), which implies that the former is more wasteful (has more parameters). Moreover, as for the standardized indices (NFI–RMSEA), those for LDR (10, 5) are much better than those for LIR (5, 6) (Table 7.8, p. 325). Furthermore, the three information criteria for LDR (10, 5) are also better than those for LIR (5, 6) (cf. Table 7.9, p. 328).
10.3 Structure Learning This section explains how to determine the interfield local dependency structure for each rank. To do this, as shown in Estimation Order of LDR (p. 492), the numbers of fields and ranks must be fixed in advance by performing (usual) ranklustering. In addition, it is necessary to understand the contents of each field by field analysis (Sect. 7.4.5, p. 292) and grasp the abilities of students at each rank using rank analysis (Sect. 7.4.8, p. 300). Without this knowledge, it is almost impossible to build a meaningful local dependency structure. Based on this, a machine learning method, such as the genetic algorithm employed in the BNM (Sect. 8.5, p. 407) and LD-LRA (Sect. 9.4, p. 465), is an effective tool for identifying the structure. Exploring a number of structures and selecting one of them based on a machine learning rule is referred to as structure learning. In the author’s experience, however, the structure recommended by structure learning is almost always unsatisfactory and even unrealistic for explaining the present situation (i.e., phenomenon); thus, in most cases, the recommended structure is modified based on the knowledge and experience of the analyst and then finalized. This is because the best-fitting model in terms of a fit index is not necessarily the best applicable model for the phenomenon. Indeed, the recommended structure is often not even used as a hint for the optimal structure. As a result, the labor and time required to determine the final structure are not very different, whether first recommended by machine learning and modified and finalized based on the knowledge and expe-
10.3 Structure Learning
525
rience of the analyst or based on the knowledge and experience from the beginning. If the number of fields is ten or less, it is not impossible to create a meaningful and accountable structure based on experience and knowledge, for which some tips are provided in this section. Tips for Determining Structure (LDR) 1. The fields are topologically sorted. 2. A slightly higher number of parents is first specified for each field. 3. The parents are then selected to make the field PIRP profile rise. First, regarding Point 1, to topologically sort the fields ( Topological Sorting , p. 371) is effective in greatly reducing the number of candidate graphs to be explored and ensuring that each is a DAG. In this analysis, the 35 items were classified into ten fields, as shown in Table 10.1 (p. 496); the field labels were topologically sorted; and then in the graph at each rank a single edge from a larger labeled field to a smaller labeled one was disallowed. Under such a rule, the parent label(s) of each field are smaller than the field label, and the graph thus created is necessarily a DAG regardless of the number of edges and their positions, because no cycle can be formed in the graph (see also Fig. 8.10, p. 371). Next, Point 2 recommends that all the related fields of a field be specified as the parents of the field if they have a smaller label than that of the field. In this chapter, Plan (C), which assumes that the CRR depends on the NRS in the parent field(s), was adopted. This plan was the most restrictive constraint among Plans (A), (B), and (C), as otherwise the number of parameters would be impractically large. Under Plan (C), even when a field has three parents, if each parent contains only two items,11 the number of parameters specified for the field is not more than 7 (0–6), whereas they are 64 (= 26 ) and 27 (= 33 ) under Plans (A) and (B), respectively. Therefore, even when a slightly higher number of parents is set for each field, the number of parameters for the field is not too large for them to be estimated in practice. Then, as described in Point 3, the parents of each field are gradually pruned such that the field PIRP profile rises rightward. This profile plots the CCRRs on the vertical axis and the NRS in the parent field(s) on the horizontal axis (Fig. 10.6, p. 512); thus, it is natural that this profile grows because the CRR for a field item should increase as the NRS in the parent(s) of the field increases. If a field has many parents, the field PIRP profile may be uneven and not slope upward from the beginning. However, in the process of pruning unnecessary parents and reserving a small number of truly necessary parents, the field PIRP profile would slope increasingly upward. Analyst can determine the extent of pruning based on their experience and knowledge by referring to the model-fit indices. The simplest way to identify redundant parents is to refer to the (polychoric) correlation matrix (Sect. 2.6.2, p. 45) calculated from the U F (Eq. (10.2), p. 502).
11
Note also that a larger number of fields helps to reduce the number of items per field.
526
10 Local Dependence Biclustering
The data record the NRS for each field. If the ( f, g)-th element of the correlation matrix is large, Fields f and g are strongly correlated. Additionally, if f < g, it is highly probable that Field f is a parent of Field g.
10.4 Chapter Summary This chapter introduced LDB and LDR, which incorporate the BNM into biclustering and ranklustering, respectively. The former classifies students into nominal clusters (i.e., latent classes), whereas the latter classifies them into ordinal clusters (i.e., latent ranks). This chapter mainly focused on the LDR, which assumes an interfield local dependency structure at each rank. When all the fields are locally independent at every rank, the LDR is identical to the usual ranklustering. This method also fulfills Conditions (A)–(D) for a good test data analysis as described in Sect. 1.5.2 (p. 10). That is, the local dependence DAG structure shows a map of the test (A); the LRE and rank-up odds of each student show the position of the student (B); the RRV (Fig. 10.10, p. 520) shows the next path (i.e., next field to learn) (C); and the field PIRP profiles and marginal FRPs show the features of the paths (i.e., fields) (D). However, it is difficult to determine the structure at each rank. A machine learning method does not always yield a satisfactory structure; thus, this chapter recommends a heuristic method that is slow but steady.
Chapter 11
Bicluster Network Model
This chapter introduces the bicluster network model (BINET; Shojima, 2019), a combination of the Bayesian network model (BNM; Chap. 8, p. 349) and biclustering (Chap. 7, p. 259). This model also closely resembles local dependence biclustering (LDB; Chap. 10, p. 491). It is recommended that readers first read Chaps. 8 and 10 before going any further in this chapter. Biclustering is a method of clustering students into latent classes and clustering items into latent fields, where both latent classes and fields are nominal clusters. Ranklustering (Sect. 7.3, p. 274), meanwhile, is a method of clustering students into latent ranks and classifying items into latent fields. In addition, the BNM is a method used to identify the (global) dependency structure between items. Moreover, LDB is used to examine the local dependency structure between fields. An example of the results obtained by BINET is shown in Fig. 11.1. The large squares represent the latent classes, and the number of classes (C) is 13 in the figure. The classes are not ordinal but nominal clusters, but in general, the low-ability group is labeled as Class 1, and a high-ability class is labeled by a larger number. In Fig. 11.1, Class 13 is the most advanced ability group. In addition, the smaller boxes shaped like diamonds express latent fields, and the number of fields (F) is 12. The fields are also unordered nominal clusters, but the easiest item group is generally labeled as Field 1. In this figure, Field 12 is assumed as the most difficult item cluster. Furthermore, the figure indicates that if a student wants to level up from Class 1 to 2, the student needs to train on Field 1 items. Similarly, for a student in Class 2, the diagram recommends improving Field 2 to move to Class 4 or to improve Field 5 to move to Class 3. Each class is characterized by a different proficiency pattern for the respective 12 fields.1 Thus, as a student significantly advances in any field(s), he or she moves to a higher class and can finally, after clearing all the fields, reach Class C. 1
For instance, Class 13 is characterized as 111111111111 (12 1s) because the class masters all the 12 fields. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3_11
527
528
11 Bicluster Network Model
Fig. 11.1 Bicluster network model
Some fields may appear multiple times on the same graph. For example, Field 5 appears on the edges of Class 2 → 3 and Class 6 → 9. This is because progress of student is not at the level of mastery. Class 2 students need to improve (not master) Field 5 items to move up to Class 3, but Class 5 students are required to master the field items to move to Class 9.
11.1 Difference Between BINET and LDB The BINET is very similar to the LDB and LDR. The most significant difference, nevertheless, is that in LDB, the vertices represent the fields, whereas in BINET, they represent the classes. Moreover, the LDB forms a DAG at each rank, whereas BINET forms a single graph, although this difference stems from the way the results
11.1 Difference Between BINET and LDB
529
Fig. 11.2 Latent field DAGs
are expressed. In fact, Fig. 11.1 layers over the 12 DAGs obtained at the respective 12 fields, as shown in Fig. 11.2, and shows them as one integrated graph. For example, the graph in Field 1 in Fig. 11.2 contains only an edge from Class 1 → 2; thus, in Fig. 11.1, an edge from Class 1 → 2 is named Field 1. Similarly, the graph at Field 2 in Fig. 11.2 has two edges, Class 2 → 4 and Class 3 → 5; thus, in Fig. 11.1, the edges named Field 2 are placed at the corresponding positions. Moreover, the graph in Field 11 in Fig. 11.2 is empty (edgeless); in other words, the classes are locally independent in the field. For this reason, Fig. 11.1 has no edges labeled as Field 11. Simply put, BINET is an analysis that performs the BNM in each field and then expresses (integrates) DAGs as a single-layered graph. Integrating them into a single graph is not indispensable but desirable in terms of interpretability and information portability. Difference Between BINET and LDB LDB (LDR): The model exploring the local dependency structure among latent fields at each latent class (rank), where each class (rank) is a locus. BINET: The model exploring the local dependency structure among latent classes at each latent field, where each field is a locus. As shown in Difference between BINET and LDB , the roles of the classes and fields in the LDB are swapped in the BINET. The LDB focuses on discovering an effective learning sequence of the fields for each class, but the findings at each class
530
11 Bicluster Network Model
Fig. 11.3 Two images of biclustering
can only be applicable to the class. Meanwhile, the BINET can directly provide a roadmap of which field(s) should be learned to move up to a higher class. In other words, it can be said that BINET focuses on students’ interclass transition. Let us now consider the differences among BINET, LDB, and biclustering. First, recall that biclustering is the same as local independence biclustering; thus, this model can be represented as Fig. 11.3 (top), which is an example of (F, C) = (6, 5). The figure implies that each latent class is the “locus” and shows that the six fields are “locally” independent at each locus. However, the figure can be viewed differently, as shown at the bottom of the figure, where the five classes are locally independent regarding each latent field as the locus. For this reason, both of the following are
11.1 Difference Between BINET and LDB
531
Table 11.1 Difference among Biclustering, LDB, and BINET Model Biclustering LDB BINET
Locus class/field class field
Node field/class field class
Interfield independent dependent independent
Interclass independent independent dependent
true for this model: (Usual) biclustering is a model that assumes local independence between fields/classes when classes/fields are regarded as loci. The differences among biclustering, LDB, and BINET are summarized in Table 11.1. Biclustering/LDB is the local independence/dependence model between fields under the condition that each class is a locus. This difference between the two models is explained in Chap. 10 (p. 491). Meanwhile, biculstering can be considered the local independence model between classes when each field is regarded as a locus, as described above. Thus, the difference between biclustering and BINET can be described as the local independence/dependence model between classes, given that each field is a locus.
11.2 Parameter Learning This section describes the parameter learning in BINET. The data to be analyzed are J35S515, with the sample size and the number of items as 515 and 35, respectively. Parameter learning, which estimates the CCRRs induced by the structure (i.e., graph), is performed, given that the following conditions: (1) The number of classes and fields (C and F, respectively) and (2) the structure (i.e., DAG at each field) have already been determined. ˆ C) ˆ = (12, 13), as shown in Fig. 11.4, which are the results obtained For (1), let ( F, by the infinite relational model (IRM; Sect. 7.8, p. 336). The reason IRM was employed instead of conventional biclustering is that biclustering may produce fields containing only one item, whereas IRM can easily avoid generating such fields. More specifically, in the IRM estimation process, the fields of items are settled one by one, where each item is classified stochastically into an existing field or a new field. If an item is classified into a new field, a field with one item is created, but if the probability of being classified in a new field is set to zero, the fields with one item will eventually disappear as the estimation cycles progress. There must not be a field with only one item in the BINET that regards each field as a locus. It is impossible for such a locus (i.e., field) with one item to assume multiple CCRRs (i.e., parameters) because such a field has only one PIRP. For more on this, refer to the following box.
532
11 Bicluster Network Model
Original Data
∗
IRM (F, C) = (12, 13)∗
The leftmost column is Field 1 and the topmost row is Rank 1.
Fig. 11.4 Array plot of sorted data by infinite relational model
A Field with Only One Item The situation where there is a field with only one item is equivalent to the situation in LDB where there is a class with only one student. Suppose that an LDB model has the structure of Field f → g at Class c and many students belong to the class. Then, based on Plan (C), k + 1 PIRPs (NRSs) are observed if Field f has k items, and k + 1 parameters are then induced by the structure. However, if Class c has only one student; only one PIRP, the student’s NRS, will be observed; the only CCRR corresponding to the PIRP will then be specified for Field g. This also implies that the structure of Field f → g is dispensable from the beginning because the CCRR of a Field g item is invariant with respect to the status in Field f (there is only one status in Field f ). Consequently, the result under this case is equivalent to the case assuming Fields f and g are locally independent.
11.2 Parameter Learning
533
In addition, with the IRM, it is easy to create a class consisting of zero scorers only or a class of full scorers only, which is easily achieved by forcibly allotting Class 1 (C) when sampling a class of each zero (full) scorer. In the IRM analysis of Fig. 11.4, two students with zero marks were classified as Class 1 and 15 students with full marks as Class 13. Furthermore, as for (2), structure learning refers to exploring the candidates for the optimal structure and choosing one of them. Suppose that the structure shown in Fig. 11.1 was selected by a structure learning method. Note that the local dependency structure between the latent classes in each field must be a DAG.
11.2.1 Binary Field Membership Matrix The optimal number of fields has been settled by the IRM as F = 12, which also implies that the field into which each item is classified has been fixed. Thus, the ˜ F = {m˜ j f } (J × F) can be created, where binary field membership matrix M m˜ j f =
1, if Item j is classified in Field f . 0, otherwise
Table 11.2 shows the item classification results by IRM. Each field had two or more items, and Field 6 had the most items. In addition, for example, Items 11, 21, and 22 are classified in Field 2; thus, the memberships of the items are given as
Table 11.2 Item classification result by IRM Field 1 2 3 4 5 6 7 8 9 10 11 12 ∗N
Jf ∗ 3 3 2 3 3 5 4 2 2 2 2 4
Item 1, 31, 32 11, 21, 22 23, 24 25, 26, 27 2, 3, 4 7, 8, 9, 10, 33 12, 13, 16, 17 28, 29 5, 6 34, 35 14, 15 18, 19, 20, 30
of items in field f or LFD (latent field distribution)
534
11 Bicluster Network Model
m˜ 11,2 = m˜ 21,2 = m˜ 22,2 = 1, m˜ 11, f = m˜ 21, f = m˜ 22, f = 0
( f = 2).
Moreover, the fields are labeled in the decreasing order of the average CRR of the items in each field. More specifically, let p j be the CRR of Item j; the average CRR of Field f items, p¯ f , is given by J j=1
p¯ f = J
m˜ j f p j
j=1
m˜ j f
.
Then, the following inequality holds among these 12 p¯ f s: p¯ 1 > p¯ 2 > · · · > p¯ 12 Accordingly, Fields 1 and 12 are the easiest and hardest item groups, respectively.
11.2.2 Class Membership Matrix ˜ C = {m˜ sc ∈ {0, 1}}, S × The IRM also estimated the class membership matrix ( M C), which indicates the class to which each student belongs. The membership of Student s to Class c is represented by m˜ sc , and it is a binary variable coded 1 if the student belongs to the class. Note that these binary class memberships are the student clustering results, not by the BINET, but by the IRM, and these are, thus, indispensable to be reestimated under BINET later, although the number of classes is accepted as C = 13, following the IRM biclustering result (Fig. 11.4). Let m sc represent the membership of Student s to Class c; then, ⎤ m s1 ⎥ ⎢ ms = ⎣ ... ⎦ (C × 1) ⎡
m sC is called the class membership profile of Student s. The sum of this vector is 1. A matrix collecting the class membership profiles of all students, ⎤ m 11 · · · m 1C ⎥ ⎢ M C = ⎣ ... . . . ... ⎦ (S × C), m S1 · · · m SC ⎡
is the class membership matrix.
11.2 Parameter Learning
535
11.2.3 Parent Student Response Pattern Suppose that the interclass local dependence structure in each field (i.e., locus) is fixed as Fig. 11.2 (p. 529). The graph in each field is a DAG because, for every edge, the label of the out-class (predecessor) is smaller than that of the in-class (i.e., successor). This condition also implies that the class labels are topologically sorted. Under this condition, a graph is always a DAG, regardless of the number of edges the graph has and where they are placed. Generally, when C classes are topologically sorted, the joint probability (JP) of the C classes in Field f (i.e., the data of all the students for Field f items) can be factored as P f (1, . . . , C) = P f (C|pa f (C)) × P f (C − 1|pa f (C − 1)) × · · · × Pc (1) =
C
P f (c|pa f (c)),
c=1
where pa f (c) is the parent class set of Class c in Field f . Note that pa f (c) = ∅ if P f (c|pa f (c)) = P f (c). Table 11.3 shows the factorization induced from Fig. 11.2 (or Fig. 11.1) for each field. Because the graph in Field 11 is empty (i.e., edgeless),2 all classes of the field are locally independent, as shown in the eleventh row of the table. Next, the manner of specifying the parameter set for P f (c|pa f (c)) is described, which depends on the definition of the PIRP. The PIRP was created in the BNM and LD-LRA based on Plan (A),3 while in LDB, it was done based on Plan (C). In the BINET, first, rather than calling it a parent item response pattern (PIRP), calling it a parent student response pattern (PSRP) is more accurate. This is because each field is the locus in the BINET, and those who belong to the parent(s) of each class are the students. Let us consider the PSRP based on Plan (C) in this section. Suppose that there exists a structure of (Classes d → c ← e) in Field f . That is, Class c has two parents (i.e., Classes d and e). It is also assumed that Classes d and e have seven and six students, respectively. If Plan (A) is adopted, there are 8192 (= 26+7 ) PSRPs as follows: Class d
Class e
Class d
Class e
Class d
Class e
Class d
Class e
Plan (A) 0000000 000000, 0000000 000001, 0000000 000010, · · · , 1111111 111111,
and then, corresponding to these 13-digit patterns, 8192 parameters are set as π f c1 , π f c2 , π f c3 , . . . , π f c8192 ,
2 3
An empty graph is also a DAG. For Plans (A), (B), and (C), refer to Sect. 10.1.3 (p. 497).
536
11 Bicluster Network Model
Table 11.3 Joint probability factorization induced by Fig. 11.1
Field Factorization 1 P1 (1, · · · , 13)= P1 (13) P1 (12) P1 (11) P1 (10) P1 (9) P1 (8) P1 (7) P1 (6) P1 (5) P1 (4) P1 (3) P1 (2|1) P1 (1) 2 P2 (1, · · · , 13)= P2 (13) P2 (12) P2 (11) P2 (10) P2 (9) P2 (8) P2 (7) P2 (6) P2 (5|3) P2 (4|2) P2 (3) P2 (2) P2 (1) 3 P3 (1, · · · , 13)= P3 (13) P3 (12) P3 (11) P3 (10) P3 (9) P3 (8) P3 (7) P3 (6) P3 (5|4) P3 (4) P3 (3) P3 (2) P3 (1) 4 P4 (1, · · · , 13)= P4 (13) P4 (12) P4 (11|7) P4 (10) P4 (9) P4 (8) P4 (7) P4 (6|5) P4 (5 P4 (4) P4 (3) P4 (2) P4 (1) 5 P5 (1, · · · , 13)= P5 (13) P5 (12|8, 10) P5 (11) P5 (10) P5 (9|6) P5 (8) P5 (7|4) P5 (6) P5 (5) P5 (4) P5 (3|2) P5 (2) P5 (1) 6 P6 (1, · · · , 13)= P6 (13) P6 (12) P6 (11) P6 (10) P6 (9) P6 (8) P6 (7) P6 (6) P6 (5) P6 (4) P6 (3) P6 (2) P6 (1) 7 P7 (1, · · · , 13)= P7 (13) P7 (12) P7 (11) P7 (10|6) P7 (9) P7 (8) P7 (7) P7 (6) P7 (5) P7 (4) P7 (3) P7 (2) P7 (1) 8 P8 (1, · · · , 13)= P8 (13) P8 (12|11) P8 (11|9) P8 (10) P8 (9) P8 (8|6) P8 (7) P8 (6) P8 (5) P8 (4) P8 (3) P8 (2) P8 (1) 9 P9 (1, · · · , 13)= P9 (13) P9 (12|8) P9 (11|9) P9 (10) P9 (9) P9 (8) P9 (7) P9 (6) P9 (5) P9 (4) P9 (3) P9 (2) P9 (1) 10 P10 (1, · · · , 13)= P10 (13) P10 (12) P10 (11) P10 (10) P10 (9) P10 (8) P10 (7) P10 (6) P10 (5) P10 (4) P10 (3) P10 (2) P10 (1) 11 P11 (1, · · · , 13)= P11 (13) P11 (12) P11 (11) P11 (10) P11 (9) P11 (8) P11 (7) P11 (6) P11 (5) P11 (4) P11 (3) P11 (2) P11 (1) 12 P12 (1, · · · , 13)= P12 (13|12) P12 (12) P12 (11) P12 (10) P12 (9) P12 (8) P12 (7) P12 (6) P12 (5) P12 (4) P12 (3) P12 (2) P12 (1)
where π f cd represents the passing student rate (PSR4 ) of Class c for a Field f item, in which the PSRP (response pattern by the students belonging to the parent classes of Class c) is the d-th pattern (see also PIRP under LDB and PSRP under BINET ). However, not all the 8192 parameters are estimated because all 8192 PSRPs will never be observed. Recall that the number of items classified in Field f is likely to be ≤ 10. Even if Field f has ten items, the maximum number of response patterns (i.e., PSRPs) by 13 students belonging to the parents of Class c (i.e., Classes d and e) is ten (i.e., the number of items of Field f ), which is far from 8192. In addition, the ten items in Field f usually have (observe) ten different 13-digit binary patterns among the 8192 PSRPs. Accordingly, out of the designated 8192 parameters, the practical number of parameters to be estimated is ten. 4
The PSR is essentially the same as the CRR, but the former is an expression that emphasizes the viewpoint from an item more.
11.2 Parameter Learning
537
PIRP Under LDB and PSRP Under BINET In the LDB (LDR) where each class (rank) is the locus, and each student is the respondent or subject, πc f d (πr f d ) denotes the parameter at the locus of Class c (Rank r ), with the node of f (i.e., Field f ), and for the d-th PIRP. In other words, πc f d is the conditional CRR for a Field f item of the Class c students whose PIRP is the d-th pattern. Conversely, as the BINET views each field as the locus and each item as the subject (or respondent), π f cd represents the parameter at the locus of Field f , with the node of c (i.e., Class c), and for the d-th PSRP. That is, π f cd is the conditional PSR for the Class c students of a Field f item whose PSRP is the d-th. In this example, the number of students in Classes d and e is 13, which is small in reality, and it is not unusual for the number of students in a class to be ≥ 100 when the sample size is large. If the two classes have 200 students in total, the number of PSRPs in the above example will be 2200 (> 1060 ). Even if Field f has ten items, it is hopeless that any two of the ten 2200 -digit binary patterns are the same. The larger the sample size, the lower the possibility that an item pair has the same student response pattern. Accordingly, the number of PSRPs is then ten, and the corresponding ten parameters are specified and estimated. Next, let us consider the total passing students (TPS) in each class as PSRP. If Classes d and e have seven and six students, respectively, the classes can then have eight (0–7) and seven (0–6) TPS patterns, respectively. Accordingly, the PSRPs are, thus, the following 56 (= 8 × 7) two-digit patterns: Plan (B) 00, 01, 02, 03, 04, 05, 10, 11, 12, . . . , 87. Corresponding to these 56 PSRPs, the 56 parameters are designated as follows: π f c1 , π f c2 , π f c3 , . . . , π f c56 . Compared with Plan (A), the number of parameters can be reduced from 8192 to 56, but the number of items in Field f (J f ) is usually less than 56; thus, it is pointless to prepare more than J f parameters. Moreover, when there are 100 students in each of Classes d and e, it is almost impossible for any two items to share the same TPS pattern. Therefore, in this case as well, one parameter is allocated for one item because the number of PSRPs (and parameters) is equal to the number of items. Alternatively, we consider the number of TPS patterns in all parent classes. If Classes d and e have seven and six students, respectively, the number of students in the classes is 13 in total; thus, the number of TPS patterns is 14 (0–13). In this case, PSRPs are Plan(C) 0, 1, 2, 3, . . . , 13, and the 14 parameters are induced as follows:
538
11 Bicluster Network Model
π f c1 , π f c2 , π f c3 , . . . , π f c14 . However, the number of items in Field f is ≤ 10; thus, the number of observed PSRP patterns is ≤ 10. In addition, when the number of students in Classes d and e is large (e.g., 100) in total, it hardly occurs for any two items to have the same TPS pattern (among the 101 patterns). Consequently, the number of parameters to be estimated reduces to the number of items in the field (J f ). After verifying each of the three plans, it is found that a parameter is allotted (estimated) for each item in Field f because an individual item usually has its own unique PSRP, and any two of them generally do not share the same PSRP. Thus, π f cd is rewritable as π f cj (s.t. Item j ∈ Field f ). Meaning of π f cj (1) This means that the PSR of Class c students for each Field f item differs. (2) This means that the PSRs of Class c students for Items j and k (∈ Field f ) are different if the TPSs of the pa f (c) students for the two items are different. (3) This means that the PSRs of Class c students for Items j and k (∈ Field f ) are different if the PSRs of the pa f (c) students for the two items are different. (4) This means that the PSR for Item j (∈ Field f ) of the Class c students is different from that of the pa f (c) students. As concluded above, π f cj can be simply interpreted as (1) in the above box, and (1) was originally expressed as (2) based on Plan (C) and deduced from it because, in general, there are no two items for which the TPSs are the same. In addition, (2) and (3) have almost the same meaning because the PSR is simply a TPS divided by the number of students belonging to the parent class(es). Consequently, (4) can be inferred from (1) and (3). It concludes that the PSRs of the students belonging to a class and the students belonging to the class’s parent(s) for the same item can be verified by each parameter. Of these four different expressions of the same meaning, (4) is the most significant in BINET.
11.2.4 Local Dependence Parameter Set Now, let us see how the parameters are specified from the graph by taking Field 2 as an example. From Table 11.2 (p. 533), the field consists of three items (Items 11, 21, and 22), and from Fig. 11.2 (p. 529), it contains two edges of Class 2 → 4 and Class 3 → 5. Let the parameter set for the field be denoted by DC2 5 and is specified as follows: 5
The subscript DC means “(local) dependence among classes.”
11.2 Parameter Learning
539
⎤ π2,1,1 ⎥ ⎢ π2,2,1 ⎥ ⎢ ⎥ ⎢ π2,3,1 ⎥ ⎢ ⎢ π2,4,1 π2,4,2 π2,4,3 π2,4,2 ⎥ ⎥ ⎢ ⎥ ⎢ π2,5,1 π2,5,2 π2,5,3 ⎥ ⎢ ⎥ ⎢ π2,6,1 ⎥ ⎢ ⎥ ⎢ = ⎢ π2,7,1 ⎥ ⎥ ⎢ π2,8,1 ⎥ ⎢ ⎥ ⎢ π2,9,1 ⎥ ⎢ ⎥ ⎢ π2,10,1 ⎥ ⎢ ⎥ ⎢ π2,11,1 ⎥ ⎢ ⎦ ⎣ π2,12,1 π2,13,1 ⎡
DC2
(13 × 4).
The c-th row of this matrix contains the parameters for Class c. First, for the local independent classes (i.e., Classes 1, 6, 7, 8, 9, 10, 11, 12, and 13) and the source classes (i.e., Classes 2 and 3), these 11 classes have only one parameter. For example, Class 1 has only π2,1,1 . This implies that the three Field 2 items (i.e., Items 11, 21, and 22) have the same CRR for Class 1 students. Meanwhile, Classes 4 and 5 each have three parameters. The three parameters correspond to three items (i.e., Items 11, 21, and 22). The size of this matrix is (13 × 4). The number of rows (13) is the number of classes (C), while the number of columns (4) is derived from the largest number of columns among DC1 , . . ., DC12 . Likewise, the parameter sets of all the other fields are given as DC4 (Items 25–27) ⎤ ⎡ DC1 (Items 1, 31, 32) ⎤ ⎡ DC3 (Items 23, 24) ⎤ ⎡ π3,1,1 π4,1,1 π1,1,1 ⎢π1,2,1 π1,2,2 π1,2,3 π1,2,32 ⎥ ⎢π3,2,1 ⎥ ⎢π4,2,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 1,3,1 ⎥ ⎢ 3,3,1 ⎥ ⎢ 4,3,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 1,4,1 ⎥ ⎢ 3,4,1 ⎥ ⎢ 4,4,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ ⎥ ⎢ 1,5,1 ⎥ ⎢ 3,5,1 π3,5,2 π3,5,3 π3,5,4 ⎥ ⎢π4,5,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ π π π π π π ⎢ 1,6 ⎥ ⎢ 3,6,1 ⎥ ⎢ 4,6,1 4,6,2 4,6,3 3,5,4 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π1,7,1 ⎥ ,⎢π3,7,1 ⎥ ,⎢π4,7,1 ⎥, ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π1,8,1 ⎥ ⎢π3,8,1 ⎥ ⎢π4,8,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π1,9,1 ⎥ ⎢π3,9,1 ⎥ ⎢π4,9,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π1,10,1 ⎥ ⎢π3,10,1 ⎥ ⎢π4,10,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π1,11,1 ⎥ ⎢π3,11,1 ⎥ ⎢π4,11,1 π4,11,2 π4,11,3 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣π1,12,1 ⎦ ⎣π3,12,1 ⎦ ⎣π4,12,1 ⎦ π1,13,1 π3,13,1 π4,13,1
540
11 Bicluster Network Model
DC5 (Items 2–4) ⎡ ⎤ ⎡ DC6 (Items 7–10, 33) ⎤ ⎡ DC7 (Items 12, 13, 16, 17) ⎤ π6,1,1 π3,5,2 π3,5,3 π3,5,4 π7,1,1 π5,1,1 ⎢π ⎥ ⎢π6,2,1 ⎥ ⎢π7,2,1 ⎥ ⎢ 5,2,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ ⎥ ⎢π ⎥ ⎢ 5,3,1 π5,3,2 π5,3,3 π3,5,2 ⎥ ⎢π6,3,1 ⎥ ⎢ 7,3,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 5,4,1 ⎥ ⎢ 6,4,1 ⎥ ⎢ 7,4,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 5,5,1 ⎥ ⎢ 6,5,1 ⎥ ⎢ 7,5,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ π π π ⎢ 5,6,1 ⎥ ⎢ 6,6,1 ⎥ ⎢ 7,6,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π5,7,1 π5,7,2 π5,7,3 ⎥ ,⎢π6,7,1 ⎥ ,⎢π7,7,1 ⎥, ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π5,8,1 ⎥ ⎢π6,8,1 ⎥ ⎢π7,8,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π5,9,1 π5,9,2 π5,9,3 ⎥ ⎢π6,9,1 ⎥ ⎢π7,9,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π5,10,1 ⎥ ⎢π6,10,1 ⎥ ⎢π7,10,1 π7,10,2 π7,10,3 π7,10,4 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π5,11,1 ⎥ ⎢π6,11,1 ⎥ ⎢π7,11,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣π5,12,1 π5,12,2 π5,12,3 ⎦ ⎣π6,12,1 ⎦ ⎣π7,12,1 ⎦ π5,13,1 π6,13,1 π7,13,1
DC8 (Items 28, 29) ⎤ ⎡ DC9 (Items 5, 6) DC10 (Items 34, 35) ⎤ ⎡ ⎤ π9,1,1 π10,1,1 π13,5,2 π13,5,2 π13,5,2 π8,1,1 ⎢π8,2,1 ⎥ ⎢π9,2,1 ⎥ ⎢π10,2,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 8,3,1 ⎥ ⎢ 9,3,1 ⎥ ⎢ 10,3,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 8,4,1 ⎥ ⎢ 9,4,1 ⎥ ⎢ 10,4,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 8,5,1 ⎥ ⎢ 9,5,1 ⎥ ⎢ 10,5,1 ⎥ ⎢π ⎥ ⎢π ⎥ ⎢π ⎥ ⎢ 8,6,1 ⎥ ⎢ 9,6,1 ⎥ ⎢ 10,6,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π8,7,1 ⎥ ,⎢π9,7,1 ⎥ ,⎢π10,7,1 ⎥, ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π8,8,1 π8,8,2 π3,5,3 π3,5,4 ⎥ ⎢π9,8,1 ⎥ ⎢π10,8,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π8,9,1 ⎥ ⎢π9,9,1 ⎥ ⎢π10,9,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π8,10,1 ⎥ ⎢π9,10,1 ⎥ ⎢π10,10,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π8,11,1 ⎥ ⎢π9,11,1 π9,11,2 π3,5,3 π3,5,4 ⎥ ⎢π10,11,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣π8,12,1 ⎦ ⎣π9,12,1 π9,12,2 ⎦ ⎣π10,12,1 ⎦ π8,13,1 π9,13,1 π10,13,1 ⎡
⎡ DC11 (Items 14, 15) ⎤ ⎡ DC12 (Items 18–20, 30) ⎤ π12,1,1 π11,1,1 π13,5,2 π13,5,2 π13,5,2 ⎢π11,2,1 ⎥ ⎢π12,2,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,3,1 ⎥ ⎢π12,3,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,4,1 ⎥ ⎢π12,4,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,5,1 ⎥ ⎢π12,5,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,6,1 ⎥ ⎢π12,6,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,7,1 ⎥ ,⎢π12,7,1 ⎥. ⎢ ⎥ ⎢ ⎥ ⎢π11,8,1 ⎥ ⎢π12,8,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,9,1 ⎥ ⎢π12,9,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,10,1 ⎥ ⎢π12,10,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢π11,11,1 ⎥ ⎢π12,11,1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎣π11,12,1 ⎦ ⎣π12,12,1 ⎦ π11,13,1 π12,13,1 π12,13,2 π12,13,3 π12,13,4
11.2 Parameter Learning
541
The size of all matrices is (13 × 4). If the c-th row in DC f contains only one parameter, it indicates that Class c is a locally independent class (or source) in Field f . For example, every row in DC6 has only one parameter because all 13 classes are locally independent in Field 6. However, if Class c is locally dependent in Field f , the c-th row of DC f has the same number of parameters as the number of items in Field f (J f ). For example, the third row of DC1 has three parameters, which correspond to the field items (i.e., Items 1, 31, and 32). A three-dimensional array collecting the parameter sets of all the 12 fields, DC = { DC1 , DC2 , . . . , DC12 }
(F × C × Dmax ),
is the local dependence parameter set in the BINET, where Dmax is the number of columns of DC1 , . . ., DC12 , which is four here. Thus, the size of this parameter set is (12 × 13 × 4) for the analysis. Note that this array is sparse.
11.2.5 Parameter Selection Array Next, we introduce the parameter selection array = {γ j f cd } (J × F × C × Dmax ). This is a four-dimensional array that determines which parameter each single datum is used for the estimation. For example, as shown in Fig. 11.2 (p. 529), Classes 4 and 5 are locally dependent in Field 2 (i.e., Items 11, 21, and 22). Thus, the respective arrays with size (C × Dmax ) for the three Field 2 items are extracted from the parameter selection array as follows: ⎡
11,2
⎡ ⎡ ⎤ ⎤ ⎤ 1000 1000 1000 ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 0100 ⎥ ⎢ 0010 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 0100 ⎥ ⎢ 0010 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎥ =⎢ ⎢ 1000 ⎥ , 21,2 = ⎢ 1000 ⎥ , 22,2 = ⎢ 1000 ⎥ . ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ 1000 ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎣ 1000 ⎦ ⎣ 1000 ⎦ ⎣ 1000 ⎦ 1000 1000 1000
For example, Item 22 data are used to estimate the following parameters:
(11.1)
542
11 Bicluster Network Model
⎤ π2,1,1 ⎥ ⎢ π2,2,1 ⎥ ⎢ ⎥ ⎢ π2,3,1 ⎥ ⎢ ⎢ π2,4,1 π2,4,3 π2,4,2 ⎥ ⎥ ⎢ ⎥ ⎢ π2,5,1 π2,5,3 ⎥ ⎢ ⎥ ⎢ π2,6,1 ⎥ ⎢ ⎥. ⎢ = ⎢ π2,7,1 ⎥ ⎥ ⎢ π2,8,1 ⎥ ⎢ ⎥ ⎢ π2,9,1 ⎥ ⎢ ⎥ ⎢ π2,10,1 ⎥ ⎢ ⎥ ⎢ π2,11,1 ⎥ ⎢ ⎦ ⎣ π2,12,1 π2,13,1 ⎡
DC2 22,2
Note that because Items 11, 21, and 22 are classified in Field 2, the partial arrays for the items in the other fields are null arrays (i.e., zero arrays), as follows: 11, f = 21, f = 22, f = O C×Dmax ( f = 2),
(11.2)
where O m×n represents a matrix of size (m × n), all of whose elements are 0. The parameter selection array is, thus, a four-dimensional array collecting such J F matrices, each of which size is C × Dmax . That is, ⎧ ⎫ ⎪ ⎨ 1,1 . . . 1,F ⎪ ⎬ .. . . .. = . . ⎪ = {γ j f cd } (J × F × C × Dmax ). ⎪ . ⎩ ⎭ J,1 . . . J,F
11.2.6 Likelihood Using the local dependence parameter set ( DC ) and parameter selection array (), the likelihood of Field f can be constructed as l(U|π f c ) =
S
J D max
u
zs j m˜ sc m˜ j f γ j f cd
u
zs j m˜ sc γ j f cd
π f scdj (1 − π f cd )1−u s j
s=1 j=1 d=1
=
S
J D max
π f scdj (1 − π f cd )1−u s j
,
(11.3)
s=1 j=1 d=1
where π f c is the c-th row vector of DC f , which is the parameter set related to Class c in Field f . In addition, m˜ sc is the binary class membership coded 1 if Student s belongs to Class c and coded 0 otherwise. Moreover, m˜ j f is the binary field mem-
11.2 Parameter Learning
543
bership coded 1 when Item j is classified in Field f and coded 0 otherwise. These binary memberships (i.e., m˜ sc and m˜ j f ) are the classification results by the IRM. As Eq. (11.2) shows, γ j f cd = 0 when Item j is not classified in Field f . In other words, the parameter selection array () includes information contained in the binary ˜ F is redundant and is, thus, removed from ˜ F ). That is, M field membership matrix ( M the final expression in the above equation. For example, let the likelihood of Class 4 in Field 2 be closely examined. Field 2 has three items (Items 11, 21, and 22), and Class 4 in the field is locally dependent, as shown in Fig. 11.2 (p. 529). Then, with ( f, c) = (2, 4), Eq. (11.3) is given as l(U|π 2,4 ) =
S
u s11 π2,4,1 (1 − π2,4,1 )1−u s11
zs11
z u s21 × π2,4,2 (1 − π2,4,2 )1−u s21 s21
s=1
u s22 z m˜ s4 × π2,4,3 (1 − π2,4,3 )1−u s22 s22 . That is, as shown in this equation, the data of Class 4 students for Items 11, 21, and 22 are used to estimate π2,4,1 , π2,4,2 , and π2,4,3 , respectively. As for locally independent Class c other than Classes 4 and 5 in Field 2, the likelihood becomes l(U|π 2,c ) =
S
z s11 u s11 +z s21 u s21 +z s31 u s31 π2,c,1
s=1
× (1 − π2,c,1 )zs11 (1−u s11 )+zs21 (1−u s21 )+zs31 (1−u s31 )
m˜ sc
(c = 4, 5),
which means that all the data of the Class c (= 4, 5) students for Items 11, 21, and 22 are consumed to estimate only π2,c,1 . In this equation, π 2,c represents the c-th row vector of DC2 but contains only π2,c,1 . For instance, π 2,1 = [π2,1,1 ] for Class 1. Consequently, the likelihood for all data can be defined as l(U| DC ) =
F
C
f =1 c=1
l(U|π f c ) =
F
S
J D C
max
us j z m˜ γ π f cd (1 − π f cd )1−u s j s j sc j f cd . f =1 c=1 s=1 j=1 d=1
(11.4)
11.2.7 Posterior Distribution Let a prior density be imposed on the local dependence parameter set ( DC = {π f cd }), which can make the parameters more stable (less extremely large/small). It is natural for all the elements in DC to assume the same natural conjugate beta density (see Sect. 4.5.3, p. 119). The density function of the beta distribution is defined as
544
11 Bicluster Network Model β −1
pr (π f cd ; β0 , β1 ) =
π f 1cd (1 − π f cd )β0 −1 B(β0 , β1 )
,
where β0 and β1 are the hyperparameters of beta density. Accordingly, the prior density for all the parameters is constructed as pr ( DC ; β0 , β1 ) =
C D F
max
pr (π f cd ; β0 , β1 ).
f =1 c=1 d=1
With this prior density pr ( DC ; β0 , β1 ), the posterior density of the local dependence parameter set ( DC ) is obtained as l(U| DC ) pr ( DC ; β0 , β1 ) pr (U) ∝l(U| DC ) pr ( DC ; β0 , β1 )
pr ( DC |U) =
∝
C
F
S
J D max
u
π f scdj (1 − π f cd )1−u s j
zs j m˜ sc γ j f cd
f =1 c=1 s=1 j=1 d=1
×
F
C D max
β −1
π f 1cd (1 − π f cd )β0 −1 .
(11.5)
f =1 c=1 d=1
11.2.8 Estimation of Parameter Set By taking the logarithm of Eq. (11.5), the log-posterior density is obtained as follows: ln pr ( DC |U) =
Dmax F S J C
z s j γ j f cd m˜ sc u s j ln π f cd + (1 − u s j ) ln(1 − π f cd )
f =1 c=1 s=1 j=1 d=1
+
Dmax F C
{(β1 − 1) ln π f cd + (β0 − 1) ln(1 − π f cd )
f =1 c=1 d=1
=
Dmax F S J C f =1 c=1 d=1
+
S J s=1 j=1
s=1 j=1
γ j f cd m˜ sc z s j u s j + β1 − 1 ln π f cd
γ j f cd m˜ sc z s j (1 − u s j ) + β0 − 1 ln(1 − π f cd )
11.2 Parameter Learning
=
545
Dmax C F U1 f cd + β1 − 1 ln π f cd + U0 f cd + β0 − 1 ln(1 − π f cd ) , f =1 c=1 d=1
(11.6) where U1 f cd =
S J
γ j f cd m˜ sc z s j u s j ,
s=1 j=1
U0 f cd =
S J
γ j f cd m˜ sc z s j (1 − u s j ).
s=1 j=1
The DC that maximizes Eq. (11.6) is the MAP of DC . Differentiating Eq. (11.6) with respect to π f cd , only the terms related to the parameter remain, as follows: d ln pr (π f cd |U) d ln pr ( D F |U) = . dπ f cd dπ f cd This is because all the parameters are disjointed from each other, and this also implies that each parameter can be optimized individually. Let the above derivative be equal to 0: U1 f cd + β1 − 1 U0 f cd + β0 − 1 d ln pr (π f cd |U) = − = 0. dπ f cd π f cd 1 − π f cd Then, by solving the equation with respect to π f cd , the MAP of π f cd that maximizes ln pr (π f cd |U) is obtained as follows: = πˆ (MAP) f cd
U1 f cd + β1 − 1 ) U0 f cd + U1(Tf cd + β0 + β1 − 2
.
Note that unless the prior density is assumed or when β0 = β1 = 1, the MAP is reduced to the MLE, as follows: πˆ (ML) f cd =
U1 f cd . U0 f cd + U1 f cd
(11.7)
Then, with (β0 , β1 ) = (1, 1), the estimate of the parameter set is obtained as
546
11 Bicluster Network Model
ˆ ⎤ ⎡ DC1 (Item 1, 31, 32) .000 ⎢.583 .573 .660 .000⎥ ⎥ ⎢ ⎥ ⎢.714 ⎥ ⎢ ⎥ ⎢.952 ⎥ ⎢ ⎥ ⎢.899 ⎥ ⎢ ⎥ ⎢.891 ⎥ ⎢ ⎥, ⎢.857 ⎥ ⎢ ⎥ ⎢.869 ⎥ ⎢ ⎥ ⎢.920 ⎥ ⎢ ⎥ ⎢.978 ⎥ ⎢ ⎥ ⎢.902 ⎥ ⎢ ⎦ ⎣.943 1.00
ˆ DC2 (Item 11, 21, 22) ˆ ⎡ ⎤ ⎡ DC3 (Item 23, 24)⎤ .000 .000 ⎢.019 ⎥ ⎥ ⎢.170 ⎢ ⎥ ⎥ ⎢ ⎢.026 ⎥ ⎥ ⎢.305 ⎢ ⎥ ⎥ ⎢ ⎢.257 .543 .571 .000⎥ ⎢.057 ⎥ ⎢ ⎥ ⎥ ⎢ ⎢.488 .860 .837 ⎥ ⎢.442 .605 .000 .000⎥ ⎢ ⎥ ⎥ ⎢ ⎢.762 ⎥ ⎥ ⎢1.00 ⎢ ⎥ ⎥ ⎢ ⎢.724 ⎥, ⎥ , ⎢.214 ⎢ ⎥ ⎥ ⎢ ⎢.848 ⎥ ⎥ ⎢1.00 ⎢ ⎥ ⎥ ⎢ ⎢.920 ⎥ ⎥ ⎢.980 ⎢ ⎥ ⎥ ⎢ ⎢1.00 ⎥ ⎥ ⎢1.00 ⎢ ⎥ ⎥ ⎢ ⎢.948 ⎥ ⎥ ⎢.980 ⎢ ⎥ ⎥ ⎢ ⎣1.00 ⎦ ⎦ ⎣1.00 1.00 1.00 (11.8)
ˆ ⎡ DC4 (Item 25–27)⎤ .000 ⎥ ⎢.000 ⎥ ⎢ ⎥ ⎢.172 ⎥ ⎢ ⎥ ⎢.181 ⎥ ⎢ ⎥ ⎢.039 ⎥ ⎢ ⎢.939 .776 .531 .000⎥ ⎥ ⎢ ⎥, ⎢.038 ⎥ ⎢ ⎥ ⎢.980 ⎥ ⎢ ⎥ ⎢.960 ⎥ ⎢ ⎥ ⎢.968 ⎥ ⎢ ⎥ ⎢.961 .941 .784 ⎥ ⎢ ⎦ ⎣1.00 1.00
ˆ ˆ ⎡ DC5 (Item 2–4) ⎤ ⎡ DC6 (Item 7–10, 33) ⎤ .000 .000 .000 .000 .000 ⎢.000 ⎥ ⎥ ⎢.239 ⎢ ⎥ ⎥ ⎢ ⎢.078 .484 .109 .000⎥ ⎢.278 ⎥ ⎢ ⎥ ⎥ ⎢ ⎢.162 ⎥ ⎥ ⎢.383 ⎢ ⎥ ⎥ ⎢ ⎢.132 ⎥ ⎥ ⎢.372 ⎢ ⎥ ⎥ ⎢ ⎢.211 ⎥ ⎥ ⎢.371 ⎢ ⎥ ⎥ ⎢ ⎢1.00 1.00 .914 ⎥, ⎥ , ⎢.457 ⎢ ⎥ ⎥ ⎢ ⎢.131 ⎥ ⎥ ⎢.388 ⎢ ⎥ ⎥ ⎢ ⎢1.00 .920 .920 ⎥ ⎥ ⎢.536 ⎢ ⎥ ⎥ ⎢ ⎢.301 ⎥ ⎥ ⎢.639 ⎢ ⎥ ⎥ ⎢ ⎢.922 ⎥ ⎥ ⎢.545 ⎢ ⎥ ⎥ ⎢ ⎣.966 1.00 1.00 ⎦ ⎦ ⎣.752 1.00 1.00 (11.9)
11.2 Parameter Learning
ˆ DC7 (Item 12, 13, 16, 17) ⎤ ⎡ .000 ⎥ ⎢.068 ⎥ ⎢ ⎥ ⎢.117 ⎥ ⎢ ⎥ ⎢.014 ⎥ ⎢ ⎥ ⎢.238 ⎥ ⎢ ⎥ ⎢.107 ⎥ ⎢ ⎥, ⎢.193 ⎥ ⎢ ⎥ ⎢.235 ⎥ ⎢ ⎥ ⎢.220 ⎥ ⎢ ⎢.968 .806 .871 .935⎥ ⎥ ⎢ ⎥ ⎢.387 ⎥ ⎢ ⎦ ⎣.845 1.00
547
ˆ ⎡ DC8 (Item 28, 29)⎤ .000 ⎢.005 ⎥ ⎢ ⎥ ⎢.031 ⎥ ⎢ ⎥ ⎢.029 ⎥ ⎢ ⎥ ⎢.093 ⎥ ⎢ ⎥ ⎢.020 ⎥ ⎢ ⎥ ⎢.043 ⎥, ⎢ ⎥ ⎢.879 .697 .000 .000⎥ ⎢ ⎥ ⎢.340 ⎥ ⎢ ⎥ ⎢.468 ⎥ ⎢ ⎥ ⎢.255 ⎥ ⎢ ⎥ ⎣.966 1.00 ⎦ 1.00
ˆ ˆ ⎡ DC10 (Item 34, 35)⎤ ⎡ DC11 (Item 14, 15)⎤ .000 .000 .000 .000 .000 .000 .000 .000 ⎥ ⎢.005 ⎢.102 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.000 ⎢.227 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.000 ⎢.157 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.023 ⎢.093 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.020 ⎢.184 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ , ⎢.086 ⎢.186 ⎥, ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.076 ⎢.152 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.000 ⎢.160 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.387 ⎢.210 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢.059 ⎢.186 ⎥ ⎥ ⎢ ⎢ ⎥ ⎦ ⎣.586 ⎣.310 ⎦ 1.00 1.00
ˆ ⎡ DC9 (Item 5, 6) ⎤ .000 ⎥ ⎢.000 ⎥ ⎢ ⎥ ⎢.031 ⎥ ⎢ ⎥ ⎢.000 ⎥ ⎢ ⎥ ⎢.070 ⎥ ⎢ ⎥ ⎢.010 ⎥ ⎢ ⎥, ⎢.643 ⎥ ⎢ ⎥ ⎢.000 ⎥ ⎢ ⎥ ⎢.020 ⎥ ⎢ ⎥ ⎢.129 ⎥ ⎢ ⎢1.00 .941 .000 .000⎥ ⎥ ⎢ ⎦ ⎣.828 .759 1.00
ˆ DC12 (Item 18–20, 30) ⎤ ⎡ .000 ⎥ ⎢.005 ⎥ ⎢ ⎥ ⎢.063 ⎥ ⎢ ⎥ ⎢.036 ⎥ ⎢ ⎥ ⎢.012 ⎥ ⎢ ⎥ ⎢.026 ⎥ ⎢ ⎥. ⎢.000 ⎥ ⎢ ⎥ ⎢.015 ⎥ ⎢ ⎥ ⎢.060 ⎥ ⎢ ⎥ ⎢.016 ⎥ ⎢ ⎥ ⎢.010 ⎥ ⎢ ⎦ ⎣.190 1.00 1.00 1.00 1.00
11.3 Output and Interpretation This section describes themain outputs of BINET and how to interpret them. To summarize, with reference to Estimation Process of BINET , the parameter set estimate ˜ C ) and the binary ˆ DC ) was obtained from the binary class membership matrix ( M ( ˜ F ),6 which were both the results of the IRM. field membership matrix ( M
˜ F was not used directly because the information contained was included in the parameter In fact, M selection array ().
6
548
11 Bicluster Network Model
Estimation Process of BINET
11.3.1 Class Membership Matrix Estimate ˜ C ) in Estimation Process of BINET The binary class membership matrix ( M presents the student classification results not by BINET but by IRM. It is, thus, necessary to reestimate the class memberships of each student. From the likelihood of all data (Eq. (11.4), p. 543), the likelihood that Student s belongs to Class c is given as F
f =1
l(us |π f c ) =
F
J D max
u
π f scdj (1 − π f cd )1−u s j
zs j γ j f cd
.
f =1 j=1 d=1
The membership of the student to the class is, thus, expressed as !F
f =1 l(us |π f c )πc !F c =1 f =1 l(us |π f c )πc
mˆ sc = C
,
where πc is the prior probability that the (each) student belongs to Class c and is usually assumed as πc = 1/C unless there is prior knowledge of it. When this mˆ sc is calculated for all the students (∀s ∈ N S ) and classes (∀c ∈ NC ), the class membership ˆ C (S × C) can be obtained. matrix estimate M ˆ C ), where each Table 11.4 shows the estimate of the class membership matrix ( M blank cell represents 0.000 for readability. For example, as Student 1 has the largest membership to Class 2 (84.8%), the latent class estimate (LCE) for this student is Class 2. In addition, there were five students whose LCE classified by the BINET was different from that of the IRM. Figure 11.5 shows the class membership profiles of the first nine students. This shows that most students have a large membership in any one of the 13 classes. This is because the latent classes are nominal clusters and are dissimilar, even from their adjacent classes. In contrast, if the clusters are ordinal (i.e., latent ranks), they are ordered by making closer clusters more similar (see BINET with Latent Rank Scale ), and the memberships to the clusters around the optimal cluster become moderately large.
11.3 Output and Interpretation
549
Table 11.4 Class membership estimate and latent class estimate (BINET) Student C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 .000 .848 .141 .008 .003 .000 .000 .000 .000 .000 2 .001 .998 3 .035 .838 .126 4 .001 .152 .848 5 .533 .183 .283 .002 6 .001 .978 .021 7 8 1.00 9 .881 .115 .003 .001 .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . 515 .002 2 103 67 32 42 49 36 32 25 32 LCD∗2 CMD∗3 2.0 87.4 76.0 37.0 44.9 46.9 36.8 33.2 25.5 32.2 ∗1 latent
class estimate, ∗2 latent class distribution, ∗3 class membership distribution Student 2 1.0
0.8
0.8
0.8
0.6 0.4 0.2
Membership
1.0
0.0
0.6 0.4 0.2 0.0
2
4
6
8
10
12
0.6 0.4 0.2 0.0
2
4
Latent Class
6
8
10
12
2
0.8
0.8
0.2
Membership
0.8
Membership
1.0
0.4
0.6 0.4 0.2
4
6
8
10
2
12
4
6
8
10
2
12
0.8
Membership
0.8
Membership
0.8
0.6 0.4 0.2
6
8
10
12
8
10
12
0.6 0.4 0.2 0.0
0.0 4
6
Student 9 1.0
2
4
Latent Class
1.0
0.0
12
0.2
Student 8
Student 7
0.2
10
0.4
1.0
0.4
12
0.6
Latent Class
Latent Class
0.6
10
0.0
0.0
0.0
8
Student 6
Student 5 1.0
0.6
6
Latent Class
1.0
2
4
Latent Class
Student 4 Membership
Student 3
1.0 Membership
Membership
Student 1
Membership
C11 C12 C13 LCE∗1 2 .000 .000 .000 10 3 8 2 4 .061 .938 12 10 2 .. .. .. .. . . . . 11 .993 .005 51 29 15 49.4 28.8 15.0
2
4
Latent Class
Fig. 11.5 Class membership profiles (BINET)
6
8
Latent Class
10
12
2
4
6
8
Latent Class
550
11 Bicluster Network Model
In this sense, Student 5 is a rare case because of the large value of memberships in more than one class (i.e., Classes 2, 3, and 4). This implies that the response pattern of the student is not typical, and it is necessary to closely investigate his or her learning history. BINET with Latent Rank Scale In this chapter, the students are clustered into latent classes, but they can be classified into latent ranks. To do so, the optimal number of ranks (R) and fields (F) should first be obtained by ranklustering (Sect. 7.3, p. 274), as shown in the flowchart below.
If there is a field having only one item, sucha field cannot be analyzed in the BINET (see A Field with Only One Item , p. 532). Let the field be Field A, and the only item in Field A was Item α. Then, field membership profile (Table 7.3, p. 293) of Item α would indicate the field in which membership is the second largest next to Field A. Let the field also be Field B. Then, using confirmatory ranklustering (Sect. 7.7, p. 333), Item α should be forcibly classified in Field B (i.e., Fields A and B should be merged). After the number of ranks and fields are fixed by the (confirmatory) ranklustering, the local dependence parameter set (DR = {π f r d }, F × R × Dmax )7 can be estimated using the parameter selection array ( = {γ j f r d }, J × F × R × Dmax ) that summarizes the local dependency structure. Consequently, by referring to Eq. (11.6) (p. 545), and using the smoothed rank (T ) }, S × R), the objective function for estimembership matrix (S(T ) = {ssr mating the parameter set can be specified as ln pr (DR |U) =
Dmax R F
S1 f r d + β1 − 1 ln π f r d
f =1 r =1 d=1
+ S0 f r d + β0 − 1 ln(1 − π f r d ) S1 f r d =
S J
(T ) γ j f r d ssr zs j u s j
s=1 j=1
S0 f r d =
S J
(T ) γ j f r d ssr z s j (1 − u s j ).
s=1 j=1
The latent class distribution (LCD) in the second line from the bottom in Table 11.4 represents the frequency distribution of the number of students in each class. Class 7
The subscript DR means “(local) dependence among ranks”.
11.3 Output and Interpretation
551
Fig. 11.6 LCD and CMD (BINET)
Frequency
100
103
80 67
60
51
49
40
42 36
32
32
20
32 25
29 15
0
1
2
3
4
5
6
7
8
9
10 11 12 13
Latent Class
2 had the greatest number of students at 103, and Class 1 (i.e., the group of zero markers) had the lowest number of students at two. In addition, the CMD in the bottom row of this table indicates the column sum vector of the class membership matrix, as follows: ⎤ 2.0 ⎢ 87.4 ⎥ ⎥ ˆ C 1 S = ⎢ =M ⎢ .. ⎥ (C × 1). ⎣ . ⎦ ⎡
dˆ CMD
15.0 Fig. 11.6 plots these LCD and CMD. Figure 11.7 (left) shows the class membership profile of Student 5 represented in the dependency structure, where a higher membership is represented by a circle with a larger diameter. This student has the largest membership in Class 2 (53.3%), followed by Class 4 (28.3%). The student could move to Class 4 if he or she relearned and reviewed Field 2 (i.e., Items 11, 21, and 22) or move up to Class 3 if he or she did the same for Field 5 (i.e., Items 2–4). This type of representation can provide feedback for each student. It is important for students to be able to see where they are in the learning structure (i.e., map) for the test. It is also essential to inform them of what to learn next to reach the goal (i.e., Class C of full markers). For instance, it is reckless for a Class 2 student to deal with the difficult Field 12 items (i.e., Items 18–20 and 30). The graph tells us that the next field the Class 2 students should work on is either or both Fields 2 and 5. Furthermore, Fig. 11.7 (right) shows the LCD represented on the dependency structure. A class with more students is equipped with a longer bar.
552
11 Bicluster Network Model
Class Membership Profile (Student 5)
Latent Class Distribution
13
13
12
12
12 5
9
12 8
11
5
5
9
8
11
9
8
9 8
4
10 5
8
9
7
8
4
6 7 2
3
6 5
5
2
3
4 2
7
4
5 5
10 5
4
7
5 9
3 2
5
5
2 1
1
1
1
Fig. 11.7 CMP and LCD shown in graph
11.3.2 Interpretation for Parameter Estimates Next, we explain how to interpret the estimate of the local dependence parameter set ˆ DC ). There are two primary points to be checked: (1) the parameters related to the ( locally dependent classes and (2) the marginal parameters. Of these, (1) is focused on in this section, while (2) is explained in the next section. As for (1) locally dependent classes, they are classes that have at least one parent in Fig. 11.2 (p. 529): Class 2 in Field 1, Classes 4 and 5 in Field 2, . . ., Class 13 in Field 12. For example, the parameter set estimate for Field 2 (i.e., Items 11, 21, ˆ DC2 , is obtained as shown in Eq. (11.8) (p. 546). In this field, Class 4 is a and 22), locally dependent class, and the PSRs of the class students for the field items (i.e., Items 11, 21, and 22) were estimated as πˆ 2,4 = [.257 .543 .571]. These values to the bold cells in Field 2 of Table 11.5. As described also correspond by (3) in Meaning of π f cj (p. 544), the PSRs of Class 4 students for Field 2 items differ when those of the parent (i.e., Class 2) students differ. In other words, as (4) of the box says, each of the three PSRs of Class 4 students must be compared to that of Class 2 students. The PSRs of Class 2 students for Items 11, 21, and 22 are
11.3 Output and Interpretation
553
(πˆ 2,11 , πˆ 2,21 , πˆ 2,22 ) = (.010, .039, .010). They are picked from the shaded cells in Field 2 of Table 11.5, where each PSR (i.e., cell) of πˆ cj is calculated by S m˜ sc z s j u s j + β1 − 1 πˆ cj = S s=1 . ˜ sc z s j + β0 + β1 − 1 s=1 m ˆ C = {mˆ sc } Note that the class membership matrix employed in this equation is not M ˜ (S × C) estimated in Sect. 11.3.1 (p. 548) but M C = {m˜ sc } (S × C) obtained by IRM. This definition is consistent because the local dependence parameter set ( DC ) ˜ C ) in Eq. (11.6) was estimated using the latter binary class membership matrix ( M (p. 545). For this reason, the bold cells in Field 2 of Table 11.5 accords with πˆ 2,4 . These results indicate that with changing Classes 2 → 4, the PSRs of Items 11, 21, and 22 rise as Item 11: πˆ 2,11 = .010 Item 21: πˆ 2,21 = .039
→ →
πˆ 2,4,1 = 0.257, πˆ 2,4,2 = 0.543,
Item 22: πˆ 2,22 = .010
→
πˆ 2,4,3 = 0.571.
(11.10)
The PSR of Item 11 was approximately only 1% for Class 2 students but exceeded 25% for Class 4 students. In addition, the PSRs for Items 21 and 22 were very small for the Class 2 students, whereas those for the Class 4 students exceeded 50%. This implies that the ability required in Field 2 (i.e., Items 11, 21, and 22) is the key to classifying students into Class 2 or 4. From the array plot that sorts the original data considering the students’ LCEs (estimated by BINET) and items’ field memberships (estimated by IRM and employed in the BINET),8 the rows for Classes 2 and 4 are extracted and shown in Fig. 11.8 (left). The upper and lower rows show Classes 2 and 4, respectively. The plots indicate that almost all Class 2 students failed Field 2 (cf. the second cell from the left) items, whereas approximately half of the Class 4 students passed the field items. Note that for other fields, the differences between Classes 2 and 4 were not significant. Figure 11.8 (right) is the density plot of the array plot, where each cell is shaded darker as the PSR of the cell is larger. That is, the cell with a PSR of 0.0 is filled with white, whereas the cell with a PSR is 1.0 is filled with black. As will be seen later (in Table 11.6, p. 560), the PSR of Class 4 for Field 1 is 0.952, and thus, the leftmost cell in the lower row is almost black. It can also be seen from the two rows that the difference in darkness for Field 2 (cf. the second column from the left) is the largest. In fact, Field 2, as the biggest gap between Classes 2 and 4, was already found in the IRM analysis results (see Fig. 11.4, p. 532). Therefore, the dependency structure 8
As mentioned above, the LCEs estimated by the IRM and BINET were the same, except for five students. This array plot is, thus, almost identical to that sorted by the IRM shown in Fig. 11.4 (right) (p. 532).
554
11 Bicluster Network Model
Table 11.5 Passing student rates (BINET) Field Item C1 1 1 .000 31 .000 32 .000 2 11 .000 21 .000 22 .000 3 23 .000 24 .000 4 25 .000 26 .000 27 .000 5 2 .000 3 .000 4 .000 6 7 .000 8 .000 9 .000 10 .000 33 .000 7 12 .000 13 .000 16 .000 17 .000 8 28 .000 29 .000 9 5 .000 6 .000 10 34 .000 35 .000 11 14 .000 15 .000 12 18 .000 19 .000 20 .000 30 .000 ∗ correct
C2 .583 .573 .660 .010 .039 .010 .204 .136 .000 .000 .000 .000 .000 .000 .379 .262 .214 .107 .233 .068 .058 .039 .107 .000 .010 .000 .000 .165 .039 .000 .010 .019 .000 .000 .000
response rate
C3 .688 .766 .688 .016 .063 .000 .438 .172 .188 .141 .188 .078 .484 .109 .438 .250 .156 .203 .344 .141 .109 .063 .156 .000 .063 .063 .000 .250 .203 .000 .000 .047 .078 .063 .063
C4 1.00 .914 .943 .257 .543 .571 .057 .057 .200 .114 .229 .171 .286 .029 .543 .171 .314 .371 .514 .029 .000 .000 .029 .000 .057 .000 .000 .143 .171 .000 .000 .029 .057 .057 .000
C5 .930 .884 .884 .488 .860 .837 .442 .605 .093 .000 .023 .163 .233 .000 .512 .326 .279 .279 .465 .349 .279 .116 .209 .000 .186 .116 .023 .140 .047 .023 .023 .000 .000 .023 .023
C6 .939 .857 .878 .449 .918 .918 1.00 1.00 .939 .776 .531 .265 .306 .061 .551 .224 .388 .327 .367 .163 .041 .061 .163 .000 .041 .020 .000 .224 .143 .041 .000 .000 .000 .000 .102
C7 1.00 .743 .829 .714 .800 .657 .229 .200 .086 .029 .000 1.00 1.00 .914 .600 .486 .371 .514 .314 .257 .114 .171 .229 .000 .086 .657 .629 .257 .114 .143 .029 .000 .000 .000 .000
C8 .879 .909 .818 .636 .970 .939 1.00 1.00 1.00 1.00 .939 .121 .273 .000 .606 .242 .394 .333 .364 .424 .121 .182 .212 .879 .697 .000 .000 .212 .091 .091 .061 .000 .000 .000 .061
C9 1.00 .920 .840 .800 1.00 .960 .960 1.00 1.00 1.00 .880 1.00 .920 .920 .800 .440 .480 .360 .600 .480 .160 .120 .120 .320 .360 .040 .000 .160 .160 .000 .000 .000 .040 .040 .160
C10 .935 1.00 1.00 1.00 1.00 1.00 1.00 1.00 .968 1.00 .935 .419 .323 .161 .774 .387 .710 .613 .710 .968 .806 .871 .935 .581 .355 .161 .097 .258 .161 .484 .290 .000 .000 .000 .065
C11 1.00 .922 .784 .980 .941 .922 .980 .980 .961 .941 .784 1.00 .961 .804 .627 .529 .549 .471 .549 .569 .373 .275 .333 .314 .196 1.00 .941 .157 .216 .078 .039 .000 .000 .000 .039
C12 1.00 .897 .931 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 .966 1.00 1.00 .966 .552 .828 .724 .690 .897 .828 .828 .828 .966 1.00 .828 .759 .414 .207 .690 .483 .138 .138 .172 .310
C13 CRR∗ 1.00 .850 1.00 .812 1.00 .808 1.00 .476 1.00 .616 1.00 .586 1.00 .600 1.00 .567 1.00 .491 1.00 .452 1.00 .414 1.00 .392 1.00 .458 1.00 .303 1.00 .573 1.00 .350 1.00 .390 1.00 .353 1.00 .437 1.00 .340 1.00 .237 1.00 .216 1.00 .276 1.00 .221 1.00 .227 1.00 .250 1.00 .216 1.00 .229 1.00 .155 1.00 .126 1.00 .087 1.00 .049 1.00 .052 1.00 .054 1.00 .085
11.3 Output and Interpretation
Array Plot
555
Density Plot
Fig. 11.8 Array and density plots of Classes 2 and 4
Array Plot
Density Plot
Fig. 11.9 Array plot and density plot of Classes 3 and 5
of Class 2 → 4 in Field 2 was assumed in the BINET. As will also be seen, when constructing the graph (i.e., exploring the dependency structure), it is a good idea to first take any two rows (i.e., classes) from the IRM density plot, then look for a field(s) where the difference in darkness stands out, and finally assume a local dependence between the classes in the field(s). Because the PSRs of the two classes for such a field differ greatly, it is recommended to compare the PSRs of the two classes item by item. Similarly, for Classes 3 and 5, Fig. 11.9 shows the array plot (left) and density plot (right) comparing Classes 3 (top row) and 5 (bottom row). It can be seen that the field where the difference is the most outstanding is Field 2 (the second row from the left). Generally, in BINET, the edges in the graph should be set between class pairs, in which the PSRs of the fields are significantly different. Let us take a look at one more example. As shown in Fig. 11.7 (p. 552), a local dependency structure was assumed for Classes 6 → 10 in Field 7. This edge was specified considering that Field 7 is the most distinguishing difference between Classes 6 (top row) and 10 (bottom row) in Fig. 11.10, showing the array and density plots for the two classes. Figure 11.11 shows all such PSR differences for all locally dependent classes compared with their respective parents. In this figure, the center panel of the top
556
11 Bicluster Network Model
Array Plot
Density Plot
Fig. 11.10 Array plot and density plot of Classes 6 and 10
row conveys Eq. (11.10) (p. 557). That is, the PSRs of the three items in Field 2 (Items 11, 21, and 22) are (.010, .039, .010) for the Class 2 students, but they are (.257, .543, .571) for the Class 4 students. In addition, for example, the center panel in the third row of the figure represents the local dependency structure of Class 4 → 7 in Field 5 (Items 2–4). From the shaded cells in Field 5 of Table 11.5 (p. 554), the PSRs of Items 2–4 for Class 4 students are (πˆ 4,2 , πˆ 4,3 , πˆ 4,4 ) = (.171, .286, .029), whereas those for the Class 7 students are πˆ 5,7 = [1.000 1.000 .914] ˆ DC5 from the bold cells in Field 5 of Table 11.5 (or also from the seventh row of in (Eq. (11.9), p. 546). These values indicate that most Class 7 students could pass Field 5 items. The PSRs are generally greater for the upper-level classes (with a larger label) than for the lower-level classes. Figure 11.12 summarizes all plots shown in Fig. 11.11. This figure can tell each student to grasp at a glance which field should be focused on and the criteria (i.e., PSRs) of the field items to move to the next class. For example, a Class 6 student should concentrate on learning the two items in Field 8 (Items 28 and 29) to grade up to Class 8 or should focus on the four items in Field 7 (Items 12, 13, 16, and 17) to level up to Class 10. Such a graphical representation can greatly help the teacher understand the map of learning and guide students in each class.
11.3.3 Marginal Field Reference Profile Let π f c denote the PSR of Class c students for a Field f item; it is represented as π f c = Pr ( f |c),
11.3 Output and Interpretation
Fig. 11.11 Local dependence passing student rates (BINET)
557
558
11 Bicluster Network Model
13
Latent Class
12
Latent Field
12 5
8
9
11
5 9
8
9 5
8
4
10 7
6 4
7
5 5
2
3
4
3 2
5
2 1
1 Fig. 11.12 Local dependence PSRs on graph
while π f cd obtained in Sect. (11.2.8) is expressed as π f cd = Pr ( f |c, d), where d is the d-th item classified in Field f . Thus, the following relationship can be derived from these two equations: Pr ( f |c) =
Dmax d=1
Pr ( f |c, d)Pr (d),
11.3 Output and Interpretation
559
where Pr (d) can be obtained by S Pr (d) =
J
s=1 S s=1
j=1
J
mˆ sc z s j γ j f cd mˆ sc z s j m˜ j f
j=1
.
When there are no missing data, it can be simply expressed by Pr (d) =
1 , Jf
where J f is the number of items in Field f . Accordingly, by marginalizing πˆ f cd with respect to d, πˆ f c can be obtained as πˆ f c =
Dmax
S
J
s=1 πˆ f cd S s=1 d=1
j=1
J
mˆ sc z s j γ j f cd
j=1
mˆ sc z s j m˜ j f
.
Note that the PSR of a source (i.e., locally independent class) is marginalized. The matrix collecting these πˆ f c s for all classes and fields is given as ⎤ πˆ 11 . . . πˆ 1C ⎥ ⎢ = ⎣ ... . . . ... ⎦ = {πˆ f c } (F × C). πˆ F1 . . . πˆ FC ⎡
ˆ MB
ˆ DC ) This matrix is obtained by marginalizing the local dependence parameter set ( over the parameter selection array () and is, thus, called the marginal bicluster reference matrix. ˆ M B ). Table 11.6 shows the estimate of the marginal bicluster reference matrix ( For example, the marginal (i.e., averaged) PSR for three items in Field 2 of Class 7 students was 72.4%, and that of Class 8 students was 84.8%. In addition, because all classes were locally independent in Field 6, each class in ˆ DC6 ) in the field had only one parameter, as shown in the parameter set for the field ( Eq. (11.9) (p. 546). In a field where all classes are locally independent, the parameter ˆ DC6 estimates are already marginalized; thus, the estimates in the first column of were copied and pasted in the sixth row (i.e., Field 6) of the marginal bicluster ˆ M B ). reference matrix ( ˆ M B ) is the The f -th row vector of the marginal bicluster reference matrix ( marginal field reference profile (FRP) of Field f . Figure 11.13 plots the marginal FRPs of the 12 fields. The marginal FRP of each field helps us understand how the average PSR for the field items changes with class. In general, the marginal FRPs
560
11 Bicluster Network Model
Table 11.6 Marginal bicluster reference matrix (BINET) Field 1 2 3 4 5 6 7 8 9 10 11 12 TRP∗2 LCD∗3 ∗1 latent
C1 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 0.0 2
C2 .605 .019 .170 .000 .000 .239 .068 .005 .000 .102 .005 .005 3.9 103
C3 .714 .026 .305 .172 .224 .278 .117 .031 .031 .227 .000 .063 6.7 67
C4 .952 .457 .057 .181 .162 .383 .014 .029 .000 .157 .000 .036 7.9 32
C5 .899 .729 .523 .039 .132 .372 .238 .093 .070 .093 .023 .012 9.9 42
C6 .891 .762 1.00 .748 .211 .371 .107 .020 .010 .184 .020 .026 12.7 49
C7 .857 .724 .214 .038 .971 .457 .193 .043 .643 .186 .086 .000 13.2 36
C8 .869 .848 1.00 .980 .131 .388 .235 .788 .000 .152 .076 .015 15.5 32
C9 .920 .920 .980 .960 .947 .536 .220 .340 .020 .160 .000 .060 18.0 25
C10 .978 1.00 1.00 .968 .301 .639 .895 .468 .129 .210 .387 .016 21.0 32
C11 .902 .948 .980 .895 .922 .545 .387 .255 .971 .186 .059 .010 20.2 51
C12 .943 1.00 1.00 1.00 .989 .752 .845 .983 .793 .310 .586 .190 27.0 29
C13 LFD∗1 3 1.00 3 1.00 1.00 2 1.00 3 3 1.00 1.00 5 1.00 4 2 1.00 2 1.00 1.00 2 2 1.00 4 1.00 35.0 15
field distribution, ∗2 test reference profile, ∗2 latent class distribution,
increase as the class label increases because the average ability level of an upper class tends to be higher.9
11.3.4 Test Reference Profile The test reference profile (TRP) represents the expected NRSs to the entire test regarding the respective latent classes. First, from Table 11.6, the latent field distribution (LFD), the frequency distribution of the number of items for the respective classes, is obtained as dˆ LFD = [3 3 2 3 3 5 4 2 2 2 2 4] , and then the TRP is computed as
9
Note that the trend is not very firm because the classes are nominal clusters. If the clusters are ordinal (i.e., latent ranks), these marginal FRPs will be smoother. When analyzing with the ranks, it is recommended to set a smaller number of ranks because the field PSRs of each rank become less unique and similar to those of the adjacent ranks. This is because the ranks are created by making neighboring clusters similar. Accordingly, when the number of ranks is small, the PSRs of each individual rank become more unique, and it is then easy for any rank pair to find a field(s) for which PSRs differ remarkably.
0.0 1 2 3 4 5 6 7 8 9 10111213
0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
Field 4
Field 5
1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213
Correct Response Rate
Latent Class
0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213
Field 7
Field 8
1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213
Correct Response Rate
Latent Class
0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
Field 10
Field 11
1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213
Field 6 1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
1.0
Latent Class
Field 3 1.0
Latent Class
1.0
Latent Class
Correct Response Rate
0.2
0.6
Correct Response Rate
0.4
0.8
Correct Response Rate
0.6
Field 2 1.0
Field 9 1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
Correct Response Rate
0.8
Correct Response Rate
Field 1 1.0
561
Correct Response Rate
Correct Response Rate
Correct Response Rate
Correct Response Rate
Correct Response Rate
11.3 Output and Interpretation
Field 12 1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10111213 Latent Class
Fig. 11.13 Marginal field reference profile (BINET)
⎡ ⎢ ˆ MB dˆ LFD = ⎢ ˆt TRP = ⎢ ⎣
⎤ 0.0 3.9 ⎥ ⎥ .. ⎥ , .⎦ 35.0
which means that the c-th element in the TRP is the weighted sum of the c-th column elements in the marginal bicluster reference matrix, and the weight of the f -th column elements is the f -th element in the LFD. From the above equation, for example, a student in Class 2 is expected to pass approximately four (3.9) items of this 35-item test, and a Class 3 student is expected to pass about seven (6.7) items on the test. The line graph in Fig. 11.14 shows the TRP, while the bar graph represents the number of students in the respective classes (i.e., LCD; see also Fig. 11.6). The figure shows that the TRP has an upward trend, but it drops slightly from Class 10 to 11. The classes are originally labeled based on the IRM analysis such that the
562
11 Bicluster Network Model
Fig. 11.14 Test reference profile (BINET)
35
Expected Score
30
103
25 20
67
15
51
49 42
10
36
32
32
32
29
25
5
15
0
1
2
3
4
5
6
7
8
9
10 11 12 13
Latent Class
TRP is monotonically increasing, but the TRP under the BINET analysis does not necessarily increase monotonically. However, because the latent scale in this analysis is nominal, it does not matter whether the TRP is monotonic. If the latent scale is ordinal and the TRP is not monotonic, the latent scale does not satisfy the weak ordinal alignment condition (WOAC). If fulfilled, it certifies that the latent clusters are arranged ordinally, even if the marginal FRPs of all fields are not monotonic.
11.3.5 Model Fit Based on Benchmark Model This section describes the evaluation of BINET fitness to data. As in other chapters, the model analyzing the data (i.e., analysis model) is compared with a well-fitting benchmark model and a poor-fitting null model. Note that the benchmark and null models can be designated by the analyst, but those employed in this book are consistently the same throughout the chapters. That is, the benchmark model is a multigroup model where students are grouped by NRS, and items are locally independent in each group (i.e., locus), while the null model is a single-group model in which items are (globally) independent. The log-likelihoods of these two models to the data under analysis (J35S515) were calculated and are shown in Table 7.7 (p. 319). Additionally, the χ2 value and DF for the null model are shown in the table. ˆ DC , class memFrom Eq. (11.4) (p. 543), with local dependence parameter set ˜ F , the likelihood of the ˆ C , and binary field membership matrix M bership matrix M analysis model (Fig. 11.1, p. 528) can be expressed as l A (U| DC ) =
F
S
J D C
max
f =1 c=1 s=1 j=1 d=1
Thus, the log-likelihood is given as
u
πˆ f scdj (1 − πˆ f cd )1−u s j
zs j mˆ sc γ j f cd
.
11.3 Output and Interpretation
563
Table 11.7 Model fit indices (BINET) χ2 and d f
Standardized Index
Information Criterion
ll B
−5891.31
NFI
1.000
AIC
−2240.35
ll N
−9862.11
RFI
1.000
CAIC
−6507.68
ll A
−5776.14
IFI
1.000
BIC
−6505.73
χ2N
7941.60
TLI
1.000
χ2A
−230.35
CFI
1.000
RMSEA
0.000
d fN
1155
d fA
1005
ll A (U| DC ) =
S J D F C max
z s j mˆ sc γ j f cd u s j ln πˆ f cd + (1 − u s j ) ln(1 − πˆ f cd ) .
f =1 c=1 s=1 j=1 d=1
Accordingly, the χ2 value for the analysis model is computed by χ2A = 2(ll A − ll B ), where ll B represents the log-likelihood of the benchmark model. In addition, the number of parameters in the analysis model (N P A ) equals the ˆ DC ), and it number of parameter estimates in the local dependence parameter set ( is 185 in this analysis. The DF for the analysis model (d f A ) is, thus, obtained as d f A = N P B − N P A, where N P B (= 1190) is the number of parameters in the benchmark model. From the χ2 values and DFs of the analysis and null models, various fitness indices can be calculated based on Standardized Fit Indices (p. 321) and Information Criteria (p. 326). Table 11.7 shows the model-fit indices for this analysis. As can be seen from the table, the log-likelihood of the analysis model is larger (i.e., closer to zero) than that of the benchmark model, which implies that the analysis model has better-fitting data than the benchmark model. This is also shown by χ2A being negative. A χ2 value must be nonnegative by definition, and all the χ2 -based indices are, thus, meaningless.
564
11 Bicluster Network Model
11.3.6 Model Fit Based on Saturated Models When classifying students into nominal clusters (i.e., latent classes), the analysis model almost always fits the data very well. In such a case, the log-likelihood of the analysis model becomes larger than that of the benchmark model, which causes χ2A to become negative, and the fitness indices cannot be evaluated properly. To deal with this matter, consider a further well-fitting benchmark model. One idea is to designate an LCA model in which the number of classes (C) is larger than that of the analysis model (for instance, 30) as the benchmark model (denoted by LCA (30)). Meanwhile, C = 13 (and F = 12) for the analysis model (denoted by BINET (13, 12)); thus, the model fit of LCA (30) would be better than that of BINET (13, 12). However, LCA (30) may not always be calculatable (e.g., for data with a small sample size). Conversely, when data have a very large sample size, the optimal C of the analysis model for the data can be > 40, and such an analysis model will then fit better than the benchmark model. Therefore, it is inappropriate to specify a model with a large C as the benchmark model. Rather, we designate a multigroup model with the number of groups greater than the number of observed NRS patterns as the benchmark model. As for the number of groups, a promising candidate is the number of response patterns. Table 11.8 explains the number of response patterns. If data are obtained as in the left table, the data can be grouped by the same binary pattern, as shown in the right table. The results show that the number of response patterns is 11. Thus far, the benchmark model has grouped students according to the number of observed NRS patterns, which is six (0–5) in the table. The number of response patterns is usually much greater than the number of NRS patterns: In the data with 35 items and 515 students, which is under analysis (J35S515), the number of observed NRS patterns was 34,10 whereas the number of response patterns was 488, which implies that only 27 (= 515 − 488) out of the 515 students shared the same response pattern with other students. In addition, two of the 27 students were in fact zero markers, and 15 of them were full markers, indicating that most of the students had their own unique response patterns. In general, the response patterns of any two students are rarely duplicated when the number of items is large. The multigroup model in which the number of groups is the number of response patterns is called the saturated model. Here, it is the 488-group model. Then, group membership matrix M G = {m sg } (515 × 488) can be specified, where m sg is a binary indicator coded 1 when Student s belongs to Group g and otherwise 0. Next, the group reference matrix ( P G = { p jg }, 35 × 488) is also fixed, where p jg is the PSR of Group g students for Item j. Note that this p jg is also binary because all Group g students share the same response pattern, and their responses to Item j can, thus, be correct or incorrect (or missing). Using them, the log-likelihood of the saturated model for Item j is obtained as 10
The maximum possible number of NRS patterns is 36 (0–35), but there were no students whose NRS was 32 or 33.
11.3 Output and Interpretation
565
Table 11.8 Data grouped by response pattern Grouped
Original ID 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
ll S (u j | p j ) =
1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1
S G
2 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1
3 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1
4 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1
5 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
G 1 2 3 4 5 6 7
8 9 10 11
ID 09 05 03 13 10 07 14 02 08 12 06 11 04 01 15
1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
2 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1
3 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1
4 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1
5 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1
m sg z s j {u s j ln p jg + (1 − u s j ) ln(1 − p jg )} = 0,
s=1 g=1
where p j is the j-th row vector of the group reference matrix ( P G ). As the above equation shows, the log-likelihood of Item j is 0. This is because if Student s belongs to Group g, then u s j = 0 when p jg = 0 and u s j = 1 when p jg = 1.11 As a result, the log-likelihood of the saturated model for the entire test is given as ll S (U| P G ) =
J
ell S (u j | p j ) = 0.
j=1
Note that because the saturated model perfectly fits the data, the log-likelihood of the model becomes 0 (i.e., the likelihood is then 1). In addition, the number of parameters in the saturated model is N P S = G × J = 488 × 35 = 17,080.
11
Let 0 × ln 0 = 0 × (−∞) = 0.
566
11 Bicluster Network Model
Table 11.9 Model fit indices based on saturated model (BINET)
ll S ll N ll A χ2N χ2A d fN d fA
χ2 and d f 0.00 −9862.11 −5776.14 19724.23 11552.28 17045 16895
Standardized Index NFI 0.414 RFI 0.409 IFI 1.000 TLI 1.000 CFI 1.000 RMSEA 0.000
Information Criterion AIC −22237.72 CAIC −93975.69 BIC −93942.92
Table 11.9 shows the model-fit indices calculated based on the saturated model. For example, the χ2 value and DF of the analysis model are obtained as follows: χ2A = 2 × {0 − (−5776.14)} = 11,552.28, d f A = N P S − N P A = 17,080 − 185 = 16,895. Because the log-likelihood cannot be > 0 (i.e., the likelihood cannot be > 1), the log-likelihood of the analysis model is never greater than that of the saturated model. Therefore, with the saturated model, the fitness of the analysis model can always be properly evaluated. However, the saturated model may be a little too good to fit the data as a comparison object. Thus, the NFI and RFI were indeed very poor. In addition, the IFI, TLI, and CFI do not work in practice. To evaluate the fitness of an analysis model, it is necessary to designate better- and worse-fitting models and then sandwich the fitness of the former model with those of the latter models (see Fig. 4.23, p. 139). Therefore, it is important to properly specify the models that can sandwich the analysis model, which is up to the analyst.
11.4 Structure Learning The previous sections described estimating the parameter set and evaluating the model fitness under the condition that the local dependency structure was already fixed. However, the structure is usually not given a priori; thus, this section shows a method of determining the local dependency structure with the assumption that the number of classes (ranks) and fields is already determined by biclustering (ranklustering) or IRM, and there is no field having only one item. Determining the local dependency structure is referred to as structure learning. To achieve this, a genetic algorithm such as that used in the BNM (Sect. 8.5, p. 407) and LD-LRA (Sect. 9.4, p. 465) is effective. In the author’s experience, however, structure learning methods often select and recommend an uninterpretable structure.
11.4 Structure Learning
567
Although this is partly due to the author’s poor settings for the learning method employed, because the number of candidate structures is enormous, there are a certain number of models that do not explain real phenomena at all but fit the data very well. Such a good-fitting (but unaccountable) model is often selected by machine learning methods. Therefore, it is recommended to develop a rational structure with proper experience and knowledge from the beginning, rather than identifying an uninterpretable model through machine learning and constantly making adjustments to it. Thus, this section shares some knowledge on structure identification. Knacks for Structure Identification (BINET) 1. No fields containing one item only should be created, and all fields are to be labeled in the order of easiness. 2. A large number of classes (C) may as well be supposed, and classes should be topologically sorted. 3. The difference of the class reference vectors obtained by IRM for each pair of classes should be calculated, and if the difference vector has only one large element (i.e., field), the edge between the two classes in the field may as well be assumed. that a field is not created with only one item (see First, as for Point 1, ensure A Field with Only One Item , p. 532). Moreover, it is recommended that the fields are labeled in order of the average PSR from the easiest one, Field 1, to the hardest one, Field F. Next, Point 2 proposes to take a larger number of classes, but the reason for this will be described further in the text. In addition, the classes should be labeled in the order of the average CRR (Class 1 as the lowest and Class C as the most advanced ability group). If there are students with a zero score and students with a full score, let them be grouped in Class 1 and Class C, respectively. Furthermore, Point 2 also strongly recommends that the classes are topologically sorted. When the classes are topologically sorted by class labels, the possible number of structure candidates greatly reduces and any structure always becomes aDAG regardless of the number of edges and their placement ( Topological Sorting , p. 371). Finally, the class reference vectors (CRVs) described in Point 3 are the column vectors of the bicluster reference matrix obtained in the IRM shown in Table 7.14 (p. 345). For example, the CRV of Class 3 is shown in the third column of the table as follows: πˆ 3 = [.711 .031 . . . .066] . Then, we consider the difference in the CRVs of two classes. For example, the CRV difference of Classes 2 and 1 is calculated as πˆ 2 − πˆ 1 = [.605 .023 .173 .003 .003 .240 .070 .010 .005 .106 .010 .007] .
568
11 Bicluster Network Model
This vector shows that Field 1 has the largest difference. In other words, Field 1 is the item cluster that best discriminates whether a student belongs to Class 1 or 2. Therefore, based on this result obtained in the IRM, in the BINET, an edge for Classes 1 → 2 in Field 1 was assumed, as shown in Fig. 11.1 (p. 528). When the number of classes C is small, the ability differences between any two classes would range across multiple fields. For example, if C = 3, the students in Class 2 must perform better in all fields than students in Class 1, and the difference vector between the two classes thereby has several large elements (i.e., fields). Accordingly, it is impossible to narrow the F fields down to one or two fields. Conversely, when C is large, the difference vector of a class pair will have only one or two large elements. In other words, for each field, one can find a class pair such that only a few field elements are large in the difference vector of the class pair if C is large. Because of the large number of classes (C = 13), as in the above example regarding Field 1, Classes 1 and 2 were found by exploring a pair such that only Field 1 element was large. For this reason, a large C is recommended at Point 2 of Knacks for Structure Identification (BINET) . Similarly, the edge of was set in Field 4. This was because the CRV difference between Classes 5 and 6 was, πˆ 6 − πˆ 5 =[−.007 .033 .467 .699 .077 − .001 − .130 − .072 − .060 .088 − .004 .013] , and the largest difference in this vector is Field 4 (i.e., the fourth element). In addition, Field 3 (i.e., the third element) also shows a large difference between the CRVs of the two classes; thus, the following structure is worth considering:
The final graph can be gradually fixed by combining such partial structures into a graph, analyzing the data under the graph, checking the graph fitness, and repeating this process several times. It is certain that this process is a tough task with many trials and errors, but it is often a quicker way than starting from an unrealistic structure by a machine learning method and modifying it. In addition, in the process of converging one structure, one can deepen the knowledge and experience of the subject measured by the test.
11.5 Chapter Summary This chapter introduced the BINET. Although this model is similar to LDB in the sense that the model is made by incorporating biclustering and BNM, the latent
11.5 Chapter Summary
569
classes are the loci in the LDB, and the interfield dependency structure is analyzed in each class, whereas the latent fields are the loci in the BINET, and the interclass structure is identified in each field. This book repeatedly emphasizes that a model employed in test data analysis should have the four points shown in Sect. 1.5.2 (p. 10). They are (A) showing an entire map that the test measures, (B) indicating the position of each student on the map, (C) giving advice about the next action to be taken by each student, and (D) showing the characteristics of each particular road (i.e., field) on the map. BINET can also be said to involve the four points. As to (A), the graph obtained in the BINET gives us a map of the routes that can be taken by the students to reach the goal (i.e., Class C) from the starting point (i.e., Class 1). In addition, (B) is indicated by the class membership profile of each student. As for (C), the graph shows the next class to the students in each class and shows the field they should learn next. Finally, for (D), the BINET (IRM) specifies the field into which each item is classified, and the statistical features of each field are clarified by the marginal FRP of the field. However, it is difficult to finalize the dependency structure from a number of possible structures. It is possible to apply a machine learning method and determine the best data-fitting structure. However, the statistically best-fitting structure is not necessarily a pedagogically meaningful and interpretable model for the subject when guiding students. Instead, by properly consulting with experienced and knowledgeable people, examining each meaningful edge one by one, and integrating them together, the entire structure can be properly constructed and completed.
References
Akaike, H. (1980b). Likelihood and the Bayes procedure, In J. M. Bernardo, M. H. De Groot, D. V. Lindley, & A. F. M. Smith (Eds.) Bayesian Statistics (pp.143–166 with discussion pp. 185–203). Valencia University Press. Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332. Akiyama, M. (2014). Algorithm for computerized adaptive test based on latent rank theory: Propositions of the algorithm and its validations by simulation. Japanese Journal for Research on Testing, 10, 81–94. (in Japanese). Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251–269. Aldous, D. (1983). Exchangeability and related topics (pp. 1–198). XIII: École dété de probabilitiés de Saint-Flour. Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J.-D. (2007). Modeling diagnostic assessments with Bayesian networks. Journal of Educational Measurement, 44, 341–359. Almond, R. G., Mulder, J., Hemat, L. A., & Yan, D. (2009). Bayesian network models for local dependence among observable outcome variables. Journal of Educational and Behavioral Statistics, 34, 491–521. Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., & Williamson, D. M. (2015). Bayesian Networks in Educational Assessment. Springer. Amari, S. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364. Angoff, W. H. (1984). Scales, norms, and equivalent scores. Educational Testing Service. (Reprint of chapter in R. L. Thorndike (Ed.) (1971) Educational Measurement (2nd Ed.). American Council on Education. Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. Technical Report, CMU-CS-94-163, Carnegie Mellon University. Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., & Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375– 1386. Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford: Oxford University Press. Barton, M. A., & Lord, F. M. (1981). An upper asymptote for the three-parameter logistic itemresponse model. In Educational Testing Service Research Report, 81–20. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 K. Shojima, Test Data Engineering, Behaviormetrics: Quantitative Approaches to Human Behavior 13, https://doi.org/10.1007/978-981-16-9986-3
571
572
References
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606. Berger, P. L., & Luckmann, T. (1966). The social construction of reality: A treatise in the sociology of knowledge. Anchor. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord, & M. R. Novick (Eds.) Statistical Theories of Mental Test Scores (pp. 397–479). Addison-Wesley. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). The generative topographic mapping. Neural Computation, 10, 215–234. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters?: An application of EM algorithm. Psychometrika, 46, 443–459. Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444. Bollen, K. A. (1986). Sample size and Bentler and Bonnet’s nonnormed fit index. Psychometrika, 51, 375–377. Bollen, K. A. (1989). A new incremental fit index for general structural equation models. Sociological Methods & Research, 17, 303–316. Bollen, K. A. (1989). Structural Equations with Latent Variables. John Wiley. Bollen, K. A., & Curran, P. J. (2005). Latent curve models: A structural equation perspective. Wiley. Bove, G., Okada, A., & Vicari, D. (2021). Cluster analysis for asymmetry. In G. Bove, A. Okada, & D. Vicari (Eds.), Methods for the analysis of asymmetric proximity data (pp.119–160). Springer. Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345–370. Brin, S., Motwani, R., Ullman, J., & Tsur, S. (1997). Dynamic itemset counting and implication rules for market basket data. In Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 255–264). Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen, & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Sage. Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Sage. Cattell, R. B. (1952). Factor analysis. Harper. Cattell, R. B. (1978). Use of factor analysis in behavioral and life sciences. Plenum. Cheng, L., Watanabe, Y., & Curtis, A. (Eds.) (2004). Washback in language testing: Research contexts and methods. Lawrence Erlbaum Associates, Inc. Chino, N. (1990). A generalized inner product model for the analysis of asymmetry. Behaviormetrika, 27, 25–46. Chino, N., & Shiraiwa, K. (1993). Geometrical structures of some nondistance models for asymmetric MDS. Behaviormetrika, 20, 35–47. Clauser, B. E., & Bunch, M. B. (Eds.) The history of educational measurement: Key advancements in theory, policy, and practice. Routledge. Clogg, C. C. (1995). Latent class models. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral sciences (pp. 311–359). Plenum Press Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7, 249–253. Cooper, G. F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104. Cover, T. M., & Ordentlich, E. (1996). Universal portfolios with side information. IEEE Transactions on Information Theory, 42, 248–363. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of a test. Psychometrika, 16, 297–334.
References
573
Croon, M. (2002). Ordering the classes. In J. A. Hagenaars, & A. L. McCutcheon (Eds.) Applied latent class analysis (pp. 137–162). Cambridge University Press. Culbertson, M. J. (2016). Bayesian networks in educational assessment: The state of the field. Applied Psychological Meassurement, 40, 3–21. Darwiche, A. (2009). Modeling and reasoning with Bayesian networks. Cambridge University Press. de Ayala, R. J. (2008). The theory and practice of item response theory. Guilford Pubn. de Jong, K. A. (2006). Evolutionary computation: A unified approach. A Bradford Book. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. de Rooij, M., & Heiser, W. J. (2003) A distance representation of the quasi-symmetry model and related distance models. In H. Yanai, A. Okada, K. Shigemasu, Y. Kano, & J. J. Meulman (Eds.), New developments in psychometrics (pp.487–494). Springer. Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44, 169– 172. Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68, 363–373. Duncan, T. E., Duncan, S. C., Strycker, L. A., Li, F., & Alpert, A. (2006). An introduction to latent variable growth curve modeling: Concepts, issues, and applications (2nd Edn). Routledge. Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. Chapman & Hall. Everitt, B. S., Landau, S., Morven Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley. Ferguson, G. A. (1941). The factorial interpretation of test difficulty. Psychometrika, 6, 323–333. Finch, W. H., & French, B. F. (2018). Educational and psychological measurement. Routledge. Fischer, G. H., & Molenaar, I. W. (Eds.) (1995) Rasch models: Foundations, recent developments, and applications. Springer. Fruchter, B. (1954). Introduction to factor analysis. Van Nostrand. Fukuda, S., Yamanaka, Y., & Yoshihiro, T. (2014). A Probability-based evolutionary algorithm with mutations to learn Bayesian networks. International Journal of Artificial Intelligence and Interactive Multimedia, 3, 7–13. Fujita, T., Nakagawa, H., Sasa, H., Enomoto, S., Yatsuka, M., & Miyazaki, M. (2021). Japanese teachers’ mental readiness for online teaching of mathematics following unexpected school closures. International Journal of Mathematical Education in Science and Technology, Advance online publication. https://doi.org/10.1080/0020739X.2021.2005171. Gergen, K. (2015). An invitation to social construction (3rd ed.). Sage. Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. AddisonWesley Professional. Gonzalez, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer. Govaert, G., & Nadif, M. (2013). Co-clustering: Models. Wiley-ISTE: Algorithms and Applications. Gower, J. C. (1977) The analysis of asymmetry and orthogonality. In J.-R. Barra, F. Brodeau, G. Romier, & B. Van Cutsem (Eds.) Recent developments in statistics (pp. 109–123). North-Holland. Guilford, J. P. (1965). The minimal phi coefficient and the maximal phi. Educational and Psychological Measurement, 25, 3–8. Guttman, L. (1954). Some necessary conditions for common factor analysis. Psychometrika, 19, 194–162. Hagenaars, J. A., & McCutcheon, A. L. (Eds.) (2002). Applied latent class analysis. Cambridge University Press. Haldane, J. B. S. (1931). A note on inverse probability. Cambridge Philosophy Society, 28, 55–61. Hambleton, R. K., & Swaminathan, H. (1984). Item response theory: Principles and applications. Springer. Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3–24.
574
References
Harris, B. (1988). Tetrachoric correlation coefficient. In L. Kotz, & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 9, pp. 223–225). Wiley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer. Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139–164. Holland, J. E. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. The University of Michigan Press. Horn, J. L. (1965). A rationale and test of the number of factors in factor analysis. Psychometrika, 30, 179–185. Hubert, L. (1973). Min and max hierachial clustering using asymmetric similarity measures. Psychometrika, 38, 63–72. Ishiguro, K., Ueda, N., & Sawada, H. (2012). Subset infinite relational models. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (Vol. 22, pp. 547–555). PMLR. Jensen, F. V., & Nielsen, T. D. (2007). Bayesian networks and decision graphs. Springer. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183–202. Joreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409–426. Jöreskog, K. G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Abt Books. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151. Kaiser, H. F. (1960). Varimax solution for primary mental abilities. Psychometrika, 25, 153–158. Kasim, A., Shkedy, Z., Kaiser, S., Hochreiter, S., & Talloen, W. (2016). Applied biclustering methods for big and high-dimensional data using R. Chapman & Hall/CRC. Kato, K., Yamada, T., & Kawahashi, I. (2014). Item response theory with R. Ohmsha (in Japanese). Kaplan, D. (2000). Structural equation modeling. Foundations and extensions. Sage. Kemp, C., Tenenbaum, J., Griffiths, T., Yamada, T., & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In Proceedings of the 21st AAAI Conference (pp. 381–388). Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131–143. Kim, C.-J., & Nelson, C. R. (2017) State-space models with regime switching: Classical and Gibbssampling approaches with applications. MIT Press. Kimura, T., & Nagaoka, K. (2012). Computer adaptive test based on latent rank theory: A proposition of the algorithm and its validation. Japanese Journal for Research on Testing, 8, 69–84. (in Japanese). Kitagawa, G. (1997). Information criteria for the predictive evaluation of Bayesian models. Communications in Statistic: Theory and Methods, 26, 2223–2246. Kohonen, T. (1995). Self-organizing maps. Springer. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd Edn). Springer. Konishi, S., & Kitagawa, G. (1996). Generalized information criteria in model selection. Biometrika, 83, 875–890. Koski, T., & Noble, J. (2009). Bayesian networks: An introduction. Wiley. Kosugi, K. E. (2018) Asymmetrical triadic relationship based on the structural difficulty. Behaviormetrika, 45, (pp. 7–23). Kristof, W. (1963). The statistical theory of stepped-up reliability when a test has been divided into several equivalent parts. Psychometrika, 28, 221–238. Krumhansl, C. L. (1978). Concerning the applicability of geometric models to similarity data: The interrelationship between similarity and spatial density. Psychological Review, 85, 445–463.
References
575
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151–160. Kumagai, R. (2007). Item characteristic indices and parameter estimation in NTT when considering it as a discrete variable IRT. Workshop “Neural Test Theory” (at Tokyo in Nov. 2007; in Japanese; https://shojima.starfree.jp/ntt/Kumagai2007WS1121.pptx). Larranãga, P., Kuijpers, C. M. H., Murga, R. H., & Yurramendi, Y. (1996). Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 26, 487–493. Larranãga, P., Poza, M., Yurramendi, Y., Murga, R. H., & Kuijpers, C. M. H. (1996). Structure learning of Bayesian networks by genetic algorithms: A performance analysis of control parameters. IEEE Transactions on Pattern Analysis and Machine Intteligence, 18, 912–926. Lawley, D. N., & Maxwell, A. E. (1971). Factor analysis as a statistical method. Butterworths. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Houghton Mifflin. Leighton, J., & Gierl, M. (Eds). (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge University Press. Liao, W.-W., Ho, R.-G., Yen, Y.-C., & Cheng, H.-C. (2012). The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses. Social Behavior & Personality, 40, 1679–1694. Linn, R. L. (1993). Educational measurement (3rd Edn). Macmillan USA. Little, T. D. (2013). Longitudinal structural equation modeling. Guilford Press. Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89, 958–966. Long, S. (1983). Confirmatory factor analysis. Sage University Paper series on Quantitative Applications in the Social Sciences, No 33. Sage. Long, B., Zhang, Z., & Yu, P. S. (2019). Relational data clustering: Models, algorithms, and applications. Chapman & Hall/CRC. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Welsley. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. Matloff, N. (2020). Probability and statistics for data science. CRC Press. Matsumiya, I., & Shojima, K. (2009a). Making a can-do table using neural test theory. In Proceedings of The 37th Annual Meeting of the Behaviormetric Society of Japan (pp. 58–59) (in Japanese). Matsumiya, I., & Shojima, K. (2009b). Making a subject-based can-do table using neural test theory. In Proceedings of The 7th Annual Meeting of The Japan Association for Research on Testing (pp. 232–233) (in Japanese). Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd Edn, pp. 13–103). McMillan. Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50, 741–749. McDonald, R. P. (1999). Test theory: A unified treatment. Erlbaum. McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British Journal of Mathematical and Statistical Psychology, 27, 82–99. McLachlan, G. & Peel, D. (2000) Finite Mixture Models. John Wiley & Sons, Inc. Michalewicz, Z. (1996). Genetic Alrogithms + Data Structures = Evolution Programs (3rd ed.). Springer. Mirman, D. (2014). Growth curve analysis and visualization using R. Chapman & Hall/CRC Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177– 195. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. Muthén, L. K., & Muthén, B. O. (2017). Mplus user’s guide (8th Edn). Muthén & Muthén. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9, 249–265.
576
References
Nichols, P. D., Chipman, S. F., & Brennan, R. L. (Eds.) (1995). Cognitively diagnostic assessment. Routledge. O’Connell, A. A. & McCoach, D. B. (Eds.) (2008). Multilevel modeling of educational data. Information Age Publishing. Oja, E., & Kaski, S. (Eds.) (1999). Kohonen maps. Elsevier Science. Okada, K. (2012). A Bayesian approach to asymmetric multidimensional scaling. Behaviormetrika, 39, 49–62. Okada, A., & Imaizumi, T. (1987). Nonmetric multidimensional scaling of asymmetric proximities. Behaviormetrika, 21, 81–96. Okada, A., & Imaizumi, T. (1997). Asymmetric multidimensional scaling of two-mode three-way proximities. Journal of Classification, 14, 195–224. Okada, A., & Iwamoto, T. (1996). University enrollment flow among the Japanese prefectures: A comparison before and after the Joint First Stage Achievement Test by asymmetric cluster analysis. Behaviormetrika, 23, 169–185. Okada, A., & Tsurumi, H. (2012). Asymmetric multidimensional scaling of brand switching among margarine brands. Behaviormetrika, 39, 111–126. Okada, A., & Tsurumi, H. (2013). External analysis of asymmetric multidimensional scaling based on singular value decomposition. In P. Giudici, S. Ingrassia, & M. Vichi (Eds.) Statistical models for data analysis (pp. 269–278). Springer. Okada, A. & Yokoyama, S. (2015). Asymmetric CLUster analysis based on SKEW-symmetry: ACLUSKEW. In I. Morlini, T. Minerva, & M. Vichi (Eds.), Advances in statistical models for data analysis (pp. 191–199). Springer. Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44, 443–460. Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47, 337–347. Olszewski, D., & Šter, B. (2014). Asymmetric clustering using the alpha-beta divergence. Pattern Recognition, 47, 2031–2041. Otsu, T. (2004). Nonlinear factor analysis with spline transformation of latent variables. The Japanese Journal of Behaviormetrics, 31, 1–15. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann. Raykov, T. (1997). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21, 173–184. Raykov, T. (2007). Reliability if deleted, not “alpha if deleted”: Evaluation of scale reliability following component deletion. British Journal of Mathematical and Statistical Psychology, 60, 201–216. Raykov, T. (2008). Alpha if item deleted: A note on criterion validity loss in scale revision if maximising coefficient alpha. British Journal of Mathematical and Statistical Psychology, 61, 275–285. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, 207–230. Reckase, M. D. (2009). Multidimensional item response theory. Springer. Reiss, J. (2013). Philosophy of economics: A contemporary introduction. Routledge. Reichenberg, R. (2018). Dynamic Bayesian networks in educational measurement: Reviewing and advancing the state of the field. Applied Measurement in Education, 31, 335–350. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biological Cybernetics, 54, 99–106. Robinson, R. W. (1973). Counting labeled acyclic digraphs. In F. Harary (Ed.) New directions in the theory of graphs (pp. 239–273). Academic Press. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282.
References
577
Rost, J. (1991). A logistic mixture distribution model for polytomous item responses. British Journal of Mathematical and Statistical Psychology, 44, 75–92. Rubin, R. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 69–76. Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement. Theory, methods, and applications. Guilford Press. Saburi, S., & Chino, N. (2008). A maximum likelihood method for an asymmetric MDS model. Computational Statistics and Data Analysis, 52, 4673–4684. Sahin, ¸ A., & Anil, D. (2017). The effects of test length and sample size on item parameters in item response theory. Educational Sciences: Theory & Practice, 17, 321–335. Saito, T., & Yadohisa, H. (2005). Data analysis of asymmetric structures: Advanced approaches in computational statistics. Marcel Dekker, Inc. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 17. Sangole, A. (2009). Spherical self-organizing maps: A comprehensive view. VDM Verlag. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Scutari, M., & Denis, J.-B. (2014). Bayesian networks: With examples in R. Chapman & Hall/CRC. Searle, J. (1996). The construction of social reality. Penguin. Shiba, S. (1972). Factor analysis methods. University of Tokyo Press (in Japanese). Shigemasu, K., & Nakamura, T. (1996). A Bayesian marginal inference in estimating item parameters using the Gibbs sampler. Behaviormetrika, 23, 97–110. Shojima, K. (2007a). A sketch of artimage processing. In T. Saijo, G. Sugamura, & S. Saito. (Eds.) The emerging human sciences. Kitaoji-Shobo. (pp. 146–160, in Japanese). Shojima, K. (2007). Selection of item response model by genetic algorithm. Behaviormetrika, 34, 1–26. Shojima, K. (2007c). The graded neural test model: A neural test model for ordered poltomous data. DNC Research Note, 07-03. (https://shojima.starfree.jp/ntt/Shojima2007RN07-03.pdf). Shojima, K. (2008a). Neural test theory: A latent rank theory for analyzing test data. DNC Research Note, 08-01. (https://shojima.starfree.jp/ntt/Shojima2008RN08-01.pdf). Shojima, K. (2008b). The batch-type neural test model: A latent rank model with the mechanism of generative topographic mapping. DNC Research Note, 08-06. (https://shojima.starfree.jp/ntt/ Shojima2008RN08-06.pdf). Shojima, K. (2008–). Exametrika (Software). (https://shojima.starfree.jp/exmk/index.htm). Shojima, K. (2009). Neural test theory. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino. (Eds.), New trends in psychometrics (pp. 407–416). Universal Academy Press. Shojima, K. (2010). Classical test theory. In M. Ueno & K. Shojima (Eds.), New trends in educational evaluation. Asakura-Shoten (pp. 37–55, in Japanese). Shojima, K. (2011). Local dependence model in latent rank theory. Japanese Journal of Applied Statistics, 40, 141–156. Shojima, K. (2012). Asymmetric triangulation scaling: Asymmetric multidimensional scaling for visualizing inter-item dependency structure. Behaviormetrika, 39, 27–48. Shojima, K. (2014). Asymmetrika 2.2 (Software). (https://shojima.starfree.jp/asmk/index.htm). Shojima, K. (2018). Achievement test. Annual Review of Japanese Child Psychology, 57, 215–234. (in Japanese). Shojima, K. (2019). Bicluster network model. In Proceedings of The 47th Annual Meeting of The Behaviormetric Society (pp. 130–133). Shojima, K. (2020a). The maximal Pearson’s correlation coefficient between Likert items from the perspective of the Hicthcock’s transportation problem. In Proceedings of The 84th Annual Convention of The Japanese Psychological Association, PF-001. Shojima, K. (2020b). Ranklustering: Biclustering for (sample ranking) × (variable clustering). In Proceedings of The 48th Annual Meeting of The Behaviormetric Society (pp. 198–201). Shojima, K. (2021). Local dependence biclustering. In Proceedings of The 49th Annual Meeting of The Behaviormetric Society (pp. 106–109).
578
References
Shojima, K., & Toyoda, H. (2002). Estimation of Cronbach’s alpha coefficient in the context of item response theory. The Japanese Journal of Psychology, 73, 227–233. (in Japanese). Shojima, K., & Toyoda, H. (2004). Item parameter estimation when a test contains different item response models. The Japanese Journal of Educational Psychology, 52, 61–70. (in Japanese). Singer, J. D. & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press. Spearman, C. E. (1904). “General intelligence”, objectively determined and measured. American Journal of Psychology, 15, 201–293. Spearman, C. (1927). The abilities of man. Macmillan. Spearman, C. E. (1904). “General intelligence”, objectively determined and measured. American Journal of Psychology, 15, 201–293. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, B64, 583–639. Spirites, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. Springer. Spirites, P., Glymour, C., Scheines, R., & Tillman, R. (2010). Automate search for causal relations: Theory and practice. In R. Dechter, H. Geffner, & J. Halpern (Eds.), Heuristics, probability, and causality: A tribute to Judea Pearl (pp. 467–506). Colledge Publications. Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415. Suess, E. A. A., & Trumbo, B. E. (2010). Introduction to probability simulation and Gibbs sampling with R (Use R!). Springer. Sugino, N., Yamakawa, K., Ohba, H., Shojima, K., Shimizu, Y., & Nakano, M. (2012). Characterizing individual learners on an empirically-developed can-do system: An application of latent rank theory. In W. M. Chan, K. N., Chin, S. K. Bhatt, & I. Walker (Eds.), Perspectives on individual characteristics and foreign language education (pp. 131–149). De Gruyter Mouton. Sugino, N., & Shojima, K. (2012). An application of latent rank theory in developing a can-do chart of Japanese EFL learners. Studies in English Language Education, 3, 21–30. Sugino, N., Yamakawa, K., Ohba, H., Shojima, K., Shimizu, Y., & Nakano, M. (2012). Characterizing individual learners on an empirically-developed can-do system: An application of latent rank theory. In W. M. Chan, K. N., Chin, S. K. Bhatt, & I. Walker (Eds.), Perspectives on individual characteristics and foreign language education (pp. 131–149). De Gruyter Mouton. Suzuki, J. (1999). Learning Bayesian belief networks based on the MDL principle: An efficient algorithm using Branch and Bound technique. IEICE Transactions on Fundamentals, E82-D, 356–367. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408. Takeuchi, A., Saito, T., & Yadohisa, H. (2007). Asymmetric agglomerative hierarchical clustering algorithms and their evaluations. Journal of Classification, 24, 123–143. Tatsuoka, K. K. (2009). Cognitive assessment: An introduction to the rule space method. Routledge. Thissen, D. (1991). MULTILOG user’s guide (Version 6). Scientific Software. Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs No. 1. Thurstone, L. L. (1947). Multiple-factor analysis: A development and expansion of the vectors of mind. University of Chicago Press. Tobler, W. (1976–77). Spatial interaction patterns. Journal of Environmental Systems, 6, 271–301. Toyoda, H. (1998). An introduction to covariance structure analysis: Structural equation modeling. Asakura Shoten (in Japanese). Toyoda, H. (2012). An introdution to item response theory (2nd Edn). Asakura Shoten (in Japanese). Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65, 31–78. Tsutakawa, R. K. (1984). Estimation of two-parameter logistic item response curves. Journal of Educational Statistics, 9, 263–276.
References
579
Tsutakawa, R. K., & Lin, H. Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51, 251–267. Ueno, M. (2002). An extension of the IRT to a network model. Behaviormetrika, 29, 59–79. Ueno, M. (2013). Bayesian network. Coronasha (in Japanese). van der Linden, W. J. (2016). Handbook of item response theory, Volume one: Models. Chapman & Hall/CRC. van der Linden, W. J. & Glas, C. A. W. (Eds.) (2000). Computerized adaptive testing: Theory and practice. Springer. Verbeke, G., & Molenberghs, G. (2009). Linear mixed models for longitudinal data. Springer. von Davier, A. (2012). Statistical models for test equating, scaling, and linking. Springer. Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized adaptive testing: A primer (2nd ed.). Routledge. Watanabe, S. (2009). Algebraic geometry and statistical learning theory. Cambridge University Press. Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594. Weeks, D. G., & Bentler, P. M. (1982). Restricted multidimensional scaling models for asymmetric proximities. Psychometrika, 47, 201–208. Weinberg, D. (2014). Contemporary social constructionism: Key themes. Temple University Press. Wierzcho´n, S., & Kłopotek, M. (2018). Modern algorithms of cluster analysis. Springer. Wong, M. L., Lam, W., & Leung, K. S. (1999). Using evolutionary programming and minimum description length principle for data mining of Bayesian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 174–178. Yadohisa, H., & Niki, N. (1999). Vector field representation of asymmetric proximity data. Communications in Statistics, Theory and Methods, 28, 35–48. Yanai, H., & Ichikawa, M. (2007). Factor analysis. In C. R. Rao, & S. Sinharay (Eds.) Handbook of statistics 26: Psychometrics (pp. 257–296). Elsevier. Yehezkel, R., & Lerner, B. (2009). Bayesian network structure learning by recursive autonomy identification. Journal of Machine Learning Research, 10, 1527–1570. Yen, W. M. (1984). Effect of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213. Zielman, B., & Heiser, W. J. (1993). Analysis of asymmetry by a slide-vector. Psychometrika, 58, 101–114. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item analysis and test scoring with binary logistic models. Scientific Software International, Inc.