Assessing Measurement Invariance for Applied Research 1108485227, 9781108485227

This book focuses on the practical application of statistical techniques for assessing measurement invariance with less

125 33

English Pages 250 [428] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Assessing Measurement Invariance for Applied Research
 1108485227, 9781108485227

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH

Assessing Measurement Invariance for Applied Research will provide psychometricians and researchers across diverse disciplines in the social sciences with the necessary knowledge and skills to select and apply appropriate methods to assess measurement invariance. It is a user-friendly guide that describes a variety of statistical methods using a pedagogical framework emphasizing conceptual understanding, with extensive illustrations that demonstrate how to use software to analyze real data. A companion website (people.umass.edu/cswells) provides downloadable computer syntax and the data sets demonstrated in this book so readers can use them to become familiar with the analyses and understand how to apply the methods with proficiency to their own work. Evidence-supported methods that can be readily applied to real-world data are described and illustrated, providing researchers with many options from which to select given the characteristics of their data. The approaches include observed-score methods and those that use item response theory models and confirmatory factor analysis. .  .  is a professor in the Research, Educational Measurement and Psychometrics program at the University of Massachusetts (UMass) Amherst, where he teaches courses on statistics and psychometrics, including structural equation modeling and psychometric modeling. He is also the associate director for the Center for Educational Assessment at UMass. His research interests pertain to statistical methods for detecting differential item functioning, and assessing the fit of item response theory models. He also has a keen interest in philosophy of science, especially how it relates to hypothesis testing. He coedited Educational Measurement: From Foundations to Future (2016) and was president of the Northeastern Educational Research Association in 2017.

Published online by Cambridge University Press

       

Editor Neal Schmitt, Michigan State University

The Educational and Psychological Testing in a Global Context series features advanced theory, research, and practice in the areas of international testing and assessment in psychology, education, counseling, organizational behavior, human resource management, and all related disciplines. It aims to explore, in great depth, the national and cultural idiosyncrasies of test use and how they affect the psychometric quality of assessments and the decisions made on the basis of measures. Our hope is to contribute to the quality of measurement and to facilitate the work of professionals who must use practices or measures with which they may be unfamiliar or adapt familiar measures to a local context. Published Titles: ş , Adapting Tests in Linguistic and Cultural Contexts, 2017  . ,   and  .  (eds.), International Applications of Web-Based Testing: Challenges and Opportunities, 2017  .  et al., Schooling Across the Globe: What We Have Learned from 60 Years of Mathematics and Science International Assessments, 2018    and  , Higher Education Admission Practices: An International Perspective, 2020 Forthcoming Titles:  , Comparative Histories of Psychological Assessment

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH CRAIG S. WELLS University of Massachusetts Amherst

Published online by Cambridge University Press

University Printing House, Cambridge 2 8, United Kingdom One Liberty Plaza, 20th Floor, New York,  10006, USA 477 Williamstown Road, Port Melbourne,  3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108485227 : 10.1017/9781108750561 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 A catalogue record for this publication is available from the British Library.  978-1-108-48522-7 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH

Assessing Measurement Invariance for Applied Research will provide psychometricians and researchers across diverse disciplines in the social sciences with the necessary knowledge and skills to select and apply appropriate methods to assess measurement invariance. It is a user-friendly guide that describes a variety of statistical methods using a pedagogical framework emphasizing conceptual understanding, with extensive illustrations that demonstrate how to use software to analyze real data. A companion website (people.umass.edu/cswells) provides downloadable computer syntax and the data sets demonstrated in this book so readers can use them to become familiar with the analyses and understand how to apply the methods with proficiency to their own work. Evidence-supported methods that can be readily applied to real-world data are described and illustrated, providing researchers with many options from which to select given the characteristics of their data. The approaches include observed-score methods and those that use item response theory models and confirmatory factor analysis. .  .  is a professor in the Research, Educational Measurement and Psychometrics program at the University of Massachusetts (UMass) Amherst, where he teaches courses on statistics and psychometrics, including structural equation modeling and psychometric modeling. He is also the associate director for the Center for Educational Assessment at UMass. His research interests pertain to statistical methods for detecting differential item functioning, and assessing the fit of item response theory models. He also has a keen interest in philosophy of science, especially how it relates to hypothesis testing. He coedited Educational Measurement: From Foundations to Future (2016) and was president of the Northeastern Educational Research Association in 2017.

Published online by Cambridge University Press

       

Editor Neal Schmitt, Michigan State University

The Educational and Psychological Testing in a Global Context series features advanced theory, research, and practice in the areas of international testing and assessment in psychology, education, counseling, organizational behavior, human resource management, and all related disciplines. It aims to explore, in great depth, the national and cultural idiosyncrasies of test use and how they affect the psychometric quality of assessments and the decisions made on the basis of measures. Our hope is to contribute to the quality of measurement and to facilitate the work of professionals who must use practices or measures with which they may be unfamiliar or adapt familiar measures to a local context. Published Titles: ş , Adapting Tests in Linguistic and Cultural Contexts, 2017  . ,   and  .  (eds.), International Applications of Web-Based Testing: Challenges and Opportunities, 2017  .  et al., Schooling Across the Globe: What We Have Learned from 60 Years of Mathematics and Science International Assessments, 2018    and  , Higher Education Admission Practices: An International Perspective, 2020 Forthcoming Titles:  , Comparative Histories of Psychological Assessment

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH CRAIG S. WELLS University of Massachusetts Amherst

Published online by Cambridge University Press

University Printing House, Cambridge 2 8, United Kingdom One Liberty Plaza, 20th Floor, New York,  10006, USA 477 Williamstown Road, Port Melbourne,  3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108485227 : 10.1017/9781108750561 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 A catalogue record for this publication is available from the British Library.  978-1-108-48522-7 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH

Assessing Measurement Invariance for Applied Research will provide psychometricians and researchers across diverse disciplines in the social sciences with the necessary knowledge and skills to select and apply appropriate methods to assess measurement invariance. It is a user-friendly guide that describes a variety of statistical methods using a pedagogical framework emphasizing conceptual understanding, with extensive illustrations that demonstrate how to use software to analyze real data. A companion website (people.umass.edu/cswells) provides downloadable computer syntax and the data sets demonstrated in this book so readers can use them to become familiar with the analyses and understand how to apply the methods with proficiency to their own work. Evidence-supported methods that can be readily applied to real-world data are described and illustrated, providing researchers with many options from which to select given the characteristics of their data. The approaches include observed-score methods and those that use item response theory models and confirmatory factor analysis. .  .  is a professor in the Research, Educational Measurement and Psychometrics program at the University of Massachusetts (UMass) Amherst, where he teaches courses on statistics and psychometrics, including structural equation modeling and psychometric modeling. He is also the associate director for the Center for Educational Assessment at UMass. His research interests pertain to statistical methods for detecting differential item functioning, and assessing the fit of item response theory models. He also has a keen interest in philosophy of science, especially how it relates to hypothesis testing. He coedited Educational Measurement: From Foundations to Future (2016) and was president of the Northeastern Educational Research Association in 2017.

Published online by Cambridge University Press

       

Editor Neal Schmitt, Michigan State University

The Educational and Psychological Testing in a Global Context series features advanced theory, research, and practice in the areas of international testing and assessment in psychology, education, counseling, organizational behavior, human resource management, and all related disciplines. It aims to explore, in great depth, the national and cultural idiosyncrasies of test use and how they affect the psychometric quality of assessments and the decisions made on the basis of measures. Our hope is to contribute to the quality of measurement and to facilitate the work of professionals who must use practices or measures with which they may be unfamiliar or adapt familiar measures to a local context. Published Titles: ş , Adapting Tests in Linguistic and Cultural Contexts, 2017  . ,   and  .  (eds.), International Applications of Web-Based Testing: Challenges and Opportunities, 2017  .  et al., Schooling Across the Globe: What We Have Learned from 60 Years of Mathematics and Science International Assessments, 2018    and  , Higher Education Admission Practices: An International Perspective, 2020 Forthcoming Titles:  , Comparative Histories of Psychological Assessment

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH CRAIG S. WELLS University of Massachusetts Amherst

Published online by Cambridge University Press

University Printing House, Cambridge 2 8, United Kingdom One Liberty Plaza, 20th Floor, New York,  10006, USA 477 Williamstown Road, Port Melbourne,  3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108485227 : 10.1017/9781108750561 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 A catalogue record for this publication is available from the British Library.  978-1-108-48522-7 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH

Assessing Measurement Invariance for Applied Research will provide psychometricians and researchers across diverse disciplines in the social sciences with the necessary knowledge and skills to select and apply appropriate methods to assess measurement invariance. It is a user-friendly guide that describes a variety of statistical methods using a pedagogical framework emphasizing conceptual understanding, with extensive illustrations that demonstrate how to use software to analyze real data. A companion website (people.umass.edu/cswells) provides downloadable computer syntax and the data sets demonstrated in this book so readers can use them to become familiar with the analyses and understand how to apply the methods with proficiency to their own work. Evidence-supported methods that can be readily applied to real-world data are described and illustrated, providing researchers with many options from which to select given the characteristics of their data. The approaches include observed-score methods and those that use item response theory models and confirmatory factor analysis. .  .  is a professor in the Research, Educational Measurement and Psychometrics program at the University of Massachusetts (UMass) Amherst, where he teaches courses on statistics and psychometrics, including structural equation modeling and psychometric modeling. He is also the associate director for the Center for Educational Assessment at UMass. His research interests pertain to statistical methods for detecting differential item functioning, and assessing the fit of item response theory models. He also has a keen interest in philosophy of science, especially how it relates to hypothesis testing. He coedited Educational Measurement: From Foundations to Future (2016) and was president of the Northeastern Educational Research Association in 2017.

Published online by Cambridge University Press

       

Editor Neal Schmitt, Michigan State University

The Educational and Psychological Testing in a Global Context series features advanced theory, research, and practice in the areas of international testing and assessment in psychology, education, counseling, organizational behavior, human resource management, and all related disciplines. It aims to explore, in great depth, the national and cultural idiosyncrasies of test use and how they affect the psychometric quality of assessments and the decisions made on the basis of measures. Our hope is to contribute to the quality of measurement and to facilitate the work of professionals who must use practices or measures with which they may be unfamiliar or adapt familiar measures to a local context. Published Titles: ş , Adapting Tests in Linguistic and Cultural Contexts, 2017  . ,   and  .  (eds.), International Applications of Web-Based Testing: Challenges and Opportunities, 2017  .  et al., Schooling Across the Globe: What We Have Learned from 60 Years of Mathematics and Science International Assessments, 2018    and  , Higher Education Admission Practices: An International Perspective, 2020 Forthcoming Titles:  , Comparative Histories of Psychological Assessment

Published online by Cambridge University Press

ASSESSING MEASUREMENT INVARIANCE FOR APPLIED RESEARCH CRAIG S. WELLS University of Massachusetts Amherst

Published online by Cambridge University Press

University Printing House, Cambridge 2 8, United Kingdom One Liberty Plaza, 20th Floor, New York,  10006, USA 477 Williamstown Road, Port Melbourne,  3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108485227 : 10.1017/9781108750561 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 A catalogue record for this publication is available from the British Library.  978-1-108-48522-7 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published online by Cambridge University Press

This book is dedicated to my loving family: Amanda, Madeline, and Elianna.

Published online by Cambridge University Press

Published online by Cambridge University Press

Contents

List of Figures List of Tables Preface 1

page ix xii xix

Introduction

1

What Is Measurement Invariance? Why Should We Assess Measurement Invariance? Forms of DIF Classification of DIF Detection Methods The Conditioning Variable: To Purify or Not to Purify? Considerations When Applying Statistical Tests for DIF

2

Observed-Score Methods

17

Transformed Item Difficulty (Delta-Plot) Method Mantel–Haenszel DIF Procedure Standardization DIF Procedure Logistic Regression SIBTEST Applying Observed-Score Methods to Incomplete Data Sets

3

Item Response Theory

19 31 52 65 86 104

108

Unidimensional IRT Models Dichotomous Item Responses Polytomous Item Responses Multidimensional IRT Models Model Assumptions Parameter Estimation The Latent Variable Scale Linking IRT Scales Item and Testing Information Functions Test Information IRT Parameter Estimation Software

vii

Published online by Cambridge University Press

1 5 8 10 11 13

111 112 120 127 129 136 141 143 146 149 152

Contents

viii 4

Methods Based on Item Response Theory b- and a-Plots Lord’s Chi-Square and Wald Statistic Likelihood-Ratio Test Raju’s Area DIF Measures DFIT Applying IRT DIF Methods to Incomplete Data Sets

5

Confirmatory Factor Analysis Basic Principles of CFA Example of CFA Using Mplus Analyzing Categorical Indicators Description and Example of the Bifactor Model

6

161 166 177 198 213 222 243

245 246 270 281 288

Methods Based on Confirmatory Factor Analysis

295

Multigroup Confirmatory Factor Analysis MIMIC Model Applying CFA-Based Methods to Incomplete Data Sets

296 332 367

Appendix A: A Brief R Tutorial References Author Index Subject Index You can download the data files and computer syntax described in this book at the companion website: people.umass.edu/cswells

Published online by Cambridge University Press

369 376 388 391

Figures

1.1 An example of functional relationships for two groups on a dichotomously scored item that is invariant. page 3 1.2 An example of functional relationships for two groups on a dichotomously scored item that lacks invariance. 4 1.3 An illustration of an invariant and non-invariant polytomous item. 6 1.4 Two examples of nonuniform DIF. 9 2.1 Plot of the Δ-values for the reference and focal groups. 20 2.2 Delta-plot of the reference and focal groups produced by the diagPlot function within R package deltaPlotR. 26 2.3 Plot of the conditional p-values for the reference and focal groups: (a) item 10 and (b) item 20. 54 2.4 Plot of the conditional p-values and difference in conditional p-values across the matching variable for item 20. 61 2.5 Plot of conditional expected scores and difference in conditional expected scores across the matching variable for item 11. 63 3.1 An example of an IRF for a hypothetical multiple-choice math item. 109 3.2 An example of an IRF from the 3PL model. 113 3.3 Examples of IRFs from the 3PL model. 115 3.4 Examples of IRFs from the 2PL model. 118 3.5 Examples of IRFs from the 1PL model. 119 3.6 Example of boundary response functions for Samejima’s GR model. 122 3.7 Category response functions based on the boundary response functions shown in Figure 3.6. 123 3.8 Conditional expected item score based on the category response functions shown in Figure 3.7. 124 ix

Published online by Cambridge University Press

x

List of Figures

3.9 Category response functions for two partial-credit items from the GPC model. 3.10 An example of an IRS for a multidimensional 2PL model. 3.11 An example of an empirical and model-based IRF for a fitting item (top) and a misfitting item (bottom). 3.12 An example of the conditional observed proportions for an item with 10 intervals (groups) used in item parameter estimation. 3.13 An example of a TCF based on the 3PL model for a 36-item, multiple-choice test. 3.14 A scatterplot of unscaled item difficulty parameter estimates for two groups with disparate proficiency distributions. 3.15 Examples of the IIF for a 2PL and 3PL model item. 3.16 Examples of the IIF from the 3PL model. 3.17 Example of a TIF and CSEE based on the 3PL model for a 36-item test. 3.18 A plot for the conditional reliability along with the TIF from Figure 3.17. 4.1 IRFs illustrating uniform and nonuniform DIF. 4.2 IRFs illustrating uniform DIF where the difference in the difficulty parameter estimates equals 0.5 for a high and low discriminating item. 4.3 b-plot using difficulty parameter estimates from the 3PL model. 4.4 a-plot using discrimination parameter estimates from the 3PL model. 4.5 b-plot from Samejima’s GR model using the thresholds. 4.6 b-plot from Samejima’s GR model using the average of the thresholds. 4.7 b-plot using bs from the 3PL model and the individual thresholds from the GR model. 4.8 Example of noncentral chi-square distribution used to test a range-null hypothesis using Lord’s chi-square statistic. 4.9 Comparison of IRFs for the reference and focal groups for item 20, flagged as DIF using Lord’s chi-square test. 4.10 Comparison of expected item scores for item 11, which was flagged for DIF using Lord’s chi-square test. 4.11 Plot of IRFs for the focal and reference groups produced by the PlotNcdif function for item 20 (large DIF).

Published online by Cambridge University Press

126 129 135 137 142 144 148 150 151 152 163 165 167 172 174 175 177 182 189 194 234

List of Figures 4.12 Plot of IRFs for the focal and reference groups produced by the PlotNcdif function for item 9 (trivial DIF). 4.13 Plot of expected test scores for the focal and reference groups produced by the plot.tcf function. 4.14 Plot of IRFs for the focal and reference groups produced by the PlotNcdif function for item 4. 4.15 Plot of IRFs for the focal and reference groups produced by the PlotNcdif function for item 11. 5.1 Path diagram for a two-factor model. 5.2 Illustrating under-identified, just-identified, and over-identified models using simple linear regression. 5.3 Illustration of thresholds for an ordinal item with four categories. 5.4 Example of a bifactor model. 6.1 Example of a MIMIC model to test uniform DIF in the first indicator. 6.2 Example of a MIMIC model that includes the interaction term. 6.3 Path diagram illustrating longitudinal measurement invariance. A.1 Screenshot of the R website. A.2 Screenshot of the R window and console.

Published online by Cambridge University Press

xi 235 237 241 242 248 252 283 289 332 334 360 370 371

Tables

2.1 Selected Output from Delta-Plot Analysis Using the deltaPlot Function in R and a Threshold of 1.5 to Flag Items as DIF. page 2.2 2  2 Contingency Table for the mth Score Level. 2.3 Hypothetical 2  2 Contingency Table Data for the mth Score Level. 2.4 Abbreviated Output from MH DIF Analysis Using the difMH Function in R. 2.5 Output from the CI.for.D Function that Provides the 100(1α)% Confidence Interval for DαMH . 2.6 Output from the PDIF Function that Computes PDIF and Its Corresponding 100(1α)% Confidence Interval. 2.7 G  2 Contingency Table for the mth Score Level. 2.8 Output (Abbreviated) from GMH DIF Analysis Using the difGMH Function in R. 2.9 2  C Contingency Table for the mth Score Level. 2.10 Output from Mantel DIF Procedure for Polytomous Data using the Mantel.poly Function in R. 2.11 Output from GMH DIF Procedure for Polytomous Data Using the GMH.poly Function in R. 2.12 Abbreviated Output from Standardization DIF Analysis Using the difSTD Function in R. 2.13 Output from Standardization DIF Analysis for Polytomous Data Using the SMD Function in R. 2.14 Output from the lrm Function When Performing Logistic Regression for Model 3 that Includes the Matching and Grouping Variables along with the Interaction Term for Item 1.

xii

Published online by Cambridge University Press

25 32 32 40 42 44 45 47 49 52 53 59 64

72

List of Tables 2.15 Output from the lrm Function When Performing Logistic Regression for Model 1 that Includes Only the Matching Variable for Item 1. 2.16 Abbreviated Output from the DIF.Logistic Function. 2.17 Output from the lrm Function When Performing Logistic Regression for Model 3 for Dichotomous Data and Three Groups for Item 4. 2.18 Output from the lrm Function When Performing Logistic Regression for Model 1 for Dichotomous Data and Three Groups for Item 4. 2.19 Abbreviated Output from the DIF.Logistic.MG Function. 2.20 Output from the lrm Function When Performing Logistic Regression for Model 3 for Polytomous Data for Item 1. 2.21 Output from the lrm Function When Performing Logistic Regression for Model 1 for Polytomous Data for Item 1. 2.22 Abbreviated Output from the DIF.Logistic Function. 2.23 Output from the SIBTEST Function for Testing Item 21 for DIF. 2.24 Conditional Item Scores from the SIBTEST Function for Testing Item 21 for DIF. 2.25 Abbreviated Output from SIBTEST.exp Function Applied to Dichotomous Data. 2.26 Output from the SIBTEST Function for Testing Item 12 for DIF. 2.27 Output from SIBTEST.exp Function Applied to Polytomous (Likert-type) Data with the Two-Step Purification Procedure. 2.28 Output from the SIBTEST Function for Testing DBF Based on Items 16 through 20. 3.1 flexMIRT Input File for Fitting 3PL and GPC Models. 3.2 Item Parameter Estimates for 3PL and GPC Model from flexMIRT. 3.3 flexMIRT Input File for Fitting Samejima’s GR Model. 3.4 Item Parameter Estimates for GR Model from flexMIRT Reported in Slope-Intercept Form (Top) and IRT Parameterization. 4.1 Output from the b-plot Analysis Using the b.plot Function in R with the Two-Step Purification Procedure. 4.2 Metadata for the Reference Group Found in Ref.est.

Published online by Cambridge University Press

xiii

73 75 78 79 81 84 85 87 95 96 98 100 102 103 154 157 158 159 170 184

xiv

List of Tables

4.3 Abbreviated Output from the LordChi Function for Testing DIF under the 3PL Model. 4.4 Metadata for the Reference Group’s Likert-type Data in the Data Frame Ref.est. 4.5 Abbreviated Printout of Lord.results for Likert Data Using Samejima’s GR Model. 4.6 Metadata for the Reference Group’s Stored in the Data Frame Ref.est. 4.7 Example Data File (Abbreviated) with 30 Dichotomous Items (Columns 1 through 30) and Group Membership in Column 31. 4.8 Information to Execute the IRTLRDIF Program for the 3PL Model. 4.9 Abbreviated Output from the IRTLRDIF Program. 4.10 Content of Input File to Run with IRTLRDIF. 4.11 Content of Input File to Run with IRTLRDIF with an Anchor. 4.12 Content of Input File to Run with IRTLRDIF for Samejima’s GR Model. 4.13 Output from IRTLRDIF for Testing DIF Using Samejima’s GR Model. 4.14 Content of Input File to Run with IRTLRDIF for a Mixed-Format Test that Uses the 3PL, 2PL, and Samejima’s GR Models. 4.15 Abbreviated Output from the difRaju Function for the 3PL Model with g Fixed to 0.2. 4.16 Abbreviated Output from difRaju Function for the 1PL Model. 4.17 NCDIF Estimates from the 3PL Model. 4.18 CDIF Estimates for the 3PL Model. 4.19 NCDIF Estimates from Samejima’s GR Model. 5.1 Hypothetical Correlation Matrix for Six Observed Variables Based on Two Factors. 5.2 Hypothetical Correlation Matrix for Six Observed Variables Based on One Factor. 5.3 Mplus Input Code for Two-Factor CFA Model. 5.4 Descriptive Statistics for Subscore Indicators Reported by Mplus. 5.5 Model Fit Information for Fitting a Two-Factor CFA Model Reported by Mplus.

Published online by Cambridge University Press

186 191 193 196 202 204 205 206 208 208 209 211 218 221 230 236 239 247 250 271 273 275

List of Tables 5.6 Standardized Residuals from a Two-Factor Model Reported by Mplus. 5.7 Unstandardized Parameter Estimates. 5.8 Standardized Parameter Estimates. 5.9 R-Squared Values. 5.10 Mplus Input Code for Fitting a One-Factor Model to the Subscores. 5.11 Mplus Input Code for Fitting a Two-Factor Model to Item Responses. 5.12 Mplus Input Code for Fitting a One-Factor Model to Item Responses. 5.13 Mplus Output for Chi-Square Difference Test. 5.14 Mplus Input Code for Fitting a One-Factor Model to Likert-Type Responses. 5.15 Mplus Input Code for Fitting a Bifactor Model. 5.16 Parameter Estimates from the Bifactor and One-Factor Models. 6.1 Mplus Input File for Testing Configural Invariance (Equal Dimensionality Structure) for the Reference Group (Reference ¼ 0) on the Subscores. 6.2 Fit Statistics for Evaluating Configural, Metric, Scalar, and Strict Factorial Invariance on the Subscore Data. 6.3 Mplus Input File for Configural Invariance Model in Which the Factor Loadings and Intercepts Are Freely Estimated in Each Group. 6.4 Parameter Estimates from Mplus for the Configural Invariance, Metric Invariance, Scalar Invariance, and Partially Invariant Scalar Models. 6.5 Mplus Input File for Metric Invariance Model in Which the Factor Loadings Are Constrained to Be Equal between the Groups While the Intercepts Are Freely Estimated. 6.6 Mplus Input File for the Scalar Invariance Model in Which the Factor Loadings and Intercepts Are Constrained to be Equal between the Groups. 6.7 Modification Indices Reported in the Mplus Output for the Scalar Model. 6.8 Mplus Input File for the Scalar Partial Invariance Model in Which the Intercept for the Fifth Indicator (C5) Has Been Freely Estimated in Both Groups.

Published online by Cambridge University Press

xv 276 277 278 279 279 284 286 287 287 290 291 304 306 307 308 309 310 311 312

xvi

List of Tables

6.9 Modification Indices Reported in the Mplus Output for the Partially Invariant Scalar Model. 6.10 Mplus Input File for Testing Strict Factorial Invariance in Which the Factor Loadings, Intercepts (Except for C5), and Indicator Residuals Have Been Constrained to be Equal between the Groups. 6.11 Mplus Input File for Configural, Metric, and Scalar Invariance Using the MODEL = CONFIG METRIC SCALAR Command. 6.12 Abbreviated Mplus Output Reporting the Statistics for Testing Configural, Metric, and Scalar Invariance. 6.13 Mplus Input File for Testing Configural Invariance (Equal Dimensionality Structure) for the Reference Group (Reference ¼ 0) on the Item Responses Using Categorical Indicators. 6.14 Fit Statistics for Evaluating Configural, Metric, Scalar, and Strict Factorial Invariance on the Subscore Data. 6.15 Mplus Input File for Configural Invariant (Baseline) Model in Which Factor Loadings and Thresholds Are Freely Estimated in Each Group Using Categorical Indicators. 6.16 Mplus Input File for the Metric Invariance Model in Which Factor Loadings Are Constrained to Be Equal between Groups While the Intercepts Are Freely Estimated Using Categorical Indicators. 6.17 Mplus Input File for the Partial Metric Invariance Model in Which the Factor Loadings Are Constrained to Be Equal between Groups Except for Indicator 19. 6.18 Mplus Input File for the Scalar Invariance Model in Which the Factor Loadings and Intercepts Are Constrained to Be Equal between the Groups Using Categorical Indicators. 6.19 Mplus Input File for the Scalar Partial Invariance Model in Which the Thresholds for the 19th and 40th Indicators Have Been Freely Estimated in Both Groups. 6.20 Threshold Estimates for Items 12 and 40. 6.21 Mplus Input File for the Scalar Partial Invariance Model in Which the Thresholds for the 12th, 19th, and 40th Indicators Have Been Freely Estimated in Both Groups. 6.22 Mplus Input File for the Strict (Partial) Invariance Model in Which the Residual Variances Are Unconstrained.

Published online by Cambridge University Press

313

314 315 316

318 320

321

323 324 326 328 328 329 330

List of Tables 6.23 Mplus Input File for Testing a One-Factor Model on the Subscore Data. 6.24 Mplus Input File for Testing Whether the First Content Domain Functions Differentially between the Groups. 6.25 Mplus Input File for Fixing the Path Coefficients between the Grouping Variable and Indicators to Zero. 6.26 Mplus Input File for Testing Nonuniform and Uniform DIF. 6.27 Mplus Input Code for Fixing the Path Coefficients to Zero and Requesting Modification Indices when Testing Nonuniform and Uniform DIF. 6.28 Mplus Input File for Testing Uniform DIF via the Wald Statistic and Estimating the Parameters Using Robust WLS Estimation. 6.29 Augmented Model Where the Path Coefficient for the Grouping Variable Was Estimated for the First Indicator. 6.30 Compact Model Where the Path Coefficient for the Grouping Variable Was Fixed to Zero for Each Indicator. 6.31 Mplus Input Code When Fixing the Path Coefficients to Zero. 6.32 Logistic Regression Odds Ratio Statistics Reported in Mplus Output. 6.33 Mplus Input File for Testing Nonuniform and Uniform DIF Using the Categorical Indicators. 6.34 Mplus Input Code for Fixing the Path Coefficients to Zero for Categorical Indicators. 6.35 Mplus Input Code for Estimating the Path Coefficient Regressing Item 12 on the Grouping Variable but Fixing the Path Coefficient to Zero for the Interaction Term. 6.36 Mplus Input Code for Fitting a Baseline Model for Assessing Longitudinal Measurement Invariance. 6.37 Fit Statistics for Evaluating Configural, Metric, Scalar, and Strict Factorial Longitudinal Invariance for the Anxiety Test. 6.38 Mplus Input Code for Fitting the Metric Longitudinal Invariance Model. 6.39 Input Code for Fitting the Scalar Longitudinal Invariance Model. A.1 R Code Used in This Appendix.

Published online by Cambridge University Press

xvii 338 339 340 343 344 348 349 349 351 352 353 355 357 362 364 365 366 372

Published online by Cambridge University Press

Preface

Why write a book about assessing measurement invariance? This question is not about justifying the importance of examining measurement invariance (I address that issue in Chapter 1) but instead about why a book on measurement invariance is useful for social science researchers. In the past 30 years, there have been several excellent texts that have described statistical methods for detecting a lack of measurement invariance (often referred to as differential item functioning), such as Holland and Wainer (1993), Camilli and Shepard (1994), Zumbo (1999), Osterlind and Everson (2009), Engelhard (2012), and Millsap’s comprehensive Statistical Approaches to Measurement Invariance (2011) in which he provided a unifying theory of measurement invariance. Each of these books provided a valuable contribution to the literature, and we are indebted to the authors and the many other researchers who have added to the theory and practice of assessing measurement invariance. What sets this book apart from those excellent resources on this topic is that, as the title implies, this book focuses on the practical application of assessing measurement invariance, with less emphasis on theoretical development or exposition. As such, a primary emphasis of the book is to describe the methods using a pedagogical framework, followed by extensive illustrations that demonstrate how to use software to analyze real data. My intention is that you will use this book to learn about practical methods to assess measurement invariance and how to apply them to your own data. This book is intended to be a user-friendly guide for anyone who wants to assess measurement invariance in their own work. To support this goal, the computer syntax and data sets used in this book are available for download at the following website: www.people.umass.edu/cswells. As someone who has taught statistics and psychometrics for many years, I have learned that students learn better when they are actively engaged in the learning process. Therefore, I invite you to download the example data files and syntax and perform the analyses while you read along in this xix

https://doi.org/10.1017/9781108750561.001 Published online by Cambridge University Press

xx

Preface

book. My goal is to provide you with enough instruction and practice in the most useful methods so that you can apply what you have learned to novel problems and data sets in your own work. The audience I envisioned while writing this book is applied researchers, both graduate students and professionals, from the social sciences in which educational tests, psychological inventories, questionnaires, and surveys are administered. This includes researchers from diverse fields, such as education, psychology, public policy, higher education, marketing, sports management, sociology, and public health, to name a few. The purpose of this book is to provide researchers and psychometricians from these diverse fields with the necessary knowledge and skills to select and apply an appropriate method to assess measurement invariance for the problems they are trying to solve and the research questions they are trying to answer. Regardless of your field, I wrote this book assuming you understand basic statistics taught in an introductory statistics course (e.g., mean, standard deviation, correlation, regression, effect size, hypothesis testing, confidence intervals), as well as the concepts of reliability and validity. This book includes chapters that describe the big ideas and the theoretical concepts of measurement invariance and various statistical procedures, while other chapters include a pedagogical framework for you to learn a new method, including guided practice opportunities. In Chapter 1 I describe the basic ideas of measurement invariance, including why it is important to assess, and I address important issues to consider when applying the methods described in this book. Chapters 2, 4, and 6 are written to teach you specific methods for assessing measurement invariance. The basic structure of each of these chapters is a description of the method, followed by detailed illustrations on how to examine measurement invariance using computer software. Chapter 2 describes observedscore methods where the groups are matched using raw scores. Chapter 4 describes detection methods that use item response theory (IRT) models. Chapter 6 describes methods that rely on confirmatory factor analysis (CFA). For those who are unfamiliar with IRT, Chapter 3 provides a description of the basic concepts of and models used in IRT. I also provide an illustration of how to use the computer program flexMIRT (Cai, 2017) to estimate the parameters for several IRT models. Chapter 5 provides a description of the underlying concepts of CFA that are needed to understand the CFA-based methods described in Chapter 6. Several CFA models are fit to data using the computer program Mplus (Muthén & Muthén, 2008–2017). Finally, Appendix A provides a brief tutorial on the

https://doi.org/10.1017/9781108750561.001 Published online by Cambridge University Press

Preface

xxi

computer program R, because many of the techniques described in this book rely on packages in R. There are many statistical methods available that I could have included in this book. Given that I was writing for practitioners and not writing a methodological tome that addresses all methods that have been created, I had to strategically select the methods I wanted to address. To accomplish this goal, I decided to include methods that met four criteria. The methods must (1) have been studied empirically and shown to provide reasonably accurate assessment of measurement invariance; (2) be able to incorporate an effect size or some type of information for determining whether the lack of invariance was nontrivial; (3) be relatively simple to use, especially for practitioners who are non-psychometricians; and (4) have software readily available to implement them. Indeed, there are methods that meet these criteria that I may not have included, but there will always be more to learn and write. One of the challenges in writing a book that encompasses methods from different areas (i.e., observed-score and IRT- and CFA-based detection methods) is that the same symbols can take on different meanings depending on the context. I have attempted to use the symbols in their original meaning, which may cause confusion. For example, the Greek letter α is often used to represent the nominal Type I error rate, common odds ratio, or item discrimination parameter in an IRT model. It is important for the reader to discern the appropriate meaning of a symbol based on the context. To help prevent possible confusion, I have tried to define the symbols when they are first introduced in a chapter. This book would not have been written if it were not for the support and encouragement of many people. First, I would like to acknowledge my colleagues at UMass Amherst: Steve Sireci, who recommended I write the book and whose support, guidance, and mentoring has had a positive impact on my career; Ronald Hambleton, whose ideas have permeated my understanding of measurement invariance and psychometrics in general; April Zenisky, Lisa Keller, Jennifer Randall, and Scott Monroe, who have all supported me when I needed it. I also would like to acknowledge Allan Cohen, Daniel Bolt, James Wollack, and Michael Subkoviak for their tutelage and persistent support during my doctoral studies; and Ron Serlin, who taught me the value of philosophy of science and the rangenull hypothesis, which I have incorporated into DIF detection methods. I would also like to thank Neal Schmitt for giving me the opportunity to write this book, as well as for providing encouraging words along the journey. I would be remiss if I did not acknowledge the hard work and

https://doi.org/10.1017/9781108750561.001 Published online by Cambridge University Press

xxii

Preface

support of Emily Watton, my editor at Cambridge University Press, for her tireless effort in bringing this manuscript to press, and Clare Diston, my copyeditor, whose attention to detail and knowledge for the English language brought out the best in my writing. Last but not least, I would like to express special thanks to my wife, Amanda, for her encouragement, support, and valuable feedback.

https://doi.org/10.1017/9781108750561.001 Published online by Cambridge University Press

 1

Introduction

What Is Measurement Invariance? The concept underlying measurement invariance is often introduced using a metaphoric example via physical measurements such as length or weight (Millsap, 2011). Suppose I developed an instrument to estimate the perimeter of any object. My instrument is invariant if it produces the same estimate of the object’s perimeter, regardless of the object’s shape. For example, if my instrument provides the same estimate of the perimeter for a circle and a rectangle that have the same true perimeter, then it is invariant. However, if for a circle and a rectangle of the same true perimeter my measure systematically overestimates the perimeters of rectangles, then my measure is not invariant across objects. The object’s shape should be an irrelevant factor in that my instrument is expected to provide an accurate estimate of the perimeter, regardless of the object’s shape. However, when we have a lack of measurement invariance, the estimated perimeter provided by my instrument is influenced not only by the true perimeter but also by the object’s shape. When we lack measurement invariance, irrelevant factors systematically influence the estimates our instruments are designed to produce. We can apply the concept of measurement invariance from physical variables to variables in the social sciences. To do so, let’s suppose I have a constructed-response item, scored 0 to 10, that measures Grade 8 math proficiency. For the item to be invariant, the expected scores for students with the same math proficiency level should be equal, regardless of other variables such as country membership. However, if, for example, Korean students with the same math proficiency level as American students have higher expected scores than Americans, then the item lacks measurement invariance. In this case, an irrelevant factor (i.e., country membership) plays a role in estimating item performance beyond math proficiency. When using my non-invariant instrument to estimate the perimeter of 1

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

2

Introduction

an object, I need the estimate from my instrument, as well as the shape of the object, to provide an accurate estimate. The same is true for the noninvariant math item. To estimate accurately a student’s math proficiency, I would need their response on the item and their country membership. For an invariant math item, however, I would only need their item response. While the use of physical measurements can be useful for introducing the concept of measurement invariance, there are two important differences when extending the idea to constructs in the social sciences, such as math proficiency or depression. First, the variables we measure in the social sciences are latent and cannot be directly observed. Instead, we make inferences from our observations that are often based on responses to stimuli such as multiple-choice, Likert-type, or constructed-response items. As a result, we must deal with unreliability, which makes it more difficult to determine whether our measures (or items) are invariant. Second, in the physical world we can obtain a gold standard that provides very accurate measurements. The gold standard can be used to match object shapes based on their true perimeter, which then allows us to compare the estimates produced by my instrument between different object shapes of the same perimeter. Unfortunately, there are no gold standards in the social sciences for the latent variables we are measuring. Latent variables that are used to match students are flawed to a certain degree, which, again, makes it difficult to assess measurement invariance. Measurement invariance in the social sciences essentially indicates that a measure (or its items) is behaving in the same manner for people from different groups. To assess measurement invariance, we compare the performance on the item or set of items between the groups while matching on the proficiency level of the latent variable. While the idea of the items behaving in the same way between groups is useful for conveying the essence of measurement invariance, it is too simple to provide an accurate technical definition to understand the statistical approaches for examining measurement invariance. To fully understand what I mean by an item being invariant across groups within a population, I will begin by first defining the functional relationship between the latent variable being measured, which I will denote as θ, and item performance, denoted Y. The general notation for a functional relationship can be expressed as f ðY jθÞ, which indicates that the response to the item or set of items is a function of the latent variable. For example, if an item is scored dichotomously (Y ¼ 0 for an incorrect response, Y ¼ 1 for a correct response),

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

What Is Measurement Invariance?

3

then f ðY jθÞ refers to the probability of correctly answering the item given an examinee’s level on the latent variable, and can be written as P ðY ¼ 1jθÞ. For an item in which measurement invariance is satisfied, the functional relationship is the same in both groups; that is,   P Y ¼ 1 θ; G ¼ g 1 ¼ P Y ¼ 1 θ; G ¼ g 2 :

(1.1)

G refers to group membership, with g1 and g2 representing two separate groups (e.g., Korea and America). Another way of expressing measurement invariance is that group membership does not provide any additional information about the item performance above and beyond θ (Millsap, 2011). In other words, P ðY ¼ 1jθ; G Þ ¼ P ðY ¼ 1jθÞ:

(1.2)

Group 1 Group 2

0

0.2

P(Y=1| T) 0.4 0.6

0.8

1.0

To illustrate the idea of measurement invariance graphically, Figure 1.1 provides an example of the functional relationships for two groups on a dichotomously scored item that is invariant. The horizontal axis represents the proficiency level on the latent variable – in this book, I will refer to the level on the latent variable as proficiency. The vertical axis provides the probability of a correct response. Because the item is invariant, the functional relationships for both groups are identical (i.e., the probability of a correct response given θ is identical in both groups). The proficiency distributions for each group are shown underneath the horizontal axis.

Proficiency

Group 2

Figure 1.1

Group 1

An example of functional relationships for two groups on a dichotomously scored item that is invariant.

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

4

Introduction

We can see that Group 1 has a higher proficiency than Group 2. The difference in proficiency distributions between the groups highlights the idea that matching on proficiency is an important aspect of the definition and assessment of measurement invariance. If we do not control for differences on θ between the groups, then differences in item performance may be due to true differences on the latent variable, not necessarily a lack of measurement invariance. The difference between latent variable distributions is referred to as impact. For example, since Group 1 has a higher mean θ distribution than Group 2, then Group 1 would have, on average, performed better on the item than Group 2, even if the functional relationship was identical, as shown in Figure 1.1. As a result, the proportion of examinees in Group 1 who answered the item correctly would have been higher compared to Group 2. However, once we control for differences in proficiency by conditioning on θ, item performance is identical. The fact that we want to control for differences in the latent variable before we compare item performance highlights the idea that we are not willing to assume the groups have the same θ distributions when assessing measurement invariance. Figure 1.2 illustrates an item that lacks measurement invariance. In this case, the probability of a correct response conditioned on θ is higher for

Figure 1.2

An example of functional relationships for two groups on a dichotomously scored item that lacks invariance.

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

Why Should We Assess Measurement Invariance?

5

Group 1, indicating that the item is relatively easier for Group 1. In other words, Group 1 examinees of the same proficiency level as examinees from Group 2 have a higher probability of answering the item correctly. When measurement invariance does not hold, as shown in Figure 1.2, then the functional relationships for Groups   1 and 2 are not the same (i.e., f Y θ; G ¼ g 1 6¼ f Y θ; G ¼ g 2 ). Therefore, to explain item performance we need proficiency and group membership. In the case of noninvariance, the item is functioning differentially between the groups; in other words, the item is exhibiting differential item functioning (DIF). In this book, I will refer to a lack of measurement invariance as DIF. In fact, many of the statistical techniques used to assess measurement invariance are traditionally referred to as DIF methods. The concept of measurement invariance can be applied to polytomously scored items; that is, items that have more than two score points (e.g., partial-credit or Likert-type items). For a polytomous item, Y could refer to the probability of responding to a category, or it could refer to the expected score on the item. For example, Figure 1.3 illustrates an invariant (top plot) and non-invariant (bottom plot) functional relationship for a polytomous item with five score categories. The vertical axis ranges from 0 to 4 and represents the expected scores on the polytomous item conditioned on θ (i.e., E ðY ¼ yjθÞ). The expected item scores conditioned on proficiency are identical when invariance is satisfied but differ when the property of invariance is not satisfied. Measurement invariance can also be extended to compare performance on a subset of items from a test (e.g., items that represent a content domain). In this case, the functional relationship looks a lot like a polytomous item in that we are comparing the expected score conditioned on the latent variable. When a scale based on a subset of items from a test lacks invariance, we often refer to it as differential bundle functioning (DBF). A special case of DBF is when we examine the performance of all items on a test. In this case, we are examining the invariance at the test score level. When the invariance is violated at the test score level, we refer to it as differential test functioning (DTF).

Why Should We Assess Measurement Invariance? There are two basic reasons for why we should care about whether a test and its items are invariant across groups in a population. The first reason pertains to test validity in that the presence of DIF can impede test score interpretations and uses of the test. The Standards for Educational and

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

6

Introduction

Figure 1.3

An illustration of an invariant and non-invariant polytomous item.

Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014) describe five sources of validity: content, response processes, internal structure, relation to other variables, and consequences. Providing evidence to support measurement invariance is one of the aspects of internal structure. The presence of DIF is an indication that there may be a construct-irrelevant factor (or factors) influencing item performance. The consequences of having items that lack invariance in a test can be severe in some cases. For example, a lack of invariance at the item level can manifest to the test score level, leading to unfair comparisons of examinees from different groups. If the DIF is large enough, examinees may be placed into the wrong performance category (imagine how disheartening it would be, after working diligently to

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

Why Should We Assess Measurement Invariance?

7

successfully build skills in, say, math, to be placed into a performance category below your expectation because of something other than math proficiency). A lack of invariance is not just a validity concern for largescale assessments but for any test in which a decision is being made using a test score, such as remediation plans for struggling students in schools, determining whether an intervention is effective for a student, assigning grades or performance descriptors to students’ report cards, etc. The examples I have discussed so far have pertained to educational tests. However, the importance of measurement invariance also applies to noncognitive tests such as psychological inventories, attitudinal measures, and observational measures. In fact, it is important to establish measurement invariance for any measure prior to making any group comparisons using its results. Essentially, any time we are planning on using a score from an instrument, we should collect evidence of measurement invariance so that we can be confident that no construct-irrelevant factor is playing a meaningful role in our interpretations and uses. In addition to the direct effect that DIF can have on test score interpretation and use, it can also indirectly influence validity through its deleterious effect on measurement processes. For example, DIF can disrupt a scale score via its negative effect on score equating or scaling when using item response theory (IRT). A common goal in many testing programs is to establish a stable scale over time with the goal of measuring improvement (e.g., the proportion of proficient students within Grade 8 math increases over consecutive years) and growth (each student demonstrates improved proficiency over grades). Tests contain items that are common between testing occasions (e.g., administration years) that are used to link scales so that the scales have the same meaning. If some of the common items contain DIF, then the equating or scaling can be corrupted, which results in an unstable scale. This type of DIF, where the groups are defined by testing occasion, is referred to as item parameter drift in that some of the items become easier or harder over time after controlling for proficiency differences (Goldstein, 1983). A consequence of item parameter drift is that inferences drawn from test scores may be inaccurate (e.g., examinees may be placed into the wrong performance categories). A second purpose for assessing measurement invariance is when we have substantive research questions pertaining to how populations may differ on a latent variable. For example, suppose we want to compare geographic regions on a math test. In addition to examining mean differences between regions, assessing measurement could provide useful information about how the items are functioning across the regions. We could find that

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

8

Introduction

certain domains of items are relatively harder for a particular region, suggesting that perhaps that group did not have the same opportunity to learn the content. Assessing measurement invariance in this context could also be useful for psychological latent variables. For instance, we may be interested in comparing gender groups on a measure of aggressiveness. Items that are flagged as DIF may provide insight into differences between the groups.

Forms of DIF It is helpful to have nomenclature to classify the types of non-invariance. There are two basic forms of DIF: uniform and nonuniform (Mellenbergh, 1982). Uniform DIF occurs when the functional relationship differs between the groups consistently or uniformly across the proficiency scale. The plot shown in Figure 1.2 provides an example of uniform DIF. In this case, the probability of a correct response for Group 1 is higher compared to Group 2 throughout the proficiency scale. At the item level, the difference in the functional relationships for uniform DIF is defined only by the item difficulty, whereas the item discrimination is the same in both groups. As I will describe in Chapter 3, where I address IRT, the item discrimination is related to the slope of the functional relationship curve. In uniform DIF, the curve shifts to the right or left for one of the groups, while the slope remains the same. Nonuniform DIF occurs when the lack of invariance is due to the discrimination between the groups, regardless of whether the difficulty differs between the groups. Whereas for uniform DIF the item can only be harder or easier for one of the groups, nonuniform DIF can take on many forms. Figure 1.4 provides two examples of nonuniform DIF. In the top plot, the DIF is defined only by the difference in discrimination between the two groups; in this case, the curve is flatter for Group 2, indicating a less discriminating item compared to Group 1. The difference between Groups 1 and 2 in answering the item correctly depends on the θ value; for lower θ values, the item is relatively easier for Group 2, whereas for higher θ values, the item is relatively harder compared to Group 1. The bottom plot in Figure 1.4 provides another example of nonuniform DIF, but in this case the item differs with respect to discrimination and difficulty such that the item is less discriminating and more difficult in Group 2. When testing for DIF, our goal is often not only to detect DIF but also to describe the nature or form of the DIF.

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

Forms of DIF

9

Group 1 Group 2

0

0.2

P(Y=1| θ) 0.4 0.6

0.8

1.0

Nonuniform DIF: Discrimination Only

Proficiency

Group 1 Group 2

0

0.2

P(Y=1| θ) 0.4 0.6

0.8

1.0

Nonuniform DIF: Discrimination and Difficulty

Proficiency

Figure 1.4 Two examples of nonuniform DIF.

Another important factor to consider when describing DIF is whether the set of DIF items is consistently harder or easier for one of the groups. When the DIF is consistent across items (e.g., the DIF items are all harder in one of the groups), then it is referred to as unidirectional DIF. If, on the other hand, some of the DIF items are easier in one group, while some of the other DIF items are harder, then that is referred to as bidirectional DIF. The reason it is helpful to make this distinction is that the effect of unidirectional DIF can often pose a more serious risk to psychometric procedures such as equating and making test score comparisons. In addition, unidirectional DIF can also make it more difficult to detect DIF items in that the DIF has a larger impact on the latent variable used to match examinees (see discussion on purification procedures for further details and how to mitigate the effect of unidirectional DIF).

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

10

Introduction

Classification of DIF Detection Methods The statistical techniques used to assess measurement invariance that we will explore in this book can be classified under three general approaches. Each of the approaches differs with respect to how the latent variable used to match examinees is measured. The first class of DIF detection methods, referred to as observed-score methods, uses the raw score as a proxy for θ. The raw scores are used to match examinees when comparing item performance. For example, measurement invariance is assessed by comparing item performance (e.g., the proportion correct for a dichotomously scored item) for examinees from different groups with the same raw score. Observed-score methods have the advantage of providing effect sizes to classify a detected item as nontrivial DIF. The observed-score methods addressed in this book include the Mantel–Haenszel procedure (Holland, 1985; Holland & Thayer, 1988), the standardization DIF method (Dorans & Kulick, 1986; Dorans & Holland, 1993), logistic regression (Swaminathan & Rogers, 1990), and the Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, 1993a, 1993b). I will describe the observedscore methods in Chapter 2. The second class of methods uses a nonlinear latent variable model to define θ and subsequently the functional relationship. These methods rely on IRT models. The plots shown in Figures 1.1–1.4 are examples of item response functions provided by IRT models. One of the advantages of using IRT to examine DIF is that the models provide a convenient evaluation of DIF that is consistent with the definition of DIF. The IRT methods addressed in this book include b-plot, Lord’s chi-square (Lord, 1977, 1980), the likelihood-ratio test (Thissen, Steinberg, & Wainer, 1993), Raju’s area measure (Raju, 1988, 1990), and differential functioning of items and tests (DFIT; Raju, van der Linden, & Fleer, 1995). I will describe the basic ideas of IRT in Chapter 3 and the IRT-based DIF methods in Chapter 4. The third class of methods uses a linear latent variable model via confirmatory factor analysis (CFA). Although there is a strong relationship between CFA and IRT, and the methods used to examine DIF are similar in some respects, they are distinct in important ways. For example, CFA and IRT evaluate the fit of the respective latent variable model using very different approaches and statistics. One of the advantages of using CFA to assess measurement invariance is that it provides a comprehensive evaluation of the data structure and can easily accommodate complicated multidimensional models. The methods we will examine in this book include

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

The Conditioning Variable: To Purify or Not?

11

multigroup CFA for both continuous and categorical data, MIMIC (multiple indicators, multiple causes) models, and longitudinal measurement invariance. The fundamental ideas of CFA will be described in Chapter 5, with Chapter 6 containing a description of the CFA-based methods for assessing measurement invariance.

The Conditioning Variable: To Purify or Not to Purify? As described previously, an important aspect of the definition of measurement invariance is the conditioning on the level of the latent variable; that is, the performance on the item (or set of items) is the same between the groups for examinees with the same proficiency or θ level. Therefore, it is crucial that a measure provides an accurate representation of the latent variable – a criterion that is not influenced by a lack of invariance or DIF. If the conditioning variable is influenced by DIF, then the examinees may not be accurately matched. For example, if several of the items on the criterion lack invariance in such a way that they are relatively harder in one of the groups, then examinees with the same raw score from different groups will not necessarily be equivalent on the latent variable. As a result, the comparison on the item performance may not be accurate because we have not matched examinees accurately, leading to a DIF analysis that identifies invariant items as DIF and DIF items as invariant. Therefore, it is crucial that we use a criterion that is invariant between the groups. We can use either an internal or external criterion as the conditioning variable. An external criterion is one that is based on a variable that is not part of the items contained in the test being evaluated for measurement invariance. An internal criterion, on the other hand, uses a composite score based on the items from the test being assessed for measurement invariance. To illustrate the advantages and disadvantages of both approaches, consider a hypothetical situation where we are testing DIF between females and males on a math test used to place incoming university students into an appropriate math course. If all students take an admissions exam such as the SAT (Scholastic Aptitude Test), then we could feasibly use the SAT-Quantitative score as the conditioning variable. The advantage of using an external criterion is that if the math placement test contains DIF items, then those items will not influence our measure of θ. Of course, when choosing an external criterion, we would want evidence that it is invariant for the groups we are comparing. The challenge in using an external criterion is that it is rarely available, and the external criterion must be measuring the same latent variable as the test being assessed for

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

12

Introduction

measurement invariance (which may be questionable in many contexts). Both challenges typically preclude the use of an external criterion in most situations. An internal criterion is the most common conditioning variable used when assessing measurement invariance, and in this book it will be used exclusively. The items used to define the conditioning variable are referred to as anchor items. The advantages of using an internal criterion are that it is readily available and it measures the relevant latent variable. The disadvantage of using an internal criterion is that if any of the anchor items contain DIF, then the conditioning variable, which is based on the items from the same test, may not be accurate (Dorans & Holland, 1993).1 One way to address this limitation is to purify the conditioning variable by removing the DIF items from the conditioning variable (Holland & Thayer, 1988; Dorans & Holland, 1993; Camilli & Shepard, 1994). Although the details of purifying the conditioning variable vary across methods, the general procedure is as follows. First, we test each item for DIF using all items to define the conditioning variable. Second, we retest each of the items for DIF, but define the conditioning variable using only the items that were not identified as DIF in the first step. At this point, we can either continue this procedure until the same items are flagged as DIF in subsequent stages, or simply stop after the second step. There is evidence, however, that two stages are sufficient for purifying the anchor (Clauser, Mazor, & Hambleton, 1993; Zenisky, Hambleton, & Robin, 2003). The result of purifying the conditioning variable is that you have a set of anchor items that are invariant and, thus, provide an uncontaminated measure of the conditioning variable for both groups. Although purifying the conditioning variable seems to solve the main disadvantage of using an internal criterion, it is not without limitations. First, because statistical tests used to identify DIF items are not infallible, it is possible still to have DIF items in the anchor. In fact, unless the statistical power is very high, then it is likely you will have at least one DIF item in the anchor. Although this can be a serious limitation, the purification approach seems to remove the most egregious DIF items, leaving items that may not have a meaningful impact on the conditioning variable. The second limitation of the purification approach is that when we remove items from the anchor, we are reducing the reliability of the 1

In fact, if all items are systematically harder for one group, then we cannot flag any items as DIF because the bias will be subsumed into the difference in the latent variable distributions of the groups.

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

Considerations When Applying Statistical DIF Tests

13

conditioning variable, which can have a negative impact on the DIF statistics, especially those that use observed scores to define the conditioning variable described in Chapter 2. This can be particularly problematic when using large sample sizes where statistical tests tend to flag many items as DIF. In this case, it may be prudent to use an effect size along with the statistical test to flag an item as nontrivial DIF when implementing the purification approach (more on the use of effect sizes will be described later in this chapter). Regardless of the limitations of the purification approach, it is wise to use it when testing for DIF.

Considerations When Applying Statistical Tests for DIF Statistical and Practical Significance Most approaches for assessing measurement invariance rely on traditional statistical significance testing in which the null hypothesis being tested states that the item is invariant between populations, and the alternative hypothesis states that the item functions differentially between the populations. Although significance tests are useful for assessing measurement invariance, the challenge in applying them is that the traditional null hypothesis is always false in real data (Cohen, 1994). To elaborate on this issue given the definition of measurement invariance, the traditional null hypothesis specifies that the item performance conditioned on θ is identical in both populations. For example, for a dichotomously scored item, this means that the probability of a correct response given θ is identical for all examinees (as shown in Figure 1.1). Unfortunately, this is an unrealistic hypothesis, even for items that function comparably in the populations. The consequence of testing this type of null hypothesis in real data is that, as the group sample sizes increase, the statistical tests tend to be statistically significant, even for differences that are trivial. This problem (which is a ubiquitous problem in all statistical significance tests that test a point-null hypothesis) has led some researchers to state that the test statistics have too much power or are oversensitive, when in fact the problem is that the null hypothesis is always false and, as a result, meaningless. Furthermore, testing the point-null hypothesis does not support the inferences we want to draw, including the main inference that the DIF is nontrivial in the population. There are two basic strategies to help alleviate this problem. The most common solution is to examine effect sizes post hoc to help judge the practical significance of a statistically significant result. In this strategy,

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

14

Introduction

referred to as a blended approach (Gomez-Benito, Hidalgo, & Zumbo, 2013), we first test an item for DIF using traditional significance testing. If the DIF is statistically significant, we then use an effect size to qualify whether the DIF is trivial or nontrivial. The advantage of this approach is that it is relatively simple to apply (presuming you have an effect size available), and it helps support the claim we want to make – that is, the DIF is nontrivial. The disadvantage of the blended approach is that it does not control the Type I error rate for our desired claim. A Type I error for our claim is that the lack of invariance is nontrivial, when in fact the DIF is trivial. The second strategy is to test a range-null hypothesis (Serlin & Lapsley, 1993) instead of a point-null hypothesis. The range-null hypothesis approach specifies under the null hypothesis a magnitude of DIF that is considered trivial; in other words, the null hypothesis states that the item is essentially as DIF-free as can be expected in real data (but not exactly DIFfree). The alternative hypothesis states that the item displays nontrivial DIF. The advantage of this approach is that a statistically significant result implies the DIF is nontrivial and it controls the Type I error rate for the desired claim or inference. A further advantage is that it requires the test developer to consider what trivial and nontrivial DIF are prior to the statistical test. The disadvantage of this approach is that it requires large sample sizes to have sufficient power to reject a false null hypothesis, and it is difficult to apply for every statistical test. For every statistical approach addressed in this book, my goal is to describe how to determine whether the DIF is trivial or nontrivial, not simply to determine if an item is functioning differentially. To help support this goal, I will rely heavily on the blended approach and will apply the range-null hypothesis when discussing Lord’s chi-square DIF statistic. Type I Error Rate Control When assessing measurement invariance, we often test many items for DIF. This introduces the problem of having inflated Type I error rates. For example, if you test 50 items for DIF, then the chances of observing at least one Type I error (when the items are DIF-free) is very high; more specifically, for α ¼ 0.05, P ðat least one rejectionjH 0 is trueÞ ¼ 1  ð1  0:05Þ50 ¼ 0:92:

(1.3)

This is problematic for at least two reasons. First, if you are trying to

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

Considerations When Applying Statistical DIF Tests

15

understand the causes of DIF, then flagging items that do not function differentially will only add confusion to the analysis, leading to a lack of confidence in the results or the manufacturing of erroneous claims about why the groups differ. Second, we may decide to remove an item that exhibits DIF from the test for fairness issues or so that the item does not affect the psychometric operation (e.g., equating). In this case, we are removing an item unnecessarily, leading to less reliable results and a financial loss. Developing effective items is a time-consuming and financially extensive procedure – items cost a lot of money to develop. Both of these reasons illustrate the importance of having a controlled Type I error rate when testing for DIF. There are two basic solutions to the problem of inflated Type I error rates. The first is to use a procedure to correct for the inflated Type I error rate, such as familywise error rate procedures (e.g., Dunn-Bonferroni; Holm, 1979) or the false discovery rate procedure developed by Benjamini and Hochberg (1995). Of these procedures, it seems that controlling the false discovery rate using the Bejamini-Hochberg procedure is more appropriate than controlling the familywise error rate, given that it has more statistical power and the collection of items being tested for DIF does not necessarily comprise a family of tests that is connected to an overarching hypothesis. For example, when testing 50 items for DIF using α ¼ 0.05, the α level for each comparison would be 0.05/50 ¼ 0.001 for each item when using Dunn-Bonferroni. Unless we are working with very large sample sizes, this α level may be overly strict, which has a negative impact on power. The second solution is to use effect sizes to help qualify statistically significant results as nontrivial or meaningful. In this case, items that are falsely flagged as DIF must also pass a benchmark indicating that the DIF is nontrivial. As a result, many of the items that are Type I errors would be ignored as being trivial and be classified as essentially free of DIF. Meeting statistical significance and having an effect size that exceeds a certain value helps mitigate the problem with inflated Type I error rates that is addressed in the blended and range-null hypothesis results (although, technically, the range-null hypothesis does a better job of controlling the Type I error rate for trivial DIF items). The inflated Type I error rate is not only an issue when testing many items within a test, but also when comparing more than two groups for each item. For example, in international assessments such as the Trends in International Mathematics and Science Study (TIMSS), we have access to examinees from many countries, and we may want to examine whether the

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

16

Introduction

items are performing differentially between several countries. In this case, we perform multiple tests of DIF on each item. Here, however, the analyses belong to a family of comparisons and using one of the familywise error rate control procedures is reasonable. If we are comparing three groups, for instance, then using Fisher’s least significant difference (LSD) is most powerful, presuming we can test the omnibus hypothesis using a test statistic. For analyses with four or more groups, a procedure such as Holm’s (1979) method is perhaps most appropriate and powerful.

https://doi.org/10.1017/9781108750561.002 Published online by Cambridge University Press

 2

Observed-Score Methods

What distinguishes observed-score methods from other types of method described in this book is that they typically use raw scores1 to match examinees from the reference and focal groups. As a result, there is no need to fit a latent variable model such as an IRT model (the disadvantages of using latent variable models are that they often require large sample sizes to obtain accurate parameter estimates, and they require acceptable model fit to obtain valid DIF statistics [Bolt, 2002]). Another advantage of using observed-score methods is that many of the procedures provide an effect size measure in addition to a hypothesis test. Many of the effect sizes, in fact, have well-established benchmarks that test developers and researchers can use to classify an item as exhibiting negligible, moderate, and large DIF. Because of these advantages, observed-score methods are a popular approach for testing DIF. There are a few disadvantages to using observed-score methods compared to latent variable models. Approaches based on latent variable models have slightly more power to detect DIF when the model fits the data and there is a sufficient sample size to estimate the parameters (Camilli & Shepard, 1994). In addition, measurement error can influence the Type I and II error rates for observed-score methods, especially when the groups differ considerably with respect to proficiency (Gotzmann, 2001; Gierl, Gotzmann, & Boughton, 2004). Latent variable models, on the other hand, remove the effect due to unreliability when comparing groups. Lastly, observed-score methods may not be appropriate for

1

It is possible, and sometimes advantageous, to incorporate proficiency estimates from a latent variable model into the classical methods in lieu of raw scores. For example, if you have a test design that uses spiraling booklets or matrix sampling where not all examinees are administered all items, an IRT proficiency estimate could be obtained and used to match examinees in lieu of raw scores, presuming the item parameter estimates for the groups are on the same scale.

17

https://doi.org/10.1017/9781108750561.003 Published online by Cambridge University Press

18

Observed-Score Methods

1multidimensional data, where latent variable models can model the multidimensionality and perform DIF analyses within that context. As briefly described in Chapter 1, it is important to control for differences in proficiency in the reference and focal groups when comparing item performance via a DIF procedure. Most observed-score methods match examinees from different groups using the raw scores (the raw score is used as an estimate for the latent variable being measured). For instance, the standardization DIF procedure (Dorans & Kulick, 1986; Dorans & Holland, 1993) compares the conditional classical p-values between the groups when conditioning on raw scores. Therefore, the raw score plays a vital role in matching examinees. There are two basic factors that influence how well the raw scores match examinees. First, it is crucial that the anchor items used to create the raw scores are free from DIF. Otherwise, you may not accurately match examinees from each group, which leads to faulty detection of DIF. To address this issue, we often remove the items from the anchor used to create the raw scores that exhibit DIF using a purification procedure (Candell & Drasgow, 1988; Holland & Thayer, 1988). The details of each purification procedure will be provided for each method. The second challenge in applying observed-score methods is that if the test scores do not have sufficient reliability and the groups differ with respect to proficiency, then the raw scores may not match examinees well (Donoghue, Holland, & Thayer, 1993; Shealy & Stout, 1993a; Uttaro & Millsap, 1994). As a result, the performance for several of the methods suffers when the test does not contain a sufficient number of items to provide reasonable reliability. Missing data provide another challenge in applying observed-score methods. Because raw scores are based on summing the item responses, if an examinee did not respond to an item, then the raw score will not be comparable to examinees who responded to all the items. If you want to use an observed-score method in the presence of missing data, you will have to decide how to deal with the missing data. In some cases, it makes sense to replace the omitted response with a reasonable value. For example, for dichotomously scored or partial-credit items you may simply replace an omit with a 0. Replacing a value with a 0 may be reasonable if it is consistent with the scoring model; that is, when scoring examinees, omits are replaced with a 0. However, if the operational scoring procedure does not score omits as 0, then replacing omits with a 0 may not be appropriate. Also, the same strategy cannot be used for Likert-type survey items. In such cases, you may want to consider either using an IRT

https://doi.org/10.1017/9781108750561.003 Published online by Cambridge University Press

Transformed Item Difficulty (Delta-Plot) Method

19

model to obtain a proficiency estimate2 to use in lieu of the raw score, or replacing the missing value using a multiple imputation technique (see Enders, 2010).

Transformed Item Difficulty (Delta-Plot) Method Dichotomous Data One of the earliest approaches for detecting items that function differentially between two groups is the transformed item difficulty (TID) method, also known as the delta-plot method, developed by Angoff (1972). In the delta-plot method, we first convert the proportion correct (i.e., p-values) for each item within each group to a z-score using the inverse of the normal cumulative function and then perform a linear transformation to a metric with a mean of 13 and standard deviation of 4. The newly transformed scale is referred to as the delta (Δ) metric. The p-values for item i in each group are converted to the Δ-metric using the following equation:    Δi ¼ 13  4 Φ1 pi ,

ð2:1Þ

where Φ1[.] is a function based on the inverse of the standard normal cumulative function and pi is the p-value for item i. For example, an item with a p-value of 0.50 corresponds to a Δ-value of 13 (i.e., Δ ¼ 13  4ðΦ1 ½0:50Þ ¼ 13  4ð0Þ ¼ 13). There are a few things worth noting about the Δ-metric. First, it is on an interval-level scale, as opposed to the p-values, which are on an ordinal scale. The advantage of the interval-level scale is that it is more conducive for making comparisons when examining DIF. For instance, the relationship between the Δ-values tends to be linear, whereas the relationship between p-values tends to be curvilinear. Second, Δ-values are inversely related to the p-values, which means that higher Δ-values indicate a more difficult item (as opposed to a higher p-value indicating an easier item). Third, because Δ-values are based on classical item difficulty estimates (i.e., p-values), the Δ-values are sample dependent and as a result are influenced by the group’s proficiency distribution (e.g., groups with higher 2

With the exception of logistic regression, the observed-score methods require matching examinees using a discrete matching variable; however, IRT proficiency estimates are continuous. One way to address this issue is to convert the IRT proficiency estimates to expected true scores using the test characteristic curve and then rounding the number to the nearest integer.

https://doi.org/10.1017/9781108750561.003 Published online by Cambridge University Press

20

Observed-Score Methods

Figure 2.1

Plot of the Δ-values for the reference and focal groups.

proficiency will have lower Δ-values since the items will appear to be easier compared to groups with lower overall proficiency). This third fact indicates that you cannot directly compare Δ-values between the groups since the observed differences may be due to disparate proficiency distributions, not DIF. Once the p-values have been transformed onto the Δ-metric, we examine the relationship between the Δ-values for both groups via a bivariate plot. Figure 2.1 provides a scatterplot, referred to as a delta-plot, for Δ-values from a 30-item, multiple-choice test. The x- and y-axes represent the Δ-values for the reference and focal groups, respectively. To aid interpretation, it is prudent to use a square plot where the x- and y-axes are the same length and scale. The points on the plot represent the pair of Δ-values for each item. The dashed line is the identity line and can be useful for inferring differences in the groups’ proficiency levels. In this case, because most of the points are located above the identity line, it indicates that the focal group’s proficiency was less than the reference group’s proficiency (when the groups have equal proficiency, the points are expected to be scattered around the identity line). Bivariate outliers in the delta-plot indicate possible items that are functioning differentially between the groups. To help identify the outliers, we fit a principal axis line through the points (solid line shown in

https://doi.org/10.1017/9781108750561.003 Published online by Cambridge University Press

Transformed Item Difficulty (Delta-Plot) Method

21

Figure 2.1). Principal axis regression, unlike ordinary least squares regression, minimizes the squared perpendicular distances between the observed data and predicted values (i.e., points on the line) and results in a symmetric line (i.e., the line is the same whether you regress the focal group’s Δ-values onto the reference group or vice versa). The principal axis line can be expressed in the linear form ^y i ¼ a þ bx i , where ^y i represents the predicted Δ-value for item i with a Δ-value of x; and b and a represent the slope and intercept, respectively. The slope (b) and intercept (a) for the principal axis line are computed as follows: b¼

S 2ΔF



S 2ΔR

þ

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi   S 2ΔF  S 2ΔR

2

þ 4ðS ΔR ΔF Þ2

2S ΔR ΔF

,

ð2:2Þ

and  F  aΔ R, a¼Δ

ð2:3Þ

where and represent the variance of the Δ-values for the focal and reference groups, respectively; S ΔR ΔF represents the covariance of the  R represent the mean of the  F and Δ Δ-values for the two groups; and Δ Δ-values for the focal and reference groups, respectively. For the data shown in Figure 2.1, the slope equals 0.95 and the intercept is 1.68. Although the delta-plot method is relatively simple to implement, one of the challenges we face is determining an appropriate criterion to flag items. Essentially, we flag items in which the perpendicular distance from the principal axis line exceeds a certain value or threshold. For example, Educational Testing Service (ETS), which pioneered the use of the Δ-scale, developed criteria for classifying items as negligible, intermediate, and large DIF, based, in part, on the difference on the Δ-metric. Absolute differences on the Δ-metric that are less than 1.0 are considered negligible, values greater than 1.0 but less than 1.5 are considered moderate, and values greater than 1.5 are considered large DIF. The perpendicular distance (di) for item i from the principal axis line is computed as follows: S 2ΔF

S 2ΔR

di ¼

bΔiR  ΔiF þ a pffiffiffiffiffiffiffiffiffiffiffiffiffi : b2 þ 1

ð2:4Þ

A commonly used threshold is 1.5 (Muniz, Hambleton, & Xing, 2001; Robin, Sireci, & Hambleton, 2003); in other words, the data point must be greater than 1.5 units from the principal axis line to be flagged as DIF. One of the issues with using an a priori threshold of 1.5 to flag items is that it may lead to very conservative flagging rates, which means that

https://doi.org/10.1017/9781108750561.003 Published online by Cambridge University Press

22

Observed-Score Methods

it will have deflated Type I error rates and low power in flagging DIF items (Magis & Facon, 2012). An alternative approach, developed by Magis and Facon (2012), is to determine a threshold under the null hypothesis that the items are DIF-free and an assumption of bivariate normality. Essentially, the threshold is based on the standardized residuals (i.e., standardized distances from the principal axis line), and therefore uses the data and a nominal Type I error rate (α) to define a threshold value. The threshold value, denoted T α , is computed as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h i  b2 S 2ΔR  2bS ΔR ΔF þ S 2ΔF α , T α ¼ Φ1 1  2 b2 þ 1 

ð2:5Þ

where Φ1[.] is a function based on the inverse of the standard normal cumulative function; and   α is the nominal Type I error rate (e.g., for α ¼ 0.05, Φ1 1  :05 2 ¼ 1:96). Using an α of 0.05, T α ¼ 0:60 for our example, which means that an item that exceeds a distance of 0.60 raw units from the principal axis line would be flagged as exhibiting DIF. The threshold value will vary across samples as a function of the variability around the principal axis line. For example, if the data points exhibited larger scatter around the line, then the threshold would be larger. Although we will perform the delta-plot method using a function within R, due to the procedure’s relative simplicity it can be performed with many software packages including Microsoft Excel. To illustrate how to perform the delta-plot method using a function in R, we will use the same data that we have been exploring so far. The data are based on a 30-item, multiplechoice test that has been dichotomously scored with 2,500 examinees in each group. The data file, “MCData.csv,” is a comma-separated values (CSV) text file, with the first 30 columns containing the item responses and the last column containing the grouping variable (0 ¼ reference, 1 ¼ focal). The first row of the data file contains the variable names (i.e., item1, item2, . . ., item30, group), which will be printed in the output, making it easier to read. You can read the data file into R using the following command:3 MC.data 0.90 provides a sufficient condition for treating the data as unidimensional enough, regardless of the other indices we will examine. However, for values less than 0.90, we can still examine two other indices to determine if the data are unidimensional enough. In addition to evaluating the loadings for the specific factors to help determine if the specific factors are useful, there is an index we can compute that is based on the factor loadings that can help determine if a unidimensional model is good enough (even if the data are strictly multidimensional). The index, referred to as the expected common variance (ECV; Reise, Moore, & Haviland, 2010), provides information about the strength of the general factor relative to the specific factors. ECV is

https://doi.org/10.1017/9781108750561.006 Published online by Cambridge University Press

Description and Example of the Bifactor Model

293

based on the common variance explained by the general factor divided by the total common variance and is computed as follows: IG P

ECV ¼

i¼1 IG P i¼1

λ2iG

þ

λ2iG Is S P P s¼1 i¼1

:

(5.22)

λ2is

λ2iG

represents the loading for the general factor squared for item i, and λ2is represents the loading for the specific factor s squared for item i. Using the factor loading estimates provided in Table 5.16, ECV equals 0.91, which indicates that much of the common variance is explained by the general factor (however, this does not necessarily mean that much of the total observed-score variance is explained by the general factor). Although more research is needed to determine appropriate benchmarks for ECV, Reise et al. (2013) provided a tentative benchmark of 0.60 for determining if the data are unidimensional enough; in other words, ECV values greater than 0.60 suggest the data are unidimensional enough. In addition to ECV and PUC values, researchers can compute reliability coefficients to determine if composite scores predominantly reflect a single common factor, even when the data follow a bifactor measurement model. As noted by Reise (2012), the presence of multidimensionality does not necessitate the creation of subscales, nor does it prevent the interpretability of a single composite score. Instead, test developers must make the distinction between the degree of multidimensionality and the degree to which a total score reflects a common variable. Coefficient omega hierarchical (ωH ) can provide evidence regarding the degree to which a total score reflects a common dimension and is computed as follows: G P

ωH ¼  G P i¼1

λiG

i¼1

2 λiG

2

þ

 s S P P s¼1

i¼1

2 λis

þ

IG P

:

(5.23)

^ 2 ðεi Þ σ

i¼1

λiG represents the loading for the general factor for item i, and λis ^ 2 ðεÞ represents represents the loading for the specific factor s for item i. σ the measurement error variance. Using the factor loading estimates from the bifactor model provided in Table 5.16, ωH equals 0.90, which indicates that much of the total observed-score variance was explained by the general factor. Large ωH values indicate that composite scores primarily reflect a single dimension, thus providing evidence that reporting a unidimensional score is a reasonable option.

https://doi.org/10.1017/9781108750561.006 Published online by Cambridge University Press

294

Confirmatory Factor Analysis

ωH may be relatively small because either the error variance is large (indicating low reliability) or the variance due to the specific factors is large, indicating a strong multidimensional solution. To help tease apart these two sources, we can compute the reliability of subscale scores, controlling for the effect of the general factor. This reliability coefficient, which Reise (2012) termed omega subscale, ωs , can be computed for each specific factor s as follows: I Ps ωs ¼ 

Gs P

i¼1

i¼1

2 λiG s

2 λis

þ

S P s¼1



s P

2 λis

þ

i¼1

I Gs P

:

(5.24)

^ 2 ðεi Þ σ

i¼1

High values indicate that the subscales provide reliable information above and beyond the general factor, whereas low values suggest that the subscales are not precise indicators of the specific factors. For our example, since ωH was large, we would expect ωs to be small, which it is for each of the three factors (ωs1 ¼ 0:002, ωs1 ¼ :01, and ωs1 < 0:001).

https://doi.org/10.1017/9781108750561.006 Published online by Cambridge University Press

 6

Methods Based on Confirmatory Factor Analysis

Confirmatory factor analysis (CFA) provides a convenient framework for evaluating measurement invariance. In CFA we specify a measurement model that defines the factorial structure underlying the observed data (e.g., item responses). The factorial structure includes the number of latent variables (i.e., factors) and the pattern of factor loadings (see Chapter 5 for a more detailed description of CFA). If the CFA model adequately captures the factorial structure, then group membership will not provide any additional information about observed variables above and beyond that explained by the latent variable. In fact, DIF has been framed as a dimensionality issue (Ackerman, 1992) and CFA makes the role of dimensionality in the analyses more explicit. The methods described in this chapter share characteristics with many of the other methods addressed in this book. First, similar to item response theory (IRT; described in Chapters 3 and 4), CFA is based on a latent variable model that depicts the relationship between the latent variable (or variables) and the observed variables purported to measure the latent variable (see Chapter 5 for a description of CFA). Second, methods using CFA can test measurement invariance at the item and subscore (i.e., differential bundle functioning) level. Third, when analyzing item responses, the factor loadings and thresholds represent an item’s discrimination and difficulty, which are analogous to the parameters bearing the same name in IRT models. Fourth, statistical significance tests that are used to evaluate measurement invariance are often based on comparing the fit of nested models using the likelihood-ratio test (or difference in chi-square test statistics), which is similar to logistic regression and the likelihood-ratio test in IRT. Although CFA-based methods share many characteristics with other DIF methods, there are important differences. First, CFA methods explicitly incorporate dimensionality into the analyses. A CFA model that does not accurately represent the factorial structure, which includes the 295

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

296

Methods Based on Confirmatory Factor Analysis

number of latent variables and the pattern of factor loadings, cannot be used to assess measurement invariance. IRT models, on the other hand, delegate dimensionality to the model assumptions realm, resulting in dimensionality playing an implicit role in the IRT DIF methods (this is not to say dimensionality is not important in IRT, because it is). Second, CFA has a collection of fit indices that are used to complement the significance test to help judge if the model fit is acceptable and whether the lack of measurement is nontrivial. These fit indices are necessary because the hypothesis test is often statistically significant, even for trivial DIF. Third, an important distinction between CFA and IRT is that CFA defines the relationship between the latent variable and indicator using a linear function. This can pose a challenge when analyzing item responses since the functional relationship between the latent variable and indicator is not linear. Fortunately, robust estimation methods have been developed that can be used to assess measurement invariance for categorical data from dichotomous and polytomous items. Fourth, CFA has some unique advantages over some of the other methods discussed in this book. For example, CFA models control for measurement error, which can result in more accurate results in some cases. It is also relatively easy to incorporate more than two groups into the same analysis and accommodate multidimensional data. Although CFA provides a powerful approach for evaluating invariance, it comes with challenges as well. First, applying CFA is often more timeconsuming than other methods because it typically requires fitting multiple models, and the number of models increases for test data that contain several DIF items. Second, the CFA models cannot accommodate guessing behavior for data from multiple-choice items (unlike the 3PL model in IRT). Third, CFA often lacks useful effect sizes to help judge if an item demonstrates nontrivial DIF. Lastly, assessing measurement invariance for categorical indicators can be a time-consuming and laborious process. I will address three popular methods for evaluating measurement invariance using CFA: multigroup CFA, the MIMIC model, and CFA to evaluate longitudinal measurement invariance. I will illustrate each method using Mplus on real data. For analyses using other software such as LISREL and the lavaan package in R, see the website for example code.

Multigroup Confirmatory Factor Analysis Multigroup confirmatory factor analysis (MG-CFA) is a flexible and comprehensive method for examining measurement invariance. In

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

Multigroup Confirmatory Factor Analysis

297

addition to examining item characteristics (e.g., discrimination and difficulty), it includes an explicit evaluation of the dimensionality between the groups. MG-CFA can be applied at the item level or to subscores (analogous to differential bundle functioning) and to two or more groups simultaneously. Also, it can be applied to unidimensional as well as multidimensional models such as higher-order and bifactor models. In fact, its strength is its applicability to complicated multidimensional models. MGCFA shares a few characteristics with the IRT likelihood-ratio test (see Chapter 4), in that we fit a measurement model for each group separately and then compare the fit of several nested models that are defined by imposing constraints on the parameter estimates between the groups. Types of Measurement Invariance There are four types of measurement invariance that are assessed using MG-CFA: configural (equal dimensional structure), metric (equal factor loadings), scalar (equal intercepts/thresholds), and strict factorial invariance (equal indicator residual variances). Configural invariance is evaluated first because examining the factor loadings and intercepts/thresholds requires the same measurement model. Metric invariance is assessed prior to scalar invariance because the interpretation of the intercepts/ thresholds depends on equal factor loadings (this is analogous to the likelihood-ratio DIF procedure where you first show that the discrimination parameter is DIF-free before examining DIF in the difficulty parameter). Of these four types of invariance, strict factorial invariance is often considered overly restrictive and unnecessary for most testing applications, and as a result, is usually not conducted. To test each type of invariance, we begin with a fully unconstrained model where the parameter estimates are freely estimated in each group. We then proceed by placing strategic constraints on the model parameter estimates to evaluate the specific types of invariance. When fitting the measurement models to evaluate the type of invariance using MG-CFA, it is important to consider how the scale and location for the latent variable is defined for each group. In particular, it is crucial that the mean and variance of the latent variable(s) are not constrained to be equal between the groups; otherwise, you would be testing not only measurement invariance hypotheses but also whether the groups differ on the latent variable (which may be an interesting hypothesis, but it does not pertain directly to measurement invariance). An effective method for defining the latent variable scale is to fix or place equality constraints on the parameter

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

298

Methods Based on Confirmatory Factor Analysis

estimates for one indicator in both groups – the indicator with the constrained parameter estimates is referred to as the reference item and is excluded from being assessed for equal factor loadings or intercept/threshold. Typically, the factor loading is fixed to 1.0 and the intercept/threshold is constrained to be equal in both groups for the reference item. By fixing the reference item’s factor loading to 1.0, the scale of the latent variable(s) is defined by the reliable portion of this item in both groups. This approach also places the parameter estimates for each group onto the same scale so that they can be compared. Furthermore, throughout this process, the mean of the latent variable for one of the groups (e.g., reference group) is fixed to 0 with the variance based on the reliable portion of the indicator. For the other groups in the model (e.g., focal group), the mean and standard deviation of the latent variable are freely estimated. The first step in applying MG-CFA to evaluate measurement invariance is to test configural invariance to determine if the dimensional structure is the same between the groups (e.g., the number of factors and the pattern of the indicator–factor relationships are the same in each group). Configural invariance is assessed by examining the fit of the measurement model for each group separately using CFA. If configural invariance is not satisfied (e.g., the data for the focal group follow a two-dimensional structure, whereas the data for the reference group are unidimensional), then you have determined that the construct is not being measured in the same manner in both groups, and invariance is not supported. If configural invariance is satisfied, then you proceed by testing metric invariance (equal factor loadings) to determine whether the factor loadings differ between the groups. To test the equal factor loadings hypothesis (i.e., metric invariance), you first fit a measurement model in each group with all the parameter estimates unconstrained between the two groups except for the reference item. This model has no equality constraints placed on the parameter estimates (other than the reference item) and is referred to as the freebaseline or configural invariance model. The subsequent fit statistics (chisquare and fit indices such as CFI) provide an indication of how well the model fits when the parameter estimates are freely estimated in each group. To test whether the factor loadings are invariant between the groups, you then fit a constrained model, referred to as the metric invariance model, in which the factor loadings for the groups are constrained to be equal (the parameter estimates for the reference item maintain the same fixed values defined in the configural invariance model). If the fit of the metric invariance model is significantly worse than the configural model, then

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

Multigroup Confirmatory Factor Analysis

299

we can reject the hypothesis of invariance in the factor loadings and conclude that at least one of the factor loadings differs between the groups. Since the configural and metric invariance models are nested, we can perform a significance test by taking the difference between the respective chi-square statistics. Under the null hypothesis, the difference in the chi-square statistics is distributed as a chi-square with degrees of freedom equal to the number of parameters being tested (i.e., number of constrained factor loadings, not including the reference item). The null hypothesis being tested in this case is an omnibus hypothesis that states that the factor loadings being tested are equal in both populations. With very large sample sizes, the chi-square difference test is often significant, even when the difference in factor loadings is trivial in the populations. Therefore, the purpose of the analysis is to determine which of the factor loadings exhibit a nontrivial lack of invariance. To do this, we start by examining the modification indices to determine which items require the factor loadings to be freely estimated to improve fit in the metric model. In this case, it is advisable to use a stepwise procedure when deciding which factor loadings to freely estimate in each group. In other words, you first identify the indicator that has the largest modification index and then fit the metric invariance model again, but this time, freely estimating the factor loadings for this indicator in each group. You then perform the chi-square difference test again by comparing the chi-square test statistics between the partial metric invariance model and the configural invariance model to determine if other factor loadings should be freely estimated. You continue this stepwise process until either the chi-square difference test is not significant, or the practical significance based on the change in fit indices between the models or the difference in factor loadings is trivial. One of the challenges in using the change in fit indices (e.g., difference in CFI values between the baseline and metric model) is that there are no clear benchmarks for what is nontrivial for item response data. More research is needed in this area to establish guidelines on how to use the change in fit indices or other effect size measures (see later in the chapter) to judge the practical significance of a lack of invariance. After establishing metric (or partial metric) invariance, the next step is to compare the intercepts (for continuous indicators) and/or thresholds (for categorical indicators) between the groups to test the scalar invariance hypothesis. At the item level, this is analogous to testing if a lack of invariance is occurring in the item difficulties. To test whether the intercepts/thresholds are invariant between the groups, you fit a constrained

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

300

Methods Based on Confirmatory Factor Analysis

model, referred to as the scalar invariance model, in which the intercepts/ thresholds, in addition to the factor loadings, are constrained to be equal between the groups (the parameter estimates for the reference item maintain the same fixed values defined in the configural and metric models). We then compare the fit between the scalar and metric invariance models using the chi-square difference test. If the chi-square difference is statistically significant (which occurs often when analyzing data with large sample sizes), then we examine the modification indices to determine which constrained intercepts/thresholds are responsible for decreased model fit. We then relax the constraint for the indicator with the largest modification index and refit the scalar invariance model. In addition to the chi-square difference test, it is important to examine the change in fit indices and intercepts/thresholds for the offending indicators to determine if the lack of invariance is nontrivial in the context of a statistically significant chisquare difference test. We continue this process until we have a scalar invariance model that provides reasonably good fit compared to the metric invariance model with nontrivial differences in intercepts/thresholds unconstrained. Strict factorial invariance can be assessed by placing equality constraints on the residual variances (and any correlated residuals) in a similar manner as previously described. We fit a model in which residual variances (with the factor loadings and intercepts/thresholds) are constrained to be equal. The chi-square difference test is used to determine if the constraints have significantly deteriorated model fit given the new constraints compared to the scalar invariance model. For a significant chi-square difference test, you then examine the modification indices to determine which residuals are responsible for producing worse fit. However, strict factorial invariance is often not examined because it is considered unnecessary and overly strict. Partial Measurement Invariance When analyzing real test data at the item level, it is common to observe at least one indicator that lacks measurement invariance in either the factor loading or intercept/threshold. As a result, in most cases, we are not looking for absolute measurement invariance, in that every indicator must have invariant parameter values. Instead, our goal is often to establish partial measurement invariance in which certain indicators have invariant parameter values, while other indicators have factor loadings and/or thresholds that are not invariant.

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

Multigroup Confirmatory Factor Analysis

301

Reference Indicator An important aspect of MG-CFA is the role of the reference item. To compare the factor loadings and intercepts/thresholds, the parameter estimates must be on the same scale. To define the scale of the latent variable and place the model parameter estimates onto the same scale, we fix the factor loading to 1.0 and constrain the intercept/threshold to be equal for the groups for one of the indicators. If the parameter values for this indicator, however, lack invariance, then the parameter estimates for the two groups may not be on the same scale, making the invariance analyses inaccurate. This is the same issue we face when placing the item parameter estimates onto the same scale in IRT-based methods, in that if the anchor contains DIF items, then the scaling may be corrupted, leading to inaccurate DIF results. Therefore, it is crucial that the factor loadings and intercepts/thresholds for the reference item are invariant. One strategy for selecting an appropriate reference item is to examine the items for DIF using another method such as Lord’s chi-square, logistic regression, or the likelihood-ratio test and to select a DIF-free item to serve as the reference item. A second strategy is to assess measurement invariance using an item as the reference item and then to reanalyze the data using another item (one that was invariant from the first analysis) as the reference item. If the original reference is invariant in the second analysis, then the results should be consistent. If not, then the results from the second reference item may be more accurate. Effect Sizes A major challenge we face when using MG-CFA to assess measurement invariance is determining whether a statistically significant result based on the chi-square difference test is related to nontrivial differences in the parameter values. One of the reasons this is challenging is because we often use very large sample sizes when conducting MG-CFA, which results in high power rates to reject the null hypothesis that states that the factor loadings and intercepts/thresholds are identical in the population. Therefore, it is imperative that we use other evidence and rely on our judgment to determine if a statistically significant result is practically meaningful, in that the lack of invariance has a consequential effect. One approach that we can use to judge the meaningfulness of a significant difference is to use the change in fit indices between the nested models (Cheung & Rensvold, 2002; Meade, Johnson, & Braddy, 2008). For

https://doi.org/10.1017/9781108750561.007 Published online by Cambridge University Press

302

Methods Based on Confirmatory Factor Analysis

example, we can compute the change in the CFI index between the configural and metric invariance models as follows: ΔCFI ¼ CFI Configural  CFI Metric :

(6.1)

When invariance is satisfied, we expect ΔCFI to be close to 0. However, if the CFI fit index drops considerably after placing equality constraints on the parameter estimates, then we have evidence to support the inference that the lack of invariance is nontrivial. The challenge in using this approach is that there are not well-established benchmarks for classifying the lack of invariance as nontrivial. Cheung and Rensvold (2002) suggested that if the change in the CFI fit index exceeds 0.01, then the lack of invariance may be considered nontrivial. It is unclear, however, if this benchmark is appropriate for all contexts and whether it can be used on categorical data (Cheung and Rensvold [2002] examined polytomous data but treated the indicators as continuous). Even though there is a dearth of research examining appropriate benchmarks for the change in fit indices, the fit indices may still be useful in that if the change is very large, then there is compelling evidence to support a lack of invariance. On the other hand, if the change in fit indices is close to 0, then that is evidence to support the inference that the lack of invariance, despite a significant chisquare difference test, is trivial. For situations where it is unclear, we will likely have to rely on other information. A second approach for evaluating whether a statistically significant result is meaningful is to compute an effect size suggested by Millsap and Olivera-Aguilar (2012). For example, if the factor loading for an indicator is invariant, then we can compute the proportion of the group difference in means on the indicator that is due to the difference in intercepts as follows: ES i ¼

^τ iR  ^τ iF : ^ iF ^ iR  μ μ

(6.2)

^τ iR and ^τ iF represent the estimated intercepts for indicator i in the refer^ iR and μ ^ iF represent the indicator ence and focal groups, respectively. μ means for the reference and focal groups, respectively. According to Millsap and Olivera-Aguilar (2012), if the proportion of the difference in means due to intercept differences is small (e.g., ) 743 + 548 and hit the Enter key. R would then print the response in the R console window. However, using the R console to conduct analyses is cumbersome and inefficient in most cases. Fortunately, R provides a script file where you can enter code to perform computations and analyses. A script file is simply a text file where you can enter R code. One of the advantages of using script files is that you can save the file for future use. Also, you can add comments that describe aspects of your analysis using the # symbol (R ignores any comments after the # symbol). In Windows, you can create a new script file through the File pulldown menu by selecting “New script.” A window will appear where you can enter your R code (you can also use other text editors). Once you have entered code

https://doi.org/10.1017/9781108750561.008 Published online by Cambridge University Press

A Brief R Tutorial

Figure A.2

371

Screenshot of the R window and console.

into the script file, you can execute the code by highlighting the text and pressing Ctrl and R simultaneously. When you do this, R copies and pastes the code to the R console and executes the analysis or request. After you enter R code into the script file, you can then save the file using the extension *.R. Table A.1 provides R code stored in the script file “Appendix A.R,” which we will be using as an example in this Appendix (you can download the script file from the book’s website).

Importing Data into R and Assigning Values to Objects Before you perform analyses, you will often have to input data into R. There are many ways of inputting data into R, and I will discuss just a few of them. If you have only a few data points, you can manually enter the data using the c command. For example, suppose we had test scores for five examinees. We could enter the test scores using the R console or script file as follows: test.score